Many Small Evals Beat One Big Eval, Every Time

Key Insight

If creating an eval takes less than 10 minutes, your team will create them when they spot issues or fix bugs.

When evals become a habit instead of a chore, your AI system becomes dramatically more robust and your team moves faster.

We're going to walk through the benefits of this strategy, and how to get started:

What is a small eval?
The benefits of many small evals over a large eval
Evals vs Unit Testing
3 Steps to Set Up Your Team for Evals and Iteration
Demo: How to create a small eval in under 5 minutes with Kiln

What is a Small Eval?

Evals are like tests for AI tasks. Like software tests, they run your system to ensure things work properly and don't regress over time. Unlike tests, they are statistical. They run many different inputs and produce a statistical score as output instead of a simple pass/fail. This approach is needed because AI systems are non-deterministic.

'Small' evals focus on one concern: a single product goal or bug/issue. This contrasts with large or holistic evals. Large evals try to evaluate overall product performance with a single dataset:

Large Eval Scorecard

Single holistic evaluation

Overall Product Eval 90%

Small Evals Scorecard

Multiple focused evaluations

Clarify unclear requests 84%

Refuse to discuss competitors 99%

Reject toxic requests 100%

Offer rebate before cancellation 89%

Follow brand styleguide 86%

Only link to official docs 99%

Avoid 'clickbait' titles 91%

Knowledge base retrieval recall 87%

⋯

Overall 90%

The Benefits of Many Small Evals

Small evals are easier to create, and that means they get adopted by teams

Small evals are faster to create because when you're focused on a single issue:

They require fewer people (no design by committee)
You need fewer samples and fewer annotations
Data generation is easier because it's focused
You don't need to worry about class imbalance

When everyone can create evals, more regressions get caught earlier

Different team members bring unique insights — PMs see use cases, designers notice style issues, support tracks ticket patterns, and QA/Eng spot all sorts of problems. When everyone can create evals, more regressions get caught earlier; before they make it to customers.

Large evals hide critical insights

Evals that cover multiple concerns can hide critical regressions and insights. Let's look at a hypothetical example where we try out a newly released model on our evals to see if it's a worthy upgrade:

Large Eval Scorecard

New model vs current model

Overall Product Eval

94% (+4%)

Small Evals Scorecard

New model vs current model

Clarify unclear requests

93% (+9%)

Refuse to discuss competitors

100% (+1%)

Reject toxic requests

100% (even)

Offer rebate before cancellation

72% (-18%)

Follow brand styleguide

85% (-1%)

Only link to official docs

99% (even)

Avoid 'clickbait' titles

96% (+5%)

Knowledge base retrieval recall

94% (+7%)

⋯

Overall

94% (+4%)

When only looking at the overall eval, the new model looks like an easy win. However, that hides the full picture: an important use case has collapsed in quality. Had you shipped the model based on the first eval, those bugs would have hit customers with potentially serious impact. With small evals, we not only spot the regressions, but it gives us the information needed to update our prompts to address them before we ship.

Real products are made up of dozens or hundreds of small decisions. In software those decisions tend to be captured in code, and are somewhat resistant to sudden undetected breakage (with proper testing). With AI, any new model or prompt change can regress unrelated things that had been working for years. A set of small evals helps you monitor the things that matter for your product.

Many small evals are more stable and easier to maintain

Products evolve over time. When goals change, you'll need to update evals.

With large evals, updating evals can be a nightmare. You have to review and update hundreds or thousands of data samples. Over time, this gets even harder: the people who created the original evals leave the company, change roles, or forget the nuance of problems they previously solved.

With many small evals you'll need to make updates too. But they tend to be easier to manage. When you change product goals, an eval or two may become irrelevant or need updates. However, the majority of your evals won't need to be changed.

Fewer eval changes also means you'll have more reliable metrics over time, since each change breaks your ability to compare against historical performance.

Q: How small are small evals?

A: Small vs large is relative

While we're emphasizing "many small evals", what we're really saying is:

Make each eval focus on a specific goal.
Add enough data in each eval to give you confidence it works. If you're focusing on a single goal, this might not require a ton of samples. More complex goals will require more data, and that's okay.
There should be a low fixed-cost to adding a new eval.

These principles naturally lead to many evals instead of few holistic evals. On average, their size is smaller. Hence "many small evals" as a shorthand for the approach.

Advanced uses of evals for larger teams

As teams get larger and more advanced, evals become more helpful. Here are some additional ways larger teams leverage evals throughout their workflows:

Detect fine-tuning data issues: Run each of your evals against fine-tuning datasets to find problematic data samples. Even a few bad data samples can have catastrophic impact on the quality of fine-tuned models.
Detect conflicts between evals: Run your evals against the eval set of every other eval. This is a helpful trick for finding conflicting priorities in larger teams. For example, the support team pushes for your agent to offer discount codes to raise satisfaction, meanwhile, the revenue team is adding evals ensuring it rarely does.

Evals vs Unit Testing

With many small evals, the process starts to sound like software unit testing: many small checks that confirm individual pieces of the system work and don't regress. While the analogy can be helpful, it is important to know the difference and when each should be used.

So what's the difference? Unlike most software, AI models are non-deterministic. A single sample might always pass, even if 99% of similar samples fail. A specific sample can even change from run to run, based on the random seed or floating point differences between hardware. To address this, evals differ from unit tests in a few ways:

They measure many iterations, over many sample inputs
They produce statistical scores. Both by running many iterations/samples, and looking at the probability of individual outputs using methods like G-eval.
Statistical scores mean you need a range of acceptable scores. You really don't want to be paged because an eval regresses from 96% to 95.8%.

A good rule of thumb: if you're actually invoking a model you should be creating an eval and not just a unit test. It's a bit more work, but much more robust and the right tool for the job.

3 Steps to Set Up Your Team for Evals and Iteration

If you're sold, here's how to get started with evals for your AI product team, even if you don't have any prior experience:

Step 1: Set up a tool that makes creating evals easy

Back to our key insight: the easier evals are, the more your team will adopt them, and the more benefits you'll see.

Here's what to look for when choosing an easy-to-use eval toolkit:

Intuitive UI: It should have a user interface anyone can use, not just code. You want PM, support, designers, and the whole team feeling empowered to create evals.
Synthetic data generation: It should be able to generate synthetic data for your eval; manual data creation is typically too time consuming and slow. The synthetic data generation should support human guidance, allowing you to guide it to generate samples relevant to the goal of this eval.
Baseline to human preference: It should allow human annotation of a golden dataset, and confirm the LLM as judge is aligned to human preference. Without this, you won't have any confidence your evals actually work.
Rapid experimentation: It should be easy to re-run your evals against a new implementation method. This includes new models, new prompts, or new code. The easier this is, the more rapidly you can experiment.

We are of course biased, but we suggest Kiln which meets all of these requirements. Other tools tend to not integrate synthetic data gen, or don't have UIs, or have UIs designed for data scientists. Kiln is open and free to use.

Demo Building an Eval in Under 5 Minutes with Kiln

In the video below, we show how you can quickly create new evals from scratch using synthetic data generation. The process doesn't require any expertise in AI evals.

The video and our eval docs cover:

Step 2: Teach your team to build evals

Once you've set up an easy to use system, this should be easy, fun, and get the team excited. A quick 30 minute demo is usually all you need.

Step 3: Create a culture of evaluation

Just like creating a testing culture on software teams, you need to foster an evals culture on AI product teams. I suggest:

Encourage QA/Eng to create small evals when issues are found, instead of filing bugs.
Encourage all bug fixes touching models or prompts to include an eval. This ensures the issue never regresses again. This is very similar to how you encourage bug-fixes in software to include a unit test. In both cases the goal is to reproduce the issue, show that the fix works, and prevent future regressions.

Let Kiln Help You Get Started with Evals

Kiln is a free app which makes it easy to create evals for your AI product. It will guide you through every step of the process. Check out the docs for everything you need to create an eval culture on your team.

You can download Kiln, check out our Github, join our Discord, or learn more on our website.