We're going to walk through the benefits of this strategy, and how to get started:
- What is a small eval?
- The benefits of many small evals over a large eval
- Evals vs Unit Testing
- 3 Steps to Set Up Your Team for Evals and Iteration
- Demo: How to create a small eval in under 5 minutes with Kiln
What is a Small Eval?
Evals are like tests for AI tasks. Like software tests, they run your system to ensure things work properly and don't regress over time. Unlike tests, they are statistical. They run many different inputs and produce a statistical score as output instead of a simple pass/fail. This approach is needed because AI systems are non-deterministic.
'Small' evals focus on one concern: a single product goal or bug/issue. This contrasts with large or holistic evals. Large evals try to evaluate overall product performance with a single dataset:
The Benefits of Many Small Evals
Small evals are easier to create, and that means they get adopted by teams
Small evals are faster to create because when you're focused on a single issue:
- They require fewer people (no design by committee)
- You need fewer samples and fewer annotations
- Data generation is easier because it's focused
- You don't need to worry about class imbalance
When everyone can create evals, more regressions get caught earlier
Different team members bring unique insights — PMs see use cases, designers notice style issues, support tracks ticket patterns, and QA/Eng spot all sorts of problems. When everyone can create evals, more regressions get caught earlier; before they make it to customers.
Large evals hide critical insights
Evals that cover multiple concerns can hide critical regressions and insights. Let's look at a hypothetical example where we try out a newly released model on our evals to see if it's a worthy upgrade:
When only looking at the overall eval, the new model looks like an easy win. However, that hides the full picture: an important use case has collapsed in quality. Had you shipped the model based on the first eval, those bugs would have hit customers with potentially serious impact. With small evals, we not only spot the regressions, but it gives us the information needed to update our prompts to address them before we ship.
Real products are made up of dozens or hundreds of small decisions. In software those decisions tend to be captured in code, and are somewhat resistant to sudden undetected breakage (with proper testing). With AI, any new model or prompt change can regress unrelated things that had been working for years. A set of small evals helps you monitor the things that matter for your product.
Many small evals are more stable and easier to maintain
Products evolve over time. When goals change, you'll need to update evals.
With large evals, updating evals can be a nightmare. You have to review and update hundreds or thousands of data samples. Over time, this gets even harder: the people who created the original evals leave the company, change roles, or forget the nuance of problems they previously solved.
With many small evals you'll need to make updates too. But they tend to be easier to manage. When you change product goals, an eval or two may become irrelevant or need updates. However, the majority of your evals won't need to be changed.
Fewer eval changes also means you'll have more reliable metrics over time, since each change breaks your ability to compare against historical performance.
While we're emphasizing "many small evals", what we're really saying is:
- Make each eval focus on a specific goal.
- Add enough data in each eval to give you confidence it works. If you're focusing on a single goal, this might not require a ton of samples. More complex goals will require more data, and that's okay.
- There should be a low fixed-cost to adding a new eval.
These principles naturally lead to many evals instead of few holistic evals. On average, their size is smaller. Hence "many small evals" as a shorthand for the approach.
As teams get larger and more advanced, evals become more helpful. Here are some additional ways larger teams leverage evals throughout their workflows:
- Detect fine-tuning data issues: Run each of your evals against fine-tuning datasets to find problematic data samples. Even a few bad data samples can have catastrophic impact on the quality of fine-tuned models.
- Detect conflicts between evals: Run your evals against the eval set of every other eval. This is a helpful trick for finding conflicting priorities in larger teams. For example, the support team pushes for your agent to offer discount codes to raise satisfaction, meanwhile, the revenue team is adding evals ensuring it rarely does.
Evals vs Unit Testing
With many small evals, the process starts to sound like software unit testing: many small checks that confirm individual pieces of the system work and don't regress. While the analogy can be helpful, it is important to know the difference and when each should be used.
So what's the difference? Unlike most software, AI models are non-deterministic. A single sample might always pass, even if 99% of similar samples fail. A specific sample can even change from run to run, based on the random seed or floating point differences between hardware. To address this, evals differ from unit tests in a few ways:
- They measure many iterations, over many sample inputs
- They produce statistical scores. Both by running many iterations/samples, and looking at the probability of individual outputs using methods like G-eval.
- Statistical scores mean you need a range of acceptable scores. You really don't want to be paged because an eval regresses from 96% to 95.8%.
A good rule of thumb: if you're actually invoking a model you should be creating an eval and not just a unit test. It's a bit more work, but much more robust and the right tool for the job.
3 Steps to Set Up Your Team for Evals and Iteration
If you're sold, here's how to get started with evals for your AI product team, even if you don't have any prior experience:
Step 1: Set up a tool that makes creating evals easy
Back to our key insight: the easier evals are, the more your team will adopt them, and the more benefits you'll see.
Here's what to look for when choosing an easy-to-use eval toolkit:
- Intuitive UI: It should have a user interface anyone can use, not just code. You want PM, support, designers, and the whole team feeling empowered to create evals.
- Synthetic data generation: It should be able to generate synthetic data for your eval; manual data creation is typically too time consuming and slow. The synthetic data generation should support human guidance, allowing you to guide it to generate samples relevant to the goal of this eval.
- Baseline to human preference: It should allow human annotation of a golden dataset, and confirm the LLM as judge is aligned to human preference. Without this, you won't have any confidence your evals actually work.
- Rapid experimentation: It should be easy to re-run your evals against a new implementation method. This includes new models, new prompts, or new code. The easier this is, the more rapidly you can experiment.
We are of course biased, but we suggest Kiln which meets all of these requirements. Other tools tend to not integrate synthetic data gen, or don't have UIs, or have UIs designed for data scientists. Kiln is open and free to use.
Demo Building an Eval in Under 5 Minutes with Kiln
In the video below, we show how you can quickly create new evals from scratch using synthetic data generation. The process doesn't require any expertise in AI evals.
The video and our eval docs cover:
- Synthetic data generation
- Creating LLM-as-Judge evals
- Ensuring the judge aligns with human experts
- Using evals to find the best way to run your AI workload
- Adding product custom evals ("small evals")
Step 2: Teach your team to build evals
Once you've set up an easy to use system, this should be easy, fun, and get the team excited. A quick 30 minute demo is usually all you need.
Step 3: Create a culture of evaluation
Just like creating a testing culture on software teams, you need to foster an evals culture on AI product teams. I suggest:
- Encourage QA/Eng to create small evals when issues are found, instead of filing bugs.
- Encourage all bug fixes touching models or prompts to include an eval. This ensures the issue never regresses again. This is very similar to how you encourage bug-fixes in software to include a unit test. In both cases the goal is to reproduce the issue, show that the fix works, and prevent future regressions.
Let Kiln Help You Get Started with Evals
Kiln is a free app which makes it easy to create evals for your AI product. It will guide you through every step of the process. Check out the docs for everything you need to create an eval culture on your team.
You can download Kiln, check out our Github, join our Discord, or learn more on our website.