In this video, we'll walk you through the process of creating a LLM eval from scratch. LLM evals are a rigorous test of the quality of an AI system.
- Check out the documentation for LLM evals in Kiln AI.
- Download Kiln to try it yourself.
Video Walkthrough
Feature Overview
The eval feature of Kiln include:
- Multiple state of the art evaluation methods (G-Eval, LLM as Judge)
- Synthetic data generation makes it easy to generaet hundreds or thousands of eval data samples in minutes.
- Includes tooling to find the best evaluation method for your task. It finds the eval algo+model which best correlates to human preference (Kendall's Tau, Spearman, MSE, etc).
- Includes eval dashboard to find the highest quality method to run your task (prompt+model)
- Fine-tunes: create then evaluate custom fine-tunes for your task
- Intuitive UI for eval dataset management: create eval sets, manage golden sets, add human ratings, etc.
- Automatic eval generation: it will examine your task definition, then automatically create an evaluator for you.
- Supports custom evaluators: create evals for any score/goals/instructions you want.
- Built in eval templates for common scenarios: toxicity, bias, jailbreaking, factual correctness, and maliciousness.
- Synthetic data templates to generate adversarial datasets using uncensored and unaligned models like Dolphin/Grok. Weird use case where very inappropriate content has a very ethical use. The video has a demo of Dolphin trying to jailbreak the core model.