Video Guide: Create LLM Evals in under 20 minutes

In this video, we'll walk you through the process of creating a LLM eval from scratch. LLM evals are a rigorous test of the quality of an AI system.

Video Walkthrough

The eval feature of Kiln include:

Multiple state of the art evaluation methods (G-Eval, LLM as Judge)
Synthetic data generation makes it easy to generaet hundreds or thousands of eval data samples in minutes.
Includes tooling to find the best evaluation method for your task. It finds the eval algo+model which best correlates to human preference (Kendall's Tau, Spearman, MSE, etc).
Includes eval dashboard to find the highest quality method to run your task (prompt+model)
Fine-tunes: create then evaluate custom fine-tunes for your task
Intuitive UI for eval dataset management: create eval sets, manage golden sets, add human ratings, etc.
Automatic eval generation: it will examine your task definition, then automatically create an evaluator for you.
Supports custom evaluators: create evals for any score/goals/instructions you want.
Built in eval templates for common scenarios: toxicity, bias, jailbreaking, factual correctness, and maliciousness.
Synthetic data templates to generate adversarial datasets using uncensored and unaligned models like Dolphin/Grok. Weird use case where very inappropriate content has a very ethical use. The video has a demo of Dolphin trying to jailbreak the core model.