Mar 3, 2025

Video Guide: Create LLM Evals in under 20 minutes

Easily create your own LLM evals in minutes. Use powerful evaluation methods like LLM-as-Judge and G-eval from Kiln's UI without coding.

In this video, we'll walk you through the process of creating a LLM eval from scratch. LLM evals are a rigorous test of the quality of an AI system.

Video Walkthrough

Feature Overview

The eval feature of Kiln include:

  • Multiple state of the art evaluation methods (G-Eval, LLM as Judge)
  • Synthetic data generation makes it easy to generaet hundreds or thousands of eval data samples in minutes.
  • Includes tooling to find the best evaluation method for your task. It finds the eval algo+model which best correlates to human preference (Kendall's Tau, Spearman, MSE, etc).
  • Includes eval dashboard to find the highest quality method to run your task (prompt+model)
  • Fine-tunes: create then evaluate custom fine-tunes for your task
  • Intuitive UI for eval dataset management: create eval sets, manage golden sets, add human ratings, etc.
  • Automatic eval generation: it will examine your task definition, then automatically create an evaluator for you.
  • Supports custom evaluators: create evals for any score/goals/instructions you want.
  • Built in eval templates for common scenarios: toxicity, bias, jailbreaking, factual correctness, and maliciousness.
  • Synthetic data templates to generate adversarial datasets using uncensored and unaligned models like Dolphin/Grok. Weird use case where very inappropriate content has a very ethical use. The video has a demo of Dolphin trying to jailbreak the core model.
Get Kiln Updates in Your Inbox
Zero spam, unsubscribe at any time.