May 28, 2025

When Fine-Tuning Actually Makes Sense: A Developer's Guide

Most teams are curious about fine-tuning and how it might help their AI products, but they don't know what to expect from the process, how to measure success, or where to start.

Fine Tuning Header

Fine-tuning solves specific, measurable problems: models that produce inconsistent JSON schemas, inference costs that scale beyond your budget, prompts so complex they hurt performance, and specialized behavior that's impossible to achieve through prompting alone.

This guide walks through the concrete benefits of fine-tuning, helps you identify which goals matter for your use case, and shows you how to get started with a clear path to measurable results.

We'll cover the real use cases where fine-tuning makes sense—and when it doesn't. Let's start with the problems it actually solves:

Improve Quality

"Quality" means different things for different tasks. You should already have evals set up for the quality metrics you care about (if not, check out our evals guide).

Fine-tuning excels in specific quality areas:

Task-Specific Quality Score

Most products have an overall quality metric—often a 1-5 star rating. Fine-tuning can improve this metric by teaching the model how to respond through examples.

Improve Style Conformance

A customer service chatbot for a bank needs a very different style and tone than a fantasy roleplaying agent. Fine-tuning enforces specific styles more effectively than style-prompting.

Better JSON Formatting

We've seen JSON formatting accuracy jump from under 5% to over 99% with fine-tuning, when compared to the same untuned base-model.

When you need a model to produce output in a specific JSON format, smaller models often struggle out of the box. Even if your model produces valid JSON, it commonly produces the wrong schema (incorrect key names, missing required fields, etc). Fine-tuning significantly improves LLMs' ability to produce valid JSON in the correct schema.

The same applies to other formats like function calls, XML, YAML, and markdown.

Lower Cost and Faster Speed

Fine-tuning is a great way to make your AI app faster and cheaper.

Fine-tuning for Shorter Prompts

Prompts grow quickly as you add details needed to perform a task (task description, style guides, rules, formatting, chain-of-thought instructions, etc). Long prompts create several problems:

  • Longer processing time, impacting speed
  • More tokens/GPU usage, impacting cost
  • Reduced prompt conformance as length increases, impacting quality

Fine-tuning addresses this by moving most of these requirements from prompts to the model itself. I typically recommend:

  • Formatting instructions: use fine-tuning
  • Tone/Style: use fine-tuning
  • Rules/logic (if A reply X, if A+B reply Y): use fine-tuning
  • Chain of thought guidance: use fine-tuning (with chain of thought examples)
  • Core task prompt: keep this, but make it much shorter after the above changes
You can also explore prompt caching as another method for improving the speed and cost of long prompts. However, unlike fine-tuning, this does not help with reduced prompt conformance.

Smaller Models

Fine-tuning often achieves similar quality on much smaller models. Smaller models are faster and cheaper to run.

Example: Qwen 14B is a good fine-tuning candidate that runs 6x faster and costs ~3% of GPT 4.1.

Local Models

For products you distribute to users, fine-tuning can create small models that run locally on their devices, reducing your inference cost to zero. However, expect a speed tradeoff—local models won't be particularly fast.

Privacy

Many users don't want to send private data to providers like OpenAI and Anthropic. Fine-tuning can create smaller models that run locally while maintaining the same quality metrics for a given task. This benefits everyone from hobbyists to companies handling sensitive data.

Tool Calling

Fine-tuning effectively teaches LLMs how and when to use specific tools. A training set showing the right tools at the right time is easier to manage than defining the behavior in prompts. Training samples also help models learn the tool calling format (parameters, order), reducing errors.

Better Logic / Rule Following

AI agents need to handle a wide range of inputs. While you can put all needed instructions into a prompt, this becomes unwieldy quickly, and instruction following can decrease as prompts grow. Providing examples of your logic/rules through fine-tuning helps models learn rules more effectively than prompt-based logic.

Product Logic

Fine-tuning helps models learn to respond based on content and context:

  • Conditional logic: If A: respond with X. If B: respond with Y. If A+B: respond with Z.
  • Long-tail situations: off-topic requests, questions in other languages, profanity/aggression, etc.
  • Uncertain responses: Models usually aren't trained to respond "I don't know," but it's often good practice in real products.

Fixing Bugs

LLMs can have unexpected or undesired behavior — we'll call these "bugs." Fine-tuning with examples of bugs effectively eliminates them. Add examples of common failure modes to your training set (with the correct expected output), and the model will learn to avoid these pitfalls. Simple supervised fine-tuning is usually sufficient, though more advanced techniques like reinforcement learning can also help.

Alignment and Safety

Fine-tuning effectively aligns models with human values and safety requirements, including teaching models to refuse harmful requests, follow content policies, and behave according to specific ethical guidelines.

Distillation From Larger Models

"Distillation" is the process of getting a larger model to teach a smaller model how to perform a task. The process involves:

  • Using a large model to produce hundreds or thousands of task samples across a range of inputs/conditions
  • Fine-tuning a smaller model using the samples from the larger model
  • Evaluating the resulting fine-tune and optionally iterating with more samples if quality issues remain

See our guide on distilling models — it's easier than you think and only takes about 20 minutes.

Better Thinking / Reasoning / Chain of Thought

Reasoning models and chain-of-thought prompts improve response quality, but unoptimized models often waste their "thinking tokens" without improving response quality.

Specialized reasoning models like R1 and o3 excel at this but tend to be large and expensive. They've learned to reason about numerous tasks from PhD-level science problems to creative writing to financial analysis.

If your model targets a specific task, you can easily teach it the necessary "thinking patterns" through fine-tuning examples:

  • Generate hundreds or thousands of samples that include "thinking" tokens. This can come from distilling a large high-quality thinking model like R1 (see above), or by creating a custom chain-of-thought prompt with detailed reasoning instructions.
  • Fine-tune a model from these samples, including thinking tokens
  • Run the model and verify the thinking tokens mirror the thinking patterns from samples.

Knowledge: Not an Ideal Use Case for Fine-Tuning

Fine-tuning helps with many things, but we advise against using it to add knowledge to a model. If adding knowledge is your goal, consider other techniques:

  • RAG: let the model search for relevant information
  • Context loading: provide the model with a system prompt containing needed knowledge
  • Tool calls: allow the model to call tools to fetch knowledge

You can use all of these methods in conjunction with a fine-tuned model.

Choosing Models to Fine-Tune

Select your goals from the list above, as these will guide which models you try fine-tuning. There's no point training a 32B parameter model if you want it to run on a phone, or a 1B parameter model if you're maximizing quality.

Here's high-level guidance for selecting models based on your goal:

  • Run locally on mobile: select tiny models like Gemma 3n/1B or Qwen 3 1.7B
  • Run locally on desktops: select small models like Qwen 3 4B/8B or Gemma 3 2B/4B
  • Cost reduction or speed: choose a range of model sizes from 1B-32B to compare quality/speed/cost tradeoffs
  • Maximal quality: larger models like Llama 70b, Gemma 3 27b, Qwen3, GPT 4.1, Gemini Flash/Pro (yes, you can fine-tune Gemini and OpenAI models using APIs from Google/OpenAI)

Iterate and Experiment When Fine-Tuning

Data science is a "science" — you need to hypothesize, test, and measure to get results. Try a few different base models. Try different training data. Try training fine-tunes with and without thinking data (reasoning mode). Try more/fewer training epochs.

If you've properly set up evals, it's easy to compare results and find the best model for your task.

Conclusion and How to Start Fine-tuning

Fine-tuning solves real problems that prompting can't. If you're dealing with inconsistent outputs, bloated inference costs, or models that won't follow your rules, it's worth the investment.

With the right tooling, the process isn't complicated: pick a goal, generate training data (synthetic works well), train a few candidates, and measure what matters. Most teams see meaningful improvements within a few iterations. You'll be set up for future iterations, whether fixing bugs, changing product goals, or adopting new state-of-the-art models.

Use Kiln: The Easiest Fine-Tuning Tool

Kiln is a free app which makes it easy to fine-tune models. It will guide you through every step of the process including creating training data, fine tuning models, and evaluating fine-tunes to find the best one for your project. It handles the boring parts so you can focus on what actually moves the needle for your project.

Get Kiln Updates in Your Inbox
Zero spam, unsubscribe at any time.