Distillation in Practice: Ablating Gemma 3 27B with synthetic data from Sonnet 4

It has been a wild ride in keeping up with the now zoo of LLM base models. Anthropic's new Claude Sonnet 4 model and Google's new "small" 27B (Gemma 3 model) that can fit in a single A100. That gave us at Kiln an idea: can we teach Gemma 3 to do what Sonnet 4 does using synthetic data generation and distillation? This set-up emulates an archetype of a product company that wants to use a large model but doesn't want to pay the price of a proprietary model (price nor latency). Alright let's start with some opens:

Is the relatively small sized Gemma 3 27B capable of solving multi-objective real world problems which involve instruction following, language understanding, and structure/style?
To optimize Gemma 3 on a task, do we fine-tune it with Sonnet 4 synthetic data or can we get away with clever prompts and examples contained in-context (few-shot prompting)?

Setup

Data source: Utilized Kiln's synthetic data generator with Sonnet 4 to create both inputs and outputs (synthetic data generation guide).
Data problem type: Language understanding with instruction following (parameterized summarization).
Data: The data input (user prompt) is an input "news article" and a desired summarization length in sentences. The output is the summary. The instruction following canary that is injected into the output summary is that the output summary must have the second word start with the letter "P". Caveat: I should note here that this is not a great test, but rather just an OK one. Most of modern models use sub-word tokenizers where a word can have many tokens but not usually at the character level. Gemma uses the SentencePiece tokenizer (SentencePiece) So this has more to do with how much the model has memorized which words start with P rather than measuring it on the fly. Even still, the model needs to learn JSON structure, juggle with a constrained summarization task, and then remember to have the second word start with a letter.
Size: Approximately 250 training examples were generated from Claude Sonnet 4.
Params: The parameters were kept straightforward, with a LoRA rank of 8, and default settings for learning rate (1e-4) and batch size.
Evaluation: A combination of straightforward tests (canary tests) and more complex evaluations, such as summarization quality, were conducted using Kiln's evaluation stack with LLM-as-a-Judge GPT-4.1 models.

Results

Fine-tuning ablations:

The methodology was straightforward. We evaluated whether to use few-shot examples at inference time (even when they were not included in the training prompt) and tested the impact of training over the same dataset multiple times (i.e., epochs). The evaluation was conducted using 64 test samples with GPT-4.1 serving as an LLM-as-a-Judge to assess outputs across different metrics.

Metric Higher is better	Gemma 3 27B + LoRA (R=8) 10 epochs Zero-shot	Gemma 3 27B + LoRA (R=8) 10 epochs Few-shot	Gemma 3 27B + LoRA (R=8) 1 epoch Few-shot	Gemma 3 27B + LoRA (R=8) 10 epochs Few-shot
Summarization Quality	3.83	3.95	4.23	4.42
Instruction Following Summarization Length	0.86	0.98	1.0	1.0
Instruction Following Canary	0.23	0.38	0.38	0.38

Analysis of columns 1 versus 2 demonstrates how adding few-shot examples at inference improves performance even when the model was not trained with them. Comparing columns 3 versus 4 shows how training epochs impact performance when prompts are held constant - a modest improvement in one metric while others remained stable.

The following table below compares these fine-tuned LoRA models to base models.

Final comparison to baselines:

Metric Higher is better	Gemma 3 27B Base Model Zero-shot	Gemma 3 27B Base Model Few-shot	Gemma 3 27B Best LoRA Few-shot	GPT-4o Baseline Few-shot
Summarization Quality	3.78	4.14	4.42	4.06
Instruction Following Summarization Length	0.73	0.98	1.0	1.0
Instruction Following Canary	0.25	0.13	0.38	0.38

These results demonstrate notable improvements. The base Gemma 3 model shows significant enhancement with few-shot Sonnet 4 examples but continues to face challenges with instruction following. While GPT-4o performs better at instruction following compared to the base Gemma 3 model (as expected). In addition, the fine-tuned Gemma 3 model achieved superior overall performance on this toy dataset against both GPT-4o and the base Gemma 3 model which is expected due to how narrow the dataset is.

Key takeaways:

LoRA supervised fine-tuning demonstrates clear value: Consistent improvements across all metrics compared to the base Gemma 3 27B model
Inference-time prompting provides measurable benefits: Adding few-shot examples at test time improved performance even when not used in training. It should be noted that this approach increases time-to-first-token (TTFT) and overall latency for prompt processing, though this can be addressed through prompt caching in future implementations.
Increased epochs yield diminishing returns: Training from 1 to 10 epochs improved summarization performance (4.23 → 4.42) while other metrics plateaued. Generally, increasing epochs leads to greater memorization and potential overfitting, though this remains a viable approach when training data is limited.
Performance exceeded GPT-4o: The best fine-tuned model outperformed GPT-4o on summarization tasks and matched its instruction-following capabilities
Small datasets prove effective: Just 250 synthetic examples produced significant improvements

Kiln AI - How to build your own LLM products (and how you can too)

We did all of this work with the local Kiln AI desktop app - a tool for fine-tuning models, evaluating results, and generating the synthetic training data. No code either, just UI!

Summary

This experiment demonstrates that fine-tuning Gemma 3 27B with 250 synthetic examples from Sonnet 4 achieves performance comparable to few-shot GPT-4o on specific summarization tasks, while offering advantages in computational efficiency and cost-effectiveness. The findings suggest that targeted fine-tuning can be an effective strategy for task-specific applications, though results may vary across different use cases and domains.