It has been a wild ride in keeping up with the now zoo of LLM base models. Anthropic's new Claude Sonnet 4 model and Google's new "small" 27B (Gemma 3 model) that can fit in a single A100. That gave us at Kiln an idea: can we teach Gemma 3 to do what Sonnet 4 does using synthetic data generation and distillation? This set-up emulates an archetype of a product company that wants to use a large model but doesn't want to pay the price of a proprietary model (price nor latency). Alright let's start with some opens:
- Is the relatively small sized Gemma 3 27B capable of solving multi-objective real world problems which involve instruction following, language understanding, and structure/style?
- To optimize Gemma 3 on a task, do we fine-tune it with Sonnet 4 synthetic data or can we get away with clever prompts and examples contained in-context (few-shot prompting)?
Setup
- Data source: Utilized Kiln's synthetic data generator with Sonnet 4 to create both inputs and outputs (synthetic data generation guide).
- Data problem type: Language understanding with instruction following (parameterized summarization).
- Data: The data input (user prompt) is an input "news article" and a desired summarization length in sentences. The output is the summary. The instruction following canary that is injected into the output summary is that the output summary must have the second word start with the letter "P". Caveat: I should note here that this is not a great test, but rather just an OK one. Most of modern models use sub-word tokenizers where a word can have many tokens but not usually at the character level. Gemma uses the SentencePiece tokenizer (SentencePiece) So this has more to do with how much the model has memorized which words start with P rather than measuring it on the fly. Even still, the model needs to learn JSON structure, juggle with a constrained summarization task, and then remember to have the second word start with a letter.
- Size: Approximately 250 training examples were generated from Claude Sonnet 4.
- Params: The parameters were kept straightforward, with a LoRA rank of 8, and default settings for learning rate (1e-4) and batch size.
- Evaluation: A combination of straightforward tests (canary tests) and more complex evaluations, such as summarization quality, were conducted using Kiln's evaluation stack with LLM-as-a-Judge GPT-4.1 models.
Results
Fine-tuning ablations:
The methodology was straightforward. We evaluated whether to use few-shot examples at inference time (even when they were not included in the training prompt) and tested the impact of training over the same dataset multiple times (i.e., epochs). The evaluation was conducted using 64 test samples with GPT-4.1 serving as an LLM-as-a-Judge to assess outputs across different metrics.
Metric Higher is better | Gemma 3 27B + LoRA (R=8) 10 epochs Zero-shot | Gemma 3 27B + LoRA (R=8) 10 epochs Few-shot | Gemma 3 27B + LoRA (R=8) 1 epoch Few-shot | Gemma 3 27B + LoRA (R=8) 10 epochs Few-shot |
---|---|---|---|---|
Summarization Quality | 3.83 | 3.95 | 4.23 | 4.42 |
Instruction Following Summarization Length | 0.86 | 0.98 | 1.0 | 1.0 |
Instruction Following Canary | 0.23 | 0.38 | 0.38 | 0.38 |
Analysis of columns 1 versus 2 demonstrates how adding few-shot examples at inference improves performance even when the model was not trained with them. Comparing columns 3 versus 4 shows how training epochs impact performance when prompts are held constant - a modest improvement in one metric while others remained stable.
The following table below compares these fine-tuned LoRA models to base models.
Final comparison to baselines:
Metric Higher is better | Gemma 3 27B Base Model Zero-shot | Gemma 3 27B Base Model Few-shot | Gemma 3 27B Best LoRA Few-shot | GPT-4o Baseline Few-shot |
---|---|---|---|---|
Summarization Quality | 3.78 | 4.14 | 4.42 | 4.06 |
Instruction Following Summarization Length | 0.73 | 0.98 | 1.0 | 1.0 |
Instruction Following Canary | 0.25 | 0.13 | 0.38 | 0.38 |
These results demonstrate notable improvements. The base Gemma 3 model shows significant enhancement with few-shot Sonnet 4 examples but continues to face challenges with instruction following. While GPT-4o performs better at instruction following compared to the base Gemma 3 model (as expected). In addition, the fine-tuned Gemma 3 model achieved superior overall performance on this toy dataset against both GPT-4o and the base Gemma 3 model which is expected due to how narrow the dataset is.
Key takeaways:
- LoRA supervised fine-tuning demonstrates clear value: Consistent improvements across all metrics compared to the base Gemma 3 27B model
- Inference-time prompting provides measurable benefits: Adding few-shot examples at test time improved performance even when not used in training. It should be noted that this approach increases time-to-first-token (TTFT) and overall latency for prompt processing, though this can be addressed through prompt caching in future implementations.
- Increased epochs yield diminishing returns: Training from 1 to 10 epochs improved summarization performance (4.23 → 4.42) while other metrics plateaued. Generally, increasing epochs leads to greater memorization and potential overfitting, though this remains a viable approach when training data is limited.
- Performance exceeded GPT-4o: The best fine-tuned model outperformed GPT-4o on summarization tasks and matched its instruction-following capabilities
- Small datasets prove effective: Just 250 synthetic examples produced significant improvements
Kiln AI - How to build your own LLM products (and how you can too)
We did all of this work with the local Kiln AI desktop app - a tool for fine-tuning models, evaluating results, and generating the synthetic training data. No code either, just UI!
- How to Fine Tune a LLM
- Synthetic Data Generation Guide
- Star Kiln on GitHub
- Download Kiln for Free
- Ask questions on Discord