Fine-tuning Open Source LLMs

A practical guide to fine-tuning open source LLMs on your own data.

March 22, 2026 AI , Engineering 2 min read

Fine-tuning open source LLMs has gone from a research exercise to a practical engineering tool. With LoRA (Low-Rank Adaptation), you can adapt a 70B parameter model on a single GPU with 80GB of VRAM — and get results that rival GPT-4 on domain-specific tasks.

We fine-tuned Llama 3 8B and Mistral 7B on our internal documentation, support tickets, and codebase. Here’s the complete process, from data preparation to deployment.

The quality of your fine-tuning data matters more than the model choice. We collected 15,000 instruction-response pairs from three sources:

import json
from datasets import Dataset

def prepare_training_data(sources: list[dict]) -> Dataset:
    """Convert raw data into instruction-tuning format."""
    records = []

    for source in sources:
        if source['type'] == 'documentation':
            records.extend(doc_to_instructions(source))
        elif source['type'] == 'support_ticket':
            records.extend(ticket_to_instructions(source))
        elif source['type'] == 'code_review':
            records.extend(review_to_instructions(source))

    # Deduplicate near-identical examples
    records = deduplicate(records, threshold=0.92)

    # Split: 90% train, 5% validation, 5% test
    dataset = Dataset.from_list(records)
    return dataset.train_test_split(test_size=0.1).train_test_split(test_size=0.05)

def doc_to_instructions(doc: dict) -> list[dict]:
    """Convert documentation sections into Q&A pairs."""
    return [
        {
            "instruction": f"How do I {section['topic']}?",
            "input": "",
            "output": section['content'],
            "source": doc['id']
        }
        for section in doc['sections']
    ]

Our final dataset had 12,400 training examples, 680 validation, and 680 test examples.

LoRA freezes the base model weights and trains small adapter matrices. This reduces trainable parameters by 99% while preserving quality.

from peft import LoraConfig, get_peft_model, TaskType
from transformers import AutoModelForCausalLM, AutoTokenizer

MODEL_NAME = "meta-llama/Meta-Llama-3-8B"

lora_config = LoraConfig(
    r=16,                    # Rank of adapter matrices
    lora_alpha=32,           # Scaling factor
    target_modules=[
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj"
    ],
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.CAUSAL_LM,
)

model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 41,943,040 || all params: 8,072,343,552 || trainable%: 0.52%

Key configuration decisions:

Rank 16 — higher ranks (32, 64) gave diminishing returns for our domain
All linear layers — targeting attention AND MLP layers improved quality by 8% over attention-only
5% dropout — prevented overfitting on our relatively small dataset

We used Unsloth for 2x faster training with 60% less VRAM:

from trl import SFTTrainer
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="./llama3-finetuned",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=2e-4,
    lr_scheduler_type="cosine",
    warmup_ratio=0.03,
    fp16=True,
    logging_steps=10,
    save_strategy="epoch",
    evaluation_strategy="epoch",
    report_to="wandb",
)

trainer = SFTTrainer(
    model=model,
    args=training_args,
    train_dataset=dataset['train'],
    eval_dataset=dataset['test'],
    dataset_text_field="output",
    max_seq_length=2048,
    packing=False,
)

trainer.train()

Training took 4.5 hours on a single A100 80GB. The loss curve converged cleanly with no signs of overfitting.

We evaluated on three dimensions:

def evaluate_model(model, tokenizer, test_data: list[dict]) -> dict:
    """Evaluate fine-tuned model on held-out test set."""
    results = {"exact_match": 0, "rouge_l": 0, "hallucination_rate": 0}

    for example in test_data:
        prompt = format_prompt(example['instruction'], example['input'])
        generated = generate(model, tokenizer, prompt, max_new_tokens=256)

        results['exact_match'] += compute_exact_match(generated, example['output'])
        results['rouge_l'] += compute_rouge_l(generated, example['output'])
        results['hallucination_rate'] += check_hallucination(generated, example['output'])

    return {k: v / len(test_data) for k, v in results.items()}

Model	Exact Match	ROUGE-L	Hallucination Rate
Llama 3 8B (base)	12%	34%	28%
Llama 3 8B (LoRA)	41%	62%	8%
Mistral 7B (LoRA)	38%	59%	10%
GPT-4 (few-shot)	45%	65%	5%

Our fine-tuned Llama 3 reached 91% of GPT-4’s performance on our domain tasks, at a fraction of the inference cost.

We deployed the LoRA adapter alongside the base model using vLLM for high-throughput serving:

# Serve with vLLM
python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Meta-Llama-3-8B \
    --enable-lora \
    --lora-modules my-adapter=/path/to/adapter \
    --max-lora-rank=16 \
    --tensor-parallel-size=1

The adapter is only 160MB on disk, making it trivial to version, roll back, and A/B test.

Data quality beats model size — 10K clean examples outperform 100K noisy ones
LoRA rank 16 is the sweet spot for most domain adaptation tasks
Evaluate on your actual use case — benchmark scores don’t predict domain performance
Keep the base model frozen — full fine-tuning is rarely worth the cost and risk of catastrophic forgetting
Version your adapters — they’re small enough to treat as artifacts, not infrastructure

Questions about fine-tuning? Find me on GitHub or Twitter.

Building a Production-Ready RAG Pipeline

How we built a retrieval-augmented generation system serving 10K+ queries/day with sub-second latency.

ai rag llm

Building an AI Code Review Agent

A practical guide to building an autonomous code review system using LLMs.

ai llm code-review

Prompt Engineering in Production

Lessons learned from building production LLM applications with proper prompt management.

ai llm prompt-engineering

Data Preparation

LoRA Configuration

Training

Evaluation

Deployment

Lessons Learned

Related Posts

Building a Production-Ready RAG Pipeline

Building an AI Code Review Agent

Prompt Engineering in Production