Fine-tuning for the wrong reason
Teams treat fine-tuning as a way to inject knowledge — when the symptom was actually a prompt or retrieval problem. RAG would have done the job in a fraction of the time, at none of the maintenance cost.
Most enterprises don’t need a fine-tuned model — but the ones that do are reaching for the wrong tool, the wrong technique, and a lifecycle they can’t sustain. We test the escalation ladder before climbing it, build small sovereign models when the math actually works, and engineer the revalidation discipline that keeps them relevant as base models evolve.
Fine-tuning is for form, not facts — but most teams reach for it as a knowledge-injection mechanism, build the wrong thing, and watch the fine-tune silently degrade the next time the base model updates. By 2027, Gartner projects enterprise use of small task-specific models will run 3× that of LLMs. The hard question isn’t how to fine-tune. It’s when — and most teams answer it too fast.
Teams treat fine-tuning as a way to inject knowledge — when the symptom was actually a prompt or retrieval problem. RAG would have done the job in a fraction of the time, at none of the maintenance cost.
The base model gets an update; the adapter degrades silently. Without quarterly revalidation tied to a structured eval set, the model rots in production and nobody sees it until users do.
A private endpoint is not a sovereign deployment. The full lifecycle — training data, eval set, weights, inference, monitoring — has to land on infrastructure you control, with an audit trail a regulator will accept.
A 200-example eval set looks great in the report. Production data exposes the brittleness within weeks. Without domain-specific evals and regression tests sized to the real distribution, you ship a model that benchmarks well and fails quietly.
Six steps for deciding when to customize and how to do it without creating maintenance debt your team can’t carry.
Three production patterns — each engineered for a different shape of customization work.
Fine-tuned 7B-class models on your task — cheaper, faster, sovereign. A fine-tuned 7B legal SLM hits 94% on contract tasks against 87% for the frontier alternative, at 10–30× lower cost. We pick the base, build the data pipeline, run the alignment, and engineer the eval set.
Teacher–student distillation from a frontier model into a small, deployable model that holds the same accuracy on your task. The right pattern when you need frontier behavior at edge-deployable cost and latency.
Full-stack custom-model lifecycle on infrastructure you control — for EU AI Act obligations, sector regulators, and sovereignty mandates. Training, weights, evals, inference, monitoring — all inside your perimeter, with the audit trail to prove it.
The model is one piece. These are the layers that make it trustworthy after the launch deck closes.
Synthetic generation from teacher models, expert curation, augmentation, deduplication, and contamination checks. The eval set and the training set are designed together — not as an afterthought.
Domain-specific evals sized to the real distribution. Regression tests on every release. Drift monitoring against the base model and against your production data. Without it, you don’t know when the model is wrong before users do.
Quarterly revalidation cycles, base-model drift detection, structured release gates, and a deprecation playbook. A fine-tune is a living artifact, and the lifecycle is what keeps it earning its cost.
The strongest cases share a shape: narrow, repeatable, high-stakes work where a small model on infrastructure you control beats a frontier API on cost, latency, or sovereignty — and earns its keep.
Domain SLMs outperform frontier models on narrow legal tasks — and run on infrastructure firms can defend in a regulator meeting.
Privacy and sovereignty drive on-prem deployment. The model never leaves the hospital network, and the evidence trail is built in.
Custom models trained on your internal code outperform generic copilots on the languages, frameworks, and patterns that matter inside your codebase.
Narrow, repeatable tasks where small task-specific models ship cheaper, run faster, and degrade more predictably than a frontier API.
Regulated sectors needing EU-only or jurisdiction-specific deployment — finance, defense, public sector — where the residency of the weights matters as much as the residency of the data.
Real-time CX, trading, and operations workloads where frontier-API latency is unacceptable and a small model on dedicated hardware is the only path that works.
Usually a prompt or RAG problem. Roughly four out of five “we need fine-tuning” conversations resolve on the lower rungs of the ladder — better prompts, better retrieval, better grounding. We test the ladder against your data before recommending you climb it, because customization is a commitment with a lifecycle attached.
When the task is narrow and you have examples. A fine-tuned 7B legal SLM hits 94% on contract tasks against 87% for the frontier alternative, at roughly an order of magnitude lower cost. Narrow tasks, latency budgets, and sovereignty requirements are the three signals that an SLM is the right pattern.
Almost always LoRA, QLoRA, or DoRA — full fine-tunes are rare in 2026 production. Distillation is the right answer when you need frontier behavior at edge-deployable cost. DPO, ORPO, or KTO for alignment work — PPO-RLHF has been displaced in production. The technique falls out of the use case, not the other way around.
The whole lifecycle moves, not just the inference endpoint. Training data, evals, weights, inference, and monitoring all sit on infrastructure you control — typically VPC, dedicated hardware, or sovereign cloud (Together AI, Fireworks, Modal, Databricks Mosaic, or your own GPUs). The audit trail is engineered alongside it, because a regulator will ask.
Quarterly, at minimum. Base models update, your data drifts, and the eval set you launched with stops representing the real distribution. We build the revalidation cadence into the project so it’s a scheduled engineering activity, not a fire drill triggered by a user complaint.
Yes. The weights, the training data pipeline, the eval harness, the lifecycle runbook — all yours. We design every engagement so your team can revalidate, retrain, and redeploy independently. Enablement is built into the project, not sold as an upsell.
We test the escalation ladder against your data — prompt, RAG, PEFT, distill — benchmark each rung for accuracy, cost, and latency, and deliver a recommendation with the trade-offs spelled out. If a custom model is the right call, you leave with the deployment architecture, lifecycle plan, and technique choice. If it isn’t, you save the quarter.
What you get: a rung-by-rung benchmark on your data with accuracy and cost numbers; a recommendation on technique (LoRA/QLoRA/DoRA, distillation, full fine-tune, or none); a target deployment architecture matched to your sovereignty and latency constraints; a lifecycle plan with revalidation cadence and base-model drift monitoring; and one workshop with your engineering and data leads. Led by a senior consultant — fixed scope, fixed fee.
Book a Custom Model Review →A 30-minute conversation with a senior consultant. Bring the use case driving the question — a workflow where frontier API cost is unsustainable, a domain where a generic model doesn’t speak the language, a regulator asking where the weights live. We’ll tell you whether customization is the right rung, what the trade-offs look like, and what a Custom Model Review would surface.
Book a Custom Model Review →