LLM Fine-Tuning Strategy & Data Preparation -

Services / LLM Fine-Tuning Strategy & Data Preparation

Fine-tuning a language model is the most consequential and least reversible decision in an enterprise LLM programme. It is also the decision most commonly made for the wrong reasons, with inadequate data, without understanding what it costs to the base model’s safety properties, and without a plan for what happens when the fine-tuned model must be updated as the base model evolves. Most fine-tuning projects that fail do not fail at the training step. They fail at the decision step — the organisation decided to fine-tune a problem that should have been solved with a better prompt, or decided to fine-tune before the training data was in a state that fine-tuning could succeed from.

Fine-tuning shifts the model’s probability distribution towards the training data. Everything that is gained in domain-specific behaviour comes from somewhere — typically from reduction in coverage of the training distribution the base model was trained on. A model fine-tuned intensively on legal contract language becomes more capable with legal contracts and less reliable in the general reasoning tasks that underlie reliable contract analysis. A model fine-tuned on clinical notes becomes more fluent with clinical terminology and less reliable in its calibration of uncertainty — which matters significantly in clinical contexts.

This service addresses the decision, the data, and the safety properties — in that order. The decision phase produces a documented recommendation on whether to fine-tune, which base model, and what the fine-tuning is expected to achieve that cannot be achieved without it. The data phase produces a training data preparation specification addressing quality, coverage, format, and volume. The safety phase produces a pre-training baseline assessment and a post-training evaluation protocol measuring what the fine-tuning cost and whether those costs are acceptable for the deployment context.

Book a Fine-Tuning Assessment →

Pricing & Scope

Price Range

£7,500 – £55,000+
Decision assessment, data preparation specification, and safety evaluation framework. Fine-tuning execution is separate and additional.

Duration

3 – 16 weeks
Strategy and specification only. Training runs add 2–8 weeks depending on dataset size and available compute.

Scope boundary

Strategy, data preparation specification, and evaluation framework design. We do not execute training runs, manage compute infrastructure, or operate fine-tuned models in production.

Decision first

The Decision Assessment (£7,500) is a standalone engagement that always precedes the others. Its fee is credited in full against any subsequent tier. Its recommendation may be not to fine-tune.

Contract

Fixed-price. 50% on signing, 50% on delivery acceptance.

Fine-tuning is not the first resortMost enterprise LLM capability gaps can be closed with better prompt engineering, better retrieval architecture, or a different base model — faster, cheaper, and without the safety property degradation that fine-tuning causes. This engagement begins by establishing whether fine-tuning is the correct intervention. If the Decision Assessment concludes it is not, we will say so. The £7,500 decision fee will have saved you significantly more than it cost.

The Decision Framework

When fine-tuning is the right decision and when it is not. Four signals for each. Both sides stated as directly as the other.

The decision to fine-tune is made with incomplete information under time pressure in most enterprise LLM programmes. Vendors have incentives to recommend fine-tuning because it creates lock-in and generates compute revenue. Engineering teams have incentives to recommend it because it is technically interesting. Neither incentive is aligned with the organisation’s interest in the fastest, cheapest route to the capability it actually needs. The decision framework below is the one the Decision Assessment applies — with no preference for the outcome.

Fine-tune when…

The domain vocabulary and patterns cannot be learned from context

Some domains have vocabulary and reasoning patterns sparse in general pre-training data — clinical documentation, legal drafting in a specific jurisdiction’s style, highly specialised engineering notation. When the gap between the model’s base vocabulary and the domain’s required vocabulary is large enough that even extensive system prompt examples do not close it reliably, fine-tuning shifts the model’s probability distribution towards domain-correct outputs in a way that prompting cannot replicate.

The test that confirms this signal is real

Measure model performance on the domain task with the best achievable prompt including extensive few-shot examples. If performance remains significantly below the required threshold despite exhausting prompt engineering, and a domain expert assesses that the failures are vocabulary and pattern failures rather than reasoning failures, fine-tuning is likely to help. If the failures are reasoning failures — the model cannot perform the underlying inference even with the vocabulary — fine-tuning will not help. A more capable base model is the correct intervention.

A consistent output format is required that prompting cannot reliably produce

When the output format is sufficiently specific that even detailed format instructions and examples produce unacceptable deviation rates at production scale, fine-tuning on correctly-formatted examples teaches the format as a pattern rather than as an instruction to follow. Format consistency typically improves significantly and measurably with modest fine-tuning. There is also a token cost benefit: the format specification consumed in the context window at inference time is eliminated once the format is learned.

The test that confirms this signal is real

Measure format compliance rate on the production input distribution with the best achievable prompt. If compliance falls below the acceptable threshold on more than 2–3% of production inputs and the non-compliant outputs represent a genuine operational problem, format-focused fine-tuning on correctly-formatted examples is likely to close the gap efficiently.

Cost at production volume makes the frontier model unaffordable and a smaller fine-tuned model would meet the capability requirement

A frontier model meeting the capability requirement may cost 10–50× more per request than a smaller model fine-tuned for the specific task. For high-volume applications where the differential is material, fine-tuning a smaller model may produce acceptable task performance at dramatically lower operating cost. This is legitimate when the task is specific and bounded enough that a smaller fine-tuned model can match frontier performance on that specific task.

The test that confirms this signal is real

Run the smaller model on the task with the best achievable prompt. If performance is already close with prompting, fine-tuning may not provide the cost justification. Model the full TCO: fine-tuning data preparation cost + training compute cost + re-training cost when the base model is updated, against the cost differential over the deployment horizon.

Privacy or latency constraints require a self-hosted model and no suitable self-hostable model performs adequately without fine-tuning

Organisations with data that cannot leave their infrastructure — classified government data, certain healthcare records, commercially sensitive trading data — may not be able to use API-based models at all. If the available self-hostable models of appropriate size do not perform adequately on the task without fine-tuning, fine-tuning is a constraint, not a preference. The choice is between fine-tuning an available self-hostable model and not deploying an LLM at all.

The test that confirms this signal is real

Verify that no compliant hosted solution exists at the required security classification — some cloud providers now offer compliant environments for specific government classifications that remove the self-hosting requirement. Evaluate the best available self-hostable models with prompt engineering before concluding fine-tuning is needed.

Do not fine-tune when…

The problem is a prompt engineering problem, not a capability problem

The most common reason organisations decide to fine-tune when they should not: the model produces inconsistent or incorrect outputs because the prompt is poorly designed — contradictory instructions, missing edge case handling, no format specification. Fine-tuning on examples of correct outputs does not fix a broken prompt system. It teaches the model to mimic the training examples, which may improve average performance while leaving the underlying prompt design problems intact. The fine-tuned model will still fail on inputs the training examples did not cover.

How to verify this is the reason before committing

Engage the Prompt Engineering & System Design service before committing to fine-tuning. In our consistent experience, organisations that conduct a systematic prompt redesign before deciding to fine-tune find that fine-tuning is unnecessary in a significant proportion of cases. The prompt redesign costs £12,000 and takes 4 weeks — a significantly lower-cost intervention to try first.

The training data does not yet exist in sufficient volume and quality

Fine-tuning requires training data that is representative of the production task, correctly labelled, of sufficient volume, and free from quality problems that corrupt training. Organisations that decide to fine-tune before assessing their training data consistently discover it is insufficient. Fine-tuning on insufficient or low-quality data produces a model that is overfit to bad examples — worse than the base model in specific ways that are hard to diagnose because the training data is believed to be correct.

How to verify this before committing

The Decision Assessment includes a training data audit on a 200-item sample. If the audit reveals insufficient volume, poor quality, distributional mismatch, or labelling inconsistency, the recommendation will be to address data problems before proceeding. Data preparation is included in subsequent engagements — but the data must be preparable, which the audit establishes first.

The required knowledge is dynamic and changes more frequently than the fine-tuning cadence

Fine-tuning teaches the model patterns and fixed knowledge from the training data at training time. An organisation that fine-tunes on its product catalogue, regulatory guidance library, or policy documentation and then updates those sources will have a fine-tuned model whose learned knowledge is progressively more outdated. The options are re-fine-tuning (expensive and slow) or supplementing with RAG — which works, but if RAG can supply the knowledge, fine-tuning to bake it in was never necessary.

What to do instead

RAG is almost always the correct approach for dynamic knowledge. Fine-tuning is appropriate for static patterns — style, format, domain vocabulary, task-specific reasoning structure. The combination — fine-tuned model for style and domain pattern, plus RAG for current knowledge — is often the best approach when both requirements exist. See RAG Architecture & Knowledge System Design.

The safety property degradation is unacceptable for the deployment context

Every fine-tuning run degrades the base model’s safety properties to some degree. In regulated environments — healthcare, financial services, legal services — the degradation may be unacceptable regardless of the capability gain. A clinical LLM that becomes more fluent with clinical terminology through fine-tuning but less reliable in its calibration of uncertainty may be worse, not better, for clinical deployment. If the pre-fine-tuning assessment shows fine-tuning will push the model below the acceptable safety threshold, fine-tuning is the wrong approach regardless of the capability benefit.

How to assess this before committing

The Decision Assessment establishes the base model’s safety property baseline in the dimensions relevant to the deployment context and estimates expected degradation from the fine-tuning intensity required. If estimated post-fine-tuning safety levels are below the deployment context’s acceptable threshold, the Decision Assessment will say so explicitly. The guardrail layer design (see LLM Architecture Services) may compensate for some degradation — but not all, and the model must retain the properties the guardrail cannot substitute for.

How Fine-Tuning Data Preparation Fails

Seven data preparation failure modes. Every one produces a fine-tuned model that is worse than the base model in specific, hard-to-diagnose ways.

Fine-tuning data preparation is where most fine-tuning projects fail in practice. The training run completes. The fine-tuned model looks better on the examples used to validate it. It is deployed. In production, it fails in ways that are hard to attribute to the training data because no one reviewed it carefully enough before training to know what was in it.

Inconsistent quality standards across the training dataset

Training data collected from multiple sources — different authors, time periods, quality review processes, annotation guidelines — contains examples of varying quality treated as equivalent. A high-quality example and a mediocre example formatted identically are given equal weight during training. The model learns from both. The outputs it produces are a mixture of the quality distribution in the training data — not the quality the organisation intended to teach.