Skip to main content

LLM Fine-Tuning Strategy & Data Preparation

Services  /  LLM Fine-Tuning Strategy & Data Preparation

Fine-tuning a language model is the most consequential and least reversible decision in an enterprise LLM programme. It is also the decision most commonly made for the wrong reasons, with inadequate data, without understanding what it costs to the base model’s safety properties, and without a plan for what happens when the fine-tuned model must be updated as the base model evolves. Most fine-tuning projects that fail do not fail at the training step. They fail at the decision step — the organisation decided to fine-tune a problem that should have been solved with a better prompt, or decided to fine-tune before the training data was in a state that fine-tuning could succeed from.

Fine-tuning shifts the model’s probability distribution towards the training data. Everything that is gained in domain-specific behaviour comes from somewhere — typically from reduction in coverage of the training distribution the base model was trained on. A model fine-tuned intensively on legal contract language becomes more capable with legal contracts and less reliable in the general reasoning tasks that underlie reliable contract analysis. A model fine-tuned on clinical notes becomes more fluent with clinical terminology and less reliable in its calibration of uncertainty — which matters significantly in clinical contexts.

This service addresses the decision, the data, and the safety properties — in that order. The decision phase produces a documented recommendation on whether to fine-tune, which base model, and what the fine-tuning is expected to achieve that cannot be achieved without it. The data phase produces a training data preparation specification addressing quality, coverage, format, and volume. The safety phase produces a pre-training baseline assessment and a post-training evaluation protocol measuring what the fine-tuning cost and whether those costs are acceptable for the deployment context.

Price Range
£7,500 – £55,000+
Decision assessment, data preparation specification, and safety evaluation framework. Fine-tuning execution is separate and additional.
Duration
3 – 16 weeks
Strategy and specification only. Training runs add 2–8 weeks depending on dataset size and available compute.
Scope boundary
Strategy, data preparation specification, and evaluation framework design. We do not execute training runs, manage compute infrastructure, or operate fine-tuned models in production.
Decision first
The Decision Assessment (£7,500) is a standalone engagement that always precedes the others. Its fee is credited in full against any subsequent tier. Its recommendation may be not to fine-tune.
Contract
Fixed-price. 50% on signing, 50% on delivery acceptance.
Fine-tuning is not the first resortMost enterprise LLM capability gaps can be closed with better prompt engineering, better retrieval architecture, or a different base model — faster, cheaper, and without the safety property degradation that fine-tuning causes. This engagement begins by establishing whether fine-tuning is the correct intervention. If the Decision Assessment concludes it is not, we will say so. The £7,500 decision fee will have saved you significantly more than it cost.

When fine-tuning is the right decision and when it is not. Four signals for each. Both sides stated as directly as the other.

The decision to fine-tune is made with incomplete information under time pressure in most enterprise LLM programmes. Vendors have incentives to recommend fine-tuning because it creates lock-in and generates compute revenue. Engineering teams have incentives to recommend it because it is technically interesting. Neither incentive is aligned with the organisation’s interest in the fastest, cheapest route to the capability it actually needs. The decision framework below is the one the Decision Assessment applies — with no preference for the outcome.

Fine-tune when…
The domain vocabulary and patterns cannot be learned from context
Some domains have vocabulary and reasoning patterns sparse in general pre-training data — clinical documentation, legal drafting in a specific jurisdiction’s style, highly specialised engineering notation. When the gap between the model’s base vocabulary and the domain’s required vocabulary is large enough that even extensive system prompt examples do not close it reliably, fine-tuning shifts the model’s probability distribution towards domain-correct outputs in a way that prompting cannot replicate.
The test that confirms this signal is real
Measure model performance on the domain task with the best achievable prompt including extensive few-shot examples. If performance remains significantly below the required threshold despite exhausting prompt engineering, and a domain expert assesses that the failures are vocabulary and pattern failures rather than reasoning failures, fine-tuning is likely to help. If the failures are reasoning failures — the model cannot perform the underlying inference even with the vocabulary — fine-tuning will not help. A more capable base model is the correct intervention.
A consistent output format is required that prompting cannot reliably produce
When the output format is sufficiently specific that even detailed format instructions and examples produce unacceptable deviation rates at production scale, fine-tuning on correctly-formatted examples teaches the format as a pattern rather than as an instruction to follow. Format consistency typically improves significantly and measurably with modest fine-tuning. There is also a token cost benefit: the format specification consumed in the context window at inference time is eliminated once the format is learned.
The test that confirms this signal is real
Measure format compliance rate on the production input distribution with the best achievable prompt. If compliance falls below the acceptable threshold on more than 2–3% of production inputs and the non-compliant outputs represent a genuine operational problem, format-focused fine-tuning on correctly-formatted examples is likely to close the gap efficiently.
Cost at production volume makes the frontier model unaffordable and a smaller fine-tuned model would meet the capability requirement
A frontier model meeting the capability requirement may cost 10–50× more per request than a smaller model fine-tuned for the specific task. For high-volume applications where the differential is material, fine-tuning a smaller model may produce acceptable task performance at dramatically lower operating cost. This is legitimate when the task is specific and bounded enough that a smaller fine-tuned model can match frontier performance on that specific task.
The test that confirms this signal is real
Run the smaller model on the task with the best achievable prompt. If performance is already close with prompting, fine-tuning may not provide the cost justification. Model the full TCO: fine-tuning data preparation cost + training compute cost + re-training cost when the base model is updated, against the cost differential over the deployment horizon.
Privacy or latency constraints require a self-hosted model and no suitable self-hostable model performs adequately without fine-tuning
Organisations with data that cannot leave their infrastructure — classified government data, certain healthcare records, commercially sensitive trading data — may not be able to use API-based models at all. If the available self-hostable models of appropriate size do not perform adequately on the task without fine-tuning, fine-tuning is a constraint, not a preference. The choice is between fine-tuning an available self-hostable model and not deploying an LLM at all.
The test that confirms this signal is real
Verify that no compliant hosted solution exists at the required security classification — some cloud providers now offer compliant environments for specific government classifications that remove the self-hosting requirement. Evaluate the best available self-hostable models with prompt engineering before concluding fine-tuning is needed.
Do not fine-tune when…
The problem is a prompt engineering problem, not a capability problem
The most common reason organisations decide to fine-tune when they should not: the model produces inconsistent or incorrect outputs because the prompt is poorly designed — contradictory instructions, missing edge case handling, no format specification. Fine-tuning on examples of correct outputs does not fix a broken prompt system. It teaches the model to mimic the training examples, which may improve average performance while leaving the underlying prompt design problems intact. The fine-tuned model will still fail on inputs the training examples did not cover.
How to verify this is the reason before committing
Engage the Prompt Engineering & System Design service before committing to fine-tuning. In our consistent experience, organisations that conduct a systematic prompt redesign before deciding to fine-tune find that fine-tuning is unnecessary in a significant proportion of cases. The prompt redesign costs £12,000 and takes 4 weeks — a significantly lower-cost intervention to try first.
The training data does not yet exist in sufficient volume and quality
Fine-tuning requires training data that is representative of the production task, correctly labelled, of sufficient volume, and free from quality problems that corrupt training. Organisations that decide to fine-tune before assessing their training data consistently discover it is insufficient. Fine-tuning on insufficient or low-quality data produces a model that is overfit to bad examples — worse than the base model in specific ways that are hard to diagnose because the training data is believed to be correct.
How to verify this before committing
The Decision Assessment includes a training data audit on a 200-item sample. If the audit reveals insufficient volume, poor quality, distributional mismatch, or labelling inconsistency, the recommendation will be to address data problems before proceeding. Data preparation is included in subsequent engagements — but the data must be preparable, which the audit establishes first.
The required knowledge is dynamic and changes more frequently than the fine-tuning cadence
Fine-tuning teaches the model patterns and fixed knowledge from the training data at training time. An organisation that fine-tunes on its product catalogue, regulatory guidance library, or policy documentation and then updates those sources will have a fine-tuned model whose learned knowledge is progressively more outdated. The options are re-fine-tuning (expensive and slow) or supplementing with RAG — which works, but if RAG can supply the knowledge, fine-tuning to bake it in was never necessary.
What to do instead
RAG is almost always the correct approach for dynamic knowledge. Fine-tuning is appropriate for static patterns — style, format, domain vocabulary, task-specific reasoning structure. The combination — fine-tuned model for style and domain pattern, plus RAG for current knowledge — is often the best approach when both requirements exist. See RAG Architecture & Knowledge System Design.
The safety property degradation is unacceptable for the deployment context
Every fine-tuning run degrades the base model’s safety properties to some degree. In regulated environments — healthcare, financial services, legal services — the degradation may be unacceptable regardless of the capability gain. A clinical LLM that becomes more fluent with clinical terminology through fine-tuning but less reliable in its calibration of uncertainty may be worse, not better, for clinical deployment. If the pre-fine-tuning assessment shows fine-tuning will push the model below the acceptable safety threshold, fine-tuning is the wrong approach regardless of the capability benefit.
How to assess this before committing
The Decision Assessment establishes the base model’s safety property baseline in the dimensions relevant to the deployment context and estimates expected degradation from the fine-tuning intensity required. If estimated post-fine-tuning safety levels are below the deployment context’s acceptable threshold, the Decision Assessment will say so explicitly. The guardrail layer design (see LLM Architecture Services) may compensate for some degradation — but not all, and the model must retain the properties the guardrail cannot substitute for.

Seven data preparation failure modes. Every one produces a fine-tuned model that is worse than the base model in specific, hard-to-diagnose ways.

Fine-tuning data preparation is where most fine-tuning projects fail in practice. The training run completes. The fine-tuned model looks better on the examples used to validate it. It is deployed. In production, it fails in ways that are hard to attribute to the training data because no one reviewed it carefully enough before training to know what was in it.

01
Inconsistent quality standards across the training dataset
Training data collected from multiple sources — different authors, time periods, quality review processes, annotation guidelines — contains examples of varying quality treated as equivalent. A high-quality example and a mediocre example formatted identically are given equal weight during training. The model learns from both. The outputs it produces are a mixture of the quality distribution in the training data — not the quality the organisation intended to teach.
What this produces in the fine-tuned model
An organisation fine-tunes a customer communication model on 5 years of historical emails. Quality varied across the collection period: early emails were verbose and inconsistently formatted; later emails were tighter. The fine-tuned model blends old and new styles inconsistently — worse than the best recent examples, better than the worst historical ones. The team expected the model to learn the current standard. It learned the average of the full collection period.
Data preparation approach that prevents this
Quality stratification before dataset assembly: every candidate training example assessed against a defined quality rubric with a minimum threshold for inclusion. The rubric must be specific enough for consistent application — not “good quality” but specific criteria a reviewer can apply to any example and reach the same conclusion. Quality-filtered datasets are smaller but produce significantly better results than large, heterogeneous datasets with broad quality variation.
02
Training data distribution does not match production input distribution
Fine-tuning teaches the model the patterns in the training data. If the training distribution differs systematically from the production distribution, the model’s improved performance is in the training distribution, not in production. Validation results look good because validation is conducted on held-out training data — which has the same distribution as the training data, not the same distribution as production. The degradation is only visible when the model is deployed to production inputs.
What this produces in the fine-tuned model
A legal document assistant is fine-tuned on executed contracts. The production use case includes draft contracts under negotiation. Draft contracts use different language — placeholder text, conditional phrasing, bracketed alternatives — rare in executed contracts. The fine-tuned model handles executed contract language better than the base model and handles draft contract language worse. The application required both.
Data preparation approach that prevents this
Production input distribution analysis before dataset assembly: characterise the full range of production inputs including the tail, and verify the training dataset covers the production distribution proportionally. Validation must be conducted on a held-out set drawn from production inputs, not from the same source as the training data.
03
Insufficient training volume for the fine-tuning objective
Different fine-tuning objectives require different data volumes. Teaching a model a new output format may require a few hundred examples. Teaching a complex domain reasoning pattern may require tens of thousands. Organisations that collect whatever is available and train on it are training with an unknown data volume relationship to their objective. The result may be a model that has learned the surface pattern without the underlying structure — high validation accuracy, poor generalisation.
What this produces in the fine-tuned model
The fine-tuned model performs well on training-distribution examples in validation but fails to generalise to production inputs that differ slightly — the overfit signature of insufficient data volume. It is typically only visible when enough production examples are available to assess generalisation beyond the training distribution.
Data preparation approach that prevents this
Data volume requirement estimation before collection: based on fine-tuning objective type, base model’s existing coverage of the target domain, and complexity of patterns to be learned. Staged training with incremental data: train on a subset, measure generalisation, add more data and retrain until the generalisation metric plateaus. Do not commit to a fine-tuning programme before verifying sufficient collectable data exists or can be generated.
04
Label noise corrupts the training signal
In supervised fine-tuning, training examples have labels or preferred outputs that teach the model what correct looks like. If labels are incorrect — annotators who disagreed and the disagreement was averaged rather than arbitrated by an expert, domain experts who applied different standards at different times, automatically generated labels from a weaker model that was itself imperfect — the model is trained towards incorrect targets. The resulting model has learned to produce outputs that look like the noisy labels.
What this produces in the fine-tuned model
A clinical coding model fine-tuned on historical ICD codes learns the coding biases and errors present in the historical dataset — including systematic undercoding of specific conditions that hospital coders historically undercoded for billing reasons. The model is more consistent than the base model in applying these biases, appearing more accurate in validation against the same historical dataset. It is less accurate coding against clinical guidelines.
Data preparation approach that prevents this
Label quality review: every training label reviewed by a domain expert against a defined labelling standard. Disagreement protocol: where reviewers disagree, the example is escalated to a senior expert rather than averaged. Inter-annotator agreement measurement: systematic measurement of agreement between annotators, identifying those whose standards differ from the established standard. Examples below the agreement threshold are excluded.
05
Catastrophic forgetting removes capabilities the application depends on
Fine-tuning on a domain-specific dataset updates the model’s weights towards domain-specific patterns. This also reduces the strength of connections not reinforced by the domain-specific training — catastrophic forgetting. Capabilities the base model had that were not represented in the fine-tuning dataset may be degraded or eliminated. This is especially significant when the application depends on general reasoning capabilities that the fine-tuning dataset did not include because those capabilities are assumed but not directly taught.
What this produces in the fine-tuned model
A financial analysis model fine-tuned on financial report summaries produces more accurate, consistently formatted summaries than the base model. It is also significantly worse at multi-document synthesis — correlating information across multiple reports to identify trends — because the fine-tuning data consisted entirely of single-document summaries. The application required multi-document synthesis as a core capability. This was discovered in user testing after deployment.
Data preparation approach that prevents this
Capability inventory before fine-tuning: identify all capabilities the application requires, not just those fine-tuning is intended to improve. Verify the fine-tuning dataset includes examples of required non-target capabilities. Replay data: include a proportion of general-capability examples in the fine-tuning dataset to preserve general capabilities — standard regularisation against catastrophic forgetting.
06
Evaluation contamination inflates performance estimates
If the validation dataset was drawn from the same source as the training dataset without a clean separation, or if validation examples are semantically similar to training examples rather than genuinely held-out, validation performance is an overestimate of production performance. The model appears to generalise well because validation examples are similar to training examples. Production inputs that differ from the training distribution reveal the actual generalisation gap, which is larger than validation suggested.
What this produces in the fine-tuned model
A fine-tuned model achieves 94% accuracy on its validation set. In production, accuracy measured on a random sample of actual user queries is 71%. The 23-point gap is not measurement error — it is the result of a validation set drawn from the same document corpus as the training data, with the same authorial style and terminology. Real user queries used different phrasings and question structures not represented in the training corpus.
Data preparation approach that prevents this
Strict train-validation-test split: validation and test sets drawn from separate sources, or separated at the document level. Validation set designed to match the production input distribution including terminology variation and the full range of query types expected in production. Post-deployment evaluation on actual production queries is the only reliable measure of true generalisation — establish this infrastructure before deployment.
07
Fine-tuning objective drifts from the production task during data collection
Data collection for fine-tuning projects frequently runs over weeks or months, during which the production task definition evolves — the application’s requirements change, the user population’s needs become clearer, or a regulatory change shifts what the application must do. The training data collected at the start reflects an earlier, possibly obsolete definition of the task. Training on this data teaches the model the earlier version of the task, which may not be what the current application requires.
What this produces in the fine-tuned model
A compliance document assistant is fine-tuned on examples collected across a period during which a regulatory update changed assessment criteria for one area. Data collected before and after the update is mixed in the training set. The fine-tuned model blends old and new criteria — correct for most areas, systematically incorrect for the updated area. It is deployed with the belief that it reflects the current regulatory standard.
Data preparation approach that prevents this
Task definition locking before data collection begins: the production task is formally specified before a single training example is collected, and the specification is under change control for the collection period. Any change to the specification during collection triggers a review of whether previously collected data still aligns. Data collected under a superseded specification is quarantined and reviewed before inclusion.
08
Objective mis-specification: optimising the wrong success criteria
Fine-tuning aligns a model to the objective encoded in its training data and evaluation metrics. If those metrics do not reflect the real-world success condition of the application, the model will optimise for proxy signals rather than actual performance. This frequently occurs when measurable attributes (format compliance, lexical similarity, brevity) are prioritised over harder-to-measure outcomes (decision correctness, downstream business impact, regulatory compliance, user trust). The model becomes highly efficient at satisfying the metric while failing the purpose the metric was intended to represent.
What this produces in the fine-tuned model
A support automation model is fine-tuned to maximise response speed and template adherence. Validation shows high scores across both metrics. In production, resolution rates decline and escalation volume increases because the model prioritises fast, structured responses over accurate problem diagnosis. It produces outputs that pass evaluation while degrading operational outcomes. The system appears optimised but is economically and functionally regressing.
Data preparation approach that prevents this
Outcome-aligned objective design: define success in terms of measurable real-world impact before dataset construction — resolution rate, error rate, compliance adherence, financial impact, or task completion accuracy. Training data must encode these outcomes explicitly, not indirectly. Multi-layer evaluation: combine surface-level metrics with outcome-based validation tied to production KPIs. Where direct measurement is difficult, construct validated proxies with proven correlation to real outcomes. Fine-tuning programmes must be governed by the same performance definitions used to evaluate the application in production, ensuring the model is optimised for what actually matters rather than what is easiest to measure.

Three engagement types. Decision first, then single-task strategy, then multi-task or domain adaptation.

The Decision Assessment is always first. Its fee is credited against any subsequent engagement. It may conclude that fine-tuning is the wrong decision — in which case the credit is not applied because there is no subsequent engagement, and the £7,500 will have prevented a significantly more expensive programme from being initiated incorrectly. Fine-tuning execution — the training runs themselves — is outside the scope of all three engagements.

Engagement Type 1 — Always First
Fine-Tuning Decision Assessment
For any organisation considering fine-tuning an LLM for any purpose. The prerequisite for any subsequent fine-tuning engagement — it is not possible to design a good data preparation specification without first establishing that fine-tuning is the correct decision, which base model is correct, what the fine-tuning is expected to achieve, and what safety properties are at risk. This engagement produces a written decision recommendation with documented evidence. Its fee is credited in full against any subsequent engagement if the recommendation is to proceed.
£7,500
Fixed · VAT excl. · credited if proceeding
3 weeksThe recommendation may be not to fine-tune. If so, the engagement concludes at delivery. The £7,500 is not refunded — an honest negative recommendation takes the same work as a positive one.
Decision Analysis
Application of the four-signal decision framework: does this use case meet the criteria for fine-tuning or for not fine-tuning?
Prompt engineering gap analysis: measure current task performance with the best achievable prompt — establishing whether fine-tuning is needed or whether the gap can be closed without it
Alternative intervention assessment: for each capability gap, what is the fastest and cheapest intervention that would close it?
Base model selection: which base models are candidates given task type, data privacy requirements, inference cost target, and self-hosting requirements
Fine-tuning method selection: full fine-tuning vs. LoRA vs. QLoRA vs. instruction fine-tuning
Data & Safety Pre-Assessment
Training data audit on a 200-item sample: quality distribution, distribution coverage, labelling consistency, volume estimate relative to objective
Base model safety baseline: current safety property measurements in the dimensions most relevant to the deployment context
Safety degradation estimate: expected safety property changes from the fine-tuning intensity required, and whether acceptable for the deployment context
EU AI Act provider assessment: if fine-tuning is recommended, the obligations as a new AI system provider and whether they change the decision
Decision Document
Recommendation: fine-tune or do not fine-tune, with specific evidence from the decision framework
If not to fine-tune: the alternative intervention recommendation and expected outcome
If to fine-tune: base model recommendation, fine-tuning method, data readiness assessment, safety risk assessment, and scope and timeline for the appropriate subsequent engagement
Decision rationale documented for senior technical leadership review
Uncertainty disclosures: where the recommendation depends on assumptions with meaningful uncertainty, those are stated explicitly
Fee credit on subsequent engagementIf the Decision Assessment recommends proceeding and the organisation engages Type 2 or Type 3 within 90 days of delivery, the £7,500 fee is credited in full against the subsequent engagement fee. The credit does not apply if the recommendation is not to fine-tune.
What Your Team Must Provide
200-item sample of candidate training data — representative of the full intended dataset, not the best examples only
Current production prompt and performance data: what the application currently produces, on what inputs, with what accuracy
Deployment context documentation: regulatory environment, intended user population, consequence of errors, specific safety requirements
ML lead: 2-hour technical discussion in week 1 and 90-minute decision presentation in week 3
What the Decision Assessment Does Not Cover
Full data preparation specification: the Decision Assessment audits readiness; the full specification is in Type 2 or Type 3
Fine-tuning execution: not in scope at any tier
Post-training evaluation execution: the Decision Assessment specifies the evaluation framework; running it after training is your ML team’s work from our specification
Engagement Type 2 — Requires Type 1 first
Single-Task Fine-Tuning Strategy & Data Preparation
For organisations where the Decision Assessment has confirmed fine-tuning is the correct intervention and the objective is one specific task — one output type, one domain, one fine-tuning run. Examples: a clinical note summarisation model, a legal clause extraction model, a customer communication style model, a code generation model for a specific language or codebase. If the organisation has multiple distinct fine-tuning objectives requiring separate datasets and separate training runs, Type 3 is appropriate.
£28,000
Fixed · less £7,500 Type 1 credit · VAT excl.
10 weeksExcludes training run duration, which depends on dataset size and available compute. Allow 2–6 weeks for the training run after specifications are delivered.
Data Preparation Specification
Training data quality standard: the specific quality rubric for inclusion — precise enough for consistent application
Dataset assembly specification: sources, collection methodology, quality filtering process, volume targets per input category
Distribution coverage specification: how the training dataset must cover the production input distribution with specific coverage requirements per category
Labelling standard: exact criteria, labelling process, inter-annotator agreement requirement, escalation protocol for disagreements
Train-validation-test split methodology: designed for genuine held-out evaluation on a production-representative validation set
Data format specification: exact format for the chosen fine-tuning method, ready to ingest without transformation
Training Specification
Hyperparameter specification: learning rate schedule, batch size, training duration, regularisation approach
Catastrophic forgetting mitigation: replay data specification and proportion to preserve non-target capabilities
Checkpoint strategy: how often to checkpoint, which to evaluate, how to select the best checkpoint
Convergence criteria: the validation metrics indicating when training is complete
Compute requirements: minimum GPU specification, VRAM, training time estimate for infrastructure planning
Safety & Evaluation
Pre-training baseline documentation: base model safety properties from the Decision Assessment, formalised as the comparison baseline
Post-training evaluation protocol: specific tests, metrics, and pass/fail criteria covering task capability AND safety property retention
Safety property test suite: specific inputs and expected behaviours testing the safety properties most at risk
Acceptance criteria: minimum post-training thresholds on both task capability and safety properties — the model must pass both to be accepted for deployment
Failure protocol: root cause assessment process, remediation options, decision authority if acceptance criteria are not met
Training run execution
Post-training evaluation execution
What Your Team Must Provide
Access to the complete candidate training dataset — not a sample, the full intended dataset — for the full audit in weeks 1–2
Senior domain expert: 3–5 hours in weeks 3–5 for labelling standard validation and quality rubric review
ML lead: available for training specification review in weeks 6–7 and compute planning discussion
Infrastructure plan: the compute environment must be identified before week 6 so compute requirements can be calibrated to available infrastructure
What Is Not in This Engagement
Training run execution: your ML team or a specialist ML infrastructure partner executes the training from our specification
Post-training evaluation execution: your ML team runs the evaluation against our acceptance criteria
Data labelling labour: labelling work performed by your domain experts or a specialist annotation partner
Post-deployment monitoring: see LLMOps for ongoing operational monitoring of the fine-tuned model
Engagement Type 3 — Requires Type 1 first
Multi-Task or Domain Adaptation Fine-Tuning Strategy
For organisations with multiple distinct fine-tuning objectives, or a domain adaptation objective broader than a single task. Multi-task: fine-tuning a model across several related tasks simultaneously where task similarity allows a single run to benefit all tasks. Domain adaptation: fine-tuning on a large domain corpus to improve general domain capability before task-specific fine-tuning — appropriate when the domain vocabulary gap is large enough that task-specific fine-tuning alone cannot close it. All Type 3 engagements individually scoped.
From £55,000
Individually scoped · less £7,500 credit · VAT excl.
From 14 weeksDomain adaptation programmes covering large corpora commonly run 18–24 weeks due to data volume requirements and multi-stage training design.
Multi-Task Additions
Task interaction analysis: where training objectives reinforce or compete, and how the training data mixture must be balanced
Multi-task dataset design: separate dataset specifications per task, with mixture proportions and sampling strategy for the combined training dataset
Per-task evaluation framework: separate evaluation per task, plus portfolio evaluation assessing whether the model meets all task requirements simultaneously
Task failure protocol: what happens when the fine-tuned model meets some task criteria and fails others
Domain Adaptation Additions
Domain corpus specification: sources, volume, quality filtering for the domain training corpus — typically significantly larger than a task-specific dataset
Two-stage training design: the domain adaptation stage and the task-specific fine-tuning stage, separately specified with separate evaluation checkpoints
Domain vocabulary coverage assessment: systematic measurement of base model coverage before and after domain adaptation
Continual learning strategy: how the domain-adapted model is updated as the domain evolves
Type 3 Requirements
Significantly larger training datasets: domain adaptation corpora typically range from hundreds of thousands to millions of examples — data availability must be confirmed before scope is agreed
ML infrastructure plan confirmed: compute requirements for domain adaptation are substantially more demanding than for task-specific fine-tuning
Named ML programme lead with seniority to make training infrastructure and data access decisions
Legal review: for domain adaptation on proprietary domain text, licensing for use in model training must be reviewed before data collection begins

Client Obligations
Provide the actual training data — including poor quality, incorrectly labelled, and edge case examples
The data audit cannot be conducted on a curated sample of the best training data. It must be conducted on a representative sample of the actual intended dataset — including examples collected under inconsistent labelling standards, marginal cases on the quality rubric, and examples from early collection when the labelling standard was not yet finalised. These are precisely the examples that most significantly affect the quality of the resulting fine-tuned model.
If only the best data is provided for the auditThe audit will recommend a quality standard the withheld data does not meet. When the full dataset is used for training, the fine-tuned model will be worse than the specification predicted. The specification was correct — the data was not.
The post-training evaluation protocol must be executed before the model is deployed
The post-training evaluation is not optional. It determines whether the fine-tuned model meets the acceptance criteria — both for task capability and for safety property retention. A fine-tuned model deployed without evaluation against the acceptance criteria may be below the safety property threshold for the deployment context. The EU AI Act treats the organisation that fine-tunes a model as the provider of the resulting AI system, with all provider obligations including the conformity assessment. The post-training evaluation is the technical foundation of that conformity assessment.
If the model is deployed without executing the post-training evaluationThe organisation is deploying a fine-tuned model whose safety properties relative to the deployment context’s requirements are unknown. We document explicitly that the evaluation protocol we specified was not executed before deployment.
RJV Obligations
The Decision Assessment recommendation is based on evidence, not on the direction that generates more revenue
A recommendation to fine-tune leads to a larger engagement at higher cost than a recommendation not to fine-tune. We have a financial incentive to recommend fine-tuning. We declare this conflict explicitly and manage it by: basing the recommendation solely on documented evidence from the decision framework, having the recommendation reviewed by a second RJV consultant, and documenting the rationale in sufficient detail that the client can independently assess whether the conclusion follows from the evidence. If we recommend fine-tuning and the programme subsequently fails in a way the Decision Assessment should have predicted, we will review the decision documentation and acknowledge if the recommendation was not well-founded.
If the Decision Assessment recommends fine-tuning and subsequent evidence shows prompt engineering alone would have been sufficientThis is a recommendation error. We will acknowledge it and discuss what compensation, if any, is appropriate.
Safety property degradation findings reported honestly regardless of whether they support proceeding
The pre-fine-tuning baseline and post-fine-tuning safety property evaluation may reveal that the fine-tuning has degraded safety properties below the acceptable threshold for the deployment context. This finding does not support proceeding with deployment — and we will not present it as if it does. If the safety property evaluation fails the acceptance criteria, the finding is reported as a failure, with the specific properties that failed, the degree of failure, and the options: iterate on the training approach, accept the degraded properties with additional guardrail design, or conclude that fine-tuning for this use case and deployment context is not viable.
If safety property findings are disputedWe review the methodology and results with the client’s ML team and domain expert. If the dispute is substantive — a measurement methodology error or a domain expert assessment that a specific safety property is less important for this deployment context — we revise accordingly. If the dispute is motivated by a preference for a different result rather than a substantive concern, the original finding stands.

Questions that determine whether fine-tuning is the right programme to initiate

Our ML team wants to fine-tune but leadership is sceptical. What does the Decision Assessment give us to resolve the disagreement?
The Decision Assessment produces a documented recommendation with the evidence that supports it — the prompt engineering gap measurement, the data readiness assessment, the alternative intervention comparison, and the safety property risk assessment. This is sufficient evidence for leadership to make an informed decision and for the ML team to understand whether their recommendation is supported by the evidence or contradicted by it. Either outcome resolves the disagreement on the basis of evidence rather than opinion — significantly more useful than an internal debate without a structured evaluation framework.
We have a vendor offering to fine-tune a model for us as part of a platform agreement. Do we need this service?
The vendor offering to fine-tune for you has a commercial interest in the outcome. They will not conduct the Decision Assessment with genuine neutrality on the “do not fine-tune” side, because that outcome terminates the conversation about their fine-tuning service. They will not assess safety property degradation from their fine-tuning with the same rigour as an independent party. The Decision Assessment, data preparation specification, and post-training evaluation protocol are independent assessments that protect your interests regardless of which vendor or infrastructure executes the training. They are also the technical documentation that satisfies your EU AI Act provider obligations — obligations the vendor’s fine-tuning service does not create documentation for.
How much training data do we actually need? We have heard everything from dozens to millions.
The range is real. Format fine-tuning — teaching a specific output format the model does not already produce reliably — may require a few hundred high-quality examples. Style fine-tuning: typically 500–2,000. Domain vocabulary extension: typically 5,000–50,000. Domain adaptation: typically 100,000–1,000,000+. The Decision Assessment produces a data volume estimate calibrated to the specific fine-tuning objective and the base model’s existing coverage of the target domain. This estimate is the starting point for the full data audit in Type 2 and Type 3.
What happens when the base model we fine-tuned is updated or deprecated by the vendor?
This is the least discussed and most significant operational risk of fine-tuning. When the base model vendor releases a new version, your fine-tuned model is based on the old version. You face a choice: continue on the old version (which the vendor will eventually cease supporting), re-fine-tune on the new version (repeating the data preparation and training cost), or switch to a prompt-engineering-based approach on the new model. The Training Specification we deliver for Type 2 and Type 3 includes a re-training protocol defining when re-training is triggered, what data updates are required, and how the re-training is validated. It does not make re-training free — it makes it planned, documented, and executable without repeating the full strategy process.
What is the relationship between fine-tuning and the other LLM services?
Enterprise LLM Strategy establishes whether fine-tuning is part of the platform strategy — the Decision Assessment in this service provides the detailed technical evidence. Prompt Engineering is the intervention to exhaust before concluding fine-tuning is required — we consistently recommend attempting prompt engineering first. RAG Architecture addresses the dynamic knowledge problem that fine-tuning cannot solve. LLMOps is the operational framework for the fine-tuned model in production — the version management, quality monitoring, and re-training protocol that keeps it performing as the domain and base model evolve.
What are your payment terms?
50% on contract signature, 50% on delivery acceptance for all engagement types. The £7,500 Type 1 fee is credited against Type 2 or Type 3 if engaged within 90 days of Type 1 delivery. The credit applies to the signing payment — the effective signing payment for Type 2 is £6,500 (50% of £28,000 minus the £7,500 credit). If the Type 1 recommendation is not to fine-tune, there is no subsequent engagement and the credit is not applied. Scope additions are invoiced as agreed in writing before execution. If a data preparation specification cannot be completed because the full data audit reveals the data is fundamentally unsuitable — a finding the Type 1 200-item sample should have surfaced but did not — we will discuss the appropriate resolution at the time of the finding.

Start with the Decision Assessment. Bring your use case, your data, and the reasons your team believes fine-tuning is the correct intervention.

90 minutes to review the use case, the current model performance with the best available prompt, and a sample of the intended training data. We apply the decision framework in the session and give you a preliminary view on whether fine-tuning is the correct intervention, what alternative interventions might close the gap faster or more cheaply, and what data quality problems are visible in the sample. This is the fastest route to an informed decision on whether to initiate a fine-tuning programme.

If the preliminary assessment suggests fine-tuning is likely the right intervention, we proceed to the formal Decision Assessment. If it suggests it is not — that prompt engineering or RAG would close the gap more effectively — we will say so in the session. The session is free. The Decision Assessment is £7,500 and produces the documented evidence that either grounds the fine-tuning programme or prevents an incorrect one.

Format
Video call or in-person in London. 90 minutes.
Cost
Free. No commitment. The formal Decision Assessment (£7,500) follows if you want to proceed to a documented recommendation.
Lead time
Within 5 business days of contact.
Bring
Your use case: what the fine-tuned model should do that the current model cannot. Your current performance data: what the model produces today with the best available prompt. A 20–30 item sample of your intended training data — including the difficult examples. Your deployment context: the regulatory environment, the consequence of errors, any specific safety requirements.
Attendees
ML lead or principal engineer and the business owner of the use case. Both are needed. From RJV: a senior ML strategist with no infrastructure vendor affiliation.
After
Written summary of preliminary findings within 2 business days. Formal Decision Assessment proposal within 5 business days if you want to proceed.