The LLM vendor market is the fastest-moving enterprise technology market in a generation. That speed creates specific decision traps that do not exist in slower markets: the evaluation criteria that were correct six months ago may be wrong today, the PoC conditions that validated the choice may not hold in production, and the contract terms that seemed reasonable at signing may be catastrophic at the scale the platform reaches in year two. Each failure mode below is documented from enterprise deployments that have been publicly reported or that represent a pattern we observe consistently in the organisations we assess.
01
The PoC used clean, curated data. Production uses the actual data.
Proof-of-concept evaluations almost always use the organisation’s best data: the well-formatted documents, the clearly written tickets, the clean customer records. The model performs well. The organisation commits. Production ingests the actual data estate: inconsistently formatted records, partial documents, legacy extracts with encoding errors, free-text fields written by people in a hurry with abbreviations specific to the organisation’s culture. The model that performed at 94% accuracy on curated PoC data performs at 61% on production data. The difference was always there — the PoC was not designed to find it.
Cost of discovering this in production
Rework: re-evaluation against production data, potential platform switch, re-implementation. Timeline: typically 4–8 months. Cost: typically 2–4× the original implementation cost. In one reported case: a 14-month customer service LLM programme halted and restarted from vendor selection after production accuracy was unacceptable. Total rework cost: £2.3M.
How to prevent it
Evaluation conducted on a representative sample of production data — not curated data, not the best 1% of the data estate. The evaluation sample must include the tail: the malformed records, the ambiguous inputs, the edge cases that are rare but consequential. A model that performs well only on clean data is not ready for production. The strategy engagement specifies the evaluation dataset before any vendor trial begins.
02
The model was selected for capability. The cost at production volume is unaffordable.
Token-based pricing means that LLM cost scales with usage in a way that is easy to underestimate at PoC scale and catastrophic to discover at production scale. A model that costs £200/month in PoC costs £85,000/month at production volume for a high-throughput application. This is not a surprise that emerges from unusual usage — it is the direct mathematical consequence of applying the per-token pricing to the production usage volume. Organisations that did not model this before selecting a platform are locked into a contract at a price point that makes the application economically unviable.
Cost of discovering this in production
Options at this point: renegotiate contract (unlikely to succeed), switch to a cheaper model mid-production (requires re-evaluation, re-testing, re-deployment), reduce throughput to fit budget (defeats the purpose of the application), or absorb the cost (eroding the ROI case entirely). In documented cases: organisations have switched from frontier models to smaller open-source models mid-deployment and accepted 6–8 months of parallel operation while the switch was validated.
How to prevent it
Total cost of ownership modelled before vendor selection: production volume forecast, token count per request at production input distribution (not PoC), output token count at production (not estimated), context window usage per request type, and the cost curve at 1×, 5×, and 10× expected volume for headroom. Model selection explicitly includes cost-per-request as a first-class evaluation criterion alongside capability. Smaller models that meet the capability threshold at lower cost are preferred over frontier models that exceed the capability threshold at unaffordable cost.
03
The selected model’s context window is too small for the actual production context.
Context window size determines how much text the model can process in a single request — the document being analysed, the conversation history, the system prompt, and the instructions all consume context window. At PoC, requests are small and the context window is not a constraint. At production, the average request exceeds the context window: the customer service LLM needs to process a 30-page policy document that is larger than the context window; the legal review LLM needs to hold multiple contract sections in context simultaneously. The model cannot do the task it was selected to do.
What happens next
Chunking and retrieval-augmented generation (RAG) are the standard workarounds — but they introduce their own failure modes: retrieval misses relevant context, chunk boundaries split reasoning, the model produces answers that are locally correct but globally inconsistent across the full document. The workaround changes the system’s behaviour in ways that must be re-evaluated before the original accuracy claims can be maintained.
How to prevent it
Context window requirement analysis conducted on the actual production inputs before vendor evaluation begins. Every request type: the system prompt size, the instruction size, the input document or conversation size at the 50th and 95th percentile, and the expected output size. Context window requirement is an explicit minimum specification. Any model that does not meet the 95th-percentile context requirement is excluded from consideration before capability evaluation begins — capability on inputs the model cannot fit in its context window is irrelevant.
04
Data residency and sovereignty requirements were not assessed before platform selection.
Many LLM vendor platforms default to US-based data processing. For organisations with NHS data, regulated financial data, government data, or client contractual obligations requiring UK or EU data processing, this means the default platform configuration is immediately non-compliant. The organisation discovers this either during a data protection impact assessment conducted after platform selection, or during a supplier security review, or — worst — after data has been processed through a non-compliant endpoint. Switching to a compliant configuration may require a different pricing tier, a different API endpoint, or in some cases a different vendor entirely.
What this looks like in practice
An NHS-connected digital health company selects a US-based LLM provider for patient-facing functionality. Post-selection data protection impact assessment identifies that the default API endpoint processes data in US-East. The UK endpoint is available but only on the Enterprise tier — at 3× the cost of the tier selected. Re-negotiation required. Timeline delay: 11 weeks. Organisations under stricter constraints (government security classification, patient-identifiable data under DSP Toolkit) found no compliant configuration available from their selected vendor and restarted vendor selection entirely.
How to prevent it
Data residency requirements mapped and documented before vendor shortlisting begins. Every data type the application will process: its regulatory classification, the jurisdiction it must remain in, and the vendor configurations that satisfy those requirements. Vendors that cannot offer a compliant configuration are excluded from the shortlist regardless of capability. Data residency compliance is a binary requirement — it is not traded against capability.
05
The latency of the selected model is incompatible with the application’s user experience requirements.
Frontier LLMs with the largest context windows and highest capability also have the highest latency. Time-to-first-token and time-to-completion vary significantly between models, between vendors, between API tiers, and under load. A model that responds in 800ms under PoC conditions may respond in 4.2 seconds under production load. For a customer-facing application where the user waits for a response, 4.2 seconds is unacceptable. The latency was in the model’s specification; it was not measured under production-representative load conditions before selection.
What happens next
Streaming responses mitigate the perception of latency for conversational applications but do not reduce total processing time for batch applications. Smaller, faster models are evaluated post-selection as alternatives — adding the evaluation and re-implementation cost that should have been part of the original selection process. In real-time applications (voice assistants, live customer service, trading systems) the latency constraint may exclude all frontier models and require an open-source deployment on controlled infrastructure to achieve the required response time.
How to prevent it
Latency requirements defined by application type before vendor evaluation: maximum acceptable time-to-first-token, maximum acceptable total response time at the 95th percentile, and whether streaming is acceptable for the application’s UX. Vendor evaluation conducted under load — not single-request timing, but representative concurrent request load. Any model that does not meet the 95th-percentile latency requirement under representative load is excluded regardless of capability scores.
06
The vendor’s terms allow training on your data. You did not read them.
LLM vendor terms of service have changed materially and repeatedly since 2022. Default terms for several major providers have included clauses permitting the vendor to use customer input data for model training or improvement. Opting out of data training typically requires a specific enterprise agreement, a specific API configuration, or both. Organisations that signed standard agreements and did not read the data usage provisions may have been permitting their proprietary data — client communications, internal documents, financial data, clinical data — to be used in ways their data protection obligations prohibit and their clients did not consent to.
Regulatory consequence
UK GDPR Article 28 requires a data processing agreement with any processor handling personal data. If the vendor’s terms permitted uses beyond the organisation’s stated processing purpose — specifically, model training — the data processing agreement may not have satisfied Article 28 requirements. ICO enforcement and client contractual liability both arise from this. The data that was processed under non-compliant terms cannot be un-processed.
How to prevent it
Commercial terms review as a mandatory component of vendor evaluation — before any data is sent to any vendor API during evaluation. Terms assessed for: data retention and deletion policy, training data opt-out provisions and how to invoke them, data processing geography, sub-processor list, breach notification obligations, and compliance with UK GDPR Article 28 requirements. Any vendor whose default terms do not satisfy the organisation’s data processing requirements is excluded or placed on a negotiated-terms-required shortlist.
07
The chosen architecture is more complex than the use case requires.
The fastest-moving area of LLM deployment is also the area with the strongest engineering pull towards complexity: agentic systems, multi-model pipelines, RAG with vector databases, fine-tuned models with custom inference infrastructure. These are the right choices for some problems. They are not the right choices for organisations deploying an LLM to classify incoming customer enquiries into one of eight categories. A small, fast, cheap model with a well-designed system prompt solves that problem in 4 weeks. A fine-tuned model with a vector database and an agent orchestration layer solves it in 6 months at 20× the cost and with 5× the operational complexity. The engineering team chose the architecture that was technically interesting, not the architecture that was appropriate.
What this costs
Over-engineered LLM architectures are expensive to build, expensive to operate, expensive to debug, and expensive to change. They also tend to fail in more complex ways — a simple model with a system prompt fails in ways that are immediately visible; a multi-agent pipeline fails in ways that require significant investigation to trace. The maintenance cost of a complex architecture is carried indefinitely. The value delivered is identical to what a simple architecture would have delivered, in less time, at lower cost.
How to prevent it
Architecture selection begins from the use case requirements, not from the available technology. Each requirement is mapped to the minimum architecture that satisfies it: if a single model call with a well-designed system prompt meets the accuracy requirement, the strategy specifies that. RAG is specified only when the model genuinely needs access to information that cannot be in its system prompt or context window. Fine-tuning is specified only when the use case cannot be met by a general-purpose model with prompt engineering. Agentic architectures are specified only when the task genuinely requires the model to take sequential actions with external tool access.
08
The organisation is locked into a vendor whose model no longer leads on the relevant task.
The LLM capability landscape moves on a 6-month cycle. The model that led on coding tasks in Q1 2024 was not the leader by Q3 2024. The model that led on long-document analysis in Q2 2024 was not the leader by Q4 2024. Organisations that built tightly coupled integrations with a specific model’s API — using model-specific features, fine-tuning on model-specific formats, building prompt templates that depend on model-specific behaviour — are significantly more expensive to migrate when the landscape changes than organisations that built against an abstraction layer. The migration cost was not zero when the decision to build tightly coupled was made; it was deferred and compounded.
What this looks like at migration time
A tightly coupled integration built against GPT-4 in 2023 required 4–6 months to migrate to a different frontier model in 2024 for several documented enterprise deployments, because: prompt templates depended on GPT-4-specific response formatting that other models did not replicate; fine-tuning had been conducted on OpenAI’s fine-tuning infrastructure with a format not portable to other platforms; monitoring and evaluation infrastructure was built against OpenAI’s API response structure. Total migration cost: typically 40–70% of the original implementation cost.
How to prevent it
Architecture designed with a model abstraction layer from the start: the application calls an interface, not a model. The interface routes to the current selected model. Swapping the model requires changing one configuration, not refactoring the application. Prompt templates written to be model-agnostic — testing across multiple models before finalisation, not optimised for a single model’s response characteristics. Fine-tuning avoided unless the capability gain over prompt engineering is demonstrated to be material on the production task — because fine-tuned models are the hardest to migrate.