Skip to main content

Enterprise LLM Strategy & Vendor Selection

Services  /  Enterprise LLM Strategy & Vendor Selection

Organisations are committing to language model platforms under commercial pressure that does not allow the time the decision actually requires. The vendor landscape is moving fast enough that a model that was the right choice six months ago may not be the right choice today. The contracts being signed lock organisations into platforms, pricing structures, and data arrangements for 24–36 months — long enough for the landscape to shift significantly and for the locked-in choice to become a constraint rather than a capability.

Most organisations making these decisions are choosing between models and platforms they have evaluated with proof-of-concept prompts on clean test data, in isolation from the production constraints that will determine whether the choice works at scale. The model that performed best in the PoC is not always the model that performs best in production, under load, on the actual distribution of real-world inputs, with the latency requirements of the production application, and at the cost structure of production usage volumes.

This service produces an LLM strategy and vendor selection that is grounded in your specific use case, your specific constraints, and an honest evaluation of the available options against those constraints — not a ranked list of the most popular models, not a framework that tells you what questions to ask, but a specific recommendation with documented reasoning that you can take to your board and defend. If the recommendation is that no currently available model is right for your use case, we will say that rather than recommend one anyway.

Price Range
£18,000 – £120,000+
Strategy, use case analysis, vendor evaluation, and architecture design. Implementation is separate and additional.
Duration
6 – 18 weeks
Strategy phase only. Implementation and deployment of the selected platform are outside this engagement’s scope.
Independence
We have no commercial relationship with any LLM vendor. No referral fees, no implementation revenue, no preferred partner arrangements. The recommendation is based on your requirements — not on which vendor pays us.
Scope boundary
Strategy, use case definition, vendor evaluation, architecture design, commercial terms guidance, and implementation specification. Procurement negotiation, platform deployment, and ongoing operations are outside scope.
Tiers
Focused (single use case) · Professional (multi-use case portfolio) · Enterprise (organisation-wide strategy)
Contract
Fixed-price. 50% on signing, 50% on delivery acceptance.
Vendor neutrality is non-negotiableAny LLM strategy consultant with a commercial relationship with a specific vendor has a conflict of interest in your selection process — whether or not they disclose it. We declare all vendor relationships before engagement begins. Currently: none. If that changes, we will disclose it before any affected recommendation is made. We will not make a recommendation influenced by a commercial relationship we have not disclosed.

Eight specific failure modes. Each one costs more to correct after commitment than it would have cost to avoid before it.

The LLM vendor market is the fastest-moving enterprise technology market in a generation. That speed creates specific decision traps that do not exist in slower markets: the evaluation criteria that were correct six months ago may be wrong today, the PoC conditions that validated the choice may not hold in production, and the contract terms that seemed reasonable at signing may be catastrophic at the scale the platform reaches in year two. Each failure mode below is documented from enterprise deployments that have been publicly reported or that represent a pattern we observe consistently in the organisations we assess.

01
The PoC used clean, curated data. Production uses the actual data.
Proof-of-concept evaluations almost always use the organisation’s best data: the well-formatted documents, the clearly written tickets, the clean customer records. The model performs well. The organisation commits. Production ingests the actual data estate: inconsistently formatted records, partial documents, legacy extracts with encoding errors, free-text fields written by people in a hurry with abbreviations specific to the organisation’s culture. The model that performed at 94% accuracy on curated PoC data performs at 61% on production data. The difference was always there — the PoC was not designed to find it.
Cost of discovering this in production
Rework: re-evaluation against production data, potential platform switch, re-implementation. Timeline: typically 4–8 months. Cost: typically 2–4× the original implementation cost. In one reported case: a 14-month customer service LLM programme halted and restarted from vendor selection after production accuracy was unacceptable. Total rework cost: £2.3M.
How to prevent it
Evaluation conducted on a representative sample of production data — not curated data, not the best 1% of the data estate. The evaluation sample must include the tail: the malformed records, the ambiguous inputs, the edge cases that are rare but consequential. A model that performs well only on clean data is not ready for production. The strategy engagement specifies the evaluation dataset before any vendor trial begins.
02
The model was selected for capability. The cost at production volume is unaffordable.
Token-based pricing means that LLM cost scales with usage in a way that is easy to underestimate at PoC scale and catastrophic to discover at production scale. A model that costs £200/month in PoC costs £85,000/month at production volume for a high-throughput application. This is not a surprise that emerges from unusual usage — it is the direct mathematical consequence of applying the per-token pricing to the production usage volume. Organisations that did not model this before selecting a platform are locked into a contract at a price point that makes the application economically unviable.
Cost of discovering this in production
Options at this point: renegotiate contract (unlikely to succeed), switch to a cheaper model mid-production (requires re-evaluation, re-testing, re-deployment), reduce throughput to fit budget (defeats the purpose of the application), or absorb the cost (eroding the ROI case entirely). In documented cases: organisations have switched from frontier models to smaller open-source models mid-deployment and accepted 6–8 months of parallel operation while the switch was validated.
How to prevent it
Total cost of ownership modelled before vendor selection: production volume forecast, token count per request at production input distribution (not PoC), output token count at production (not estimated), context window usage per request type, and the cost curve at 1×, 5×, and 10× expected volume for headroom. Model selection explicitly includes cost-per-request as a first-class evaluation criterion alongside capability. Smaller models that meet the capability threshold at lower cost are preferred over frontier models that exceed the capability threshold at unaffordable cost.
03
The selected model’s context window is too small for the actual production context.
Context window size determines how much text the model can process in a single request — the document being analysed, the conversation history, the system prompt, and the instructions all consume context window. At PoC, requests are small and the context window is not a constraint. At production, the average request exceeds the context window: the customer service LLM needs to process a 30-page policy document that is larger than the context window; the legal review LLM needs to hold multiple contract sections in context simultaneously. The model cannot do the task it was selected to do.
What happens next
Chunking and retrieval-augmented generation (RAG) are the standard workarounds — but they introduce their own failure modes: retrieval misses relevant context, chunk boundaries split reasoning, the model produces answers that are locally correct but globally inconsistent across the full document. The workaround changes the system’s behaviour in ways that must be re-evaluated before the original accuracy claims can be maintained.
How to prevent it
Context window requirement analysis conducted on the actual production inputs before vendor evaluation begins. Every request type: the system prompt size, the instruction size, the input document or conversation size at the 50th and 95th percentile, and the expected output size. Context window requirement is an explicit minimum specification. Any model that does not meet the 95th-percentile context requirement is excluded from consideration before capability evaluation begins — capability on inputs the model cannot fit in its context window is irrelevant.
04
Data residency and sovereignty requirements were not assessed before platform selection.
Many LLM vendor platforms default to US-based data processing. For organisations with NHS data, regulated financial data, government data, or client contractual obligations requiring UK or EU data processing, this means the default platform configuration is immediately non-compliant. The organisation discovers this either during a data protection impact assessment conducted after platform selection, or during a supplier security review, or — worst — after data has been processed through a non-compliant endpoint. Switching to a compliant configuration may require a different pricing tier, a different API endpoint, or in some cases a different vendor entirely.
What this looks like in practice
An NHS-connected digital health company selects a US-based LLM provider for patient-facing functionality. Post-selection data protection impact assessment identifies that the default API endpoint processes data in US-East. The UK endpoint is available but only on the Enterprise tier — at 3× the cost of the tier selected. Re-negotiation required. Timeline delay: 11 weeks. Organisations under stricter constraints (government security classification, patient-identifiable data under DSP Toolkit) found no compliant configuration available from their selected vendor and restarted vendor selection entirely.
How to prevent it
Data residency requirements mapped and documented before vendor shortlisting begins. Every data type the application will process: its regulatory classification, the jurisdiction it must remain in, and the vendor configurations that satisfy those requirements. Vendors that cannot offer a compliant configuration are excluded from the shortlist regardless of capability. Data residency compliance is a binary requirement — it is not traded against capability.
05
The latency of the selected model is incompatible with the application’s user experience requirements.
Frontier LLMs with the largest context windows and highest capability also have the highest latency. Time-to-first-token and time-to-completion vary significantly between models, between vendors, between API tiers, and under load. A model that responds in 800ms under PoC conditions may respond in 4.2 seconds under production load. For a customer-facing application where the user waits for a response, 4.2 seconds is unacceptable. The latency was in the model’s specification; it was not measured under production-representative load conditions before selection.
What happens next
Streaming responses mitigate the perception of latency for conversational applications but do not reduce total processing time for batch applications. Smaller, faster models are evaluated post-selection as alternatives — adding the evaluation and re-implementation cost that should have been part of the original selection process. In real-time applications (voice assistants, live customer service, trading systems) the latency constraint may exclude all frontier models and require an open-source deployment on controlled infrastructure to achieve the required response time.
How to prevent it
Latency requirements defined by application type before vendor evaluation: maximum acceptable time-to-first-token, maximum acceptable total response time at the 95th percentile, and whether streaming is acceptable for the application’s UX. Vendor evaluation conducted under load — not single-request timing, but representative concurrent request load. Any model that does not meet the 95th-percentile latency requirement under representative load is excluded regardless of capability scores.
06
The vendor’s terms allow training on your data. You did not read them.
LLM vendor terms of service have changed materially and repeatedly since 2022. Default terms for several major providers have included clauses permitting the vendor to use customer input data for model training or improvement. Opting out of data training typically requires a specific enterprise agreement, a specific API configuration, or both. Organisations that signed standard agreements and did not read the data usage provisions may have been permitting their proprietary data — client communications, internal documents, financial data, clinical data — to be used in ways their data protection obligations prohibit and their clients did not consent to.
Regulatory consequence
UK GDPR Article 28 requires a data processing agreement with any processor handling personal data. If the vendor’s terms permitted uses beyond the organisation’s stated processing purpose — specifically, model training — the data processing agreement may not have satisfied Article 28 requirements. ICO enforcement and client contractual liability both arise from this. The data that was processed under non-compliant terms cannot be un-processed.
How to prevent it
Commercial terms review as a mandatory component of vendor evaluation — before any data is sent to any vendor API during evaluation. Terms assessed for: data retention and deletion policy, training data opt-out provisions and how to invoke them, data processing geography, sub-processor list, breach notification obligations, and compliance with UK GDPR Article 28 requirements. Any vendor whose default terms do not satisfy the organisation’s data processing requirements is excluded or placed on a negotiated-terms-required shortlist.
07
The chosen architecture is more complex than the use case requires.
The fastest-moving area of LLM deployment is also the area with the strongest engineering pull towards complexity: agentic systems, multi-model pipelines, RAG with vector databases, fine-tuned models with custom inference infrastructure. These are the right choices for some problems. They are not the right choices for organisations deploying an LLM to classify incoming customer enquiries into one of eight categories. A small, fast, cheap model with a well-designed system prompt solves that problem in 4 weeks. A fine-tuned model with a vector database and an agent orchestration layer solves it in 6 months at 20× the cost and with 5× the operational complexity. The engineering team chose the architecture that was technically interesting, not the architecture that was appropriate.
What this costs
Over-engineered LLM architectures are expensive to build, expensive to operate, expensive to debug, and expensive to change. They also tend to fail in more complex ways — a simple model with a system prompt fails in ways that are immediately visible; a multi-agent pipeline fails in ways that require significant investigation to trace. The maintenance cost of a complex architecture is carried indefinitely. The value delivered is identical to what a simple architecture would have delivered, in less time, at lower cost.
How to prevent it
Architecture selection begins from the use case requirements, not from the available technology. Each requirement is mapped to the minimum architecture that satisfies it: if a single model call with a well-designed system prompt meets the accuracy requirement, the strategy specifies that. RAG is specified only when the model genuinely needs access to information that cannot be in its system prompt or context window. Fine-tuning is specified only when the use case cannot be met by a general-purpose model with prompt engineering. Agentic architectures are specified only when the task genuinely requires the model to take sequential actions with external tool access.
08
The organisation is locked into a vendor whose model no longer leads on the relevant task.
The LLM capability landscape moves on a 6-month cycle. The model that led on coding tasks in Q1 2024 was not the leader by Q3 2024. The model that led on long-document analysis in Q2 2024 was not the leader by Q4 2024. Organisations that built tightly coupled integrations with a specific model’s API — using model-specific features, fine-tuning on model-specific formats, building prompt templates that depend on model-specific behaviour — are significantly more expensive to migrate when the landscape changes than organisations that built against an abstraction layer. The migration cost was not zero when the decision to build tightly coupled was made; it was deferred and compounded.
What this looks like at migration time
A tightly coupled integration built against GPT-4 in 2023 required 4–6 months to migrate to a different frontier model in 2024 for several documented enterprise deployments, because: prompt templates depended on GPT-4-specific response formatting that other models did not replicate; fine-tuning had been conducted on OpenAI’s fine-tuning infrastructure with a format not portable to other platforms; monitoring and evaluation infrastructure was built against OpenAI’s API response structure. Total migration cost: typically 40–70% of the original implementation cost.
How to prevent it
Architecture designed with a model abstraction layer from the start: the application calls an interface, not a model. The interface routes to the current selected model. Swapping the model requires changing one configuration, not refactoring the application. Prompt templates written to be model-agnostic — testing across multiple models before finalisation, not optimised for a single model’s response characteristics. Fine-tuning avoided unless the capability gain over prompt engineering is demonstrated to be material on the production task — because fine-tuned models are the hardest to migrate.

Six evaluation dimensions. Weighted to your specific use case. Every exclusion criterion binary before any scoring begins.

The evaluation framework is built from your requirements, not from a generic model benchmark. Benchmarks measure model performance on standardised tasks. Your use case is not a standardised task. The evaluation is conducted on your data, against your success criteria, under your production constraints. Models are first screened against binary exclusion criteria — data residency, latency, context window, commercial terms — before any capability scoring begins. A model that fails any exclusion criterion is removed from consideration regardless of its capability score. This ordering matters: capability scores from vendors who cannot meet your data residency requirements are not relevant.

01
Compliance & Data Governance
The non-negotiable dimension. Every requirement in this dimension is binary — pass or fail. No capability score compensates for a compliance failure. Evaluated before any other dimension.
What is evaluated
Data residency: which API endpoints process data, in which jurisdictions, under which legal framework
Training data opt-out: default terms vs. negotiated terms, what must be done to opt out, whether opt-out applies retroactively
UK GDPR Article 28 compliance: whether the vendor’s standard DPA satisfies the requirements or requires negotiation
Sub-processor list: who processes data on the vendor’s behalf, where they are located, whether their locations are acceptable
Breach notification: timeline and procedure for notifying the organisation of a data breach affecting their data
Sector-specific: NHS DSPT, FCA, CQC, GovAssure, or other sector obligations where applicable
Binary exclusion criterion
02
Technical Compatibility
Hard technical requirements that determine whether the model can physically do the job. Binary exclusion criteria assessed before capability scoring. Evaluated on production-representative measurements, not vendor specifications.
What is evaluated
Context window: 95th-percentile request size vs. model context window — measured on actual production input distribution, not estimated
Latency: time-to-first-token and total response time at 50th and 95th percentile under representative concurrent load — not single-request benchmarks
Rate limits: tokens per minute and requests per minute at the relevant tier vs. production peak throughput requirement
API stability: version history, deprecation timelines, breaking change frequency — relevant to operational maintenance burden
Modality requirements: if the use case requires vision, audio, structured output, function calling, or code execution — which models support them at production quality
Binary exclusion criterion
03
Task Capability
The dimension most organisations evaluate exclusively — and first. It is third in our sequence, because it only matters for vendors that have passed compliance and technical compatibility screening. Evaluated on your task, your data, your success criteria.
What is evaluated
Primary task accuracy: performance on the specific task the model will perform in production, measured on representative production data including the tail distribution
Failure mode characterisation: what the model produces when it fails — does it fail safely (low-confidence abstention) or unsafely (confident wrong answer)?
Instruction following: consistency of adherence to system prompt instructions across the production input distribution, not just on clean test inputs
Output format consistency: whether the model produces reliably structured outputs (JSON, specific formats) at production scale or degrades to prose when inputs are ambiguous
Robustness to adversarial inputs: performance on inputs specifically designed to make the model deviate from its intended behaviour
Scored and ranked
04
Total Cost of Ownership
The dimension that produces the most budget surprises when not modelled correctly. Token pricing is transparent. The inputs to the model cost calculation at production scale are often not. Evaluated across a 3-year horizon at three volume scenarios.
What is modelled
Input token cost: average tokens per request at production input distribution (not PoC), at 1×, 3×, and 10× expected volume
Output token cost: average output tokens per response, modelled separately from input — output tokens are typically priced higher and are harder to forecast
Context window overhead: if RAG is required, the retrieval context added to every request — often the largest single contributor to token cost for document-heavy use cases
Committed vs. on-demand pricing: vendor pricing tiers, commitment discount availability, break-even analysis for committed spend
Infrastructure cost: if self-hosted or private deployment is in scope, the infrastructure cost over the assessment period
Operational overhead: monitoring, prompt maintenance, version management, re-evaluation cost as models are updated
Scored and ranked
05
Architecture Fit & Migration Risk
How well the selected model and platform fit into the organisation’s existing architecture, and how much it will cost to change the decision when the landscape moves. The dimension that is most often omitted from vendor evaluations and most often regretted.
What is evaluated
Integration complexity: how the model integrates with existing systems — authentication, data flow, response handling, error management
Abstraction layer feasibility: can the integration be built against an abstraction layer that allows model substitution? What would need to be model-specific vs. model-agnostic?
Migration cost estimate: what it would cost to migrate to a different model in 18 months if required — the lower this number, the better, regardless of how confident the organisation is in the current selection
Vendor lock-in risks: proprietary features, fine-tuning portability, data export provisions, contractual exit terms
Operational tooling: observability, logging, cost monitoring, version management — what the vendor provides vs. what must be built
Scored and ranked
06
Commercial Terms & Vendor Stability
LLM vendors are a young market. Several significant vendors have changed their pricing, terms, or product availability materially within 12 months of enterprise commitments. The commercial dimension assesses the risk of signing a multi-year commitment with a vendor whose commercial position may change.
What is evaluated
Pricing stability: history of price changes, committed pricing availability, price escalation clauses in enterprise agreements
Model continuity commitments: how long the vendor commits to maintaining a specific model version, deprecation notice periods, what happens to fine-tuned models when a base model is deprecated
SLA terms: uptime commitments, credit structure for downtime, maximum outage duration covered by SLA vs. typical incident durations from status page history
Contractual exit provisions: termination rights, data export on termination, notice periods, financial exposure on early exit
Vendor financial position: public information on funding, revenue, and runway — relevant for 24–36 month enterprise commitments
Scored and ranked

Three tiers. The work differs in scale. The rigour does not.

All three tiers apply all six evaluation dimensions. The difference is the number of use cases assessed, the depth of the vendor evaluation, and whether the engagement produces a single-use-case recommendation or an organisation-wide LLM portfolio strategy. Implementation of the selected platform — procurement, integration, deployment, testing — is outside scope in all tiers. The strategy engagement ends when you have a recommendation, the documented evidence supporting it, and the implementation specification your team or a partner executes from.

Focused Strategy
Single Use Case LLM Strategy
For organisations evaluating LLMs for a single, well-defined use case: one application, one primary task type, one team implementing it. Examples: customer service enquiry classification, internal document search and retrieval, contract clause extraction, meeting summarisation, code review assistance, HR policy Q&A. If you have multiple use cases requiring different model characteristics or your LLM programme spans more than one business unit, the Professional tier is appropriate.
£18,000
Fixed · VAT excl.
6 weeksAssumes production data is available for evaluation and the use case is defined with sufficient specificity to begin evaluation in week 1.
Use Case Analysis
Use case definition: task type, input specification, output specification, success criteria defined as measurable thresholds
Production data profiling: actual input distribution characterised — not estimated, measured on a sample of real production data
Context window requirement: 95th-percentile request size measured on production data sample
Latency requirement: maximum acceptable response time for the application’s UX, documented with reasoning
Data residency mapping: data types the application will process, their regulatory classification, the jurisdiction requirements
Volume forecast: requests per day/month at current and 12-month projected scale for cost modelling
Architecture complexity assessment: minimum architecture that satisfies the use case — is a single model call sufficient, or is RAG/agentic architecture genuinely required?
Vendor Evaluation
Shortlist: up to 5 vendors/models evaluated across all 6 dimensions
Exclusion screening: compliance, data residency, context window, latency, rate limits — binary pass/fail before capability scoring
Capability evaluation: models tested on 200-item production data sample spanning the full input distribution including tail
TCO model: 3-year cost projection per vendor at baseline, 3×, and 10× volume — input tokens, output tokens, context overhead, infrastructure where applicable
Commercial terms review: data processing terms, training opt-out, DPA adequacy, pricing stability, exit provisions
Scored comparison table: all 6 dimensions, all evaluated vendors, with dimension weights calibrated to the specific use case
Strategy Output
Recommended vendor/model with documented scoring rationale
Architecture recommendation: the minimum architecture appropriate for the use case, with the reasoning for each architectural decision
Implementation specification: what to build, how to integrate, what the abstraction layer should look like, what monitoring is required from day one
Prompt design specification: system prompt structure, input formatting requirements, output format specification, edge case handling
Evaluation harness specification: the automated test suite that verifies the deployed system is performing against the defined success criteria
30-day post-delivery advisory support (email)
Implementation execution
Vendor contract negotiation
Ongoing model monitoring programme
Timeline — 6 Weeks
Wk 1
Use Case Definition
Task specification, success criteria, data profiling, context window measurement, latency and volume requirements.
Use cases that are not defined precisely enough to write measurable success criteria cannot be evaluated. We will work to define them — but vague use cases produce vague recommendations.
Wk 2
Data Residency & Exclusion Screening
Data classification, regulatory mapping, vendor exclusion screening against all binary criteria.
Some organisations discover in this stage that no publicly available vendor meets their data residency requirements. This is the correct time to discover it — before evaluation spend, not after implementation spend.
Wk 3–4
Capability Evaluation
200-item evaluation on production data across all shortlisted models. Failure mode characterisation. Output format consistency testing.
Access to a representative sample of production data is required before week 3. If data access is delayed, the evaluation timeline extends. Curated or synthetic data is not an acceptable substitute.
Wk 5
TCO & Commercial Terms
3-year cost models. Commercial terms review for each shortlisted vendor. Scored comparison table.
Volume forecast accuracy affects the cost model. Organisations without a production volume forecast for the use case receive a cost model with a wider uncertainty range.
Wk 6
Strategy & Handover
Final recommendation, architecture specification, prompt design spec, evaluation harness spec, implementation guidance.
The recommendation session must include both the technical lead and the decision-maker. A recommendation delivered only to the technical lead without the decision-maker present produces a different outcome than one delivered to both.
What Your Team Must Provide
A representative sample of 500+ real production inputs for the use case — the actual data the model will process, including the difficult, malformed, and ambiguous examples, not a curated sample of easy examples
Ground truth labels for 200 of those inputs — the correct output for each, agreed by the domain expert who will assess the model’s outputs
Technical lead: available for use case definition workshop (2 hours, week 1) and evaluation harness review (2 hours, week 6)
Decision-maker: available for the final recommendation session (90 minutes, week 6)
Legal or compliance: available for 1 hour in week 2 to confirm data classification and regulatory requirements
What Is Not in This Engagement
Implementation: all integration, deployment, testing, and monitoring are outside scope and separately resourced. Typical implementation cost for a single-use-case LLM application: £15,000–£80,000 depending on integration complexity
Vendor contract negotiation: we provide commercial terms guidance and the comparison on which to base negotiation — the negotiation itself is your legal and procurement team’s work
More than 5 vendors in the shortlist: additional vendor evaluation at £1,800/vendor
Ongoing strategy review as the model landscape evolves: available as a 6-month retainer at £4,500 per review cycle
RAG architecture design if required: this engagement recommends whether RAG is needed; the RAG architecture design is a separate engagement
Professional Strategy
Multi-Use-Case LLM Portfolio Strategy
For organisations with 3–10 LLM use cases under evaluation or development across multiple business units, where a single platform decision affects multiple use cases and the interaction between use cases matters for vendor selection. A use case that demands UK data residency may constrain the platform choice for all use cases. A use case that requires a 200k-token context window may require a different model family from one that needs fast low-latency responses. These interactions must be resolved at portfolio level, not use-case-by-use-case.
£52,000
Fixed · VAT excl.
12 weeksMulti-unit data access provisioning and ground truth labelling by multiple domain teams are the most common sources of timeline extension at this tier.
Portfolio Analysis
Up to 10 use cases defined, profiled, and evaluated — each with independent success criteria
Cross-use-case constraint analysis: requirements that, if satisfied for one use case, constrain or enable others — platform selection interactions mapped explicitly
Use case prioritisation: ROI ranking of the portfolio — which use cases deliver the highest return per unit of implementation investment, and in which sequence should they be deployed
Platform consolidation analysis: whether a single platform can serve the full portfolio, or whether different platforms are required for different use cases, and the operational overhead of each scenario
Build vs. buy analysis per use case: for each use case, whether a commercial LLM with prompt engineering is sufficient or whether a fine-tuned or custom model is required and justified
Portfolio risk assessment: concentration risk from single-platform dependency, vendor stability risk across the portfolio horizon
Evaluation & Recommendation
Up to 8 vendors evaluated across the full portfolio — tested on representative production data per use case
Platform recommendation: primary platform for the majority of use cases, with documented reasoning for any use cases requiring a different platform
Architecture recommendations per use case: model, architecture pattern (single call, RAG, agent), integration approach
Implementation sequencing: which use cases to build first, in which order, with what dependencies between them
3-year portfolio TCO: total cost across all use cases at three volume scenarios, broken down by use case and by cost component
Commercial terms guidance: negotiation priorities for enterprise agreement, volume discount structure, data processing terms, SLA requirements
Strategy Outputs
Portfolio strategy document: 40–60 pages covering all 10 use cases, platform recommendation, implementation roadmap, and cost model
Implementation specifications per use case: architecture, prompt design, evaluation harness, integration approach
Board presentation pack: portfolio investment case, ROI model, risk assessment, phased deployment plan
Vendor evaluation evidence pack: scored comparison table, test results, TCO models — suitable for procurement audit trail
60-day post-delivery advisory support: email plus 2 × scheduled video calls
6-month landscape review included: one structured review of the recommendation at 6 months to assess whether the vendor landscape has changed materially
Timeline — 12 Weeks
Wk 1–2
Portfolio Inventory & Use Case Definition
All 10 use cases defined with success criteria. Data profiling per use case. Cross-use-case constraint mapping.
Use cases proposed by different business units often overlap or conflict. Clarifying this in week 1–2 prevents duplication and competing requirements from complicating the evaluation.
Wk 3
Exclusion Screening & Shortlisting
Binary exclusion screening across all vendors for all use cases. Portfolio-level shortlist that satisfies the most constrained use case requirements.
The most constrained use case sets the floor for the full portfolio shortlist. Organisations sometimes push back on this — they want the constrained use case on a different platform. This is a legitimate strategy that must be costed explicitly.
Wk 4–7
Capability Evaluation (all use cases)
200-item evaluation per use case across shortlisted vendors. Ground truth labelling by domain experts required before this phase begins.
Ground truth labelling across 10 use cases by multiple domain teams is the most common cause of Professional tier timeline extension. Begin ground truth labelling in week 1, not week 4.
Wk 8–9
TCO, Commercial Terms & Portfolio Analysis
3-year portfolio cost model. Commercial terms review. Platform consolidation analysis. Build vs. buy analysis.
Volume forecasts for 10 use cases are harder to produce accurately than for one. Ranges are acceptable in the cost model but must be explicitly documented as ranges, not point estimates.
Wk 10–11
Strategy Document & Recommendations
Portfolio strategy document. Implementation specifications. Board pack. Vendor evaluation evidence pack.
Review cycle: each business unit whose use case is covered will review their section. Conflicting comments between units are the most common issue — managed by the executive sponsor, not by us.
Wk 12
Handover & Board Presentation
Board presentation session. Procurement negotiation briefing. Implementation team handover.
Board approval before procurement begins — do not start vendor contract negotiation before the board has seen and approved the strategy. The recommendation may change after board feedback.
What Your Team Must Provide
Production data samples for all 10 use cases — 500+ real inputs each, including difficult and edge cases, collected before week 4 when evaluation begins
Domain expert for each use case: responsible for defining success criteria and providing ground truth labels for 200 evaluation items. This is the hardest resource requirement to meet at the Professional tier and the most common source of delay.
Executive sponsor: available for cross-use-case constraint resolution workshop (2 hours, week 2) and final strategy approval session (2 hours, week 12)
Legal/compliance: available for 2 hours in week 3 to confirm data classification and regulatory requirements across all use cases
Finance: volume forecasts and budget parameters for the 3-year TCO model
What Is Not in This Engagement
Implementation of any use case — separately resourced, separately costed. Typical portfolio implementation: £100,000–£500,000 depending on use case complexity and number of integrations
More than 10 use cases: scope addition at £3,500 per additional use case
More than 8 vendors evaluated: additional vendor at £2,200
RAG architecture design for use cases that require it: separate engagement following this strategy
12-month landscape review (beyond the included 6-month review): £4,500 per additional review cycle
Enterprise Strategy
Organisation-Wide LLM Strategy
For organisations establishing an enterprise-wide position on LLM adoption: governance framework, platform standards, build vs. buy policy, centre of excellence design, and an organisation-wide LLM portfolio strategy covering more than 10 use cases. Also appropriate for organisations that have made an LLM platform commitment they now have concerns about and want an independent assessment of whether it remains the right choice or whether a course correction is required. All enterprise engagements individually scoped.
From £95,000
Individually scoped · fixed · VAT excl.
From 16 weeksEnterprise strategies covering 20+ use cases and organisation-wide governance commonly run 20–28 weeks.
What Enterprise Adds
No ceiling on use cases — all active and planned LLM initiatives across the organisation in scope
LLM governance framework: policy for what LLMs can and cannot be used for, how new use cases are approved, who owns the decision, how compliance is verified
Platform standards: technical standards for LLM integration across the organisation — abstraction layer design, prompt template standards, evaluation harness standards, monitoring requirements
Centre of excellence design: the internal capability that the organisation needs to build and maintain LLM systems without external dependency on a single vendor or consultant
Build vs. buy policy: a documented decision framework for future use cases — the criteria under which the organisation builds its own model vs. prompts a commercial one vs. fine-tunes a base model
Platform commitment assessment: if the organisation has an existing commitment, an independent assessment of whether it remains appropriate and what the cost-benefit of staying vs. switching is
Why Enterprise Takes Longer
20+ use cases with independent domain experts requires sustained coordination across the organisation — the scheduling burden alone typically adds 3–4 weeks beyond the Professional tier timeline
Governance framework design requires engagement with legal, compliance, HR, and IT security — functions that have different risk tolerances for LLM deployment and whose concerns must be reconciled before a policy can be written
Centre of excellence design requires an organisational design component — what skills are needed, where they sit, how they are funded, how their authority is established — which involves HR and organisational strategy input that moves on different timescales
Platform commitment assessment for an organisation with an existing enterprise agreement requires legal review of the contract before any commercial assessment can be completed
Enterprise Requirements
Named C-suite sponsor — CDO, CTO, or equivalent — with authority to set LLM policy across the organisation. Without this, the governance framework will not be implemented.
Dedicated internal programme coordinator with access to all business units
Legal and compliance team: available throughout the programme for governance framework review and commercial terms assessment
All business unit leads: available for use case inventory workshops — typically 3–4 sessions of 2 hours each spread across weeks 1–4
If existing platform commitment is being assessed: full contract, current usage data, and cost invoices for the past 12 months before the engagement begins

What both parties commit to. What follows when either fails.

Client Obligations
Provide real production data — not curated or synthetic data
The evaluation is only as accurate as the data it is conducted on. Curated data — the best examples, the easy cases, the records that have been cleaned for a previous project — produces evaluation results that overstate the model’s production performance. We will specify the data sample requirements before the engagement begins. If the organisation cannot provide representative production data — because it does not yet exist, because it is too sensitive to share in an evaluation context, or because the data is held by a third party — we will discuss mitigation approaches and document the confidence limitation they create.
If only curated data is availableThe evaluation proceeds with explicit documentation that the results reflect curated data and that production performance may differ. The recommendation carries an explicit caveat about this limitation. A production validation phase is recommended before full deployment commitment.
Ground truth labels provided by the domain expert, not by the team managing the engagement
The success criteria for an LLM evaluation must be defined by someone who knows what a correct output looks like for the specific task — the clinician who knows what a correct triage classification is, the underwriter who knows what a correct risk assessment looks like, the customer service manager who knows what a good response to a difficult enquiry is. Ground truth labels provided by a project manager, a developer, or a consultant who is not a domain expert in the task produce evaluation results that do not reflect real-world quality. Providing access to the right domain expert for ground truth labelling is a client obligation.
If domain experts are unavailableGround truth labelling is conducted by the best available proxy with explicit documentation of who labelled what and their relationship to the domain. The evaluation confidence is correspondingly lower. We will say so in the recommendation.
Do not begin vendor contract negotiation before the strategy is delivered
Organisations sometimes begin vendor discussions in parallel with the strategy engagement — either because a vendor is pushing for early commitment or because a business unit is impatient to start. Early commitment before the evaluation is complete means the commitment may not reflect the evaluation’s findings. If the evaluation recommends a different vendor from the one already under negotiation, the organisation faces a choice between following the recommendation and absorbing the cost of the committed negotiation, or ignoring the recommendation and committing to a vendor the evaluation did not recommend. Neither outcome is good. If vendor pressure requires earlier engagement, notify us — we can accelerate specific components of the evaluation to provide an earlier preliminary view on specific vendors.
If commitment has already been made before engagementWe assess the committed platform as one of the evaluated vendors. If it scores well, the strategy confirms the commitment. If it does not, we present the cost-benefit of honouring the commitment vs. switching before go-live. The honest answer may be uncomfortable.
RJV Obligations
Vendor neutrality declared before engagement and maintained throughout
Before the engagement begins, we declare in writing: any commercial relationships we hold with LLM vendors (currently: none), any financial interest in the outcome of the evaluation, and any prior work for the vendors under evaluation that might create a conflict. This declaration is updated if circumstances change during the engagement. If a conflict emerges that we cannot manage — a vendor relationship that develops during the engagement that would affect the recommendation — we disclose it immediately and discuss with the client whether to proceed or engage an alternative evaluator for the affected vendor.
If an undisclosed relationship is discoveredIf you discover a commercial relationship with a vendor we recommended that we did not disclose, you have grounds to request a re-evaluation of the affected recommendation at our cost.
Evaluation conducted on production data under production-representative conditions
We will not conduct the evaluation under conditions that we know are more favourable than production. If we are asked to evaluate only on the clean subset of the data, or to exclude the difficult edge cases from the evaluation sample, or to test under single-request conditions rather than under representative concurrent load, we will refuse and explain why. The evaluation must reflect production conditions to produce a reliable recommendation. An evaluation that produces a more favourable result by excluding the difficult cases is not useful to you — it is useful to a vendor who wants to look good in an evaluation.
If the client requests evaluation conditions we believe are unrepresentativeWe document the request, explain our concern in writing, and discuss. If the client insists, we conduct the evaluation as requested and document prominently in the recommendation that the evaluation conditions were not representative of production and that production performance may differ significantly.
Recommendation includes the case for alternatives the recommendation rejected
The recommendation document includes the scored comparison table, the reasoning for ranking the recommended vendor above its alternatives on each dimension, and the conditions under which those alternatives would be preferable — what would need to change in the use case, the organisation’s constraints, or the vendor landscape for the recommendation to change. A recommendation without the case for alternatives cannot be revised when circumstances change. Twelve months after delivery, when the landscape has moved, the question “why did we choose this vendor over the others?” must have a documented, revisable answer.
If you disagree with the recommendationThe scored comparison table is yours. The documentation of why each alternative scored lower is yours. You can make a different choice from the evidence we produced. We do not require you to follow the recommendation — we require that the recommendation and its evidence base are honest, which they are regardless of which option you choose.

Questions to answer before committing to any LLM platform

We have already run a PoC and the model performed well. Do we still need a strategy engagement?
It depends on what the PoC evaluated. If the PoC was conducted on production-representative data including the tail distribution, under production-representative load conditions, with a commercial terms review and a TCO model at production volume — then you have most of what this engagement produces and the value is lower. If the PoC was conducted on curated data, with single-request timing, without a TCO model, and without a commercial terms review — then the PoC validated that the model can produce good outputs on your best data, which was never the question that mattered. The strategy engagement starts from where a well-designed PoC left off. We will assess at the discovery session what the PoC covered and recommend only the additional components you need.
We are being pressured by a vendor to commit quickly. How should we handle this?
A vendor who creates urgency around a commitment decision is a vendor who wants you to commit before you have completed your evaluation. This is a sales tactic, not a reflection of genuine market scarcity. LLM platform capacity is not constrained in the way that, say, a physical data centre slot might be. The urgency is artificial. The appropriate response is to tell the vendor that your commitment timeline is determined by your evaluation timeline, not by their sales cycle. If the vendor withdraws the offer under those conditions, they were not offering the terms they represented. If a time-limited pricing offer is genuinely the concern, we can assess it during the commercial terms review and model whether the pricing premium from waiting outweighs the cost of committing before the evaluation is complete.
What if the strategy recommends no currently available model is right for our use case?
We will say so. This happens in specific circumstances: use cases with very strict latency requirements that frontier models cannot meet; regulated use cases where no vendor can offer a compliant data residency configuration; use cases where the task accuracy requirement on the production data distribution exceeds what any currently available model achieves; or use cases where the cost at production volume makes the LLM approach economically unviable. In each case, we will explain specifically what the blocking constraint is, what would need to change for an LLM approach to become viable (whether a model improvement, a vendor configuration change, or a relaxation of the use case requirement), and what alternative approaches are available in the interim. A recommendation not to proceed is the most useful outcome we can produce for a use case that is genuinely not ready — it prevents an expensive failed implementation.
We have an existing enterprise agreement with a major LLM provider. Can you assess whether it remains the right choice?
Yes, and this is one of the most useful engagements we do. The assessment reviews the committed platform against the current landscape on all six evaluation dimensions, models whether the committed terms are still competitive, assesses the specific use cases deployed against the performance of current alternative models, and produces a cost-benefit analysis of staying versus switching. The analysis is honest — if the committed platform remains the best choice, the assessment says so and you have independent evidence to support the decision. If a better option exists, the assessment quantifies the cost of switching versus staying on an increasingly suboptimal platform and gives you the evidence to make the decision with your eyes open. We understand that switching has a real cost. We also understand that staying on the wrong platform has a cost that grows over time.
How do you stay current on a landscape that moves this fast?
We run structured evaluations on all major models on a defined schedule, covering the task categories most relevant to our clients’ use cases. We review vendor term changes as they are announced — major providers update their terms materially 3–5 times per year and we track these changes. We do not rely on vendor marketing materials or third-party benchmark rankings as the primary source — those are starting points for directing evaluation attention, not substitutes for direct evaluation on representative tasks. The evaluation harnesses we build for client engagements are designed to be re-run as the landscape changes, so the 6-month landscape review included in the Professional tier can be conducted efficiently against the same methodology used in the original strategy.
What are your payment terms?
50% on contract signature, 50% on written acceptance of the final deliverables. No milestone payments during execution. Scope additions — additional use cases, additional vendors, additional regulatory frameworks — are invoiced as agreed in writing before execution, never retrospectively. The final payment is contingent on written acceptance. If a deliverable does not meet the agreed specification, we remediate before raising the final invoice. The 6-month landscape review included in the Professional tier is included in the programme fee — no additional invoice. Advisory retainer engagements beyond included post-delivery support are billed monthly in arrears for days actually worked.

Start with a 90-minute strategy assessment. Bring your current LLM thinking — including the vendor you are already leaning towards.

The discovery session reviews your use cases, your constraints, and your current thinking on vendor selection. If you already have a preferred vendor, we will assess it against the six dimensions in the session — so you leave knowing whether that preference is well-grounded or whether there are dimensions you have not yet considered. If you have no current preference, we will identify the constraints that should shape your shortlist. Either way, you leave with a clear picture of what the evaluation must cover and what the primary decision risks are.

The landscape moves fast enough that we have seen organisations whose PoC selection was appropriate at PoC time be the wrong choice by implementation time, six months later. The strategy engagement is designed to produce a recommendation that is current at delivery and to give you the evidence you need to revisit it when the landscape moves again.

Format
Video call or in-person in London. 90 minutes.
Cost
Free. No commitment.
Lead time
Within 5 business days of contact.
Bring
Your LLM use cases — even if loosely defined. Any PoC work already conducted and its results. Your current vendor shortlist or preference if one exists. The constraints you are aware of: data residency requirements, latency needs, budget parameters, regulatory obligations. Any existing vendor discussions or agreements.
Attendees
Technical lead or ML engineer and the business-side owner of the use case. Both are needed — the technical constraints and the business requirements must be in the same room. From RJV: a senior LLM strategist with no vendor affiliation.
After
Written summary of session findings within 2 business days. Fixed-price scope for the appropriate tier within 5 business days if you want to proceed.