Enterprise LLM Strategy & Vendor Selection -

Services / Enterprise LLM Strategy & Vendor Selection

Organisations are committing to language model platforms under commercial pressure that does not allow the time the decision actually requires. The vendor landscape is moving fast enough that a model that was the right choice six months ago may not be the right choice today. The contracts being signed lock organisations into platforms, pricing structures, and data arrangements for 24–36 months — long enough for the landscape to shift significantly and for the locked-in choice to become a constraint rather than a capability.

Most organisations making these decisions are choosing between models and platforms they have evaluated with proof-of-concept prompts on clean test data, in isolation from the production constraints that will determine whether the choice works at scale. The model that performed best in the PoC is not always the model that performs best in production, under load, on the actual distribution of real-world inputs, with the latency requirements of the production application, and at the cost structure of production usage volumes.

This service produces an LLM strategy and vendor selection that is grounded in your specific use case, your specific constraints, and an honest evaluation of the available options against those constraints — not a ranked list of the most popular models, not a framework that tells you what questions to ask, but a specific recommendation with documented reasoning that you can take to your board and defend. If the recommendation is that no currently available model is right for your use case, we will say that rather than recommend one anyway.

Book a Strategy Assessment →

Pricing & Scope

Price Range

£18,000 – £120,000+
Strategy, use case analysis, vendor evaluation, and architecture design. Implementation is separate and additional.

Duration

6 – 18 weeks
Strategy phase only. Implementation and deployment of the selected platform are outside this engagement’s scope.

Independence

We have no commercial relationship with any LLM vendor. No referral fees, no implementation revenue, no preferred partner arrangements. The recommendation is based on your requirements — not on which vendor pays us.

Scope boundary

Strategy, use case definition, vendor evaluation, architecture design, commercial terms guidance, and implementation specification. Procurement negotiation, platform deployment, and ongoing operations are outside scope.

Tiers

Focused (single use case) · Professional (multi-use case portfolio) · Enterprise (organisation-wide strategy)

Contract

Fixed-price. 50% on signing, 50% on delivery acceptance.

Vendor neutrality is non-negotiableAny LLM strategy consultant with a commercial relationship with a specific vendor has a conflict of interest in your selection process — whether or not they disclose it. We declare all vendor relationships before engagement begins. Currently: none. If that changes, we will disclose it before any affected recommendation is made. We will not make a recommendation influenced by a commercial relationship we have not disclosed.

How Enterprise LLM Decisions Go Wrong

Eight specific failure modes. Each one costs more to correct after commitment than it would have cost to avoid before it.

The LLM vendor market is the fastest-moving enterprise technology market in a generation. That speed creates specific decision traps that do not exist in slower markets: the evaluation criteria that were correct six months ago may be wrong today, the PoC conditions that validated the choice may not hold in production, and the contract terms that seemed reasonable at signing may be catastrophic at the scale the platform reaches in year two. Each failure mode below is documented from enterprise deployments that have been publicly reported or that represent a pattern we observe consistently in the organisations we assess.

The PoC used clean, curated data. Production uses the actual data.

Proof-of-concept evaluations almost always use the organisation’s best data: the well-formatted documents, the clearly written tickets, the clean customer records. The model performs well. The organisation commits. Production ingests the actual data estate: inconsistently formatted records, partial documents, legacy extracts with encoding errors, free-text fields written by people in a hurry with abbreviations specific to the organisation’s culture. The model that performed at 94% accuracy on curated PoC data performs at 61% on production data. The difference was always there — the PoC was not designed to find it.

Cost of discovering this in production

Rework: re-evaluation against production data, potential platform switch, re-implementation. Timeline: typically 4–8 months. Cost: typically 2–4× the original implementation cost. In one reported case: a 14-month customer service LLM programme halted and restarted from vendor selection after production accuracy was unacceptable. Total rework cost: £2.3M.

How to prevent it

Evaluation conducted on a representative sample of production data — not curated data, not the best 1% of the data estate. The evaluation sample must include the tail: the malformed records, the ambiguous inputs, the edge cases that are rare but consequential. A model that performs well only on clean data is not ready for production. The strategy engagement specifies the evaluation dataset before any vendor trial begins.

The model was selected for capability. The cost at production volume is unaffordable.

Token-based pricing means that LLM cost scales with usage in a way that is easy to underestimate at PoC scale and catastrophic to discover at production scale. A model that costs £200/month in PoC costs £85,000/month at production volume for a high-throughput application. This is not a surprise that emerges from unusual usage — it is the direct mathematical consequence of applying the per-token pricing to the production usage volume. Organisations that did not model this before selecting a platform are locked into a contract at a price point that makes the application economically unviable.

Cost of discovering this in production

Options at this point: renegotiate contract (unlikely to succeed), switch to a cheaper model mid-production (requires re-evaluation, re-testing, re-deployment), reduce throughput to fit budget (defeats the purpose of the application), or absorb the cost (eroding the ROI case entirely). In documented cases: organisations have switched from frontier models to smaller open-source models mid-deployment and accepted 6–8 months of parallel operation while the switch was validated.

How to prevent it

Total cost of ownership modelled before vendor selection: production volume forecast, token count per request at production input distribution (not PoC), output token count at production (not estimated), context window usage per request type, and the cost curve at 1×, 5×, and 10× expected volume for headroom. Model selection explicitly includes cost-per-request as a first-class evaluation criterion alongside capability. Smaller models that meet the capability threshold at lower cost are preferred over frontier models that exceed the capability threshold at unaffordable cost.

The selected model’s context window is too small for the actual production context.

Context window size determines how much text the model can process in a single request — the document being analysed, the conversation history, the system prompt, and the instructions all consume context window. At PoC, requests are small and the context window is not a constraint. At production, the average request exceeds the context window: the customer service LLM needs to process a 30-page policy document that is larger than the context window; the legal review LLM needs to hold multiple contract sections in context simultaneously. The model cannot do the task it was selected to do.

What happens next

Chunking and retrieval-augmented generation (RAG) are the standard workarounds — but they introduce their own failure modes: retrieval misses relevant context, chunk boundaries split reasoning, the model produces answers that are locally correct but globally inconsistent across the full document. The workaround changes the system’s behaviour in ways that must be re-evaluated before the original accuracy claims can be maintained.

How to prevent it

Context window requirement analysis conducted on the actual production inputs before vendor evaluation begins. Every request type: the system prompt size, the instruction size, the input document or conversation size at the 50th and 95th percentile, and the expected output size. Context window requirement is an explicit minimum specification. Any model that does not meet the 95th-percentile context requirement is excluded from consideration before capability evaluation begins — capability on inputs the model cannot fit in its context window is irrelevant.

Data residency and sovereignty requirements were not assessed before platform selection.

Many LLM vendor platforms default to US-based data processing. For organisations with NHS data, regulated financial data, government data, or client contractual obligations requiring UK or EU data processing, this means the default platform configuration is immediately non-compliant. The organisation discovers this either during a data protection impact assessment conducted after platform selection, or during a supplier security review, or — worst — after data has been processed through a non-compliant endpoint. Switching to a compliant configuration may require a different pricing tier, a different API endpoint, or in some cases a different vendor entirely.

What this looks like in practice

An NHS-connected digital health company selects a US-based LLM provider for patient-facing functionality. Post-selection data protection impact assessment identifies that the default API endpoint processes data in US-East. The UK endpoint is available but only on the Enterprise tier — at 3× the cost of the tier selected. Re-negotiation required. Timeline delay: 11 weeks. Organisations under stricter constraints (government security classification, patient-identifiable data under DSP Toolkit) found no compliant configuration available from their selected vendor and restarted vendor selection entirely.

How to prevent it

Data residency requirements mapped and documented before vendor shortlisting begins. Every data type the application will process: its regulatory classification, the jurisdiction it must remain in, and the vendor configurations that satisfy those requirements. Vendors that cannot offer a compliant configuration are excluded from the shortlist regardless of capability. Data residency compliance is a binary requirement — it is not traded against capability.

The latency of the selected model is incompatible with the application’s user experience requirements.

Frontier LLMs with the largest context windows and highest capability also have the highest latency. Time-to-first-token and time-to-completion vary significantly between models, between vendors, between API tiers, and under load. A model that responds in 800ms under PoC conditions may respond in 4.2 seconds under production load. For a customer-facing application where the user waits for a response, 4.2 seconds is unacceptable. The latency was in the model’s specification; it was not measured under production-representative load conditions before selection.

What happens next

Streaming responses mitigate the perception of latency for conversational applications but do not reduce total processing time for batch applications. Smaller, faster models are evaluated post-selection as alternatives — adding the evaluation and re-implementation cost that should have been part of the original selection process. In real-time applications (voice assistants, live customer service, trading systems) the latency constraint may exclude all frontier models and require an open-source deployment on controlled infrastructure to achieve the required response time.

How to prevent it

Latency requirements defined by application type before vendor evaluation: maximum acceptable time-to-first-token, maximum acceptable total response time at the 95th percentile, and whether streaming is acceptable for the application’s UX. Vendor evaluation conducted under load — not single-request timing, but representative concurrent request load. Any model that does not meet the 95th-percentile latency requirement under representative load is excluded regardless of capability scores.

The vendor’s terms allow training on your data. You did not read them.

LLM vendor terms of service have changed materially and repeatedly since 2022. Default terms for several major providers have included clauses permitting the vendor to use customer input data for model training or improvement. Opting out of data training typically requires a specific enterprise agreement, a specific API configuration, or both. Organisations that signed standard agreements and did not read the data usage provisions may have been permitting their proprietary data — client communications, internal documents, financial data, clinical data — to be used in ways their data protection obligations prohibit and their clients did not consent to.

Regulatory consequence

UK GDPR Article 28 requires a data processing agreement with any processor handling personal data. If the vendor’s terms permitted uses beyond the organisation’s stated processing purpose — specifically, model training — the data processing agreement may not have satisfied Article 28 requirements. ICO enforcement and client contractual liability both arise from this. The data that was processed under non-compliant terms cannot be un-processed.

How to prevent it

Commercial terms review as a mandatory component of vendor evaluation — before any data is sent to any vendor API during evaluation. Terms assessed for: data retention and deletion policy, training data opt-out provisions and how to invoke them, data processing geography, sub-processor list, breach notification obligations, and compliance with UK GDPR Article 28 requirements. Any vendor whose default terms do not satisfy the organisation’s data processing requirements is excluded or placed on a negotiated-terms-required shortlist.

The chosen architecture is more complex than the use case requires.

The fastest-moving area of LLM deployment is also the area with the strongest engineering pull towards complexity: agentic systems, multi-model pipelines, RAG with vector databases, fine-tuned models with custom inference infrastructure. These are the right choices for some problems. They are not the right choices for organisations deploying an LLM to classify incoming customer enquiries into one of eight categories. A small, fast, cheap model with a well-designed system prompt solves that problem in 4 weeks. A fine-tuned model with a vector database and an agent orchestration layer solves it in 6 months at 20× the cost and with 5× the operational complexity. The engineering team chose the architecture that was technically interesting, not the architecture that was appropriate.

What this costs

Over-engineered LLM architectures are expensive to build, expensive to operate, expensive to debug, and expensive to change. They also tend to fail in more complex ways — a simple model with a system prompt fails in ways that are immediately visible; a multi-agent pipeline fails in ways that require significant investigation to trace. The maintenance cost of a complex architecture is carried indefinitely. The value delivered is identical to what a simple architecture would have delivered, in less time, at lower cost.

How to prevent it

Architecture selection begins from the use case requirements, not from the available technology. Each requirement is mapped to the minimum architecture that satisfies it: if a single model call with a well-designed system prompt meets the accuracy requirement, the strategy specifies that. RAG is specified only when the model genuinely needs access to information that cannot be in its system prompt or context window. Fine-tuning is specified only when the use case cannot be met by a general-purpose model with prompt engineering. Agentic architectures are specified only when the task genuinely requires the model to take sequential actions with external tool access.

The organisation is locked into a vendor whose model no longer leads on the relevant task.

The LLM capability landscape moves on a 6-month cycle. The model that led on coding tasks in Q1 2024 was not the leader by Q3 2024. The model that led on long-document analysis in Q2 2024 was not the leader by Q4 2024. Organisations that built tightly coupled integrations with a specific model’s API — using model-specific features, fine-tuning on model-specific formats, building prompt templates that depend on model-specific behaviour — are significantly more expensive to migrate when the landscape changes than organisations that built against an abstraction layer. The migration cost was not zero when the decision to build tightly coupled was made; it was deferred and compounded.

What this looks like at migration time

A tightly coupled integration built against GPT-4 in 2023 required 4–6 months to migrate to a different frontier model in 2024 for several documented enterprise deployments, because: prompt templates depended on GPT-4-specific response formatting that other models did not replicate; fine-tuning had been conducted on OpenAI’s fine-tuning infrastructure with a format not portable to other platforms; monitoring and evaluation infrastructure was built against OpenAI’s API response structure. Total migration cost: typically 40–70% of the original implementation cost.

How to prevent it

Architecture designed with a model abstraction layer from the start: the application calls an interface, not a model. The interface routes to the current selected model. Swapping the model requires changing one configuration, not refactoring the application. Prompt templates written to be model-agnostic — testing across multiple models before finalisation, not optimised for a single model’s response characteristics. Fine-tuning avoided unless the capability gain over prompt engineering is demonstrated to be material on the production task — because fine-tuned models are the hardest to migrate.

How We Evaluate — The Six Dimensions

Six evaluation dimensions. Weighted to your specific use case. Every exclusion criterion binary before any scoring begins.

The evaluation framework is built from your requirements, not from a generic model benchmark. Benchmarks measure model performance on standardised tasks. Your use case is not a standardised task. The evaluation is conducted on your data, against your success criteria, under your production constraints. Models are first screened against binary exclusion criteria — data residency, latency, context window, commercial terms — before any capability scoring begins. A model that fails any exclusion criterion is removed from consideration regardless of its capability score. This ordering matters: capability scores from vendors who cannot meet your data residency requirements are not relevant.

Compliance & Data Governance

The non-negotiable dimension. Every requirement in this dimension is binary — pass or fail. No capability score compensates for a compliance failure. Evaluated before any other dimension.

What is evaluated

Data residency: which API endpoints process data, in which jurisdictions, under which legal framework

Training data opt-out: default terms vs. negotiated terms, what must be done to opt out, whether opt-out applies retroactively

UK GDPR Article 28 compliance: whether the vendor’s standard DPA satisfies the requirements or requires negotiation

Sub-processor list: who processes data on the vendor’s behalf, where they are located, whether their locations are acceptable

Breach notification: timeline and procedure for notifying the organisation of a data breach affecting their data

Sector-specific: NHS DSPT, FCA, CQC, GovAssure, or other sector obligations where applicable

Binary exclusion criterion

Technical Compatibility

Hard technical requirements that determine whether the model can physically do the job. Binary exclusion criteria assessed before capability scoring. Evaluated on production-representative measurements, not vendor specifications.

What is evaluated

Context window: 95th-percentile request size vs. model context window — measured on actual production input distribution, not estimated

Latency: time-to-first-token and total response time at 50th and 95th percentile under representative concurrent load — not single-request benchmarks

Rate limits: tokens per minute and requests per minute at the relevant tier vs. production peak throughput requirement

API stability: version history, deprecation timelines, breaking change frequency — relevant to operational maintenance burden

Modality requirements: if the use case requires vision, audio, structured output, function calling, or code execution — which models support them at production quality

Binary exclusion criterion

Task Capability

The dimension most organisations evaluate exclusively — and first. It is third in our sequence, because it only matters for vendors that have passed compliance and technical compatibility screening. Evaluated on your task, your data, your success criteria.

What is evaluated

Primary task accuracy: performance on the specific task the model will perform in production, measured on representative production data including the tail distribution

Failure mode characterisation: what the model produces when it fails — does it fail safely (low-confidence abstention) or unsafely (confident wrong answer)?

Instruction following: consistency of adherence to system prompt instructions across the production input distribution, not just on clean test inputs

Output format consistency: whether the model produces reliably structured outputs (JSON, specific formats) at production scale or degrades to prose when inputs are ambiguous

Robustness to adversarial inputs: performance on inputs specifically designed to make the model deviate from its intended behaviour

Scored and ranked

Total Cost of Ownership

The dimension that produces the most budget surprises when not modelled correctly. Token pricing is transparent. The inputs to the model cost calculation at production scale are often not. Evaluated across a 3-year horizon at three volume scenarios.

What is modelled

Input token cost: average tokens per request at production input distribution (not PoC), at 1×, 3×, and 10× expected volume

Output token cost: average output tokens per response, modelled separately from input — output tokens are typically priced higher and are harder to forecast

Context window overhead: if RAG is required, the retrieval context added to every request — often the largest single contributor to token cost for document-heavy use cases

Committed vs. on-demand pricing: vendor pricing tiers, commitment discount availability, break-even analysis for committed spend

Infrastructure cost: if self-hosted or private deployment is in scope, the infrastructure cost over the assessment period

Operational overhead: monitoring, prompt maintenance, version management, re-evaluation cost as models are updated

Scored and ranked

Architecture Fit & Migration Risk

How well the selected model and platform fit into the organisation’s existing architecture, and how much it will cost to change the decision when the landscape moves. The dimension that is most often omitted from vendor evaluations and most often regretted.

What is evaluated

Integration complexity: how the model integrates with existing systems — authentication, data flow, response handling, error management

Abstraction layer feasibility: can the integration be built against an abstraction layer that allows model substitution? What would need to be model-specific vs. model-agnostic?

Migration cost estimate: what it would cost to migrate to a different model in 18 months if required — the lower this number, the better, regardless of how confident the organisation is in the current selection

Vendor lock-in risks: proprietary features, fine-tuning portability, data export provisions, contractual exit terms

Operational tooling: observability, logging, cost monitoring, version management — what the vendor provides vs. what must be built

Scored and ranked

Commercial Terms & Vendor Stability

LLM vendors are a young market. Several significant vendors have changed their pricing, terms, or product availability materially within 12 months of enterprise commitments. The commercial dimension assesses the risk of signing a multi-year commitment with a vendor whose commercial position may change.

What is evaluated

Pricing stability: history of price changes, committed pricing availability, price escalation clauses in enterprise agreements

Model continuity commitments: how long the vendor commits to maintaining a specific model version, deprecation notice periods, what happens to fine-tuned models when a base model is deprecated

SLA terms: uptime commitments, credit structure for downtime, maximum outage duration covered by SLA vs. typical incident durations from status page history

Contractual exit provisions: termination rights, data export on termination, notice periods, financial exposure on early exit

Vendor financial position: public information on funding, revenue, and runway — relevant for 24–36 month enterprise commitments

Scored and ranked

Engagement Tiers — Scope, Price, Timeline

Three tiers. The work differs in scale. The rigour does not.

All three tiers apply all six evaluation dimensions. The difference is the number of use cases assessed, the depth of the vendor evaluation, and whether the engagement produces a single-use-case recommendation or an organisation-wide LLM portfolio strategy. Implementation of the selected platform — procurement, integration, deployment, testing — is outside scope in all tiers. The strategy engagement ends when you have a recommendation, the documented evidence supporting it, and the implementation specification your team or a partner executes from.

Focused Strategy

Single Use Case LLM Strategy

For organisations evaluating LLMs for a single, well-defined use case: one application, one primary task type, one team implementing it. Examples: customer service enquiry classification, internal document search and retrieval, contract clause extraction, meeting summarisation, code review assistance, HR policy Q&A. If you have multiple use cases requiring different model characteristics or your LLM programme spans more than one business unit, the Professional tier is appropriate.

£18,000

Fixed · VAT excl.

6 weeksAssumes production data is available for evaluation and the use case is defined with sufficient specificity to begin evaluation in week 1.

Use Case Analysis

Use case definition: task type, input specification, output specification, success criteria defined as measurable thresholds

Production data profiling: actual input distribution characterised — not estimated, measured on a sample of real production data

Context window requirement: 95th-percentile request size measured on production data sample

Latency requirement: maximum acceptable response time for the application’s UX, documented with reasoning

Data residency mapping: data types the application will process, their regulatory classification, the jurisdiction requirements

Volume forecast: requests per day/month at current and 12-month projected scale for cost modelling

Architecture complexity assessment: minimum architecture that satisfies the use case — is a single model call sufficient, or is RAG/agentic architecture genuinely required?

Vendor Evaluation

Shortlist: up to 5 vendors/models evaluated across all 6 dimensions

Exclusion screening: compliance, data residency, context window, latency, rate limits — binary pass/fail before capability scoring

Capability evaluation: models tested on 200-item production data sample spanning the full input distribution including tail

TCO model: 3-year cost projection per vendor at baseline, 3×, and 10× volume — input tokens, output tokens, context overhead, infrastructure where applicable

Commercial terms review: data processing terms, training opt-out, DPA adequacy, pricing stability, exit provisions

Scored comparison table: all 6 dimensions, all evaluated vendors, with dimension weights calibrated to the specific use case

Strategy Output

Recommended vendor/model with documented scoring rationale

Architecture recommendation: the minimum architecture appropriate for the use case, with the reasoning for each architectural decision

Implementation specification: what to build, how to integrate, what the abstraction layer should look like, what monitoring is required from day one

Prompt design specification: system prompt structure, input formatting requirements, output format specification, edge case handling

Evaluation harness specification: the automated test suite that verifies the deployed system is performing against the defined success criteria

30-day post-delivery advisory support (email)

Implementation execution

Vendor contract negotiation

Ongoing model monitoring programme

Timeline — 6 Weeks

Wk 1

Use Case Definition

Task specification, success criteria, data profiling, context window measurement, latency and volume requirements.

Use cases that are not defined precisely enough to write measurable success criteria cannot be evaluated. We will work to define them — but vague use cases produce vague recommendations.

Wk 2

Data Residency & Exclusion Screening

Data classification, regulatory mapping, vendor exclusion screening against all binary criteria.

Some organisations discover in this stage that no publicly available vendor meets their data residency requirements. This is the correct time to discover it — before evaluation spend, not after implementation spend.

Wk 3–4

Capability Evaluation

200-item evaluation on production data across all shortlisted models. Failure mode characterisation. Output format consistency testing.

Access to a representative sample of production data is required before week 3. If data access is delayed, the evaluation timeline extends. Curated or synthetic data is not an acceptable substitute.

Wk 5

TCO & Commercial Terms

3-year cost models. Commercial terms review for each shortlisted vendor. Scored comparison table.

Volume forecast accuracy affects the cost model. Organisations without a production volume forecast for the use case receive a cost model with a wider uncertainty range.

Wk 6

Strategy & Handover

Final recommendation, architecture specification, prompt design spec, evaluation harness spec, implementation guidance.

The recommendation session must include both the technical lead and the decision-maker. A recommendation delivered only to the technical lead without the decision-maker present produces a different outcome than one delivered to both.

What Your Team Must Provide

A representative sample of 500+ real production inputs for the use case — the actual data the model will process, including the difficult, malformed, and ambiguous examples, not a curated sample of easy examples

Ground truth labels for 200 of those inputs — the correct output for each, agreed by the domain expert who will assess the model’s outputs

Technical lead: available for use case definition workshop (2 hours, week 1) and evaluation harness review (2 hours, week 6)

Decision-maker: available for the final recommendation session (90 minutes, week 6)

Legal or compliance: available for 1 hour in week 2 to confirm data classification and regulatory requirements

What Is Not in This Engagement

Implementation: all integration, deployment, testing, and monitoring are outside scope and separately resourced. Typical implementation cost for a single-use-case LLM application: £15,000–£80,000 depending on integration complexity

Vendor contract negotiation: we provide commercial terms guidance and the comparison on which to base negotiation — the negotiation itself is your legal and procurement team’s work

More than 5 vendors in the shortlist: additional vendor evaluation at £1,800/vendor

Ongoing strategy review as the model landscape evolves: available as a 6-month retainer at £4,500 per review cycle

RAG architecture design if required: this engagement recommends whether RAG is needed; the RAG architecture design is a separate engagement

Professional Strategy

Multi-Use-Case LLM Portfolio Strategy

For organisations with 3–10 LLM use cases under evaluation or development across multiple business units, where a single platform decision affects multiple use cases and the interaction between use cases matters for vendor selection. A use case that demands UK data residency may constrain the platform choice for all use cases. A use case that requires a 200k-token context window may require a different model family from one that needs fast low-latency responses. These interactions must be resolved at portfolio level, not use-case-by-use-case.

£52,000

Fixed · VAT excl.

12 weeksMulti-unit data access provisioning and ground truth labelling by multiple domain teams are the most common sources of timeline extension at this tier.

Portfolio Analysis

Up to 10 use cases defined, profiled, and evaluated — each with independent success criteria

Cross-use-case constraint analysis: requirements that, if satisfied for one use case, constrain or enable others — platform selection interactions mapped explicitly

Use case prioritisation: ROI ranking of the portfolio — which use cases deliver the highest return per unit of implementation investment, and in which sequence should they be deployed

Platform consolidation analysis: whether a single platform can serve the full portfolio, or whether different platforms are required for different use cases, and the operational overhead of each scenario

Build vs. buy analysis per use case: for each use case, whether a commercial LLM with prompt engineering is sufficient or whether a fine-tuned or custom model is required and justified

Portfolio risk assessment: concentration risk from single-platform dependency, vendor stability risk across the portfolio horizon

Evaluation & Recommendation

Up to 8 vendors evaluated across the full portfolio — tested on representative production data per use case

Platform recommendation: primary platform for the majority of use cases, with documented reasoning for any use cases requiring a different platform

Architecture recommendations per use case: model, architecture pattern (single call, RAG, agent), integration approach

Implementation sequencing: which use cases to build first, in which order, with what dependencies between them

3-year portfolio TCO: total cost across all use cases at three volume scenarios, broken down by use case and by cost component

Commercial terms guidance: negotiation priorities for enterprise agreement, volume discount structure, data processing terms, SLA requirements

Strategy Outputs

Portfolio strategy document: 40–60 pages covering all 10 use cases, platform recommendation, implementation roadmap, and cost model

Implementation specifications per use case: architecture, prompt design, evaluation harness, integration approach

Board presentation pack: portfolio investment case, ROI model, risk assessment, phased deployment plan

Vendor evaluation evidence pack: scored comparison table, test results, TCO models — suitable for procurement audit trail

60-day post-delivery advisory support: email plus 2 × scheduled video calls

6-month landscape review included: one structured review of the recommendation at 6 months to assess whether the vendor landscape has changed materially

Timeline — 12 Weeks

Wk 1–2

Portfolio Inventory & Use Case Definition

All 10 use cases defined with success criteria. Data profiling per use case. Cross-use-case constraint mapping.

Use cases proposed by different business units often overlap or conflict. Clarifying this in week 1–2 prevents duplication and competing requirements from complicating the evaluation.

Wk 3

Exclusion Screening & Shortlisting

Binary exclusion screening across all vendors for all use cases. Portfolio-level shortlist that satisfies the most constrained use case requirements.

The most constrained use case sets the floor for the full portfolio shortlist. Organisations sometimes push back on this — they want the constrained use case on a different platform. This is a legitimate strategy that must be costed explicitly.

Wk 4–7

Capability Evaluation (all use cases)

200-item evaluation per use case across shortlisted vendors. Ground truth labelling by domain experts required before this phase begins.

Ground truth labelling across 10 use cases by multiple domain teams is the most common cause of Professional tier timeline extension. Begin ground truth labelling in week 1, not week 4.

Wk 8–9

TCO, Commercial Terms & Portfolio Analysis

3-year portfolio cost model. Commercial terms review. Platform consolidation analysis. Build vs. buy analysis.

Volume forecasts for 10 use cases are harder to produce accurately than for one. Ranges are acceptable in the cost model but must be explicitly documented as ranges, not point estimates.

Wk 10–11

Strategy Document & Recommendations

Portfolio strategy document. Implementation specifications. Board pack. Vendor evaluation evidence pack.

Review cycle: each business unit whose use case is covered will review their section. Conflicting comments between units are the most common issue — managed by the executive sponsor, not by us.

Wk 12

Handover & Board Presentation

Board presentation session. Procurement negotiation briefing. Implementation team handover.

Board approval before procurement begins — do not start vendor contract negotiation before the board has seen and approved the strategy. The recommendation may change after board feedback.

What Your Team Must Provide

Production data samples for all 10 use cases — 500+ real inputs each, including difficult and edge cases, collected before week 4 when evaluation begins

Domain expert for each use case: responsible for defining success criteria and providing ground truth labels for 200 evaluation items. This is the hardest resource requirement to meet at the Professional tier and the most common source of delay.

Executive sponsor: available for cross-use-case constraint resolution workshop (2 hours, week 2) and final strategy approval session (2 hours, week 12)

Legal/compliance: available for 2 hours in week 3 to confirm data classification and regulatory requirements across all use cases

Finance: volume forecasts and budget parameters for the 3-year TCO model

What Is Not in This Engagement

Implementation of any use case — separately resourced, separately costed. Typical portfolio implementation: £100,000–£500,000 depending on use case complexity and number of integrations

More than 10 use cases: scope addition at £3,500 per additional use case

More than 8 vendors evaluated: additional vendor at £2,200

RAG architecture design for use cases that require it: separate engagement following this strategy

12-month landscape review (beyond the included 6-month review): £4,500 per additional review cycle

Enterprise Strategy

Organisation-Wide LLM Strategy

For organisations establishing an enterprise-wide position on LLM adoption: governance framework, platform standards, build vs. buy policy, centre of excellence design, and an organisation-wide LLM portfolio strategy covering more than 10 use cases. Also appropriate for organisations that have made an LLM platform commitment they now have concerns about and want an independent assessment of whether it remains the right choice or whether a course correction is required. All enterprise engagements individually scoped.

From £95,000

Individually scoped · fixed · VAT excl.

From 16 weeksEnterprise strategies covering 20+ use cases and organisation-wide governance commonly run 20–28 weeks.

What Enterprise Adds

No ceiling on use cases — all active and planned LLM initiatives across the organisation in scope

LLM governance framework: policy for what LLMs can and cannot be used for, how new use cases are approved, who owns the decision, how compliance is verified

Platform standards: technical standards for LLM integration across the organisation — abstraction layer design, prompt template standards, evaluation harness standards, monitoring requirements

Centre of excellence design: the internal capability that the organisation needs to build and maintain LLM systems without external dependency on a single vendor or consultant

Build vs. buy policy: a documented decision framework for future use cases — the criteria under which the organisation builds its own model vs. prompts a commercial one vs. fine-tunes a base model

Platform commitment assessment: if the organisation has an existing commitment, an independent assessment of whether it remains appropriate and what the cost-benefit of staying vs. switching is

Why Enterprise Takes Longer

20+ use cases with independent domain experts requires sustained coordination across the organisation — the scheduling burden alone typically adds 3–4 weeks beyond the Professional tier timeline

Governance framework design requires engagement with legal, compliance, HR, and IT security — functions that have different risk tolerances for LLM deployment and whose concerns must be reconciled before a policy can be written

Centre of excellence design requires an organisational design component — what skills are needed, where they sit, how they are funded, how their authority is established — which involves HR and organisational strategy input that moves on different timescales

Platform commitment assessment for an organisation with an existing enterprise agreement requires legal review of the contract before any commercial assessment can be completed

Enterprise Requirements

Named C-suite sponsor — CDO, CTO, or equivalent — with authority to set LLM policy across the organisation. Without this, the governance framework will not be implemented.

Dedicated internal programme coordinator with access to all business units

Legal and compliance team: available throughout the programme for governance framework review and commercial terms assessment

All business unit leads: available for use case inventory workshops — typically 3–4 sessions of 2 hours each spread across weeks 1–4

If existing platform commitment is being assessed: full contract, current usage data, and cost invoices for the past 12 months before the engagement begins

Bilateral Obligations

What both parties commit to. What follows when either fails.

Client Obligations

Provide real production data — not curated or synthetic data

The evaluation is only as accurate as the data it is conducted on. Curated data — the best examples, the easy cases, the records that have been cleaned for a previous project — produces evaluation results that overstate the model’s production performance. We will specify the data sample requirements before the engagement begins. If the organisation cannot provide representative production data — because it does not yet exist, because it is too sensitive to share in an evaluation context, or because the data is held by a third party — we will discuss mitigation approaches and document the confidence limitation they create.

If only curated data is availableThe evaluation proceeds with explicit documentation that the results reflect curated data and that production performance may differ. The recommendation carries an explicit caveat about this limitation. A production validation phase is recommended before full deployment commitment.

Ground truth labels provided by the domain expert, not by the team managing the engagement

The success criteria for an LLM evaluation must be defined by someone who knows what a correct output looks like for the specific task — the clinician who knows what a correct triage classification is, the underwriter who knows what a correct risk assessment looks like, the customer service manager who knows what a good response to a difficult enquiry is. Ground truth labels provided by a project manager, a developer, or a consultant who is not a domain expert in the task produce evaluation results that do not reflect real-world quality. Providing access to the right domain expert for ground truth labelling is a client obligation.

If domain experts are unavailableGround truth labelling is conducted by the best available proxy with explicit documentation of who labelled what and their relationship to the domain. The evaluation confidence is correspondingly lower. We will say so in the recommendation.

Do not begin vendor contract negotiation before the strategy is delivered

Organisations sometimes begin vendor discussions in parallel with the strategy engagement — either because a vendor is pushing for early commitment or because a business unit is impatient to start. Early commitment before the evaluation is complete means the commitment may not reflect the evaluation’s findings. If the evaluation recommends a different vendor from the one already under negotiation, the organisation faces a choice between following the recommendation and absorbing the cost of the committed negotiation, or ignoring the recommendation and committing to a vendor the evaluation did not recommend. Neither outcome is good. If vendor pressure requires earlier engagement, notify us — we can accelerate specific components of the evaluation to provide an earlier preliminary view on specific vendors.

If commitment has already been made before engagementWe assess the committed platform as one of the evaluated vendors. If it scores well, the strategy confirms the commitment. If it does not, we present the cost-benefit of honouring the commitment vs. switching before go-live. The honest answer may be uncomfortable.

RJV Obligations

Vendor neutrality declared before engagement and maintained throughout

Before the engagement begins, we declare in writing: any commercial relationships we hold with LLM vendors (currently: none), any financial interest in the outcome of the evaluation, and any prior work for the vendors under evaluation that might create a conflict. This declaration is updated if circumstances change during the engagement. If a conflict emerges that we cannot manage — a vendor relationship that develops during the engagement that would affect the recommendation — we disclose it immediately and discuss with the client whether to proceed or engage an alternative evaluator for the affected vendor.

If an undisclosed relationship is discoveredIf you discover a commercial relationship with a vendor we recommended that we did not disclose, you have grounds to request a re-evaluation of the affected recommendation at our cost.

Evaluation conducted on production data under production-representative conditions

We will not conduct the evaluation under conditions that we know are more favourable than production. If we are asked to evaluate only on the clean subset of the data, or to exclude the difficult edge cases from the evaluation sample, or to test under single-request conditions rather than under representative concurrent load, we will refuse and explain why. The evaluation must reflect production conditions to produce a reliable recommendation. An evaluation that produces a more favourable result by excluding the difficult cases is not useful to you — it is useful to a vendor who wants to look good in an evaluation.

If the client requests evaluation conditions we believe are unrepresentativeWe document the request, explain our concern in writing, and discuss. If the client insists, we conduct the evaluation as requested and document prominently in the recommendation that the evaluation conditions were not representative of production and that production performance may differ significantly.

Recommendation includes the case for alternatives the recommendation rejected

The recommendation document includes the scored comparison table, the reasoning for ranking the recommended vendor above its alternatives on each dimension, and the conditions under which those alternatives would be preferable — what would need to change in the use case, the organisation’s constraints, or the vendor landscape for the recommendation to change. A recommendation without the case for alternatives cannot be revised when circumstances change. Twelve months after delivery, when the landscape has moved, the question “why did we choose this vendor over the others?” must have a documented, revisable answer.

If you disagree with the recommendationThe scored comparison table is yours. The documentation of why each alternative scored lower is yours. You can make a different choice from the evidence we produced. We do not require you to follow the recommendation — we require that the recommendation and its evidence base are honest, which they are regardless of which option you choose.

Questions to answer before committing to any LLM platform

We have already run a PoC and the model performed well. Do we still need a strategy engagement?

It depends on what the PoC evaluated. If the PoC was conducted on production-representative data including the tail distribution, under production-representative load conditions, with a commercial terms review and a TCO model at production volume — then you have most of what this engagement produces and the value is lower. If the PoC was conducted on curated data, with single-request timing, without a TCO model, and without a commercial terms review — then the PoC validated that the model can produce good outputs on your best data, which was never the question that mattered. The strategy engagement starts from where a well-designed PoC left off. We will assess at the discovery session what the PoC covered and recommend only the additional components you need.

We are being pressured by a vendor to commit quickly. How should we handle this?

A vendor who creates urgency around a commitment decision is a vendor who wants you to commit before you have completed your evaluation. This is a sales tactic, not a reflection of genuine market scarcity. LLM platform capacity is not constrained in the way that, say, a physical data centre slot might be. The urgency is artificial. The appropriate response is to tell the vendor that your commitment timeline is determined by your evaluation timeline, not by their sales cycle. If the vendor withdraws the offer under those conditions, they were not offering the terms they represented. If a time-limited pricing offer is genuinely the concern, we can assess it during the commercial terms review and model whether the pricing premium from waiting outweighs the cost of committing before the evaluation is complete.

What if the strategy recommends no currently available model is right for our use case?

We will say so. This happens in specific circumstances: use cases with very strict latency requirements that frontier models cannot meet; regulated use cases where no vendor can offer a compliant data residency configuration; use cases where the task accuracy requirement on the production data distribution exceeds what any currently available model achieves; or use cases where the cost at production volume makes the LLM approach economically unviable. In each case, we will explain specifically what the blocking constraint is, what would need to change for an LLM approach to become viable (whether a model improvement, a vendor configuration change, or a relaxation of the use case requirement), and what alternative approaches are available in the interim. A recommendation not to proceed is the most useful outcome we can produce for a use case that is genuinely not ready — it prevents an expensive failed implementation.

We have an existing enterprise agreement with a major LLM provider. Can you assess whether it remains the right choice?

Yes, and this is one of the most useful engagements we do. The assessment reviews the committed platform against the current landscape on all six evaluation dimensions, models whether the committed terms are still competitive, assesses the specific use cases deployed against the performance of current alternative models, and produces a cost-benefit analysis of staying versus switching. The analysis is honest — if the committed platform remains the best choice, the assessment says so and you have independent evidence to support the decision. If a better option exists, the assessment quantifies the cost of switching versus staying on an increasingly suboptimal platform and gives you the evidence to make the decision with your eyes open. We understand that switching has a real cost. We also understand that staying on the wrong platform has a cost that grows over time.

How do you stay current on a landscape that moves this fast?

We run structured evaluations on all major models on a defined schedule, covering the task categories most relevant to our clients’ use cases. We review vendor term changes as they are announced — major providers update their terms materially 3–5 times per year and we track these changes. We do not rely on vendor marketing materials or third-party benchmark rankings as the primary source — those are starting points for directing evaluation attention, not substitutes for direct evaluation on representative tasks. The evaluation harnesses we build for client engagements are designed to be re-run as the landscape changes, so the 6-month landscape review included in the Professional tier can be conducted efficiently against the same methodology used in the original strategy.

What are your payment terms?

50% on contract signature, 50% on written acceptance of the final deliverables. No milestone payments during execution. Scope additions — additional use cases, additional vendors, additional regulatory frameworks — are invoiced as agreed in writing before execution, never retrospectively. The final payment is contingent on written acceptance. If a deliverable does not meet the agreed specification, we remediate before raising the final invoice. The 6-month landscape review included in the Professional tier is included in the programme fee — no additional invoice. Advisory retainer engagements beyond included post-delivery support are billed monthly in arrears for days actually worked.

Start with a 90-minute strategy assessment. Bring your current LLM thinking — including the vendor you are already leaning towards.

The discovery session reviews your use cases, your constraints, and your current thinking on vendor selection. If you already have a preferred vendor, we will assess it against the six dimensions in the session — so you leave knowing whether that preference is well-grounded or whether there are dimensions you have not yet considered. If you have no current preference, we will identify the constraints that should shape your shortlist. Either way, you leave with a clear picture of what the evaluation must cover and what the primary decision risks are.

The landscape moves fast enough that we have seen organisations whose PoC selection was appropriate at PoC time be the wrong choice by implementation time, six months later. The strategy engagement is designed to produce a recommendation that is current at delivery and to give you the evidence you need to revisit it when the landscape moves again.

Book a Strategy Assessment →

AI Systems Engineering →

Format

Video call or in-person in London. 90 minutes.

Cost

Free. No commitment.

Lead time

Within 5 business days of contact.

Bring

Your LLM use cases — even if loosely defined. Any PoC work already conducted and its results. Your current vendor shortlist or preference if one exists. The constraints you are aware of: data residency requirements, latency needs, budget parameters, regulatory obligations. Any existing vendor discussions or agreements.

Attendees

Technical lead or ML engineer and the business-side owner of the use case. Both are needed — the technical constraints and the business requirements must be in the same room. From RJV: a senior LLM strategist with no vendor affiliation.

After

Written summary of session findings within 2 business days. Fixed-price scope for the appropriate tier within 5 business days if you want to proceed.

AI Systems Engineering · LLM Architecture & Language Model Services · Digital Transformation