Skip to main content

LLM Deployment & Operations

Services  /  LLM Deployment & Operations

Deploying an LLM application to production is not the end of the work. It is the beginning of a different kind of work — one that most organisations are less prepared for than they were for the build. Production LLM systems degrade in ways that are invisible to standard application monitoring: the model’s behaviour drifts when the vendor updates the underlying model, user query distributions shift in ways that move the system into regions of its performance space that were not evaluated before deployment, costs accumulate in patterns that were not in the original business case, and incidents occur in ways that are unique to probabilistic systems and that standard incident response playbooks do not cover.

The organisations that manage production LLM systems well treat them as a distinct operational discipline from conventional software systems. The monitoring is different. The incident classification is different. The change management process is different. The cost governance is different. Most of these differences are not intuitive to engineering and operations teams whose experience is with deterministic software — and the consequences of applying deterministic software operational practices to probabilistic AI systems are silent quality degradation, uncontrolled cost growth, and incidents that are discovered by users before they are detected by operations.

This service designs the operational framework — monitoring, incident response, cost governance, change management, and continuous evaluation — for LLM systems in production. We do not operate the system on your behalf. We design the framework your team operates from: the runbooks, the metric thresholds, the escalation procedures, the cost controls, and the governance processes that turn a deployed LLM application into a managed, observable, cost-controlled production system.

Price Range
£18,000 – £110,000+
Operations framework design, runbooks, monitoring specifications, cost governance design, and incident response playbooks. We do not operate systems.
Duration
6 – 16 weeks
Framework design phase. Implementation of monitoring tooling and operational processes by your team is separate and additional.
Scope boundary
We design the operational framework. We do not operate monitoring tools, manage incidents on your behalf, configure observability platforms, or act as a managed service provider. Your engineering and operations team runs the system from our framework.
Applicable to
Any production LLM application — API-based, fine-tuned, RAG-augmented, agentic, or combinations. New deployments and existing systems without adequate operational frameworks.
Contract
Fixed-price. 50% on signing, 50% on delivery acceptance.
LLMs fail differently from conventional softwareA conventional application fails with an error code. An LLM application fails by producing a plausible-sounding wrong answer — silently, without a stack trace, with no signal to standard uptime monitoring. The monitoring approaches, incident definitions, and operational responses designed for conventional software do not detect these failures. Applying them to LLM systems creates a false sense of operational visibility. Your dashboards show green while the system is degrading.

Eight failure modes specific to production LLM systems. None of them produce an error code. All of them are detectable with the right operational framework.

Production LLM failures are qualitatively different from conventional software failures. They do not throw exceptions. They do not time out. They do not return HTTP 500. They return HTTP 200 with a response that looks correct and is wrong. The detection requires purpose-built operational mechanisms — not the monitoring infrastructure that works perfectly well for the rest of the application stack. Each failure mode below has a specific detection mechanism and a specific operational response.

01
Model version updates change system behaviour without warning
API-based LLM providers update their models on rolling schedules. These updates may improve average performance on benchmark tasks while shifting the model’s behaviour in your specific application’s task in ways that were not anticipated. The same prompt produces meaningfully different outputs — in tone, in format, in content, in the way edge cases are handled. Unless the system is being evaluated continuously against a stable test set, this change is invisible. The system’s dashboards show normal throughput and latency. The output quality has shifted.
How organisations discover this
A customer service manager notices that responses feel different — more verbose, or less helpful on a specific topic category. An automated downstream process that parsed the model’s structured JSON output starts failing because the output format changed subtly. A quarterly user satisfaction survey shows a drop that correlates with a model version change that was not noticed at the time. In each case, the degradation had been in place for weeks before detection.
Detection mechanism
Continuous evaluation against a pinned test set run on every model version — triggered automatically when the provider’s API returns a new model version identifier. Comparison of output distributions between the previous and current model version on the evaluation set. Alert when distribution shift exceeds a defined threshold. Evaluation results reviewed by a human before the new version is accepted into production. Model version pinning where the provider offers it.
02
Query distribution drift moves the system into unevaluated territory
Pre-deployment evaluation covers the query types anticipated at design time. Production users ask questions that were not anticipated. Over time, the proportion of queries in categories that were not well-evaluated grows — either because user behaviour shifts, because the user base expands to a different population, or because users discover edge cases the design team did not encounter. The system’s performance on the evaluated query categories remains good. Its performance on the unevaluated categories is unknown. The aggregate performance metric conceals the degradation in the growing tail.
What this looks like in practice
An internal HR policy Q&A system was evaluated on policy lookup and procedure questions. After 6 months in production, query analysis shows that 31% of queries are now about edge cases — part-time employee scenarios, international assignments, newly joined employees during policy transition periods — none of which were well-represented in the evaluation set. The aggregate satisfaction metric has not changed. The edge case satisfaction is unknown and likely poor.
Detection mechanism
Query distribution monitoring: cluster production queries by topic and intent type, track the distribution over time, alert when query categories that are not in the evaluation set exceed a defined proportion of volume. New query category detection triggers evaluation set expansion: domain experts add ground truth answers for the new category, the evaluation is run on the new category, and the results are reviewed before the category grows further.
03
Token costs grow faster than usage — and no one notices until the invoice arrives
LLM costs scale with token consumption. Token consumption grows for reasons beyond simple volume increase: prompts are lengthened when developers add more instructions to fix edge cases; conversation histories accumulate in multi-turn applications and are included in every subsequent request; RAG retrieval returns more chunks as the knowledge base grows; output length increases when the model produces more verbose responses after a version update. Each of these is a small incremental cost increase. Together, over months, they can double or triple the cost per request while usage volume remains flat.
What the invoice looks like
Month 1: £4,200. Month 6: £11,800. Same number of users. Same feature set. The cost increase is entirely attributable to prompt length growth (developers added instructions), conversation history accumulation (no truncation logic implemented), and a RAG context window increase (more chunks retrieved after knowledge base expansion). None of these were tracked. The month 6 invoice is a surprise.
Detection mechanism
Per-request cost monitoring tracking input tokens, output tokens, and context window usage — not just total monthly spend. Cost per request trended over time, broken down by request type. Budget alerts at defined thresholds — weekly and monthly — before overspend is discovered on the invoice. Prompt length versioning: every change to system prompts tracked with the date of change and the cost impact measured. Conversation history truncation policy with cost impact monitoring.
04
Prompt changes are made without version control or impact assessment
System prompts are the primary mechanism for controlling LLM behaviour. They are also typically not under version control, not subject to change management review, and modified by multiple team members in response to user complaints without systematic assessment of whether the change fixes the reported problem without creating new problems elsewhere. A change that fixes a formatting issue in one output category breaks the output structure for another category. A change that makes responses more concise for simple queries makes complex queries inadequate. None of this is visible until a different complaint arrives weeks later.
How this accumulates
An LLM application’s system prompt has been modified 47 times over 8 months. No version history exists. The current prompt is a composite of contributions from 6 different team members responding to different complaints at different times. It contains contradictory instructions that were added to fix opposing problems. The model’s behaviour in edge cases is unpredictable because the instructions it receives are internally inconsistent. No one knows what the prompt looked like when the application was working best.
Detection mechanism
Prompt version control: every system prompt change is committed to a version-controlled store with the change rationale, the author, and the date. Pre-deployment evaluation: every prompt change is evaluated against the full test set before deployment. Change review gate: prompt changes above a defined scope require a second reviewer and a test set pass rate above the defined threshold. Rollback capability: the previous prompt version can be deployed in under 5 minutes if a new version causes regression.
05
Latency degrades under load without triggering availability alerts
LLM API latency is variable in ways that conventional application latency is not. It varies with model size, context window length, concurrent load on the provider’s infrastructure, and network conditions. A system that responds in 1.2 seconds at low load responds in 8 seconds at peak load. Standard uptime monitoring reports the system as available — it is responding to every request. Users experience the system as broken. The 95th-percentile latency that makes the application unusable is not captured by average latency monitoring or by binary up/down availability checks.
Where this manifests
Customer-facing LLM assistant times out for users during peak hours. Users retry, amplifying the load. The timeout threshold in the frontend is 10 seconds — just above the 8-second API response time — so the error is not appearing in frontend error logs. Operations sees: 100% availability. Users see: unusable application between 9am and 11am on business days.
Detection mechanism
Percentile latency monitoring: track P50, P90, P95, and P99 latency by request type — not just average. Separate latency budgets by request complexity: short queries have different acceptable latency than long-context document analysis requests. Latency alerting at the percentile that affects user experience, not at the threshold that indicates total failure. Load-based latency testing in staging before production deployment of prompt changes that increase context length.
06
The system produces harmful or inappropriate outputs that are not caught by standard logging
LLM systems can produce outputs that are factually wrong, inappropriately styled, potentially harmful, or that violate the organisation’s content policies — not as a result of adversarial attack, but as a consequence of the model’s probabilistic nature applied to edge case inputs. Standard application logging captures request and response payloads. It does not evaluate whether those responses are appropriate. Without a content review mechanism, inappropriate outputs accumulate in production, potentially damaging user trust or creating regulatory liability, with no operational signal until a user reports a specific instance.
What gets missed
A healthcare information LLM produces a response that contradicts established clinical guidance when asked about a medication dosage in an unusual context — the combination of patient age and medication that the model handles correctly individually but incorrectly in combination. The response is confident. The response is logged. No system reviews it. A nurse reports it to the vendor six weeks later. In those six weeks: unknown number of similar responses on similar inputs.
Detection mechanism
Automated output review pipeline: a secondary model or rule-based classifier evaluates each production response against defined content policy dimensions — factual consistency, policy compliance, tone, harmful content. Responses flagged by the classifier are queued for human review on a defined SLA. High-confidence flags trigger immediate human review. Aggregate flagging rates monitored over time — a rising flagging rate is an early signal of model drift or query distribution shift before human reviewers report it explicitly.
07
Vendor API incidents have no documented response procedure
LLM vendor APIs experience outages, degraded performance, rate limiting, and partial failures — where some request types fail while others succeed, where latency increases substantially for specific model versions, or where the API returns unexpected response formats during a backend update. These incidents are different from infrastructure incidents the operations team is familiar with: there is nothing to restart, nothing to scale up, nothing to patch. The correct response is to fall back to a different model, queue requests, serve cached responses, or present a meaningful degraded state to users. Without a documented response procedure, the incident response is improvised under pressure.
What improvised response looks like
A major LLM provider experiences a 40-minute partial outage. The customer-facing application returns errors. The engineering team on call has no playbook for this scenario. 22 minutes pass while they investigate whether the problem is their infrastructure or the vendor’s. They establish it is the vendor’s. No fallback model is configured. No graceful degradation mode exists. The application is unavailable for the outage duration plus 22 minutes of investigation. A simple status page check would have identified the vendor incident in under 2 minutes. A configured fallback would have kept the application functional.
Detection and response framework
Vendor status monitoring integrated into the operational dashboard — automated checks against vendor status APIs, not manual status page polling. Tiered response playbook: synthetic request health check that detects partial failures before users do; automatic fallback routing to a secondary model when primary model failure rate exceeds threshold; graceful degradation mode that serves cached responses or reduced functionality with clear user communication; defined escalation path to vendor support with account information pre-documented. Recovery procedure for re-enabling primary model after vendor incident resolution.
08
No process exists for incorporating user feedback into system improvement
Users of production LLM systems are the richest source of information about where the system fails. They know when an answer is wrong, when it misses their intent, when it is unhelpful for their specific situation. Most production LLM applications collect thumbs-up/thumbs-down feedback or allow users to flag responses. Almost none have a systematic process for converting that feedback into action: analysing the distribution of negative feedback by category, identifying the most common failure patterns, prioritising improvements, implementing them, and verifying that the implementation improved the flagged cases without degrading others. The feedback accumulates. The system does not improve.
What the feedback backlog looks like at 12 months
A production LLM system has 8,400 negative feedback items collected over 12 months. No one has analysed them systematically. There is no categorisation, no frequency analysis, no prioritisation. Three engineers have individually looked at samples and made prompt changes in response to specific complaints they noticed. The changes did not address the most common failure patterns because no one had identified what the most common failure patterns were. The 8,400 items represent a detailed map of the system’s failure modes that has never been read.
The operational process that prevents this
Weekly feedback triage: automated categorisation of negative feedback by query type and failure mode. Weekly human review of the categorisation, calibrated by the categoriser’s confidence. Monthly root cause analysis on the top failure categories: what is causing these failures, is it a prompt issue, a model issue, a knowledge base issue, or a query type the system should not be handling? Monthly improvement cycle: one targeted improvement per month, evaluated against the test set before deployment and against the feedback rate for the improved category after deployment.

Five distinct operational requirements. Each one different from its equivalent in conventional software operations.

The operational framework this engagement designs covers five domains. Each addresses a class of operational challenge that is specific to production LLM systems and requires a different approach from conventional software operations. The framework is not a list of monitoring metrics — it is a complete operational design that specifies who does what, when, in response to which signals, following which procedures, with which authority to make which decisions.

Domain 1
Quality Monitoring & Drift Detection
The most operationally critical domain and the one most absent from organisations’ current monitoring. Quality monitoring for LLM systems requires measuring the quality of outputs — not just whether the system is responding. This requires a continuous evaluation pipeline running alongside the production system, comparing current output quality against a stable baseline, and detecting degradation before it is reported by users.
What the framework covers
Evaluation pipeline design: automated test set execution on a defined schedule, triggered additionally by model version changes and significant prompt changes
Quality metrics specification: which metrics are measured (faithfulness, relevance, output format consistency, confidence calibration), at what frequency, and what the acceptable ranges are
Drift detection thresholds: at what level of metric shift an alert is raised, who receives it, and what the response procedure is at each threshold level
Query distribution monitoring: weekly analysis of production query distribution compared to the evaluation set coverage, alert when coverage gap grows
Output content review: automated classifier for flagging outputs that may violate content policy, with human review SLA and escalation path
Model version change protocol: what happens when the provider pushes a model update — evaluation, comparison, acceptance criteria, rollback if criteria not met
Why this is different from conventional monitoringConventional monitoring measures system availability and performance. Quality monitoring measures the meaning of the system’s outputs. There is no metric in a standard APM tool that captures whether the LLM’s answers are correct. This requires purpose-built evaluation infrastructure.
Domain 2
Cost Governance & Budget Control
LLM operating costs have more variables than conventional API costs and those variables change over time in ways that are hard to predict from first principles. The cost governance framework establishes the instrumentation and controls that keep costs observable, predictable, and within approved boundaries — without requiring budget reviews to be triggered by invoice surprise.
What the framework covers
Cost instrumentation: per-request tracking of input token count, output token count, model tier, and request type — the granularity required to understand what is driving cost growth
Cost allocation: tagging requests by user, team, application feature, and request type so costs can be attributed and chargeback or showback applied
Budget alert thresholds: daily, weekly, and monthly alerts at defined spend levels — calibrated to give response time before a threshold is breached, not notification after it is
Cost anomaly detection: automated detection of per-request cost increases that are not explained by volume growth — prompt length creep, context window expansion, output verbosity increases
Token optimisation review process: quarterly review of token consumption patterns against the baseline established at deployment, with specific optimisation recommendations
Rate limiting and quota management: protecting against cost spikes from user behaviour changes or downstream application bugs that generate unusually high token consumption
Why this is different from conventional cost managementConventional API costs scale linearly with request count. LLM costs scale with token count per request, which changes independently of request count. A 20% increase in average context length has the same cost impact as a 20% increase in user volume — but appears as a completely different signal.
Domain 3
Change Management & Prompt Governance
Changes to LLM systems — prompt modifications, model version updates, RAG knowledge base changes, configuration parameter adjustments — have a different risk profile from conventional software changes. A prompt change that looks minor can produce unexpected behaviour changes in output categories that were not in the developer’s test cases. The change management process must account for this: every change is a system behaviour change, and every system behaviour change requires evaluation before deployment.
What the framework covers
Prompt version control specification: storage, naming conventions, change metadata (author, date, rationale, ticket reference), review requirements by change scope
Change classification: minor (punctuation, formatting), moderate (instruction addition or modification), major (structural prompt change or model switch) — with different review requirements per class
Pre-deployment evaluation gate: every change evaluated against the full test set before deployment, with a defined pass rate requirement per change class
A/B testing framework specification: for moderate and major changes, how to run controlled A/B tests in production before full deployment, what the success criteria are, and when to conclude the test
Rollback procedure: how to revert to the previous version in under 5 minutes if a deployed change causes unexpected regression, with the decision authority for rollback at each severity level
Knowledge base change management: for RAG systems, the process for adding, updating, or removing documents from the knowledge base with associated re-evaluation requirements
Why this is different from conventional change managementConventional change management focuses on functional correctness — does the feature work as specified? LLM change management must also assess probabilistic output quality — does the change maintain or improve output quality across the full distribution of inputs, including edge cases that were not in the change’s immediate scope?
Domain 4
Incident Response & Escalation
LLM incidents require a different incident classification system from conventional software incidents. The classification must distinguish between infrastructure incidents (the API is down — conventional response), quality incidents (the API is up but outputs have degraded — LLM-specific response), and content incidents (the system produced an inappropriate or harmful output — LLM-specific response with potential regulatory or reputational implications). Each incident type has a different response procedure, a different escalation path, and different resolution criteria.
What the framework covers
Incident taxonomy: infrastructure incidents, quality degradation incidents, content incidents, cost incidents, and vendor incidents — definitions, detection signals, and severity classification for each
Response playbooks per incident type: step-by-step response procedure, decision authority at each step, escalation thresholds, and resolution criteria for each incident class
Content incident playbook: specific to outputs that may be harmful, inappropriate, or factually dangerous — immediate containment (can the output be recalled?), scope assessment (how many users may have seen it?), regulatory notification assessment, stakeholder communication
Vendor incident protocol: how to detect vendor incidents before users do, which status page and API health check signals to monitor, what the fallback procedure is, when and how to contact vendor support
Post-incident review process: root cause analysis template, contributing factor assessment, preventive measure identification, and framework update procedure for incidents that reveal framework gaps
On-call documentation: what the on-call engineer needs to know about the LLM system to respond to incidents at 3am — architecture overview, dependency map, access credentials location, escalation contacts
Why this is different from conventional incident responseA conventional P1 incident is binary — the system is down or it is not. An LLM quality incident is probabilistic — the system is partially degraded on a subset of queries that may not be immediately obvious. The incident classification requires measuring quality, not just availability.
Domain 5
Continuous Improvement & Feedback Integration
The most underinvested operational domain for most organisations. The mechanisms for collecting user feedback, analysing it systematically, converting it into prioritised improvement actions, implementing those actions through the change management process, and measuring whether the improvements worked — this closed loop is what distinguishes LLM systems that improve over their lifetime from those that remain at deployment-day quality or silently degrade.
What the framework covers
Feedback collection specification: what feedback signals to collect (explicit thumbs-up/down, follow-up queries that signal dissatisfaction, session abandonment, correction patterns), how to store them, how to link them to the generating request
Weekly feedback triage: automated categorisation of negative feedback by query type and failure mode, human review of categorisation, identification of emerging patterns
Monthly improvement cycle: root cause analysis on the top feedback categories, prioritisation of improvements, implementation through the change management process, measurement of impact
Evaluation set expansion process: how new query types surfaced through feedback analysis are added to the evaluation set, with ground truth from domain experts
Improvement tracking: how improvement actions are logged, how their impact is measured, and how the results are communicated to stakeholders
Quarterly operational review: structured review of all five operational domains — metrics trends, cost performance, incident history, change management compliance, and improvement cycle effectiveness — producing a prioritised operational improvement plan for the next quarter
Why this requires a defined process, not goodwillWithout a defined process, feedback analysis happens when someone has time for it — which is never, because operations teams always have more urgent demands. The monthly improvement cycle must be a scheduled, resourced activity with a named owner, not an aspiration.

Three engagement types. New deployments, existing systems, and multi-system portfolios.

This service is available for new LLM deployments — where the operational framework is designed alongside the system before go-live — and for existing deployments — where a system is already in production without adequate operational processes. The approach differs: new deployments allow the framework to be designed with full knowledge of the system’s architecture; existing deployments require an audit of current operations before the framework can be designed. The cost and timeline differ accordingly.

Engagement Type 1
Operations Framework — New Deployment
For organisations deploying an LLM application to production for the first time, or deploying an additional LLM application into a production environment that already has an operational framework but where the new application has materially different characteristics (a RAG system being added where only API-call applications previously existed, or a real-time agentic application being added where only request-response applications existed). The framework is designed alongside the system before go-live, not retrofitted after the problems emerge.
£18,000
Fixed · VAT excl.
6 weeksMust begin no later than 4 weeks before the planned production go-live date. Beginning after go-live means operating without a framework during the most incident-prone period of any deployment.
Quality & Monitoring Design
Evaluation pipeline design: continuous test set execution schedule, trigger conditions for additional evaluation runs, result storage and comparison approach
Quality metrics specification: metrics, acceptable ranges, and alert thresholds for the specific application’s task type
Drift detection specification: model version change detection, query distribution monitoring, output distribution baseline and alert thresholds
Output content review specification: classifier requirements and configuration, human review queue design, review SLA and escalation
Vendor status monitoring integration: which signals to monitor, how to integrate them into the operational dashboard
Observability instrumentation requirements: what logging, tracing, and metrics must be emitted from the application to enable the monitoring framework
Cost Governance & Change Management
Cost instrumentation requirements: per-request token tracking, request type tagging, cost attribution design
Budget alert thresholds: daily, weekly, monthly — calibrated to the approved cost model for the deployment
Cost anomaly detection specification: what constitutes an anomalous per-request cost, alert threshold design
Prompt version control specification: storage, naming conventions, metadata requirements, review process per change class
Change classification and review requirements: minor/moderate/major criteria, test set pass rate requirements by change class, A/B testing framework for major changes
Rollback procedure: step-by-step prompt rollback, model version rollback, and — for RAG systems — knowledge base rollback
Incident Response & Improvement
Incident taxonomy for this application: infrastructure, quality, content, cost, vendor incidents — definitions, detection signals, severity classification
Response playbooks per incident type: step-by-step procedures, decision authorities, escalation paths, resolution criteria
On-call runbook: what the on-call engineer needs to know to respond to incidents at 3am — architecture overview, access details, escalation contacts, first-response procedures
Feedback collection specification: signals to collect, storage, linkage to generating request, triage process
Weekly triage and monthly improvement cycle process: named owners, scheduled cadence, input/output definition for each cycle
Quarterly operational review template: metrics to review, decision framework for framework updates
Timeline — 6 Weeks (must begin 4+ weeks before go-live)
Wk 1
System Assessment
Architecture review, observability current state, cost model, existing monitoring gaps, go-live timeline confirmed.
If go-live is less than 6 weeks away, scope must be prioritised: incident response playbooks and cost monitoring are the highest-priority deliverables for an imminent go-live.
Wk 2–3
Monitoring & Cost Framework Design
Quality metrics, drift detection, content review, cost instrumentation, budget alert thresholds.
Monitoring instrumentation requirements must reach the engineering team in week 2 to allow implementation before go-live. This is the most time-constrained deliverable.
Wk 4
Change Management Design
Prompt version control specification, change classification criteria, rollback procedures.
Prompt version control must be set up before go-live, not after. Post-go-live prompt changes without version control are the most common source of untraceable quality degradation.
Wk 5
Incident Response & Playbooks
Incident taxonomy, response playbooks per type, on-call runbook, escalation paths.
On-call team must review and accept the playbooks before go-live. A playbook that on-call engineers have not read is not a playbook — it is a document.
Wk 6
Improvement Process & Handover
Feedback collection spec, improvement cycle process, quarterly review template, full framework handover.
Operations team and product owner must both attend the handover. The improvement process requires both — operations owns the monitoring, product owns the improvement prioritisation.
What Your Team Must Provide
Architecture documentation for the LLM system: model, prompt structure, integration points, data flows, and any RAG or agentic components
Current observability infrastructure: what logging and monitoring tools are in place for the wider application stack — the LLM monitoring integrates into the existing stack rather than running separately
Engineering lead: 3 hours across weeks 1–3 for system assessment and monitoring instrumentation requirements review
Operations or SRE lead: 3 hours across weeks 4–5 for incident response playbook review and on-call runbook validation
Product owner: 2 hours in week 6 for improvement cycle process review and handover
Confirmed go-live date before the engagement begins — the timeline is calibrated to go-live, and a moving go-live date complicates prioritisation
What Is Not in This Engagement
Implementation of monitoring infrastructure: all observability tooling, dashboards, alerting, and feedback collection implemented by your engineering team from our specifications
Ongoing monitoring operations: we design the framework; your team operates it. We are not an MSSP and do not provide 24/7 monitoring services.
Multi-system portfolio operational framework: if the organisation has multiple LLM applications requiring a coordinated operational approach, this is the Multi-System Portfolio tier
Post-delivery quarterly review facilitation: available at £3,500 per quarterly review if you want RJV to facilitate the structured review process rather than running it internally
Engagement Type 2
Operations Audit & Framework Remediation — Existing System
For organisations with one or more LLM applications already in production that do not have an adequate operational framework — or that have experienced a quality incident, a cost surprise, or a degradation event that revealed the inadequacy of current operations. This engagement first audits the current operational state, then designs the missing or inadequate components, and produces a prioritised remediation roadmap calibrated to the system’s actual risk profile. The audit finding may reveal that the gaps are critical and require immediate action — or that the current state is closer to adequate than expected and only targeted improvements are needed.
£32,000
Fixed · VAT excl.
8 weeksAudit phase (weeks 1–3) must have full access to production logs, monitoring data, and current operational processes before it can begin.
Operational Audit (Weeks 1–3)
Current quality monitoring assessment: what is monitored today, what is not, what metrics exist and whether they are fit for purpose
Cost governance audit: current cost visibility, billing data review, cost anomaly detection capability, historical cost trend analysis
Prompt version history audit: is version control in place? How many undocumented changes have been made? What is the current prompt state vs. the deployed-at-launch prompt?
Incident history review: what incidents have occurred, how were they detected, how were they classified, how were they resolved, what was the time to detection and resolution?
Feedback backlog analysis: what feedback exists, has it been analysed systematically, what are the most common failure patterns in the accumulated feedback?
Gap severity assessment: each operational domain rated — adequate, partial, inadequate — with specific evidence from the audit for each rating
Framework Design (Weeks 4–7)
Full operational framework for all five domains, calibrated to the specific system and the gaps identified in the audit
Prioritised remediation roadmap: which gaps to address first based on the risk they represent — immediately critical gaps prioritised for week-1 remediation after handover
Immediate action plan: for gaps identified as immediately critical (no incident response for a system with content risk, no cost monitoring for a system approaching budget limits), specific actions that can be taken by the operations team before the full framework is implemented
Retrospective on past incidents: root cause analysis of significant incidents in the audit history, contributing operational gaps identified, and specific framework components that would have prevented or shortened each incident
Feedback backlog action plan: how to process the accumulated feedback backlog systematically to extract improvement actions before beginning the monthly improvement cycle
Handover (Week 8)
Complete operational framework documentation
Prioritised remediation roadmap with estimated implementation effort per gap
Immediate action plan: actions the operations team can take this week, before full framework implementation begins
Implementation guidance: what the engineering team needs to implement each monitoring component, with technology stack recommendations for any gaps
Handover session with operations team, engineering lead, and product owner (3 hours): framework walkthrough, questions, immediate action assignments
60-day post-delivery advisory support: email plus 2 × scheduled check-in calls during the initial framework implementation period
Timeline — 8 Weeks
Wk 1–2
Audit: Monitoring, Cost, Prompts
Review of current monitoring data, cost records, and prompt history. Interviews with engineering and operations leads.
Access to production logs and billing data must be arranged before week 1. Without these, the audit is incomplete and the gap severity assessment is less reliable.
Wk 3
Audit: Incidents & Feedback
Incident history review, feedback backlog analysis, gap severity assessment. Preliminary findings to engineering lead.
Incident history and feedback data must be accessible. If incidents were not logged or feedback was not stored, the audit of those domains relies on interviews — which are less reliable than data.
Wk 4–6
Framework Design
All five operational domains designed. Immediate action plan. Prioritised remediation roadmap.
Framework design is calibrated to the audit findings. If audit findings are more complex than anticipated — more gaps, more severe — the design phase may require additional time. This is assessed at the end of week 3.
Wk 7
Review
Client review of framework and remediation roadmap. Feedback incorporated. Final revisions.
Review must include both the engineering lead and the product owner. The remediation roadmap has both technical and product implications — both perspectives are needed to validate prioritisation.
Wk 8
Handover
3-hour handover session. Framework walkthrough. Immediate action assignments. Implementation guidance.
Immediate action items must have named owners and a completion date assigned at the handover session. Immediate actions without assigned owners are not completed.
What Your Team Must Provide
Access to production application logs for the past 3–6 months: request/response logs, error logs, latency logs — sufficient to characterise current behaviour and incident history
Billing data: monthly cost breakdown for the LLM application from inception, broken down by the granularity available from the vendor billing dashboard
Prompt history: all versions of the system prompt that have been used in production, with dates of change where available. If no version history exists, the current prompt and a description of what has changed since deployment.
Incident records: any documented incidents, user complaints, or quality issues that have been reported since deployment, however informally documented
Feedback data: all stored user feedback from the application since deployment
Engineering lead, operations lead, and product owner: 90-minute interview each in weeks 1–2
What Is Not in This Engagement
Implementation of the remediation roadmap: all implementation by your engineering team from the specifications and prioritised roadmap we deliver
Ongoing monitoring operations: framework design only — your team operates the monitoring
More than one LLM application: if the audit covers multiple applications, the Professional Portfolio tier is appropriate
Post-delivery quarterly review facilitation: available at £3,500 per quarterly review
Engagement Type 3
Multi-System LLMOps Portfolio Framework
For organisations running or deploying multiple LLM applications across different teams, where an organisation-wide operational framework — shared monitoring infrastructure, common incident classification, portfolio cost governance, and consistent change management standards — is more efficient and more reliable than each team maintaining independent operational processes. Also appropriate when one team’s operational framework has become a de facto standard and needs to be formalised and extended to other teams. All portfolio engagements individually scoped. Starting price reflects 3–5 applications; larger portfolios are scoped at assessment.
From £65,000
Individually scoped · fixed · VAT excl.
From 12 weeksPortfolio frameworks covering 6+ applications with different architectures and teams commonly run 16–20 weeks.
What Portfolio Adds
Cross-application operational audit: current operational state per application, comparison across applications, identification of which application’s practices represent the best baseline for the organisation
Shared monitoring infrastructure design: common observability layer that aggregates quality, cost, and incident signals from all applications into a unified operational view
Portfolio cost governance: organisation-wide cost allocation, cross-application budget visibility, portfolio-level cost optimisation (shared caching, rate limit coordination, commitment discount planning across all applications)
LLMOps standards: organisation-wide standards for prompt version control, change management, incident classification, and feedback collection that all teams follow — enabling cross-team learning from incidents and improvements
LLMOps centre of excellence design: the internal capability that owns the shared framework, supports individual application teams, and evolves the standards as the landscape changes
Why Portfolio Frameworks Are Difficult
Standardisation resistance: teams with established practices resist external standards, especially when the standard requires changes to tools they have already invested in learning
Architecture heterogeneity: different applications have different architectures (API-call vs. RAG vs. fine-tuned vs. agentic) with different operational requirements — the framework must accommodate variation without being so flexible it provides no actual standard
Shared infrastructure ownership: who owns the shared monitoring infrastructure when each application team previously owned its own? This is an organisational design question that requires explicit resolution before the framework can be designed
Prioritisation conflicts: when multiple applications have critical operational gaps, the remediation roadmap creates competing priorities for a shared engineering resource — this requires explicit prioritisation with executive authority, not consensus
Portfolio Requirements
Named LLMOps programme sponsor: a senior technology leader with authority to mandate standards across all application teams — without this, teams will adopt the standards they agree with and ignore the ones they find inconvenient
Engineering leads from all application teams: available for audit interviews and framework review sessions throughout the engagement
Shared infrastructure decision made before framework design begins: whether shared monitoring infrastructure is the approach, who will own it, and what the resourcing plan is
Existing operational data from all applications: billing data, logs, incident records, and feedback data as available — the richer the existing data, the more accurate the audit

Client Obligations
The framework must be implemented before it has value — and implementation is the client’s responsibility
The operational framework we deliver is a design — runbooks, specifications, process descriptions, metric definitions, alert threshold recommendations, and decision authority charts. Its value is entirely contingent on implementation. An organisation that receives the framework, files it, and continues operating the system without implementing the monitoring, the version control, the incident response procedures, and the improvement cycle has received a document. The obligation to implement the framework on a defined schedule, starting with the immediate priority items, is a client obligation from the day of handover.
If the framework is not implemented after deliveryWe cannot be responsible for operational failures that the framework would have prevented. If you engage us for an incident post-mortem after a quality or cost event, and the post-mortem reveals that the event would have been prevented or detected earlier by the monitoring framework we delivered, the root cause is framework non-implementation — a client obligation.
Honest disclosure of the current operational state — including gaps and past incidents
The operational audit for existing systems is only as accurate as the information provided. Organisations sometimes present their current operational state more positively than it is — because the assessment feels like an evaluation of the operations team’s performance. We are assessing the operational framework, not the team. Gaps in the current framework are not failures of the team — they are consequences of operating in a domain that was new and where operational best practices are not yet settled. Accurate disclosure of gaps, incidents, and operational shortcomings is a client obligation. It is what enables us to design the framework that addresses the actual risks rather than the perceived ones.
If gaps are concealed during the audit and discovered laterA gap we were not told about cannot be addressed by a framework designed without knowing about it. Post-handover discoveries of significant gaps require a framework update, scoped and priced at the time of discovery.
RJV Obligations
Operational thresholds and alert levels calibrated to your system — not to generic defaults
Monitoring thresholds — the quality metric levels that trigger alerts, the cost anomaly thresholds that trigger review, the latency percentiles that trigger escalation — must be calibrated to your system’s baseline performance, your application’s acceptable performance range, and your team’s capacity to respond to alerts. Generic thresholds produce either too many alerts (alert fatigue, real alerts ignored) or too few (degradation not detected). We calibrate every threshold to your system’s measured baseline. We document the calibration rationale so that thresholds can be recalibrated as the system’s baseline changes over time.
If calibrated thresholds produce alert fatigue after implementationRaise within 30 days. We review the threshold calibration and recalibrate based on the production alert data. Threshold calibration is an iterative process — the first calibration is based on available baseline data, and production data will refine it.
Runbooks written at the level of specificity that an on-call engineer can execute at 3am without asking questions
Every incident response runbook is reviewed against a single criterion before delivery: can an engineer who is not the primary owner of this system execute every step of this runbook, under stress, at 3am, without calling anyone? If the answer is no — because a step says “check the logs” without specifying which logs, because an escalation step names a role without naming a person, because a resolution step requires knowledge that is not in the runbook — it is not ready to deliver. We review every runbook against this criterion. Where the answer is no, we add the missing specificity before delivery.
If an on-call engineer cannot execute a runbook step during an incidentRaise within 5 business days of the incident. We review the step and add the missing specificity within 3 business days at no cost. A runbook that passes our pre-delivery review but fails in production because of information that was not available at design time (a new access credential, a changed escalation path) is updated as a standard maintenance activity, not a delivery failure.

Questions to answer before — or immediately after — going live with an LLM application

We already have standard application monitoring in place. Isn’t that sufficient?
Standard application monitoring measures availability, latency, and error rate — the signals that indicate whether the system is running. It does not measure whether the system’s outputs are correct, whether the model’s behaviour has drifted, whether the cost per request is increasing, or whether the query distribution has moved outside the evaluated region. All of these are failure conditions for an LLM application that produce no signal in standard monitoring. A system that is 100% available, under its latency SLA, and returning zero error codes can simultaneously be producing incorrect answers on 40% of queries, spending 3× its intended budget, and drifting away from its evaluated behaviour. Standard monitoring would not detect any of these. The operational framework we design adds the LLM-specific layer on top of the existing monitoring infrastructure — it does not replace it.
Our LLM application is small — a single use case for an internal team. Is this level of operational framework justified?
The minimum viable operational framework for a small internal application is lighter than for a customer-facing system with regulatory implications. For a small internal application: at minimum, you need prompt version control (so you know what changed when quality changes), a basic cost monitoring alert (so a budget spike is detected before the month-end invoice), and a feedback collection mechanism that is reviewed monthly (so you have a signal when something is wrong). This does not require a full engagement — it requires an afternoon of setup and a defined review rhythm. The New Deployment engagement is the right scope for a production deployment of any size. A team that says “our application is too small for operational rigour” is usually a team that has not yet experienced a silent quality degradation or an unexpected cost spike. The framework is cheaper to set up before the incident than to retrofit after it.
We experienced a significant quality degradation event. Where do we start?
The first priority is establishing what happened and whether it is ongoing. If the degradation is currently affecting users, the immediate priority is determining whether the model version changed, whether a prompt change was deployed, whether the vendor had an incident, or whether the query distribution shifted. These are the four most common causes of sudden quality degradation and each has a different response. Once the immediate cause is identified and the degradation is stabilised, the Operations Audit engagement provides the structured post-incident analysis and framework design that prevents recurrence. Contact us for a rapid assessment session — 90 minutes to review what you know about the degradation and provide an initial view on the most likely cause and the fastest route to stabilisation.
How does LLMOps relate to the other LLM services on this site?
The LLM services form a logical sequence. Enterprise LLM Strategy & Vendor Selection (service page) establishes what to build and which platform to build on. RAG Architecture (service page) designs the knowledge system. LLMOps designs how it is operated after deployment. Each service produces outputs that the next service builds from — the vendor selection informs which vendor APIs the operations framework monitors; the RAG architecture informs the knowledge governance component of the operations framework. They can be engaged sequentially or in parallel depending on where in the deployment lifecycle the organisation is. For organisations already in production, the Operations Audit can be engaged independently without having previously engaged the other services.
What is the difference between this service and an AI monitoring tool vendor?
Monitoring tool vendors sell software that implements specific monitoring capabilities — LLM tracing, evaluation runners, cost dashboards, prompt management. They provide the infrastructure for monitoring. They do not provide the operational design decisions: which metrics matter for your specific application and task type, what thresholds are calibrated to your system’s baseline, how incidents are classified and escalated, who has the decision authority for which response, and how the improvement cycle is integrated with your team’s existing processes. This engagement designs those decisions. The monitoring tools your team then selects and implements are the mechanism for executing the decisions. In many cases we recommend specific tools during the engagement — as the best implementation of a specific component of the framework. We have no commercial relationship with any tool vendor.
What are your payment terms?
50% on contract signature, 50% on written acceptance of the final framework deliverables. No milestone payments during execution. Scope additions — additional applications in the audit, additional playbooks for use cases not in the original scope — are invoiced as agreed in writing before execution, never retrospectively. The final payment is contingent on written acceptance. If a deliverable — a runbook, a monitoring specification, a change management procedure — does not meet the agreed definition of completeness, we remediate before raising the final invoice. The 60-day post-delivery advisory support included in the Audit engagement is part of the programme fee. Post-delivery quarterly review facilitation at £3,500 per review cycle is separately invoiced when sessions are scheduled.

Start with an operations assessment. Tell us what the last unexpected event was — a cost spike, a quality complaint, a model behaviour change — and we will tell you what operational gap it exposed.

90 minutes. We review your current LLM application’s operational state: what monitoring exists, how prompts are managed, how costs are tracked, what the last incident was and how it was detected and resolved. We identify the most significant operational gaps and give you an initial assessment of their risk. Whether the engagement is a new deployment or an existing system audit, the assessment session tells you which gaps require immediate action and which can be addressed in a structured framework engagement.

If you have a production LLM application and you are not continuously evaluating its output quality, you do not know whether it is working correctly today. That is not a judgment — it is the baseline condition of most production LLM deployments. The assessment session is 90 minutes to find out what you do not know.

Format
Video call or in-person in London. 90 minutes.
Cost
Free. No commitment.
Lead time
Within 5 business days. For systems experiencing active incidents: contact us directly by email.
Bring
Your LLM application architecture overview. Your current monitoring setup — what exists today. Your last 3 months of cost data if available. The last unexpected event — a quality complaint, a cost spike, a model behaviour change, an incident — and how it was detected and resolved. Your go-live date if the application is not yet in production.
Attendees
Engineering lead or SRE who owns the system day-to-day. Optionally, the product owner. From RJV: a senior LLMOps consultant. Not a monitoring tool vendor representative.
After
Written summary of session findings within 2 business days. Fixed-price scope for the appropriate engagement type within 5 business days if you want to proceed.