Prompt Engineering & System Design -

Services / Prompt Engineering & System Design

The system prompt is the primary mechanism for controlling what a language model does in production. It defines the model’s role, its constraints, its output format, how it handles edge cases, what it refuses to do, and how it behaves when user input is ambiguous or adversarial. In most production LLM applications, the system prompt was written quickly, tested informally, and modified repeatedly by different people in response to individual complaints — without a coherent design, without version control, and without systematic evaluation of whether each change improved the system or introduced new failure modes elsewhere.

Prompt engineering as a discipline is frequently underestimated in two opposite directions. It is underestimated as too simple — “it’s just writing instructions” — by organisations that have not yet encountered the precision required to produce consistent, reliable behaviour from a probabilistic model across a real production input distribution. And it is overestimated as sufficient — “better prompts will fix this” — by organisations whose underlying problem is not prompt design but model capability, knowledge base quality, or architecture. Distinguishing between these is the first task this engagement performs.

This service designs the system prompt architecture, the input and output format specifications, the edge case and adversarial input handling, the evaluation framework for measuring prompt performance, and the governance process for maintaining prompt quality over time. The output is a complete prompt system — not a single prompt — with the documentation, evaluation evidence, and governance process that allows it to be maintained, improved, and handed between team members without loss of the design rationale.

Book a Prompt Assessment →

Pricing & Scope

Price Range

£8,500 – £55,000+
Prompt system design, evaluation framework, and governance process. Does not include implementation of tooling or ongoing prompt management.

Duration

3 – 14 weeks
Depends on number of prompts in scope, whether existing prompts are being redesigned, and whether the engagement is a single application or portfolio.

Scope

System prompt architecture, I/O format specifications, edge case and adversarial handling, evaluation framework, governance process. Not ongoing prompt management.

Applicable to

New LLM applications being designed – Existing applications with accumulated prompt debt – Applications where output consistency is inadequate – Organisations establishing prompt governance standards

Contract

Fixed-price. 50% on signing, 50% on delivery acceptance.

Prompt engineering is not sufficient if the underlying problem is architecturalBetter prompts cannot compensate for an embedding model that does not handle the domain vocabulary, a knowledge base with contradictory information, a model lacking the required capability, or an architecture that routes queries incorrectly. This engagement begins by assessing whether prompt design is the correct intervention. If it is not, we will say so before any prompt design begins.

How Prompt Systems Fail in Production

Eight specific failure modes. Each one is a design error, not a model limitation.

The failures below are not caused by the model being incapable. They are caused by prompts that are imprecise, contradictory, incomplete, or designed for a narrow range of inputs that does not cover the full production input distribution. Most of them are visible in the first systematic evaluation of a production prompt against a representative test set — which most organisations have not conducted.

The prompt contains contradictory instructions that create inconsistent behaviour

Prompts modified incrementally by multiple people accumulate contradictory instructions. An instruction to be concise contradicts a later instruction to be thorough. An instruction to answer only from context contradicts a later instruction to use general knowledge when context is insufficient. The model resolves contradictions inconsistently, producing output quality that varies unpredictably across superficially similar inputs.

What this looks like in production

A legal document assistant produces responses that are sometimes 2 sentences and sometimes 12 sentences for equivalent complexity questions. The inconsistency is caused by three length instructions added at different times that contradict each other — none visible to the individual who added each one because each was tested only against the specific scenario they had just fixed.