RAG Architecture & Knowledge System Design -

Services / RAG Architecture & Knowledge System Design

Retrieval-Augmented Generation is how most enterprises make a general-purpose language model useful for their specific knowledge — the model answers questions using retrieved documents from the organisation’s own data rather than relying solely on what it learned during training. The concept is simple. The implementation is not. A working RAG system in production requires correct decisions across a chain of components — document processing, chunking, embedding, indexing, retrieval, re-ranking, context assembly, and output generation — where a poor decision at any stage degrades the quality of every answer the system produces.

The failure mode that most organisations encounter is not that retrieval-augmented generation does not work — it is that it works acceptably in controlled demonstration conditions and fails in production on the questions that matter most. The demonstration retrieves the right document when the query matches the document’s language closely. Production fails when the user asks the same question in different words, when the answer spans multiple documents, when the relevant context is buried in the middle of a 40-page policy document rather than in an obvious heading, or when the system confidently retrieves the wrong document and the model generates a confident wrong answer from it.

This service designs the RAG architecture from the knowledge source to the generated answer — the document processing pipeline, the chunking and embedding strategy, the index architecture, the retrieval and re-ranking approach, the context assembly logic, and the evaluation framework that measures whether the system is actually answering questions correctly rather than answering them plausibly. We do not implement the system. Your engineering team or a technology partner implements from the architecture and specifications we produce.

Book a Knowledge System Assessment →

Pricing & Scope

Price Range

£24,000 – £140,000+
Architecture design, pipeline specification, index design, evaluation framework. Implementation is separate and additional.

Duration

8 – 20 weeks
Design phase. Implementation typically adds 6–16 weeks depending on data estate size and team capability.

Scope boundary

Architecture design, pipeline specification, chunking and embedding strategy, index design, retrieval and re-ranking approach, evaluation harness specification. We do not ingest data, configure vector databases, or write production code.

Knowledge sources

Documents, databases, structured data, web content, APIs, internal wikis, SharePoint, Confluence, email archives — any source that can be processed into retrievable text

Contract

Fixed-price. 50% on signing, 50% on delivery acceptance.

The document quality problemThe quality of a RAG system is bounded by the quality of the knowledge it retrieves from. If the underlying documents are inconsistent, outdated, contradictory, or poorly structured, the RAG system will retrieve and present that inconsistency to users — with the authority of an AI-generated answer. Document governance is not optional preparation for a RAG system. It is a prerequisite. We assess document quality as the first phase of this engagement and will tell you if the knowledge base is not in a state that can support a reliable RAG system before any architecture design begins.

Where RAG Fails in Production

Nine specific failure modes. Every one of them is an architecture decision made incorrectly.

RAG systems are not a single technology — they are a pipeline of interconnected components, each of which can fail independently and whose failures compound. A poor chunking strategy ruins retrieval precision even with a good embedding model. A good embedding model cannot compensate for a knowledge base where the same question is answered differently in different documents. A precise retrieval step is wasted if the re-ranking stage promotes the wrong document above the correct one. Understanding where each failure occurs is the prerequisite for designing a system that does not fail there.

Stage: Document Processing

Documents are ingested as text without preserving their structure

PDF extraction strips tables into flat text, losing the relationship between row headers and cell values. Heading hierarchies that define document structure are lost, so a section titled “Exclusions” is indistinguishable from one titled “Coverage”. Footnotes that modify the meaning of the main text are separated from the text they modify. The resulting ingested content is technically the same words, in a different arrangement that destroys the meaning relationships between them.

What this produces in production

An insurance policy RAG system answers “Is flood damage covered?” correctly for standard policies and incorrectly for policies where flood exclusions appear in a table rather than in prose. The table-extracted text placed flood and coverage in the same chunk without the exclusion qualifier that appeared as a column header. The model answered “yes” from the retrieved text. The policy said “no”.

Architecture decision that prevents this

Structure-aware document processing that extracts tables as structured objects, preserves heading hierarchy as metadata attached to content, and maintains footnote associations. The processing specification defines the extraction behaviour per document type — not a single generic PDF extractor applied to all documents.

Stage: Chunking

Fixed-size chunking splits reasoning across chunk boundaries

The most common chunking strategy is fixed-size: split every document into chunks of N tokens with an M-token overlap. It is easy to implement and performs acceptably on simple factual questions where the answer fits within a single chunk. It fails on any question where the answer requires reading a premise in one paragraph and a conclusion in another — because the chunk boundary may fall between them, retrieving the conclusion without the premise, or the premise without the conclusion.

What this produces in production

A legal document Q&A system is asked “Under what conditions can the contract be terminated without notice?” The termination clause spans two paragraphs: the first states conditions, the second states exceptions to those conditions. Fixed-size chunking split them across two chunks. The retrieval returned the conditions chunk. The model answered without the exceptions. The answer was technically accurate for the retrieved text and materially wrong for the contract.

Architecture decision that prevents this

Semantic chunking that respects document structure: chunk boundaries at paragraph or section boundaries rather than at token counts. For documents with complex cross-references, hierarchical chunking that stores both the section-level chunk and the document-level summary as separate index entries, allowing retrieval at the appropriate granularity for the query type.

Stage: Embedding

The embedding model was not evaluated on the domain’s vocabulary

Embedding models encode semantic similarity — the closer two texts are in embedding space, the more semantically similar they are considered to be. General-purpose embedding models are trained on general text. In highly specialised domains, the domain vocabulary is sparse in the training data: medical ICD codes, legal Latin phrases, financial instrument abbreviations, engineering part numbers. The embedding model places domain-specific terms near general-English near-equivalents rather than near domain-equivalent concepts, producing similarity scores that do not reflect actual semantic relevance in the domain.

What this produces in production

A clinical documentation RAG system retrieved records about “heart failure” when asked about “cardiac decompensation” — a clinically equivalent term whose embedding was not close enough in the general-purpose embedding space. Retrieved nothing relevant for “HFrEF” (heart failure with reduced ejection fraction) when highly relevant records used the acronym. The system’s clinical recall on domain-specific queries was 34% against 91% on general English queries.

Architecture decision that prevents this

Embedding model evaluation on domain-specific query-document pairs before selection. Domain-specific fine-tuning of the embedding model if evaluation shows unacceptable recall on domain vocabulary. Hybrid retrieval combining dense embeddings with sparse keyword retrieval (BM25) — sparse retrieval handles exact domain terminology that embedding models misplace; dense retrieval handles semantic variation in general queries. The balance between the two is tuned on the domain’s actual query distribution.

Stage: Retrieval

Top-K retrieval returns topically adjacent but factually wrong documents

Vector similarity retrieval returns the K most similar chunks to the query. Similarity is a proxy for relevance — it measures how topically close the chunk is to the query, not whether the chunk contains the answer. A chunk that discusses the same topic from a different perspective, a superseded policy version, or a document that asks the same question the user is asking (without answering it) can score higher similarity than the document that actually contains the answer.

What this produces in production

An HR policy RAG system was asked about the enhanced parental leave policy introduced in January 2024. The knowledge base contained both the current policy and the previous policy from 2022. Both had high similarity to the query. The 2022 policy ranked higher because it used the exact phrase “parental leave policy” more frequently. The system answered with the entitlements from the 2022 policy. Five employees were given incorrect information about their entitlements before the error was identified.

Architecture decision that prevents this

Metadata filtering as a first-pass retrieval constraint: document date, version, status (current/superseded/draft), and access permissions applied before similarity ranking. Re-ranking after initial retrieval: a cross-encoder or LLM re-ranker that evaluates query-chunk relevance as a pair rather than as independent embeddings, promoting factually relevant chunks over topically similar ones. Version-aware indexing that tags superseded documents and applies a relevance penalty by default unless the query explicitly asks for historical versions.

Stage: Context Assembly

Retrieved chunks are assembled in retrieval-score order, not reasoning order

After retrieval and re-ranking, the selected chunks are concatenated and passed to the model as context. The concatenation order is typically the retrieval score order — highest-scoring chunk first. But the model reads context sequentially, and the reasoning required to answer a question often follows a logical order that is different from the relevance score order. The most relevant chunk (the answer) placed in the middle of a long context window is attended to less reliably than if it were placed at the beginning or end.

What this produces in production

A regulatory Q&A system retrieved 8 relevant chunks for a complex compliance question. The chunk containing the direct answer was the 5th most relevant by similarity score and was placed 4th in the context. The model generated an answer synthesised from the surrounding chunks rather than from the most directly relevant one. The answer was plausible and wrong. When the context was reordered to place the direct-answer chunk first, the model answered correctly in 94% of test cases vs. 61% in score order.

Architecture decision that prevents this

Context assembly logic that places the most directly relevant chunks at the beginning and end of the context window rather than in score order — exploiting the “lost in the middle” attention pattern documented in LLM behaviour research. Deduplication to remove near-duplicate chunks before assembly. Context length budget management that prunes lower-relevance chunks when the assembled context approaches the model’s context window limit rather than truncating arbitrarily.

Stage: Generation

The model generates answers from its parametric knowledge, not the retrieved context

Language models contain knowledge from their pre-training data. When the retrieved context conflicts with the model’s parametric knowledge — for example, an organisation’s specific policy differs from standard industry practice — models sometimes preferentially generate answers from their parametric knowledge rather than from the retrieved context. This is especially common when the retrieved context is in the middle of a long context window, when the parametric knowledge is strongly represented in the training data, and when the system prompt does not explicitly constrain the model to answer only from the provided context.

What this produces in production

A bank’s internal policy Q&A system retrieved the correct policy document stating the bank’s overdraft charges. The model generated an answer describing typical UK overdraft charges rather than the bank’s specific charges — because the typical charge structure was more strongly represented in the model’s training data than the bank’s specific rates. The retrieved document was in the context. The model ignored it in favour of what it “knew” from pre-training.

Architecture decision that prevents this

System prompt architecture that explicitly instructs the model to answer only from the provided context and to state when the context does not contain sufficient information to answer. Faithfulness evaluation as part of the evaluation harness: each generated answer assessed against the retrieved context to detect answers that cannot be grounded in the retrieved text. When faithfulness falls below threshold, the answer is flagged for human review rather than presented as a retrieved-knowledge response.

Stage: Knowledge Governance

The knowledge base is not maintained and silently becomes wrong

A RAG system answers from the documents in its index. If those documents are not kept current — if policies are updated in SharePoint but not re-indexed, if superseded documents are not removed from the index, if new knowledge sources are added to the knowledge base but not ingested — the system’s answers progressively diverge from the current state of the organisation. The system continues to answer with confidence. The answers are increasingly wrong. There is typically no monitoring that detects this — the system’s performance metrics do not drop, because the evaluation harness is also using the old documents.

What this looks like 12 months after deployment

An organisation deployed a HR policy RAG system in January 2024. By December 2024: 14 policies had been updated since ingestion, 3 had been superseded and replaced with new documents, and 2 new policies had been created that were never ingested. The system was answering questions about 14% of HR policies from outdated documents and was unaware of 2 policy areas entirely. The organisation had received no signal from system metrics that this was happening.

Architecture decision that prevents this

Automated ingestion pipeline connected to the source systems — not a one-time batch ingest, but an event-driven or scheduled pipeline that detects document changes at source and re-indexes affected documents. Document version tracking in the index metadata: every indexed chunk tagged with source document version, last-modified date, and ingestion date. Staleness monitoring: alert when the gap between source document modification date and index modification date exceeds a defined threshold. Periodic full reconciliation between source and index to catch documents that were modified through mechanisms that did not trigger the event-driven pipeline.

Stage: Access Control

The RAG system retrieves documents the querying user is not authorised to see

Enterprise knowledge bases contain documents with different access permissions — HR records accessible only to HR, financial reports accessible only to finance, commercially sensitive documents accessible only to specific teams. A RAG system that ingests all documents into a shared index and retrieves without permission checking will surface the content of restricted documents to users who ask questions whose answers happen to be in those documents. The user may not see the document directly — but they see the model’s answer, which is generated from the document’s content. This is a data leakage event that bypasses all source-system access controls.

What this looks like in practice

A company deployed an internal knowledge RAG system. All SharePoint documents were ingested including a folder containing draft board papers marked for board members only. A junior employee asked about a pending acquisition the company was considering. The RAG system retrieved from the board papers and generated a response describing the acquisition target and indicative price. The junior employee was not on the board and had no authorisation to see this information. The RAG system bypassed SharePoint’s access controls entirely.

Architecture decision that prevents this

Permission-aware indexing: every indexed chunk tagged with the access permissions inherited from the source document. At retrieval time, candidate chunks are filtered to those the querying user is authorised to access before similarity ranking. The permission model must be propagated from the source system and kept in sync — a document whose permissions change in SharePoint must have its index entry updated within the defined sync window. For high-sensitivity documents, retrieval exclusion is the correct approach rather than permission inheritance.

Stage: Evaluation

The system is evaluated on questions it was designed to answer, not on questions users actually ask

RAG system evaluation is commonly conducted with a hand-crafted test set of questions and expected answers, designed by the team that built the system, using the same documents the system was built from. This test set reflects the questions the team expected users to ask. It does not reflect the questions users actually ask — the phrasing variations, the multi-hop reasoning questions, the questions that span multiple documents, the questions about topics not well-covered in the knowledge base, and the adversarial questions that expose the system’s failure modes. A system that scores 95% on the development test set and 61% on real user queries was not well-evaluated before deployment.

Why this gap is consistent and large

The team that designs the test set knows which questions the system handles well — they built the system. They unconsciously design questions that use the same vocabulary as the indexed documents. Real users ask in their own vocabulary. They ask about edge cases the team did not anticipate. They ask compound questions that require retrieving from multiple documents and synthesising the answers. Every one of these is a test the hand-crafted evaluation set did not include.

Architecture decision that prevents this

Evaluation harness built from real user query logs where they exist, or from query generation using the LLM itself to produce diverse question types from the knowledge base documents — including paraphrased questions, multi-hop questions, and questions about document boundaries and gaps. Evaluation metrics that measure retrieval quality (recall, precision, MRR) separately from generation quality (faithfulness, answer relevance) so that failures can be attributed to the correct pipeline stage. Production monitoring that captures real user queries and flags low-confidence responses for human review, creating a continuous evaluation signal that improves the test set over time.

What the Architecture Covers

Seven pipeline stages. Each one a separate architecture decision. Each decision made from your data, not from a vendor tutorial.

The architecture delivered by this engagement covers every stage of the RAG pipeline. Each stage has multiple technically valid options. The correct option for your system depends on your document types, your knowledge domain, your query distribution, your access control requirements, and your performance constraints. The architecture is not a selection from a menu of standard patterns — it is a set of decisions made from the analysis of your specific situation. Below is what each stage involves and why it cannot be treated as a default choice.

Stage 1 — Document Processing

Source ingestion and structure extraction

Before any embedding or indexing, documents must be converted from their source formats — PDF, Word, HTML, SharePoint, Confluence, database exports, email archives, scanned images — into structured text that preserves the meaning relationships in the original. This is more than format conversion. It requires decisions about how tables are represented, how heading hierarchies are encoded as metadata, how footnotes are associated with the text they modify, and how images and diagrams are handled.

Key architecture decisions

Extraction method per document type: native text extraction vs. OCR vs. layout-aware parsing (for complex PDFs) vs. structured API (for SharePoint/Confluence)

Table handling: extract as structured objects with row/column metadata vs. linearise into text vs. exclude and flag for human review

Metadata schema: what metadata is attached to each document at ingestion (date, author, version, access permissions, document type, topic tags)

Image and diagram handling: exclude, OCR, or generate alt-text descriptions using a vision model

Pre-processing quality gates: what document quality checks are applied before ingestion — minimum length, language detection, duplicate detection, superseded document flagging

What drives these decisionsSource format mix, document complexity, proportion of tables and images in the knowledge base, available processing budget per document type

Stage 2 — Chunking Strategy

How documents are divided into retrievable units

Chunking determines the granularity at which content is retrieved. Too large and retrieval returns more context than needed, diluting the relevant content and consuming context window. Too small and answers that require multi-sentence reasoning are split across chunks, never retrieved together. The right chunking strategy depends on the document structure, the expected query types, and the retrieval architecture — there is no universal correct chunk size.

Key architecture decisions

Chunking approach: fixed-size (simple, document-structure-agnostic), semantic (paragraph/section boundaries), or hierarchical (parent document + child chunks for multi-granularity retrieval)

Chunk size and overlap: determined by the 95th-percentile answer length in the domain, not by default settings

Chunk enrichment: adding document title, section heading, and topic summary as context prefix to each chunk — so that chunks can be understood without retrieving the full document

Multi-hop support: whether the architecture needs to support questions that require retrieving from multiple documents and reasoning across them — if yes, this drives specific chunking and context assembly requirements

What drives these decisionsDocument length distribution, proportion of questions requiring multi-paragraph answers, whether the knowledge base has hierarchical structure that semantic chunking can exploit

Stage 3 — Embedding & Indexing

Converting chunks into searchable vector representations

The embedding model determines how the semantic similarity between queries and documents is computed. The choice affects every subsequent retrieval operation. A general-purpose embedding model is the correct starting point — but it is not always the correct finishing point. Domain-specific vocabulary, cross-lingual requirements, and the specific semantic distinctions that matter in the domain all influence which embedding model produces the best retrieval performance on the actual query distribution.

Key architecture decisions

Embedding model selection: evaluated on a representative sample of domain query-document pairs, not on general benchmarks

Retrieval approach: dense-only (embedding similarity), sparse-only (BM25 keyword), or hybrid — and the balance between them for the domain’s query distribution

Index architecture: single flat index vs. multiple specialised indices by document type, topic, or access tier

Metadata indexing: which metadata fields are indexed for filtering vs. stored for display only

Vector database selection: evaluated on query latency at the index size, filtering capabilities, update performance, and operational requirements

What drives these decisionsDomain vocabulary specialisation, index size, query latency requirements, multilingual requirements, proportion of queries requiring exact term matching

Stage 4 — Query Processing & Retrieval

How user queries are transformed before retrieval

User queries are rarely in the form that produces the best retrieval results. They may be ambiguous, abbreviated, conversational, or phrased differently from the terminology in the indexed documents. Query processing transforms the user’s input to improve retrieval before the actual search is executed. The appropriate transformations depend on the user population, the document vocabulary, and the conversation context.

Key architecture decisions

Query expansion: expanding the query with synonyms, related terms, or reformulations — when the domain has significant vocabulary variation between users and documents

Hypothetical document embedding (HyDE): generating a hypothetical answer to the query and using that as the retrieval query — improves recall for questions whose phrasing differs significantly from document language

Conversation history management: how previous turns in a multi-turn conversation are incorporated into the retrieval query — query rewriting vs. full conversation embedding vs. context-aware filtering

Query routing: directing different query types to different indices or retrieval strategies — factual questions to the dense index, exact terminology questions to the sparse index, date-sensitive questions through the metadata filter first

What drives these decisionsGap between user vocabulary and document vocabulary, proportion of multi-turn conversations in the expected query distribution, query type diversity

Stage 5 — Re-ranking

Promoting the most relevant retrieved chunks to the top

Initial retrieval by vector similarity produces a ranked list of candidate chunks. Re-ranking applies a more expensive but more accurate relevance model to the top N candidates from initial retrieval, producing a re-ordered list where the chunks that are most directly relevant to the query appear first. Re-ranking is the component that most consistently improves RAG system quality on real-world queries — and the one most commonly omitted in initial implementations because it adds latency and cost.

Key architecture decisions

Re-ranker type: cross-encoder (high accuracy, higher latency, best for low-throughput applications), LLM-based relevance scoring (highest accuracy, highest cost, appropriate for high-value queries), or a smaller fine-tuned re-ranker (balanced)

Re-ranking scope: how many initial candidates are re-ranked before context assembly — wider scope improves recall at the cost of latency

Relevance threshold: whether to filter out re-ranked chunks below a minimum relevance score — relevant for systems where “I don’t know” is a better response than a low-confidence answer

Diversity re-ranking: preventing the re-ranked list from being dominated by semantically identical chunks from the same source document

What drives these decisionsAcceptable latency budget, query throughput, frequency of queries where initial retrieval returns the wrong document at top position

Stage 6 — Context Assembly & Generation

Assembling retrieved content for the generation model

The context passed to the generation model — the system prompt, the retrieved chunks, the conversation history, and the user query — must be assembled in a way that maximises the probability of the model generating an accurate, grounded answer. Context assembly is not a concatenation operation. It is a deliberate design decision about what information the model needs, in what order, and within what length constraints, to produce the best possible answer for each query type.

Key architecture decisions

Context ordering: most relevant chunks at beginning and end of context, not in score order — addressing the “lost in the middle” attention pattern

Source attribution: how chunk provenance is communicated to the model and how citations are included in the generated response

Context length management: how to handle cases where retrieved chunks exceed the context window — priority-based truncation vs. summarisation vs. multi-call strategies

System prompt design: instructions that constrain the model to answer from the context, specify the response format, define the handling for questions without sufficient context, and set the tone appropriate to the application

Faithfulness enforcement: system prompt and output validation to detect and handle responses that contradict or go beyond the retrieved context

What drives these decisionsModel context window size, required response format (structured vs. prose vs. citations), proportion of queries expected to have no answer in the knowledge base

Stage 7 — Evaluation Framework

Measuring whether the system actually works

The evaluation framework is not a post-deployment concern — it is an architecture component that must be designed alongside the pipeline, because the metrics and methodology determine what “working correctly” means for this specific system. Without a defined evaluation framework before deployment, there is no baseline against which to measure degradation, no signal that document staleness is affecting answer quality, and no evidence with which to justify the system to stakeholders who are sceptical of AI outputs.

What the evaluation framework covers

Retrieval metrics: Recall@K (are the relevant documents in the top K retrieved?), MRR (is the most relevant document ranked first?), precision (what proportion of retrieved documents are relevant?)

Generation metrics: answer faithfulness (does the answer contradict the retrieved context?), answer relevance (does the answer address the query?), groundedness (can every claim in the answer be traced to a retrieved chunk?)

Test set design: how the test set is constructed to reflect real user queries, how it is maintained as the knowledge base evolves, and how production query logs are incorporated over time

Continuous monitoring: what is monitored in production, at what frequency, with what alerting thresholds, and what the response protocol is when metrics fall below threshold

Human evaluation workflow: for queries where automated metrics are insufficient — how human reviewers assess answer quality and how those assessments feed back into system improvement

What drives these decisionsAcceptable false-positive and false-negative rates for the application, regulatory requirements for evidence of system performance, available resource for human evaluation

Cross-cutting — Knowledge Governance

Keeping the knowledge base current and trusted

Knowledge governance is not a pipeline stage — it is a set of operational processes and technical mechanisms that run continuously after deployment to maintain the accuracy and currency of the knowledge base. A RAG system without knowledge governance becomes progressively less reliable as the underlying documents change. The governance framework defines who owns each document in the knowledge base, how changes are detected and propagated to the index, how conflicting information across documents is identified and resolved, and how the system communicates uncertainty to users when its knowledge is outdated or incomplete.

What the governance framework covers

Document ownership: named owners per document or document category, with defined responsibilities for keeping their documents current

Change detection and propagation: automated pipeline that detects source document changes and triggers re-ingestion within a defined SLA

Conflict detection: process for identifying and resolving cases where different documents in the knowledge base give conflicting answers to the same question

Gap detection: process for identifying questions the system cannot answer from current knowledge and flagging missing documentation to content owners

User feedback integration: mechanism for collecting user signals about incorrect or incomplete answers and routing them to the appropriate document owner

What drives these decisionsRate of change of the knowledge base, consequence of outdated answers for the application’s users, available content ownership structure in the organisation

Engagement Tiers — Scope, Price, Timeline

Three tiers. All seven pipeline stages in every tier. Scale of knowledge base and query complexity determine the tier.

Every tier covers the complete pipeline from document processing through evaluation framework and knowledge governance. The tier is determined by the size of the knowledge base, the diversity of document types and sources, the complexity of the access control requirements, and whether multi-hop reasoning across documents is required. Implementation by your engineering team or a technology partner follows from the architecture and specifications we deliver. Typical implementation timelines after architecture delivery: 6–10 weeks for Focused, 10–20 weeks for Professional, 20–40 weeks for Enterprise.

Focused

Single Knowledge Domain RAG

For organisations building a RAG system over a single, well-defined knowledge domain: one document corpus, one primary use case, one user population. Examples: HR policy Q&A, product documentation assistant, internal IT support knowledge base, contract clause search, regulatory guidance retrieval. Typically up to 10,000 documents, one primary document type (e.g. predominantly PDF, or predominantly SharePoint pages), and no complex multi-level access control requirements. If your knowledge base spans multiple document types with very different structures, or requires per-document access control across many permission tiers, the Professional tier is appropriate.

£24,000

Fixed · VAT excl.

8 weeksA sample of the actual knowledge base documents and a set of real or representative user queries must be available before week 1 begins.

Knowledge Base Assessment

Document quality audit: 100-document sample assessed for structure, completeness, consistency, and conflicting content — before any architecture is designed

Document type analysis: format distribution, complexity distribution, table and image prevalence, extraction method recommendation per type

Query distribution analysis: characterisation of the expected query types from user logs or representative samples — factual, multi-hop, comparison, procedural

Vocabulary gap analysis: degree of mismatch between user query vocabulary and document vocabulary — informs embedding model selection and hybrid retrieval weighting

Access control assessment: current permission structure and what must be propagated to the RAG index

Update frequency analysis: how often documents change, which change most frequently — informs ingestion pipeline design and staleness monitoring thresholds

Architecture & Specifications

Document processing specification: extraction method per document type, metadata schema, quality gate criteria

Chunking specification: approach, chunk size, overlap, enrichment strategy — justified by the query distribution and document structure analysis

Embedding model selection: evaluated on domain query-document pairs from the actual knowledge base, not benchmark scores

Index architecture: flat vs. multi-index, dense vs. hybrid retrieval balance, vector database selection recommendation with justification

Re-ranking specification: re-ranker type and configuration appropriate for the latency budget and query volume

Context assembly and system prompt design: ordering logic, faithfulness instructions, handling for unanswerable questions

Evaluation & Governance

Evaluation harness specification: test set design, retrieval metrics (Recall@K, MRR, precision), generation metrics (faithfulness, relevance, groundedness)

Baseline test set: 200 query-answer pairs from real or representative queries with ground truth answers, ready to run against the implemented system

Production monitoring specification: what metrics to monitor, at what frequency, with what alerting thresholds

Knowledge governance framework: document ownership assignments, change detection specification, conflict detection process, gap reporting

Implementation specification: your engineering team builds from this document. Includes technology stack recommendation, component-level specifications, integration approach.

30-day post-delivery advisory support

Implementation execution

Vector database configuration or operation

Timeline — 8 Weeks

Wk 1

Knowledge Base Audit

Document quality audit on 100-document sample. Document type analysis. Query distribution characterisation.

If the audit reveals that a material proportion of the knowledge base is poorly structured, outdated, or internally contradictory, the architecture design must wait until the knowledge base quality issues are resolved. This is the most common cause of Focused tier delays.

Wk 2

Embedding & Retrieval Evaluation

Embedding model candidates evaluated on domain query-document pairs. Hybrid retrieval balance tested. Chunking approach evaluated against query types.

Representative query samples needed before this stage. If user logs don’t exist, a domain expert generates 50 representative queries in week 1. Without these, embedding evaluation cannot be domain-specific.

Wk 3–5

Architecture Design

All seven pipeline stages specified. Implementation specification written. Technology stack recommendation with justification.

Architecture design requires a review session with the implementation team in week 5 before finalisation. Implementation constraints not disclosed before this session may require specification revisions.

Wk 6–7

Evaluation & Governance Design

Evaluation harness spec. 200-item baseline test set. Production monitoring spec. Knowledge governance framework.

Baseline test set requires ground truth answers reviewed by a domain expert — not by the project team. Allow 1 week for domain expert ground truth review before this stage concludes.

Wk 8

Handover

All specifications delivered. Implementation team walkthrough. Knowledge governance framework handover to content owners.

Both the implementation team and the content/knowledge owners must attend the handover. The governance framework has no value if the content owners do not understand their obligations under it.

What Your Team Must Provide

Access to 100+ representative documents from the knowledge base before week 1 — including examples of the most complex, poorly structured, and edge-case documents, not just the clean ones

User query logs if they exist, or a domain expert available for 3 hours in week 1 to generate representative queries

Domain expert: available for 2 hours in week 6–7 to review and validate the 200-item baseline test set ground truth answers

Implementation team lead: available for 2-hour architecture review in week 5 and 2-hour handover in week 8

Content or knowledge owner: available for governance framework review in week 7 and handover in week 8

Current access permission structure for the knowledge base documents, even if informal

What Is Not in This Engagement

Implementation: ingestion pipeline build, vector database setup, retrieval service, API development — all separate. Typical implementation cost for a Focused RAG system: £20,000–£60,000 in engineering time

Document quality remediation: if the audit reveals knowledge base quality problems, resolving them is the client’s work. We specify what needs to change; the content team makes the changes.

Knowledge bases above 10,000 documents: scope addition at £1,500 per additional 5,000 documents

More than one primary document source type requiring different extraction approaches: Professional tier

Post-implementation validation: the evaluation harness we specify can be run by your team after implementation; we are available at advisory day-rate if you want us to run it

Professional

Multi-Domain Enterprise Knowledge System

For organisations building a RAG system that spans multiple knowledge domains, multiple document source types, multiple user populations with different access levels, or a knowledge base that requires multi-hop reasoning across documents to answer complex queries. Typical: enterprise-wide knowledge assistant covering multiple business functions, customer-facing knowledge system drawing from product documentation and regulatory guidance, internal expert system for professional services, multi-source research assistant for healthcare or legal applications.

£62,000

Fixed · VAT excl.

14 weeksMulti-source ingestion design and permission model complexity are the most common sources of Professional tier timeline extension. Both must be fully characterised before architecture design begins.

Extended Knowledge Base Assessment

Document quality audit across all source types: 50-document sample per source type, assessed independently

Cross-domain consistency analysis: identification of topics where different knowledge domains provide conflicting information — must be resolved in governance before the architecture can handle it reliably

Multi-source ingestion assessment: API availability, authentication requirements, rate limits, and update mechanisms per source system

Permission model analysis: the complete access permission structure across all sources — including groups, roles, and document-level permissions — and the technical approach to propagating this into the index

Multi-hop query analysis: identification of query types that require reasoning across multiple documents — these drive specific architectural requirements for cross-document retrieval

User population analysis: different user groups, their query patterns, vocabulary, and information needs — informs query routing and index partitioning decisions

Architecture Extensions

Multi-source ingestion pipeline architecture: source-specific extraction and processing per source system, unified metadata schema across all sources

Multi-index architecture: domain-specific indices with routing logic, or unified index with domain filtering — decision justified by cross-domain query distribution

Permission-aware retrieval: technical specification for propagating source-system permissions to the RAG index and enforcing them at retrieval time

Multi-hop retrieval architecture: iterative retrieval strategy for queries that require retrieving from multiple documents and reasoning across them

Query routing: logic for directing queries to the appropriate index or retrieval strategy based on query type, user role, and topic classification

Cross-domain conflict handling: how the system behaves when retrieved documents from different sources give conflicting information on the same topic

Evaluation & Governance at Scale

Domain-stratified evaluation: separate retrieval and generation metrics per knowledge domain, plus cross-domain query evaluation

500-item baseline test set spanning all domains and query types including multi-hop

Permission boundary testing: test cases that verify the access control implementation prevents cross-permission retrieval

Multi-source governance framework: document ownership per source, change detection per source system, cross-domain conflict resolution process

60-day post-delivery advisory support: email plus 2 × scheduled video calls

Post-implementation validation session included: 4-hour session to review evaluation harness results and identify the highest-priority optimisations

Timeline — 14 Weeks

Wk 1–2

Multi-Source Assessment

Document audit across all source types. Permission model mapping. Multi-source ingestion feasibility per source system.

Source system API access must be arranged before week 1. In organisations with complex IT governance, API access provisioning for multiple systems commonly adds 2–3 weeks. Arrange this before the engagement begins.

Wk 3

Cross-Domain Consistency & Query Analysis

Cross-domain conflict identification. Multi-hop query characterisation. User population query pattern analysis.

Cross-domain conflicts require content owner involvement to resolve. If conflicts are significant, resolution may delay architecture design — the architecture cannot be specified for conflicting content that has not been resolved.

Wk 4–5

Embedding & Retrieval Evaluation

Embedding model evaluation across all domains. Hybrid retrieval tuning per domain. Multi-hop retrieval approach evaluation.

Multi-hop query evaluation requires the most complex queries to be specified before week 4. Domain expert involvement in specifying multi-hop test cases is required.

Wk 6–9

Architecture Design

All pipeline stages specified across all domains and sources. Permission model specification. Query routing logic. Cross-domain conflict handling.

Permission model specification is typically the most contentious part of the architecture review — IT security, legal, and HR each have views on access control that may conflict. This must be resolved before the specification is finalised.

Wk 10–12

Evaluation & Governance

500-item test set. Domain-stratified metrics. Permission boundary tests. Multi-source governance framework.

500-item test set ground truth validation requires domain experts across all knowledge areas. Coordinate their availability before week 10 — this is the highest resource demand of the engagement.

Wk 13–14

Review & Handover

Client review of all specifications. Handover to implementation team and content owners. Implementation kickoff briefing.

All source system owners and implementation team must attend the handover. A multi-source RAG system implemented without a coordinated handover produces integration mismatches that are expensive to diagnose.

What Your Team Must Provide

API or programmatic access to all source systems before week 1 — including authentication credentials, rate limit specifications, and available APIs for change detection

Complete permission model documentation for all source systems — every access group, role, and document-level permission that must be propagated to the RAG index

Domain experts per knowledge area: available for query characterisation in weeks 1–3 and test set ground truth validation in weeks 10–12

Content owners per source system: available for cross-domain conflict resolution in week 3 and governance framework review in weeks 10–12

IT security: available for permission model specification review in weeks 6–9

Implementation team: available for architecture review in week 9 and implementation kickoff in week 14

What Is Not in This Engagement

Implementation: all build work separate. Typical Professional RAG system implementation: £60,000–£200,000 depending on source system count and integration complexity

Cross-domain content conflict remediation: the audit identifies conflicts; resolving them is the content owners’ work before implementation begins

Source systems above 8 distinct sources: scope addition at £4,000 per additional source

Knowledge bases above 100,000 documents: scope addition and timeline discussion at assessment

Post-implementation ongoing optimisation: available as a quarterly review retainer at £6,500 per review

Enterprise

Enterprise Knowledge Platform Architecture

For organisations building a knowledge platform that serves as infrastructure for multiple RAG applications across the organisation — not a single application, but a shared knowledge layer with the ingestion pipelines, indexing infrastructure, access control enforcement, and evaluation framework that multiple teams deploy applications on top of. Also appropriate for organisations requiring real-time or near-real-time knowledge ingestion (sub-minute staleness), multi-language knowledge bases, or RAG systems in regulated contexts where knowledge source provenance must be documented for audit purposes. All enterprise engagements individually scoped.

From £110,000

Individually scoped · fixed · VAT excl.

From 16 weeksEnterprise knowledge platform architecture programmes commonly run 20–28 weeks when multi-language, real-time ingestion, or regulatory audit requirements are in scope.

What Enterprise Adds

Shared knowledge platform architecture: designed for multiple application teams to build on top of, with API standards, SDK specifications, and governance frameworks that enable self-service deployment of new RAG applications

Real-time ingestion architecture: event-driven ingestion pipeline for knowledge bases where staleness of minutes or seconds is operationally unacceptable — trading floor news feeds, clinical alerts, live regulatory updates

Multi-language architecture: embedding and retrieval designed for knowledge bases in multiple languages, cross-language query handling, language-specific chunking and evaluation

Regulatory audit architecture: knowledge source provenance documentation for every answer, retrievable for regulatory audit — which document, which version, ingested when, last reviewed when

No ceiling on source systems, document types, or knowledge base size — scoped at assessment

Knowledge platform governance: organisation-wide standards for what can be ingested, by whom, with what quality gates, subject to what retention and deletion policies

Why Enterprise Takes Longer

Shared platform design requires stakeholder alignment across multiple application teams before the architecture can be finalised — each team has different requirements and different constraints

Real-time ingestion design requires understanding the latency and reliability requirements of the fastest-moving knowledge source, and designing a pipeline that meets those requirements while handling slower-moving sources efficiently

Multi-language evaluation requires domain experts for each language who can validate ground truth answers — a resource coordination challenge that consistently adds timeline

Regulatory audit architecture requires legal input on what provenance information must be captured, how long it must be retained, and in what format it must be producible — this moves on legal timescales, not engineering timescales

Enterprise Requirements

Named platform owner with authority to set standards across all application teams — without this, the shared platform architecture will be ignored in favour of each team’s preferred approach

Representatives from all application teams that will build on the platform: available for requirements workshops in weeks 1–3

Legal/compliance: available for regulatory audit requirement specification throughout the programme

IT security: sustained involvement in access control and permission model design

For real-time ingestion: the operations team responsible for the real-time source systems must be engaged from week 1 — real-time pipeline design requires understanding operational constraints they impose

Bilateral Obligations

Client Obligations

Provide actual documents — including the difficult, messy, and non-representative ones

The document quality audit must reflect the actual knowledge base, not the best subset of it. The architecture is designed for the documents as they are, not for a curated sample that does not include the poorly scanned PDFs, the inconsistently formatted tables, the documents that were last updated in 2017, or the SharePoint pages that someone started but never finished. If the architecture is designed only for the clean documents, it will fail on the messy ones — which are typically a significant proportion of the real knowledge base and which are often the documents that contain the most important information.

If only clean documents are provided for the auditThe architecture will underspecify the handling for complex document types. When implementation encounters the complex documents, additional specification work will be required. This is a client-caused scope addition.

Knowledge base quality issues identified in the audit must be resolved before implementation begins

If the audit identifies significant knowledge base quality problems — conflicting information, outdated documents that have not been superseded, documents covering the same topic with different answers, significant gaps in coverage — these must be addressed before implementation begins. A RAG system built on a poor knowledge base will produce poor answers, no matter how good the architecture is. The system amplifies the knowledge base’s quality by making it more accessible. If the knowledge base quality is low, the system makes low-quality answers more accessible, faster, to more people.

If implementation proceeds before knowledge base quality issues are resolvedThe evaluation harness will reveal the quality problems when it is run against the implemented system. Resolving knowledge base quality issues post-implementation costs more than resolving them pre-implementation, because some architectural decisions may need to be revisited to accommodate the changes.

Named content owners must be assigned and engaged before governance framework delivery

The knowledge governance framework assigns ownership of documents in the knowledge base to named individuals who are responsible for keeping them current. If these individuals are not identified before the governance framework is designed, the framework will have ownership placeholders rather than owners. An ownership framework without owners is not a governance framework — it is a document about governance. The content owner assignment must happen as part of the engagement, not as a post-delivery action item.

If ownership assignments are not completed during the engagementThe governance framework is delivered with placeholder ownership and an explicit notation that the framework is incomplete until ownership is assigned. The probability of the framework being maintained after delivery without named owners is, in our experience, close to zero.

RJV Obligations

Architecture specifications written to allow implementation without interpretation

Every architecture specification we deliver — chunking strategy, embedding model selection, index architecture, re-ranking configuration, context assembly logic — is written at the level of detail required for an engineering team to implement it without needing to make significant interpretive decisions. Where a specification requires a decision that depends on information we do not have at design time (a configuration parameter that must be tuned on production load), we specify the decision method — what to measure, how to interpret the measurement, and how to set the parameter from it — rather than leaving the decision to the implementer’s judgment.

If a specification is found to require significant interpretation during implementationRaise within 15 business days of delivery. We review and add the missing specificity at no additional cost. Interpretation gaps discovered after implementation begins that have caused implementation to proceed in an incorrect direction are reviewed case by case.

Embedding model recommendation based on evaluation on your data, with the evaluation results disclosed

The embedding model recommendation is made from an evaluation on your domain’s query-document pairs, not from benchmark scores or vendor claims. We disclose the evaluation methodology, the evaluation dataset, the metrics we measured, and the scores for each candidate model. You receive the full evaluation evidence, not just the recommendation. If the recommended model’s advantage over alternatives is small and the alternatives have other advantages (cost, latency, familiarity to the implementation team), the recommendation says so and quantifies the trade-off.

If a different model is selected than the one recommendedYour choice. You have the evaluation evidence. If you choose a model that scored lower on your data for good reasons (cost, operational considerations), document the reason. If you want our assessment of whether the alternative choice is reasonable, we will provide it within the post-delivery support window.

Knowledge base quality findings reported as found — including findings that are uncomfortable

The document quality audit will sometimes reveal that the knowledge base is in worse condition than the organisation believed. Outdated documents that have been superseded but not removed. Contradictory policies that have coexisted unnoticed because no one searches across both. Documents that are technically in the knowledge base but are of such poor quality that they should not be ingested. We report these findings as they are — without softening them to reduce the discomfort of the implications. An organisation that does not know its knowledge base has quality problems cannot fix them. Discovering the problems is the prerequisite for resolving them.

If you believe a quality finding is incorrectRaise within 10 business days. We review the specific document(s) in question with the content owner and revise the finding if the challenge is substantively correct. We do not revise findings in response to discomfort with the implications.

Questions to answer before building or redesigning a RAG system

We have already built a RAG system. It does not work as well as expected. Can you diagnose why?

Yes. Diagnosis of an underperforming RAG system is one of the most common starting points for this engagement. We conduct a structured assessment of the existing system: retrieval recall and precision against a test set, faithfulness of generated answers, identification of which pipeline stage is contributing most to underperformance, and document quality assessment of the knowledge base. The diagnosis takes 2–3 weeks and produces a written root cause analysis with specific remediation recommendations per identified failure mode. In most cases the underperformance is attributable to 1–2 specific issues — poor chunking, missing re-ranking, document quality, or a retrieval approach that does not handle the domain’s vocabulary. Fixing those issues without redesigning the full pipeline is usually faster and cheaper than starting over. Whether that is the right approach depends on what the diagnosis finds.

Our knowledge base is large — several hundred thousand documents. Does that change the architecture significantly?

Yes, in three specific ways. Index size affects query latency — at 500,000+ documents the choice of vector database and approximate nearest neighbour algorithm becomes a first-order performance decision rather than a secondary one. Ingestion pipeline architecture must handle bulk initial load and ongoing updates at scale — this is a different engineering problem from a 10,000-document system. And document quality audit sampling must be statistically designed rather than manually curated — you cannot audit 500,000 documents the same way you audit 10,000. All three of these push the engagement into Enterprise tier territory. Contact us at the assessment session with the document count and we will scope accordingly.

We are using a RAG framework (LangChain, LlamaIndex, Haystack). Does that change what this engagement covers?

No, with one clarification. RAG frameworks provide implementations of the pipeline components — they do not make the architecture decisions for you. Choosing LangChain does not determine your chunking strategy, your embedding model, your re-ranking approach, or your context assembly logic. The framework provides the infrastructure for the components; the architecture defines what the components do and how they are configured. This engagement produces the architecture that your implementation team then builds using whatever framework they prefer. Where a framework has specific constraints or capabilities that affect an architecture decision — a framework-specific limitation on chunking approaches, for example — we note it in the specification so the implementation team is aware. Framework selection itself is addressed in the Enterprise LLM Strategy engagement if needed.

How do we know the RAG system is working correctly after deployment?

The evaluation framework and production monitoring specification we deliver are the answer to this question. The evaluation harness measures retrieval quality (Recall@K, MRR, precision) and generation quality (faithfulness, relevance, groundedness) against the 200 or 500-item baseline test set. This can be run immediately after implementation to establish the baseline, and re-run at defined intervals to detect degradation. The production monitoring specification defines the real-time signals to watch — low-confidence responses, elevated no-answer rates, retrieval latency — and the alerting thresholds. When the knowledge base changes materially, the test set should be updated by re-running domain expert ground truth review on new query types that the updated knowledge base should handle. Without this ongoing evaluation discipline, the system’s actual performance is unknown and the first signal of degradation is user complaints rather than metric alerts.

Is fine-tuning the LLM better than RAG for our use case?

Different problem, different solution. Fine-tuning improves the model’s general behaviour in the domain — its tone, its vocabulary, its formatting preferences, its understanding of domain-specific reasoning patterns. RAG gives the model access to specific, current information it was not trained on. For use cases where the questions can be answered from static information that does not change frequently and where the volume of knowledge is small enough to fit in a fine-tuning dataset, fine-tuning may be sufficient without RAG. For use cases where the knowledge changes frequently, where the knowledge base is large, or where answers must cite specific source documents, RAG is the appropriate approach. For most enterprise knowledge assistant use cases, the answer is both: a fine-tuned or domain-adapted model that behaves well in the domain, combined with RAG that gives it access to current organisational knowledge. The Enterprise LLM Strategy engagement covers this decision at the use case level before architecture design begins.

What are your payment terms?

50% on contract signature, 50% on written acceptance of the final deliverables. No milestone payments during execution. Scope additions — additional document sources, additional knowledge base size above tier ceiling — are invoiced as agreed in writing before execution, never retrospectively. The final payment is contingent on written acceptance. If a specification does not meet the agreed definition — if it requires significant interpretation, if a pipeline stage is underspecified relative to what was agreed — we remediate before the final invoice. The post-implementation validation session included in the Professional tier is part of the programme fee. Advisory day-rate support beyond the included post-delivery window is billed at £1,400/day, invoiced monthly in arrears for days actually worked.

Start with a knowledge system assessment. Bring your knowledge base — including the documents that are the hardest to work with.

A 90-minute session reviewing your knowledge base, your use case, and your current RAG implementation if one exists. We assess the document quality on a sample you bring, characterise the query types your users ask, and identify the pipeline stages most likely to be causing underperformance or most likely to cause problems in a new build. At the end of the session you have a clear view of the primary architecture risks and which tier of engagement is appropriate.

If you have an existing RAG system that is underperforming, bring a description of what it is doing wrong — where it fails, on what types of question, with what frequency. The diagnosis session is the fastest route to understanding whether the problem is fixable at the architecture level or whether it requires knowledge base remediation first.

Book a Knowledge System Assessment →

Enterprise LLM Strategy →

Format

Video call or in-person in London. 90 minutes.

Cost

Free. No commitment.

Lead time

Within 5 business days of contact.

Bring

10–20 representative documents from your knowledge base — including the most complex and difficult ones, not just the clean ones. A description of your use case and the types of questions users will ask. If you have an existing RAG system: a description of where it is failing and on what types of queries. Your knowledge base size estimate and primary source systems.

Attendees

Engineering lead or ML engineer and a domain expert who understands what a correct answer looks like for the use case. Both are needed — the technical constraints and the quality standard must be in the same room. From RJV: a senior RAG architect.

After

Written summary of session findings within 2 business days. Fixed-price scope within 5 business days if you want to proceed.

Enterprise LLM Strategy · AI Systems Engineering · LLM Architecture Services