The architecture delivered by this engagement covers every stage of the RAG pipeline. Each stage has multiple technically valid options. The correct option for your system depends on your document types, your knowledge domain, your query distribution, your access control requirements, and your performance constraints. The architecture is not a selection from a menu of standard patterns — it is a set of decisions made from the analysis of your specific situation. Below is what each stage involves and why it cannot be treated as a default choice.
Stage 1 — Document Processing
Source ingestion and structure extraction
Before any embedding or indexing, documents must be converted from their source formats — PDF, Word, HTML, SharePoint, Confluence, database exports, email archives, scanned images — into structured text that preserves the meaning relationships in the original. This is more than format conversion. It requires decisions about how tables are represented, how heading hierarchies are encoded as metadata, how footnotes are associated with the text they modify, and how images and diagrams are handled.
Key architecture decisions
Extraction method per document type: native text extraction vs. OCR vs. layout-aware parsing (for complex PDFs) vs. structured API (for SharePoint/Confluence)
Table handling: extract as structured objects with row/column metadata vs. linearise into text vs. exclude and flag for human review
Metadata schema: what metadata is attached to each document at ingestion (date, author, version, access permissions, document type, topic tags)
Image and diagram handling: exclude, OCR, or generate alt-text descriptions using a vision model
Pre-processing quality gates: what document quality checks are applied before ingestion — minimum length, language detection, duplicate detection, superseded document flagging
What drives these decisionsSource format mix, document complexity, proportion of tables and images in the knowledge base, available processing budget per document type
Stage 2 — Chunking Strategy
How documents are divided into retrievable units
Chunking determines the granularity at which content is retrieved. Too large and retrieval returns more context than needed, diluting the relevant content and consuming context window. Too small and answers that require multi-sentence reasoning are split across chunks, never retrieved together. The right chunking strategy depends on the document structure, the expected query types, and the retrieval architecture — there is no universal correct chunk size.
Key architecture decisions
Chunking approach: fixed-size (simple, document-structure-agnostic), semantic (paragraph/section boundaries), or hierarchical (parent document + child chunks for multi-granularity retrieval)
Chunk size and overlap: determined by the 95th-percentile answer length in the domain, not by default settings
Chunk enrichment: adding document title, section heading, and topic summary as context prefix to each chunk — so that chunks can be understood without retrieving the full document
Multi-hop support: whether the architecture needs to support questions that require retrieving from multiple documents and reasoning across them — if yes, this drives specific chunking and context assembly requirements
What drives these decisionsDocument length distribution, proportion of questions requiring multi-paragraph answers, whether the knowledge base has hierarchical structure that semantic chunking can exploit
Stage 3 — Embedding & Indexing
Converting chunks into searchable vector representations
The embedding model determines how the semantic similarity between queries and documents is computed. The choice affects every subsequent retrieval operation. A general-purpose embedding model is the correct starting point — but it is not always the correct finishing point. Domain-specific vocabulary, cross-lingual requirements, and the specific semantic distinctions that matter in the domain all influence which embedding model produces the best retrieval performance on the actual query distribution.
Key architecture decisions
Embedding model selection: evaluated on a representative sample of domain query-document pairs, not on general benchmarks
Retrieval approach: dense-only (embedding similarity), sparse-only (BM25 keyword), or hybrid — and the balance between them for the domain’s query distribution
Index architecture: single flat index vs. multiple specialised indices by document type, topic, or access tier
Metadata indexing: which metadata fields are indexed for filtering vs. stored for display only
Vector database selection: evaluated on query latency at the index size, filtering capabilities, update performance, and operational requirements
What drives these decisionsDomain vocabulary specialisation, index size, query latency requirements, multilingual requirements, proportion of queries requiring exact term matching
Stage 4 — Query Processing & Retrieval
How user queries are transformed before retrieval
User queries are rarely in the form that produces the best retrieval results. They may be ambiguous, abbreviated, conversational, or phrased differently from the terminology in the indexed documents. Query processing transforms the user’s input to improve retrieval before the actual search is executed. The appropriate transformations depend on the user population, the document vocabulary, and the conversation context.
Key architecture decisions
Query expansion: expanding the query with synonyms, related terms, or reformulations — when the domain has significant vocabulary variation between users and documents
Hypothetical document embedding (HyDE): generating a hypothetical answer to the query and using that as the retrieval query — improves recall for questions whose phrasing differs significantly from document language
Conversation history management: how previous turns in a multi-turn conversation are incorporated into the retrieval query — query rewriting vs. full conversation embedding vs. context-aware filtering
Query routing: directing different query types to different indices or retrieval strategies — factual questions to the dense index, exact terminology questions to the sparse index, date-sensitive questions through the metadata filter first
What drives these decisionsGap between user vocabulary and document vocabulary, proportion of multi-turn conversations in the expected query distribution, query type diversity
Stage 5 — Re-ranking
Promoting the most relevant retrieved chunks to the top
Initial retrieval by vector similarity produces a ranked list of candidate chunks. Re-ranking applies a more expensive but more accurate relevance model to the top N candidates from initial retrieval, producing a re-ordered list where the chunks that are most directly relevant to the query appear first. Re-ranking is the component that most consistently improves RAG system quality on real-world queries — and the one most commonly omitted in initial implementations because it adds latency and cost.
Key architecture decisions
Re-ranker type: cross-encoder (high accuracy, higher latency, best for low-throughput applications), LLM-based relevance scoring (highest accuracy, highest cost, appropriate for high-value queries), or a smaller fine-tuned re-ranker (balanced)
Re-ranking scope: how many initial candidates are re-ranked before context assembly — wider scope improves recall at the cost of latency
Relevance threshold: whether to filter out re-ranked chunks below a minimum relevance score — relevant for systems where “I don’t know” is a better response than a low-confidence answer
Diversity re-ranking: preventing the re-ranked list from being dominated by semantically identical chunks from the same source document
What drives these decisionsAcceptable latency budget, query throughput, frequency of queries where initial retrieval returns the wrong document at top position
Stage 6 — Context Assembly & Generation
Assembling retrieved content for the generation model
The context passed to the generation model — the system prompt, the retrieved chunks, the conversation history, and the user query — must be assembled in a way that maximises the probability of the model generating an accurate, grounded answer. Context assembly is not a concatenation operation. It is a deliberate design decision about what information the model needs, in what order, and within what length constraints, to produce the best possible answer for each query type.
Key architecture decisions
Context ordering: most relevant chunks at beginning and end of context, not in score order — addressing the “lost in the middle” attention pattern
Source attribution: how chunk provenance is communicated to the model and how citations are included in the generated response
Context length management: how to handle cases where retrieved chunks exceed the context window — priority-based truncation vs. summarisation vs. multi-call strategies
System prompt design: instructions that constrain the model to answer from the context, specify the response format, define the handling for questions without sufficient context, and set the tone appropriate to the application
Faithfulness enforcement: system prompt and output validation to detect and handle responses that contradict or go beyond the retrieved context
What drives these decisionsModel context window size, required response format (structured vs. prose vs. citations), proportion of queries expected to have no answer in the knowledge base
Stage 7 — Evaluation Framework
Measuring whether the system actually works
The evaluation framework is not a post-deployment concern — it is an architecture component that must be designed alongside the pipeline, because the metrics and methodology determine what “working correctly” means for this specific system. Without a defined evaluation framework before deployment, there is no baseline against which to measure degradation, no signal that document staleness is affecting answer quality, and no evidence with which to justify the system to stakeholders who are sceptical of AI outputs.
What the evaluation framework covers
Retrieval metrics: Recall@K (are the relevant documents in the top K retrieved?), MRR (is the most relevant document ranked first?), precision (what proportion of retrieved documents are relevant?)
Generation metrics: answer faithfulness (does the answer contradict the retrieved context?), answer relevance (does the answer address the query?), groundedness (can every claim in the answer be traced to a retrieved chunk?)
Test set design: how the test set is constructed to reflect real user queries, how it is maintained as the knowledge base evolves, and how production query logs are incorporated over time
Continuous monitoring: what is monitored in production, at what frequency, with what alerting thresholds, and what the response protocol is when metrics fall below threshold
Human evaluation workflow: for queries where automated metrics are insufficient — how human reviewers assess answer quality and how those assessments feed back into system improvement
What drives these decisionsAcceptable false-positive and false-negative rates for the application, regulatory requirements for evidence of system performance, available resource for human evaluation
Cross-cutting — Knowledge Governance
Keeping the knowledge base current and trusted
Knowledge governance is not a pipeline stage — it is a set of operational processes and technical mechanisms that run continuously after deployment to maintain the accuracy and currency of the knowledge base. A RAG system without knowledge governance becomes progressively less reliable as the underlying documents change. The governance framework defines who owns each document in the knowledge base, how changes are detected and propagated to the index, how conflicting information across documents is identified and resolved, and how the system communicates uncertainty to users when its knowledge is outdated or incomplete.
What the governance framework covers
Document ownership: named owners per document or document category, with defined responsibilities for keeping their documents current
Change detection and propagation: automated pipeline that detects source document changes and triggers re-ingestion within a defined SLA
Conflict detection: process for identifying and resolving cases where different documents in the knowledge base give conflicting answers to the same question
Gap detection: process for identifying questions the system cannot answer from current knowledge and flagging missing documentation to content owners
User feedback integration: mechanism for collecting user signals about incorrect or incomplete answers and routing them to the appropriate document owner
What drives these decisionsRate of change of the knowledge base, consequence of outdated answers for the application’s users, available content ownership structure in the organisation