RAG — Retrieval-Augmented Generation

RAG is the technique that grounds AI in your actual data. Instead of relying on what a model memorised during training, RAG retrieves relevant documents at query time and injects them into the prompt. The result: answers based on facts, not guesses.

How RAG Works

RAG is a pipeline with two phases: an offline ingestion phase that prepares your data, and a runtime retrieval phase that uses it. Here is what happens at each step.

1. Document ingestion

Raw documents — PDFs, web pages, internal wikis, support tickets, whatever your knowledge source is — get loaded into the pipeline. The ingestion step handles format conversion, metadata extraction, and cleaning. Garbage in, garbage out applies here more than anywhere else in the stack.

2. Chunking

Documents get split into smaller pieces — chunks. Chunk size matters enormously. Too large and you waste context window space on irrelevant content. Too small and you lose the surrounding context that makes the information meaningful. Chunking strategy is one of the most impactful decisions in a RAG pipeline.

3. Embedding

Each chunk gets converted into a numerical vector — an embedding — that captures its semantic meaning. Similar concepts produce similar vectors, which is what makes semantic search possible. The embedding model you choose affects retrieval quality directly. Not all embedding models are equal, and the best choice depends on your domain.

4. Vector storage

Embeddings get stored in a vector database — a specialised data store optimised for similarity search. When a query comes in, the database finds the most semantically similar chunks without scanning every document. This is what makes RAG fast enough for real-time applications.

5. Retrieval

At query time, the user's question gets embedded using the same model, and the vector database returns the most relevant chunks. This is where ranking, filtering, and re-ranking come in — techniques that improve the quality of what gets returned beyond simple similarity scores.

6. Generation

The retrieved chunks get injected into the prompt alongside the user's question. The model generates its answer grounded in the retrieved content. When done well, this means the model cites real documents, provides accurate information, and avoids making things up.

Why RAG Matters

RAG solves the fundamental problem with large language models: they sound confident even when they are wrong. Grounding generation in retrieved facts changes the equation.

Grounding and accuracy

Without RAG, a model answers from its training data — which may be outdated, incomplete, or simply wrong for your domain. With RAG, the model answers from your documents. It can cite sources, quote specific passages, and flag when the retrieved context does not cover the question. This is the difference between a chatbot that sounds right and one that is right.

Reduced hallucination

Hallucination — the model generating plausible but false information — is the biggest trust problem in AI. RAG does not eliminate it entirely, but it dramatically reduces it by giving the model factual content to work with. When the model can draw from retrieved documents, it is far less likely to invent answers.

No retraining required

Fine-tuning a model on your data is expensive, slow, and creates a maintenance burden. RAG gives you most of the same benefit — domain-specific, accurate answers — without touching the model itself. Update your documents, re-index, and the system immediately reflects the new information.

Transparent reasoning

Because RAG retrieves specific documents, you can show users where the answer came from. Source attribution builds trust — users can verify the information themselves. This is particularly important in regulated industries, legal contexts, and any domain where "trust me" is not good enough.

Common Pitfalls

RAG is conceptually simple but operationally tricky. Most RAG implementations I encounter have the same set of problems.

Bad chunking

The most common problem. Chunks that split mid-sentence, break apart tables, or separate a heading from its content. Bad chunking means the retrieved context is fragmented and confusing — the model gets pieces of information that do not make sense in isolation. I spend more time on chunking strategy than any other part of the pipeline.

Retrieval without re-ranking

Basic vector similarity search returns the most semantically similar chunks, but "similar" and "relevant" are not the same thing. A chunk might be semantically close to the query without actually answering it. Re-ranking — using a cross-encoder or other model to score the relevance of retrieved chunks — is often the difference between a mediocre RAG system and a good one.

Ignoring metadata

Not all documents are created equal. Recency, source authority, document type — these all matter for retrieval quality. A policy document from last week should outrank one from three years ago. Metadata filtering lets you apply these rules at retrieval time, rather than hoping the embedding captures them.

Context window overflow

Retrieving too many chunks stuffs the context window and degrades generation quality. The model has to process a wall of text to find the relevant bits. Fewer, higher-quality chunks almost always produce better answers than more, lower-quality ones. Retrieval precision matters more than recall.

Where It Fits

RAG is one of the most connected building blocks in the stack. It touches storage, agents, memory, and safety — and the quality of each connection affects the whole system.

Memory

RAG and memory solve related but different problems. RAG retrieves from institutional knowledge — documents, wikis, knowledge bases that exist independently of any user. Memory persists personal context — what this user prefers, what they asked last week, what they corrected. The best systems use both.

Storage

The vector database where embeddings live is part of the storage layer. But RAG also depends on the document store (where raw files live), the metadata store (where document attributes are tracked), and potentially a cache layer for frequently retrieved chunks. Storage architecture directly shapes RAG performance.

Agents

Agents are the primary consumers of RAG. An agent decides when to search, what query to use, and how to incorporate the results into its response. A smart agent might reformulate a query that returns poor results, retrieve from multiple sources, or combine RAG results with other context before generating.

Safety and observability

RAG introduces a new surface for monitoring: retrieval quality. Are the right documents being returned? Is sensitive content being surfaced to users who should not see it? Logging what gets retrieved — and what gets generated from it — is essential for both quality assurance and compliance.

Need a RAG pipeline that actually works?

I build RAG systems that go beyond basic vector search — with proper chunking strategies, re-ranking, metadata filtering, and the retrieval quality monitoring that makes the difference between a demo and a production system.

Get in Touch Back to Context