We hit 89% retrieval accuracy. The answers were still wrong.
Retrieval accuracy is not the metric that matters. Here is the one that does.
The client had roughly 4,200 internal documents — compliance policies, product terms, rate schedules, regulatory updates going back three years. The kind of knowledge base where a wrong answer is not just an inconvenience.
We built the pipeline over two weeks. LangChain, ChromaDB, text-embedding-ada-002, GPT-4o on top. Chunked documents at 512 tokens with 50-token overlap — the defaults most teams use without thinking about it too hard. Felt like a solid stack.
Before launch we built an eval set: 120 questions drawn from real support tickets, manually matched to the relevant source documents. Ran retrieval. Checked top-3 accuracy — whether the right document appeared in the first three results. Landed at 89%. We were happy with that. We shipped.
Two weeks later, support tickets.
Not many — maybe a dozen in the first week. But the right kind of bad. Users were getting answers that were almost correct. A policy effective date off by a quarter. A rate that applied to one product tier being cited for another. Numbers that looked plausible until someone pulled the source document and held it next to the answer.
We went back to the logs and started tracing individual queries. The retrieved chunks looked right. The vector search was surfacing the correct documents, in roughly the right order. The 89% was holding. But the answers the model was generating had diverged from the source text in small, specific ways.
It took a while to see it because we were looking at the wrong thing. We kept checking what was retrieved. The problem was where the chunk ended.
A chunk that ends mid-clause is an open invitation for the model to complete the thought. It always accepts.
Our compliance documents were dense with conditional logic. Sentences like: "This rate applies to accounts opened after 1 April, subject to the terms outlined in Schedule 3, provided that—" — and then the chunk boundary. The rest of the condition was in the next 512-token window, which did not make it into the context.
The model received an incomplete premise and reasoned forward from it. Not randomly — plausibly. It knew the general shape of how these policies work, so it completed the clause in a way that sounded coherent. Which is exactly what makes it dangerous. There was no hallucination alarm. No uncertainty hedge. Just a confident, smoothly written answer that was wrong in a specific and consequential way.
Fixed-size chunking at 512 tokens does not care about your document's logic. It cuts where the token count says to cut. For straightforward prose this is mostly fine. For compliance documents with nested conditionals, cross-references and dense tables, it is quietly catastrophic.
The second thing we found: we had no output evals. None. We had optimised the retrieval step thoroughly and assumed good retrieval would produce good answers. It does not — and we had no mechanism to catch the gap.
RAG failures are quiet. The system always returns something. There is no 500 error, no exception, no red flag in the logs. Just an answer that sounds right until someone checks it against the source.
What we changed
Chunk by structure, not by token count.
We switched to LangChain's RecursiveCharacterTextSplitter with explicit separators matching our document structure, then moved to a parent-document retriever pattern for anything with tables. Chunk boundaries now fall at paragraph ends, section headers and clause breaks — not wherever 512 tokens runs out.
Retrieval accuracy and answer accuracy are different metrics. Measure both.
We kept our 120-question retrieval eval set and built a second eval set — 80 questions with manually verified expected answers — to score the generation step directly. LLM-as-judge pipelines are imperfect but they catch the class of error we were seeing.
Add a reranker between retrieval and generation.
A bi-encoder (your embedding model) retrieves fast but scores for semantic similarity, not true relevance. We added Cohere Rerank as a cross-encoder pass over the top-10 retrieved chunks before passing the top-3 to the model. Answer quality improved measurably — particularly on queries where the most relevant chunk was not the most semantically similar one.
The model will complete an incomplete context. Design around this.
If a chunk ends mid-thought, the model does not say it does not know. It reasons forward from partial information. This is not a bug in the model — it is what it is trained to do. Your chunking has to be complete enough that it does not need to guess. Where that is not possible, explicit system prompt instructions to surface uncertainty actually help.
Build output evals before launch. Not after.
A 50-question human-labelled output eval set, run before the first user hits the system, would have caught every one of the issues we found in production. The retrieval eval gave us false confidence. Output evals are the ones that matter.
The retrieval pipeline we have now looks more complicated than the one we shipped. It is also actually correct. Sometimes that is the trade.
Building a RAG system and hitting a wall?
We have been there. Tell us what you are trying to build — we will give you an honest read on what is likely going wrong.
Get in touch