RAG without hallucinations: 7 tricks from production
Good sources aren't enough. Here's how to make your agent factually reliable.
The real problem with RAG
Everyone talks about embeddings, chunking and vector DBs. That's the easy half. The hard part: making sure the LLM only uses what's in the context, and doesn't hallucinate when the answer isn't there.
Trick 1: Hybrid retrieval
Pure semantic search misses precise keywords (order numbers, product codes). We combine BM25 + dense embeddings + a light cross-encoder reranking. Result: 18% improvement in Recall@5.
Trick 2: Force source coverage
We tag every sentence in the answer with the source ID. If the LLM writes anything without a source, we reject the answer and retry with a stricter prompt.
Trick 3: 'I don't know' is a feature
The LLM must not guess. We train with examples where 'I can't find information on that' is the right answer. On low source confidence: human handoff.
Trick 4: Real-time re-indexing
When the source changes, the embedding is stale. We hook a webhook into your CMS updates and re-index under 30 seconds. No stale answers.
Trick 5: Multi-hop decomposition
'How many support hours are in the Pro plan, and what does adding more cost?' — that's two questions. The agent splits them and retrieves for each. Better coverage, shorter answers.
Trick 6: Embedding cache
Same question twice in a row? We cache embeddings + top-K results. Saves 60% latency on repeated queries.
Trick 7: Eval suite with real questions
Every week we run an eval suite with 200 real customer questions and compare answer quality against previous versions. Regressions block the build, norollout without green CI.
Result
Combined, these seven tricks dropped hallucination rate to <2% — across 50,000+ conversations per month.