RAG without hallucinations: 7 tricks from production

The real problem with RAG

Everyone talks about embeddings, chunking and vector DBs. That's the easy half. The hard part: making sure the LLM only uses what's in the context, and doesn't hallucinate when the answer isn't there.

Trick 1: Hybrid retrieval

Pure semantic search misses precise keywords (order numbers, product codes). We combine BM25 + dense embeddings + a light cross-encoder reranking. Result: 18% improvement in Recall@5.

Trick 2: Force source coverage

We tag every sentence in the answer with the source ID. If the LLM writes anything without a source, we reject the answer and retry with a stricter prompt.

Trick 3: 'I don't know' is a feature

The LLM must not guess. We train with examples where 'I can't find information on that' is the right answer. On low source confidence: human handoff.

Trick 4: Real-time re-indexing

When the source changes, the embedding is stale. We hook a webhook into your CMS updates and re-index under 30 seconds. No stale answers.

Trick 5: Multi-hop decomposition

'How many support hours are in the Pro plan, and what does adding more cost?' — that's two questions. The agent splits them and retrieves for each. Better coverage, shorter answers.

Trick 6: Embedding cache

Same question twice in a row? We cache embeddings + top-K results. Saves 60% latency on repeated queries.

Trick 7: Eval suite with real questions

Every week we run an eval suite with 200 real customer questions and compare answer quality against previous versions. Regressions block the build, norollout without green CI.

Result

Combined, these seven tricks dropped hallucination rate to <2% — across 50,000+ conversations per month.