Naive chunks (split every 500 tokens) lose context. We split on paragraph boundaries, with overlap, and keep heading hierarchy as metadata.

Pure semantic search misses precise keywords (order numbers, product codes). We combine BM25 + dense embeddings + a light cross-encoder reranking. Recall@5 typically improves by 18%.

Fetch top 50 with a fast vector search, then narrow to top 5 with a more accurate cross-encoder. +50ms latency, much better quality.

Confidence thresholds

If top-chunk scores are too low, the agent doesn't answer — it routes to a human. Default threshold: cosine 0.78.

Multi-query decomposition

A question like 'What does Pro cost and is it valid for 12 months?' is split into 2 sub-queries. Better coverage.

Agent Hub

Book a demo

← Knowledge hub

What is RAG?

RAG: how AI agents answer using your data — without hallucinating it.

Retrieval-Augmented Generation in 8 minutes: from concept to flow to the pitfalls we've learned in two years of practice.

What RAG is

Retrieval-Augmented Generation (RAG) is the default technique for getting language models to answer based on your own content. Instead of retraining the model (fine-tuning), we give it the relevant document chunks at every request.The acronym breaks down as:<ul><li>Retrieval — we fetch the most similar chunks from your knowledge base (FAQs, PDFs, Notion, Confluence, databases …).</li><li>Augmented — those are passed to the LLM as context.</li><li>Generation — the model writes the answer grounded in real sources.</li></ul>Result: current, precise, with source attribution — and no need for the model to guess or hallucinate. More background in the <a href="/glossary/rag">RAG glossary entry</a>.

01features

why

Why not just fine-tune?

Three hard advantages RAG has over retraining the model.

Current
You change a FAQ — the bot knows 30 seconds later. Fine-tuning would need a new training run.
- Live index
Cheaper
Re-indexing costs cents per document. Fine-tuning costs hundreds depending on model and data — per iteration.
- 100× cheaper
Transparent
Every answer comes with a source ID. You see which paragraph was quoted. With fine-tuning the model itself is the black box.
- Source enforcement

02how-it-works

flow

How a RAG system works, step by step.

Step 01
Index
Your content is split into chunks. Each chunk gets an embedding — a vector representing its semantic meaning. Stored in a vector DB (we use pgvector).
Step 02
Retrieve
On a question, the query itself becomes an embedding. We fetch the k most similar chunks (5–10) by cosine similarity. Optional: hybrid search with BM25 keyword boost.
Step 03
Generate
Question + top chunks go to the LLM. It answers grounded in those sources — with source IDs for each statement.

03faq

quality

What separates great RAG from mediocre.

Common pitfalls (and how we avoid them)

We had to learn these the hard way:<ul><li>Embeddings without an update pipeline. Source changes, embedding stays stale → half-knowledge. Fix: webhook on every CMS change, automatic re-indexing in <30 seconds.</li><li>Chunks too small. 100-token chunks lose context; the model can't answer coherently. We use 500–800 tokens as default.</li><li>No source enforcement. Model writes something not in the chunks — hallucination. We reject answers without a source ID and retry with a stricter prompt.</li><li>No eval suite. Without weekly runs against 200+ real customer questions, we only notice regressions when the customer complains. Today: every deploy needs a green eval, otherwise rollback.</li></ul>

Want to see RAG running on your use case?

30-minute demo with your real content — free, no commitment.

Book a demo What is an AI agent?

RAG: how AI agents answer using your data — without hallucinating it.

What RAG is

Why not just fine-tune?

Current

Cheaper

Transparent

How a RAG system works, step by step.

Index

Retrieve

Generate

What separates great RAG from mediocre.

Chunking strategy

Hybrid search

Re-ranking

Confidence thresholds

Multi-query decomposition

Common pitfalls (and how we avoid them)

Want to see RAG running on your use case?