← Back to blog
EngineeringFebruary 19, 20267 min read
Observability for AI agents: what to track
Latency, cost, quality, safety — four dimensions you can't ignore.
Classic metrics aren't enough
For a REST API we track latency, error rate, throughput. Done. For an AI agent, you need more.
Latency, but multidimensional
- TTFT (time-to-first-token): when does the user see the first words?
- TTLT (time-to-last-token): when is the whole answer ready?
- Tool latency: how long do tool calls (CRM lookup, email send) take?
In the UI we optimize for TTFT — users tolerate long answers when tokens stream.
Cost — per conversation, not per token
Token costs are abstract. Per-conversation cost is the number your CFO understands. Average $0.04/conversation in our setups; outliers indicate tool loops or pathologies.
Quality — the hardest dimension
- Relevance: did the agent answer the question? Auto-scored via LLM-as-judge.
- Source coverage: do all statements come from retrieved documents?
- Tone fit: does the tone match the brand persona?
We correlate quality scores with real CSAT surveys — correlation 0.71, good enough to act on.
Safety — non-negotiable
- PII leak detector: answers are checked against regex + LLM detector before going out
- Jailbreak attempts: prompts against a classifier
- Rate limiting per user session
Stack
We use LangSmith for tracing, custom Grafana dashboard for aggregation metrics. On request we deploy this in your own cloud.