Observability for AI agents: what to track

Classic metrics aren't enough

For a REST API we track latency, error rate, throughput. Done. For an AI agent, you need more.

Latency, but multidimensional

TTFT (time-to-first-token): when does the user see the first words?
TTLT (time-to-last-token): when is the whole answer ready?
Tool latency: how long do tool calls (CRM lookup, email send) take?

In the UI we optimize for TTFT — users tolerate long answers when tokens stream.

Cost — per conversation, not per token

Token costs are abstract. Per-conversation cost is the number your CFO understands. Average $0.04/conversation in our setups; outliers indicate tool loops or pathologies.

Quality — the hardest dimension

Relevance: did the agent answer the question? Auto-scored via LLM-as-judge.
Source coverage: do all statements come from retrieved documents?
Tone fit: does the tone match the brand persona?

We correlate quality scores with real CSAT surveys — correlation 0.71, good enough to act on.

Safety — non-negotiable

PII leak detector: answers are checked against regex + LLM detector before going out
Jailbreak attempts: prompts against a classifier
Rate limiting per user session

Stack

We use LangSmith for tracing, custom Grafana dashboard for aggregation metrics. On request we deploy this in your own cloud.