Zum Inhalt springen
Agent Hub
← Back to blog
EngineeringFebruary 19, 20267 min read

Observability for AI agents: what to track

Latency, cost, quality, safety — four dimensions you can't ignore.

by Vernes Perviz

Classic metrics aren't enough

For a REST API we track latency, error rate, throughput. Done. For an AI agent, you need more.

Latency, but multidimensional

  • TTFT (time-to-first-token): when does the user see the first words?
  • TTLT (time-to-last-token): when is the whole answer ready?
  • Tool latency: how long do tool calls (CRM lookup, email send) take?

In the UI we optimize for TTFT — users tolerate long answers when tokens stream.

Cost — per conversation, not per token

Token costs are abstract. Per-conversation cost is the number your CFO understands. Average $0.04/conversation in our setups; outliers indicate tool loops or pathologies.

Quality — the hardest dimension

  • Relevance: did the agent answer the question? Auto-scored via LLM-as-judge.
  • Source coverage: do all statements come from retrieved documents?
  • Tone fit: does the tone match the brand persona?

We correlate quality scores with real CSAT surveys — correlation 0.71, good enough to act on.

Safety — non-negotiable

  • PII leak detector: answers are checked against regex + LLM detector before going out
  • Jailbreak attempts: prompts against a classifier
  • Rate limiting per user session

Stack

We use LangSmith for tracing, custom Grafana dashboard for aggregation metrics. On request we deploy this in your own cloud.