Topics

Agent observability (logs, traces, and failure modes)

Evergreen topic pages updated with new evidence

Last reviewed: 2026-05-12 · Policy: Editorial standards · Methodology

Answer

Agent observability means instrumenting every LLM call, tool invocation, and inter-step state transition with a shared trace_id—so you can reconstruct the full decision chain after a failure. In 2026, LangSmith and LangFuse are the default starting points; silent failures (no error thrown, wrong output) remain the hardest class of bug to catch.

Key points

  • Three-pillar model: Logging (structured JSON per call), Tracing (trace_id propagation across steps), Metrics (success rate, p95 latency, token spend).
  • Silent failures—agent completes without error but produces wrong output—require assertion-based guards, not just exception handling.
  • LangSmith and LangFuse provide visual trace replay; open-source LangFuse is a low-cost entry point for small teams.
  • Sample 10–20% of requests for full tracing in production; capture 100% on anomaly.
  • Set alert thresholds: tool call fail rate >5%, token spike >3×, avg latency >10s.

What changed recently

  • Harness Engineering methodology (2026) formalizes agent reliability patterns including trace-based root cause analysis.
  • LangFuse v3 supports async trace ingestion with sub-100ms overhead—viable for latency-sensitive production agents.
  • MCP protocol adoption means tool calls now carry structured action traces by default, reducing custom instrumentation.

Explanation

A single user request to an agent can trigger 20+ tool calls and multiple LLM round-trips. Traditional request/response monitoring (status codes, response times) misses the internal decision graph entirely. Trace_id propagation across every step is the minimum viable observability baseline.

Silent failures are the most dangerous: the agent returns a response, logs show no errors, but the answer is wrong or incomplete. Catching these requires validating output schema and key fields at each step—not just catching exceptions.

Tools / Examples

  • Attach trace_id='req_20260508_001' at request entry; pass it through RAG lookup, LLM call, and tool execution; correlate all logs in LangFuse dashboard.
  • After a tool returns, assert response schema matches expected fields—log error_type: schema_mismatch and trigger alert if not, even if no exception was raised.

Evidence timeline

Sources

FAQ

How do I distinguish a model failure from a tool failure?

Check the trace: if the LLM output is well-formed but the downstream tool call fails, the problem is in the tool. If the LLM output is malformed or off-topic, check your prompt or model version.

Is full tracing too expensive for high-volume production?

Use 10–20% sampling for full traces in production; always capture full traces on errors and anomalies. LangFuse async ingestion adds <100ms overhead.

What's the minimum observability setup for a small team?

LangFuse open-source + two metrics (success rate and p95 latency) + one alert (tool fail rate >5%). Build from there once the system is stable.

Related

Last updated: 2026-05-12 · Policy: Editorial standards · Methodology