Wiki · concept

Evaluation and observability

These two get conflated but do different jobs. Observability is seeing what your agents do in production (traces, costs, latency). Evaluation is judging whether the outputs are good (regression tests, LLM-as-a-judge scores, guardrails).

  • Langfuse does both but, per its own docs, its evaluation features trail dedicated eval tools; pair it with something like DeepEval or Braintrust for real regression testing.
  • Tracing alone will not catch semantic quality drift; you need evals that score outputs against expectations.
  • self-hosting changes the calculus: open-source observability you can run yourself is attractive for private traces, while evaluation is often cloud-first.

See the directory’s /tags/evaluation and /tags/observability for the full tool lists.


Sources: 0004-langfuse-docs.md

Edit wiki/evaluation-and-observability.md; see how we maintain the wiki.