LangSmith vs Langfuse vs Arize vs Braintrust: comparing AI observability platforms.
Four platforms, four philosophies. We've shipped on all of them. Here's the honest comparison — what each does well, what each doesn't, and how to pick.

LangSmith is the deepest fit if you're on LangChain/LangGraph. Langfuse is the best self-hostable option and framework-agnostic. Arize/Phoenix has the strongest eval primitives for regulated workloads. Braintrust leads on eval rigor and prompt-iteration tooling. Most teams pick one and stick with it — switching costs are real.
AI observability is now a separate category from classic APM. Datadog, New Relic, and Honeycomb don't see the failure modes that LLM-driven systems hit — tool-call failures, context truncation, runaway loops, prompt regressions. These four platforms emerged to fill that gap and now own the production conversation.
LangSmith.
Built by the LangChain team. The integration with LangChain and LangGraph is the deepest in the field — node-by-node state diffs, full agent execution graphs, replay against new model versions. If you're already on LangChain, this is the path of least resistance.
The downside: framework lock-in. If you're not on LangChain, the integration story is weaker. Pricing is SaaS-only.
Langfuse.
Self-hostable (Postgres + ClickHouse), framework-agnostic, supports any LLM SDK or agent framework via OpenTelemetry traces. The most flexible option and the best fit for teams with infrastructure or data-residency requirements.
Eval features have caught up to LangSmith over the last year. Self-hosting is real work; the cloud product is a fast on-ramp.

Arize / Phoenix.
Built on ML-observability heritage. Eval primitives are deeper and more rigorous than the others — meaningful for regulated or accuracy-critical workloads. Phoenix (the open-source companion) is one of the cleanest local-dev observability experiences available.
If your domain requires audit trails, drift detection, and statistically rigorous eval scoring, this is the strongest pick. The product is more complex to learn than LangSmith or Langfuse.
Braintrust.
Eval-first by design. Dataset management, scoring function authoring, experiment tracking, and prompt-iteration with statistical confidence indicators are best-in-class. If your team is iterating on prompts and eval rubrics weekly, Braintrust feels purpose-built.
Production tracing exists but isn't the headline. Some teams pair Braintrust (for evals + iteration) with Langfuse (for production tracing).

How to pick.
- On LangChain or LangGraph? Default to LangSmith.
- Need self-hosting or framework-agnostic? Pick Langfuse.
- Regulated workload, audit-grade evals? Pick Arize.
- Eval and prompt iteration is the daily work? Pick Braintrust.
- Still unsure? Start with Langfuse cloud — it's the safest bet that won't lock you in.
Don't roll your own. Building a tracing pipeline from scratch is a multi-quarter project that produces a worse version of what's already on the market.
What we actually use.
Across SmartDuke client engagements: about 50% Langfuse (because most clients want self-hosting or framework-agnostic), 25% LangSmith (LangGraph-heavy projects), 15% Arize (regulated work), 10% Braintrust (when eval rigor is the headline). The single most common mistake we see is teams switching platforms mid-project. Pick once, stick with it.
How to write your first AI eval suite without a framework.
Have an AI product
that needs to ship?
Tell us where you are — early concept, broken prototype, or scaling something that already works. We'll come back within 24 hours with a take and a quote.