Long-form
shipping notes.
Essays, teardowns, and patterns from real client engagements. Written for the engineers and product builders shipping AI products.
Browse by topic.
The disciplines we write about most. Pick what you're shipping today.
Evals
Eval suites, regression scoring, LLM-as-judge, frozen test sets.
Agents & patterns
Agent loop architectures: planner-executor, ReAct, supervisor-worker, hierarchical.
Guardrails
Input validation, output verification, escape hatches, human handoff design.
Production engineering
Observability, telemetry, infra choices, and the boring stack around model calls.
GEO + AEO
Earning citations from ChatGPT, Perplexity, and Google AI Overviews.
Latest from the lab.
⊕PatternsRAG vs agents vs fine-tuning: when each one wins.
Three techniques. Three different problems. Most teams reach for the wrong one because they're picking based on hype, not problem shape. Here's the honest decision framework.
Read essay
$OpinionHow much does it cost to build an AI agent in 2026?
Pricing for AI work is opaque. Here's the honest breakdown — what a prototype costs, what production costs, what operations costs, and what makes the numbers move.
Read essay
◯Field notesLangSmith vs Langfuse vs Arize vs Braintrust: comparing AI observability platforms.
Four platforms, four philosophies. We've shipped on all of them. Here's the honest comparison — what each does well, what each doesn't, and how to pick.
Read essay
✓EngineeringHow to write your first AI eval suite without a framework.
You don't need LangSmith, Braintrust, or any platform to ship your first eval suite. Most production-grade evals start as 100 prompts in a JSON file and a script. Here's the playbook.
Read essay
⊘OperationsAI safety in production: a checklist that actually ships.
Safety isn't a content filter you add at the end. It's an architecture. These six layers are non-negotiable before any AI product touches real users.
Read essay
⌬EngineeringSchema.org markup for AI engines: what actually works in 2026.
Most schema markup is wasted effort. The four types that actually move citation rates on ChatGPT, Perplexity, and Google AI Overviews — and what to skip.
Read essay
✦EngineeringEvals that actually catch regressions before users do.
The eval suite most teams ship with is a confidence-builder, not a regression detector. Here's the structure we use to catch real failures earlier.
Read essay
◐PatternsFive agent-loop patterns we keep reaching for.
Planner-executor, ReAct, supervisor-worker, hierarchical, pure tool-calling. When each one fits, when it doesn't, and when to mix them.
Read essay
▲OperationsGuardrails that survive contact with real users.
Why bolt-on safety layers fail and what production-grade guardrail architecture actually looks like in 2026.
Read essay
◆Field notesGEO and AEO: the new search stack for AI-native brands.
Citation-rate has replaced rank-position. Here's how we instrument content for ChatGPT, Perplexity, and Google AI Overviews — and what we measure.
Read essay