All essays✓Engineering

How to write your first AI eval suite without a framework.

You don't need LangSmith, Braintrust, or any platform to ship your first eval suite. Most production-grade evals start as 100 prompts in a JSON file and a script. Here's the playbook.

By SmartDuke Team·May 9, 2026·10 min

Code editor showing structured test cases on a dark screen

In brief

You don't need an eval platform to ship your first eval suite. Start with a frozen test set of 50–100 prompts, an explicit rubric (not vibes-based scoring), an LLM-as-judge using a stronger model than the one you're testing, and a script that runs on every deploy and blocks on regressions. Frameworks help when you scale; they slow you down at the start.

Most teams over-engineer evals on day one. They evaluate eval platforms before they have eval data. They stand up dashboards before they know what they're measuring. They try to score everything before they've decided what "better" actually means.

The teams that ship reliably skip all of that. They start with a JSON file. Here's the playbook.

Step 1 — Pick the use case that matters most.

You can't eval everything from day one. Pick the single highest-stakes path in your product — the one where a regression would matter most to a real user. That's your eval target. Ignore the rest until this works.

Developer writing tests in a code editor

Step 2 — Build a frozen test set of 50–100 prompts.

Curate real or representative inputs. Cover the happy path, the edge cases, the adversarial inputs you've already seen, and the ones you predict you'll see. Save them as a JSON file. Commit it to the repo. Don't change it without versioning.

If your test set is your bug-of-the-week, it's not a frozen test set. You'll never measure regressions because the baseline keeps moving.

Step 3 — Write an explicit rubric.

"Is this answer good" is not a rubric. "Does this answer cite at least one source from the provided documents, refuse if no sources are relevant, and avoid hallucinated facts" is a rubric. Be specific. The more specific, the more reliable LLM-as-judge scoring becomes.

Step 4 — Score with LLM-as-judge.

Use a stronger model than the one you're testing. If you're shipping a Claude Sonnet agent, score it with Claude Opus or GPT-4 class. The judge model returns a structured score against your rubric. Save the scores per prompt; aggregate to a pass-rate.

Continuous integration pipeline running automated checks

Step 5 — Run before every deploy. Fail loudly.

Wire the eval suite into your CI or your deploy script. If pass-rate drops by more than a defined threshold versus the baseline, the deploy fails. No human discretion. The whole point is removing the temptation to ship through a regression.

When to graduate to a framework: when you have multiple eval suites, multiple environments, multiple people authoring scoring rubrics, and you need replay against new model versions. That's a real moment. Day one isn't.

What this gets you in two weeks.

A working baseline you can compare against. Confidence to ship. A vocabulary for talking about regressions with your team. The ability to swap models without flying blind. None of those need a framework — they need 100 prompts and a script that won't ship through a regression.

Filed under

#evals #engineering #production

Next essay

Operations · 9 min

AI safety in production: a checklist that actually ships.

Start a project

Have an AI product
that needs to ship?

Tell us where you are — early concept, broken prototype, or scaling something that already works. We'll come back within 24 hours with a take and a quote.

Start a project Explore packages