Evals As The Contract: How We Deliver AI Safely

The Problem

A demo proves the AI works once. An eval proves the AI works repeatedly under conditions you've defined.

Most AI agencies ship the demo and call it done. We've watched the result enough times to know what comes next. Six weeks in, the model is upgraded. The output silently degrades. Nobody notices for a month. Then someone files a complaint, the team scrambles, the engagement renews at "support pricing" because nobody documented the eval criteria in the first place.

We don't do that. Evals are the contract.

What An Eval Suite Actually Is

For every AI call type in the product:

Inputs. A structured set of test cases, including edge cases
Expected behaviors. What the output should contain or look like
Quality thresholds. The minimum score the output must hit
Grading mechanism. Automated (LLM-as-judge, regex, structured comparison) or human (queue-based review)

For Verdikt, our evals look something like this:

```yaml

name: verdict_includes_kill_criterion

input_set: 30 founder briefs from validation expected: output.verdict_section.kill_criterion exists threshold: 100% pass

name: market_section_quality

input_set: same 30 grader: claude-3-5-sonnet with rubric expected: score >= 4/5 on 4 dimensions (market sizing, competitive density, customer specificity, evidence quality) threshold: 28/30 pass at 4/5+

name: source_citation_count

input_set: same 30 expected: at least 20 distinct sources per verdict threshold: 28/30 pass ```

The eval suite runs on every PR. If a prompt change drops the market_section_quality below threshold, the build fails. We don't ship a regression.

How This Lives In A Client Contract

When we sign a project, the SOW includes:

The eval suite as a named deliverable. Not just the feature. The evals that grade the feature.
Quality thresholds as acceptance criteria. "Done" means evals pass at the documented thresholds.
The eval definitions belong to the client. Full IP transfer at engagement end.
A regression policy. If a future model upgrade or prompt change drops a metric below threshold, the work is not shipped.

The eval suite is the operational contract. It survives us leaving.

Why Most AI Agencies Skip This

Three honest reasons:

Evals are unsexy. They're internal infrastructure. They don't demo well.
Evals are hard. Writing a good eval is harder than writing the prompt it grades.
Evals constrain the agency. They make it provable when the agency's output degrades.

The third reason is the real one. An agency that ships demos can pretend its work is great forever. An agency that ships evals signs a measurable contract.

We sign the measurable contract. So should every studio you evaluate.

What "Good" Looks Like

The eval suite is a deliverable when:

It's in your repo, not ours
It runs in your CI pipeline
Thresholds are documented in plain English
A future engineer (not us) can read it and understand what the AI is supposed to do
It has actually caught at least one regression during the engagement

If those are not true, the evals are aspirational, not contractual.

The Buyer's Filter

When you evaluate any AI vendor, ask:

> "Show me a sample eval suite from a past engagement. Show me one regression it caught."

Most vendors will not have an answer. The ones who do are the ones to hire.

If you want to see what this looks like in practice, the Verdikt case study breaks down the eval suite we built for our own product.

Evals As The Contract: How We Deliver AI Safely

The Problem

What An Eval Suite Actually Is

How This Lives In A Client Contract

Why Most AI Agencies Skip This

What "Good" Looks Like

The Buyer's Filter

Tell us what you are building. We will tell you if we can help.

More from the journal.

From Cursor/Lovable Prototype to Shipped App: A Vibecoder's Guide

React vs Next.js for Startups: Which Should You Choose?

No-Code vs Custom Development: When to Use Each