The Problem
A demo proves the AI works once. An eval proves the AI works repeatedly under conditions you've defined.
Most AI agencies ship the demo and call it done. We've watched the result enough times to know what comes next. Six weeks in, the model is upgraded. The output silently degrades. Nobody notices for a month. Then someone files a complaint, the team scrambles, the engagement renews at "support pricing" because nobody documented the eval criteria in the first place.
We don't do that. Evals are the contract.
What An Eval Suite Actually Is
For every AI call type in the product:
- Inputs. A structured set of test cases, including edge cases
- Expected behaviors. What the output should contain or look like
- Quality thresholds. The minimum score the output must hit
- Grading mechanism. Automated (LLM-as-judge, regex, structured comparison) or human (queue-based review)
For Verdikt, our evals look something like this:
```yaml
- name: verdict_includes_kill_criterion
input_set: 30 founder briefs from validation expected: output.verdict_section.kill_criterion exists threshold: 100% pass
- name: market_section_quality
input_set: same 30 grader: claude-3-5-sonnet with rubric expected: score >= 4/5 on 4 dimensions (market sizing, competitive density, customer specificity, evidence quality) threshold: 28/30 pass at 4/5+
- name: source_citation_count
input_set: same 30 expected: at least 20 distinct sources per verdict threshold: 28/30 pass ```
The eval suite runs on every PR. If a prompt change drops the market_section_quality below threshold, the build fails. We don't ship a regression.
How This Lives In A Client Contract
When we sign a project, the SOW includes:
- The eval suite as a named deliverable. Not just the feature. The evals that grade the feature.
- Quality thresholds as acceptance criteria. "Done" means evals pass at the documented thresholds.
- The eval definitions belong to the client. Full IP transfer at engagement end.
- A regression policy. If a future model upgrade or prompt change drops a metric below threshold, the work is not shipped.
The eval suite is the operational contract. It survives us leaving.
Why Most AI Agencies Skip This
Three honest reasons:
- Evals are unsexy. They're internal infrastructure. They don't demo well.
- Evals are hard. Writing a good eval is harder than writing the prompt it grades.
- Evals constrain the agency. They make it provable when the agency's output degrades.
The third reason is the real one. An agency that ships demos can pretend its work is great forever. An agency that ships evals signs a measurable contract.
We sign the measurable contract. So should every studio you evaluate.
What "Good" Looks Like
The eval suite is a deliverable when:
- It's in your repo, not ours
- It runs in your CI pipeline
- Thresholds are documented in plain English
- A future engineer (not us) can read it and understand what the AI is supposed to do
- It has actually caught at least one regression during the engagement
If those are not true, the evals are aspirational, not contractual.
The Buyer's Filter
When you evaluate any AI vendor, ask:
> "Show me a sample eval suite from a past engagement. Show me one regression it caught."
Most vendors will not have an answer. The ones who do are the ones to hire.
If you want to see what this looks like in practice, the Verdikt case study breaks down the eval suite we built for our own product.