Skip to content
All articles
Engineering6 min read

Evals As The Contract: How We Deliver AI Safely

The number one differentiator between AI projects that ship and AI projects that die is whether the team wrote an evaluation suite first. Here is what 'evals as the contract' actually means in a Kastling engagement.

The Problem

A demo proves the AI works once. An eval proves the AI works repeatedly under conditions you've defined.

Most AI agencies ship the demo and call it done. We've watched the result enough times to know what comes next. Six weeks in, the model is upgraded. The output silently degrades. Nobody notices for a month. Then someone files a complaint, the team scrambles, the engagement renews at "support pricing" because nobody documented the eval criteria in the first place.

We don't do that. Evals are the contract.

What An Eval Suite Actually Is

For every AI call type in the product:

  1. Inputs. A structured set of test cases, including edge cases
  2. Expected behaviors. What the output should contain or look like
  3. Quality thresholds. The minimum score the output must hit
  4. Grading mechanism. Automated (LLM-as-judge, regex, structured comparison) or human (queue-based review)

For Verdikt, our evals look something like this:

```yaml

  • name: verdict_includes_kill_criterion

input_set: 30 founder briefs from validation expected: output.verdict_section.kill_criterion exists threshold: 100% pass

  • name: market_section_quality

input_set: same 30 grader: claude-3-5-sonnet with rubric expected: score >= 4/5 on 4 dimensions (market sizing, competitive density, customer specificity, evidence quality) threshold: 28/30 pass at 4/5+

  • name: source_citation_count

input_set: same 30 expected: at least 20 distinct sources per verdict threshold: 28/30 pass ```

The eval suite runs on every PR. If a prompt change drops the market_section_quality below threshold, the build fails. We don't ship a regression.

How This Lives In A Client Contract

When we sign a project, the SOW includes:

  1. The eval suite as a named deliverable. Not just the feature. The evals that grade the feature.
  2. Quality thresholds as acceptance criteria. "Done" means evals pass at the documented thresholds.
  3. The eval definitions belong to the client. Full IP transfer at engagement end.
  4. A regression policy. If a future model upgrade or prompt change drops a metric below threshold, the work is not shipped.

The eval suite is the operational contract. It survives us leaving.

Why Most AI Agencies Skip This

Three honest reasons:

  1. Evals are unsexy. They're internal infrastructure. They don't demo well.
  2. Evals are hard. Writing a good eval is harder than writing the prompt it grades.
  3. Evals constrain the agency. They make it provable when the agency's output degrades.

The third reason is the real one. An agency that ships demos can pretend its work is great forever. An agency that ships evals signs a measurable contract.

We sign the measurable contract. So should every studio you evaluate.

What "Good" Looks Like

The eval suite is a deliverable when:

  • It's in your repo, not ours
  • It runs in your CI pipeline
  • Thresholds are documented in plain English
  • A future engineer (not us) can read it and understand what the AI is supposed to do
  • It has actually caught at least one regression during the engagement

If those are not true, the evals are aspirational, not contractual.

The Buyer's Filter

When you evaluate any AI vendor, ask:

> "Show me a sample eval suite from a past engagement. Show me one regression it caught."

Most vendors will not have an answer. The ones who do are the ones to hire.

If you want to see what this looks like in practice, the Verdikt case study breaks down the eval suite we built for our own product.

Start an audit

Tell us what you are building. We will tell you if we can help.

A brief takes three minutes. We read every one. If there is a fit, you hear back within one business day with a scope call and a proposal. If there is not, we say so and point you somewhere better.

Email the team
Code in your repoEvals as the contractModel-agnosticNo token arbitrageIP yours at the end