Skip to content
All articles
Engineering8 min read

The AI MVP Checklist: From Prompt to Production

An AI MVP that ships looks completely different from an AI MVP that demos. This is the 50-item checklist we run against every AI product engagement before we call it production-ready.

The Gap

An AI MVP that demos is twenty percent of the work. The remaining eighty percent is what turns a prototype into a product. This checklist is how we close that gap on every engagement.


Architecture (8 items)

  • [ ] Model-agnostic routing layer (LiteLLM or equivalent, 100-300 lines)
  • [ ] No direct openai / anthropic imports at application layer
  • [ ] Provider fallback configured (primary + at least one fallback per call type)
  • [ ] Streaming responses where user-facing latency matters
  • [ ] Background job queue for any call that exceeds 5 seconds (Inngest, Trigger.dev, or equivalent)
  • [ ] Idempotency keys on every external write
  • [ ] Database schema designed for prompt + output audit logging
  • [ ] Feature flag system in place (LaunchDarkly, PostHog feature flags, or equivalent)

Evals (7 items)

  • [ ] Eval suite written BEFORE the agent
  • [ ] Inputs, expected behaviors, and quality thresholds defined per call type
  • [ ] Evals run in CI on every PR
  • [ ] CI fails the build when eval scores drop below threshold
  • [ ] Held-out test set separate from training/iteration set
  • [ ] Manual review queue for outputs that fail evals in production
  • [ ] Quarterly eval refresh as production patterns evolve

Auth and Secrets (5 items)

  • [ ] No API keys in client-side code (ever)
  • [ ] All model provider keys server-side only
  • [ ] Per-user rate limiting on AI endpoints
  • [ ] Secrets rotated quarterly or after any team change
  • [ ] Audit log of who accessed which secrets when

Cost Engineering (6 items)

  • [ ] Per-call cost logged to your analytics
  • [ ] Per-feature dollar dashboard (token spend by capability)
  • [ ] Per-customer cost dashboard (token spend by user/account)
  • [ ] Cost-per-output metric tracked weekly (not cost-per-token)
  • [ ] Budget alerts wired (Slack/email when daily spend exceeds threshold)
  • [ ] Cheaper-model fallback for non-critical calls

Monitoring and Observability (8 items)

  • [ ] Latency tracked per call type (p50, p95, p99)
  • [ ] Success rate tracked per call type
  • [ ] Cost tracked per call
  • [ ] Eval score tracked per call sampled to a queue
  • [ ] Error rate alerts wired (Slack/PagerDuty)
  • [ ] User-facing error states designed (not raw stack traces)
  • [ ] Retry logic with exponential backoff
  • [ ] Circuit breakers on critical paths

Security (5 items)

  • [ ] Prompt injection mitigations in place for user-input → AI calls
  • [ ] Output validation before user display (no executing AI-generated code without review)
  • [ ] PII filtering on AI inputs where regulatorily required
  • [ ] Zero-retention model endpoints where required by compliance
  • [ ] Penetration test or external security review before paid launch

User Experience (5 items)

  • [ ] Loading states designed (streaming, skeletons, progress indicators)
  • [ ] Error states designed (specific to AI failure modes)
  • [ ] "Why did the AI say this?" affordance where useful
  • [ ] User can override / correct the AI output
  • [ ] Feedback collection on every AI output (thumbs up/down or equivalent)

Documentation (3 items)

  • [ ] Runbook for common production issues
  • [ ] Architecture diagram up to date
  • [ ] Eval definitions documented with examples

Pre-Launch (3 items)

  • [ ] Alpha tested with 10-25 real users for at least 2 weeks
  • [ ] Eval scores stable at acceptable thresholds for 7 days
  • [ ] Cost-per-output stable and within target

When to Skip Items

The honest answer: never skip evals, never skip cost tracking, never skip auth. Skip nothing in those three categories.

Everything else is negotiable based on scale, stage, and risk profile. A pre-seed alpha can ship without circuit breakers. A regulated-industry product cannot.

If you want a structured audit of your AI MVP against this checklist (and a working demo of what "production-ready" looks like on your data), that's exactly the Kastling AI Readiness Audit.

Start an audit

Tell us what you are building. We will tell you if we can help.

A brief takes three minutes. We read every one. If there is a fit, you hear back within one business day with a scope call and a proposal. If there is not, we say so and point you somewhere better.

Email the team
Code in your repoEvals as the contractModel-agnosticNo token arbitrageIP yours at the end