The AI MVP Checklist: From Prompt to Production

The Gap

An AI MVP that demos is twenty percent of the work. The remaining eighty percent is what turns a prototype into a product. This checklist is how we close that gap on every engagement.

Architecture (8 items)

[ ] Model-agnostic routing layer (LiteLLM or equivalent, 100-300 lines)
[ ] No direct openai / anthropic imports at application layer
[ ] Provider fallback configured (primary + at least one fallback per call type)
[ ] Streaming responses where user-facing latency matters
[ ] Background job queue for any call that exceeds 5 seconds (Inngest, Trigger.dev, or equivalent)
[ ] Idempotency keys on every external write
[ ] Database schema designed for prompt + output audit logging
[ ] Feature flag system in place (LaunchDarkly, PostHog feature flags, or equivalent)

Evals (7 items)

[ ] Eval suite written BEFORE the agent
[ ] Inputs, expected behaviors, and quality thresholds defined per call type
[ ] Evals run in CI on every PR
[ ] CI fails the build when eval scores drop below threshold
[ ] Held-out test set separate from training/iteration set
[ ] Manual review queue for outputs that fail evals in production
[ ] Quarterly eval refresh as production patterns evolve

Auth and Secrets (5 items)

[ ] No API keys in client-side code (ever)
[ ] All model provider keys server-side only
[ ] Per-user rate limiting on AI endpoints
[ ] Secrets rotated quarterly or after any team change
[ ] Audit log of who accessed which secrets when

Cost Engineering (6 items)

[ ] Per-call cost logged to your analytics
[ ] Per-feature dollar dashboard (token spend by capability)
[ ] Per-customer cost dashboard (token spend by user/account)
[ ] Cost-per-output metric tracked weekly (not cost-per-token)
[ ] Budget alerts wired (Slack/email when daily spend exceeds threshold)
[ ] Cheaper-model fallback for non-critical calls

Monitoring and Observability (8 items)

[ ] Latency tracked per call type (p50, p95, p99)
[ ] Success rate tracked per call type
[ ] Cost tracked per call
[ ] Eval score tracked per call sampled to a queue
[ ] Error rate alerts wired (Slack/PagerDuty)
[ ] User-facing error states designed (not raw stack traces)
[ ] Retry logic with exponential backoff
[ ] Circuit breakers on critical paths

Security (5 items)

[ ] Prompt injection mitigations in place for user-input → AI calls
[ ] Output validation before user display (no executing AI-generated code without review)
[ ] PII filtering on AI inputs where regulatorily required
[ ] Zero-retention model endpoints where required by compliance
[ ] Penetration test or external security review before paid launch

User Experience (5 items)

[ ] Loading states designed (streaming, skeletons, progress indicators)
[ ] Error states designed (specific to AI failure modes)
[ ] "Why did the AI say this?" affordance where useful
[ ] User can override / correct the AI output
[ ] Feedback collection on every AI output (thumbs up/down or equivalent)

Documentation (3 items)

[ ] Runbook for common production issues
[ ] Architecture diagram up to date
[ ] Eval definitions documented with examples

Pre-Launch (3 items)

[ ] Alpha tested with 10-25 real users for at least 2 weeks
[ ] Eval scores stable at acceptable thresholds for 7 days
[ ] Cost-per-output stable and within target

When to Skip Items

The honest answer: never skip evals, never skip cost tracking, never skip auth. Skip nothing in those three categories.

Everything else is negotiable based on scale, stage, and risk profile. A pre-seed alpha can ship without circuit breakers. A regulated-industry product cannot.

If you want a structured audit of your AI MVP against this checklist (and a working demo of what "production-ready" looks like on your data), that's exactly the Kastling AI Readiness Audit.