The Gap
An AI MVP that demos is twenty percent of the work. The remaining eighty percent is what turns a prototype into a product. This checklist is how we close that gap on every engagement.
Architecture (8 items)
- [ ] Model-agnostic routing layer (LiteLLM or equivalent, 100-300 lines)
- [ ] No direct
openai/anthropicimports at application layer - [ ] Provider fallback configured (primary + at least one fallback per call type)
- [ ] Streaming responses where user-facing latency matters
- [ ] Background job queue for any call that exceeds 5 seconds (Inngest, Trigger.dev, or equivalent)
- [ ] Idempotency keys on every external write
- [ ] Database schema designed for prompt + output audit logging
- [ ] Feature flag system in place (LaunchDarkly, PostHog feature flags, or equivalent)
Evals (7 items)
- [ ] Eval suite written BEFORE the agent
- [ ] Inputs, expected behaviors, and quality thresholds defined per call type
- [ ] Evals run in CI on every PR
- [ ] CI fails the build when eval scores drop below threshold
- [ ] Held-out test set separate from training/iteration set
- [ ] Manual review queue for outputs that fail evals in production
- [ ] Quarterly eval refresh as production patterns evolve
Auth and Secrets (5 items)
- [ ] No API keys in client-side code (ever)
- [ ] All model provider keys server-side only
- [ ] Per-user rate limiting on AI endpoints
- [ ] Secrets rotated quarterly or after any team change
- [ ] Audit log of who accessed which secrets when
Cost Engineering (6 items)
- [ ] Per-call cost logged to your analytics
- [ ] Per-feature dollar dashboard (token spend by capability)
- [ ] Per-customer cost dashboard (token spend by user/account)
- [ ] Cost-per-output metric tracked weekly (not cost-per-token)
- [ ] Budget alerts wired (Slack/email when daily spend exceeds threshold)
- [ ] Cheaper-model fallback for non-critical calls
Monitoring and Observability (8 items)
- [ ] Latency tracked per call type (p50, p95, p99)
- [ ] Success rate tracked per call type
- [ ] Cost tracked per call
- [ ] Eval score tracked per call sampled to a queue
- [ ] Error rate alerts wired (Slack/PagerDuty)
- [ ] User-facing error states designed (not raw stack traces)
- [ ] Retry logic with exponential backoff
- [ ] Circuit breakers on critical paths
Security (5 items)
- [ ] Prompt injection mitigations in place for user-input → AI calls
- [ ] Output validation before user display (no executing AI-generated code without review)
- [ ] PII filtering on AI inputs where regulatorily required
- [ ] Zero-retention model endpoints where required by compliance
- [ ] Penetration test or external security review before paid launch
User Experience (5 items)
- [ ] Loading states designed (streaming, skeletons, progress indicators)
- [ ] Error states designed (specific to AI failure modes)
- [ ] "Why did the AI say this?" affordance where useful
- [ ] User can override / correct the AI output
- [ ] Feedback collection on every AI output (thumbs up/down or equivalent)
Documentation (3 items)
- [ ] Runbook for common production issues
- [ ] Architecture diagram up to date
- [ ] Eval definitions documented with examples
Pre-Launch (3 items)
- [ ] Alpha tested with 10-25 real users for at least 2 weeks
- [ ] Eval scores stable at acceptable thresholds for 7 days
- [ ] Cost-per-output stable and within target
When to Skip Items
The honest answer: never skip evals, never skip cost tracking, never skip auth. Skip nothing in those three categories.
Everything else is negotiable based on scale, stage, and risk profile. A pre-seed alpha can ship without circuit breakers. A regulated-industry product cannot.
If you want a structured audit of your AI MVP against this checklist (and a working demo of what "production-ready" looks like on your data), that's exactly the Kastling AI Readiness Audit.