Skip to content
All articles
Strategy6 min read

How To Evaluate AI Vendors: 10 Disqualifying Questions

The AI services market is full of vendors who can't actually ship. These are the ten questions buyers should ask before signing any AI engagement. Most vendors fail at least four of them.

The Filter

The AI services market in 2026 is filled with vendors who can't ship what they claim. The pattern is consistent: impressive sales call, vague proposal, missed deadlines, demoware that crumbles in production, emergency upgrade quotes when something breaks.

The solution is upstream qualification. These are the ten questions every buyer should ask every vendor. Vendors should be able to answer at least eight cleanly. If they can't, walk.

1. Show me one AI workflow you've shipped in your own business.

This is the most important question. A vendor that has never shipped AI for themselves is selling theoretical capability.

Good answer: "Here is the product. Here are the live metrics. Here is the case study with the methodology we used."

Red flag: Deflection, hand-wave, "we mostly do client work."

2. Will the code be in our repo from day one or yours?

Good answer: "Yours. Day one. Branches, PRs, CI in your environment."

Red flag: "Our staging environment" / "we'll hand it off at the end" / "our platform handles that."

3. How do you define 'done' before the engagement begins?

Good answer: "An eval suite with inputs, expected behaviors, and quality thresholds. CI fails the build when thresholds slip. The eval suite is a contractual deliverable."

Red flag: "We'll show you when it's working" / "the demo speaks for itself."

4. Are you model-agnostic?

Good answer: "Yes. We use a routing layer (LiteLLM or equivalent). You pay providers directly through your own accounts. We can show you the config."

Red flag: "We use [provider] exclusively because it's the best" / "our platform handles all the AI stuff."

5. Do you mark up tokens?

Good answer: "No. You pay providers directly. Costs land in your billing, not ours. We provide per-feature cost dashboards."

Red flag: "Our pricing includes inference costs" / "we have negotiated rates we pass through" / "a flat monthly fee covers everything."

6. What happens to the IP at engagement end?

Good answer: "Full transfer on final invoice. Prompts, evals, fine-tunes, routing logic, all yours. We document the handoff."

Red flag: "We retain the underlying tooling" / "you get the application but not the underlying agent" / "we license our platform to you."

7. Who specifically on your team will be doing the work?

Good answer: Named people, by role. Senior engineers, not "account managers" or "delivery leads."

Red flag: "Our delivery team will be assigned at kickoff" / "we have a pool of specialists" / refusal to name people.

8. What's your largest single AI engagement that's currently in production after 6+ months?

Good answer: Specific example. Client name (where permitted). Metrics. What was learned.

Red flag: Long pause. Examples that ended in pilots. Vague references.

9. What's your refund policy if the engagement fails?

Good answer: Clear cancellation terms. Earned work invoiced through cancellation date. Refund of unearned portion. No hidden lock-ins.

Red flag: "We don't refund" / "we'll work it out" / multi-year contracts with cancellation penalties.

10. Who is on your team that's worked on AI specifically (not just software) for 3+ years?

Good answer: Named people with verifiable backgrounds in AI/ML, prompt engineering, or model evaluation.

Red flag: "Our whole team is AI-trained" / "everyone is using AI now" / no specific named experience.

The Scoring

For each question:

  • Cleanly answered with specifics: 1 point
  • Hedged or vague: 0 points
  • Deflected or refused: -1 point

8+ points: serious vendor. Worth a follow-up. 5-7 points: marginal. Push for clarification on the weak answers. <5 points: walk.

Send The List To Every Vendor

This is the most useful thing buyers can do during evaluation. Send the same 10 questions to every shortlisted vendor. Score them on the same scale. Compare.

You will be amazed how many vendors fail this simple filter.

How We Score Ourselves

We're transparent about this. Against our own 10 questions, Kastling scores 10/10. The full case for why is in our methodology. The seven engineering principles map directly to the disqualifying questions.

If you want to apply the questions to us in a real evaluation context, the discovery call is exactly designed for that.

Start an audit

Tell us what you are building. We will tell you if we can help.

A brief takes three minutes. We read every one. If there is a fit, you hear back within one business day with a scope call and a proposal. If there is not, we say so and point you somewhere better.

Email the team
Code in your repoEvals as the contractModel-agnosticNo token arbitrageIP yours at the end