Yes, I'm opening the topic of evals. It's not as scary as it sounds — but it's extremely important. This post is inspired by Aakash Gupta's breakdown of AI evals, which is the most useful eval content I've found. Think of this as my summary, plus a real example with numbers. Go watch the original too.
What's an Eval?
As LLMs get more capable, we can deliver value to users in many different ways. But these systems don't work like 2 × 2 = 4. They're black boxes, and the real question is: are they actually reliable?
If you're vibe coding a side project, maybe it doesn't matter. But in a product with 10,000 users, a 1% error rate means 100 people are affected. At 5%, that's 500. Understanding where these systems fail and how reliable they really are — that's not optional.
This is where evals come in. There's been talk that this is becoming the PM's job. Whether the PM role stays as-is or merges into some "builder" type — I don't know. But if your product uses AI, evals matter. That much is certain.
Eval Types
Two questions: when do you evaluate, and what do you evaluate?
When
| Type | Description |
|---|---|
| Offline — before launch | Run your test dataset, fix what fails, ship when it passes. |
| Online — after launch | Monitor production: latency, error rates, quality scores. |
| Human — continuous spot-checks | Domain experts review samples regularly. Catches what automation misses and feeds errors back into the model. |
What
| Dimension | Examples |
|---|---|
| Quality | Correctness, completeness, coherence |
| UX | Clarity, tone, helpfulness |
| Safety | Toxicity, hallucination, PII leakage |
| Performance | Latency, cost, token usage |
| Behaviour | Instruction following, persona adherence |
Pick 3-4 that matter for your product. You don't need all of them.
Maturity Ladder
Aakash Gupta and Ankit Shukla published a 7-step maturity model for AI evals. It maps out where most teams are and where they should aim.
Instead of explaining each step, let me show you how I implemented the first four in a day — for my side project, ViziAI.
Example: ViziAI
ViziAI is a side project. I built it to track my family's health records — upload a PDF lab report in any language, and it extracts every biomarker, value, and reference range so you can track changes over time. I've been using it for 5-6 months. It seemed fine. But after watching Aakash's breakdown, I decided to actually measure it.
I took 10 real PDFs, wrote every expected value by hand, and ran the pipeline against all of them.
| Run | Metric Match | Value Accuracy | Cost |
|---|---|---|---|
| First run (GPT-4o) | 58.3% | 85.1% | baseline |
| After fixes (GPT-4o) | 95.2% | 99.7% | baseline |
| Model swap (GPT-5-mini) | 94.4% | 99.7% | 90% cheaper |
A couple of iterations with Claude Code — fixing decimal shifts, image-based PDFs, Spanish abbreviations — and I went from 58% to 95%. Then I swapped to a cheaper model. Same accuracy, 90% less cost. Without the eval, I never would have trusted that swap.
Takeaway
Once you measure, errors become visible.
Once errors are visible, fixes become obvious.
Once fixes land, cost reduction becomes possible.
Evals didn't just tell me how good my AI was. They unlocked decisions I was avoiding: which failures to fix first, whether a cheaper model could do the job, and where my prompt was actually broken vs. where I was just getting lucky.
You don't need a framework. You don't need infrastructure. You need 10 real examples, expected outputs, and a script that compares them. Start there.
What's Next
Next I'll find more PDFs online and expand the test range. Every time I hit an edge case, I add it to the eval. The test set grows, the system stays accurate. Quite simple, actually.