« EVALs » - Onur Ovalı

Yes, I'm opening the topic of evals. It's not as scary as it sounds — but it's extremely important. This post is inspired by Aakash Gupta's breakdown of AI evals, which is the most useful eval content I've found. Think of this as my summary, plus a real example with numbers. Go watch the original too.

What's an Eval?

As LLMs get more capable, we can deliver value to users in many different ways. But these systems don't work like 2 × 2 = 4. They're black boxes, and the real question is: are they actually reliable?

If you're vibe coding a side project, maybe it doesn't matter. But in a product with 10,000 users, a 1% error rate means 100 people are affected. At 5%, that's 500. Understanding where these systems fail and how reliable they really are — that's not optional.

This is where evals come in. There's been talk that this is becoming the PM's job. Whether the PM role stays as-is or merges into some "builder" type — I don't know. But if your product uses AI, evals matter. That much is certain.

Traditional product testing (QA: pass/fail) vs AI-powered product evaluation (Eval: 94.4% accuracy score) — same input-product-output flow, different quality measurement

Eval Types

Two questions: when do you evaluate, and what do you evaluate?

When

Type	Description
Offline — before launch	Run your test dataset, fix what fails, ship when it passes.
Online — after launch	Monitor production: latency, error rates, quality scores.
Human — continuous spot-checks	Domain experts review samples regularly. Catches what automation misses and feeds errors back into the model.

What

Dimension	Examples
Quality	Correctness, completeness, coherence
UX	Clarity, tone, helpfulness
Safety	Toxicity, hallucination, PII leakage
Performance	Latency, cost, token usage
Behaviour	Instruction following, persona adherence

Pick 3-4 that matter for your product. You don't need all of them.

Maturity Ladder

Aakash Gupta and Ankit Shukla published a 7-step maturity model for AI evals. It maps out where most teams are and where they should aim.

Instead of explaining each step, let me show you how I implemented the first four in a day — for my side project, ViziAI.

Example: ViziAI

ViziAI is a side project. I built it to track my family's health records — upload a PDF lab report in any language, and it extracts every biomarker, value, and reference range so you can track changes over time. I've been using it for 5-6 months. It seemed fine. But after watching Aakash's breakdown, I decided to actually measure it.

I took 10 real PDFs, wrote every expected value by hand, and ran the pipeline against all of them.

Run	Metric Match	Value Accuracy	Cost
First run (GPT-4o)	58.3%	85.1%	baseline
After fixes (GPT-4o)	95.2%	99.7%	baseline
Model swap (GPT-5-mini)	94.4%	99.7%	90% cheaper

A couple of iterations with Claude Code — fixing decimal shifts, image-based PDFs, Spanish abbreviations — and I went from 58% to 95%. Then I swapped to a cheaper model. Same accuracy, 90% less cost. Without the eval, I never would have trusted that swap.

Takeaway

Once you measure, errors become visible.

Once errors are visible, fixes become obvious.

Once fixes land, cost reduction becomes possible.

Before evals: vibes, guessing, 'it seems fine' — After evals: 95.2% accuracy, visible errors, 90% cost savings, ship with confidence

Evals didn't just tell me how good my AI was. They unlocked decisions I was avoiding: which failures to fix first, whether a cheaper model could do the job, and where my prompt was actually broken vs. where I was just getting lucky.

You don't need a framework. You don't need infrastructure. You need 10 real examples, expected outputs, and a script that compares them. Start there.

What's Next

Next I'll find more PDFs online and expand the test range. Every time I hit an edge case, I add it to the eval. The test set grows, the system stays accurate. Quite simple, actually.