When we build AI systems, especially ones that summarize customer complaints or assign priorities, it’s not enough for the output to look correct. We need ways to evaluate whether the AI is actually doing the right job.
Let’s understand this using a simple real-world example.
The Customer Message
“My payment failed twice yesterday. Amount got debited but order didn’t go through. Support hasn’t replied in 24 hours. This is very frustrating.”
The AI Output
{
"summary": "Customer reports payment failure with amount debited and no response from support.",
"priority": "High"
}Now let’s see how this output is evaluated using four common evaluation approaches.
1. Code-Based Evaluations (Heuristic / Deterministic Evals)
This is the fastest and cheapest way to evaluate AI outputs.
Instead of humans checking every response, we write code that verifies whether the output meets predefined rules.
Typical checks include:
Is the output valid JSON?
Does it contain the required keys like summary and priority?
Is priority one of {High, Medium, Low}?
Is the summary short (e.g., under 30 words)?
Does it avoid banned or risky phrases?
In simple terms:
Code-based evals answer one question: “Is this output valid as per our rules?”
2. LLM-as-a-Judge Evaluation
Here, another AI model evaluates the output.
It scores the response (usually 0–5) on criteria like:
Accuracy of the summary.
Appropriateness of priority.
Missed critical signals.
This method adds contextual judgment that rules alone can’t capture.
3. Human Evaluation (Expert Judgment)
A human reviewer, like a support lead, looks deeper.
In this case:
Summary is technically correct ✅
“Amount debited” signals financial risk ✅
24-hour silence violates SLA ❌
Customer frustration isn’t explicitly flagged ❌
Verdict: Acceptable, but needs improvement
Feedback: Add urgency markers like “customer distressed” and “SLA breach.”
Humans bring accountability and nuance that AI still struggles with.
4. User Evaluation (Real-World Impact)
Finally, the most important test:
Did the customer feel helped?
Was the issue resolved quickly?
If not, even a “perfect” AI output has failed.
Final Takeaway
Each evaluation type answers a different question:
Code evals: Is it valid?
LLM-as-judge: Is it reasonable?
Human evals: Is it truly correct?
User evals: Did it actually work?
Strong AI systems use all four together.


