Evals for agents
How do you build evaluations for agents? Model capabilities are evolving fast, user expectations are shifting, and both inputs and outputs are highly variable. This series walks through how to think about agent evals — from the kinds of agents you might be building, to identifying risk, defining quality, and combining qualitative research with metrics.
What will we discuss?
- Why evaluating agents is different from evaluating prompts or models
- The two big families: agents that help vs. agents that take action
- How to identify the risks that actually matter
- How to turn fuzzy "quality" into something testable
- Where qualitative research fits in
- Where metrics genuinely help (and where they mislead)
- A practical playbook for agent evals that survive model upgrades
Who is this for?
- Anyone building agents — copilots, assistants, action-takers — who needs to know if they're actually any good
- PMs, researchers, and designers working alongside engineers on agent products
- Anyone who's done basic evals and now needs to handle the variability and risk that agents introduce
Start your free trial
Start a 15-day free trial to unlock every episode — cancel any time.
1. Evals for agents
Evaluating agents — introduction
Why evaluating agents is different from evaluating prompts or models. Set the stage for the series: the shifting ground (models, users, inputs/outputs) and what makes a good agent eval.
Agents that help
Agents whose job is to assist a person — copilots, researchers, summarizers. What "good" looks like when the human stays in the loop, and how that shapes what you measure.
Agents that take action
Agents that do things in the world — book, send, write, deploy. The eval bar is higher: correctness, reversibility, and trust become first-class concerns.
Identifying risk
Where agents can go wrong and which failures actually matter. A practical way to map risks so your evals cover what's costly, not just what's easy to measure.
Defining quality
What does "good" even mean for an agent? Turning fuzzy expectations into concrete, testable criteria that hold up across variable inputs and outputs.
The role of qual research
Why you can't eval your way out of not understanding users. How qualitative research surfaces the failure modes and quality dimensions that metrics alone will miss.
The role of metrics
Where metrics genuinely help, where they mislead, and how to build a metric set that complements — rather than replaces — human judgment.
Wrapping it up
Pulling the threads together: a practical playbook for building agent evals that survive model upgrades, shifting user expectations, and the inherent variability of agent work.
What participants say
"This training has provided valuable insights into AI product development methodologies and practical implementation strategies."
— Course Participant
"Highly relevant for our enterprise software organization as we scale AI feature evaluation processes."
— Course Participant
"The evaluation framework training has been instrumental in establishing our AI quality processes."
— Course Participant