Evals for agents

How do you build evaluations for agents? Model capabilities are evolving fast, user expectations are shifting, and both inputs and outputs are highly variable. This series walks through how to think about agent evals — from the kinds of agents you might be building, to identifying risk, defining quality, and combining qualitative research with metrics.

What will we discuss?

  • Why evaluating agents is different from evaluating prompts or models
  • The two big families: agents that help vs. agents that take action
  • How to identify the risks that actually matter
  • How to turn fuzzy "quality" into something testable
  • Where qualitative research fits in
  • Where metrics genuinely help (and where they mislead)
  • A practical playbook for agent evals that survive model upgrades

Who is this for?

  • Anyone building agents — copilots, assistants, action-takers — who needs to know if they're actually any good
  • PMs, researchers, and designers working alongside engineers on agent products
  • Anyone who's done basic evals and now needs to handle the variability and risk that agents introduce

Start your free trial

Start a 15-day free trial to unlock every episode — cancel any time.

1. Evals for agents

1.1
Evaluating agents — introduction

Why evaluating agents is different from evaluating prompts or models. Set the stage for the series: the shifting ground (models, users, inputs/outputs) and what makes a good agent eval.

Coming Soon
1.2
Agents that help

Agents whose job is to assist a person — copilots, researchers, summarizers. What "good" looks like when the human stays in the loop, and how that shapes what you measure.

Coming Soon
1.3
Agents that take action

Agents that do things in the world — book, send, write, deploy. The eval bar is higher: correctness, reversibility, and trust become first-class concerns.

Coming Soon
1.4
Identifying risk

Where agents can go wrong and which failures actually matter. A practical way to map risks so your evals cover what's costly, not just what's easy to measure.

Coming Soon
1.5
Defining quality

What does "good" even mean for an agent? Turning fuzzy expectations into concrete, testable criteria that hold up across variable inputs and outputs.

Coming Soon
1.6
The role of qual research

Why you can't eval your way out of not understanding users. How qualitative research surfaces the failure modes and quality dimensions that metrics alone will miss.

Coming Soon
1.7
The role of metrics

Where metrics genuinely help, where they mislead, and how to build a metric set that complements — rather than replaces — human judgment.

Coming Soon
1.8
Wrapping it up

Pulling the threads together: a practical playbook for building agent evals that survive model upgrades, shifting user expectations, and the inherent variability of agent work.

Coming Soon

What participants say

"This training has provided valuable insights into AI product development methodologies and practical implementation strategies."

— Course Participant

"Highly relevant for our enterprise software organization as we scale AI feature evaluation processes."

— Course Participant

"The evaluation framework training has been instrumental in establishing our AI quality processes."

— Course Participant

Ready to get started?

15-day free trial. Cancel any time.

Start your free trial