Evals for agents

How do you build evaluations for agents? Model capabilities are evolving fast, user expectations are shifting, and both inputs and outputs are highly variable. This series walks through how to think about agent evals — from the kinds of agents you might be building, to identifying risk, defining quality, and combining qualitative research with metrics.

What will we discuss?

Why evaluating agents is different from evaluating prompts or models
The two big families: agents that help vs. agents that take action
How to identify the risks that actually matter
How to turn fuzzy "quality" into something testable
Where qualitative research fits in
Where metrics genuinely help (and where they mislead)
A practical playbook for agent evals that survive model upgrades

Who is this for?

Anyone building agents — copilots, assistants, action-takers — who needs to know if they're actually any good
PMs, researchers, and designers working alongside engineers on agent products
Anyone who's done basic evals and now needs to handle the variability and risk that agents introduce

Start your free trial

Start a 15-day free trial to unlock every episode. Cancel any time.

Start free trial

1. Evals for agents

Episode 1.1

Evaluating agents — introduction

Why evaluating agents is different from evaluating prompts or models. Set the stage for the series: the shifting ground (models, users, inputs/outputs) and what makes a good agent eval.

Coming Soon

Episode 1.2

Agents that help

Agents whose job is to assist a person — copilots, researchers, summarizers. What "good" looks like when the human stays in the loop, and how that shapes what you measure.

Coming Soon

Episode 1.3

Agents that take action

Agents that do things in the world — book, send, write, deploy. The eval bar is higher: correctness, reversibility, and trust become first-class concerns.

Coming Soon

Episode 1.4

Identifying risk

Where agents can go wrong and which failures actually matter. A practical way to map risks so your evals cover what's costly, not just what's easy to measure.

Coming Soon

Episode 1.5

Defining quality

What does "good" even mean for an agent? Turning fuzzy expectations into concrete, testable criteria that hold up across variable inputs and outputs.

Coming Soon

Episode 1.6

The role of qual research

Why you can't eval your way out of not understanding users. How qualitative research surfaces the failure modes and quality dimensions that metrics alone will miss.

Coming Soon

Episode 1.7

The role of metrics

Where metrics genuinely help, where they mislead, and how to build a metric set that complements — rather than replaces — human judgment.

Coming Soon

Episode 1.8

Wrapping it up

Pulling the threads together: a practical playbook for building agent evals that survive model upgrades, shifting user expectations, and the inherent variability of agent work.

Coming Soon

What participants say

“This is opening up all sorts of new neural pathways for me to see under the hood more of how the sausage is made! 🙏”

Maya Lindgren · Senior UX Designer

“Very timely at my enterprise software company as evaluation of AI features scales.”

Daniel Reyes · Principal Product Designer

“Everything I know about evals is from Peter's talk, which is why I'm back to find out more!”

Anneke Visser · UX Researcher

Ready to get started?

15-day free trial. Cancel any time.

Start your free trial

All series