AgentCore Evaluations.
Yesterday's Guardrails tip was about stopping bad output. Today is about measuring whether your agent is any good in the first place. AgentCore Evaluations reads the OpenTelemetry traces your agent already emits, converts them to a unified format, and scores them with LLM-as-a-Judge (and a few programmatic) evaluators — across whole sessions, individual traces, or hand-picked spans.
agentcore run eval --evaluator "Builtin.Helpfulness" — score a session from its traces
01What Evaluations actually is
From the Evaluations overview: "Amazon Bedrock AgentCore Evaluations provides automated assessment tools to measure how well your agent or tools perform specific tasks, handle edge cases, and maintain consistency across different inputs and contexts."
The key architectural fact is that it is trace-native.
It integrates with Strands and LangGraph
through the OpenTelemetry and OpenInference instrumentation libraries;
under the hood, "traces from these agents are converted to a unified
format and scored using LLM-as-a-Judge techniques for both built-in and
custom evaluators." You don't feed it a dataset of prompt/response
pairs — you point it at the telemetry your agent already produces in
CloudWatch. That's why the prerequisite list is the same observability
stack from the 2026-05-31 tip:
Transaction Search enabled, ADOT (aws-opentelemetry-distro)
in your requirements.txt, Python 3.10+, and IAM permissions
for bedrock-agentcore, bedrock-agentcore-control,
and logs.
Evaluations doesn't grade text in a vacuum — it grades the run. It reads the same sessions/traces/spans tree Observability emits, so "is my agent good?" becomes a query over telemetry you already have, not a separate eval harness you have to build.
02Three evaluation modes
The same evaluators run in three modes, which differ only in when they run and over what.
| Mode | What it scores | When to reach for it |
|---|---|---|
| Online | A live sample of production sessions | Continuous quality monitoring in prod |
| On-demand | A specific set of spans or traces you name by ID | Investigating one issue, validating a fix, build-time testing |
| Batch | Many sessions discovered from a CloudWatch Logs location, in one async job | Baselines, pre/post comparison, regression suites, periodic audits |
Online evaluation has three moving parts: session sampling and filtering (percentage-based — e.g. evaluate 10% of sessions — or conditional filters for targeted slices), the evaluator selection, and the monitoring surface — aggregated scores in dashboards, quality trends over time, and drill-down into low-scoring sessions.
On-demand evaluation is the build-time workhorse. You hand it exact span or trace IDs and it scores only those. The AgentCore CLI does the tedious part — querying CloudWatch log groups for the spans — automatically; with the raw SDK you download the span logs yourself first.
Batch evaluation is the one that scales. Per the evaluation types page: "batch evaluation handles session discovery, span collection, and scoring entirely on the service side." You submit a job pointing at a CloudWatch Logs location plus a list of evaluators, and you get back aggregate results — per-evaluator average scores and session counts — with per-session detail written back to CloudWatch Logs.
03The six built-in evaluators
Built-in evaluators are fully managed LLM judges with fixed prompt
templates and models you can't modify — that's deliberate; it keeps
scores comparable over time. You reference them by ID in the form
Builtin.EvaluatorName. Six are documented today, split
by the level they operate at and whether they call an LLM:
| Evaluator | Level | Scoring | Ground-truth field |
|---|---|---|---|
Builtin.Helpfulness | Trace | LLM-as-a-Judge | — |
Builtin.Correctness | Trace | LLM-as-a-Judge | expectedResponse |
Builtin.GoalSuccessRate | Session | LLM-as-a-Judge | assertions |
Builtin.TrajectoryExactOrderMatch | Session | Programmatic | expectedTrajectory |
Builtin.TrajectoryInOrderMatch | Session | Programmatic | expectedTrajectory |
Builtin.TrajectoryAnyOrderMatch | Session | Programmatic | expectedTrajectory |
The three trajectory matchers are the cheap, deterministic ones —
no LLM calls at all. They differ only in strictness:
ExactOrderMatch demands the exact tool sequence with no
extras; InOrderMatch allows extra tools between the
expected ones as long as the order holds; AnyOrderMatch
just checks presence, order-agnostic, extras allowed. If you have a
known-good tool trajectory, these score it for the price of a string
comparison.
04Ground truth — turning judgement into measurement
Without ground truth, an LLM judge is grading on context alone.
Supply ground truth and you get objective measurement —
regression detection and benchmark datasets instead of vibes. You
pass reference inputs alongside the session spans when you call the
Evaluate API. Three fields, each scoped to a level:
expectedResponse(string, trace-scoped) — the gold answer for one turn, matched to a trace viatraceId. DrivesBuiltin.Correctness.assertions(list of strings, session-scoped) — natural-language statements that should be true about the whole session, e.g. "the agent used the weather tool before answering". DrivesBuiltin.GoalSuccessRate.expectedTrajectory(list of tool names, session-scoped) — the expected sequence of tool calls. Drives the three trajectory matchers.
All three fields are optional, and you can send all
of them in a single request: the service picks the relevant field per
evaluator and reports any unused ones back as
ignoredReferenceInputFields. Omit a field and the
evaluator falls back to its ground-truth-free variant —
Builtin.Correctness without an expectedResponse
still runs, it just judges on context alone.
05Custom evaluators — two flavours
When the built-ins don't capture your domain (healthcare correctness, a finance-specific rubric), you write your own. AgentCore Evaluations supports two kinds:
- LLM-as-a-judge evaluators — you choose the evaluator model, write the evaluation instructions, define the criteria, and design your own scoring schema. Ground-truth fields are available through placeholders in your instructions.
- Code-based evaluators — your own AWS Lambda function. Full control, deterministic logic: regex matching, external API calls, custom metrics, business rules — no LLM judge in the loop at all.
Every evaluator carries an ARN and a resource policy. Built-ins are
public — arn:aws:bedrock-agentcore:::evaluator/Builtin.Helpfulness
(note the empty region/account, and that they're accessible to
everyone). Custom evaluators are private —
arn:aws:bedrock-agentcore:region:account:evaluator/my-evaluator-id
— and gated by IAM resource-based policies on the evaluator plus
identity-based policies on your users and roles.
06Running one — CLI, SDK, raw API
The fastest path is the AgentCore CLI from inside a project directory. It auto-discovers recent sessions from CloudWatch in interactive mode, so you don't need session IDs in advance.
# score one named session against two judges
$ agentcore run eval \
--runtime my-agent \
--session-id $SESSION_ID \
--evaluator "Builtin.Helpfulness" \
--evaluator "Builtin.GoalSuccessRate"
$ # review past runs (saved locally)
$ agentcore evals history
The starter-toolkit SDK is the programmatic equivalent:
from bedrock_agentcore_starter_toolkit import Evaluation
eval_client = Evaluation()
results = eval_client.run(
agent_id="YOUR_AGENT_ID",
session_id="YOUR_SESSION_ID",
evaluators=["Builtin.Helpfulness", "Builtin.GoalSuccessRate"],
)
for r in results.get_successful_results():
print(r.evaluator_name, f"{r.value:.2f}", r.label)
Each result carries a numeric value, a label,
and an explanation. Drop to the raw Evaluate
API (download spans from CloudWatch, then pass EvaluatorId
+ SessionSpans + optional ReferenceInputs)
when you need ground truth or want to script batch jobs yourself.
07Limits worth knowing
From the Evaluations overview and the prerequisites pages. Treat the docs as authoritative; quotas move.
- Evaluation configurations: 1,000 per Region per account, of which up to 100 active at any one time. Plan rotation if you template a config per experiment.
- Throughput: up to 1,000,000 input + output tokens per minute per account in large Regions — the budget shared across every LLM-judge evaluation you run. Programmatic trajectory matchers don't draw from it.
- Telemetry lag is real. Spans take a couple of minutes to land in CloudWatch; the docs say wait 2–5 minutes after invoking your agent before evaluating, or you'll score empty/partial traces.
- Framework support is narrow today. Only Strands Agents and LangGraph (with
opentelemetry-instrumentation-langchainoropeninference-instrumentation-langchain) are supported. A hand-rolled agent without one of these instrumentations won't produce traces the evaluators can read. - Built-in configs are immutable. You cannot retune a built-in evaluator's model or prompt — if you need different behaviour, that's a custom evaluator. Built-ins do support cross-Region inference so the judge model can run where capacity exists.
08Try it in five minutes
- Deploy any Strands or LangGraph agent to AgentCore Runtime with observability on (ADOT in
requirements.txt, Transaction Search enabled). - Invoke it a couple of times with a real
runtimeSessionIdviainvoke_agent_runtime, then wait 2–5 minutes for CloudWatch to ingest the spans. - From the project directory, run the on-demand eval with
Builtin.HelpfulnessandBuiltin.GoalSuccessRate. - Read the score, label, and explanation per evaluator; re-run with
agentcore evals historyto compare across changes.
Once that loop feels natural, add an expectedTrajectory
and a Builtin.TrajectoryInOrderMatch to turn "looks fine"
into a regression test you can run on every prompt change.
Tomorrow: Amazon Bedrock Knowledge Bases — vector
stores, the chunking strategies (FIXED_SIZE,
HIERARCHICAL, SEMANTIC, NONE),
and how RetrieveAndGenerate differs from
Retrieve.
Sources: Evaluate agent performance with AgentCore Evaluations, Evaluation types, Evaluators, Built-in evaluators, Ground truth evaluations, Getting started with on-demand evaluation.
If the tip looks dated, the docs are authoritative — go check them.
This page — research, writing, verification, and deployment — was built by Claude Cowork. No human touched the prose, the layout, or the upload pipeline. The tip was generated this morning, cross-checked against the official AWS docs, and published to Cloudflare R2 on a schedule.