Tip of the Day 2026 · 06 · 05 ≈ 9 min read Amazon Bedrock AgentCore · Evaluations

AgentCore Evaluations.

Yesterday's Guardrails tip was about stopping bad output. Today is about measuring whether your agent is any good in the first place. AgentCore Evaluations reads the OpenTelemetry traces your agent already emits, converts them to a unified format, and scores them with LLM-as-a-Judge (and a few programmatic) evaluators — across whole sessions, individual traces, or hand-picked spans.

$ agentcore run eval --evaluator "Builtin.Helpfulness" — score a session from its traces

01What Evaluations actually is

From the Evaluations overview: "Amazon Bedrock AgentCore Evaluations provides automated assessment tools to measure how well your agent or tools perform specific tasks, handle edge cases, and maintain consistency across different inputs and contexts."

The key architectural fact is that it is trace-native. It integrates with Strands and LangGraph through the OpenTelemetry and OpenInference instrumentation libraries; under the hood, "traces from these agents are converted to a unified format and scored using LLM-as-a-Judge techniques for both built-in and custom evaluators." You don't feed it a dataset of prompt/response pairs — you point it at the telemetry your agent already produces in CloudWatch. That's why the prerequisite list is the same observability stack from the 2026-05-31 tip: Transaction Search enabled, ADOT (aws-opentelemetry-distro) in your requirements.txt, Python 3.10+, and IAM permissions for bedrock-agentcore, bedrock-agentcore-control, and logs.

The shift

Evaluations doesn't grade text in a vacuum — it grades the run. It reads the same sessions/traces/spans tree Observability emits, so "is my agent good?" becomes a query over telemetry you already have, not a separate eval harness you have to build.

02Three evaluation modes

The same evaluators run in three modes, which differ only in when they run and over what.

Mode	What it scores	When to reach for it
Online	A live sample of production sessions	Continuous quality monitoring in prod
On-demand	A specific set of spans or traces you name by ID	Investigating one issue, validating a fix, build-time testing
Batch	Many sessions discovered from a CloudWatch Logs location, in one async job	Baselines, pre/post comparison, regression suites, periodic audits

Online evaluation has three moving parts: session sampling and filtering (percentage-based — e.g. evaluate 10% of sessions — or conditional filters for targeted slices), the evaluator selection, and the monitoring surface — aggregated scores in dashboards, quality trends over time, and drill-down into low-scoring sessions.

On-demand evaluation is the build-time workhorse. You hand it exact span or trace IDs and it scores only those. The AgentCore CLI does the tedious part — querying CloudWatch log groups for the spans — automatically; with the raw SDK you download the span logs yourself first.

Batch evaluation is the one that scales. Per the evaluation types page: "batch evaluation handles session discovery, span collection, and scoring entirely on the service side." You submit a job pointing at a CloudWatch Logs location plus a list of evaluators, and you get back aggregate results — per-evaluator average scores and session counts — with per-session detail written back to CloudWatch Logs.

03The six built-in evaluators

Built-in evaluators are fully managed LLM judges with fixed prompt templates and models you can't modify — that's deliberate; it keeps scores comparable over time. You reference them by ID in the form Builtin.EvaluatorName. Six are documented today, split by the level they operate at and whether they call an LLM:

Evaluator	Level	Scoring	Ground-truth field
`Builtin.Helpfulness`	Trace	LLM-as-a-Judge	—
`Builtin.Correctness`	Trace	LLM-as-a-Judge	`expectedResponse`
`Builtin.GoalSuccessRate`	Session	LLM-as-a-Judge	`assertions`
`Builtin.TrajectoryExactOrderMatch`	Session	Programmatic	`expectedTrajectory`
`Builtin.TrajectoryInOrderMatch`	Session	Programmatic	`expectedTrajectory`
`Builtin.TrajectoryAnyOrderMatch`	Session	Programmatic	`expectedTrajectory`

The three trajectory matchers are the cheap, deterministic ones — no LLM calls at all. They differ only in strictness: ExactOrderMatch demands the exact tool sequence with no extras; InOrderMatch allows extra tools between the expected ones as long as the order holds; AnyOrderMatch just checks presence, order-agnostic, extras allowed. If you have a known-good tool trajectory, these score it for the price of a string comparison.

04Ground truth — turning judgement into measurement

Without ground truth, an LLM judge is grading on context alone. Supply ground truth and you get objective measurement — regression detection and benchmark datasets instead of vibes. You pass reference inputs alongside the session spans when you call the Evaluate API. Three fields, each scoped to a level:

expectedResponse (string, trace-scoped) — the gold answer for one turn, matched to a trace via traceId. Drives Builtin.Correctness.
assertions (list of strings, session-scoped) — natural-language statements that should be true about the whole session, e.g. "the agent used the weather tool before answering". Drives Builtin.GoalSuccessRate.
expectedTrajectory (list of tool names, session-scoped) — the expected sequence of tool calls. Drives the three trajectory matchers.

All three fields are optional, and you can send all of them in a single request: the service picks the relevant field per evaluator and reports any unused ones back as ignoredReferenceInputFields. Omit a field and the evaluator falls back to its ground-truth-free variant — Builtin.Correctness without an expectedResponse still runs, it just judges on context alone.

05Custom evaluators — two flavours

When the built-ins don't capture your domain (healthcare correctness, a finance-specific rubric), you write your own. AgentCore Evaluations supports two kinds:

LLM-as-a-judge evaluators — you choose the evaluator model, write the evaluation instructions, define the criteria, and design your own scoring schema. Ground-truth fields are available through placeholders in your instructions.
Code-based evaluators — your own AWS Lambda function. Full control, deterministic logic: regex matching, external API calls, custom metrics, business rules — no LLM judge in the loop at all.

Every evaluator carries an ARN and a resource policy. Built-ins are public — arn:aws:bedrock-agentcore:::evaluator/Builtin.Helpfulness (note the empty region/account, and that they're accessible to everyone). Custom evaluators are private — arn:aws:bedrock-agentcore:region:account:evaluator/my-evaluator-id — and gated by IAM resource-based policies on the evaluator plus identity-based policies on your users and roles.

06Running one — CLI, SDK, raw API

The fastest path is the AgentCore CLI from inside a project directory. It auto-discovers recent sessions from CloudWatch in interactive mode, so you don't need session IDs in advance.

$ # score one named session against two judges $ agentcore run eval \ --runtime my-agent \ --session-id $SESSION_ID \ --evaluator "Builtin.Helpfulness" \ --evaluator "Builtin.GoalSuccessRate" $ # review past runs (saved locally) $ agentcore evals history

The starter-toolkit SDK is the programmatic equivalent:

from bedrock_agentcore_starter_toolkit import Evaluation eval_client = Evaluation() results = eval_client.run( agent_id="YOUR_AGENT_ID", session_id="YOUR_SESSION_ID", evaluators=["Builtin.Helpfulness", "Builtin.GoalSuccessRate"], ) for r in results.get_successful_results(): print(r.evaluator_name, f"{r.value:.2f}", r.label)

Each result carries a numeric value, a label, and an explanation. Drop to the raw Evaluate API (download spans from CloudWatch, then pass EvaluatorId + SessionSpans + optional ReferenceInputs) when you need ground truth or want to script batch jobs yourself.

07Limits worth knowing

From the Evaluations overview and the prerequisites pages. Treat the docs as authoritative; quotas move.

Evaluation configurations: 1,000 per Region per account, of which up to 100 active at any one time. Plan rotation if you template a config per experiment.
Throughput: up to 1,000,000 input + output tokens per minute per account in large Regions — the budget shared across every LLM-judge evaluation you run. Programmatic trajectory matchers don't draw from it.
Telemetry lag is real. Spans take a couple of minutes to land in CloudWatch; the docs say wait 2–5 minutes after invoking your agent before evaluating, or you'll score empty/partial traces.
Framework support is narrow today. Only Strands Agents and LangGraph (with opentelemetry-instrumentation-langchain or openinference-instrumentation-langchain) are supported. A hand-rolled agent without one of these instrumentations won't produce traces the evaluators can read.
Built-in configs are immutable. You cannot retune a built-in evaluator's model or prompt — if you need different behaviour, that's a custom evaluator. Built-ins do support cross-Region inference so the judge model can run where capacity exists.

08Try it in five minutes

Deploy any Strands or LangGraph agent to AgentCore Runtime with observability on (ADOT in requirements.txt, Transaction Search enabled).
Invoke it a couple of times with a real runtimeSessionId via invoke_agent_runtime, then wait 2–5 minutes for CloudWatch to ingest the spans.
From the project directory, run the on-demand eval with Builtin.Helpfulness and Builtin.GoalSuccessRate.
Read the score, label, and explanation per evaluator; re-run with agentcore evals history to compare across changes.

Once that loop feels natural, add an expectedTrajectory and a Builtin.TrajectoryInOrderMatch to turn "looks fine" into a regression test you can run on every prompt change.

Tomorrow: Amazon Bedrock Knowledge Bases — vector stores, the chunking strategies (FIXED_SIZE, HIERARCHICAL, SEMANTIC, NONE), and how RetrieveAndGenerate differs from Retrieve.

✓Verified against the official AWS docs on 2026-06-05.
Sources: Evaluate agent performance with AgentCore Evaluations, Evaluation types, Evaluators, Built-in evaluators, Ground truth evaluations, Getting started with on-demand evaluation.
If the tip looks dated, the docs are authoritative — go check them.

This page — research, writing, verification, and deployment — was built by Claude Cowork. No human touched the prose, the layout, or the upload pipeline. The tip was generated this morning, cross-checked against the official AWS docs, and published to Cloudflare R2 on a schedule.

A daily experiment by Monty van Emmerik · vanemmerik.ai · what is Claude Cowork?