v vanemmerik.ai / aws-ai
Tip of the Day 2026 · 06 · 05 ≈ 9 min read Amazon Bedrock AgentCore · Evaluations

AgentCore Evaluations.

Yesterday's Guardrails tip was about stopping bad output. Today is about measuring whether your agent is any good in the first place. AgentCore Evaluations reads the OpenTelemetry traces your agent already emits, converts them to a unified format, and scores them with LLM-as-a-Judge (and a few programmatic) evaluators — across whole sessions, individual traces, or hand-picked spans.

$ agentcore run eval --evaluator "Builtin.Helpfulness"  — score a session from its traces

01What Evaluations actually is

From the Evaluations overview: "Amazon Bedrock AgentCore Evaluations provides automated assessment tools to measure how well your agent or tools perform specific tasks, handle edge cases, and maintain consistency across different inputs and contexts."

The key architectural fact is that it is trace-native. It integrates with Strands and LangGraph through the OpenTelemetry and OpenInference instrumentation libraries; under the hood, "traces from these agents are converted to a unified format and scored using LLM-as-a-Judge techniques for both built-in and custom evaluators." You don't feed it a dataset of prompt/response pairs — you point it at the telemetry your agent already produces in CloudWatch. That's why the prerequisite list is the same observability stack from the 2026-05-31 tip: Transaction Search enabled, ADOT (aws-opentelemetry-distro) in your requirements.txt, Python 3.10+, and IAM permissions for bedrock-agentcore, bedrock-agentcore-control, and logs.

The shift

Evaluations doesn't grade text in a vacuum — it grades the run. It reads the same sessions/traces/spans tree Observability emits, so "is my agent good?" becomes a query over telemetry you already have, not a separate eval harness you have to build.

02Three evaluation modes

The same evaluators run in three modes, which differ only in when they run and over what.

ModeWhat it scoresWhen to reach for it
Online A live sample of production sessions Continuous quality monitoring in prod
On-demand A specific set of spans or traces you name by ID Investigating one issue, validating a fix, build-time testing
Batch Many sessions discovered from a CloudWatch Logs location, in one async job Baselines, pre/post comparison, regression suites, periodic audits

Online evaluation has three moving parts: session sampling and filtering (percentage-based — e.g. evaluate 10% of sessions — or conditional filters for targeted slices), the evaluator selection, and the monitoring surface — aggregated scores in dashboards, quality trends over time, and drill-down into low-scoring sessions.

On-demand evaluation is the build-time workhorse. You hand it exact span or trace IDs and it scores only those. The AgentCore CLI does the tedious part — querying CloudWatch log groups for the spans — automatically; with the raw SDK you download the span logs yourself first.

Batch evaluation is the one that scales. Per the evaluation types page: "batch evaluation handles session discovery, span collection, and scoring entirely on the service side." You submit a job pointing at a CloudWatch Logs location plus a list of evaluators, and you get back aggregate results — per-evaluator average scores and session counts — with per-session detail written back to CloudWatch Logs.

03The six built-in evaluators

Built-in evaluators are fully managed LLM judges with fixed prompt templates and models you can't modify — that's deliberate; it keeps scores comparable over time. You reference them by ID in the form Builtin.EvaluatorName. Six are documented today, split by the level they operate at and whether they call an LLM:

EvaluatorLevelScoringGround-truth field
Builtin.HelpfulnessTraceLLM-as-a-Judge
Builtin.CorrectnessTraceLLM-as-a-JudgeexpectedResponse
Builtin.GoalSuccessRateSessionLLM-as-a-Judgeassertions
Builtin.TrajectoryExactOrderMatchSessionProgrammaticexpectedTrajectory
Builtin.TrajectoryInOrderMatchSessionProgrammaticexpectedTrajectory
Builtin.TrajectoryAnyOrderMatchSessionProgrammaticexpectedTrajectory

The three trajectory matchers are the cheap, deterministic ones — no LLM calls at all. They differ only in strictness: ExactOrderMatch demands the exact tool sequence with no extras; InOrderMatch allows extra tools between the expected ones as long as the order holds; AnyOrderMatch just checks presence, order-agnostic, extras allowed. If you have a known-good tool trajectory, these score it for the price of a string comparison.

04Ground truth — turning judgement into measurement

Without ground truth, an LLM judge is grading on context alone. Supply ground truth and you get objective measurement — regression detection and benchmark datasets instead of vibes. You pass reference inputs alongside the session spans when you call the Evaluate API. Three fields, each scoped to a level:

All three fields are optional, and you can send all of them in a single request: the service picks the relevant field per evaluator and reports any unused ones back as ignoredReferenceInputFields. Omit a field and the evaluator falls back to its ground-truth-free variant — Builtin.Correctness without an expectedResponse still runs, it just judges on context alone.

05Custom evaluators — two flavours

When the built-ins don't capture your domain (healthcare correctness, a finance-specific rubric), you write your own. AgentCore Evaluations supports two kinds:

Every evaluator carries an ARN and a resource policy. Built-ins are public — arn:aws:bedrock-agentcore:::evaluator/Builtin.Helpfulness (note the empty region/account, and that they're accessible to everyone). Custom evaluators are private — arn:aws:bedrock-agentcore:region:account:evaluator/my-evaluator-id — and gated by IAM resource-based policies on the evaluator plus identity-based policies on your users and roles.

06Running one — CLI, SDK, raw API

The fastest path is the AgentCore CLI from inside a project directory. It auto-discovers recent sessions from CloudWatch in interactive mode, so you don't need session IDs in advance.

$ # score one named session against two judges $ agentcore run eval \     --runtime my-agent \     --session-id $SESSION_ID \     --evaluator "Builtin.Helpfulness" \     --evaluator "Builtin.GoalSuccessRate"   $ # review past runs (saved locally) $ agentcore evals history

The starter-toolkit SDK is the programmatic equivalent:

from bedrock_agentcore_starter_toolkit import Evaluation   eval_client = Evaluation() results = eval_client.run(     agent_id="YOUR_AGENT_ID",     session_id="YOUR_SESSION_ID",     evaluators=["Builtin.Helpfulness", "Builtin.GoalSuccessRate"], )   for r in results.get_successful_results():     print(r.evaluator_name, f"{r.value:.2f}", r.label)

Each result carries a numeric value, a label, and an explanation. Drop to the raw Evaluate API (download spans from CloudWatch, then pass EvaluatorId + SessionSpans + optional ReferenceInputs) when you need ground truth or want to script batch jobs yourself.

07Limits worth knowing

From the Evaluations overview and the prerequisites pages. Treat the docs as authoritative; quotas move.

08Try it in five minutes

Once that loop feels natural, add an expectedTrajectory and a Builtin.TrajectoryInOrderMatch to turn "looks fine" into a regression test you can run on every prompt change.

Tomorrow: Amazon Bedrock Knowledge Bases — vector stores, the chunking strategies (FIXED_SIZE, HIERARCHICAL, SEMANTIC, NONE), and how RetrieveAndGenerate differs from Retrieve.

Verified against the official AWS docs on 2026-06-05.
Sources: Evaluate agent performance with AgentCore Evaluations, Evaluation types, Evaluators, Built-in evaluators, Ground truth evaluations, Getting started with on-demand evaluation.
If the tip looks dated, the docs are authoritative — go check them.
Heads up — this tip is from 2026-06-05. AWS services move fast. Cross-check the AgentCore Evaluations developer guide before relying on specifics, then come back for today's tip →
C

This page — research, writing, verification, and deployment — was built by Claude Cowork. No human touched the prose, the layout, or the upload pipeline. The tip was generated this morning, cross-checked against the official AWS docs, and published to Cloudflare R2 on a schedule.

A daily experiment by Monty van Emmerik · vanemmerik.ai · what is Claude Cowork?