---
title: "AgentCore Evaluations — three evaluation modes, six built-in judges, and the ground-truth fields that turn vibes into scores"
date: 2026-06-05
service: "Amazon Bedrock AgentCore"
component: "Evaluations"
tags: [agentcore, evaluations, llm-as-a-judge, built-in-evaluators, custom-evaluators, online-evaluation, on-demand-evaluation, batch-evaluation, ground-truth, trajectory-matching, helpfulness, goal-success-rate, correctness, transaction-search, quotas]
source: https://docs.aws.amazon.com/bedrock-agentcore/latest/devguide/evaluations.html
verified_on: 2026-06-05
url: https://vanemmerik.ai/aws-ai/2026-06-05.html
---

# AWS Bedrock & AgentCore · Tip of the Day · 2026-06-05

## AgentCore Evaluations — scoring agent behaviour, not just model output

Yesterday's Guardrails tip was about *stopping* bad output. Today is about *measuring* whether your agent is any good in the first place. **AgentCore Evaluations** reads the OpenTelemetry traces your agent already emits, converts them to a unified format, and scores them with LLM-as-a-Judge (and a few programmatic) evaluators — across whole sessions, individual traces, or hand-picked spans.

    agentcore run eval \
      --runtime my-agent \
      --session-id $SESSION_ID \
      --evaluator "Builtin.Helpfulness" \
      --evaluator "Builtin.GoalSuccessRate"

≈ 9 min read · Amazon Bedrock AgentCore · Evaluations

## 01 · What Evaluations actually is

From the [Evaluations overview](https://docs.aws.amazon.com/bedrock-agentcore/latest/devguide/evaluations.html): *"Amazon Bedrock AgentCore Evaluations provides automated assessment tools to measure how well your agent or tools perform specific tasks, handle edge cases, and maintain consistency across different inputs and contexts."*

The key architectural fact is that it is **trace-native**. It integrates with **Strands** and **LangGraph** through the OpenTelemetry and OpenInference instrumentation libraries; under the hood, *"traces from these agents are converted to a unified format and scored using LLM-as-a-Judge techniques for both built-in and custom evaluators."* You don't feed it a dataset of prompt/response pairs — you point it at the telemetry your agent already produces in CloudWatch. That's why the prerequisite list is the same observability stack we covered on 2026-05-31: Transaction Search enabled, ADOT (`aws-opentelemetry-distro`) in your `requirements.txt`, Python 3.10+, and IAM permissions for `bedrock-agentcore`, `bedrock-agentcore-control`, and `logs`.

## 02 · Three evaluation modes

The same evaluators run in three modes, which differ only in *when* and *over what* they run.

| Mode | What it scores | When to reach for it |
| --- | --- | --- |
| **Online** | A live sample of production sessions | Continuous quality monitoring in prod |
| **On-demand** | A specific set of spans or traces you name by ID | Investigating one customer issue, validating a fix, build-time testing |
| **Batch** | Many sessions discovered from a CloudWatch Logs location, in one async job | Baselines, pre/post comparison, regression suites, periodic audits |

**Online evaluation** has three moving parts: session sampling and filtering (percentage-based, e.g. evaluate 10% of sessions, or conditional filters for targeted slices), the evaluator selection, and the monitoring surface — aggregated scores in dashboards, quality trends over time, and drill-down into low-scoring sessions.

**On-demand evaluation** is the build-time workhorse. You hand it exact span or trace IDs and it scores only those. The AgentCore CLI does the tedious part — querying CloudWatch log groups for the spans — automatically; with the raw SDK you download the span logs yourself first.

**Batch evaluation** is the one that scales. Per the [evaluation types page](https://docs.aws.amazon.com/bedrock-agentcore/latest/devguide/evaluations-types.html): *"batch evaluation handles session discovery, span collection, and scoring entirely on the service side."* You submit a job pointing at a CloudWatch Logs location plus a list of evaluators, and you get back aggregate results — per-evaluator average scores and session counts — with per-session detail written back to CloudWatch Logs.

## 03 · The six built-in evaluators

Built-in evaluators are fully managed LLM judges with fixed prompt templates and models you can't modify (that's deliberate — it keeps scores comparable over time). You reference them by ID in the form `Builtin.EvaluatorName`. Six are documented today, split by the *level* they operate at and whether they call an LLM:

| Evaluator | Level | Scoring | Ground-truth field |
| --- | --- | --- | --- |
| `Builtin.Helpfulness` | Trace | LLM-as-a-Judge | — |
| `Builtin.Correctness` | Trace | LLM-as-a-Judge | `expectedResponse` |
| `Builtin.GoalSuccessRate` | Session | LLM-as-a-Judge | `assertions` |
| `Builtin.TrajectoryExactOrderMatch` | Session | Programmatic | `expectedTrajectory` |
| `Builtin.TrajectoryInOrderMatch` | Session | Programmatic | `expectedTrajectory` |
| `Builtin.TrajectoryAnyOrderMatch` | Session | Programmatic | `expectedTrajectory` |

The three trajectory matchers are the cheap, deterministic ones — **no LLM calls at all**. They differ only in strictness: `ExactOrderMatch` demands the exact tool sequence with no extras; `InOrderMatch` allows extra tools between the expected ones as long as the order holds; `AnyOrderMatch` just checks presence, order-agnostic, extras allowed. If you have a known-good tool trajectory, these score it for the price of a string comparison.

## 04 · Ground truth — turning judgement into measurement

Without ground truth, an LLM judge is grading on context alone. Supply ground truth and you get *objective* measurement — regression detection and benchmark datasets instead of vibes. You pass reference inputs alongside the session spans when you call the `Evaluate` API. Three fields, each scoped to a level:

- **`expectedResponse`** (string, trace-scoped) — the gold answer for one turn, matched to a trace via `traceId`. Drives `Builtin.Correctness`.
- **`assertions`** (list of strings, session-scoped) — natural-language statements that should be true about the whole session, e.g. *"the agent used the weather tool before answering"*. Drives `Builtin.GoalSuccessRate`.
- **`expectedTrajectory`** (list of tool names, session-scoped) — the expected sequence of tool calls. Drives the three trajectory matchers.

All three fields are **optional**, and you can send all of them in a single request: the service picks the relevant field per evaluator and reports any unused ones back as `ignoredReferenceInputFields`. Omit a field and the evaluator falls back to its ground-truth-free variant — `Builtin.Correctness` without an `expectedResponse` still runs, it just judges on context alone.

## 05 · Custom evaluators — two flavours

When the built-ins don't capture your domain (healthcare correctness, a finance-specific rubric), you write your own. AgentCore Evaluations supports two kinds:

- **LLM-as-a-judge evaluators** — you choose the evaluator model, write the evaluation instructions, define the criteria, and design your own scoring schema. Ground-truth fields are available through placeholders in your instructions.
- **Code-based evaluators** — your own AWS **Lambda** function. Full control, deterministic logic: regex matching, external API calls, custom metrics, business rules — no LLM judge in the loop at all.

Every evaluator carries an ARN and a resource policy. Built-ins are public — `arn:aws:bedrock-agentcore:::evaluator/Builtin.Helpfulness` (note the empty region/account, and that they're accessible to everyone). Custom evaluators are private — `arn:aws:bedrock-agentcore:region:account:evaluator/my-evaluator-id` — and gated by IAM resource-based policies on the evaluator plus identity-based policies on your users and roles.

## 06 · Running one — CLI, SDK, raw API

The fastest path is the AgentCore CLI from inside a project directory. It auto-discovers recent sessions from CloudWatch in interactive mode, so you don't need session IDs in advance:

    # named session
    agentcore run eval \
      --runtime my-agent \
      --session-id $SESSION_ID \
      --evaluator "Builtin.Helpfulness" \
      --evaluator "Builtin.GoalSuccessRate"

    # review past runs (saved locally)
    agentcore evals history

The starter-toolkit SDK is the programmatic equivalent:

    from bedrock_agentcore_starter_toolkit import Evaluation

    eval_client = Evaluation()
    results = eval_client.run(
        agent_id="YOUR_AGENT_ID",
        session_id="YOUR_SESSION_ID",
        evaluators=["Builtin.Helpfulness", "Builtin.GoalSuccessRate"],
    )

    for r in results.get_successful_results():
        print(r.evaluator_name, f"{r.value:.2f}", r.label)

Each result carries a numeric `value`, a `label`, and an `explanation`. Drop to the raw `Evaluate` API (download spans from CloudWatch, then pass `EvaluatorId` + `SessionSpans` + optional `ReferenceInputs`) when you need ground truth or want to script batch jobs yourself.

## 07 · Limits worth knowing

- **Evaluation configurations: 1,000 per Region per account**, of which **up to 100 active** at any one time. Plan rotation if you template a config per experiment.
- **Throughput: up to 1,000,000 input + output tokens per minute per account** in large Regions — the budget shared across every LLM-judge evaluation you run. Programmatic trajectory matchers don't draw from it.
- **Telemetry lag is real.** Spans take a couple of minutes to land in CloudWatch; the docs say wait **2–5 minutes** after invoking your agent before evaluating, or you'll score empty/partial traces.
- **Framework support is narrow today.** Only **Strands Agents** and **LangGraph** (with `opentelemetry-instrumentation-langchain` or `openinference-instrumentation-langchain`) are supported. A hand-rolled agent without one of these instrumentations won't produce traces the evaluators can read.
- **Built-in configs are immutable.** You cannot retune a built-in evaluator's model or prompt — if you need different behaviour, that's a custom evaluator. Built-ins do support cross-Region inference so the judge model can run where capacity exists.

## 08 · Try it in five minutes

1. Deploy any Strands or LangGraph agent to AgentCore Runtime with observability on (ADOT in `requirements.txt`, Transaction Search enabled).
2. Invoke it a couple of times with a real `runtimeSessionId` via `invoke_agent_runtime`, then **wait 2–5 minutes** for CloudWatch to ingest the spans.
3. From the project directory, run the on-demand eval:

       agentcore run eval \
         --session-id $SESSION_ID \
         --evaluator "Builtin.Helpfulness" \
         --evaluator "Builtin.GoalSuccessRate"

4. Read the score, label, and explanation per evaluator; re-run with `agentcore evals history` to compare across changes.

Once that loop feels natural, add an `expectedTrajectory` and a `Builtin.TrajectoryInOrderMatch` to turn "looks fine" into a regression test you can run on every prompt change.

Tomorrow: **Amazon Bedrock Knowledge Bases** — vector stores, the chunking strategies (`FIXED_SIZE`, `HIERARCHICAL`, `SEMANTIC`, `NONE`), and how `RetrieveAndGenerate` differs from `Retrieve`.

---

**Verified against the official AWS docs on 2026-06-05.**

Sources:
- [Evaluate agent performance with Amazon Bedrock AgentCore Evaluations](https://docs.aws.amazon.com/bedrock-agentcore/latest/devguide/evaluations.html)
- [Evaluation types](https://docs.aws.amazon.com/bedrock-agentcore/latest/devguide/evaluations-types.html)
- [Evaluators](https://docs.aws.amazon.com/bedrock-agentcore/latest/devguide/evaluators.html)
- [Built-in evaluators](https://docs.aws.amazon.com/bedrock-agentcore/latest/devguide/built-in-evaluators-overview.html)
- [Ground truth evaluations](https://docs.aws.amazon.com/bedrock-agentcore/latest/devguide/ground-truth-evaluations.html)
- [Getting started with on-demand evaluation](https://docs.aws.amazon.com/bedrock-agentcore/latest/devguide/getting-started-on-demand.html)

If this tip looks dated, the docs are authoritative — go check them.

---

*This page — research, writing, verification, and deployment — was built by Claude Cowork. The tip was generated this morning, cross-checked against the official AWS docs, and published to Cloudflare R2 on a schedule. A daily experiment by Monty van Emmerik · vanemmerik.ai*