Skip to main content

Agent Evaluation

This page shows you how to score an AI agent on the tools it called, not just its final reply.

AI agents pick tools on their own, reason through multi-step problems, and call external APIs. Checking a single response is not enough. You want to know what tools the agent used, how it used them, and whether it finished the task.

Dokimos gives you nine agent evaluators and a portable data model for tool calls and tool definitions. The data model works with any framework. Five evaluators are deterministic and need no LLM, so you can run them in a unit test or a CI gate with no API key.

Quick Start

Capture the agent's tool calls, list the tools it has, then run evaluators. Copy this and adjust the tool names to your agent.

// 1. List the tools your agent can use
List<ToolDefinition> tools = List.of(
ToolDefinition.of("search_flights", "Search for available flights", flightSchema),
ToolDefinition.of("book_hotel", "Book a hotel room", hotelSchema)
);

// 2. Set up a judge LLM (needed for task completion and hallucination checks)
JudgeLM judge = prompt -> openAiClient.generate(prompt);

// 3. Run your agent and capture its trace
AgentTrace trace = AgentTrace.builder()
.addToolCall(ToolCall.of("search_flights", Map.of("origin", "JFK", "destination", "CDG")))
.addToolCall(ToolCall.of("book_hotel", Map.of("city", "Paris", "nights", 5)))
.finalResponse("Found flights and booked your hotel in Paris.")
.build();

// 4. Build a test case
var testCase = EvalTestCase.builder()
.input("Find flights from NYC to Paris and book a hotel for 5 nights")
.actualOutput("toolCalls", trace.toolCalls())
.actualOutput("output", trace.finalResponse())
.expectedOutput("toolCalls", List.of(
ToolCall.of("search_flights", Map.of()),
ToolCall.of("book_hotel", Map.of())
))
.metadata("tools", tools)
.metadata("tasks", List.of("Search for flights", "Book a hotel"))
.build();

// 5. Pick the evaluators you need and run them
var results = List.of(
ToolCallValidityEvaluator.builder().build().evaluate(testCase),
ToolCorrectnessEvaluator.builder().build().evaluate(testCase),
TaskCompletionEvaluator.builder().judge(judge).build().evaluate(testCase),
ToolArgumentHallucinationEvaluator.builder().judge(judge).build().evaluate(testCase)
);

Evaluators

Pick from these nine. The first five need no LLM. The next two always need a judge. The last two take an optional judge.

EvaluatorWhat it checksLLM required?Default threshold
ToolCallValidityEvaluatorTool calls match their JSON schema (names, required params, types, enums)No1.0
ToolCorrectnessEvaluatorAgent used the expected set of toolsNo1.0
ToolTrajectoryEvaluatorTool-call sequence matches an expected trajectoryNo1.0
ToolErrorEvaluatorTool calls succeeded (no error results)No1.0
ToolEfficiencyEvaluatorNo redundant tool callsNo1.0
TaskCompletionEvaluatorAgent completed the user's requested tasksYes0.5
ToolArgumentHallucinationEvaluatorTool call arguments are grounded in user inputYes0.8
ToolNameReliabilityEvaluatorTool names follow naming conventions (snake_case, conciseness, clarity, ordering, intent)Optional0.8
ToolDescriptionReliabilityEvaluatorTool descriptions are well written (structure, clarity, args documented, examples, usage notes)Optional0.8

ToolCallValidityEvaluator

Checks each tool call against its JSON schema. It confirms the tool name exists, required params are present, types match, enum values are valid, and no unexpected params slip in (in strict mode, or when the schema sets additionalProperties: false).

Score = fraction of valid tool calls.

ToolCorrectnessEvaluator

Compares the tools the agent used against the tools you expected. Pick one of three match modes.

ModeComparison
NAMES_ONLY (default)Set of tool names (F1 score)
NAMES_AND_ORDERNames plus invocation order (LCS similarity)
NAMES_AND_ARGSFull structural comparison including arguments

In NAMES_AND_ARGS mode, arguments use a tolerant matcher by default, so numerically equal values like 1 and 1.0 count as equal. See Argument Matching below.

ToolTrajectoryEvaluator

Scores the agent's tool-call sequence against an expected one. Deterministic, no LLM. Use it to assert how an agent should move through a task, and choose how strict the order and arguments need to be.

ModeMeaningScore
STRICTSame calls, same order, arguments match0 or 1
IN_ORDERExpected appears as an ordered subsequencegraded (LCS)
ANY_ORDERSame calls in any ordergraded
SUPERSETActual contains every expected call (extras allowed)0 or 1
SUBSETEvery actual call is in expected (omissions allowed)0 or 1
PRECISIONMatched / number of actual callsgraded
RECALLMatched / number of expected callsgraded

It reads toolCalls from actualOutputs and expectedOutputs. The unordered modes use maximum bipartite matching, so repeated tool names are counted in the best possible way.

ToolTrajectoryEvaluator trajectory = ToolTrajectoryEvaluator.builder()
.matchMode(ToolTrajectoryEvaluator.MatchMode.IN_ORDER)
.build();

var testCase = EvalTestCase.builder()
.actualOutput("toolCalls", trace.toolCalls())
.expectedOutput("toolCalls", List.of(
ToolCall.of("search_flights", Map.of()),
ToolCall.of("book_hotel", Map.of())
))
.build();

By default arguments use a tolerant matcher, so numerically equal values like 1 and 1.0 match. To compare tool names and order only, pass ArgumentMatcher.of(ArgMatchMode.IGNORE). You can also override the matcher for one tool.

ToolTrajectoryEvaluator trajectory = ToolTrajectoryEvaluator.builder()
.matchMode(ToolTrajectoryEvaluator.MatchMode.ANY_ORDER)
.argumentMatcher(ArgumentMatcher.tolerant()) // default for every tool
.argumentMatcher("book_hotel", ArgumentMatcher.of(ArgMatchMode.SUBSET)) // override one tool
.build();

ToolErrorEvaluator

Looks at each tool call's result and scores the fraction that succeeded. Deterministic, no LLM. A call counts as failed when its result is null or blank, when it is a JSON object with a top-level error field, or when it matches a custom predicate you supply.

ToolErrorEvaluator toolError = ToolErrorEvaluator.builder()
.errorDetector(result -> result.contains("HTTP 500")) // optional, on top of the defaults
.build();

ToolEfficiencyEvaluator

Finds redundant tool calls. The score is the ratio of distinct calls to total calls, so 1.0 means no redundancy. Two calls are redundant when they share a name and matching arguments. Consecutive duplicates also show up in the result metadata as a loop signal. Deterministic, no LLM.

ToolEfficiencyEvaluator efficiency = ToolEfficiencyEvaluator.builder().build();

Treat efficiency as a signal, not a hard gate. A legitimately repeated call (a retry, say) lowers the score, so tune the threshold to your case.

TaskCompletionEvaluator

Sends the user-agent dialog and a task list to a judge LLM, which decides which tasks were completed. Score = fraction of completed tasks.

Provide tasks with metadata("tasks", List.of("Search flights", "Book hotel")) and optional constraints with metadata("constraints", "Budget under $500").

ToolArgumentHallucinationEvaluator

Uses a judge LLM to check whether each tool call's argument values can be derived from the user's input. Score = fraction of non-hallucinated tool calls.

ToolNameReliabilityEvaluator

Checks tool names with 5 checks. Rule-based checks always run: snakecase_format (strict snake_case), conciseness (7 segments or fewer), intent_over_implementation (blocklist for patterns like _with_llm, _via_api). LLM checks need a judge: clarity (purpose clear from the name alone), name_order (follows operation_system_entity_data ordering), plus a semantic intent_over_implementation.

Without a judge, only the 3 rule-based checks run. The score is based on the checks that actually ran.

ToolDescriptionReliabilityEvaluator

Checks tool descriptions with 13 checks. Rule-based checks always run: input_arguments_clarity (params have descriptions), input_arguments_types (params have types), max_num_input_arguments (5 or fewer by default), max_optional_input_arguments (3 or fewer by default). LLM checks need a judge: general_structure, has_examples, has_usage_notes, intent_over_implementation, clarity, redundancy, input_arguments_enum, input_arguments_format, return_statement_quality.

Without a judge, only the 4 rule-based checks run. The score is based on the checks that actually ran.

Argument Matching

ToolTrajectoryEvaluator and ToolCorrectnessEvaluator (in NAMES_AND_ARGS mode) compare arguments through an ArgumentMatcher. The default, TolerantArgumentMatcher, compares structurally with a few deliberate tolerances.

  • Numbers compare by value, so 1, 1.0, and 1L are equal. This is always on. Treating 1 and 1.0 as different is a JSON number-widening artifact, not a real difference.
  • Strings compare exactly by default. Whitespace trimming and case-insensitivity are opt-in, so turning them on never silently changes existing pass/fail outcomes.
  • Maps and lists compare recursively with the same rules.

ArgMatchMode sets how the key sets are compared.

ModeActual arguments must...
EXACThave the same keys as expected, all values matching
SUBSETcontain every expected entry (extra keys allowed)
SUPERSETbe contained in expected (omissions allowed)
IGNOREnot be compared at all
ArgumentMatcher matcher = TolerantArgumentMatcher.builder()
.mode(ArgMatchMode.SUBSET) // only the expected arguments must be present and correct
.trimStrings(true)
.caseInsensitive(true)
.build();

Shortcuts: ArgumentMatcher.tolerant() gives the default EXACT matcher, and ArgumentMatcher.of(mode) gives a tolerant matcher in another mode. For anything custom, pass a lambda: (expected, actual) -> ....

Data Model

Three records in dev.dokimos.core.agents hold agent execution data.

ToolCall

A single tool invocation: name, arguments, optional result, and metadata.

// Quick
ToolCall call = ToolCall.of("search_flights", Map.of("origin", "NYC", "destination", "LAX"));

// Full builder
ToolCall call = ToolCall.builder()
.name("book_hotel")
.argument("city", "Paris")
.argument("nights", 3)
.result("{\"confirmation\": \"ABC123\"}")
.build();

The result is a single string. result(String) stores whatever you pass, exactly as is. Use it when your tool already produced a string. When the tool produced a structured value (a record, POJO, map, or list), use resultJson(Object) instead. It serializes the value to a compact, single-line JSON string and stores it in the same result component, so you stop hand-escaping JSON. A null value serializes to the JSON literal null.

record Confirmation(String confirmation, double total) {}

// Before: hand-escaped JSON, easy to get wrong
ToolCall.builder()
.name("book_hotel")
.result("{\"confirmation\": \"ABC123\", \"total\": 540.0}")
.build();

// After: serialize the value, no escaping
ToolCall.builder()
.name("book_hotel")
.resultJson(new Confirmation("ABC123", 540.0))
.build();

Read a structured result back, type-safe, with resultAs(Class<T>) or resultAs(OutputType<T>), the counterpart of resultJson. This is what makes a sequential agent's output -> input -> output chain assertable: capture each step's structured result, then read it back as a real object. Tool-call arguments read back the same way with argumentsAs(Class<T>) and argumentsAs(OutputType<T>). This is one stop on Dokimos's typed-data pipeline. See the Structured & Typed Data hub for how it connects to typed task outputs, structural matching, and the typed accessors on EvalTestCase.

ToolCall call = ToolCall.builder()
.name("book_hotel")
.resultJson(new Confirmation("ABC123", 540.0))
.build();

Confirmation booked = call.resultAs(Confirmation.class); // back to a typed object
List<Confirmation> many =
call.resultAs(new OutputType<List<Confirmation>>() {}); // generics via OutputType
note

Both writers set the same result field, so downstream evaluators (ToolErrorEvaluator, the hallucination judge, and anything reading ToolCall.result()) see an identical string either way. resultAs parses that string as JSON (the form resultJson produces). A null or blank result returns null, and a raw non-JSON string from result(String) is not parseable, so use result() for that.

ToolDefinition

A tool's contract: name, description, and JSON schema for arguments.

ToolDefinition tool = ToolDefinition.of("search_flights", "Search for available flights", Map.of(
"type", "object",
"properties", Map.of(
"origin", Map.of("type", "string", "description", "Origin airport code"),
"destination", Map.of("type", "string", "description", "Destination airport code")
),
"required", List.of("origin", "destination")
));

AgentTrace

Wraps a complete agent execution. Use toOutputMap() to produce the map format that evaluators expect ("output", "toolCalls", "reasoningSteps").

Task agentTask = example -> {
AgentTrace trace = runAgent(example.input());
return trace.toOutputMap();
};

When you evaluate a single trace directly, toTestCase() is a shortcut that builds a ready-to-use EvalTestCase. The tool calls, final response, and reasoning steps go into the actual outputs, and the tool definitions and tasks go into metadata. Use it so the validity and completion evaluators don't fail just because the tools or tasks entries were left out.

EvalTestCase testCase = trace.toTestCase(
"Find flights from NYC to Paris", // user input
tools, // List<ToolDefinition>, optional
List.of("Search flights")); // tasks, optional

// Shorter overloads when you don't need every part:
EvalTestCase justInput = trace.toTestCase("Find flights from NYC to Paris");
EvalTestCase withTools = trace.toTestCase("Find flights from NYC to Paris", tools);
Multi-turn agents

These evaluators score one set of tool calls. When tools are called across a back-and-forth conversation, attach the calls to each assistant turn and score the conversation per turn, with the same evaluators and no LLM. A ConversationTrajectory exposes toolCallsByTurn() for per-turn scoring and toTestCase(tools) / toTestCase(tools, tasks) for the whole-conversation deterministic and judge paths. See Tool Calls on Turns.

Extracting Traces from Your Framework

The examples above assume you already have an AgentTrace. In practice your agent runs on a framework, and Dokimos ships extractors that turn a framework's own run result into an AgentTrace, so you don't hand-write the mapping. Each extractor captures the tool calls (name, parsed arguments, and result) and the final response.

AiServices methods that return Result<T> carry the tool executions for a run. Pass the result to LangChain4jSupport.toAgentTrace, and convert the tool specifications with toToolDefinitions so the validity and reliability evaluators can see the tools the agent was given.

import dev.dokimos.langchain4j.LangChain4jSupport;

Result<String> result = assistant.chat(userMessage);

AgentTrace trace = LangChain4jSupport.toAgentTrace(result);
List<ToolDefinition> tools = LangChain4jSupport.toToolDefinitions(toolSpecifications);

EvalTestCase testCase = trace.toTestCase(userMessage, tools);

EvalTestCase Keys

Agent evaluators read these keys from EvalTestCase.

MapKeyTypeUsed by
actualOutputs"toolCalls"List<ToolCall>Validity, Correctness, Trajectory, Tool Error, Tool Efficiency, Hallucination
actualOutputs"output"StringTask Completion
expectedOutputs"toolCalls"List<ToolCall>Correctness, Trajectory
metadata"tools"List<ToolDefinition>Validity, Name Reliability, Description Reliability
metadata"tasks"List<String>Task Completion
metadata"constraints"StringTask Completion

Evaluator Configuration

Every evaluator uses the builder pattern. Common options:

// Rule-based: just set threshold
ToolCallValidityEvaluator.builder()
.strictMode(true) // Fail on any unexpected param
.threshold(1.0)
.build();

ToolCorrectnessEvaluator.builder()
.matchMode(ToolCorrectnessEvaluator.MatchMode.NAMES_AND_ORDER)
.build();

// LLM-based: provide a judge
TaskCompletionEvaluator.builder()
.judge(judgeLM)
.threshold(0.5)
.build();

// Tool reliability: optional judge for semantic checks
ToolNameReliabilityEvaluator.builder()
.judge(judgeLM) // optional
.threshold(0.8)
.build();

ToolDescriptionReliabilityEvaluator.builder()
.maxInputArgs(5) // default 5
.maxOptionalArgs(3) // default 3
.judge(judgeLM) // optional, enables 9 additional LLM checks
.threshold(0.8)
.build();

Running as an Experiment

To evaluate an agent across a dataset, put tool definitions and task lists in each Example's metadata. That is where evaluators look for them at runtime.

JudgeLM judge = prompt -> openAiClient.generate(prompt);

List<ToolDefinition> tools = List.of(
ToolDefinition.of("search_flights", "Search for flights", flightSchema),
ToolDefinition.of("book_hotel", "Book a hotel room", hotelSchema)
);

// Tools and tasks go in each Example's metadata
Dataset dataset = Dataset.builder()
.name("Travel Agent")
.addExample(Example.builder()
.input("input", "Find flights to Paris and book a hotel for 5 nights")
.expectedOutput("toolCalls", List.of(
ToolCall.of("search_flights", Map.of()),
ToolCall.of("book_hotel", Map.of())
))
.metadata("tools", tools)
.metadata("tasks", List.of("Search flights", "Book hotel"))
.build())
.build();

ExperimentResult result = Experiment.builder()
.name("Travel Agent Evaluation")
.dataset(dataset)
.task(example -> {
AgentTrace trace = travelAgent.run(example.input());
return trace.toOutputMap();
})
.evaluators(List.of(
ToolCallValidityEvaluator.builder().build(),
ToolCorrectnessEvaluator.builder().build(),
TaskCompletionEvaluator.builder().judge(judge).build(),
ToolArgumentHallucinationEvaluator.builder().judge(judge).build()
))
.build()
.run();

OpenAI Integration

Here is a full example that captures tool calls from an OpenAI agent and evaluates them. There are three bridge points:

  1. Convert your ToolDefinition to OpenAI's ChatCompletionTool format.
  2. Extract tool call names and arguments from the OpenAI response.
  3. Build an AgentTrace from the captured execution.
import com.openai.client.OpenAIClient;
import com.openai.client.okhttp.OpenAIOkHttpClient;
import com.openai.core.JsonValue;
import com.openai.models.*;
import com.openai.models.chat.completions.*;
import dev.dokimos.core.agents.*;

OpenAIClient client = OpenAIOkHttpClient.fromEnv();

// Define tools once, use them for both OpenAI and evaluation
List<ToolDefinition> tools = List.of(
ToolDefinition.of("search_flights", "Search for flights", flightSchema),
ToolDefinition.of("book_hotel", "Book a hotel room", hotelSchema)
);

// Convert to OpenAI format
ChatCompletionTool toOpenAITool(ToolDefinition def) {
var params = FunctionParameters.builder();
for (var entry : def.inputSchema().entrySet()) {
params.putAdditionalProperty(entry.getKey(), JsonValue.from(entry.getValue()));
}
return ChatCompletionTool.ofFunction(
ChatCompletionFunctionTool.builder()
.function(FunctionDefinition.builder()
.name(def.name())
.description(def.description())
.parameters(params.build())
.build())
.build());
}

// Run the tool-calling loop
var traceBuilder = AgentTrace.builder();
var paramsBuilder = ChatCompletionCreateParams.builder()
.model(ChatModel.GPT_5_NANO)
.addUserMessage("Find flights to Paris and book a hotel for 5 nights");
tools.forEach(t -> paramsBuilder.addTool(toOpenAITool(t)));

for (int i = 0; i < 10; i++) {
var completion = client.chat().completions().create(paramsBuilder.build());
var message = completion.choices().get(0).message();
paramsBuilder.addMessage(message);

var toolCalls = message.toolCalls().orElse(List.of());
if (toolCalls.isEmpty()) {
traceBuilder.finalResponse(message.content().orElse(""));
break;
}

for (var toolCall : toolCalls) {
var func = toolCall.asFunction();
var function = func.function();
String result = yourApp.executeTool(function.name(), function.arguments(Map.class));

traceBuilder.addToolCall(ToolCall.builder()
.name(function.name())
.arguments(function.arguments(Map.class))
.result(result)
.build());

paramsBuilder.addMessage(ChatCompletionToolMessageParam.builder()
.toolCallId(func.id())
.content(result)
.build());
}
}

AgentTrace trace = traceBuilder.build();

// Evaluate
var testCase = EvalTestCase.builder()
.input("Find flights to Paris and book a hotel for 5 nights")
.actualOutput("toolCalls", trace.toolCalls())
.actualOutput("output", trace.finalResponse())
.metadata("tools", tools)
.metadata("tasks", List.of("Search for flights", "Book a hotel"))
.build();

var result = ToolCallValidityEvaluator.builder().build().evaluate(testCase);

The loop runs up to 10 iterations because the model may call tools across several turns. It might search first, then book based on those results. Each iteration is one API round-trip, and the loop exits when the model returns a final text response instead of tool calls.

See OpenAIAgentEvaluationExample.java for a complete runnable example.

Best Practices

  • Start with rule-based evaluators. ToolCallValidityEvaluator and ToolCorrectnessEvaluator need no LLM and give fast, deterministic feedback. Add LLM-based evaluators once the basics pass.
  • Evaluate tool definitions in CI. Use ToolNameReliabilityEvaluator and ToolDescriptionReliabilityEvaluator to catch tool definition quality issues before they change agent behavior.
  • Use AgentTrace for consistent data flow. Build AgentTrace objects in your Task and call toOutputMap() to produce the standard format every evaluator expects.
  • Combine with standard evaluators. Use LLMJudgeEvaluator to check the quality of the agent's final response alongside the tool-level checks.
For AI agentsView as Markdown