Agent Evaluation
This page shows you how to score an AI agent on the tools it called, not just its final reply.
AI agents pick tools on their own, reason through multi-step problems, and call external APIs. Checking a single response is not enough. You want to know what tools the agent used, how it used them, and whether it finished the task.
Dokimos gives you nine agent evaluators and a portable data model for tool calls and tool definitions. The data model works with any framework. Five evaluators are deterministic and need no LLM, so you can run them in a unit test or a CI gate with no API key.
Quick Start
Capture the agent's tool calls, list the tools it has, then run evaluators. Copy this and adjust the tool names to your agent.
- Java
- Kotlin
// 1. List the tools your agent can use
List<ToolDefinition> tools = List.of(
ToolDefinition.of("search_flights", "Search for available flights", flightSchema),
ToolDefinition.of("book_hotel", "Book a hotel room", hotelSchema)
);
// 2. Set up a judge LLM (needed for task completion and hallucination checks)
JudgeLM judge = prompt -> openAiClient.generate(prompt);
// 3. Run your agent and capture its trace
AgentTrace trace = AgentTrace.builder()
.addToolCall(ToolCall.of("search_flights", Map.of("origin", "JFK", "destination", "CDG")))
.addToolCall(ToolCall.of("book_hotel", Map.of("city", "Paris", "nights", 5)))
.finalResponse("Found flights and booked your hotel in Paris.")
.build();
// 4. Build a test case
var testCase = EvalTestCase.builder()
.input("Find flights from NYC to Paris and book a hotel for 5 nights")
.actualOutput("toolCalls", trace.toolCalls())
.actualOutput("output", trace.finalResponse())
.expectedOutput("toolCalls", List.of(
ToolCall.of("search_flights", Map.of()),
ToolCall.of("book_hotel", Map.of())
))
.metadata("tools", tools)
.metadata("tasks", List.of("Search for flights", "Book a hotel"))
.build();
// 5. Pick the evaluators you need and run them
var results = List.of(
ToolCallValidityEvaluator.builder().build().evaluate(testCase),
ToolCorrectnessEvaluator.builder().build().evaluate(testCase),
TaskCompletionEvaluator.builder().judge(judge).build().evaluate(testCase),
ToolArgumentHallucinationEvaluator.builder().judge(judge).build().evaluate(testCase)
);
val judge = JudgeLM { prompt -> openAiClient.generate(prompt) }
val result = experiment {
name = "Travel Agent Evaluation"
dataset(dataset)
task { example ->
val trace = travelAgent.run(example.input())
trace.toOutputMap()
}
evaluators {
toolCallValidity { }
toolCorrectness { }
taskCompletion(judge) { }
toolArgumentHallucination(judge) { }
}
}.run()
Evaluators
Pick from these nine. The first five need no LLM. The next two always need a judge. The last two take an optional judge.
| Evaluator | What it checks | LLM required? | Default threshold |
|---|---|---|---|
ToolCallValidityEvaluator | Tool calls match their JSON schema (names, required params, types, enums) | No | 1.0 |
ToolCorrectnessEvaluator | Agent used the expected set of tools | No | 1.0 |
ToolTrajectoryEvaluator | Tool-call sequence matches an expected trajectory | No | 1.0 |
ToolErrorEvaluator | Tool calls succeeded (no error results) | No | 1.0 |
ToolEfficiencyEvaluator | No redundant tool calls | No | 1.0 |
TaskCompletionEvaluator | Agent completed the user's requested tasks | Yes | 0.5 |
ToolArgumentHallucinationEvaluator | Tool call arguments are grounded in user input | Yes | 0.8 |
ToolNameReliabilityEvaluator | Tool names follow naming conventions (snake_case, conciseness, clarity, ordering, intent) | Optional | 0.8 |
ToolDescriptionReliabilityEvaluator | Tool descriptions are well written (structure, clarity, args documented, examples, usage notes) | Optional | 0.8 |
ToolCallValidityEvaluator
Checks each tool call against its JSON schema. It confirms the tool name exists, required params are present, types match, enum values are valid, and no unexpected params slip in (in strict mode, or when the schema sets additionalProperties: false).
Score = fraction of valid tool calls.
ToolCorrectnessEvaluator
Compares the tools the agent used against the tools you expected. Pick one of three match modes.
| Mode | Comparison |
|---|---|
NAMES_ONLY (default) | Set of tool names (F1 score) |
NAMES_AND_ORDER | Names plus invocation order (LCS similarity) |
NAMES_AND_ARGS | Full structural comparison including arguments |
In NAMES_AND_ARGS mode, arguments use a tolerant matcher by default, so numerically equal values like 1 and 1.0 count as equal. See Argument Matching below.
ToolTrajectoryEvaluator
Scores the agent's tool-call sequence against an expected one. Deterministic, no LLM. Use it to assert how an agent should move through a task, and choose how strict the order and arguments need to be.
| Mode | Meaning | Score |
|---|---|---|
STRICT | Same calls, same order, arguments match | 0 or 1 |
IN_ORDER | Expected appears as an ordered subsequence | graded (LCS) |
ANY_ORDER | Same calls in any order | graded |
SUPERSET | Actual contains every expected call (extras allowed) | 0 or 1 |
SUBSET | Every actual call is in expected (omissions allowed) | 0 or 1 |
PRECISION | Matched / number of actual calls | graded |
RECALL | Matched / number of expected calls | graded |
It reads toolCalls from actualOutputs and expectedOutputs. The unordered modes use maximum bipartite matching, so repeated tool names are counted in the best possible way.
- Java
- Kotlin
ToolTrajectoryEvaluator trajectory = ToolTrajectoryEvaluator.builder()
.matchMode(ToolTrajectoryEvaluator.MatchMode.IN_ORDER)
.build();
var testCase = EvalTestCase.builder()
.actualOutput("toolCalls", trace.toolCalls())
.expectedOutput("toolCalls", List.of(
ToolCall.of("search_flights", Map.of()),
ToolCall.of("book_hotel", Map.of())
))
.build();
val trajectory = toolTrajectory {
matchMode = ToolTrajectoryEvaluator.MatchMode.IN_ORDER
}
By default arguments use a tolerant matcher, so numerically equal values like 1 and 1.0 match. To compare tool names and order only, pass ArgumentMatcher.of(ArgMatchMode.IGNORE). You can also override the matcher for one tool.
- Java
- Kotlin
ToolTrajectoryEvaluator trajectory = ToolTrajectoryEvaluator.builder()
.matchMode(ToolTrajectoryEvaluator.MatchMode.ANY_ORDER)
.argumentMatcher(ArgumentMatcher.tolerant()) // default for every tool
.argumentMatcher("book_hotel", ArgumentMatcher.of(ArgMatchMode.SUBSET)) // override one tool
.build();
val trajectory = toolTrajectory {
matchMode = ToolTrajectoryEvaluator.MatchMode.ANY_ORDER
argumentMatcher = ArgumentMatcher.tolerant() // default for every tool
argumentMatcher("book_hotel", ArgumentMatcher.of(ArgMatchMode.SUBSET)) // override one tool
}
ToolErrorEvaluator
Looks at each tool call's result and scores the fraction that succeeded. Deterministic, no LLM. A call counts as failed when its result is null or blank, when it is a JSON object with a top-level error field, or when it matches a custom predicate you supply.
- Java
- Kotlin
ToolErrorEvaluator toolError = ToolErrorEvaluator.builder()
.errorDetector(result -> result.contains("HTTP 500")) // optional, on top of the defaults
.build();
val toolError = toolError {
errorDetector = { it.contains("HTTP 500") } // optional, on top of the defaults
}
ToolEfficiencyEvaluator
Finds redundant tool calls. The score is the ratio of distinct calls to total calls, so 1.0 means no redundancy. Two calls are redundant when they share a name and matching arguments. Consecutive duplicates also show up in the result metadata as a loop signal. Deterministic, no LLM.
- Java
- Kotlin
ToolEfficiencyEvaluator efficiency = ToolEfficiencyEvaluator.builder().build();
val efficiency = toolEfficiency { }
Treat efficiency as a signal, not a hard gate. A legitimately repeated call (a retry, say) lowers the score, so tune the threshold to your case.
TaskCompletionEvaluator
Sends the user-agent dialog and a task list to a judge LLM, which decides which tasks were completed. Score = fraction of completed tasks.
Provide tasks with metadata("tasks", List.of("Search flights", "Book hotel")) and optional constraints with metadata("constraints", "Budget under $500").
ToolArgumentHallucinationEvaluator
Uses a judge LLM to check whether each tool call's argument values can be derived from the user's input. Score = fraction of non-hallucinated tool calls.
ToolNameReliabilityEvaluator
Checks tool names with 5 checks. Rule-based checks always run: snakecase_format (strict snake_case), conciseness (7 segments or fewer), intent_over_implementation (blocklist for patterns like _with_llm, _via_api). LLM checks need a judge: clarity (purpose clear from the name alone), name_order (follows operation_system_entity_data ordering), plus a semantic intent_over_implementation.
Without a judge, only the 3 rule-based checks run. The score is based on the checks that actually ran.
ToolDescriptionReliabilityEvaluator
Checks tool descriptions with 13 checks. Rule-based checks always run: input_arguments_clarity (params have descriptions), input_arguments_types (params have types), max_num_input_arguments (5 or fewer by default), max_optional_input_arguments (3 or fewer by default). LLM checks need a judge: general_structure, has_examples, has_usage_notes, intent_over_implementation, clarity, redundancy, input_arguments_enum, input_arguments_format, return_statement_quality.
Without a judge, only the 4 rule-based checks run. The score is based on the checks that actually ran.
Argument Matching
ToolTrajectoryEvaluator and ToolCorrectnessEvaluator (in NAMES_AND_ARGS mode) compare arguments through an ArgumentMatcher. The default, TolerantArgumentMatcher, compares structurally with a few deliberate tolerances.
- Numbers compare by value, so
1,1.0, and1Lare equal. This is always on. Treating1and1.0as different is a JSON number-widening artifact, not a real difference. - Strings compare exactly by default. Whitespace trimming and case-insensitivity are opt-in, so turning them on never silently changes existing pass/fail outcomes.
- Maps and lists compare recursively with the same rules.
ArgMatchMode sets how the key sets are compared.
| Mode | Actual arguments must... |
|---|---|
EXACT | have the same keys as expected, all values matching |
SUBSET | contain every expected entry (extra keys allowed) |
SUPERSET | be contained in expected (omissions allowed) |
IGNORE | not be compared at all |
- Java
- Kotlin
ArgumentMatcher matcher = TolerantArgumentMatcher.builder()
.mode(ArgMatchMode.SUBSET) // only the expected arguments must be present and correct
.trimStrings(true)
.caseInsensitive(true)
.build();
val matcher = TolerantArgumentMatcher.builder()
.mode(ArgMatchMode.SUBSET) // only the expected arguments must be present and correct
.trimStrings(true)
.caseInsensitive(true)
.build()
Shortcuts: ArgumentMatcher.tolerant() gives the default EXACT matcher, and ArgumentMatcher.of(mode) gives a tolerant matcher in another mode. For anything custom, pass a lambda: (expected, actual) -> ....
Data Model
Three records in dev.dokimos.core.agents hold agent execution data.
ToolCall
A single tool invocation: name, arguments, optional result, and metadata.
// Quick
ToolCall call = ToolCall.of("search_flights", Map.of("origin", "NYC", "destination", "LAX"));
// Full builder
ToolCall call = ToolCall.builder()
.name("book_hotel")
.argument("city", "Paris")
.argument("nights", 3)
.result("{\"confirmation\": \"ABC123\"}")
.build();
The result is a single string. result(String) stores whatever you pass, exactly as is. Use it when your tool already produced a string. When the tool produced a structured value (a record, POJO, map, or list), use resultJson(Object) instead. It serializes the value to a compact, single-line JSON string and stores it in the same result component, so you stop hand-escaping JSON. A null value serializes to the JSON literal null.
record Confirmation(String confirmation, double total) {}
// Before: hand-escaped JSON, easy to get wrong
ToolCall.builder()
.name("book_hotel")
.result("{\"confirmation\": \"ABC123\", \"total\": 540.0}")
.build();
// After: serialize the value, no escaping
ToolCall.builder()
.name("book_hotel")
.resultJson(new Confirmation("ABC123", 540.0))
.build();
Read a structured result back, type-safe, with resultAs(Class<T>) or resultAs(OutputType<T>), the counterpart of resultJson. This is what makes a sequential agent's output -> input -> output chain assertable: capture each step's structured result, then read it back as a real object. Tool-call arguments read back the same way with argumentsAs(Class<T>) and argumentsAs(OutputType<T>). This is one stop on Dokimos's typed-data pipeline. See the Structured & Typed Data hub for how it connects to typed task outputs, structural matching, and the typed accessors on EvalTestCase.
ToolCall call = ToolCall.builder()
.name("book_hotel")
.resultJson(new Confirmation("ABC123", 540.0))
.build();
Confirmation booked = call.resultAs(Confirmation.class); // back to a typed object
List<Confirmation> many =
call.resultAs(new OutputType<List<Confirmation>>() {}); // generics via OutputType
Both writers set the same result field, so downstream evaluators (ToolErrorEvaluator, the hallucination judge, and anything reading ToolCall.result()) see an identical string either way. resultAs parses that string as JSON (the form resultJson produces). A null or blank result returns null, and a raw non-JSON string from result(String) is not parseable, so use result() for that.
ToolDefinition
A tool's contract: name, description, and JSON schema for arguments.
ToolDefinition tool = ToolDefinition.of("search_flights", "Search for available flights", Map.of(
"type", "object",
"properties", Map.of(
"origin", Map.of("type", "string", "description", "Origin airport code"),
"destination", Map.of("type", "string", "description", "Destination airport code")
),
"required", List.of("origin", "destination")
));
AgentTrace
Wraps a complete agent execution. Use toOutputMap() to produce the map format that evaluators expect ("output", "toolCalls", "reasoningSteps").
Task agentTask = example -> {
AgentTrace trace = runAgent(example.input());
return trace.toOutputMap();
};
When you evaluate a single trace directly, toTestCase() is a shortcut that builds a ready-to-use EvalTestCase. The tool calls, final response, and reasoning steps go into the actual outputs, and the tool definitions and tasks go into metadata. Use it so the validity and completion evaluators don't fail just because the tools or tasks entries were left out.
- Java
- Kotlin
EvalTestCase testCase = trace.toTestCase(
"Find flights from NYC to Paris", // user input
tools, // List<ToolDefinition>, optional
List.of("Search flights")); // tasks, optional
// Shorter overloads when you don't need every part:
EvalTestCase justInput = trace.toTestCase("Find flights from NYC to Paris");
EvalTestCase withTools = trace.toTestCase("Find flights from NYC to Paris", tools);
val testCase = trace.toTestCase(
"Find flights from NYC to Paris", // user input
tools, // List<ToolDefinition>, optional
listOf("Search flights")) // tasks, optional
// Shorter overloads when you don't need every part:
val justInput = trace.toTestCase("Find flights from NYC to Paris")
val withTools = trace.toTestCase("Find flights from NYC to Paris", tools)
These evaluators score one set of tool calls. When tools are called across a back-and-forth conversation, attach the calls to each assistant turn and score the conversation per turn, with the same evaluators and no LLM. A ConversationTrajectory exposes toolCallsByTurn() for per-turn scoring and toTestCase(tools) / toTestCase(tools, tasks) for the whole-conversation deterministic and judge paths. See Tool Calls on Turns.
Extracting Traces from Your Framework
The examples above assume you already have an AgentTrace. In practice your agent runs on a framework, and Dokimos ships extractors that turn a framework's own run result into an AgentTrace, so you don't hand-write the mapping. Each extractor captures the tool calls (name, parsed arguments, and result) and the final response.
- LangChain4j
- Spring AI
- Koog
- Embabel
- Spring AI Alibaba
- OpenAI
AiServices methods that return Result<T> carry the tool executions for a run. Pass the result to LangChain4jSupport.toAgentTrace, and convert the tool specifications with toToolDefinitions so the validity and reliability evaluators can see the tools the agent was given.
import dev.dokimos.langchain4j.LangChain4jSupport;
Result<String> result = assistant.chat(userMessage);
AgentTrace trace = LangChain4jSupport.toAgentTrace(result);
List<ToolDefinition> tools = LangChain4jSupport.toToolDefinitions(toolSpecifications);
EvalTestCase testCase = trace.toTestCase(userMessage, tools);
An AssistantMessage carries the tool calls the model made. The results come back in the ToolResponseMessages. Pass both so the trace carries the calls and what the tools returned (matched by tool-call id).
import dev.dokimos.springai.SpringAiSupport;
AgentTrace trace = SpringAiSupport.toAgentTrace(assistantMessage, toolResponseMessages);
List<ToolDefinition> tools = SpringAiSupport.toToolDefinitions(toolDefinitions);
EvalTestCase testCase = trace.toTestCase(userMessage, tools);
Koog reports tool calls through its event handler. Install a KoogTraceCollector with collectAgentTrace, run the agent, then read the trace.
import dev.dokimos.koog.KoogTraceCollector
import dev.dokimos.koog.collectAgentTrace
val collector = KoogTraceCollector()
val agent = AIAgent(/* ... */) {
install(EventHandler) { collectAgentTrace(collector) }
}
val response = agent.run(userInput)
val testCase = collector.toAgentTrace(response).toTestCase(userInput, tools)
The collector tolerates framework versions: it reads the completion context reflectively, so one build works across Koog 0.6.4 through 1.0.0.
Embabel reports tool calls through its AgenticEventListener. Attach an EmbabelTraceCollector to your run with EmbabelSupport.attach, run the agent, then read the trace. The tool definitions are synthesized from the observed tool names with an empty schema, so build them by hand for full ToolDescriptionReliabilityEvaluator coverage.
import dev.dokimos.embabel.EmbabelSupport;
import dev.dokimos.embabel.EmbabelTraceCollector;
EmbabelTraceCollector collector = EmbabelSupport.attach(invocationBuilder);
String response = invocationBuilder.build(String.class).invoke(userInput);
AgentTrace trace = collector.trace();
List<ToolDefinition> tools = EmbabelSupport.toToolDefinitions(collector);
EvalTestCase testCase = trace.toTestCase(userInput, tools);
See the Embabel integration for the full flow and limitations.
For Spring AI Alibaba Graph agents, SpringAiAlibabaSupport.toAgentTrace reads the run's OverAllState and windows over its messages list to recover the tool calls per turn. Convert the tool callbacks the agent was given with toToolDefinitions.
import dev.dokimos.springai.alibaba.SpringAiAlibabaSupport;
AgentTrace trace = SpringAiAlibabaSupport.toAgentTrace(state);
List<ToolDefinition> tools = SpringAiAlibabaSupport.toToolDefinitions(toolCallbacks);
EvalTestCase testCase = trace.toTestCase(userInput, tools);
The OpenAI Java SDK has no published Dokimos module, so a small reusable bridge lives in the examples module (copy it into your project). It turns the SDK's tool calls into Dokimos ToolCalls as your tool-calling loop runs.
AgentTrace.Builder trace = AgentTrace.builder();
for (var toolCall : message.toolCalls().orElse(List.of())) {
String result = myApp.execute(toolCall);
trace.addToolCall(OpenAiAgentTraces.toToolCall(toolCall, result));
}
trace.finalResponse(finalMessage.content().orElse(""));
EvalTestCase testCase = trace.build().toTestCase(userMessage, tools);
EvalTestCase Keys
Agent evaluators read these keys from EvalTestCase.
| Map | Key | Type | Used by |
|---|---|---|---|
actualOutputs | "toolCalls" | List<ToolCall> | Validity, Correctness, Trajectory, Tool Error, Tool Efficiency, Hallucination |
actualOutputs | "output" | String | Task Completion |
expectedOutputs | "toolCalls" | List<ToolCall> | Correctness, Trajectory |
metadata | "tools" | List<ToolDefinition> | Validity, Name Reliability, Description Reliability |
metadata | "tasks" | List<String> | Task Completion |
metadata | "constraints" | String | Task Completion |
Evaluator Configuration
Every evaluator uses the builder pattern. Common options:
- Java
- Kotlin
// Rule-based: just set threshold
ToolCallValidityEvaluator.builder()
.strictMode(true) // Fail on any unexpected param
.threshold(1.0)
.build();
ToolCorrectnessEvaluator.builder()
.matchMode(ToolCorrectnessEvaluator.MatchMode.NAMES_AND_ORDER)
.build();
// LLM-based: provide a judge
TaskCompletionEvaluator.builder()
.judge(judgeLM)
.threshold(0.5)
.build();
// Tool reliability: optional judge for semantic checks
ToolNameReliabilityEvaluator.builder()
.judge(judgeLM) // optional
.threshold(0.8)
.build();
ToolDescriptionReliabilityEvaluator.builder()
.maxInputArgs(5) // default 5
.maxOptionalArgs(3) // default 3
.judge(judgeLM) // optional, enables 9 additional LLM checks
.threshold(0.8)
.build();
evaluators {
toolCallValidity { strictMode = true }
toolCorrectness { matchMode = ToolCorrectnessEvaluator.MatchMode.NAMES_AND_ORDER }
taskCompletion(judge) { threshold = 0.5 }
toolArgumentHallucination(judge) { threshold = 0.8 }
toolNameReliability { judge = judgeLM }
toolDescriptionReliability { maxInputArgs = 5; maxOptionalArgs = 3; judge = judgeLM }
}
Running as an Experiment
To evaluate an agent across a dataset, put tool definitions and task lists in each Example's metadata. That is where evaluators look for them at runtime.
- Java
- Kotlin
JudgeLM judge = prompt -> openAiClient.generate(prompt);
List<ToolDefinition> tools = List.of(
ToolDefinition.of("search_flights", "Search for flights", flightSchema),
ToolDefinition.of("book_hotel", "Book a hotel room", hotelSchema)
);
// Tools and tasks go in each Example's metadata
Dataset dataset = Dataset.builder()
.name("Travel Agent")
.addExample(Example.builder()
.input("input", "Find flights to Paris and book a hotel for 5 nights")
.expectedOutput("toolCalls", List.of(
ToolCall.of("search_flights", Map.of()),
ToolCall.of("book_hotel", Map.of())
))
.metadata("tools", tools)
.metadata("tasks", List.of("Search flights", "Book hotel"))
.build())
.build();
ExperimentResult result = Experiment.builder()
.name("Travel Agent Evaluation")
.dataset(dataset)
.task(example -> {
AgentTrace trace = travelAgent.run(example.input());
return trace.toOutputMap();
})
.evaluators(List.of(
ToolCallValidityEvaluator.builder().build(),
ToolCorrectnessEvaluator.builder().build(),
TaskCompletionEvaluator.builder().judge(judge).build(),
ToolArgumentHallucinationEvaluator.builder().judge(judge).build()
))
.build()
.run();
val judge = JudgeLM { prompt -> openAiClient.generate(prompt) }
val tools = listOf(
ToolDefinition.of("search_flights", "Search for flights", flightSchema),
ToolDefinition.of("book_hotel", "Book a hotel room", hotelSchema)
)
// Tools and tasks go in each Example's metadata
val dataset = Dataset.builder()
.name("Travel Agent")
.addExample(Example.builder()
.input("input", "Find flights to Paris and book a hotel for 5 nights")
.expectedOutput("toolCalls", listOf(
ToolCall.of("search_flights", mapOf()),
ToolCall.of("book_hotel", mapOf())
))
.metadata("tools", tools)
.metadata("tasks", listOf("Search flights", "Book hotel"))
.build())
.build()
val result = experiment {
name = "Travel Agent Evaluation"
dataset(dataset)
task { example ->
val trace = travelAgent.run(example.input())
trace.toOutputMap()
}
evaluators {
toolCallValidity { }
toolCorrectness { }
taskCompletion(judge) { }
toolArgumentHallucination(judge) { }
}
}.run()
OpenAI Integration
Here is a full example that captures tool calls from an OpenAI agent and evaluates them. There are three bridge points:
- Convert your
ToolDefinitionto OpenAI'sChatCompletionToolformat. - Extract tool call names and arguments from the OpenAI response.
- Build an
AgentTracefrom the captured execution.
import com.openai.client.OpenAIClient;
import com.openai.client.okhttp.OpenAIOkHttpClient;
import com.openai.core.JsonValue;
import com.openai.models.*;
import com.openai.models.chat.completions.*;
import dev.dokimos.core.agents.*;
OpenAIClient client = OpenAIOkHttpClient.fromEnv();
// Define tools once, use them for both OpenAI and evaluation
List<ToolDefinition> tools = List.of(
ToolDefinition.of("search_flights", "Search for flights", flightSchema),
ToolDefinition.of("book_hotel", "Book a hotel room", hotelSchema)
);
// Convert to OpenAI format
ChatCompletionTool toOpenAITool(ToolDefinition def) {
var params = FunctionParameters.builder();
for (var entry : def.inputSchema().entrySet()) {
params.putAdditionalProperty(entry.getKey(), JsonValue.from(entry.getValue()));
}
return ChatCompletionTool.ofFunction(
ChatCompletionFunctionTool.builder()
.function(FunctionDefinition.builder()
.name(def.name())
.description(def.description())
.parameters(params.build())
.build())
.build());
}
// Run the tool-calling loop
var traceBuilder = AgentTrace.builder();
var paramsBuilder = ChatCompletionCreateParams.builder()
.model(ChatModel.GPT_5_NANO)
.addUserMessage("Find flights to Paris and book a hotel for 5 nights");
tools.forEach(t -> paramsBuilder.addTool(toOpenAITool(t)));
for (int i = 0; i < 10; i++) {
var completion = client.chat().completions().create(paramsBuilder.build());
var message = completion.choices().get(0).message();
paramsBuilder.addMessage(message);
var toolCalls = message.toolCalls().orElse(List.of());
if (toolCalls.isEmpty()) {
traceBuilder.finalResponse(message.content().orElse(""));
break;
}
for (var toolCall : toolCalls) {
var func = toolCall.asFunction();
var function = func.function();
String result = yourApp.executeTool(function.name(), function.arguments(Map.class));
traceBuilder.addToolCall(ToolCall.builder()
.name(function.name())
.arguments(function.arguments(Map.class))
.result(result)
.build());
paramsBuilder.addMessage(ChatCompletionToolMessageParam.builder()
.toolCallId(func.id())
.content(result)
.build());
}
}
AgentTrace trace = traceBuilder.build();
// Evaluate
var testCase = EvalTestCase.builder()
.input("Find flights to Paris and book a hotel for 5 nights")
.actualOutput("toolCalls", trace.toolCalls())
.actualOutput("output", trace.finalResponse())
.metadata("tools", tools)
.metadata("tasks", List.of("Search for flights", "Book a hotel"))
.build();
var result = ToolCallValidityEvaluator.builder().build().evaluate(testCase);
The loop runs up to 10 iterations because the model may call tools across several turns. It might search first, then book based on those results. Each iteration is one API round-trip, and the loop exits when the model returns a final text response instead of tool calls.
See
OpenAIAgentEvaluationExample.javafor a complete runnable example.
Best Practices
- Start with rule-based evaluators.
ToolCallValidityEvaluatorandToolCorrectnessEvaluatorneed no LLM and give fast, deterministic feedback. Add LLM-based evaluators once the basics pass. - Evaluate tool definitions in CI. Use
ToolNameReliabilityEvaluatorandToolDescriptionReliabilityEvaluatorto catch tool definition quality issues before they change agent behavior. - Use AgentTrace for consistent data flow. Build
AgentTraceobjects in yourTaskand calltoOutputMap()to produce the standard format every evaluator expects. - Combine with standard evaluators. Use
LLMJudgeEvaluatorto check the quality of the agent's final response alongside the tool-level checks.