Skip to main content

Agent Evaluation

AI agents autonomously use tools, reason through multi-step problems, and interact with external APIs. Evaluating them requires more than checking a single response — you need to assess what tools they used, how they used them, and whether they accomplished the task.

Dokimos provides a framework-agnostic agent evaluation system with six evaluators and a portable data model for tool calls and tool definitions.

Quick Start

The simplest way to evaluate an agent: capture its tool calls, define what tools it has, and run evaluators.

// 1. Define the tools your agent can use
List<ToolDefinition> tools = List.of(
ToolDefinition.of("search_flights", "Search for available flights", flightSchema),
ToolDefinition.of("book_hotel", "Book a hotel room", hotelSchema)
);

// 2. Set up a judge LLM (needed for task completion and hallucination checks)
JudgeLM judge = prompt -> openAiClient.generate(prompt);

// 3. Run your agent and capture its trace
AgentTrace trace = AgentTrace.builder()
.addToolCall(ToolCall.of("search_flights", Map.of("origin", "JFK", "destination", "CDG")))
.addToolCall(ToolCall.of("book_hotel", Map.of("city", "Paris", "nights", 5)))
.finalResponse("Found flights and booked your hotel in Paris.")
.build();

// 4. Build a test case and evaluate
var testCase = EvalTestCase.builder()
.input("Find flights from NYC to Paris and book a hotel for 5 nights")
.actualOutput("toolCalls", trace.toolCalls())
.actualOutput("output", trace.finalResponse())
.expectedOutput("toolCalls", List.of(
ToolCall.of("search_flights", Map.of()),
ToolCall.of("book_hotel", Map.of())
))
.metadata("tools", tools)
.metadata("tasks", List.of("Search for flights", "Book a hotel"))
.build();

// 5. Pick the evaluators you need
var results = List.of(
ToolCallValidityEvaluator.builder().build().evaluate(testCase),
ToolCorrectnessEvaluator.builder().build().evaluate(testCase),
TaskCompletionEvaluator.builder().judge(judge).build().evaluate(testCase),
ToolArgumentHallucinationEvaluator.builder().judge(judge).build().evaluate(testCase)
);

Evaluators

EvaluatorWhat it checksLLM required?Default threshold
ToolCallValidityEvaluatorTool calls match their JSON schema (names, required params, types, enums)No1.0
ToolCorrectnessEvaluatorAgent used the expected set of toolsNo1.0
TaskCompletionEvaluatorAgent completed the user's requested tasksYes0.5
ToolArgumentHallucinationEvaluatorTool call arguments are grounded in user inputYes0.8
ToolNameReliabilityEvaluatorTool names follow naming conventions (snake_case, conciseness, clarity, ordering, intent)Optional0.8
ToolDescriptionReliabilityEvaluatorTool descriptions are well-crafted (structure, clarity, args documented, examples, usage notes)Optional0.8

ToolCallValidityEvaluator

Validates each tool call against its JSON schema. Checks: tool name exists, required params present, types match, enum values valid, no unexpected params (in strict mode or when additionalProperties: false).

Score = fraction of valid tool calls.

ToolCorrectnessEvaluator

Compares actual vs expected tool usage. Three match modes:

ModeComparison
NAMES_ONLY (default)Set of tool names (F1 score)
NAMES_AND_ORDERNames + invocation order (LCS similarity)
NAMES_AND_ARGSFull structural comparison including arguments

TaskCompletionEvaluator

Sends the user-agent dialog and a task list to a judge LLM to evaluate which tasks were completed. Score = fraction of completed tasks.

Provide tasks via metadata("tasks", List.of("Search flights", "Book hotel")) and optional constraints via metadata("constraints", "Budget under $500").

ToolArgumentHallucinationEvaluator

Uses a judge LLM to check whether each tool call's argument values can be derived from the user's input. Score = fraction of non-hallucinated tool calls.

ToolNameReliabilityEvaluator

Evaluates tool names with 5 checks. Rule-based checks (always run): snakecase_format (strict snake_case), conciseness (≤ 7 segments), intent_over_implementation (blocklist for patterns like _with_llm, _via_api). LLM checks (require judge): clarity (purpose clear from name alone), name_order (follows operation_system_entity_data ordering), plus semantic intent_over_implementation.

Without a judge, only the 3 rule-based checks run. Score is based on checks that actually ran.

ToolDescriptionReliabilityEvaluator

Evaluates tool descriptions with 13 checks. Rule-based checks (always run): input_arguments_clarity (params have descriptions), input_arguments_types (params have types), max_num_input_arguments (≤ 5 by default), max_optional_input_arguments (≤ 3 by default). LLM checks (require judge): general_structure, has_examples, has_usage_notes, intent_over_implementation, clarity, redundancy, input_arguments_enum, input_arguments_format, return_statement_quality.

Without a judge, only the 4 rule-based checks run. Score is based on checks that actually ran.

Data Model

Three records in dev.dokimos.core.agents represent agent execution data.

ToolCall

A single tool invocation: name, arguments, optional result, and metadata.

// Quick
ToolCall call = ToolCall.of("search_flights", Map.of("origin", "NYC", "destination", "LAX"));

// Full builder
ToolCall call = ToolCall.builder()
.name("book_hotel")
.argument("city", "Paris")
.argument("nights", 3)
.result("{\"confirmation\": \"ABC123\"}")
.build();

ToolDefinition

A tool's contract: name, description, and JSON schema for arguments.

ToolDefinition tool = ToolDefinition.of("search_flights", "Search for available flights", Map.of(
"type", "object",
"properties", Map.of(
"origin", Map.of("type", "string", "description", "Origin airport code"),
"destination", Map.of("type", "string", "description", "Destination airport code")
),
"required", List.of("origin", "destination")
));

AgentTrace

Wraps a complete agent execution. Use toOutputMap() to produce the map format that evaluators expect ("output", "toolCalls", "reasoningSteps").

Task agentTask = example -> {
AgentTrace trace = runAgent(example.input());
return trace.toOutputMap();
};

EvalTestCase Keys

Agent evaluators use these keys in EvalTestCase:

MapKeyTypeUsed by
actualOutputs"toolCalls"List<ToolCall>Validity, Correctness, Hallucination
actualOutputs"output"StringTask Completion
expectedOutputs"toolCalls"List<ToolCall>Correctness
metadata"tools"List<ToolDefinition>Validity, Name Reliability, Description Reliability
metadata"tasks"List<String>Task Completion
metadata"constraints"StringTask Completion

Evaluator Configuration

All evaluators use the builder pattern. Common options:

// Rule-based — just set threshold
ToolCallValidityEvaluator.builder()
.strictMode(true) // Fail on any unexpected param
.threshold(1.0)
.build();

ToolCorrectnessEvaluator.builder()
.matchMode(ToolCorrectnessEvaluator.MatchMode.NAMES_AND_ORDER)
.build();

// LLM-based — provide a judge
TaskCompletionEvaluator.builder()
.judge(judgeLM)
.threshold(0.5)
.build();

// Tool reliability — optional judge for semantic checks
ToolNameReliabilityEvaluator.builder()
.judge(judgeLM) // optional
.threshold(0.8)
.build();

ToolDescriptionReliabilityEvaluator.builder()
.maxInputArgs(5) // default 5
.maxOptionalArgs(3) // default 3
.judge(judgeLM) // optional, enables 9 additional LLM checks
.threshold(0.8)
.build();

Running as an Experiment

To evaluate an agent across a dataset, put tool definitions and task lists in each Example's metadata — this is where evaluators look for them at runtime.

JudgeLM judge = prompt -> openAiClient.generate(prompt);

List<ToolDefinition> tools = List.of(
ToolDefinition.of("search_flights", "Search for flights", flightSchema),
ToolDefinition.of("book_hotel", "Book a hotel room", hotelSchema)
);

// Tools and tasks go in each Example's metadata
Dataset dataset = Dataset.builder()
.name("Travel Agent")
.addExample(Example.builder()
.input("input", "Find flights to Paris and book a hotel for 5 nights")
.expectedOutput("toolCalls", List.of(
ToolCall.of("search_flights", Map.of()),
ToolCall.of("book_hotel", Map.of())
))
.metadata("tools", tools)
.metadata("tasks", List.of("Search flights", "Book hotel"))
.build())
.build();

ExperimentResult result = Experiment.builder()
.name("Travel Agent Evaluation")
.dataset(dataset)
.task(example -> {
AgentTrace trace = travelAgent.run(example.input());
return trace.toOutputMap();
})
.evaluators(List.of(
ToolCallValidityEvaluator.builder().build(),
ToolCorrectnessEvaluator.builder().build(),
TaskCompletionEvaluator.builder().judge(judge).build(),
ToolArgumentHallucinationEvaluator.builder().judge(judge).build()
))
.build()
.run();

OpenAI Integration

Here's a complete example showing how to capture tool calls from an OpenAI agent and evaluate them. The key bridge points are:

  1. Convert your ToolDefinition to OpenAI's ChatCompletionTool format
  2. Extract tool call names and arguments from the OpenAI response
  3. Build an AgentTrace from the captured execution
import com.openai.client.OpenAIClient;
import com.openai.client.okhttp.OpenAIOkHttpClient;
import com.openai.core.JsonValue;
import com.openai.models.*;
import com.openai.models.chat.completions.*;
import dev.dokimos.core.agents.*;

OpenAIClient client = OpenAIOkHttpClient.fromEnv();

// Define tools once — used for both OpenAI and evaluation
List<ToolDefinition> tools = List.of(
ToolDefinition.of("search_flights", "Search for flights", flightSchema),
ToolDefinition.of("book_hotel", "Book a hotel room", hotelSchema)
);

// Convert to OpenAI format
ChatCompletionTool toOpenAITool(ToolDefinition def) {
var params = FunctionParameters.builder();
for (var entry : def.inputSchema().entrySet()) {
params.putAdditionalProperty(entry.getKey(), JsonValue.from(entry.getValue()));
}
return ChatCompletionTool.ofFunction(
ChatCompletionFunctionTool.builder()
.function(FunctionDefinition.builder()
.name(def.name())
.description(def.description())
.parameters(params.build())
.build())
.build());
}

// Run the tool-calling loop
var traceBuilder = AgentTrace.builder();
var paramsBuilder = ChatCompletionCreateParams.builder()
.model(ChatModel.GPT_5_NANO)
.addUserMessage("Find flights to Paris and book a hotel for 5 nights");
tools.forEach(t -> paramsBuilder.addTool(toOpenAITool(t)));

for (int i = 0; i < 10; i++) {
var completion = client.chat().completions().create(paramsBuilder.build());
var message = completion.choices().get(0).message();
paramsBuilder.addMessage(message);

var toolCalls = message.toolCalls().orElse(List.of());
if (toolCalls.isEmpty()) {
traceBuilder.finalResponse(message.content().orElse(""));
break;
}

for (var toolCall : toolCalls) {
var func = toolCall.asFunction();
var function = func.function();
String result = yourApp.executeTool(function.name(), function.arguments(Map.class));

traceBuilder.addToolCall(ToolCall.builder()
.name(function.name())
.arguments(function.arguments(Map.class))
.result(result)
.build());

paramsBuilder.addMessage(ChatCompletionToolMessageParam.builder()
.toolCallId(func.id())
.content(result)
.build());
}
}

AgentTrace trace = traceBuilder.build();

// Evaluate
var testCase = EvalTestCase.builder()
.input("Find flights to Paris and book a hotel for 5 nights")
.actualOutput("toolCalls", trace.toolCalls())
.actualOutput("output", trace.finalResponse())
.metadata("tools", tools)
.metadata("tasks", List.of("Search for flights", "Book a hotel"))
.build();

var result = ToolCallValidityEvaluator.builder().build().evaluate(testCase);

The loop runs up to 10 iterations because the model may call tools across multiple turns — for example, searching first and then booking based on those results. Each iteration is one API round-trip, and the loop exits when the model produces a final text response instead of tool calls.

See OpenAIAgentEvaluationExample.java for a complete runnable example.

Best Practices

  • Start with rule-based evaluators. ToolCallValidityEvaluator and ToolCorrectnessEvaluator don't need an LLM and give fast, deterministic feedback. Add LLM-based evaluators once the basics pass.
  • Evaluate tool definitions in CI. Use ToolNameReliabilityEvaluator and ToolDescriptionReliabilityEvaluator to catch tool definition quality issues before they affect agent behavior.
  • Use AgentTrace for consistent data flow. Build AgentTrace objects in your Task and use toOutputMap() to produce the standard format all evaluators expect.
  • Combine with standard evaluators. Use LLMJudgeEvaluator to check the quality of the agent's final response alongside tool-level checks.