Skip to main content

Embabel Integration

This page shows you how to capture an Embabel agent run as a Dokimos AgentTrace and score it with the agent evaluators. You register a listener, run the agent as you normally would, then read the trace out.

Java 21+

Embabel's published artifacts are built for Java 21, so dokimos-embabel requires Java 21 or later. The rest of Dokimos keeps the Java 17 baseline.

What this integration gives you

Trace capture from an event listener. EmbabelTraceCollector implements Embabel's AgenticEventListener. It listens to the process events your agent emits and assembles an AgentTrace from the tool calls it observes.

No change to how you run the agent. You attach the collector to your ProcessOptions or AgentInvocation.Builder, run the agent, then read collector.trace(). The agent code stays the same.

Straight into the agent evaluators. The captured AgentTrace feeds the agent evaluators through trace.toTestCase(input, tools).

Setup

Add the integration dependency. It pulls in dokimos-core. You bring your own Embabel SDK version.

Maven

<dependency>
<groupId>dev.dokimos</groupId>
<artifactId>dokimos-embabel</artifactId>
<version>${dokimos.version}</version>
</dependency>

You also need the Embabel agent API on your classpath:

<dependency>
<groupId>com.embabel.agent</groupId>
<artifactId>embabel-agent-api</artifactId>
<version>0.4.0</version>
</dependency>

Gradle (Groovy DSL)

implementation 'dev.dokimos:dokimos-embabel:${dokimosVersion}'
implementation 'com.embabel.agent:embabel-agent-api:0.4.0'

Capture a trace

The flow is three steps: create a collector, attach it to your run, run the agent, then read the trace.

EmbabelSupport.attach has two forms. One adds the collector to an existing ProcessOptions. The other attaches a fresh collector to an AgentInvocation.Builder and hands it back to you.

import com.embabel.agent.api.common.autonomy.AgentInvocation;
import dev.dokimos.core.EvalTestCase;
import dev.dokimos.core.agents.AgentTrace;
import dev.dokimos.core.agents.ToolDefinition;
import dev.dokimos.embabel.EmbabelSupport;
import dev.dokimos.embabel.EmbabelTraceCollector;

// 1. Attach a collector to an invocation builder
AgentInvocation.Builder<String> builder = AgentInvocation.builder(agentPlatform)
.options(ProcessOptions.DEFAULT);
EmbabelTraceCollector collector = EmbabelSupport.attach(builder);

// 2. Run the agent as usual
String response = builder.build(String.class).invoke(userInput);

// 3. Read the trace and the tools the agent was observed using
AgentTrace trace = collector.trace();
List<ToolDefinition> tools = EmbabelSupport.toToolDefinitions(collector);

// 4. Build a test case for the agent evaluators
EvalTestCase testCase = trace.toTestCase(userInput, tools);

If you already build your own ProcessOptions, create the collector yourself and attach it with the other overload:

import com.embabel.agent.core.ProcessOptions;
import dev.dokimos.embabel.EmbabelSupport;
import dev.dokimos.embabel.EmbabelTraceCollector;

EmbabelTraceCollector collector = new EmbabelTraceCollector();

// Returns a ProcessOptions with the collector wired in as a listener
ProcessOptions options = EmbabelSupport.attach(ProcessOptions.DEFAULT, collector);

Score the trace

trace.toTestCase(input, tools) builds the EvalTestCase the agent evaluators expect: the tool calls and final response go into the actual outputs, and the tool definitions go into metadata. Every evaluator uses builder().

import dev.dokimos.core.EvalResult;
import dev.dokimos.core.EvalTestCase;
import dev.dokimos.core.agents.AgentTrace;
import dev.dokimos.core.agents.ToolDefinition;
import dev.dokimos.core.evaluators.agents.ToolCallValidityEvaluator;
import dev.dokimos.core.evaluators.agents.ToolCorrectnessEvaluator;
import dev.dokimos.core.evaluators.agents.ToolEfficiencyEvaluator;
import dev.dokimos.embabel.EmbabelSupport;

AgentTrace trace = collector.trace();
List<ToolDefinition> tools = EmbabelSupport.toToolDefinitions(collector);

EvalTestCase testCase = trace.toTestCase("Find flights to Paris", tools);

// Deterministic checks, no judge needed
EvalResult validity = ToolCallValidityEvaluator.builder().build().evaluate(testCase);
EvalResult efficiency = ToolEfficiencyEvaluator.builder().build().evaluate(testCase);
EvalResult correctness = ToolCorrectnessEvaluator.builder().build().evaluate(testCase);

For the LLM-based checks, pass a judge. See Agent Evaluation for the full list of nine evaluators and what each one checks.

import dev.dokimos.core.JudgeLM;
import dev.dokimos.core.evaluators.agents.TaskCompletionEvaluator;
import dev.dokimos.core.evaluators.agents.ToolArgumentHallucinationEvaluator;

JudgeLM judge = prompt -> openAiClient.generate(prompt);

EvalTestCase testCase = trace.toTestCase(
"Find flights to Paris",
tools,
List.of("Search for flights")); // tasks, for TaskCompletionEvaluator

EvalResult completion = TaskCompletionEvaluator.builder()
.judge(judge)
.build()
.evaluate(testCase);

EvalResult hallucination = ToolArgumentHallucinationEvaluator.builder()
.judge(judge)
.build()
.evaluate(testCase);

Inspect what was captured

Beyond trace(), the collector exposes the raw observations. Use these to debug or to assert directly on the calls.

  • collector.toolCalls() returns the captured List<ToolCall> (name, arguments, result).
  • collector.observedToolNames() returns the distinct tool names seen, in order.
  • collector.trace() assembles the full AgentTrace.

Cost, tokens, and latency

The same collector captures metrics. After the run, call collector.callMetrics(model, priceTable) to get a CallMetrics (tokensIn, tokensOut, costUsd, latencyMs — any may be null), or collector.callMetrics() for tokens and latency only. Feed it into a MeasuredTask's TaskResult so the run detail shows Total Tokens, Total Cost, and Avg Latency.

CallMetrics metrics = collector.callMetrics("your-model", priceTable);

Embabel reports its own cost on the completed agent process, so cost precedence here differs from the other adapters: Embabel's own non-zero totalCost() wins, and the PriceTable is consulted only when Embabel reported $0 and a model id is supplied. All-zero token usage is treated as "not measured" (null), and callMetrics() returns null when nothing was captured. See Cost and Pricing for the pricing seam.

Limitations

Two limitations follow from how Embabel reports events. Keep them in mind when you pick evaluators.

Synthesized tool definitions

EmbabelSupport.toToolDefinitions(collector) builds one ToolDefinition per observed tool name, with an empty input schema. Embabel's events carry the tool names and call arguments, not the full tool contracts. So ToolDescriptionReliabilityEvaluator has little to score (no descriptions, no documented arguments), and its coverage is weakened. For real coverage, build the ToolDefinition list by hand from your actual tool contracts and pass that to trace.toTestCase(input, tools) instead.

Single-run collector

A collector captures one run. It is not thread-safe, and reusing it without clearing it appends a second run's calls onto the first. Call collector.reset() before reusing it, or create a fresh EmbabelTraceCollector per run.

collector.reset(); // clears tool calls and observed names before the next run
tip

The agent evaluators are framework-agnostic. Once you have an AgentTrace, scoring is identical across Embabel, Spring AI, LangChain4j, Koog, and OpenAI. See Agent Evaluation for the data model and every evaluator option.

For AI agentsView as Markdown