Skip to main content

Data Model

This page shows the classes Dokimos uses to hold your test cases, run your LLM, and report scores, so you know exactly what to build and what comes back.

How the Pieces Fit Together

The flow is short:

  1. A Dataset holds a list of Examples (your test cases).
  2. An Experiment runs a Task (your LLM) on each example.
  3. Evaluators score the outputs and return EvalResults.
  4. Everything lands in one ExperimentResult.
// The flow in code
var result = Experiment.builder()
.dataset(myDataset) // Examples to test
.task(myTask) // Your LLM
.evaluators(List.of(evaluator)) // How to judge outputs
.run(); // Returns ExperimentResult

Core Classes

Dataset

A list of test cases you want to evaluate.

AttributeTypeRequiredDescription
nameStringYesName of the dataset
descriptionStringNoDescription of the dataset
examplesList<Example>YesYour test cases

Methods you will use most:

  • size() returns the number of examples.
  • get(int index) returns one example.
  • iterator() lets you loop through them.
System.out.println("Examples: " + dataset.size());
Example first = dataset.get(0);
for (Example ex : dataset) {
System.out.println(ex.input());
}

Build a dataset like this:

Dataset dataset = Dataset.builder()
.name("Support Questions")
.examples(List.of(
Example.of("How do I reset my password?", "Click 'Forgot Password'..."),
Example.of("What's your refund policy?", "We offer 30-day refunds...")
))
.build();

Belongs to: Nothing (top level)
Contains: Many Examples


Example

One test case: input, expected output, and optional metadata.

AttributeTypeRequiredDescription
inputsMap<String, Object>NoInput values
expectedOutputsMap<String, Object>NoWhat you expect as output
metadataMap<String, Object>NoExtra info (tags, categories, etc.)

Two shortcuts read the primary values:

  • input() returns inputs.get("input").
  • expectedOutput() returns expectedOutputs.get("output").
Example ex = Example.of("What's 2+2?", "4");
String primaryInput = ex.input(); // "What's 2+2?"
String primaryExpected = ex.expectedOutput(); // "4"

Start with the short form. Switch to the builder when you need more keys or metadata.

// Simple example (just input and output)
Example simple = Example.of(
"What's 2+2?",
"4"
);

// Full example with metadata
Example detailed = Example.builder()
.inputs(Map.of(
"input", "What's 2+2?",
"language", "en"
))
.expectedOutputs(Map.of(
"output", "4",
"confidence", 1.0
))
.metadata(Map.of("category", "math"))
.build();

Belongs to: Dataset
Becomes: EvalTestCase (after task runs)


Experiment

Runs your task on a dataset and scores the results.

AttributeTypeRequiredDescription
nameStringNoExperiment name
descriptionStringNoWhat you're testing
datasetDatasetYesTest cases to run
taskTaskYesYour LLM or system
evaluatorsList<Evaluator>NoHow to judge outputs
metadataMap<String, Object>NoCustom tracking info

Call run() to execute everything. It returns an ExperimentResult.

ExperimentResult result = experiment.run();
System.out.println("Pass rate: " + result.passRate());

A full experiment with two evaluators:

ExperimentResult result = Experiment.builder()
.name("Test GPT-5.2 on support questions")
.dataset(supportDataset)
.task(chatbotTask)
.evaluators(List.of(
ExactMatchEvaluator.builder().build(),
FaithfulnessEvaluator.builder().judge(judge).build()
))
.run();

Uses: Dataset, Task, Evaluators
Produces: ExperimentResult


ExperimentResult

The summary of how your experiment did.

AttributeTypeRequiredDescription
nameStringYesExperiment name
descriptionStringYesExperiment description
metadataMap<String, Object>NoCustom metadata
itemResultsList<ItemResult>NoResults for each example

The metrics you will read:

  • totalCount() returns the number of examples evaluated.
  • passCount() returns how many passed every evaluator.
  • failCount() returns how many failed at least one evaluator.
  • passRate() returns the fraction that passed (0.0 to 1.0).
  • averageScore(String) returns the average score for one named evaluator.
System.out.println("Pass rate: " + result.passRate());
System.out.println("Average faithfulness: " + result.averageScore("Faithfulness"));

// Check individual results
for (ItemResult item : result.itemResults()) {
if (!item.success()) {
System.out.println("Failed: " + item.example().input());
}
}

Contains: Many ItemResults


ItemResult

The result of evaluating one example.

AttributeTypeRequiredDescription
exampleExampleYesThe original test case
actualOutputsMap<String, Object>NoWhat your task produced
evalResultsList<EvalResult>NoResults from each evaluator

Call success() to check if every evaluator passed.

for (ItemResult item : experimentResult.itemResults()) {
System.out.println("Input: " + item.example().input());
System.out.println("Expected: " + item.example().expectedOutput());
System.out.println("Actual: " + item.actualOutputs().get("output"));
System.out.println("Passed: " + item.success());

// See why it failed
for (EvalResult eval : item.evalResults()) {
if (!eval.success()) {
System.out.println(eval.name() + ": " + eval.reason());
}
}
}

Contains: Example, EvalResults
Part of: ExperimentResult


EvalTestCase

A test case ready for evaluation. It combines an example with the actual output.

AttributeTypeRequiredDescription
inputsMap<String, Object>NoOriginal inputs
actualOutputsMap<String, Object>NoWhat the task produced
expectedOutputsMap<String, Object>NoWhat you expected
metadataMap<String, Object>NoAdditional metadata

Three shortcuts read the primary values:

  • input() returns the primary input.
  • actualOutput() returns the primary actual output.
  • expectedOutput() returns the primary expected output.

This is the object Dokimos passes to each evaluator. You rarely build one yourself. Dokimos builds it when an experiment runs.

Created from: Example + actual outputs
Passed to: Evaluators


Typed outputs

The output and expected-output maps hold Object values, so the usual habit is to stringify everything. A task can instead return a structured object (a record, a list, a POJO) and read it back type-safely later. This keeps your task body honest (you return the thing you built, not a hand-assembled map) and lets custom evaluators work with real domain objects instead of parsing strings.

tip

For the whole typed pipeline in one place (authoring a typed output, comparing it, reading it back, judging it as JSON, and typing tool-call results), see the Structured & Typed Data hub. The sections below are the per-method reference it links into.

Returning a typed value from a task

Task.typed(fn) wraps a function that returns a single value and stores it under the conventional "output" key. In Kotlin, the reified typedTask<T> { ... } DSL does the same thing.

note

Task.typed rejects a null return with NullPointerException, because the output map cannot hold a null value. If you genuinely need an absent output, use a raw Task. As a convenience guard, if your function already returns a Map, that map is used directly as the output map rather than being nested under "output", so a multi-key task can adopt typed without double-nesting.

record Movie(String title, String director, int year) {}

Task task = Task.typed(example -> {
String json = llm.chat(example.input());
return Json.parseMovie(json); // returns a Movie record
});

Reading typed values back

Both EvalTestCase and Example expose typed accessors. For a non-generic target, pass a Class<T>. The accessors default to the "output" key, and keyed overloads read any other key.

MethodReadsReturns
actualOutputAs(Class<T>)actual "output"converted value or null
actualOutputAs(OutputType<T>)actual "output"converted value or null
actualOutputAs(String, Class<T>)actual under keyconverted value or null
actualOutputAs(String, OutputType<T>)actual under keyconverted value or null
expectedOutputAs(Class<T>)expected "output"converted value or null
expectedOutputAs(OutputType<T>)expected "output"converted value or null
expectedOutputAs(String, Class<T>)expected under keyconverted value or null
expectedOutputAs(String, OutputType<T>)expected under keyconverted value or null

Example carries the expectedOutputAs(...) twins only (it has no actual output yet). EvalTestCase carries both the actual and expected variants.

public class MovieEvaluator implements Evaluator {
@Override
public EvalResult evaluate(EvalTestCase testCase) {
Movie actual = testCase.actualOutputAs(Movie.class);
Movie expected = testCase.expectedOutputAs(Movie.class);

boolean match = actual != null
&& actual.director().equals(expected.director());

return EvalResult.builder()
.name("Movie Director")
.score(match ? 1.0 : 0.0)
.success(match)
.reason(match ? "Director matches" : "Wrong director")
.build();
}

@Override
public String name() { return "Movie Director"; }

@Override
public double threshold() { return 1.0; }
}

Generic types with OutputType<T>

A plain Class<T> cannot express a generic target like List<Movie>, because type arguments are erased at runtime. OutputType<T> is a super-type token (the "Gafter gadget", like Jackson's TypeReference or Spring's ParameterizedTypeReference) that captures the full generic type. Always instantiate it as an anonymous subclass so the type argument is recorded:

// Task produces a List<Movie>
Task task = Task.typed(example -> parseMovies(llm.chat(example.input())));

// Read it back, preserving the element type
List<Movie> movies =
testCase.actualOutputAs(new OutputType<List<Movie>>() {});

// A keyed, non-"output" variant works the same way
List<Movie> shortlist =
testCase.actualOutputAs("shortlist", new OutputType<List<Movie>>() {});
tip

Constructing an OutputType raw (new OutputType() {}) throws IllegalArgumentException, because there is no type argument to capture. Use the Class<T> accessors for non-generic targets, and reach for OutputType<T> only when the target is generic.

Conversion contract

The typed accessors share one conversion contract across EvalTestCase and Example:

  • Absent key returns null. If the requested key is missing from the map, the accessor returns null instead of throwing.
  • Already the right type is returned as-is. For the Class<T> accessors, a stored value that is already an instance of the target type is cast directly without going through serialization.
  • Otherwise it is converted, or it throws. Any other value is converted (via Jackson under the hood). If the value cannot be converted to the requested type, the accessor throws DokimosTypeConversionException (in dev.dokimos.core.exceptions).

This is why a typed task pairs naturally with structural matching: StructuralMatchEvaluator compares the stored structured value against the expected structure, and your custom evaluators can read the same value back as a real object.


EvalResult

The score and feedback from one evaluator.

AttributeTypeRequiredDescription
nameStringYesEvaluator name
scoredoubleYesScore (0.0 to 1.0)
successbooleanYesWhether it passed the threshold
reasonStringYesWhy this score was given
metadataMap<String, Object>NoExtra info from evaluator
for (EvalResult eval : itemResult.evalResults()) {
System.out.println(eval.name() + ": " + eval.score());
if (!eval.success()) {
System.out.println(" Failed because: " + eval.reason());
}
}

Produced by: Evaluator
Part of: ItemResult


Interfaces

Task

The function that runs your LLM or system.

@FunctionalInterface
public interface Task {
Map<String, Object> run(Example example);
}

Return a single output, or return several keys at once:

// Simple task
Task simple = example -> {
String response = llm.chat(example.input());
return Map.of("output", response);
};

// Task with multiple outputs
Task detailed = example -> {
String response = llm.chat(example.input());
return Map.of(
"output", response,
"tokens", 150,
"latency_ms", 320
);
};

Evaluator

The interface for judging outputs.

public interface Evaluator {
EvalResult evaluate(EvalTestCase testCase);
String name();
double threshold();
}

Dokimos ships these built-in implementations:

  • ExactMatchEvaluator checks for an exact match.
  • RegexEvaluator matches a pattern.
  • LLMJudgeEvaluator uses another LLM to judge.
  • FaithfulnessEvaluator checks that the answer is grounded in the context.
  • Agent evaluators cover tool call validation, task completion, argument hallucination, and tool reliability.

Write your own by implementing the three methods:

public class LengthEvaluator implements Evaluator {
@Override
public EvalResult evaluate(EvalTestCase testCase) {
String output = testCase.actualOutput();
boolean inRange = output.length() >= 50 && output.length() <= 500;

return EvalResult.builder()
.name("Length Check")
.score(inRange ? 1.0 : 0.0)
.success(inRange)
.reason(inRange ? "Good length" : "Too short or too long")
.build();
}

@Override
public String name() { return "Length Check"; }

@Override
public double threshold() { return 1.0; }
}

Working with Maps

Most attributes use Map<String, Object> so you can store anything. These are the keys Dokimos recognizes:

KeyUsed InDescription
"input"inputsPrimary input text
"output"outputsPrimary output text
"context"outputsRetrieved documents (for RAG)
"query"inputsSearch query (for RAG)
"toolCalls"outputs / expectedTool calls made by an agent (for agent evaluation)
"tools"metadataAvailable tool definitions (for agent evaluation)
"tasks"metadataTask list for agent completion evaluation

For a RAG task, put the retrieved docs under "context" so evaluators can read them:

Task ragTask = example -> {
List<String> docs = retriever.search(example.input());
String answer = llm.generate(example.input(), docs);

return Map.of(
"output", answer,
"context", docs, // Evaluators can check this
"num_docs", docs.size()
);
};

Add any custom keys you need. Built-in evaluators read the standard keys, and custom evaluators can read anything you put in the map.

For AI agentsView as Markdown