Data Model
This page shows the classes Dokimos uses to hold your test cases, run your LLM, and report scores, so you know exactly what to build and what comes back.
How the Pieces Fit Together
The flow is short:
- A Dataset holds a list of Examples (your test cases).
- An Experiment runs a Task (your LLM) on each example.
- Evaluators score the outputs and return EvalResults.
- Everything lands in one ExperimentResult.
- Java
- Kotlin
// The flow in code
var result = Experiment.builder()
.dataset(myDataset) // Examples to test
.task(myTask) // Your LLM
.evaluators(List.of(evaluator)) // How to judge outputs
.run(); // Returns ExperimentResult
// The flow in code
val result = experiment {
dataset(myDataset) // Examples to test
task(myTask) // Your LLM
evaluator(evaluator) // How to judge outputs
}.run() // Returns ExperimentResult
Core Classes
Dataset
A list of test cases you want to evaluate.
| Attribute | Type | Required | Description |
|---|---|---|---|
name | String | Yes | Name of the dataset |
description | String | No | Description of the dataset |
examples | List<Example> | Yes | Your test cases |
Methods you will use most:
size()returns the number of examples.get(int index)returns one example.iterator()lets you loop through them.
- Java
- Kotlin
System.out.println("Examples: " + dataset.size());
Example first = dataset.get(0);
for (Example ex : dataset) {
System.out.println(ex.input());
}
println("Examples: ${dataset.size()}")
val first = dataset[0]
dataset.forEach { ex -> println(ex.input()) }
Build a dataset like this:
- Java
- Kotlin
Dataset dataset = Dataset.builder()
.name("Support Questions")
.examples(List.of(
Example.of("How do I reset my password?", "Click 'Forgot Password'..."),
Example.of("What's your refund policy?", "We offer 30-day refunds...")
))
.build();
val dataset = dataset {
name = "Support Questions"
example {
input = "How do I reset my password?"
expected = "Click 'Forgot Password'..."
}
example {
input = "What's your refund policy?"
expected = "We offer 30-day refunds..."
}
}
Belongs to: Nothing (top level)
Contains: Many Examples
Example
One test case: input, expected output, and optional metadata.
| Attribute | Type | Required | Description |
|---|---|---|---|
inputs | Map<String, Object> | No | Input values |
expectedOutputs | Map<String, Object> | No | What you expect as output |
metadata | Map<String, Object> | No | Extra info (tags, categories, etc.) |
Two shortcuts read the primary values:
input()returnsinputs.get("input").expectedOutput()returnsexpectedOutputs.get("output").
- Java
- Kotlin
Example ex = Example.of("What's 2+2?", "4");
String primaryInput = ex.input(); // "What's 2+2?"
String primaryExpected = ex.expectedOutput(); // "4"
val ex = example {
input = "What's 2+2?"
expected = "4"
}
val primaryInput = ex.input() // "What's 2+2?"
val primaryExpected = ex.expectedOutput() // "4"
Start with the short form. Switch to the builder when you need more keys or metadata.
- Java
- Kotlin
// Simple example (just input and output)
Example simple = Example.of(
"What's 2+2?",
"4"
);
// Full example with metadata
Example detailed = Example.builder()
.inputs(Map.of(
"input", "What's 2+2?",
"language", "en"
))
.expectedOutputs(Map.of(
"output", "4",
"confidence", 1.0
))
.metadata(Map.of("category", "math"))
.build();
// Simple example (just input and output)
val simple = example {
input = "What's 2+2?"
expected = "4"
}
// Full example with metadata
val detailed = example {
input("input", "What's 2+2?")
input("language", "en")
expected("output", "4")
expected("confidence", 1.0)
metadata("category", "math")
}
Belongs to: Dataset
Becomes: EvalTestCase (after task runs)
Experiment
Runs your task on a dataset and scores the results.
| Attribute | Type | Required | Description |
|---|---|---|---|
name | String | No | Experiment name |
description | String | No | What you're testing |
dataset | Dataset | Yes | Test cases to run |
task | Task | Yes | Your LLM or system |
evaluators | List<Evaluator> | No | How to judge outputs |
metadata | Map<String, Object> | No | Custom tracking info |
Call run() to execute everything. It returns an ExperimentResult.
- Java
- Kotlin
ExperimentResult result = experiment.run();
System.out.println("Pass rate: " + result.passRate());
val result = experiment.run()
println("Pass rate: ${result.passRate()}")
A full experiment with two evaluators:
- Java
- Kotlin
ExperimentResult result = Experiment.builder()
.name("Test GPT-5.2 on support questions")
.dataset(supportDataset)
.task(chatbotTask)
.evaluators(List.of(
ExactMatchEvaluator.builder().build(),
FaithfulnessEvaluator.builder().judge(judge).build()
))
.run();
val result = experiment {
name = "Test GPT-5.2 on support questions"
dataset(supportDataset)
task(chatbotTask)
evaluators {
exactMatch{ }
faithfulness(judge) {
contextKey = "ctx"
threshold = 0.4
}
}
}.run()
Uses: Dataset, Task, Evaluators
Produces: ExperimentResult
ExperimentResult
The summary of how your experiment did.
| Attribute | Type | Required | Description |
|---|---|---|---|
name | String | Yes | Experiment name |
description | String | Yes | Experiment description |
metadata | Map<String, Object> | No | Custom metadata |
itemResults | List<ItemResult> | No | Results for each example |
The metrics you will read:
totalCount()returns the number of examples evaluated.passCount()returns how many passed every evaluator.failCount()returns how many failed at least one evaluator.passRate()returns the fraction that passed (0.0 to 1.0).averageScore(String)returns the average score for one named evaluator.
- Java
- Kotlin
System.out.println("Pass rate: " + result.passRate());
System.out.println("Average faithfulness: " + result.averageScore("Faithfulness"));
// Check individual results
for (ItemResult item : result.itemResults()) {
if (!item.success()) {
System.out.println("Failed: " + item.example().input());
}
}
println("Pass rate: ${result.passRate()}")
println("Average faithfulness: ${result.averageScore("Faithfulness")}")
// Check individual results
result.itemResults().filterNot { it.success() }.forEach { item ->
println("Failed: ${item.example().input()}")
}
Contains: Many ItemResults
ItemResult
The result of evaluating one example.
| Attribute | Type | Required | Description |
|---|---|---|---|
example | Example | Yes | The original test case |
actualOutputs | Map<String, Object> | No | What your task produced |
evalResults | List<EvalResult> | No | Results from each evaluator |
Call success() to check if every evaluator passed.
- Java
- Kotlin
for (ItemResult item : experimentResult.itemResults()) {
System.out.println("Input: " + item.example().input());
System.out.println("Expected: " + item.example().expectedOutput());
System.out.println("Actual: " + item.actualOutputs().get("output"));
System.out.println("Passed: " + item.success());
// See why it failed
for (EvalResult eval : item.evalResults()) {
if (!eval.success()) {
System.out.println(eval.name() + ": " + eval.reason());
}
}
}
experimentResult.itemResults().forEach { item ->
println("Input: ${item.example().input()}")
println("Expected: ${item.example().expectedOutput()}")
println("Actual: ${item.actualOutputs()["output"]}")
println("Passed: ${item.success()}")
// See why it failed
item.evalResults().filterNot { it.success() }.forEach { eval ->
println("${eval.name()}: ${eval.reason()}")
}
}
Contains: Example, EvalResults
Part of: ExperimentResult
EvalTestCase
A test case ready for evaluation. It combines an example with the actual output.
| Attribute | Type | Required | Description |
|---|---|---|---|
inputs | Map<String, Object> | No | Original inputs |
actualOutputs | Map<String, Object> | No | What the task produced |
expectedOutputs | Map<String, Object> | No | What you expected |
metadata | Map<String, Object> | No | Additional metadata |
Three shortcuts read the primary values:
input()returns the primary input.actualOutput()returns the primary actual output.expectedOutput()returns the primary expected output.
This is the object Dokimos passes to each evaluator. You rarely build one yourself. Dokimos builds it when an experiment runs.
Created from: Example + actual outputs
Passed to: Evaluators
Typed outputs
The output and expected-output maps hold Object values, so the usual habit is to stringify everything. A task can instead return a structured object (a record, a list, a POJO) and read it back type-safely later. This keeps your task body honest (you return the thing you built, not a hand-assembled map) and lets custom evaluators work with real domain objects instead of parsing strings.
For the whole typed pipeline in one place (authoring a typed output, comparing it, reading it back, judging it as JSON, and typing tool-call results), see the Structured & Typed Data hub. The sections below are the per-method reference it links into.
Returning a typed value from a task
Task.typed(fn) wraps a function that returns a single value and stores it under the conventional "output" key. In Kotlin, the reified typedTask<T> { ... } DSL does the same thing.
Task.typed rejects a null return with NullPointerException, because the output map cannot hold a null value. If you genuinely need an absent output, use a raw Task. As a convenience guard, if your function already returns a Map, that map is used directly as the output map rather than being nested under "output", so a multi-key task can adopt typed without double-nesting.
- Java
- Kotlin
record Movie(String title, String director, int year) {}
Task task = Task.typed(example -> {
String json = llm.chat(example.input());
return Json.parseMovie(json); // returns a Movie record
});
data class Movie(val title: String, val director: String, val year: Int)
val task = typedTask<Movie> { example ->
val json = llm.chat(example.input())
parseMovie(json) // returns a Movie
}
Inside experiment { ... } you can also set it directly with the typedTask builder method:
val experiment = experiment {
name = "Movie extraction"
dataset(movieDataset)
typedTask<Movie> { example -> parseMovie(llm.chat(example.input())) }
evaluator(StructuralMatchEvaluator.builder().build())
}
Reading typed values back
Both EvalTestCase and Example expose typed accessors. For a non-generic target, pass a Class<T>. The accessors default to the "output" key, and keyed overloads read any other key.
| Method | Reads | Returns |
|---|---|---|
actualOutputAs(Class<T>) | actual "output" | converted value or null |
actualOutputAs(OutputType<T>) | actual "output" | converted value or null |
actualOutputAs(String, Class<T>) | actual under key | converted value or null |
actualOutputAs(String, OutputType<T>) | actual under key | converted value or null |
expectedOutputAs(Class<T>) | expected "output" | converted value or null |
expectedOutputAs(OutputType<T>) | expected "output" | converted value or null |
expectedOutputAs(String, Class<T>) | expected under key | converted value or null |
expectedOutputAs(String, OutputType<T>) | expected under key | converted value or null |
Example carries the expectedOutputAs(...) twins only (it has no actual output yet). EvalTestCase carries both the actual and expected variants.
- Java
- Kotlin
public class MovieEvaluator implements Evaluator {
@Override
public EvalResult evaluate(EvalTestCase testCase) {
Movie actual = testCase.actualOutputAs(Movie.class);
Movie expected = testCase.expectedOutputAs(Movie.class);
boolean match = actual != null
&& actual.director().equals(expected.director());
return EvalResult.builder()
.name("Movie Director")
.score(match ? 1.0 : 0.0)
.success(match)
.reason(match ? "Director matches" : "Wrong director")
.build();
}
@Override
public String name() { return "Movie Director"; }
@Override
public double threshold() { return 1.0; }
}
class MovieEvaluator : Evaluator {
override fun evaluate(testCase: EvalTestCase): EvalResult {
val actual = testCase.actualOutputAs(Movie::class.java)
val expected = testCase.expectedOutputAs(Movie::class.java)
val match = actual != null && actual.director == expected?.director
return EvalResult(
name = "Movie Director",
score = if (match) 1.0 else 0.0,
success = match,
reason = if (match) "Director matches" else "Wrong director",
)
}
override fun name(): String = "Movie Director"
override fun threshold(): Double = 1.0
}
Generic types with OutputType<T>
A plain Class<T> cannot express a generic target like List<Movie>, because type arguments are erased at runtime. OutputType<T> is a super-type token (the "Gafter gadget", like Jackson's TypeReference or Spring's ParameterizedTypeReference) that captures the full generic type. Always instantiate it as an anonymous subclass so the type argument is recorded:
- Java
- Kotlin
// Task produces a List<Movie>
Task task = Task.typed(example -> parseMovies(llm.chat(example.input())));
// Read it back, preserving the element type
List<Movie> movies =
testCase.actualOutputAs(new OutputType<List<Movie>>() {});
// A keyed, non-"output" variant works the same way
List<Movie> shortlist =
testCase.actualOutputAs("shortlist", new OutputType<List<Movie>>() {});
// Task produces a List<Movie>
val task = typedTask<List<Movie>> { example -> parseMovies(llm.chat(example.input())) }
// Read it back, preserving the element type
val movies: List<Movie> =
testCase.actualOutputAs(object : OutputType<List<Movie>>() {})
// A keyed, non-"output" variant works the same way
val shortlist: List<Movie> =
testCase.actualOutputAs("shortlist", object : OutputType<List<Movie>>() {})
Constructing an OutputType raw (new OutputType() {}) throws IllegalArgumentException, because there is no type argument to capture. Use the Class<T> accessors for non-generic targets, and reach for OutputType<T> only when the target is generic.
Conversion contract
The typed accessors share one conversion contract across EvalTestCase and Example:
- Absent key returns
null. If the requested key is missing from the map, the accessor returnsnullinstead of throwing. - Already the right type is returned as-is. For the
Class<T>accessors, a stored value that is already an instance of the target type is cast directly without going through serialization. - Otherwise it is converted, or it throws. Any other value is converted (via Jackson under the hood). If the value cannot be converted to the requested type, the accessor throws
DokimosTypeConversionException(indev.dokimos.core.exceptions).
This is why a typed task pairs naturally with structural matching: StructuralMatchEvaluator compares the stored structured value against the expected structure, and your custom evaluators can read the same value back as a real object.
EvalResult
The score and feedback from one evaluator.
| Attribute | Type | Required | Description |
|---|---|---|---|
name | String | Yes | Evaluator name |
score | double | Yes | Score (0.0 to 1.0) |
success | boolean | Yes | Whether it passed the threshold |
reason | String | Yes | Why this score was given |
metadata | Map<String, Object> | No | Extra info from evaluator |
- Java
- Kotlin
for (EvalResult eval : itemResult.evalResults()) {
System.out.println(eval.name() + ": " + eval.score());
if (!eval.success()) {
System.out.println(" Failed because: " + eval.reason());
}
}
itemResult.evalResults().onEach { eval ->
println("${eval.name()}: ${eval.score()}")
}.filterNot { it.success() }.forEach { eval ->
println(" Failed because: ${eval.reason()}")
}
Produced by: Evaluator
Part of: ItemResult
Interfaces
Task
The function that runs your LLM or system.
- Java
- Kotlin
@FunctionalInterface
public interface Task {
Map<String, Object> run(Example example);
}
fun interface Task {
fun run(example: Example): Map<String, Any>
}
Return a single output, or return several keys at once:
- Java
- Kotlin
// Simple task
Task simple = example -> {
String response = llm.chat(example.input());
return Map.of("output", response);
};
// Task with multiple outputs
Task detailed = example -> {
String response = llm.chat(example.input());
return Map.of(
"output", response,
"tokens", 150,
"latency_ms", 320
);
};
// Simple task
val simple: Task = Task { example ->
val response = llm.chat(example.input())
mapOf("output" to response)
}
// Task with multiple outputs
val detailed: Task = Task { example ->
val response = llm.chat(example.input())
mapOf(
"output" to response,
"tokens" to 150,
"latency_ms" to 320
)
}
Evaluator
The interface for judging outputs.
- Java
- Kotlin
public interface Evaluator {
EvalResult evaluate(EvalTestCase testCase);
String name();
double threshold();
}
interface Evaluator {
fun evaluate(testCase: EvalTestCase): EvalResult
fun name(): String
fun threshold(): Double
}
Dokimos ships these built-in implementations:
ExactMatchEvaluatorchecks for an exact match.RegexEvaluatormatches a pattern.LLMJudgeEvaluatoruses another LLM to judge.FaithfulnessEvaluatorchecks that the answer is grounded in the context.- Agent evaluators cover tool call validation, task completion, argument hallucination, and tool reliability.
Write your own by implementing the three methods:
- Java
- Kotlin
public class LengthEvaluator implements Evaluator {
@Override
public EvalResult evaluate(EvalTestCase testCase) {
String output = testCase.actualOutput();
boolean inRange = output.length() >= 50 && output.length() <= 500;
return EvalResult.builder()
.name("Length Check")
.score(inRange ? 1.0 : 0.0)
.success(inRange)
.reason(inRange ? "Good length" : "Too short or too long")
.build();
}
@Override
public String name() { return "Length Check"; }
@Override
public double threshold() { return 1.0; }
}
class LengthEvaluator : Evaluator {
override fun evaluate(testCase: EvalTestCase): EvalResult {
val output = testCase.actualOutput()
val inRange = output.length in 50..500
return EvalResult(
name = "Length Check",
score = if (inRange) 1.0 else 0.0,
success = inRange,
reason = if (inRange) "Good length" else "Too short or too long"
)
}
override fun name(): String = "Length Check"
override fun threshold(): Double = 1.0
}
Working with Maps
Most attributes use Map<String, Object> so you can store anything. These are the keys Dokimos recognizes:
| Key | Used In | Description |
|---|---|---|
"input" | inputs | Primary input text |
"output" | outputs | Primary output text |
"context" | outputs | Retrieved documents (for RAG) |
"query" | inputs | Search query (for RAG) |
"toolCalls" | outputs / expected | Tool calls made by an agent (for agent evaluation) |
"tools" | metadata | Available tool definitions (for agent evaluation) |
"tasks" | metadata | Task list for agent completion evaluation |
For a RAG task, put the retrieved docs under "context" so evaluators can read them:
- Java
- Kotlin
Task ragTask = example -> {
List<String> docs = retriever.search(example.input());
String answer = llm.generate(example.input(), docs);
return Map.of(
"output", answer,
"context", docs, // Evaluators can check this
"num_docs", docs.size()
);
};
val ragTask: Task = Task { example ->
val docs = retriever.search(example.input())
val answer = llm.generate(example.input(), docs)
mapOf(
"output" to answer,
"context" to docs, // Evaluators can check this
"num_docs" to docs.size
)
}
Add any custom keys you need. Built-in evaluators read the standard keys, and custom evaluators can read anything you put in the map.