Skip to main content

Experiments

An experiment runs your LLM application against a whole dataset, scores every output, and hands you the totals. It is the main way to measure how well your application performs.

The pieces fit together like this. You wrap your application in a Task. You point the experiment at a Dataset. You attach one or more Evaluators to grade the outputs. You call run(), and you get an ExperimentResult with pass rates, scores, and per-item details.

Here is the shortest path from nothing to a number.

Run your first experiment

This builds a three-example dataset, runs your bot against it, grades each answer with an LLM judge, and prints the pass rate. Copy it, swap in your own bot and judge, and run it.

import dev.dokimos.core.*;

// 1. Build a dataset (input + expected output per example)
Dataset dataset = Dataset.builder()
.name("Product Support Questions")
.addExample(Example.of(
"How do I reset my password?",
"Click 'Forgot Password' on the login page and follow the email instructions"
))
.addExample(Example.of(
"Where can I track my order?",
"Go to your account dashboard and click on 'Order History'"
))
.addExample(Example.of(
"What payment methods do you accept?",
"We accept credit cards, PayPal, and bank transfers"
))
.build();

// 2. Wrap your application in a Task. It returns a map of outputs.
Task task = example -> {
String answer = customerSupportBot.generateAnswer(example.input());
return Map.of("output", answer);
};

// 3. Add evaluators to grade the outputs
List<Evaluator> evaluators = List.of(
LLMJudgeEvaluator.builder()
.name("Answer Quality")
.criteria("Is the answer helpful and accurate?")
.judge(judge)
.threshold(0.8)
.build()
);

// 4. Run it
ExperimentResult result = Experiment.builder()
.name("QA Evaluation")
.dataset(dataset)
.task(task)
.evaluators(evaluators)
.build()
.run();

// 5. Read the totals
System.out.println("Pass rate: " + String.format("%.2f%%", result.passRate() * 100));
System.out.println("Total examples: " + result.totalCount());
System.out.println("Passed: " + result.passCount());
System.out.println("Failed: " + result.failCount());

That is the full loop. The rest of this page goes deeper on each piece: tasks, datasets, parallelism, evaluators, results, CI, and exports.

When to use experiments vs JUnit

Dokimos also plugs into JUnit (see the @DatasetSource annotation). The two tools solve different problems.

AspectJUnit tests with @DatasetSourceExperiments
PurposeUnit and integration testingFull-dataset evaluation and benchmarking
ExecutionIndividual test assertionsBatch run with aggregation
ResultsPass or fail per testPass rates, average scores, totals
Use caseCI/CD quality gatesPerformance analysis and reporting
FlexibilityOne example at a timeWhole datasets, trends over time
OutputTest reports (JUnit format)Detailed results with statistics

Reach for JUnit tests when you want to:

  • Fail the build if critical cases don't pass
  • Catch regressions fast during development
  • Get immediate feedback on specific examples

Reach for experiments when you want to:

  • Measure performance across a whole dataset
  • Generate reports with metrics and trends
  • Compare models or prompt versions
  • Understand overall application behavior

Most projects use both.

Why bother

Manual testing with a few prompts does not scale. Experiments give you:

  • Numbers you can track. Pass rates, average scores, and counts over time. Now you know whether a prompt change or model swap actually helped.
  • Coverage. Run the whole dataset automatically instead of trying inputs by hand.
  • Comparisons. Run different models, prompts, or retrieval strategies against the same cases.
  • Regression alarms. Wire experiments into CI/CD so changes don't quietly break things.
  • Failure patterns. When outputs go wrong, see which kinds of inputs fail and why.

Writing the Task

A Task runs your application for one example and returns its outputs. It is a single-method functional interface.

@FunctionalInterface
public interface Task {
Map<String, Object> run(Example example);
}

The simplest task calls your model and returns one output:

Task task = example -> {
String response = myLlmService.generate(example.input());
return Map.of("output", response);
};

For RAG or other multi-step systems, return more than one value. Evaluators read these by key.

Task ragTask = example -> {
// Retrieve relevant documents
List<String> retrievedDocs = vectorStore.search(example.input(), topK = 3);

// Generate a response using the retrieved context
String response = ragSystem.generate(example.input(), retrievedDocs);

// Capture a confidence score
double confidence = ragSystem.getConfidenceScore();

return Map.of(
"output", response,
"retrievedContext", retrievedDocs,
"confidence", confidence
);
};

Recording tokens, cost, and latency

A plain Task returns only outputs, so each ItemResult carries null metrics. To record tokens, cost, and latency, return a MeasuredTask instead. It returns a TaskResult that holds the outputs plus a CallMetrics record, and those metrics flow through to every ItemResult.metrics().

@FunctionalInterface
public interface MeasuredTask {
TaskResult run(Example example);
}

CallMetrics is a record with four nullable fields: tokensIn, tokensOut, costUsd, and latencyMs. Fill in what you can measure. Leave the rest null.

MeasuredTask task = example -> {
long start = System.currentTimeMillis();
LlmResponse response = myLlmService.generate(example.input());
long latencyMs = System.currentTimeMillis() - start;

CallMetrics metrics = new CallMetrics(
response.promptTokens(),
response.completionTokens(),
response.costUsd(),
latencyMs
);

return new TaskResult(Map.of("output", response.text()), metrics);
};

ExperimentResult result = Experiment.builder()
.name("QA Evaluation")
.dataset(dataset)
.measuredTask(task)
.evaluators(evaluators)
.build()
.run();

The plain task(Task) path still works the same. Use measuredTask(MeasuredTask) only when you want metrics on the results. The builder method has a separate name so a lambda passed to task(...) is never ambiguous between the two interfaces.

Running against a dataset

Load a dataset from a file

Experiments take any Dataset, including ones loaded from JSON or CSV on the classpath.

// Load a dataset from the classpath
Dataset dataset = DatasetResolverRegistry.getInstance()
.resolve("classpath:datasets/qa-dataset.json");

// Run the experiment
ExperimentResult result = Experiment.builder()
.name("QA Evaluation")
.dataset(dataset)
.task(task)
.evaluators(evaluators)
.build()
.run();

Inspect each result

After a run, loop over the items to see what happened on each example.

ExperimentResult result = experiment.run();

// Walk every item result
for (ItemResult itemResult : result.itemResults()) {
System.out.println("\nInput: " + itemResult.example().input());
System.out.println("Expected: " + itemResult.example().expectedOutput());
System.out.println("Actual: " + itemResult.actualOutputs().get("output"));
System.out.println("Success: " + itemResult.success());

// Check each evaluator's result for this item
for (EvalResult evalResult : itemResult.evalResults()) {
System.out.println(" " + evalResult.name() +
": " + (evalResult.success() ? "PASS" : "FAIL") +
" (score: " + evalResult.score() + ")");
}
}

Find the failures

To debug, filter for items that did not pass.

ExperimentResult result = experiment.run();

List<ItemResult> failures = result.itemResults().stream()
.filter(item -> !item.success())
.toList();

System.out.println("Failed cases: " + failures.size());
for (ItemResult failure : failures) {
System.out.println("Failed input: " + failure.example().input());
System.out.println("Expected: " + failure.example().expectedOutput());
System.out.println("Got: " + failure.actualOutputs().get("output"));
}

One bad item never kills the run

If a task or evaluator throws on one example, the run keeps going. That example is recorded as a failed item (its success() is false, with no eval results), and execution moves to the next example. Sequential and parallel runs behave the same way, so one flaky call or one malformed output never costs you the rest of the dataset. Filter for !item.success(), as shown above, to inspect what failed.

Parallelism and multiple runs

Two builder settings control speed and statistical confidence: parallelism and runs.

Run examples concurrently

Set .parallelism(n) to process n examples at once within each run.

ExperimentResult result = Experiment.builder()
.name("Knowledge Assistant Evaluation")
.dataset(dataset)
.task(task)
.evaluators(evaluators)
.parallelism(4) // run 4 examples at once
.build()
.run();

The default is 1 (sequential). Raise it for speed, but watch your API rate limits. When you set parallelism above 1, make sure your task is thread-safe.

Repeat the run for stability

Set .runs(n) to run the whole experiment n times.

ExperimentResult result = Experiment.builder()
.name("Knowledge Assistant Evaluation")
.dataset(dataset)
.task(task)
.evaluators(evaluators)
.runs(3) // run the experiment 3 times
.parallelism(4) // parallelism within each run
.build()
.run();

Runs go one after another. Parallelism applies inside each run. Repeating runs smooths out LLM non-determinism and gives you confidence in the numbers.

Read the run statistics:

result.averageScore("Faithfulness")     // mean across all runs
result.scoreStdDev("Faithfulness") // standard deviation across runs
result.runCount() // number of runs performed
result.runs() // individual run results

A high standard deviation means your task or evaluator output is unstable.

Asynchronous tasks

The task and measuredTask paths block one thread per in-flight example. That is fine for blocking SDK calls. It is a poor fit when your task is already non-blocking, such as a Kotlin suspend function, a Reactor or CompletableFuture pipeline, or an agent runtime that hands you a future. For those, use an AsyncTask. It returns a CompletableFuture<TaskResult>, so the experiment drives many examples without parking a thread on each one.

@FunctionalInterface
public interface AsyncTask {
CompletableFuture<TaskResult> run(Example example);
}

The completed future carries the same TaskResult (outputs plus optional CallMetrics) that measuredTask uses, so call metrics flow through to each ItemResult.metrics() just like on the synchronous paths.

Set it with asyncTask(...). An async task satisfies the task requirement on its own. You do not also call task(...) or measuredTask(...).

import java.util.Map;
import java.util.concurrent.CompletableFuture;

AsyncTask task = example ->
myAsyncLlmService
.generateAsync(example.input()) // returns CompletableFuture<String>
.thenApply(answer -> TaskResult.of(Map.of("output", answer)));

ExperimentResult result = Experiment.builder()
.name("QA Evaluation")
.dataset(dataset)
.asyncTask(task)
.evaluators(evaluators)
.parallelism(8) // caps in-flight invocations at 8
.build()
.run();

How the in-flight cap works

When you set an async task, the experiment runs on a dedicated non-blocking path. This path takes precedence over the sequential and parallel paths. Here parallelism no longer sizes a thread pool. Instead it caps the number of in-flight invocations with a semaphore. The experiment takes a permit before calling asyncTask.run(...) and releases it when that example's future settles, so at most parallelism invocations are ever outstanding. That stops a non-blocking task from launching the entire dataset at once and flooding a downstream service or rate limit. Dataset order is preserved in the returned results.

note

For tasks that bridge a blocking call onto a future (for example via CompletableFuture.supplyAsync(..., executor)), the real concurrency is the smaller of two limits: the experiment's parallelism cap, or the executor backing those calls. The semaphore caps how many futures are outstanding. The executor caps how many actually run at once. The Kotlin suspendTask {} DSL dispatches on Dispatchers.IO by default. The framework integrations build async tasks on top of asyncTask(...), see the Koog, LangChain4j, and Spring AI pages.

Failure isolation works the same

Async tasks isolate failures exactly like the synchronous paths. A future that completes exceptionally becomes a failed ItemResult (its success() is false, with no eval results), and the run continues with the rest. A task that throws synchronously from run(...), or returns a null future, is isolated the same way instead of aborting the run. Filter for !item.success() to see what failed, just like on the sequential and parallel paths.

The Kotlin suspendTask {} DSL

In Kotlin you rarely build an AsyncTask by hand. The suspendTask {} block inside experiment {} takes a suspend body that returns a TaskResult and bridges it to a CompletableFuture for you. There is also a top-level suspendTask(...) function, plus a suspendMapTask(...) overload that returns an output Map and wraps it in a TaskResult with no metrics, for building the task outside the DSL.

import dev.dokimos.core.TaskResult
import dev.dokimos.kotlin.dsl.suspendTask

val task = suspendTask { example ->
val answer = myAsyncLlmService.generate(example.input())
TaskResult.of(mapOf("output" to answer))
}

val result = experiment {
name = "QA Evaluation"
dataset(dataset)
asyncTask(task)
parallelism = 8
evaluators(evaluators)
}.run()

Each invocation launches the suspend body on the given CoroutineScope (the IO dispatcher by default). Pass your own scope to either form to control where the work runs. A suspend exception surfaces as an exceptionally completed future, which the experiment isolates as a failed item.

tip

Use an async task only when your caller is truly non-blocking. If your task is a plain blocking SDK call, the synchronous task(...) or measuredTask(...) path with parallelism(n) is simpler and gives you the same concurrency through its thread pool.

Configuring the experiment

Add a name, a description, evaluators, and metadata on the builder.

Name and description

Experiment.builder()
.name("Customer Support QA Evaluation")
.description("Evaluating the assistant's ability to answer customer support questions accurately")
.dataset(dataset)
.task(task)
.build();

Add evaluators

Add evaluators one at a time or as a list.

// Add evaluators one by one
Experiment.builder()
.name("QA Evaluation")
.dataset(dataset)
.task(task)
.evaluator(exactMatchEvaluator)
.evaluator(faithfulnessEvaluator)
.evaluator(relevanceEvaluator)
.build();

// Or add several at once
List<Evaluator> evaluators = List.of(
exactMatchEvaluator,
faithfulnessEvaluator,
relevanceEvaluator
);

Experiment.builder()
.name("QA Evaluation")
.dataset(dataset)
.task(task)
.evaluators(evaluators)
.build();

build() validates the experiment before it constructs it. It throws IllegalStateException if there is no dataset or task, if the dataset has no examples, or if no evaluators were added. You see configuration mistakes up front instead of at run time.

Close the reporter automatically

When you attach a Reporter with .reporter(...), you own its lifecycle by default. Set .autoCloseReporter(true) to have run() close the reporter once all runs finish, on top of flushing it. The default is false, which leaves the reporter open so you can reuse it across experiments.

Experiment.builder()
.name("QA Evaluation")
.dataset(dataset)
.task(task)
.evaluators(evaluators)
.reporter(reporter)
.autoCloseReporter(true) // run() closes the reporter when done
.build()
.run();

Record configuration with metadata

Use metadata to record the settings behind each run. This helps when you compare results across model versions or configurations later.

Experiment.builder()
.name("GPT-5.2 Evaluation")
.dataset(dataset)
.task(task)
.evaluators(evaluators)
.metadata("model", "gpt-5.2")
.metadata("temperature", 0.7)
.metadata("timestamp", Instant.now().toString())
.metadata("version", "1.0.0")
.build();

// Or add several entries at once
Map<String, Object> metadata = Map.of(
"model", "gpt-5.2",
"temperature", 0.7,
"maxTokens", 500
);

Experiment.builder()
.name("GPT-5.2 Evaluation")
.dataset(dataset)
.task(task)
.evaluators(evaluators)
.metadata(metadata)
.build();

Metadata rides along in the ExperimentResult, so you can use it to tell configurations apart.

Working with evaluators

Each evaluator gives a score from 0.0 to 1.0 and decides pass or fail against a threshold you set. Here are the common ones.

// For deterministic outputs like calculations
Evaluator exactMatch = ExactMatchEvaluator.builder()
.name("Exact Match")
.threshold(1.0)
.build();

// For output format checks (dates, phone numbers, etc.)
Evaluator formatCheck = RegexEvaluator.builder()
.name("Date Format")
.pattern("\\d{4}-\\d{2}-\\d{2}") // YYYY-MM-DD
.threshold(1.0)
.build();

// For semantic correctness, using an LLM as judge
Evaluator semanticCorrectness = LLMJudgeEvaluator.builder()
.name("Answer Correctness")
.criteria("Is the answer factually correct and complete?")
.evaluationParams(List.of(
EvalTestCaseParam.INPUT,
EvalTestCaseParam.EXPECTED_OUTPUT,
EvalTestCaseParam.ACTUAL_OUTPUT
))
.threshold(0.8)
.judge(prompt -> judgeModel.generate(prompt))
.build();

// For checking that RAG outputs are grounded in retrieved docs
Evaluator faithfulness = FaithfulnessEvaluator.builder()
.name("Faithfulness")
.threshold(0.7)
.judge(prompt -> judgeModel.generate(prompt))
.contextKey("retrievedContext")
.build();

Score several dimensions at once

Real applications usually need more than one check. Add several evaluators and read each one's average score.

List<Evaluator> evaluators = List.of(
// Factual correctness
LLMJudgeEvaluator.builder()
.name("Correctness")
.criteria("Is the answer factually correct?")
.threshold(0.8)
.judge(judge)
.build(),

// Relevance
LLMJudgeEvaluator.builder()
.name("Relevance")
.criteria("Is the answer relevant to the question?")
.threshold(0.7)
.judge(judge)
.build(),

// Faithfulness to source
FaithfulnessEvaluator.builder()
.threshold(0.8)
.judge(judge)
.contextKey("retrievedContext")
.build()
);

ExperimentResult result = Experiment.builder()
.name("Multi-dimensional Evaluation")
.dataset(dataset)
.task(task)
.evaluators(evaluators)
.build()
.run();

// Average score per evaluator
System.out.println("Correctness: " + result.averageScore("Correctness"));
System.out.println("Relevance: " + result.averageScore("Relevance"));
System.out.println("Faithfulness: " + result.averageScore("Faithfulness"));

Reading the results

ExperimentResult carries the totals and the per-item detail. With multiple runs, all metrics are averaged across runs for you.

Totals

ExperimentResult result = experiment.run();

// Overall metrics
System.out.println("Experiment: " + result.name());
System.out.println("Description: " + result.description());
System.out.println("Total examples: " + result.totalCount());
System.out.println("Passed: " + result.passCount());
System.out.println("Failed: " + result.failCount());
System.out.println("Pass rate: " + String.format("%.2f%%", result.passRate() * 100));

// Per-evaluator metrics
System.out.println("\nAverage scores:");
System.out.println("Exact Match: " + result.averageScore("Exact Match"));
System.out.println("Relevance: " + result.averageScore("Relevance"));

// For multi-run experiments, check stability
if (result.runCount() > 1) {
System.out.println("\nScore stability (standard deviation):");
System.out.println("Exact Match: " + result.scoreStdDev("Exact Match"));
System.out.println("Relevance: " + result.scoreStdDev("Relevance"));
}

Per-item detail

// Access individual results
List<ItemResult> itemResults = result.itemResults();

for (ItemResult item : itemResults) {
Example example = item.example();
Map<String, Object> actualOutputs = item.actualOutputs();
List<EvalResult> evalResults = item.evalResults();
boolean success = item.success();

// Your analysis here
}

Metadata

// Read experiment metadata
Map<String, Object> metadata = result.metadata();
System.out.println("Model: " + metadata.get("model"));
System.out.println("Temperature: " + metadata.get("temperature"));

Running experiments in CI/CD

Run experiments in CI to catch regressions before they ship. There are two ways to wire it up.

Option 1: a main class with an exit code

Write a main class that exits non-zero when results fall below your threshold.

public class EvaluationPipeline {
public static void main(String[] args) {
Dataset dataset = DatasetResolverRegistry.getInstance()
.resolve("classpath:datasets/qa-dataset.json");

ExperimentResult result = Experiment.builder()
.name("CI Validation")
.dataset(dataset)
.task(task)
.evaluators(evaluators)
.build()
.run();

System.out.println("Pass rate: " + result.passRate() * 100 + "%");

// Fail the build if the pass rate is below threshold
if (result.passRate() < 0.95) {
System.err.println("❌ Evaluation failed: pass rate below 95%");
System.exit(1);
}

System.out.println("✅ Evaluation passed!");
System.exit(0);
}
}

Option 2: a JUnit test

Wrap the experiment in a JUnit test for better reporting and IDE integration.

import org.junit.jupiter.api.Test;
import static org.junit.jupiter.api.Assertions.*;

class LLMEvaluationTest {

@Test
void experimentShouldPassQualityThreshold() {
Dataset dataset = DatasetResolverRegistry.getInstance()
.resolve("classpath:datasets/qa-dataset.json");

ExperimentResult result = Experiment.builder()
.name("QA Evaluation")
.dataset(dataset)
.task(task)
.evaluators(evaluators)
.build()
.run();

// Assert the pass rate threshold
assertTrue(result.passRate() >= 0.95,
"Pass rate " + result.passRate() + " is below threshold 0.95");

// Assert per-evaluator performance
assertTrue(result.averageScore("Correctness") >= 0.8,
"Correctness score too low");
}
}

GitHub Actions example

name: LLM Evaluation

on: [push, pull_request]

jobs:
evaluate:
runs-on: ubuntu-latest

steps:
- uses: actions/checkout@v3

- name: Set up JDK 21
uses: actions/setup-java@v3
with:
java-version: '21'
distribution: 'temurin'

- name: Run LLM Evaluations
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
run: mvn test -Dtest=LLMEvaluationTest

- name: Upload Evaluation Report
if: always()
uses: actions/upload-artifact@v3
with:
name: evaluation-results
path: target/evaluation-results/

CI/CD tips

  • Keep CI datasets small. Use a subset (20 to 50 examples) so builds stay fast. Run the full dataset nightly or weekly.
  • Set realistic thresholds. Don't expect 100% right away. Start at something you can hit (say 80%) and raise it over time.
  • Cache responses where you can. If you test the same examples often, cache LLM responses to save on API cost.
  • Fail early. Put your most important evaluators first so obvious problems surface fast.
  • Save detailed results. Upload results as build artifacts so you can review failures later.

LangChain4j integration

If you use LangChain4j, the dokimos-langchain4j module turns an AI Service into a Task in one call.

import dev.dokimos.langchain4j.LangChain4jSupport;

// Your LangChain4j AI Service
interface Assistant {
Result<String> chat(String userMessage);
}

Assistant assistant = AiServices.builder(Assistant.class)
.chatLanguageModel(chatModel)
.retrievalAugmentor(retrievalAugmentor)
.build();

// Wrap it as a Task
Task task = LangChain4jSupport.ragTask(assistant::chat);

// Run the experiment
ExperimentResult result = Experiment.builder()
.name("RAG Evaluation")
.dataset(dataset)
.task(task)
.evaluators(evaluators)
.build()
.run();

ragTask() pulls the retrieved context out of Result.sources() and adds it to the outputs, so faithfulness evaluation works out of the box.

Best practices

Start small, then grow

Don't build a giant dataset up front. Start with 10 to 20 strong examples that cover your main cases. Run experiments often and add examples as you find edge cases.

Name experiments clearly

When you compare results later, you want to know exactly what each run tested.

.name("gpt-5-nano-customer-support-temp0.7-2025-12-27")

Track everything with metadata

Record model settings, versions, and timestamps so you can reproduce a result.

.metadata("model", "gpt-5-nano")
.metadata("temperature", 0.7)
.metadata("prompt_version", "v3")
.metadata("timestamp", Instant.now().toString())

Match evaluators to the job

  • Use exact match for deterministic factual answers (like calculations).
  • Use LLM judges when you need meaning, not exact text (like whether an explanation holds up).
  • Use faithfulness for RAG, to confirm answers stay grounded in your documents.
  • Build custom evaluators for domain-specific rules.

Set thresholds you can hit

Don't aim for perfect on day one. Start at 70 to 80% and raise the bar as the application improves.

Version your datasets

As you add cases, keep old versions so you can track how the application improves over time.

src/test/resources/datasets/
├── support-v1-initial.json
├── support-v2-edge-cases.json
└── support-v3-current.json

Run experiments regularly

Schedule nightly or weekly runs to catch regressions early. Run a quick experiment on a smaller dataset during development.

Exporting results

Dokimos exports results to four formats for reporting, analysis, or handoff to other tools.

Pick a format

FormatBest for
JSONProgrammatic access, storing results, further processing
HTMLHuman-readable reports, sharing with stakeholders
MarkdownCI/CD logs, GitHub PR comments
CSVSpreadsheet analysis, exploration

Export to files or strings

Write to a file, or get the content back as a string.

ExperimentResult result = experiment.run();

// Write to files
result.exportJson(Path.of("results/experiment.json"));
result.exportHtml(Path.of("results/report.html"));
result.exportMarkdown(Path.of("results/summary.md"));
result.exportCsv(Path.of("results/data.csv"));

// Get as strings (for inline use, PR comments, etc.)
String json = result.toJson();
String html = result.toHtml();
String markdown = result.toMarkdown();
String csv = result.toCsv();

JSON format

The JSON export holds the full experiment data.

{
"version": 1,
"experimentName": "QA Evaluation",
"timestamp": "2025-01-02T14:30:00Z",
"description": "Testing customer support bot",
"metadata": { "model": "gpt-5-nano" },
"config": { "runs": 3 },
"summary": {
"totalExamples": 50,
"passCount": 45,
"failCount": 5,
"passRate": 0.9,
"runCount": 3,
"evaluators": {
"Faithfulness": {
"averageScore": 0.85,
"stdDev": 0.03,
"passRate": 0.92
}
}
},
"items": [...]
}

For multi-run experiments, each item's evaluations include aggregated statistics.

{
"evaluator": "Faithfulness",
"averageScore": 0.85,
"stdDev": 0.03,
"scores": [0.82, 0.87, 0.86],
"threshold": 0.8,
"success": true
}

Markdown format

Markdown suits CI/CD logs and readable summaries.

# Experiment: QA Evaluation

**Date:** 2025-01-02 14:30:00
**Pass Rate:** 90% (45/50)

## Evaluator Summary

| Evaluator | Avg Score | Std Dev | Pass Rate |
|-----------|-----------|---------|-----------|
| Faithfulness | 0.85 | 0.03 | 92% |

## Failed Examples

### What is your return policy?
**Expected:** 30 days, full refund
**Actual:** You can return items within 60 days...
**Faithfulness:** 0.45 (FAIL): Claim not supported by context

HTML reports

Generate a standalone HTML report with styling built in.

result.exportHtml(Path.of("reports/evaluation-report.html"));

HTML reports include:

  • Summary cards with pass rate and counts
  • A sortable evaluator statistics table
  • A results table with expandable rows for detail
  • Pass and fail color coding
  • Dark mode support

Here is what the layout looks like:

HTML Report Example

CSV export

CSV is handy for spreadsheet analysis.

result.exportCsv(Path.of("results/data.csv"));

The columns are dynamic, based on the evaluators you used.

input,expected_output,actual_output,success,faithfulness_score,faithfulness_pass
"What is..?","30 days","You can...",true,0.92,true

Exporting in CI/CD

Export every format and print the markdown summary to the console.

ExperimentResult result = experiment.run();

// Export all formats
Path outputDir = Path.of("target/evaluation-results");
result.exportJson(outputDir.resolve("results.json"));
result.exportHtml(outputDir.resolve("report.html"));
result.exportMarkdown(outputDir.resolve("summary.md"));
result.exportCsv(outputDir.resolve("data.csv"));

// Print the markdown summary to the console
System.out.println(result.toMarkdown());
For AI agentsView as Markdown