Skip to main content

JUnit Integration

Run your LLM evaluations as JUnit tests, so a bad output fails the build the same way a broken function does.

Dokimos plugs into JUnit parameterized tests. You load a dataset, run your LLM on each example, and assert that your evaluators pass. JUnit runs the test once per example and fails fast when an output misses your threshold.

Quick start

Three steps: add the dependency, point @DatasetSource at a dataset, call Assertions.assertEval.

Add the dependency to your pom.xml:

<dependency>
<groupId>dev.dokimos</groupId>
<artifactId>dokimos-junit</artifactId>
<version>${dokimos.version}</version>
<scope>test</scope>
</dependency>

Works with JUnit 5.x and 6.x.

Write the test:

import dev.dokimos.junit.DatasetSource;
import dev.dokimos.core.*;
import dev.dokimos.core.evaluators.*;
import org.junit.jupiter.params.ParameterizedTest;

@ParameterizedTest
@DatasetSource("classpath:datasets/support-qa.json")
void shouldAnswerSupportQuestions(Example example) {
// Run your LLM on the example input.
String answer = supportBot.generate(example.input());

// Build a test case from the example plus the answer.
EvalTestCase testCase = example.toTestCase(answer);

// Assert the evaluator passes. The test fails if it misses its threshold.
Evaluator correctness = LLMJudgeEvaluator.builder()
.name("Helpfulness")
.criteria("Is the response helpful and does it address the customer's issue?")
.judge(judgeLM)
.threshold(0.7)
.build();

Assertions.assertEval(testCase, correctness);
}

JUnit runs this test once for each example in the dataset. If any evaluator misses its threshold, the test fails.

When to use JUnit tests

Use JUnit tests when you want fast, fail-fast checks:

  • Fast feedback during development. A test fails the moment an output misses your criteria. You do not wait for a full evaluation run.
  • CI/CD quality gates. Fail the build when critical test cases break, just like a regular unit test.
  • Familiar tooling. Use the test runners, IDE integration, and reports you already have.

Reach for JUnit tests for:

  • Critical examples that should never break
  • Quick validation during development
  • CI/CD pipelines where you want to fail fast
  • Test-driven development of LLM features

Reach for experiments instead when you want:

  • Analysis across large datasets
  • Comparison of different models or configurations
  • Detailed reports with metrics
  • Exploratory evaluation of new features

See Experiments vs JUnit Testing for the full comparison.

tip

Your task can return a typed record, not just a string. A JUnit test reads it back with actualOutputAs(...) or compares it with StructuralMatchEvaluator. See the Structured & Typed Data hub.

Load a dataset

@DatasetSource accepts a path or inline data. Pick the form that fits.

From the classpath (for example src/test/resources):

@DatasetSource("classpath:datasets/support-qa.json")
@DatasetSource("classpath:datasets/support-qa.jsonl")

From the file system:

@DatasetSource("file:testdata/support-qa.json")
@DatasetSource("file:testdata/support-qa.jsonl")

Inline JSON for quick tests:

@DatasetSource(json = """
{
"examples": [
{"input": "Reset password", "expectedOutput": "Click Forgot Password"},
{"input": "Track order", "expectedOutput": "Check Order History"}
]
}
""")

Inline JSONL for quick tests:

@DatasetSource(jsonl = """
{"input": "Reset password", "expectedOutput": "Click Forgot Password"}
{"input": "Track order", "expectedOutput": "Check Order History"}
""")

Assert with assertEval

Assertions.assertEval() runs your evaluators and fails the test if any miss their threshold:

Assertions.assertEval(testCase, evaluators);

When a test fails, you get a clear message:

Evaluation 'Answer Quality' failed: score=0.65 (threshold=0.80)
Reason: The answer is incomplete and doesn't mention the 30-day policy.

Full example

This test class sets up two evaluators once, then checks every example in the dataset.

import dev.dokimos.junit.DatasetSource;
import dev.dokimos.core.*;
import org.junit.jupiter.api.BeforeAll;
import org.junit.jupiter.params.ParameterizedTest;
import java.util.List;

class CustomerSupportTest {

private static List<Evaluator> evaluators;
private static CustomerSupportBot supportBot;

@BeforeAll
static void setup() {
supportBot = new CustomerSupportBot(apiKey);
JudgeLM judge = prompt -> judgeModel.generate(prompt);

evaluators = List.of(
LLMJudgeEvaluator.builder()
.name("Answer Quality")
.criteria("Is the answer helpful and addresses the user's question?")
.evaluationParams(List.of(EvalTestCaseParam.INPUT, EvalTestCaseParam.ACTUAL_OUTPUT))
.threshold(0.80)
.judge(judge)
.build(),
RegexEvaluator.builder()
.name("No Placeholders")
.pattern(".*\\[.*\\].*") // Catch [PLACEHOLDER] text.
.threshold(0.0) // Should NOT match.
.build()
);
}

@ParameterizedTest(name = "[{index}] {0}")
@DatasetSource("classpath:datasets/support-qa-v3.json")
void shouldAnswerSupportQuestions(Example example) {
String response = supportBot.generate(example.input());
EvalTestCase testCase = example.toTestCase(response);
Assertions.assertEval(testCase, evaluators);
}
}

Test RAG systems

For RAG, put the retrieved context in the test case so a faithfulness check can use it. Pass a map and store the context under a key like retrievedContext, then point FaithfulnessEvaluator at that key.

@ParameterizedTest
@DatasetSource("classpath:datasets/product-docs-qa.json")
void shouldAnswerFromDocumentation(Example example) {
// Retrieve relevant documents.
List<String> docs = vectorStore.search(example.input(), topK = 5);

// Generate the answer with RAG.
String answer = ragSystem.generate(example.input(), docs);

// Put the answer and the context in the test case.
EvalTestCase testCase = example.toTestCase(Map.of(
"output", answer,
"retrievedContext", docs
));

// Check both quality and faithfulness.
Assertions.assertEval(testCase, List.of(
LLMJudgeEvaluator.builder()
.name("Answer Quality")
.criteria("Is the answer helpful?")
.evaluationParams(List.of(EvalTestCaseParam.INPUT, EvalTestCaseParam.ACTUAL_OUTPUT))
.threshold(0.8)
.judge(judge)
.build(),
FaithfulnessEvaluator.builder()
.threshold(0.85)
.judge(judge)
.contextKey("retrievedContext")
.build()
));
}

Name your tests

Set the name on @ParameterizedTest to control how each case shows up in output:

@ParameterizedTest(name = "{index}: {0}")
@DatasetSource("classpath:datasets/support-qa.json")
void shouldAnswerQuestions(Example example) {
// Output: "1: How do I reset my password?"
}

Report real outputs to a server

Declare a static @DatasetReporter field, and @DatasetSource opens a run and reports each invocation as an item result. By default that item is empty.

To carry the real outputs and eval results, add a DatasetItemRecorder parameter to your test method and fill it in. The extension supplies a fresh recorder per invocation, so you never reset it between examples.

import dev.dokimos.core.EvalResult;
import dev.dokimos.core.Reporter;
import dev.dokimos.junit.DatasetRunExtension.DatasetItemRecorder;
import dev.dokimos.junit.DatasetReporter;
import dev.dokimos.junit.DatasetSource;
import org.junit.jupiter.params.ParameterizedTest;

class SupportEvaluationTest {

@DatasetReporter
static final Reporter reporter = new DokimosServerReporter(serverConfig);

@ParameterizedTest
@DatasetSource("classpath:datasets/support-qa.json")
void shouldAnswerSupportQuestions(Example example, DatasetItemRecorder recorder) {
String answer = supportBot.generate(example.input());
EvalTestCase testCase = example.toTestCase(answer);

recorder.actualOutput("output", answer);
for (Evaluator evaluator : evaluators) {
EvalResult result = evaluator.evaluate(testCase);
recorder.evalResult(result);
}

Assertions.assertEval(testCase, evaluators);
}
}

The recorder methods are chainable:

  • actualOutput(String key, Object value)
  • actualOutputs(Map<String, Object> outputs)
  • evalResult(EvalResult result)
  • evalResults(List<EvalResult> results)

Add run metadata

When a @DatasetReporter field is present, @DatasetSource forwards metadata to the reporter. Use entries for type-safe key-value pairs:

@ParameterizedTest
@DatasetSource(
value = "classpath:datasets/support-qa.json",
entries = {
@MetadataEntry(key = "model", value = "gpt-4"),
@MetadataEntry(key = "temperature", value = "0")
})
void shouldAnswerSupportQuestions(Example example) {
// ...
}

The alternating-string form metadata = {"model", "gpt-4", "temperature", "0"} also works. When you set both, entries wins.

Run in CI/CD

Maven

Run the tests in your pipeline:

mvn test

GitHub Actions

name: LLM Tests

on: [push, pull_request]

jobs:
test:
runs-on: ubuntu-latest

steps:
- uses: actions/checkout@v3

- name: Set up JDK 21
uses: actions/setup-java@v3
with:
java-version: '21'
distribution: 'temurin'

- name: Run LLM Tests
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
run: mvn test

- name: Publish Test Report
if: always()
uses: dorny/test-reporter@v1
with:
name: JUnit Tests
path: target/surefire-reports/*.xml
reporter: java-junit

Test reports

JUnit writes standard reports that CI tools read:

target/surefire-reports/
├── TEST-CustomerSupportTest.xml
└── CustomerSupportTest.txt

Run tests in parallel

JUnit 5 and 6 run tests in parallel out of the box. Use this to speed up suites with many examples.

Turn it on

Create src/test/resources/junit-platform.properties:

junit.jupiter.execution.parallel.enabled=true
junit.jupiter.execution.parallel.mode.default=concurrent
junit.jupiter.execution.parallel.config.fixed.parallelism=4

It works with @DatasetSource

Parameterized tests that use @DatasetSource get parallel execution automatically:

@ParameterizedTest
@DatasetSource("classpath:datasets/qa-dataset.json")
void shouldAnswerCorrectly(Example example) {
String answer = assistant.answer(example.input());
EvalTestCase testCase = example.toTestCase(answer);
Assertions.assertEval(testCase, evaluators);
}

With parallelism on, JUnit runs multiple examples at the same time.

Watch for rate limits

LLM APIs have rate limits. If you hit them:

  • Lower parallelism in the properties file.
  • Or use the programmatic Experiment API with explicit .parallelism() control.

Keep it thread-safe

Make your task implementation and any shared state thread-safe before you run tests in parallel.

Best practices

  • Keep datasets in version control. Store them next to your code so tests stay reproducible.
  • Start with critical examples. Do not test everything. Focus on the cases that must never break.
  • Use clear test names. Make it obvious what each test checks.
  • Split CI from full evaluation. Use a small dataset for CI (10 to 20 examples) and run full evaluations separately.
  • Test at multiple levels. Combine unit tests (JUnit) with full evaluations (Experiments) for the best coverage.
For AI agentsView as Markdown