JUnit Integration

Run your LLM evaluations as JUnit tests, so a bad output fails the build the same way a broken function does.

Dokimos plugs into JUnit parameterized tests. You load a dataset, run your LLM on each example, and assert that your evaluators pass. JUnit runs the test once per example and fails fast when an output misses your threshold.

Quick start

Three steps: add the dependency, point @DatasetSource at a dataset, call Assertions.assertEval.

Add the dependency to your pom.xml:

<dependency>
    <groupId>dev.dokimos</groupId>
    <artifactId>dokimos-junit</artifactId>
    <version>${dokimos.version}</version>
    <scope>test</scope>
</dependency>

Works with JUnit 5.x and 6.x.

Write the test:

Java
Kotlin

import dev.dokimos.junit.DatasetSource;
import dev.dokimos.core.*;
import dev.dokimos.core.evaluators.*;
import org.junit.jupiter.params.ParameterizedTest;

@ParameterizedTest
@DatasetSource("classpath:datasets/support-qa.json")
void shouldAnswerSupportQuestions(Example example) {
    // Run your LLM on the example input.
    String answer = supportBot.generate(example.input());

    // Build a test case from the example plus the answer.
    EvalTestCase testCase = example.toTestCase(answer);

    // Assert the evaluator passes. The test fails if it misses its threshold.
    Evaluator correctness = LLMJudgeEvaluator.builder()
        .name("Helpfulness")
        .criteria("Is the response helpful and does it address the customer's issue?")
        .judge(judgeLM)
        .threshold(0.7)
        .build();

    Assertions.assertEval(testCase, correctness);
}

import dev.dokimos.core.Assertions
import dev.dokimos.core.Example
import dev.dokimos.core.EvalTestCaseParam
import dev.dokimos.junit.DatasetSource
import dev.dokimos.kotlin.dsl.llmJudge
import org.junit.jupiter.params.ParameterizedTest

class SupportTests {
    private val correctness = llmJudge(judgeLM) {
        name = "Helpfulness"
        criteria = "Is the response helpful and addresses the customer's issue?"
        params(EvalTestCaseParam.INPUT, EvalTestCaseParam.ACTUAL_OUTPUT)
        threshold = 0.7
    }

    @ParameterizedTest
    @DatasetSource("classpath:datasets/support-qa.json")
    fun shouldAnswerSupportQuestions(example: Example) {
        val answer = supportBot.generate(example.input())
        val testCase = example.toTestCase(answer)
        Assertions.assertEval(testCase, listOf(correctness))
    }
}

JUnit runs this test once for each example in the dataset. If any evaluator misses its threshold, the test fails.

When to use JUnit tests

Use JUnit tests when you want fast, fail-fast checks:

Fast feedback during development. A test fails the moment an output misses your criteria. You do not wait for a full evaluation run.
CI/CD quality gates. Fail the build when critical test cases break, just like a regular unit test.
Familiar tooling. Use the test runners, IDE integration, and reports you already have.

Reach for JUnit tests for:

Critical examples that should never break
Quick validation during development
CI/CD pipelines where you want to fail fast
Test-driven development of LLM features

Reach for experiments instead when you want:

Analysis across large datasets
Comparison of different models or configurations
Detailed reports with metrics
Exploratory evaluation of new features

See Experiments vs JUnit Testing for the full comparison.

tip

Your task can return a typed record, not just a string. A JUnit test reads it back with actualOutputAs(...) or compares it with StructuralMatchEvaluator. See the Structured & Typed Data hub.

Load a dataset

@DatasetSource accepts a path or inline data. Pick the form that fits.

From the classpath (for example src/test/resources):

@DatasetSource("classpath:datasets/support-qa.json")
@DatasetSource("classpath:datasets/support-qa.jsonl")

From the file system:

@DatasetSource("file:testdata/support-qa.json")
@DatasetSource("file:testdata/support-qa.jsonl")

Inline JSON for quick tests:

@DatasetSource(json = """
    {
      "examples": [
        {"input": "Reset password", "expectedOutput": "Click Forgot Password"},
        {"input": "Track order", "expectedOutput": "Check Order History"}
      ]
    }
    """)

Inline JSONL for quick tests:

@DatasetSource(jsonl = """
    {"input": "Reset password", "expectedOutput": "Click Forgot Password"}
    {"input": "Track order", "expectedOutput": "Check Order History"}
    """)

Assert with assertEval

Assertions.assertEval() runs your evaluators and fails the test if any miss their threshold:

Assertions.assertEval(testCase, evaluators);

When a test fails, you get a clear message:

Evaluation 'Answer Quality' failed: score=0.65 (threshold=0.80)
Reason: The answer is incomplete and doesn't mention the 30-day policy.

Full example

This test class sets up two evaluators once, then checks every example in the dataset.

Java
Kotlin

import dev.dokimos.junit.DatasetSource;
import dev.dokimos.core.*;
import org.junit.jupiter.api.BeforeAll;
import org.junit.jupiter.params.ParameterizedTest;
import java.util.List;

class CustomerSupportTest {

    private static List<Evaluator> evaluators;
    private static CustomerSupportBot supportBot;

    @BeforeAll
    static void setup() {
        supportBot = new CustomerSupportBot(apiKey);
        JudgeLM judge = prompt -> judgeModel.generate(prompt);

        evaluators = List.of(
            LLMJudgeEvaluator.builder()
                .name("Answer Quality")
                .criteria("Is the answer helpful and addresses the user's question?")
                .evaluationParams(List.of(EvalTestCaseParam.INPUT, EvalTestCaseParam.ACTUAL_OUTPUT))
                .threshold(0.80)
                .judge(judge)
                .build(),
            RegexEvaluator.builder()
                .name("No Placeholders")
                .pattern(".*\\[.*\\].*")  // Catch [PLACEHOLDER] text.
                .threshold(0.0)  // Should NOT match.
                .build()
        );
    }

    @ParameterizedTest(name = "[{index}] {0}")
    @DatasetSource("classpath:datasets/support-qa-v3.json")
    void shouldAnswerSupportQuestions(Example example) {
        String response = supportBot.generate(example.input());
        EvalTestCase testCase = example.toTestCase(response);
        Assertions.assertEval(testCase, evaluators);
    }
}

import dev.dokimos.core.Example
import dev.dokimos.core.Evaluator
import dev.dokimos.core.JudgeLM
import dev.dokimos.core.evaluators.RegexEvaluator
import dev.dokimos.core.evaluators.LLMJudgeEvaluator
import dev.dokimos.junit.DatasetSource
import org.junit.jupiter.api.BeforeAll
import org.junit.jupiter.params.ParameterizedTest

class CustomerSupportTest {

    companion object {
        private lateinit var evaluators: List<Evaluator>
        private lateinit var supportBot: CustomerSupportBot

        @JvmStatic
        @BeforeAll
        fun setup() {
            supportBot = CustomerSupportBot(apiKey)
            val judge = JudgeLM { prompt -> judgeModel.generate(prompt) }

            evaluators =
                evaluators {
                    llmJudge(judge) {
                        name = "Answer Quality"
                        criteria = "Is the answer helpful and addresses the user's question?"
                        threshold = 0.80
                    }
                    regex {
                        name = "No Placeholders"
                        pattern = """.*\[.*\].*"""  // Catch [PLACEHOLDER] text.
                        threshold = 0.0                // Should NOT match.
                    }
                }
        }
    }

    @ParameterizedTest(name = "[{index}] {0}")
    @DatasetSource("classpath:datasets/support-qa-v3.json")
    fun shouldAnswerSupportQuestions(example: Example) {
        val response = supportBot.generate(example.input())
        val testCase = example.toTestCase(response)
        Assertions.assertEval(testCase, evaluators)
    }
}

Test RAG systems

For RAG, put the retrieved context in the test case so a faithfulness check can use it. Pass a map and store the context under a key like retrievedContext, then point FaithfulnessEvaluator at that key.

Java
Kotlin

@ParameterizedTest
@DatasetSource("classpath:datasets/product-docs-qa.json")
void shouldAnswerFromDocumentation(Example example) {
    // Retrieve relevant documents.
    List<String> docs = vectorStore.search(example.input(), topK = 5);

    // Generate the answer with RAG.
    String answer = ragSystem.generate(example.input(), docs);

    // Put the answer and the context in the test case.
    EvalTestCase testCase = example.toTestCase(Map.of(
        "output", answer,
        "retrievedContext", docs
    ));

    // Check both quality and faithfulness.
    Assertions.assertEval(testCase, List.of(
        LLMJudgeEvaluator.builder()
            .name("Answer Quality")
            .criteria("Is the answer helpful?")
            .evaluationParams(List.of(EvalTestCaseParam.INPUT, EvalTestCaseParam.ACTUAL_OUTPUT))
            .threshold(0.8)
            .judge(judge)
            .build(),
        FaithfulnessEvaluator.builder()
            .threshold(0.85)
            .judge(judge)
            .contextKey("retrievedContext")
            .build()
    ));
}

@ParameterizedTest
@DatasetSource("classpath:datasets/product-docs-qa.json")
fun shouldAnswerFromDocumentation(example: Example) {
    // Retrieve relevant documents.
    val docs = vectorStore.search(example.input(), topK = 5)

    // Generate the answer with RAG.
    val answer = ragSystem.generate(example.input(), docs)

    // Put the answer and the context in the test case.
    val testCase = example.toTestCase(
        mapOf(
            "output" to answer,
            "retrievedContext" to docs
        )
    )

    // Check both quality and faithfulness.
    val answerQuality = llmJudge(judge) {
        name = "Answer Quality"
        criteria = "Is the answer helpful?"
        threshold = 0.8
    }

    val faithfulness = faithfulness(judge) {
        threshold = 0.85
        contextKey = "retrievedContext"
    }

    Assertions.assertEval(testCase, listOf(answerQuality, faithfulness))
}

Name your tests

Set the name on @ParameterizedTest to control how each case shows up in output:

Java
Kotlin

@ParameterizedTest(name = "{index}: {0}")
@DatasetSource("classpath:datasets/support-qa.json")
void shouldAnswerQuestions(Example example) {
    // Output: "1: How do I reset my password?"
}

@ParameterizedTest(name = "{index}: {0}")
@DatasetSource("classpath:datasets/support-qa.json")
fun shouldAnswerQuestions(example: Example) {
    // Output: "1: How do I reset my password?"
}

Report real outputs to a server

Declare a static @DatasetReporter field, and @DatasetSource opens a run and reports each invocation as an item result. By default that item is empty.

To carry the real outputs and eval results, add a DatasetItemRecorder parameter to your test method and fill it in. The extension supplies a fresh recorder per invocation, so you never reset it between examples.

import dev.dokimos.core.EvalResult;
import dev.dokimos.core.Reporter;
import dev.dokimos.junit.DatasetRunExtension.DatasetItemRecorder;
import dev.dokimos.junit.DatasetReporter;
import dev.dokimos.junit.DatasetSource;
import org.junit.jupiter.params.ParameterizedTest;

class SupportEvaluationTest {

    @DatasetReporter
    static final Reporter reporter = new DokimosServerReporter(serverConfig);

    @ParameterizedTest
    @DatasetSource("classpath:datasets/support-qa.json")
    void shouldAnswerSupportQuestions(Example example, DatasetItemRecorder recorder) {
        String answer = supportBot.generate(example.input());
        EvalTestCase testCase = example.toTestCase(answer);

        recorder.actualOutput("output", answer);
        for (Evaluator evaluator : evaluators) {
            EvalResult result = evaluator.evaluate(testCase);
            recorder.evalResult(result);
        }

        Assertions.assertEval(testCase, evaluators);
    }
}

The recorder methods are chainable:

actualOutput(String key, Object value)
actualOutputs(Map<String, Object> outputs)
evalResult(EvalResult result)
evalResults(List<EvalResult> results)

Add run metadata

When a @DatasetReporter field is present, @DatasetSource forwards metadata to the reporter. Use entries for type-safe key-value pairs:

@ParameterizedTest
@DatasetSource(
    value = "classpath:datasets/support-qa.json",
    entries = {
        @MetadataEntry(key = "model", value = "gpt-4"),
        @MetadataEntry(key = "temperature", value = "0")
    })
void shouldAnswerSupportQuestions(Example example) {
    // ...
}

The alternating-string form metadata = {"model", "gpt-4", "temperature", "0"} also works. When you set both, entries wins.

Run in CI/CD

Maven

Run the tests in your pipeline:

mvn test

GitHub Actions

name: LLM Tests

on: [push, pull_request]

jobs:
  test:
    runs-on: ubuntu-latest

    steps:
      - uses: actions/checkout@v3

      - name: Set up JDK 21
        uses: actions/setup-java@v3
        with:
          java-version: '21'
          distribution: 'temurin'

      - name: Run LLM Tests
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
        run: mvn test

      - name: Publish Test Report
        if: always()
        uses: dorny/test-reporter@v1
        with:
          name: JUnit Tests
          path: target/surefire-reports/*.xml
          reporter: java-junit

Test reports

JUnit writes standard reports that CI tools read:

target/surefire-reports/
  ├── TEST-CustomerSupportTest.xml
  └── CustomerSupportTest.txt

Run tests in parallel

JUnit 5 and 6 run tests in parallel out of the box. Use this to speed up suites with many examples.

Turn it on

Create src/test/resources/junit-platform.properties:

junit.jupiter.execution.parallel.enabled=true
junit.jupiter.execution.parallel.mode.default=concurrent
junit.jupiter.execution.parallel.config.fixed.parallelism=4

It works with @DatasetSource

Parameterized tests that use @DatasetSource get parallel execution automatically:

Java
Kotlin

@ParameterizedTest
@DatasetSource("classpath:datasets/qa-dataset.json")
void shouldAnswerCorrectly(Example example) {
    String answer = assistant.answer(example.input());
    EvalTestCase testCase = example.toTestCase(answer);
    Assertions.assertEval(testCase, evaluators);
}

@ParameterizedTest
@DatasetSource("classpath:datasets/qa-dataset.json")
fun shouldAnswerCorrectly(example: Example) {
    val answer = assistant.answer(example.input())
    val testCase = example.toTestCase(answer)
    Assertions.assertEval(testCase, evaluators)
}

With parallelism on, JUnit runs multiple examples at the same time.

Watch for rate limits

LLM APIs have rate limits. If you hit them:

Lower parallelism in the properties file.
Or use the programmatic Experiment API with explicit .parallelism() control.

Keep it thread-safe

Make your task implementation and any shared state thread-safe before you run tests in parallel.

Best practices

Keep datasets in version control. Store them next to your code so tests stay reproducible.
Start with critical examples. Do not test everything. Focus on the cases that must never break.
Use clear test names. Make it obvious what each test checks.
Split CI from full evaluation. Use a small dataset for CI (10 to 20 examples) and run full evaluations separately.
Test at multiple levels. Combine unit tests (JUnit) with full evaluations (Experiments) for the best coverage.

For AI agentsView as Markdown

Quick start​

When to use JUnit tests​

Load a dataset​

Assert with assertEval​

Full example​

Test RAG systems​

Name your tests​

Report real outputs to a server​

Add run metadata​

Run in CI/CD​

Maven​

GitHub Actions​

Test reports​

Run tests in parallel​

Turn it on​

It works with @DatasetSource​

Watch for rate limits​

Keep it thread-safe​

Best practices​