JUnit Integration
Run your LLM evaluations as JUnit tests, so a bad output fails the build the same way a broken function does.
Dokimos plugs into JUnit parameterized tests. You load a dataset, run your LLM on each example, and assert that your evaluators pass. JUnit runs the test once per example and fails fast when an output misses your threshold.
Quick start
Three steps: add the dependency, point @DatasetSource at a dataset, call Assertions.assertEval.
Add the dependency to your pom.xml:
<dependency>
<groupId>dev.dokimos</groupId>
<artifactId>dokimos-junit</artifactId>
<version>${dokimos.version}</version>
<scope>test</scope>
</dependency>
Works with JUnit 5.x and 6.x.
Write the test:
- Java
- Kotlin
import dev.dokimos.junit.DatasetSource;
import dev.dokimos.core.*;
import dev.dokimos.core.evaluators.*;
import org.junit.jupiter.params.ParameterizedTest;
@ParameterizedTest
@DatasetSource("classpath:datasets/support-qa.json")
void shouldAnswerSupportQuestions(Example example) {
// Run your LLM on the example input.
String answer = supportBot.generate(example.input());
// Build a test case from the example plus the answer.
EvalTestCase testCase = example.toTestCase(answer);
// Assert the evaluator passes. The test fails if it misses its threshold.
Evaluator correctness = LLMJudgeEvaluator.builder()
.name("Helpfulness")
.criteria("Is the response helpful and does it address the customer's issue?")
.judge(judgeLM)
.threshold(0.7)
.build();
Assertions.assertEval(testCase, correctness);
}
import dev.dokimos.core.Assertions
import dev.dokimos.core.Example
import dev.dokimos.core.EvalTestCaseParam
import dev.dokimos.junit.DatasetSource
import dev.dokimos.kotlin.dsl.llmJudge
import org.junit.jupiter.params.ParameterizedTest
class SupportTests {
private val correctness = llmJudge(judgeLM) {
name = "Helpfulness"
criteria = "Is the response helpful and addresses the customer's issue?"
params(EvalTestCaseParam.INPUT, EvalTestCaseParam.ACTUAL_OUTPUT)
threshold = 0.7
}
@ParameterizedTest
@DatasetSource("classpath:datasets/support-qa.json")
fun shouldAnswerSupportQuestions(example: Example) {
val answer = supportBot.generate(example.input())
val testCase = example.toTestCase(answer)
Assertions.assertEval(testCase, listOf(correctness))
}
}
JUnit runs this test once for each example in the dataset. If any evaluator misses its threshold, the test fails.
When to use JUnit tests
Use JUnit tests when you want fast, fail-fast checks:
- Fast feedback during development. A test fails the moment an output misses your criteria. You do not wait for a full evaluation run.
- CI/CD quality gates. Fail the build when critical test cases break, just like a regular unit test.
- Familiar tooling. Use the test runners, IDE integration, and reports you already have.
Reach for JUnit tests for:
- Critical examples that should never break
- Quick validation during development
- CI/CD pipelines where you want to fail fast
- Test-driven development of LLM features
Reach for experiments instead when you want:
- Analysis across large datasets
- Comparison of different models or configurations
- Detailed reports with metrics
- Exploratory evaluation of new features
See Experiments vs JUnit Testing for the full comparison.
Your task can return a typed record, not just a string. A JUnit test reads it back with actualOutputAs(...) or compares it with StructuralMatchEvaluator. See the Structured & Typed Data hub.
Load a dataset
@DatasetSource accepts a path or inline data. Pick the form that fits.
From the classpath (for example src/test/resources):
@DatasetSource("classpath:datasets/support-qa.json")
@DatasetSource("classpath:datasets/support-qa.jsonl")
From the file system:
@DatasetSource("file:testdata/support-qa.json")
@DatasetSource("file:testdata/support-qa.jsonl")
Inline JSON for quick tests:
@DatasetSource(json = """
{
"examples": [
{"input": "Reset password", "expectedOutput": "Click Forgot Password"},
{"input": "Track order", "expectedOutput": "Check Order History"}
]
}
""")
Inline JSONL for quick tests:
@DatasetSource(jsonl = """
{"input": "Reset password", "expectedOutput": "Click Forgot Password"}
{"input": "Track order", "expectedOutput": "Check Order History"}
""")
Assert with assertEval
Assertions.assertEval() runs your evaluators and fails the test if any miss their threshold:
Assertions.assertEval(testCase, evaluators);
When a test fails, you get a clear message:
Evaluation 'Answer Quality' failed: score=0.65 (threshold=0.80)
Reason: The answer is incomplete and doesn't mention the 30-day policy.
Full example
This test class sets up two evaluators once, then checks every example in the dataset.
- Java
- Kotlin
import dev.dokimos.junit.DatasetSource;
import dev.dokimos.core.*;
import org.junit.jupiter.api.BeforeAll;
import org.junit.jupiter.params.ParameterizedTest;
import java.util.List;
class CustomerSupportTest {
private static List<Evaluator> evaluators;
private static CustomerSupportBot supportBot;
@BeforeAll
static void setup() {
supportBot = new CustomerSupportBot(apiKey);
JudgeLM judge = prompt -> judgeModel.generate(prompt);
evaluators = List.of(
LLMJudgeEvaluator.builder()
.name("Answer Quality")
.criteria("Is the answer helpful and addresses the user's question?")
.evaluationParams(List.of(EvalTestCaseParam.INPUT, EvalTestCaseParam.ACTUAL_OUTPUT))
.threshold(0.80)
.judge(judge)
.build(),
RegexEvaluator.builder()
.name("No Placeholders")
.pattern(".*\\[.*\\].*") // Catch [PLACEHOLDER] text.
.threshold(0.0) // Should NOT match.
.build()
);
}
@ParameterizedTest(name = "[{index}] {0}")
@DatasetSource("classpath:datasets/support-qa-v3.json")
void shouldAnswerSupportQuestions(Example example) {
String response = supportBot.generate(example.input());
EvalTestCase testCase = example.toTestCase(response);
Assertions.assertEval(testCase, evaluators);
}
}
import dev.dokimos.core.Example
import dev.dokimos.core.Evaluator
import dev.dokimos.core.JudgeLM
import dev.dokimos.core.evaluators.RegexEvaluator
import dev.dokimos.core.evaluators.LLMJudgeEvaluator
import dev.dokimos.junit.DatasetSource
import org.junit.jupiter.api.BeforeAll
import org.junit.jupiter.params.ParameterizedTest
class CustomerSupportTest {
companion object {
private lateinit var evaluators: List<Evaluator>
private lateinit var supportBot: CustomerSupportBot
@JvmStatic
@BeforeAll
fun setup() {
supportBot = CustomerSupportBot(apiKey)
val judge = JudgeLM { prompt -> judgeModel.generate(prompt) }
evaluators =
evaluators {
llmJudge(judge) {
name = "Answer Quality"
criteria = "Is the answer helpful and addresses the user's question?"
threshold = 0.80
}
regex {
name = "No Placeholders"
pattern = """.*\[.*\].*""" // Catch [PLACEHOLDER] text.
threshold = 0.0 // Should NOT match.
}
}
}
}
@ParameterizedTest(name = "[{index}] {0}")
@DatasetSource("classpath:datasets/support-qa-v3.json")
fun shouldAnswerSupportQuestions(example: Example) {
val response = supportBot.generate(example.input())
val testCase = example.toTestCase(response)
Assertions.assertEval(testCase, evaluators)
}
}
Test RAG systems
For RAG, put the retrieved context in the test case so a faithfulness check can use it. Pass a map and store the context under a key like retrievedContext, then point FaithfulnessEvaluator at that key.
- Java
- Kotlin
@ParameterizedTest
@DatasetSource("classpath:datasets/product-docs-qa.json")
void shouldAnswerFromDocumentation(Example example) {
// Retrieve relevant documents.
List<String> docs = vectorStore.search(example.input(), topK = 5);
// Generate the answer with RAG.
String answer = ragSystem.generate(example.input(), docs);
// Put the answer and the context in the test case.
EvalTestCase testCase = example.toTestCase(Map.of(
"output", answer,
"retrievedContext", docs
));
// Check both quality and faithfulness.
Assertions.assertEval(testCase, List.of(
LLMJudgeEvaluator.builder()
.name("Answer Quality")
.criteria("Is the answer helpful?")
.evaluationParams(List.of(EvalTestCaseParam.INPUT, EvalTestCaseParam.ACTUAL_OUTPUT))
.threshold(0.8)
.judge(judge)
.build(),
FaithfulnessEvaluator.builder()
.threshold(0.85)
.judge(judge)
.contextKey("retrievedContext")
.build()
));
}
@ParameterizedTest
@DatasetSource("classpath:datasets/product-docs-qa.json")
fun shouldAnswerFromDocumentation(example: Example) {
// Retrieve relevant documents.
val docs = vectorStore.search(example.input(), topK = 5)
// Generate the answer with RAG.
val answer = ragSystem.generate(example.input(), docs)
// Put the answer and the context in the test case.
val testCase = example.toTestCase(
mapOf(
"output" to answer,
"retrievedContext" to docs
)
)
// Check both quality and faithfulness.
val answerQuality = llmJudge(judge) {
name = "Answer Quality"
criteria = "Is the answer helpful?"
threshold = 0.8
}
val faithfulness = faithfulness(judge) {
threshold = 0.85
contextKey = "retrievedContext"
}
Assertions.assertEval(testCase, listOf(answerQuality, faithfulness))
}
Name your tests
Set the name on @ParameterizedTest to control how each case shows up in output:
- Java
- Kotlin
@ParameterizedTest(name = "{index}: {0}")
@DatasetSource("classpath:datasets/support-qa.json")
void shouldAnswerQuestions(Example example) {
// Output: "1: How do I reset my password?"
}
@ParameterizedTest(name = "{index}: {0}")
@DatasetSource("classpath:datasets/support-qa.json")
fun shouldAnswerQuestions(example: Example) {
// Output: "1: How do I reset my password?"
}
Report real outputs to a server
Declare a static @DatasetReporter field, and @DatasetSource opens a run and reports each invocation as an item result. By default that item is empty.
To carry the real outputs and eval results, add a DatasetItemRecorder parameter to your test method and fill it in. The extension supplies a fresh recorder per invocation, so you never reset it between examples.
import dev.dokimos.core.EvalResult;
import dev.dokimos.core.Reporter;
import dev.dokimos.junit.DatasetRunExtension.DatasetItemRecorder;
import dev.dokimos.junit.DatasetReporter;
import dev.dokimos.junit.DatasetSource;
import org.junit.jupiter.params.ParameterizedTest;
class SupportEvaluationTest {
@DatasetReporter
static final Reporter reporter = new DokimosServerReporter(serverConfig);
@ParameterizedTest
@DatasetSource("classpath:datasets/support-qa.json")
void shouldAnswerSupportQuestions(Example example, DatasetItemRecorder recorder) {
String answer = supportBot.generate(example.input());
EvalTestCase testCase = example.toTestCase(answer);
recorder.actualOutput("output", answer);
for (Evaluator evaluator : evaluators) {
EvalResult result = evaluator.evaluate(testCase);
recorder.evalResult(result);
}
Assertions.assertEval(testCase, evaluators);
}
}
The recorder methods are chainable:
actualOutput(String key, Object value)actualOutputs(Map<String, Object> outputs)evalResult(EvalResult result)evalResults(List<EvalResult> results)
Add run metadata
When a @DatasetReporter field is present, @DatasetSource forwards metadata to the reporter. Use entries for type-safe key-value pairs:
@ParameterizedTest
@DatasetSource(
value = "classpath:datasets/support-qa.json",
entries = {
@MetadataEntry(key = "model", value = "gpt-4"),
@MetadataEntry(key = "temperature", value = "0")
})
void shouldAnswerSupportQuestions(Example example) {
// ...
}
The alternating-string form metadata = {"model", "gpt-4", "temperature", "0"} also works. When you set both, entries wins.
Run in CI/CD
Maven
Run the tests in your pipeline:
mvn test
GitHub Actions
name: LLM Tests
on: [push, pull_request]
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Set up JDK 21
uses: actions/setup-java@v3
with:
java-version: '21'
distribution: 'temurin'
- name: Run LLM Tests
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
run: mvn test
- name: Publish Test Report
if: always()
uses: dorny/test-reporter@v1
with:
name: JUnit Tests
path: target/surefire-reports/*.xml
reporter: java-junit
Test reports
JUnit writes standard reports that CI tools read:
target/surefire-reports/
├── TEST-CustomerSupportTest.xml
└── CustomerSupportTest.txt
Run tests in parallel
JUnit 5 and 6 run tests in parallel out of the box. Use this to speed up suites with many examples.
Turn it on
Create src/test/resources/junit-platform.properties:
junit.jupiter.execution.parallel.enabled=true
junit.jupiter.execution.parallel.mode.default=concurrent
junit.jupiter.execution.parallel.config.fixed.parallelism=4
It works with @DatasetSource
Parameterized tests that use @DatasetSource get parallel execution automatically:
- Java
- Kotlin
@ParameterizedTest
@DatasetSource("classpath:datasets/qa-dataset.json")
void shouldAnswerCorrectly(Example example) {
String answer = assistant.answer(example.input());
EvalTestCase testCase = example.toTestCase(answer);
Assertions.assertEval(testCase, evaluators);
}
@ParameterizedTest
@DatasetSource("classpath:datasets/qa-dataset.json")
fun shouldAnswerCorrectly(example: Example) {
val answer = assistant.answer(example.input())
val testCase = example.toTestCase(answer)
Assertions.assertEval(testCase, evaluators)
}
With parallelism on, JUnit runs multiple examples at the same time.
Watch for rate limits
LLM APIs have rate limits. If you hit them:
- Lower
parallelismin the properties file. - Or use the programmatic
ExperimentAPI with explicit.parallelism()control.
Keep it thread-safe
Make your task implementation and any shared state thread-safe before you run tests in parallel.
Best practices
- Keep datasets in version control. Store them next to your code so tests stay reproducible.
- Start with critical examples. Do not test everything. Focus on the cases that must never break.
- Use clear test names. Make it obvious what each test checks.
- Split CI from full evaluation. Use a small dataset for CI (10 to 20 examples) and run full evaluations separately.
- Test at multiple levels. Combine unit tests (JUnit) with full evaluations (Experiments) for the best coverage.