JUnit 5 Integration
Dokimos works with JUnit 5's parameterized tests so you can test LLM applications the same way you test regular code - with fast-failing tests that catch regressions.
Why Use JUnit 5 Integration?
Fast feedback during development - Tests fail immediately when an output doesn't meet your criteria. You don't have to wait for a full evaluation run to finish.
CI/CD quality gates - Fail your build if critical test cases don't pass, just like you would with regular unit tests.
Familiar tooling - Use the JUnit tools you already know: test runners, IDE integration, and reporting.
When to use JUnit tests:
- Testing critical examples that should never break
- Quick validation during development
- CI/CD pipelines where you want to fail fast
- Test-driven development of LLM features
When to use experiments instead:
- Analyzing performance across large datasets
- Comparing different models or configurations
- Generating detailed reports with metrics
- Exploratory evaluation of new features
See Experiments vs JUnit Testing for more details.
Setup
Add the JUnit 5 integration dependency:
<dependency>
<groupId>dev.dokimos</groupId>
<artifactId>dokimos-junit5</artifactId>
<version>${dokimos.version}</version>
<scope>test</scope>
</dependency>
Basic Usage
Using @DatasetSource
Load datasets with the @DatasetSource annotation:
import dev.dokimos.junit5.DatasetSource;
import dev.dokimos.core.*;
import org.junit.jupiter.params.ParameterizedTest;
@ParameterizedTest
@DatasetSource("classpath:datasets/support-qa.json")
void shouldAnswerSupportQuestions(Example example) {
// Generate answer from your LLM
String answer = supportBot.generate(example.input());
// Create test case
EvalTestCase testCase = example.toTestCase(answer);
// Assert evaluators pass (fails test if they don't)
Assertions.assertEval(testCase, evaluators);
}
JUnit runs this test once for each example in the dataset. If any evaluator doesn't pass its threshold, the test fails.
Loading Datasets
From classpath (like src/test/resources):
@DatasetSource("classpath:datasets/support-qa.json")
From file system:
@DatasetSource("file:testdata/support-qa.json")
Inline for quick tests:
@DatasetSource(json = """
{
"examples": [
{"input": "Reset password", "expectedOutput": "Click Forgot Password"},
{"input": "Track order", "expectedOutput": "Check Order History"}
]
}
""")
Using assertEval
Assertions.assertEval() runs your evaluators and fails the test if any don't pass:
Assertions.assertEval(testCase, evaluators);
When a test fails, you get a clear error message:
Evaluation 'Answer Quality' failed: score=0.65 (threshold=0.80)
Reason: The answer is incomplete and doesn't mention the 30-day policy.
Complete Example
import dev.dokimos.junit5.DatasetSource;
import dev.dokimos.core.*;
import org.junit.jupiter.api.BeforeAll;
import org.junit.jupiter.params.ParameterizedTest;
import java.util.List;
class CustomerSupportTest {
private static List<Evaluator> evaluators;
private static CustomerSupportBot supportBot;
@BeforeAll
static void setup() {
supportBot = new CustomerSupportBot(apiKey);
JudgeLM judge = prompt -> judgeModel.generate(prompt);
evaluators = List.of(
LLMJudgeEvaluator.builder()
.name("Answer Quality")
.criteria("Is the answer helpful and addresses the user's question?")
.threshold(0.80)
.judge(judge)
.build(),
RegexEvaluator.builder()
.name("No Placeholders")
.pattern(".*\\[.*\\].*") // Catch [PLACEHOLDER] text
.threshold(0.0) // Should NOT match
.build()
);
}
@ParameterizedTest(name = "[{index}] {0}")
@DatasetSource("classpath:datasets/support-qa-v3.json")
void shouldAnswerSupportQuestions(Example example) {
String response = supportBot.generate(example.input());
EvalTestCase testCase = example.toTestCase(response);
Assertions.assertEval(testCase, evaluators);
}
}
Advanced Usage
Testing RAG Systems
For RAG applications, include the retrieved context in your test case:
@ParameterizedTest
@DatasetSource("classpath:datasets/product-docs-qa.json")
void shouldAnswerFromDocumentation(Example example) {
// Retrieve relevant documents
List<String> docs = vectorStore.search(example.input(), topK = 5);
// Generate answer with RAG
String answer = ragSystem.generate(example.input(), docs);
// Include context in test case
EvalTestCase testCase = example.toTestCase(Map.of(
"output", answer,
"retrievedContext", docs
));
// Check both quality and faithfulness
Assertions.assertEval(testCase, List.of(
LLMJudgeEvaluator.builder()
.name("Answer Quality")
.criteria("Is the answer helpful?")
.threshold(0.8)
.judge(judge)
.build(),
FaithfulnessEvaluator.builder()
.threshold(0.85)
.judge(judge)
.contextKey("retrievedContext")
.build()
));
}
Readable Test Names
Customize how tests appear in output:
@ParameterizedTest(name = "{index}: {0}")
@DatasetSource("classpath:datasets/support-qa.json")
void shouldAnswerQuestions(Example example) {
// Output: "1: How do I reset my password?"
}
CI/CD Integration
Maven
Run tests in your CI pipeline:
mvn test
GitHub Actions
name: LLM Tests
on: [push, pull_request]
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Set up JDK 21
uses: actions/setup-java@v3
with:
java-version: '21'
distribution: 'temurin'
- name: Run LLM Tests
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
run: mvn test
- name: Publish Test Report
if: always()
uses: dorny/test-reporter@v1
with:
name: JUnit Tests
path: target/surefire-reports/*.xml
reporter: java-junit
Test Reports
JUnit generates standard test reports that integrate with CI tools:
target/surefire-reports/
├── TEST-CustomerSupportTest.xml
└── CustomerSupportTest.txt
Best Practices
Keep datasets in version control - Store them alongside your code so tests are reproducible.
Start with critical examples - Don't try to test everything. Focus on the most important cases that should never break.
Use clear test names - Make it obvious what each test is checking.
Separate CI and comprehensive testing - Use a smaller dataset for CI (maybe 10-20 examples) and run full evaluations separately.
Test at multiple levels - Combine unit tests (JUnit) with comprehensive evaluations (Experiments) for best coverage.