Skip to main content

Test your LLM in JUnit: evaluate and gate model output in Java

This page shows you how to check whether your model's output is good from inside a plain JUnit test, so a quality drop turns your build red.

Most LLM evaluation tooling is Python-first. If you ship on the JVM, that means a second language, a second toolchain, and a separate pipeline just to grade model output. Dokimos runs where your code already runs. You write the test in Java or Kotlin, run the same mvn test your team already runs, and let the CI that already gates your merges gate model quality too. No new service. No Python.

By the end you will have:

  • A JUnit test that calls a model, asserts its output, and fails the build when quality drops
  • A deterministic check (exact match), a semantic check (an LLM judge), and a structured-output check
  • A dataset-driven test that runs many cases from one method

Prerequisites

  • Java 17 or later
  • Maven or Gradle
  • An OpenAI API key exported as OPENAI_API_KEY

This tutorial calls OpenAI directly through the OpenAI Java SDK, so there is no framework prerequisite. If you already use Spring AI or LangChain4j, see the Spring AI agent evaluation tutorial instead.

Step 1: Add the dependency

Add the Dokimos JUnit integration and core library in test scope.

Maven

<dependencies>
<!-- Dokimos core: evaluators and test cases -->
<dependency>
<groupId>dev.dokimos</groupId>
<artifactId>dokimos-core</artifactId>
<version>${dokimos.version}</version>
<scope>test</scope>
</dependency>

<!-- Dokimos JUnit integration: @DatasetSource -->
<dependency>
<groupId>dev.dokimos</groupId>
<artifactId>dokimos-junit</artifactId>
<version>${dokimos.version}</version>
<scope>test</scope>
</dependency>

<!-- The model client used in this tutorial -->
<dependency>
<groupId>com.openai</groupId>
<artifactId>openai-java</artifactId>
<version>4.11.0</version>
<scope>test</scope>
</dependency>
</dependencies>

Gradle

dependencies {
testImplementation 'dev.dokimos:dokimos-core:${dokimosVersion}'
testImplementation 'dev.dokimos:dokimos-junit:${dokimosVersion}'
testImplementation 'com.openai:openai-java:4.11.0'
}

See Installation for the current version and other build setups.

Step 2: Call the model and get text out

Dokimos does not call the model for you. You bring your own call and hand the result to an evaluator. Here is a small helper that calls a gpt-5.x model through the OpenAI Responses API and returns the output text.

import com.openai.client.OpenAIClient;
import com.openai.client.okhttp.OpenAIOkHttpClient;
import com.openai.models.ChatModel;
import com.openai.models.responses.Response;
import com.openai.models.responses.ResponseCreateParams;

static final OpenAIClient CLIENT = OpenAIOkHttpClient.fromEnv(); // reads OPENAI_API_KEY

static String ask(String prompt) {
Response response = CLIENT
.responses()
.create(ResponseCreateParams.builder()
.model(ChatModel.GPT_5_2)
.input(prompt)
.build());
return response.output().stream()
.filter(item -> item.isMessage())
.flatMap(item -> item.asMessage().content().stream())
.filter(content -> content.isOutputText())
.map(content -> content.asOutputText().text())
.reduce("", String::concat)
.trim();
}

OpenAIOkHttpClient.fromEnv() reads OPENAI_API_KEY from the environment, so you keep no secrets in your code.

Step 3: Write a deterministic eval

Some questions have one correct answer: math, extraction, a known fact. For these, use ExactMatchEvaluator. It compares the actual output to the expected output, and the test fails when they differ.

Drive the cases from a dataset file so adding a case is a one-line edit. Create src/test/resources/datasets/junit-tutorial-qa.json:

{
"name": "JUnit Tutorial QA",
"examples": [
{
"input": "What is the capital of France? Reply with only the city name.",
"expectedOutput": "Paris",
"metadata": { "category": "geography" }
},
{
"input": "What is the capital of Japan? Reply with only the city name.",
"expectedOutput": "Tokyo",
"metadata": { "category": "geography" }
},
{
"input": "What is the capital of Italy? Reply with only the city name.",
"expectedOutput": "Rome",
"metadata": { "category": "geography" }
}
]
}

@DatasetSource turns each example into one run of a parameterized test. example.toTestCase(answer) builds the EvalTestCase. Assertions.assertEval(...) fails the test if any evaluator does not pass.

import dev.dokimos.core.Assertions;
import dev.dokimos.core.EvalTestCase;
import dev.dokimos.core.Evaluator;
import dev.dokimos.core.Example;
import dev.dokimos.core.evaluators.ExactMatchEvaluator;
import dev.dokimos.junit.DatasetSource;
import org.junit.jupiter.params.ParameterizedTest;

@ParameterizedTest(name = "[{index}] {0}")
@DatasetSource("classpath:datasets/junit-tutorial-qa.json")
void factualAnswerMatchesExactly(Example example) {
String answer = ask(example.input());

EvalTestCase testCase = example.toTestCase(answer);
Evaluator exactMatch = ExactMatchEvaluator.builder()
.name("Exact Match")
.threshold(1.0)
.build();

Assertions.assertEval(testCase, exactMatch);
}

Run it with mvn test. Each dataset row shows up as a separate test case in your IDE and your CI report.

Step 4: Add an LLM judge for open-ended answers

Exact match breaks the moment an answer has more than one correct phrasing. For open-ended output, use LLMJudgeEvaluator. It scores the answer against criteria you write in plain English, using an LLM as the grader. Pick a cheaper model for the judge.

The judge is a JudgeLM, a one-method functional interface that takes a prompt and returns text. So you wrap the same OpenAI client.

import dev.dokimos.core.Assertions;
import dev.dokimos.core.EvalTestCase;
import dev.dokimos.core.EvalTestCaseParam;
import dev.dokimos.core.Evaluator;
import dev.dokimos.core.JudgeLM;
import dev.dokimos.core.evaluators.LLMJudgeEvaluator;
import com.openai.models.ChatModel;
import com.openai.models.responses.ResponseCreateParams;
import java.util.List;
import org.junit.jupiter.api.Test;

JudgeLM judge() {
return prompt -> CLIENT
.responses()
.create(ResponseCreateParams.builder()
.model(ChatModel.GPT_5_MINI)
.input(prompt)
.build())
.output()
.stream()
.filter(item -> item.isMessage())
.flatMap(item -> item.asMessage().content().stream())
.filter(content -> content.isOutputText())
.map(content -> content.asOutputText().text())
.reduce("", String::concat);
}

@Test
void openEndedAnswerIsHelpful() {
String answer = ask("In one sentence, what does an LLM evaluation framework do?");

EvalTestCase testCase = EvalTestCase.builder()
.input("What does an LLM evaluation framework do?")
.actualOutput(answer)
.build();
Evaluator helpfulness = LLMJudgeEvaluator.builder()
.name("Helpfulness")
.criteria("Is the answer accurate, clear, and genuinely helpful?")
.evaluationParams(List.of(EvalTestCaseParam.INPUT, EvalTestCaseParam.ACTUAL_OUTPUT))
.threshold(0.7)
.judge(judge())
.build();

Assertions.assertEval(testCase, helpfulness);
}

The judge returns a score in [0, 1]. The test passes when the score meets the threshold. See LLMJudgeEvaluator for scoring details.

Step 5 (bonus): Assert on structured output

Models increasingly return JSON. Comparing JSON as a string is fragile. 21 versus 21.0, reordered keys, and extra whitespace all break equals. StructuralMatchEvaluator compares the two payloads as JSON structures, so numbers match by value and you choose how strict to be about field sets and array order.

Ask the model for JSON, parse it into a Map, store it under the output key, and compare it against the expected contract. Then read the same output back through the typed accessor actualOutputAs(...), with no manual map juggling.

import com.fasterxml.jackson.core.type.TypeReference;
import com.fasterxml.jackson.databind.ObjectMapper;
import dev.dokimos.core.Assertions;
import dev.dokimos.core.EvalTestCase;
import dev.dokimos.core.Evaluator;
import dev.dokimos.core.evaluators.StructuralMatchEvaluator;
import dev.dokimos.core.evaluators.StructuralMatchMode;
import java.util.Map;
import org.junit.jupiter.api.Test;
import static org.junit.jupiter.api.Assertions.assertEquals;

static final ObjectMapper JSON = new ObjectMapper();

record WeatherReport(String city, int temperatureCelsius, String condition) {}

@Test
void structuredOutputMatchesContract() throws Exception {
String raw = ask("Return ONLY compact JSON with keys city (string), temperatureCelsius "
+ "(integer), and condition (string) for this report: it is 21 degrees Celsius and "
+ "sunny in Paris. Do not wrap it in markdown.");

Map<String, Object> actual = JSON.readValue(raw, new TypeReference<>() {});

EvalTestCase testCase = EvalTestCase.builder()
.input("weather report for Paris")
.actualOutput("output", actual)
.expectedOutput("output", Map.of(
"city", "Paris", "temperatureCelsius", 21, "condition", "sunny"))
.build();

Evaluator structuralMatch = StructuralMatchEvaluator.builder()
.name("Structural Match")
.mode(StructuralMatchMode.LENIENT) // tolerate extra fields, ignore array order
.threshold(1.0)
.build();

Assertions.assertEval(testCase, structuralMatch);

// Typed accessor: read the same output back as a record.
WeatherReport report = testCase.actualOutputAs(WeatherReport.class);
assertEquals("Paris", report.city());
assertEquals(21, report.temperatureCelsius());
}

LENIENT mode lets the model add fields you do not care about, and it ignores array order. Switch to StructuralMatchMode.STRICT when the contract must be exact. See StructuralMatchEvaluator for the full scoring and mode rules.

Step 6: Gate your build in CI

Here is the payoff. These are ordinary JUnit tests, so any CI that runs your tests already gates on them. When the model regresses below your thresholds, the build goes red.

The only setup is making the API key available. In GitHub Actions:

name: LLM Evaluation

on:
push:
branches: [main]
pull_request:
branches: [main]

jobs:
evaluate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4

- name: Set up JDK 17
uses: actions/setup-java@v4
with:
java-version: '17'
distribution: 'temurin'

- name: Run LLM evaluations
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
run: mvn test
Keep model calls off the critical path

Tests that hit a live model cost money and add latency. A common pattern is to tag them and run the full set on a schedule or on demand, while keeping every commit fast. Annotate model-calling tests with JUnit's @Tag("integration") and gate them on the key with @EnabledIfEnvironmentVariable(named = "OPENAI_API_KEY", matches = ".+"), then run them with mvn verify -Dgroups=integration.

Next steps

Resources


If this saved you from standing up a Python pipeline just to test your model, consider giving the repository a star on GitHub ⭐.

For AI agentsView as Markdown