All Classes and Interfaces
Class
Description
Wraps a complete agent execution trace for evaluation.
Builder for constructing agent traces.
Strategies for aggregating scores from multiple evaluation criteria.
Filter that enforces API key authentication for write operations on
/api/v1/** endpoints.
Configuration properties for API key authentication.
Assertion utilities for evaluation-based testing.
Base class for implementing concrete evaluators.
Evaluator that measures how relevant retrieved context chunks are to a user's input query.
Builder for constructing ContextualRelevanceEvaluator instances.
Represents the relevance score for a single context chunk.
Functional interface representing an application that can engage in
multi-turn conversations.
Orchestrates multi-turn conversations between a simulated user and an
application.
Builder for constructing conversation simulators.
Represents a complete conversation trajectory between a simulated user and an
application.
Builder for constructing conversation trajectories.
A collection of examples for evaluation.
Builder for constructing datasets.
JUnit ArgumentsProvider that loads
Examples from a Dataset.Thrown when a dataset cannot be correctly resolved or loading fails.
Resolves a dataset URI to a
Dataset.Singleton registry for dataset resolvers.
Provides
Examples from a Dataset as arguments to a
parameterized test.An async HTTP implementation of
Reporter that sends experiment
results to a Dokimos server.Builder for
DokimosServerReporter.The result of an evaluation.
Builder for constructing evaluation results.
A test case for evaluation.
Builder for constructing test cases with multiple inputs and outputs.
Defines a single evaluation dimension for trajectory evaluation.
Thrown when an evaluation cannot be executed successfully.
Evaluates test cases and produces scored results.
Evaluator that checks for exact string match between actual and expected outputs.
A dataset example with inputs, expected outputs, and metadata.
Builder for constructing examples with multiple inputs and outputs.
An evaluation experiment that runs a task against a dataset and evaluates the
results.
Aggregated results from an experiment.
Utility class for exporting
ExperimentResult to various formats.Evaluator that uses an LLM to check how much of the actual output is backed by the given context.
Resolves datasets from the filesystem.
Evaluator that uses an LLM to detect hallucinations in the actual output.
A language model used for evaluation.
Utilities for integrating with LangChain4j.
Evaluator that uses an LLM to evaluate outputs based on the specified
criteria.
Utility methods for processing LLM responses.
An LLM-based simulated user for multi-turn conversation testing.
Builder for constructing LLM simulated users.
Strategy for determining if a retrieved item matches an expected item.
Represents a single message in a conversation.
The role of a message sender in a conversation.
A no-op implementation of
Reporter that does nothing.Evaluator that measures retrieval precision.
Builder for constructing PrecisionEvaluator instances.
Evaluator that measures retrieval recall.
Builder for constructing RecallEvaluator instances.
Evaluator that checks if the actual output matches a regular expression pattern.
Interface for reporting experiment results to an external system.
A handle representing an active experiment run.
Results from a single run of an experiment.
Status of an experiment run.
Functional interface for simulating user behavior in multi-turn
conversations.
Utilities for integrating with Spring AI.
Evaluates whether an AI agent completed the user's requested tasks.
Builder for constructing the evaluator.
Uses a judge LLM to assess whether tool call argument values are factually
grounded in the user's input and preceding tool call results.
Builder for constructing the evaluator.
Represents a single tool invocation made by an AI agent.
Builder for constructing tool calls.
Validates that tool calls are syntactically correct per their JSON schema definitions.
Builder for constructing the evaluator.
Checks whether the agent used the expected set of tools.
Builder for constructing the evaluator.
The comparison mode for evaluating tool correctness.
Describes an available tool's contract including its name, description, and JSON schema.
Builder for constructing tool definitions.
Evaluates tool description quality using a mix of rule-based checks and optional LLM checks.
Builder for constructing the evaluator.
Evaluates tool naming quality using a mix of rule-based checks and optional LLM checks.
Builder for constructing the evaluator.
Factory for creating pre-built evaluation criteria for trajectory evaluation.
Evaluates complete conversation trajectories using LLM-as-judge patterns.
Builder for constructing trajectory evaluators.
Factory for creating pre-built simulated user personas.