Evaluation | Dokimos | LLM Evaluation Framework for Java

📄️ Datasets

A dataset is a collection of examples that represent the scenarios you want to test your LLM application against. Each example typically contains an input (like a user question or prompt) and an expected output (the correct or desired response).

📄️ Experiments

An experiment runs your LLM application (called a Task) against a dataset, applies evaluators to check the outputs, and gives you aggregated results. It's the main way to systematically evaluate how well your application performs.

📄️ Evaluators

Evaluators check the quality of your LLM's outputs. Each one gives a score between 0.0 and 1.0, and decides whether the output passes based on a threshold you set.

📄️ Data Model

Understanding Dokimos's data model helps you work more effectively with datasets, experiments, and evaluation results. This guide covers the core classes and how they fit together.

📄️ Multi-Turn Conversations

Evaluating multi-turn conversations is more complex than single-turn interactions. You need to test how your AI system handles back-and-forth exchanges, maintains context, and achieves user goals over multiple turns.