📄️ Datasets
A dataset is your list of test cases. Each example holds an input (a user question or prompt) and the expected output (the answer you want back). You run your LLM application against every example at once instead of trying prompts by hand.
📄️ Experiments
An experiment runs your LLM application against a whole dataset, scores every output, and hands you the totals. It is the main way to measure how well your application performs.
📄️ Evaluators
An evaluator scores one of your LLM's outputs and tells you if it passes. Each evaluator returns a score from 0.0 to 1.0 and compares it against a threshold you set. Use this page to pick a built-in evaluator, configure it, and read its result.
📄️ Data Model
This page shows the classes Dokimos uses to hold your test cases, run your LLM, and report scores, so you know exactly what to build and what comes back.
📄️ Multi-Turn Conversations
This page shows you how to test a chat assistant across a full back-and-forth conversation, not just one prompt and reply.
📄️ Structured & Typed Data
Return real domain objects from your tasks, compare them structurally, and read them back type-safely. This page shows you how.
📄️ Agent Evaluation
This page shows you how to score an AI agent on the tools it called, not just its final reply.
📄️ Cost and Pricing
Dokimos can record what each LLM call cost, roll it up per run, and flag when a cost total only covers part of a run. This page explains how cost capture works, the pluggable pricing seam, and the partial-coverage signal you will see in the UI.
📄️ Regression gate (server-free)
Run your evals as a test and fail the build when quality drops. You commit a baseline next to your test, and on every run the gate compares the fresh result against it and throws on a real regression. No server, no account, no API key for the gate itself. The failing test is the gate, and it fires the same way locally and in CI.