Class RunComparison
Each side may contain one or more runs (repetitions). Items are grouped by an item-identity key, aggregated across repetitions into a per-item pass-probability and per-evaluator mean, then paired across sides by key. The engine emits per-evaluator and overall deltas classified as IMPROVED, REGRESSED, or UNCHANGED, each backed by a significance test.
For single-run binary outcomes the pass-rate test uses McNemar's test with continuity correction;
otherwise a paired sign-flip permutation test with a bootstrap percentile confidence interval. A
change is flagged only when |delta| > epsilon and the test is significant at alpha.
Randomized procedures are deterministic for a fixed seed and evaluator set; the shared
Random is consumed in evaluator-name order, so adding or removing evaluators shifts
p-values of the others.
-
Nested Class Summary
Nested Classes -
Method Summary
Modifier and TypeMethodDescriptionstatic RunComparison.Builderbuilder()New builder with default configuration.Compares baseline runs against candidate runs.static RunComparisoncreate()Engine with default settings.
-
Method Details
-
builder
New builder with default configuration. -
create
Engine with default settings. -
compare
Compares baseline runs against candidate runs. Either list may be empty.- Throws:
NullPointerException- if either argument is null
-