Calibration Report
Every /cohort response returns a calibrated_return_pct band alongside the raw quantiles. These are the empirical coverage numbers behind those calibrated bands — validated on held-out anchors, not claimed. See the write-up at /blog/calibrated-retrieval.
Per-horizon offsets (uniform across cohort sizes)
| horizon | n_val | offset [p10,p90] | raw cov | calibrated cov | offset [p25,p75] | raw cov | calibrated cov |
|---|---|---|---|---|---|---|---|
| 5d | 999 | ±1.72pp | 70.2% | 82.3% | ±0.51pp | 38.6% | 48.9% |
| 10d | 999 | ±2.24pp | 65.5% | 78.8% | ±0.91pp | 35.8% | 49.4% |
Target coverage: 80% for [p10,p90], 50% for [p25,p75]. Raw bands systematically under-cover — the conformal offset widens each side so the empirical coverage hits nominal on held-out data.
Bucket-aware offsets (by cohort size)
| horizon | bucket | n_val | offset [p10,p90] | raw cov | calibrated cov |
|---|---|---|---|---|---|
| 5d | large | 204 | ±1.10pp | 66.2% | 76.0% |
| 5d | medium | 790 | ±1.38pp | 68.7% | 79.2% |
| 10d | large | 126 | ±2.61pp | 71.4% | 84.9% |
| 10d | medium | 868 | ±2.75pp | 67.3% | 82.5% |
Cohort size buckets: small (n<100), medium (100-199), large (200+). When a cohort falls in a bucket with too few calibration samples, the lookup falls back to the uniform per-horizon offset.
How to read this in an agent
- Use
calibrated_return_pctfor sizing, stop placement, or any statement about uncertainty. - Use raw
return_pctfor ranking cohorts against each other — the raw medians are unbiased and comparable across horizons. - The
calibration.coverage_80_validatedfield is the empirical 80%-band hit rate on the held-out calibration set. If your agent needs stricter guarantees, use that number to derate your confidence. - Offsets are refit after material changes to the embedding index or as calibration sample size grows. The current
n_validationis the honest bound on how tight any coverage claim can be.