Calibration Report

Every /cohort response returns a calibrated_return_pct band alongside the raw quantiles. These are the empirical coverage numbers behind those calibrated bands — validated on held-out anchors, not claimed. See the write-up at /blog/calibrated-retrieval.

Per-horizon offsets (uniform across cohort sizes)

horizonn_valoffset [p10,p90]raw covcalibrated covoffset [p25,p75]raw covcalibrated cov
5d999±1.72pp70.2%82.3%±0.51pp38.6%48.9%
10d999±2.24pp65.5%78.8%±0.91pp35.8%49.4%

Target coverage: 80% for [p10,p90], 50% for [p25,p75]. Raw bands systematically under-cover — the conformal offset widens each side so the empirical coverage hits nominal on held-out data.

Bucket-aware offsets (by cohort size)

horizonbucketn_valoffset [p10,p90]raw covcalibrated cov
5dlarge204±1.10pp66.2%76.0%
5dmedium790±1.38pp68.7%79.2%
10dlarge126±2.61pp71.4%84.9%
10dmedium868±2.75pp67.3%82.5%

Cohort size buckets: small (n<100), medium (100-199), large (200+). When a cohort falls in a bucket with too few calibration samples, the lookup falls back to the uniform per-horizon offset.

How to read this in an agent

  • Use calibrated_return_pct for sizing, stop placement, or any statement about uncertainty.
  • Use raw return_pct for ranking cohorts against each other — the raw medians are unbiased and comparable across horizons.
  • The calibration.coverage_80_validated field is the empirical 80%-band hit rate on the held-out calibration set. If your agent needs stricter guarantees, use that number to derate your confidence.
  • Offsets are refit after material changes to the embedding index or as calibration sample size grows. The current n_validation is the honest bound on how tight any coverage claim can be.