Chart Library
IntegrityEvaluationAI AgentsFor Developers

From Retrieval to Calibrated Retrieval: Conformal Prediction on Agent Base Rates

Chart Library Team··6 min read

The problem we caught in our own product

Our cohort API returns a distribution of forward returns for a chart pattern — p10, p25, median, p75, p90. An agent calling it should be able to trust those numbers for sizing. If the agent reads '[p10, p90] is [-3%, +5%]' it should act on roughly an 80% confidence that the outcome lands in that range.

It didn't. We audited our own endpoint against 400 held-out anchors with known forward returns and measured how often the actual return fell inside the band we published.

  • Nominal [p10, p90] coverage: 80%. Empirical: 68.2% (5d), 64.2% (10d).
  • Nominal [p25, p75] coverage: 50%. Empirical: 40.0% (5d), 43.2% (10d).
  • The medians were fine — 0.19% actual vs 0.15% predicted at 5d. The failure was entirely in the band widths.

Put bluntly: if an agent used our raw bands to size a position, it was taking roughly 1.2× more risk than it thought. That's the kind of silent failure mode that makes people distrust AI-assisted trading tools, and it was hiding in our own product.

Why retrieved quantiles miscalibrate

The raw cohort quantiles come from nearest-neighbor retrieval in embedding space — we pull 200 historical patterns similar to the anchor and read off their return percentiles. The math treats those 200 matches as if they were an iid sample from the same distribution the anchor came from. They aren't.

Near-neighbor matches are systematically closer to each other than to the anchor — they're selected for shape similarity, not randomness. That shrinks the variance of the empirical quantiles. p10 reads as -3% when the real tail is closer to -5%. The mechanism is structural, not a bug, and it shows up in any system that publishes quantiles derived from retrieval without a calibration step.

The fix: split conformal prediction

Conformal prediction is the standard statistical tool for this. The version we use is split conformal for quantile regression (CQR-style):

  • Hold out a calibration set of anchors with known forward returns.
  • For each calibration sample, compute a nonconformity score: max(p_lo - y, y - p_hi) — how far outside the raw band the actual outcome was.
  • Take the (1 - α)(n+1)/n empirical quantile of those scores. That's your additive offset q.
  • Calibrated band = [p_lo - q, p_hi + q]. By construction this hits ~(1 - α) coverage on exchangeable data.

It is a band correction, not a median shift — the p50 is already unbiased. The offset is fit per-horizon, checked into the repo as services/conformal_offsets.json, and applied on every /cohort response.

Validation on the held-out half

Split the 800-row calibration sample 50/50. Fit the conformal offsets on one half, measure empirical coverage on the other:

  • 5d [p10, p90] — raw 68.0%, calibrated 82.5% (target 80%)
  • 5d [p25, p75] — raw 40.0%, calibrated 48.5% (target 50%)
  • 10d [p10, p90] — raw 64.5%, calibrated 80.5% (target 80%)
  • 10d [p25, p75] — raw 42.5%, calibrated 53.5% (target 50%)

Numbers are now inside the tolerance a reasonable agent would expect. The offsets themselves are small in absolute terms — ±1.4pp for 5d, ±2.6pp for 10d — but the coverage gap they close is large because it compounds at the tails.

What agents should ask of any retrieval API

If you're building on any historical-pattern API, demand three things before you size a trade off a retrieved quantile:

  • An empirical coverage number validated on held-out anchors, not just 'p10' as a label.
  • A calibrated band you can use for sizing, separate from the raw band you use for ranking.
  • The calibration set size and method, so you can judge whether their 80% means the same thing as your 80%.

We now return calibrated_return_pct alongside return_pct on every cohort response, plus a calibration meta block with coverage_80_validated, coverage_50_validated, and n_validation. That's the evidence, not the claim. The MCP tool description tells agents which band to use for which job. None of this was here a week ago.

What we still owe you

Split conformal is the minimum viable calibration — it gives one offset per horizon across all cohort configurations. Small cohorts and regime-extreme buckets are almost certainly miscalibrated in their own ways, and a uniform offset under-corrects for them. The next version is bucket-aware: separate offsets by cohort size and by regime bin. That work is queued.

Longer term, the calibration model should consume cohort features (size, filter-stack, distance distribution) and output band widths directly. But the honest version of the story is that 800 anchors isn't enough to fit that yet, so we shipped the simpler correction that already closes most of the gap and we'll keep widening the calibration set.

Ready to build on calibrated retrieval? Grab an API key at chartlibrary.io/developers and the MCP server on PyPI (chartlibrary-mcp v1.4.0). Every cohort response now includes calibrated_return_pct and a validated coverage number.

Ready to try Chart Library?

Upload a chart screenshot or search any ticker — see what history says about your pattern.

Try it free