Calibrated stock forecasting — what it means and why it matters.
A forecast is calibrated if, when it says “80% of the time the outcome falls inside this band,” the actual outcome falls inside that band 80% of the time. Sounds obvious. Almost no AI stock forecasting tool actually achieves it.
This page covers what calibration means in practice, why most models fail it, the two methodology fixes that get you there honestly (conformal correction and held-out evaluation), and how Chart Library validates calibration on every release.
The shape of an uncalibrated forecast
Most AI stock-prediction tools return point forecasts: “NVDA: +2.3% over 5 days.” A point forecast can’t be calibrated because there’s no probability claim to test. It’s a guess, not a forecast. Calibration is undefined.
The slightly-better tools return bands: “NVDA: 5d return between −2% and +6% with 80% confidence.” These are testable. Run the model on 1,000 held-out anchors, count how often the realized 5d return fell inside the stated band. If it’s 65% when the model claimed 80%, the model is over-confident — its bands are too narrow. If it’s 92%, the model is under-confident — bands too wide.
The standard finding when you actually run this test: raw ML-model bands cover 50-70% of outcomes when claiming 80%. That’s a model that lies about its uncertainty. It’s a worse-than-useless tool because the user trusts the bands more than they should.
Why ML models are usually uncalibrated
Three structural reasons:
- Training distribution ≠ deployment distribution. Training data has different vol regimes than the deployment period. The model’s learned uncertainty estimate reflects training-set variance, not deployment variance.
- Models optimize for point accuracy, not coverage. MSE / MAE losses don’t penalize over-confident bands. The model can be brilliant at point prediction and terrible at uncertainty.
- Bayesian methods help, but they don’t fix the training-distribution problem. A Bayesian neural net’s posterior is calibrated under its prior. When the prior doesn’t match the deployment regime, the posterior is miscalibrated too.
Conformal prediction: calibration as a wrapper
Conformal prediction is a method-agnostic correction layer. You take any model with a point prediction and a (possibly miscalibrated) uncertainty estimate, run it on a held-out calibration set, and widen the bands by exactly the amount needed to hit the claimed coverage on that set. The math is simple and distribution-free.
The full details are on the conformal prediction page. For this page the key claim is: conformal correction is the honest fix when your raw model bands don’t cover what they claim. It widens the bands until they do.
How Chart Library validates calibration
Every release goes through a three-step calibration validation:
- Held-out anchors. A symbol-disjoint test set (see symbol-disjoint evaluation) of (symbol, date, timeframe) anchors not seen in training. For each, we compute the cohort and the nominal 80% band.
- Realized outcome lookup. Each anchor’s actual 1d/5d/10d return is in
forward_returns_cache. We compute empirical coverage: of the held-out anchors, what fraction had actual returns inside the nominal 80% band? - Conformal correction. If empirical coverage is less than nominal (the typical case for raw retrieval), we apply a split conformal correction. The correction is computed once on a calibration split and then applied to all subsequent queries.
For our V2 embeddings:
raw nominal 80% band → 68% empirical coverage (under-covers)
→ conformal correction widens bands by ~17%
→ corrected nominal 80% band → 82.5% empirical coverage (slight over-cover)
→ stable on rolling held-out anchors over 6+ monthsFor V5 embeddings, raw bands are already approximately well-calibrated:
raw nominal 80% band → ~80% empirical coverage
→ conformal offset is near-zero (essentially self-calibrated)We still apply the conformal layer regardless because the day a regime change tightens or loosens the model, the conformal layer protects users from miscalibrated outputs.
What this looks like in the API
Every cohort_analyze response includes calibrated percentile bands:
{
"outcome_distribution": {
"5": {
"n": 285,
"median": -1.3,
"p10": -11.3, // 10th percentile (raw)
"p25": -4.1,
"p75": 2.4,
"p90": 6.8, // 90th percentile (raw)
"win_rate": 0.44,
...
}
}
}The p10 and p90 are computed from the cohort’s realized returns (raw empirical quantiles), then widened by the conformal offset. The 80% band is [p10, p90]. On the held-out test set, the realized 5-day return falls inside that band 80% of the time within statistical error.
Why calibration matters for AI agents
An AI agent reasoning about a trade decision needs to know how much to trust the prediction. A miscalibrated forecast means the agent over-confidently sizes positions, overrides risk management, and makes decisions a properly-calibrated forecast wouldn’t support.
When the cohort says “5-day return between −11.3% and +6.8% with 80% confidence” and that band is calibrated, the agent has actionable information. It can decide: that’s a wide band, this setup is high-uncertainty, I should size small or skip. An uncalibrated forecast doesn’t support that reasoning.
Frequently asked questions
- What's the difference between accuracy and calibration?
- Accuracy = how close the point prediction is to the realized outcome. Calibration = whether the stated probability matches empirical frequency. A model can be accurate but miscalibrated (right on average, wrong about its uncertainty), or calibrated but inaccurate (correct uncertainty estimates around bad point predictions). For decision-making, calibration is usually more important than accuracy.
- Does Chart Library publish calibration metrics?
- Yes. The methodology page includes empirical coverage on held-out anchors, broken down by horizon. Calibration is re-validated quarterly and after any embedding update.
- How does calibration interact with conformal prediction?
- Conformal prediction is the correction mechanism that achieves calibration. Calibration is the property; conformal correction is one way to enforce it. See the conformal prediction page for the underlying math.
- Can a single forecast be calibrated?
- Strictly no — calibration is a property of a forecast distribution measured over many predictions. A single forecast can be 'consistent with a calibrated model' but you can't audit calibration from one point. You need a held-out test set with realized outcomes.
See calibrated bands in a live cohort.
Run cohort_analyze on any (symbol, date, timeframe) and see the conformal-corrected p10/p90 bands. Free Sandbox tier.