What is Chart Library?

Chart Library is a chart-pattern intelligence engine for humans and AI agents. Anchor any (symbol, date, timeframe) and it returns the cohort of historical analogs — from 25M+ indexed patterns across 19,000+ symbols and 10 years — what those analogs did next, and a calibrated forward-return distribution, over web, REST API, and MCP.

Does Chart Library predict stock prices?

No. Chart Library never forecasts. It returns historical distributions — what actually happened after setups like yours (median, p10/p90 band, up rate) — plus the receipts to audit them. Direction and decisions stay with you or your agent.

How calibrated are Chart Library's bands?

The nominal 80% forward-return band held 80.8% across 300,000+ live, audited cases under symbol-disjoint evaluation. The coverage record is public and recomputed continuously at chartlibrary.io/calibration.

IntegrityEvaluationAI AgentsFor Developers

From Retrieval to Calibrated Retrieval: Conformal Prediction on Agent Base Rates

Chart Library Team·April 14, 2026·6 min read

The problem we caught in our own product

Our cohort API returns a distribution of forward returns for a chart pattern — p10, p25, median, p75, p90. An agent calling it should be able to trust those numbers for sizing. If the agent reads '[p10, p90] is [-3%, +5%]' it should act on roughly an 80% confidence that the outcome lands in that range.

It didn't. We audited our own endpoint against 400 held-out anchors with known forward returns and measured how often the actual return fell inside the band we published.

Nominal [p10, p90] coverage: 80%. Empirical: 68.2% (5d), 64.2% (10d).
Nominal [p25, p75] coverage: 50%. Empirical: 40.0% (5d), 43.2% (10d).
The medians were fine — 0.19% actual vs 0.15% predicted at 5d. The failure was entirely in the band widths.

Put bluntly: if an agent used our raw bands to size a position, it was taking roughly 1.2× more risk than it thought. That's the kind of silent failure mode that makes people distrust AI-assisted trading tools, and it was hiding in our own product.

Why retrieved quantiles miscalibrate

The raw cohort quantiles come from nearest-neighbor retrieval in embedding space — we pull 200 historical patterns similar to the anchor and read off their return percentiles. The math treats those 200 matches as if they were an iid sample from the same distribution the anchor came from. They aren't.

Near-neighbor matches are systematically closer to each other than to the anchor — they're selected for shape similarity, not randomness. That shrinks the variance of the empirical quantiles. p10 reads as -3% when the real tail is closer to -5%. The mechanism is structural, not a bug, and it shows up in any system that publishes quantiles derived from retrieval without a calibration step.

The fix: split conformal prediction

Conformal prediction is the standard statistical tool for this. The version we use is split conformal for quantile regression (CQR-style):

Hold out a calibration set of anchors with known forward returns.
For each calibration sample, compute a nonconformity score: max(p_lo - y, y - p_hi) — how far outside the raw band the actual outcome was.
Take the (1 - α)(n+1)/n empirical quantile of those scores. That's your additive offset q.
Calibrated band = [p_lo - q, p_hi + q]. By construction this hits ~(1 - α) coverage on exchangeable data.

It is a band correction, not a median shift — the p50 is already unbiased. The offset is fit per-horizon, checked into the repo as services/conformal_offsets.json, and applied on every /cohort response.

Validation on the held-out half

Split the 800-row calibration sample 50/50. Fit the conformal offsets on one half, measure empirical coverage on the other:

5d [p10, p90] — raw 68.0%, calibrated 82.5% (target 80%)
5d [p25, p75] — raw 40.0%, calibrated 48.5% (target 50%)
10d [p10, p90] — raw 64.5%, calibrated 80.5% (target 80%)
10d [p25, p75] — raw 42.5%, calibrated 53.5% (target 50%)

Numbers are now inside the tolerance a reasonable agent would expect. The offsets themselves are small in absolute terms — ±1.4pp for 5d, ±2.6pp for 10d — but the coverage gap they close is large because it compounds at the tails.

What agents should ask of any retrieval API

If you're building on any historical-pattern API, demand three things before you size a trade off a retrieved quantile:

An empirical coverage number validated on held-out anchors, not just 'p10' as a label.
A calibrated band you can use for sizing, separate from the raw band you use for ranking.
The calibration set size and method, so you can judge whether their 80% means the same thing as your 80%.

We now return calibrated_return_pct alongside return_pct on every cohort response, plus a calibration meta block with coverage_80_validated, coverage_50_validated, and n_validation. That's the evidence, not the claim. The MCP tool description tells agents which band to use for which job. None of this was here a week ago.

What we still owe you

Split conformal is the minimum viable calibration — it gives one offset per horizon across all cohort configurations. Small cohorts and regime-extreme buckets are almost certainly miscalibrated in their own ways, and a uniform offset under-corrects for them. The next version is bucket-aware: separate offsets by cohort size and by regime bin. That work is queued.

Longer term, the calibration model should consume cohort features (size, filter-stack, distance distribution) and output band widths directly. But the honest version of the story is that 800 anchors isn't enough to fit that yet, so we shipped the simpler correction that already closes most of the gap and we'll keep widening the calibration set.

Ready to build on calibrated retrieval? Grab an API key at chartlibrary.io/developers and the MCP server on PyPI (chartlibrary-mcp v1.4.0). Every cohort response now includes calibrated_return_pct and a validated coverage number.

Ready to try Chart Library?

Anchor any ticker + date — see what history says about your setup, with cohort statistics, feature attribution, and AI narrative.

Try it free

Learn the methodology

Chart Library is built on four canonical concepts. Read the pillars to understand what backs the numbers in this post:

Cohort intelligence →

What it is and why it beats point forecasts.

Calibrated stock forecasting →

Why distributions beat point estimates.

Symbol-disjoint evaluation →

The eval discipline that prevents leakage.

Conformal prediction in finance →

The math behind calibrated bands.

How to Add a Stock Base-Rate MCP Node to LangGraph, the OpenAI Agents SDK, and the Claude Agent SDK

The same calibrated historical-base-rate node, wired into three agent frameworks unchanged. How to drop a 'what usually happens next' stock node into LangGraph, the OpenAI Agents SDK, and the Claude Agent SDK — with a boundary, provenance, and a blind-judge receipt — runnable offline for free.

How to Build a Market-Research Agent Crew in 2026: Frameworks, Data Costs, and the Missing Primitive

A practical 2026 guide to building a multi-agent market-research crew — the specialist roles, what the data actually costs ($0 to ~$250/mo), the frameworks that wire it together, and the one calibrated-base-rate node most crews are missing.

What Does It Cost to Build an AI Trading Agent in 2026? A Data-Stack Breakdown

The honest 2026 line-item cost of feeding a multi-agent trading crew real market data — which lanes are free (SEC EDGAR, FRED), which actually cost money (price, options, news), and the two realistic budgets: a $0–30/mo one-day-lagged crew vs a ~$180–270/mo live-everything crew.

Try It Yourself

AAPL Patterns NVDA Patterns TSLA Patterns SPY Patterns AMD Patterns

← All articles