What is Chart Library?

Chart Library is a chart-pattern intelligence engine for humans and AI agents. Anchor any (symbol, date, timeframe) and it returns the cohort of historical analogs — from 25M+ indexed patterns across 19,000+ symbols and 10 years — what those analogs did next, and a calibrated forward-return distribution, over web, REST API, and MCP.

Does Chart Library predict stock prices?

No. Chart Library never forecasts. It returns historical distributions — what actually happened after setups like yours (median, p10/p90 band, up rate) — plus the receipts to audit them. Direction and decisions stay with you or your agent.

How calibrated are Chart Library's bands?

The nominal 80% forward-return band held 80.8% across 300,000+ live, audited cases under symbol-disjoint evaluation. The coverage record is public and recomputed continuously at chartlibrary.io/calibration.

AI AgentsBacktestingCalibrationMethodologyTrading

The Oracle Fallacy: Why Your Trading Agent's Backtest Lies — and What Calibrated Base Rates Fix

Chart Library Team·June 2, 2026·6 min read

Your backtest looks brilliant. Be suspicious.

You build a trading agent. You backtest it. The equity curve goes up and to the right. You feel great. You should be suspicious instead.

The single most common reason a trading-agent backtest looks brilliant is that, somewhere in the pipeline, the agent already knew the answer. It peeked. Not maliciously — structurally. There's a name we use for this failure mode: the 'Oracle Fallacy', when a model or backtest quietly consults the very outcome it's supposed to be predicting.

It inflates nearly every paper result and slick demo you've seen — a 2026 survey of LLM trading agents (arXiv 2605.19337) found only 2 of 19 reported a time-consistent train/test split, the basic guard against it. Here's how it sneaks in, and what an honest crew does instead.

Three ways the future leaks in

The leak is almost never deliberate. It hides in three common places.

Classic lookahead bias. Your backtest computes a feature using data that wasn't available at decision time — using a closing price to 'decide' to buy at the open, or a z-score normalized over the whole dataset including future rows. Restated-vs-point-in-time fundamentals do this too: a company's '2023 revenue' as it reads today is not what you knew in Q1 2023.
Hindsight base rates. The agent-specific one. You ask an LLM, 'what usually happens after a high-volume breakout?' and it answers with confident specifics — 'about 70% of the time'. Where did that number come from? Its training data, which includes the outcomes. The model has read the ending of the book and is telling you how the story 'usually' goes. It feels like a base rate. It's hindsight wearing a base rate's clothes.
Survivorship and selection. You backtest on today's index constituents, so the companies that blew up and got delisted aren't in your universe. You tuned on the same data you report on. Quieter cousins of the same disease.

The through-line: the agent is graded by an oracle it isn't supposed to have. Take the oracle away and most of the edge evaporates.

Why LLM agents make this worse, not better

A pre-LLM quant at least had to write the leak — a line of code referencing a future column. An LLM agent can manufacture the leak from nothing, in fluent prose, on demand. Ask it for the odds and it will give you odds. They'll sound calibrated. They are not measurements; they're plausible-sounding numbers pattern-matched from a corpus that already contains the future relative to any given historical setup.

And because the answer is fluent, it's trusted. A downstream agent — a risk sizer, a portfolio manager — has no way to tell '70%, measured over N real analogs' from '70%, vibes'. It consumes both as fact. The fallacy doesn't just inflate your backtest; it propagates through the crew as ungrounded confidence.

The honest version needs two things vibes can't supply

If you want a real answer to 'what usually happens next', you need two ingredients that training-data pattern-matching structurally cannot provide.

A large library of real historical analogs. Not the model's memory of markets — actual past setups that genuinely resemble the one in front of you, retrieved by structural similarity, each with its real recorded outcome. 'What did charts that looked like this one actually do next?' is an empirical question with an empirical answer, if you have the library to look it up in.
Time-gated calibration. Each analog's outcome computed using only information available at that analog's own decision point — no peeking forward — and then the probability bands calibrated against held-out reality so that an '80% band' actually contains the outcome about 80% of the time. That's the part that turns a pile of analogs into a number you can trust.

Do both and you've replaced the oracle with a measurement.

What 'calibrated' should actually mean (a receipt)

'Calibrated' is a word people throw around. It has a testable meaning: if you draw an 80% band, the realized outcome should land inside it about 80% of the time, across a large out-of-sample set. Not 95% (you're sandbagging), not 65% (you're overconfident). Eighty.

Here's the receipt from the node we build at Chart Library: the nominal 80% band held 80.8% across 302,880 audited historical cases. That's the difference between a band that means something and a confident-sounding guess. It's measured on held-out reality, not asserted.

Note:We're deliberately not parading a backtested equity curve here — that would be the very thing this post warns about. The honest receipt is coverage: did the bands hold, out-of-sample? They did.

Provenance: so the crew can tell measurement from vibes

The fix isn't just having a real number — it's labeling it so downstream agents trust it correctly. Every base-rate figure should carry its provenance: 'per N historical analogs, calibrated 80% band'. That tag is what lets a risk-sizer or portfolio-manager agent distinguish a grounded measurement from a fluent guess, and weight it accordingly. Without provenance, your best number and your worst hallucination arrive looking identical. With it, an honest number stops getting flagged as a hallucination by a skeptical teammate.

The takeaway

If your trading agent's backtest looks amazing, your first question shouldn't be 'how do I deploy this' — it should be 'where did it peek?' The Oracle Fallacy is the default state of trading-agent evaluation, not a rare bug. The way out isn't a cleverer backtest; it's refusing to let the future leak in: real time-gated analogs, calibration tested against held-out reality, and provenance on every number so the honesty survives the trip through the rest of your crew.

You can't predict the future. But you can measure what honestly happened after setups like this one — and that, calibrated, is worth more than any oracle.

See the calibrated base-rate node in action — the runnable reference crew is at https://github.com/grahammccain/chart-library-agent-crew, and Chart Library is free to start at chartlibrary.io.

Ready to try Chart Library?

Anchor any ticker + date — see what history says about your setup, with cohort statistics, feature attribution, and AI narrative.

Try it free

Learn the methodology

Chart Library is built on four canonical concepts. Read the pillars to understand what backs the numbers in this post:

Cohort intelligence →

What it is and why it beats point forecasts.

Calibrated stock forecasting →

Why distributions beat point estimates.

Symbol-disjoint evaluation →

The eval discipline that prevents leakage.

Conformal prediction in finance →

The math behind calibrated bands.

How to Add a Stock Base-Rate MCP Node to LangGraph, the OpenAI Agents SDK, and the Claude Agent SDK

The same calibrated historical-base-rate node, wired into three agent frameworks unchanged. How to drop a 'what usually happens next' stock node into LangGraph, the OpenAI Agents SDK, and the Claude Agent SDK — with a boundary, provenance, and a blind-judge receipt — runnable offline for free.

How to Build a Market-Research Agent Crew in 2026: Frameworks, Data Costs, and the Missing Primitive

A practical 2026 guide to building a multi-agent market-research crew — the specialist roles, what the data actually costs ($0 to ~$250/mo), the frameworks that wire it together, and the one calibrated-base-rate node most crews are missing.

What Does It Cost to Build an AI Trading Agent in 2026? A Data-Stack Breakdown

The honest 2026 line-item cost of feeding a multi-agent trading crew real market data — which lanes are free (SEC EDGAR, FRED), which actually cost money (price, options, news), and the two realistic budgets: a $0–30/mo one-day-lagged crew vs a ~$180–270/mo live-everything crew.

Try It Yourself

AAPL Patterns NVDA Patterns TSLA Patterns SPY Patterns AMD Patterns

← All articles