Chart Library
AI AgentsBacktestingCalibrationMethodologyTrading

The Oracle Fallacy: Why Your Trading Agent's Backtest Lies — and What Calibrated Base Rates Fix

Chart Library Team··6 min read

Your backtest looks brilliant. Be suspicious.

You build a trading agent. You backtest it. The equity curve goes up and to the right. You feel great. You should be suspicious instead.

The single most common reason a trading-agent backtest looks brilliant is that, somewhere in the pipeline, the agent already knew the answer. It peeked. Not maliciously — structurally. There's a name for this failure mode, increasingly discussed in the agent-evaluation literature: the 'Oracle Fallacy', when a model or backtest quietly consults the very outcome it's supposed to be predicting.

It inflates nearly every paper result and slick demo you've seen. Here's how it sneaks in, and what an honest crew does instead.

Three ways the future leaks in

The leak is almost never deliberate. It hides in three common places.

  • Classic lookahead bias. Your backtest computes a feature using data that wasn't available at decision time — using a closing price to 'decide' to buy at the open, or a z-score normalized over the whole dataset including future rows. Restated-vs-point-in-time fundamentals do this too: a company's '2023 revenue' as it reads today is not what you knew in Q1 2023.
  • Hindsight base rates. The agent-specific one. You ask an LLM, 'what usually happens after a high-volume breakout?' and it answers with confident specifics — 'about 70% of the time'. Where did that number come from? Its training data, which includes the outcomes. The model has read the ending of the book and is telling you how the story 'usually' goes. It feels like a base rate. It's hindsight wearing a base rate's clothes.
  • Survivorship and selection. You backtest on today's index constituents, so the companies that blew up and got delisted aren't in your universe. You tuned on the same data you report on. Quieter cousins of the same disease.

The through-line: the agent is graded by an oracle it isn't supposed to have. Take the oracle away and most of the edge evaporates.

Why LLM agents make this worse, not better

A pre-LLM quant at least had to write the leak — a line of code referencing a future column. An LLM agent can manufacture the leak from nothing, in fluent prose, on demand. Ask it for the odds and it will give you odds. They'll sound calibrated. They are not measurements; they're plausible-sounding numbers pattern-matched from a corpus that already contains the future relative to any given historical setup.

And because the answer is fluent, it's trusted. A downstream agent — a risk sizer, a portfolio manager — has no way to tell '70%, measured over N real analogs' from '70%, vibes'. It consumes both as fact. The fallacy doesn't just inflate your backtest; it propagates through the crew as ungrounded confidence.

The honest version needs two things vibes can't supply

If you want a real answer to 'what usually happens next', you need two ingredients that training-data pattern-matching structurally cannot provide.

  • A large library of real historical analogs. Not the model's memory of markets — actual past setups that genuinely resemble the one in front of you, retrieved by structural similarity, each with its real recorded outcome. 'What did charts that looked like this one actually do next?' is an empirical question with an empirical answer, if you have the library to look it up in.
  • Time-gated calibration. Each analog's outcome computed using only information available at that analog's own decision point — no peeking forward — and then the probability bands calibrated against held-out reality so that an '80% band' actually contains the outcome about 80% of the time. That's the part that turns a pile of analogs into a number you can trust.

Do both and you've replaced the oracle with a measurement.

What 'calibrated' should actually mean (a receipt)

'Calibrated' is a word people throw around. It has a testable meaning: if you draw an 80% band, the realized outcome should land inside it about 80% of the time, across a large out-of-sample set. Not 95% (you're sandbagging), not 65% (you're overconfident). Eighty.

Here's the receipt from the node we build at Chart Library: the nominal 80% band held 80.8% across 303,000+ real historical cases. That's the difference between a band that means something and a confident-sounding guess. It's measured on held-out reality, not asserted.

Note:We're deliberately not parading a backtested equity curve here — that would be the very thing this post warns about. The honest receipt is coverage: did the bands hold, out-of-sample? They did.

Provenance: so the crew can tell measurement from vibes

The fix isn't just having a real number — it's labeling it so downstream agents trust it correctly. Every base-rate figure should carry its provenance: 'per N historical analogs, calibrated 80% band'. That tag is what lets a risk-sizer or portfolio-manager agent distinguish a grounded measurement from a fluent guess, and weight it accordingly. Without provenance, your best number and your worst hallucination arrive looking identical. With it, an honest number stops getting flagged as a hallucination by a skeptical teammate.

The takeaway

If your trading agent's backtest looks amazing, your first question shouldn't be 'how do I deploy this' — it should be 'where did it peek?' The Oracle Fallacy is the default state of trading-agent evaluation, not a rare bug. The way out isn't a cleverer backtest; it's refusing to let the future leak in: real time-gated analogs, calibration tested against held-out reality, and provenance on every number so the honesty survives the trip through the rest of your crew.

You can't predict the future. But you can measure what honestly happened after setups like this one — and that, calibrated, is worth more than any oracle.

See the calibrated base-rate node in action — the runnable reference crew is at https://github.com/grahammccain/chart-library-agent-crew, and Chart Library is free to start at chartlibrary.io.

Ready to try Chart Library?

Anchor any ticker + date — see what history says about your setup, with cohort statistics, feature attribution, and AI narrative.

Try it free