Chart Library
MethodologyAI AgentsEvaluationResearch

Why we stopped backtesting our intelligence layer (and what we found instead)

Chart Library Team··9 min read

The backtest paradox we kept running into

For most of the last quarter we evaluated Chart Library the way every quant team evaluates a signal: by backtesting it. Define a rule, apply it uniformly across history, ask whether the resulting equity curve has alpha.

We tested every form of the cohort signal we could construct. Cluster-based pattern matching. Composite multi-factor scoring. Long-only ranking. Cross-sectional long-short. Each passed a smaller battery of tests than the last.

The headline number that kept hitting us in the face: CAPM alpha not statistically distinguishable from zero. Total return looked good — one variant returned +136% over five years — but the CAPM regression showed t-stats of 0.7 to −0.5 OOS, within noise of zero. Whatever return we were generating was leveraged beta, not idiosyncratic edge.

This was confusing, because the engine clearly does something. Its calibrated forward-return distributions hit at advertised coverage rates. Its cohort retrieval surfaces non-obvious historical analogs. Its feature attributions correctly identify what's driving recent moves. The information is real. So why didn't the rules-applied-mechanically version produce alpha?

The reframe

The breakthrough was admitting we'd been evaluating the wrong product. We aren't building a trading strategy. We're building an intelligence layer that AI agents call. A trading strategy is a deterministic rule. An intelligence layer is reasoning input. The two require different evaluations.

Here's the calculator-vs-AI metaphor that made it click. A calculator does deterministic arithmetic. An AI reasons about when to apply which arithmetic. We'd been treating Chart Library like a fancy calculator — input pattern, output prediction — and asking whether mechanical use of the prediction produced returns. But the actual value of an intelligence layer is what reasoning it enables in a downstream agent.

The right parallel is Bloomberg Terminal. If you backtested Bloomberg as a screener — buying every stock its filter ranked 'undervalued' — the Sharpe would be terrible. But Bloomberg is a $25B/yr business because analysts use it to reason, not to apply mechanical rules.

We needed to stop asking 'does our signal have alpha' and start asking 'does our intelligence layer help an agent reason.'

The right experiment — ADQE

We built ADQE: Agent Decision Quality Evaluation. The setup is simple and reproducible.

  • Scenario bank: 50 trading-decision scenarios from out-of-sample data (2024 onward), balanced across winners (+2%+ over next 5d), losers (−2%+), and neutral.
  • Two Claude Haiku agents with identical prompts. Agent A has get_recent_ohlc and get_recent_headlines. Agent B has those plus cohort_analyze, get_market_context, and narrative_pulse.
  • Both run an Anthropic tool-call loop, choose what to investigate, then output a JSON decision (decision, confidence, stop, target, top-3 factors, rationale).
  • Claude Sonnet judges. It sees both full traces and final responses, scores each on 6 dimensions of reasoning quality 1-5: reasoning_rigor, evidence_use, confidence_calibration, risk_awareness, decision_quality, investigation_quality. It does NOT know which agent has which toolkit.
  • Rigor controls on the n=30 follow-up: A/B presentation order randomized per scenario; each pair judged twice with swapped order; scores averaged; winner by consensus.

The result

Combined across the pilot (n=20) and rigor-controlled follow-up (n=30), Agent B won every dimension on every scenario. Combined judge winners: A=0, B=50, ties=0. Under H₀ (the agents are equally good), the probability of a 50-0 sweep is about 1 in 1.13 quadrillion.

From the rigor-controlled n=30: investigation_quality +2.75 (paired t=32.13), evidence_use +1.88 (t=26.66), reasoning_rigor +1.40 (t=18.11), risk_awareness +1.22 (t=13.24), decision_quality +1.13 (t=11.16), confidence_calibration +0.87 (t=12.84). Paired t-statistic above 10 means effects so large that conventional significance testing reduces to 'essentially certainly real.'

Notable observation: the rigor-controlled run produced LARGER deltas than the pilot. The controls didn't expose hidden bias — they revealed pilot measurement noise had been working slightly against our result, not for it.

The scenario that explains what we actually do

The most revealing scenario in the run was NUVL on 2024-09-13. A biotech with a strong recent run. Both agents investigated. Both reached the same decision: no_position. Over the next 5 trading days NUVL ran +23.5%. Both agents got the OUTCOME wrong.

The judge still ranked Agent B substantially higher. Its rationale, verbatim:

This is the whole proposition. An intelligence layer doesn't make your agent right more often. It makes your agent reason better. Sometimes the better-reasoned conclusion is to stay out of a trade that ends up being a winner. The expected value of the trade was unfavorable given the evidence; the realized outcome was the right tail of a distribution. That's how research works.

Note:Agent B's investigation was substantially more rigorous, leveraging market context, cohort analytics, and narrative pulse to produce a multi-factor, data-grounded argument, whereas Agent A relied solely on price action from a single tool call and skipped obvious available evidence.

The bidirectional value — GEHC saved a loss, JEF caught a winner

Two scenarios in the run showed the agents reaching DIFFERENT decisions — and in both, the agent with Chart Library made the better call.

GEHC 2025-02-26: Baseline went long with confidence 6. With-layer said no_position with confidence 7. Actual 5-day return: −6.56%. With-layer avoided the loss. The judge cited 'quantitative base rates and multiple converging bearish signals rather than narrative optimism.'

JEF 2025-08-22: Baseline said no_position. With-layer went long with confidence 6. Actual 5-day return: +3.69%. With-layer caught a winner the baseline agent missed. The judge cited 'quantitative base rates, sector dynamics, and macro backdrop' against the baseline's narrow momentum-exhaustion read.

The common thread isn't that one direction works better. The common thread is that the agent with Chart Library was making decisions on a broader evidence base — and that's true whether the right answer was 'stay out' or 'take the trade.'

What this changes

The pivot from 'signal generator' to 'intelligence layer' has cascading consequences.

  • Eval methodology: agent decision quality replaces backtest Sharpe as our primary measurement. Calibration of published distributions stays as a continuous secondary measure.
  • Product: API + MCP server, sold by call volume and seat count. Not AUM-priced.
  • Customer: AI agent builders and quantitative researchers. Not retail traders looking for buy signals.
  • Marketing: case studies of agent traces, calibration curves, decision-quality deltas. Not backtest equity curves.

Reproducing this result

The code that ran this evaluation is at github.com/grahammccain/chart-library-adqe. To reproduce: clone the repo, set ANTHROPIC_API_KEY, run python scripts/adqe_v03.py --n-scenarios 20. Cost: roughly $6 for 20 scenarios with dual judge. The eval is fully open — anyone can re-run it on any subset of scenarios, with any toolkit split, with any judge model.

Try the engine that won 50–0 — chartlibrary.io/developers (API) · pip install chartlibrary-mcp (MCP) · chartlibrary.io/app (live playground)

The thing we learned

For a year we asked 'does our signal have alpha?' because that's what trading systems are evaluated on. The answer kept being 'maybe, but not significantly.' We took it as evidence that the signal needed more work.

It took us longer than it should have to ask the question whose answer was already in front of us: does the system help an agent reason better?

Sometimes you don't need a better signal. You need a better evaluation.

Ready to try Chart Library?

Anchor any ticker + date — see what history says about your setup, with cohort statistics, feature attribution, and AI narrative.

Try it free