Evaluation · the coverage record

The number we are judged by — audited, not asserted.

The headline proof is the coverage record: our calibrated band held what it claimed across hundreds of thousands of audited cases, and you can check it live. Then the clean comparison against an ungrounded LLM — same coverage, 44% tighter band, reproducible via the public API. And below, the blind-judge paired evaluation, framed honestly about exactly what it does and doesn't show.

The coverage record

80.8%
Coverage record · 5-day band, held out
Our nominal 80% forward-return band held this share of the time across the audited historical analyses below (5-day horizon, symbol-disjoint evaluation, no same-stock leakage). This is an audit of past predicted-vs-realized coverage, not a forward claim. It is the number we are judged by — and it is live: call /api/v1/calibration and check it yourself.
A real desk note
comp_strength: low · match_quality: loose
conditions: normal · coverage: 0.81
drivers: sector_lagging, narrative_passive, earnings_near
“This one reads like a coin flip — the comp set doesn’t separate up from down here.”
This is the product. A calibrated comp set and the flags that qualify it — not a call.

Grounded vs a guess — same coverage, half the width

The coverage record above proves our band holds what it claims. This is the comparison that matters for an agent builder: against an ungrounded LLM doing the same job. We took 300 out-of-sample setups, asked a model (Claude Haiku) for its own 80% interval on the 5-day forward return, and put it next to Chart Library’s calibrated band — both scored against what actually happened.

80% interval for the 5-day moveCoverageMean width
Ungrounded LLM (its own interval)82.7%18.5 pts
Chart Library calibrated band82.7%10.3 pts — 44% tighter
Raw cohort (uncalibrated baseline)87.0%11.7 pts

Identical coverage, 44% tighter band. We are explicit about what this is not: it is not a claim that we predict returns more accurately — the coverage is the same (~83%), so both are equally honest about how often the move lands in range. The difference is precision: an ungrounded model can only be calibrated by hedging ~2× as wide. Chart Library gives the same honesty in half the width, with a setup-specific band instead of a generic one — and a coverage receipt the model structurally cannot produce.

Method, stated plainly. 300 setups across five months (Feb–Jun 2026), large-cap and high-volatility names, sampled out-of-sample. Every test date is after the model’s training cutoff, so the LLM cannot have memorized the outcome — the fair version of the ungrounded-agent task. The calibrated band and the realized return both come straight from the public /api/v1/replay endpoint, so you can reproduce it yourself: pull a band, ask any model for its interval, score both against the realized move. The result held in every one of the five months. (An earlier 80-setup pilot showed a coverage edge too; at n=300 that washed out to zero — which is the honest result, and why the claim here is width, not accuracy.)

It isn’t a small-model artifact. We re-ran the comparison against three model tiers — Haiku 4.5, Sonnet 4.6, and Opus 4.8, a frontier model. Every tier hedged to roughly the same width (~18–19 points) at comparable coverage, leaving our calibrated band 42–46% tighter against all three. The telling part: the frontier model was not sharper — Opus produced the widest interval of the three and over-covered, hedging more, not less. Capability doesn’t buy precision on this task. Grounding does.

What a blind judge actually preferred

Two identical Claude agents, identical prompts, 50 out-of-sample subjects. One could call Chart Library; the other worked from raw price and headlines alone. A second model, blind to which agent held which tools, scored the reasoning and preferred the grounded reasoning on every scenario. We’re explicit about what this does and doesn’t show: an agent given a research desk will investigate more, so a gap is expected by construction. What the blind judge adds is that the grounded reasoning was preferred even when both agents reached the same conclusion — and even when that conclusion later proved wrong (see NUVL). This is not evidence that our comp set predicts returns. It’s evidence that an agent reasons more like a careful analyst when it can pull the historical record instead of guessing it.

The harder test, specified and queued. The genuinely un-rigged version of this is same-toolkit-both-arms, the only difference being whether the calibration receipt is in context — isolating the moat (calibration), not tool count. That run doesn’t exist yet; we’ll publish its design, and its result, whether or not it favors us.

Per-dimension lift

Score deltas from the rigor-controlled run (n=30, A/B presentation order randomized per scenario, dual-judge averaged). Every dimension positive, with the largest lift on investigation_quality — exactly the one you’d expect when the only thing changing is the toolkit.

Reasoning quality lift with Chart Library — bar chart showing positive deltas across 6 dimensions
DimensionBaselineWith-layerΔpaired t
Investigation quality2.174.92+2.7532.13
Evidence use3.074.95+1.8826.66
Reasoning rigor3.134.53+1.4018.11
Risk awareness3.284.50+1.2213.24
Decision quality3.104.23+1.1311.16
Confidence calibration3.154.02+0.8712.84
Paired t-statistic above 10 means "effect so large that conventional significance testing reduces to essentially certainly real."

The methodology

Two Claude Haiku agents. Identical prompts. Identical out-of-sample scenarios.

  • Agent A (baseline): tools for get_recent_ohlc and get_recent_headlines. Raw data only.
  • Agent B (with-layer): same plus cohort_analyze, get_market_context, narrative_pulse. The three intelligence-layer tools.
  • Scenarios: 50 random (symbol, date) subjects from 2024-onward, balanced across winners, losers, and neutral outcomes.
  • Both agents run an Anthropic tool-call loop (max 8 iterations), choosing what to investigate, then output a JSON decision.
  • Judge: Claude Sonnet, sees both full traces + final responses, scores each agent on 6 dimensions of reasoning quality. The judge does not know which agent has which toolkit.
  • Rigor controls (on the n=30 follow-up): A/B presentation order randomized per scenario; each pair judged twice with swapped order; scores averaged; winner by consensus.

The notable observation: the rigor-controlled run produced larger deltas than the pilot. The controls didn’t expose hidden bias — they revealed that pilot measurement noise had been working slightly against our result, not for it.

The scenario that explains what we actually do — NUVL 2024-09-13

The most revealing scenario in the run involved Nuvalent (NUVL) on September 13, 2024. Both agents were asked: should this be a long entry, 5-day hold?

Both agents reached the same conclusion: no_position. Over the next 5 trading days, NUVL ran +23.5%. Both agents got the outcome wrong.

The judge still ranked Agent B substantially higher. Verbatim:

“Agent B’s investigation was substantially more rigorous, leveraging market context, comp set analytics, and narrative pulse to produce a multi-factor, data-grounded argument, whereas Agent A relied solely on price action from a single tool call and skipped obvious available evidence.”

This is the whole proposition. An intelligence layer doesn’t make your agent right more often. It makes your agent reason better. Sometimes the better-reasoned conclusion is to stay out of a trade that turns out to be a winner. That’s how research works: the realized outcome was the right tail of a distribution; the reasoning that said “stay out” was correct given the available evidence.

The bidirectional value — saves losses, catches winners

Two scenarios in the run showed the agents reaching different decisions. In both, the agent with Chart Library made the better call.

GEHC 2025-02-26 · Saves a loss
−6.56% loss avoided
Baseline: long (conf 6)
With-layer: no_position (conf 7)
Actual 5d return: −6.56%
Judge: “Materially superior, grounding the decision in quantitative base rates and multiple converging bearish signals rather than narrative optimism.”
JEF 2025-08-22 · Catches a winner
+3.69% captured
Baseline: no_position (conf 7)
With-layer: long (conf 6)
Actual 5d return: +3.69%
Judge: “Synthesized quantitative base rates, sector dynamics, and macro backdrop into a well-structured probabilistic case, while Agent First relied on a narrow momentum-exhaustion narrative without broader context.”

What this means about the product

Most “AI trading” tools promise to predict markets. They mostly don’t work — no signal applied mechanically beats SPY net of costs. Chart Library does something different. We’re not in the prediction business. We’re in the reasoning substrate business.

We give AI agents the kind of structured historical context that lets them think well about uncertain situations — the way a Bloomberg Terminal supports an analyst’s reasoning. The validation that matters for an intelligence layer isn’t whether mechanical use of its outputs produces alpha (almost certainly not). It’s whether agents using it reason better. The evaluation on this page measured exactly that, and the answer was yes, decisively.

Try it

Try the layer a blind judge preferred on every scenario.

Same engine, same tools the agents used — grounded in a coverage record you can verify. Free sandbox tier: 1,000 calls/day, no card required.

Related