Evaluation · agent decision quality

How we tested ourselves — and what 50 Claude agents found.

A paired-toolkit evaluation methodology for AI intelligence layers. Open methodology, reproducible code, judge's verbatim rationales in this page.

The result in three numbers

50 / 50

Scenarios won by AI agents using Chart Library

Combined n=50 across pilot (n=20) and rigor-controlled (n=30) runs

6 / 6

Reasoning dimensions improved

Deltas from +0.87 to +2.75 on a 1–5 scale · paired t-statistic > 10 on every dimension

<1e-15

Probability this happened by chance

Binomial under H₀ = agents equally good (50-0 sweep)

Per-dimension lift

Score deltas from the rigor-controlled run (n=30, A/B presentation order randomized per scenario, dual-judge averaged). Every dimension positive. The dimension with the largest lift — investigation_quality — is exactly the one we’d expect to move when the only thing changing is the toolkit available to the agent.

Reasoning quality lift with Chart Library — bar chart showing positive deltas across 6 dimensions

Dimension	Baseline	With-layer	Δ	paired t
Investigation quality	2.17	4.92	+2.75	32.13
Evidence use	3.07	4.95	+1.88	26.66
Reasoning rigor	3.13	4.53	+1.40	18.11
Risk awareness	3.28	4.50	+1.22	13.24
Decision quality	3.10	4.23	+1.13	11.16
Confidence calibration	3.15	4.02	+0.87	12.84

Paired t-statistic above 10 means "effect so large that conventional significance testing reduces to essentially certainly real."

The methodology

Two Claude Haiku agents. Identical prompts. Identical out-of-sample scenarios.

Agent A (baseline): tools for get_recent_ohlc and get_recent_headlines. Raw data only.
Agent B (with-layer): same plus cohort_analyze, get_market_context, narrative_pulse. The three intelligence-layer tools.
Scenarios: 50 random (symbol, date) anchors from 2024-onward, balanced across winners, losers, and neutral outcomes.
Both agents run an Anthropic tool-call loop (max 8 iterations), choosing what to investigate, then output a JSON decision.
Judge: Claude Sonnet, sees both full traces + final responses, scores each agent on 6 dimensions of reasoning quality. The judge does not know which agent has which toolkit.
Rigor controls (on the n=30 follow-up): A/B presentation order randomized per scenario; each pair judged twice with swapped order; scores averaged; winner by consensus.

The notable observation: the rigor-controlled run produced larger deltas than the pilot. The controls didn’t expose hidden bias — they revealed that pilot measurement noise had been working slightly against our result, not for it.

Eval code on GitHub →

The scenario that explains what we actually do — NUVL 2024-09-13

The most revealing scenario in the run involved Nuvalent (NUVL) on September 13, 2024. Both agents were asked: should this be a long entry, 5-day hold?

Both agents reached the same conclusion: no_position. Over the next 5 trading days, NUVL ran +23.5%. Both agents got the outcome wrong.

The judge still ranked Agent B substantially higher. Verbatim:

“Agent B’s investigation was substantially more rigorous, leveraging market context, cohort analytics, and narrative pulse to produce a multi-factor, data-grounded argument, whereas Agent A relied solely on price action from a single tool call and skipped obvious available evidence.”

This is the whole proposition. An intelligence layer doesn’t make your agent right more often. It makes your agent reason better. Sometimes the better-reasoned conclusion is to stay out of a trade that turns out to be a winner. That’s how research works: the realized outcome was the right tail of a distribution; the reasoning that said “stay out” was correct given the available evidence.

The bidirectional value — saves losses, catches winners

Two scenarios in the run showed the agents reaching different decisions. In both, the agent with Chart Library made the better call.

GEHC 2025-02-26 · Saves a loss

−6.56% loss avoided

Baseline: long (conf 6)

With-layer: no_position (conf 7)

Actual 5d return: −6.56%

Judge: “Materially superior, grounding the decision in quantitative base rates and multiple converging bearish signals rather than narrative optimism.”

JEF 2025-08-22 · Catches a winner

+3.69% captured

Baseline: no_position (conf 7)

With-layer: long (conf 6)

Actual 5d return: +3.69%

Judge: “Synthesized quantitative base rates, sector dynamics, and macro backdrop into a well-structured probabilistic case, while Agent First relied on a narrow momentum-exhaustion narrative without broader context.”

What this means about the product

Most “AI trading” tools promise to predict markets. They mostly don’t work — no signal applied mechanically beats SPY net of costs. Chart Library does something different. We’re not in the prediction business. We’re in the reasoning substrate business.

We give AI agents the kind of structured historical context that lets them think well about uncertain situations — the way a Bloomberg Terminal supports an analyst’s reasoning. The validation that matters for an intelligence layer isn’t whether mechanical use of its outputs produces alpha (almost certainly not). It’s whether agents using it reason better. The evaluation on this page measured exactly that, and the answer was yes, decisively.

Try it