Guide · Backtesters

Backtest a chart pattern honestly — without survivorship bias or leakage.

If you’ve ever seen a chart pattern backtest claim 70%+ win rates, the methodology was almost certainly broken. The two most common failure modes — survivorship bias and train/test leakage — silently inflate results by 5 to 30 percentage points and almost no retail backtesting tool defends against either.

This guide covers what survivorship bias is, why date-based splits leak, and the working code for a backtest that doesn’t.

Failure mode 1: Survivorship bias

Most retail data feeds (yfinance, Alpha Vantage’s free tier, every “current S&P 500” CSV on Kaggle) only containcurrently-trading symbols. Symbols that went bankrupt, got delisted, or merged are silently absent.

When you backtest on this data, you’re testing “patterns from companies that survived the period.” That’s a population conditioned on survival — by construction biased toward winners. A pattern strategy backtested on survivors-only data will show ~5 to 15 percentage points higher win rate than it would on the full population.

The fix: include delisted symbols. They’re what survivorship bias systematically excludes. Forward returns for delisted symbols use the last trading price (occasionally pessimistic, never optimistic).

# Wrong: yfinance default — survivors-only
import yfinance as yf
data = yf.download(['AAPL', 'NVDA', ...current_sp500])  # missing GE-old, K-old, BBBY, ...

# Better: include delisted (Polygon, Norgate, EODHistoricalData)
import polygon
client = polygon.RESTClient(...)
all_tickers = list_all_tickers_including_delisted(client)
# As of 2026-05: ~19,000 US equity symbols, of which ~7,000 are delisted

Failure mode 2: Train/test leakage via date splits

The standard ML practice — train on 2016-2024, test on 2025 — is silently broken in finance because the same symbols appear on both sides. A model that has seen NVDA 2,500 times during training and is now being tested on NVDA isn’t being evaluated. It’s being asked to recall.

The fix is symbol-disjoint splits: hold out entire tickers, not date ranges. Train on 80% of symbols across all years; evaluate on the remaining 20% of symbols. The evaluation symbols never appear in training, period.

Plus a 10-day embargo window: even within a held-out symbol, don’t use a date that’s within 10 trading days of an in-sample date. Adjacent days are too autocorrelated.

Full discipline: symbol-disjoint evaluation.

A working honest backtest

Here’s a minimal but honest backtest of a chart pattern. Anchor: NVDA on 2024-08-05 at 1h timeframe. Find the cohort, look at what those analogs did over 5 days, score against forward returns.

import os
import requests
import statistics

API = "https://chartlibrary.io/api/v1"

# 1. Get cohort with proper hygiene applied
resp = requests.post(f"{API}/cohort_analyze", json={
    "anchor": {"symbol": "NVDA", "date": "2024-08-05", "timeframe": "1h"},
    "cohort_size": 300,
    "horizons": [1, 5, 10],
    "options": {
        "include_cohort_anchors": True,  # return the actual analogs
        "exclude_same_symbol_days": 10,  # cohort hygiene (default)
    },
})
cohort = resp.json()

# 2. Inspect the distribution (p10/p90 are conformal-corrected)
out_5d = cohort["outcome_distribution"]["5"]
print(f"Cohort n: {out_5d['n']}")
print(f"5d median: {out_5d['median']:.2f}%")
print(f"5d p10/p90 (calibrated): {out_5d['p10']:.2f}% / {out_5d['p90']:.2f}%")
print(f"5d win rate: {out_5d['win_rate']:.0%}")

# 3. Inspect feature importance — what separated cohort winners from losers?
for f in cohort["feature_importance"]["5"][:5]:
    sign = "+" if f["direction"] == "positive" else "-"
    print(f"  {sign} {f['feature']} (importance {f['importance']:.2f})")

# 4. The cohort anchors themselves — auditable
for analog in cohort["cohort_anchors"][:5]:
    print(f"  {analog['symbol']} on {analog['date']}: 5d return = {analog['ret_5d']}%")

The output for our example anchor (real numbers from the live API):

Cohort n: 285
5d median: -1.30%
5d p10/p90 (calibrated): -11.30% / +6.80%
5d win rate: 44%
  + credit_spread_state=tight (importance 0.31)
  + macro_state=bullish (importance 0.27)
  - vol_regime=low (importance 0.22)

  PFE on 2019-03-12: 5d return = +2.4%
  RIO on 2022-08-08: 5d return = -3.1%
  AMD on 2017-04-14: 5d return = -8.9%
  ...

You see the cohort, the distribution, the features, and the actual analogs. You can audit any one of them. No survivorship bias (delisted symbols are in the index). No train/test leakage (the embedding model was trained on symbol-disjoint data; this query happens at inference time on held-out anchors).

If you want to backtest a strategy, not just a pattern

The cohort_analyze response gives you the empirical conditional distribution for one anchor. To backtest a strategy that picks one or more anchors per day:

import datetime
import requests

API = "https://chartlibrary.io/api/v1"

equity = 10000.0
trades = []

# Walk through 60 trading days
start = datetime.date(2026, 3, 1)
for i in range(60):
    d = start + datetime.timedelta(days=i)
    if d.weekday() >= 5: continue  # skip weekends

    # Get top picks for that day (mirrors what an agent would do)
    setups = requests.get(f"{API}/agent/setups", params={"top": 3, "timeframe": "1d"}).json()

    if not setups: continue

    # Equal-weight 33% × 3 picks
    daily_returns = []
    for s in setups[:3]:
        sym = s["symbol"]
        # Look up actual 5d return from forward_returns_cache
        ret = requests.get(f"{API}/forward-returns/{sym}/{d.isoformat()}").json()
        if ret and ret.get("ret_5d") is not None:
            daily_returns.append(ret["ret_5d"])

    if daily_returns:
        portfolio_return = sum(daily_returns) / len(daily_returns) / 100
        equity *= (1 + portfolio_return)
        trades.append((d, equity))

print(f"Final equity: ${equity:.2f}")  # vs $10,000 start

For larger backtests with proper benchmarks, our offline backtest harness (scripts/research/agent_cohort_backtest.py in the open-source MCP repo) runs five strategies (SPY hold, random, K=10 picker, cohort full-distribution, cohort + LLM memory) over N days and reports total return, Sharpe, hit rate, and max drawdown. Email graham@chartlibrary.io if you want the harness.

The numbers that change when you do this honestly

When papers and products claim “X% win rate on chart patterns,” the leakage typically inflates the number by 5-30 percentage points. Specific things we’ve seen on internal experiments:

Date-only split + survivors-only data: claimed 78% win rate
Symbol-disjoint + 10d embargo + survivors-only: 64% (14 pp drop)
Symbol-disjoint + 10d embargo + delisted included: 56% (8 more pp drop, ~22 pp from the original)

The 22pp gap is the size of the lie that naive evaluation tells. Models that look great in a paper often look mediocre when evaluated honestly. We’d rather know.

Frequently asked questions

Does Chart Library have survivorship bias?: No. The 19,000+ symbol universe includes ~7,000 delisted tickers (bankruptcies, mergers, voluntary delistings). Forward returns for delisted symbols use the last trading price.
What other biases should I watch for?: Look-ahead bias (using future information that wasn't available at decision time), regime overfit (training period had different vol than test), and cherry-picking the test window. Symbol-disjoint splits + a fixed test universe + held-out time periods address most of these.
Can I trust academic papers' backtests?: Mixed. Top-tier finance journals (JFE, RFS) generally enforce honest splits. Quant blogs and practitioner papers often don't. The tells: no mention of survivorship, no embargo, no out-of-sample period, claimed Sharpe > 3 on a multi-decade backtest.
Why do the calibration bands matter for backtesting?: If you size positions by claimed uncertainty (e.g., risk parity, Kelly fraction), and the uncertainty estimates are wrong, your sizing is wrong. Calibrated bands mean Kelly-like sizing actually achieves its theoretical risk profile. See our calibrated stock forecasting page.

Try it