Guide · Backtesters

Backtest a chart pattern honestly — without survivorship bias or leakage.

If you’ve ever seen a chart pattern backtest claim 70%+ win rates, the methodology was almost certainly broken. The two most common failure modes — survivorship bias and train/test leakage — silently inflate results by 5 to 30 percentage points and almost no retail backtesting tool defends against either.

This guide covers what survivorship bias is, why date-based splits leak, and the working code for a backtest that doesn’t.

Failure mode 1: Survivorship bias

Most retail data feeds (yfinance, Alpha Vantage’s free tier, every “current S&P 500” CSV on Kaggle) only containcurrently-trading symbols. Symbols that went bankrupt, got delisted, or merged are silently absent.

When you backtest on this data, you’re testing “patterns from companies that survived the period.” That’s a population conditioned on survival — by construction biased toward winners. A pattern strategy backtested on survivors-only data will show ~5 to 15 percentage points higher win rate than it would on the full population.

The fix: include delisted symbols. They’re what survivorship bias systematically excludes. Forward returns for delisted symbols use the last trading price (occasionally pessimistic, never optimistic).

# Wrong: yfinance default — survivors-only
import yfinance as yf
data = yf.download(['AAPL', 'NVDA', ...current_sp500])  # missing GE-old, K-old, BBBY, ...

# Better: include delisted (Polygon, Norgate, EODHistoricalData)
import polygon
client = polygon.RESTClient(...)
all_tickers = list_all_tickers_including_delisted(client)
# As of 2026-05: ~19,000 US equity symbols, of which ~7,000 are delisted

Failure mode 2: Train/test leakage via date splits

The standard ML practice — train on 2016-2024, test on 2025 — is silently broken in finance because the same symbols appear on both sides. A model that has seen NVDA 2,500 times during training and is now being tested on NVDA isn’t being evaluated. It’s being asked to recall.

The fix is symbol-disjoint splits: hold out entire tickers, not date ranges. Train on 80% of symbols across all years; evaluate on the remaining 20% of symbols. The evaluation symbols never appear in training, period.

Plus a 10-day embargo window: even within a held-out symbol, don’t use a date that’s within 10 trading days of an in-sample date. Adjacent days are too autocorrelated.

Full discipline: symbol-disjoint evaluation.

A working honest backtest

Here’s a minimal but honest backtest of a chart pattern. Anchor: NVDA on 2024-08-05 at 1h timeframe. Find the cohort, look at what those analogs did over 5 days, score against forward returns.

import os
import requests
import statistics

API = "https://chartlibrary.io/api/v1"

# 1. Get cohort with proper hygiene applied
resp = requests.post(f"{API}/cohort_analyze", json={
    "anchor": {"symbol": "NVDA", "date": "2024-08-05", "timeframe": "1h"},
    "cohort_size": 300,
    "horizons": [1, 5, 10],
    "options": {
        "include_cohort_anchors": True,  # return the actual analogs
        "exclude_same_symbol_days": 10,  # cohort hygiene (default)
    },
})
cohort = resp.json()

# 2. Inspect the distribution (p10/p90 are conformal-corrected)
out_5d = cohort["outcome_distribution"]["5"]
print(f"Cohort n: {out_5d['n']}")
print(f"5d median: {out_5d['median']:.2f}%")
print(f"5d p10/p90 (calibrated): {out_5d['p10']:.2f}% / {out_5d['p90']:.2f}%")
print(f"5d win rate: {out_5d['win_rate']:.0%}")

# 3. Inspect feature importance — what separated cohort winners from losers?
for f in cohort["feature_importance"]["5"][:5]:
    sign = "+" if f["direction"] == "positive" else "-"
    print(f"  {sign} {f['feature']} (importance {f['importance']:.2f})")

# 4. The cohort anchors themselves — auditable
for analog in cohort["cohort_anchors"][:5]:
    print(f"  {analog['symbol']} on {analog['date']}: 5d return = {analog['ret_5d']}%")

The output for our example anchor (real numbers from the live API):

Cohort n: 285
5d median: -1.30%
5d p10/p90 (calibrated): -11.30% / +6.80%
5d win rate: 44%
  + credit_spread_state=tight (importance 0.31)
  + macro_state=bullish (importance 0.27)
  - vol_regime=low (importance 0.22)

  PFE on 2019-03-12: 5d return = +2.4%
  RIO on 2022-08-08: 5d return = -3.1%
  AMD on 2017-04-14: 5d return = -8.9%
  ...

You see the cohort, the distribution, the features, and the actual analogs. You can audit any one of them. No survivorship bias (delisted symbols are in the index). No train/test leakage (the embedding model was trained on symbol-disjoint data; this query happens at inference time on held-out anchors).

If you want to backtest a strategy, not just a pattern

The cohort_analyze response gives you the empirical conditional distribution for one anchor. To backtest a strategy that picks one or more anchors per day:

import datetime
import requests

API = "https://chartlibrary.io/api/v1"

equity = 10000.0
trades = []

# Walk through 60 trading days
start = datetime.date(2026, 3, 1)
for i in range(60):
    d = start + datetime.timedelta(days=i)
    if d.weekday() >= 5: continue  # skip weekends

    # Get top picks for that day (mirrors what an agent would do)
    setups = requests.get(f"{API}/agent/setups", params={"top": 3, "timeframe": "1d"}).json()

    if not setups: continue

    # Equal-weight 33% × 3 picks
    daily_returns = []
    for s in setups[:3]:
        sym = s["symbol"]
        # Look up actual 5d return from forward_returns_cache
        ret = requests.get(f"{API}/forward-returns/{sym}/{d.isoformat()}").json()
        if ret and ret.get("ret_5d") is not None:
            daily_returns.append(ret["ret_5d"])

    if daily_returns:
        portfolio_return = sum(daily_returns) / len(daily_returns) / 100
        equity *= (1 + portfolio_return)
        trades.append((d, equity))

print(f"Final equity: ${equity:.2f}")  # vs $10,000 start

For larger backtests with proper benchmarks, our offline backtest harness (scripts/research/agent_cohort_backtest.py in the open-source MCP repo) runs five strategies (SPY hold, random, K=10 picker, cohort full-distribution, cohort + LLM memory) over N days and reports total return, Sharpe, hit rate, and max drawdown. Email graham@chartlibrary.io if you want the harness.

The numbers that change when you do this honestly

When papers and products claim “X% win rate on chart patterns,” the leakage typically inflates the number by 5-30 percentage points. Specific things we’ve seen on internal experiments:

  • Date-only split + survivors-only data: claimed 78% win rate
  • Symbol-disjoint + 10d embargo + survivors-only: 64% (14 pp drop)
  • Symbol-disjoint + 10d embargo + delisted included: 56% (8 more pp drop, ~22 pp from the original)

The 22pp gap is the size of the lie that naive evaluation tells. Models that look great in a paper often look mediocre when evaluated honestly. We’d rather know.

Frequently asked questions

Does Chart Library have survivorship bias?
No. The 19,000+ symbol universe includes ~7,000 delisted tickers (bankruptcies, mergers, voluntary delistings). Forward returns for delisted symbols use the last trading price.
What other biases should I watch for?
Look-ahead bias (using future information that wasn't available at decision time), regime overfit (training period had different vol than test), and cherry-picking the test window. Symbol-disjoint splits + a fixed test universe + held-out time periods address most of these.
Can I trust academic papers' backtests?
Mixed. Top-tier finance journals (JFE, RFS) generally enforce honest splits. Quant blogs and practitioner papers often don't. The tells: no mention of survivorship, no embargo, no out-of-sample period, claimed Sharpe > 3 on a multi-decade backtest.
Why do the calibration bands matter for backtesting?
If you size positions by claimed uncertainty (e.g., risk parity, Kelly fraction), and the uncertainty estimates are wrong, your sizing is wrong. Calibrated bands mean Kelly-like sizing actually achieves its theoretical risk profile. See our calibrated stock forecasting page.
Try it

Run an honest cohort backtest now.

cohort_analyze on the API is the production version. Free Sandbox tier (200 calls/day, no auth). Cohort hygiene + calibration baked in.

Related