Backtest a chart pattern honestly — without survivorship bias or leakage.
If you’ve ever seen a chart pattern backtest claim 70%+ win rates, the methodology was almost certainly broken. The two most common failure modes — survivorship bias and train/test leakage — silently inflate results by 5 to 30 percentage points and almost no retail backtesting tool defends against either.
This guide covers what survivorship bias is, why date-based splits leak, and the working code for a backtest that doesn’t.
Failure mode 1: Survivorship bias
Most retail data feeds (yfinance, Alpha Vantage’s free tier, every “current S&P 500” CSV on Kaggle) only containcurrently-trading symbols. Symbols that went bankrupt, got delisted, or merged are silently absent.
When you backtest on this data, you’re testing “patterns from companies that survived the period.” That’s a population conditioned on survival — by construction biased toward winners. A pattern strategy backtested on survivors-only data will show ~5 to 15 percentage points higher win rate than it would on the full population.
The fix: include delisted symbols. They’re what survivorship bias systematically excludes. Forward returns for delisted symbols use the last trading price (occasionally pessimistic, never optimistic).
# Wrong: yfinance default — survivors-only
import yfinance as yf
data = yf.download(['AAPL', 'NVDA', ...current_sp500]) # missing GE-old, K-old, BBBY, ...
# Better: include delisted (Polygon, Norgate, EODHistoricalData)
import polygon
client = polygon.RESTClient(...)
all_tickers = list_all_tickers_including_delisted(client)
# As of 2026-05: ~19,000 US equity symbols, of which ~7,000 are delistedFailure mode 2: Train/test leakage via date splits
The standard ML practice — train on 2016-2024, test on 2025 — is silently broken in finance because the same symbols appear on both sides. A model that has seen NVDA 2,500 times during training and is now being tested on NVDA isn’t being evaluated. It’s being asked to recall.
The fix is symbol-disjoint splits: hold out entire tickers, not date ranges. Train on 80% of symbols across all years; evaluate on the remaining 20% of symbols. The evaluation symbols never appear in training, period.
Plus a 10-day embargo window: even within a held-out symbol, don’t use a date that’s within 10 trading days of an in-sample date. Adjacent days are too autocorrelated.
Full discipline: symbol-disjoint evaluation.
A working honest backtest
Here’s a minimal but honest backtest of a chart pattern. Anchor: NVDA on 2024-08-05 at 1h timeframe. Find the cohort, look at what those analogs did over 5 days, score against forward returns.
import os
import requests
import statistics
API = "https://chartlibrary.io/api/v1"
# 1. Get cohort with proper hygiene applied
resp = requests.post(f"{API}/cohort_analyze", json={
"anchor": {"symbol": "NVDA", "date": "2024-08-05", "timeframe": "1h"},
"cohort_size": 300,
"horizons": [1, 5, 10],
"options": {
"include_cohort_anchors": True, # return the actual analogs
"exclude_same_symbol_days": 10, # cohort hygiene (default)
},
})
cohort = resp.json()
# 2. Inspect the distribution (p10/p90 are conformal-corrected)
out_5d = cohort["outcome_distribution"]["5"]
print(f"Cohort n: {out_5d['n']}")
print(f"5d median: {out_5d['median']:.2f}%")
print(f"5d p10/p90 (calibrated): {out_5d['p10']:.2f}% / {out_5d['p90']:.2f}%")
print(f"5d win rate: {out_5d['win_rate']:.0%}")
# 3. Inspect feature importance — what separated cohort winners from losers?
for f in cohort["feature_importance"]["5"][:5]:
sign = "+" if f["direction"] == "positive" else "-"
print(f" {sign} {f['feature']} (importance {f['importance']:.2f})")
# 4. The cohort anchors themselves — auditable
for analog in cohort["cohort_anchors"][:5]:
print(f" {analog['symbol']} on {analog['date']}: 5d return = {analog['ret_5d']}%")The output for our example anchor (real numbers from the live API):
Cohort n: 285
5d median: -1.30%
5d p10/p90 (calibrated): -11.30% / +6.80%
5d win rate: 44%
+ credit_spread_state=tight (importance 0.31)
+ macro_state=bullish (importance 0.27)
- vol_regime=low (importance 0.22)
PFE on 2019-03-12: 5d return = +2.4%
RIO on 2022-08-08: 5d return = -3.1%
AMD on 2017-04-14: 5d return = -8.9%
...You see the cohort, the distribution, the features, and the actual analogs. You can audit any one of them. No survivorship bias (delisted symbols are in the index). No train/test leakage (the embedding model was trained on symbol-disjoint data; this query happens at inference time on held-out anchors).
If you want to backtest a strategy, not just a pattern
The cohort_analyze response gives you the empirical conditional distribution for one anchor. To backtest a strategy that picks one or more anchors per day:
import datetime
import requests
API = "https://chartlibrary.io/api/v1"
equity = 10000.0
trades = []
# Walk through 60 trading days
start = datetime.date(2026, 3, 1)
for i in range(60):
d = start + datetime.timedelta(days=i)
if d.weekday() >= 5: continue # skip weekends
# Get top picks for that day (mirrors what an agent would do)
setups = requests.get(f"{API}/agent/setups", params={"top": 3, "timeframe": "1d"}).json()
if not setups: continue
# Equal-weight 33% × 3 picks
daily_returns = []
for s in setups[:3]:
sym = s["symbol"]
# Look up actual 5d return from forward_returns_cache
ret = requests.get(f"{API}/forward-returns/{sym}/{d.isoformat()}").json()
if ret and ret.get("ret_5d") is not None:
daily_returns.append(ret["ret_5d"])
if daily_returns:
portfolio_return = sum(daily_returns) / len(daily_returns) / 100
equity *= (1 + portfolio_return)
trades.append((d, equity))
print(f"Final equity: ${equity:.2f}") # vs $10,000 startFor larger backtests with proper benchmarks, our offline backtest harness (scripts/research/agent_cohort_backtest.py in the open-source MCP repo) runs five strategies (SPY hold, random, K=10 picker, cohort full-distribution, cohort + LLM memory) over N days and reports total return, Sharpe, hit rate, and max drawdown. Email graham@chartlibrary.io if you want the harness.
The numbers that change when you do this honestly
When papers and products claim “X% win rate on chart patterns,” the leakage typically inflates the number by 5-30 percentage points. Specific things we’ve seen on internal experiments:
- Date-only split + survivors-only data: claimed 78% win rate
- Symbol-disjoint + 10d embargo + survivors-only: 64% (14 pp drop)
- Symbol-disjoint + 10d embargo + delisted included: 56% (8 more pp drop, ~22 pp from the original)
The 22pp gap is the size of the lie that naive evaluation tells. Models that look great in a paper often look mediocre when evaluated honestly. We’d rather know.
Frequently asked questions
- Does Chart Library have survivorship bias?
- No. The 19,000+ symbol universe includes ~7,000 delisted tickers (bankruptcies, mergers, voluntary delistings). Forward returns for delisted symbols use the last trading price.
- What other biases should I watch for?
- Look-ahead bias (using future information that wasn't available at decision time), regime overfit (training period had different vol than test), and cherry-picking the test window. Symbol-disjoint splits + a fixed test universe + held-out time periods address most of these.
- Can I trust academic papers' backtests?
- Mixed. Top-tier finance journals (JFE, RFS) generally enforce honest splits. Quant blogs and practitioner papers often don't. The tells: no mention of survivorship, no embargo, no out-of-sample period, claimed Sharpe > 3 on a multi-decade backtest.
- Why do the calibration bands matter for backtesting?
- If you size positions by claimed uncertainty (e.g., risk parity, Kelly fraction), and the uncertainty estimates are wrong, your sizing is wrong. Calibrated bands mean Kelly-like sizing actually achieves its theoretical risk profile. See our calibrated stock forecasting page.
Run an honest cohort backtest now.
cohort_analyze on the API is the production version. Free Sandbox tier (200 calls/day, no auth). Cohort hygiene + calibration baked in.