Concept · Eval discipline

Symbol-disjoint evaluation — the eval discipline that prevents leakage in financial ML.

The standard ML practice of “train on data before 2024-01-01, test on data after” is silently broken in finance. The problem is that the same symbols appear on both sides of the split. A model that has seen NVDA 2,500 times during training and is now being evaluated on NVDA isn’t being evaluated — it’s being asked to recall, with a slight noise perturbation.

Symbol-disjoint evaluation is the fix. Hold out entire tickers, not date ranges. Combined with embargo windows that prevent autocorrelation leakage, it’s the only honest way to evaluate a model that learns patterns-by-symbol.

The leakage you don't see in date-based splits

Suppose you have 10 years of daily bars on 19,000 symbols and you do the standard split: train on 2016-2024, test on 2025. What your model sees during training:

NVDA price patterns 2016, 2017, 2018, … 2024 (~2,000 bars)
AAPL price patterns 2016 … 2024 (~2,000 bars)
… and so on for all 19,000 symbols

What it sees during evaluation:

NVDA price patterns 2025 (~250 bars)
AAPL price patterns 2025 (~250 bars)
… same 19,000 symbols

The model has learned per-symbol idiosyncrasies: NVDA’s typical mean reversion, AAPL’s response to earnings, TSLA’s volatility regimes. When you evaluate on the same symbols in 2025, you’re measuring how well the model recalls those idiosyncrasies, not how well it generalizes to genuinely new patterns.

For a pattern-similarity model, this is even worse. The embedding for NVDA on 2025-03-15 will be near the embedding for NVDA on 2024-12-20 simply because both days share NVDA’s general regime, sector, and shape conventions. The cohort retrieval finds NVDA’s own past as the closest analog, and the “forward return” lookup is effectively asking “what did NVDA do after this point?” — using NVDA’s own future as the ground truth.

What symbol-disjoint splits look like

Hold out symbols, not dates. Train on 80% of all 19,000 symbols across all years. Evaluate on the remaining 20% of symbols. The evaluation symbols never appear in training, period.

For Chart Library:

# Pseudocode for the split
all_symbols = sorted(set(symbol for (symbol, date) in dataset))
random.seed(42)  # deterministic

# Stratify by sector to keep test-set sectoral balance
train_symbols = []
test_symbols = []
for sector, syms in groupby_sector(all_symbols):
    cut = int(len(syms) * 0.8)
    train_symbols.extend(syms[:cut])
    test_symbols.extend(syms[cut:])

train = dataset[dataset.symbol.isin(train_symbols)]
test  = dataset[dataset.symbol.isin(test_symbols)]
assert len(set(train.symbol) & set(test.symbol)) == 0  # disjoint

Now when we evaluate, the embeddings for test-set symbols are genuinely held-out. The retrieval finds analogs from the training set (other symbols, real history), and the forward returns are ground truth that the model couldn’t have memorized.

The embargo window — why it matters

Symbol-disjoint solves cross-symbol leakage. There’s a second leak channel even within the same symbol: temporal autocorrelation. Adjacent trading days for the same symbol share too much price path. NVDA on 2024-08-05 and NVDA on 2024-08-06 are nearly the same chart. Including both as independent samples inflates the apparent independence of the data.

The fix: an embargo window. After any in-sample date for a symbol, the next N days for that symbol are excluded from out-of-sample evaluation. We use N=10 trading days. So a held-out anchor for symbol X must be at least 10 trading days from any in-sample date for symbol X.

In Chart Library’s case, this is enforced at retrieval time too: when we pull the cohort of historical analogs to a query anchor, we exclude same-symbol matches within ±10 calendar days. Without this guard, every cohort would secretly include the query symbol’s adjacent days as “analogs,” producing a meaninglessly tight cohort that’s almost-the-same-anchor.

The numbers that change when you do this honestly

When papers and products claim “our model predicts stock returns at X% accuracy,” the leakage from naive splits typically inflates that number by 5-30 percentage points. We’ve seen this on multiple internal experiments:

Date-only split, no embargo: nominal 80% calibration band covers ~85% of held-out outcomes (looks over-calibrated, but it’s really memorization).
Symbol-disjoint, 10-day embargo: same nominal 80% band covers ~68% of outcomes (under-covers — model is over-confident on genuinely new symbols).
After conformal correction on the disjoint split: 80% band covers 82.5% of outcomes. Calibrated.

The 17-percentage-point difference between the leaky split and the honest split is the size of the lie that naive evaluation tells. Models that look great in a paper often look mediocre when evaluated honestly. We’d rather know.

Why we publish honest negatives

Symbol-disjoint evaluation has killed multiple internal Chart Library experiments. The methodology page lists them by name: an outcome-aware embedding fine-tune that beat baseline on a date split but failed on the disjoint test, a regime-conditioning experiment whose effect went to zero on held-out symbols, a transformer-based encoder that didn’t justify its compute.

We publish these because the alternative is shipping things that don’t work. The honest discipline is what makes the things we do ship credible.

Frequently asked questions

Isn't 20% of symbols a small held-out set?: It's ~3,800 symbols × 10 years = ~9.5M held-out chart-pattern anchors. Statistical power is high. The constraint is sectoral balance and survivorship — we stratify the split by sector and include delisted symbols in both sides.
What about purged group time-series cross-validation?: We use it for hyperparameter tuning within the train fold. The final eval is symbol-disjoint with embargo. PGTSCV is a strong technique inside training; symbol-disjoint is the right discipline for the held-out final evaluation.
Does this work for time series with strong cross-symbol dependence (e.g. sector-wide moves)?: Symbol-disjoint splits don't eliminate cross-symbol regime correlation — if the held-out symbols experienced the same 2025 vol regime as the training symbols, that's still a regime-level confound. We address this by stratifying across regimes during the split and reporting per-regime calibration metrics.
How do you handle delisted / merged symbols?: Delisted symbols are included on both sides of the split. Their forward returns use the last trading price (occasionally pessimistic, never optimistic). This prevents survivorship bias — a problem most retail backtesting tools have but few people notice.

Try it