Published audits and honest negatives.
Research artifacts from Chart Library’s own eval pipeline. Positive findings ship into the product; null results get the same publication treatment — the review discipline is the point. Methodology at /learn/methodology.
What held up
Raw retrieval [p10, p90] covers ~68% empirically on a nominal 80% band. Conformal offsets (CQR-style) restore coverage to 82.5% on held-out.
A prior cross-symbol eval silently reused symbols across splits and overstated accuracy. Moving to symbol-disjoint MD5-bucketed splits + 10-day purge/embargo closed the leak. The inflated baseline claim was retracted.
9,400 rows for delisted tickers backfilled into the pattern library. Forward returns now include the subset of the past where companies did not survive — a conservative correction against a common retrieval-side bias.
What didn’t — and why we published it anyway
Five independent regime filters (variance risk premium, VIX term structure, credit spread, yield curve, market breadth) layered on top of shape retrieval. Across 200 anchors × 6 modes, IQR shifts were below 0.4pp. Shape already captures regime implicitly.
Caveats: Loose ±0.15 percentile bucketing. Tight bucketing (±0.05) may restore meaningful effect at the cost of variance. Filter stacking untested.
One anchor suggested earnings-window patterns underperform by −3.65pp. Re-running across 100 anchors, the population effect is −0.52pp. The paired test against a dividend placebo yields p = 0.08. The single-anchor result was an outlier, not a generalization.
Extended the H3 sample to 2020–2023 for real COVID-era high-VIX anchors. Q4 VIX bucket shows −0.69pp paired diff vs Q1 at −0.35pp — directionally consistent with the hypothesis but no paired CI excludes zero at any VIX threshold.
Caveats: Extreme tail (VIX > 40) n = 14. Directionally clean, statistically underpowered.
Distance ensemble between two retrieval spaces tested at α ∈ {0.3, 0.5, 0.7}. Best ensemble edges V5-alone by 0.2pp MAE at n = 20 anchors — within noise. V5-alone cleanly beats V2-alone. Intersection of top-500 lists typically shares < 5 pairs, so strict intersection ensembles are structurally unworkable.
Earnings-window underperformance stratified by GICS sector. Only three sectors cleared n ≥ 10 anchors after Bonferroni. All three paired CIs straddle zero. The aggregate effect is cross-sector, not concentrated.