Concept · Methodology

Cohort filtering by regime —
when to narrow and when not to.

By default, cohort intelligence returns the 300 nearest historical analogs to a (symbol, date, timeframe) anchor based purely on chart-shape similarity. Most of the time that’s what you want. But not always.

When the macro regime is regime-shifting, when the sector is doing something unusual, when the anchor has news divergence — the unfiltered cohort can mix apples-and-oranges analogs. Filters let you narrow. Filters can also break the math if you over- narrow.

This piece walks through the seven filter dimensions cohort_analyze supports, the n=30 floor that protects against over-filtering, and three calibrated patterns for when to filter and when to leave the cohort wide.

The seven filter dimensions

cohort_analyze accepts a filters object that narrows the cohort along any combination of seven dimensions:

vol_regime — low / medium / high, cut from rolling 20-day realized vol vs 1-year median
macro_state — bullish / neutral / bearish, composite of SPY momentum, breadth, yield curve
sector — restrict to analogs from the anchor’s sector ETF, or a specific sector
has_news — boolean, was there material news within ±2 days of the analog
days_since_earnings — integer range (e.g. 5-30); useful to exclude earnings-driven analogs
days_since_ath — integer range; proximity to all-time high (the “is this a base breakout or a recovery?” question)
relative_volume — minimum threshold vs 20-day average (filters out illiquid analogs)

Filters compose. You can ask “analogs in low-vol AND bullish-macro AND days_since_earnings > 14.” The API will narrow the cohort to whatever satisfies all conditions.

The n=30 floor

The math behind cohort intelligence rests on having enough analogs to compute meaningful distribution statistics. The 95% confidence interval on a binomial win-rate from n=30 is roughly ±15 percentage points. From n=10 it’s ±30pp — essentially useless.

The API enforces a floor: if your filter combination drops cohort_size_actual below 30, the response includes a warning. Below 15, the warning is louder. Below 5, the response is flagged as unreliable.

The over-filtering trap is real. A natural user instinct is to narrow to “analogs that look exactly like the current situation” — low-vol AND bullish macro AND same sector AND no news AND post-earnings AND near ATH — and end up with n=4. n=4 analogs tell you nothing. The cohort lost its ability to be statistical.

Three calibrated narrowing patterns

Three patterns produce useful filtered cohorts without breaking the math:

Pattern 1: One-dimension filter for regime alignment

Pick the one most operative regime dimension and filter on it alone. Usually macro_state or vol_regime.

{
  "anchor": {"symbol": "NVDA", "date": "today", "timeframe": "1d"},
  "cohort_size": 300,
  "filters": {
    "macro_state": "bullish"
  }
}

Result: a 300-analog cohort drawn from bullish-macro periods. Cohort_size_actual likely stays at 300 (we have plenty of bullish-macro days in the index). The statistics are still robust, and the analogs are conditioned on the regime you care about.

This is the safest filter pattern. Use it when you have a strong view on which regime you’re in.

Pattern 2: Two-dimension filter with cohort-size monitoring

Two filters together: macro + vol, or macro + sector, or vol + days_since_earnings.

{
  "anchor": {"symbol": "NVDA", "date": "today", "timeframe": "1d"},
  "cohort_size": 300,
  "filters": {
    "macro_state": "bullish",
    "vol_regime": "high"
  }
}

Result: typically still n=200-300 (most of the analog space is covered). Read cohort_size_actual first. If it dropped below 100, the filter combination might be picking a tight slice — proceed carefully.

Two-dimension filters work when the two dimensions are roughly independent. Macro and vol are weakly correlated (high-vol regimes overlap with bear macros) but not identical. Macro and sector are nearly orthogonal.

Pattern 3: Earnings-window exclusion

The most common useful filter is excluding earnings-window analogs. Earnings moves are structurally different from non-earnings moves; including them blurs the signal.

{
  "anchor": {"symbol": "NVDA", "date": "today", "timeframe": "1d"},
  "cohort_size": 300,
  "filters": {
    "days_since_earnings": {"min": 5, "max": 60}
  }
}

Result: cohort restricted to analogs in the 5-to-60-days-after-earnings window. Excludes the chaotic earnings-day moves and the pre-earnings drift. The cohort becomes much more apples-to-apples if your current question is about non-earnings price action.

When NOT to filter

Sometimes the right answer is the unfiltered cohort. Specifically:

When you’re trying to discover the regime split. The whole point of regime_stratification is to surface differences across regimes. If you filter on regime first, you lose the comparison.
When the live regime is transitioning. If macro_state has been ambiguous for the past two weeks, conditioning on one label is an over-claim. Stay wide.
When you’re forming a base-rate prior. Base rates are deliberately wide. You apply conditioning later (in the analyst’s head, in the agent’s reasoning loop) — not in the cohort retrieval.
When the filter you want isn’t reliably labeled for older analogs. Macro_state and vol_regime are computed back to 2014. News context and narrative scores have less history. If the filter requires data we don’t have for old analogs, the cohort silently shrinks to whatever has the label.

Filtering vs reading regime_stratification

These two approaches achieve similar conditional reads, by different means:

Approach A: Filter at retrieval

Pass filters: { macro_state: "bullish" }. Cohort comes back with 300 bullish-macro analogs, no non-bullish ones. outcome_distribution describes only those analogs.

Approach B: Don’t filter, read stratification

Pass no filter. Get the full cohort, then read regime_stratification_5d.bull_macro to extract the bull-macro slice (typically n=120 of the 300 in our example).

Approach B is usually preferable. It gives you both the conditioned read AND the contrast (you can compare bull-macro to bear-macro and see how much regime matters). Approach A loses that information.

Approach A is preferable when you want full 300 conditioned analogs instead of a slice of 300. If macro_state matters a lot to your decision and you’re willing to commit to one regime, filter at retrieval to maximize the conditioned sample.

A worked example: filtering NVDA pre-earnings

Suppose NVDA reports earnings in 4 days. The unfiltered cohort includes both pre-earnings and non-earnings analogs. Pre-earnings analogs often have a structural drift — implied vol creep, options positioning — that doesn’t match a typical “does the chart look bullish?” question.

Filter to non-earnings-window analogs and the live anchor (which IS pre-earnings) ends up matched against analogs that don’t share that property — a different mistake.

Better: filter to same-distance-from-earnings analogs.

{
  "anchor": {"symbol": "NVDA", "date": "today", "timeframe": "1d"},
  "cohort_size": 300,
  "filters": {
    "days_since_earnings": {"min": -7, "max": -3}
  }
}

Negative days_since_earnings = days before earnings. The filter restricts to analogs that were also 3-7 days before an earnings event. The cohort now genuinely mirrors the live situation.

Watch cohort_size_actual. Pre-earnings windows are sparse — you might drop from 300 to 80 analogs. Still well above the n=30 floor, statistically meaningful, and now apples-to-apples.

Filtering checklist

Before adding any filter, ask:

Is the dimension actually operative? If the feature shows up in feature_importance with importance > 0.05, yes. If it doesn’t, filtering on it adds noise without signal.
Is the live anchor cleanly on one side of the cut? If macro is borderline-bullish, don’t filter on bullish-macro — you’re assuming a regime that isn’t reliable.
Will I still have n > 30 after the filter? Run with the filter, check cohort_size_actual, loosen if below 30.
Am I better served by regime_stratification on an unfiltered cohort? Usually yes. The stratification gives you both the conditional and the contrast.

Frequently asked questions

Why does my cohort_size_actual sometimes come back lower than my requested cohort_size?: Two reasons. Either filters narrowed the candidate pool, or the request was made very early in the trading history of the symbol and not enough analogs satisfy the look-ahead-free retrieval. Check cohort_size_actual on every response; act on the response only if it's above n=30.
Can I filter on multiple feature values within one dimension?: Yes — pass a list. Example: macro_state: ['bullish', 'neutral'] includes both buckets and excludes bearish. Useful when you want to be permissive about one side of a binary cut.
What if I want analogs from a specific date range?: Pass start_date and end_date in the filters object. Useful for excluding extreme periods (2008 GFC, 2020 COVID crash) when their inclusion would distort the cohort. Be honest with yourself about why you're excluding — easy to overfit by trimming inconvenient periods.
Are filters cumulative or replaceable on re-query?: Each call to cohort_analyze is independent — there's no session state. Filters in one call don't affect the next. If you want to compare filtered vs unfiltered, run both calls and compare the responses.
What's the worst filter I could apply?: Anything that filters on a dimension correlated with forward returns directly — e.g. filtering to analogs that had positive 5-day returns. That's reading the answer into the question. The API doesn't expose forward-return-based filters for exactly this reason.

Try it