2026-06-18 · 9f68c4a
Docs / 5 · Modeling Methods

05 — Modeling Methods: Building, Validating, and Calibrating

TL;DR. Build the rating spine with ridge-regularized least-squares power ratings (or an Elo/state-space dynamic model) — never plain OLS on ~272 games. Convert margins to probabilities with Normal(line, σ≈13.5) for spreads and Normal(total, ~10) for totals, then check calibration (reliability diagram, Brier, log loss) — not just accuracy. Validate walk-forward only (k-fold leaks the future). And the highest-leverage idea of all: blend your model toward the market line (Benter-style) and judge yourself on CLV / ROI vs. the close, not on score R².


1. The rating spine

Encode each game as one row: +1 away column, −1 home column, 1 HFA column; response = margin of victory. One regression recovers all 32 team ratings + HFA:

MOV = Rating_home − Rating_away + HFA
  • This style calls ~66% of games right; recovers HFA ≈ 2.5–3 historically.
  • Use ridge (L2), not OLS. With ~272 games, 32 teams + HFA, the matrix is near-collinear early; OLS ratings blow up for teams with extreme small samples. Ridge's λ‖β‖² penalty shrinks ratings toward league average — the bias/variance trade you want. Tune λ by walk-forward validation (§3).
  • Refinements: cap/shrink MOV (e.g. min(MOV, 28)) so blowouts don't dominate; exponential time-decay to weight recent games.

1b. Elo (cheap, online, self-correcting)

Full 538 spec in Doc 2 §3 (K=20, HFA=65 Elo ≈ 2.5 pts, 25 Elo ≈ 1 pt, MOV multiplier with autocorrelation control, 1/3 preseason reversion). Good as a secondary rating and a sanity check; trivially updatable each week.

1c. Glicko / Glicko-2

Elo + a ratings deviation (RD) = explicit uncertainty that grows on layoffs and shrinks with games played (Glicko-2 adds a volatility term). Effectively a per-team adaptive K — more principled than Elo for sparse schedules.

1d. Bayesian state-space (Glickman & Stern 1998, JASA)

Team strengths are latent states evolving as a random-walk (week-to-week drift + season-to-season reversion); point differential is the noisy measurement of θ_home − θ_away + HFA. Estimated via Kalman filter / Gibbs sampling. Biggest advantage: it yields full predictive distributions → direct cover/total probabilities with principled uncertainty, and the data chooses how fast strengths move instead of a fixed K. This is the "gold-plated" version of our rating spine if/when we want it.

1e. ML (gradient boosting / RF / nets)

XGBoost/LightGBM/CatBoost on tabular features (EPA, DVOA, success rate, pace, QB-adjusted ratings) are the workhorses. Reported reality is sobering: best setups land ~55.8% ATS accuracy, and "even simple baselines like bookmaker odds remain hard to outperform." Treat vendor "58% ATS" claims as marketing.

1f. Poisson / bivariate Poisson

Standard for soccer (Dixon-Coles); a poor fit for the NFL because scoring comes in 3s/7s/2s, not Poisson units. For the NFL, model the margin as Normal (§2), not scores as Poisson. Keep Poisson in mind only if we model discrete scoring events.

Sources: EdsCave MOV model, Glickman & Stern (JASA), 538 forecast.py, ML comparison.


2. Calibration: margin → probability

Model the margin as Normal(mean = projected spread, σ):

  • Spreads: σ ≈ 13.45–13.86 (Winston & Stern's classic 13.86; use ~13.5).
  • Totals: σ ≈ 10 around the total.
P(win)   = Φ(s / 13.5)              # favored by s points
P(cover) = Φ((μ − m) / 13.5)        # model margin μ vs market spread m
P(over)  = Φ((T̂ − L) / 10)          # model total T̂ vs market total L

Refinement: σ varies with spread size — a per-spread σ table (Boyd's Bets) beats one constant. And near key numbers, layer the discrete margin distribution (Doc 2 §4) on top of the smooth normal.

Watch the interaction with score_dispersion (added recently). Our engine now multiplies each team's points by ~1.45 around the league mean to make the displayed scores spread realistically (predicted EV is naturally flatter than real outcomes — a correct instinct). But that also widens the predicted margin. If you then plug that already-stretched margin into Φ((μ−m)/13.5), you count variance twice and get over-confident probabilities. Decide where variance lives: compute cover/over probability from the pre-dispersion margin with the empirical σ, or drop dispersion and let σ own the spread. Don't do both. (See Doc 7, Gap F / Phase 3.)

Evaluate the probabilities, not just W/L

  • Brier score = mean((p − outcome)²); decomposes into reliability + resolution − uncertainty.
  • Log loss = −mean(y·ln p + (1−y)·ln(1−p)); punishes confident wrong calls.
  • Reliability diagram — bin predicted probs, plot vs. observed frequency; 45° = perfect calibration. A model can be accurate but miscalibrated (says 80%, hits 65%) — and miscalibration silently destroys Kelly sizing (Doc 6).

Sources: Boyd's per-spread SD, nfelo cover-prob calculator, reliability diagrams.


3. Validation: walk-forward only

Sports results are an ordered, non-stationary time series. Standard k-fold CV shuffles rows, so a fold's training set contains games that happened after the test games — pure lookahead leakage that produces optimistic numbers that never reproduce live.

Use walk-forward / rolling-origin: train on weeks 1…t, predict week t+1, append, retrain, advance. Mirrors real deployment (retrain weekly or per season). All feature engineering (means, ratings, scalers, λ tuning) must be fit on the training window only.

Our backtester already does chronological splits. Verify there is no leakage — e.g., season-long stats that include the target game, post-game injury news, or using the closing line as a feature when we intend to bet the opener. Documented case: a naive split looked 12% better in backtest but dropped 28% live.

Sources: cross-validation for betting, data leakage.


4. The most important practical idea: blend toward the market

The closing line is the best public predictor — so don't ignore it, shrink toward it. The blueprint is Bill Benter's horse model: fit a fundamental model, then a second stage combining the log of your model probability and the log of the market's implied probability:

combined_i ∝ exp( α·ln(f_model_i) + β·ln(π_market_i) )

with α, β fit by maximum likelihood. NFL analogue for point estimates:

final = w · model + (1 − w) · market_line          # w well below 1

Benter's diagnostic was ΔR² = R²_combined − R²_market — the incremental information your model adds beyond the market. A tiny ΔR² ≈ 0.018 was enough to profit; a tipster's ΔR² ≈ 0.0002 added nothing. The NFL test: regress outcomes on (your predicted margin, the market line) — if the market coefficient swamps yours, you have no edge and should shrink hard toward the line.

This is the philosophical pivot for our engine. Right now we predict in a vacuum and then compare to Vegas to flag bets. Better: make the de-vigged market line an explicit input/prior, output our independent number, and bet only the residual disagreement. This both improves accuracy (the market is a strong prior) and makes our edge measurable (ΔR², CLV).

Sources: annotated Benter paper.


5. How to score the model the right way

  • Break-even at −110 = 52.38%. That's the bar.
  • Beat the closing line (CLV), not the scoreboard. A model that nails scores but always agrees with the line makes zero bets and zero profit. Edge = systematic, correct disagreement with the no-vig close.
  • Report ROI and CLV with confidence intervals, plus calibration (Brier/log loss) — not score R² as the headline.

6. Pitfalls checklist (print this)

  • [ ] Tiny data (~272 games/yr) → regularize hard, few features, pool seasons with time-decay.
  • [ ] Data leakage → no future info in features; fit everything on the train window only.
  • [ ] k-fold on time series → use walk-forward instead.
  • [ ] Optimizing accuracy/MSE instead of CLV/ROI → align the objective with profit.
  • [ ] Miscalibration → check reliability diagram before trusting probabilities for sizing.
  • [ ] Overfitting to backtest → expect live performance well below backtest; demand margin.
  • [ ] Ignoring the market → blend toward the line; measure incremental edge.

7. Implications for our engine

  1. Introduce a ridge power-rating module as the new spine (replaces the ad-hoc scored/allowed baseline + power-rank nudge). Tune λ walk-forward.
  2. Add a probability layer (Φ((μ−m)/σ), Φ((T̂−L)/10)) + a per-spread σ table + key-number margin distribution. Output cover/over probabilities.
  3. Make the de-vigged market line a model input and blend toward it; report ΔR² so we can see if we add anything.
  4. Add CLV + ROI + Brier/log loss to the backtester as first-class metrics; retarget the optimizer's objective toward CLV/ROI, not raw error.
  5. Audit for leakage in the existing backtest pipeline (closing-line features, season-aggregate stats including the target game).

→ Continue to Doc 6 — Betting Strategy & Bankroll.