05 — Modeling Methods: Building, Validating, and Calibrating
TL;DR. Build the rating spine with ridge-regularized least-squares power ratings (or an Elo/state-space dynamic model) — never plain OLS on ~272 games. Convert margins to probabilities with Normal(line, σ≈13.5) for spreads and Normal(total, ~10) for totals, then check calibration (reliability diagram, Brier, log loss) — not just accuracy. Validate walk-forward only (k-fold leaks the future). And the highest-leverage idea of all: blend your model toward the market line (Benter-style) and judge yourself on CLV / ROI vs. the close, not on score R².
1. The rating spine
1a. Ridge least-squares power ratings (recommended primary)
Encode each game as one row: +1 away column, −1 home column, 1 HFA column;
response = margin of victory. One regression recovers all 32 team ratings + HFA:
MOV = Rating_home − Rating_away + HFA
- This style calls ~66% of games right; recovers HFA ≈ 2.5–3 historically.
- Use ridge (L2), not OLS. With ~272 games, 32 teams + HFA, the matrix is
near-collinear early; OLS ratings blow up for teams with extreme small samples.
Ridge's
λ‖β‖²penalty shrinks ratings toward league average — the bias/variance trade you want. Tuneλby walk-forward validation (§3). - Refinements: cap/shrink MOV (e.g.
min(MOV, 28)) so blowouts don't dominate; exponential time-decay to weight recent games.
1b. Elo (cheap, online, self-correcting)
Full 538 spec in Doc 2 §3 (K=20, HFA=65 Elo ≈ 2.5 pts, 25 Elo ≈ 1 pt, MOV multiplier with autocorrelation control, 1/3 preseason reversion). Good as a secondary rating and a sanity check; trivially updatable each week.
1c. Glicko / Glicko-2
Elo + a ratings deviation (RD) = explicit uncertainty that grows on layoffs and shrinks with games played (Glicko-2 adds a volatility term). Effectively a per-team adaptive K — more principled than Elo for sparse schedules.
1d. Bayesian state-space (Glickman & Stern 1998, JASA)
Team strengths are latent states evolving as a random-walk (week-to-week drift +
season-to-season reversion); point differential is the noisy measurement of
θ_home − θ_away + HFA. Estimated via Kalman filter / Gibbs sampling. Biggest
advantage: it yields full predictive distributions → direct cover/total
probabilities with principled uncertainty, and the data chooses how fast
strengths move instead of a fixed K. This is the "gold-plated" version of our
rating spine if/when we want it.
1e. ML (gradient boosting / RF / nets)
XGBoost/LightGBM/CatBoost on tabular features (EPA, DVOA, success rate, pace, QB-adjusted ratings) are the workhorses. Reported reality is sobering: best setups land ~55.8% ATS accuracy, and "even simple baselines like bookmaker odds remain hard to outperform." Treat vendor "58% ATS" claims as marketing.
1f. Poisson / bivariate Poisson
Standard for soccer (Dixon-Coles); a poor fit for the NFL because scoring comes in 3s/7s/2s, not Poisson units. For the NFL, model the margin as Normal (§2), not scores as Poisson. Keep Poisson in mind only if we model discrete scoring events.
Sources: EdsCave MOV model, Glickman & Stern (JASA), 538 forecast.py, ML comparison.
2. Calibration: margin → probability
Model the margin as Normal(mean = projected spread, σ):
- Spreads: σ ≈ 13.45–13.86 (Winston & Stern's classic 13.86; use ~13.5).
- Totals: σ ≈ 10 around the total.
P(win) = Φ(s / 13.5) # favored by s points
P(cover) = Φ((μ − m) / 13.5) # model margin μ vs market spread m
P(over) = Φ((T̂ − L) / 10) # model total T̂ vs market total L
Refinement: σ varies with spread size — a per-spread σ table (Boyd's Bets) beats one constant. And near key numbers, layer the discrete margin distribution (Doc 2 §4) on top of the smooth normal.
Watch the interaction with
score_dispersion(added recently). Our engine now multiplies each team's points by ~1.45 around the league mean to make the displayed scores spread realistically (predicted EV is naturally flatter than real outcomes — a correct instinct). But that also widens the predicted margin. If you then plug that already-stretched margin intoΦ((μ−m)/13.5), you count variance twice and get over-confident probabilities. Decide where variance lives: compute cover/over probability from the pre-dispersion margin with the empirical σ, or drop dispersion and let σ own the spread. Don't do both. (See Doc 7, Gap F / Phase 3.)
Evaluate the probabilities, not just W/L
- Brier score = mean((p − outcome)²); decomposes into reliability + resolution − uncertainty.
- Log loss = −mean(y·ln p + (1−y)·ln(1−p)); punishes confident wrong calls.
- Reliability diagram — bin predicted probs, plot vs. observed frequency; 45° = perfect calibration. A model can be accurate but miscalibrated (says 80%, hits 65%) — and miscalibration silently destroys Kelly sizing (Doc 6).
Sources: Boyd's per-spread SD, nfelo cover-prob calculator, reliability diagrams.
3. Validation: walk-forward only
Sports results are an ordered, non-stationary time series. Standard k-fold CV shuffles rows, so a fold's training set contains games that happened after the test games — pure lookahead leakage that produces optimistic numbers that never reproduce live.
Use walk-forward / rolling-origin: train on weeks 1…t, predict week t+1, append, retrain, advance. Mirrors real deployment (retrain weekly or per season). All feature engineering (means, ratings, scalers, λ tuning) must be fit on the training window only.
Our backtester already does chronological splits. Verify there is no leakage — e.g., season-long stats that include the target game, post-game injury news, or using the closing line as a feature when we intend to bet the opener. Documented case: a naive split looked 12% better in backtest but dropped 28% live.
Sources: cross-validation for betting, data leakage.
4. The most important practical idea: blend toward the market
The closing line is the best public predictor — so don't ignore it, shrink toward it. The blueprint is Bill Benter's horse model: fit a fundamental model, then a second stage combining the log of your model probability and the log of the market's implied probability:
combined_i ∝ exp( α·ln(f_model_i) + β·ln(π_market_i) )
with α, β fit by maximum likelihood. NFL analogue for point estimates:
final = w · model + (1 − w) · market_line # w well below 1
Benter's diagnostic was ΔR² = R²_combined − R²_market — the incremental information your model adds beyond the market. A tiny ΔR² ≈ 0.018 was enough to profit; a tipster's ΔR² ≈ 0.0002 added nothing. The NFL test: regress outcomes on (your predicted margin, the market line) — if the market coefficient swamps yours, you have no edge and should shrink hard toward the line.
This is the philosophical pivot for our engine. Right now we predict in a vacuum and then compare to Vegas to flag bets. Better: make the de-vigged market line an explicit input/prior, output our independent number, and bet only the residual disagreement. This both improves accuracy (the market is a strong prior) and makes our edge measurable (ΔR², CLV).
Sources: annotated Benter paper.
5. How to score the model the right way
- Break-even at −110 = 52.38%. That's the bar.
- Beat the closing line (CLV), not the scoreboard. A model that nails scores but always agrees with the line makes zero bets and zero profit. Edge = systematic, correct disagreement with the no-vig close.
- Report ROI and CLV with confidence intervals, plus calibration (Brier/log loss) — not score R² as the headline.
6. Pitfalls checklist (print this)
- [ ] Tiny data (~272 games/yr) → regularize hard, few features, pool seasons with time-decay.
- [ ] Data leakage → no future info in features; fit everything on the train window only.
- [ ] k-fold on time series → use walk-forward instead.
- [ ] Optimizing accuracy/MSE instead of CLV/ROI → align the objective with profit.
- [ ] Miscalibration → check reliability diagram before trusting probabilities for sizing.
- [ ] Overfitting to backtest → expect live performance well below backtest; demand margin.
- [ ] Ignoring the market → blend toward the line; measure incremental edge.
7. Implications for our engine
- Introduce a ridge power-rating module as the new spine (replaces the ad-hoc scored/allowed baseline + power-rank nudge). Tune λ walk-forward.
- Add a probability layer (
Φ((μ−m)/σ),Φ((T̂−L)/10)) + a per-spread σ table + key-number margin distribution. Output cover/over probabilities. - Make the de-vigged market line a model input and blend toward it; report ΔR² so we can see if we add anything.
- Add CLV + ROI + Brier/log loss to the backtester as first-class metrics; retarget the optimizer's objective toward CLV/ROI, not raw error.
- Audit for leakage in the existing backtest pipeline (closing-line features, season-aggregate stats including the target game).
→ Continue to Doc 6 — Betting Strategy & Bankroll.