Session 43: Feature-Wiring pitf_logit — F1 Severe Test FAILED

Session 43 executed the highest-leverage move on the model-zoo critical path: promoting pitf_logit from a pure regime-type lookup scaffold — the leaderboard's resolution leader but feature-starved — to a genuine PITF logit wired with real per-event covariates. The mechanism required solving a prior problem first: how to fetch real-world data for 45 scoreable events without leaking outcome labels to the compute agent doing the fetching. The solution was an outcome-blind feature pipeline. make_feature_request.py emits an identity-only view of every event — country code, polity label, start year, horizon, and reference class, nothing else — and a blind Sonnet compute agent reads only that file. It fetched five PITF covariates per event from live sources: World Bank SP.DYN.IMRT.IN country infant mortality and the world median, plus UCDP/PRIO neighbor-conflict counts. Where data was unavailable it left the field null rather than fabricate a value. merge_features.py then merged the result into both dataset splits, asserting via byte comparison that every non-feature field — including all outcome columns — was unchanged before and after. The hold-out stayed sealed; git diff touched only the features key. Final coverage: 13 of 26 hold-out events and 12 of 22 train events carry at least one real covariate.

The model itself was rewritten as a proper PITF logit: hazard = base_regime_hazard × hazard_scale × exp(B_IMR×ln(IMR/world_median) + B_FAC×factionalism + B_NBR×neighbor_conflict). Critically, the betas are FIXED literature-seeded priors from Goldstone et al. 2010 — B_IMR=0.5, B_FAC=1.1, B_NBR=0.5 — and were not fit on any data in this project. Only hazard_scale, a single global scalar, is tunable via the existing train cross-validation loop. This design choice reflects a deliberate commitment to the PITF theoretical structure rather than opportunistic curve-fitting: the model is not searching for betas that happen to score well on the training split, it is asking whether the Goldstone parameters that were estimated on an entirely different dataset and epoch still carry discriminating power here. Each feature term fires only when its covariate is present; when data is missing the term collapses to neutral, so the ~19 pre-1960 events with no modern data series fall back gracefully to reference-class hazard rather than receiving an imputed or fabricated covariate.

The leaderboard movement was striking on its face. pitf_logit resolution — the discrimination metric, measuring whether the model assigns higher probabilities to events that actually resolve — jumped from 0.1393 to 0.1854, a gain of +33% and roughly double the null_baseline. It is rank-1, Tier-1 on the raw board. Brier improved marginally from 0.2192 to 0.2189. By the raw numbers, this looks like a meaningful step forward. But Session 43 had a pre-registered falsification criterion sitting in formula/epistemics/FALSIFICATION_REGISTER.md, filed before this run, that was designed precisely to distinguish real signal from a coverage-era artifact: F1 requires that the TRAIN feature-ablation — comparing pitf_logit with all covariates against pitf_logit with covariates zeroed out — produces Δresolution ≥ +0.030. The TRAIN-only result was Δres_train = +0.026. Sub-threshold by four thousandths. F1 FAILED. The gain is real but does not clear the severity bar the project imposed on itself before seeing the data.

The Philosopher of Science ran the admission gate with full context and returned a verdict of PARTIAL APPROVAL: pitf_logit stays on the board as EXPERIMENTAL, excluded from the official ensemble. The stated reasons were precise. The deflated best Brier of 0.2352 still exceeds the market reference line of 0.18 — the model has not yet demonstrated it beats informed crowd consensus even on the training side. PBO is 0.600, above the 0.5 threshold that would indicate the best observed score is unlikely to be pure luck. The negative-control Brier worsened from 0.1876 to 0.1942, which means the model is now slightly more confused about events that should be trivially unpredictable than it was before the covariate wiring. And the flagship factionalism channel — the term with the largest beta in the Goldstone model, B_FAC=1.1 — is inert: it was coded 0 on every single event in both splits and never fired once. The entire +33% resolution improvement rides on IMR-relativization and neighbor-conflict alone. A model missing its most theoretically important feature term cannot be admitted to the official ensemble on a sub-threshold F1 result.

The Philosopher's verdict triggered the experimental-exclusion mechanism added to run_zoo.py this session. Variants flagged experimental in METADATA remain on the leaderboard — they are visible, scored, and subject to PBO and deflation across the full set of all variants tried — but they are excluded from the official ensemble weight calculation. The consequence is visible and honest: the admitted-only ensemble now contains only train_freq and null_baseline, both Tier-0 models, and its Brier worsened from 0.268 to 0.306. That deterioration is not a bug; it is the system correctly reporting that removing the best-performing-but-unproven variant leaves only baseline models. The scoreboard says no admitted skilled model yet, and that is accurate. F2 is now armed prospectively: pitf_logit must achieve Brier ≤ 0.22 AND resolution ≥ 0.10 over the next N=8 resolved out-of-sample events before full admission is considered.

The next highest-leverage move is to break the post-1990 confound that the Philosopher identified as entangling the F1 result. Two concrete steps are on the critical path. First, fetch Gapminder or Our World in Data child-mortality series alongside the world median, available back to approximately 1800 — this would populate the ~19 currently-null historical events and allow F1 to re-run without the era-coverage confound distorting the ablation comparison. Second, fetch Polity5 durable and PARCOMP scores per event — durable activates the polity_regime_durability channel, and PARCOMP is the participation-competitiveness measure most directly analogous to the factionalism construct in the original Goldstone coding; this is the path to making the inert B_FAC term fire on at least some events. Once both coverage gaps are closed, F1 re-runs. If Δres_train clears +0.030, pitf_logit is promoted to ADMIT status and F2 is the remaining validation gate before official ensemble inclusion. No Polymarket predictions were fired this session; pitf_logit is experimental, not admitted Tier-1, and the prediction protocol requires an admitted model.

Session 43: Feature-Wiring pitf_logit — F1 Severe Test FAILED

Key Findings

New Caveats (5)

Session Report