Lead agent: Orchestrator + Zoo
Ratchet engine replaced: UCB1 bandit + isotropic Gaussian swapped for per-variant scipy differential_evolution (champion seeded into initial population). Fixes the date-seed replay bug (same-day reruns replayed identical proposals) and the regression bug (Gaussian mutation could demote champions). Budget raised 40 → 400 evaluations; two same-day runs now explore distinct proposals with zero regression.
F1 ablation FAILED a third consecutive time: Δres_train ran +0.0263 (2026-06-08, coverage 12/22) → +0.0132 (widened, stale champion) → −0.0044 (widened, after TRAIN-only re-tune). Philosopher verdict: the fixed-prior PITF hypothesis is FALSIFIED, not merely unproven — a real signal should have risen through +0.030 as the coverage confound was removed and factionalism came online; it fell monotonically through zero instead. Full entry in FALSIFICATION_REGISTER.md entry 7.
Report-only hold-out Brier 0.1745 (best ever, below the 0.18 market line) explicitly DISREGARDED: not the locked criterion; hold-out resolution DROPPED 0.185 → 0.141; neg-ctrl Brier WORSENED 0.194 → 0.256 — the anti-signature of the claimed mechanism. Accepting it would have been a HARKing trap.
Phase 1 — country-year panel: 11,130 country-years (209 polities, 1946–2020) from COW + Polity5 + UCDP/PRIO v24.1 + Cline coups v2.0 + V-Dem ERT. Train n=8,410 / sealed hold-out n=1,198 (sha256 in HISTORY.md). The binding data constraint (n≈19, Brier CI ±0.22) is broken.
Phase 2 — honesty stack: EVT sqrt(2 ln N) deflation (binding), bootstrap CIs, permutation + BH-FDR, anytime-valid e-values. EVT-deflated best Brier = 0.2500 = chance — the system honestly reports zero deflated skill evidence. Grammar sweep on 236 structural candidates: 11 FDR survivors (q=0.10) at panel scale, factionalism in every survivor (p=0.002); with n=19 the same machinery produces ~0 survivors.
Phase 3 — nightly bounded agentic evolution live-tested: one claude -p Sonnet build/night, $3 cap, 45-min timeout, protected-file hash quarantine + git-restore, forced experimental via metadata.json. Built variant_sdt_turchin (Brier 0.349, resolution 0.073, Tier 0 as expected without real MMP proxies). End-to-end integrity verified.
Recovery batch — all skipped data sources obtained in parallel: COW Direct Contiguity v3.2 (414 land-contiguous dyads, now primary neighbor relation), REIGN 2021-8 (138,600 country-months, 94.1% coverage), EPR Core 2021 (92.4%), Maddison 2023 via OWID mirror (78.7%), UN WPP youth bulge via OWID (83.3%). Polymarket unblocked in-process via ISP DoH bypass (polymarket/net_access.py, HTTP 200 ×3). Metaculus identified as auth gate (token policy late 2024), not Cloudflare.
Panel v1.1 re-sealed (one-time sanctioned re-seal — v1.0 was never scored): same split design and outcome logic, 5 new features wired, --verify PASSED. Seal is now hard. All six Phase-4 variants (hierarchical_bayes, firth_logit, hazard_spline, reign_logit, gbm_honest, conformal_wrapper) built with per-variant falsification criteria locked in the register BEFORE run_zoo produced any number.
First 10-variant board after 216-eval ratchet pass: firth_logit resolution 0.1918 = board record (the fitted-coefficient route delivering exactly what the grammar sweep predicted); sdt_turchin resolution 0.073 → 0.173 after youth_bulge landed; ratchet capacity proof — 216 evals improved 5 variants in one pass vs 1–7 scalar accepts on the old engine. PBO honestly 0.700 with 10 variants tried. Official ensemble unchanged at 0.2909 (2 admitted variants) — correct, admission only via pre-registered gates.
Agent architecture v2.0: 10×15 ceremonial org retired to knowledge bases; 7 artifact-owning roles (Orchestrator, Data Engineer, ML Engineer, Demographer, Red-Team Forecaster, Calibration Auditor, Philosopher-judges-only). start.md rewritten to queue-driven protocol. CLAUDE.md rewritten to describe the real system.
EVT-deflated best Brier = 0.2500 = chance: zero deflated evidence of skill yet — the honest redesign baseline
Fixed-prior PITF bundle FALSIFIED (register entry 7) — re-entry only via fitted coefficients or panel-scale evidence at the same +0.030 Δres bar with neg-control guard
firth_logit posts board-record resolution 0.1918 but poor calibration (Brier 0.2693); its pre-registered FL-1 severe test decides admission — do not interpret the resolution figure as a pass
Metaculus needs an operator API token (auth gate, not Cloudflare); Polymarket permanent fix = system DNS to 1.1.1.1 or equivalent DoH — in-process bypass is fragile
Panel hold-out re-sealed ONCE (sanctioned — v1.0 was never scored against); seal is now hard — no further re-seals permitted
Session 44 was the throughput redesign — the single most consequential architectural day in the project's history. The problem it addressed had been accumulating for months: the nightly optimization loop was tuning only three scalar parameters via a UCB1 bandit with isotropic Gaussian proposals, working on approximately 19 events that produced Brier confidence intervals of roughly ±0.22 — wider than any gap between models on the leaderboard. The system could not distinguish signal from noise, and its best candidate hypothesis, the fixed-prior PITF bundle, had failed its locked F1 severity test twice already. Something had to change structurally, not marginally.
Phase 0 addressed the ratchet engine first. The UCB1 bandit was replaced with per-variant scipy differential evolution, with the current champion seeded into the initial population. This fixed two concrete bugs that had been silently wasting compute. The date-seeded RNG bug meant that same-day reruns replayed identical proposals — all within-day budget was burned on duplicates. The regression bug meant that Gaussian mutation could and did demote champions between sessions (hazard_scale 2.5 falling to 1.986 between two 2026-06-08 runs). DE keeps-if-better strictly. The default evaluation budget was raised from 40 to 400 evaluations. Verification was direct: two same-day runs explored distinct proposals; the second accepted zero regressions. Alongside the ratchet fix, the outcome-blind feature widening pipeline added OWID child mortality back to 1751, wired Polity5 factionalism to 33 of 45 events (the flagship B_FAC=1.1 channel finally fired), and raised hold-out feature coverage from 13/26 to 21/26.
With wider feature coverage, the locked F1 severity test ran for the third time. The criterion was unchanged: Δres_train ≥ +0.030. The results ran +0.0263 on 2026-06-08 with coverage 12/22 and B_FAC inert, then +0.0132 with the widened features and a stale champion, then −0.0044 after a sanctioned TRAIN-only re-tune. The Philosopher's verdict was unambiguous: the structural hypothesis — that fixed-prior PITF channels add discrimination beyond reference-class structure — is FALSIFIED, not merely unproven. A real signal should have risen through +0.030 as the coverage confound was removed and factionalism came online. It fell monotonically through zero instead. The full entry was appended to FALSIFICATION_REGISTER.md as entry 7. The report-only hold-out Brier of 0.1745 — the best ever recorded, below the 0.18 market line — was explicitly disregarded. It was not the locked criterion; hold-out resolution actually dropped from 0.185 to 0.141; and the negative-control Brier worsened from 0.194 to 0.256, which is the anti-signature of the claimed mechanism. Accepting that number as evidence would have been a HARKing trap. The system bit its own hand correctly.
Phase 1 broke the data constraint. The panel builder constructed a spine of 11,130 country-years covering 209 polities from 1946 to 2020, drawing outcomes from UCDP/PRIO v24.1, Cline coups v2.0, Polity5 adverse changes, and V-Dem ERT, with features at 76–100% coverage. The split: TRAIN 1946–2004 at n=8,410 with a base rate of 21.8%, an embargoed 2005–2009 buffer, and a SEALED hold-out of 2010–2015 at n=1,198 with a base rate of 13.8% — its sha256 recorded in formula/HISTORY.md and covered by the evolve.py tamper guard. Going from 19 events to 8,410 training observations is not an incremental improvement. It changes what hypotheses are even testable.
Phase 2 installed the honesty stack. The EVT sqrt(2 ln N) deflation is the binding correction: applied to the best observed Brier across the zoo of 10 variants, it produces 0.2500 — exactly chance. The leaderboard now reports this number prominently. It is not a failure of the session; it is the redesign's honest baseline. The grammar sweep ran 236 structural candidates through two-fidelity purged cross-validation on the panel. Eleven FDR survivors at q=0.10, and factionalism appeared in every one of them at p=0.002 — fitted at panel scale, not assumed at a fixed prior. With the same machinery at n=19, the expected number of survivors is approximately zero. The mechanism the falsification register rejected at fixed priors is precisely the mechanism the panel-scale evidence is now pointing toward, via a different and sanctioned path.
Phase 3 put the nightly agentic evolution loop into live operation. One bounded claude -p Sonnet invocation per night, a $3 cost cap, a 45-minute timeout, protected-file hash quarantine with automatic git-restore on any tamper attempt, and a forced experimental flag via metadata.json so that run_zoo enforces it — the evolve stage cannot promote a model to admitted status by itself. The live test built variant_sdt_turchin from scratch, producing a Brier of 0.349 and resolution of 0.073 at Tier 0, which was exactly expected given that the Turchin secular-dynamics theory requires real MMP proxy data that was not yet wired. The end-to-end integrity check passed before and after. The queue and the MAP-Elites family archive updated automatically. The kill switch at loop/EVOLVE_DISABLED is operational.
The recovery batch addressed every data source that had been marked skipped or blocked. Six parallel agents worked simultaneously. COW Direct Contiguity v3.2 had returned 404s due to a site reorganization; the real DirectContiguity320.zip was found at the new wp-content path, delivering 414 land-contiguous dyads from 1816 to 2016 as the primary neighbor relation. REIGN 2021-8 came from the GitHub archive at OEFDataScience, 138,600 country-months yielding leader_tenure_years and irregular_regime at 94.1% coverage. EPR Core 2021 came directly from ETH at 92.4%. Maddison 2023 came via the OWID mirror after the Dataverse route proved DDoS-gated, yielding gdp_pc_ln at 78.7%. UN WPP age structure via OWID gave youth_bulge at 83.3% — the real definition, no proxy fallback. The Polymarket block was diagnosed as an ISP-level silent DNS drop on *.polymarket.com by the Vivo resolver; the hosts file was clean; 1.1.1.1 and 8.8.8.8 both resolve correctly. An in-process DoH plus CURLOPT_RESOLVE bypass in polymarket/net_access.py returned HTTP 200 three times. The live read confirmed: pred_004 Starmer market at 75.5% versus our locked 58%. Metaculus was confirmed as an auth gate, not Cloudflare — the token policy changed in late 2024; an operator-provided API token is the fix.
Panel v1.1 was re-sealed in a one-time sanctioned ceremony. The justification: the v1.0 hold-out had never been scored, so no information was extracted — the re-seal was epistemically clean. Same split design and outcome logic, five new features wired, --verify passed. The sha256 is recorded in HISTORY.md. The seal is now hard. All six Phase-4 variants — hierarchical_bayes, firth_logit, hazard_spline, reign_logit, gbm_honest, conformal_wrapper — were built using the evolve stage, and this was the first session in the project where per-variant falsification criteria were locked in the register before run_zoo produced any number. The pipeline order enforced pre-registration. firth_logit is explicitly bound to entry 7's re-entry terms: fitted-ablation Δres ≥ +0.030 plus a neg-control guard.
The first honest 10-variant board emerged from a 216-evaluation ratchet pass. firth_logit posted resolution 0.1918 — a board record, surpassing even the falsified fixed-beta peak of 0.1854 — delivering exactly what the grammar sweep predicted: the fitted-coefficient route works where the fixed-prior route did not. sdt_turchin's resolution improved from 0.073 to 0.173 immediately after youth_bulge landed, validating the theory-grounded variant's sensitivity to its intended proxies. The ratchet capacity proof is concrete: 216 evaluations improved five variants in a single pass, versus one to seven scalar accepts on the old three-tunable engine. PBO honestly reached 0.700 with ten variants tried. The official ensemble remains at 0.2909 with two admitted variants — correct, because admission flows only through the six pre-registered severe tests, none of which have run yet. firth_logit's FL-1 test is first in the queue.
The honest bottom line: EVT-deflated best Brier is still at chance. The system has zero deflated skill evidence in its official numbers. What it has instead is a much sharper set of bets — a panel fifty times larger than before, a falsification register that caught its best model's mechanism failing, a grammar sweep pointing at the same factionalism channel through a different and sanctioned route, a ratchet that can actually explore the space, and six pre-registered severe tests waiting to decide whether the first panel-fitted models earn their admission. The next /start runs FL-1.