For 41 sessions the project polished a single equation it could never actually run. We rebuilt it the way real forecasting sciences work: a population of competing, runnable models on a leaderboard, scored by a frozen oracle, driven by an autonomous nightly loop. The June throughput build then broke the binding data constraint (a country-year panel of 11,130 observations), gave the night a mass structural-hypothesis search and a bounded agentic builder, and put the whole thing behind a multiple-testing-honest evaluation stack.
The honest question 41 sessions never answered — what is the actual forecast skill? — now has an honest answer: EVT-deflated, still at chance. We say so, and we built the machinery to move it.
Why we were stuck
A ground-truth audit of the repository plus a 10-report research review converged on the same conclusion: the stagnation was architectural, not a matter of effort. An autonomous discovery loop needs four legs. The old design had none of them.
One living formula, refined by consensus.
There was exactly one CURRENT.md. Every session re-polished the same document — gradient descent from a single seed. Productive discovery needs a population of competing candidates (FunSearch, AlphaEvolve, ShinkaEvolve). With one model, the diversity term that powers every ensemble is zero by construction.
The gate was prose, not a number.
A formula update was approved by the Philosopher's written verdict. Prose cannot detect overfitting or compute calibration — and it selects for coherence with the existing formula, filtering out exactly the diverse variants an ensemble needs. Every serious system gates on an immutable scalar an agent cannot edit.
Agents read papers; they never ran code.
All nine lead agents and their 135 sub-agents only searched and scored literature. None fit a model, ran a simulation, or executed a backtest. The 36 'calibrated' parameters were argued into existence from papers, never fit to data — curve-fitting by citation.
Backtests were circular and qualitative.
Historical 'tests' were narrative precondition assessments scored PASS/PARTIAL, with the calibration cases reused as validation cases. No locked hold-out, no numeric score, no negative controls. After 40 sessions the state vector and solver were still undefined — so the formula had never produced a single numerical prediction.
The forks we chose
Scope
Clean-slate engine, keep the corpus
Rebuild the session engine around the model zoo; archive the single-formula apparatus; reuse the 74 accumulated parameters as priors.
Mathematics
A portfolio of tractable models now
Runnable models that can be backtested today (hazard models, regime logits, ABMs, ensembles). The grand unified PDE becomes one variant and a non-blocking long-horizon research track.
Compute
Agents that run code
A new tier of compute agents fit models, run simulations, and execute backtests every session. Literature research becomes an input, not the deliverable.
Targets
Dense, scoreable benchmarks
UCDP, V-Dem, PITF, the Cline Center coup data, Seshat, and IMF feed continuous, numerically-scored feedback. Polymarket becomes one signal among several.
Publication
Publish pre-resolution, labeled
Predictions are hash-locked before they are published, then shown live tagged by reflexivity class. The clean validation signal is held-out retrodiction; live markets are secondary.
Cadence
Nightly autonomous loop
A scheduled ratchet runs many fast backtest experiments each night: keep what improves the score, revert what does not.
The model zoo
Each variant is a genuinely different structural hypothesis with a runnable model — not a parameter tweak. They are scored identically by the frozen oracle and combined into an ensemble of the ADMITTED variants only. A variant earns admission by beating the null baseline on discrimination AND passing a pre-registered severe test; until then it stays on the board marked experimental. The June build added six new variants, most fit on the country-year panel.
null
Reference-class base rate only. The mandatory benchmark — beat it or you are noise.
empirical
Smoothed frequency learned from training events — a control that exposes data contamination.
regime logit
PITF regime-type inverted-U with FIXED literature betas on infant mortality, factionalism, neighbor conflict.
regime logit (fitted)
The same PITF channels but with coefficients FITTED on the panel under a Firth penalty — the Philosopher-sanctioned re-entry route. Board-record discrimination.
structural-demographic
Turchin's Political Stress Index — elite overproduction, mass mobilization, state fiscal stress; now fed real youth-bulge data.
hazard (partial pooling)
Beta-Binomial empirical-Bayes shrinkage across reference classes — the statistically correct small-N regularizer.
hazard (splines)
Discrete-time logistic hazard with natural cubic splines on polity score and IMR — captures the anocracy inverted-U linear logits miss.
leader tenure
Regime/leader-duration hazard (REIGN): risk falls with regime durability and leader tenure, rises with factionalism.
ML ensemble
Gradient boosting on panel features with hard anti-overfit constraints (ViEWS paradigm), early-stopped on a chronological tail.
calibration meta-layer
Split-conformal calibration over the admitted ensemble — inherits, never creates, discrimination; reports honest intervals.
The autonomous nightly pipeline
The June build turned a near-idle parameter ratchet into a four-stage engine. Stages 1–2 are pure compute; stage 2b runs a mass structural search; stage 3 is a single bounded LLM build (the AlphaEvolve pattern — LLM as mutation operator, frozen scorer as selection); stage 4 resolves live predictions. Anything the LLM builds is auto-marked experimental and can never enter the ensemble without the Philosopher gate.
Ratchet
Differential evolution searches each variant's parameter box against TRAINING purged cross-validation (keep-if-better, ~400 evals). Replaced the old UCB1 + Gaussian engine, fixing a same-day-replay seed bug and a champion-regression bug.
Grammar sweep
Thousands of structural candidates — feature subsets × transforms × link functions × bases — fit and ranked by out-of-fold discrimination; survivors of a permutation null + Benjamini-Hochberg FDR become queued hypotheses (labeled hypotheses, never discoveries).
Evolve
ONE bounded headless agent build per night: it pulls the top queued hypothesis, writes a runnable variant, the frozen scorer judges it. Hard dollar cap, wall-clock cap, protected-file hash quarantine, automatic experimental flag.
Resolve
Registered predictions are checked against live markets (Manifold; Polymarket via an in-process DNS bypass) and appended to an append-only scoreboard.
Anti-degeneracy safeguards
The frozen oracle + honesty stack
The single evaluation function for the whole project is frozen and hash-verified. Around it sits a honesty stack that scales with the thousands of hypotheses now tested per night — because testing more only stays honest if the haircut grows with the trial count.
Integrity rules
Live leaderboard
Scored by the frozen oracle on a locked hold-out of 26 historical events including 10 negative controls (high-stress societies that did NOT collapse). The set is deliberately crisis-skewed, so the real test is resolution (discrimination), not a low average. With ten variants now tried, PBO is 0.70 and the binding number is the EVT-deflated best Brier, which sits exactly at the chance line (0.25) — i.e. no deflated evidence of skill yet. New variants are auto-experimental and excluded from the official ensemble until they pass a pre-registered severe test.
| Model | Family | Brier | Resolution | Neg-ctrl | Tier |
|---|---|---|---|---|---|
| ensemble | equal-weight | 0.291 | 0.096 | 0.068 | T0 |
| pitf_logitexcl. | regime_logit | 0.175 | 0.141 | 0.256 | T2 |
| hierarchical_bayesexcl. | empirical_bayes | 0.219 | 0.084 | 0.278 | T1 |
| hazard_splineexcl. | hazard_spline | 0.220 | 0.095 | 0.405 | T1 |
| conformal_wrapperexcl. | calibration_meta | 0.221 | 0.121 | 0.158 | T1 |
| sdt_turchinexcl. | structural_demographic | 0.230 | 0.173 | 0.222 | T1 |
| train_freq | empirical_frequency | 0.234 | 0.064 | 0.161 | T1 |
| firth_logitexcl. | penalised_logit | 0.269 | 0.192 | 0.297 | T0 |
| gbm_honestexcl. | gradient_boosting | 0.281 | 0.115 | 0.144 | T0 |
| reign_logitexcl. | duration_logit | 0.330 | 0.060 | 0.209 | T0 |
| null_baseline | null | 0.370 | 0.095 | 0.038 | T0 |
The headline is a falsification and a lead, not a victory. The fixed-prior pitf_logit posts the lowest raw Brier (0.175) but its pre-registered F1 ablation FAILED a third time as feature coverage widened — so the fixed-beta PITF hypothesis is FALSIFIED, and that low Brier is a calibration artifact (its discrimination dropped and its negative-control error rose). The genuinely interesting result is firth_logit: the SAME PITF channels, but with coefficients fitted on the country-year panel, reach the board's highest discrimination ever (resolution 0.192) — the Philosopher-sanctioned re-entry route, now awaiting its own pre-registered severe test. The admitted ensemble (just the two baselines) stays conservative on purpose. The binding honest number: EVT-deflated best Brier = 0.25, exactly chance. No validated skill yet — admission flows only through the six pre-registered gates.
New to these numbers? How to read this table, column by column →
The progress ladder
A single objective metric, tracked every night: ensemble Brier on the frozen hold-out, against three reference lines. The legacy formula sits at Tier 0 — it cannot emit a probability at all. A variant can dip under the chance line yet be held experimental (excluded from the ensemble) until it passes a severe test, so the admitted ensemble stays conservative. We don't claim robust Tier-1 skill yet, and we say so.
Produces a numeric prediction at all
—
legacy formula is here (cannot)
Beats chance
Brier < 0.25
current target
Beats market consensus
Brier < 0.18
future
Approaches superforecaster level
Brier < 0.15
genuine psychohistory progress
The research team, rebuilt
The 2024–2026 evidence is clear: homogeneous multi-agent debate underperforms, and beyond a small panel more agents degrade results. The ceremonial roster of 10 leads × 15 prose-only sub-agents was retired to domain knowledge bases (workers cite them; nobody 'runs' them). In its place: seven heterogeneous roles, each owning a concrete artifact, with strict proposer ≠ scorer ≠ critic separation and parallel cold-briefed workers instead of sequential deliberation.
The Orchestrator routes each session on a lean context; the Philosopher of Science runs the adversarial admission gate. Both reason; neither writes the model code.
Cold-briefed workers whose output is an artifact: the Data Engineer owns the panel and feature pipelines, the ML/Forecasting Engineer owns variant construction, the Demographer owns age-structure features, the Calibration Auditor owns reliability and CI reporting, the Red-Team Forecaster owns the adversarial counter-case before any prediction is registered.
| Lead agent | Verdict | What changes |
|---|---|---|
| Orchestrator | ROUTES | Lean-context session router; runs the integrity pre-flight and chooses the one move. |
| Data Engineer | OWNS DATA | The country-year panel, feature pipelines, and source caches — the role that broke the data constraint. |
| ML / Forecasting Engineer | OWNS THE ZOO | Builds and maintains the runnable variants and grooms the hypothesis backlog. |
| Demographer | NEW | Owns age-structure / youth-bulge / urbanization features (UN WPP) — feeds the structural-demographic variant. |
| Calibration Auditor | NEW | Reads reliability curves and confidence intervals after every session; flags train-vs-hold-out drift. |
| Red-Team Forecaster | NEW | Writes the adversarial counter-case before any prediction is registered. |
| Philosopher of Science | JUDGES ONLY | Owns the falsification register and the admission gate; pre-registers severe tests before scores are read — and never selects the work. |
| Legacy 10 leads | → KNOWLEDGE BASES | Cliodynamicist, Econophysicist, Stat Physicist, et al. survive as domain references the workers cite, not as dispatch targets. |
Validation, rebuilt
A locked hold-out
Twenty-plus events including negative controls, which the model-building agents are forbidden to read.
Numeric Brier, not PASS/FAIL
Every retrodiction emits a probability and is scored. Narrative precondition assessments are gone.
Leakage-safe backtesting
Purged k-fold with an embargo longer than the cycle being modeled; standard k-fold is banned.
Pre-registration
The probability is hash-locked before the outcome can be read — technical, not procedural, honesty.
Severe testing
Every variant admitted to the zoo carries a pre-specified falsification criterion (Mayo severity ≥ 0.8).
Reflexivity audit
Each published prediction is classified immune, self-fulfilling, or self-defeating — the Seldon problem, handled explicitly.
What's next
The data constraint is broken and ten variants are on the board, but none has earned admission. The next session runs the six severe tests that were locked before any score was read — starting with FL-1 for firth_logit, the fitted-coefficient re-entry route that now posts the board's best discrimination. That test, not the leaderboard, decides whether the PITF channels carry real structural signal at panel scale. In parallel, the nightly engine keeps running its four stages, and the open operator items are a free Metaculus API token, an optional permanent DNS fix for live Polymarket data, and confirming one candidate market link.