§ Chapter V · Research Methodology & field notes

The work, on the record.

Methodology, sacred test sets, calibrated uncertainty, and a working log of recent milestones. The full archive of what we've shipped, what we've falsified, and what we're preparing for external review.

§ V.1 — Methodology Five commitments

How we work, in five parts.

The homepage names three commitments. Here are all five — the full discipline that governs every reported number, every checkpoint shipped, and every experiment we mark as closed.

i. Pipeline consistency A single declared metric drives checkpoint selection, early stopping, learning-rate schedule, ensemble assembly, and final reported number. No silent divergence between training and reporting. Enforced by a typed checkpoint tracker that is the only sanctioned way to drive best-checkpoint selection in our codebase. † Internal rule
Forensic audit closed April 2026. Documentation available on request.
ii. Sacred test sets Every published number is computed once, on a test set defined before the first experiment in the series ran. No tuning on the holdout. Reproduction chain preserved end-to-end. For HydroField: 626 basins across 6 Caravan datasets, locked April 2026 and untouched since. ‡ Sacred test
Anchor-test harness preserved; full reproduction available to academic collaborators.
iii. Calibrated uncertainty Every model goes through a pre-publication calibration test: a 90% predicted interval should contain 90% of observed values, scored on the held-out test set. Models that pass ship as production claims; models that fail are reported as methodology findings, not buried.§ § Calibration cycle
V2 cycle closed May 2026; methodology paper in preparation.
iv. Discovery override If a running experiment is discovered to be flawed, redundant, or suboptimal due to new information, we do not protect the running experiment for the sake of completion. We surface the finding, evaluate impact, and ask before restarting. Wasting compute on a known-flawed experiment is as much a methodology failure as stacking variables blindly. ¶ Decision rule
Codified internally; multiple training cycles redirected mid-run under this rule.
v. Data-quality gate No model trains on data that hasn't been forensically audited. Dead-feature elimination at cache-build time, not after the fact. Coverage and density audits per station and per period before training begins. Audit reports persist alongside the cache they document. ‖ Audit gate
Pre-training contract; reproducibility kit ships the audit report.
§ V.2 — Publications In preparation

What's in the pipeline.

Methodology papers and benchmark reports we are preparing for external review. Order is approximate; pre-prints will be linked here as they post.

· HydroField benchmark report Multi-continental streamflow benchmark: public medNSE 0.830 (k-fold prediction in ungauged basins), 0.894 on held-out Canadian basins, 0.874 on a held-out test period. Full per-region distributions, worst-decile diagnostics, and comparison to published baselines under matched protocol — including the CAMELS-GB case we do not yet win. Target
Hydrology / ML venue · 2026.
· Sacred-test-set discipline A short methodology note on locked test sets, the data-quality gate, and the reproducibility kit — as a public artifact and a contribution to the broader ML-rigor conversation. Target
Short-format venue or pre-print · 2026.
§ V.3 — Field notes Working log · 2026

Field notes archive.

The full chronological log of recent milestones, methodology updates, falsified hypotheses, and operations work. Quiet on the moat; substantive on the work.

2026 · Jun 8 Water main-break prediction validated across three North American utilities. Calibrated probabilities (expected calibration error ≤ 0.008) on an out-of-time test. Benchmark
2026 · Jun 7 HydroField Canadian benchmark: medNSE 0.894 on 181 held-out basins with zero training overlap. Verifier-reproduced to a tolerance of 1e-6. Benchmark
2026 · May 24 HydroField public benchmark locked: medNSE 0.830 on a k-fold prediction-in-ungauged-basins evaluation. In line with or ahead of matched published protocols (Kratzert 2019, HydroDL) on like-for-like evaluation — with one honest exception on CAMELS-GB. Benchmark
2026 · May 14 AvalancheWatch operational hardening. A security audit closed with the critical findings remediated, and a test-coverage gate now runs before each release. We treat the service as production infrastructure, not a research demo. Operations
2026 · Apr 22 Provenance snapshot preserved across local and NAS storage. Conda environment, git tag, run logs, sidecar artifacts captured for every defensibility-relevant result. Discipline
2026 · Apr 18 Sacred test set locked for HydroField scale-up to 16,299 basins on the Caravan v1.6 multi-source assembly. Anchor-test harness preserved. Discipline
2026 · Apr 2 HydroField headline locked: medNSE 0.8316 across the 626-basin sacred test set. Seven-seed ensemble at BF16 precision; the first result shipped under the pipeline-consistency contract. Benchmark
2026 · Mar 2 Elysium Fields AI Inc. incorporated in British Columbia (BC CCPC). Headquarters Cranbrook, BC. Company
§ V.4 — Reproducibility Audit-ready

The reproduction chain.

For any reported number, we can hand a reviewer: the data sources at version, the configuration file, the git commit hash, the random seeds, the cache content hash, the sacred-test-set definition, and the audit report that closed the loop. The reproducibility kit is the answer to "prove your number is real."

· Data sources at version Public benchmark data and public corpora, each pinned to the date and version used. Pinned
· Config & code Every checkpoint pairs with: training config JSON, git commit hash of the producing code, cache content hash, seed list. Stored alongside the checkpoint. Co-located
· Audit report Pre-training data-quality audit report ships with the cache it documents. Dead-feature elimination decisions recorded as a manifest. Cache-paired
· Downloads Public reproducibility artifacts for the published HydroField benchmark will be hosted on request. Email below for access; we are deliberately careful with the request list during the pre-paper window. On request
§ Correspond

Academic reviewers, methodology readers,
reproducibility researchers.

If you'd like the reproducibility kit, the per-region distributions, the worst-decile diagnostics, or the calibration sidecars for any reported number — the door is open. Same inbox handles everything.