docs/bootstrap.md

# Bootstrap (Build 4)

This document describes the parametric bootstrap on the IO-HMM-fitted
population, the warning-shift sweep that rides on top of it, and how to read
the resulting Fig. 6 (failed-evacuation count vs. shift, with quantile
bands).

## What's bootstrapped, what's swept

* **Bootstrap dimension (B replicates):** resample households *with
  replacement* from the baseline simulation output, refit the IO-HMM on each
  resample, yielding `B` parameter sets `θ̂⁽¹⁾, …, θ̂⁽ᴮ⁾`.
* **Shift dimension (S shifts):** vary the warning-lead-time shift `δ` in
  hours relative to baseline. A shift of `δ` means the voluntary order fires
  at `t = 60 + δ` and the mandatory at `t = 84 + δ`. Defaults:
  `δ ∈ {-24, -16, -8, 0, +8, +16, +24}`, so `S = 7`.
* **For each `(θ̂⁽ᵇ⁾, δ)` cell:** simulate trajectories using `θ̂⁽ᵇ⁾` as
  the (DGP-equivalent) parameter set with the shifted timeline, then compute
  the four network metrics. The *failed-evacuation count* is the headline
  metric for Fig. 6.

The `(B × S)` grid is the long-format `shift_sweep.parquet`. Aggregating
along `B` at each `δ` gives the median, the 25–75% inter-quartile band, and
the 5–95% outer band that the figure plots.

## Why this captures the right uncertainty

The bootstrap quantifies **parameter uncertainty** in `θ̂`. Each `δ` column
of the grid asks: *"what's the failed-evac count under this policy, given
our uncertainty about the true behavioral parameters?"* The bands directly
answer the chapter's policy question — *"is the shift in the failed-evac
curve under each policy distinguishable from sampling noise?"*

The simulation step is itself stochastic (each call uses an RNG), but we
hold the simulator seed *deterministic* given `(b, δ)` so this noise does
not enter the bands. Concretely the cell-level seed is:

```
seed(b, s_index) = base_seed * 10000 + replicate_id * 100 + shift_index
```

This isolates parameter uncertainty from simulation noise. The chapter only
needs the parameter-uncertainty story; if a future extension wants to also
display simulation noise, it can sweep multiple seeds per `(b, δ)`.

## Note on terminology

Strictly, this is a **parametric bootstrap on the IO-HMM-fitted population**:
we resample the *fitted-data* households, refit, and simulate from each
refit. This is the most direct way to propagate parameter-estimation
uncertainty into downstream policy outcomes. The chapter can call it either
"bootstrap" or "parametric bootstrap"; the procedure documented here is the
authoritative reference.

## Why warm-starting

Each replicate's EM problem is a small perturbation of the original
full-data fit. Cold (random) starts converge in ~20–30 iterations on the
baseline data; warm-starting from the original full-data `theta.toml`
typically converges in 5–10. Since EM dominates the per-replicate runtime,
warm-starting halves wall-clock time for the production 50-replicate run.

`--warm-start` is optional. Pass `./output/fit/` (the directory containing
the existing single-fit `theta.toml`); leave it off to fit each replicate
cold from a random init (works, but slower).

## Substituting `θ̂⁽ᵇ⁾` into the simulator

Each `θ̂⁽ᵇ⁾` is an IO-HMM `FitParameters` object. The IO-HMM's transition
logits have the same multinomial-logit form as the DGP's transition rows
(`α + β · u`), but the IO-HMM input vector

```
u = [vol, mand, ρ, r, v, τ]
```

is a strict subset of the DGP's: the endogenous features `π` (peer share),
`c` (congestion), and `tir` (time-in-ER) are absent by design — the IO-HMM
sees only exogenous inputs. The bootstrap simulator
(`simulate_with_iohmm_transitions` in
`src/iohmm_evac/bootstrap/shift_sweep.py`) consequently:

1. Reuses the DGP's population, timeline, emissions, and feedback machinery
   from the supplied `SimulationConfig`.
2. At each step, builds `u_{i,t}` exactly as `inference.data.bundle_to_fit_data`
   does, so the simulator and the IO-HMM agree on the input vector.
3. Samples the next state from `softmax(α[k, :] + β[k, :, :] @ u_{i,t})`
   using `θ̂⁽ᵇ⁾`'s `(α, β)`. SH remains absorbing.
4. Continues to track `tir`, `π`, `c` because the *emission* model and the
   network metrics depend on them — but they no longer enter the
   *transition* probabilities.

This isolates the IO-HMM's view of policy effects (warning shifts) from any
DGP-specific feedback wiring.

## Worked example: reading Fig. 6

A typical Fig. 6 (failed-evac count vs. `δ`) shows:

* Median curve sloping upward from `δ = -24` to `δ = +24` — later warnings
  ⇒ more failed evacuations.
* 25–75% inner band giving the typical replicate-to-replicate spread.
* 5–95% outer band giving the worst-case range a reader should expect.

To answer *"is δ = -8 distinguishably better than baseline (δ = 0)?"*,
compare the medians at `δ = -8` and `δ = 0`. If the 25–75% interval at
`δ = -8` does not overlap with the 25–75% interval at `δ = 0`, the
improvement is robust to the IO-HMM's parameter uncertainty.

The `bootstrap summary` CLI prints the same information as a table:

```
shift   failed_evacuation_count           ...
+0      120.00 [115.50,123.00] (109.20,128.10)
-8      102.50 [99.00,106.00] (95.40,110.20)
...
```

Each cell is `median [P25,P75] (P5,P95)`.

## Resampling-only bootstrap, not nonparametric on trajectories

The household-level resample changes which trajectories enter the EM fit,
so each replicate's `θ̂` differs. We do **not** resample at the
trajectory-realization level (i.e. drawing fresh DGP trajectories per
replicate). That's a different kind of bootstrap which mixes
parameter-estimation and DGP-realization uncertainty; we want only the
former for the chapter's policy story.

## Production run

```bash
uv run iohmm-evac bootstrap fit \
    --input ./output/baseline.parquet \
    --warm-start ./output/fit/ \
    --output-dir ./output/bootstrap/fits/ \
    --n-replicates 50 --jobs -1 --seed 0
uv run iohmm-evac bootstrap shift-sweep \
    --bootstrap-dir ./output/bootstrap/fits/ \
    --output ./output/bootstrap/shift_sweep.parquet --seed 0
uv run iohmm-evac report bootstrap-bands \
    --input ./output/bootstrap/shift_sweep.parquet \
    --output ./output/figures/bootstrap_bands.png
```

The 5-replicate smoke test (used by the test suite and the acceptance
gate) substitutes `--n-replicates 5` for the full 50.