{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "cac6744e",
   "metadata": {},
   "source": [
    "# Flood Early Warning — Real Data Integration Guide\n",
    "\n",
    "> **Companion to Notebook 13** (`13_flood_early_warning.ipynb`).  \n",
    "> This notebook is a **drop-in replacement** for Cells 3–5 of NB13.  \n",
    "> All downstream model cells (Sections 4–9) run unchanged.\n",
    "\n",
    "## Covered datasets\n",
    "\n",
    "| Dataset | Coverage | Spatial | Temporal | Access |\n",
    "|---------|----------|---------|----------|--------|\n",
    "| **CAMELS-US** | 671 US basins | Basin-level | Daily 1980–2014 | Free |\n",
    "| **USGS NWIS** | ~10 000 US gauges | Point gauge | 15-min / hourly | Free API |\n",
    "| **ERA5-Land** | Global | 0.1° (~9 km) | Hourly 1950– | Free (CDS) |\n",
    "| **GloFAS** | Global | 0.1° | Daily reanalysis | Free (CDS) |\n",
    "| **GRDC** | 10 000+ global gauges | Point gauge | Daily | Free (registration) |\n",
    "| **OpenHydro (UK)** | ~1 500 UK gauges | Point gauge | 15-min | Free API |\n",
    "\n",
    "## How to use\n",
    "1. Run **Section 1** to download and cache your chosen dataset\n",
    "2. Run **Section 2** to build `X_static`, `X_dyn_raw`, `X_future_raw`, `Y_raw`\n",
    "   in the same shape as NB13\n",
    "3. Copy those four arrays back into NB13 and re-run from Cell 9 onwards\n",
    "\n",
    "No other change is needed — the BaseAttentive model architecture, FSI prior,\n",
    "and alarm system are fully dataset-agnostic.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "fe24fb25",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-05-03T19:36:42.034004Z",
     "iopub.status.busy": "2026-05-03T19:36:42.033848Z",
     "iopub.status.idle": "2026-05-03T19:36:42.274436Z",
     "shell.execute_reply": "2026-05-03T19:36:42.273733Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "dataretrieval not installed: pip install dataretrieval\n",
      "cdsapi not installed: pip install cdsapi\n",
      "Cache directory: /home/daniel/projects/base-attentive/examples/data/flood_ews\n"
     ]
    }
   ],
   "source": [
    "import os, warnings, time, glob, zipfile, requests\n",
    "warnings.filterwarnings('ignore')\n",
    "\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "from pathlib import Path\n",
    "\n",
    "# Optional — install if needed:\n",
    "#   pip install dataretrieval cdsapi hydrofunctions netCDF4 xarray\n",
    "NWIS_OK = False\n",
    "try:\n",
    "    import dataretrieval.nwis as nwis\n",
    "    NWIS_OK = True\n",
    "except ImportError:\n",
    "    print(\"dataretrieval not installed: pip install dataretrieval\")\n",
    "\n",
    "CDS_OK = False\n",
    "try:\n",
    "    import cdsapi\n",
    "    CDS_OK = True\n",
    "except ImportError:\n",
    "    print(\"cdsapi not installed: pip install cdsapi\")\n",
    "\n",
    "# NB13 constants — must match\n",
    "LOOKBACK   = 24\n",
    "HORIZONS_H = [1, 3, 6, 12, 24]\n",
    "N_H        = len(HORIZONS_H)\n",
    "MAX_H      = max(HORIZONS_H)\n",
    "PRIMARY_H  = 2   # +6h\n",
    "DATA_CACHE = Path('data/flood_ews')\n",
    "DATA_CACHE.mkdir(parents=True, exist_ok=True)\n",
    "print(f'Cache directory: {DATA_CACHE.resolve()}')\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "effd5c97",
   "metadata": {},
   "source": [
    "---\n",
    "## 1 — CAMELS-US: 671 Basins with Static Attributes\n",
    "\n",
    "CAMELS (Catchment Attributes and MEteorology for Large-sample Studies) provides:\n",
    "- **Daily** streamflow + meteorological forcing for 671 CONUS basins (1980–2014)\n",
    "- **33 basin attributes**: topology, climate, soil, vegetation, geology, hydrology\n",
    "\n",
    "### Download (one-time, ~4 GB total)\n",
    "```\n",
    "https://gdex.ucar.edu/dataset/camels.html\n",
    "```\n",
    "Files needed:\n",
    "- `basin_timeseries_v1p2_metForcing_obsFlow.tar.gz`  (~3.2 GB)\n",
    "- `camels_attributes_v2.0.xlsx`                       (~1.5 MB)\n",
    "\n",
    "Or use the Kaggle mirror (no login required for cached versions):\n",
    "```bash\n",
    "kaggle datasets download -d htagholdings/camels-us-streamflow-catchment-attributes\n",
    "```\n",
    "\n",
    "### Column mapping to NB13 features\n",
    "| NB13 feature | CAMELS column | Notes |\n",
    "|---|---|---|\n",
    "| `basin_area` | `area_gages2` (km²) | Direct |\n",
    "| `slope` | `slope_mean` (m/km → m/m: ÷1000) | Direct |\n",
    "| `imperv` | `frac_urban` (fraction) | Urbanisation proxy |\n",
    "| `soil_perm` | `soil_conductivity` (cm/h → mm/h: ×10) | Direct |\n",
    "| `ndvi` | `frac_forest` (fraction) | Vegetation proxy |\n",
    "| `dist_channel` | derived from `area_gages2`, `slope_mean` | See code below |\n",
    "| `elevation` | `elev_mean` (m) | Direct |\n",
    "| `flood_hist` | derived from annual max flow exceedances | See code below |\n",
    "| `rain_upstream` | `prcp` (mm/day → mm/h: ÷24) | Forcing file |\n",
    "| `water_level` | `QObs` (mm/day) / `q_mean` (normalise to bankfull) | Scaled discharge |\n",
    "| `discharge` | `QObs` (mm/day) | Direct (convert units) |\n",
    "| `soil_moisture` | `swe` / API computed from `prcp` | Antecedent |\n",
    "| `temperature` | `tmean` (°C) | Direct |\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "fac97f49",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-05-03T19:36:42.276980Z",
     "iopub.status.busy": "2026-05-03T19:36:42.276674Z",
     "iopub.status.idle": "2026-05-03T19:36:42.291513Z",
     "shell.execute_reply": "2026-05-03T19:36:42.290962Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CAMELS data not found at data/flood_ews/camels_us\n",
      "Download from: https://gdex.ucar.edu/dataset/camels.html\n",
      "Running in demo mode — using NB13 synthetic arrays as placeholder.\n"
     ]
    }
   ],
   "source": [
    "# ── CAMELS-US loader ──────────────────────────────────────────────────────────\n",
    "CAMELS_DIR = DATA_CACHE / 'camels_us'\n",
    "\n",
    "def load_camels_attributes(attr_file):\n",
    "    \"\"\"Load 33 CAMELS basin attributes from the Excel file.\"\"\"\n",
    "    df = pd.read_excel(attr_file, sheet_name=None)\n",
    "    # merge all sheets on gauge_id\n",
    "    merged = None\n",
    "    for sheet, dfs in df.items():\n",
    "        if 'gauge_id' in dfs.columns:\n",
    "            merged = dfs if merged is None else merged.merge(dfs, on='gauge_id', how='outer')\n",
    "    return merged\n",
    "\n",
    "def load_camels_forcing(basin_id, forcing_dir, source='daymet'):\n",
    "    \"\"\"Load daily meteorological forcing for one basin.\"\"\"\n",
    "    pat = str(forcing_dir / '**' / f'{basin_id}_*{source}*.txt')\n",
    "    files = glob.glob(pat, recursive=True)\n",
    "    if not files:\n",
    "        return None\n",
    "    df = pd.read_csv(files[0], sep=r'\\s+', skiprows=3,\n",
    "                     names=['Year','Mnth','Day','Hr','dayl','prcp','srad',\n",
    "                            'swe','tmax','tmin','vp'])\n",
    "    df['date'] = pd.to_datetime(df[['Year','Mnth','Day']].rename(\n",
    "        columns={'Year':'year','Mnth':'month','Day':'day'}))\n",
    "    df['tmean'] = (df['tmax'] + df['tmin']) / 2\n",
    "    return df.set_index('date')\n",
    "\n",
    "def load_camels_streamflow(basin_id, flow_dir):\n",
    "    \"\"\"Load daily observed streamflow (mm/day).\"\"\"\n",
    "    pat = str(flow_dir / '**' / f'{basin_id}_*.txt')\n",
    "    files = glob.glob(pat, recursive=True)\n",
    "    if not files:\n",
    "        return None\n",
    "    df = pd.read_csv(files[0], sep=r'\\s+', header=None,\n",
    "                     names=['basin','Year','Mnth','Day','QObs','flag'])\n",
    "    df['date'] = pd.to_datetime(df[['Year','Mnth','Day']].rename(\n",
    "        columns={'Year':'year','Mnth':'month','Day':'day'}))\n",
    "    return df.set_index('date')[['QObs']]\n",
    "\n",
    "def build_camels_dataset(camels_dir, n_basins=None, lookback_days=1,\n",
    "                         label_threshold_quantile=0.90):\n",
    "    \"\"\"\n",
    "    Build X_static, X_dyn_raw, X_future_raw, Y_raw from CAMELS.\n",
    "\n",
    "    Parameters\n",
    "    ----------\n",
    "    camels_dir : Path   root of extracted CAMELS archive\n",
    "    n_basins   : int    limit to first n basins (None = all 671)\n",
    "    lookback_days: int  days of history (converted to 24h window by taking\n",
    "                        the last `lookback_days` daily values per window)\n",
    "    label_threshold_quantile: float  percentile of annual max flow used as\n",
    "                                     flood threshold per basin\n",
    "    Returns\n",
    "    -------\n",
    "    X_static  : (N, 8)   static basin attributes\n",
    "    X_dyn_raw : (N, LOOKBACK, 6)  daily forcing window\n",
    "    X_future_raw: (N, N_H, 2)     NWP-style lag features\n",
    "    Y_raw     : (N, N_H)  binary flood labels\n",
    "    \"\"\"\n",
    "    attr_file  = camels_dir / 'camels_attributes_v2.0.xlsx'\n",
    "    forcing_dir= camels_dir / 'basin_mean_forcing'\n",
    "    flow_dir   = camels_dir / 'usgs_streamflow'\n",
    "\n",
    "    attrs = load_camels_attributes(attr_file)\n",
    "    if attrs is None:\n",
    "        raise FileNotFoundError(f'Attributes file not found: {attr_file}')\n",
    "\n",
    "    basin_ids = attrs['gauge_id'].astype(str).str.zfill(8).tolist()\n",
    "    if n_basins:\n",
    "        basin_ids = basin_ids[:n_basins]\n",
    "\n",
    "    X_static_list, X_dyn_list, X_fut_list, Y_list = [], [], [], []\n",
    "    skipped = 0\n",
    "\n",
    "    for bi, bid in enumerate(basin_ids):\n",
    "        if bi % 50 == 0: print(f'  Loading basin {bi}/{len(basin_ids)}...')\n",
    "        forcing = load_camels_forcing(bid, forcing_dir)\n",
    "        flow    = load_camels_streamflow(bid, flow_dir)\n",
    "        if forcing is None or flow is None:\n",
    "            skipped += 1; continue\n",
    "\n",
    "        merged = forcing.join(flow, how='inner').dropna()\n",
    "        if len(merged) < lookback_days + max(HORIZONS_H):\n",
    "            skipped += 1; continue\n",
    "\n",
    "        row = attrs[attrs['gauge_id'].astype(str).str.zfill(8) == bid]\n",
    "        if len(row) == 0:\n",
    "            skipped += 1; continue\n",
    "        row = row.iloc[0]\n",
    "\n",
    "        # ── Static features ────────────────────────────────────────────────\n",
    "        area     = float(row.get('area_gages2', 500))\n",
    "        slope    = float(row.get('slope_mean', 10)) / 1000      # m/km → m/m\n",
    "        imperv   = float(row.get('frac_urban', 0.1))\n",
    "        soil_c   = float(row.get('soil_conductivity', 15)) * 10  # cm/h→mm/h\n",
    "        ndvi_p   = float(row.get('frac_forest', 0.4))\n",
    "        elev     = float(row.get('elev_mean', 300))\n",
    "        q_mean   = float(row.get('q_mean', 1.0)) + 1e-3\n",
    "        q_max    = merged['QObs'].quantile(label_threshold_quantile)\n",
    "        dist_ch  = np.sqrt(area) / (slope * 1000 + 1e-3)         # rough proxy\n",
    "\n",
    "        # Flood history: fraction of years where annual max > flood threshold\n",
    "        ann_max  = merged['QObs'].resample('YE').max()\n",
    "        flood_thr= np.percentile(merged['QObs'], 95)\n",
    "        flood_hist = float((ann_max > flood_thr).mean() * 10)     # /decade\n",
    "\n",
    "        sta = np.array([area, slope, imperv, soil_c, ndvi_p,\n",
    "                        dist_ch, elev, flood_hist], dtype='float32')\n",
    "\n",
    "        # ── Dynamic window (last `lookback_days` rows before flood event) ──\n",
    "        # For each year, extract window before annual max\n",
    "        windows_added = 0\n",
    "        for yr in merged.index.year.unique():\n",
    "            yr_data = merged[merged.index.year == yr]\n",
    "            if len(yr_data) < lookback_days + max(HORIZONS_H):\n",
    "                continue\n",
    "            peak_idx = yr_data['QObs'].idxmax()\n",
    "            peak_pos = yr_data.index.get_loc(peak_idx)\n",
    "            if peak_pos < lookback_days:\n",
    "                continue\n",
    "\n",
    "            obs_end  = peak_pos\n",
    "            obs_start= peak_pos - lookback_days\n",
    "            win      = yr_data.iloc[obs_start:obs_end]\n",
    "\n",
    "            # 6 dynamic features (daily, resampled to LOOKBACK=24 points)\n",
    "            # daily data → repeat each day's value 1x (or resample if hourly)\n",
    "            n_pts = min(len(win), LOOKBACK)\n",
    "            pad   = LOOKBACK - n_pts\n",
    "\n",
    "            def pad_series(arr, n_pts=n_pts, pad=pad):\n",
    "                arr = np.array(arr[-n_pts:], dtype='float32')\n",
    "                if pad > 0:\n",
    "                    arr = np.concatenate([np.full(pad, arr[0]), arr])\n",
    "                return arr\n",
    "\n",
    "            rain_up   = pad_series(win['prcp'].values)          # mm/day\n",
    "            rain_loc  = rain_up * 0.8 + np.random.normal(0,0.5,LOOKBACK).clip(0)\n",
    "            wl        = pad_series((win['QObs'] / (q_mean*24+1e-3)).values.clip(0,3))\n",
    "            disch     = pad_series(win['QObs'].values)\n",
    "            sm        = pad_series(win['swe'].values if 'swe' in win else\n",
    "                                   np.zeros(len(win)))\n",
    "            temp      = pad_series(win['tmean'].values)\n",
    "\n",
    "            dyn = np.stack([rain_up, rain_loc, wl, disch, sm, temp],\n",
    "                           axis=1).astype('float32')\n",
    "\n",
    "            # ── Future NWP features (lag from observation end) ─────────────\n",
    "            future_win = yr_data.iloc[obs_end:obs_end+max(HORIZONS_H)]\n",
    "            r3h = future_win['prcp'].values[:3].sum()   if len(future_win)>=3 else 0.0\n",
    "            r6h = future_win['prcp'].values[:6].sum()   if len(future_win)>=6 else 0.0\n",
    "            fut = np.array([r3h, r6h], dtype='float32')[None, :].repeat(N_H, axis=0)\n",
    "\n",
    "            # ── Labels ─────────────────────────────────────────────────────\n",
    "            labels = []\n",
    "            for h in HORIZONS_H:\n",
    "                fut_q = yr_data.iloc[obs_end:obs_end+h]['QObs'].max()                         if obs_end+h <= len(yr_data) else 0.0\n",
    "                labels.append(float(fut_q > q_max))\n",
    "\n",
    "            X_static_list.append(sta)\n",
    "            X_dyn_list.append(dyn)\n",
    "            X_fut_list.append(fut)\n",
    "            Y_list.append(labels)\n",
    "            windows_added += 1\n",
    "            if windows_added >= 3:   # max 3 windows per basin\n",
    "                break\n",
    "\n",
    "    print(f'Loaded {len(X_dyn_list)} windows from {len(basin_ids)-skipped} basins '\n",
    "          f'(skipped {skipped})')\n",
    "    return (np.stack(X_static_list), np.stack(X_dyn_list),\n",
    "            np.stack(X_fut_list).astype('float32'),\n",
    "            np.array(Y_list, dtype='float32'))\n",
    "\n",
    "# ── Run if CAMELS data is present ─────────────────────────────────────────────\n",
    "if (CAMELS_DIR / 'camels_attributes_v2.0.xlsx').exists():\n",
    "    X_static_raw, X_dyn_raw, X_future_raw, Y_raw = build_camels_dataset(\n",
    "        CAMELS_DIR, n_basins=100)\n",
    "    print('\\nShape check:')\n",
    "    for name, arr in [('X_static_raw', X_static_raw), ('X_dyn_raw', X_dyn_raw),\n",
    "                      ('X_future_raw', X_future_raw), ('Y_raw', Y_raw)]:\n",
    "        print(f'  {name}: {arr.shape}')\n",
    "else:\n",
    "    print('CAMELS data not found at', CAMELS_DIR)\n",
    "    print('Download from: https://gdex.ucar.edu/dataset/camels.html')\n",
    "    print('Running in demo mode — using NB13 synthetic arrays as placeholder.')\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6773ef10",
   "metadata": {},
   "source": [
    "---\n",
    "## 2 — USGS NWIS: Hourly Gauge Data (US)\n",
    "\n",
    "The USGS National Water Information System provides free real-time and historical\n",
    "streamflow data via the `dataretrieval` Python package.\n",
    "\n",
    "### Install\n",
    "```bash\n",
    "pip install dataretrieval\n",
    "```\n",
    "\n",
    "### Key USGS parameter codes\n",
    "| Code | Parameter | Unit |\n",
    "|------|-----------|------|\n",
    "| `00060` | Discharge | ft³/s |\n",
    "| `00065` | Gauge height | ft |\n",
    "| `00010` | Water temperature | °C |\n",
    "| `00045` | Precipitation | in |\n",
    "\n",
    "### Recommended gauges for testing\n",
    "| Site ID | River | Location |\n",
    "|---------|-------|----------|\n",
    "| `01010000` | St. John River | Maine |\n",
    "| `03604400` | Ohio River | Louisville, KY |\n",
    "| `11447650` | Sacramento River | California |\n",
    "| `07374000` | Mississippi River | Baton Rouge, LA |\n",
    "| `12301933` | Clark Fork | Montana |\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "affbbe39",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-05-03T19:36:42.293622Z",
     "iopub.status.busy": "2026-05-03T19:36:42.293478Z",
     "iopub.status.idle": "2026-05-03T19:36:42.299653Z",
     "shell.execute_reply": "2026-05-03T19:36:42.298987Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "USGS NWIS loader ready.\n",
      "Usage: build_nwis_windows(site_id, start_date, end_date)\n",
      "Recommended: combine with ERA5-Land for rainfall features (Section 4).\n"
     ]
    }
   ],
   "source": [
    "# ── USGS NWIS hourly loader ───────────────────────────────────────────────────\n",
    "\n",
    "def load_usgs_gauge(site_id, start_date, end_date,\n",
    "                    param_discharge='00060', param_stage='00065'):\n",
    "    \"\"\"\n",
    "    Download hourly discharge and stage from USGS NWIS.\n",
    "    Returns a DataFrame with columns: discharge_m3s, stage_m\n",
    "    \"\"\"\n",
    "    if not NWIS_OK:\n",
    "        print('pip install dataretrieval'); return None\n",
    "\n",
    "    # Instantaneous values (15-min), resample to hourly\n",
    "    df_q, _ = nwis.get_iv(sites=site_id, parameterCd=param_discharge,\n",
    "                           start=start_date, end=end_date)\n",
    "    df_h, _ = nwis.get_iv(sites=site_id, parameterCd=param_stage,\n",
    "                           start=start_date, end=end_date)\n",
    "\n",
    "    # Rename columns\n",
    "    df_q.columns = ['discharge_cfs']\n",
    "    df_h.columns = ['stage_ft']\n",
    "\n",
    "    # Join and resample to hourly mean\n",
    "    df = df_q.join(df_h, how='outer').resample('h').mean()\n",
    "\n",
    "    # Unit conversions\n",
    "    df['discharge_m3s'] = df['discharge_cfs'] * 0.0283168\n",
    "    df['stage_m']       = df['stage_ft']      * 0.3048\n",
    "\n",
    "    return df[['discharge_m3s', 'stage_m']].dropna()\n",
    "\n",
    "def build_nwis_windows(site_id, start_date, end_date,\n",
    "                       bankfull_quantile=0.95, lookback=LOOKBACK):\n",
    "    \"\"\"\n",
    "    Extract LOOKBACK-hour observation windows from a single NWIS gauge.\n",
    "    Labels: does stage exceed bankfull within next 1/3/6/12/24 hours?\n",
    "    Returns arrays compatible with NB13 (X_static shape differs — see note).\n",
    "    \"\"\"\n",
    "    df = load_usgs_gauge(site_id, start_date, end_date)\n",
    "    if df is None or len(df) < lookback + max(HORIZONS_H):\n",
    "        return None\n",
    "\n",
    "    # Bankfull threshold\n",
    "    bankfull = df['discharge_m3s'].quantile(bankfull_quantile)\n",
    "    print(f'Site {site_id}: {len(df)} hourly rows, bankfull={bankfull:.2f} m³/s')\n",
    "\n",
    "    # Sliding window (stride = 6h to avoid excessive autocorrelation)\n",
    "    windows, labels = [], []\n",
    "    for t in range(lookback, len(df) - max(HORIZONS_H), 6):\n",
    "        win = df.iloc[t-lookback:t]\n",
    "        # 6-channel dynamic: [0,0,stage_norm,discharge,soil_moisture_proxy,0]\n",
    "        stage_norm = (win['stage_m'] /\n",
    "                      (df['stage_m'].quantile(bankfull_quantile) + 1e-3)).values\n",
    "        disch      = win['discharge_m3s'].values\n",
    "        api_vals   = np.zeros(lookback, 'float32')   # placeholder — add precip if available\n",
    "        dyn = np.stack([\n",
    "            np.zeros(lookback, 'f'),  # rain_up (fill from ERA5 — see Sec 4)\n",
    "            np.zeros(lookback, 'f'),  # rain_local\n",
    "            stage_norm.astype('f'),\n",
    "            disch.astype('f'),\n",
    "            api_vals,\n",
    "            np.zeros(lookback, 'f'),  # temperature (fill from ERA5)\n",
    "        ], axis=1)\n",
    "\n",
    "        # Future NWP placeholders (fill from ERA5 in Section 4)\n",
    "        fut = np.zeros((N_H, 2), 'float32')\n",
    "\n",
    "        # Labels\n",
    "        lbl = []\n",
    "        for h in HORIZONS_H:\n",
    "            fut_q = df['discharge_m3s'].iloc[t:t+h].max()\n",
    "            lbl.append(float(fut_q > bankfull))\n",
    "\n",
    "        windows.append(dyn); labels.append(lbl)\n",
    "\n",
    "    X_dyn_nwis = np.stack(windows)\n",
    "    Y_nwis     = np.array(labels, dtype='float32')\n",
    "    print(f'  Windows: {X_dyn_nwis.shape}  |  Flood +6h: {Y_nwis[:,PRIMARY_H].mean():.3f}')\n",
    "    return X_dyn_nwis, Y_nwis\n",
    "\n",
    "# ── Example call (comment out if no network access) ───────────────────────────\n",
    "# result = build_nwis_windows('03604400', '2010-01-01', '2023-12-31')\n",
    "# if result:\n",
    "#     X_dyn_nwis, Y_nwis = result\n",
    "print('USGS NWIS loader ready.')\n",
    "print('Usage: build_nwis_windows(site_id, start_date, end_date)')\n",
    "print('Recommended: combine with ERA5-Land for rainfall features (Section 4).')\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8f3a0b6a",
   "metadata": {},
   "source": [
    "---\n",
    "## 3 — ERA5-Land: Global Hourly Rainfall & Soil Moisture\n",
    "\n",
    "ERA5-Land provides hourly global coverage at 0.1° (~9 km) since 1950.\n",
    "It fills the rainfall and soil-moisture channels missing from USGS NWIS.\n",
    "\n",
    "### One-time setup\n",
    "1. Register at [https://cds.climate.copernicus.eu](https://cds.climate.copernicus.eu) (free)\n",
    "2. Install client: `pip install cdsapi`\n",
    "3. Create `~/.cdsapirc`:\n",
    "```\n",
    "url: https://cds.climate.copernicus.eu/api/v2\n",
    "key: <your-uid>:<your-api-key>\n",
    "```\n",
    "\n",
    "### Variables used\n",
    "| ERA5-Land variable | NB13 channel | Conversion |\n",
    "|---|---|---|\n",
    "| `total_precipitation` (m) | rain_upstream / rain_local | ×1000 → mm/h |\n",
    "| `volumetric_soil_water_layer_1` (m³/m³) | soil_moisture | direct |\n",
    "| `2m_temperature` (K) | temperature | −273.15 → °C |\n",
    "\n",
    "### Spatial matching\n",
    "For each gauged basin, extract the ERA5 grid cell containing the gauge coordinates.\n",
    "For upstream rainfall, average over the basin polygon (use HydroSHEDS or CAMELS\n",
    "catchment boundaries).\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "4b0c4be4",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-05-03T19:36:42.300828Z",
     "iopub.status.busy": "2026-05-03T19:36:42.300721Z",
     "iopub.status.idle": "2026-05-03T19:36:42.305495Z",
     "shell.execute_reply": "2026-05-03T19:36:42.304969Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "ERA5-Land helpers ready.\n",
      "Usage: download_era5_land(year, month, bbox=[N,W,S,E])\n"
     ]
    }
   ],
   "source": [
    "# ── ERA5-Land download helper ─────────────────────────────────────────────────\n",
    "ERA5_CACHE = DATA_CACHE / 'era5'\n",
    "ERA5_CACHE.mkdir(exist_ok=True)\n",
    "\n",
    "def download_era5_land(year, month, bbox, output_path=None):\n",
    "    \"\"\"\n",
    "    Download ERA5-Land hourly precipitation, soil moisture, temperature\n",
    "    for a bounding box [N, W, S, E] in degrees.\n",
    "\n",
    "    Parameters\n",
    "    ----------\n",
    "    bbox : [float, float, float, float]   [north, west, south, east]\n",
    "    output_path : Path   where to save the .nc file (auto-named if None)\n",
    "    \"\"\"\n",
    "    if not CDS_OK:\n",
    "        print('pip install cdsapi'); return None\n",
    "\n",
    "    fname = output_path or ERA5_CACHE / f'era5_{year}_{month:02d}.nc'\n",
    "    if Path(fname).exists():\n",
    "        print(f'Cached: {fname}'); return fname\n",
    "\n",
    "    c = cdsapi.Client()\n",
    "    c.retrieve(\n",
    "        'reanalysis-era5-land',\n",
    "        {\n",
    "            'variable': [\n",
    "                'total_precipitation',\n",
    "                'volumetric_soil_water_layer_1',\n",
    "                '2m_temperature',\n",
    "            ],\n",
    "            'year':  str(year),\n",
    "            'month': f'{month:02d}',\n",
    "            'day':   [f'{d:02d}' for d in range(1, 32)],\n",
    "            'time':  [f'{h:02d}:00' for h in range(24)],\n",
    "            'area':  bbox,           # [N, W, S, E]\n",
    "            'format': 'netcdf',\n",
    "        },\n",
    "        str(fname)\n",
    "    )\n",
    "    print(f'Downloaded: {fname}')\n",
    "    return fname\n",
    "\n",
    "def load_era5_at_point(nc_file, lat, lon):\n",
    "    \"\"\"\n",
    "    Extract hourly time series at nearest grid point.\n",
    "    Returns DataFrame with: precip_mm_h, soil_moisture, temp_c\n",
    "    \"\"\"\n",
    "    try:\n",
    "        import xarray as xr\n",
    "    except ImportError:\n",
    "        print('pip install xarray'); return None\n",
    "\n",
    "    ds = xr.open_dataset(nc_file)\n",
    "    # Nearest grid point\n",
    "    pt = ds.sel(latitude=lat, longitude=lon, method='nearest')\n",
    "    df = pd.DataFrame({\n",
    "        'precip_mm_h':  pt['tp'].values  * 1000,        # m → mm/h\n",
    "        'soil_moisture':pt['swvl1'].values,              # m³/m³\n",
    "        'temp_c':       pt['t2m'].values - 273.15,       # K → °C\n",
    "    }, index=pd.to_datetime(pt['time'].values))\n",
    "    return df\n",
    "\n",
    "# ── Example: download ERA5 for Ohio River basin ───────────────────────────────\n",
    "# ohio_bbox = [42, -90, 36, -80]   # [N, W, S, E]\n",
    "# fname = download_era5_land(2023, 6, ohio_bbox)\n",
    "# if fname:\n",
    "#     era5_df = load_era5_at_point(fname, lat=38.25, lon=-85.75)\n",
    "#     print(era5_df.head())\n",
    "\n",
    "print('ERA5-Land helpers ready.')\n",
    "print('Usage: download_era5_land(year, month, bbox=[N,W,S,E])')\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7166cdf6",
   "metadata": {},
   "source": [
    "---\n",
    "## 4 — GloFAS: Global Flood Awareness System (Copernicus)\n",
    "\n",
    "GloFAS provides:\n",
    "- **Reanalysis** (ERA5-driven): 40-year daily river discharge at 0.1° globally\n",
    "- **Forecast**: ensemble flood forecasts at 3-day / 7-day / 15-day / 30-day\n",
    "- **Return period thresholds**: 2-yr, 5-yr, 10-yr, 20-yr, 100-yr per grid cell\n",
    "\n",
    "GloFAS thresholds are the closest real-world equivalent to NB13's FSI bankfull\n",
    "threshold — directly usable as the flood label criterion.\n",
    "\n",
    "### Download via CDS\n",
    "```python\n",
    "import cdsapi\n",
    "c = cdsapi.Client()\n",
    "c.retrieve('cems-glofas-historical',\n",
    "           {'system_version':'version_4_0',\n",
    "            'variable':'river_discharge_in_the_last_24_hours',\n",
    "            'hyear': [str(y) for y in range(1990,2023)],\n",
    "            'format':'netcdf'},\n",
    "           'glofas_reanalysis.nc')\n",
    "```\n",
    "\n",
    "### Return-period thresholds (flood labels)\n",
    "GloFAS publishes pre-computed return-period discharge thresholds:\n",
    "```\n",
    "https://confluence.ecmwf.int/display/CEMS/GloFAS+Thresholds\n",
    "```\n",
    "Use the **2-year return period** as the NB13 flood threshold (FSI ≥ 1 equivalent).\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "fa1d784e",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-05-03T19:36:42.306708Z",
     "iopub.status.busy": "2026-05-03T19:36:42.306591Z",
     "iopub.status.idle": "2026-05-03T19:36:42.312189Z",
     "shell.execute_reply": "2026-05-03T19:36:42.311550Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "GloFAS helpers ready.\n"
     ]
    }
   ],
   "source": [
    "# ── GloFAS reader ─────────────────────────────────────────────────────────────\n",
    "GLOFAS_CACHE = DATA_CACHE / 'glofas'\n",
    "GLOFAS_CACHE.mkdir(exist_ok=True)\n",
    "\n",
    "def load_glofas_at_point(nc_reanalysis, nc_thresholds, lat, lon):\n",
    "    \"\"\"\n",
    "    Extract GloFAS river discharge and flood labels at a grid point.\n",
    "\n",
    "    Parameters\n",
    "    ----------\n",
    "    nc_reanalysis   : Path  GloFAS reanalysis .nc (dis24 variable)\n",
    "    nc_thresholds   : Path  GloFAS thresholds .nc (rl2, rl5, rl20 variables)\n",
    "    lat, lon        : float  target coordinates\n",
    "\n",
    "    Returns\n",
    "    -------\n",
    "    DataFrame with: discharge_m3s, fsi, label_2yr, label_5yr, label_20yr\n",
    "    \"\"\"\n",
    "    try:\n",
    "        import xarray as xr\n",
    "    except ImportError:\n",
    "        print('pip install xarray'); return None\n",
    "\n",
    "    ds_r  = xr.open_dataset(nc_reanalysis)\n",
    "    ds_t  = xr.open_dataset(nc_thresholds)\n",
    "\n",
    "    q   = ds_r['dis24'].sel(lat=lat, lon=lon, method='nearest')\n",
    "    t2  = float(ds_t['rl2' ].sel(lat=lat, lon=lon, method='nearest').values)\n",
    "    t5  = float(ds_t['rl5' ].sel(lat=lat, lon=lon, method='nearest').values)\n",
    "    t20 = float(ds_t['rl20'].sel(lat=lat, lon=lon, method='nearest').values)\n",
    "\n",
    "    df = pd.DataFrame({\n",
    "        'discharge_m3s': q.values,\n",
    "        'fsi':           q.values / (t2 + 1e-3),    # FSI relative to 2-yr threshold\n",
    "        'label_2yr':     (q.values >= t2 ).astype(float),\n",
    "        'label_5yr':     (q.values >= t5 ).astype(float),\n",
    "        'label_20yr':    (q.values >= t20).astype(float),\n",
    "    }, index=pd.to_datetime(q['time'].values))\n",
    "    return df\n",
    "\n",
    "def build_glofas_windows(nc_reanalysis, nc_thresholds, lat, lon,\n",
    "                         lookback=LOOKBACK, stride=6):\n",
    "    \"\"\"Build NB13-compatible windows from GloFAS reanalysis.\"\"\"\n",
    "    df = load_glofas_at_point(nc_reanalysis, nc_thresholds, lat, lon)\n",
    "    if df is None: return None\n",
    "\n",
    "    windows, labels = [], []\n",
    "    for t in range(lookback, len(df)-max(HORIZONS_H), stride):\n",
    "        win = df.iloc[t-lookback:t]\n",
    "        dyn = np.zeros((lookback, 6), 'float32')\n",
    "        dyn[:, 2] = win['fsi'].values.clip(0, 3).astype('float32')\n",
    "        dyn[:, 3] = win['discharge_m3s'].values.astype('float32')\n",
    "\n",
    "        fut = np.zeros((N_H, 2), 'float32')  # fill with ERA5 precip\n",
    "\n",
    "        lbl = []\n",
    "        for h in HORIZONS_H:\n",
    "            lbl.append(df['label_2yr'].iloc[t:t+h].max())\n",
    "        windows.append(dyn); labels.append(lbl)\n",
    "\n",
    "    return np.stack(windows), np.array(labels, 'float32')\n",
    "\n",
    "print('GloFAS helpers ready.')\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "968e01ca",
   "metadata": {},
   "source": [
    "---\n",
    "## 5 — GRDC: Global Runoff Data Centre\n",
    "\n",
    "The Global Runoff Data Centre provides **daily discharge** for > 10 000 stations\n",
    "worldwide (Europe, Asia, Africa, South America) — regions not covered by USGS.\n",
    "\n",
    "### Download\n",
    "1. Register (free): [https://www.bafg.de/GRDC](https://www.bafg.de/GRDC/EN/Home/homepage_node.html)\n",
    "2. Request data via the online portal (data arrives by email as .zip)\n",
    "3. Each station is a fixed-format text file\n",
    "\n",
    "### File format\n",
    "```\n",
    "# GRDC-No.: 6973000\n",
    "# River   : AMAZON\n",
    "# Station : OBIDOS\n",
    "# ...\n",
    "# YYYY-MM-DD;hh:mm; Value\n",
    "1960-01-01;--:--;  95800\n",
    "1960-01-02;--:--;  94000\n",
    "```\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "238bb121",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-05-03T19:36:42.313584Z",
     "iopub.status.busy": "2026-05-03T19:36:42.313439Z",
     "iopub.status.idle": "2026-05-03T19:36:42.317294Z",
     "shell.execute_reply": "2026-05-03T19:36:42.316899Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "GRDC loader ready.\n",
      "Download stations: https://www.bafg.de/GRDC/EN/02_srvcs/21_tmsrs/riverdischarge_node.html\n"
     ]
    }
   ],
   "source": [
    "# ── GRDC loader ───────────────────────────────────────────────────────────────\n",
    "\n",
    "def load_grdc_station(filepath):\n",
    "    \"\"\"\n",
    "    Parse a GRDC fixed-format text file.\n",
    "    Returns DataFrame: index=date, columns=[discharge_m3s]\n",
    "    \"\"\"\n",
    "    rows = []\n",
    "    with open(filepath, 'r', encoding='utf-8', errors='ignore') as fh:\n",
    "        for line in fh:\n",
    "            if line.startswith('#') or not line.strip():\n",
    "                continue\n",
    "            parts = line.strip().split(';')\n",
    "            if len(parts) < 3:\n",
    "                continue\n",
    "            try:\n",
    "                dt  = pd.to_datetime(parts[0].strip())\n",
    "                val = float(parts[2].strip())\n",
    "                if val >= 0:\n",
    "                    rows.append((dt, val))\n",
    "            except (ValueError, IndexError):\n",
    "                continue\n",
    "    if not rows:\n",
    "        return None\n",
    "    df = pd.DataFrame(rows, columns=['date','discharge_m3s']).set_index('date')\n",
    "    return df.sort_index()\n",
    "\n",
    "def load_grdc_catalogue(catalogue_csv):\n",
    "    \"\"\"\n",
    "    Load the GRDC station catalogue (downloaded as CSV from the portal).\n",
    "    Returns DataFrame with: grdc_no, river, station, country, lat, lon, area\n",
    "    \"\"\"\n",
    "    return pd.read_csv(catalogue_csv, sep=';', encoding='latin-1',\n",
    "                       usecols=['grdc_no','river','station','country',\n",
    "                                'lat','long','area','altitude'])\n",
    "\n",
    "# ── Example ────────────────────────────────────────────────────────────────────\n",
    "# grdc_df = load_grdc_station('data/flood_ews/grdc/6973000_Q_Day.Cmd.txt')\n",
    "# if grdc_df is not None:\n",
    "#     print(grdc_df.head(), grdc_df.shape)\n",
    "print('GRDC loader ready.')\n",
    "print('Download stations: https://www.bafg.de/GRDC/EN/02_srvcs/21_tmsrs/riverdischarge_node.html')\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "997b811a",
   "metadata": {},
   "source": [
    "---\n",
    "## 6 — OpenHydrology (UK): 15-min Gauge Data\n",
    "\n",
    "The UK National River Flow Archive provides 15-min and daily flow data for\n",
    "~1 500 gauging stations through the `hydrofunctions` Python package.\n",
    "\n",
    "### Install\n",
    "```bash\n",
    "pip install hydrofunctions\n",
    "```\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "42cd8e61",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-05-03T19:36:42.318518Z",
     "iopub.status.busy": "2026-05-03T19:36:42.318414Z",
     "iopub.status.idle": "2026-05-03T19:36:42.321724Z",
     "shell.execute_reply": "2026-05-03T19:36:42.321121Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "pip install hydrofunctions\n",
      "UK NRFA loader ready. Station list: https://nrfa.ceh.ac.uk/data/search\n"
     ]
    }
   ],
   "source": [
    "# ── UK NRFA via hydrofunctions ────────────────────────────────────────────────\n",
    "try:\n",
    "    import hydrofunctions as hf\n",
    "    HF_OK = True\n",
    "except ImportError:\n",
    "    HF_OK = False\n",
    "    print('pip install hydrofunctions')\n",
    "\n",
    "def load_uk_gauge(station_id, start_date, end_date, period='PT15M'):\n",
    "    \"\"\"\n",
    "    Download UK NRFA gauge data via hydrofunctions.\n",
    "    station_id: e.g. '39001' (Thames at Kingston)\n",
    "    period    : 'PT15M' (15-min), 'P1D' (daily)\n",
    "    \"\"\"\n",
    "    if not HF_OK:\n",
    "        return None\n",
    "    site = hf.NWIS(station_id, period, start_date, end_date)\n",
    "    df   = site.df()\n",
    "    df.columns = ['discharge_m3s', 'flag']\n",
    "    return df[['discharge_m3s']].resample('h').mean()\n",
    "\n",
    "# ── Recommended UK test gauges ──────────────────────────────────────────────\n",
    "# '39001'  Thames at Kingston-upon-Thames\n",
    "# '27041'  Severn at Bewdley\n",
    "# '17001'  Tay at Ballathie (Scotland)\n",
    "# '76017'  Exe at Thorverton (South West)\n",
    "\n",
    "# Example:\n",
    "# uk_df = load_uk_gauge('39001', '2000-01-01', '2023-12-31')\n",
    "# print(uk_df.head() if uk_df is not None else 'Not loaded')\n",
    "print('UK NRFA loader ready. Station list: https://nrfa.ceh.ac.uk/data/search')\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0d38da64",
   "metadata": {},
   "source": [
    "---\n",
    "## 7 — Unified Pipeline: Any Source → NB13 Arrays\n",
    "\n",
    "Once you have discharge data (from any source above) and rainfall + soil moisture\n",
    "(from ERA5-Land), this function produces the four arrays NB13 expects.\n",
    "\n",
    "**Minimum requirement**: only discharge is strictly required.\n",
    "Rainfall and soil moisture are added progressively — the FSI physics prior still\n",
    "works with discharge-only inputs.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "567b121c",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-05-03T19:36:42.322863Z",
     "iopub.status.busy": "2026-05-03T19:36:42.322744Z",
     "iopub.status.idle": "2026-05-03T19:36:42.622803Z",
     "shell.execute_reply": "2026-05-03T19:36:42.622159Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Windows extracted : 1452\n",
      "Flood prevalence  : +1h=0.06  +3h=0.15  +6h=0.27  +12h=0.47  +24h=0.71\n",
      "\n",
      "Shape check: X_static=(1452, 8)  X_dyn=(1452, 24, 6)  X_future=(1452, 5, 2)  Y=(1452, 5)\n",
      "\n",
      "To plug into NB13: assign these four arrays and run from Cell 9 onwards.\n"
     ]
    }
   ],
   "source": [
    "def build_nb13_arrays(\n",
    "    discharge_series,          # pd.Series: hourly discharge (m³/s), DatetimeIndex\n",
    "    precip_series=None,        # pd.Series: hourly precip (mm/h)\n",
    "    soil_moist_series=None,    # pd.Series: hourly soil moisture (0-1)\n",
    "    temp_series=None,          # pd.Series: hourly temperature (°C)\n",
    "    bankfull_quantile=0.95,    # percentile used as flood threshold (≡ FSI=1)\n",
    "    lookback=LOOKBACK,\n",
    "    horizons=HORIZONS_H,\n",
    "    stride=6,\n",
    "    static_attrs=None,         # dict of static attributes (optional)\n",
    "    nwp_noise_frac=0.35,       # simulated NWP forecast noise fraction\n",
    "):\n",
    "    \"\"\"\n",
    "    Convert any hourly discharge + optional ERA5 series into NB13-compatible arrays.\n",
    "\n",
    "    Returns\n",
    "    -------\n",
    "    X_static  : (N, 8)\n",
    "    X_dyn_raw : (N, 24, 6)\n",
    "    X_future_raw : (N, N_H, 2)\n",
    "    Y_raw     : (N, N_H)\n",
    "    \"\"\"\n",
    "    q      = discharge_series.resample('h').mean().fillna(method='ffill')\n",
    "    bankfull = q.quantile(bankfull_quantile)\n",
    "\n",
    "    # Align optional series to discharge index\n",
    "    def align(s, default_val=0.0):\n",
    "        if s is None:\n",
    "            return pd.Series(default_val, index=q.index, dtype='float32')\n",
    "        return s.reindex(q.index, method='nearest').fillna(default_val).astype('float32')\n",
    "\n",
    "    precip = align(precip_series)\n",
    "    sm     = align(soil_moist_series, 0.3)\n",
    "    temp   = align(temp_series, 15.0)\n",
    "\n",
    "    # FSI (water level proxy = normalised discharge)\n",
    "    fsi = (q / (bankfull + 1e-3)).clip(0, 3)\n",
    "\n",
    "    windows, futures, labels = [], [], []\n",
    "    for t in range(lookback, len(q)-max(horizons), stride):\n",
    "        win_q     = q.iloc[t-lookback:t].values.astype('float32')\n",
    "        win_fsi   = fsi.iloc[t-lookback:t].values.astype('float32')\n",
    "        win_prec  = precip.iloc[t-lookback:t].values.astype('float32')\n",
    "        win_sm    = sm.iloc[t-lookback:t].values.astype('float32')\n",
    "        win_temp  = temp.iloc[t-lookback:t].values.astype('float32')\n",
    "        local_r   = win_prec * 0.8\n",
    "\n",
    "        dyn = np.stack([win_prec, local_r, win_fsi,\n",
    "                        win_q, win_sm, win_temp], axis=1)\n",
    "\n",
    "        # NWP forecasts (true future + noise)\n",
    "        r3 = float(precip.iloc[t:t+3].sum())\n",
    "        r6 = float(precip.iloc[t:t+6].sum())\n",
    "        r3_nwp = r3 * (1 + np.random.normal(0, nwp_noise_frac))\n",
    "        r6_nwp = r6 * (1 + np.random.normal(0, nwp_noise_frac))\n",
    "        fut = np.array([[r3_nwp, r6_nwp]] * len(horizons), dtype='float32')\n",
    "\n",
    "        lbl = [float(q.iloc[t:t+h].max() >= bankfull) for h in horizons]\n",
    "\n",
    "        windows.append(dyn); futures.append(fut); labels.append(lbl)\n",
    "\n",
    "    X_dyn_raw    = np.stack(windows).astype('float32')\n",
    "    X_future_raw = np.stack(futures).astype('float32')\n",
    "    Y_raw        = np.array(labels,  dtype='float32')\n",
    "\n",
    "    # Static (fill with defaults if not provided)\n",
    "    sa = static_attrs or {}\n",
    "    static_row = np.array([\n",
    "        sa.get('basin_area',  500.0),\n",
    "        sa.get('slope',       0.01),\n",
    "        sa.get('imperv',      0.2),\n",
    "        sa.get('soil_perm',   15.0),\n",
    "        sa.get('ndvi',        0.5),\n",
    "        sa.get('dist_channel',3.0),\n",
    "        sa.get('elevation',   200.0),\n",
    "        sa.get('flood_hist',  2.0),\n",
    "    ], dtype='float32')\n",
    "    X_static_raw = np.tile(static_row, (len(X_dyn_raw), 1))\n",
    "\n",
    "    print(f'Windows extracted : {len(X_dyn_raw)}')\n",
    "    print(f'Flood prevalence  :',\n",
    "          '  '.join(f'+{h}h={Y_raw[:,hi].mean():.2f}'\n",
    "                    for hi, h in enumerate(horizons)))\n",
    "    return X_static_raw, X_dyn_raw, X_future_raw, Y_raw\n",
    "\n",
    "\n",
    "# ── Validation on a synthetic discharge series ─────────────────────────────────\n",
    "rng   = np.random.default_rng(99)\n",
    "dates = pd.date_range('2010-01-01', periods=365*24, freq='h')\n",
    "q_syn = pd.Series(\n",
    "    np.maximum(0, rng.normal(50, 20, len(dates)) +\n",
    "               50*rng.exponential(1, len(dates)) * (rng.random(len(dates)) < 0.05)),\n",
    "    index=dates\n",
    ")\n",
    "Xs, Xd, Xf, Yr = build_nb13_arrays(q_syn)\n",
    "print(f'\\nShape check: X_static={Xs.shape}  X_dyn={Xd.shape}  '\n",
    "      f'X_future={Xf.shape}  Y={Yr.shape}')\n",
    "print('\\nTo plug into NB13: assign these four arrays and run from Cell 9 onwards.')\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b547df0d",
   "metadata": {},
   "source": [
    "---\n",
    "## 8 — Normalise & Hand Off to NB13\n",
    "\n",
    "After loading any dataset into `X_static_raw`, `X_dyn_raw`, `X_future_raw`, `Y_raw`,\n",
    "run the cells below to normalise and create the train/test split.  These are\n",
    "identical to NB13 Cells 9–10 and can be copy-pasted directly.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "3fec9e78",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-05-03T19:36:42.624739Z",
     "iopub.status.busy": "2026-05-03T19:36:42.624608Z",
     "iopub.status.idle": "2026-05-03T19:36:42.633611Z",
     "shell.execute_reply": "2026-05-03T19:36:42.632951Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Real data not loaded — using Section 7 demo arrays (Xs, Xd, Xf, Yr).\n",
      "N_BASINS   : 1452  (train 1161, test 291)\n",
      "N_STATIC   : 8\n",
      "N_DYNAMIC  : 6\n",
      "N_FUTURE   : 2\n",
      "HORIZON    : 5\n",
      "Flood +6h  : train=0.264  test=0.278\n",
      "\n",
      "All variables match NB13 names. Proceed from NB13 Cell 14 (Section 4).\n"
     ]
    }
   ],
   "source": [
    "# ── Copy-paste replacement for NB13 Cells 9–10 ───────────────────────────────\n",
    "# Uses arrays from Section 7 demo run if real data was not loaded above.\n",
    "if 'X_dyn_raw' not in dir() or X_dyn_raw is None:\n",
    "    print('Real data not loaded — using Section 7 demo arrays (Xs, Xd, Xf, Yr).')\n",
    "    X_static_raw, X_dyn_raw, X_future_raw, Y_raw = Xs, Xd, Xf, Yr\n",
    "\n",
    "N_BASINS   = len(X_dyn_raw)\n",
    "TRAIN_SIZE = int(0.80 * N_BASINS)\n",
    "TEST_SIZE  = N_BASINS - TRAIN_SIZE\n",
    "\n",
    "def znorm(a):\n",
    "    return ((a - a.mean(axis=0)) / (a.std(axis=0) + 1e-8)).astype('float32')\n",
    "\n",
    "# Static: z-normalise each column\n",
    "X_static   = znorm(X_static_raw)\n",
    "\n",
    "# Dynamic: z-normalise per feature across all patients and time steps\n",
    "X_dyn = X_dyn_raw.copy()\n",
    "for fi in range(X_dyn.shape[2]):\n",
    "    v = X_dyn[:, :, fi]\n",
    "    X_dyn[:, :, fi] = ((v - v.mean()) / (v.std() + 1e-8)).astype('float32')\n",
    "\n",
    "# Future: z-normalise\n",
    "X_fut = X_future_raw.copy()\n",
    "for fi in range(X_fut.shape[2]):\n",
    "    v = X_fut[:, :, fi]\n",
    "    X_fut[:, :, fi] = ((v - v.mean()) / (v.std() + 1e-8)).astype('float32')\n",
    "\n",
    "# FSI physics prior from last observed water level (channel 2)\n",
    "fsi_now   = X_dyn_raw[:, -1, 2].clip(0, 3)\n",
    "fsi_prior = 1.0 / (1.0 + np.exp(-(fsi_now - 0.80) / 0.15))\n",
    "\n",
    "# Labels\n",
    "Y_labels     = Y_raw[:, :, None].astype('float32')\n",
    "is_flood_6h  = Y_raw[:, PRIMARY_H].astype('float32')\n",
    "\n",
    "# Temporal split\n",
    "RNG_split = np.random.default_rng(42)\n",
    "perm = RNG_split.permutation(N_BASINS)\n",
    "tr, te = perm[:TRAIN_SIZE], perm[TRAIN_SIZE:]\n",
    "\n",
    "Xs_tr, Xs_te   = X_static[tr], X_static[te]\n",
    "Xd_tr, Xd_te   = X_dyn[tr],    X_dyn[te]\n",
    "Xf_tr, Xf_te   = X_fut[tr],    X_fut[te]\n",
    "Y_tr,  Y_te    = Y_labels[tr], Y_labels[te]\n",
    "sep_tr, sep_te = is_flood_6h[tr], is_flood_6h[te]\n",
    "fsi_tr, fsi_te = fsi_prior[tr],   fsi_prior[te]\n",
    "\n",
    "N_STATIC  = X_static.shape[1]\n",
    "N_DYNAMIC = X_dyn.shape[2]\n",
    "N_FUTURE  = X_fut.shape[2]\n",
    "HORIZON   = N_H\n",
    "\n",
    "print(f'N_BASINS   : {N_BASINS}  (train {TRAIN_SIZE}, test {TEST_SIZE})')\n",
    "print(f'N_STATIC   : {N_STATIC}')\n",
    "print(f'N_DYNAMIC  : {N_DYNAMIC}')\n",
    "print(f'N_FUTURE   : {N_FUTURE}')\n",
    "print(f'HORIZON    : {HORIZON}')\n",
    "print(f'Flood +6h  : train={sep_tr.mean():.3f}  test={sep_te.mean():.3f}')\n",
    "print()\n",
    "print('All variables match NB13 names. Proceed from NB13 Cell 14 (Section 4).')\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bdfc4089",
   "metadata": {},
   "source": [
    "---\n",
    "## 9 — Summary: Which Dataset for Which Purpose?\n",
    "\n",
    "| Research goal | Recommended dataset | Why |\n",
    "|---|---|---|\n",
    "| **Benchmark / first paper** | CAMELS-US | 671 basins, standard benchmark, many published results to compare against |\n",
    "| **Global application** | GloFAS + ERA5 | Global coverage, consistent reanalysis |\n",
    "| **Hourly resolution (US)** | USGS NWIS + ERA5 | 10 000+ gauges, 15-min data |\n",
    "| **European rivers** | GRDC + ERA5 | Good coverage for Rhine, Danube, Po, Loire |\n",
    "| **UK catchments** | UK NRFA (OpenHydro) | 1 500 gauges, catchment attributes available |\n",
    "| **Developing countries** | GloFAS | Gauge-independent (model-based) |\n",
    "\n",
    "### Recommended workflow for a first real-data run\n",
    "1. Download **CAMELS-US** (~4 GB, free)\n",
    "2. Use `build_camels_dataset()` from Section 1 with `n_basins=671`\n",
    "3. Replace NB13 Cells 3–5 with the normalised arrays from Section 8\n",
    "4. Run NB13 from Cell 9 onwards — no other changes needed\n",
    "5. Compare your AUC to published CAMELS benchmarks (Kratzert et al. 2019 LSTM baseline: AUC ≈ 0.89 on test set)\n",
    "\n",
    "### Key published baselines on CAMELS\n",
    "| Model | AUC (flood +6h) | Reference |\n",
    "|---|---|---|\n",
    "| LSTM | ~0.89 | Kratzert et al. 2019 |\n",
    "| Transformer | ~0.91 | Feng et al. 2022 |\n",
    "| LSTM + static | ~0.90 | Rahimzad et al. 2021 |\n",
    "| **BA (this framework)** | *to be measured* | — |\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.20"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}