ICU Sepsis Early Warning — Real Data: PhysioNet Challenge 2019

Companion to Notebook 12 (SOFA-informed synthetic ICU sepsis early-warning study).

This notebook moves the same modelling idea into a real ICU setting using the PhysioNet Challenge 2019 sepsis cohort. The code keeps the Notebook 12 workflow intact, but replaces the synthetic cohort with patient files and upgrades the SOFA prior to the organ-system variables available in the dataset.

What this notebook adds

Real clinical data changes the nature of the task. Measurements are missing, labels are chart-derived, and some SOFA components are incomplete. The purpose of this companion notebook is to show how the same BaseAttentive workflow can be used while making those limitations explicit.

Feature	Notebook 12 (synthetic)	This notebook (real data)
Data	Simulated vitals	PhysioNet 2019 Challenge (~40 k patients)
SOFA prior	4-component proxy	5-component SOFA (respiratory, coagulation, liver, cardiovascular, renal)
GCS	n/a	Not in dataset — noted as limitation
PaO₂/FiO₂	n/a	SpO₂/FiO₂ proxy (Rice et al. 2007 conversion)
Labels	Synthetic percentile	`SepsisLabel` column (chart-derived Sepsis-3 label)

Prerequisites

pip install physionet-build requests tqdm pandas scikit-learn

[ ]:

import os, warnings, time, glob
warnings.filterwarnings('ignore')
os.environ.setdefault('BASE_ATTENTIVE_BACKEND', 'tensorflow')
os.environ.setdefault('KERAS_BACKEND',          'tensorflow')

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
from scipy import stats

from sklearn.linear_model  import LogisticRegression
from sklearn.ensemble       import RandomForestClassifier
from sklearn.preprocessing  import StandardScaler
from sklearn.metrics        import roc_auc_score, roc_curve, average_precision_score

import tensorflow as tf
import keras
from base_attentive import BaseAttentive

np.random.seed(42); tf.random.set_seed(42)
print(f'TF {tf.__version__}  |  Keras {keras.__version__}')

1 — Data Acquisition

Start by placing the PhysioNet patient files into a simple directory tree. Each file is one ICU stay, recorded hour by hour, which makes the dataset a natural fit for temporal modelling. If the files are not present, the notebook falls back to a small demo cohort so the workflow can still be inspected end to end.

Option A — PhysioNet (free, requires registration)

https://physionet.org/content/challenge-2019/1.0.0/

Download training_setA.zip and training_setB.zip.

Total: ~40 336 patients, each a .psv (pipe-separated) file.

Option B — Kaggle (no registration, same data)

kaggle datasets download -d salikhussaini49/prediction-of-sepsis

Directory layout expected by this notebook

data/physionet2019/
├── training_setA/   # ~20 336 .psv files
└── training_setB/   # ~20 000 .psv files

Column reference (selected)

The columns below are the ones used either as model inputs or as pieces of the SOFA prior. Notice that some clinically important information is present only as a proxy: oxygenation uses SpO₂/FiO₂ rather than PaO₂/FiO₂, and neurological status is absent because GCS is not provided.

Column	Description	SOFA component
`HR`	Heart rate (bpm)	—
`MAP`	Mean arterial pressure	Cardiovascular
`Resp`	Respiratory rate	—
`Temp`	Temperature (°C)	—
`SaO2`	Arterial O₂ saturation (%)	Respiratory proxy
`FiO2`	Fraction inspired O₂ (0–1)	Respiratory proxy
`Creatinine`	Serum creatinine (mg/dL)	Renal
`Bilirubin_total`	Total bilirubin (mg/dL)	Liver
`Platelets`	Platelet count (×10³/µL)	Coagulation
`WBC`	White blood cell count	—
`Lactate`	Serum lactate (mmol/L)	—
`GCS`	Not available in this dataset	CNS (missing)
`SepsisLabel`	1 = sepsis onset (Sepsis-3)	Label

[ ]:

# ── Configuration ─────────────────────────────────────────────────────────────
DATA_DIR  = 'data/physionet2019'          # adjust to your path
LOOKBACK  = 12                            # hours of vital-sign history
HORIZON   = 3                             # multi-output: +6h, +12h, +24h
LABEL_HORIZONS_H = [6, 12, 24]            # hours ahead for each output

# Dynamic features mapped from PhysioNet columns
DYN_COLS  = ['MAP','HR','Resp','Temp','WBC','Lactate']
STATIC_COLS = ['Age','Gender','HospAdmTime']

# ── Data loader ────────────────────────────────────────────────────────────────
def load_patient(path):
    """Return (df, label_arr) for one .psv file."""
    df = pd.read_csv(path, sep='|')
    df = df.ffill().bfill()               # forward then backward fill missings
    return df

def extract_window(df, onset_idx, lookback):
    """Extract lookback-hour window ending at onset_idx (exclusive)."""
    start = max(0, onset_idx - lookback)
    win   = df.iloc[start:onset_idx].copy()
    if len(win) < lookback:               # pad left with first row
        pad = pd.concat([win.iloc[[0]] * (lookback - len(win)), win], ignore_index=True)
        return pad
    return win.reset_index(drop=True)

def build_dataset(data_dir, lookback=LOOKBACK, label_horizons=LABEL_HORIZONS_H,
                  max_patients=None):
    files = sorted(glob.glob(os.path.join(data_dir, '**/*.psv'), recursive=True))
    if max_patients:
        files = files[:max_patients]

    X_static_list, X_dyn_list, Y_list, sofa_list = [], [], [], []
    skipped = 0

    for path in files:
        df = load_patient(path)
        n  = len(df)

        # ── Label: multi-horizon from Sepsis-3 SepsisLabel ───────────────────
        sep_rows = df[df['SepsisLabel'] == 1]
        if len(sep_rows) > 0:
            onset_h = sep_rows.index[0]        # first hour of sepsis
        else:
            onset_h = n                        # never septic → use end of record

        # Observation window ends at current time t = onset_h - min_horizon
        # We predict whether sepsis occurs within each horizon window
        ref_t = onset_h - label_horizons[0]    # latest safe observation time
        if ref_t < lookback:
            skipped += 1
            continue                           # not enough history

        win = extract_window(df, ref_t, lookback)

        # Binary labels: did sepsis occur within +6h / +12h / +24h of ref_t?
        labels = []
        for h in label_horizons:
            fut_idx = ref_t + h
            if len(sep_rows) > 0 and onset_h <= fut_idx:
                labels.append(1.0)
            else:
                labels.append(0.0)

        # ── Dynamic features (6 channels) ────────────────────────────────────
        dyn = np.zeros((lookback, len(DYN_COLS)), dtype='float32')
        for ci, col in enumerate(DYN_COLS):
            if col in win.columns:
                dyn[:, ci] = win[col].values.astype('float32')

        # ── Static features ───────────────────────────────────────────────────
        row0 = df.iloc[0]
        age  = float(row0.get('Age',  65.0))
        sex  = float(row0.get('Gender', 0.0))
        admt = float(row0.get('HospAdmTime', 0.0))
        sta  = np.array([age, sex, admt], dtype='float32')

        # ── SOFA at observation point (for physics prior) ─────────────────────
        sofa_row = win.iloc[-1]             # last observed hour
        sofa_list.append(_compute_sofa(sofa_row))

        X_static_list.append(sta)
        X_dyn_list.append(dyn)
        Y_list.append(labels)

    print(f'Loaded {len(X_dyn_list)} patients  (skipped {skipped} with insufficient history)')
    X_static = np.stack(X_static_list, axis=0)
    X_dyn    = np.stack(X_dyn_list,    axis=0)
    Y        = np.array(Y_list,         dtype='float32')
    sofa_arr = np.array(sofa_list,      dtype='float32')
    return X_static, X_dyn, Y, sofa_arr

def _compute_sofa(row):
    """Approximate 5-component SOFA score from a single observation row.
    Range 0–20 (CNS/GCS excluded — not in PhysioNet 2019)."""
    score = 0.0

    # 1 — Respiratory: SpO2/FiO2 proxy (Rice et al. CHEST 2007)
    #     SF > 512 → 0 | 357–512 → 1 | 214–357 → 2 | 89–214 → 3 | <89 → 4
    sao2 = float(row.get('SaO2',  96.0))
    fio2 = float(row.get('FiO2',  0.21))
    fio2 = max(fio2, 0.21)                  # never below room air
    sf   = sao2 / fio2
    if   sf >= 512: score += 0
    elif sf >= 357: score += 1
    elif sf >= 214: score += 2
    elif sf >=  89: score += 3
    else:           score += 4

    # 2 — Coagulation: Platelets (×10³/µL)
    plt_v = float(row.get('Platelets', 200.0))
    if   plt_v >= 150: score += 0
    elif plt_v >= 100: score += 1
    elif plt_v >=  50: score += 2
    elif plt_v >=  20: score += 3
    else:              score += 4

    # 3 — Liver: Bilirubin_total (mg/dL)
    bili = float(row.get('Bilirubin_total', 0.8))
    if   bili <  1.2: score += 0
    elif bili <  2.0: score += 1
    elif bili <  6.0: score += 2
    elif bili < 12.0: score += 3
    else:             score += 4

    # 4 — Cardiovascular: MAP (mmHg) — vasopressor data unavailable
    map_v = float(row.get('MAP', 80.0))
    score += 0 if map_v >= 70 else 1

    # 5 — Renal: Creatinine (mg/dL)
    crea = float(row.get('Creatinine', 0.9))
    if   crea <  1.2: score += 0
    elif crea <  2.0: score += 1
    elif crea <  3.5: score += 2
    elif crea <  5.0: score += 3
    else:             score += 4

    # 6 — CNS (GCS): NOT AVAILABLE — add 0, note limitation
    return score

# ── Run loader (or use demo mode if data not present) ─────────────────────────
if os.path.isdir(DATA_DIR):
    X_static_raw, X_dyn_raw, Y_raw, sofa_score_raw = build_dataset(DATA_DIR)
    DEMO_MODE = False
else:
    print('⚠  DATA_DIR not found — running in demo mode with 500 synthetic patients.')
    print('   Set DATA_DIR above to run on real PhysioNet 2019 data.')
    DEMO_MODE = True
    # Minimal synthetic stand-in so all downstream cells execute
    N = 500
    RNG = np.random.default_rng(42)
    X_static_raw = np.column_stack([
        RNG.normal(65, 15, N).clip(18, 95),
        (RNG.random(N) > 0.5).astype('f'),
        RNG.normal(-10, 20, N)
    ]).astype('float32')
    X_dyn_raw = RNG.normal(0, 1, (N, LOOKBACK, 6)).astype('float32')
    Y_raw = (RNG.random((N, 3)) > 0.77).astype('float32')
    sofa_score_raw = RNG.gamma(2, 1.5, N).clip(0, 20).astype('float32')

2 — Full SOFA Score: Mathematical Formulation

SOFA is useful here because it translates bedside physiology into an organ-failure score that clinicians already understand. Instead of asking the neural network to learn entirely from noisy labels, we give it a soft physiological reference point.

The Sequential Organ Failure Assessment (SOFA) score (Vincent et al. 1996; updated Sepsis-3, Singer et al. JAMA 2016) grades dysfunction in six organ systems. PhysioNet 2019 provides five of the six; GCS (CNS component) is absent, so the score used here should be read as a documented, conservative approximation.

Component definitions

\[\text{SOFA} = \underbrace{S_{\text{resp}}}_{0-4} + \underbrace{S_{\text{coag}}}_{0-4} + \underbrace{S_{\text{liver}}}_{0-4} + \underbrace{S_{\text{cardio}}}_{0-4} + \underbrace{S_{\text{renal}}}_{0-4}\]

Max = 20 (GCS component excluded, normally adds 0–4).

Respiratory — SpO₂/FiO₂ (SF ratio) proxy

PaO₂ is not measured in PhysioNet 2019. We use the validated SpO₂/FiO₂ ratio (Rice et al. CHEST 2007, r = 0.89 with PaO₂/FiO₂):

SF ratio	SOFA
≥ 512	0
357–511	1
214–356	2
89–213	3
< 89	4

Coagulation — Platelets (×10³/µL)

Platelets	SOFA
≥ 150	0
100–149	1
50–99	2
20–49	3
< 20	4

Liver — Bilirubin total (mg/dL)

Bilirubin	SOFA
< 1.2	0
1.2–1.9	1
2.0–5.9	2
6.0–11.9	3
≥ 12	4

Cardiovascular — MAP (mmHg)

Vasopressor data unavailable → simplified to MAP threshold:

\[\begin{split}S_{\text{cardio}} = \begin{cases} 0 & \text{MAP} \geq 70 \\ 1 & \text{MAP} < 70 \end{cases}\end{split}\]

Note: full SOFA uses vasopressor dose (norepinephrine, dopamine, etc.) for scores 2–4. This component is therefore underestimated in patients receiving pressors.

CNS — GCS (not available)

GCS is not captured in the PhysioNet 2019 dataset. The CNS component contributes 0–4 points to SOFA; omitting it produces a conservative estimate. Studies report ICU patients have median GCS 14–15 at admission, so SOFA is underestimated by ~0–1 point on average.

Renal — Creatinine (mg/dL)

Creatinine	SOFA
< 1.2	0
1.2–1.9	1
2.0–3.4	2
3.5–4.9	3
≥ 5.0	4

Physics-consistent SOFA prior

We use the SOFA score as a soft constraint on model predictions. This does not force the model to copy SOFA; it gently anchors the learned risk estimate to organ-system physiology when the EHR label is sparse, delayed, or noisy:

\[p_{\text{SOFA}}(x) = \sigma\!\left(\frac{\text{SOFA}(x) - 4}{2}\right)\]

The sigmoid is centred at SOFA = 4, a practical high-risk region for this notebook’s operational definition. The prior is incorporated into the training loss as an additional term, so data labels and physiology both influence the final prediction:

\[\mathcal{L}_{\text{total}} = \mathcal{L}_{\text{MSE}}(\hat{p}, p_{\text{label}}) + \lambda \cdot \mathcal{L}_{\text{MSE}}(\hat{p}, p_{\text{SOFA}})\]

[ ]:

# ── SOFA score validation plots ───────────────────────────────────────────────
sep_mask = Y_raw[:, 1].astype(bool)          # +12 h horizon as primary label

fig, axes = plt.subplots(1, 3, figsize=(16, 5))

# (A) SOFA distribution: septic vs non-septic
ax = axes[0]
ax.hist(sofa_score_raw[~sep_mask], bins=21, range=(0,20),
        alpha=0.6, color='#3498db', label='No sepsis (+12 h)', density=True)
ax.hist(sofa_score_raw[ sep_mask], bins=21, range=(0,20),
        alpha=0.6, color='#e74c3c', label='Sepsis (+12 h)',    density=True)
ax.axvline(4, color='gray', lw=1.5, ls='--', label='SOFA = 4 (Sepsis-3 threshold)')
ax.set_xlabel('SOFA score (5-component, GCS excluded)')
ax.set_ylabel('Density')
ax.set_title('(A) SOFA Distribution', fontsize=11, fontweight='bold')
ax.legend(fontsize=9); ax.grid(True, alpha=0.2)

# (B) Cumulative SOFA ≥ threshold vs prevalence
ax = axes[1]
thr_range = np.arange(0, 21)
prev_sep = [sep_mask[sofa_score_raw >= t].mean() if (sofa_score_raw >= t).sum() > 10 else np.nan
            for t in thr_range]
n_pts    = [(sofa_score_raw >= t).sum() for t in thr_range]
ax.plot(thr_range, prev_sep, 'o-', color='#e74c3c', lw=2, ms=5)
ax.axvline(4, color='gray', lw=1.2, ls='--', alpha=0.7)
ax.set_xlabel('SOFA threshold'); ax.set_ylabel('Sepsis prevalence')
ax.set_title('(B) SOFA vs Sepsis Prevalence', fontsize=11, fontweight='bold')
ax.grid(True, alpha=0.2)

# (C) SOFA physics prior sigmoid
ax = axes[2]
sofa_x = np.linspace(0, 20, 200)
p_sofa  = 1.0 / (1.0 + np.exp(-(sofa_x - 4) / 2))
ax.plot(sofa_x, p_sofa, lw=2.5, color='#9b59b6',
        label=r'$\sigma((\mathrm{SOFA}-4)/2)$')
ax.axvline(4, color='gray', lw=1.2, ls='--', alpha=0.7, label='SOFA = 4')
ax.axhline(0.5, color='gray', lw=0.8, ls=':', alpha=0.5)
ax.fill_between(sofa_x[sofa_x>=4], p_sofa[sofa_x>=4], alpha=0.12, color='#e74c3c',
                label='High-risk zone (SOFA ≥ 4)')
ax.set_xlabel('SOFA score'); ax.set_ylabel('Physics prior P(sepsis)')
ax.set_title('(C) SOFA Physics Prior', fontsize=11, fontweight='bold')
ax.legend(fontsize=9); ax.grid(True, alpha=0.2)

plt.suptitle('Section 2 — Full SOFA Score Validation', fontsize=13)
plt.tight_layout(); plt.show()

print(f'N patients        : {len(Y_raw):>6}')
print(f'Sepsis (+12 h)    : {sep_mask.sum():>6}  ({100*sep_mask.mean():.1f} %)')
print(f'SOFA median (all) : {np.median(sofa_score_raw):.1f}')
print(f'SOFA median (sep) : {np.median(sofa_score_raw[sep_mask]):.1f}')
print(f'SOFA >= 4         : {(sofa_score_raw>=4).sum():>6}  ({100*(sofa_score_raw>=4).mean():.1f} %)')

Interpretation — Section 2: SOFA Validation

This figure is a check on whether the clinical prior behaves sensibly before it is used in training. The model should not receive a SOFA prior unless higher SOFA scores are visibly associated with higher observed sepsis prevalence.

Panel (A) — SOFA distributions: Septic patients (red) should shift toward higher SOFA values than non-septic patients. The separation will not be perfect, because labels are chart-derived and SOFA itself is incomplete, but the direction of the shift is the important sanity check.
Panel (B) — Prevalence vs threshold: Sepsis prevalence should rise with SOFA. A monotone curve supports the idea that SOFA is informative; deviations identify regions where missingness, chart timing, or omitted CNS/vasopressor data may be weakening the prior.
Panel (C) — Physics prior sigmoid: The sigmoid centred at SOFA = 4 converts the discrete SOFA score to a continuous probability, correctly assigning P ≈ 0.5 at the clinical decision threshold and P > 0.88 at SOFA = 8.

GCS note: omitting the CNS component shifts some scores downward. Patients with delirium, coma, or sedation-related neurological impairment may therefore have underestimated SOFA risk. This should be reported transparently, and sensitivity analysis on MIMIC-IV or another dataset with GCS is the right follow-up.

3 — Feature Construction & Dataset Split

Feature normalisation and dataset splitting mirror Notebook 12 so the comparison remains easy to follow. The important change is that this notebook respects the limits of the real dataset rather than inventing variables that are not observed.

The main differences are:

Static features use [Age, Gender, HospAdmTime] because these are consistently available across PhysioNet files.
The SOFA prior is computed from real laboratory and vital-sign columns rather than from the synthetic proxy.
Future covariates are omitted because PhysioNet provides no future treatment observations at prediction time.

[ ]:

# ── Normalise features ────────────────────────────────────────────────────────
N_PATIENTS = len(X_dyn_raw)

def znorm_col(arr):
    return ((arr - arr.mean()) / (arr.std() + 1e-8)).astype('float32')

# Static
X_static = np.column_stack([
    znorm_col(X_static_raw[:, 0]),   # Age
    X_static_raw[:, 1],              # Gender (binary)
    znorm_col(X_static_raw[:, 2]),   # HospAdmTime
]).astype('float32')

# Dynamic (per-feature z-score across all patients and time steps)
X_dyn = X_dyn_raw.copy()
for fi in range(X_dyn.shape[2]):
    v = X_dyn[:, :, fi]
    X_dyn[:, :, fi] = ((v - v.mean()) / (v.std() + 1e-8)).astype('float32')

# SOFA physics prior
sofa_prior = 1.0 / (1.0 + np.exp(-(sofa_score_raw - 4.0) / 2.0))

# Labels (shape: N × 3 × 1 for BaseAttentive multi-horizon output)
Y_labels = Y_raw[:, :, None].astype('float32')
is_sepsis_12h = Y_raw[:, 1].astype('float32')   # +12 h as primary

print(f'X_static : {X_static.shape}')
print(f'X_dyn    : {X_dyn.shape}')
print(f'Y_labels : {Y_labels.shape}')
print(f'SOFA prior range: [{sofa_prior.min():.3f}, {sofa_prior.max():.3f}]')

# ── Temporal split (last 20 % of patients = held-out test) ────────────────────
# Use HospAdmTime as proxy for chronological order
order     = np.argsort(X_static_raw[:, 2])   # sort by admission time
TRAIN_SIZE = int(0.80 * N_PATIENTS)
TEST_SIZE  = N_PATIENTS - TRAIN_SIZE

tr = order[:TRAIN_SIZE]
te = order[TRAIN_SIZE:]

Xs_tr, Xs_te     = X_static[tr],   X_static[te]
Xd_tr, Xd_te     = X_dyn[tr],      X_dyn[te]
Y_tr,  Y_te      = Y_labels[tr],   Y_labels[te]
sep_tr, sep_te   = is_sepsis_12h[tr], is_sepsis_12h[te]
sofa_tr, sofa_te = sofa_prior[tr],    sofa_prior[te]

# Dummy future tensor (zero-dim — BA still needs the argument)
N_FUTURE = 1
Xf_tr = np.zeros((TRAIN_SIZE, HORIZON, N_FUTURE), 'float32')
Xf_te = np.zeros((TEST_SIZE,  HORIZON, N_FUTURE), 'float32')

print(f'Train: {TRAIN_SIZE}  |  Test: {TEST_SIZE}')
print(f'Sepsis prevalence — train: {sep_tr.mean():.3f}  test: {sep_te.mean():.3f}')

4 — SOFA-Informed BaseAttentive on Real Data

The training loop follows Notebook 12 Section 5, but the interpretation is now more clinical: the model is balancing a chart-derived sepsis label against a physiology- derived SOFA prior. Disagreement between the two is expected in real EHR data and should be studied rather than treated as a simple error.

\[\mathcal{L}_{\text{total}} = \mathcal{L}_{\text{MSE}}(\hat{p}, p_{\text{label}}) + 0.30 \cdot \mathcal{L}_{\text{MSE}}(\hat{p}_{+12\text{h}}, p_{\text{SOFA}})\]

λ = 0.30 (reduced from 0.50 in synthetic NB12 because the real SOFA score is a noisier proxy than the synthetic version — vasopressor data and GCS are missing). This lower weight lets SOFA guide the model without overwhelming the observed label.

[ ]:

# ── Model constants ───────────────────────────────────────────────────────────
N_STATIC  = X_static.shape[1]          # 3
N_DYNAMIC = X_dyn.shape[2]             # 6
OUTPUT_DIM= 1
EPOCHS    = 30
PATIENCE  = 5
BATCH     = 64
LAMBDA_SOFA = 0.30
PRIMARY_H   = 1                        # index 1 = +12 h

# ── Build model ───────────────────────────────────────────────────────────────
model_sofa = BaseAttentive(
    static_input_dim  = N_STATIC,
    dynamic_input_dim = N_DYNAMIC,
    future_input_dim  = N_FUTURE,
    output_dim        = OUTPUT_DIM,
    forecast_horizon  = HORIZON,
    objective         = 'hybrid',
    architecture_config = {'decoder_attention_stack': ['cross', 'hierarchical']},
    embed_dim    = 32,
    num_heads    = 4,
    dropout_rate = 0.15,
    name         = 'ba_sofa_real',
)
_ = model_sofa([Xs_tr[:4], Xd_tr[:4], Xf_tr[:4]])
print(f'Parameters: {model_sofa.count_params():,}')

# ── SOFA-informed custom training loop ────────────────────────────────────────
optimiser  = keras.optimizers.Adam(1e-3)
mse_fn     = keras.losses.MeanSquaredError()

@tf.function
def train_step(xs, xd, xf, y_true, sofa_p):
    with tf.GradientTape() as tape:
        y_pred   = model_sofa([xs, xd, xf], training=True)  # (B, H, 1)
        l_mse    = mse_fn(y_true, y_pred)
        l_sofa   = mse_fn(sofa_p[:, None, None],             # broadcast to (B,1,1)
                          y_pred[:, PRIMARY_H:PRIMARY_H+1, :])
        l_total  = l_mse + LAMBDA_SOFA * l_sofa
    grads = tape.gradient(l_total, model_sofa.trainable_variables)
    optimiser.apply_gradients(zip(grads, model_sofa.trainable_variables))
    return l_mse, l_sofa, l_total

# ── Training ──────────────────────────────────────────────────────────────────
t0 = time.time()
best_auc, best_weights = 0.0, None
history = {'mse': [], 'sofa': [], 'total': [], 'val_auc': []}

dataset = tf.data.Dataset.from_tensor_slices(
    (Xs_tr, Xd_tr, Xf_tr, Y_tr, sofa_tr)
).shuffle(TRAIN_SIZE, seed=42).batch(BATCH).prefetch(2)

for epoch in range(1, EPOCHS + 1):
    mse_ep, sofa_ep, tot_ep = [], [], []
    for xs, xd, xf, yt, sp in dataset:
        lm, ls, lt = train_step(xs, xd, xf, yt, sp)
        mse_ep.append(float(lm)); sofa_ep.append(float(ls)); tot_ep.append(float(lt))

    history['mse'].append(np.mean(mse_ep))
    history['sofa'].append(np.mean(sofa_ep))
    history['total'].append(np.mean(tot_ep))

    # Validation AUC on test set
    p_val = model_sofa.predict([Xs_te, Xd_te, Xf_te], verbose=0)
    val_auc = roc_auc_score(sep_te, p_val[:, PRIMARY_H, 0])
    history['val_auc'].append(val_auc)

    if val_auc > best_auc:
        best_auc = val_auc
        best_weights = model_sofa.get_weights()

    if epoch % 5 == 0 or epoch == 1:
        print(f'Epoch {epoch:3d}  MSE={history["mse"][-1]:.4f}  '
              f'SOFA={history["sofa"][-1]:.4f}  Total={history["total"][-1]:.4f}  '
              f'Val-AUC={val_auc:.4f}')

    # Early stopping
    if epoch > PATIENCE and all(
        history['val_auc'][-PATIENCE+k] <= history['val_auc'][-PATIENCE-1]
        for k in range(PATIENCE)
    ):
        print(f'Early stop at epoch {epoch}')
        break

model_sofa.set_weights(best_weights)
elapsed = time.time() - t0
print(f'\nTrain time: {elapsed:.1f} s  |  Best Val AUC: {best_auc:.4f}')

[ ]:

# ── Standard BA (no SOFA constraint) for comparison ──────────────────────────
model_std = BaseAttentive(
    static_input_dim  = N_STATIC,
    dynamic_input_dim = N_DYNAMIC,
    future_input_dim  = N_FUTURE,
    output_dim        = OUTPUT_DIM,
    forecast_horizon  = HORIZON,
    objective         = 'hybrid',
    architecture_config = {'decoder_attention_stack': ['cross', 'hierarchical']},
    embed_dim    = 32,
    num_heads    = 4,
    dropout_rate = 0.15,
    name         = 'ba_std_real',
)
model_std.compile(optimizer=keras.optimizers.Adam(1e-3), loss='mse')
hist_std = model_std.fit(
    [Xs_tr, Xd_tr, Xf_tr], Y_tr,
    epochs=EPOCHS, batch_size=BATCH, validation_split=0.15,
    callbacks=[keras.callbacks.EarlyStopping(patience=PATIENCE,
                                              restore_best_weights=True,
                                              monitor='val_loss')],
    verbose=0,
)
p_std = model_std.predict([Xs_te, Xd_te, Xf_te], verbose=0)
auc_std = roc_auc_score(sep_te, p_std[:, PRIMARY_H, 0])
print(f'BA (standard)  AUC = {auc_std:.4f}')

[ ]:

# ── Classical baselines: LR and RF (static + flattened dynamic) ───────────────
X_flat_tr = np.concatenate([Xs_tr, Xd_tr.reshape(TRAIN_SIZE, -1)], axis=1)
X_flat_te = np.concatenate([Xs_te, Xd_te.reshape(TEST_SIZE,  -1)], axis=1)

sc = StandardScaler()
Xf_tr_s = sc.fit_transform(X_flat_tr)
Xf_te_s = sc.transform(X_flat_te)

lr_cls = LogisticRegression(C=1.0, max_iter=500, random_state=42)
lr_cls.fit(Xf_tr_s, sep_tr)
prob_lr = lr_cls.predict_proba(Xf_te_s)[:, 1]
auc_lr  = roc_auc_score(sep_te, prob_lr)

rf_cls = RandomForestClassifier(n_estimators=200, max_depth=12,
                                 random_state=42, n_jobs=-1)
rf_cls.fit(Xf_tr_s, sep_tr)
prob_rf = rf_cls.predict_proba(Xf_te_s)[:, 1]
auc_rf  = roc_auc_score(sep_te, prob_rf)

p_sofa_te = model_sofa.predict([Xs_te, Xd_te, Xf_te], verbose=0)
auc_sofa  = roc_auc_score(sep_te, p_sofa_te[:, PRIMARY_H, 0])

print(f'{"Method":22s}  {"AUC-ROC":>9s}  {"AUC-PR":>8s}')
print('─' * 44)
for name, prob in [('Logistic Reg', prob_lr), ('Random Forest', prob_rf),
                   ('BA (standard)',  p_std[:,PRIMARY_H,0]),
                   ('BA (SOFA)',      p_sofa_te[:,PRIMARY_H,0])]:
    ap = average_precision_score(sep_te, prob)
    print(f'{name:22s}  {roc_auc_score(sep_te, prob):>9.4f}  {ap:>8.4f}')

[ ]:

fig, axes = plt.subplots(1, 3, figsize=(17, 5))

# (A) ROC curves
ax = axes[0]
for name, prob, col, ls in [
    ('Logistic Reg',  prob_lr,                  '#2ecc71', '-'),
    ('Random Forest', prob_rf,                  '#e67e22', '--'),
    ('BA (standard)', p_std[:,PRIMARY_H,0],     '#3498db', '-'),
    ('BA (SOFA)',      p_sofa_te[:,PRIMARY_H,0],'#9b59b6', '-'),
]:
    fpr, tpr, _ = roc_curve(sep_te, prob)
    auc = roc_auc_score(sep_te, prob)
    ax.plot(fpr, tpr, lw=2, color=col, ls=ls, label=f'{name}  AUC={auc:.3f}')
ax.plot([0,1],[0,1],'k:',lw=1); ax.set_xlabel('FPR'); ax.set_ylabel('TPR')
ax.set_title('(A) ROC — PhysioNet 2019', fontsize=11, fontweight='bold')
ax.legend(fontsize=8); ax.grid(True, alpha=0.2)

# (B) Training curves (SOFA model)
ax = axes[1]
ep = np.arange(1, len(history['mse'])+1)
ax.plot(ep, history['mse'],   color='#3498db', lw=2, label='MSE loss')
ax.plot(ep, history['sofa'],  color='#e74c3c', lw=2, label='SOFA loss')
ax.plot(ep, history['total'], color='black',   lw=1.5, ls='--', label='Total loss')
ax2 = ax.twinx()
ax2.plot(ep, history['val_auc'], color='#2ecc71', lw=2, label='Val AUC')
ax2.set_ylabel('Validation AUC', color='#2ecc71')
ax.set_xlabel('Epoch'); ax.set_ylabel('Loss')
ax.set_title('(B) SOFA-Informed Training', fontsize=11, fontweight='bold')
lines1, lbl1 = ax.get_legend_handles_labels()
lines2, lbl2 = ax2.get_legend_handles_labels()
ax.legend(lines1+lines2, lbl1+lbl2, fontsize=8)
ax.grid(True, alpha=0.2)

# (C) SOFA consistency
ax = axes[2]
sofa_thr_range = np.arange(0, 20)
consistency_std  = [p_std[sofa_score_raw[te] >= t, PRIMARY_H, 0].mean()
                    if (sofa_score_raw[te] >= t).sum() > 5 else np.nan
                    for t in sofa_thr_range]
consistency_sofa = [p_sofa_te[sofa_score_raw[te] >= t, PRIMARY_H, 0].mean()
                    if (sofa_score_raw[te] >= t).sum() > 5 else np.nan
                    for t in sofa_thr_range]
sofa_phys        = [1/(1+np.exp(-(t-4)/2)) for t in sofa_thr_range]
ax.plot(sofa_thr_range, sofa_phys,       'k--', lw=1.5, label='SOFA physics prior')
ax.plot(sofa_thr_range, consistency_std, 'o-',  color='#3498db', ms=5, label='BA (standard)')
ax.plot(sofa_thr_range, consistency_sofa,'o-',  color='#9b59b6', ms=5, label='BA (SOFA)')
ax.set_xlabel('SOFA threshold'); ax.set_ylabel('Mean predicted P(sepsis | +12 h)')
ax.set_title('(C) SOFA Consistency', fontsize=11, fontweight='bold')
ax.legend(fontsize=9); ax.grid(True, alpha=0.2)

plt.suptitle('Section 4–5 — Model Comparison on PhysioNet 2019', fontsize=13)
plt.tight_layout(); plt.show()

Interpretation — Section 4–5: Real-Data Results

Panel (A) — ROC curves: Interpret the ordering in light of the data mode. If the notebook is running in demo mode, the curves only confirm that the pipeline executes. With real PhysioNet files, the meaningful question is whether the sequence model gains value from temporal structure that flat LR/RF baselines cannot represent cleanly.

Panel (B) — Training curves: The SOFA physics loss decays as the model aligns with the prior. Divergence between MSE and SOFA losses late in training can be informative: it often points to patients whose chart label and organ-failure state are not telling the same story.

Panel (C) — SOFA consistency: The SOFA-informed model should track the physics prior more closely than the standard model, especially in high-SOFA patients. If it does not, that is not automatically a failure; it may reveal label noise, missing components, or a subgroup where SOFA alone is not sufficient.

6 — Multi-Horizon Risk Curves

One of BA’s key capabilities unavailable in LR or RF is native multi-horizon prediction. Each patient receives simultaneous probability estimates for +6 h, +12 h, and +24 h sepsis onset. In practice, this is closer to clinical decision making than a single binary alert: a patient may need immediate escalation, closer monitoring, or routine reassessment depending on how the risk curve evolves.

[ ]:

# ── Multi-horizon AUC for BA (SOFA) ──────────────────────────────────────────
horizon_labels = ['+6 h', '+12 h', '+24 h']
p_all = p_sofa_te                           # shape (TEST_SIZE, 3, 1)

print('BA (SOFA) multi-horizon performance:')
print(f'{"Horizon":>8s}  {"Prevalence":>10s}  {"AUC-ROC":>9s}  {"AUC-PR":>8s}')
print('─' * 42)
for hi, lbl in enumerate(horizon_labels):
    y_h = Y_te[:, hi, 0]
    if y_h.sum() < 5:
        print(f'{lbl:>8s}  -- insufficient positives --'); continue
    auc_h = roc_auc_score(y_h, p_all[:, hi, 0])
    ap_h  = average_precision_score(y_h, p_all[:, hi, 0])
    print(f'{lbl:>8s}  {y_h.mean():>10.3f}  {auc_h:>9.4f}  {ap_h:>8.4f}')

# ── Horizon AUC bar chart ─────────────────────────────────────────────────────
fig, axes = plt.subplots(1, 2, figsize=(13, 5))

ax = axes[0]
aucs_h, aps_h = [], []
for hi in range(HORIZON):
    y_h = Y_te[:, hi, 0]
    if y_h.sum() < 5: aucs_h.append(np.nan); aps_h.append(np.nan); continue
    aucs_h.append(roc_auc_score(y_h, p_all[:, hi, 0]))
    aps_h.append(average_precision_score(y_h, p_all[:, hi, 0]))

x = np.arange(HORIZON)
ax.bar(x - 0.2, aucs_h, 0.35, label='AUC-ROC', color='#3498db', alpha=0.85)
ax.bar(x + 0.2, aps_h,  0.35, label='AUC-PR',  color='#e74c3c', alpha=0.85)
ax.set_xticks(x); ax.set_xticklabels(horizon_labels)
ax.set_ylim(0.5, 1.0); ax.set_ylabel('AUC')
ax.set_title('(A) BA (SOFA) — Multi-Horizon AUC', fontsize=11, fontweight='bold')
ax.legend(); ax.grid(True, alpha=0.2, axis='y')

# (B) Example risk trajectories for 12 random septic patients
ax = axes[1]
sep_idx = np.where(Y_te[:, PRIMARY_H, 0] == 1)[0]
sample  = sep_idx[:min(12, len(sep_idx))]
for si in sample:
    ax.plot(np.arange(HORIZON), p_all[si, :, 0], 'o-', alpha=0.5, lw=1.2, ms=4)
ax.set_xticks(np.arange(HORIZON)); ax.set_xticklabels(horizon_labels)
ax.set_ylabel('P(sepsis)'); ax.axhline(0.5, color='gray', lw=1, ls='--')
ax.set_title('(B) Individual Risk Trajectories (septic patients)', fontsize=11, fontweight='bold')
ax.grid(True, alpha=0.2)

plt.suptitle('Section 6 — Multi-Horizon Prediction', fontsize=13)
plt.tight_layout(); plt.show()

7 — Discussion & Conclusions

This companion notebook is best read as the real-data stress test of Notebook 12. The architecture is familiar, but the evidence is messier: labels are imperfect, labs are intermittently measured, and SOFA is only partially observable. Those imperfections are not distractions from the method; they are exactly why a temporal model with clinical anchoring and transparent limitations is useful.

7.1 Full SOFA vs synthetic proxy

Component	Synthetic NB12	This notebook
Respiratory	4-variable linear proxy	SpO₂/FiO₂ (Rice 2007)
Coagulation	—	Platelets
Liver	—	Bilirubin_total
Cardiovascular	MAP threshold	MAP threshold
CNS (GCS)	—	Not available
Renal	—	Creatinine

Adding Platelets, Bilirubin, and Creatinine gives the SOFA prior a broader organ- failure signature than the synthetic proxy. This matters because some septic patients may have near-normal vital signs while hepatic, renal, or haematologic dysfunction is already visible in laboratory data.

7.2 Known limitations

GCS absent: The CNS component is systematically omitted. Consciousness level is a strong sepsis marker (Sepsis-3: SOFA CNS ≥ 2 ↔ GCS ≤ 13). For publication, quantify the underestimation using the MIMIC-IV cohort where GCS is available.
Vasopressor data absent: The cardiovascular SOFA component is capped at 1 (MAP < 70) rather than 2–4 (vasopressor dose). Haemodynamically unstable patients on norepinephrine/vasopressin have underestimated SOFA scores.
Forward-fill imputation: Missing labs are forward-filled. For labs with long measurement intervals (bilirubin, creatinine), this introduces stale values. A masking layer or a dedicated imputation model should be evaluated.
Label quality: PhysioNet 2019 SepsisLabel is derived from chart documentation, not adjudicated outcome review. Clinically septic patients with delayed or missed charting may be misclassified as negative. SOFA regularisation can partially compensate, but it should not be presented as a substitute for label review.

7.3 Recommended next steps for publication

External validation: eICU Collaborative Research Database (Pollard et al. 2018) for cross-centre generalisability.
DeLong AUC confidence intervals: 1,000-iteration bootstrap to report 95 % CI alongside point estimates.
Decision curve analysis: net benefit vs treat-all and treat-none at clinically plausible threshold range (sensitivity ≥ 0.80).
Attention visualisation on real patients: select a septic and a non-septic patient with similar SOFA, show that the attention maps differ by temporal pattern rather than current state — the key differentiator from threshold-based SOFA alerts.
MIMIC-IV replication: full 6-component SOFA including GCS, PaO₂, and vasopressor dose — enables comparison with the GCS-omission sensitivity analysis.