Architecture Guide
==================

This guide explains how BaseAttentive works internally, what changed in
**v2.0.0**, and how to use the new registry / resolver / assembly system.
If you are migrating from v1.0.0, read the
:ref:`breaking changes <arch-breaking-changes>` section first.

.. contents:: On this page
   :local:
   :depth: 2

----

Overview
--------

BaseAttentive is an encoder-decoder neural network for sequence-to-sequence
time series forecasting.  It accepts three distinct feature streams:

1. **Static features** — time-invariant properties ``(batch, static_dim)``
2. **Dynamic features** — historical time series ``(batch, T, dynamic_dim)``
3. **Future features** — known future exogenous variables ``(batch, H, future_dim)``

.. code-block:: text

   ┌─────────────────────────────────────┐
   │           Inputs (3 types)          │
   ├─────────────────────────────────────┤
   │ static:   (batch, S)                │
   │ dynamic:  (batch, T, D)             │
   │ future:   (batch, H, F)             │
   └────────────────┬────────────────────┘
                    │
                    ▼
           ┌─────────────────────┐
           │   Encoder-Decoder   │
           └────────────┬────────┘
                        │
                ┌───────┴────────┐
                │                │
                ▼                ▼
           Point Forecast      With Quantiles
         (B, H, output_dim)  (B, H, Q, output_dim)

Conceptual flow:

1. **Select** — Variable Selection Networks (VSN) weight each input feature
2. **Project** — Transform features into a shared embedding space
3. **Encode** — Process temporal context (hybrid LSTM or pure transformer)
4. **Attend** — Apply the decoder attention stack (cross / hierarchical / memory)
5. **Pool** — Collapse the sequence representation into a fixed vector
6. **Forecast** — Generate point or probabilistic outputs

----

Encoder Architectures
---------------------

Hybrid Mode (``objective="hybrid"``)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Multi-scale LSTM with attention.  Each LSTM processes a down-sampled version
of the sequence at scale ``s``, then the outputs are aggregated before
entering the decoder:

.. code-block:: python

   import numpy as np
   from base_attentive import BaseAttentive

   model = BaseAttentive(
       static_input_dim=4,
       dynamic_input_dim=8,
       future_input_dim=6,
       output_dim=1,
       forecast_horizon=24,
       objective="hybrid",
       scales=[1, 2, 4],          # sequence sub-sampled at ×1, ×2, ×4
       multi_scale_agg="average", # how to merge the scale outputs
       embed_dim=32,
   )

``scales=[1, 2, 4]`` creates three parallel LSTMs.  At scale ``s``, every
``s``-th time step is kept, so the LSTM at scale 4 sees a quarter of the
full history.  This lets the model capture both fine-grained and coarse
temporal patterns simultaneously.

``multi_scale_agg`` choices:

.. list-table::
   :header-rows: 1
   :widths: 20 80

   * - Value
     - Effect
   * - ``"last"``
     - Keep the final hidden state of each scale; concatenate then project
   * - ``"average"``
     - Average all hidden states across time, then merge
   * - ``"flatten"``
     - Flatten the full output sequence of each scale, then project
   * - ``"sum"``
     - Sum hidden states element-wise across time
   * - ``"concat"``
     - Concatenate all time-step outputs end-to-end

Transformer Mode (``objective="transformer"``)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Pure self-attention encoder — better parallelism on shorter sequences
(T < 500):

.. code-block:: python

   model = BaseAttentive(
       static_input_dim=4,
       dynamic_input_dim=8,
       future_input_dim=6,
       output_dim=1,
       forecast_horizon=24,
       objective="transformer",
       num_encoder_layers=4,
       num_heads=8,
       embed_dim=64,
   )

----

Decoder Attention Stack
-----------------------

After encoding, a configurable stack of attention mechanisms bridges the
encoded history with the future feature context.

Attention types:

.. list-table::
   :header-rows: 1
   :widths: 20 30 50

   * - Type
     - Purpose
     - Use case
   * - ``"cross"``
     - Bridge encoder outputs to future context
     - Default; works for all forecasting tasks
   * - ``"hierarchical"``
     - Multi-level temporal patterns in the decoder
     - Seasonal / structured data with nested cycles
   * - ``"memory"``
     - Retrieve patterns from a learned memory bank
     - Long-range dependencies, repeated anomalies

Controlling the stack with ``attention_levels``:

.. code-block:: python

   # All three levels
   model = BaseAttentive(..., attention_levels=None)

   # Single level by name
   model = BaseAttentive(..., attention_levels="cross")

   # Two levels by list
   model = BaseAttentive(..., attention_levels=["cross", "memory"])

   # Single level by integer (1=cross, 2=hierarchical, 3=memory)
   model = BaseAttentive(..., attention_levels=2)

----

Operational Mode Shortcuts
--------------------------

The ``mode`` parameter applies a named configuration profile, wiring up
encoder type, attention stack, and decoder in one step:

.. list-table::
   :header-rows: 1
   :widths: 20 80

   * - Value
     - Effect
   * - ``None`` (default)
     - Manual configuration — use ``objective``, ``architecture_config``, etc.
   * - ``"tft"`` / ``"tft_like"``
     - Temporal Fusion Transformer style: VSN + gated residuals + cross attention
   * - ``"pihal"`` / ``"pihal_like"``
     - Physics-Informed HAL style: memory-augmented + hierarchical stack

.. code-block:: python

   # TFT-like mode — no need to specify objective or attention_levels
   model = BaseAttentive(
       static_input_dim=4,
       dynamic_input_dim=8,
       future_input_dim=6,
       output_dim=1,
       forecast_horizon=24,
       mode="tft",
       embed_dim=32,
   )

----

Output Modes
------------

.. code-block:: python

   # Point forecast — shape (batch, H, output_dim)
   model = BaseAttentive(..., output_dim=2, forecast_horizon=24)

   # Quantile forecast — shape (batch, H, Q, output_dim)
   model = BaseAttentive(..., quantiles=[0.1, 0.5, 0.9])

   # Probabilistic (Gaussian mixture, for CRPSLoss)
   model = BaseAttentive(..., output_mode="gaussian_mixture")

----

.. _arch-v2-system:

V2 Architecture: Registry / Resolver / Assembly
------------------------------------------------

Version 2.0.0 replaces the monolithic class hierarchy of v1.0.0 with a
**registry / resolver / assembly** system.  Every model component is now
registered under a string key and resolved at build time.  This makes the
model fully pluggable and backend-neutral.

Why this matters
~~~~~~~~~~~~~~~~

In v1.0.0 the encoder, attention heads, and forecast head were hard-coded
inside ``BaseAttentive``.  Customising them required subclassing internal
layers and overriding private methods — fragile and backend-specific.

In v2.0.0:

- Each component is a **builder function** stored in a registry.
- ``BaseAttentiveSpec`` / ``BaseAttentiveComponentSpec`` describe the model
  purely as data (no Keras imports required at spec-creation time).
- ``BaseAttentiveV2Assembly`` reads the spec, resolves each component from
  the registry, and wires everything together.
- Swapping a component is a one-line registry call — no subclassing.

The Three Registries
~~~~~~~~~~~~~~~~~~~~

``ComponentRegistry``
    Stores builder functions for individual layers
    (encoders, projections, attention heads, pooling, forecast heads).
    Key format: ``"<category>.<name>"``.

``ModelRegistry``
    Stores assembler functions that construct the full model from a spec.

Both registries are available as singletons:

.. code-block:: python

   from base_attentive.registry import (
       DEFAULT_COMPONENT_REGISTRY,
       DEFAULT_MODEL_REGISTRY,
   )

Registering a custom encoder
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. code-block:: python

   from base_attentive.registry import DEFAULT_COMPONENT_REGISTRY

   def wavenet_encoder_builder(*, context, units, hidden_units, **kw):
       """
       A WaveNet-style dilated causal encoder.
       context: BaseAttentiveSpec — gives access to embed_dim, dropout_rate, etc.
       """
       from my_layers import WaveNetBlock
       return WaveNetBlock(
           units=units,
           dilation_rates=[1, 2, 4, 8],
           dropout=context.dropout_rate,
       )

   DEFAULT_COMPONENT_REGISTRY.register(
       "encoder.wavenet",
       wavenet_encoder_builder,
       backend="generic",          # works across TF / Torch / JAX
       description="WaveNet dilated causal encoder.",
   )

Then use the key in a spec:

.. code-block:: python

   from base_attentive.config import BaseAttentiveSpec, BaseAttentiveComponentSpec

   spec = BaseAttentiveSpec(
       static_input_dim=4,
       dynamic_input_dim=8,
       future_input_dim=6,
       output_dim=1,
       forecast_horizon=24,
       embed_dim=64,
       components=BaseAttentiveComponentSpec(
           temporal_encoder="encoder.wavenet",   # <-- custom component
       ),
   )

   from base_attentive.assembly import BaseAttentiveV2Assembly
   assembler = BaseAttentiveV2Assembly()
   model = assembler.build(spec)

BaseAttentiveSpec
~~~~~~~~~~~~~~~~~

A frozen dataclass that fully describes a model without any framework
imports.  All fields have defaults.

.. code-block:: python

   from base_attentive.config import BaseAttentiveSpec, BaseAttentiveComponentSpec

   spec = BaseAttentiveSpec(
       # ── Input dimensions ────────────────────────────────────────────
       static_input_dim=4,
       dynamic_input_dim=8,
       future_input_dim=6,
       output_dim=1,
       forecast_horizon=24,

       # ── Model capacity ──────────────────────────────────────────────
       embed_dim=32,
       hidden_units=64,
       attention_heads=4,
       dropout_rate=0.1,
       activation="relu",
       layer_norm_epsilon=1e-6,

       # ── Backend / head ──────────────────────────────────────────────
       backend_name="tensorflow",   # or "torch" / "jax"
       head_type="point",           # or "quantile"
       quantiles=(),                # e.g. (0.1, 0.5, 0.9)

       # ── Component overrides ─────────────────────────────────────────
       components=BaseAttentiveComponentSpec(
           sequence_pooling="pool.last",      # override pooling
           temporal_encoder="encoder.wavenet",# override encoder
       ),
   )

``BaseAttentiveComponentSpec`` accepts the following keys
(all optional — omitted keys use the registry default):

.. list-table::
   :header-rows: 1
   :widths: 35 65

   * - Field
     - Registry key resolved
   * - ``static_projection``
     - ``"projection.static"``
   * - ``dynamic_projection``
     - ``"projection.dynamic"``
   * - ``future_projection``
     - ``"projection.future"``
   * - ``hidden_projection``
     - ``"projection.hidden"``
   * - ``temporal_encoder``
     - ``"encoder.temporal_self_attention"``
   * - ``sequence_pooling``
     - ``"pool.mean"``
   * - ``feature_fusion``
     - ``"fusion.concat"``
   * - ``forecast_head``
     - ``"head.point_forecast"`` or ``"head.quantile_forecast"``

Default component keys (built-in, ``"generic"`` backend):

.. list-table::
   :header-rows: 1
   :widths: 40 60

   * - Registry key
     - Purpose
   * - ``"projection.static"``
     - Static feature linear projection
   * - ``"projection.dynamic"``
     - Dynamic sequence projection
   * - ``"projection.future"``
     - Future covariate projection
   * - ``"projection.hidden"``
     - Post-fusion hidden projection
   * - ``"projection.dense"``
     - Generic dense projection (fallback)
   * - ``"encoder.temporal_self_attention"``
     - Temporal self-attention encoder
   * - ``"pool.mean"``
     - Sequence mean pooling
   * - ``"pool.last"``
     - Last-step pooling
   * - ``"fusion.concat"``
     - Feature concatenation
   * - ``"head.point_forecast"``
     - Point forecast head
   * - ``"head.quantile_forecast"``
     - Quantile forecast head

Inspecting the registry
~~~~~~~~~~~~~~~~~~~~~~~

.. code-block:: python

   from base_attentive.registry import DEFAULT_COMPONENT_REGISTRY

   # List all registered keys
   for key in DEFAULT_COMPONENT_REGISTRY.list_keys():
       print(key)

   # Check if a key exists
   if DEFAULT_COMPONENT_REGISTRY.has("encoder.wavenet"):
       print("custom encoder registered")

   # Retrieve builder metadata
   info = DEFAULT_COMPONENT_REGISTRY.get_info("encoder.temporal_self_attention")
   print(info["description"])

Full v2 build-from-spec example
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. code-block:: python

   import numpy as np
   from base_attentive.config import BaseAttentiveSpec
   from base_attentive.assembly import BaseAttentiveV2Assembly

   spec = BaseAttentiveSpec(
       static_input_dim=4,
       dynamic_input_dim=8,
       future_input_dim=6,
       output_dim=1,
       forecast_horizon=24,
       embed_dim=32,
       hidden_units=64,
       attention_heads=4,
       backend_name="tensorflow",
       head_type="quantile",
       quantiles=(0.1, 0.5, 0.9),
   )

   model = BaseAttentiveV2Assembly().build(spec)
   model.compile(optimizer="adam", loss="mse")

   x_static  = np.random.randn(16, 4).astype("float32")
   x_dynamic = np.random.randn(16, 100, 8).astype("float32")
   x_future  = np.random.randn(16, 24, 6).astype("float32")
   y         = np.random.randn(16, 24, 1).astype("float32")

   model.fit([x_static, x_dynamic, x_future], y, epochs=2)

Using ``BaseAttentive`` (facade)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The ``BaseAttentive`` class is a convenience facade that builds the model
from keyword arguments without requiring you to construct a spec manually.
It delegates to the same registry/assembly system under the hood:

.. code-block:: python

   from base_attentive import BaseAttentive

   # This is equivalent to building through BaseAttentiveSpec + Assembly
   model = BaseAttentive(
       static_input_dim=4,
       dynamic_input_dim=8,
       future_input_dim=6,
       output_dim=1,
       forecast_horizon=24,
       embed_dim=32,
       num_heads=4,
       quantiles=[0.1, 0.5, 0.9],
   )

----

.. _arch-breaking-changes:

Breaking Changes in v2.0.0
--------------------------

v2.0.0 is a **major release**.  If you are upgrading from v1.0.0, the
following changes require action.

.. note::

   These changes are intentional.  The v1.0.0 API was tightly coupled to
   TensorFlow; v2.0.0 achieves full backend neutrality through these
   structural changes.

1. Keras 3 required
~~~~~~~~~~~~~~~~~~~

v1.0.0 used ``tensorflow.keras`` directly.  v2.0.0 uses
`Keras 3 <https://keras.io/>`_ (``import keras``) as the framework
abstraction layer.

**What breaks:** Any code that imports from ``tensorflow.keras`` or passes
``tf.Tensor`` objects to model inputs may need updating.

**Migration:**

.. code-block:: python

   # v1.0.0 — TensorFlow-coupled
   import tensorflow as tf
   model = BaseAttentive(...)
   x = tf.random.normal([32, 100, 8])

   # v2.0.0 — backend-neutral
   import numpy as np
   model = BaseAttentive(...)
   x = np.random.randn(32, 100, 8).astype("float32")
   # or use the active backend's tensor type directly

2. Internal layer paths removed
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

In v1.0.0, internal layer classes were importable from
``base_attentive.layers.*`` and ``base_attentive.models.components.*``.
These paths no longer exist in v2.0.0.  All components are accessed
through the registry.

**What breaks:** Direct imports of internal layer classes.

**Migration:**

.. code-block:: python

   # v1.0.0 (breaks in v2.0.0)
   from base_attentive.layers import HierarchicalAttention

   # v2.0.0 — use registry or components_reference API
   from base_attentive.registry import DEFAULT_COMPONENT_REGISTRY
   builder = DEFAULT_COMPONENT_REGISTRY.get("attention.hierarchical")

3. ``architecture_config`` dict keys changed
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Several ``architecture_config`` keys were renamed for clarity:

.. list-table::
   :header-rows: 1
   :widths: 35 35 30

   * - v1.0.0 key
     - v2.0.0 key
     - Notes
   * - ``"encoder_units"``
     - ``"embed_dim"``
     - Unified dimension name
   * - ``"decoder_heads"``
     - ``"num_heads"``
     - Consistent with Keras naming
   * - ``"use_attention"``
     - ``"attention_levels"``
     - Now accepts name, list, or int
   * - ``"temporal_mode"``
     - ``"objective"``
     - ``"hybrid"`` or ``"transformer"``

**Migration:**

.. code-block:: python

   # v1.0.0
   model = BaseAttentive(
       ...,
       architecture_config={"encoder_units": 64, "use_attention": True},
   )

   # v2.0.0
   model = BaseAttentive(
       ...,
       embed_dim=64,
       attention_levels=["cross"],
   )

4. ``output_mode`` default changed
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

v1.0.0 default was ``"quantile"`` when ``quantiles`` was set.
v2.0.0 always infers the output mode from the combination of
``quantiles`` and ``output_mode``.  Passing ``quantiles`` without
``output_mode`` now produces a quantile forecast as before, but
the internal tensor layout changed:

.. list-table::
   :header-rows: 1
   :widths: 30 35 35

   * - Setting
     - v1.0.0 output shape
     - v2.0.0 output shape
   * - ``output_dim=2``, no quantiles
     - ``(B, H, 2)``
     - ``(B, H, 2)`` (unchanged)
   * - ``output_dim=2``, ``quantiles=[0.1,0.5,0.9]``
     - ``(B, H, 2, 3)``  ← Q last
     - ``(B, H, 3, 2)``  ← Q before output_dim

**Migration:** If you index the quantile axis, update from ``[..., i]``
(v1) to ``[:, :, i, :]`` (v2).

----

Data Flow Diagram (v2)
-----------------------

.. code-block:: text

   Static (B,S)  Dynamic (B,T,D)  Future (B,H,F)
        │               │                │
        │        ┌──────▼──────┐         │
        │        │  VSN / Dense │         │
        │        └──────┬──────┘         │
        │               │                │
   ┌────▼────┐   ┌──────▼──────┐  ┌──────▼──────┐
   │ Static  │   │  Temporal   │  │  Future     │
   │ Proj.   │   │  Encoder    │  │  Proj.      │
   │ (Dense) │   │ (LSTM/Attn) │  │  (Dense)    │
   └────┬────┘   └──────┬──────┘  └──────┬──────┘
        │               │                │
        └───────────────┴────────────────┘
                        │
             ┌──────────▼──────────┐
             │   Feature Fusion    │
             │   (concat + proj)   │
             └──────────┬──────────┘
                        │
             ┌──────────▼──────────┐
             │   Attention Stack   │
             │   (cross → hier     │
             │    → memory)        │
             └──────────┬──────────┘
                        │
             ┌──────────▼──────────┐
             │  Sequence Pooling   │
             │  (mean / last)      │
             └──────────┬──────────┘
                        │
             ┌──────────▼──────────┐
             │  Hidden Projection  │
             └──────────┬──────────┘
                        │
             ┌──────────┴──────────┐
             │                     │
       Point Forecast        Quantile Forecast
         (B, H, D)             (B, H, Q, D)

----

Configuration Hierarchy
-----------------------

Precedence (lowest → highest):

1. Built-in defaults (``DEFAULT_ARCHITECTURE``)
2. Explicit keyword arguments (``objective``, ``mode``, ``attention_levels``, …)
3. ``architecture_config`` dict (overrides all)

.. code-block:: python

   model = BaseAttentive(
       ...,
       objective="hybrid",            # step 2
       architecture_config={
           "encoder_type": "transformer",  # step 3 — wins over step 2
       },
   )

----

Performance Notes
-----------------

.. list-table::
   :header-rows: 1
   :widths: 15 20 22 43

   * - Mode
     - Encoder
     - Complexity
     - Notes
   * - Hybrid
     - Multi-scale LSTM
     - O(T·h²)
     - Recommended for T > 500
   * - Transformer
     - Self-attention
     - O(T²·h)
     - Recommended for T < 500

----

See Also
--------

- :doc:`configuration_guide` — Full parameter reference
- :doc:`api_reference` — Complete API docs
- :doc:`usage` — Extended usage patterns
- :doc:`components_reference` — Component library
- :doc:`release_notes/v2.0.0` — v2.0.0 stable release notes
- :doc:`release_notes/v1.0.0` — v1.0.0 release notes