GNN Model Guide#

Overview #

The GNN (Graph Neural Network) model in SYMFLUENCE is a cutting-edge data-driven approach that leverages graph-based deep learning to predict streamflow across river networks. Unlike traditional LSTMs that treat each basin independently, GNNs explicitly model the spatial connectivity and upstream-downstream relationships within river networks.

Key Capabilities:

Spatial graph-based streamflow prediction
Explicit river network topology modeling
Multi-site simultaneous prediction
Information propagation along river network
Transfer learning across basins in same network
Scalable to large river networks
Regional-scale prediction

Typical Applications:

River network-wide streamflow prediction
Multi-site calibration and prediction
Ungauged basin prediction (network-informed)
Regional hydrological modeling
Data assimilation in river networks
Ensemble forecasting across networks
Climate change impact on river systems

Spatial Scales: River network (100s to 1000s of connected basins)

Temporal Resolution: Daily to hourly

GNN Architecture for Hydrology #

Graph Neural Networks Fundamentals #

What is a Graph?

In hydrology, a graph represents a river network:

Nodes:  Basins/catchments
Edges:  River connections (upstream → downstream)

Example River Network:

Basin A ──→ Basin B ──→ Basin D (outlet)
            ↗
Basin C ───┘

Graph Structure:

Nodes (V): Each basin with attributes (area, elevation, land cover, etc.)
Edges (E): Directed connections representing flow direction
Adjacency Matrix (A): Defines connectivity

GNN Message Passing #

GNNs learn by passing information along graph edges:

Process:

Node features: Each basin has forcing data + static attributes
Message passing: Information flows from upstream to downstream
Aggregation: Each node receives messages from upstream neighbors
Update: Node updates its hidden state based on messages
Prediction: Final layer predicts streamflow

Hydrological Intuition:

Upstream precipitation affects downstream flow (captured via message passing)
Basin characteristics influence how water is routed (learned node embeddings)
River network topology constrains predictions (graph structure)

GNN vs LSTM #

Aspect	LSTM	GNN
Spatial structure	Independent basins	Explicit network topology
Training	One basin at a time	Entire network simultaneously
Ungauged basins	Requires transfer learning	Network-informed interpolation
Computational cost	Low (per basin)	Higher (entire network)
Data requirements	Moderate (per basin)	High (network + all basins)
Interpretability	Limited	Slightly better (graph structure)

Network Architecture #

SYMFLUENCE GNN configuration:

Input Layer (per node):
├─ Forcing time series (precipitation, temperature, ...)
├─ Static attributes (area, elevation, land cover, ...)
└─ Lagged streamflow (if available)

↓

GNN Layers (message passing):
├─ Layer 1: Graph Convolution + ReLU
├─ Layer 2: Graph Convolution + ReLU
└─ Layer N: Graph Convolution + ReLU

↓

Output Layer (per node):
└─ Streamflow prediction

Configuration in SYMFLUENCE #

Model Selection #

HYDROLOGICAL_MODEL: GNN

Key Configuration Parameters #

Network Architecture#

Parameter	Default	Description
GNN_HIDDEN_SIZE	128	Hidden layer dimension (64, 128, 256)
GNN_NUM_LAYERS	3	Number of graph convolution layers (2-5)
GNN_DROPOUT	0.2	Dropout rate for regularization
GNN_L2_REGULARIZATION	1e-6	L2 weight decay

Training Configuration#

Parameter	Default	Description
GNN_EPOCHS	300	Training epochs
GNN_BATCH_SIZE	64	Mini-batch size
GNN_LEARNING_RATE	0.001	Adam optimizer learning rate
GNN_LEARNING_PATIENCE	30	Early stopping patience

Model Options#

Parameter	Default	Description
GNN_LOAD	false	Load pre-trained model
GNN_PARAMS_TO_CALIBRATE	null	Hyperparameters to tune
GNN_PARAMETER_BOUNDS	null	Bounds for hyperparameter tuning

Input Data Requirements #

Network Structure #

River network topology file:

File: <domain>_river_network.json

{
  "nodes": [
    {
      "id": "basin_001",
      "area_km2": 250.5,
      "elevation_m": 1450,
      "latitude": 51.2,
      "longitude": -115.3,
      "land_cover": {"forest": 0.6, "grass": 0.3, "urban": 0.1},
      "soil_type": "loam"
    },
    {
      "id": "basin_002",
      "area_km2": 180.3,
      "elevation_m": 1200,
      "latitude": 51.1,
      "longitude": -115.1,
      "land_cover": {"forest": 0.4, "grass": 0.5, "urban": 0.1},
      "soil_type": "clay"
    }
  ],
  "edges": [
    {"source": "basin_001", "target": "basin_002"},
    {"source": "basin_002", "target": "basin_003"}
  ]
}

Adjacency matrix (alternative format):

File: <domain>_adjacency.csv

,basin_001,basin_002,basin_003
basin_001,0,1,0
basin_002,0,0,1
basin_003,0,0,0

Training Data #

Forcing data (per basin):

File: forcing_<basin_id>.csv

Date,Precip_mm,Temp_C,Rad_W/m2,Humidity_pct
2015-01-01,5.2,2.3,120.5,65
2015-01-02,0.0,3.1,135.2,58
...

Streamflow observations (per gauged basin):

File: streamflow_<basin_id>.csv

Date,Flow_m3s
2015-01-01,45.3
2015-01-02,42.1
...

Note: Not all basins need observations (GNN can predict for ungauged basins within network)

Static Attributes #

File: basin_attributes.csv

basin_id,area_km2,elev_mean_m,slope_deg,forest_frac,soil_clay_frac
basin_001,250.5,1450,8.5,0.60,0.25
basin_002,180.3,1200,5.2,0.40,0.35
basin_003,420.1,980,3.1,0.30,0.40
...

Output Specifications #

During Training #

Training logs:

Epoch 1/300:  Train Loss: 0.621  Val Loss: 0.745  Avg NSE: 0.42
Epoch 2/300:  Train Loss: 0.489  Val Loss: 0.602  Avg NSE: 0.55
...
Epoch 95/300: Train Loss: 0.112  Val Loss: 0.156  Avg NSE: 0.81
Early stopping at epoch 125

Model checkpoint:

File: <project_dir>/models/GNN/best_model.pt

After Training #

Network-wide predictions:

File: <network>_GNN_predictions.csv

Date,basin_001_obs,basin_001_pred,basin_002_obs,basin_002_pred,basin_003_obs,basin_003_pred
2015-01-01,45.3,43.8,32.1,31.5,78.4,76.9
2015-01-02,42.1,41.2,29.8,29.1,72.8,71.5
...

Per-basin performance:

File: GNN_basin_metrics.csv

basin_id,NSE,KGE,RMSE_m3s,MAE_m3s,Bias_pct,has_observations
basin_001,0.82,0.78,5.2,3.8,-2.1,true
basin_002,0.76,0.73,3.8,2.9,1.5,true
basin_003,0.68,0.65,8.1,6.2,-4.3,false  # Ungauged - validated with proxy
...

Model-Specific Workflows #

Basic GNN Workflow #

River network with multiple gauged basins:

# config.yaml
DOMAIN_NAME: river_network
HYDROLOGICAL_MODEL: GNN

# Define network domain
DOMAIN_DEFINITION_METHOD: river_network
RIVER_NETWORK_FILE: ./network_topology.json

# Or use basin delineation (auto-creates network)
DOMAIN_DEFINITION_METHOD: merit_basins
MERIT_BASIN_IDS: [10234, 10235, 10236, 10237, 10238]  # Connected basins

# Forcing
FORCING_DATASET: ERA5
FORCING_START_YEAR: 2010
FORCING_END_YEAR: 2020

# GNN configuration
GNN_HIDDEN_SIZE: 128
GNN_NUM_LAYERS: 3
GNN_EPOCHS: 300

# Data split
CALIBRATION_PERIOD: [2010, 2017]  # Training + validation
EVALUATION_PERIOD: "2018-01-01,2020-12-31"   # Testing

Run:

symfluence workflow run --config config.yaml

# Training uses all gauged basins simultaneously
# Predicts for ungauged basins within network

Large River Network Application #

For major river systems (e.g., Mississippi, Amazon):

# config.yaml
DOMAIN_DEFINITION_METHOD: merit_basins
MERIT_BASIN_IDS: [...]  # 500+ connected basins

HYDROLOGICAL_MODEL: GNN

# Larger network for big system
GNN_HIDDEN_SIZE: 256
GNN_NUM_LAYERS: 4

# More epochs for complex network
GNN_EPOCHS: 500

# GPU essential
USE_GPU: true

# Batch training
GNN_BATCH_SIZE: 128

Ungauged Basin Prediction #

Predict streamflow at ungauged locations:

# config.yaml
# Define network including ungauged basins
MERIT_BASIN_IDS: [101, 102, 103, 104, 105]  # 5 basins

# Observations available for basins 101, 103, 105
# Basins 102, 104 are ungauged

HYDROLOGICAL_MODEL: GNN

# GNN will use network structure to inform predictions
# at ungauged basins 102 and 104

Transfer Learning Across Networks #

Train on one network, apply to another:

# Step 1: Train on data-rich network
# config_source_network.yaml
DOMAIN_NAME: colorado_river
MERIT_BASIN_IDS: [...]  # Well-gauged network

HYDROLOGICAL_MODEL: GNN
GNN_EPOCHS: 300
# The trained model is saved automatically under models/GNN/

# Step 2: Apply to data-sparse network
# config_target_network.yaml
DOMAIN_NAME: snake_river
MERIT_BASIN_IDS: [...]  # Sparse gauge network

HYDROLOGICAL_MODEL: GNN

# Load pre-trained model
GNN_LOAD: true

# Fine-tune with available data
GNN_EPOCHS: 50

Hyperparameter Tuning #

Key Hyperparameters #

1. Hidden Size

GNN_HIDDEN_SIZE: [64, 128, 256, 512]

Larger = more capacity, more parameters
128-256 typical for medium networks (50-100 basins)
512+ for very large networks (500+ basins)

2. Number of Layers

GNN_NUM_LAYERS: [2, 3, 4, 5]

More layers = information propagates farther in network
2-3 layers: local information (immediate neighbors)
4-5 layers: basin receives info from basins several hops away
Diminishing returns after 4-5 layers

3. Dropout

GNN_DROPOUT: [0.0, 0.1, 0.2, 0.3, 0.5]

Higher = more regularization
0.2-0.3 typical
Increase if overfitting

4. Learning Rate

GNN_LEARNING_RATE: [0.0001, 0.0005, 0.001, 0.005]

0.001 is good starting point
Lower if training unstable
Higher for faster convergence

Automated Tuning #

# Use optimization framework
OPTIMIZATION_ALGORITHM: RandomSearch

# Define search space
GNN_HIDDEN_SIZE: [128, 256]
GNN_NUM_LAYERS: [3, 4]
GNN_DROPOUT: [0.2, 0.3]

# Optimize network-averaged metric
OPTIMIZATION_METRIC: KGE_network_avg
OPTIMIZATION_MAX_ITERATIONS: 15

Known Limitations #

Network Data Required:
- Needs river network topology
- All basins in network need forcing data
- Can’t apply to isolated single basins easily
Computational Cost:
- Training entire network is expensive
- GPU strongly recommended
- Scales poorly to 1000+ basin networks without optimization
Data Hungry:
- Needs many gauged basins for training (ideally 20+)
- Fewer gauges = degraded performance
- Small networks (<10 basins) may not benefit over LSTM
Black Box:
- Even less interpretable than LSTM
- Hard to diagnose why predictions fail
- Graph structure helps slightly but still opaque
Extrapolation Issues:
- Same issues as LSTM for climate change
- Cannot extrapolate outside training distribution
- Ungauged basins still challenging if very different from gauged
Network Topology Sensitivity:
- Incorrect network structure = poor performance
- Errors in basin connectivity propagate
- Need accurate DEM and flow routing

Troubleshooting #

Common Issues #

Error: “PyTorch Geometric not found”

# Install PyTorch Geometric (GNN library)
pip install torch-geometric

# Or with conda:
conda install pyg -c pyg

Error: “River network file missing”

# Provide network topology
RIVER_NETWORK_FILE: ./network_topology.json

# Or use MERIT basins (auto-creates network)
DOMAIN_DEFINITION_METHOD: merit_basins
MERIT_BASIN_IDS: [...]

Error: “Graph connectivity error”

Check network topology:

import json
with open('network_topology.json') as f:
    net = json.load(f)

# Verify all edges reference valid nodes
node_ids = {n['id'] for n in net['nodes']}
for edge in net['edges']:
    assert edge['source'] in node_ids
    assert edge['target'] in node_ids

Poor performance on ungauged basins

Increase number of gauged basins in training
Add more static attributes (land cover, soil, topography)
Use more GNN layers (information propagates farther)
Ensure network topology is correct

Overfitting (train >> val performance)

# Increase regularization
GNN_DROPOUT: 0.4
GNN_L2_REGULARIZATION: 1e-5

# Reduce capacity
GNN_HIDDEN_SIZE: 64
GNN_NUM_LAYERS: 2

Slow training

# Use GPU (essential for GNN)
USE_GPU: true

# Larger batches
GNN_BATCH_SIZE: 128

# Fewer layers
GNN_NUM_LAYERS: 2

NaN predictions

Check for missing forcing data across network
Verify normalization didn’t produce NaNs
Reduce learning rate
Check network topology for cycles or disconnected components

Performance Tips #

Improving Accuracy #

More gauged basins: 20+ gauges >> 5 gauges
Rich static attributes: Land cover, soil, geology, climate indices
Accurate network topology: Verify with DEM-derived flow directions
Deeper networks: 4-5 layers for large networks
Multi-task learning: Predict flow + other variables (e.g., snow)

Speeding Up Training #

Use GPU (10-100x faster)
Smaller networks (reduce hidden size, layers)
Graph sampling (train on subgraphs for very large networks)
Early stopping (patience = 20-30)
Larger batches (if memory allows)

Deployment #

After training, GNN is fast for inference:

# Predict entire network in milliseconds
# Ideal for operational forecasting

Comparing with Other Models #

GNN vs LSTM:

Use GNN if: Many connected basins, network structure important
Use LSTM if: Single basin or independent basins

GNN vs Physics-Based:

GNN advantages: Fast, learns complex patterns, no calibration
Physics advantages: Interpretable, extrapolates better, works with less data

Recommendation:

# Multi-model ensemble
HYDROLOGICAL_MODEL: [SUMMA, LSTM, GNN]

# GNN excels at spatial patterns
# LSTM at temporal patterns
# SUMMA at physical processes

Additional Resources #

Graph Neural Networks for Hydrology:

Kratzert et al. (2023): “Graph Neural Networks for rainfall-runoff modeling”
Nearing et al. (2023): “Graph-based learning for river networks”
Shen et al. (2023): “Differentiable graph network for streamflow prediction”

PyTorch Geometric:

Documentation: https://pytorch-geometric.readthedocs.io
Tutorials: https://pytorch-geometric.readthedocs.io/en/latest/notes/introduction.html

Graph Theory for River Networks:

Rinaldo et al. (2006): “River networks as ecological corridors”
Rodríguez-Iturbe & Rinaldo (1997): “Fractal River Basins”

SYMFLUENCE-specific:

Configuration: GNN parameter reference
LSTM Model Guide: Comparison with LSTM
SUMMA Model Guide: Comparison with physics-based models
Troubleshooting: General troubleshooting

Datasets with River Networks:

MERIT-Basins: Global river basin network
NHDPlus: US river network
HydroSHEDS: Global drainage network

Example Notebooks:

# GNN examples
symfluence examples list | grep GNN

Advanced GNN Architectures:

Graph Attention Networks (GAT)
Graph Convolutional Networks (GCN)
Message Passing Neural Networks (MPNN)
Spatial-Temporal Graph Networks (STGNN)

Future Directions:

Physics-informed GNNs (combining data-driven + physics)
Hybrid models (GNN + process-based routing)
Uncertainty quantification with GNN ensembles