Coral Reef Health Monitoring around Vanuatu’s Coastline - ML training for Climate Change and Natural Disasters

This notebook provides a step-by-step guide for identifying coral reefs and assessing their health over time using Sentinel-2 satellite imagery. Coral ecosystems are highly sensitive to environmental change, and satellite data offers a powerful way to monitor reefs at scale.

We will walk through a remote sensing workflow, including:

Obtain Top of Atmosphere Sentinel-2 imagery: We will use provinces to get buffered coastlines as geographic selection criteria for a STAC search.
Obtain derived Sea Surface Temperature (SST) data from Sentinel-3: Use the same seach criteria to obtain SST data and interpolate it to the grid of the Sentinel-2 data.
Cloud and land masking: Remove clouds, shadows, and non-water pixels to isolate the marine environment.
Atmospheric correction: Convert raw Level-1C Sentinel-2 imagery to surface reflectance (Level-2A) using physics-based correction techniques.
Index calculation: Derive the Normalized Blue-Green Index (NBGI), which is sensitive to benthic features like live coral, algae, and sand.
Training a classifier: Use labeled coral reef polygons from a vector dataset to train a Random Forest classifier on the spectral and index data.
Time series analysis: Apply the trained model across the time series to generate predictions, and track trends in reef extent over time.

!mamba install --channel conda-forge libgdal openjpeg gdal rasterio --yes

!mamba list libgdal-jp2openjpeg

!ls /srv/conda/envs/notebook/lib/gdalplugins

import os
os.environ["GDAL_DRIVER_PATH"] = "/srv/conda/envs/notebook/lib/gdalplugins"

from osgeo import gdal
gdal.AllRegister()

drivers = [gdal.GetDriver(i).GetDescription() for i in range(gdal.GetDriverCount())]
print("JP2OpenJPEG" in drivers)
print("JP2KAK" in drivers)

import joblib
import requests
import numpy as np
import pandas as pd
import geopandas as gpd
import pyproj
import xarray as xr
import rioxarray
import rasterio
from rasterio.features import geometry_mask
import shapely
from shapely.geometry import mapping, shape, MultiPolygon, Point, Polygon, box
from scipy.interpolate import griddata
import matplotlib.pyplot as plt
import hvplot.xarray
from cartopy.crs import PlateCarree
from sklearn.metrics import (
    ConfusionMatrixDisplay,
    accuracy_score,
    classification_report,
)
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
import fsspec
import planetary_computer
from odc.stac import load
from pystac_client import Client
from geocube.api.core import make_geocube
from tqdm import tqdm

Define AOI — Vanuatu Coastal Buffer¶

# Load province boundaries of Vanuatu
provinces = gpd.read_file("./2016_phc_vut_pid_4326.geojson")

provinces

Buffer one province for demonstration. We’ll use a subset of a province for this lesson, as we’ll be using a time series which consumes more memory than a single timestamp.

PROVINCE = "TAFEA"
DATERANGE_START = "2020-06-01"
DATERANGE_END = "2020-08-30"

# Reproject to meters
province_proj = provinces.to_crs(epsg=32759) # UTM projection for Vanuatu

# Get the boundary (coastline)
coastline_proj = province_proj.boundary

# Buffer outward
full_buffer = coastline_proj.buffer(4000) # 4000 meters

# Subtract the original geometry to get only the outer shell
external_only = full_buffer.difference(province_proj.geometry)

# Back to lat/lon
external_only_latlon = external_only.to_crs(epsg=4326)

# Filter to one province (e.g., TAFEA)
province_match = provinces[provinces["pname"] == PROVINCE]
geom = external_only_latlon.loc[province_match.index].iloc[0]

# Get the medium sized island for this province
if isinstance(geom, MultiPolygon):
    # Sort polygons by area
    sorted_polys = sorted(geom.geoms, key=lambda p: p.area)
    n = len(sorted_polys)
    median_index = n // 2  # Integer division

    # If even number of polygons, pick the lower-middle one
    selected = sorted_polys[median_index - 1] if n % 2 == 0 else sorted_polys[median_index]
else:
    selected = geom  # If it's just a single Polygon

selected

Search STAC for Sentinel-2 L1C Imagery¶

# Connect to STAC API
stac = Client.open("https://earth-search.aws.element84.com/v1") 

# Search for Sentinel-2 L1C within the reef buffer
items_l1c = stac.search(
    collections=["sentinel-2-l1c"],
    intersects=selected,
    datetime=f"{DATERANGE_START}/{DATERANGE_END}",
    query={"eo:cloud_cover": {"lt": 20}}
).item_collection()

# Search for Sentinel-2 L2A within the reef buffer to obtain the SCL band for cloud masking
items_l2a = stac.search(
    collections=["sentinel-2-l2a"],
    intersects=selected,
    datetime=f"{DATERANGE_START}/{DATERANGE_END}",
    query={"eo:cloud_cover": {"lt": 20}}
).item_collection()

len(items_l1c), len(items_l2a)

items_l1c[0]

def get_key(item):
    dt = item.datetime.replace(microsecond=0)  # drop microseconds
    tile = item.properties.get("sentinel:tile_id")
    return (dt, tile)
    
# Build set of L2A (datetime, tile) keys
l2a_keys = {get_key(item) for item in items_l2a}

# Filter L1C items to only those that have matching keys in L2A
items_l1c_filtered = [item for item in items_l1c if get_key(item) in l2a_keys]

print(f"Filtered L1C items: {len(items_l1c_filtered)} of {len(items_l1c)}")

# Create a set of (datetime, tile) from L1C
l1c_keys = {get_key(item) for item in items_l1c_filtered}

# Filter L2A items using the same key
items_l2a_matched = [
    item for item in items_l2a
    if get_key(item) in l1c_keys
]

Load Imagery¶

ds_l1c = load(
    items_l1c_filtered,
    bands=["blue", "green", "red", "nir08", "rededge1", "swir16"], #"rededge2", "rededge3", "swir22"
    crs="EPSG:32759",
    resolution=10,
    #chunks={'x': 1024, 'y': 1024, 'bands': -1, 'time': -1}, #{"time": 1},
    bbox=selected.bounds #aoi.total_bounds  # constrain to buffered coastline
)

ds_l2a = load(
    items_l2a_matched,
    bands=["blue", "green", "red", "scl"],
    crs="EPSG:32759",
    resolution=10,
    #chunks={'x': 1024, 'y': 1024, 'bands': -1, 'time': -1}, #{"time": 1},
    bbox=selected.bounds #aoi.total_bounds  # constrain to buffered coastline
)

ds_l2a

Search STAC for derived Sea Surface Temperature data¶

Sea surface temperature is one of the most important indicators for coral reef health because corals are highly sensitive to even small changes in temperature. Corals have a narrow thermal tolerance, and even small deviations above normal can trigger physiological stress. Prolonged heat stress—typically when SST exceeds the long-term maximum summer average by 1 °C or more for several weeks—can cause mass coral bleaching, a process in which corals expel their symbiotic algae, leading to reduced growth, increased susceptibility to disease, and potentially mortality (NOAA Coral Reef Watch, n.d.; Hoegh-Guldberg, 1999). Elevated SST is widely recognized as the primary driver of recent global bleaching events, fundamentally altering reef community structure (Hughes et al., 2018). Consequently, consistent SST monitoring enables both early warning and targeted management responses to protect vulnerable reef systems (Maynard et al., 2008). In short, SST acts like a “fever thermometer” for reefs — it tells you when the system is overheating, can predict stress events, and provides a long-term health indicator.

References

NOAA Coral Reef Watch. (n.d.). Coral Bleaching. Retrieved from https://coralreefwatch.noaa.gov/satellite/education/tutorial/bleaching.php
Hoegh-Guldberg, O. (1999). Climate change, coral bleaching and the future of the world’s coral reefs. Marine Pollution Bulletin, 38(2), 39–44. https://doi.org/10.1016/S0025-326X(99)00002-0
Hughes, T. P., Kerry, J. T., Álvarez-Noriega, M., et al. (2018). Global warming transforms coral reef assemblages. Nature, 556(7702), 492–496. Hughes et al. (2017)
Maynard, J. A., Anthony, K. R., Marshall, P. A., & Masiri, I. (2008). Major bleaching events can be predicted weeks in advance. Estuarine, Coastal and Shelf Science, 79(1), 173–180. Walker et al. (2008)

We’ll start by obtaining the dates we need SST data for. We’ll get those from the L1C time series.

dates_l1c = pd.to_datetime(ds_l1c.time.values).normalize()  # strips to midnight
# Remove duplicates and sort
unique_dates_l1c = dates_l1c.normalize().drop_duplicates().sort_values()

# Convert to datetime strings for STAC search
dates_l1c = [d.strftime('%Y-%m-%d') for d in unique_dates_l1c]
dates_l1c

We also need a center lat/lon for our area of interest to filter for intersecting scenes.

print(selected.centroid)

With that, we’ll search for an intersecting SST image for each timestamp.

catalog = Client.open(
    "https://planetarycomputer.microsoft.com/api/stac/v1",
    modifier=planetary_computer.sign_inplace,
)

sst_items = []
seen_days = set()

for d in dates_l1c:
    search_str = f"{d}T00:00:00Z/{d}T23:59:59Z"
    print(f"Searching for: {search_str}")
    search = catalog.search(
        collections=["sentinel-3-slstr-wst-l2-netcdf"],
        intersects={"type": "Point", "coordinates": [169.8214742517027, -20.19282504523742]},
        datetime=search_str
    )
    items = list(search.items())
    print(f"Found {len(items)} items")
    for item in items:
        day = item.datetime.date()  # get the date part only
        print(item)
        if day not in seen_days:
            sst_items.append(item)
            seen_days.add(day)
            break  # stop after first item for that day

We should have 5 SST items returned, which would equal how many unique timestamps we have in our L1C time series.

len(sst_items)

Checkout the metadata for the first SST item.

sst_items[0]

Mask land and clouds in the Sentinel-2 data¶

Now, we’ll come back to the Sentinel-2 data to mask away land and clouds. The buffer extent subtracting the original island boundary polygon (the coastal offshore remainder) is what we will use to designate the valid region.

ds_l1c.rio.crs

That above WKT describes the EPSG:32759 coordinate reference system, which is:

WGS 84 / UTM zone 59S

Datum: WGS84
Projection: Transverse Mercator
Zone: 59 south (central meridian at 171°E)
Units: meters (Easting/Northing)
False easting: 500,000 m
False northing: 10,000,000 m (because it’s in the southern hemisphere)

Our Sentinel-2 data is in EPSG:32759, which means the x and y coordinates are in meters east/north from the zone’s origin.

# Reproject the coastal buffer to match the dataset CRS
buffer_gdf = gpd.GeoDataFrame(geometry=[selected], crs="EPSG:4326")
buffer_proj = buffer_gdf.to_crs(ds_l1c.rio.crs)

# Create binary mask: True = outside buffer, False = inside
mask_l1c = geometry_mask(
    geometries=[mapping(buffer_proj.iloc[0].geometry)],
    transform=ds_l1c.odc.transform,
    out_shape=(ds_l1c.sizes["y"], ds_l1c.sizes["x"]),
    invert=True  # We want True = inside the buffer
)

mask_l2a = geometry_mask(
    geometries=[mapping(buffer_proj.iloc[0].geometry)],
    transform=ds_l2a.odc.transform,
    out_shape=(ds_l2a.sizes["y"], ds_l2a.sizes["x"]),
    invert=True  # We want True = inside the buffer
)

# Convert masks to DataArrays
mask_xr_l1c = xr.DataArray(
    mask_l1c,
    dims=("y", "x"),
    coords={"y": ds_l1c.y, "x": ds_l1c.x}
)

mask_xr_l2a = xr.DataArray(
    mask_l2a,
    dims=("y", "x"),
    coords={"y": ds_l2a.y, "x": ds_l2a.x}
)

# mask the L1C and L2A data
ds_masked_l1c = ds_l1c.where(mask_xr_l1c)
ds_masked_l2a = ds_l2a.where(mask_xr_l2a)

Now we will mask clouds. The Sentinel-2 L2A comes with an extra “band” called SCL which stands for Scene Classification Layer. The values in this band denote various high level land cover types, such as water and vegetation, but more importantly, there are also classes for cloud types. We will select those to use for masking away clouds.

# Select cloud classes based on SCL band
cloud_classes = [8, 9, 10]

# Create a mask: True = valid (non-cloud), False = cloud
valid_mask = ~ds_l2a["scl"].isin(cloud_classes)

Mask the L1C data, as that is what we will use for downstream processing.

ds_masked = ds_masked_l1c[["blue", "green", "red", "nir08", "rededge1", "swir16"]].where(valid_mask)

ds_masked

# Choose time slice and scale reflectance
ds_rgb = ds_masked_l2a[["red", "green", "blue"]] #.isel(time=0)
# Auto-determines a scale from the 99th percentile pixel brightness. 
# Finds the pixel value at the this percentile brightness (this helps avoid extreme bright outliers like clouds or glint dominating the scale)
scale = ds_rgb.to_array().quantile(0.99).item() #.compute().item()  
ds_rgb = (ds_rgb / scale).clip(0, 1) #.clip(0, 1) Normalizes values to 0–1 for display.

# Stack and plot
rgb = xr.concat([ds_rgb["red"], ds_rgb["green"], ds_rgb["blue"]], dim="band")
rgb = rgb.assign_coords(band=["red", "green", "blue"])

rgb.hvplot.rgb(x="x", y="y", bands="band", width=600, height=600)

Top of Atmosphere (TOA) vs. Surface Reflectance & Atmospheric Correction¶

TOA reflectance refers to the reflectance values measured by a satellite sensor before any correction for atmospheric effects. It represents the radiance reaching the sensor at the top of the atmosphere, and includes contributions from:

Direct solar reflectance from the surface,
Scattering by atmospheric molecules (Rayleigh scattering),
Scattering by aerosols and haze,
Reflection from clouds and surrounding areas.

Sentinel-2 Level-1C products are provided in TOA reflectance.

Surface Reflectance is the fraction of incoming solar radiation reflected by the Earth’s surface, as it would appear if the atmosphere were not present. It represents the true reflectance of land, water, or vegetation and is more reliable for quantitative analysis.

Sentinel-2 Level-2A products are atmospherically corrected to surface reflectance.

Sentinel-2 Level-2A data is produced using the Sen2Cor processor, which performs atmospheric correction to derive surface reflectance from top-of-atmosphere (TOA) reflectance. It was designed primarily for land surfaces — not water. We’ll come back to this in a moment.

Atmospheric correction is the process of removing atmospheric effects from TOA reflectance to estimate surface reflectance. It compensates for:

Scattering (blue light by molecules, haze),
Absorption (by gases like water vapor, ozone, CO₂),
Aerosols (dust, smoke, sea salt).

There are two main types:

Empirical methods: e.g., Dark Object Subtraction (DOS), based on scene content.
Physics-based methods: e.g., Sen2Cor, 6S, or ACOLITE, which use radiative transfer models and ancillary data (e.g., aerosol optical thickness).

Water bodies have unique optical properties that make atmospheric correction far more complex:

Low Reflectance (Dark Target Problem)

Water reflects very little sunlight, especially in the visible and near-infrared.
This makes it hard to distinguish water signal from atmospheric scattering, leading to unstable or noisy surface reflectance values.
Overcorrected water pixels may even appear as negative reflectance, which is physically meaningless.

High Sensitivity to Atmospheric Effects

Even small amounts of aerosols, thin clouds, or Rayleigh scattering can dominate the signal received from water.
Sen2Cor is tuned for land reflectance properties, and may overcorrect these subtle signals in aquatic environments.

Incorrect Assumptions

Sen2Cor assumes a Lambertian (diffuse) surface, which is true for land but not for water, which reflects sunlight specularly (like a mirror).
It also uses land-based visibility and elevation models that don’t apply well to open water.

No Water-Specific Tuning

Sen2Cor doesn’t use water-leaving radiance models or bidirectional reflectance functions (BRDF) specific to aquatic systems.
It lacks aerosol correction schemes optimized for open water, unlike ocean color processors (e.g., ACOLITE or C2RCC).

What To Use Instead Over Water?¶

For more reliable atmospheric correction over water, use:

Dark Object Subtraction (DOS) – A simple, empirical method suitable for ocean scenes with few clouds and minimal aerosols.
ACOLITE – Optimized for coastal and inland waters.

Today, we’ll use DOS as it doesn’t require us to download any external software. It isn’t the MOST accurate method we could use for this task, as we’ll explain next, but it does well enough to be suitable for this a lightweight and flexible approach.

Dark Object Subtraction¶

Dark Object Subtraction (DOS) is a simple and widely used atmospheric correction method that estimates and removes the effects of atmospheric scattering in satellite imagery. It is particularly useful for correcting Sentinel-2 Level-1C (Top-of-Atmosphere reflectance) imagery in oceanic or coastal environments.

The core idea behind DOS is that certain “dark objects” in the scene—such as deep, clear water or dense vegetation—should theoretically have near-zero reflectance in some bands (especially the blue and shortwave bands). Any non-zero signal observed in these dark areas is attributed to path radiance (light scattered by the atmosphere).

How It Works¶

Identify dark pixels: Find the minimum (or low-percentile, e.g., 1%) reflectance values over dark surfaces, usually in the blue band (Band 2) or coastal aerosol band (Band 1).
Estimate path radiance: Assume this value represents atmospheric scattering (haze, Rayleigh scattering).
Subtract offset: Subtract this value from all pixels in the scene for each band to estimate surface reflectance:
```
ρ_surface = ρ_TOA - ρ_min
```
Clip negatives: Any resulting negative reflectance values are set to zero.

Application Over the Ocean¶

Clear deep water serves as an ideal dark object.
Works best in clear-sky ocean conditions where water reflectance is minimal.
Helps reduce haze and Rayleigh scattering impacts in shorter wavelengths (e.g., blue, green).
Less effective in turbid or super shallow waters where bottom reflectance or suspended particles are present.
Simple, fast, and does not require external atmospheric data.
Useful for preprocessing imagery in remote areas with limited ancillary data.

Limitations¶

Assumes perfect dark targets exist in the scene.
May overcorrect in bright or turbid water conditions.
Ignores adjacency effects and variable atmospheric thickness.
Does not correct for absorption effects (e.g., water vapor, aerosols).

def dark_object_subtraction(band, percentile=1, max_subtract=0.05):
    """
    Apply DOS safely to float32 reflectance images.
    Caps subtraction to prevent overcorrection.
    Uses a chosen dark percentile for estimating the "dark object" value (default: 1st percentile).
    Uses a maximum reflectance value to subtract (prevents subtracting too much and distorting the data).
    """
    # Sample spatially
    # Full satellite images can be huge. 
    # Instead of scanning every pixel, we sample every 10th pixel in both x and y directions to speed up calculations.
    sample = band.isel(x=slice(None, None, 10), y=slice(None, None, 10))
    
    # Compute dark object reflectance
    # Finds the percentile reflectance in the sampled pixels.
    # This value is the estimated "dark object" (e.g., deep shadows or water) that should have near-zero reflectance.
    dark_val = sample.quantile(percentile / 100.0, skipna=True).compute()
    # min() ensures we don’t subtract more than max_subtract — this prevents overcorrection in very hazy scenes.
    # max() ensures we don’t subtract a negative number (no brightening).
    dark_val = min(max(dark_val.item(), 0.0), max_subtract)
    print("Dark value:", dark_val)

    # Apply DOS (clip only min to avoid upward clipping)
    # Subtracts the estimated haze/dark offset from every pixel.
    # .clip(min=0) ensures no negative reflectance values.
    return (band - dark_val).clip(min=0)


def correct_dataset(ds, bands=["blue", "green", "red", "nir08", "rededge1", "swir16"]):
    corrected = {}
    for b in bands:
        corrected[b] = dark_object_subtraction(ds[b])
    return xr.Dataset(corrected, coords=ds.coords)

# Sentinel-2 L1C data must be scaled from 0–10000 to 0–1 reflectance before applying dark object subtraction
ds_scaled = ds_masked / 10000.0
# Correct each timestep and concatenate
corrected_list = [
    correct_dataset(ds_scaled.isel(time=t), bands=["blue", "green", "red", "nir08", "rededge1", "swir16"])
    for t in range(ds_scaled.sizes['time'])
]

xarray tries to infer the time coordinate values from the datasets being concatenated.

But since we created each slice by selecting individual timesteps (isel(time=t)), the individual datasets might lose the original time coordinate values or have them replaced by default integer indexes (0,1,2...).

So the concatenated dataset ds_corrected may end up with a generic integer index for time instead of the original timestamp values.

# Combine along time
ds_corrected = xr.concat(corrected_list, dim='time')
ds_corrected['time'] = ds_scaled['time']  # Reassign time
ds_corrected['time']

Visualize the L1C true color imagery before atmospheric correction¶

# Select and concatenate RGB bands
ds_rgb = ds_masked[["red", "green", "blue"]]
rgb = xr.concat([ds_rgb[b] for b in ["red", "green", "blue"]], dim="band")
rgb = rgb.assign_coords(band=["red", "green", "blue"])

# Sampled quantile scaling
scale = rgb.quantile(0.99, dim=("x", "y"), skipna=True).compute()
rgb_scaled = (rgb / scale).clip(0, 1)

# Plot
rgb_scaled.hvplot.rgb(x="x", y="y", bands="band", width=600, height=600)

Visualize the L1C true color imagery after atmospheric correction¶

# Select and concatenate RGB bands
ds_rgb = ds_corrected[["red", "green", "blue"]]
rgb = xr.concat([ds_rgb[b] for b in ["red", "green", "blue"]], dim="band")
rgb = rgb.assign_coords(band=["red", "green", "blue"])

# Sampled quantile scaling
scale = rgb.quantile(0.99, dim=("x", "y"), skipna=True).compute()
rgb_scaled = (rgb / scale).clip(0, 1)

# Plot
rgb_scaled.hvplot.rgb(x="x", y="y", bands="band", width=600, height=600)

Add the Sea Surface Temperature data to the corrected L1C imagery¶

# Check length of the masked and atmospherically corrected L1C time series vs the number of SST items
# Note the length of the time series is one lesser; this because after masking, one timestamp was removed, likely due to lack of complete coverage
len(sst_items), len(ds_corrected.time.values)

# Original time series 
ds_l1c.time.values

# New time series
ds_corrected.time.values

# SST item correspondning to the dropped L1C timestamp
sst_items[2]

# Remove that SST item
sst_items_clean = [item for i, item in enumerate(sst_items) if i != 2]

sst_items_clean

# Instantiate a new (currently placeholder) band in the L1C dataset for the SST data
ds_corrected["sst"] = (("time", "y", "x"), np.full((len(ds_corrected.time), len(ds_corrected.y), len(ds_corrected.x)), np.nan))

Now we match the SST data to the L1C data by latitude/longitude coordinates.

ds_l2a['x'] is the Sentinel-2 longitude coordinate array.
ds_l2a['y'] is the Sentinel-2 latitude coordinate array.
meshgrid expands them into 2-D arrays (lon2d, lat2d) so there’s a lat/lon pair for every pixel in the Sentinel-2 scene.
points=(data['longitude'].values, data['latitude'].values) defines the known SST measurement locations in lon/lat space.
xi=(lon2d, lat2d) says: interpolate onto these exact lat/lon positions.
It’s doing the interpolation in geographic space, not in raster grid space.
Each Sentinel-2 pixel gets the SST value of the nearest SST observation point (because method='nearest'), based purely on the geographic coordinates.

t_index = 0

for item in sst_items_clean:
    planetary_computer.sign_inplace(item)
    dataset = xr.open_dataset(fsspec.open(item.assets["l2p"].href).open())

    # Check projection info
    print(dataset.attrs)
    print(dataset.lon.attrs, dataset.lat.attrs)

    # SST DataFrame in WGS84
    data = (
        pd.DataFrame(
            {
                "longitude": dataset.lon.data.ravel(),
                "latitude": dataset.lat.data.ravel(),
                "sea_surface_temperature": dataset.sea_surface_temperature.load().data.ravel(),
            }
        )
        .dropna()
    )
    print(len(data))

    # Convert to GeoDataFrame in EPSG:4326
    gdf = gpd.GeoDataFrame(
        data,
        geometry=gpd.points_from_xy(data.longitude, data.latitude),
        crs="EPSG:4326"
    )

    # Reproject to EPSG:32759
    gdf = gdf.to_crs("EPSG:32759")

    # Create SST dataset in projected coords
    sst_ds = xr.Dataset(
        {
            "sst": ("points", gdf["sea_surface_temperature"].values)
        },
        coords={
            "x": ("points", gdf.geometry.x.values),
            "y": ("points", gdf.geometry.y.values)
        }
    )

    # Sentinel-2 coords (already in EPSG:32759)
    lon2d, lat2d = np.meshgrid(ds_corrected['x'].values, ds_corrected['y'].values)

    # Interpolate SST points to S2 grid
    sst_grid = griddata(
        points=(sst_ds["x"].values, sst_ds["y"].values),
        values=sst_ds["sst"].values,
        xi=(lon2d, lat2d),
        method='nearest'
    )

    # Add SST to Sentinel-2 dataset
    ds_corrected["sst"][t_index, :, :] = sst_grid
    t_index += 1

Check the value range for SST in each timestamp.

for i in range(len(sst_items_clean)):
    print(ds_corrected["sst"].isel(time=i).values.min(), ds_corrected["sst"].isel(time=i).values.max())

Compute Reef Health Indicators¶

The Blue-Green Index (BGI) is a spectral index used primarily in aquatic and coastal remote sensing to help discriminate benthic habitats (like coral reefs, seagrass, sand) and assess water properties (Bannari et al., 2022). It leverages the difference between the blue and green spectral bands.

Formula¶

\text{BGI} = {\text{Blue} - \text{Green}}

(1)

Calculated using surface reflectance values.

Interpretation of BGI values¶

$D$ = difference

Positive $D>0$ : Blue reflectance > Green → More blue reflectance → indicates clear water, possibly healthy coral or bright sandy substrate.
Negative $D<0$ : Green > Blue → often shallow water, vegetation, or sediment that boosts green. May suggest turbid water, algal blooms, or seagrass.
Zero $D=0$ : equal blue/green reflectance.

How BGI helps with coral reefs¶

Differentiates benthic substrates: Coral, seagrass, algae, sand, and bare substrate often have distinct blue and green reflectance signatures, so BGI can help separate these classes.
Highlights shallow water features: Coral reefs usually occur in shallow, clear water where blue and green light penetrates well, making BGI sensitive to benthic composition.
Tracks changes over time: By analyzing BGI time series, you can detect changes in reef cover, algal blooms, sedimentation, or coral bleaching events affecting reflectance.

Limitations and considerations¶

Scale-sensitive: Sensitive to illumination, sensor gain, and shadows — not standardized across conditions. A difference of 0.1 at low reflectance (e.g. 0.15–0.05) is not the same “contrast” as 0.1 at high reflectance (0.80–0.70), but raw $D$ treats them identically.
Usually masked to water pixels (e.g., using land/water masks or cloud masks).
Water column effects: Water depth, turbidity, and dissolved materials affect blue and green light differently, potentially confounding BGI values.
Atmospheric correction needed: Accurate surface reflectance is essential for reliable BGI values, especially over water.
Not a direct health measure: BGI reflects substrate and water color but not coral health metrics like bleaching directly. It is an indirect indicator.
Supplement with other indices and data: Combine BGI with other indices (e.g., NDVI for algae, bathymetry data, or hyperspectral data) for better reef monitoring.

def compute_bgi(ds):
    return (ds["blue"] - ds["green"])

# Calculate BGI for every timestamp
bgi_list = [
    compute_bgi(ds_corrected.isel(time=t))
    for t in range(ds_corrected.sizes['time'])
]

# Combine along time
bgi = xr.concat(bgi_list, dim='time')
bgi['time'] = bgi['time']  # Reassign time

# Visualize corrected BGI
bgi.hvplot.image(x="x", y="y", cmap="viridis", width=600, height=400)

# Check output ranges
vals = bgi.values
vals_clean = vals[np.isfinite(vals)]

#print("Unique:", np.unique(vals_clean))
print("Min:", vals_clean.min())
print("Max:", vals_clean.max())

# Save to GeoTIFF
#bgi.rio.to_raster(f"bgi_{PROVINCE}_{DATERANGE_START}_{DATERANGE_END}.tif")

Because the Blue Green index has such limitations, we can normalize it to make it robust to lighting changes, atmospheric variation, and sensor inconsistencies — making it more reliable for time-series analysis or cross-scene comparisons. Importantly, coral bleaching often leads to whitening, reducing pigment (chlorophyll) and increasing reflectance in the visible spectrum (especially blue and green). Chlorophyll strongly absorbs blue (~450 nm) and red (~670 nm) light for photosynthesis, so healthy coral with chlorophyll content shows low reflectance in blue and red bands. When chlorophyll content decreases (e.g., due to stress or bleaching), the absorption in the blue and red wavelengths lessens, so more light is reflected in those bands. Green reflectance is usually high already because chlorophyll reflects green (~550 nm) more than blue or red, but with reduced chlorophyll, the overall structure and pigments may change, sometimes leading to subtle changes in green reflectance as well.

The normalized blue green index uses a normalized difference of the blue and green bands to detect coral bleaching, which manifests as a spectral signature change in shallow waters. This is a form of a Normalized Difference Index (like NDVI), designed to highlight spectral changes in coral reefs:

\text{Normalized Blue Green Index} = \frac{Blue - Green}{Blue + Green}

(2)

Healthy coral reflects more in the green spectrum and absorbs more blue light.
Bleached coral loses pigmentation and structure, becoming brighter in blue, reducing the green–blue contrast.

Interpretation of NBGI values¶

Positive ( $0< \text{NBGI}\le1$ ): Blue dominates relative to total signal → likely clearer/deeper water or substrates with strong blue reflectance (e.g., bleached coral).
Negative ( $-1\le \text{NBGI}<0$ ): Green dominates → shallow water, benthic vegetation (seagrass, algae), or sediment.
Zero ( $\text{NBGI}=0$ ): equal contribution of blue and green.
Higher index values → likely healthy coral (stronger green reflectance).
Lower index values → potential bleached coral (higher blue reflectance).

Advantages¶

Normalized: automatically adjusts for overall brightness—sun angle, water depth, sensor gain—so values are directly comparable across images and times.
Bounded: easy to interpret thresholds (e.g., $\text{NBGI}<−0.2$ for dense vegetation, $\text{NBGI}>0.2$ for clear water).
Contrast-enhancing: emphasizes relative difference rather than absolute magnitude.

Considerations:¶

Only works in shallow, clear water — turbid water or shadows can affect accuracy.
Requires masking clouds, deep water, and land first.
You can couple it with red edge or NIR bands, or other indices like:
- NDVI
- Red-Edge Chlorophyll Index
- Turbidity Index (Blue/SWIR)

def compute_nbgi(ds):
    return (ds["blue"] - ds["green"]) / (ds["blue"] + ds["green"])  # Simplified spectral ratio

# Calculate BGI for every timestamp
nbgi_list = [
    compute_nbgi(ds_corrected.isel(time=t))
    for t in range(ds_corrected.sizes['time'])
]

# Combine along time
nbgi = xr.concat(nbgi_list, dim='time')
nbgi['time'] = nbgi['time']  # Reassign time

# Check output ranges
vals = nbgi.values
vals_clean = vals[np.isfinite(vals)]

#print("Unique:", np.unique(vals_clean))
print("Min:", vals_clean.min())
print("Max:", vals_clean.max())
print("Proportion of 1.0 values:", (vals_clean == 1.0).sum() / len(vals_clean))

# Visualize corrected BGI
nbgi.hvplot.image(x="x", y="y", cmap="viridis", width=600, height=400)

#nbgi.rio.to_raster(f"nbgi_{PROVINCE}_{DATERANGE_START}_{DATERANGE_END}.tif")

Ground truth data integration and segmentation¶

We will use a reef map provided by Vanuatu Bureau of Statistics.

!gdown "https://drive.google.com/uc?id=1_QMwaxVw_IV3HJM4XGgU7mCYmRfWezK4"

Let’s also download a subset from the Allen Coral Atlas for the same area.

!gdown "https://drive.google.com/uc?id=1Yzl12h9N6bERnCU12kDg7s2OIr01WCwC"

#benthic_gdf = gpd.read_file('Vanuatu reefs IMARS.zip')
benthic_gdf = gpd.read_file('benthic.geojson')

benthic_gdf.columns

#benthic_gdf.REEF.unique()
benthic_gdf["class"].unique()

benthic_gdf["class_int"] = (benthic_gdf["class"] == "Coral/Algae").astype(int)

benthic_gdf.class_int.unique()

The reef map is a vector polygon dataset. We are rasterizing it to use as extent labels for the MSI and SST pixels.

width, height = nbgi.x.size, nbgi.y.size

benthic_gdf_rp = benthic_gdf.to_crs(epsg=nbgi.rio.crs.to_epsg())

# Define the resolution and bounds based on BGI features
resolution = nbgi.rio.resolution()
bounds_test = nbgi.rio.bounds()

#unique_classes = benthic_gdf['REEF'].unique()

raster_bounds = box(*nbgi.rio.bounds())
benthic_gdf_select = benthic_gdf_rp[benthic_gdf_rp.intersects(raster_bounds)]

print(f"Before: {len(benthic_gdf_rp)} | After: {len(benthic_gdf_select)}")

# Rasterize the vector dataset to match the BGI image
rasterized_labels_benthic = make_geocube(
    vector_data=benthic_gdf_select,
    #measurements=["REEF"], 
    measurements=["class_int"],
    like=nbgi,  # Align with the features dataset
)

print("rasterized_labels_benthic: ", rasterized_labels_benthic)

Live Coral vs. Algae/Sand¶

Live coral has unique reflectance in the red-edge and NIR bands. As well, a sudden rise in visible reflectance and drop in red-edge/NIR may indicate bleaching.

Useful Bands:

Red Edge
Near Infrared
Short Wave Infrared (for benthic type separation)

Live coral is not just a bare rock or dead skeleton — it’s covered by zooxanthellae algae living in the coral tissue. These algae give coral its color and contribute to unique pigment absorption features:

Strong chlorophyll absorption in the blue and red bands.
Higher reflectance in the green (green gap between chlorophyll absorption peaks).
Sharp reflectance changes in the red-edge region.
Distinct absorption features in the SWIR due to water content in living tissue.

Dead coral, macroalgae, and sand each lack some of these features or have them in different magnitudes.

Red Edge (700–750 nm)

The red-edge region captures the abrupt transition between strong chlorophyll absorption in the red and high reflectance in the NIR.
Live coral’s pigment-rich tissue produces a less steep slope compared to terrestrial vegetation but still much steeper than bare sand.
Algae also show a red-edge shift, but spectral shape and slope differ — useful for discrimination.
Live coral → moderate red-edge slope, distinct curvature.
Macroalgae → steeper slope, more vegetation-like.
Sand → almost flat, no steep rise.

Near Infrared (NIR, ~800–900 nm underwater)

In shallow water, NIR reflectance is very low because water absorbs NIR strongly.
However, differences in how little NIR is reflected can still be meaningful.
- Live coral: reflect slightly higher than sand (due to scattering from tissue and skeleton microstructure).
- Sand: minimal NIR return, especially in deeper water.
- Algae: slightly higher NIR than coral because of cell structure.

NIR underwater is weak, but the relative differences in shallow clear water still carry classification power.

Short Wave Infrared (SWIR, >1000 nm)

SWIR is almost completely absorbed by water at even small depths, so it’s not for direct benthic reflectance in situ.
However:
- In above-water remote sensing (like airborne or satellite with sunglint correction), SWIR is used for atmospheric and sunglint correction.
- Good for very shallow / above-water cases to separate moisture-rich from dry substrates.
- It can also help separate benthic types in very shallow or exposed conditions (intertidal coral flats) because:
  - Water content in coral tissue absorbs differently than in macroalgae.
  - Sand has distinct SWIR slopes.

Summary:

Red Edge + NIR: Distinguishes living photosynthetic tissue (live coral, algae) from non-living substrates (sand, dead coral). Helps separate algae and coral based on pigment density.
SWIR (if applicable): Helps reject false positives from wet sand or turbidity artifacts.

ds_corrected["nbgi"] = (("time", "y", "x"), np.full((len(ds_corrected.time), len(ds_corrected.y), len(ds_corrected.x)), np.nan))
ds_corrected["nbgi"] = nbgi

features_stack = xr.concat([
    ds_corrected["rededge1"],
    #ds_corrected["rededge2"],
    #ds_corrected["rededge3"],
    ds_corrected["nir08"],
    ds_corrected["swir16"],
    #ds_corrected["swir22"],
    ds_corrected["sst"],
    ds_corrected["nbgi"]
], dim="band")

Flatten the features and newly rasterized labels for use with a random forest classifier. We will use the first timestamp as features for model training.

#features = features_stack.isel(time=0).stack(flattened_pixel=("y", "x")).fillna(0)
features = features_stack.isel(time=0).transpose("y", "x", "band").stack(flattened_pixel=("y", "x")).transpose("flattened_pixel", "band").fillna(-9999)
labels = rasterized_labels_benthic.stack(flattened_pixel=("y", "x")).fillna(0).astype(int)

labels = labels.to_array().squeeze()

features.data.shape

labels.data.shape

Data Splitting¶

Now that we have the arrays flattened, we can split the datasets into training and testing partitions. We will reserve 80 percent of the data for training, and 20 percent for testing.

X_train, X_test, y_train, y_test = train_test_split(
    features.data, labels, test_size=0.2, random_state=42, shuffle=True
)

Ensure all labels are in each partition.

np.unique(y_train), np.unique(y_test)

len(X_train), len(y_train), len(X_test), len(y_test)

Now we will set up a small random forest classifider with 10 trees. We use a seed (random_state) to ensure reproducibility. Calling the .fit() method on the classifier will initiate training.

%%time
# Train a Random Forest classifier
clf = RandomForestClassifier(n_estimators=10, random_state=42)
clf.fit(X_train, y_train)

Once the classifier is finished training, we can use it to make predictions on our test dataset.

# Test the classifier
y_pred = clf.predict(X_test)

It’s important to know how well our classifier performs relative to the true labels (y_test). For this, we can calculate the accuracy metric to measure agreement between the true and predicted labels.

accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.4f}")

We can also produce a classification report to check the precision, recall and F1 scores for each class.

print(classification_report(y_true=y_test, y_pred=y_pred))

We can also plot a confusion matrix to explore per-class performance.

ConfusionMatrixDisplay.from_predictions(
    y_true=y_test, y_pred=y_pred, normalize="true", values_format=".2f"
)

Save the model to file so that it can be loaded and reused again without needing to repeat training.

# save to file
joblib.dump(clf, f"rf_vanuatu_coral_{PROVINCE}_{DATERANGE_START}_{DATERANGE_END}.pkl")

If we want to generate predictions for the entire dataset in order to plot a map of predicted reefs for the entire area of interest, we can do this using the test province dataset.

def predict_in_batches(model, X, batch_size=100000):
    results = []
    for i in range(0, X.shape[0], batch_size):
        batch = X[i:i+batch_size]
        pred = model.predict(batch)
        results.append(pred)
    return np.concatenate(results)

# Calculate BGI for every timestamp
pred_list = [
    predict_in_batches(clf, features_stack.isel(time=t).transpose("y", "x", "band")
                .stack(flattened_pixel=("y", "x")).transpose("flattened_pixel", "band")
                .fillna(0).data, batch_size=100000).reshape((height, width))
    for t in range(ds_corrected.sizes['time'])
]

pred_list_xr = [xr.DataArray(data=predicted_map, coords=rasterized_labels_benthic.coords) for predicted_map
                in pred_list]

# Combine along time
predictions = xr.concat(pred_list_xr, dim='time')
predictions['time'] = features_stack['time']  # Reassign time

predictions.hvplot.image(height=600, rasterize=True, cmap="Set1")

Vectorize and save the predicted reef map to a geojson file.

# Convert array to int32
compatible_array = predictions.astype("int32")

# Rasterize to polygons
polygons = list(
    rasterio.features.shapes(
        compatible_array.values,
        transform=compatible_array.rio.transform()
    )
)

# Convert to GeoDataFrame and filter for value == 1
prediction_gdf = gpd.GeoDataFrame(
    [{"geometry": shape(geom), "value": value} for geom, value in polygons if value == 1],
    crs="EPSG:32759"
)

# print unique values (should be [1])
print(prediction_gdf.value.unique())

# Save to GeoJSON
prediction_gdf.to_file(
    f"./predicted_coral_{PROVINCE}_{DATERANGE_START}_{DATERANGE_END}.geojson",
    driver="GeoJSON"
)

Time series analysis and change detection¶

# Simple difference between two time points
# Compare two specific time indices (e.g., first and last)
reef_change = predictions.isel(time=-1) - predictions.isel(time=0)

#reef_change.hvplot.image(height=600, rasterize=True, cmap="Set1")

Trend analysis¶

Now let’s compute the per-pixel linear trend over time from the time series prediction maps.

time_num is a numeric time axis (e.g., [0, 1, 2, ...]) corresponding to each time step in the data.
The inner function fit_linear_trend(y) takes a 1D time series (one pixel’s values across time) and:
- Masks out NaNs with np.isfinite.
- Fits a linear model (np.polyfit) only if there are at least 2 valid observations.
- Returns the slope of the best-fit line, representing the rate of change over time.
- Returns NaN if the fit fails or insufficient data exists.
xr.apply_ufunc applies fit_linear_trend to every pixel in the dataset using xarray’s vectorized and parallelized infrastructure, preserving spatial dimensions and returning a new DataArray of trend values.

This function is useful for detecting temporal changes (e.g., coral bleaching trends, coral reef recession or expansion) on a per-pixel basis.

def calc_trend(da):
    """Fit linear trend per pixel over time, robust to NaNs and low data."""

    time_num = np.arange(da.sizes['time'])

    def fit_linear_trend(y):
        y = np.array(y)
        valid_mask = np.isfinite(y)
        if valid_mask.sum() < 2:  # Not enough data points to fit
            return np.nan
        try:
            p = np.polyfit(time_num[valid_mask], y[valid_mask], deg=1)
            return p[0]  # slope
        except np.linalg.LinAlgError:
            return np.nan

    trend = xr.apply_ufunc(
        fit_linear_trend,
        da,
        input_core_dims=[['time']],
        vectorize=True,
        dask='parallelized',
        output_dtypes=[float]
    )
    return trend

reef_trend = calc_trend(predictions)

reef_trend.hvplot.image(height=600, rasterize=True, cmap="Set1")

# Select and concatenate RGB bands
ds_rgb = ds_corrected[["red", "green", "blue"]]
rgb = xr.concat([ds_rgb[b] for b in ["red", "green", "blue"]], dim="band")
rgb = rgb.assign_coords(band=["red", "green", "blue"])

# Sampled quantile scaling
scale = rgb.quantile(0.99, dim=("x", "y"), skipna=True).compute()
rgb_scaled = (rgb / scale).clip(0, 1)

# Plot
rgb_scaled.hvplot.rgb(x="x", y="y", bands="band", width=600, height=600)

sst_trend = calc_trend(ds_corrected["sst"])

sst_trend.hvplot.image(height=600, rasterize=True, cmap="Set1")

nbgi_trend = calc_trend(ds_corrected["nbgi"])

nbgi_trend.hvplot.image(height=600, rasterize=True, cmap="Set1")

Threshold the change to detect significant reef change¶

Track loss in reflectance in blue/green bands or increase in brightness. Bleached coral appears whiter and brighter due to algae loss.

significant_trend = xr.where(np.abs(nbgi_trend) > 0.01, 1, 0)

significant_trend.hvplot.image(height=600, rasterize=True, cmap="Set1")

References¶

Hughes, T. P., Kerry, J. T., Álvarez-Noriega, M., Álvarez-Romero, J. G., Anderson, K. D., Baird, A. H., Babcock, R. C., Beger, M., Bellwood, D. R., Berkelmans, R., Bridge, T. C., Butler, I. R., Byrne, M., Cantin, N. E., Comeau, S., Connolly, S. R., Cumming, G. S., Dalton, S. J., Diaz-Pulido, G., … Wilson, S. K. (2017). Global warming and recurrent mass bleaching of corals. Nature, 543(7645), 373–377. 10.1038/nature21707
Walker, S. J., Schlacher, T. A., & Thompson, L. M. C. (2008). Habitat modification in a dynamic environment: The influence of a small artificial groyne on macrofaunal assemblages of a sandy beach. Estuarine, Coastal and Shelf Science, 79(1), 24–34. 10.1016/j.ecss.2008.03.011