Data Science

UHI-Pipe

A satellite data extraction pipeline that works for any location on Earth.

You give it a bounding box, a time window, and a set of lat/lon coordinates. It pulls imagery from four satellite and geospatial sources, cleans and composites the data, computes spectral indices and land surface temperature, and hands you back a single ML-ready DataFrame. No API keys. No manual downloads. Works anywhere Sentinel-2 and Landsat-8 have coverage.

Spectral Indices

Data Sources

~44

Output Columns

View on GitHub

How it works

Data Sources

Sentinel-2, Landsat-8, Copernicus DEM, Building Footprints via Microsoft Planetary Computer

Point Extraction

Download only pixels at your coordinates, apply cloud and shadow masking

Feature Engineering

Compute 19 spectral indices + LST with emissivity correction

Smart Caching

Parquet-based cache. Static sources extracted once, not per resolution

ML-Ready Output

One DataFrame, one row per point, ~44 columns, ready for classification

The pipeline connects to Microsoft Planetary Computer, a free STAC catalog that hosts petabytes of satellite data. It downloads only the pixels at your specific coordinates, applies cloud and quality masking, composites across your time window, and caches everything so re-runs are instant.

Sentinel-2 L2A

10–60m resolution

Sentinel-2 is the backbone of the pipeline. It captures 11 spectral bands ranging from visible light to shortwave infrared, giving us enough spectral resolution to distinguish vegetation from concrete, water from bare soil, and everything in between. The L2A product is already atmospherically corrected, so the reflectance values represent surface properties rather than atmospheric noise.

False-Color CompositeBands B08 (NIR), B04 (Red), B03 (Green)

Rio de Janeiro, Brazil

WaterDense Veg.Light Veg.BareUrban

Step 1

Search STAC catalog for scenes within bounding box and time window (max 30% cloud cover)

Step 2

Apply SCL (Scene Classification Layer) mask: remove clouds (class 8, 9), shadows (3), snow (11), saturated (1), nodata (0)

Step 3

Resample all bands to target resolution using nearest-neighbor interpolation

Step 4

Compute median composite across all valid scenes in the time window

Step 5

Extract pixel values at each sample point using 5×5 neighborhood median (not just nearest pixel)

Step 6

Set raw DN = 0 to NaN before any computation

19 features extracted

NDVIEVISAVIGNDVILAINDWIMNDWINDMINDBIISABUNDISINDBaIBSIDBSILSEAlbedoSWIR1_NIRSWIR2_NIR

The 5×5 neighborhood median is important. Rather than extracting just the nearest pixel to each coordinate, the pipeline takes the median value in a 5×5 grid around the point. This reduces noise from GPS coordinate error and sub-pixel heterogeneity, at the cost of smoothing out features smaller than 5 pixels across.

Spectral index library

The 19 features above come from these formulas. Each index is a mathematical combination of Sentinel-2 bands designed to isolate a specific surface property. Vegetation indices use the contrast between red absorption and near-infrared reflection. Built-up indices exploit how impervious surfaces reflect shortwave infrared. Water indices flip the vegetation logic.

NDVI

Normalized Difference Vegetation Index

\frac{B08 - B04}{B08 + B04}

B08 (NIR)B04 (Red)

EVI

Enhanced Vegetation Index

2.5 \cdot \frac{B08 - B04}{B08 + 6 \cdot B04 - 7.5 \cdot B02 + 1}

B08B04B02

SAVI

Soil-Adjusted Vegetation Index

\frac{(B08 - B04)(1 + L)}{B08 + B04 + L}

B08B04

GNDVI

Green NDVI

\frac{B08 - B03}{B08 + B03}

B08B03 (Green)

LAI

Leaf Area Index

3.618 \cdot NDVI - 0.118

Derived from NDVI

NDWI

Normalized Difference Water Index

\frac{B03 - B08}{B03 + B08}

B03 (Green)B08 (NIR)

MNDWI

Modified Water Index

\frac{B03 - B11}{B03 + B11}

B03B11 (SWIR)

NDMI

Normalized Difference Moisture Index

\frac{B08 - B11}{B08 + B11}

B08B11

NDBI

Normalized Difference Built-up Index

\frac{B11 - B08}{B11 + B08}

B11 (SWIR)B08 (NIR)

ISA

Impervious Surface Area

\frac{B11 - B08}{B11 + B08} - \frac{B08 - B04}{B08 + B04}

B11B08B04

Built-Up Index

NDBI - NDVI

Derived

NDISI

Normalized Difference Impervious Surface Index

\frac{B03 - (B04 + B08 + B11)/3}{B03 + (B04 + B08 + B11)/3}

B03B04B08B11

NDBaI

Normalized Difference Bare Index

\frac{B11 - B12}{B11 + B12}

B11 (SWIR1)B12 (SWIR2)

BSI

Bare Soil Index

\frac{(B11 + B04) - (B08 + B02)}{(B11 + B04) + (B08 + B02)}

B11B04B08B02

DBSI

Dry Bare Soil Index

\frac{B11 - B03}{B11 + B03} - NDVI

B11B03NDVI

LSE

Land Surface Emissivity

0.004 \cdot P_v + 0.986

Derived from NDVI

Albedo

Broadband Surface Albedo

\begin{gathered}0.356 B02 + 0.130 B04 \\ + 0.373 B08 + 0.085 B11 \\ + 0.072 B12 - 0.018\end{gathered}

B02B04B08B11B12

SWIR1

SWIR-to-NIR Ratio

\frac{B11}{B08}

B11B08

SWIR2

SWIR2-to-NIR Ratio

\frac{B12}{B08}

B12B08

Landsat-8 Collection 2

30m visible, 100m thermal

Landsat-8 matters here for one reason: thermal infrared. Sentinel-2 does not have a thermal band, so it cannot measure surface temperature directly. Landsat's Band 10 captures longwave infrared radiation emitted by the ground, which is how we derive land surface temperature. The visible and NIR bands also give us a second, independent NDVI measurement that feeds into the emissivity correction.

Land Surface TemperatureDerived from Band 10 (TIRS) with mono-window correction

Rio de Janeiro, Brazil

CoolModerateHot

Step 1

Search for Collection 2 Level 2 scenes (max 50% cloud cover)

Step 2

Apply QA_PIXEL bitmask: remove cloud, cloud shadow, snow, fill

Step 3

Visible/NIR bands are surface reflectance (already atmospherically corrected)

Step 4

Thermal band (lwir11) is brightness temperature in Kelvin, converted to °C

Step 5

Median composite across time window

Step 6

Emissivity derived from Landsat NDVI using Sobrino et al. (2004) thresholds

Step 7

LST computed via mono-window correction using brightness temperature and emissivity

3 features extracted

NDVI_LandsatEmissivityLST (°C)

Mono-Window Correction

T_s = \frac{BT}{1 + \left(\frac{\lambda \cdot BT}{\rho}\right) \ln(\varepsilon)} - 273.15

Where BT is brightness temperature from Landsat Band 10, λ = 10.895 μm, ρ = 14388 μm·K, and ε is emissivity derived from NDVI.

Without emissivity correction, the thermal band gives you brightness temperature, what the sensor sees at the top of the atmosphere. The mono-window method adjusts for how efficiently different surfaces emit radiation. Bare soil (ε ≈ 0.97) emits differently than full vegetation (ε ≈ 0.99). The pipeline derives emissivity from Landsat NDVI using Sobrino et al. (2004) thresholds, then corrects the brightness temperature to get actual ground temperature in °C.

Copernicus GLO-30 DEM

30m static

Elevation matters because higher ground is cooler. Hilltop neighborhoods experience different heat dynamics than low-lying river valleys, even when the land cover is identical. The Copernicus GLO-30 DEM provides consistent global elevation data at 30-meter resolution. Unlike the satellite imagery sources, this is a static dataset. No date parameter, no cloud masking, no compositing. The pipeline downloads the tiles once, mosaics them, and extracts the elevation at each point.

Digital Elevation ModelCopernicus GLO-30, meters above sea level

Rio de Janeiro, Brazil

LowMidHigh

Step 1

Query STAC catalog for GLO-30 DEM tiles covering the bounding box

Step 2

Mosaic tiles into a single elevation raster

Step 3

Reproject to WGS 84 and resample to target resolution

Step 4

Extract elevation at each sample point (nearest pixel, no neighborhood median needed)

Step 5

Cache result per city. DEM is static, so the cache key ignores date and resolution

1 feature extracted

elevation

Because the DEM is static, it gets special treatment in the caching layer. During a resolution sweep (running 7 resolutions across 3 cities), the pipeline extracts DEM once per city instead of 21 times. The cache key ignores resolution and date parameters for this source.

Building Footprints

Vector, 100m buffer

Satellite imagery tells you what the surface looks like from above, but it cannot tell you how tall the buildings are or how densely packed they are. Building footprint data fills that gap. The pipeline accepts either a pre-computed CSV with density values or a raw shapefile of building polygons. When given a shapefile, it computes building density within a configurable buffer (default 100m) around each sample point.

Building Density Map3D-GloBFP footprints, density per 100m buffer

Rio de Janeiro, Brazil

SparseDense

Step 1

Load building polygons from shapefile or pre-computed CSV

Step 2

Build spatial index (R-tree) over all footprints for fast lookup

Step 3

For each sample point, buffer by 100m and count intersecting footprints

Step 4

Compute building density as footprint count per buffer area

Step 5

Merge density values into the main DataFrame, filling NaN where coverage is missing

Features extracted

building_density_100mbuilding_compactnessheight_statssky_view_factor

Building data is the most uneven source in terms of global availability. 3D-GloBFP has good coverage for major cities but gaps elsewhere. When building data is unavailable, the pipeline runs without it. The corresponding columns return NaN, and the rest of the DataFrame is unaffected. This makes the building source optional rather than a hard dependency.

Usage

All four sources converge into one function call, but each can also run independently. The tabs below show both approaches.

from uhi_pipe import extract_sentinel
from uhi_pipe.features import sentinel_features
# Extract raw bands at your coordinates
raw = extract_sentinel(
    points=my_points,
    bbox=(-33.88, -71.05, -33.23, -70.04),
    time_window="2024-01-15/2024-02-15",
    resolution=100,
    max_cloud=30,
)
# → Latitude, Longitude, B01-B12 (raw DN values)
# Compute all 19 spectral indices
features = sentinel_features(raw)
# → Adds NDVI, EVI, SAVI, NDBI, Albedo, ... (19 columns)

Pre-configured locations

Three cities ship as built-in configurations for quick testing. These are the cities from the UHI-Explorer classification project, but the pipeline works with any bounding box and coordinates.

Rio de Janeiro

Tropical

Sample points28,488

Date rangeJan - Mar 2023

Bbox-23.02, -43.52

Santiago

Mediterranean

Sample points21,662

Date rangeJan - Feb 2024

Bbox-33.88, -71.05

Freetown

Tropical Coastal

Sample points14,105

Date rangeJan - Feb 2023

Bbox8.36, -13.32

Resolution sweep

The pipeline supports resolutions from 10m to 1000m. Below 30m, the fused dataset hits a floor. Landsat thermal is natively 100m (resampled to 30m), and the DEM is 30m. Sentinel-2 bands go down to 10m, but combining them with coarser sources introduces spatial mismatch at fine scales.

Rio de Janeiro

Santiago

Freetown

Assumptions and limitations

Minimum resolution

30m, limited by Landsat thermal band and DEM. Sentinel-2 bands go down to 10m, but fusing with Landsat at finer scales introduces spatial mismatch.

Temporal compositing

Median composite assumes surface conditions are stable across the time window. Works well for 1-2 month windows, breaks down for longer periods or rapid land-use change.

5×5 neighborhood median

Extracts the median pixel value in a 5×5 grid around each point, not just the nearest pixel. Reduces noise from GPS coordinate error and sub-pixel heterogeneity, but smooths out fine-grained features.

Cloud masking coverage

Persistently cloudy regions (tropics in wet season) may have few or no valid pixels after masking. The pipeline reports valid pixel counts so you know when data is thin.

Building data availability

3D-GloBFP coverage is incomplete globally. For cities without building footprints, the pipeline runs without the buildings source and returns NaN for building features.

Emissivity simplification

Mono-window LST uses NDVI-derived emissivity with three thresholds (bare soil, mixed, full vegetation). Real emissivity varies more than this, especially over water and rock.

Technologies

Python PackagePyPISentinel-2Geospatial ML

Next project

UHI-Explorer