Back to Portfolio
Data Science

UHI-Pipe

A satellite data extraction pipeline that works for any location on Earth.

You give it a bounding box, a time window, and a set of lat/lon coordinates. It pulls imagery from four satellite and geospatial sources, cleans and composites the data, computes spectral indices and land surface temperature, and hands you back a single ML-ready DataFrame. No API keys. No manual downloads. Works anywhere Sentinel-2 and Landsat-8 have coverage.

19

Spectral Indices

4

Data Sources

~44

Output Columns

How it works

1

Data Sources

Sentinel-2, Landsat-8, Copernicus DEM, Building Footprints via Microsoft Planetary Computer

2

Point Extraction

Download only pixels at your coordinates, apply cloud and shadow masking

3

Feature Engineering

Compute 19 spectral indices + LST with emissivity correction

4

Smart Caching

Parquet-based cache. Static sources extracted once, not per resolution

5

ML-Ready Output

One DataFrame, one row per point, ~44 columns, ready for classification

The pipeline connects to Microsoft Planetary Computer, a free STAC catalog that hosts petabytes of satellite data. It downloads only the pixels at your specific coordinates, applies cloud and quality masking, composites across your time window, and caches everything so re-runs are instant.

Sentinel-2 L2A

10–60m resolution

Sentinel-2 is the backbone of the pipeline. It captures 11 spectral bands ranging from visible light to shortwave infrared, giving us enough spectral resolution to distinguish vegetation from concrete, water from bare soil, and everything in between. The L2A product is already atmospherically corrected, so the reflectance values represent surface properties rather than atmospheric noise.

False-Color CompositeBands B08 (NIR), B04 (Red), B03 (Green)
Rio de Janeiro, Brazil
WaterDense Veg.Light Veg.BareUrban
Step 1

Search STAC catalog for scenes within bounding box and time window (max 30% cloud cover)

Step 2

Apply SCL (Scene Classification Layer) mask: remove clouds (class 8, 9), shadows (3), snow (11), saturated (1), nodata (0)

Step 3

Resample all bands to target resolution using nearest-neighbor interpolation

Step 4

Compute median composite across all valid scenes in the time window

Step 5

Extract pixel values at each sample point using 5×5 neighborhood median (not just nearest pixel)

Step 6

Set raw DN = 0 to NaN before any computation

19 features extracted

NDVIEVISAVIGNDVILAINDWIMNDWINDMINDBIISABUNDISINDBaIBSIDBSILSEAlbedoSWIR1_NIRSWIR2_NIR

The 5×5 neighborhood median is important. Rather than extracting just the nearest pixel to each coordinate, the pipeline takes the median value in a 5×5 grid around the point. This reduces noise from GPS coordinate error and sub-pixel heterogeneity, at the cost of smoothing out features smaller than 5 pixels across.

Spectral index library

The 19 features above come from these formulas. Each index is a mathematical combination of Sentinel-2 bands designed to isolate a specific surface property. Vegetation indices use the contrast between red absorption and near-infrared reflection. Built-up indices exploit how impervious surfaces reflect shortwave infrared. Water indices flip the vegetation logic.

NDVI
Normalized Difference Vegetation Index
B08B04B08+B04\frac{B08 - B04}{B08 + B04}
B08 (NIR)B04 (Red)
EVI
Enhanced Vegetation Index
2.5B08B04B08+6B047.5B02+12.5 \cdot \frac{B08 - B04}{B08 + 6 \cdot B04 - 7.5 \cdot B02 + 1}
B08B04B02
SAVI
Soil-Adjusted Vegetation Index
(B08B04)(1+L)B08+B04+L\frac{(B08 - B04)(1 + L)}{B08 + B04 + L}
B08B04
GNDVI
Green NDVI
B08B03B08+B03\frac{B08 - B03}{B08 + B03}
B08B03 (Green)
LAI
Leaf Area Index
3.618NDVI0.1183.618 \cdot NDVI - 0.118
Derived from NDVI
NDWI
Normalized Difference Water Index
B03B08B03+B08\frac{B03 - B08}{B03 + B08}
B03 (Green)B08 (NIR)
MNDWI
Modified Water Index
B03B11B03+B11\frac{B03 - B11}{B03 + B11}
B03B11 (SWIR)
NDMI
Normalized Difference Moisture Index
B08B11B08+B11\frac{B08 - B11}{B08 + B11}
B08B11
NDBI
Normalized Difference Built-up Index
B11B08B11+B08\frac{B11 - B08}{B11 + B08}
B11 (SWIR)B08 (NIR)
ISA
Impervious Surface Area
B11B08B11+B08B08B04B08+B04\frac{B11 - B08}{B11 + B08} - \frac{B08 - B04}{B08 + B04}
B11B08B04
BU
Built-Up Index
NDBINDVINDBI - NDVI
Derived
NDISI
Normalized Difference Impervious Surface Index
B03(B04+B08+B11)/3B03+(B04+B08+B11)/3\frac{B03 - (B04 + B08 + B11)/3}{B03 + (B04 + B08 + B11)/3}
B03B04B08B11
NDBaI
Normalized Difference Bare Index
B11B12B11+B12\frac{B11 - B12}{B11 + B12}
B11 (SWIR1)B12 (SWIR2)
BSI
Bare Soil Index
(B11+B04)(B08+B02)(B11+B04)+(B08+B02)\frac{(B11 + B04) - (B08 + B02)}{(B11 + B04) + (B08 + B02)}
B11B04B08B02
DBSI
Dry Bare Soil Index
B11B03B11+B03NDVI\frac{B11 - B03}{B11 + B03} - NDVI
B11B03NDVI
LSE
Land Surface Emissivity
0.004Pv+0.9860.004 \cdot P_v + 0.986
Derived from NDVI
Albedo
Broadband Surface Albedo
0.356B02+0.130B04+0.373B08+0.085B11+0.072B120.018\begin{gathered}0.356 B02 + 0.130 B04 \\ + 0.373 B08 + 0.085 B11 \\ + 0.072 B12 - 0.018\end{gathered}
B02B04B08B11B12
SWIR1
SWIR-to-NIR Ratio
B11B08\frac{B11}{B08}
B11B08
SWIR2
SWIR2-to-NIR Ratio
B12B08\frac{B12}{B08}
B12B08

Landsat-8 Collection 2

30m visible, 100m thermal

Landsat-8 matters here for one reason: thermal infrared. Sentinel-2 does not have a thermal band, so it cannot measure surface temperature directly. Landsat's Band 10 captures longwave infrared radiation emitted by the ground, which is how we derive land surface temperature. The visible and NIR bands also give us a second, independent NDVI measurement that feeds into the emissivity correction.

Land Surface TemperatureDerived from Band 10 (TIRS) with mono-window correction
Rio de Janeiro, Brazil
CoolModerateHot
Step 1

Search for Collection 2 Level 2 scenes (max 50% cloud cover)

Step 2

Apply QA_PIXEL bitmask: remove cloud, cloud shadow, snow, fill

Step 3

Visible/NIR bands are surface reflectance (already atmospherically corrected)

Step 4

Thermal band (lwir11) is brightness temperature in Kelvin, converted to °C

Step 5

Median composite across time window

Step 6

Emissivity derived from Landsat NDVI using Sobrino et al. (2004) thresholds

Step 7

LST computed via mono-window correction using brightness temperature and emissivity

3 features extracted

NDVI_LandsatEmissivityLST (°C)

Mono-Window Correction

Ts=BT1+(λBTρ)ln(ε)273.15T_s = \frac{BT}{1 + \left(\frac{\lambda \cdot BT}{\rho}\right) \ln(\varepsilon)} - 273.15

Where BT is brightness temperature from Landsat Band 10, λ = 10.895 μm, ρ = 14388 μm·K, and ε is emissivity derived from NDVI.

Without emissivity correction, the thermal band gives you brightness temperature, what the sensor sees at the top of the atmosphere. The mono-window method adjusts for how efficiently different surfaces emit radiation. Bare soil (ε ≈ 0.97) emits differently than full vegetation (ε ≈ 0.99). The pipeline derives emissivity from Landsat NDVI using Sobrino et al. (2004) thresholds, then corrects the brightness temperature to get actual ground temperature in °C.

Copernicus GLO-30 DEM

30m static

Elevation matters because higher ground is cooler. Hilltop neighborhoods experience different heat dynamics than low-lying river valleys, even when the land cover is identical. The Copernicus GLO-30 DEM provides consistent global elevation data at 30-meter resolution. Unlike the satellite imagery sources, this is a static dataset. No date parameter, no cloud masking, no compositing. The pipeline downloads the tiles once, mosaics them, and extracts the elevation at each point.

Digital Elevation ModelCopernicus GLO-30, meters above sea level
Rio de Janeiro, Brazil
LowMidHigh
Step 1

Query STAC catalog for GLO-30 DEM tiles covering the bounding box

Step 2

Mosaic tiles into a single elevation raster

Step 3

Reproject to WGS 84 and resample to target resolution

Step 4

Extract elevation at each sample point (nearest pixel, no neighborhood median needed)

Step 5

Cache result per city. DEM is static, so the cache key ignores date and resolution

1 feature extracted

elevation

Because the DEM is static, it gets special treatment in the caching layer. During a resolution sweep (running 7 resolutions across 3 cities), the pipeline extracts DEM once per city instead of 21 times. The cache key ignores resolution and date parameters for this source.

Building Footprints

Vector, 100m buffer

Satellite imagery tells you what the surface looks like from above, but it cannot tell you how tall the buildings are or how densely packed they are. Building footprint data fills that gap. The pipeline accepts either a pre-computed CSV with density values or a raw shapefile of building polygons. When given a shapefile, it computes building density within a configurable buffer (default 100m) around each sample point.

Building Density Map3D-GloBFP footprints, density per 100m buffer
Rio de Janeiro, Brazil
SparseDense
Step 1

Load building polygons from shapefile or pre-computed CSV

Step 2

Build spatial index (R-tree) over all footprints for fast lookup

Step 3

For each sample point, buffer by 100m and count intersecting footprints

Step 4

Compute building density as footprint count per buffer area

Step 5

Merge density values into the main DataFrame, filling NaN where coverage is missing

Features extracted

building_density_100mbuilding_compactnessheight_statssky_view_factor

Building data is the most uneven source in terms of global availability. 3D-GloBFP has good coverage for major cities but gaps elsewhere. When building data is unavailable, the pipeline runs without it. The corresponding columns return NaN, and the rest of the DataFrame is unaffected. This makes the building source optional rather than a hard dependency.

Usage

All four sources converge into one function call, but each can also run independently. The tabs below show both approaches.

from uhi_pipe import extract_sentinel
from uhi_pipe.features import sentinel_features
# Extract raw bands at your coordinates
raw = extract_sentinel(
points=my_points,
bbox=(-33.88, -71.05, -33.23, -70.04),
time_window="2024-01-15/2024-02-15",
resolution=100,
max_cloud=30,
)
# → Latitude, Longitude, B01-B12 (raw DN values)
# Compute all 19 spectral indices
features = sentinel_features(raw)
# → Adds NDVI, EVI, SAVI, NDBI, Albedo, ... (19 columns)

Pre-configured locations

Three cities ship as built-in configurations for quick testing. These are the cities from the UHI-Explorer classification project, but the pipeline works with any bounding box and coordinates.

Rio de Janeiro

Tropical
Sample points28,488
Date rangeJan - Mar 2023
Bbox-23.02, -43.52

Santiago

Mediterranean
Sample points21,662
Date rangeJan - Feb 2024
Bbox-33.88, -71.05

Freetown

Tropical Coastal
Sample points14,105
Date rangeJan - Feb 2023
Bbox8.36, -13.32

Resolution sweep

The pipeline supports resolutions from 10m to 1000m. Below 30m, the fused dataset hits a floor. Landsat thermal is natively 100m (resampled to 30m), and the DEM is 30m. Sentinel-2 bands go down to 10m, but combining them with coarser sources introduces spatial mismatch at fine scales.

Rio de Janeiro
Santiago
Freetown

Assumptions and limitations

Minimum resolution

30m, limited by Landsat thermal band and DEM. Sentinel-2 bands go down to 10m, but fusing with Landsat at finer scales introduces spatial mismatch.

Temporal compositing

Median composite assumes surface conditions are stable across the time window. Works well for 1-2 month windows, breaks down for longer periods or rapid land-use change.

5×5 neighborhood median

Extracts the median pixel value in a 5×5 grid around each point, not just the nearest pixel. Reduces noise from GPS coordinate error and sub-pixel heterogeneity, but smooths out fine-grained features.

Cloud masking coverage

Persistently cloudy regions (tropics in wet season) may have few or no valid pixels after masking. The pipeline reports valid pixel counts so you know when data is thin.

Building data availability

3D-GloBFP coverage is incomplete globally. For cities without building footprints, the pipeline runs without the buildings source and returns NaN for building features.

Emissivity simplification

Mono-window LST uses NDVI-derived emissivity with three thresholds (bare soil, mixed, full vegetation). Real emissivity varies more than this, especially over water and rock.

Technologies

Python PackageGeospatialSentinel-2PyPI

Next project

UHI-Explorer