Data Engineering · Snowflake · Galaxy Schema

Coffee Futures
Snowflake Warehouse

A production-style Snowflake data warehouse built around the coffee futures curve. 8 fact tables (1,217,266 rows total) joined through 19 dimension tables, stitched together from CFTC scraping, FMP and FRED APIs, and Open-Meteo. Designed so one SQL query answers a full thesis.

Fact tables

OHLCV, weather, macro, COT, FX, features

Dim tables

calendar, regions, indicators, regimes

Total rows

1.22M

largest: FACT_FX (942,510)

Data sources

FMP · FRED · Open-Meteo · CFTC · earth-engine

The BuildMickias Ambaye · November 2025

Abstract

Most quantitative research projects die in the ETL layer. You pull prices from one API, macro from a second, weather from a third, and by the time you have stitched them together the analysis is a notebook full of CSV mergers. This warehouse is the antidote: a Snowflake galaxy schema where every fact, from KC futures bars to CFTC positioning to Open-Meteo rainfall in Minas Gerais, joins through a single canonical date and country dimension. The point of the design is that a thesis-level question (“does abnormal rainfall in Vietnam predict KC's return when money managers are net long?”) becomes one SQL query, not a Python project.

Part I · Galaxy schema, not a star

1.The right name for it

A textbook star schema has exactly one fact table at the center surrounded by its dimensions. This warehouse has eight fact tables (prices, features, commodities, FX, macro, weather, and two cuts of CFTC positioning), and most of them share the same dimensions: DIM_DATE joins every fact, DIM_COUNTRY joins four of them, DIM_CONTRACT joins two. That structure has a proper name: a galaxy schema, also known as a fact constellation. Calling it a star schema is the common shorthand, but the constellation form is what actually shows up the moment a warehouse needs to model more than one process, and it is what shows up here.

2.Why this design

A coffee thesis touches at least six independent data streams: futures prices, positioning, weather in producing regions, macro indicators in producing and consuming economies, FX rates against the USD, and adjacent commodity prices. Each stream lives at a different frequency and a different grain. Self-merging this in pandas is feasible but unrepeatable; every analyst reinvents the same join.

The constellation makes those joins canonical. Every fact references the same surrogate keys: DATE_ID for time, COUNTRY_ID for location, CONTRACT_ID for the futures instrument, and so on. The dimensions hold the human-readable attributes and the slowly changing metadata. Any new fact stream can be added without touching the dimensions, and any new dimension attribute (a new harvest-season flag, a new region) can be added without touching the facts. That is the property a star schema gives you in the single-fact case; the constellation extends it across as many fact processes as the domain needs.

2.Surrogate keys, not natural keys

Every primary key in this warehouse is an integer surrogate, not the natural key. DATE_ID is days-since-1970, not the date string. CONTRACT_ID is 1, 2, 3, not "KC". This is the textbook Kimball choice and it pays off in two ways: joins are O(int compare) instead of O(string compare), and renames or schema changes upstream do not propagate through the warehouse. The catch is that human-readable queries always require a join to the dimension to surface the readable attribute, which is fine because that is exactly what the dimensions are for.

Interactive · Schema Explorer

Below: click any of the 25 tables on the left to see its columns, primary key, foreign keys, and which tables it joins to. Yellow dots are facts, green are dimensions. Use the join chips at the bottom of each detail card to follow a join out into the next table.

Studio · Schema Explorer

Click any table, see its columns and where it joins

Fact tables

Dimension tables

Fact table

FACT_PRICES

22,154 rows

Daily OHLCV bars for KC (arabica), RC (robusta), and the KC/USX cross.

Columns

FKDATE_ID→ DIM_DATEINTEGER
FKCONTRACT_ID→ DIM_CONTRACTINTEGER
OPENFLOAT
HIGHFLOAT
LOWFLOAT
CLOSEFLOAT
ADJCLOSEFLOAT
VOLUMEINTEGER
UNADJUSTEDVOLUMEINTEGER
VWAPFLOAT

Joins out (2)

Part II · The five data sources

3.Where each fact comes from

FACT_PRICES, FACT_FEATURES, FACT_COMMODITY, FACT_FX come from the Financial Modeling Prep REST API. One Python script per endpoint, output written long-format with a DATE_ID column already resolved.
FACT_MACRO comes from FRED for US series and the FMP economic-calendar endpoint for international series, sourced under DIM_MACRO_DATA_SOURCE so the provenance is queryable.
FACT_WEATHER_LONGis from Open-Meteo's historical archive API, called once per region in DIM_REGIONSat the region's lat/lon.
FACT_COT and FACT_COT_DATAcome from a custom scraper against the CFTC's historical Commitment-of-Traders Excel archive, filtered to the KC contract code.
DIM_REGIONS has a Google Earth Engine companion notebook for satellite-derived NDVI and soil-moisture extracts. That data is not in the warehouse yet but the lat/lon column is.

4.Calendar with harvest seasons baked in

DIM_DATE has 31 columns instead of the usual 7. Eight of them are IS_HARVEST_SEASON_<COUNTRY> flags for Brazil, Vietnam, Colombia, Ethiopia, Indonesia, Honduras, India, and Uganda. Pre-computing these on the calendar dimension means every fact query gets seasonality filtering for free, without joining to a separate seasonality table or recomputing month windows in every WHERE clause. Same trick for fiscal calendars, quarter-ends, and week-of-month, which are also pre-baked.

Part III · The thesis query

5.One join, one answer

The pay-off of the schema is the query that motivated the build. Joining a single futures contract's OHLC against its corresponding regional weather and the CFTC positioning report is the kind of analysis that takes a day in pandas. Here it is three JOINs and a WHERE clause. The full SQL is in the weather_join.sql tab below; the verbal version reads: pull KC daily closes during Brazil harvest season, attach the rainfall in Minas Gerais on that day, attach the Managed Money net position, join through the canonical date dimension so the harvest flag is already on each row.

That query is the working proof that the schema does what it was designed to do. Any extension, additional commodities, additional COT trader categories, additional weather regions, is a CSV upload that hits the same date and country dimensions and is queryable on the next refresh.

Public artifacts

Downloads

The full DDL is below, plus a handful of the smaller dimension CSVs to inspect. The large fact files (FACT_FX is 942K rows, FACT_WEATHER_LONG 104K) are kept out of the public bundle. The original ETL also reads a local .env with FMP and FRED API keys, which stays on disk and is not published here.

DDL

create_snowflake_tables.sql

Full schema: 24 tables, ~50 FK constraints

Dimensions

dim-tables-all.zip

All 10 dimension CSVs in one download (3 KB)

Code

Build, upload, query

Five tabs: the DDL excerpt, the thesis-style join, the upload pipeline that loads every CSV into Snowflake in dim-first / fact-second order, the date dimension builder with harvest-season flags, and the CFTC COT scraper. Everything below runs against a fresh COFFEE_FUTURES schema.

-- ============================================================================
-- COFFEE FUTURES DATA WAREHOUSE - SNOWFLAKE DDL  (excerpt)
-- ============================================================================
-- Author: Mickias Ambaye   Date: 2025-11-09
 
CREATE DATABASE IF NOT EXISTS COFFEE_FUTURES;
USE DATABASE COFFEE_FUTURES;
CREATE SCHEMA IF NOT EXISTS MAIN;
USE SCHEMA MAIN;
 
-- ─── DIMENSION TABLES (17) ──────────────────────────────────────────────────
 
CREATE OR REPLACE TABLE DIM_DATE (
    DATE_ID                        INTEGER,
    DATE                           DATE,
    YEAR                           INTEGER,
    QUARTER                        INTEGER,
    MONTH                          INTEGER,
    DAY_OF_WEEK                    INTEGER,
    IS_WEEKEND                     INTEGER,
    -- harvest-season flags baked into the calendar dimension
    IS_HARVEST_SEASON_BRAZIL       INTEGER,
    IS_HARVEST_SEASON_VIETNAM      INTEGER,
    IS_HARVEST_SEASON_COLOMBIA     INTEGER,
    IS_HARVEST_SEASON_ETHIOPIA     INTEGER,
    -- +19 more calendar columns
    PRIMARY KEY (DATE_ID)
);
 
CREATE OR REPLACE TABLE DIM_CONTRACT (
    CONTRACT_ID                    INTEGER,
    SYMBOL                         CHAR(5),
    NAME                           VARCHAR(15),
    COFFEE_TYPE_ID                 INTEGER,
    EXCHANGE                       CHAR(8),
    PRIMARY KEY (CONTRACT_ID)
);
 
ALTER TABLE DIM_CONTRACT ADD CONSTRAINT FK_DIM_CONTRACT_COFFEE_TYPE_ID
    FOREIGN KEY (COFFEE_TYPE_ID) REFERENCES DIM_COFFEE_TYPE(COFFEE_TYPE_ID);
 
-- ─── FACT TABLES (7) ────────────────────────────────────────────────────────
 
CREATE OR REPLACE TABLE FACT_PRICES (
    DATE_ID                        INTEGER,
    CONTRACT_ID                    INTEGER,
    OPEN                           FLOAT,
    HIGH                           FLOAT,
    LOW                            FLOAT,
    CLOSE                          FLOAT,
    ADJCLOSE                       FLOAT,
    VOLUME                         INTEGER,
    VWAP                           FLOAT
);
 
ALTER TABLE FACT_PRICES ADD CONSTRAINT FK_FACT_PRICES_DATE_ID
    FOREIGN KEY (DATE_ID) REFERENCES DIM_DATE(DATE_ID);
ALTER TABLE FACT_PRICES ADD CONSTRAINT FK_FACT_PRICES_CONTRACT_ID
    FOREIGN KEY (CONTRACT_ID) REFERENCES DIM_CONTRACT(CONTRACT_ID);
 
-- See the full DDL link below for all 24 tables and ~50 FK constraints.

Mickias Ambaye · 2026

Coffee FuturesSnowflake Warehouse