Data Engineering · Snowflake · Galaxy Schema
Coffee Futures
Snowflake Warehouse
A production-style Snowflake data warehouse built around the coffee futures curve. 8 fact tables (1,217,266 rows total) joined through 19 dimension tables, stitched together from CFTC scraping, FMP and FRED APIs, and Open-Meteo. Designed so one SQL query answers a full thesis.
Fact tables
8
OHLCV, weather, macro, COT, FX, features
Dim tables
19
calendar, regions, indicators, regimes
Total rows
1.22M
largest: FACT_FX (942,510)
Data sources
5
FMP · FRED · Open-Meteo · CFTC · earth-engine
Abstract
Most quantitative research projects die in the ETL layer. You pull prices from one API, macro from a second, weather from a third, and by the time you have stitched them together the analysis is a notebook full of CSV mergers. This warehouse is the antidote: a Snowflake galaxy schema where every fact, from KC futures bars to CFTC positioning to Open-Meteo rainfall in Minas Gerais, joins through a single canonical date and country dimension. The point of the design is that a thesis-level question (“does abnormal rainfall in Vietnam predict KC's return when money managers are net long?”) becomes one SQL query, not a Python project.
Part I · Galaxy schema, not a star
1.The right name for it
A textbook star schema has exactly one fact table at the center surrounded by its dimensions. This warehouse has eight fact tables (prices, features, commodities, FX, macro, weather, and two cuts of CFTC positioning), and most of them share the same dimensions: DIM_DATE joins every fact, DIM_COUNTRY joins four of them, DIM_CONTRACT joins two. That structure has a proper name: a galaxy schema, also known as a fact constellation. Calling it a star schema is the common shorthand, but the constellation form is what actually shows up the moment a warehouse needs to model more than one process, and it is what shows up here.
2.Why this design
A coffee thesis touches at least six independent data streams: futures prices, positioning, weather in producing regions, macro indicators in producing and consuming economies, FX rates against the USD, and adjacent commodity prices. Each stream lives at a different frequency and a different grain. Self-merging this in pandas is feasible but unrepeatable; every analyst reinvents the same join.
The constellation makes those joins canonical. Every fact references the same surrogate keys: DATE_ID for time, COUNTRY_ID for location, CONTRACT_ID for the futures instrument, and so on. The dimensions hold the human-readable attributes and the slowly changing metadata. Any new fact stream can be added without touching the dimensions, and any new dimension attribute (a new harvest-season flag, a new region) can be added without touching the facts. That is the property a star schema gives you in the single-fact case; the constellation extends it across as many fact processes as the domain needs.
2.Surrogate keys, not natural keys
Every primary key in this warehouse is an integer surrogate, not the natural key. DATE_ID is days-since-1970, not the date string. CONTRACT_ID is 1, 2, 3, not "KC". This is the textbook Kimball choice and it pays off in two ways: joins are O(int compare) instead of O(string compare), and renames or schema changes upstream do not propagate through the warehouse. The catch is that human-readable queries always require a join to the dimension to surface the readable attribute, which is fine because that is exactly what the dimensions are for.
Interactive · Schema Explorer
Below: click any of the 25 tables on the left to see its columns, primary key, foreign keys, and which tables it joins to. Yellow dots are facts, green are dimensions. Use the join chips at the bottom of each detail card to follow a join out into the next table.
Studio · Schema Explorer
Click any table, see its columns and where it joins
Fact tables
8Dimension tables
19Fact table
FACT_PRICES
22,154 rows
Daily OHLCV bars for KC (arabica), RC (robusta), and the KC/USX cross.
Columns
- FKDATE_ID→ DIM_DATEINTEGER
- FKCONTRACT_ID→ DIM_CONTRACTINTEGER
- OPENFLOAT
- HIGHFLOAT
- LOWFLOAT
- CLOSEFLOAT
- ADJCLOSEFLOAT
- VOLUMEINTEGER
- UNADJUSTEDVOLUMEINTEGER
- VWAPFLOAT
Joins out (2)
Part II · The five data sources
3.Where each fact comes from
- FACT_PRICES, FACT_FEATURES, FACT_COMMODITY, FACT_FX come from the Financial Modeling Prep REST API. One Python script per endpoint, output written long-format with a
DATE_IDcolumn already resolved. - FACT_MACRO comes from FRED for US series and the FMP economic-calendar endpoint for international series, sourced under
DIM_MACRO_DATA_SOURCEso the provenance is queryable. - FACT_WEATHER_LONGis from Open-Meteo's historical archive API, called once per region in
DIM_REGIONSat the region's lat/lon. - FACT_COT and FACT_COT_DATAcome from a custom scraper against the CFTC's historical Commitment-of-Traders Excel archive, filtered to the KC contract code.
- DIM_REGIONS has a Google Earth Engine companion notebook for satellite-derived NDVI and soil-moisture extracts. That data is not in the warehouse yet but the lat/lon column is.
4.Calendar with harvest seasons baked in
DIM_DATE has 31 columns instead of the usual 7. Eight of them are IS_HARVEST_SEASON_<COUNTRY> flags for Brazil, Vietnam, Colombia, Ethiopia, Indonesia, Honduras, India, and Uganda. Pre-computing these on the calendar dimension means every fact query gets seasonality filtering for free, without joining to a separate seasonality table or recomputing month windows in every WHERE clause. Same trick for fiscal calendars, quarter-ends, and week-of-month, which are also pre-baked.
Part III · The thesis query
5.One join, one answer
The pay-off of the schema is the query that motivated the build. Joining a single futures contract's OHLC against its corresponding regional weather and the CFTC positioning report is the kind of analysis that takes a day in pandas. Here it is three JOINs and a WHERE clause. The full SQL is in the weather_join.sql tab below; the verbal version reads: pull KC daily closes during Brazil harvest season, attach the rainfall in Minas Gerais on that day, attach the Managed Money net position, join through the canonical date dimension so the harvest flag is already on each row.
That query is the working proof that the schema does what it was designed to do. Any extension, additional commodities, additional COT trader categories, additional weather regions, is a CSV upload that hits the same date and country dimensions and is queryable on the next refresh.
Public artifacts
Downloads
The full DDL is below, plus a handful of the smaller dimension CSVs to inspect. The large fact files (FACT_FX is 942K rows, FACT_WEATHER_LONG 104K) are kept out of the public bundle. The original ETL also reads a local .env with FMP and FRED API keys, which stays on disk and is not published here.
Code
Build, upload, query
Five tabs: the DDL excerpt, the thesis-style join, the upload pipeline that loads every CSV into Snowflake in dim-first / fact-second order, the date dimension builder with harvest-season flags, and the CFTC COT scraper. Everything below runs against a fresh COFFEE_FUTURES schema.
-- ============================================================================-- COFFEE FUTURES DATA WAREHOUSE - SNOWFLAKE DDL (excerpt)-- ============================================================================-- Author: Mickias Ambaye Date: 2025-11-09 CREATE DATABASE IF NOT EXISTS COFFEE_FUTURES;USE DATABASE COFFEE_FUTURES;CREATE SCHEMA IF NOT EXISTS MAIN;USE SCHEMA MAIN; -- ─── DIMENSION TABLES (17) ────────────────────────────────────────────────── CREATE OR REPLACE TABLE DIM_DATE ( DATE_ID INTEGER, DATE DATE, YEAR INTEGER, QUARTER INTEGER, MONTH INTEGER, DAY_OF_WEEK INTEGER, IS_WEEKEND INTEGER, -- harvest-season flags baked into the calendar dimension IS_HARVEST_SEASON_BRAZIL INTEGER, IS_HARVEST_SEASON_VIETNAM INTEGER, IS_HARVEST_SEASON_COLOMBIA INTEGER, IS_HARVEST_SEASON_ETHIOPIA INTEGER, -- +19 more calendar columns PRIMARY KEY (DATE_ID)); CREATE OR REPLACE TABLE DIM_CONTRACT ( CONTRACT_ID INTEGER, SYMBOL CHAR(5), NAME VARCHAR(15), COFFEE_TYPE_ID INTEGER, EXCHANGE CHAR(8), PRIMARY KEY (CONTRACT_ID)); ALTER TABLE DIM_CONTRACT ADD CONSTRAINT FK_DIM_CONTRACT_COFFEE_TYPE_ID FOREIGN KEY (COFFEE_TYPE_ID) REFERENCES DIM_COFFEE_TYPE(COFFEE_TYPE_ID); -- ─── FACT TABLES (7) ──────────────────────────────────────────────────────── CREATE OR REPLACE TABLE FACT_PRICES ( DATE_ID INTEGER, CONTRACT_ID INTEGER, OPEN FLOAT, HIGH FLOAT, LOW FLOAT, CLOSE FLOAT, ADJCLOSE FLOAT, VOLUME INTEGER, VWAP FLOAT); ALTER TABLE FACT_PRICES ADD CONSTRAINT FK_FACT_PRICES_DATE_ID FOREIGN KEY (DATE_ID) REFERENCES DIM_DATE(DATE_ID);ALTER TABLE FACT_PRICES ADD CONSTRAINT FK_FACT_PRICES_CONTRACT_ID FOREIGN KEY (CONTRACT_ID) REFERENCES DIM_CONTRACT(CONTRACT_ID); -- See the full DDL link below for all 24 tables and ~50 FK constraints. Mickias Ambaye · 2026