Methodology

Data Pipeline

Discovery

Official data sources are cataloged with metadata: URL, format, language, update frequency, and reliability.

Extraction

Files are downloaded with hash-based change detection. Parsing is deterministic: Excel (openpyxl/xlrd), CSV (chardet encoding detection), HTML (BeautifulSoup).

Normalization

Tables are converted to one-observation-per-row format. Headers, merged cells, notes, and units are detected and separated. Time periods and geographic entities are standardized.

Validation

Multi-pass checks: required fields, value ranges, duplicates, percentage sanity (0-100%), year-over-year anomaly detection (>300% change), time series completeness.

Publication

Validated data is stored in a SQLite database, indexed for search, and exported as formatted spreadsheets (analyst and presentation variants).

Validation Framework

Check	Severity	Description
Required fields	Error	Indicator, time period, and geography must be present
Value ranges	Warning	Flags values exceeding 1 trillion
Duplicates	Warning	Detects observations with identical keys
Percentage sanity	Warning	Percentages outside -100% to 1000%
YoY anomalies	Warning	Year-over-year changes exceeding 300%
Completeness	Info	Indicators with fewer than 2 time periods

Data Model

The canonical data model supports:

Dataset

Source Asset

Source Table

Indicator

Time Period

Geography

Unit

Observation

Classification