Methodology
Data Pipeline
Discovery
Official data sources are cataloged with metadata: URL, format, language, update frequency, and reliability.
Extraction
Files are downloaded with hash-based change detection. Parsing is deterministic: Excel (openpyxl/xlrd), CSV (chardet encoding detection), HTML (BeautifulSoup).
Normalization
Tables are converted to one-observation-per-row format. Headers, merged cells, notes, and units are detected and separated. Time periods and geographic entities are standardized.
Validation
Multi-pass checks: required fields, value ranges, duplicates, percentage sanity (0-100%), year-over-year anomaly detection (>300% change), time series completeness.
Publication
Validated data is stored in a SQLite database, indexed for search, and exported as formatted spreadsheets (analyst and presentation variants).
Validation Framework
| Check | Severity | Description |
|---|---|---|
| Required fields | Error | Indicator, time period, and geography must be present |
| Value ranges | Warning | Flags values exceeding 1 trillion |
| Duplicates | Warning | Detects observations with identical keys |
| Percentage sanity | Warning | Percentages outside -100% to 1000% |
| YoY anomalies | Warning | Year-over-year changes exceeding 300% |
| Completeness | Info | Indicators with fewer than 2 time periods |
Data Model
The canonical data model supports: