Live StatsSchema demo — built from MONSTAT’s public releases.Get this for your data →

Methodology

Data Pipeline

1

Discovery

Official data sources are cataloged with metadata: URL, format, language, update frequency, and reliability.

2

Extraction

Files are downloaded with hash-based change detection. Parsing is deterministic: Excel (openpyxl/xlrd), CSV (chardet encoding detection), HTML (BeautifulSoup).

3

Normalization

Tables are converted to one-observation-per-row format. Headers, merged cells, notes, and units are detected and separated. Time periods and geographic entities are standardized.

4

Validation

Multi-pass checks: required fields, value ranges, duplicates, percentage sanity (0-100%), year-over-year anomaly detection (>300% change), time series completeness.

5

Publication

Validated data is stored in a SQLite database, indexed for search, and exported as formatted spreadsheets (analyst and presentation variants).

Validation Framework

CheckSeverityDescription
Required fieldsErrorIndicator, time period, and geography must be present
Value rangesWarningFlags values exceeding 1 trillion
DuplicatesWarningDetects observations with identical keys
Percentage sanityWarningPercentages outside -100% to 1000%
YoY anomaliesWarningYear-over-year changes exceeding 300%
CompletenessInfoIndicators with fewer than 2 time periods

Data Model

The canonical data model supports:

Dataset
Source Asset
Source Table
Indicator
Time Period
Geography
Unit
Observation
Classification
Category
Note
Validation Result