Stats

Dataset overview

This section describes the current public release and links directly to the dataset exports and summaries already available on the site.

Key Metrics

16,001

Biographies reflected in the current structured dataset.

57.09%

Known-value coverage across 22 core fields.

86.48%

Entries with a resolved geocoded country.

100.00%

Biographies covered by occupation classification.

Highlights

Gender distribution

  • Male: 92.89% (14,863/16,001)
  • Female: 4.47% (715/16,001)
  • Unknown: 2.64% (423/16,001)

Quality and validation

  • Quality issues flagged: 3.47% (556/16,001)
  • Needs validation: 82.48% (13,197/16,001)
  • Cross references: 4.54% (727/16,001)

Family and classification

  • Any family information present: 58.51% (9,362/16,001)
  • Father present: 51.42% (8,228/16,001)
  • Mother present: 42.12% (6,740/16,001)
  • Occupation coverage: 100.00% (16,001/16,001); OpenAI 97.16% (15,546/16,001), unknown occupation text 2.84% (455/16,001)
  • Mean classification confidence: 0.8549; low-confidence OpenAI rows: 11.03% (1,715/15,546)

Visuals

Occupation classification

Category distribution

Category distribution

Current distribution across the published occupation categories.

Address geocoding

Geographic coverage map

Geographic coverage map

Resolved address locations staged from the latest geocoding output.

Integration note

This page is backed by generated summaries in site/data/generated/ and does not read full raw datasets in the browser.