From source text to structure
Printed notices become analyzable records
Raw entries are converted into machine-readable fields that can be grouped, counted, mapped, and eventually browsed record by record.
Historical Biography Dataset
A public-facing home for a structured dataset derived from printed historical biographical entries, designed to support browsing, summary statistics, and downloadable research outputs.
This site presents a structured dataset derived from printed historical biographical entries. It turns a dense reference source into a format that can be summarized, compared, and gradually explored online.
The source material consists of compact biographical notices that bring together identity, titles, life events, family references, affiliations, and addresses in compressed prose. In raw form, those entries are readable, but they are difficult to search or analyze systematically at scale.
Here, those notices are transformed into structured records across 12,000 biographies. The current normalized output tracks 22 core fields and preserves room for names, professions, birth details, addresses, education, career paths, family relations, political affiliations, memberships, and personal notes.
What one record can contain
A single record can combine several layers of information at once:
Name, first names, title or profession.
Birth date, birth place, and address.
Education, job, career path, publications, or specialization.
Parents, spouse, children, or ancestors.
Political affiliation, memberships, hobbies, collections, and personal notes.
From source text to structure
Raw entries are converted into machine-readable fields that can be grouped, counted, mapped, and eventually browsed record by record.
Rich but uneven coverage
The current release tracks 22 core fields, reaches 56.91% overall completeness, and still captures family information in 58.23% of biographies.
Geography and classification
74.90% of entries already resolve to a geocoded country, and occupation classification covers 83.34% of biographies in the current structured output.
The metrics below provide a compact view of the current structured release and show how much of the dataset is already available for comparison and interpretation.
Dataset size
12,000Biographies currently reflected in the structured dataset outputs.
Completeness
56.91%150,241 of 264,000 tracked core slots filled.
Geocoded coverage
74.90%8,988 entries resolved to a country.
Occupation coverage
83.34%10,001 biographies classified so far.
Last refreshed: 2026-03-11 14:28
Planned additions
Current emphasis
The next additions will deepen interpretation rather than change the core structure: richer browse tools, fuller method documentation, and clearer public release guidance.
Developing space for searchable records and future record detail views.
Current summary metrics, classification outputs, and geographic views.
Pipeline, validation, and limitations behind the structured dataset.
Dataset files and supporting outputs available through the repository.