- OCR
Read the scanned pages
Each scanned page is converted to text with OCR that is tuned for historical print. That gives us a machine-readable starting point without changing the original pages.
How this dataset was built
We turn pages from Degener's Wer Ist's? into structured biography records in a repeatable workflow: scan in the source, extract text, turn it into fields, and then check and standardise outputs.
Each scanned page is converted to text with OCR that is tuned for historical print. That gives us a machine-readable starting point without changing the original pages.
The text is grouped into one biography at a time, then split into fields like name, occupation, address, birth, education, and family.
Values are compared against rules and known lists. If the model is uncertain, the record is flagged so a reviewer can confirm or correct it.
Cleaned fields are standardized to a common structure. Those same files are what power browse, metrics, and download pages.
To keep the process transparent:
Current status
Additional validation reports and release packaging are added as part of each regular refresh.