Method

How this dataset was built

We turn pages from Degener's Wer Ist's? into structured biography records in a repeatable workflow: scan in the source, extract text, turn it into fields, and then check and standardise outputs.

Processing pipeline

OCR

Read the scanned pages

Each scanned page is converted to text with OCR that is tuned for historical print. That gives us a machine-readable starting point without changing the original pages.

Extraction

Build one record per person

The text is grouped into one biography at a time, then split into fields like name, occupation, address, birth, education, and family.

Validation

Check and flag unclear fields

Values are compared against rules and known lists. If the model is uncertain, the record is flagged so a reviewer can confirm or correct it.

Normalization

Publish final, usable outputs

Cleaned fields are standardized to a common structure. Those same files are what power browse, metrics, and download pages.

What this means in practice

To keep the process transparent:

You can rebuild the same release flow from the same generated inputs.
Missing or empty fields are not hidden; they are shown as known gaps.
All public pages (browse, stats, and download) use the same exported dataset files.
Method and results remain connected to the underlying source records.

Current status

Core browsing, metrics, and release links are live.

Additional validation reports and release packaging are added as part of each regular refresh.

Processing pipeline

Read the scanned pages

Build one record per person

Check and flag unclear fields

Publish final, usable outputs

What this means in practice

Core browsing, metrics, and release links are live.