Method

How this dataset was built

We turn pages from Degener's Wer Ist's? into structured biography records in a repeatable workflow: scan in the source, extract text, turn it into fields, and then check and standardise outputs.

Processing pipeline

  1. OCR

    Read the scanned pages

    Each scanned page is converted to text with OCR that is tuned for historical print. That gives us a machine-readable starting point without changing the original pages.

  1. Extraction

    Build one record per person

    The text is grouped into one biography at a time, then split into fields like name, occupation, address, birth, education, and family.

  1. Validation

    Check and flag unclear fields

    Values are compared against rules and known lists. If the model is uncertain, the record is flagged so a reviewer can confirm or correct it.

  1. Normalization

    Publish final, usable outputs

    Cleaned fields are standardized to a common structure. Those same files are what power browse, metrics, and download pages.

What this means in practice

To keep the process transparent:

  • You can rebuild the same release flow from the same generated inputs.
  • Missing or empty fields are not hidden; they are shown as known gaps.
  • All public pages (browse, stats, and download) use the same exported dataset files.
  • Method and results remain connected to the underlying source records.

Current status

Core browsing, metrics, and release links are live.

Additional validation reports and release packaging are added as part of each regular refresh.