Reproducible LLM Workflows in Economic History: Degener's Wer Ist's?
Forthcoming in Jahrbuch für Wirtschaftsgeschichte
This paper develops a transparent and reproducible LLM-based workflow for turning dense historical print sources into structured data, and demonstrates it by digitizing Hermann A. L. Degener’s Wer Ist’s? (1911). The source contains roughly 16,000 biographies spread across about 1,900 pages and is challenging due to dense typography, non-standard abbreviations, and frequent minor errors. The workflow has modular steps that are divided into small tasks, which also involve humans for verification. These modular steps include image pre-processing and OCR, biography assembly with sanity checks, variable identification, field extraction, validation, and normalization that expands abbreviations while flagging uncertainty. The workflow is also transparent; prompts, logs, model responses, and outputs link each biography back to the original source.
Website | Paper | Dataset | Code
