Reproducible LLM Workflows in Economic History: Degener's Wer Ist's?
Forthcoming in Jahrbuch für Wirtschaftsgeschichte
This project develops a transparent and reproducible workflow for using large language models to turn dense historical print sources into structured research data. The paper demonstrates the workflow by digitizing Hermann A. L. Degener’s Wer Ist’s? (1911) and discusses how historians can combine OCR, preprocessing, and LLM-assisted extraction without losing reproducibility. Code and materials are available in the repository linked above.
Abstract
This paper develops a transparent and reproducible LLM-based workflow for turning dense historical print sources into structured data, and demonstrates it by digitizing Hermann A. L. Degener’s Wer Ist’s? (1911). The source contains roughly 16,000 biographies spread across about 1,900 pages and is challenging due to dense typography, non-standard abbreviations, and frequent minor errors. The workflow has modular steps that are divided into small tasks, which also involve humans for verification. These modular steps include image pre-processing and OCR, biography assembly with sanity checks, variable identification, field extraction, validation, and normalization that expands abbreviations while flagging uncertainty. The workflow is also transparent; prompts, logs, model responses, and outputs link each biography back to the original source.
