-
Ulrich Reffle, Christoph Ringlstetter: Unsupervised profiling of OCRed historical documents. Pattern Recognition 46(5): 1346-1357 (2013)
不好下载
摘要
-
In search engines and digital libraries, more and more OCRed historical documents become available. Still, access to these texts is often not satisfactory due to two problems: first, the quality of optical character recognition (OCR) on historical texts is often surprisingly low; second, historical spelling variation represents a barrier for search even if texts are properly reconstructed. As one step towards a solution we introduce a method that automatically computes a two-channel profile from an OCRed historical text. The profile includes (1) “global” information on typical recognition errors found in the OCR output, typical p atterns for historical spelling variation, vocabulary and word frequencies in the underlying text, and (2) “local” hypotheses on OCR-errors and historical orthography of particular tokens of the OCR output. We argue that availability of this kind of knowledge represents a key step for improving OCR and Information Retrieval (IR) on historical texts: profiles can be used, e.g., to automatically finetune postcorrection systems or adapt OCR engines to the given input document, and to define refined models for approximate search that are aware of the kind of language variation found in a specific document. Our evaluation results show a strong correlation between the true distribution of spelling variation patterns and recognition errors in the OCRed text and estimated ranks and scores automatically computed in profiles. As a specific application we show how to improve the output of a commercial OCR engine using profiles in a postcorrection system.
在搜索引擎和数字图书馆中,越来越多的OCRed历史文档变得可用。然而,由于存在两个问题,访问这些文本通常并不令人满意:首先,历史文本上的光学字符识别(OCR)质量往往低得惊人;第二,历史拼写变化代表了搜索的障碍,即使文本被正确地重建。作为解决方案的一步,我们引入了一种方法,可以自动计算来自OCRed历史文本的双通道配置文件。该简档包括(1)关于OCR输出中发现的典型识别错误的“全局”信息,历史拼写变化的典型模式,基础文本中的词汇和词频,以及(2)OCR错误和历史上的“本地”假设OCR输出的特定标记的正字法。我们认为,这种知识的可用性是改进历史文本上的OCR和信息检索(IR)的关键步骤:可以使用配置文件,例如,自动微调后校正系统或使OCR引擎适应给定的输入文档,以及为近似搜索定义精确模型,了解特定文档中发现的语言变体类型。我们的评估结果显示,拼写变异模式的真实分布与OCRed文本中的识别错误以及在配置文件中自动计算的估计排名和分数之间存在很强的相关性。作为具体应用,我们展示了如何使用后期校正系统中的配置文件来改进商用OCR引擎的输出。