Christian Reul, Uwe Springmann, Christoph Wick, Frank Puppe: Improving OCR Accuracy on Early Printed Books by Utilizing Cross Fold Training and Voting. DAS 2018: 423-428
使用经过修改的辅助算法选择的拼写建议对错误进行修正后
方法
-
in this article, the efforts made by the vocalizer project development team to correct errors from texts generated by ocr tesseract are described. vocalizer consists of a device that captures images from books, converts them into plain texts with the aid of an ocr (optical character recognition) software. it also prepares the post-processing of the obtained text, and converts its textual content into voice. the whole process is performed autonomously. in the post-processing step, a modified needleman-wunsch algorithm was applied to select the suggestions made by the spellchecker pyenchant. the results obtained were reasonable, which encourages further research
在本文中,描述了项目开发团队为纠正由ACCER生成的文本中的错误所做的努力。它由一种装置组成,它从书本中获取图像,借助光学字符识别软件将它们转换成纯文本。它还准备对获得的文本进行后处理,并将其文本内容转换为语音。整个过程被执行。在后处理步骤中,采用了一种改进的再匹配算法来选择所提出的建议。所得结果是合理的,有利于进一步的研究。
-
使用tesseract的输出结果,作为初步的识别结果
####
实验
结论
启发
- 提出了一种改进的再匹配算法
- 提出了一个spellchecker的方法
- 适用于英文,可能中文不是很适用