Post-correction of OCR Errors Using PyEnchant Spelling Suggestions Selected Through a Modified Needleman-Wunsch Algorithm

Christian Reul, Uwe Springmann, Christoph Wick, Frank Puppe: Improving OCR Accuracy on Early Printed Books by Utilizing Cross Fold Training and Voting. DAS 2018: 423-428

使用经过修改的辅助算法选择的拼写建议对错误进行修正后

方法

  • in this article, the efforts made by the vocalizer project development team to correct errors from texts generated by ocr tesseract are described. vocalizer consists of a device that captures images from books, converts them into plain texts with the aid of an ocr (optical character recognition) software. it also prepares the post-processing of the obtained text, and converts its textual content into voice. the whole process is performed autonomously. in the post-processing step, a modified needleman-wunsch algorithm was applied to select the suggestions made by the spellchecker pyenchant. the results obtained were reasonable, which encourages further research

    在本文中,描述了项目开发团队为纠正由ACCER生成的文本中的错误所做的努力。它由一种装置组成,它从书本中获取图像,借助光学字符识别软件将它们转换成纯文本。它还准备对获得的文本进行后处理,并将其文本内容转换为语音。整个过程被执行。在后处理步骤中,采用了一种改进的再匹配算法来选择所提出的建议。所得结果是合理的,有利于进一步的研究。

  • 使用tesseract的输出结果,作为初步的识别结果

####

实验

结论

启发

  • 提出了一种改进的再匹配算法
  • 提出了一个spellchecker的方法
  • 适用于英文,可能中文不是很适用

参考文献

打赏一个呗

取消

感谢您的支持,我会继续努力的!

扫码支持
扫码支持
扫码打赏,你说多少就多少

打开支付宝扫一扫,即可进行扫码打赏哦