Environment - Tesseract 4.0.0-beta.3-249-g607e - leptonica-1.76.0 - Linux (hostname removed) 4.18.0-16-generic #17 <https://github.com/tesseract-ocr/tesseract/pull/17>-Ubuntu SMP Fri Feb 8 00:06:57 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
Current Behavior: I work at a SaaS firm which provides cloud storage services specializing in documents. As a part of our service, we try to create PDFs with searchable text layers from scanned documents. When processing PPMs which are created by ImageMagick from the original document, Leptonica mangles the image before it can be OCR'd properly by Tesseract. This results in a PDF unreadable by both human eyes and Tesseract. This only seems to happen for some specific documents. How do I know it's Leptonica, specifically? I have executed Tesseract with the config values tessedit_write_images 1 and tessedit_pageseg_mode 0. From my understanding, the second option does not enable OCR at all while processing with Tesseract (which speeds up my test cases) and the first option outputs a .tif debug image which is apparently what Leptonica feeds to Tesseract after processing. That image is also mangled. Sample data I have extracted a single page from a PDF -- the process works on a page-by-page basis and most of the documents we work with contain highly sensitive information, so I had no other option but to do this. Regardless, it is good sample data. The "pg_0009.ppm" file is the original input fed into Tesseract on the command line which was converted from the original scanned document by ImageMagick. The "tessinput.tif" file is the image produced by the tessedit_write_images 1 option which is supposed to be OCR'd by Tesseract. This particular page caused a seg fault in Tesseract, something that doesn't usually happen, and I suspect it is because the text is overlapped so many times that the OCR engine has too much to handle. Google Drive since it's too large for an attachment: https://drive.google.com/file/d/1UCzXYu7iusep-bOD6EcKyBs2qXCqVdu5/view?usp=sharing Expected Behavior: Leptonica leaves the image mostly intact so that Tesseract can provide a proper text layer for the output PDF. Alternatively, a configuration option is available to bypass Leptonica. Any and all help is appreciated with this issue. Thanks for reading. -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/7c2b81a0-b44c-4519-84ce-1b864e2d0f7f%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.

