Thank for report: Can you use other image format for input? It seems to be related to pnm format - after converting your image to tif/jpg/png pdf output look correct.
Zdenko št 28. 3. 2019 o 19:31 Lucas L. <[email protected]> napísal(a): > Environment > > - Tesseract 4.0.0-beta.3-249-g607e > - leptonica-1.76.0 > - Linux (hostname removed) 4.18.0-16-generic #17 > <https://github.com/tesseract-ocr/tesseract/pull/17>-Ubuntu SMP Fri > Feb 8 00:06:57 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux > > Current Behavior: > > I work at a SaaS firm which provides cloud storage services specializing > in documents. As a part of our service, we try to create PDFs with > searchable text layers from scanned documents. When processing PPMs which > are created by ImageMagick from the original document, Leptonica mangles > the image before it can be OCR'd properly by Tesseract. This results in a > PDF unreadable by both human eyes and Tesseract. This only seems to happen > for some specific documents. > How do I know it's Leptonica, specifically? > > I have executed Tesseract with the config values tessedit_write_images 1 > and tessedit_pageseg_mode 0. From my understanding, the second option > does not enable OCR at all while processing with Tesseract (which speeds up > my test cases) and the first option outputs a .tif debug image which is > apparently what Leptonica feeds to Tesseract after processing. That image > is also mangled. > Sample data > > I have extracted a single page from a PDF -- the process works on a > page-by-page basis and most of the documents we work with contain highly > sensitive information, so I had no other option but to do this. Regardless, > it is good sample data. The "pg_0009.ppm" file is the original input fed > into Tesseract on the command line which was converted from the original > scanned document by ImageMagick. The "tessinput.tif" file is the image > produced by the tessedit_write_images 1 option which is supposed to be > OCR'd by Tesseract. This particular page caused a seg fault in Tesseract, > something that doesn't usually happen, and I suspect it is because the text > is overlapped so many times that the OCR engine has too much to handle. > > Google Drive since it's too large for an attachment: > https://drive.google.com/file/d/1UCzXYu7iusep-bOD6EcKyBs2qXCqVdu5/view?usp=sharing > Expected Behavior: > > Leptonica leaves the image mostly intact so that Tesseract can provide a > proper text layer for the output PDF. Alternatively, a configuration option > is available to bypass Leptonica. > > Any and all help is appreciated with this issue. Thanks for reading. > > -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To post to this group, send email to [email protected]. > Visit this group at https://groups.google.com/group/tesseract-ocr. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/7c2b81a0-b44c-4519-84ce-1b864e2d0f7f%40googlegroups.com > <https://groups.google.com/d/msgid/tesseract-ocr/7c2b81a0-b44c-4519-84ce-1b864e2d0f7f%40googlegroups.com?utm_medium=email&utm_source=footer> > . > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8wZt8A%3DRjHSK9GPNUE%3DqTaiRSygCsPmKciUT5v3%2BTYtAg%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.

