BTW: tif with lzw compression produced smallest pdf than png or jpeg for this specific image.
Zdenko pi 29. 3. 2019 o 16:39 Lucas L. <[email protected]> napĂsal(a): > Thanks for replying. I will post an issue on Leptonica's board as well. I > was not sure if it was an issue with Leptonica itself or merely a > configuration/parameter issue in the way that Tesseract calls it. > > @zdenop, thanks so much for letting me know that the .TIF format works on > your end. The service I am working on is supposed to make a first pass > using a compressed image format, then try again with PPM only if it fails. > It would appear that there is a code issue with my service and the first > pass is failing when it shouldn't. I was also able to get the image to > process correctly (with a nicely-read OCR layer, no less) by calling > ImageMagick and then Tesseract from the command line. I just took ownership > of this service so I do not know it by heart. Regardless, it is good that I > was able to discover a possible issue related to PPM processing with > Leptonica. > > On Thursday, March 28, 2019 at 1:31:36 PM UTC-5, Lucas L. wrote: >> >> Environment >> >> - Tesseract 4.0.0-beta.3-249-g607e >> - leptonica-1.76.0 >> - Linux (hostname removed) 4.18.0-16-generic #17 >> <https://github.com/tesseract-ocr/tesseract/pull/17>-Ubuntu SMP Fri >> Feb 8 00:06:57 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux >> >> Current Behavior: >> >> I work at a SaaS firm which provides cloud storage services specializing >> in documents. As a part of our service, we try to create PDFs with >> searchable text layers from scanned documents. When processing PPMs which >> are created by ImageMagick from the original document, Leptonica mangles >> the image before it can be OCR'd properly by Tesseract. This results in a >> PDF unreadable by both human eyes and Tesseract. This only seems to happen >> for some specific documents. >> How do I know it's Leptonica, specifically? >> >> I have executed Tesseract with the config values tessedit_write_images 1 >> and tessedit_pageseg_mode 0. From my understanding, the second option >> does not enable OCR at all while processing with Tesseract (which speeds up >> my test cases) and the first option outputs a .tif debug image which is >> apparently what Leptonica feeds to Tesseract after processing. That image >> is also mangled. >> Sample data >> >> I have extracted a single page from a PDF -- the process works on a >> page-by-page basis and most of the documents we work with contain highly >> sensitive information, so I had no other option but to do this. Regardless, >> it is good sample data. The "pg_0009.ppm" file is the original input fed >> into Tesseract on the command line which was converted from the original >> scanned document by ImageMagick. The "tessinput.tif" file is the image >> produced by the tessedit_write_images 1 option which is supposed to be >> OCR'd by Tesseract. This particular page caused a seg fault in Tesseract, >> something that doesn't usually happen, and I suspect it is because the text >> is overlapped so many times that the OCR engine has too much to handle. >> >> Google Drive since it's too large for an attachment: >> https://drive.google.com/file/d/1UCzXYu7iusep-bOD6EcKyBs2qXCqVdu5/view?usp=sharing >> Expected Behavior: >> >> Leptonica leaves the image mostly intact so that Tesseract can provide a >> proper text layer for the output PDF. Alternatively, a configuration option >> is available to bypass Leptonica. >> >> Any and all help is appreciated with this issue. Thanks for reading. >> > -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To post to this group, send email to [email protected]. > Visit this group at https://groups.google.com/group/tesseract-ocr. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/1a15cfe1-54b0-4bec-a551-4627a79e8b9d%40googlegroups.com > <https://groups.google.com/d/msgid/tesseract-ocr/1a15cfe1-54b0-4bec-a551-4627a79e8b9d%40googlegroups.com?utm_medium=email&utm_source=footer> > . > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8wwn2pQBqKdEC_C7BgMShStZTo8GmAojWAcAb%3D3ZcS8_Q%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.

