[tesseract-ocr] Re: Leptonica sometimes mangles images when using PDF output mode

Lucas L. Fri, 29 Mar 2019 12:29:31 -0700

Also, please see this issue in regards to using default page seg mode for 
PDFs: 
https://github.com/tesseract-ocr/tesseract/issues/1916


On Thursday, March 28, 2019 at 1:31:36 PM UTC-5, Lucas L. wrote:
>
> Environment
>    
>    - Tesseract 4.0.0-beta.3-249-g607e
>    - leptonica-1.76.0
>    - Linux (hostname removed) 4.18.0-16-generic #17 
>    <https://github.com/tesseract-ocr/tesseract/pull/17>-Ubuntu SMP Fri 
>    Feb 8 00:06:57 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
>
> Current Behavior:
>
> I work at a SaaS firm which provides cloud storage services specializing 
> in documents. As a part of our service, we try to create PDFs with 
> searchable text layers from scanned documents. When processing PPMs which 
> are created by ImageMagick from the original document, Leptonica mangles 
> the image before it can be OCR'd properly by Tesseract. This results in a 
> PDF unreadable by both human eyes and Tesseract. This only seems to happen 
> for some specific documents.
> How do I know it's Leptonica, specifically?
>
> I have executed Tesseract with the config values tessedit_write_images 1
>  and tessedit_pageseg_mode 0. From my understanding, the second option 
> does not enable OCR at all while processing with Tesseract (which speeds up 
> my test cases) and the first option outputs a .tif debug image which is 
> apparently what Leptonica feeds to Tesseract after processing. That image 
> is also mangled.
> Sample data
>
> I have extracted a single page from a PDF -- the process works on a 
> page-by-page basis and most of the documents we work with contain highly 
> sensitive information, so I had no other option but to do this. Regardless, 
> it is good sample data. The "pg_0009.ppm" file is the original input fed 
> into Tesseract on the command line which was converted from the original 
> scanned document by ImageMagick. The "tessinput.tif" file is the image 
> produced by the tessedit_write_images 1 option which is supposed to be 
> OCR'd by Tesseract. This particular page caused a seg fault in Tesseract, 
> something that doesn't usually happen, and I suspect it is because the text 
> is overlapped so many times that the OCR engine has too much to handle.
>
> Google Drive since it's too large for an attachment: 
> https://drive.google.com/file/d/1UCzXYu7iusep-bOD6EcKyBs2qXCqVdu5/view?usp=sharing
> Expected Behavior:
>
> Leptonica leaves the image mostly intact so that Tesseract can provide a 
> proper text layer for the output PDF. Alternatively, a configuration option 
> is available to bypass Leptonica.
>
> Any and all help is appreciated with this issue. Thanks for reading.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/cecea569-f906-413a-a88c-2e43a4f8352e%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[tesseract-ocr] Re: Leptonica sometimes mangles images when using PDF output mode

Reply via email to