Environment
   
   - Tesseract 4.0.0-beta.3-249-g607e
   - leptonica-1.76.0
   - Linux (hostname removed) 4.18.0-16-generic #17 
   <https://github.com/tesseract-ocr/tesseract/pull/17>-Ubuntu SMP Fri Feb 
   8 00:06:57 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

Current Behavior:

I work at a SaaS firm which provides cloud storage services specializing in 
documents. As a part of our service, we try to create PDFs with searchable 
text layers from scanned documents. When processing PPMs which are created 
by ImageMagick from the original document, Leptonica mangles the image 
before it can be OCR'd properly by Tesseract. This results in a PDF 
unreadable by both human eyes and Tesseract. This only seems to happen for 
some specific documents.
How do I know it's Leptonica, specifically?

I have executed Tesseract with the config values tessedit_write_images 1
 and tessedit_pageseg_mode 0. From my understanding, the second option does 
not enable OCR at all while processing with Tesseract (which speeds up my 
test cases) and the first option outputs a .tif debug image which is 
apparently what Leptonica feeds to Tesseract after processing. That image 
is also mangled.
Sample data

I have extracted a single page from a PDF -- the process works on a 
page-by-page basis and most of the documents we work with contain highly 
sensitive information, so I had no other option but to do this. Regardless, 
it is good sample data. The "pg_0009.ppm" file is the original input fed 
into Tesseract on the command line which was converted from the original 
scanned document by ImageMagick. The "tessinput.tif" file is the image 
produced by the tessedit_write_images 1 option which is supposed to be 
OCR'd by Tesseract. This particular page caused a seg fault in Tesseract, 
something that doesn't usually happen, and I suspect it is because the text 
is overlapped so many times that the OCR engine has too much to handle.

Google Drive since it's too large for an 
attachment: 
https://drive.google.com/file/d/1UCzXYu7iusep-bOD6EcKyBs2qXCqVdu5/view?usp=sharing
Expected Behavior:

Leptonica leaves the image mostly intact so that Tesseract can provide a 
proper text layer for the output PDF. Alternatively, a configuration option 
is available to bypass Leptonica.

Any and all help is appreciated with this issue. Thanks for reading.

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/7c2b81a0-b44c-4519-84ce-1b864e2d0f7f%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to