[tesseract-ocr] Leptonica sometimes mangles images when using PDF output mode

Lucas L. Thu, 28 Mar 2019 11:31:34 -0700

Environment
   
   - Tesseract 4.0.0-beta.3-249-g607e
   - leptonica-1.76.0
   - Linux (hostname removed) 4.18.0-16-generic #17 
   <https://github.com/tesseract-ocr/tesseract/pull/17>-Ubuntu SMP Fri Feb 
   8 00:06:57 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

Current Behavior:

I work at a SaaS firm which provides cloud storage services specializing in
documents. As a part of our service, we try to create PDFs with searchable
text layers from scanned documents. When processing PPMs which are created
by ImageMagick from the original document, Leptonica mangles the image
before it can be OCR'd properly by Tesseract. This results in a PDF
unreadable by both human eyes and Tesseract. This only seems to happen for
some specific documents.
How do I know it's Leptonica, specifically?

I have executed Tesseract with the config values tessedit_write_images 1
and tessedit_pageseg_mode 0. From my understanding, the second option does
not enable OCR at all while processing with Tesseract (which speeds up my
test cases) and the first option outputs a .tif debug image which is
apparently what Leptonica feeds to Tesseract after processing. That image
is also mangled.
Sample data

I have extracted a single page from a PDF -- the process works on a
page-by-page basis and most of the documents we work with contain highly
sensitive information, so I had no other option but to do this. Regardless,
it is good sample data. The "pg_0009.ppm" file is the original input fed
into Tesseract on the command line which was converted from the original
scanned document by ImageMagick. The "tessinput.tif" file is the image
produced by the tessedit_write_images 1 option which is supposed to be
OCR'd by Tesseract. This particular page caused a seg fault in Tesseract,
something that doesn't usually happen, and I suspect it is because the text
is overlapped so many times that the OCR engine has too much to handle.

Google Drive since it's too large for an
attachment:
https://drive.google.com/file/d/1UCzXYu7iusep-bOD6EcKyBs2qXCqVdu5/view?usp=sharing
Expected Behavior:

Leptonica leaves the image mostly intact so that Tesseract can provide a
proper text layer for the output PDF. Alternatively, a configuration option
is available to bypass Leptonica.

Any and all help is appreciated with this issue. Thanks for reading.

--
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit
https://groups.google.com/d/msgid/tesseract-ocr/7c2b81a0-b44c-4519-84ce-1b864e2d0f7f%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[tesseract-ocr] Leptonica sometimes mangles images when using PDF output mode

Reply via email to