Re: [tesseract-ocr] Re: Leptonica sometimes mangles images when using PDF output mode

Lucas L. Fri, 29 Mar 2019 12:25:29 -0700

Well yes, that's because I changed it. It's a config file. Config files are 
designed to be changed.
I find your suggestion strange because specifying the page seg mode is 
exactly what I did in my config. Then you told me I shouldn't have changed 
my config.


On Friday, March 29, 2019 at 12:04:17 PM UTC-5, shree wrote:
>
> https://github.com/tesseract-ocr/tesseract/blob/master/tessdata/configs/pdf
>
> This is different from your installed version.
>
> On Fri, 29 Mar 2019, 22:30 Shree Devi Kumar, <[email protected] 
> <javascript:>> wrote:
>
>> The default page segmentation mode is different for command line and api. 
>> Specify it explicitly and test.
>>
>> On Fri, 29 Mar 2019, 22:12 Lucas L., <[email protected] <javascript:>> 
>> wrote:
>>
>>> OK, I am running up against another issue, and it's getting weirder. 
>>> Since Tesseract does not take PDFs as input, this service does the deed of 
>>> breaking a PDF into pages, and then converting each of those pages to an 
>>> image format (either lzw-compressed TIFF or uncompressed PPM if that 
>>> fails). Somehow, if I run ImageMagick and then Tesseract on these pages 
>>> individually from a command line using the same parameters in the service 
>>> code, it runs fine processing the TIFF. But when the service runs, I get: 
>>> Error in fopenWriteStream: stream not opened 
>>> Error in pixWrite: stream not opened
>>>
>>> And the output pdf has all of the pages and they are not mangled... 
>>> however they are shrunk into a tiny corner of the page. I have attached the 
>>> resulting file. I feel that it is obvious from the fact that it works when 
>>> I run it outside the service that it is a code issue... however I really am 
>>> not sure what it could be doing differently from my command line. The pages 
>>> come out looking great when I run tesseract on the individual pages 
>>> manually. The errors do not appear when I run the command lines manually.
>>>
>>> The command lines and params I am using:
>>>
>>> Convert the input PDF (which is scanned and has no OCR layer) to input 
>>> image:
>>> convert -depth 16 -density 300 -colorspace RGB -despeckle -flatten 
>>> -compress 
>>> lzw -background white -alpha off "/path/pg_0010.pdf" "/path/pg_0010.tif"
>>> Process the input image for OCR and output to PDF:
>>> tesseract -l eng "/path/pg_0010.tif" "/path/pg_0010" pdf
>>>
>>> Configuration parameters from /usr/share/tesseract-ocr/4.00/tessdata/
>>> configs
>>> tessedit_create_pdf 1
>>> tessedit_pageseg_mode 3
>>> tessedit_write_images true
>>>
>>>
>>> On Thursday, March 28, 2019 at 1:31:36 PM UTC-5, Lucas L. wrote:
>>>>
>>>> Environment
>>>>    
>>>>    - Tesseract 4.0.0-beta.3-249-g607e
>>>>    - leptonica-1.76.0
>>>>    - Linux (hostname removed) 4.18.0-16-generic #17 
>>>>    <https://github.com/tesseract-ocr/tesseract/pull/17>-Ubuntu SMP Fri 
>>>>    Feb 8 00:06:57 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
>>>>
>>>> Current Behavior:
>>>>
>>>> I work at a SaaS firm which provides cloud storage services 
>>>> specializing in documents. As a part of our service, we try to create PDFs 
>>>> with searchable text layers from scanned documents. When processing PPMs 
>>>> which are created by ImageMagick from the original document, Leptonica 
>>>> mangles the image before it can be OCR'd properly by Tesseract. This 
>>>> results in a PDF unreadable by both human eyes and Tesseract. This only 
>>>> seems to happen for some specific documents.
>>>> How do I know it's Leptonica, specifically?
>>>>
>>>> I have executed Tesseract with the config values tessedit_write_images 
>>>> 1 and tessedit_pageseg_mode 0. From my understanding, the second 
>>>> option does not enable OCR at all while processing with Tesseract (which 
>>>> speeds up my test cases) and the first option outputs a .tif debug image 
>>>> which is apparently what Leptonica feeds to Tesseract after processing. 
>>>> That image is also mangled.
>>>> Sample data
>>>>
>>>> I have extracted a single page from a PDF -- the process works on a 
>>>> page-by-page basis and most of the documents we work with contain highly 
>>>> sensitive information, so I had no other option but to do this. 
>>>> Regardless, 
>>>> it is good sample data. The "pg_0009.ppm" file is the original input fed 
>>>> into Tesseract on the command line which was converted from the original 
>>>> scanned document by ImageMagick. The "tessinput.tif" file is the image 
>>>> produced by the tessedit_write_images 1 option which is supposed to be 
>>>> OCR'd by Tesseract. This particular page caused a seg fault in Tesseract, 
>>>> something that doesn't usually happen, and I suspect it is because the 
>>>> text 
>>>> is overlapped so many times that the OCR engine has too much to handle.
>>>>
>>>> Google Drive since it's too large for an attachment: 
>>>> https://drive.google.com/file/d/1UCzXYu7iusep-bOD6EcKyBs2qXCqVdu5/view?usp=sharing
>>>> Expected Behavior:
>>>>
>>>> Leptonica leaves the image mostly intact so that Tesseract can provide 
>>>> a proper text layer for the output PDF. Alternatively, a configuration 
>>>> option is available to bypass Leptonica.
>>>>
>>>> Any and all help is appreciated with this issue. Thanks for reading.
>>>>
>>> -- 
>>> You received this message because you are subscribed to the Google 
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send 
>>> an email to [email protected] <javascript:>.
>>> To post to this group, send email to [email protected] 
>>> <javascript:>.
>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit 
>>> https://groups.google.com/d/msgid/tesseract-ocr/4ac58e80-fd54-49f4-b479-3a33f5ca5388%40googlegroups.com
>>>  
>>> <https://groups.google.com/d/msgid/tesseract-ocr/4ac58e80-fd54-49f4-b479-3a33f5ca5388%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/266b9c33-5563-4751-9125-0e61d54f43ad%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Re: Leptonica sometimes mangles images when using PDF output mode

Reply via email to