Re: [tesseract-ocr] Re: Leptonica sometimes mangles images when using PDF output mode

Lucas L. Fri, 29 Mar 2019 13:05:04 -0700

OK, I appreciate the suggestion and clarification, but the aptitude package 
manager doesn't seem to have a later version than the one that I have now. 
I suppose I should build it from source, but your own page for installing 
from source suggests using aptitude first. 
tesseract-ocr is already the newest version (4.00~git2844-607e8fd8-2).
0 upgraded, 0 newly installed, 0 to remove and 0 not upgraded.


Also, how could there be a permissions issue when the PDF is created, just 
not sized correctly? I would expect the PDF to not be created at all if 
that were the case.

On Friday, March 29, 2019 at 2:52:28 PM UTC-5, zdenop wrote:
>
> First of all: upgrade to the latest tesseract code. A lot fo fixes were 
> implemented in meantime
>
> Next: "Error in fopenWriteStream"  indicate problem with the writing. 
> Check privileges, space etc. Than try to use other format (jpeg, png) if it 
> helps.
>
> Zdenko
>
>
> pi 29. 3. 2019 o 17:42 Lucas L. <[email protected] <javascript:>> 
> napísal(a):
>
>> OK, I am running up against another issue, and it's getting weirder. 
>> Since Tesseract does not take PDFs as input, this service does the deed of 
>> breaking a PDF into pages, and then converting each of those pages to an 
>> image format (either lzw-compressed TIFF or uncompressed PPM if that 
>> fails). Somehow, if I run ImageMagick and then Tesseract on these pages 
>> individually from a command line using the same parameters in the service 
>> code, it runs fine processing the TIFF. But when the service runs, I get: 
>> Error in fopenWriteStream: stream not opened 
>> Error in pixWrite: stream not opened
>>
>> And the output pdf has all of the pages and they are not mangled... 
>> however they are shrunk into a tiny corner of the page. I have attached the 
>> resulting file. I feel that it is obvious from the fact that it works when 
>> I run it outside the service that it is a code issue... however I really am 
>> not sure what it could be doing differently from my command line. The pages 
>> come out looking great when I run tesseract on the individual pages 
>> manually. The errors do not appear when I run the command lines manually.
>>
>> The command lines and params I am using:
>>
>> Convert the input PDF (which is scanned and has no OCR layer) to input 
>> image:
>> convert -depth 16 -density 300 -colorspace RGB -despeckle -flatten -compress 
>> lzw -background white -alpha off "/path/pg_0010.pdf" "/path/pg_0010.tif"
>> Process the input image for OCR and output to PDF:
>> tesseract -l eng "/path/pg_0010.tif" "/path/pg_0010" pdf
>>
>> Configuration parameters from /usr/share/tesseract-ocr/4.00/tessdata/
>> configs
>> tessedit_create_pdf 1
>> tessedit_pageseg_mode 3
>> tessedit_write_images true
>>
>>
>> On Thursday, March 28, 2019 at 1:31:36 PM UTC-5, Lucas L. wrote:
>>>
>>> Environment
>>>    
>>>    - Tesseract 4.0.0-beta.3-249-g607e
>>>    - leptonica-1.76.0
>>>    - Linux (hostname removed) 4.18.0-16-generic #17 
>>>    <https://github.com/tesseract-ocr/tesseract/pull/17>-Ubuntu SMP Fri 
>>>    Feb 8 00:06:57 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
>>>
>>> Current Behavior:
>>>
>>> I work at a SaaS firm which provides cloud storage services specializing 
>>> in documents. As a part of our service, we try to create PDFs with 
>>> searchable text layers from scanned documents. When processing PPMs which 
>>> are created by ImageMagick from the original document, Leptonica mangles 
>>> the image before it can be OCR'd properly by Tesseract. This results in a 
>>> PDF unreadable by both human eyes and Tesseract. This only seems to happen 
>>> for some specific documents.
>>> How do I know it's Leptonica, specifically?
>>>
>>> I have executed Tesseract with the config values tessedit_write_images 1
>>>  and tessedit_pageseg_mode 0. From my understanding, the second option 
>>> does not enable OCR at all while processing with Tesseract (which speeds up 
>>> my test cases) and the first option outputs a .tif debug image which is 
>>> apparently what Leptonica feeds to Tesseract after processing. That image 
>>> is also mangled.
>>> Sample data
>>>
>>> I have extracted a single page from a PDF -- the process works on a 
>>> page-by-page basis and most of the documents we work with contain highly 
>>> sensitive information, so I had no other option but to do this. Regardless, 
>>> it is good sample data. The "pg_0009.ppm" file is the original input fed 
>>> into Tesseract on the command line which was converted from the original 
>>> scanned document by ImageMagick. The "tessinput.tif" file is the image 
>>> produced by the tessedit_write_images 1 option which is supposed to be 
>>> OCR'd by Tesseract. This particular page caused a seg fault in Tesseract, 
>>> something that doesn't usually happen, and I suspect it is because the text 
>>> is overlapped so many times that the OCR engine has too much to handle.
>>>
>>> Google Drive since it's too large for an attachment: 
>>> https://drive.google.com/file/d/1UCzXYu7iusep-bOD6EcKyBs2qXCqVdu5/view?usp=sharing
>>> Expected Behavior:
>>>
>>> Leptonica leaves the image mostly intact so that Tesseract can provide a 
>>> proper text layer for the output PDF. Alternatively, a configuration option 
>>> is available to bypass Leptonica.
>>>
>>> Any and all help is appreciated with this issue. Thanks for reading.
>>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected] <javascript:>.
>> To post to this group, send email to [email protected] 
>> <javascript:>.
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/4ac58e80-fd54-49f4-b479-3a33f5ca5388%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/tesseract-ocr/4ac58e80-fd54-49f4-b479-3a33f5ca5388%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/19bd0eea-13cc-4cf2-87a7-7f113307fa72%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Re: Leptonica sometimes mangles images when using PDF output mode

Reply via email to