Dear Zdenko,

Thanks for the quick reply!

Does that mean in general, i.e. except for this bug, that I can by 
construction assume the image will remain unmodified and only a text layer 
added?

Cheers,
Frank


On Friday, September 19, 2014 2:54:52 PM UTC+2, zdenop wrote:
>
> This is known issue - try current code from git repository. It should be 
> fixed.
>
> Zdenko
>
> On Fri, Sep 19, 2014 at 2:38 PM, Frank Siegert <[email protected] 
> <javascript:>> wrote:
>
>> Dear all,
>>
>> I have been testing tesseract to embed OCR in scanned PDF documents, and 
>> it works phenomenally well in recognizing the text.
>>
>> Now I noticed one slightly disturbing issue just by chance when comparing 
>> the original input image and the PDF file: A number of straight lines that 
>> are present in the input image have disappeared completely in the PDF (some 
>> of the are horizontal rules, others are lines in a logo). Since I wanted to 
>> use tesseract to produce completely unmodified documents with only the OCR 
>> text layer added, this would be a problem for me. I have uploaded a test 
>> image for this to http://cern.ch/fsiegert/tmp/tesseract-test.tif and 
>> here is the command I used on it:
>>
>>> $ tesseract -l deu tesseract-test.tif tesseract-test pdf
>>> Tesseract Open Source OCR Engine v3.03 with Leptonica
>>> OSD: Weak margin (6.96) for 162 blob text block, but using orientation 
>>> anyway: 1
>>> $ tesseract --version
>>> tesseract 3.03
>>>  leptonica-1.71
>>>   libgif 5.1.0 : libjpeg 8d : libpng 1.6.12 : libtiff 4.0.3 : zlib 1.2.8 
>>> : libwebp 0.4.1
>>
>>
>> This results in http://cern.ch/fsiegert/tmp/tesseract-test.pdf, which is 
>> missing the straight horizontal lines and the ones in the logo. Is this 
>> line-removal done on purpose and can it be disabled?
>>
>> Cheers,
>> Frank
>>
>> PS: I have removed much more text from the document for privacy reasons, 
>> but the same happens when the document is complete with text.
>>
>>  -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected] <javascript:>.
>> To post to this group, send email to [email protected] 
>> <javascript:>.
>> Visit this group at http://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/9d3455ba-6c17-4c10-bc09-e5ee5b911ad0%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/tesseract-ocr/9d3455ba-6c17-4c10-bc09-e5ee5b911ad0%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/c8596569-abef-4e93-9193-0121ac8737fd%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to