Well yes and no ;-)
"Yes" - there should be no change on image, but "no" - you need to expect
that (re)compression of input image by pdf renderer could take a place. See
comments for issue 1285[1] for more details.

[1] https://code.google.com/p/tesseract-ocr/issues/detail?id=1285

Zdenko

On Fri, Sep 19, 2014 at 3:14 PM, Frank Siegert <[email protected]
> wrote:

> Dear Zdenko,
>
> Thanks for the quick reply!
>
> Does that mean in general, i.e. except for this bug, that I can by
> construction assume the image will remain unmodified and only a text layer
> added?
>
> Cheers,
> Frank
>
>
> On Friday, September 19, 2014 2:54:52 PM UTC+2, zdenop wrote:
>>
>> This is known issue - try current code from git repository. It should be
>> fixed.
>>
>> Zdenko
>>
>> On Fri, Sep 19, 2014 at 2:38 PM, Frank Siegert <[email protected]>
>> wrote:
>>
>>> Dear all,
>>>
>>> I have been testing tesseract to embed OCR in scanned PDF documents, and
>>> it works phenomenally well in recognizing the text.
>>>
>>> Now I noticed one slightly disturbing issue just by chance when
>>> comparing the original input image and the PDF file: A number of straight
>>> lines that are present in the input image have disappeared completely in
>>> the PDF (some of the are horizontal rules, others are lines in a logo).
>>> Since I wanted to use tesseract to produce completely unmodified documents
>>> with only the OCR text layer added, this would be a problem for me. I have
>>> uploaded a test image for this to http://cern.ch/fsiegert/tmp/
>>> tesseract-test.tif and here is the command I used on it:
>>>
>>>> $ tesseract -l deu tesseract-test.tif tesseract-test pdf
>>>> Tesseract Open Source OCR Engine v3.03 with Leptonica
>>>> OSD: Weak margin (6.96) for 162 blob text block, but using orientation
>>>> anyway: 1
>>>> $ tesseract --version
>>>> tesseract 3.03
>>>>  leptonica-1.71
>>>>   libgif 5.1.0 : libjpeg 8d : libpng 1.6.12 : libtiff 4.0.3 : zlib
>>>> 1.2.8 : libwebp 0.4.1
>>>
>>>
>>> This results in http://cern.ch/fsiegert/tmp/tesseract-test.pdf, which
>>> is missing the straight horizontal lines and the ones in the logo. Is this
>>> line-removal done on purpose and can it be disabled?
>>>
>>> Cheers,
>>> Frank
>>>
>>> PS: I have removed much more text from the document for privacy reasons,
>>> but the same happens when the document is complete with text.
>>>
>>>  --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to [email protected].
>>> To post to this group, send email to [email protected].
>>> Visit this group at http://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit https://groups.google.com/d/
>>> msgid/tesseract-ocr/9d3455ba-6c17-4c10-bc09-e5ee5b911ad0%
>>> 40googlegroups.com
>>> <https://groups.google.com/d/msgid/tesseract-ocr/9d3455ba-6c17-4c10-bc09-e5ee5b911ad0%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>>  --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
> Visit this group at http://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/c8596569-abef-4e93-9193-0121ac8737fd%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/c8596569-abef-4e93-9193-0121ac8737fd%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8wwxCfjOwo729LhT_wtOUJbx7DmqVfvcMkF27bO5dFjQQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to