Well yes and no ;-) "Yes" - there should be no change on image, but "no" - you need to expect that (re)compression of input image by pdf renderer could take a place. See comments for issue 1285[1] for more details.
[1] https://code.google.com/p/tesseract-ocr/issues/detail?id=1285 Zdenko On Fri, Sep 19, 2014 at 3:14 PM, Frank Siegert <[email protected] > wrote: > Dear Zdenko, > > Thanks for the quick reply! > > Does that mean in general, i.e. except for this bug, that I can by > construction assume the image will remain unmodified and only a text layer > added? > > Cheers, > Frank > > > On Friday, September 19, 2014 2:54:52 PM UTC+2, zdenop wrote: >> >> This is known issue - try current code from git repository. It should be >> fixed. >> >> Zdenko >> >> On Fri, Sep 19, 2014 at 2:38 PM, Frank Siegert <[email protected]> >> wrote: >> >>> Dear all, >>> >>> I have been testing tesseract to embed OCR in scanned PDF documents, and >>> it works phenomenally well in recognizing the text. >>> >>> Now I noticed one slightly disturbing issue just by chance when >>> comparing the original input image and the PDF file: A number of straight >>> lines that are present in the input image have disappeared completely in >>> the PDF (some of the are horizontal rules, others are lines in a logo). >>> Since I wanted to use tesseract to produce completely unmodified documents >>> with only the OCR text layer added, this would be a problem for me. I have >>> uploaded a test image for this to http://cern.ch/fsiegert/tmp/ >>> tesseract-test.tif and here is the command I used on it: >>> >>>> $ tesseract -l deu tesseract-test.tif tesseract-test pdf >>>> Tesseract Open Source OCR Engine v3.03 with Leptonica >>>> OSD: Weak margin (6.96) for 162 blob text block, but using orientation >>>> anyway: 1 >>>> $ tesseract --version >>>> tesseract 3.03 >>>> leptonica-1.71 >>>> libgif 5.1.0 : libjpeg 8d : libpng 1.6.12 : libtiff 4.0.3 : zlib >>>> 1.2.8 : libwebp 0.4.1 >>> >>> >>> This results in http://cern.ch/fsiegert/tmp/tesseract-test.pdf, which >>> is missing the straight horizontal lines and the ones in the logo. Is this >>> line-removal done on purpose and can it be disabled? >>> >>> Cheers, >>> Frank >>> >>> PS: I have removed much more text from the document for privacy reasons, >>> but the same happens when the document is complete with text. >>> >>> -- >>> You received this message because you are subscribed to the Google >>> Groups "tesseract-ocr" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to [email protected]. >>> To post to this group, send email to [email protected]. >>> Visit this group at http://groups.google.com/group/tesseract-ocr. >>> To view this discussion on the web visit https://groups.google.com/d/ >>> msgid/tesseract-ocr/9d3455ba-6c17-4c10-bc09-e5ee5b911ad0% >>> 40googlegroups.com >>> <https://groups.google.com/d/msgid/tesseract-ocr/9d3455ba-6c17-4c10-bc09-e5ee5b911ad0%40googlegroups.com?utm_medium=email&utm_source=footer> >>> . >>> For more options, visit https://groups.google.com/d/optout. >>> >> >> -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To post to this group, send email to [email protected]. > Visit this group at http://groups.google.com/group/tesseract-ocr. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/c8596569-abef-4e93-9193-0121ac8737fd%40googlegroups.com > <https://groups.google.com/d/msgid/tesseract-ocr/c8596569-abef-4e93-9193-0121ac8737fd%40googlegroups.com?utm_medium=email&utm_source=footer> > . > > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8wwxCfjOwo729LhT_wtOUJbx7DmqVfvcMkF27bO5dFjQQ%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.

