Dear Zdenko, Thanks for the quick reply!
Does that mean in general, i.e. except for this bug, that I can by construction assume the image will remain unmodified and only a text layer added? Cheers, Frank On Friday, September 19, 2014 2:54:52 PM UTC+2, zdenop wrote: > > This is known issue - try current code from git repository. It should be > fixed. > > Zdenko > > On Fri, Sep 19, 2014 at 2:38 PM, Frank Siegert <[email protected] > <javascript:>> wrote: > >> Dear all, >> >> I have been testing tesseract to embed OCR in scanned PDF documents, and >> it works phenomenally well in recognizing the text. >> >> Now I noticed one slightly disturbing issue just by chance when comparing >> the original input image and the PDF file: A number of straight lines that >> are present in the input image have disappeared completely in the PDF (some >> of the are horizontal rules, others are lines in a logo). Since I wanted to >> use tesseract to produce completely unmodified documents with only the OCR >> text layer added, this would be a problem for me. I have uploaded a test >> image for this to http://cern.ch/fsiegert/tmp/tesseract-test.tif and >> here is the command I used on it: >> >>> $ tesseract -l deu tesseract-test.tif tesseract-test pdf >>> Tesseract Open Source OCR Engine v3.03 with Leptonica >>> OSD: Weak margin (6.96) for 162 blob text block, but using orientation >>> anyway: 1 >>> $ tesseract --version >>> tesseract 3.03 >>> leptonica-1.71 >>> libgif 5.1.0 : libjpeg 8d : libpng 1.6.12 : libtiff 4.0.3 : zlib 1.2.8 >>> : libwebp 0.4.1 >> >> >> This results in http://cern.ch/fsiegert/tmp/tesseract-test.pdf, which is >> missing the straight horizontal lines and the ones in the logo. Is this >> line-removal done on purpose and can it be disabled? >> >> Cheers, >> Frank >> >> PS: I have removed much more text from the document for privacy reasons, >> but the same happens when the document is complete with text. >> >> -- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected] <javascript:>. >> To post to this group, send email to [email protected] >> <javascript:>. >> Visit this group at http://groups.google.com/group/tesseract-ocr. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/tesseract-ocr/9d3455ba-6c17-4c10-bc09-e5ee5b911ad0%40googlegroups.com >> >> <https://groups.google.com/d/msgid/tesseract-ocr/9d3455ba-6c17-4c10-bc09-e5ee5b911ad0%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> For more options, visit https://groups.google.com/d/optout. >> > > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/c8596569-abef-4e93-9193-0121ac8737fd%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.

