Commandline:
> # the convert command (part of Imagemagick) creates a clean lossless > compressed image 1.png > # if you already have a png with characters and digits in it, you do not > need the following command: > convert -density 300x300 -depth 8 1.pdf 1.png > > # the Tesseract is called and creates a mixed mode pdf with filename > "1.png.pdf" > # this output shows coding artefacts between the characters and digits if > you enlarge the view > # I can supply you with images (on request) > tesseract -l eng 1.png 1.png pdf > Am Montag, 28. Juli 2014 09:52:50 UTC+2 schrieb Tom: > > Using the PDF-OCR option I noticed that the Tesseract-generated mixed-mode > PDFs (original image-PDF plus OCR-ed text) show coding artefacts which were > not present in the input image files (I use ImageMagick convert to render > one image (png or bmp) per PDF-input-page). > > So I propose to change Tesseract PDF-OCR mode > > - do not use lossy compression > - use lossless compression (png) > > when rendering the final mixed-mode PDF output files. > > > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/5b80105f-8db1-42bb-bf2d-3806ea0c052f%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.

