Hi,
I noticed that when I use tesseract to create a searchable pdf (I use
pdfsandwich fot this), some characters are not displayed and are replaced
by blank spaces instead. If I, however, ocr the same file with tesseract
only in order to obtain a plain text (I use OCRfeeder), everything is
recognized AND displayed properly. It seems as if tesseract had issues
exporting some characters specifically to PDFs, even though it's obviously
capable of recognizing them. This happens with quotation marks, ligatures
("Th", "ff", etc.) but also, for example, with some special Czech
characters such as "ě", "č," or "š" (even when the option "-l ces" is
activated). Does anybody have an idea what can be wrong?
I've been trying to find out whether any one has had the same issue but
could not find any relevant forum (yet).
Any advice would be much appreciated!
Jan
--
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit
https://groups.google.com/d/msgid/tesseract-ocr/e7646a7b-5942-4b24-9ad5-b2e1cda253f9%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.