[tesseract-ocr] tesseract export into txt vs. into pdf (issues with some characters)

Jan Fri, 13 Mar 2015 05:30:01 -0700

Hi,

I noticed that when I use tesseract to create a searchable pdf (I use 
pdfsandwich fot this), some characters are not displayed and are replaced 
by blank spaces instead. If I, however, ocr the same file with tesseract 
only in order to obtain a plain text (I use OCRfeeder), everything is 
recognized AND displayed properly. It seems as if tesseract had issues 
exporting some characters specifically to PDFs, even though it's obviously 
capable of recognizing them. This happens with quotation marks, ligatures 
("Th", "ff", etc.) but also, for example, with some special Czech 
characters such as "ě", "č," or "š" (even when the option "-l ces" is 
activated). Does anybody have an idea what can be wrong?


I've been trying to find out whether any one has had the same issue but 
could not find any relevant forum (yet).

Any advice would be much appreciated!
Jan

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/e7646a7b-5942-4b24-9ad5-b2e1cda253f9%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[tesseract-ocr] tesseract export into txt vs. into pdf (issues with some characters)

Reply via email to