I use tesseract for text recognition on images with a lot of
impurities like noise, bad contrast to background etc. I pre-process
the images as good as it gets to extract the critical pixels for the
recognition.
Now this is of course not perfectly done so there are still some
pixels from the noise or the background which interfere with the
recognition. I tested several OCR engines and tesseract turns out to
be one of the most sensitive to noise. Normal English text is
recognized as a string of accented letters and special characters.
This is of course quite disappointing.

To improve my results with tesseract I trained a new language with
very limited character set (in fact only a-zA-z0-9). This helped to
improve the result significantly. Adding the errors to the Dang-Ambigs
and the user-words file I was even able to get nearly everything
right.

My question is if there is another way to reach some resistance
against noise or similar impurities without settling for a very
limited character set. I'd really like to be able to add some more
characters to my language.
--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en
-~----------~----~----~----~------~----~------~--~---

Reply via email to