I use tesseract for text recognition on images with a lot of impurities like noise, bad contrast to background etc. I pre-process the images as good as it gets to extract the critical pixels for the recognition. Now this is of course not perfectly done so there are still some pixels from the noise or the background which interfere with the recognition. I tested several OCR engines and tesseract turns out to be one of the most sensitive to noise. Normal English text is recognized as a string of accented letters and special characters. This is of course quite disappointing.
To improve my results with tesseract I trained a new language with very limited character set (in fact only a-zA-z0-9). This helped to improve the result significantly. Adding the errors to the Dang-Ambigs and the user-words file I was even able to get nearly everything right. My question is if there is another way to reach some resistance against noise or similar impurities without settling for a very limited character set. I'd really like to be able to add some more characters to my language. --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en -~----------~----~----~----~------~----~------~--~---

