I am working with the various international Latin training sets and am discovering that most of them have plenty of letters that are entirely illegal in that language. For example, the "Latin letter S with caron" is in the German set (Unicode u0161, the caron looks like the bottom half of a circle and is drawn above the S). Any idea why and any suggestion what's the best way to get cleaned version of the training sets?
What compounds my problem is that it appears that TesseractExtractResults() (UTF8 with coordinates and confidence) inexplicably ignores the black list variable - and I cannot use GetUTF8Text() because I need the coordinates of each letter. Without a solution to either clean up the training sets or get Tesseract to recognize the black list, my only recourse is to painfully test for each such character and map to my best guess as to what it is - this is not only time consuming code but also error- prone: only Tesseract can make the proper determination what the character really is (once the unwanted letter has been ruled out by the blacklist). Help!!! -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en.

