Irrelevant letters in training sets

patrickq Thu, 18 Nov 2010 06:31:35 -0800

I am working with the various international Latin training sets and am
discovering that most of them have plenty of letters that are entirely
illegal in that language. For example, the "Latin letter S with caron"
is in the German set (Unicode u0161, the caron looks like the bottom
half of a circle and is drawn above the S). Any idea why and any
suggestion what's the best way to get cleaned version of the training
sets?


What compounds my problem is that it appears that
TesseractExtractResults() (UTF8 with coordinates and confidence)
inexplicably ignores the black list variable - and I cannot use
GetUTF8Text() because I need the coordinates of each letter.

Without a solution to either clean up the training sets or get
Tesseract to recognize the black list, my only recourse is to
painfully test for each such character and map to my best guess as to
what it is - this is not only time consuming code but also error-
prone: only Tesseract can make the proper determination what the
character really is (once the unwanted letter has been ruled out by
the blacklist).

Help!!!

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.

Irrelevant letters in training sets

Reply via email to