2010/9/19 Zdenko Podobný <[email protected]>: > Hi Stane, > > why it doesn't look healthy? ;-) > There is one easy way how to find if it correct or not: to test it ;-) > > BTW: when I searched for mistakes in former wiki (now corrections are > included in http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3) > I recognized that: > a) unicharset_extractor put NULL to type of script (maybe I did something > wrong, maybe google did not submit relevant code yet)
Probably the latter. There are, for example, function prototypes for a whole other OCR engine (called 'Cube', IIRC), for which there's no matching code. > b) in unicharset.cpp there is code that works with these scripts: Latin, > Common, Greek, Cyrillic, Han, NULL There are more than that. For one, Fraktur is considered a script of its own. > c) if you extract unicharset files from some languages (e.g. > "combine_tessdata -e jpn.traineddata jpn.unicharset" - Japaneses language > file is from svn revision 309) you can find there also another scripts: > Hiragana and Katakana > Yes, those are mentioned in part of the code. What /seems/ to be there is an image-based script detection mechanism (the usual mechanism is to guess the script based on the types of mistakes) but I haven't seen it used. -- <Leftmost> jimregan, that's because deep inside you, you are evil. <Leftmost> Also not-so-deep inside you. -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en.

