Re: Training japanese for 3.0

Jimmy O'Regan Sun, 19 Sep 2010 07:01:53 -0700

2010/9/19 Zdenko Podobný <[email protected]>:
> Hi Stane,
>
> why it doesn't look healthy? ;-)
> There is one easy way how to find if it correct or not: to test it ;-)
>
> BTW: when I searched for mistakes in former wiki (now corrections are
> included in http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3)
> I recognized that:
> a) unicharset_extractor put NULL to type of script (maybe I did something
> wrong, maybe google did not submit relevant code yet)


Probably the latter. There are, for example, function prototypes for a
whole other OCR engine (called 'Cube', IIRC), for which there's no
matching code.

> b) in unicharset.cpp there is code that works with these scripts: Latin,
> Common, Greek, Cyrillic, Han, NULL

There are more than that. For one, Fraktur is considered a script of its own.

> c) if you extract  unicharset files from some languages (e.g.
> "combine_tessdata -e jpn.traineddata jpn.unicharset" - Japaneses language
> file is from svn revision 309) you can find there also another scripts:
> Hiragana and Katakana
>

Yes, those are mentioned in part of the code. What /seems/ to be there
is an image-based script detection mechanism (the usual mechanism is
to guess the script based on the types of mistakes) but I haven't seen
it used.

-- 
<Leftmost> jimregan, that's because deep inside you, you are evil.
<Leftmost> Also not-so-deep inside you.

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.

Re: Training japanese for 3.0

Reply via email to