On Thu, Jun 7, 2012 at 12:29 PM, Nick White <[email protected]> wrote:
> On Thu, Jun 07, 2012 at 08:22:27AM +0200, zdenko podobny wrote: > > I start to put my notes[1] what I found (just for me ;-) ) - at the > moment > > there is not a lot of information and maybe there are some things that > > I misunderstood ;-) . > > > > [1] http://www.sk-spell.sk.cx/first-notes-for-tesseract-ocr-302-traning > > Thanks so much for posting your notes Zdenko, they're very handy > indeed, incomplete and incorrect though they may be ;) > > I am suffering from some of the same problems as you with the output > from unicharset_extractor. In particular, glyph_metrics is always: > 0,255,0,255,0,32767,0,32767,0,32767 > and script is always NULL. > > I'm training Ancient Greek, so it seems pretty clear that script > should be Greek. But does anybody know what the script field is used > for? Not setting it doesn't seem to cause any problems. Anybody have > any clues as to why it wouldn't be set automatically? Are there any > known problems to setting it manually once the unicharset has been > generated? I'll look into these more in the code when I can, but any > experience from others would be most useful. > > As for the glyph_metrics, it seems more worrying that it doesn't > seem to be filled out at all. Has anybody else had any luck with it? > And any idea why? > > Any thoughts or ideas would be most welcome! > > Well, I got "order" for which I need to run training, some I hope I will publish some more experiences with 3.02 training. But there is no deadline, so it could take a long time ;-) Regarding missing information Ray Smith is IMHO only one who could explain it ;-) Anyway my quick check revealed that this missing information are the same in all languages (e.g. "i" has the same script and glyph_metrics; differences is only in "link" between "i" and "I" because of different positions in from unicharset_extractor.... I believe this information could be reconstructed by some (python) script. But I am not sure if this helps to improve accuracy (need to be tested). -- Zdenko -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en

