Re: unicharset script and metrics questions

zdenko podobny Thu, 07 Jun 2012 06:10:37 -0700

On Thu, Jun 7, 2012 at 12:29 PM, Nick White <[email protected]> wrote:


> On Thu, Jun 07, 2012 at 08:22:27AM +0200, zdenko podobny wrote:
> > I start to put my notes[1] what I found (just for me ;-) ) - at the
> moment
> > there is not a lot of information and maybe there are some things that
> > I misunderstood ;-) .
> >
> > [1] http://www.sk-spell.sk.cx/first-notes-for-tesseract-ocr-302-traning
>
> Thanks so much for posting your notes Zdenko, they're very handy
> indeed, incomplete and incorrect though they may be ;)
>
> I am suffering from some of the same problems as you with the output
> from unicharset_extractor. In particular, glyph_metrics is always:
> 0,255,0,255,0,32767,0,32767,0,32767
> and script is always NULL.
>
> I'm training Ancient Greek, so it seems pretty clear that script
> should be Greek. But does anybody know what the script field is used
> for? Not setting it doesn't seem to cause any problems. Anybody have
> any clues as to why it wouldn't be set automatically? Are there any
> known problems to setting it manually once the unicharset has been
> generated? I'll look into these more in the code when I can, but any
> experience from others would be most useful.
>
> As for the glyph_metrics, it seems more worrying that it doesn't
> seem to be filled out at all. Has anybody else had any luck with it?
> And any idea why?
>
> Any thoughts or ideas would be most welcome!
>
>
Well, I got "order" for which I need to run training, some I hope I will
publish some more experiences with 3.02 training. But there is no deadline,
so it could take a long time ;-)

Regarding missing information Ray Smith is IMHO only one who could explain
it ;-)

Anyway my quick check revealed that this missing information are the same
in all languages (e.g. "i" has the same script
and  glyph_metrics; differences is only in "link" between "i" and "I"
because of different positions in from unicharset_extractor.... I believe
this information could be reconstructed by some (python) script. But I am
not sure if this helps to improve accuracy (need to be tested).

-- 
Zdenko

-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

Re: unicharset script and metrics questions

Reply via email to