Thanks Zdeno and Nick.  Yes, I'm using the latest code of tesseract 
(revision 729) because the 3.01 version doesn't appear to work well for me, 
I'm getting "Couldn't find matching blob" for only one of my characters for 
some reason.  After following your instructions, I was able to get 
everything working without crashing or errors.  However, the training 
didn't seem to work, because it's not recognizing anything properly.

I noticed yours and Nick's comments on unicharset.  

Zdeno, does your note on unicharset_extractor mean that the currently 
codeline doesn't work properly?
You mentioned a script to correct the information, is there any place that 
documents how I can fix the file so that it works properly?

Nick, have you been able to train either 3.01 or 3.02/current codeline to 
recognize a new language properly?

thanks for your help,

Steve


On Thursday, June 7, 2012 6:10:01 AM UTC-7, zdpo wrote:
>
>
>
> On Thu, Jun 7, 2012 at 12:29 PM, Nick White <> wrote:
>
>> On Thu, Jun 07, 2012 at 08:22:27AM +0200, zdenko podobny wrote:
>> > I start to put my notes[1] what I found (just for me ;-) ) - at the 
>> moment
>> > there is not a lot of information and maybe there are some things that
>> > I misunderstood ;-) .
>> >
>> > [1] http://www.sk-spell.sk.cx/first-notes-for-tesseract-ocr-302-traning
>>
>> Thanks so much for posting your notes Zdenko, they're very handy
>> indeed, incomplete and incorrect though they may be ;)
>>
>> I am suffering from some of the same problems as you with the output
>> from unicharset_extractor. In particular, glyph_metrics is always:
>> 0,255,0,255,0,32767,0,32767,0,32767
>> and script is always NULL.
>>
>> I'm training Ancient Greek, so it seems pretty clear that script
>> should be Greek. But does anybody know what the script field is used
>> for? Not setting it doesn't seem to cause any problems. Anybody have
>> any clues as to why it wouldn't be set automatically? Are there any
>> known problems to setting it manually once the unicharset has been
>> generated? I'll look into these more in the code when I can, but any
>> experience from others would be most useful.
>>
>> As for the glyph_metrics, it seems more worrying that it doesn't
>> seem to be filled out at all. Has anybody else had any luck with it?
>> And any idea why?
>>
>> Any thoughts or ideas would be most welcome!
>>
>>
> Well, I got "order" for which I need to run training, some I hope I will 
> publish some more experiences with 3.02 training. But there is no deadline, 
> so it could take a long time ;-)
>
> Regarding missing information Ray Smith is IMHO only one who could explain 
> it ;-)
>
> Anyway my quick check revealed that this missing information are the same 
> in all languages (e.g. "i" has the same script 
> and  glyph_metrics; differences is only in "link" between "i" and "I" 
> because of different positions in from unicharset_extractor.... I believe 
> this information could be reconstructed by some (python) script. But I am 
> not sure if this helps to improve accuracy (need to be tested).
>
> -- 
> Zdenko
>

-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

Reply via email to