I'm just going to go through your numbered points here.

On Fri, Jul 04, 2014 at 10:02:43AM -0700, Albrecht Hilker wrote:
> 1.)
> The column "other_case" should contain the ID of the other-case letter.
> For the lowercase letters they point correctly to the uppercase letters.
> But the uppercase letters they all have a value of -1 which is wrong.
> Here should be the corresponding ID of the lowercase letter.

The set_unicharset_properties tool sets this correctly.

> 2.)
> The script name is always NULL.
> It should be LATIN or COMMON

The set_unicharset_properties tool sets this correctly.

> 3.)
> All the min / max values are completely missing.
> They are 0, 255 or 32767.
> 10 missing columns!

Yes. They are missing, and as you rightly point out, that sucks.

> 4.)
> The last column "normed_form" is missing.
> With the '#' a comment is starting.
> But when reading this unicharset the '#' is misinterpreted as the
> "normed_form".
> Here should be mostly the same letter as in the first column.

Good spot that the unicharset_extractor's '#' is misinterpreted as 
the normed_form. That is definitely a bug. The 
set_unicharset_properties tool does set this correctly, though.
 
As far as I'm aware there's no good reason for unicharset_extractor 
to be separate from set_unicharset_properties, though I haven't 
looked at the code of either in depth yet.

> Here you see a unicharset extracted from a trainddata file with all columns
> filled correctly:

You can also see a bunch of unicharset files in training/langdata; 
at the moment it seems like they're generated by 
unicharset_extractor, run through set_unicharset_properties, and 
then the metrics are set somehow, maybe by some tool, maybe by hand.  

I'll ask on the dev list in a moment if there's such a tool, and if 
it can be released (some of the training tools like this were 
originally written for internal use by Google and do funky things 
like depend on map-reduce, so have to be rewritten for us plebs ;))

Nick

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/20140710152424.GC4993%40manta.lan.
For more options, visit https://groups.google.com/d/optout.

Reply via email to