I'm just going to go through your numbered points here. On Fri, Jul 04, 2014 at 10:02:43AM -0700, Albrecht Hilker wrote: > 1.) > The column "other_case" should contain the ID of the other-case letter. > For the lowercase letters they point correctly to the uppercase letters. > But the uppercase letters they all have a value of -1 which is wrong. > Here should be the corresponding ID of the lowercase letter.
The set_unicharset_properties tool sets this correctly. > 2.) > The script name is always NULL. > It should be LATIN or COMMON The set_unicharset_properties tool sets this correctly. > 3.) > All the min / max values are completely missing. > They are 0, 255 or 32767. > 10 missing columns! Yes. They are missing, and as you rightly point out, that sucks. > 4.) > The last column "normed_form" is missing. > With the '#' a comment is starting. > But when reading this unicharset the '#' is misinterpreted as the > "normed_form". > Here should be mostly the same letter as in the first column. Good spot that the unicharset_extractor's '#' is misinterpreted as the normed_form. That is definitely a bug. The set_unicharset_properties tool does set this correctly, though. As far as I'm aware there's no good reason for unicharset_extractor to be separate from set_unicharset_properties, though I haven't looked at the code of either in depth yet. > Here you see a unicharset extracted from a trainddata file with all columns > filled correctly: You can also see a bunch of unicharset files in training/langdata; at the moment it seems like they're generated by unicharset_extractor, run through set_unicharset_properties, and then the metrics are set somehow, maybe by some tool, maybe by hand. I'll ask on the dev list in a moment if there's such a tool, and if it can be released (some of the training tools like this were originally written for internal use by Google and do funky things like depend on map-reduce, so have to be rewritten for us plebs ;)) Nick -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/20140710152424.GC4993%40manta.lan. For more options, visit https://groups.google.com/d/optout.

