I have more thoughts to the unicharset metrics discussion. > So this example says that > the character "1" has a min_bottom value of 59 and > the character "9" has a min_bottom value of 18. > > Weird ? ? ? > Both numbers are aligned to the baseline!
I am guessing now (I'll take a look at the code later), but I presume "baseline-normalized" isn't supposed to mean baseline = 0. > Wouldn't it be more intelligent to define the min_bottom for "9" > with a higher value to distinguish it from a lowercase "g" ?? Comparing the lines for 9 and g is useful: 9 8 0,66,200,255,89,156,0,39,104,173 Common 64 2 64 9 # 9 [39 ]0 g 3 0,43,188,212,88,176,0,32,100,210 Latin 93 0 54 g # g [67 ]a So the min_bottom for both is 0, that's true. But don't forget that in some fonts 9 does dip significantly below the baseline. And the max_bottom is quite different, and probably is more useful for the differentiation here. It says g hardly ever rises above 43, whereas 9 can quite happily rise up to 66 (which looks like it roughly corresponds to the baseline, given how many other characters are about there). From that we can guess that 128 is the x-height, and 64 is roughly the baseline. More anon. Nick -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/20140710172704.GB27600%40manta.lan. For more options, visit https://groups.google.com/d/optout.

