Re: [tesseract-ocr] Missing detailed documentation about Unicharset files

Nick White Thu, 10 Jul 2014 10:28:18 -0700

I have more thoughts to the unicharset metrics discussion.

> So this example says that
> the character "1" has a min_bottom value of 59 and
> the character "9" has a min_bottom value of 18.
> 
> Weird ? ? ?
> Both numbers are aligned to the baseline!


I am guessing now (I'll take a look at the code later), but I 
presume "baseline-normalized" isn't supposed to mean baseline = 0.

> Wouldn't it be more intelligent to define the min_bottom for "9" 
> with a higher value to distinguish it from a lowercase "g" ??

Comparing the lines for 9 and g is useful:
9 8 0,66,200,255,89,156,0,39,104,173 Common 64 2 64 9   # 9 [39 ]0
g 3 0,43,188,212,88,176,0,32,100,210 Latin 93 0 54 g    # g [67 ]a

So the min_bottom for both is 0, that's true. But don't forget that 
in some fonts 9 does dip significantly below the baseline. And the 
max_bottom is quite different, and probably is more useful for the 
differentiation here. It says g hardly ever rises above 43, whereas 
9 can quite happily rise up to 66 (which looks like it roughly 
corresponds to the baseline, given how many other characters are 
about there). From that we can guess that 128 is the x-height, and 
64 is roughly the baseline.

More anon.

Nick

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/20140710172704.GB27600%40manta.lan.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Missing detailed documentation about Unicharset files

Reply via email to