[tesseract-ocr] Tesseract font matching

vroscigno Fri, 18 Sep 2015 06:38:45 -0700


I am using Tesseract 3.04 on Windows to analyze scanned paper forms which 
often contain non-contiguous text labels of various size, position, and 
font style. I am attempting to deduce simple typeface characteristics such 
as serif vs sans-serif, fixed vs variable pitch, italics, bold, etc, in an 
effort to loosely classify identified text labels.

I started out by using LTRResultIterator::WordFontAttributes for recognized
words, but then learned that the returned font properties are from the
*best* matching font, not from an accumulation of actual character
attributes for the recognized word.

As an example of this, I have observed cases where sequences of ARIAL
(sans-serif, variable-pitch) characters are measured and determined to be
fixed-pitch (for example: "BOOK"), and the best matching font is a COURIER
variant (fixed-pitch, serif). In this case, none of the characters have
serifs, but the determined pitch (fixed) seems to carry significant weight
when matching fonts.

I intend to study the font classification logic a bit to be sure I
understand it.

I also suspect that the Adaptive Classifier may propagate this effect for
'downstream' results. (True/False? Opinions?)

I thought about exploring the following:

1. disabling fixed-pitch character classification and handling

2. disabling the adaptive classifier or limiting it's influence

Does anyone have any suggestions or opinions?

Thanks,

Vince Roscigno

--
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit
https://groups.google.com/d/msgid/tesseract-ocr/3cc3f26e-ad8e-4a81-995f-2b1b0c14d0db%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[tesseract-ocr] Tesseract font matching

Reply via email to