Hello,
I would like to be able to use tesseract with only a specific set of
fonts and I would like to know which font actually matched. Basically,
there is only ever one font in the image but it could in principle be
one of many different fonts. However, we can typically limit it to
only a subset. Afterwards we would like to know which font matched.

I can only see three ways how we could combine fonts:

1. Train all fonts into one big tesseract file. This has the advantage
that all fonts are always present, but at the expense of accuracy and
speed. Furthermore, we don't know which font actually matched and it
is not possible to add more fonts later on without redoing the entire
training.

2. Figure out the tesseract file format and somehow find a way to
somehow combine fonts (if this is at all possible). This would make it
possible to create any combination of fonts from the individual fonts.
Thus, the number of active fonts would be limited so as to improve
accuracy and speed. The only disadvantage is that we wouldn't know
which font matched.

3. Call tesseract separately for each font. This has the advantage of
full flexibility, plus we would know which font matched. However, it
goes through the entire processing chain again and again at the cost
of speed. Furthermore, we are not clear how best to combine the
individual results. Should we pick the result with the best
confidence? Or is the rating a better indicator? Since all passes are
done on the same image, the outline length should be identical. To
improve speed, is there a way to keep the intermediate results
(binarization, connected components, baseline finding, extraction of
features, etc.) for later passes?

Is there another way I may have missed? Does tesseract already offer
some way to apply more than one font?

Thanks in advance for any help you can provide.

Best regards,
Marcus

-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

Reply via email to