> That is, my aim is to speed up Tesseract using the fact that my input will
> definitely not contain a certain set if characters.
> 
> E.g. If I can create a database with only numbers for various fonts, during 
> the
> conversion process, Tesseract will only have to match against the small set of
> numbers.
> 
> Am I right is in this assumption?

I'm not sure, to be honest. I would guess that it will make each
character recognition significantly quicker, but the majority of the
time spent is in the initial startup of Tesseract, hence the fact
that you've not seen a big speedup. But as I say, I'm not positive,
by all means do more testing or dig into the code a bit and let us
know what you find.

> Out of curiosity, are you aware why v3 box files are unavailable?

Basically because they were automatically generated. Arguably they
should still be released, because e.g. subsetting of the sort you're
talking about, or adding a few new characters, would be easier. But
they aren't. The good news is that with 3.03 (to be released soon)
the automatic generation tools will be included. You can see the
thread in which I loudly complained about this (and got pretty
reasonable answers) at:
https://groups.google.com/forum/#!topic/tesseract-dev/4lxGjCGLBSw

I'll ask soon about the making available the text files and font
list to feed in to the automatic generation tool(s), thanks for
reminding me.

Nick

-- 
-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

--- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.

Reply via email to