> That is, my aim is to speed up Tesseract using the fact that my input will > definitely not contain a certain set if characters. > > E.g. If I can create a database with only numbers for various fonts, during > the > conversion process, Tesseract will only have to match against the small set of > numbers. > > Am I right is in this assumption?
I'm not sure, to be honest. I would guess that it will make each character recognition significantly quicker, but the majority of the time spent is in the initial startup of Tesseract, hence the fact that you've not seen a big speedup. But as I say, I'm not positive, by all means do more testing or dig into the code a bit and let us know what you find. > Out of curiosity, are you aware why v3 box files are unavailable? Basically because they were automatically generated. Arguably they should still be released, because e.g. subsetting of the sort you're talking about, or adding a few new characters, would be easier. But they aren't. The good news is that with 3.03 (to be released soon) the automatic generation tools will be included. You can see the thread in which I loudly complained about this (and got pretty reasonable answers) at: https://groups.google.com/forum/#!topic/tesseract-dev/4lxGjCGLBSw I'll ask soon about the making available the text files and font list to feed in to the automatic generation tool(s), thanks for reminding me. Nick -- -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en --- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/groups/opt_out.

