Thanks Nick! For my use, I need to identify a certain set of characters only.
I assume that if I somehow reduce the database that tesseract uses to match a input character during identification, it will speedup the process. That is, my aim is to speed up Tesseract using the fact that my input will definitely not contain a certain set if characters. E.g. If I can create a database with only numbers for various fonts, during the conversion process, Tesseract will only have to match against the small set of numbers. Am I right is in this assumption? I know that we have the option to define a whitelist (or blacklist). However my initial analysis showed no improvement in speed on defining a whitelist (it was quick analysis so I need to revisit it). Does defining a whitelist make Tesseract load ONLY those characters hence speeding up the conversion process? Or does it load the whole DB, convert the whole thing to text and then remove characters based on the whitelist? The former would serve my purpose. Out of curiosity, are you aware why v3 box files are unavailable? Thanks! :) On Wednesday, January 29, 2014 1:16:32 AM UTC-8, Nick White wrote: > > Hi there, > > > I require to create a new training file that consists of a subset of the > > characters of the original training data. > > > > E.g. A training file that contains only numbers > > Do you want to do this because the English training data is too big > for your uses? If not, you can just use the digits config file: > > https://code.google.com/p/tesseract-ocr/wiki/FAQ#How_do_I_recognize_only_digits? > > > > If that is the case though it's rather trickier. > > > I believe for this I would require the original box files used to create > the > > current 21MB English training data file. > > > > Would it be possible to have access to these files? It would be a big > help. > > The easiest way would certainly be to use the original box files. > Unfortunately they aren't available for v3, and nor are they likely > to be. > > So you'd have to create your own training, which is some work (and > may well end up being less good than the official english training > using the 'digits' config). You can read how to do that at > https://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3 > > Nick > -- -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en --- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/groups/opt_out.

