Hi there, > I require to create a new training file that consists of a subset of the > characters of the original training data. > > E.g. A training file that contains only numbers
Do you want to do this because the English training data is too big for your uses? If not, you can just use the digits config file: https://code.google.com/p/tesseract-ocr/wiki/FAQ#How_do_I_recognize_only_digits? If that is the case though it's rather trickier. > I believe for this I would require the original box files used to create the > current 21MB English training data file. > > Would it be possible to have access to these files? It would be a big help. The easiest way would certainly be to use the original box files. Unfortunately they aren't available for v3, and nor are they likely to be. So you'd have to create your own training, which is some work (and may well end up being less good than the official english training using the 'digits' config). You can read how to do that at https://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3 Nick -- -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en --- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/groups/opt_out.

