Hi there,

> I require to create a new training file that consists of a subset of the
> characters of the original training data.
> 
> E.g. A training file that contains only numbers

Do you want to do this because the English training data is too big
for your uses? If not, you can just use the digits config file:
https://code.google.com/p/tesseract-ocr/wiki/FAQ#How_do_I_recognize_only_digits?

If that is the case though it's rather trickier.

> I believe for this I would require the original box files used to create the
> current 21MB English training data file.
> 
> Would it be possible to have access to these files? It would be a big help.

The easiest way would certainly be to use the original box files.
Unfortunately they aren't available for v3, and nor are they likely
to be.

So you'd have to create your own training, which is some work (and
may well end up being less good than the official english training
using the 'digits' config). You can read how to do that at
https://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3

Nick

-- 
-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

--- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.

Reply via email to