Hi All,

I'm using tesseract quite successfully in my code. I have a
preprocessing step that locate the characters I need to recognise and
then I feed them into tesseract using the PSM_SINGLE_CHAR mode.

This works great with the default eng.traineddata

I'm also constraining the tessedit_char_whitelist to just have numbers
and upper case letters as that is the only thing I have in my
character set.

I want to reduce the size of my app and the traineddata is by far the
largest chunk of data at the moment.

What I've tried to do is retrain tesseract so that it only has the
characters I need in the training data. I've done this successfully,
but when I use my newly created eng.traineddata the accuracy is much
worse than if I use the default eng.traineddata.

Any ideas why this should be? I thought if anything that accuracy
would improve if I'd removed all the unnecessary characters from the
data.

I'm doing my training by taking the box files and stripping out all
the characters I don't need and then running through the training
instructions.

I'm using tesseract3.01

Any thoughts?

Cheers
Chris.

-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

Reply via email to