Hi All, I'm using tesseract quite successfully in my code. I have a preprocessing step that locate the characters I need to recognise and then I feed them into tesseract using the PSM_SINGLE_CHAR mode.
This works great with the default eng.traineddata I'm also constraining the tessedit_char_whitelist to just have numbers and upper case letters as that is the only thing I have in my character set. I want to reduce the size of my app and the traineddata is by far the largest chunk of data at the moment. What I've tried to do is retrain tesseract so that it only has the characters I need in the training data. I've done this successfully, but when I use my newly created eng.traineddata the accuracy is much worse than if I use the default eng.traineddata. Any ideas why this should be? I thought if anything that accuracy would improve if I'd removed all the unnecessary characters from the data. I'm doing my training by taking the box files and stripping out all the characters I don't need and then running through the training instructions. I'm using tesseract3.01 Any thoughts? Cheers Chris. -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en

