Hi, I'm glad the standard trained data is working well. Particularly given that (presumably) it's against a non-white background - people have found that to be a big obstacle in the past.
I would suggest against completely retraining it. As the training source files are not available it would take quite a bit of work just to get the training up to the quality of the built-in one. The sort of errors you describe could be reasonably addressed by editing the lang.unicharambigs file. Unpack the training you're using with: combine_tessdata -u /path/to/lang.traineddata lang. add the ambigs rules you need, then recombine it with: combine_tessdata lang. and copy the lang.traineddata to the tessdata directory. More information on the unicharambigs file is given in the training guide on the wiki. You could also consider looking at the configuration variables to do things like give higher penalties for unexpected punctuation (may help things like / vs l), but I think that would take a while and not be as effective for you. Grep for '_VAR' in the source tree if you want to try it anyway. Best of luck, and let us know how you get on. Nick -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en

