Thanks again, Zdenko and Nick for your responses. I was able to get somewhat better results by modifying my training image.
I'm trying to create a training set for basically the English letters A-Z, but with subscripts and superscripts with various letters or numbers. I want to map each to a separate character. So for example, A^1 would be "A", A^2 would be "a", etc. When I run tesseract, I run it against an image of a single character, so I use SINGLE_CHAR mode. I was able to create training data for this on 3.02 revision 729, but when I actually used tesseract to identify letters, the wasn't producing any good results, even when I used it against the training image itself. I tried a second time, this time removing the subscripts and superscripts in the training image, and just using the English letters from A to Z. This worked much better, I was able to get some results. However, I'm getting a weird result where every single "D" is being recognized as "I". Are there any switches or options that would produce some sort of output so that I can use to figure out why this mis-identification is occurring? I tried using the "segdemo inter" option, but it looks like the interactive mode was removed from 3.02. Thanks for your help, Steve On Friday, June 8, 2012 2:02:20 AM UTC-7, Nick White wrote: > > Hi Steve, > > I'll cut up your email to reply to bits. > > > Zdeno, does your note on unicharset_extractor mean that the currently > > codeline doesn't work properly? > > You mentioned a script to correct the information, is there any place > that > > documents how I can fix the file so that it works properly? > > > > Nick, have you been able to train either 3.01 or 3.02/current codeline > to > > recognize a new language properly? > > Yes, the training I'm doing is with the 3.02 trunk code, and is > working very well now. As Zdenko says, we're just keen to make it as > good as possible, hence looking into unicharset oddities. My > training is already above my expectations. So don't be put off! > > On Thu, Jun 07, 2012 at 01:02:59PM -0700, steve8918 wrote: > > Thanks Zdeno and Nick. Yes, I'm using the latest code of tesseract > > (revision 729) because the 3.01 version doesn't appear to work well for > me, > > I'm getting "Couldn't find matching blob" for only one of my characters > for > > some reason. > > I get that for various of my training images, for no obvious reason. > It doesn't seem to have a major impact on the training for me > though, so I wouldn't worry too much about it. > > > After following your instructions, I was able to get > > everything working without crashing or errors. However, the training > > didn't seem to work, because it's not recognizing anything properly. > > That is suprising. Can you give more information about what > (mis)recognition is happening? > > Nick > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en

