TDG, Your command lines for training and recognition seem OK. But I'd suggest to pay more attention to training images. First do not overblur images. If some characters still look disjointed, that's OK, you will be able to correct their boxes later, when editing a generated .box file. Characters should not be losing their features like this happens with holes in some 4's in your image. Characters in a training image must not touch one another, rows have to be horizontal, at least have the same slope. All these rules are not obeyed in your image. Therefore it is better to prepare your training images by manual editing. You would need to compile an image (or a set of images) containing 20-40 samples of each character, sufficiently spaced and mixed properly. By the latter I mean that there should be no long sequences of +'s which can fool Tesseract's baseline finding algo.
Second, IMO training to get two classes for + is a bad idea. There's no so much variety in shape for +, and Tesseract is able to cluster it properly. Actually using + as a whitespace placeholder is a bad idea, if you have control over this. This character does not align with the base and x-height or ascender lines while is presented in numbers and long sequences so it can make Tess to build incorrect baselines. I think that, besides image skew, this can be a reason for your unrecognized pieces. (BTW you can use my recent baseline debugging tutorial.) So I'd suggest using as a placeholder some well-aligned character having very distinctive shape compared to digits and preferably of the same width with digits. You've asked lots of questions but this is what I'd start working with. HTH Warm regards, Dmitri Silaev www.CustomOCR.com On Wed, Jun 20, 2012 at 9:00 AM, TDG <[email protected]> wrote: > Hi everyone, > I am enjoying tesseract and it is really helping me in a pioneering project > to do with verifiable elections. > I am having some problems and would appreciate any hints or directions. > If you don't have time to read the below, a key thing I would like to do is > somehow have Tesseract NOT employ scale invarience since my input characters > are not going to vary in size. > Here is the rest of the action. > > My OCR task that involves recognising strings like this printed with a dot > matrix quality inkjet ticket printing device. > > ++4++3++5++6++1++7++2++++++++++++++++++++++++1++++++++++++++++++++ > +++++1++2++3+++++++++++++++++++++++++++++++++++1++++++++++++++++++ > > There needs to be a field separator of some sort and I have been using > plus. > It could be anything but not spaces since these get collapsed and position > of the integers is important in our application. > Tesseract sometimes performs very well and often not so well in my > situation. > I cope with this by running it in a loop and adjusting a variable in input > image treatment, the bottom levels threshold value. > However sometimes it does things I just can't quite understand, such as > ignoring several characters at the left end of the string and recognising > the rest. > This seems to happen if the image is not totally straight. > > I have followed a number of guides online. I am using Tesseract 3.01 binary > on win7 64, and image magick. > I have bootstrapped training for several generations. > Here is what I do > > Train: > tesseract.exe excella.jet.exp0.tif excella.jet.exp0 nobatch box.train.stderr > unicharset_extractor.exe excella.jet.exp0.box > echo jet 0 0 0 0 0 > font_properties > mftraining -F font_properties -U unicharset excella.jet.exp0.tr > cntraining.exe" excella.jet.exp0.tr > move /Y inttemp jet.inttemp > move /Y Microfeat jet.Microfeat > move /Y normproto jet.normproto > move /Y pffmtable jet.pffmtable > move /Y unicharset jet.unicharset > combine_tessdata jet. > move /Y jet.traineddata "c:\Program Files > (x86)\Tesseract-OCR\tessdata\jet.traineddata" > > Recognition: > convert file_name.jpg -units PixelsPerInch -density 300x300 tmp1.png > convert tmp1.png -crop 2766x148+30+450 tmp2.png > convert tmp2.png -blur 0x1.3 tmp3.png > convert tmp3.png -units PixelsPerInch -resize 2766 -density 300x300 -level > 60%,90%,30 tmp.tif > tesseract tmp.tif tmp -l jet > > Attached: > > FRONT200GRAY8_16_6_12_1.JPG is a raw image > recognise.jpg (uploaded as JPG as Google does not like TIF, scaled to width > 1900 from 2766) is a treated image which I get a perfect result from > recognise.box shows the recognition result if you use something like > owlboxer and you scale and convert the above > excella31.jet.exp0.tif (ditto, scaled down to 1900 width from 2474 for > Google size limit) is the training image > > "31" in excella31.jet.exp0 is the version, which is not shown above. > > Here are some other observations and questions: > 1. Tesseract seems very sensitive to the amount of blur I give to images. > Adjusting the kernel by a tenth of a pixel makes a difference. It seems > that judging it by eye is not accurate given my example above. Is this > expected? > 2. Tesseract seems very sensitive to the levels I set. Varying the bottom > threshold 5% makes a lot of difference. In each of the above I vary it by > 1% only. How important is it that I set the training image treatment since > I am guessing what this needs by eye. Obviously it needs blurring to > coalesce the dots in the font, but the other changes are guestimates. > 3. I couldn't get it to reliably train on so many varients of the "+" symbol > so I let it see "+" as "+" but also "C" to allow two more reasonable feature > sets to cluster. This works well. Is there a way to actually know that it > can't handle too much varience apart from it then failing to accurately > classify certain characters? > 4. Sometimes just doesn't start recognising digits until it is several > characters in from the left, especially when the image is slightly crooked. > Should I write something to straighten it - I notice it can read sideways > already without problems. Perhaps it is the slight angle that is the > problem? > 5. Can it be set NOT to encode for scale invarience? My input data > characters are not going to vary in scale. It often finds little specs to > be false positive matches. > 6. The scanner is 200dpi, I can't change this since it is a special > ticket-scanning device. The training data are 300dpi and I scale up the > input image. Is my image treatment via JPG and PNG to TIF and rescaling > likely to be a source of problems? > 7. I read that Tesseract is designed to recognise words. So would it be > better that I train it with groups like "++1" "+10" etc since three > plusses is the field width for each datum I am scanning for. > > Thanks in advance for any advice you have, > As before this is an excellent product. I'd be stumped without it. > Best, > TDG > > > > > -- > You received this message because you are subscribed to the Google > Groups "tesseract-ocr" group. > To post to this group, send email to [email protected] > To unsubscribe from this group, send email to > [email protected] > For more options, visit this group at > http://groups.google.com/group/tesseract-ocr?hl=en -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en

