Dimitri, This is very helpful. I have had some modest progress anyway via two changes but I will also do as you suggest. The things I have done just now are 1. added margin around the image to be recognised 2. explicitly set the tessedit_pageseg_mode to 3
We also notice that it is very good at some integer sequences like ++4++2++++++++1++3++ and never gets ++4+++++1++2++3+++++ We think it has learned the "word" "4++2" mind you, this is not what the makebox result is for this particular image. In fact, when I use makebox to work out what is going on, it usually always gets just single letters. Is there a way to know that it has decided on certain characters because of the order in which they appear? I think your ideas about the training set will improve it and I shouldn't be training with the separator in there. In production the separator is needed and it doesn't have to be "+" I am now thinking it might be "-" which looks like no numbers at all. Thanks again for your time, T On Saturday, June 23, 2012 8:14:45 PM UTC+10, Dmitri Silaev wrote: > > TDG, > > Your command lines for training and recognition seem OK. But I'd > suggest to pay more attention to training images. First do not > overblur images. If some characters still look disjointed, that's OK, > you will be able to correct their boxes later, when editing a > generated .box file. Characters should not be losing their features > like this happens with holes in some 4's in your image. Characters in > a training image must not touch one another, rows have to be > horizontal, at least have the same slope. All these rules are not > obeyed in your image. Therefore it is better to prepare your training > images by manual editing. You would need to compile an image (or a set > of images) containing 20-40 samples of each character, sufficiently > spaced and mixed properly. By the latter I mean that there should be > no long sequences of +'s which can fool Tesseract's baseline finding > algo. > > Second, IMO training to get two classes for + is a bad idea. There's > no so much variety in shape for +, and Tesseract is able to cluster it > properly. Actually using + as a whitespace placeholder is a bad idea, > if you have control over this. This character does not align with the > base and x-height or ascender lines while is presented in numbers and > long sequences so it can make Tess to build incorrect baselines. I > think that, besides image skew, this can be a reason for your > unrecognized pieces. (BTW you can use my recent baseline debugging > tutorial.) So I'd suggest using as a placeholder some well-aligned > character having very distinctive shape compared to digits and > preferably of the same width with digits. > > You've asked lots of questions but this is what I'd start working with. > > HTH > > Warm regards, > Dmitri Silaev > www.CustomOCR.com > > > Hi everyone, > > I am enjoying tesseract and it is really helping me in a pioneering > project > > to do with verifiable elections. > > I am having some problems and would appreciate any hints or directions. > > If you don't have time to read the below, a key thing I would like to do > is > > somehow have Tesseract NOT employ scale invarience since my input > characters > > are not going to vary in size. > > Here is the rest of the action. > > > > My OCR task that involves recognising strings like this printed with a > dot > > matrix quality inkjet ticket printing device. > > > > ++4++3++5++6++1++7++2++++++++++++++++++++++++1++++++++++++++++++++ > > +++++1++2++3+++++++++++++++++++++++++++++++++++1++++++++++++++++++ > > > > There needs to be a field separator of some sort and I have been using > > plus. > > It could be anything but not spaces since these get collapsed and > position > > of the integers is important in our application. > > Tesseract sometimes performs very well and often not so well in my > > situation. > > I cope with this by running it in a loop and adjusting a variable in > input > > image treatment, the bottom levels threshold value. > > However sometimes it does things I just can't quite understand, such as > > ignoring several characters at the left end of the string and > recognising > > the rest. > > This seems to happen if the image is not totally straight. > > > > I have followed a number of guides online. I am using Tesseract 3.01 > binary > > on win7 64, and image magick. > > I have bootstrapped training for several generations. > > Here is what I do > > > > Train: > > tesseract.exe excella.jet.exp0.tif excella.jet.exp0 nobatch > box.train.stderr > > unicharset_extractor.exe excella.jet.exp0.box > > echo jet 0 0 0 0 0 > font_properties > > mftraining -F font_properties -U unicharset excella.jet.exp0.tr > > cntraining.exe" excella.jet.exp0.tr > > move /Y inttemp jet.inttemp > > move /Y Microfeat jet.Microfeat > > move /Y normproto jet.normproto > > move /Y pffmtable jet.pffmtable > > move /Y unicharset jet.unicharset > > combine_tessdata jet. > > move /Y jet.traineddata "c:\Program Files > > (x86)\Tesseract-OCR\tessdata\jet.traineddata" > > > > Recognition: > > convert file_name.jpg -units PixelsPerInch -density 300x300 tmp1.png > > convert tmp1.png -crop 2766x148+30+450 tmp2.png > > convert tmp2.png -blur 0x1.3 tmp3.png > > convert tmp3.png -units PixelsPerInch -resize 2766 -density 300x300 > -level > > 60%,90%,30 tmp.tif > > tesseract tmp.tif tmp -l jet > > > > Attached: > > > > FRONT200GRAY8_16_6_12_1.JPG is a raw image > > recognise.jpg (uploaded as JPG as Google does not like TIF, scaled to > width > > 1900 from 2766) is a treated image which I get a perfect result from > > recognise.box shows the recognition result if you use something like > > owlboxer and you scale and convert the above > > excella31.jet.exp0.tif (ditto, scaled down to 1900 width from 2474 for > > Google size limit) is the training image > > > > "31" in excella31.jet.exp0 is the version, which is not shown above. > > > > Here are some other observations and questions: > > 1. Tesseract seems very sensitive to the amount of blur I give to > images. > > Adjusting the kernel by a tenth of a pixel makes a difference. It seems > > that judging it by eye is not accurate given my example above. Is this > > expected? > > 2. Tesseract seems very sensitive to the levels I set. Varying the > bottom > > threshold 5% makes a lot of difference. In each of the above I vary it > by > > 1% only. How important is it that I set the training image treatment > since > > I am guessing what this needs by eye. Obviously it needs blurring to > > coalesce the dots in the font, but the other changes are guestimates. > > 3. I couldn't get it to reliably train on so many varients of the "+" > symbol > > so I let it see "+" as "+" but also "C" to allow two more reasonable > feature > > sets to cluster. This works well. Is there a way to actually know that > it > > can't handle too much varience apart from it then failing to accurately > > classify certain characters? > > 4. Sometimes just doesn't start recognising digits until it is several > > characters in from the left, especially when the image is slightly > crooked. > > Should I write something to straighten it - I notice it can read > sideways > > already without problems. Perhaps it is the slight angle that is the > > problem? > > 5. Can it be set NOT to encode for scale invarience? My input data > > characters are not going to vary in scale. It often finds little specs > to > > be false positive matches. > > 6. The scanner is 200dpi, I can't change this since it is a special > > ticket-scanning device. The training data are 300dpi and I scale up the > > input image. Is my image treatment via JPG and PNG to TIF and rescaling > > likely to be a source of problems? > > 7. I read that Tesseract is designed to recognise words. So would it be > > better that I train it with groups like "++1" "+10" etc since three > > plusses is the field width for each datum I am scanning for. > > > > Thanks in advance for any advice you have, > > As before this is an excellent product. I'd be stumped without it. > > Best, > > TDG > > > > > > > > > > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en

