Dimitri, 
This is very helpful.
I have had some modest progress anyway via two changes but I will also do 
as you suggest.
The things I have done just now are
1. added margin around the image to be recognised
2. explicitly set the tessedit_pageseg_mode to 3

We also notice that it is very good at some integer sequences like 
++4++2++++++++1++3++
and never gets 
++4+++++1++2++3+++++

We think it has learned the "word" "4++2"  mind you, this is not what the 
makebox result is for this particular image.
In fact, when I use makebox to work out what is going on, it usually always 
gets just single letters.
Is there a way to know that it has decided on certain characters because of 
the order in which they appear?

I think your ideas about the training set will improve it and I shouldn't 
be training with the separator in there.
In production the separator is needed and it doesn't have to be "+" I am 
now thinking it might be "-" which looks like no numbers at all.

Thanks again for your time,
T



On Saturday, June 23, 2012 8:14:45 PM UTC+10, Dmitri Silaev wrote:
>
> TDG, 
>
> Your command lines for training and recognition seem OK. But I'd 
> suggest to pay more attention to training images. First do not 
> overblur images. If some characters still look disjointed, that's OK, 
> you will be able to correct their boxes later, when editing a 
> generated .box file. Characters should not be losing their features 
> like this happens with holes in some 4's in your image. Characters in 
> a training image must not touch one another, rows have to be 
> horizontal, at least have the same slope. All these rules are not 
> obeyed in your image. Therefore it is better to prepare your training 
> images by manual editing. You would need to compile an image (or a set 
> of images) containing 20-40 samples of each character, sufficiently 
> spaced and mixed properly. By the latter I mean that there should be 
> no long sequences of +'s which can fool Tesseract's baseline finding 
> algo. 
>
> Second, IMO training to get two classes for + is a bad idea. There's 
> no so much variety in shape for +, and Tesseract is able to cluster it 
> properly. Actually using + as a whitespace placeholder is a bad idea, 
> if you have control over this. This character does not align with the 
> base and x-height or ascender lines while is presented in numbers and 
> long sequences so it can make Tess to build incorrect baselines. I 
> think that, besides image skew, this can be a reason for your 
> unrecognized pieces. (BTW you can use my recent baseline debugging 
> tutorial.) So I'd suggest using as a placeholder some well-aligned 
> character having very distinctive shape compared to digits and 
> preferably of the same width with digits. 
>
> You've asked lots of questions but this is what I'd start working with. 
>
> HTH 
>
> Warm regards, 
> Dmitri Silaev 
> www.CustomOCR.com 
>
> > Hi everyone, 
> > I am enjoying tesseract and it is really helping me in a pioneering 
> project 
> > to do with verifiable elections. 
> > I am having some problems and would appreciate any hints or directions. 
> > If you don't have time to read the below, a key thing I would like to do 
> is 
> > somehow have Tesseract NOT employ scale invarience since my input 
> characters 
> > are not going to vary in size. 
> > Here is the rest of the action. 
> > 
> > My OCR task that involves recognising strings like this printed with a 
> dot 
> > matrix quality inkjet ticket printing device. 
> > 
> > ++4++3++5++6++1++7++2++++++++++++++++++++++++1++++++++++++++++++++ 
> > +++++1++2++3+++++++++++++++++++++++++++++++++++1++++++++++++++++++ 
> > 
> > There needs to be a field separator of some sort and I have been using 
> > plus. 
> > It could be anything but not spaces since these get collapsed and 
> position 
> > of the integers is important in our application. 
> > Tesseract sometimes performs very well and often not so well in my 
> > situation. 
> > I cope with this by running it in a loop and adjusting a variable in 
> input 
> > image treatment, the bottom levels threshold value. 
> > However sometimes it does things I just can't quite understand, such as 
> > ignoring several characters at the left end of the string and 
> recognising 
> > the rest. 
> > This seems to happen if the image is not totally straight. 
> > 
> > I have followed a number of guides online.  I am using Tesseract 3.01 
> binary 
> > on win7 64, and image magick. 
> > I have bootstrapped training for several generations. 
> > Here is what I do 
> > 
> > Train: 
> > tesseract.exe excella.jet.exp0.tif excella.jet.exp0 nobatch 
> box.train.stderr 
> > unicharset_extractor.exe excella.jet.exp0.box 
> > echo jet 0 0 0 0 0 > font_properties 
> > mftraining -F font_properties -U unicharset excella.jet.exp0.tr 
> > cntraining.exe" excella.jet.exp0.tr 
> > move /Y inttemp jet.inttemp 
> > move /Y Microfeat jet.Microfeat 
> > move /Y normproto jet.normproto 
> > move /Y pffmtable jet.pffmtable 
> > move /Y unicharset jet.unicharset 
> > combine_tessdata jet. 
> > move /Y jet.traineddata "c:\Program Files 
> > (x86)\Tesseract-OCR\tessdata\jet.traineddata" 
> > 
> > Recognition: 
> > convert file_name.jpg -units PixelsPerInch -density 300x300 tmp1.png 
> > convert tmp1.png -crop 2766x148+30+450 tmp2.png 
> > convert tmp2.png -blur 0x1.3 tmp3.png 
> > convert tmp3.png -units PixelsPerInch -resize 2766 -density 300x300 
> -level 
> > 60%,90%,30 tmp.tif 
> > tesseract tmp.tif tmp -l jet 
> > 
> > Attached: 
> > 
> > FRONT200GRAY8_16_6_12_1.JPG is a raw image 
> > recognise.jpg (uploaded as JPG as Google does not like TIF, scaled to 
> width 
> > 1900 from 2766) is a treated image which I get a perfect result from 
> > recognise.box shows the recognition result if you use something like 
> > owlboxer and you scale and convert the above 
> > excella31.jet.exp0.tif (ditto, scaled down to 1900 width from 2474 for 
> > Google size limit)  is the training image 
> > 
> > "31" in excella31.jet.exp0 is the version, which is not shown above. 
> > 
> > Here are some other observations and questions: 
> > 1. Tesseract seems very sensitive to the amount of blur I give to 
> images. 
> > Adjusting the kernel by a tenth of a pixel makes a difference.  It seems 
> > that judging it by eye is not accurate given my example above.  Is this 
> > expected? 
> > 2. Tesseract seems very sensitive to the levels I set.  Varying the 
> bottom 
> > threshold 5% makes a lot of difference.  In each of the above I vary it 
> by 
> > 1% only.  How important is it that I set the training image treatment 
> since 
> > I am guessing what this needs by eye.  Obviously it needs blurring to 
> > coalesce the dots in the font, but the other changes are guestimates. 
> > 3. I couldn't get it to reliably train on so many varients of the "+" 
> symbol 
> > so I let it see "+" as "+" but also "C" to allow two more reasonable 
> feature 
> > sets to cluster.  This works well.  Is there a way to actually know that 
> it 
> > can't handle too much varience apart from it then failing to accurately 
> > classify certain characters? 
> > 4. Sometimes just doesn't start recognising digits until it is several 
> > characters in from the left, especially when the image is slightly 
> crooked. 
> > Should I write something to straighten it - I notice it can read 
> sideways 
> > already without problems.  Perhaps it is the slight angle that is the 
> > problem? 
> > 5. Can it be set NOT to encode for scale invarience?  My input data 
> > characters are not going to vary in scale.  It often finds little specs 
> to 
> > be false positive matches. 
> > 6. The scanner is 200dpi, I can't change this since it is a special 
> > ticket-scanning device.  The training data are 300dpi and I scale up the 
> > input image.  Is my image treatment via JPG and PNG to TIF and rescaling 
> > likely to be a source of problems? 
> > 7. I read that Tesseract is designed to recognise words.  So would it be 
> > better that I train it with groups like "++1"  "+10"  etc since three 
> > plusses is the field width for each datum I am scanning for. 
> > 
> > Thanks in advance for any advice you have, 
> > As before this is an excellent product.  I'd be stumped without it. 
> > Best, 
> > TDG 
> > 
> > 
> > 
> > 
>
>

-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

Reply via email to