Dmitri, 
Alas you were right! :-(
We implemented "-" and found it was worse.  For your entertainment this is 
what we saw
1. we could train it OK with "-"
2. sometimes it would recognise "-"
3. other times it joined "---" together as one identified item
4. other times Tes totally refused to recognise anything at all.

So we are back to "+".  I propose to do this work to improve the results:
1. Deploy an algorithm that combines two or more bad / incomplete results 
to one good one (with a few guesses)
2. Pursue more aggressively a better font from the hardware device that 
prints what we must OCR
3. Consider writing a training loop where Tes' training set is varied along 
with the blur and contrast and we "evolve" best settings for training set 
size, variation, and contrast and blur on scan pre-treatment.  Run training 
lots of times in this scripted loop.

Can I ask you if you know if Tes can be set to remove scale invarience 
since it will never have to recognise anything of differing sizes?

I might have to differ with you on my use of allowing Tes to learn 
different characters for the same input character.  This works very well 
and without it, Tes cannot be taught too many variants of a number are in 
fact all the same number!

Kind regards and thank you for your interest and help.
Craig.

On Sunday, July 1, 2012 8:52:51 PM UTC+10, Dmitri Silaev wrote:
>
> TDG, 
>
> A bit of correction. Using "-" instead of "+" as a separator is even 
> worse. You need a separator that aligns well with digit characters at 
> top and bottom. At the same time it should not resemble any of digits. 
> Even somewhat degraded separators should look different from digits. 
> This could be "A" or "G", depending on how they look in your font. 
> This can't be "I" (resembles "1") or "O" (resembles "0"). Also it 
> should better have a shape that minimizes inter-character merges when 
> binarized. 
>
> Tesseract does nothing to "learn words" when training. It only learns 
> character shapes. It can remember recognized words within the document 
> during recognition. Then it can use them at the second (adaptive) 
> pass. This can benefit only in case of repeated occurrences of 
> particular words in the document. 
>
> Warm regards, 
> Dmitri Silaev 
> www.CustomOCR.com 
>
>
> On Mon, Jun 25, 2012 at 4:27 AM, TDG  wrote: 
> > Dimitri, 
> > This is very helpful. 
> > I have had some modest progress anyway via two changes but I will also 
> do as 
> > you suggest. 
> > The things I have done just now are 
> > 1. added margin around the image to be recognised 
> > 2. explicitly set the tessedit_pageseg_mode to 3 
> > 
> > We also notice that it is very good at some integer sequences like 
> > ++4++2++++++++1++3++ 
> > and never gets 
> > ++4+++++1++2++3+++++ 
> > 
> > We think it has learned the "word" "4++2"  mind you, this is not what 
> the 
> > makebox result is for this particular image. 
> > In fact, when I use makebox to work out what is going on, it usually 
> always 
> > gets just single letters. 
> > Is there a way to know that it has decided on certain characters because 
> of 
> > the order in which they appear? 
> > 
> > I think your ideas about the training set will improve it and I 
> shouldn't be 
> > training with the separator in there. 
> > In production the separator is needed and it doesn't have to be "+" I am 
> now 
> > thinking it might be "-" which looks like no numbers at all. 
> > 
> > Thanks again for your time, 
> > T 
> > 
> > 
> > 
> > 
> > On Saturday, June 23, 2012 8:14:45 PM UTC+10, Dmitri Silaev wrote: 
> >> 
> >> TDG, 
> >> 
> >> Your command lines for training and recognition seem OK. But I'd 
> >> suggest to pay more attention to training images. First do not 
> >> overblur images. If some characters still look disjointed, that's OK, 
> >> you will be able to correct their boxes later, when editing a 
> >> generated .box file. Characters should not be losing their features 
> >> like this happens with holes in some 4's in your image. Characters in 
> >> a training image must not touch one another, rows have to be 
> >> horizontal, at least have the same slope. All these rules are not 
> >> obeyed in your image. Therefore it is better to prepare your training 
> >> images by manual editing. You would need to compile an image (or a set 
> >> of images) containing 20-40 samples of each character, sufficiently 
> >> spaced and mixed properly. By the latter I mean that there should be 
> >> no long sequences of +'s which can fool Tesseract's baseline finding 
> >> algo. 
> >> 
> >> Second, IMO training to get two classes for + is a bad idea. There's 
> >> no so much variety in shape for +, and Tesseract is able to cluster it 
> >> properly. Actually using + as a whitespace placeholder is a bad idea, 
> >> if you have control over this. This character does not align with the 
> >> base and x-height or ascender lines while is presented in numbers and 
> >> long sequences so it can make Tess to build incorrect baselines. I 
> >> think that, besides image skew, this can be a reason for your 
> >> unrecognized pieces. (BTW you can use my recent baseline debugging 
> >> tutorial.) So I'd suggest using as a placeholder some well-aligned 
> >> character having very distinctive shape compared to digits and 
> >> preferably of the same width with digits. 
> >> 
> >> You've asked lots of questions but this is what I'd start working with. 
> >> 
> >> HTH 
> >> 
> >> Warm regards, 
> >> Dmitri Silaev 
> >> www.CustomOCR.com 
> >> 
> >> > Hi everyone, 
> >> > I am enjoying tesseract and it is really helping me in a pioneering 
> >> > project 
> >> > to do with verifiable elections. 
> >> > I am having some problems and would appreciate any hints or 
> directions. 
> >> > If you don't have time to read the below, a key thing I would like to 
> do 
> >> > is 
> >> > somehow have Tesseract NOT employ scale invarience since my input 
> >> > characters 
> >> > are not going to vary in size. 
> >> > Here is the rest of the action. 
> >> > 
> >> > My OCR task that involves recognising strings like this printed with 
> a 
> >> > dot 
> >> > matrix quality inkjet ticket printing device. 
> >> > 
> >> > ++4++3++5++6++1++7++2++++++++++++++++++++++++1++++++++++++++++++++ 
> >> > +++++1++2++3+++++++++++++++++++++++++++++++++++1++++++++++++++++++ 
> >> > 
> >> > There needs to be a field separator of some sort and I have been 
> using 
> >> > plus. 
> >> > It could be anything but not spaces since these get collapsed and 
> >> > position 
> >> > of the integers is important in our application. 
> >> > Tesseract sometimes performs very well and often not so well in my 
> >> > situation. 
> >> > I cope with this by running it in a loop and adjusting a variable in 
> >> > input 
> >> > image treatment, the bottom levels threshold value. 
> >> > However sometimes it does things I just can't quite understand, such 
> as 
> >> > ignoring several characters at the left end of the string and 
> >> > recognising 
> >> > the rest. 
> >> > This seems to happen if the image is not totally straight. 
> >> > 
> >> > I have followed a number of guides online.  I am using Tesseract 3.01 
> >> > binary 
> >> > on win7 64, and image magick. 
> >> > I have bootstrapped training for several generations. 
> >> > Here is what I do 
> >> > 
> >> > Train: 
> >> > tesseract.exe excella.jet.exp0.tif excella.jet.exp0 nobatch 
> >> > box.train.stderr 
> >> > unicharset_extractor.exe excella.jet.exp0.box 
> >> > echo jet 0 0 0 0 0 > font_properties 
> >> > mftraining -F font_properties -U unicharset excella.jet.exp0.tr 
> >> > cntraining.exe" excella.jet.exp0.tr 
> >> > move /Y inttemp jet.inttemp 
> >> > move /Y Microfeat jet.Microfeat 
> >> > move /Y normproto jet.normproto 
> >> > move /Y pffmtable jet.pffmtable 
> >> > move /Y unicharset jet.unicharset 
> >> > combine_tessdata jet. 
> >> > move /Y jet.traineddata "c:\Program Files 
> >> > (x86)\Tesseract-OCR\tessdata\jet.traineddata" 
> >> > 
> >> > Recognition: 
> >> > convert file_name.jpg -units PixelsPerInch -density 300x300 tmp1.png 
> >> > convert tmp1.png -crop 2766x148+30+450 tmp2.png 
> >> > convert tmp2.png -blur 0x1.3 tmp3.png 
> >> > convert tmp3.png -units PixelsPerInch -resize 2766 -density 300x300 
> >> > -level 
> >> > 60%,90%,30 tmp.tif 
> >> > tesseract tmp.tif tmp -l jet 
> >> > 
> >> > Attached: 
> >> > 
> >> > FRONT200GRAY8_16_6_12_1.JPG is a raw image 
> >> > recognise.jpg (uploaded as JPG as Google does not like TIF, scaled to 
> >> > width 
> >> > 1900 from 2766) is a treated image which I get a perfect result from 
> >> > recognise.box shows the recognition result if you use something like 
> >> > owlboxer and you scale and convert the above 
> >> > excella31.jet.exp0.tif (ditto, scaled down to 1900 width from 2474 
> for 
> >> > Google size limit)  is the training image 
> >> > 
> >> > "31" in excella31.jet.exp0 is the version, which is not shown above. 
> >> > 
> >> > Here are some other observations and questions: 
> >> > 1. Tesseract seems very sensitive to the amount of blur I give to 
> >> > images. 
> >> > Adjusting the kernel by a tenth of a pixel makes a difference.  It 
> seems 
> >> > that judging it by eye is not accurate given my example above.  Is 
> this 
> >> > expected? 
> >> > 2. Tesseract seems very sensitive to the levels I set.  Varying the 
> >> > bottom 
> >> > threshold 5% makes a lot of difference.  In each of the above I vary 
> it 
> >> > by 
> >> > 1% only.  How important is it that I set the training image treatment 
> >> > since 
> >> > I am guessing what this needs by eye.  Obviously it needs blurring to 
> >> > coalesce the dots in the font, but the other changes are guestimates. 
> >> > 3. I couldn't get it to reliably train on so many varients of the "+" 
> >> > symbol 
> >> > so I let it see "+" as "+" but also "C" to allow two more reasonable 
> >> > feature 
> >> > sets to cluster.  This works well.  Is there a way to actually know 
> that 
> >> > it 
> >> > can't handle too much varience apart from it then failing to 
> accurately 
> >> > classify certain characters? 
> >> > 4. Sometimes just doesn't start recognising digits until it is 
> several 
> >> > characters in from the left, especially when the image is slightly 
> >> > crooked. 
> >> > Should I write something to straighten it - I notice it can read 
> >> > sideways 
> >> > already without problems.  Perhaps it is the slight angle that is the 
> >> > problem? 
> >> > 5. Can it be set NOT to encode for scale invarience?  My input data 
> >> > characters are not going to vary in scale.  It often finds little 
> specs 
> >> > to 
> >> > be false positive matches. 
> >> > 6. The scanner is 200dpi, I can't change this since it is a special 
> >> > ticket-scanning device.  The training data are 300dpi and I scale up 
> the 
> >> > input image.  Is my image treatment via JPG and PNG to TIF and 
> rescaling 
> >> > likely to be a source of problems? 
> >> > 7. I read that Tesseract is designed to recognise words.  So would it 
> be 
> >> > better that I train it with groups like "++1"  "+10"  etc since three 
> >> > plusses is the field width for each datum I am scanning for. 
> >> > 
> >> > Thanks in advance for any advice you have, 
> >> > As before this is an excellent product.  I'd be stumped without it. 
> >> > Best, 
> >> > TDG 
> >> > 
> >> > 
> >> > 
> >> > 
> >> 
> > -- 
>
>

-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

Reply via email to