TDG,

A bit of correction. Using "-" instead of "+" as a separator is even
worse. You need a separator that aligns well with digit characters at
top and bottom. At the same time it should not resemble any of digits.
Even somewhat degraded separators should look different from digits.
This could be "A" or "G", depending on how they look in your font.
This can't be "I" (resembles "1") or "O" (resembles "0"). Also it
should better have a shape that minimizes inter-character merges when
binarized.

Tesseract does nothing to "learn words" when training. It only learns
character shapes. It can remember recognized words within the document
during recognition. Then it can use them at the second (adaptive)
pass. This can benefit only in case of repeated occurrences of
particular words in the document.

Warm regards,
Dmitri Silaev
www.CustomOCR.com


On Mon, Jun 25, 2012 at 4:27 AM, TDG <[email protected]> wrote:
> Dimitri,
> This is very helpful.
> I have had some modest progress anyway via two changes but I will also do as
> you suggest.
> The things I have done just now are
> 1. added margin around the image to be recognised
> 2. explicitly set the tessedit_pageseg_mode to 3
>
> We also notice that it is very good at some integer sequences like
> ++4++2++++++++1++3++
> and never gets
> ++4+++++1++2++3+++++
>
> We think it has learned the "word" "4++2"  mind you, this is not what the
> makebox result is for this particular image.
> In fact, when I use makebox to work out what is going on, it usually always
> gets just single letters.
> Is there a way to know that it has decided on certain characters because of
> the order in which they appear?
>
> I think your ideas about the training set will improve it and I shouldn't be
> training with the separator in there.
> In production the separator is needed and it doesn't have to be "+" I am now
> thinking it might be "-" which looks like no numbers at all.
>
> Thanks again for your time,
> T
>
>
>
>
> On Saturday, June 23, 2012 8:14:45 PM UTC+10, Dmitri Silaev wrote:
>>
>> TDG,
>>
>> Your command lines for training and recognition seem OK. But I'd
>> suggest to pay more attention to training images. First do not
>> overblur images. If some characters still look disjointed, that's OK,
>> you will be able to correct their boxes later, when editing a
>> generated .box file. Characters should not be losing their features
>> like this happens with holes in some 4's in your image. Characters in
>> a training image must not touch one another, rows have to be
>> horizontal, at least have the same slope. All these rules are not
>> obeyed in your image. Therefore it is better to prepare your training
>> images by manual editing. You would need to compile an image (or a set
>> of images) containing 20-40 samples of each character, sufficiently
>> spaced and mixed properly. By the latter I mean that there should be
>> no long sequences of +'s which can fool Tesseract's baseline finding
>> algo.
>>
>> Second, IMO training to get two classes for + is a bad idea. There's
>> no so much variety in shape for +, and Tesseract is able to cluster it
>> properly. Actually using + as a whitespace placeholder is a bad idea,
>> if you have control over this. This character does not align with the
>> base and x-height or ascender lines while is presented in numbers and
>> long sequences so it can make Tess to build incorrect baselines. I
>> think that, besides image skew, this can be a reason for your
>> unrecognized pieces. (BTW you can use my recent baseline debugging
>> tutorial.) So I'd suggest using as a placeholder some well-aligned
>> character having very distinctive shape compared to digits and
>> preferably of the same width with digits.
>>
>> You've asked lots of questions but this is what I'd start working with.
>>
>> HTH
>>
>> Warm regards,
>> Dmitri Silaev
>> www.CustomOCR.com
>>
>> > Hi everyone,
>> > I am enjoying tesseract and it is really helping me in a pioneering
>> > project
>> > to do with verifiable elections.
>> > I am having some problems and would appreciate any hints or directions.
>> > If you don't have time to read the below, a key thing I would like to do
>> > is
>> > somehow have Tesseract NOT employ scale invarience since my input
>> > characters
>> > are not going to vary in size.
>> > Here is the rest of the action.
>> >
>> > My OCR task that involves recognising strings like this printed with a
>> > dot
>> > matrix quality inkjet ticket printing device.
>> >
>> > ++4++3++5++6++1++7++2++++++++++++++++++++++++1++++++++++++++++++++
>> > +++++1++2++3+++++++++++++++++++++++++++++++++++1++++++++++++++++++
>> >
>> > There needs to be a field separator of some sort and I have been using
>> > plus.
>> > It could be anything but not spaces since these get collapsed and
>> > position
>> > of the integers is important in our application.
>> > Tesseract sometimes performs very well and often not so well in my
>> > situation.
>> > I cope with this by running it in a loop and adjusting a variable in
>> > input
>> > image treatment, the bottom levels threshold value.
>> > However sometimes it does things I just can't quite understand, such as
>> > ignoring several characters at the left end of the string and
>> > recognising
>> > the rest.
>> > This seems to happen if the image is not totally straight.
>> >
>> > I have followed a number of guides online.  I am using Tesseract 3.01
>> > binary
>> > on win7 64, and image magick.
>> > I have bootstrapped training for several generations.
>> > Here is what I do
>> >
>> > Train:
>> > tesseract.exe excella.jet.exp0.tif excella.jet.exp0 nobatch
>> > box.train.stderr
>> > unicharset_extractor.exe excella.jet.exp0.box
>> > echo jet 0 0 0 0 0 > font_properties
>> > mftraining -F font_properties -U unicharset excella.jet.exp0.tr
>> > cntraining.exe" excella.jet.exp0.tr
>> > move /Y inttemp jet.inttemp
>> > move /Y Microfeat jet.Microfeat
>> > move /Y normproto jet.normproto
>> > move /Y pffmtable jet.pffmtable
>> > move /Y unicharset jet.unicharset
>> > combine_tessdata jet.
>> > move /Y jet.traineddata "c:\Program Files
>> > (x86)\Tesseract-OCR\tessdata\jet.traineddata"
>> >
>> > Recognition:
>> > convert file_name.jpg -units PixelsPerInch -density 300x300 tmp1.png
>> > convert tmp1.png -crop 2766x148+30+450 tmp2.png
>> > convert tmp2.png -blur 0x1.3 tmp3.png
>> > convert tmp3.png -units PixelsPerInch -resize 2766 -density 300x300
>> > -level
>> > 60%,90%,30 tmp.tif
>> > tesseract tmp.tif tmp -l jet
>> >
>> > Attached:
>> >
>> > FRONT200GRAY8_16_6_12_1.JPG is a raw image
>> > recognise.jpg (uploaded as JPG as Google does not like TIF, scaled to
>> > width
>> > 1900 from 2766) is a treated image which I get a perfect result from
>> > recognise.box shows the recognition result if you use something like
>> > owlboxer and you scale and convert the above
>> > excella31.jet.exp0.tif (ditto, scaled down to 1900 width from 2474 for
>> > Google size limit)  is the training image
>> >
>> > "31" in excella31.jet.exp0 is the version, which is not shown above.
>> >
>> > Here are some other observations and questions:
>> > 1. Tesseract seems very sensitive to the amount of blur I give to
>> > images.
>> > Adjusting the kernel by a tenth of a pixel makes a difference.  It seems
>> > that judging it by eye is not accurate given my example above.  Is this
>> > expected?
>> > 2. Tesseract seems very sensitive to the levels I set.  Varying the
>> > bottom
>> > threshold 5% makes a lot of difference.  In each of the above I vary it
>> > by
>> > 1% only.  How important is it that I set the training image treatment
>> > since
>> > I am guessing what this needs by eye.  Obviously it needs blurring to
>> > coalesce the dots in the font, but the other changes are guestimates.
>> > 3. I couldn't get it to reliably train on so many varients of the "+"
>> > symbol
>> > so I let it see "+" as "+" but also "C" to allow two more reasonable
>> > feature
>> > sets to cluster.  This works well.  Is there a way to actually know that
>> > it
>> > can't handle too much varience apart from it then failing to accurately
>> > classify certain characters?
>> > 4. Sometimes just doesn't start recognising digits until it is several
>> > characters in from the left, especially when the image is slightly
>> > crooked.
>> > Should I write something to straighten it - I notice it can read
>> > sideways
>> > already without problems.  Perhaps it is the slight angle that is the
>> > problem?
>> > 5. Can it be set NOT to encode for scale invarience?  My input data
>> > characters are not going to vary in scale.  It often finds little specs
>> > to
>> > be false positive matches.
>> > 6. The scanner is 200dpi, I can't change this since it is a special
>> > ticket-scanning device.  The training data are 300dpi and I scale up the
>> > input image.  Is my image treatment via JPG and PNG to TIF and rescaling
>> > likely to be a source of problems?
>> > 7. I read that Tesseract is designed to recognise words.  So would it be
>> > better that I train it with groups like "++1"  "+10"  etc since three
>> > plusses is the field width for each datum I am scanning for.
>> >
>> > Thanks in advance for any advice you have,
>> > As before this is an excellent product.  I'd be stumped without it.
>> > Best,
>> > TDG
>> >
>> >
>> >
>> >
>>
> --
> You received this message because you are subscribed to the Google
> Groups "tesseract-ocr" group.
> To post to this group, send email to [email protected]
> To unsubscribe from this group, send email to
> [email protected]
> For more options, visit this group at
> http://groups.google.com/group/tesseract-ocr?hl=en

-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

Reply via email to