Re: Using Grammar to Improve Image Decoding Accuracy

Dmitri Silaev Thu, 07 Apr 2011 00:27:19 -0700

Well, to keep my answer brief, read the following papers (these links
are non-obviously located at
http://code.google.com/p/tesseract-ocr/wiki/Documentation):


http://tesseract-ocr.googlecode.com/svn/trunk/doc/tesseracticdar2007.pdf,
chapter 6
http://tesseract-ocr.googlecode.com/svn/trunk/doc/MOCRadaptingtesseract2.pdf,
chapter 5

One can see the transition from year 2007 to year 2009 and elaboration
in this area still goes on.

Warm regards,
Dmitri Silaev





On Thu, Apr 7, 2011 at 11:16 AM, Amrit <[email protected]> wrote:
> Thanks.
> I will look through your suggestions and try with some upscaling and
> binarization options.
> If you could point me to some details about the word decoding and how it
> happens it'll be great.(As in I believe that the decoded individual chars
> are parsed through some dictionary to give appropriate word as
> results.Correct me if I am mistaken)
> Again thanks for your help.
> Regards,
> Amrit.
>
>
> On Thu, Apr 7, 2011 at 1:53 AM, Dmitri Silaev <[email protected]> wrote:
>>
>> - Try to upscale the original images by a factor of 2 or 3. It might
>> improve the accuracy
>>
>> - Binarization. Tesseract's default Otsu isn't suited here. There's a
>> number of methods, I won't suggest any: you'll need to play with them.
>> If you can always expect fixed-pitch fonts, this can help, because you
>> can detect font cells and run binarization over them.
>>
>> - Handwritten addresses. Imho Tesseract won't help you much here. Long
>> time ago one person (search for Keith Beaumont) tried to make use of
>> it, but afaik he achieved moderate success. I don't know if he
>> continues his work with it, though.
>>
>> - Various fonts. Training for most dissimilar of them is inevitable.
>>
>> - DAWGs. Sorry if you are already aware, but this is the initial
>> reading: http://en.wikipedia.org/wiki/Directed_acyclic_word_graph.
>> Don't be bothered by the details of dictionary work inside Tesseract.
>> It can be obscure and current state certainly is provisional. All you
>> have to do is just build your dictionary and compile DAWGs. Again most
>> likely you already know, but how to do this is written here:
>>
>> http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3#Dictionary_Data_(Optional).
>> Also you may benefit from
>>
>> http://code.google.com/p/tesseract-ocr/wiki/FAQ#How_to_increase_the_trust_in/strength_of_the_dictionary?
>> (don't know if it's a relevant advice for the moment - I currently
>> stay away from Tesseract's dictionary facility)
>>
>> Warm regards,
>> Dmitri Silaev
>>
>>
>>
>>
>>
>> On Thu, Apr 7, 2011 at 9:51 AM, Amrit <[email protected]> wrote:
>> > Thanks Dmitri,appreciate your help.
>> > For some reason my response is not getting posted to the group.Not sure
>> > if
>> > you saw my earlier post.I am listing some of the points again.
>> > As for the image, I am sending along the original as well.The one sent
>> > earlier was a preprocessed one with the last line from the address label
>> > extracted and grey scaled.(I am only interested in getting this
>> > accurately)
>> > I do not have a choice for the resolution as I am working with a set of
>> > already taken images.Furthermore, my set of test images range from typed
>> > fonts(varied) to handwritten address labels.So individual font training
>> > is
>> > going to a laborious process which I would like to avoid.
>> > My initial impression was that I can use the character decoding results
>> > and
>> > pass on to a language model for getting the correct results,something
>> > similar to process on the speech recognition side where I have prior
>> > experience.I looked up the code under language_model but was not able to
>> > clearly understand its purpose and use.Also,I am unclear as to how
>> > exactly
>> > tesseract is actually doing the word decoding,is it directly based on
>> > the
>> > individual character sequence or some parsing is done over a
>> > language_model/grammar to give out correct word results?
>> > e.g.
>> > image ground truth : SOUTHBURY, CT 0688
>> > tesseract output    : SOUTHBURY~ CT DLUBB
>> > I was wondering if there a way by which I can direct this tesseract
>> > result
>> > to find the appropriate match in a given constrained list of possible
>> > outputs.
>> > if i have a language model containing the following :
>> > SOUTHBURY, CT 0688
>> > XYZ, CT 0688
>> > ....
>> > then based on the tesseract's correct decoding of the city name I will
>> > be
>> > able to force feed the output for choice 1.
>> > Please do let me know if this is a possibility.Also , you had mentioned
>> > that
>> > I could use dictionary for the city name, can you please give some more
>> > details.(I have already tried creating the custom dwag files but didnt
>> > seem
>> > to work)
>> >
>> > Regards,
>> > Amrit.
>> >
>> >
>> > On Wed, Apr 6, 2011 at 11:53 PM, Dmitri Silaev <[email protected]>
>> > wrote:
>> >>
>> >> Is it possible for you to get images in higher res? For Tesseract this
>> >> resolution might be insufficient to achieve decent accuracy.
>> >>
>> >> You do need to train for this specific font, as the "default"
>> >> Tesseract's eng font is just a collection of some famous computer
>> >> fonts, and yours is not one of them.
>> >>
>> >> For town/city names indeed you can use dictionary approach, but for
>> >> the state and zip I'd better use the one I described above. So the
>> >> whole thing will require some programming, but as I can suppose,
>> >> currently you just evaluate the executable.
>> >>
>> >> Warm regards,
>> >> Dmitri Silaev
>> >>
>> >>
>> >>
>> >>
>> >>
>> >> On Thu, Apr 7, 2011 at 8:29 AM, Amrit <[email protected]>
>> >> wrote:
>> >> > Thanks,Sending it again.
>> >> > On Wed, Apr 6, 2011 at 11:24 PM, Dmitri Silaev
>> >> > <[email protected]>
>> >> > wrote:
>> >> >>
>> >> >> To let you know,
>> >> >> can't see images yet...
>> >> >>
>> >> >>
>> >> >>
>> >> >> On Thu, Apr 7, 2011 at 8:17 AM, Amrit <[email protected]>
>> >> >> wrote:
>> >> >> > Hi Dmitri/Partik,
>> >> >> > Thanks for your reply.I am sending along the pre processed test
>> >> >> > image
>> >> >> > which
>> >> >> > I had mentioned in my response.
>> >> >> > tesseract output - SOUTHBURY~ CT DLUBB
>> >> >> >
>> >> >> > Regards,
>> >> >> > Amrit.
>> >> >> >
>> >> >> > On Wed, Apr 6, 2011 at 12:05 AM, Dmitri Silaev
>> >> >> > <[email protected]>
>> >> >> > wrote:
>> >> >> >>
>> >> >> >> Agree not to use dictionary at all. IMO the best you can do is:
>> >> >> >> - use appropriate whitelists for each character position
>> >> >> >> - obtain a set of char choices for every char position
>> >> >> >> - restrict choice sets by using other semantic information you
>> >> >> >> may
>> >> >> >> have
>> >> >> >>
>> >> >> >> Warm regards,
>> >> >> >> Dmitri Silaev
>> >> >> >>
>> >> >> >>
>> >> >> >>
>> >> >> >>
>> >> >> >>
>> >> >> >> On Wed, Apr 6, 2011 at 6:00 AM, Amrit
>> >> >> >> <[email protected]>
>> >> >> >> wrote:
>> >> >> >> > Hi All,
>> >> >> >> >        I am trying to evaluate tesseract to decode US postal
>> >> >> >> > address
>> >> >> >> > from a set of images(english text with varying font).I want to
>> >> >> >> > extract
>> >> >> >> > the city,state zipcode combination from the image.In doing so,
>> >> >> >> > out
>> >> >> >> > of
>> >> >> >> > the box tesseract 3.01 performance is average and I would like
>> >> >> >> > to
>> >> >> >> > increase the accuracy of the system by providing a custom
>> >> >> >> > grammar/
>> >> >> >> > wordlist (language model).
>> >> >> >> >       Any idea as to how to accomplish this?(My custom grammar/
>> >> >> >> > language model will only contain City,State and ZipCode
>> >> >> >> > numbers).
>> >> >> >> >
>> >> >> >> > I have tried to create custom dawg by following on the lines of
>> >> >> >> > 'training tesseract 3' wiki page, but this doesn't seem to work
>> >> >> >> > at
>> >> >> >> > all.Is there any way I can do this without training a subset of
>> >> >> >> > my
>> >> >> >> > test images?
>> >> >> >> >
>> >> >> >> > Regards,
>> >> >> >> > Amrit.
>> >> >> >> >
>> >> >> >> > --
>> >> >> >> > You received this message because you are subscribed to the
>> >> >> >> > Google
>> >> >> >> > Groups "tesseract-ocr" group.
>> >> >> >> > To post to this group, send email to
>> >> >> >> > [email protected].
>> >> >> >> > To unsubscribe from this group, send email to
>> >> >> >> > [email protected].
>> >> >> >> > For more options, visit this group at
>> >> >> >> > http://groups.google.com/group/tesseract-ocr?hl=en.
>> >> >> >> >
>> >> >> >> >
>> >> >> >
>> >> >> >
>> >> >
>> >> >
>> >
>> >
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.

Re: Using Grammar to Improve Image Decoding Accuracy

Reply via email to