Re: Using Grammar to Improve Image Decoding Accuracy

Dmitri Silaev Wed, 06 Apr 2011 23:53:37 -0700

- Try to upscale the original images by a factor of 2 or 3. It might
improve the accuracy


- Binarization. Tesseract's default Otsu isn't suited here. There's a
number of methods, I won't suggest any: you'll need to play with them.
If you can always expect fixed-pitch fonts, this can help, because you
can detect font cells and run binarization over them.

- Handwritten addresses. Imho Tesseract won't help you much here. Long
time ago one person (search for Keith Beaumont) tried to make use of
it, but afaik he achieved moderate success. I don't know if he
continues his work with it, though.

- Various fonts. Training for most dissimilar of them is inevitable.

- DAWGs. Sorry if you are already aware, but this is the initial
reading: http://en.wikipedia.org/wiki/Directed_acyclic_word_graph.
Don't be bothered by the details of dictionary work inside Tesseract.
It can be obscure and current state certainly is provisional. All you
have to do is just build your dictionary and compile DAWGs. Again most
likely you already know, but how to do this is written here:
http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3#Dictionary_Data_(Optional).
Also you may benefit from
http://code.google.com/p/tesseract-ocr/wiki/FAQ#How_to_increase_the_trust_in/strength_of_the_dictionary?
(don't know if it's a relevant advice for the moment - I currently
stay away from Tesseract's dictionary facility)

Warm regards,
Dmitri Silaev





On Thu, Apr 7, 2011 at 9:51 AM, Amrit <[email protected]> wrote:
> Thanks Dmitri,appreciate your help.
> For some reason my response is not getting posted to the group.Not sure if
> you saw my earlier post.I am listing some of the points again.
> As for the image, I am sending along the original as well.The one sent
> earlier was a preprocessed one with the last line from the address label
> extracted and grey scaled.(I am only interested in getting this accurately)
> I do not have a choice for the resolution as I am working with a set of
> already taken images.Furthermore, my set of test images range from typed
> fonts(varied) to handwritten address labels.So individual font training is
> going to a laborious process which I would like to avoid.
> My initial impression was that I can use the character decoding results and
> pass on to a language model for getting the correct results,something
> similar to process on the speech recognition side where I have prior
> experience.I looked up the code under language_model but was not able to
> clearly understand its purpose and use.Also,I am unclear as to how exactly
> tesseract is actually doing the word decoding,is it directly based on the
> individual character sequence or some parsing is done over a
> language_model/grammar to give out correct word results?
> e.g.
> image ground truth : SOUTHBURY, CT 0688
> tesseract output    : SOUTHBURY~ CT DLUBB
> I was wondering if there a way by which I can direct this tesseract result
> to find the appropriate match in a given constrained list of possible
> outputs.
> if i have a language model containing the following :
> SOUTHBURY, CT 0688
> XYZ, CT 0688
> ....
> then based on the tesseract's correct decoding of the city name I will be
> able to force feed the output for choice 1.
> Please do let me know if this is a possibility.Also , you had mentioned that
> I could use dictionary for the city name, can you please give some more
> details.(I have already tried creating the custom dwag files but didnt seem
> to work)
>
> Regards,
> Amrit.
>
>
> On Wed, Apr 6, 2011 at 11:53 PM, Dmitri Silaev <[email protected]>
> wrote:
>>
>> Is it possible for you to get images in higher res? For Tesseract this
>> resolution might be insufficient to achieve decent accuracy.
>>
>> You do need to train for this specific font, as the "default"
>> Tesseract's eng font is just a collection of some famous computer
>> fonts, and yours is not one of them.
>>
>> For town/city names indeed you can use dictionary approach, but for
>> the state and zip I'd better use the one I described above. So the
>> whole thing will require some programming, but as I can suppose,
>> currently you just evaluate the executable.
>>
>> Warm regards,
>> Dmitri Silaev
>>
>>
>>
>>
>>
>> On Thu, Apr 7, 2011 at 8:29 AM, Amrit <[email protected]> wrote:
>> > Thanks,Sending it again.
>> > On Wed, Apr 6, 2011 at 11:24 PM, Dmitri Silaev <[email protected]>
>> > wrote:
>> >>
>> >> To let you know,
>> >> can't see images yet...
>> >>
>> >>
>> >>
>> >> On Thu, Apr 7, 2011 at 8:17 AM, Amrit <[email protected]>
>> >> wrote:
>> >> > Hi Dmitri/Partik,
>> >> > Thanks for your reply.I am sending along the pre processed test image
>> >> > which
>> >> > I had mentioned in my response.
>> >> > tesseract output - SOUTHBURY~ CT DLUBB
>> >> >
>> >> > Regards,
>> >> > Amrit.
>> >> >
>> >> > On Wed, Apr 6, 2011 at 12:05 AM, Dmitri Silaev
>> >> > <[email protected]>
>> >> > wrote:
>> >> >>
>> >> >> Agree not to use dictionary at all. IMO the best you can do is:
>> >> >> - use appropriate whitelists for each character position
>> >> >> - obtain a set of char choices for every char position
>> >> >> - restrict choice sets by using other semantic information you may
>> >> >> have
>> >> >>
>> >> >> Warm regards,
>> >> >> Dmitri Silaev
>> >> >>
>> >> >>
>> >> >>
>> >> >>
>> >> >>
>> >> >> On Wed, Apr 6, 2011 at 6:00 AM, Amrit <[email protected]>
>> >> >> wrote:
>> >> >> > Hi All,
>> >> >> >        I am trying to evaluate tesseract to decode US postal
>> >> >> > address
>> >> >> > from a set of images(english text with varying font).I want to
>> >> >> > extract
>> >> >> > the city,state zipcode combination from the image.In doing so, out
>> >> >> > of
>> >> >> > the box tesseract 3.01 performance is average and I would like to
>> >> >> > increase the accuracy of the system by providing a custom grammar/
>> >> >> > wordlist (language model).
>> >> >> >       Any idea as to how to accomplish this?(My custom grammar/
>> >> >> > language model will only contain City,State and ZipCode numbers).
>> >> >> >
>> >> >> > I have tried to create custom dawg by following on the lines of
>> >> >> > 'training tesseract 3' wiki page, but this doesn't seem to work at
>> >> >> > all.Is there any way I can do this without training a subset of my
>> >> >> > test images?
>> >> >> >
>> >> >> > Regards,
>> >> >> > Amrit.
>> >> >> >
>> >> >> > --
>> >> >> > You received this message because you are subscribed to the Google
>> >> >> > Groups "tesseract-ocr" group.
>> >> >> > To post to this group, send email to
>> >> >> > [email protected].
>> >> >> > To unsubscribe from this group, send email to
>> >> >> > [email protected].
>> >> >> > For more options, visit this group at
>> >> >> > http://groups.google.com/group/tesseract-ocr?hl=en.
>> >> >> >
>> >> >> >
>> >> >
>> >> >
>> >
>> >
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.

Re: Using Grammar to Improve Image Decoding Accuracy

Reply via email to