Thanks Patrik,Dimitri, @Patrik : Thanks for the heads up on the usage of grammar.I am looking into the 'scanbizcard' app on my andriod and will share the updates with you on my test images.
Overall, I would like to understand how the word recognition in tesseract is happening if the grammar is not being used at all.I was looking into the language_model code but not able to figure out how exactly the word is formed from the individual char decoding.Any inputs? I am coming from a automatic speech recognition background so please bear with me ,what I was looking for was a typical language model implementation where I could force the output from tesseract. an example of my current test case is an address as follows: image ground truth : SOUTHBURY, CT 0688 tesseract output : SOUTHBURY~ CT DLUBB as you can see tesseract fails in the number decoding alone(mainly due to the image font and spacing).Is there a way that I can force feed to the tesseract to only look into a given list of viable results.Thus totally avoiding such errors.i.e the decoding of SOUTHBURY is force feed into a list like SOUTHBURY CT 0688 XYZ CT 0689 ... resulting in the correct output. *I was not able to attach the image here, will send it over email. Any suggestions are much appreciated. Regards, Amrit. On Apr 6, 12:05 am, Dmitri Silaev <[email protected]> wrote: > Agree not to use dictionary at all. IMO the best you can do is: > - use appropriate whitelists for each character position > - obtain a set of char choices for every char position > - restrict choice sets by using other semantic information you may have > > Warm regards, > Dmitri Silaev > > > > > > > > On Wed, Apr 6, 2011 at 6:00 AM, Amrit <[email protected]> wrote: > > Hi All, > > I am trying to evaluate tesseract to decode US postal address > > from a set of images(english text with varying font).I want to extract > > the city,state zipcode combination from the image.In doing so, out of > > the box tesseract 3.01 performance is average and I would like to > > increase the accuracy of the system by providing a custom grammar/ > > wordlist (language model). > > Any idea as to how to accomplish this?(My custom grammar/ > > language model will only contain City,State and ZipCode numbers). > > > I have tried to create custom dawg by following on the lines of > > 'training tesseract 3' wiki page, but this doesn't seem to work at > > all.Is there any way I can do this without training a subset of my > > test images? > > > Regards, > > Amrit. > > > -- > > You received this message because you are subscribed to the Google Groups > > "tesseract-ocr" group. > > To post to this group, send email to [email protected]. > > To unsubscribe from this group, send email to > > [email protected]. > > For more options, visit this group > > athttp://groups.google.com/group/tesseract-ocr?hl=en. -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en.

