Well, to keep my answer brief, read the following papers (these links are non-obviously located at http://code.google.com/p/tesseract-ocr/wiki/Documentation):
http://tesseract-ocr.googlecode.com/svn/trunk/doc/tesseracticdar2007.pdf, chapter 6 http://tesseract-ocr.googlecode.com/svn/trunk/doc/MOCRadaptingtesseract2.pdf, chapter 5 One can see the transition from year 2007 to year 2009 and elaboration in this area still goes on. Warm regards, Dmitri Silaev On Thu, Apr 7, 2011 at 11:16 AM, Amrit <[email protected]> wrote: > Thanks. > I will look through your suggestions and try with some upscaling and > binarization options. > If you could point me to some details about the word decoding and how it > happens it'll be great.(As in I believe that the decoded individual chars > are parsed through some dictionary to give appropriate word as > results.Correct me if I am mistaken) > Again thanks for your help. > Regards, > Amrit. > > > On Thu, Apr 7, 2011 at 1:53 AM, Dmitri Silaev <[email protected]> wrote: >> >> - Try to upscale the original images by a factor of 2 or 3. It might >> improve the accuracy >> >> - Binarization. Tesseract's default Otsu isn't suited here. There's a >> number of methods, I won't suggest any: you'll need to play with them. >> If you can always expect fixed-pitch fonts, this can help, because you >> can detect font cells and run binarization over them. >> >> - Handwritten addresses. Imho Tesseract won't help you much here. Long >> time ago one person (search for Keith Beaumont) tried to make use of >> it, but afaik he achieved moderate success. I don't know if he >> continues his work with it, though. >> >> - Various fonts. Training for most dissimilar of them is inevitable. >> >> - DAWGs. Sorry if you are already aware, but this is the initial >> reading: http://en.wikipedia.org/wiki/Directed_acyclic_word_graph. >> Don't be bothered by the details of dictionary work inside Tesseract. >> It can be obscure and current state certainly is provisional. All you >> have to do is just build your dictionary and compile DAWGs. Again most >> likely you already know, but how to do this is written here: >> >> http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3#Dictionary_Data_(Optional). >> Also you may benefit from >> >> http://code.google.com/p/tesseract-ocr/wiki/FAQ#How_to_increase_the_trust_in/strength_of_the_dictionary? >> (don't know if it's a relevant advice for the moment - I currently >> stay away from Tesseract's dictionary facility) >> >> Warm regards, >> Dmitri Silaev >> >> >> >> >> >> On Thu, Apr 7, 2011 at 9:51 AM, Amrit <[email protected]> wrote: >> > Thanks Dmitri,appreciate your help. >> > For some reason my response is not getting posted to the group.Not sure >> > if >> > you saw my earlier post.I am listing some of the points again. >> > As for the image, I am sending along the original as well.The one sent >> > earlier was a preprocessed one with the last line from the address label >> > extracted and grey scaled.(I am only interested in getting this >> > accurately) >> > I do not have a choice for the resolution as I am working with a set of >> > already taken images.Furthermore, my set of test images range from typed >> > fonts(varied) to handwritten address labels.So individual font training >> > is >> > going to a laborious process which I would like to avoid. >> > My initial impression was that I can use the character decoding results >> > and >> > pass on to a language model for getting the correct results,something >> > similar to process on the speech recognition side where I have prior >> > experience.I looked up the code under language_model but was not able to >> > clearly understand its purpose and use.Also,I am unclear as to how >> > exactly >> > tesseract is actually doing the word decoding,is it directly based on >> > the >> > individual character sequence or some parsing is done over a >> > language_model/grammar to give out correct word results? >> > e.g. >> > image ground truth : SOUTHBURY, CT 0688 >> > tesseract output : SOUTHBURY~ CT DLUBB >> > I was wondering if there a way by which I can direct this tesseract >> > result >> > to find the appropriate match in a given constrained list of possible >> > outputs. >> > if i have a language model containing the following : >> > SOUTHBURY, CT 0688 >> > XYZ, CT 0688 >> > .... >> > then based on the tesseract's correct decoding of the city name I will >> > be >> > able to force feed the output for choice 1. >> > Please do let me know if this is a possibility.Also , you had mentioned >> > that >> > I could use dictionary for the city name, can you please give some more >> > details.(I have already tried creating the custom dwag files but didnt >> > seem >> > to work) >> > >> > Regards, >> > Amrit. >> > >> > >> > On Wed, Apr 6, 2011 at 11:53 PM, Dmitri Silaev <[email protected]> >> > wrote: >> >> >> >> Is it possible for you to get images in higher res? For Tesseract this >> >> resolution might be insufficient to achieve decent accuracy. >> >> >> >> You do need to train for this specific font, as the "default" >> >> Tesseract's eng font is just a collection of some famous computer >> >> fonts, and yours is not one of them. >> >> >> >> For town/city names indeed you can use dictionary approach, but for >> >> the state and zip I'd better use the one I described above. So the >> >> whole thing will require some programming, but as I can suppose, >> >> currently you just evaluate the executable. >> >> >> >> Warm regards, >> >> Dmitri Silaev >> >> >> >> >> >> >> >> >> >> >> >> On Thu, Apr 7, 2011 at 8:29 AM, Amrit <[email protected]> >> >> wrote: >> >> > Thanks,Sending it again. >> >> > On Wed, Apr 6, 2011 at 11:24 PM, Dmitri Silaev >> >> > <[email protected]> >> >> > wrote: >> >> >> >> >> >> To let you know, >> >> >> can't see images yet... >> >> >> >> >> >> >> >> >> >> >> >> On Thu, Apr 7, 2011 at 8:17 AM, Amrit <[email protected]> >> >> >> wrote: >> >> >> > Hi Dmitri/Partik, >> >> >> > Thanks for your reply.I am sending along the pre processed test >> >> >> > image >> >> >> > which >> >> >> > I had mentioned in my response. >> >> >> > tesseract output - SOUTHBURY~ CT DLUBB >> >> >> > >> >> >> > Regards, >> >> >> > Amrit. >> >> >> > >> >> >> > On Wed, Apr 6, 2011 at 12:05 AM, Dmitri Silaev >> >> >> > <[email protected]> >> >> >> > wrote: >> >> >> >> >> >> >> >> Agree not to use dictionary at all. IMO the best you can do is: >> >> >> >> - use appropriate whitelists for each character position >> >> >> >> - obtain a set of char choices for every char position >> >> >> >> - restrict choice sets by using other semantic information you >> >> >> >> may >> >> >> >> have >> >> >> >> >> >> >> >> Warm regards, >> >> >> >> Dmitri Silaev >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> On Wed, Apr 6, 2011 at 6:00 AM, Amrit >> >> >> >> <[email protected]> >> >> >> >> wrote: >> >> >> >> > Hi All, >> >> >> >> > I am trying to evaluate tesseract to decode US postal >> >> >> >> > address >> >> >> >> > from a set of images(english text with varying font).I want to >> >> >> >> > extract >> >> >> >> > the city,state zipcode combination from the image.In doing so, >> >> >> >> > out >> >> >> >> > of >> >> >> >> > the box tesseract 3.01 performance is average and I would like >> >> >> >> > to >> >> >> >> > increase the accuracy of the system by providing a custom >> >> >> >> > grammar/ >> >> >> >> > wordlist (language model). >> >> >> >> > Any idea as to how to accomplish this?(My custom grammar/ >> >> >> >> > language model will only contain City,State and ZipCode >> >> >> >> > numbers). >> >> >> >> > >> >> >> >> > I have tried to create custom dawg by following on the lines of >> >> >> >> > 'training tesseract 3' wiki page, but this doesn't seem to work >> >> >> >> > at >> >> >> >> > all.Is there any way I can do this without training a subset of >> >> >> >> > my >> >> >> >> > test images? >> >> >> >> > >> >> >> >> > Regards, >> >> >> >> > Amrit. >> >> >> >> > >> >> >> >> > -- >> >> >> >> > You received this message because you are subscribed to the >> >> >> >> > Google >> >> >> >> > Groups "tesseract-ocr" group. >> >> >> >> > To post to this group, send email to >> >> >> >> > [email protected]. >> >> >> >> > To unsubscribe from this group, send email to >> >> >> >> > [email protected]. >> >> >> >> > For more options, visit this group at >> >> >> >> > http://groups.google.com/group/tesseract-ocr?hl=en. >> >> >> >> > >> >> >> >> > >> >> >> > >> >> >> > >> >> > >> >> > >> > >> > > > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en.

