That's a tough thing to preprocess. Take a look at this recent thread on this list: "question about training tesseract".
Nick On Tue, Jul 01, 2014 at 11:48:07PM -0700, Meenal Goyal wrote: > Hi Nick, > > I have read that post earlier and also tried to preprocess the image. This is > the input image http://imgur.com/yCxOvQS,GD38rCa which after preprocessing > gives this http://imgur.com/JzrDkug . I wanted to know if there is some way to > improve in post-processing phase. Right now I am using regex matching to > filter > the noise but it doesn't work in all cases. For eg: > "does‘?", "That's‘his." , "their’" are some words which may not be > considered > fully as noise but they get filtered out after regex matching. > > Also, Is there any way to retrain tesseract for improving results in such > cases? Any feedback mechanism which can help improve? > > On Tuesday, July 1, 2014 8:52:35 PM UTC+5:30, Nick White wrote: > > Hi Meenal, > > On Tue, Jul 01, 2014 at 02:04:36AM -0700, Meenal Goyal wrote: > > When I try to ocr an image, it also produces some noise apart from the > > meaningful words. An example output for an image is: > > > > All women become > > > > like their’ mqthers. _ ' 1"’ ' > > > > - —T at-{rs their tragedy. ” "R"-‘»“T‘*'-. > > ‘ . > > > > / > > > > > > > > N man does“ > > > > That's‘his. ‘ ' > > > > os'cAR»w;L'15E ‘ 9 > > > > So, I wanted something which removes the noise in the text or at least > reduce > > it and produce correct output. > > I see. The best plan would be to preprocess the image to clean it > up, so that Tesseract isn't seeing all that noise in the first > place. Check out this wiki page: > https://code.google.com/p/tesseract-ocr/wiki/ImproveQuality > > If you want to send a specific example image to the mailing list, we > can try to offer more specific advice. > > Nick > > -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email > to [email protected]. > To post to this group, send email to [email protected]. > Visit this group at http://groups.google.com/group/tesseract-ocr. > To view this discussion on the web visit https://groups.google.com/d/msgid/ > tesseract-ocr/bcaac70d-0459-4783-9b4b-86934eb003b7%40googlegroups.com. > For more options, visit https://groups.google.com/d/optout. -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/20140702140925.GB2081%40manta.lan. For more options, visit https://groups.google.com/d/optout.

