Hi Nick,

I have read that post earlier and also tried to preprocess the image. This 
is the input image http://imgur.com/yCxOvQS,GD38rCa which after 
preprocessing gives this http://imgur.com/JzrDkug . I wanted to know if 
there is some way to improve in post-processing phase. Right now I am using 
regex matching to filter the noise but it doesn't work in all cases. For eg:
"does‘?",  "That's‘his." ,  "their’" are some words which may not be 
considered fully as noise but they get filtered out after regex matching. 

Also, Is there any way to retrain tesseract for improving results in such 
cases? Any feedback mechanism which can help improve?

On Tuesday, July 1, 2014 8:52:35 PM UTC+5:30, Nick White wrote:
>
> Hi Meenal, 
>
> On Tue, Jul 01, 2014 at 02:04:36AM -0700, Meenal Goyal wrote: 
> > When I try to ocr an image, it also produces some noise apart from the 
> > meaningful words. An example output for an image is: 
> > 
> > All women become 
> > 
> > like their’ mqthers. _ ' 1"’ ' 
> > 
> > - —T at-{rs their tragedy. ” "R"-‘»“T‘*'-. 
> > ‘ . 
> > 
> > / 
> > 
> >   
> > 
> > N man does“ 
> > 
> > That's‘his. ‘ ' 
> > 
> > os'cAR»w;L'15E ‘ 9 
> > 
> > So, I wanted something which removes the noise in the text or at least 
> reduce 
> > it and produce correct output. 
>
> I see. The best plan would be to preprocess the image to clean it 
> up, so that Tesseract isn't seeing all that noise in the first 
> place. Check out this wiki page: 
> https://code.google.com/p/tesseract-ocr/wiki/ImproveQuality 
>
> If you want to send a specific example image to the mailing list, we 
> can try to offer more specific advice. 
>
> Nick 
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/bcaac70d-0459-4783-9b4b-86934eb003b7%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to