Re: [tesseract-ocr] retrieve words not matching the dictionary

Meenal Goyal Thu, 03 Jul 2014 00:45:27 -0700

Hi Nick,

The post about "question about training tesseract" only suggests some 
pre-processing steps which include binarisation and  I have already tried 
them. I wanted to know if anything can be done to improve output at later 
stage, something like adding the words to the dictionary used by tesseract.


I have tried listing words in eng.user-words but it wasn't much useful. Can 
you suggest anything of this sort which can train tesseract over the time 
and help improve the output.

Thanks,
Meenal

On Wednesday, July 2, 2014 7:40:56 PM UTC+5:30, Nick White wrote:
>
> That's a tough thing to preprocess. Take a look at this recent 
> thread on this list: "question about training tesseract". 
>
> Nick 
>
> On Tue, Jul 01, 2014 at 11:48:07PM -0700, Meenal Goyal wrote: 
> > Hi Nick, 
> > 
> > I have read that post earlier and also tried to preprocess the image. 
> This is 
> > the input image http://imgur.com/yCxOvQS,GD38rCa which after 
> preprocessing 
> > gives this http://imgur.com/JzrDkug . I wanted to know if there is some 
> way to 
> > improve in post-processing phase. Right now I am using regex matching to 
> filter 
> > the noise but it doesn't work in all cases. For eg: 
> > "does‘?",  "That's‘his." ,  "their’" are some words which may not be 
> considered 
> > fully as noise but they get filtered out after regex matching. 
> > 
> > Also, Is there any way to retrain tesseract for improving results in 
> such 
> > cases? Any feedback mechanism which can help improve? 
> > 
> > On Tuesday, July 1, 2014 8:52:35 PM UTC+5:30, Nick White wrote: 
> > 
> >     Hi Meenal, 
> > 
> >     On Tue, Jul 01, 2014 at 02:04:36AM -0700, Meenal Goyal wrote: 
> >     > When I try to ocr an image, it also produces some noise apart from 
> the 
> >     > meaningful words. An example output for an image is: 
> >     > 
> >     > All women become 
> >     > 
> >     > like their’ mqthers. _ ' 1"’ ' 
> >     > 
> >     > - —T at-{rs their tragedy. ” "R"-‘»“T‘*'-. 
> >     > ‘ . 
> >     > 
> >     > / 
> >     > 
> >     >   
> >     > 
> >     > N man does“ 
> >     > 
> >     > That's‘his. ‘ ' 
> >     > 
> >     > os'cAR»w;L'15E ‘ 9 
> >     > 
> >     > So, I wanted something which removes the noise in the text or at 
> least 
> >     reduce 
> >     > it and produce correct output. 
> > 
> >     I see. The best plan would be to preprocess the image to clean it 
> >     up, so that Tesseract isn't seeing all that noise in the first 
> >     place. Check out this wiki page: 
> >     https://code.google.com/p/tesseract-ocr/wiki/ImproveQuality 
> > 
> >     If you want to send a specific example image to the mailing list, we 
> >     can try to offer more specific advice. 
> > 
> >     Nick 
> > 
> > -- 
> > You received this message because you are subscribed to the Google 
> Groups 
> > "tesseract-ocr" group. 
> > To unsubscribe from this group and stop receiving emails from it, send 
> an email 
> > to [email protected] <javascript:>. 
> > To post to this group, send email to [email protected] 
> <javascript:>. 
> > Visit this group at http://groups.google.com/group/tesseract-ocr. 
> > To view this discussion on the web visit 
> https://groups.google.com/d/msgid/ 
> > tesseract-ocr/bcaac70d-0459-4783-9b4b-86934eb003b7%40googlegroups.com. 
> > For more options, visit https://groups.google.com/d/optout. 
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/6b3b0625-2070-4fbb-b31c-c8a68c8cbaf9%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] retrieve words not matching the dictionary

Reply via email to