That's a tough thing to preprocess. Take a look at this recent 
thread on this list: "question about training tesseract".

Nick

On Tue, Jul 01, 2014 at 11:48:07PM -0700, Meenal Goyal wrote:
> Hi Nick,
> 
> I have read that post earlier and also tried to preprocess the image. This is
> the input image http://imgur.com/yCxOvQS,GD38rCa which after preprocessing
> gives this http://imgur.com/JzrDkug . I wanted to know if there is some way to
> improve in post-processing phase. Right now I am using regex matching to 
> filter
> the noise but it doesn't work in all cases. For eg:
> "does‘?",  "That's‘his." ,  "their’" are some words which may not be 
> considered
> fully as noise but they get filtered out after regex matching.
> 
> Also, Is there any way to retrain tesseract for improving results in such
> cases? Any feedback mechanism which can help improve?
> 
> On Tuesday, July 1, 2014 8:52:35 PM UTC+5:30, Nick White wrote:
> 
>     Hi Meenal,
> 
>     On Tue, Jul 01, 2014 at 02:04:36AM -0700, Meenal Goyal wrote:
>     > When I try to ocr an image, it also produces some noise apart from the
>     > meaningful words. An example output for an image is:
>     >
>     > All women become
>     >
>     > like their’ mqthers. _ ' 1"’ '
>     >
>     > - —T at-{rs their tragedy. ” "R"-‘»“T‘*'-.
>     > ‘ .
>     >
>     > /
>     >
>     >  
>     >
>     > N man does“
>     >
>     > That's‘his. ‘ '
>     >
>     > os'cAR»w;L'15E ‘ 9
>     >
>     > So, I wanted something which removes the noise in the text or at least
>     reduce
>     > it and produce correct output.
> 
>     I see. The best plan would be to preprocess the image to clean it
>     up, so that Tesseract isn't seeing all that noise in the first
>     place. Check out this wiki page:
>     https://code.google.com/p/tesseract-ocr/wiki/ImproveQuality
> 
>     If you want to send a specific example image to the mailing list, we
>     can try to offer more specific advice.
> 
>     Nick
> 
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email
> to [email protected].
> To post to this group, send email to [email protected].
> Visit this group at http://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/msgid/
> tesseract-ocr/bcaac70d-0459-4783-9b4b-86934eb003b7%40googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/20140702140925.GB2081%40manta.lan.
For more options, visit https://groups.google.com/d/optout.

Reply via email to