I also wish to find a way to avoid such cases. Even I am facing some cases where I get extra white spaces, lower/upper case mismatch and wrong detection of characters...
On Tuesday, May 31, 2016 at 11:40:28 PM UTC+5:30, Diederik Hattingh wrote: > > I have a case where my tesseract isn't detecting URLs as expected. (More > details in my SO question > <http://stackoverflow.com/questions/37533524/tweak-tesseract-for-better-detection-of-urls-in-image>.) > > > > The http:// part is being recognised as http:II. If I specify a white > list of characters that doesn't include capital I tesseract recognizes the > string correctly. > > Is it possible for me to specify a priority of characters to recognize? > > Any other ideas on how to tweak the parameters to increase my accuracy? > > <http://i.stack.imgur.com/jO1u9.png> > > > Is incorrectly read as "http:II11111111111111111111111111111111111 > 1111111111111111111.coml" > > > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/b3e8194a-9735-459a-9119-58eff4d28fb3%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.

