Hi Alex, You might consider a template matching toolkit like OpenCV [1], I haven’t used it with words but I suspect it would work well in this kind of situation. OpenCV can also be used to remove basic shapes, such as circles and so on, but having a list of the words you want is a huge advantage.
art --- 1. http://docs.opencv.org/ From: [email protected] [mailto:[email protected]] On Behalf Of Alexander Pico Sent: Monday, April 27, 2015 2:34 PM To: [email protected] Subject: [tesseract-ocr] Extracting molecular labels from biological pathway images I am trying to identify the molecules from pathway images. This should be relatively simple from clear, high-res images like the one attached, but my attempts with Tesseract so are are pretty dismal... It found 9 of 25 molecules. I even have the luxury of knowing in advance all the words I'd like extract and tried supplying these as eng.user-words, but there was no improvement. I suspect I need to find the magic combination of parameter settings or perhaps image pre-processing. Any suggestions? Thanks! - Alex -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]<mailto:[email protected]>. To post to this group, send email to [email protected]<mailto:[email protected]>. Visit this group at http://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/ff5a2873-8392-4771-b314-3f2f146b0027%40googlegroups.com<https://groups.google.com/d/msgid/tesseract-ocr/ff5a2873-8392-4771-b314-3f2f146b0027%40googlegroups.com?utm_medium=email&utm_source=footer>. For more options, visit https://groups.google.com/d/optout. -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/BY2PR11MB07435D6D9D6AEE39AEE5E628DCE90%40BY2PR11MB0743.namprd11.prod.outlook.com. For more options, visit https://groups.google.com/d/optout.

