You might consider looking at some of the papers on text detection in natural images and using the techniques from the later stages in the pipeline. These are similar what Dmitri outlined, but reviewing what others have done might give you ideas on additional ways to filter and group connected components (e.g. aspect ratio, inter-CC spacing, etc). This Microsoft Research paper <http://research.microsoft.com/pubs/230169/2010%20CVPR%20TextDetection.pdf>describes their pipeline. Obviously you don't need all the front end edge detection stuff.
Extracting information from diagrams in publications is an increasingly popular topic. Your task sounds pretty similar to what's described in this paper in Bioinformatics http://bioinformatics.oxfordjournals.org/content/28/5/739.short Google Scholar has a list of related papers <https://scholar.google.com/scholar?safe=off&rlz=1C1CHFX_enUS491US491&espv=2&biw=1731&bih=839&bav=on.2,or.r_cp.&dpr=1.1&um=1&ie=UTF-8&lr&q=related:lfxgAQ17t0ROwM:scholar.google.com/>. If the two diagrams that you present are representative, your task may be easier since the text is always horizontal. On Sunday, May 3, 2015 at 1:46:34 AM UTC-4, Alexander Pico wrote: > > > So far, here are some numbers for those who are interested... > > I took 4,000 pathway images (more complicated and diverse than the simple > case above) and applied both Adobe Acrobat's OCR and Tesseract with custom > user-words: > * Adobe found 2,366 unique human gene identifiers > * Tesseract found 2,199 unique human gene identifiers > > And the sets were not completely overlapping, resulting in a combined > total of 3,187 unique identifiers. That's less than 1 per image, and of > course the results were heavily skewed. Adobe best performance was 44 hits > from a single pathway, but it failed to find a single hit on 1,600 > pathways. Tesseract's best was 31, but failed on 1,201 pathways. > What's the denominator ie how many identifiers were there to find? Is there a one-to-one correspondence between "pathway" and "image" ? I'm guessing yes, but want to check that the change in terminology isn't significant. Tom -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/4e5f71d3-06c3-4b49-9c2f-3084a5855398%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.

