Thank you both for the helpful replies. I will certainly look into OpenCV. That's the second independent recommendation I've got for that tool for this particular problem. I also started to dive into preprocessing with imagemagick. Your blog post was VERY helpful. Unfortunately, my ultimate set of pathway images are quite diverse, which I can't handle case-by-case, so there will only be a few things I can reliable apply across all cases.
So far, here are some numbers for those who are interested... I took 4,000 pathway images (more complicated and diverse than the simple case above) and applied both Adobe Acrobat's OCR and Tesseract with custom user-words: * Adobe found 2,366 unique human gene identifiers * Tesseract found 2,199 unique human gene identifiers And the sets were not completely overlapping, resulting in a combined total of 3,187 unique identifiers. That's less than 1 per image, and of course the results were heavily skewed. Adobe best performance was 44 hits from a single pathway, but it failed to find a single hit on 1,600 pathways. Tesseract's best was 31, but failed on 1,201 pathways. -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/07af9b1d-2410-42d4-ab04-4e79068bad44%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.

