Thank you both for the helpful replies. I will certainly look into OpenCV. 
That's the second independent recommendation I've got for that tool for 
this particular problem. I also started to dive into preprocessing with 
imagemagick. Your blog post was VERY helpful. Unfortunately, my ultimate 
set of pathway images are quite diverse, which I can't handle case-by-case, 
so there will only be a few things I can reliable apply across all cases.

So far, here are some numbers for those who are interested...

I took 4,000 pathway images (more complicated and diverse than the simple 
case above) and applied both Adobe Acrobat's OCR and Tesseract with custom 
user-words:
* Adobe found 2,366 unique human gene identifiers
* Tesseract found 2,199 unique human gene identifiers

And the sets were not completely overlapping, resulting in a combined total 
of 3,187 unique identifiers.  That's less than 1 per image, and of course 
the results were heavily skewed. Adobe best performance was 44 hits from a 
single pathway, but it failed to find a single hit on 1,600 pathways. 
Tesseract's best was 31, but failed on 1,201 pathways.

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/07af9b1d-2410-42d4-ab04-4e79068bad44%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to