You might consider looking at some of the papers on text detection in 
natural images and using the techniques from the later stages in the 
pipeline.  These are similar what Dmitri outlined, but reviewing what 
others have done might give you ideas on additional ways to filter and 
group connected components (e.g. aspect ratio, inter-CC spacing, etc). 
 This Microsoft Research paper 
<http://research.microsoft.com/pubs/230169/2010%20CVPR%20TextDetection.pdf>describes
 
their pipeline. Obviously you don't need all the front end edge detection 
stuff.

Extracting information from diagrams in publications is an increasingly 
popular topic.  Your task sounds pretty similar to what's described in this 
paper in Bioinformatics 
http://bioinformatics.oxfordjournals.org/content/28/5/739.short
Google Scholar has a list of related papers 
<https://scholar.google.com/scholar?safe=off&rlz=1C1CHFX_enUS491US491&espv=2&biw=1731&bih=839&bav=on.2,or.r_cp.&dpr=1.1&um=1&ie=UTF-8&lr&q=related:lfxgAQ17t0ROwM:scholar.google.com/>.
 
 If the two diagrams that you present are representative, your task may be 
easier since the text is always horizontal.

On Sunday, May 3, 2015 at 1:46:34 AM UTC-4, Alexander Pico wrote:
>
>
> So far, here are some numbers for those who are interested...
>
> I took 4,000 pathway images (more complicated and diverse than the simple 
> case above) and applied both Adobe Acrobat's OCR and Tesseract with custom 
> user-words:
> * Adobe found 2,366 unique human gene identifiers
> * Tesseract found 2,199 unique human gene identifiers
>
> And the sets were not completely overlapping, resulting in a combined 
> total of 3,187 unique identifiers.  That's less than 1 per image, and of 
> course the results were heavily skewed. Adobe best performance was 44 hits 
> from a single pathway, but it failed to find a single hit on 1,600 
> pathways. Tesseract's best was 31, but failed on 1,201 pathways.
>

What's the denominator ie how many identifiers were there to find?  Is 
there a one-to-one correspondence between "pathway" and "image" ? I'm 
guessing yes, but want to check that the change in terminology isn't 
significant.

Tom

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/4e5f71d3-06c3-4b49-9c2f-3084a5855398%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to