[tesseract-ocr] Re: Extracting molecular labels from biological pathway images

Tom Morris Wed, 06 May 2015 12:12:04 -0700

p.s. The technology out of the Krauthammer Lab 
<http://krauthammerlab.med.yale.edu/publications> at Yale which backs the Yale 
Image Finder 
<http://krauthammerlab.med.yale.edu/imagefinder/Home,$Form.direct?formids=query%2CIf%2CIf_0&submitmode=&submitname=&If=T&If_0=F&query=pi3k>
 seems 
directly applicable.


http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2732221/
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3265968/pdf/nihms241943.pdf
http://cda.ornl.gov/publications_2011/Publication%2028596_S.%20Xu.pdf
  

On Wednesday, May 6, 2015 at 2:41:13 PM UTC-4, Tom Morris wrote:
>
> You might consider looking at some of the papers on text detection in 
> natural images and using the techniques from the later stages in the 
> pipeline.  These are similar what Dmitri outlined, but reviewing what 
> others have done might give you ideas on additional ways to filter and 
> group connected components (e.g. aspect ratio, inter-CC spacing, etc). 
>  This Microsoft Research paper 
> <http://research.microsoft.com/pubs/230169/2010%20CVPR%20TextDetection.pdf>describes
>  
> their pipeline. Obviously you don't need all the front end edge detection 
> stuff.
>
> Extracting information from diagrams in publications is an increasingly 
> popular topic.  Your task sounds pretty similar to what's described in this 
> paper in Bioinformatics 
> http://bioinformatics.oxfordjournals.org/content/28/5/739.short
> Google Scholar has a list of related papers 
> <https://scholar.google.com/scholar?safe=off&rlz=1C1CHFX_enUS491US491&espv=2&biw=1731&bih=839&bav=on.2,or.r_cp.&dpr=1.1&um=1&ie=UTF-8&lr&q=related:lfxgAQ17t0ROwM:scholar.google.com/>.
>  
>  If the two diagrams that you present are representative, your task may be 
> easier since the text is always horizontal.
>
> On Sunday, May 3, 2015 at 1:46:34 AM UTC-4, Alexander Pico wrote:
>>
>>
>> So far, here are some numbers for those who are interested...
>>
>> I took 4,000 pathway images (more complicated and diverse than the simple 
>> case above) and applied both Adobe Acrobat's OCR and Tesseract with custom 
>> user-words:
>> * Adobe found 2,366 unique human gene identifiers
>> * Tesseract found 2,199 unique human gene identifiers
>>
>> And the sets were not completely overlapping, resulting in a combined 
>> total of 3,187 unique identifiers.  That's less than 1 per image, and of 
>> course the results were heavily skewed. Adobe best performance was 44 hits 
>> from a single pathway, but it failed to find a single hit on 1,600 
>> pathways. Tesseract's best was 31, but failed on 1,201 pathways.
>>
>
> What's the denominator ie how many identifiers were there to find?  Is 
> there a one-to-one correspondence between "pathway" and "image" ? I'm 
> guessing yes, but want to check that the change in terminology isn't 
> significant.
>
> Tom
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/a5e1ce4f-5146-44c4-9d85-df25db1d3595%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[tesseract-ocr] Re: Extracting molecular labels from biological pathway images

Reply via email to