p.s. The technology out of the Krauthammer Lab <http://krauthammerlab.med.yale.edu/publications> at Yale which backs the Yale Image Finder <http://krauthammerlab.med.yale.edu/imagefinder/Home,$Form.direct?formids=query%2CIf%2CIf_0&submitmode=&submitname=&If=T&If_0=F&query=pi3k> seems directly applicable.
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2732221/ http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3265968/pdf/nihms241943.pdf http://cda.ornl.gov/publications_2011/Publication%2028596_S.%20Xu.pdf On Wednesday, May 6, 2015 at 2:41:13 PM UTC-4, Tom Morris wrote: > > You might consider looking at some of the papers on text detection in > natural images and using the techniques from the later stages in the > pipeline. These are similar what Dmitri outlined, but reviewing what > others have done might give you ideas on additional ways to filter and > group connected components (e.g. aspect ratio, inter-CC spacing, etc). > This Microsoft Research paper > <http://research.microsoft.com/pubs/230169/2010%20CVPR%20TextDetection.pdf>describes > > their pipeline. Obviously you don't need all the front end edge detection > stuff. > > Extracting information from diagrams in publications is an increasingly > popular topic. Your task sounds pretty similar to what's described in this > paper in Bioinformatics > http://bioinformatics.oxfordjournals.org/content/28/5/739.short > Google Scholar has a list of related papers > <https://scholar.google.com/scholar?safe=off&rlz=1C1CHFX_enUS491US491&espv=2&biw=1731&bih=839&bav=on.2,or.r_cp.&dpr=1.1&um=1&ie=UTF-8&lr&q=related:lfxgAQ17t0ROwM:scholar.google.com/>. > > If the two diagrams that you present are representative, your task may be > easier since the text is always horizontal. > > On Sunday, May 3, 2015 at 1:46:34 AM UTC-4, Alexander Pico wrote: >> >> >> So far, here are some numbers for those who are interested... >> >> I took 4,000 pathway images (more complicated and diverse than the simple >> case above) and applied both Adobe Acrobat's OCR and Tesseract with custom >> user-words: >> * Adobe found 2,366 unique human gene identifiers >> * Tesseract found 2,199 unique human gene identifiers >> >> And the sets were not completely overlapping, resulting in a combined >> total of 3,187 unique identifiers. That's less than 1 per image, and of >> course the results were heavily skewed. Adobe best performance was 44 hits >> from a single pathway, but it failed to find a single hit on 1,600 >> pathways. Tesseract's best was 31, but failed on 1,201 pathways. >> > > What's the denominator ie how many identifiers were there to find? Is > there a one-to-one correspondence between "pathway" and "image" ? I'm > guessing yes, but want to check that the change in terminology isn't > significant. > > Tom > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/a5e1ce4f-5146-44c4-9d85-df25db1d3595%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.

