Re: cTakes Annotation Comparison

2014-12-19 Thread David Kincaid
Thanks for this, Bruce! Very interesting work. It confirms what I've seen in my small tests that I've done in a non-systematic way. Did you happen to capture the number of false positives yet (annotations made by cTAKES that are not in the human adjudicated standard)? I've seen a lot of dictionary

RE: cTakes Annotation Comparison

2014-12-19 Thread Savova, Guergana
We are doing a similar kind of evaluation and will report the results. Before we released the Fast lookup, we did a systematic evaluation across three gold standard sets. We did not see the trend that Bruce reported below. The P, R and F1 results from the old dictionary look up and the fast one

RE: intro video and ctakes youtube : Youtube Apache cTakes Channel Direct Link

2014-12-19 Thread John Green
Great article. Im not a fan of the email solution, simply because of size problems. Given how small the rate of new video uploads is likely to be, it seems a common drop box solution may be the best solution for our case. Maybe someone very central to the project could volunteer as this point

Re: cTakes Annotation Comparison

2014-12-19 Thread Kim Ebert
Guergana, I'm curious to the number of records that are in your gold standard sets, or if your gold standard set was run through a long running cTAKES process. I know at some point we fixed a bug in the old dictionary lookup that caused the permutations to become corrupted over time. Typically

RE: cTakes Annotation Comparison

2014-12-19 Thread Chen, Pei
Also check out stats that Sean ran before releasing the new component on: http://svn.apache.org/repos/asf/ctakes/trunk/ctakes-dictionary-lookup-fast/doc/DictionaryLookupStats.docx From the evaluation and experience, the new lookup algorithm should be a huge improvement in terms of both speed and

Re: cTakes Annotation Comparison

2014-12-19 Thread Miller, Timothy
Thanks Kim, This sounds interesting though I don't totally understand it. Are you saying that extraction performance for a given note depends on which order the note was in the processing queue? If so that's pretty bad! If you (or anyone else who understands this issue) has a concrete example I

Re: cTakes Annotation Comparison

2014-12-19 Thread Kim Ebert
Hi Tim, Here is an untested example, but should show the concept. Document 1: Sarah had Induced Abortion Illegally. Document 2: John had a previous history of Abuse Health Service. The following CUIs would be the matches if everything went well. Illegally Induced Abortion, C000804 Health

RE: cTakes Annotation Comparison

2014-12-19 Thread Savova, Guergana
Several thoughts: 1. The ShARE corpus annotates only mentions of type Diseases/Disorders and only Anatomical Sites associated with a Disease/Disorder. This is by design. cTAKES annotates all mentions of types Diseases/Disorders, Signs/Symptoms, Procedures, Medications and Anatomical Sites.

RE: cTakes Annotation Comparison

2014-12-19 Thread Finan, Sean
One quick mention: The cTakes dictionaries are built with UMLS 2011AB. If the Human annotations were not done using the same UMLS version then there WILL be differences in CUI and Semantic group. I don't have time to go into it with details, examples, etc. just be aware that every 6 months

Re: cTakes Annotation Comparison

2014-12-19 Thread Kim Ebert
Sean, I don't think that would be an issue since both the rare word lookup and the first word lookup are using UMLS 2011AB. Or is the rare word lookup using a different dictionary? I would expect roughly similar results between the two when it comes to differences between UMLS versions. IMAT

RE: cTakes Annotation Comparison

2014-12-19 Thread Finan, Sean
I’m bringing it up in case the Human Annotations were done using a different version. From: Kim Ebert [mailto:kim.eb...@perfectsearchcorp.com] Sent: Friday, December 19, 2014 1:40 PM To: dev@ctakes.apache.org Subject: Re: cTakes Annotation Comparison Sean, I don't think that would be an issue

Re: cTakes Annotation Comparison

2014-12-19 Thread Kim Ebert
Pei, I don't think bugs/issues should be part of determining if one algorithm vs the other is superior. Obviously, it is worth mentioning the bugs, but if the fast lookup method has worse precision and recall but better performance, vs the slower but more accurate first word lookup algorithm,

Re: cTakes Annotation Comparison

2014-12-19 Thread Bruce Tietjen
Rather than spam the mailing list with the list of filenames for the files in the set we used, I would be happy to send it to anyone interested privately. [image: IMAT Solutions] http://imatsolutions.com Bruce Tietjen Senior Software Engineer [image: Mobile:] 801.634.1547

Re: cTakes Annotation Comparison

2014-12-19 Thread Bruce Tietjen
Correction -- So far, I did steps 1 and 2 of Sean's email. [image: IMAT Solutions] http://imatsolutions.com Bruce Tietjen Senior Software Engineer [image: Mobile:] 801.634.1547 bruce.tiet...@imatsolutions.com On Fri, Dec 19, 2014 at 1:22 PM, Bruce Tietjen bruce.tiet...@perfectsearchcorp.com

RE: cTakes Annotation Comparison

2014-12-19 Thread Finan, Sean
Hi Bruce, I'm not sure how there would be fewer matches with the overlap processor. There should be all of the matches from the non-overlap processor plus those from the overlap. Decreasing from 215 to 211 is strange. Have you done any manual spot checks on this? It is really bizarre that

Re: cTakes Annotation Comparison

2014-12-19 Thread Bruce Tietjen
My original results were using a newly downloaded cTakes 3.2.1 with the separately downloaded resources copied in. There were no changes to any of the configuration files. As far as this last run, I modified the UMLSLookupAnnotator.xml and AggregatePlaintextFastUMLSProcessor.xml. I've attached

RE: cTakes Annotation Comparison

2014-12-19 Thread Finan, Sean
Hi Bruce, Correction -- So far, I did steps 1 and 2 of Sean's email. No problem. Aside from recreating the database, those two steps have the greatest impact. But before you change anything else, please do some manual spot checks. I have never seen a case where the lookup would be so

Re: cTakes Annotation Comparison

2014-12-19 Thread Bruce Tietjen
I'll do that -- there is always a possibility of bugs in the analysis tool. [image: IMAT Solutions] http://imatsolutions.com Bruce Tietjen Senior Software Engineer [image: Mobile:] 801.634.1547 bruce.tiet...@imatsolutions.com On Fri, Dec 19, 2014 at 1:39 PM, Finan, Sean

Re: cTakes Annotation Comparison

2014-12-19 Thread Bruce Tietjen
My apologies to Sean and everyone, I am happy to report that I found a bug in our analysis tools that was missing the last FSArray entry for any FSArray list. With the bug fixed, the results look MUCH better. UMLSProcessor found 31,598 annotations FastUMLSProcessor found 30,716 annotations

RE: cTakes Annotation Comparison --- (^:

2014-12-19 Thread Finan, Sean
Apologies accepted. I'm really glad that you found the problem. So what you are saying is (just to be very very clear to everybody reading this thread): FastUMLSProcessor found 2795 matches (2,842 including overlaps) While UMLSProcessor found 2632 matches (2,735 including overlaps) --- So

Re: cTakes Annotation Comparison

2014-12-19 Thread Kim Ebert
Bruce, I think we all feel a lot better now. I think the tool will be helpful moving forward. I've updated the git repo with the fix in case anyone is interested. IMAT Solutions http://imatsolutions.com Kim Ebert Software Engineer Office: 801.669.7342 kim.eb...@imatsolutions.com

Re: cTakes Annotation Comparison

2014-12-19 Thread Bruce Tietjen
When I only include SignSymptomMention and DiseaseDisorderMention in the analysis (which excludes annotations not included in the gold standard), the matched annotations remain the same while the total annotations found in those categories drop to the following: Total Annotations found: