Thanks for this, Bruce! Very interesting work. It confirms what I've seen
in my small tests that I've done in a non-systematic way. Did you happen to
capture the number of false positives yet (annotations made by cTAKES that
are not in the human adjudicated standard)? I've seen a lot of dictionary
We are doing a similar kind of evaluation and will report the results.
Before we released the Fast lookup, we did a systematic evaluation across three
gold standard sets. We did not see the trend that Bruce reported below. The P,
R and F1 results from the old dictionary look up and the fast one
Great article. Im not a fan of the email solution, simply because of size
problems. Given how small the rate of new video uploads is likely to be, it
seems a common drop box solution may be the best solution for our case. Maybe
someone very central to the project could volunteer as this point
Guergana,
I'm curious to the number of records that are in your gold standard
sets, or if your gold standard set was run through a long running cTAKES
process. I know at some point we fixed a bug in the old dictionary
lookup that caused the permutations to become corrupted over time.
Typically
Also check out stats that Sean ran before releasing the new component on:
http://svn.apache.org/repos/asf/ctakes/trunk/ctakes-dictionary-lookup-fast/doc/DictionaryLookupStats.docx
From the evaluation and experience, the new lookup algorithm should be a huge
improvement in terms of both speed and
Thanks Kim,
This sounds interesting though I don't totally understand it. Are you saying
that extraction performance for a given note depends on which order the note
was in the processing queue? If so that's pretty bad! If you (or anyone else
who understands this issue) has a concrete example I
Hi Tim,
Here is an untested example, but should show the concept.
Document 1:
Sarah had Induced Abortion Illegally.
Document 2:
John had a previous history of Abuse Health Service.
The following CUIs would be the matches if everything went well.
Illegally Induced Abortion, C000804
Health
Several thoughts:
1. The ShARE corpus annotates only mentions of type Diseases/Disorders and only
Anatomical Sites associated with a Disease/Disorder. This is by design. cTAKES
annotates all mentions of types Diseases/Disorders, Signs/Symptoms, Procedures,
Medications and Anatomical Sites.
One quick mention:
The cTakes dictionaries are built with UMLS 2011AB. If the Human annotations
were not done using the same UMLS version then there WILL be differences in CUI
and Semantic group. I don't have time to go into it with details, examples,
etc. just be aware that every 6 months
Sean,
I don't think that would be an issue since both the rare word lookup and
the first word lookup are using UMLS 2011AB. Or is the rare word lookup
using a different dictionary?
I would expect roughly similar results between the two when it comes to
differences between UMLS versions.
IMAT
I’m bringing it up in case the Human Annotations were done using a different
version.
From: Kim Ebert [mailto:kim.eb...@perfectsearchcorp.com]
Sent: Friday, December 19, 2014 1:40 PM
To: dev@ctakes.apache.org
Subject: Re: cTakes Annotation Comparison
Sean,
I don't think that would be an issue
Pei,
I don't think bugs/issues should be part of determining if one algorithm
vs the other is superior. Obviously, it is worth mentioning the bugs,
but if the fast lookup method has worse precision and recall but better
performance, vs the slower but more accurate first word lookup
algorithm,
Rather than spam the mailing list with the list of filenames for the files
in the set we used, I would be happy to send it to anyone interested
privately.
[image: IMAT Solutions] http://imatsolutions.com
Bruce Tietjen
Senior Software Engineer
[image: Mobile:] 801.634.1547
Correction -- So far, I did steps 1 and 2 of Sean's email.
[image: IMAT Solutions] http://imatsolutions.com
Bruce Tietjen
Senior Software Engineer
[image: Mobile:] 801.634.1547
bruce.tiet...@imatsolutions.com
On Fri, Dec 19, 2014 at 1:22 PM, Bruce Tietjen
bruce.tiet...@perfectsearchcorp.com
Hi Bruce,
I'm not sure how there would be fewer matches with the overlap processor.
There should be all of the matches from the non-overlap processor plus those
from the overlap. Decreasing from 215 to 211 is strange. Have you done any
manual spot checks on this? It is really bizarre that
My original results were using a newly downloaded cTakes 3.2.1 with the
separately downloaded resources copied in. There were no changes to any of
the configuration files.
As far as this last run, I modified the UMLSLookupAnnotator.xml and
AggregatePlaintextFastUMLSProcessor.xml. I've attached
Hi Bruce,
Correction -- So far, I did steps 1 and 2 of Sean's email.
No problem. Aside from recreating the database, those two steps have the
greatest impact. But before you change anything else, please do some manual
spot checks. I have never seen a case where the lookup would be so
I'll do that -- there is always a possibility of bugs in the analysis tool.
[image: IMAT Solutions] http://imatsolutions.com
Bruce Tietjen
Senior Software Engineer
[image: Mobile:] 801.634.1547
bruce.tiet...@imatsolutions.com
On Fri, Dec 19, 2014 at 1:39 PM, Finan, Sean
My apologies to Sean and everyone,
I am happy to report that I found a bug in our analysis tools that was
missing the last FSArray entry for any FSArray list.
With the bug fixed, the results look MUCH better.
UMLSProcessor found 31,598 annotations
FastUMLSProcessor found 30,716 annotations
Apologies accepted. I'm really glad that you found the problem.
So what you are saying is (just to be very very clear to everybody reading this
thread):
FastUMLSProcessor found 2795 matches (2,842 including overlaps)
While
UMLSProcessor found 2632 matches (2,735 including overlaps)
--- So
Bruce,
I think we all feel a lot better now. I think the tool will be helpful
moving forward.
I've updated the git repo with the fix in case anyone is interested.
IMAT Solutions http://imatsolutions.com
Kim Ebert
Software Engineer
Office: 801.669.7342
kim.eb...@imatsolutions.com
When I only include SignSymptomMention and DiseaseDisorderMention in the
analysis (which excludes annotations not included in the gold standard),
the matched annotations remain the same while the total annotations found
in those categories drop to the following:
Total Annotations found:
22 matches
Mail list logo