Re: UmlsOverlapLookupAnnotator + BsvRareWordDictionary: # tokens skipped varies? [EXTERNAL]

2018-03-07 Thread Kean Kaufmann
Sean -- Thanks as always!

> It shouldn't be a problem, but is there an eol character after the " II"
in your bsv?  It shouldn't be necessary, but who knows.

Double checked:  yeah, no.  Yes, there's an eol; no, that doesn't seem to
be it.

> Can you create a Jira item with this information?

https://issues.apache.org/jira/browse/CTAKES-498

Let me know if there's any other info you need.





On Wed, Mar 7, 2018 at 1:24 PM, Finan, Sean <
sean.fi...@childrens.harvard.edu> wrote:

> Hi Kean,
>
> It does sound like you are getting some odd results.  I will need to look
> into the code, but I won't have time to do so for a few days.  My initial
> thoughts are below.
>
> >If I add an entry with a comma in it:
> >then "chronic kidney disease), stage II" gets picked up, no matter what.
> Well ... Commas do get special treatment so that items in a list do not
> spawn false overlapping terms.  In other words, "A, B, C" is more likely to
> be 3 terms "A" "B" "C" than it is to be 2 terms "A B" "C".  So, a
> dictionary entry that explicitly contains a comma provides a hint to the
> system that "A,B" is actually one term.
>
> >If I remove the other BSV entry, "chronic kidney disease III",
> It shouldn't be a problem, but is there an eol character after the " II"
> in your bsv?  It shouldn't be necessary, but who knows.
>
> > then "Decubitus ulcer - grade II" gets annotated as a
> DiseaseDisorderMention with C1720518, as hoped.
> But only "chronic kidney disease" is identified, as before... "stage II"
> gets left out.
> That is very strange.  I have no idea why adding an entry would change the
> behavior.  I will have to look at the code and run your examples.  By the
> way, thank you for the explicit examples!
>
> >Is this expected behavior?
> No, and thanks for letting me know about it.  Can you create a Jira item
> with this information?
>
> Thanks,
> Sean
>
> ___
> From: Kean Kaufmann <k...@recordsone.com>
> Sent: Wednesday, March 7, 2018 10:58 AM
> To: dev@ctakes.apache.org
> Subject: UmlsOverlapLookupAnnotator + BsvRareWordDictionary: # tokens
> skipped varies? [EXTERNAL]
>
> Hi Sean,
>
> I'm perplexed. It seems as if the number of tokens that the
> UmlsOverlapLookupAnnotator will skip varies with the content of the
> RareWordDictionary.
>
> Here's my setup.  I think I've included enough information to replicate my
> perplexity, if you have time/inclination to do that; let me know if I've
> left anything out.
>
> I have a custom dictionary built from UMLS sources including SNOMEDCT_US:
>
> sql> select cui,text from cui_terms where text='chronic kidney disease' or
> > cui in (2316786,2316787);
> > CUI  TEXT
> > ---  
> > 1561643  chronic kidney disease
> > 2316787  stage 3 chronic kidney disease
> > 2316787  chronic kidney disease stage 3
> > 2316787  chronic kidney disease , stage 3
> > 2316787  ckd stage 3
> > 2316786  chronic kidney disease stage 2
> > 2316786  chronic kidney disease , stage 2
> > 2316786  stage 2 chronic kidney disease
> > 2316786  ckd stage 2
> > Fetched 9 rows.
> > sql>
>
>
> My documents contain acronym expansions and Roman numerals for stages, like
> this:
>
> Problem List:
> > CKD (chronic kidney disease), stage II
> > Decubitus ulcer - grade II
>
>
> So I create a BSV RareWordDictionary to capture the Roman numerals.
> I don't want to have to guess at all the possible punctuation variations,
> so I try to make my entries as general as safely possible,
> using the UmlsOverlapLookupAnnotator with consecutiveSkips set to 2.
>
> C2316786|chronic kidney disease II
> C2316787|chronic kidney disease III
>
> I add dictionary and dictionaryConceptPair entries for my BSV file to
> cTakesHsql.xml as shown in the example/ directory, using
> SemanticCleanupTermConsumer as rareWordConsumer.
>
> Success! Now "chronic kidney disease), stage II" gets annotated as a
> DiseaseDisorderMention with CUI C2316786.
>
> But a couple of things confuse me.
>
> *1. Removing an entry*
>
> If I remove the other BSV entry, "chronic kidney disease III",
> "chronic kidney disease), stage II" isn't identified anymore:
> suddenly it only annotates "chronic kidney disease", with C1561643.
>
> *2. Adding an entry*
>
> My documents also have staging language for ulcers, e.g. "Decubitus ulcer -
> grade II".
>
> If I add an entry for this to my BSV dictionary, so no

Re: UmlsOverlapLookupAnnotator + BsvRareWordDictionary: # tokens skipped varies? [EXTERNAL]

2018-03-07 Thread Finan, Sean
Hi Kean,

It does sound like you are getting some odd results.  I will need to look into 
the code, but I won't have time to do so for a few days.  My initial thoughts 
are below.

>If I add an entry with a comma in it:
>then "chronic kidney disease), stage II" gets picked up, no matter what.
Well ... Commas do get special treatment so that items in a list do not spawn 
false overlapping terms.  In other words, "A, B, C" is more likely to be 3 
terms "A" "B" "C" than it is to be 2 terms "A B" "C".  So, a dictionary entry 
that explicitly contains a comma provides a hint to the system that "A,B" is 
actually one term.

>If I remove the other BSV entry, "chronic kidney disease III",
It shouldn't be a problem, but is there an eol character after the " II" in 
your bsv?  It shouldn't be necessary, but who knows.

> then "Decubitus ulcer - grade II" gets annotated as a
DiseaseDisorderMention with C1720518, as hoped.
But only "chronic kidney disease" is identified, as before... "stage II"
gets left out.
That is very strange.  I have no idea why adding an entry would change the 
behavior.  I will have to look at the code and run your examples.  By the way, 
thank you for the explicit examples!

>Is this expected behavior?
No, and thanks for letting me know about it.  Can you create a Jira item with 
this information?

Thanks,
Sean

___
From: Kean Kaufmann <k...@recordsone.com>
Sent: Wednesday, March 7, 2018 10:58 AM
To: dev@ctakes.apache.org
Subject: UmlsOverlapLookupAnnotator + BsvRareWordDictionary: # tokens skipped 
varies? [EXTERNAL]

Hi Sean,

I'm perplexed. It seems as if the number of tokens that the
UmlsOverlapLookupAnnotator will skip varies with the content of the
RareWordDictionary.

Here's my setup.  I think I've included enough information to replicate my
perplexity, if you have time/inclination to do that; let me know if I've
left anything out.

I have a custom dictionary built from UMLS sources including SNOMEDCT_US:

sql> select cui,text from cui_terms where text='chronic kidney disease' or
> cui in (2316786,2316787);
> CUI  TEXT
> ---  
> 1561643  chronic kidney disease
> 2316787  stage 3 chronic kidney disease
> 2316787  chronic kidney disease stage 3
> 2316787  chronic kidney disease , stage 3
> 2316787  ckd stage 3
> 2316786  chronic kidney disease stage 2
> 2316786  chronic kidney disease , stage 2
> 2316786  stage 2 chronic kidney disease
> 2316786  ckd stage 2
> Fetched 9 rows.
> sql>


My documents contain acronym expansions and Roman numerals for stages, like
this:

Problem List:
> CKD (chronic kidney disease), stage II
> Decubitus ulcer - grade II


So I create a BSV RareWordDictionary to capture the Roman numerals.
I don't want to have to guess at all the possible punctuation variations,
so I try to make my entries as general as safely possible,
using the UmlsOverlapLookupAnnotator with consecutiveSkips set to 2.

C2316786|chronic kidney disease II
C2316787|chronic kidney disease III

I add dictionary and dictionaryConceptPair entries for my BSV file to
cTakesHsql.xml as shown in the example/ directory, using
SemanticCleanupTermConsumer as rareWordConsumer.

Success! Now "chronic kidney disease), stage II" gets annotated as a
DiseaseDisorderMention with CUI C2316786.

But a couple of things confuse me.

*1. Removing an entry*

If I remove the other BSV entry, "chronic kidney disease III",
"chronic kidney disease), stage II" isn't identified anymore:
suddenly it only annotates "chronic kidney disease", with C1561643.

*2. Adding an entry*

My documents also have staging language for ulcers, e.g. "Decubitus ulcer -
grade II".

If I add an entry for this to my BSV dictionary, so now I have:

C2316786|chronic kidney disease II
C2316787|chronic kidney disease III
C1720518|decubitus ulcer II

and annotate this text:

Problem List:
> CKD (chronic kidney disease), stage II
> Decubitus ulcer - grade II


then "Decubitus ulcer - grade II" gets annotated as a
DiseaseDisorderMention with C1720518, as hoped.
But only "chronic kidney disease" is identified, as before... "stage II"
gets left out.

*3. Adding a comma*

If I add an entry with a comma in it:

C2316786|chronic kidney disease , II

then "chronic kidney disease), stage II" gets picked up, no matter what.

Without the comma entry, it's skipping three consecutive tokens... but
sometimes it seems willing to do that, and sometimes it doesn't.

Is this expected behavior?
If so, can you help me understand what to expect?
At this point I hesitate to add anything to the BSV dictionary!

Many thanks,
Kean


Re: UmlsOverlapLookupAnnotator + BsvRareWordDictionary: # tokens skipped varies?

2018-03-07 Thread Kean Kaufmann
P.S. Extra config bit:  I also removed "CD" from the exclusionTags in the
UmlsOverlapLookupAnnotator.


On Wed, Mar 7, 2018 at 10:58 AM, Kean Kaufmann  wrote:

> Hi Sean,
>
> I'm perplexed. It seems as if the number of tokens that the
> UmlsOverlapLookupAnnotator will skip varies with the content of the
> RareWordDictionary.
>
> Here's my setup.  I think I've included enough information to replicate my
> perplexity, if you have time/inclination to do that; let me know if I've
> left anything out.
>
> I have a custom dictionary built from UMLS sources including SNOMEDCT_US:
>
> sql> select cui,text from cui_terms where text='chronic kidney disease' or
>> cui in (2316786,2316787);
>> CUI  TEXT
>> ---  
>> 1561643  chronic kidney disease
>> 2316787  stage 3 chronic kidney disease
>> 2316787  chronic kidney disease stage 3
>> 2316787  chronic kidney disease , stage 3
>> 2316787  ckd stage 3
>> 2316786  chronic kidney disease stage 2
>> 2316786  chronic kidney disease , stage 2
>> 2316786  stage 2 chronic kidney disease
>> 2316786  ckd stage 2
>> Fetched 9 rows.
>> sql>
>
>
> My documents contain acronym expansions and Roman numerals for stages,
> like this:
>
> Problem List:
>> CKD (chronic kidney disease), stage II
>> Decubitus ulcer - grade II
>
>
> So I create a BSV RareWordDictionary to capture the Roman numerals.
> I don't want to have to guess at all the possible punctuation variations,
> so I try to make my entries as general as safely possible,
> using the UmlsOverlapLookupAnnotator with consecutiveSkips set to 2.
>
> C2316786|chronic kidney disease II
> C2316787|chronic kidney disease III
>
> I add dictionary and dictionaryConceptPair entries for my BSV file to
> cTakesHsql.xml as shown in the example/ directory, using
> SemanticCleanupTermConsumer as rareWordConsumer.
>
> Success! Now "chronic kidney disease), stage II" gets annotated as a
> DiseaseDisorderMention with CUI C2316786.
>
> But a couple of things confuse me.
>
> *1. Removing an entry*
>
> If I remove the other BSV entry, "chronic kidney disease III",
> "chronic kidney disease), stage II" isn't identified anymore:
> suddenly it only annotates "chronic kidney disease", with C1561643.
>
> *2. Adding an entry*
>
> My documents also have staging language for ulcers, e.g. "Decubitus ulcer
> - grade II".
>
> If I add an entry for this to my BSV dictionary, so now I have:
>
> C2316786|chronic kidney disease II
> C2316787|chronic kidney disease III
> C1720518|decubitus ulcer II
>
> and annotate this text:
>
> Problem List:
>> CKD (chronic kidney disease), stage II
>> Decubitus ulcer - grade II
>
>
> then "Decubitus ulcer - grade II" gets annotated as a
> DiseaseDisorderMention with C1720518, as hoped.
> But only "chronic kidney disease" is identified, as before... "stage II"
> gets left out.
>
> *3. Adding a comma*
>
> If I add an entry with a comma in it:
>
> C2316786|chronic kidney disease , II
>
> then "chronic kidney disease), stage II" gets picked up, no matter what.
>
> Without the comma entry, it's skipping three consecutive tokens... but
> sometimes it seems willing to do that, and sometimes it doesn't.
>
> Is this expected behavior?
> If so, can you help me understand what to expect?
> At this point I hesitate to add anything to the BSV dictionary!
>
> Many thanks,
> Kean
>
>


UmlsOverlapLookupAnnotator + BsvRareWordDictionary: # tokens skipped varies?

2018-03-07 Thread Kean Kaufmann
Hi Sean,

I'm perplexed. It seems as if the number of tokens that the
UmlsOverlapLookupAnnotator will skip varies with the content of the
RareWordDictionary.

Here's my setup.  I think I've included enough information to replicate my
perplexity, if you have time/inclination to do that; let me know if I've
left anything out.

I have a custom dictionary built from UMLS sources including SNOMEDCT_US:

sql> select cui,text from cui_terms where text='chronic kidney disease' or
> cui in (2316786,2316787);
> CUI  TEXT
> ---  
> 1561643  chronic kidney disease
> 2316787  stage 3 chronic kidney disease
> 2316787  chronic kidney disease stage 3
> 2316787  chronic kidney disease , stage 3
> 2316787  ckd stage 3
> 2316786  chronic kidney disease stage 2
> 2316786  chronic kidney disease , stage 2
> 2316786  stage 2 chronic kidney disease
> 2316786  ckd stage 2
> Fetched 9 rows.
> sql>


My documents contain acronym expansions and Roman numerals for stages, like
this:

Problem List:
> CKD (chronic kidney disease), stage II
> Decubitus ulcer - grade II


So I create a BSV RareWordDictionary to capture the Roman numerals.
I don't want to have to guess at all the possible punctuation variations,
so I try to make my entries as general as safely possible,
using the UmlsOverlapLookupAnnotator with consecutiveSkips set to 2.

C2316786|chronic kidney disease II
C2316787|chronic kidney disease III

I add dictionary and dictionaryConceptPair entries for my BSV file to
cTakesHsql.xml as shown in the example/ directory, using
SemanticCleanupTermConsumer as rareWordConsumer.

Success! Now "chronic kidney disease), stage II" gets annotated as a
DiseaseDisorderMention with CUI C2316786.

But a couple of things confuse me.

*1. Removing an entry*

If I remove the other BSV entry, "chronic kidney disease III",
"chronic kidney disease), stage II" isn't identified anymore:
suddenly it only annotates "chronic kidney disease", with C1561643.

*2. Adding an entry*

My documents also have staging language for ulcers, e.g. "Decubitus ulcer -
grade II".

If I add an entry for this to my BSV dictionary, so now I have:

C2316786|chronic kidney disease II
C2316787|chronic kidney disease III
C1720518|decubitus ulcer II

and annotate this text:

Problem List:
> CKD (chronic kidney disease), stage II
> Decubitus ulcer - grade II


then "Decubitus ulcer - grade II" gets annotated as a
DiseaseDisorderMention with C1720518, as hoped.
But only "chronic kidney disease" is identified, as before... "stage II"
gets left out.

*3. Adding a comma*

If I add an entry with a comma in it:

C2316786|chronic kidney disease , II

then "chronic kidney disease), stage II" gets picked up, no matter what.

Without the comma entry, it's skipping three consecutive tokens... but
sometimes it seems willing to do that, and sometimes it doesn't.

Is this expected behavior?
If so, can you help me understand what to expect?
At this point I hesitate to add anything to the BSV dictionary!

Many thanks,
Kean