Thanks, James.
I am leaning toward supplementing the UMLS DB as you suggest rather than
changing it, if I can make that work. I did originally try adding entries to
dictionary1.csv, while using AggregatePlainTextProcessor.xml, but saw no change
in the annotations. I guess that dictionary1 is in fact not being used in
APTP.xml, and "hyperlipidemia", "knee", "pain", et al get annotated due to some
other term list / dictionary. Time to wade through the contents of that
ctakes-dictionary-lookup-res\src\... tree.
-----Original Message-----
From: Masanz, James J. [mailto:[email protected]]
Sent: Fri, 03 Jan, 2014 16:45
To: '[email protected]'
Subject: [External] RE: How to augment/modify UMLS resources?
The separately downloadable UMLS dictionary formatted for cTAKES [1], not
counting medication names (RxNorm), is in a database [2]. So you could add to
that database whatever terms you want.
The RxNorm dictionary is in a Lucene index (though there is a related jira
ticket open so that maybe it will end up in the same database) so to add to the
currently used medications list, would probably best be done programmatically
using the Lucene API (someone with more Lucene end-user experience, please
chime in)
cTAKES provides a way to look up terms in a flatfile dictionary that you would
provide. See the files that end with .csv within
ctakes-dictionary-lookup-res\src\main\resources\org\apache\ctakes\dictionary\lookup
The flatfile is not used directly in conjunction with the database file of
terms from UMLS – to use the two together, you would have one annotator
configured to use that flatfile for the dictionary, and have a second annotator
configured to use the database file.
Some things to be aware of if you went that route
- each note would be processed by both, and if you had terms in your flatfile
that duplicated what was in the database, you would end up with double
annotations
- each note would be processed in effect twice (not by the entire pipeline
thankfully) so it would be a slower than just using one.
As far as something being annotated that you don't want annotated, within the
LookupDesc*xml file being used, there can be an excludeList to have "men" no
longer annotated. See LookupDesc_DrugNER.xml for an example of using
excludeList.
Any improvements or even written steps on any of the above would be a great
contribution.
-- James
[1] http://sourceforge.net/projects/ctakesresources/files/
[2] the relative path to the hsql db is
resources\org\apache\ctakes\dictionary\lookup\umls2011ab
From: [email protected]
[mailto:[email protected]] On Behalf Of
Lee, Richard A. [USA]
Sent: Thursday, January 02, 2014 5:01 PM
To: [email protected]
Subject: How to augment/modify UMLS resources?
Howdy, all. I’ve got a lot of experience with various commercial extraction
tools, but I’m new to cTAKES and UIMA, so please bear with me.
I am able to use my UMLS credentials to process documents, and the results are
good. But there are a few things I wish to change in the medfacts.types.Concept
and AnatomicalSiteMention areas, for starters. For example, while it annotates
“orbicularis oculi” as a concept, it does not annotate “musculus orbicularis
oculi”, “septum orbital”, or “oculi medialis”. It annotates “ulceration”,
“perforation”, and “corneal perforation” but not “corneal ulceration”. It
annotates “men” (as in “Chinese men”) as a “problem”. It annotates “ER” (ie
Emergency Room) as an AnatomicalSiteReference.
So, the question becomes, how do I address these? Do I need to somehow
re-generate (with changes) the UMLS data files, probably using Luke or some
such? That seems a bit crude. Is there a clean way to supplement those data
files instead to achieve the desired changes?
Thanks in advance.
------------------------------------------------------------------------------------------------------------
Richard A Lee || Lead Associate / Senior Ontologist || [email protected] ||
571-482-7809
<?xml version="1.0" encoding="UTF-8"?>
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->
<taeDescription xmlns="http://uima.apache.org/resourceSpecifier">
<frameworkImplementation>org.apache.uima.java</frameworkImplementation>
<primitive>true</primitive>
<annotatorImplementationName>org.apache.ctakes.dictionary.lookup.ae.DictionaryLookupAnnotator</annotatorImplementationName>
<analysisEngineMetaData>
<name>DictionaryLookupAnnotatorCSV</name>
<description>Dictionaries - some in lucene indexes and some in CSV files</description>
<version/>
<vendor/>
<configurationParameters>
<configurationParameter>
<name>maxListSize</name>
<description>Specifies the maximum number of items to be returned from an lucene query.</description>
<type>Integer</type>
<multiValued>false</multiValued>
<mandatory>false</mandatory>
</configurationParameter>
</configurationParameters>
<configurationParameterSettings>
<nameValuePair>
<name>maxListSize</name>
<value>
<integer>2147483647</integer>
</value>
</nameValuePair>
</configurationParameterSettings>
<typeSystemDescription>
<imports>
</imports>
</typeSystemDescription>
<typePriorities/>
<fsIndexCollection/>
<capabilities>
<capability>
<inputs>
<type allAnnotatorFeatures="true">org.apache.ctakes.typesystem.type.syntax.BaseToken</type>
<type allAnnotatorFeatures="true">org.apache.ctakes.typesystem.type.textspan.LookupWindowAnnotation</type>
</inputs>
<outputs>
<type allAnnotatorFeatures="true">org.apache.ctakes.typesystem.type.textsem.IdentifiedAnnotation</type>
</outputs>
<languagesSupported/>
</capability>
</capabilities>
<operationalProperties>
<modifiesCas>true</modifiesCas>
<multipleDeploymentAllowed>true</multipleDeploymentAllowed>
<outputsNewCASes>false</outputsNewCASes>
</operationalProperties>
</analysisEngineMetaData>
<externalResourceDependencies>
<externalResourceDependency>
<key>LookupDescriptor</key>
<description/>
<interfaceName>org.apache.ctakes.core.resource.FileResource</interfaceName>
<optional>false</optional>
</externalResourceDependency>
<externalResourceDependency>
<key>DictionaryFile</key>
<description/>
<interfaceName>org.apache.ctakes.core.resource.FileResource</interfaceName>
<optional>false</optional>
</externalResourceDependency>
<externalResourceDependency>
<key>RxnormIndexReader</key>
<description/>
<interfaceName>org.apache.ctakes.core.resource.LuceneIndexReaderResource</interfaceName>
<optional>false</optional>
</externalResourceDependency>
<externalResourceDependency>
<key>OrangeBookIndexReader</key>
<description/>
<interfaceName>org.apache.ctakes.core.resource.LuceneIndexReaderResource</interfaceName>
<optional>false</optional>
</externalResourceDependency>
</externalResourceDependencies>
<resourceManagerConfiguration>
<externalResources>
<externalResource>
<name>LookupDescriptorFile</name>
<description/>
<fileResourceSpecifier>
<fileUrl>file:org/apache/ctakes/dictionary/lookup/LookupDesc_csv_sample.xml</fileUrl>
</fileResourceSpecifier>
<implementationName>org.apache.ctakes.core.resource.FileResourceImpl</implementationName>
</externalResource>
<externalResource>
<name>DictionaryFileResource</name>
<description/>
<fileResourceSpecifier>
<fileUrl>file:org/apache/ctakes/dictionary/lookup/dictionary1.csv</fileUrl>
</fileResourceSpecifier>
<implementationName>org.apache.ctakes.core.resource.FileResourceImpl</implementationName>
</externalResource>
<externalResource>
<name>RxnormIndex</name>
<description/>
<configurableDataResourceSpecifier>
<url/>
<resourceMetaData>
<name/>
<configurationParameters>
<configurationParameter>
<name>UseMemoryIndex</name>
<type>Boolean</type>
<multiValued>false</multiValued>
<mandatory>true</mandatory>
</configurationParameter>
<configurationParameter>
<name>IndexDirectory</name>
<type>String</type>
<multiValued>false</multiValued>
<mandatory>true</mandatory>
</configurationParameter>
</configurationParameters>
<configurationParameterSettings>
<nameValuePair>
<name>UseMemoryIndex</name>
<value>
<boolean>true</boolean>
</value>
</nameValuePair>
<nameValuePair>
<name>IndexDirectory</name>
<value>
<string>org/apache/ctakes/dictionary/lookup/drug_index</string>
</value>
</nameValuePair>
</configurationParameterSettings>
</resourceMetaData>
</configurableDataResourceSpecifier>
<implementationName>org.apache.ctakes.core.resource.LuceneIndexReaderResourceImpl</implementationName>
</externalResource>
<externalResource>
<name>OrangeBookIndex</name>
<description/>
<configurableDataResourceSpecifier>
<url/>
<resourceMetaData>
<name/>
<configurationParameters>
<configurationParameter>
<name>UseMemoryIndex</name>
<type>Boolean</type>
<multiValued>false</multiValued>
<mandatory>true</mandatory>
</configurationParameter>
<configurationParameter>
<name>IndexDirectory</name>
<type>String</type>
<multiValued>false</multiValued>
<mandatory>true</mandatory>
</configurationParameter>
</configurationParameters>
<configurationParameterSettings>
<nameValuePair>
<name>UseMemoryIndex</name>
<value>
<boolean>true</boolean>
</value>
</nameValuePair>
<nameValuePair>
<name>IndexDirectory</name>
<value>
<string>org/apache/ctakes/dictionary/lookup/OrangeBook</string>
</value>
</nameValuePair>
</configurationParameterSettings>
</resourceMetaData>
</configurableDataResourceSpecifier>
<implementationName>org.apache.ctakes.core.resource.LuceneIndexReaderResourceImpl</implementationName>
</externalResource>
</externalResources>
<externalResourceBindings>
<externalResourceBinding>
<key>LookupDescriptor</key>
<resourceName>LookupDescriptorFile</resourceName>
</externalResourceBinding>
<externalResourceBinding>
<key>DictionaryFile</key>
<resourceName>DictionaryFileResource</resourceName>
</externalResourceBinding>
<externalResourceBinding>
<key>RxnormIndexReader</key>
<resourceName>RxnormIndex</resourceName>
</externalResourceBinding>
<externalResourceBinding>
<key>OrangeBookIndexReader</key>
<resourceName>OrangeBookIndex</resourceName>
</externalResourceBinding>
</externalResourceBindings>
</resourceManagerConfiguration>
</taeDescription>
<?xml version="1.0" encoding="UTF-8"?>
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->
<taeDescription xmlns="http://uima.apache.org/resourceSpecifier">
<frameworkImplementation>org.apache.uima.java</frameworkImplementation>
<primitive>true</primitive>
<annotatorImplementationName>org.apache.ctakes.dictionary.lookup.ae.DictionaryLookupAnnotator</annotatorImplementationName>
<analysisEngineMetaData>
<name>DictionaryLookupAnnotatorCSV</name>
<description>Dictionaries - some in lucene indexes and some in CSV files</description>
<version/>
<vendor/>
<configurationParameters>
<configurationParameter>
<name>maxListSize</name>
<description>Specifies the maximum number of items to be returned from an lucene query.</description>
<type>Integer</type>
<multiValued>false</multiValued>
<mandatory>false</mandatory>
</configurationParameter>
</configurationParameters>
<configurationParameterSettings>
<nameValuePair>
<name>maxListSize</name>
<value>
<integer>2147483647</integer>
</value>
</nameValuePair>
</configurationParameterSettings>
<typeSystemDescription>
<imports>
</imports>
</typeSystemDescription>
<typePriorities/>
<fsIndexCollection/>
<capabilities>
<capability>
<inputs>
<type allAnnotatorFeatures="true">org.apache.ctakes.typesystem.type.syntax.BaseToken</type>
<type allAnnotatorFeatures="true">org.apache.ctakes.typesystem.type.textspan.LookupWindowAnnotation</type>
</inputs>
<outputs>
<type allAnnotatorFeatures="true">org.apache.ctakes.typesystem.type.textsem.IdentifiedAnnotation</type>
</outputs>
<languagesSupported/>
</capability>
</capabilities>
<operationalProperties>
<modifiesCas>true</modifiesCas>
<multipleDeploymentAllowed>true</multipleDeploymentAllowed>
<outputsNewCASes>false</outputsNewCASes>
</operationalProperties>
</analysisEngineMetaData>
<externalResourceDependencies>
<externalResourceDependency>
<key>LookupDescriptor</key>
<description/>
<interfaceName>org.apache.ctakes.core.resource.FileResource</interfaceName>
<optional>false</optional>
</externalResourceDependency>
<externalResourceDependency>
<key>DictionaryFile</key>
<description/>
<interfaceName>org.apache.ctakes.core.resource.FileResource</interfaceName>
<optional>false</optional>
</externalResourceDependency>
<externalResourceDependency>
<key>RxnormIndexReader</key>
<description/>
<interfaceName>org.apache.ctakes.core.resource.LuceneIndexReaderResource</interfaceName>
<optional>false</optional>
</externalResourceDependency>
<externalResourceDependency>
<key>OrangeBookIndexReader</key>
<description/>
<interfaceName>org.apache.ctakes.core.resource.LuceneIndexReaderResource</interfaceName>
<optional>false</optional>
</externalResourceDependency>
</externalResourceDependencies>
<resourceManagerConfiguration>
<externalResources>
<externalResource>
<name>LookupDescriptorFile</name>
<description/>
<fileResourceSpecifier>
<fileUrl>file:org/apache/ctakes/dictionary/lookup/LookupDesc_csv_sample.xml</fileUrl>
</fileResourceSpecifier>
<implementationName>org.apache.ctakes.core.resource.FileResourceImpl</implementationName>
</externalResource>
<externalResource>
<name>DictionaryFileResource</name>
<description/>
<fileResourceSpecifier>
<fileUrl>file:org/apache/ctakes/dictionary/lookup/dictionary1.csv</fileUrl>
</fileResourceSpecifier>
<implementationName>org.apache.ctakes.core.resource.FileResourceImpl</implementationName>
</externalResource>
<externalResource>
<name>RxnormIndex</name>
<description/>
<configurableDataResourceSpecifier>
<url/>
<resourceMetaData>
<name/>
<configurationParameters>
<configurationParameter>
<name>UseMemoryIndex</name>
<type>Boolean</type>
<multiValued>false</multiValued>
<mandatory>true</mandatory>
</configurationParameter>
<configurationParameter>
<name>IndexDirectory</name>
<type>String</type>
<multiValued>false</multiValued>
<mandatory>true</mandatory>
</configurationParameter>
</configurationParameters>
<configurationParameterSettings>
<nameValuePair>
<name>UseMemoryIndex</name>
<value>
<boolean>true</boolean>
</value>
</nameValuePair>
<nameValuePair>
<name>IndexDirectory</name>
<value>
<string>org/apache/ctakes/dictionary/lookup/drug_index</string>
</value>
</nameValuePair>
</configurationParameterSettings>
</resourceMetaData>
</configurableDataResourceSpecifier>
<implementationName>org.apache.ctakes.core.resource.LuceneIndexReaderResourceImpl</implementationName>
</externalResource>
<externalResource>
<name>OrangeBookIndex</name>
<description/>
<configurableDataResourceSpecifier>
<url/>
<resourceMetaData>
<name/>
<configurationParameters>
<configurationParameter>
<name>UseMemoryIndex</name>
<type>Boolean</type>
<multiValued>false</multiValued>
<mandatory>true</mandatory>
</configurationParameter>
<configurationParameter>
<name>IndexDirectory</name>
<type>String</type>
<multiValued>false</multiValued>
<mandatory>true</mandatory>
</configurationParameter>
</configurationParameters>
<configurationParameterSettings>
<nameValuePair>
<name>UseMemoryIndex</name>
<value>
<boolean>true</boolean>
</value>
</nameValuePair>
<nameValuePair>
<name>IndexDirectory</name>
<value>
<string>org/apache/ctakes/dictionary/lookup/OrangeBook</string>
</value>
</nameValuePair>
</configurationParameterSettings>
</resourceMetaData>
</configurableDataResourceSpecifier>
<implementationName>org.apache.ctakes.core.resource.LuceneIndexReaderResourceImpl</implementationName>
</externalResource>
</externalResources>
<externalResourceBindings>
<externalResourceBinding>
<key>LookupDescriptor</key>
<resourceName>LookupDescriptorFile</resourceName>
</externalResourceBinding>
<externalResourceBinding>
<key>DictionaryFile</key>
<resourceName>DictionaryFileResource</resourceName>
</externalResourceBinding>
<externalResourceBinding>
<key>RxnormIndexReader</key>
<resourceName>RxnormIndex</resourceName>
</externalResourceBinding>
<externalResourceBinding>
<key>OrangeBookIndexReader</key>
<resourceName>OrangeBookIndex</resourceName>
</externalResourceBinding>
</externalResourceBindings>
</resourceManagerConfiguration>
</taeDescription>