Dictionary Matching using Concept Mapper for single word entry.

Khirod Kant Naik Mon, 20 Jul 2015 06:02:30 -0700

Hi everyone,

I am unable to match text from dictionary if the enclosing span contains
only a single token.


For example - I am trying to match word "education" from my dictionary and
for the enclosing span I am using a sentence. So if sentence contains a
single token then I am not able to match it from dictionary.

Here is what I have tried,

When I have a sentence like - "Education <**something else**>" then
conceptMapper matches "education".
While if I have a sentence like - "Education" then conceptMapper is not
picking it from dictionary.

So I have a question that *does conceptMapper requires you to have more
than 1 TokenAnnotation within the specified spanFeatureStructure ? *

P.S : This is the descriptor I am using

<?xml version="1.0" encoding="UTF-8"?>
> <taeDescription xmlns="http://uima.apache.org/resourceSpecifier";>
>   <frameworkImplementation>org.apache.uima.java</frameworkImplementation>
>   <primitive>true</primitive>
>
> <annotatorImplementationName>org.apache.uima.conceptMapper.ConceptMapper</annotatorImplementationName>
>   <analysisEngineMetaData>
>     <name>Segment Heading Annotator</name>
>     <description/>
>     <version>1</version>
>     <vendor/>
>     <configurationParameters>
>       <configurationParameter>
>         <name>caseMatch</name>
>         <description>this parameter specifies the case folding mode:
>                     ignoreall - fold everything to lowercase for
>                     matching insensitive - fold only tokens with initial
>                     caps to lowercase digitfold - fold all (and only)
>                     tokens with a digit sensitive - perform no case
>                     folding</description>
>         <type>String</type>
>         <multiValued>false</multiValued>
>         <mandatory>true</mandatory>
>       </configurationParameter>
>       <configurationParameter>
>         <name>Stemmer</name>
>         <description>Name of stemmer class to use before matching. MUST
>                     have a zero-parameter constructor! If not specified,
>                     no stemming will be performed.</description>
>         <type>String</type>
>         <multiValued>false</multiValued>
>         <mandatory>false</mandatory>
>       </configurationParameter>
>       <configurationParameter>
>         <name>ResultingAnnotationName</name>
>         <description>Name of the annotation type created by this TAE,
>                     must match the typeSystemDescription
> entry</description>
>         <type>String</type>
>         <multiValued>false</multiValued>
>         <mandatory>true</mandatory>
>       </configurationParameter>
>       <configurationParameter>
>         <name>ResultingEnclosingSpanName</name>
>         <description>Name of the feature in the resultingAnnotation to
>                     contain the span that encloses it (i.e. its
>                     sentence)</description>
>         <type>String</type>
>         <multiValued>false</multiValued>
>         <mandatory>false</mandatory>
>       </configurationParameter>
>       <configurationParameter>
>         <name>AttributeList</name>
>         <description>List of attribute names for XML dictionary entry
>                     record - must correspond to FeatureList</description>
>         <type>String</type>
>         <multiValued>true</multiValued>
>         <mandatory>true</mandatory>
>       </configurationParameter>
>       <configurationParameter>
>         <name>FeatureList</name>
>         <description>List of feature names for CAS annotation - must
>                     correspond to AttributeList</description>
>         <type>String</type>
>         <multiValued>true</multiValued>
>         <mandatory>true</mandatory>
>       </configurationParameter>
>       <configurationParameter>
>         <name>TokenAnnotation</name>
>         <description/>
>         <type>String</type>
>         <multiValued>false</multiValued>
>         <mandatory>true</mandatory>
>       </configurationParameter>
>       <configurationParameter>
>         <name>TokenClassFeatureName</name>
>         <description>Name of feature used when doing lookups against
>                     IncludedTokenClasses and
> ExcludedTokenClasses</description>
>         <type>String</type>
>         <multiValued>false</multiValued>
>         <mandatory>false</mandatory>
>       </configurationParameter>
>       <configurationParameter>
>         <name>TokenTextFeatureName</name>
>         <description/>
>         <type>String</type>
>         <multiValued>false</multiValued>
>         <mandatory>false</mandatory>
>       </configurationParameter>
>       <configurationParameter>
>         <name>SpanFeatureStructure</name>
>         <description>Type of annotation which corresponds to spans of
>                     data for processing (e.g. a Sentence)</description>
>         <type>String</type>
>         <multiValued>false</multiValued>
>         <mandatory>true</mandatory>
>       </configurationParameter>
>       <configurationParameter>
>         <name>OrderIndependentLookup</name>
>         <description>True if should ignore element order during lookup
>                     (i.e., "top box" would equal "box top"). Default is
>                     False.</description>
>         <type>Boolean</type>
>         <multiValued>false</multiValued>
>         <mandatory>false</mandatory>
>       </configurationParameter>
>       <configurationParameter>
>         <name>TokenTypeFeatureName</name>
>         <description>Name of feature used when doing lookups against
>                     IncludedTokenTypes and ExcludedTokenTypes</description>
>         <type>String</type>
>         <multiValued>false</multiValued>
>         <mandatory>false</mandatory>
>       </configurationParameter>
>       <configurationParameter>
>         <name>IncludedTokenTypes</name>
>         <description>Type of tokens to include in lookups (if not
>                     supplied, then all types are included except those
>                     specifically mentioned in
> ExcludedTokenTypes)</description>
>         <type>Integer</type>
>         <multiValued>true</multiValued>
>         <mandatory>false</mandatory>
>       </configurationParameter>
>       <configurationParameter>
>         <name>ExcludedTokenTypes</name>
>         <description/>
>         <type>Integer</type>
>         <multiValued>true</multiValued>
>         <mandatory>false</mandatory>
>       </configurationParameter>
>       <configurationParameter>
>         <name>ExcludedTokenClasses</name>
>         <description>Class of tokens to exclude from lookups (if not
>                     supplied, then all classes are excluded except those
>                     specifically mentioned in IncludedTokenClasses,
>                     unless IncludedTokenClasses is not supplied, in
>                     which case none are excluded)</description>
>         <type>String</type>
>         <multiValued>true</multiValued>
>         <mandatory>false</mandatory>
>       </configurationParameter>
>       <configurationParameter>
>         <name>IncludedTokenClasses</name>
>         <description>Class of tokens to include in lookups (if not
>                     supplied, then all classes are included except those
>                     specifically mentioned in
> ExcludedTokenClasses)</description>
>         <type>String</type>
>         <multiValued>true</multiValued>
>         <mandatory>false</mandatory>
>       </configurationParameter>
>       <configurationParameter>
>         <name>TokenClassWriteBackFeatureNames</name>
>         <description>names of features that should be written back to a
>                     token, such as a POS tag</description>
>         <type>String</type>
>         <multiValued>true</multiValued>
>         <mandatory>false</mandatory>
>       </configurationParameter>
>       <configurationParameter>
>         <name>ResultingAnnotationMatchedTextFeature</name>
>         <type>String</type>
>         <multiValued>false</multiValued>
>         <mandatory>false</mandatory>
>       </configurationParameter>
>       <configurationParameter>
>         <name>PrintDictionary</name>
>         <type>Boolean</type>
>         <multiValued>false</multiValued>
>         <mandatory>false</mandatory>
>       </configurationParameter>
>       <configurationParameter>
>         <name>SearchStrategy</name>
>         <description>Can be either "SkipAnyMatch",
>                     "SkipAnyMatchAllowOverlap" or
>                     "ContiguousMatch"&#13;&#13;ContiguousMatch: longest
>                     match of contiguous tokens within enclosing
>                     span(taking into account included/excluded items).
>                     DEFAULT strategy &#13;SkipAnyMatch: longest match of
>                     not-necessarily contiguous tokens within enclosing
>                     span (taking into account included/excluded items).
>                     Subsequent lookups begin in span after complete
>                     match. IMPLIES order-independent lookup
>                     &#13;SkipAnyMatchAllowOverlap: longest match of
>                     not-necessarily contiguous tokens within enclosing
>                     span (taking into account included/excluded items).
>                     Subsequent lookups begin in span after next token.
>                     IMPLIES order-independent lookup</description>
>         <type>String</type>
>         <multiValued>false</multiValued>
>         <mandatory>false</mandatory>
>       </configurationParameter>
>       <configurationParameter>
>         <name>StopWords</name>
>         <type>String</type>
>         <multiValued>true</multiValued>
>         <mandatory>false</mandatory>
>       </configurationParameter>
>       <configurationParameter>
>         <name>FindAllMatches</name>
>         <type>Boolean</type>
>         <multiValued>false</multiValued>
>         <mandatory>false</mandatory>
>       </configurationParameter>
>       <configurationParameter>
>         <name>MatchedTokensFeatureName</name>
>         <type>String</type>
>         <multiValued>false</multiValued>
>         <mandatory>false</mandatory>
>       </configurationParameter>
>       <configurationParameter>
>         <name>ReplaceCommaWithAND</name>
>         <type>Boolean</type>
>         <multiValued>false</multiValued>
>         <mandatory>false</mandatory>
>       </configurationParameter>
>       <configurationParameter>
>         <name>TokenizerDescriptorPath</name>
>         <type>String</type>
>         <multiValued>false</multiValued>
>         <mandatory>true</mandatory>
>       </configurationParameter>
>       <configurationParameter>
>         <name>LanguageID</name>
>         <type>String</type>
>         <multiValued>false</multiValued>
>         <mandatory>false</mandatory>
>       </configurationParameter>
>     </configurationParameters>
>     <configurationParameterSettings>
>       <nameValuePair>
>         <name>caseMatch</name>
>         <value>
>           <string>ignoreall</string>
>         </value>
>       </nameValuePair>
>       <nameValuePair>
>         <name>AttributeList</name>
>         <value>
>           <array>
>             <string>canonical</string>
>             <string>group</string>
>             <string>class</string>
>           </array>
>         </value>
>       </nameValuePair>
>       <nameValuePair>
>         <name>FeatureList</name>
>         <value>
>           <array>
>             <string>DictCanon</string>
>             <string>group</string>
>             <string>segmentClass</string>
>           </array>
>         </value>
>       </nameValuePair>
>       <nameValuePair>
>         <name>TokenAnnotation</name>
>         <value>
>           <string>com.naukri.parse.type.TokenAnnotation</string>
>         </value>
>       </nameValuePair>
>       <nameValuePair>
>         <name>ResultingAnnotationName</name>
>         <value>
>           <string>com.naukri.parse.type.resume.SegmentHeading</string>
>         </value>
>       </nameValuePair>
>       <nameValuePair>
>         <name>SpanFeatureStructure</name>
>         <value>
>           <string>com.naukri.parse.type.Sentence</string>
>         </value>
>       </nameValuePair>
>       <nameValuePair>
>         <name>OrderIndependentLookup</name>
>         <value>
>           <boolean>false</boolean>
>         </value>
>       </nameValuePair>
>       <nameValuePair>
>         <name>TokenClassWriteBackFeatureNames</name>
>         <value>
>           <array/>
>         </value>
>       </nameValuePair>
>       <nameValuePair>
>         <name>IncludedTokenClasses</name>
>         <value>
>           <array/>
>         </value>
>       </nameValuePair>
>       <nameValuePair>
>         <name>PrintDictionary</name>
>         <value>
>           <boolean>false</boolean>
>         </value>
>       </nameValuePair>
>       <nameValuePair>
>         <name>FindAllMatches</name>
>         <value>
>           <boolean>true</boolean>
>         </value>
>       </nameValuePair>
>       <nameValuePair>
>         <name>StopWords</name>
>         <value>
>           <array/>
>         </value>
>       </nameValuePair>
>       <nameValuePair>
>         <name>ReplaceCommaWithAND</name>
>         <value>
>           <boolean>false</boolean>
>         </value>
>       </nameValuePair>
>       <nameValuePair>
>         <name>TokenizerDescriptorPath</name>
>         <value>
>           <string>desc/tokenizerAE.xml</string>
>         </value>
>       </nameValuePair>
>       <nameValuePair>
>         <name>ResultingEnclosingSpanName</name>
>         <value>
>           <string>enclosingSpan</string>
>         </value>
>       </nameValuePair>
>       <nameValuePair>
>         <name>MatchedTokensFeatureName</name>
>         <value>
>           <string>matchedTokens</string>
>         </value>
>       </nameValuePair>
>       <nameValuePair>
>         <name>ResultingAnnotationMatchedTextFeature</name>
>         <value>
>           <string>matchedText</string>
>         </value>
>       </nameValuePair>
>       <nameValuePair>
>         <name>SearchStrategy</name>
>         <value>
>           <string>ContiguousMatch</string>
>         </value>
>       </nameValuePair>
>       <nameValuePair>
>         <name>LanguageID</name>
>         <value>
>           <string>en</string>
>         </value>
>       </nameValuePair>
>     </configurationParameterSettings>
>     <typeSystemDescription>
>       <imports>
>         <import location="typesystem.xml"/>
>       </imports>
>     </typeSystemDescription>
>     <typePriorities>
>       <priorityList>
>         <type>com.naukri.parse.type.TokenAnnotation</type>
>       </priorityList>
>     </typePriorities>
>     <fsIndexCollection/>
>     <capabilities>
>       <capability>
>         <inputs>
>           <type
> allAnnotatorFeatures="true">ucom.naukri.parse.type.TokenAnnotation</type>
>         </inputs>
>         <outputs>
>           <type
> allAnnotatorFeatures="true">com.naukri.parse.type.DictTerm</type>
>           <type
> allAnnotatorFeatures="true">com.naukri.parse.type.TokenAnnotation</type>
>           <type
> allAnnotatorFeatures="true">uima.tcas.DocumentAnnotation</type>
>         </outputs>
>         <languagesSupported/>
>       </capability>
>     </capabilities>
>     <operationalProperties>
>       <modifiesCas>true</modifiesCas>
>       <multipleDeploymentAllowed>true</multipleDeploymentAllowed>
>       <outputsNewCASes>false</outputsNewCASes>
>     </operationalProperties>
>   </analysisEngineMetaData>
>   <externalResourceDependencies>
>     <externalResourceDependency>
>       <key>DictionaryFile</key>
>       <description>dictionary file loader.</description>
>
> <interfaceName>org.apache.uima.conceptMapper.support.dictionaryResource.DictionaryResource</interfaceName>
>       <optional>false</optional>
>     </externalResourceDependency>
>   </externalResourceDependencies>
>   <resourceManagerConfiguration>
>     <externalResources>
>       <externalResource>
>         <name>segment_heading_dict</name>
>         <description>A file containing the dictionary. Modify this URL to
>                     use a different dictionary.</description>
>         <fileResourceSpecifier>
>           <fileUrl>dict/segment.heading.xml</fileUrl>
>         </fileResourceSpecifier>
>
> <implementationName>org.apache.uima.conceptMapper.support.dictionaryResource.DictionaryResource_impl</implementationName>
>       </externalResource>
>     </externalResources>
>     <externalResourceBindings>
>       <externalResourceBinding>
>         <key>DictionaryFile</key>
>         <resourceName>segment_heading_dict</resourceName>
>       </externalResourceBinding>
>     </externalResourceBindings>
>   </resourceManagerConfiguration>
> </taeDescription>
>

Dictionary Matching using Concept Mapper for single word entry.

Reply via email to