I modified the Regex Annotator in the Sandbox to use an external file of regex. 
See attached class & AE. 

Essentially you can specify configurationParameters in the AE that reference 
externalResources.



-----Original Message-----
From: Spico Florin [mailto:[email protected]] 
Sent: Wednesday, January 11, 2012 10:19 AM
To: [email protected]
Subject: Question about UimaRegexAnnotator

Hello!
   In our project we would like to analyze our corpus data(news) with more than 
900 regular expressions for identifying the same entity. We have also more 
other entities that ate also identified by these huge number of regular 
expressions. We would like to use UIMARegexAnnotator. I've seen that you can 
add your regex in the <rule regEx=""> tag definition.
 Here are my questions/concerns:
  1. Can we add all 900 regular expression in this tag definition?
  2. Is there any support to make a reference from the rule definition to a 
file filled with regex?

Looking forward for your answers. Thank you.


Regards,
  Florin
<?xml version="1.0" encoding="UTF-8"?>
<analysisEngineDescription xmlns="http://uima.apache.org/resourceSpecifier";>
  <frameworkImplementation>org.apache.uima.java</frameworkImplementation>
  <primitive>true</primitive>
  <annotatorImplementationName>org.spin.scrubber.uima.annotator.RegexAnnotator</annotatorImplementationName>
  <analysisEngineMetaData>
    <name>Regex Annotator</name>
    <description>Matches regular expressions in document text.</description>
    <version/>
    <vendor/>
    <configurationParameters>
      <configurationParameter>
        <name>Filenames</name>
        <description>list of external resource dependency keys that need to be initialized for this annotator</description>
        <type>String</type>
        <multiValued>true</multiValued>
        <mandatory>true</mandatory>
      </configurationParameter>
      <configurationParameter>
        <name>CaseSensitiveFile</name>
        <description>boolean flag to determine if files should be interpreted as case sensitive or not.</description>
        <type>Boolean</type>
        <multiValued>true</multiValued>
        <mandatory>true</mandatory>
      </configurationParameter>
    </configurationParameters>
    <configurationParameterSettings>
      <nameValuePair>
        <name>Filenames</name>
        <value>
          <array>
            <string>PatternFile</string>
            <string>NameFile</string>
            <string>HospitalNameFile</string>
            <string>PrivateFile</string>
          </array>
        </value>
      </nameValuePair>
      <nameValuePair>
        <name>CaseSensitiveFile</name>
        <value>
          <array>
            <boolean>true</boolean>
            <boolean>false</boolean>
            <boolean>false</boolean>
            <boolean>false</boolean>
          </array>
        </value>
      </nameValuePair>
    </configurationParameterSettings>
    <typeSystemDescription>
      <imports>
        <import location="../type/OntologyMatchTypeSystem.xml"/>
      </imports>
    </typeSystemDescription>
    <capabilities>
      <capability>
        <inputs/>
        <outputs>
          <type>uima.tcas.Annotation</type>
          <type>org.spin.scrubber.uima.type.OntologyMatch</type>
        </outputs>
        <languagesSupported/>
      </capability>
    </capabilities>
    <operationalProperties>
      <modifiesCas>true</modifiesCas>
      <multipleDeploymentAllowed>true</multipleDeploymentAllowed>
      <outputsNewCASes>false</outputsNewCASes>
    </operationalProperties>
  </analysisEngineMetaData>
  <externalResourceDependencies>
    <externalResourceDependency>
      <key>PatternFile</key>
      <description>An required external file containing regular expressions to match. 
	      File format is as follows: 
			  - Lines starting with // or whitepsace are ignored
			  - Lines starting with # are the regex name 
			  - Lines starting with % indicate an annotation type. 
			  - All other lines are regular expressions.</description>
      <optional>false</optional>
    </externalResourceDependency>
    <externalResourceDependency>
      <key>NameFile</key>
      <description>An optional external file containing names to match. 
	      File format is as follows: 
			  - Lines starting with // or whitepsace are ignored
			  - Lines starting with # are the  name 
			  - Lines starting with % indicate an annotation type. 
			  - All other lines consist of strings to match.</description>
      <optional>true</optional>
    </externalResourceDependency>
    <externalResourceDependency>
      <key>HospitalNameFile</key>
      <description>An optional external file containing names to match. 
	      File format is as follows: 
			  - Lines starting with // or whitepsace are ignored
			  - Lines starting with # are the  name 
			  - Lines starting with % indicate an annotation type. 
			  - All other lines consist of strings to match.</description>
      <optional>true</optional>
    </externalResourceDependency>
    <externalResourceDependency>
      <key>PrivateFile</key>
      <description>An optional external file containing names to match. 
	      File format is as follows: 
			  - Lines starting with // or whitepsace are ignored
			  - Lines starting with # are the  name 
			  - Lines starting with % indicate an annotation type. 
			  - All other lines consist of strings to match.</description>
      <optional>true</optional>
    </externalResourceDependency>
  </externalResourceDependencies>
  <resourceManagerConfiguration>
    <externalResources>
      <externalResource>
        <name>regex</name>
        <description/>
        <fileResourceSpecifier>
          <fileUrl>file:conf/regex_patterns.txt</fileUrl>
        </fileResourceSpecifier>
      </externalResource>
      <externalResource>
        <name>name</name>
        <description/>
        <fileResourceSpecifier>
          <fileUrl>file:conf/names.txt</fileUrl>
        </fileResourceSpecifier>
      </externalResource>
      <externalResource>
        <name>hospital</name>
        <description/>
        <fileResourceSpecifier>
          <fileUrl>file:conf/hospital_names.txt</fileUrl>
        </fileResourceSpecifier>
      </externalResource>
      <externalResource>
        <name>private</name>
        <description/>
        <fileResourceSpecifier>
          <fileUrl>file:conf/private_dict.txt</fileUrl>
        </fileResourceSpecifier>
      </externalResource>
    </externalResources>
    <externalResourceBindings>
      <externalResourceBinding>
        <key>PatternFile</key>
        <resourceName>regex</resourceName>
      </externalResourceBinding>
      <externalResourceBinding>
        <key>NameFile</key>
        <resourceName>name</resourceName>
      </externalResourceBinding>
      <externalResourceBinding>
        <key>HospitalNameFile</key>
        <resourceName>hospital</resourceName>
      </externalResourceBinding>
      <externalResourceBinding>
        <key>PrivateFile</key>
        <resourceName>private</resourceName>
      </externalResourceBinding>
    </externalResourceBindings>
  </resourceManagerConfiguration>
</analysisEngineDescription>

Reply via email to