My group would like to offer the following UIMA component, Common Feature 
Extractor (CFE), as an open source offering into the UIMA sandbox, assuming 
there is interest from the community:

 

CFE enables the configuration driven feature value extraction from UIMA 
annotations contained in CAS. The extracted information can be used for 
statistical analysis, performance metrics evaluation, regression testing and 
machine learning related processing. 

 

CFE provides a flexible, yet powerful language FESL (Feature Extraction 
Specification Language) for working with the UIMA CAS to enable the collection 
and classification of resultant data. FESL is a declarative XML-based language 
that expresses semantic rules for the feature extraction. While the rules guide 
the feature extraction in a completely generalized way and CFE provides methods 
for subsequent processing to format the output of the extraction as needed for 
downstream use.  The destination for the output is defined by a particular 
application where CFE is used (CAS, external file, database, etc.). CFE could 
be implemented by either TAE or CAS Consumer, depending on a particular 
application needs

 

FESL rules allow flexible and powerful way of defining multi-parameter criteria 
for specific information to be extracted from CAS. Such criteria can be 
customized by:

  1.. a type of an UIMA annotation object that contains the feature of interest
  2.. a surrounding (enclosing) annotation type and a relative location of the 
object within the enclosure that limits the extraction within a boundaries of a 
certain UIMA type.
  3.. "path" to the feature from the annotation object
  4.. a type and value of the feature itself
  5.. values of any public Java get-style methods (methods that accept no 
parameters and return a value) implemented by the underlying class of the 
feature
  6.. a location of the object or the feature on a specific path (in cases when 
it is required to select/bypass annotations if they are features of other UIMA 
annotation types)
 

The feature values can be evaluated by conditional expressions stated in FESL. 
Particularly, the feature values can be evaluated whether they:

  1.. are of a certain type
  2.. belong to a specific set of values (vocabulary)
  3.. belong to a range of numeric values (inclusively or non-inclusively)
  4.. match certain bits of a bit mask (integer values only)
  5.. match a Java regular expression pattern, 
 

These expressions can be specified in disjunctive normal form that gives a 
powerful and flexible way of defining fairly complex criteria for an extraction 
of a required annotation and/or its value

 

The FESL itself is defined in XSD format and integrated with EMF for syntax 
validation and automated code generation. 

 

CFE has been successfully used in several internal projects for evaluation of 
performance metrics and machine learning.

 

CFE is described in more detail in the paper "CFE - a system for testing, 
evaluation and machine learning of UIMA based applications", by I. Sominsky, A. 
Coden, M. Tanenblatt that will be presented at UIMA for NLP workshop as part of 
the LREC 2008 conference in Marrakech, Morocco. 



Igor Sominsky

[EMAIL PROTECTED]

Reply via email to