The vocabulary can be either fully embedded into the configuration file or
referenced by a URI. Any UIMA annotation feature or result of get-like
method (getCoverdText for instance) could be evaluated whether it belongs to
the list, so it could be included to or excluded from extraction.
I am not sure if I understand your second question correctly, but let me try
to answer it. CFE implements the extraction process in 2 steps. On the first
step an annotation that represents a certain concept is located. It can be a
single word annotation (uima.tt.TokenAnnotation for instance) or a custom
type annotation that contains the group of words in its properties (FSArray
for instance). But in any case your concept must be represented by a single
annotation. On the second step, annotations that are in a certain context
(defined by a configuration file) of you concept annotation are located. For
example, the configuration file could specify to extract features from 5
annotations to the left from an annotation that represents the concept
(let's say a particular word). The annotations that are located on the
second step - are the annotations the features are extracted from. I hope I
got your question right
Igor
----- Original Message -----
From: "Thilo Goetz" <[EMAIL PROTECTED]>
To: <[email protected]>
Sent: Thursday, May 15, 2008 1:07 PM
Subject: Re: proposal for a new testing and evaluation component
Cool, we absolutely need this! I was actually about to
write something like this myself, but now I think I can
wait a little longer :-)
I have quite a few questions on this, here are just some
of them:
Can you integrate external resources in the process? For
example, I might have a list of last names, and a feature
might be if a token occurs in that list or not.
I'd like to apply this to learning for individual words
or word windows. Is that possible with/supported by your
tool?
--Thilo
Igor Sominsky wrote:
My group would like to offer the following UIMA component, Common Feature
Extractor (CFE), as an open source offering into the UIMA sandbox,
assuming there is interest from the community:
CFE enables the configuration driven feature value extraction from UIMA
annotations contained in CAS. The extracted information can be used for
statistical analysis, performance metrics evaluation, regression testing
and machine learning related processing. CFE provides a flexible, yet
powerful language FESL (Feature Extraction Specification Language) for
working with the UIMA CAS to enable the collection and classification of
resultant data. FESL is a declarative XML-based language that expresses
semantic rules for the feature extraction. While the rules guide the
feature extraction in a completely generalized way and CFE provides
methods for subsequent processing to format the output of the extraction
as needed for downstream use. The destination for the output is defined
by a particular application where CFE is used (CAS, external file,
database, etc.). CFE could be implemented by either TAE or CAS Consumer,
depending on a particular application needs
FESL rules allow flexible and powerful way of defining multi-parameter
criteria for specific information to be extracted from CAS. Such criteria
can be customized by:
1.. a type of an UIMA annotation object that contains the feature of
interest
2.. a surrounding (enclosing) annotation type and a relative location
of the object within the enclosure that limits the extraction within a
boundaries of a certain UIMA type.
3.. "path" to the feature from the annotation object
4.. a type and value of the feature itself
5.. values of any public Java get-style methods (methods that accept no
parameters and return a value) implemented by the underlying class of the
feature
6.. a location of the object or the feature on a specific path (in
cases when it is required to select/bypass annotations if they are
features of other UIMA annotation types)
The feature values can be evaluated by conditional expressions stated in
FESL. Particularly, the feature values can be evaluated whether they:
1.. are of a certain type
2.. belong to a specific set of values (vocabulary)
3.. belong to a range of numeric values (inclusively or
non-inclusively)
4.. match certain bits of a bit mask (integer values only)
5.. match a Java regular expression pattern, These expressions can be
specified in disjunctive normal form that gives a powerful and flexible
way of defining fairly complex criteria for an extraction of a required
annotation and/or its value
The FESL itself is defined in XSD format and integrated with EMF for
syntax validation and automated code generation. CFE has been
successfully used in several internal projects for evaluation of
performance metrics and machine learning.
CFE is described in more detail in the paper "CFE - a system for
testing, evaluation and machine learning of UIMA based applications", by
I. Sominsky, A. Coden, M. Tanenblatt that will be presented at UIMA for
NLP workshop as part of the LREC 2008 conference in Marrakech, Morocco.
Igor Sominsky
[EMAIL PROTECTED]