Igor Sominsky wrote:
The vocabulary can be either fully embedded into the configuration file
or referenced by a URI. Any UIMA annotation feature or result of
get-like method (getCoverdText for instance) could be evaluated whether
it belongs to the list, so it could be included to or excluded from
extraction.
I am not sure if I understand your second question correctly, but let me
try to answer it. CFE implements the extraction process in 2 steps. On
the first step an annotation that represents a certain concept is
located. It can be a single word annotation (uima.tt.TokenAnnotation for
instance) or a custom type annotation that contains the group of words
in its properties (FSArray for instance). But in any case your concept
must be represented by a single annotation. On the second step,
annotations that are in a certain context (defined by a configuration
file) of you concept annotation are located. For example, the
configuration file could specify to extract features from 5 annotations
to the left from an annotation that represents the concept (let's say a
particular word). The annotations that are located on the second step -
are the annotations the features are extracted from. I hope I got your
question right
Perfectly, thank you. You answered the question that I was
trying to ask.
This would be our missing link to Apache Mahout. At the moment,
their input formats are still moving targets afaict. Once they've
settled down, we can generate input data for Mahout and use
their machine learning algorithms.
--Thilo
Igor
----- Original Message ----- From: "Thilo Goetz" <[EMAIL PROTECTED]>
To: <[email protected]>
Sent: Thursday, May 15, 2008 1:07 PM
Subject: Re: proposal for a new testing and evaluation component
Cool, we absolutely need this! I was actually about to
write something like this myself, but now I think I can
wait a little longer :-)
I have quite a few questions on this, here are just some
of them:
Can you integrate external resources in the process? For
example, I might have a list of last names, and a feature
might be if a token occurs in that list or not.
I'd like to apply this to learning for individual words
or word windows. Is that possible with/supported by your
tool?
--Thilo
Igor Sominsky wrote:
My group would like to offer the following UIMA component, Common
Feature Extractor (CFE), as an open source offering into the UIMA
sandbox, assuming there is interest from the community:
CFE enables the configuration driven feature value extraction from
UIMA annotations contained in CAS. The extracted information can be
used for statistical analysis, performance metrics evaluation,
regression testing and machine learning related processing. CFE
provides a flexible, yet powerful language FESL (Feature Extraction
Specification Language) for working with the UIMA CAS to enable the
collection and classification of resultant data. FESL is a
declarative XML-based language that expresses semantic rules for the
feature extraction. While the rules guide the feature extraction in a
completely generalized way and CFE provides methods for subsequent
processing to format the output of the extraction as needed for
downstream use. The destination for the output is defined by a
particular application where CFE is used (CAS, external file,
database, etc.). CFE could be implemented by either TAE or CAS
Consumer, depending on a particular application needs
FESL rules allow flexible and powerful way of defining
multi-parameter criteria for specific information to be extracted
from CAS. Such criteria can be customized by:
1.. a type of an UIMA annotation object that contains the feature
of interest
2.. a surrounding (enclosing) annotation type and a relative
location of the object within the enclosure that limits the
extraction within a boundaries of a certain UIMA type.
3.. "path" to the feature from the annotation object
4.. a type and value of the feature itself
5.. values of any public Java get-style methods (methods that
accept no parameters and return a value) implemented by the
underlying class of the feature
6.. a location of the object or the feature on a specific path (in
cases when it is required to select/bypass annotations if they are
features of other UIMA annotation types)
The feature values can be evaluated by conditional expressions
stated in FESL. Particularly, the feature values can be evaluated
whether they:
1.. are of a certain type
2.. belong to a specific set of values (vocabulary)
3.. belong to a range of numeric values (inclusively or
non-inclusively)
4.. match certain bits of a bit mask (integer values only)
5.. match a Java regular expression pattern, These expressions can
be specified in disjunctive normal form that gives a powerful and
flexible way of defining fairly complex criteria for an extraction of
a required annotation and/or its value
The FESL itself is defined in XSD format and integrated with EMF for
syntax validation and automated code generation. CFE has been
successfully used in several internal projects for evaluation of
performance metrics and machine learning.
CFE is described in more detail in the paper "CFE - a system for
testing, evaluation and machine learning of UIMA based applications",
by I. Sominsky, A. Coden, M. Tanenblatt that will be presented at
UIMA for NLP workshop as part of the LREC 2008 conference in
Marrakech, Morocco. Igor Sominsky
[EMAIL PROTECTED]