Re: proposal for a new testing and evaluation component

Thilo Goetz Thu, 15 May 2008 11:09:32 -0700

Igor Sominsky wrote:

The vocabulary can be either fully embedded into the configuration fileor referenced by a URI. Any UIMA annotation feature or result ofget-like method (getCoverdText for instance) could be evaluated whetherit belongs to the list, so it could be included to or excluded fromextraction.
I am not sure if I understand your second question correctly, but let metry to answer it. CFE implements the extraction process in 2 steps. Onthe first step an annotation that represents a certain concept islocated. It can be a single word annotation (uima.tt.TokenAnnotation forinstance) or a custom type annotation that contains the group of wordsin its properties (FSArray for instance). But in any case your conceptmust be represented by a single annotation. On the second step,annotations that are in a certain context (defined by a configurationfile) of you concept annotation are located. For example, theconfiguration file could specify to extract features from 5 annotationsto the left from an annotation that represents the concept (let's say aparticular word). The annotations that are located on the second step -are the annotations the features are extracted from. I hope I got yourquestion right


Perfectly, thank you.  You answered the question that I was
trying to ask.

This would be our missing link to Apache Mahout.  At the moment,
their input formats are still moving targets afaict.  Once they've
settled down, we can generate input data for Mahout and use
their machine learning algorithms.

--Thilo

Igor

----- Original Message ----- From: "Thilo Goetz" <[EMAIL PROTECTED]>
To: <[email protected]>
Sent: Thursday, May 15, 2008 1:07 PM
Subject: Re: proposal for a new testing and evaluation component
Cool, we absolutely need this!  I was actually about to
write something like this myself, but now I think I can
wait a little longer :-)

I have quite a few questions on this, here are just some
of them:

Can you integrate external resources in the process?  For
example, I might have a list of last names, and a feature
might be if a token occurs in that list or not.

I'd like to apply this to learning for individual words
or word windows.  Is that possible with/supported by your
tool?

--Thilo

Igor Sominsky wrote:
My group would like to offer the following UIMA component, CommonFeature Extractor (CFE), as an open source offering into the UIMAsandbox, assuming there is interest from the community:
CFE enables the configuration driven feature value extraction fromUIMA annotations contained in CAS. The extracted information can beused for statistical analysis, performance metrics evaluation,regression testing and machine learning related processing. CFEprovides a flexible, yet powerful language FESL (Feature ExtractionSpecification Language) for working with the UIMA CAS to enable thecollection and classification of resultant data. FESL is adeclarative XML-based language that expresses semantic rules for thefeature extraction. While the rules guide the feature extraction in acompletely generalized way and CFE provides methods for subsequentprocessing to format the output of the extraction as needed fordownstream use. The destination for the output is defined by aparticular application where CFE is used (CAS, external file,database, etc.). CFE could be implemented by either TAE or CASConsumer, depending on a particular application needs
FESL rules allow flexible and powerful way of definingmulti-parameter criteria for specific information to be extractedfrom CAS. Such criteria can be customized by:
1.. a type of an UIMA annotation object that contains the featureof interest2.. a surrounding (enclosing) annotation type and a relativelocation of the object within the enclosure that limits theextraction within a boundaries of a certain UIMA type.
  3.. "path" to the feature from the annotation object
  4.. a type and value of the feature itself
5.. values of any public Java get-style methods (methods thataccept no parameters and return a value) implemented by theunderlying class of the feature6.. a location of the object or the feature on a specific path (incases when it is required to select/bypass annotations if they arefeatures of other UIMA annotation types)The feature values can be evaluated by conditional expressionsstated in FESL. Particularly, the feature values can be evaluatedwhether they:
  1.. are of a certain type
  2.. belong to a specific set of values (vocabulary)
3.. belong to a range of numeric values (inclusively ornon-inclusively)
  4.. match certain bits of a bit mask (integer values only)
5.. match a Java regular expression pattern, These expressions canbe specified in disjunctive normal form that gives a powerful andflexible way of defining fairly complex criteria for an extraction ofa required annotation and/or its value
The FESL itself is defined in XSD format and integrated with EMF forsyntax validation and automated code generation. CFE has beensuccessfully used in several internal projects for evaluation ofperformance metrics and machine learning.
CFE is described in more detail in the paper "CFE - a system fortesting, evaluation and machine learning of UIMA based applications",by I. Sominsky, A. Coden, M. Tanenblatt that will be presented atUIMA for NLP workshop as part of the LREC 2008 conference inMarrakech, Morocco. Igor Sominsky
[EMAIL PROTECTED]

Re: proposal for a new testing and evaluation component

Reply via email to