Does this provide functionality similar to GATE's JAPE regular expression
language, i.e. could I use CFE to create new UIMA annotations as the result
of regular expressions over other UIMA annotations?

If not, does anything like this exist for UIMA right now or is anything in
the works?

Thanks,
Andrew Borthwick

On Thu, May 15, 2008 at 11:08 AM, Thilo Goetz <[EMAIL PROTECTED]> wrote:

> Igor Sominsky wrote:
>
>> The vocabulary can be either fully embedded into the configuration file or
>> referenced by a URI. Any UIMA annotation feature or result of get-like
>> method (getCoverdText for instance) could be evaluated whether it belongs to
>> the list, so it could be included to or excluded from extraction.
>>
>> I am not sure if I understand your second question correctly, but let me
>> try to answer it. CFE implements the extraction process in 2 steps. On the
>> first step an annotation that represents a certain concept is located. It
>> can be a single word annotation (uima.tt.TokenAnnotation for instance) or a
>> custom type annotation that contains the group of words in its properties
>> (FSArray for instance). But in any case your concept must be represented by
>> a single annotation. On the second step, annotations that are in a certain
>> context (defined by a configuration file) of you concept annotation are
>> located. For example, the configuration file could specify to extract
>> features from 5 annotations to the left from an annotation that represents
>> the concept (let's say a particular word). The annotations that are located
>> on the second step - are the annotations the features are extracted from. I
>> hope I got your question right
>>
>
> Perfectly, thank you.  You answered the question that I was
> trying to ask.
>
> This would be our missing link to Apache Mahout.  At the moment,
> their input formats are still moving targets afaict.  Once they've
> settled down, we can generate input data for Mahout and use
> their machine learning algorithms.
>
> --Thilo
>
>
>
>> Igor
>>
>> ----- Original Message ----- From: "Thilo Goetz" <[EMAIL PROTECTED]>
>> To: <[email protected]>
>> Sent: Thursday, May 15, 2008 1:07 PM
>> Subject: Re: proposal for a new testing and evaluation component
>>
>>
>>  Cool, we absolutely need this!  I was actually about to
>>> write something like this myself, but now I think I can
>>> wait a little longer :-)
>>>
>>> I have quite a few questions on this, here are just some
>>> of them:
>>>
>>> Can you integrate external resources in the process?  For
>>> example, I might have a list of last names, and a feature
>>> might be if a token occurs in that list or not.
>>>
>>> I'd like to apply this to learning for individual words
>>> or word windows.  Is that possible with/supported by your
>>> tool?
>>>
>>> --Thilo
>>>
>>> Igor Sominsky wrote:
>>>
>>>> My group would like to offer the following UIMA component, Common
>>>> Feature Extractor (CFE), as an open source offering into the UIMA sandbox,
>>>> assuming there is interest from the community:
>>>>
>>>>  CFE enables the configuration driven feature value extraction from UIMA
>>>> annotations contained in CAS. The extracted information can be used for
>>>> statistical analysis, performance metrics evaluation, regression testing 
>>>> and
>>>> machine learning related processing. CFE provides a flexible, yet powerful
>>>> language FESL (Feature Extraction Specification Language) for working with
>>>> the UIMA CAS to enable the collection and classification of resultant data.
>>>> FESL is a declarative XML-based language that expresses semantic rules for
>>>> the feature extraction. While the rules guide the feature extraction in a
>>>> completely generalized way and CFE provides methods for subsequent
>>>> processing to format the output of the extraction as needed for downstream
>>>> use.  The destination for the output is defined by a particular application
>>>> where CFE is used (CAS, external file, database, etc.). CFE could be
>>>> implemented by either TAE or CAS Consumer, depending on a particular
>>>> application needs
>>>>
>>>>  FESL rules allow flexible and powerful way of defining multi-parameter
>>>> criteria for specific information to be extracted from CAS. Such criteria
>>>> can be customized by:
>>>>
>>>>  1.. a type of an UIMA annotation object that contains the feature of
>>>> interest
>>>>  2.. a surrounding (enclosing) annotation type and a relative location
>>>> of the object within the enclosure that limits the extraction within a
>>>> boundaries of a certain UIMA type.
>>>>  3.. "path" to the feature from the annotation object
>>>>  4.. a type and value of the feature itself
>>>>  5.. values of any public Java get-style methods (methods that accept no
>>>> parameters and return a value) implemented by the underlying class of the
>>>> feature
>>>>  6.. a location of the object or the feature on a specific path (in
>>>> cases when it is required to select/bypass annotations if they are features
>>>> of other UIMA annotation types)
>>>>  The feature values can be evaluated by conditional expressions stated
>>>> in FESL. Particularly, the feature values can be evaluated whether they:
>>>>
>>>>  1.. are of a certain type
>>>>  2.. belong to a specific set of values (vocabulary)
>>>>  3.. belong to a range of numeric values (inclusively or
>>>> non-inclusively)
>>>>  4.. match certain bits of a bit mask (integer values only)
>>>>  5.. match a Java regular expression pattern, These expressions can be
>>>> specified in disjunctive normal form that gives a powerful and flexible way
>>>> of defining fairly complex criteria for an extraction of a required
>>>> annotation and/or its value
>>>>
>>>>  The FESL itself is defined in XSD format and integrated with EMF for
>>>> syntax validation and automated code generation. CFE has been successfully
>>>> used in several internal projects for evaluation of performance metrics and
>>>> machine learning.
>>>>
>>>>  CFE is described in more detail in the paper "CFE - a system for
>>>> testing, evaluation and machine learning of UIMA based applications", by I.
>>>> Sominsky, A. Coden, M. Tanenblatt that will be presented at UIMA for NLP
>>>> workshop as part of the LREC 2008 conference in Marrakech, Morocco. Igor
>>>> Sominsky
>>>>
>>>> [EMAIL PROTECTED]
>>>>
>>>
>


-- 
Andrew Borthwick, Ph.D. | SPOCK Networks
Spock is Hiring!
www.spock.com/jobs
P.S. We pay a $5,000 referral fee for anyone we hire

Reply via email to