New features for the Apache UIMA Regular Expression Annotator

Nicolas Hernandez Fri, 29 Jul 2011 02:42:58 -0700

Hi Everyone

I tested the Apache UIMA Regular Expression Annotator to know its
abilities to formulate recognizing rules. I tested it to recognize
named entities.
Being said it only works on text characters, I mainly encountered two
limitations. I'd like to know what you think about, and if you think
that future evolutions of the annotator could fix them.


Roughly speaking, my problems started when I tried to handle several
concepts and when my rules reach a high level complexity.

First since the regex variables are also regex I used them as a
dictionnary of elements for my rules (e.g.  <variable name="weekdays"
   value="Monday|Tuesday|Wednesday|Thursday|Friday|Saturday|Sunday"/>).
The elements are also regex which has some advantages (e.g. <variable
name="weekdays"
   
value="[m|M]onday|[t|T]uesday|[w|W]ednesday|[t|T]hursday|[f|F]riday|[s|S]aturday|[s|S]unday"/>
) .
The major drawback is when your dictionnary has several hundred or
thousand of lexical entries. It it is tedious to keep the dictionnary
up-to-date or even to handle and edit the file.
It would be great if the variable values could also be defined in
external files (one entry per line).
This solution also allows to define once some variables and to use
them as many times as you want in distinct rule files (which is also
appealing to keep up to date the rules).

Second, it is possible to set a priority order between rules of a same
concept but not between concepts. In practice some distinct concepts
may have similar rules (e.g. person entity and location entity) you
may wish to set a priority between them to avoid some ambiguity to
handle ouside of the annotator (currently to avoid this situation you
have to define the recognizing rules of the person and the location
entities in the same concept which is not conceptually acceptable).
Offering a way to set priority between concepts will lead to the
problem of how to do it when the concepts are defined in distinct
files.
I agree the ambiguity problem may be handled in further annotators.

Regards

/Nicolas

New features for the Apache UIMA Regular Expression Annotator

Reply via email to