Hi Everyone I tested the Apache UIMA Regular Expression Annotator to know its abilities to formulate recognizing rules. I tested it to recognize named entities. Being said it only works on text characters, I mainly encountered two limitations. I'd like to know what you think about, and if you think that future evolutions of the annotator could fix them.
Roughly speaking, my problems started when I tried to handle several concepts and when my rules reach a high level complexity. First since the regex variables are also regex I used them as a dictionnary of elements for my rules (e.g. <variable name="weekdays" value="Monday|Tuesday|Wednesday|Thursday|Friday|Saturday|Sunday"/>). The elements are also regex which has some advantages (e.g. <variable name="weekdays" value="[m|M]onday|[t|T]uesday|[w|W]ednesday|[t|T]hursday|[f|F]riday|[s|S]aturday|[s|S]unday"/> ) . The major drawback is when your dictionnary has several hundred or thousand of lexical entries. It it is tedious to keep the dictionnary up-to-date or even to handle and edit the file. It would be great if the variable values could also be defined in external files (one entry per line). This solution also allows to define once some variables and to use them as many times as you want in distinct rule files (which is also appealing to keep up to date the rules). Second, it is possible to set a priority order between rules of a same concept but not between concepts. In practice some distinct concepts may have similar rules (e.g. person entity and location entity) you may wish to set a priority between them to avoid some ambiguity to handle ouside of the annotator (currently to avoid this situation you have to define the recognizing rules of the person and the location entities in the same concept which is not conceptually acceptable). Offering a way to set priority between concepts will lead to the problem of how to do it when the concepts are defined in distinct files. I agree the ambiguity problem may be handled in further annotators. Regards /Nicolas
