Re: proposal for a new testing and evaluation component

Christian Mauceri Sat, 24 May 2008 01:20:33 -0700

Hi,

Indeed any new and encapsulated tool in UIMA is interesting for thecommunity but what makes UIMA unique is not the fact that these toolsexist or not but rather that UIMA makes their seamless integrationwithin a same framework possible no matter the languages they arewritten in. So I do not see why regexp tool existence in UIMA can be acriterion in making the decision to adopt UIMA. If people need a regexptool in a UIMA application, they have just to pick one and write thethin layer of code to encapsulate it, you can use mine<http://code.google.com/p/digital-philology/source/> available underApache License but there are many others around.Once again the big advantages of UIMA are: interoperability, languageindependence and theory neutrality. You can, for instance, sketch amodule in Perl integrate it in your processing line and when happyseamlessly rewrite it in another language.Finally UIMA can deal with other object than texts, images, for instanceand for this very reason it must be kept at a high level, it's up to thecommunity to provide modules. You can for instance use Gate<http://gate.ac.uk/sale/tao/index.html#x1-38600016> in UIMA but you canalso integrate OpenNLP, here<http://uima.lti.cs.cmu.edu:8080/UCR/pages/static/osnlp/OpenNLPReadme.html>is UIMA example wrappers for the OpenNLP tools.


Regards.

Andrew Borthwick a écrit :

Frank,

Your ANTLR-based approach sounds interesting.  I'd like to see the paper and
I'd be interested in seeing some demo code too.

This whole area of needing JAPE-like functionality for UIMA is a critical
issue, as far as I'm concerned.  The lack of support for writing regex's
over annotations was one of two key reasons that my company decided to go
with GATE over UIMA a year ago (the other was that GATE has the ability to
read in an html document and convert the html into GATE annotations, which
is a key feature for working with web documents).

Although we like UIMA's infrastructure and array of ML toolkits, we would
need to see some kind of solid regex functionality before we could consider
starting to develop in UIMA.

Regards,
Andrew Borthwick

On Wed, May 21, 2008 at 9:51 AM, Frank Schilder <
[EMAIL PROTECTED]> wrote:

If not, does anything like this exist for UIMA right now or is anything

in

the works?

I know of several proprietary ones, but nothing open source.  It
would be nice to have something like Jape in UIMA.

well, I wrote an annotator that uses Jape.

We have been using ANTLR (www.antlr.org) for writing grammars that detect,
for example, temporal and monetary expressions. The integration of an ANTLR
lexer and parser into UIMA was fairly straight forward. We based our
integration on a posting that explains the interfacing of StAX with ANTLR
http://www.antlr.org/wiki/display/ANTLR3/Interfacing+StAX+to+ANTLR

ANTLR grammars are written in EBNF and can be compiled into different
programming languages (e.g. Java, C, C#). The ANTLR grammar can also
contain
Java code, if you want to manipulate other objects (e.g. adding annotations
to the CAS) while parsing the input.

You can write an ANTLR grammar, add java code to it and compile everything
into a java class. This java class can then be used by your AE in UIMA.

We experimented with lexers and parsers in ANTLR:

1) a lexer in ANTLR can be set to be a scanner that scans an input string
for expressions defined within EBNF
2) a parser expects a stream of ANTLR tokens. A stream of ANTLR tokens can
be constructed from UIMA annotations (see integration of StAX events into
ANTLR). Such a grammar can detect more complex structures consisting of
basic (UIMA) annotations.

The grammar formalism used by ANTLR is LL(*) which is more flexible than
LL(k). We found the grammars we wrote are much faster than the Jape
grammars
we also used within UIMA. You're more constrained by the LL(*) formalism in
writing rules, but ANTLRworks is a useful GUI development environment that
alerts you to ambiguous rules.
http://www.antlr.org/works/index.html

BTW: This work will also be discusses as part of our paper at the LREC UIMA
workshop next week.
http://watchtower.coling.uni-jena.de/~coling/uimaws_lrec2008/<http://watchtower.coling.uni-jena.de/%7Ecoling/uimaws_lrec2008/>

Frank

There are some limits:
- it's impossible to create (in jape) an annotation that references to
another annotation, that's easy to do in uima (pseudo code):
Lemma lemma = new Lemma(cas);
Token token = new Token(cas);
token.setLemma(lemma);
- the annotator is packaged as a PEAR that include ALL the GATE jars...
- if the annotator is deployed in a web context, only the precompiled
grammars are working: I think it's a class loading problem: the pear
is loaded by a class loader, the uimaframework in deployed inside a
web context that is under another class loader.... and so on....
-performance: the reverse mapping from gate to uima il slow: updating
the existing annotation means scanning all the annos in the cas, each
feature and check if they're changed (well, if the grammar doesn't
update anithing, the updates could be excluded)

I want to open the annototor, but at the moment I don't have the
permission to do that.

But, the better would be to have a JAPE clone, or something better,
that uses UIMA directly.
I want to take a loook to the BSFAnnotator to understand if it could be
usefull.

cheers,
Roberto

--
Roberto Franchini
CELI s.r.l. (http://www.celi.it) - C.so Moncalieri 21 - 10131 Torino -

ITALY

Tel +39-011-6600814 - Fax +39-011-6600687
jabber:[EMAIL PROTECTED] <[EMAIL PROTECTED]>skype:ro.franchini

Re: proposal for a new testing and evaluation component

Reply via email to