Coming late to the conversation... Just offering some Lucene
perspective
On Dec 4, 2008, at 1:36 PM, Niels Ott wrote:
What Lucene cannot do - or at least not without a lot of hacking - is
aggregating analyses as UIMA can using the CAS. Usually your knowledge
grows during an UIMA-based NLP-pipeline: you add the a token
annotation,
a lemma annotation, a POS-annotation and so on... In Lucene, you have
the classical pipeline: the output replaces the input. (Yes, by
subclassing Lucene's "Token" class, one can fiddle around the issue,
but
it is not elegant at all.)
You might find the TeeTokenFilter and SinkTokenizer interesting for
mapping/aggregating tokens/extractions out to other fields in Lucene.
Also, Lucene is getting more flexible in terms of indexing and
searching. You can attach payloads to terms (i.e. byte arrays) which
can provide some crude annotation storage and https://issues.apache.org/jira/browse/LUCENE-1422
and a couple of other issues are the start of more flexibility to
add attributes that can then be indexed. We're still working on the
search side of it, but I think you will see more in the way of
flexible indexing in the coming months that should be a nice win for
UIMA + Lucene users.
What makes Lucene + UIMA interesting for me is a simple fact: I can do
all the NLP I want and be as flexible as I need in UIMA. Then I can
feed
the outcome (or rather: a small part of it) into a Lucene index.
In my special case, I'm not using a CAS Consumer, but I can imagine
other people would appreciate it in their application scenarios.
To conclude: Lucene and UIMA aren't competitors, but in some cases
having one feeding the other is what you want.
Couldn't agree more!
Cheers,
Grant