Coming late to the conversation... Just offering some Lucene perspective

On Dec 4, 2008, at 1:36 PM, Niels Ott wrote:

What Lucene cannot do - or at least not without a lot of hacking - is
aggregating analyses as UIMA can using the CAS. Usually your knowledge
grows during an UIMA-based NLP-pipeline: you add the a token annotation,
a lemma annotation, a POS-annotation and so on...  In Lucene, you have
the classical pipeline: the output replaces the input. (Yes, by
subclassing Lucene's "Token" class, one can fiddle around the issue, but
it is not elegant at all.)


You might find the TeeTokenFilter and SinkTokenizer interesting for mapping/aggregating tokens/extractions out to other fields in Lucene.

Also, Lucene is getting more flexible in terms of indexing and searching. You can attach payloads to terms (i.e. byte arrays) which can provide some crude annotation storage and https://issues.apache.org/jira/browse/LUCENE-1422 and a couple of other issues are the start of more flexibility to add attributes that can then be indexed. We're still working on the search side of it, but I think you will see more in the way of flexible indexing in the coming months that should be a nice win for UIMA + Lucene users.



What makes Lucene + UIMA interesting for me is a simple fact: I can do
all the NLP I want and be as flexible as I need in UIMA. Then I can feed
the outcome (or rather: a small part of it) into a Lucene index.

In my special case, I'm not using a CAS Consumer, but I can imagine
other people would appreciate it in their application scenarios.

To conclude: Lucene and UIMA aren't competitors, but in some cases having one feeding the other is what you want.

Couldn't agree more!

Cheers,
Grant

Reply via email to