Re: Lucene cas consumer

Niels Ott Thu, 04 Dec 2008 10:43:59 -0800

Hi all,

I'm using both Lucene and UIMA in one project.


Lucene is primarily an information retrieval API. It provides a
framework and default implementations for analyzing several languages.
Analyzing means tokenization, stop words, etc. Furthermore, it brings
the key functionality to build an inverted index and to search it.

Lucene can be extended easily. E.g. one can implement an analyzer that
does lemmatization or that looks up synonyms in Wordnet  and adds them
to the index.

What Lucene cannot do - or at least not without a lot of hacking - is
aggregating analyses as UIMA can using the CAS. Usually your knowledge
grows during an UIMA-based NLP-pipeline: you add the a token annotation,
a lemma annotation, a POS-annotation and so on...  In Lucene, you have
the classical pipeline: the output replaces the input. (Yes, by
subclassing Lucene's "Token" class, one can fiddle around the issue, but
it is not elegant at all.)

What makes Lucene + UIMA interesting for me is a simple fact: I can do
all the NLP I want and be as flexible as I need in UIMA. Then I can feed
the outcome (or rather: a small part of it) into a Lucene index.

In my special case, I'm not using a CAS Consumer, but I can imagine
other people would appreciate it in their application scenarios.

To conclude: Lucene and UIMA aren't competitors, but in some caseshaving one feeding the other is what you want.


Best,

   Niels


Greg Holmberg schrieb:

Roberto--

It does seem like there should be a close relationship between the
two groups.

I don't know much about Lucene--can you educate me?  For example,
have you given any thought to what to do with UIMA annotations?  From
what little I've read about Lucene, they seem to have a thing called
a document analyzer, but they don't mean the same thing we mean by
analysis in the NLP community.  They appear to mean something more
like "tokenizer".  So I haven't yet found a place to put UIMA
annotations, say for example, named entities or parts of speech.  I'm
wondering if Lucene needs a major feature enhancement before its
truly useful with UIMA?

What are your thoughts on how the integrate the two?  What
functionality is possible?

Greg Holmberg


-------------- Original message ---------------------- From: "Roberto
Franchini" <[EMAIL PROTECTED]>

Hi, I'm going to write a Lucene CAS consumer. The porpouse is to
create a Lucene document, or more than one, for each CAS. Last year
(2007)  the JENA university lab (JULIE lab? is it right?) delivered
such a component, named LUCAS. Then it disappeared. LUCAS seems a
good piece of software. The Technische Universit�t Darmstadt
developed one too: http://www.ukp.tu-darmstadt.de/projects/dkpro/.
(I will write to them).

There's anybody interested to share knowledge and/or code to do
that component? I think that Lucene and UIMA can be very good
friends :)

Roberto

PS: I apologize for my bad English.

-- Roberto Franchini http://www.celi.it http://www.blogmeter.ithttp://www.memesphere.it Tel +39-011-6600814jabber:[EMAIL PROTECTED] skype:ro.franchini



--
Niels Ott - Computational Linguist (B.A.) - http://www.drni.de/niels/
          - My PGP key is available from your favorite key server.

Wer im Glashaus sitzt, sollte immer Sidolin dabei haben!

Re: Lucene cas consumer

Reply via email to