Re: UIMA and Lucene

Thilo Goetz Thu, 30 Nov 2006 05:57:10 -0800

James Montgomery wrote:

Hello all,


I'm working on a project with an engineering firm to develop a search tool
that can find relevant engineering documents and also provide information
about relationships between documents (for instance, they mention the same
part). We are currently leaning most strongly towards a combination of

Lucene for search and UIMA for document analysis.


Good ;-)

I see on the Incubator
Wiki (http://wiki.apache.org/incubator/UimaProposal) that better

integration or communication between these two products is beingconsidered.

Here are some questions about this and UIMA:

- Would others recommend the use of Lucene to search analysis results
produced by UIMA components?

It depends what your requirements are. Lucene is certainly a goodchoice for a search engine. You may also want to consider Solr(http://incubator.apache.org/solr/) which uses Lucene internally. Oneconstraint is that Lucene, like most text search engines, does notsupport span search. What I mean by that is that you can not, forexample, index the internal structure of an XML document. So supposeyou have a UIMA analysis pipeline that discovers book descriptions, andinside those book descriptions, the author, title, ISBN or what have youof that book. Then you might want to post queries like, show me allinstances of books where the author is "Smith" and the title containsthe word "Lucene". There is no obvious way to model this kind of searchin Lucene.

What Lucene does support are fields. Fields are global to the entiredocument. So if your application does not really require span supportand you can model your UIMA data as fields, Lucene is a good choice.For example, if your application can discover product names, you cancreate a "product" field in Lucene and for each document index theproduct names you found under that field. This will allow you to searchspecifically for documents containing product names.

- What other search engines and search engine SDKs would others recommend,
perhaps as being better suited to integration with UIMA?

There is a search engine that comes with the pre-Apache UIMA SDK you candownload from IBM: http://www.alphaworks.ibm.com/tech/uimaIt supports span search, and it is planned to make a version availablethat works with Apache UIMA in the future. I'm not exactly sure whatthe license conditions are, anyone else know?

- Although UIMA has only just entered the Apache Incubator, how soon might
efforts be made to provide an interface between Lucene and UIMA?

Most of the current UIMA developers will be concentrating on getting thefirst release on Apache out the door, hopefully early next year. Afterthat is done, we hope to have time to look at the Lucene integration.On the other hand, you are not the only one interested in such anintegration, and there's always the possibility that somebody else willstep up and do it.

  - Should this question be directed to the developer list?


No, the user's list is fine.  All developers read the user's list.

- What sites would others recommend for open source UIMA analysiscomponents
for different document formats?

There is a general UIMA component repository here:http://uima.lti.cs.cmu.edu/

I'm not sure what you mean by "for different document formats". Formatssuch as html or pdf? I'm not sure anybody has done any open sourcedocument format parsing for UIMA yet. It should not be too difficult towrap existing technology, such as http://www.pdfbox.org/, for use in UIMA.


HTH,
Thilo

Re: UIMA and Lucene

Reply via email to