James Montgomery wrote:
Hello all,

I'm working on a project with an engineering firm to develop a search tool
that can find relevant engineering documents and also provide information
about relationships between documents (for instance, they mention the same
part). We are currently leaning most strongly towards a combination of
Lucene for search and UIMA for document analysis.

Good ;-)

I see on the Incubator
Wiki (http://wiki.apache.org/incubator/UimaProposal) that better
integration or communication between these two products is being considered.
Here are some questions about this and UIMA:

- Would others recommend the use of Lucene to search analysis results
produced by UIMA components?

It depends what your requirements are. Lucene is certainly a good choice for a search engine. You may also want to consider Solr (http://incubator.apache.org/solr/) which uses Lucene internally. One constraint is that Lucene, like most text search engines, does not support span search. What I mean by that is that you can not, for example, index the internal structure of an XML document. So suppose you have a UIMA analysis pipeline that discovers book descriptions, and inside those book descriptions, the author, title, ISBN or what have you of that book. Then you might want to post queries like, show me all instances of books where the author is "Smith" and the title contains the word "Lucene". There is no obvious way to model this kind of search in Lucene.

What Lucene does support are fields. Fields are global to the entire document. So if your application does not really require span support and you can model your UIMA data as fields, Lucene is a good choice. For example, if your application can discover product names, you can create a "product" field in Lucene and for each document index the product names you found under that field. This will allow you to search specifically for documents containing product names.

- What other search engines and search engine SDKs would others recommend,
perhaps as being better suited to integration with UIMA?

There is a search engine that comes with the pre-Apache UIMA SDK you can download from IBM: http://www.alphaworks.ibm.com/tech/uima It supports span search, and it is planned to make a version available that works with Apache UIMA in the future. I'm not exactly sure what the license conditions are, anyone else know?

- Although UIMA has only just entered the Apache Incubator, how soon might
efforts be made to provide an interface between Lucene and UIMA?

Most of the current UIMA developers will be concentrating on getting the first release on Apache out the door, hopefully early next year. After that is done, we hope to have time to look at the Lucene integration. On the other hand, you are not the only one interested in such an integration, and there's always the possibility that somebody else will step up and do it.

  - Should this question be directed to the developer list?

No, the user's list is fine.  All developers read the user's list.

- What sites would others recommend for open source UIMA analysis components
for different document formats?

There is a general UIMA component repository here: http://uima.lti.cs.cmu.edu/

I'm not sure what you mean by "for different document formats". Formats such as html or pdf? I'm not sure anybody has done any open source document format parsing for UIMA yet. It should not be too difficult to wrap existing technology, such as http://www.pdfbox.org/, for use in UIMA.

HTH,
Thilo


Reply via email to