Appropriate open source crawlers for UIMA integration

James Montgomery Thu, 24 May 2007 21:55:51 -0700

Hi all,

I'm involved in a project to develop a search tool with an engineering
bent for a collaborative client of my university. (FYI, the collaborative
"team" is very small: me and an engineer from the company.) We're going to
use UIMA to analyse documents and Solr/Lucene to store the results for later
searching. Rather than reinvent the wheel, I'd like to use some existing
crawler implementation(s) to feed my CollectionReaders (also, it's the
analysis that interests me, not so much the development work). I think I
may need three different crawlers (or one very flexible one) to cover the
three different areas documents will be found:
- intranet-based web
- network attached storage, home drives, etc.
- emails (in particular their attachments) stored on an Exchange server


Preferably I'd like to minimise the amount of work required to incorporate
them into the search tool. I've looked at things like Nutch (
http://lucene.apache.org/nutch/) but it appears to be too heavily
web-oriented; I welcome being corrected on that point though :-)

What have others used in similar situations?

What might others recommend even if they haven't used them yet?

Thanks,
James.

Appropriate open source crawlers for UIMA integration

Reply via email to