Re: Appropriate open source crawlers for UIMA integration

Thilo Goetz Fri, 25 May 2007 00:42:20 -0700

Hi James,

I wouldn't hold my breath for useful open source crawlers.  In particular,
I think it's unlikely that you'll find an exchange crawler (and please let

me know if you do ;-). If this is for research purposes, you may be ableto use a free academic license from one of the commercial products. Whetherit's easy to hook them up to Solr is yet another question.


--Thilo

James Montgomery wrote:

Hi all,

I'm involved in a project to develop a search tool with an engineering
bent for a collaborative client of my university. (FYI, the collaborative
"team" is very small: me and an engineer from the company.) We're going to

use UIMA to analyse documents and Solr/Lucene to store the results forlater

searching. Rather than reinvent the wheel, I'd like to use some existing
crawler implementation(s) to feed my CollectionReaders (also, it's the
analysis that interests me, not so much the development work). I think I
may need three different crawlers (or one very flexible one) to cover the
three different areas documents will be found:
- intranet-based web
- network attached storage, home drives, etc.
- emails (in particular their attachments) stored on an Exchange server

Preferably I'd like to minimise the amount of work required to incorporate
them into the search tool. I've looked at things like Nutch (
http://lucene.apache.org/nutch/) but it appears to be too heavily
web-oriented; I welcome being corrected on that point though :-)

What have others used in similar situations?

What might others recommend even if they haven't used them yet?

Thanks,
James.

Re: Appropriate open source crawlers for UIMA integration

Reply via email to