Hi James,
I wouldn't hold my breath for useful open source crawlers. In particular,
I think it's unlikely that you'll find an exchange crawler (and please let
me know if you do ;-). If this is for research purposes, you may be able
to use a free academic license from one of the commercial products. Whether
it's easy to hook them up to Solr is yet another question.
--Thilo
James Montgomery wrote:
Hi all,
I'm involved in a project to develop a search tool with an engineering
bent for a collaborative client of my university. (FYI, the collaborative
"team" is very small: me and an engineer from the company.) We're going to
use UIMA to analyse documents and Solr/Lucene to store the results for
later
searching. Rather than reinvent the wheel, I'd like to use some existing
crawler implementation(s) to feed my CollectionReaders (also, it's the
analysis that interests me, not so much the development work). I think I
may need three different crawlers (or one very flexible one) to cover the
three different areas documents will be found:
- intranet-based web
- network attached storage, home drives, etc.
- emails (in particular their attachments) stored on an Exchange server
Preferably I'd like to minimise the amount of work required to incorporate
them into the search tool. I've looked at things like Nutch (
http://lucene.apache.org/nutch/) but it appears to be too heavily
web-oriented; I welcome being corrected on that point though :-)
What have others used in similar situations?
What might others recommend even if they haven't used them yet?
Thanks,
James.