Hi all, I'm involved in a project to develop a search tool with an engineering bent for a collaborative client of my university. (FYI, the collaborative "team" is very small: me and an engineer from the company.) We're going to use UIMA to analyse documents and Solr/Lucene to store the results for later searching. Rather than reinvent the wheel, I'd like to use some existing crawler implementation(s) to feed my CollectionReaders (also, it's the analysis that interests me, not so much the development work). I think I may need three different crawlers (or one very flexible one) to cover the three different areas documents will be found: - intranet-based web - network attached storage, home drives, etc. - emails (in particular their attachments) stored on an Exchange server
Preferably I'd like to minimise the amount of work required to incorporate them into the search tool. I've looked at things like Nutch ( http://lucene.apache.org/nutch/) but it appears to be too heavily web-oriented; I welcome being corrected on that point though :-) What have others used in similar situations? What might others recommend even if they haven't used them yet? Thanks, James.
