Can this be achieved? (Was: document support for file system crawling)

Eivind Hasle Amundsen Tue, 16 Jan 2007 07:28:46 -0800

First: Please pardon the cross-post to solr-user for reference. I hopeto continue this thread in solr-dev. Please answer to solr-dev.

1) more documentation (and posisbly some locking configuration options) on
how you can use Solr to access an index generated by the nutch crawler (i
think Thorsten has allready done this) or by Compass, or any other system
that builds a Lucene index.

Thorsten Scherler? Is this code available anywhere? Sounds veryinteresting to me. Maybe someone could ellaborate on the differencesbetween the indexes created by Nutch/Solr/Compass/etc., or point me inthe direction of an answer?

2) "contrib" code that runs as it's own process to crawl documents and
send them to a Solr server. (mybe it parses them, or maybe it relies on
the next item...)

Do you know FAST? It uses a step-by-step approach ("pipeline") in whichall of these tasks are done. Much of it is tuned in a easy web tool.

The point I'm trying to make is that contrib code is nice, but a"complete package" with these possibilities could broaden Solr's appealsomewhat.

3) Stock "update" plugins that can each read a raw inputstreams of a some
widely used file format (PDF, RDF, HTML, XML of any schema) and have
configuration options telling them them what fields in the schema each
part of their document type should go in.

Exactly, this sounds more like it. But if similar inputstreams can behandled by Nutch, what's the point in using Solr at all? The http API's?In other words, both Nutch and Solr seem to have functionality thatenterprises would want. But neither gives you the "total solution".

Don't get it wrong, I don't want to bloat the products, even though itwould be nice to have a crossover solution which is easy to set up.


The architecture could look something like this:

Connector -> Parser -> DocProc -> (via schema) -> Index

Possible connectors: JDBC, filesystem, crawler, manual feed
Possible parsers: PDF, whatever

Both connectors, parsers AND the document processors would be plugins.The DocProcs would typically be adjusted for each enterprise' needs, sothat it fits with their schema.xml.

Problem is; I haven't worked enough with Solr, Nutch, Lucene etc. toreally know all possibilities and limitations. But I do believe that theoutlined architecture would be flexible and answer many needs. So thequestion is:

What is Solr missing? Could parts of Nutch be used in Solr to achievethis? How? Have I misunderstood completely? :)


Eivind

Can this be achieved? (Was: document support for file system crawling)

Reply via email to