On 9/27/06, Ken Krugler <[EMAIL PROTECTED]> wrote:
I've spent an afternoon looking at how we could use Solr as our
distributed searchers for Nutch. Currently the Nutch search serving
code isn't getting much love, so somehow leveraging Solr would seem
like a win.
The three attributes of Solr that are most interesting to me in this
context are:
1. Live update support.
2. More complex query processing.
3. Caching (though not as critical)
Things I can live with that I noticed being described as issues on
the Federated Search page:
* No sorting support - just simple merged ranking.
I think it wouldn't be too much trouble to support all forms of
sorting that Solr currently supports. This can be done in the same
manner as the current Lucene MultiSearcher.
* No IDF skew compensation - we can mix documents sufficiently.
Yeah, I wasn't going to tackle that on the first pass. But it is
doable (again, the Lucene MultiSearcher shows how). I'd want to make
it optional in any case, because the performance gains are often not
worth it.
* No automatic doc->server mapping - we can calc our own stable hash for this.
* No consistency via retry.
To that end, I did a quick exploration of how to use Hadoop RPC to
"talk" to the guts of Solr. This assumes that:
I'm not into Nutch or Hadoop that much yet, so I'd be really
interested what you find out there.
1. Query processing happens at the search server level, versus at the
master, as it is currently with Nutch.
2. There's a way to request summaries by document id via a subsequent
(post-merge) call from the master.
#2 is the biggie I think (if by "document id" you mean internal lucene docid).
Not having internal document ids change between calls is the biggest problem.
<and a bunch of other issues that I haven't noted>.
The immediate problem I ran into is that the notion of Solr running
inside of a servlet container currently penetrates deep into the
bowels of the code. Even below the core level, calls are being made
to extract query parameters from a URL.
That's wrapped up in SolrQueryParams, which has a non-servlet version though.
The unit tests use this to run outside of a container.
-Yonik