On 9/27/06, Ken Krugler <[EMAIL PROTECTED]> wrote:
I've spent an afternoon looking at how we could use Solr as our
distributed searchers for Nutch. Currently the Nutch search serving
code isn't getting much love, so somehow leveraging Solr would seem
like a win.

The three attributes of Solr that are most interesting to me in this
context are:

1. Live update support.

2. More complex query processing.

3. Caching (though not as critical)

Things I can live with that I noticed being described as issues on
the Federated Search page:

  * No sorting support - just simple merged ranking.

I think it wouldn't be too much trouble to support all forms of
sorting that Solr currently supports.  This can be done in the same
manner as the current Lucene MultiSearcher.

  * No IDF skew compensation - we can mix documents sufficiently.

Yeah, I wasn't going to tackle that on the first pass.  But it is
doable (again, the Lucene MultiSearcher shows how).  I'd want to make
it optional in any case, because the performance gains are often not
worth it.

Note that Nutch doesn't try to solve this, because of concerns that the extra round-trips required to normalize IDFs across remote searchers would be too slow. RMI is faster than Hadoop RPC, so I guess it's less of an issue there.

* No automatic doc->server mapping - we can calc our own stable hash for this.
  * No consistency via retry.

To that end, I did a quick exploration of how to use Hadoop RPC to
"talk" to the guts of Solr. This assumes that:

I'm not into Nutch or Hadoop that much yet, so I'd be really
interested what you find out there.

1. Query processing happens at the search server level, versus at the
master, as it is currently with Nutch.

2. There's a way to request summaries by document id via a subsequent
(post-merge) call from the master.

#2 is the biggie I think (if by "document id" you mean internal lucene docid).
Not having internal document ids change between calls is the biggest problem.

Well, you have to handle potential summarizer problems in any case - for example if a remote searcher goes away, or gets so bogged down that it times out, but you've got a hit from that server which needs a summary. This is the case we ran into during load testing.

Though that wouldn't be as serious as getting a completely wrong summary, if the remote index updated between when the search request happened and the summary was requested.

A munge count might be enough, and pretty simple.

<and a bunch of other issues that I haven't noted>.

The immediate problem I ran into is that the notion of Solr running
inside of a servlet container currently penetrates deep into the
bowels of the code. Even below the core level, calls are being made
to extract query parameters from a URL.

That's wrapped up in SolrQueryParams, which has a non-servlet version though.
The unit tests use this to run outside of a container.

That's part of it, but from what I remember there were other issues with servlet-esque objects getting passed down deep. I'll have to take another look, as my afternoon of poking was a few weeks ago.

Thanks,

-- Ken
--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"Find Code, Find Answers"

Reply via email to