[Moving this to solr-dev from solr-user]

On 9/27/06, Vish D. <[EMAIL PROTECTED]> wrote:
I just noticed that link on the first reply from Yonik about
FederatedSearch. I see that a lot of thought went in to it. I guess the
question to ask would be, any progress on it, Yonik? :)

No code, but great progress at shooting holes in various strategies ;-)

I'm currently thinking about doing federated search at a higher level,
with slightly modified standard request handlers, and another
top-level request handler that can combine requests.  The biggest
downside: no custom query handlers.

The other option: do federated search like a lucene MultiSearcher...
(a federated version of the SolrIndexSearcher).  The downside is that
existing interfaces would not be usable... we can't be shipping tons
of BitDocSets across the network.  Things like highlighting, federated
search, etc, would need to be pushed down into this interface.  New
interfaces means lots of changes to request handler code.  Upside
would be that custom request handlers would still work and be
automatically parallelized.

Anyone have any thoughts on this stuff?
http://wiki.apache.org/solr/FederatedSearch

Quick impression - given the scope of what's being described on this page, it feels like a "boil the ocean" problem.

I've spent an afternoon looking at how we could use Solr as our distributed searchers for Nutch. Currently the Nutch search serving code isn't getting much love, so somehow leveraging Solr would seem like a win.

The three attributes of Solr that are most interesting to me in this context are:

1. Live update support.

2. More complex query processing.

3. Caching (though not as critical)

Things I can live with that I noticed being described as issues on the Federated Search page:

 * No sorting support - just simple merged ranking.
 * No IDF skew compensation - we can mix documents sufficiently.
 * No automatic doc->server mapping - we can calc our own stable hash for this.
 * No consistency via retry.

To that end, I did a quick exploration of how to use Hadoop RPC to "talk" to the guts of Solr. This assumes that:

1. Query processing happens at the search server level, versus at the master, as it is currently with Nutch.

2. There's a way to request summaries by document id via a subsequent (post-merge) call from the master.

<and a bunch of other issues that I haven't noted>.

The immediate problem I ran into is that the notion of Solr running inside of a servlet container currently penetrates deep into the bowels of the code. Even below the core level, calls are being made to extract query parameters from a URL.

So step 1, if I was going to try to do this in a clean manner, would be to define a servlet side/Solr core API layer. Then it would be relatively easy to at least do the first cut of hooking up the Solr core to a Nutch master via Hadoop PRC.

-- Ken
--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"Find Code, Find Answers"

Reply via email to