[Moving this to solr-dev from solr-user]
On 9/27/06, Vish D. <[EMAIL PROTECTED]> wrote:
I just noticed that link on the first reply from Yonik about
FederatedSearch. I see that a lot of thought went in to it. I guess the
question to ask would be, any progress on it, Yonik? :)
No code, but great progress at shooting holes in various strategies ;-)
I'm currently thinking about doing federated search at a higher level,
with slightly modified standard request handlers, and another
top-level request handler that can combine requests. The biggest
downside: no custom query handlers.
The other option: do federated search like a lucene MultiSearcher...
(a federated version of the SolrIndexSearcher). The downside is that
existing interfaces would not be usable... we can't be shipping tons
of BitDocSets across the network. Things like highlighting, federated
search, etc, would need to be pushed down into this interface. New
interfaces means lots of changes to request handler code. Upside
would be that custom request handlers would still work and be
automatically parallelized.
Anyone have any thoughts on this stuff?
http://wiki.apache.org/solr/FederatedSearch
Quick impression - given the scope of what's being described on this
page, it feels like a "boil the ocean" problem.
I've spent an afternoon looking at how we could use Solr as our
distributed searchers for Nutch. Currently the Nutch search serving
code isn't getting much love, so somehow leveraging Solr would seem
like a win.
The three attributes of Solr that are most interesting to me in this
context are:
1. Live update support.
2. More complex query processing.
3. Caching (though not as critical)
Things I can live with that I noticed being described as issues on
the Federated Search page:
* No sorting support - just simple merged ranking.
* No IDF skew compensation - we can mix documents sufficiently.
* No automatic doc->server mapping - we can calc our own stable hash for this.
* No consistency via retry.
To that end, I did a quick exploration of how to use Hadoop RPC to
"talk" to the guts of Solr. This assumes that:
1. Query processing happens at the search server level, versus at the
master, as it is currently with Nutch.
2. There's a way to request summaries by document id via a subsequent
(post-merge) call from the master.
<and a bunch of other issues that I haven't noted>.
The immediate problem I ran into is that the notion of Solr running
inside of a servlet container currently penetrates deep into the
bowels of the code. Even below the core level, calls are being made
to extract query parameters from a URL.
So step 1, if I was going to try to do this in a clean manner, would
be to define a servlet side/Solr core API layer. Then it would be
relatively easy to at least do the first cut of hooking up the Solr
core to a Nutch master via Hadoop PRC.
-- Ken
--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"Find Code, Find Answers"