Re: Is solr scalable with respect to number of documents?

Ken Krugler Wed, 27 Sep 2006 17:25:11 -0700

[Moving this to solr-dev from solr-user]

On 9/27/06, Vish D. <[EMAIL PROTECTED]> wrote:

I just noticed that link on the first reply from Yonik about
FederatedSearch. I see that a lot of thought went in to it. I guess the
question to ask would be, any progress on it, Yonik? :)


No code, but great progress at shooting holes in various strategies ;-)

I'm currently thinking about doing federated search at a higher level,
with slightly modified standard request handlers, and another
top-level request handler that can combine requests.  The biggest
downside: no custom query handlers.

The other option: do federated search like a lucene MultiSearcher...
(a federated version of the SolrIndexSearcher).  The downside is that
existing interfaces would not be usable... we can't be shipping tons
of BitDocSets across the network.  Things like highlighting, federated
search, etc, would need to be pushed down into this interface.  New
interfaces means lots of changes to request handler code.  Upside
would be that custom request handlers would still work and be
automatically parallelized.

Anyone have any thoughts on this stuff?
http://wiki.apache.org/solr/FederatedSearch

Quick impression - given the scope of what's being described on thispage, it feels like a "boil the ocean" problem.

I've spent an afternoon looking at how we could use Solr as ourdistributed searchers for Nutch. Currently the Nutch search servingcode isn't getting much love, so somehow leveraging Solr would seemlike a win.

The three attributes of Solr that are most interesting to me in thiscontext are:


1. Live update support.

2. More complex query processing.

3. Caching (though not as critical)

Things I can live with that I noticed being described as issues onthe Federated Search page:


 * No sorting support - just simple merged ranking.
 * No IDF skew compensation - we can mix documents sufficiently.
 * No automatic doc->server mapping - we can calc our own stable hash for this.
 * No consistency via retry.

To that end, I did a quick exploration of how to use Hadoop RPC to"talk" to the guts of Solr. This assumes that:

1. Query processing happens at the search server level, versus at themaster, as it is currently with Nutch.

2. There's a way to request summaries by document id via a subsequent(post-merge) call from the master.


<and a bunch of other issues that I haven't noted>.

The immediate problem I ran into is that the notion of Solr runninginside of a servlet container currently penetrates deep into thebowels of the code. Even below the core level, calls are being madeto extract query parameters from a URL.

So step 1, if I was going to try to do this in a clean manner, wouldbe to define a servlet side/Solr core API layer. Then it would berelatively easy to at least do the first cut of hooking up the Solrcore to a Nutch master via Hadoop PRC.


-- Ken
--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"Find Code, Find Answers"

Re: Is solr scalable with respect to number of documents?

Reply via email to