Re: HTTP or RMI, Jini, JavaSpaces for distributed search

Mike Klaas Fri, 21 Sep 2007 15:06:19 -0700

On 21-Sep-07, at 2:34 PM, Yonik Seeley wrote:

On 9/21/07, Mike Klaas <[EMAIL PROTECTED]> wrote:

On 21-Sep-07, at 11:08 AM, Yonik Seeley wrote:

I wanted to take a step back for a second and think about if HTTPwas

really the right choice for the transport for distributed search.

I think the high-level approach in SOLR-303 is the right way to go
about it, but I'm unsure if HTTP is the right transport.


I don't know anything about RMI, but is it possible to do 100's of
simultaneous asynchronous requests cheaply?


Good question... probably only important for really big clusters (like
yours), but it would be nice.

Even if we go HTTP, I'm not sure it will be async at first - does
HTTPClient even support async?

I don't think so. In fact, I need to make a small admendment to myoriginal claim: the distribution code actually uses our internal rpc(which is pure python), but the other end is a python client thatconnects with solr via http (persistent, localhost connection). Iwrote it this way because it was easier, as our internal rpc libraryalready has functionality for spitting out requests to 100's ofclients and collecting the results asynchronously. I figured thatdirectly connecting to Solr via http would be cheaper, but perhaps itwouldn't be.

Both the rpc and http levels use connection-pooled persistentconnections.

I assume when you say async that you mean getting rid of the
thread-per-connection via NIO.  Some protocols do "async" by handing
off the request to another thread to wait on the response and then do
a callback to the original thread - this is async with respect to the
original calling thread, but still requires a thread-per-connection.


Right; this helps but doesn't scale too far.

Of course HTTP has some issues too - you effectively need a separate
connection per outstanding request.  Pipelining won't work well
because things need to come back in-order.  I'm not sure if RMI has
this limitation as well.

FWIW, our distributed search uses http over 120+ shards... and is
written in python.


That would be an awesome test case if you were able to use what Solr
is going to provide out-of-the-box.  Any unusual requirements?

The biggest point of customization is that we run two Solrs in asingle webapp, one for querying and one for highlighting. Thehighlighter Solr uses a set of custom parameters to determine thedocs to use (I imagine the current patch does something like this aswell). Splitting the content from the rest of the stored fields is ahuge win. There is also lots of custom deduplication and cachinglogic, but this could be done as a post-processing step.

In case anything is thinking of building something this huge, I'llmention that it is a bad idea to try to have a single point try tomanage so many shards. It is preferable to go hierarchical (could beaccomplished relatively easily if the query distributor could easilyquery other query distributor nodes).


-Mike

Re: HTTP or RMI, Jini, JavaSpaces for distributed search

Reply via email to