On 21-Sep-07, at 2:34 PM, Yonik Seeley wrote:

On 9/21/07, Mike Klaas <[EMAIL PROTECTED]> wrote:
On 21-Sep-07, at 11:08 AM, Yonik Seeley wrote:

I wanted to take a step back for a second and think about if HTTP was
really the right choice for the transport for distributed search.

I think the high-level approach in SOLR-303 is the right way to go
about it, but I'm unsure if HTTP is the right transport.

I don't know anything about RMI, but is it possible to do 100's of
simultaneous asynchronous requests cheaply?

Good question... probably only important for really big clusters (like
yours), but it would be nice.

Even if we go HTTP, I'm not sure it will be async at first - does
HTTPClient even support async?

I don't think so. In fact, I need to make a small admendment to my original claim: the distribution code actually uses our internal rpc (which is pure python), but the other end is a python client that connects with solr via http (persistent, localhost connection). I wrote it this way because it was easier, as our internal rpc library already has functionality for spitting out requests to 100's of clients and collecting the results asynchronously. I figured that directly connecting to Solr via http would be cheaper, but perhaps it wouldn't be.

Both the rpc and http levels use connection-pooled persistent connections.

I assume when you say async that you mean getting rid of the
thread-per-connection via NIO.  Some protocols do "async" by handing
off the request to another thread to wait on the response and then do
a callback to the original thread - this is async with respect to the
original calling thread, but still requires a thread-per-connection.

Right; this helps but doesn't scale too far.

Of course HTTP has some issues too - you effectively need a separate
connection per outstanding request.  Pipelining won't work well
because things need to come back in-order.  I'm not sure if RMI has
this limitation as well.

FWIW, our distributed search uses http over 120+ shards... and is
written in python.

That would be an awesome test case if you were able to use what Solr
is going to provide out-of-the-box.  Any unusual requirements?

The biggest point of customization is that we run two Solrs in a single webapp, one for querying and one for highlighting. The highlighter Solr uses a set of custom parameters to determine the docs to use (I imagine the current patch does something like this as well). Splitting the content from the rest of the stored fields is a huge win. There is also lots of custom deduplication and caching logic, but this could be done as a post-processing step.

In case anything is thinking of building something this huge, I'll mention that it is a bad idea to try to have a single point try to manage so many shards. It is preferable to go hierarchical (could be accomplished relatively easily if the query distributor could easily query other query distributor nodes).

-Mike

Reply via email to