On 21-Sep-07, at 2:34 PM, Yonik Seeley wrote:
On 9/21/07, Mike Klaas <[EMAIL PROTECTED]> wrote:
On 21-Sep-07, at 11:08 AM, Yonik Seeley wrote:
I wanted to take a step back for a second and think about if HTTP
was
really the right choice for the transport for distributed search.
I think the high-level approach in SOLR-303 is the right way to go
about it, but I'm unsure if HTTP is the right transport.
I don't know anything about RMI, but is it possible to do 100's of
simultaneous asynchronous requests cheaply?
Good question... probably only important for really big clusters (like
yours), but it would be nice.
Even if we go HTTP, I'm not sure it will be async at first - does
HTTPClient even support async?
I don't think so. In fact, I need to make a small admendment to my
original claim: the distribution code actually uses our internal rpc
(which is pure python), but the other end is a python client that
connects with solr via http (persistent, localhost connection). I
wrote it this way because it was easier, as our internal rpc library
already has functionality for spitting out requests to 100's of
clients and collecting the results asynchronously. I figured that
directly connecting to Solr via http would be cheaper, but perhaps it
wouldn't be.
Both the rpc and http levels use connection-pooled persistent
connections.
I assume when you say async that you mean getting rid of the
thread-per-connection via NIO. Some protocols do "async" by handing
off the request to another thread to wait on the response and then do
a callback to the original thread - this is async with respect to the
original calling thread, but still requires a thread-per-connection.
Right; this helps but doesn't scale too far.
Of course HTTP has some issues too - you effectively need a separate
connection per outstanding request. Pipelining won't work well
because things need to come back in-order. I'm not sure if RMI has
this limitation as well.
FWIW, our distributed search uses http over 120+ shards... and is
written in python.
That would be an awesome test case if you were able to use what Solr
is going to provide out-of-the-box. Any unusual requirements?
The biggest point of customization is that we run two Solrs in a
single webapp, one for querying and one for highlighting. The
highlighter Solr uses a set of custom parameters to determine the
docs to use (I imagine the current patch does something like this as
well). Splitting the content from the rest of the stored fields is a
huge win. There is also lots of custom deduplication and caching
logic, but this could be done as a post-processing step.
In case anything is thinking of building something this huge, I'll
mention that it is a bad idea to try to have a single point try to
manage so many shards. It is preferable to go hierarchical (could be
accomplished relatively easily if the query distributor could easily
query other query distributor nodes).
-Mike