Re: HTTP or RMI, Jini, JavaSpaces for distributed search

eks dev Sat, 22 Sep 2007 03:30:21 -0700

Yonik, 
why not using hadoop's IPC/RPC? We use some of its (pre)historical versions 
with some small mods and it works like a charm. Nutch uses it as well. I would 
say, pretty clean and easy to use it.


We started with RMI, but we gave it up due to huge latency in range of 50ms per 
hop.  For our use case (well under 100ms response times) far too much.  We 
spent a few weeks fighting with RMI, and I am not sure if it was us making it 
not working for us.

maybe an option is to have a peek at the Apache Mina project, but you will have 
to figure out your serialization in that case (hmm, if you think speed, you 
will have to do it anyhow).



----- Original Message ----
From: Yonik Seeley <[EMAIL PROTECTED]>
To: [EMAIL PROTECTED]
Sent: Friday, 21 September, 2007 8:08:05 PM
Subject: HTTP or RMI, Jini, JavaSpaces for distributed search

I wanted to take a step back for a second and think about if HTTP was
really the right choice for the transport for distributed search.

I think the high-level approach in SOLR-303 is the right way to go
about it, but I'm unsure if HTTP is the right transport.

Pro HTTP:
  - using HTTP allows one to use an http load-balancer to distribute
load across multiple copies of the same shard by assigning a VIP
(virtual IP) to each shard.
  - because you do pretty much everything by hand, you know that there
isn't some hidden limitation that will jump out and bite you later.

Cons HTTP:
 - you end up doing everything by hand... connection handling, request
serialization, response parsing, etc...
 - goes through normal servlet channels... every sub-request will be
logged to the access logs, slowing things down.
- more network bandwidth used unless we come up with a new
BinaryResponseWriter and Parser

Currently, SOLR-303 uses and parses the XML response format, which has
some serious downsides:
- response size limits scalability and how deep in responses you can go...
  If you want to retrieve documents 5000 through 5009, even though the
user only requested 10 documents, the top-level searcher needs to get
the top 5009 documents from *each* shard... and that can quickly
exhaust the network bandwidth of the NIC.  XML parsing on the order of
nShards*5009 entries won't be any picnic either.

I'm thinking the load-balancing of HTTP is overrated also, because
it's inflexible.  Adding another shard requires adding another VIP in
the load-balancer, and changing which servers have which shards or
adding new copies of a shard also requires load-balancer
configuration..  Everything points to Solr being able to do the
load-balancing itself in the future, and there wouldn't seem to be
much benefit to using a load-balancer w/ VIPS for each shard vs having
Solr do it.

So even if we stuck with HTTP, Solr would need
 - a binary protocol to minimize network bandwidth use
 - load balancing across shard copies itself

Given that, would it make sense to just go with RMI instead?
And perhaps leverage some other higher level services (Jini? JavaSpaces?)

I'd like to hear from people with more experience with RMI & friends,
and what the potential downsides are to using these technologies.

-Yonik





      ___________________________________________________________
Yahoo! Answers - Got a question? Someone out there knows the answer. Try it
now.
http://uk.answers.yahoo.com/

Re: HTTP or RMI, Jini, JavaSpaces for distributed search

Reply via email to