On 7/3/2011 3:49 PM, Will Oberman wrote:
Why not send the value itself instead of a placeholder? Now it takes
2x writes on a random node to do a single update (write placeholder,
write update) and N*x writes from the client (write value, write
placeholder to N-1). Where N is replication factor. Seems like extra
network and IO instead of less...
To send the value to each node is 1.) unnecessary, 2.) will only cause a
large burst of network traffic. Think about if it's a large data value,
such as a document. Just let C* do it's thing. The extra messages are
tiny and doesn't significantly increase latency since they are all sent
asynchronously.
Of course, I still think this sounds like reimplementing Cassandra
internals in a Cassandra client (just guessing, I'm not a cassandra dev)
I don't see how. Maybe you should take a peek at the source.
On Jul 3, 2011, at 5:20 PM, AJ <a...@dude.podzone.net
<mailto:a...@dude.podzone.net>> wrote:
Yang,
How would you deal with the problem when the 1st node responds
success but then crashes before completely forwarding any replicas?
Then, after switching to the next primary, a read would return stale
data.
Here's a quick-n-dirty way: Send the value to the primary replica
and send placeholder values to the other replicas. The placeholder
value is something like, "PENDING_UPDATE". The placeholder values
are sent with timestamps 1 less than the timestamp for the actual
value that went to the primary. Later, when the changes propagate,
the actual values will overwrite the placeholders. In event of a
crash before the placeholder gets overwritten, the next read value
will tell the client so. The client will report to the user that the
key/column is unavailable. The downside is you've overwritten your
data and maybe would like to know what the old data was! But, maybe
there's another way using other columns or with MVCC. The client
would want a success from the primary and the secondary replicas to
be certain of future read consistency in case the primary goes down
immediately as I said above. The ability to set an "update_pending"
flag on any column value would probably make this work. But, I'll
think more on this later.
aj
On 7/2/2011 10:55 AM, Yang wrote:
there is a JIRA completed in 0.7.x that "Prefers" a certain node in
snitch, so this does roughly what you want MOST of the time
but the problem is that it does not GUARANTEE that the same node
will always be read. I recently read into the HBase vs Cassandra
comparison thread that started after Facebook dropped Cassandra for
their messaging system, and understood some of the differences. what
you want is essentially what HBase does. the fundamental difference
there is really due to the gossip protocol: it's a probablistic, or
eventually consistent failure detector while HBase/Google Bigtable
use Zookeeper/Chubby to provide a strong failure detector (a
distributed lock). so in HBase, if a tablet server goes down, it
really goes down, it can not re-grab the tablet from the new tablet
server without going through a start up protocol (notifying the
master, which would notify the clients etc), in other words it is
guaranteed that one tablet is served by only one tablet server at
any given time. in comparison the above JIRA only TRYIES to serve
that key from one particular replica. HBase can have that guarantee
because the group membership is maintained by the strong failure
detector.
just for hacking curiosity, a strong failure detector + Cassandra
replicas is not impossible (actually seems not difficult), although
the performance is not clear. what would such a strong failure
detector bring to Cassandra besides this ONE-ONE strong consistency
? that is an interesting question I think.
considering that HBase has been deployed on big clusters, it is
probably OK with the performance of the strong Zookeeper failure
detector. then a further question was: why did Dynamo originally
choose to use the probablistic failure detector? yes Dynamo's main
theme is "eventually consistent", so the Phi-detector is **enough**,
but if a strong detector buys us more with little cost, wouldn't
that be great?
On Fri, Jul 1, 2011 at 6:53 PM, AJ <a...@dude.podzone.net
<mailto:a...@dude.podzone.net>> wrote:
Is this possible?
All reads and writes for a given key will always go to the same
node from a client. It seems the only thing needed is to allow
the clients to compute which node is the closes replica for the
given key using the same algorithm C* uses. When the first
replica receives the write request, it will write to itself
which should complete before any of the other replicas and then
return. The loads should still stay balanced if using random
partitioner. If the first replica becomes unavailable (however
that is defined), then the clients can send to the next repilca
in the ring and switch from ONE write/reads to QUORUM
write/reads temporarily until the first replica becomes
available again. QUORUM is required since there could be some
replicas that were not updated after the first replica went down.
Will this work? The goal is to have strong consistency with a
read/write consistency level as low as possible while
secondarily a network performance boost.