I'm using cassandra as a tool, like a black box with a certain contract to the world. Without modifying the "core", C* will send the updates to all replicas, so your plan would cause the extra write (for the placeholder). I wasn't assuming a modification to how C* fundamentally works.
Sounds like you are hacking (or at least looking) at the source, so all the power to you if/when you try these kind of changes. will On Sun, Jul 3, 2011 at 8:45 PM, AJ <a...@dude.podzone.net> wrote: > ** > On 7/3/2011 6:32 PM, William Oberman wrote: > > Was just going off of: " Send the value to the primary replica and send > placeholder values to the other replicas". Sounded like you wanted to write > the value to one, and write the placeholder to N-1 to me. > > > Yes, that is what I was suggesting. The point of the placeholders is to > handle the crash case that I talked about... "like" a WAL does. > > > But, C* will propagate the value to N-1 eventually anyways, 'cause that's > just what it does anyways :-) > > will > > On Sun, Jul 3, 2011 at 7:47 PM, AJ <a...@dude.podzone.net> wrote: > >> On 7/3/2011 3:49 PM, Will Oberman wrote: >> >> Why not send the value itself instead of a placeholder? Now it takes 2x >> writes on a random node to do a single update (write placeholder, write >> update) and N*x writes from the client (write value, write placeholder to >> N-1). Where N is replication factor. Seems like extra network and IO >> instead of less... >> >> >> To send the value to each node is 1.) unnecessary, 2.) will only cause a >> large burst of network traffic. Think about if it's a large data value, >> such as a document. Just let C* do it's thing. The extra messages are tiny >> and doesn't significantly increase latency since they are all sent >> asynchronously. >> >> >> >> Of course, I still think this sounds like reimplementing Cassandra >> internals in a Cassandra client (just guessing, I'm not a cassandra dev) >> >> >> >> I don't see how. Maybe you should take a peek at the source. >> >> >> >> On Jul 3, 2011, at 5:20 PM, AJ <a...@dude.podzone.net> wrote: >> >> Yang, >> >> How would you deal with the problem when the 1st node responds success but >> then crashes before completely forwarding any replicas? Then, after >> switching to the next primary, a read would return stale data. >> >> Here's a quick-n-dirty way: Send the value to the primary replica and >> send placeholder values to the other replicas. The placeholder value is >> something like, "PENDING_UPDATE". The placeholder values are sent with >> timestamps 1 less than the timestamp for the actual value that went to the >> primary. Later, when the changes propagate, the actual values will >> overwrite the placeholders. In event of a crash before the placeholder gets >> overwritten, the next read value will tell the client so. The client will >> report to the user that the key/column is unavailable. The downside is >> you've overwritten your data and maybe would like to know what the old data >> was! But, maybe there's another way using other columns or with MVCC. The >> client would want a success from the primary and the secondary replicas to >> be certain of future read consistency in case the primary goes down >> immediately as I said above. The ability to set an "update_pending" flag on >> any column value would probably make this work. But, I'll think more on >> this later. >> >> aj >> >> On 7/2/2011 10:55 AM, Yang wrote: >> >> there is a JIRA completed in 0.7.x that "Prefers" a certain node in >> snitch, so this does roughly what you want MOST of the time >> >> >> but the problem is that it does not GUARANTEE that the same node will >> always be read. I recently read into the HBase vs Cassandra comparison >> thread that started after Facebook dropped Cassandra for their messaging >> system, and understood some of the differences. what you want is essentially >> what HBase does. the fundamental difference there is really due to the >> gossip protocol: it's a probablistic, or eventually consistent failure >> detector while HBase/Google Bigtable use Zookeeper/Chubby to provide a >> strong failure detector (a distributed lock). so in HBase, if a tablet >> server goes down, it really goes down, it can not re-grab the tablet from >> the new tablet server without going through a start up protocol (notifying >> the master, which would notify the clients etc), in other words it is >> guaranteed that one tablet is served by only one tablet server at any given >> time. in comparison the above JIRA only TRYIES to serve that key from one >> particular replica. HBase can have that guarantee because the group >> membership is maintained by the strong failure detector. >> >> just for hacking curiosity, a strong failure detector + Cassandra >> replicas is not impossible (actually seems not difficult), although the >> performance is not clear. what would such a strong failure detector bring to >> Cassandra besides this ONE-ONE strong consistency ? that is an interesting >> question I think. >> >> considering that HBase has been deployed on big clusters, it is probably >> OK with the performance of the strong Zookeeper failure detector. then a >> further question was: why did Dynamo originally choose to use the >> probablistic failure detector? yes Dynamo's main theme is "eventually >> consistent", so the Phi-detector is **enough**, but if a strong detector >> buys us more with little cost, wouldn't that be great? >> >> >> >> On Fri, Jul 1, 2011 at 6:53 PM, AJ <a...@dude.podzone.net> wrote: >> >>> Is this possible? >>> >>> All reads and writes for a given key will always go to the same node from >>> a client. It seems the only thing needed is to allow the clients to compute >>> which node is the closes replica for the given key using the same algorithm >>> C* uses. When the first replica receives the write request, it will write >>> to itself which should complete before any of the other replicas and then >>> return. The loads should still stay balanced if using random partitioner. >>> If the first replica becomes unavailable (however that is defined), then >>> the clients can send to the next repilca in the ring and switch from ONE >>> write/reads to QUORUM write/reads temporarily until the first replica >>> becomes available again. QUORUM is required since there could be some >>> replicas that were not updated after the first replica went down. >>> >>> Will this work? The goal is to have strong consistency with a read/write >>> consistency level as low as possible while secondarily a network performance >>> boost. >>> >> >> >> >> > > > -- > Will Oberman > Civic Science, Inc. > 3030 Penn Avenue., First Floor > Pittsburgh, PA 15201 > (M) 412-480-7835 > (E) ober...@civicscience.com > > > -- Will Oberman Civic Science, Inc. 3030 Penn Avenue., First Floor Pittsburgh, PA 15201 (M) 412-480-7835 (E) ober...@civicscience.com