Was just going off of: "Send the value to the primary replica and send placeholder values to the other replicas". Sounded like you wanted to write the value to one, and write the placeholder to N-1 to me. But, C* will propagate the value to N-1 eventually anyways, 'cause that's just what it does anyways :-)
will On Sun, Jul 3, 2011 at 7:47 PM, AJ <a...@dude.podzone.net> wrote: > ** > On 7/3/2011 3:49 PM, Will Oberman wrote: > > Why not send the value itself instead of a placeholder? Now it takes 2x > writes on a random node to do a single update (write placeholder, write > update) and N*x writes from the client (write value, write placeholder to > N-1). Where N is replication factor. Seems like extra network and IO > instead of less... > > > To send the value to each node is 1.) unnecessary, 2.) will only cause a > large burst of network traffic. Think about if it's a large data value, > such as a document. Just let C* do it's thing. The extra messages are tiny > and doesn't significantly increase latency since they are all sent > asynchronously. > > > > Of course, I still think this sounds like reimplementing Cassandra > internals in a Cassandra client (just guessing, I'm not a cassandra dev) > > > I don't see how. Maybe you should take a peek at the source. > > > > On Jul 3, 2011, at 5:20 PM, AJ <a...@dude.podzone.net> wrote: > > Yang, > > How would you deal with the problem when the 1st node responds success but > then crashes before completely forwarding any replicas? Then, after > switching to the next primary, a read would return stale data. > > Here's a quick-n-dirty way: Send the value to the primary replica and send > placeholder values to the other replicas. The placeholder value is > something like, "PENDING_UPDATE". The placeholder values are sent with > timestamps 1 less than the timestamp for the actual value that went to the > primary. Later, when the changes propagate, the actual values will > overwrite the placeholders. In event of a crash before the placeholder gets > overwritten, the next read value will tell the client so. The client will > report to the user that the key/column is unavailable. The downside is > you've overwritten your data and maybe would like to know what the old data > was! But, maybe there's another way using other columns or with MVCC. The > client would want a success from the primary and the secondary replicas to > be certain of future read consistency in case the primary goes down > immediately as I said above. The ability to set an "update_pending" flag on > any column value would probably make this work. But, I'll think more on > this later. > > aj > > On 7/2/2011 10:55 AM, Yang wrote: > > there is a JIRA completed in 0.7.x that "Prefers" a certain node in snitch, > so this does roughly what you want MOST of the time > > > but the problem is that it does not GUARANTEE that the same node will > always be read. I recently read into the HBase vs Cassandra comparison > thread that started after Facebook dropped Cassandra for their messaging > system, and understood some of the differences. what you want is essentially > what HBase does. the fundamental difference there is really due to the > gossip protocol: it's a probablistic, or eventually consistent failure > detector while HBase/Google Bigtable use Zookeeper/Chubby to provide a > strong failure detector (a distributed lock). so in HBase, if a tablet > server goes down, it really goes down, it can not re-grab the tablet from > the new tablet server without going through a start up protocol (notifying > the master, which would notify the clients etc), in other words it is > guaranteed that one tablet is served by only one tablet server at any given > time. in comparison the above JIRA only TRYIES to serve that key from one > particular replica. HBase can have that guarantee because the group > membership is maintained by the strong failure detector. > > just for hacking curiosity, a strong failure detector + Cassandra > replicas is not impossible (actually seems not difficult), although the > performance is not clear. what would such a strong failure detector bring to > Cassandra besides this ONE-ONE strong consistency ? that is an interesting > question I think. > > considering that HBase has been deployed on big clusters, it is probably > OK with the performance of the strong Zookeeper failure detector. then a > further question was: why did Dynamo originally choose to use the > probablistic failure detector? yes Dynamo's main theme is "eventually > consistent", so the Phi-detector is **enough**, but if a strong detector > buys us more with little cost, wouldn't that be great? > > > > On Fri, Jul 1, 2011 at 6:53 PM, AJ <a...@dude.podzone.net> wrote: > >> Is this possible? >> >> All reads and writes for a given key will always go to the same node from >> a client. It seems the only thing needed is to allow the clients to compute >> which node is the closes replica for the given key using the same algorithm >> C* uses. When the first replica receives the write request, it will write >> to itself which should complete before any of the other replicas and then >> return. The loads should still stay balanced if using random partitioner. >> If the first replica becomes unavailable (however that is defined), then >> the clients can send to the next repilca in the ring and switch from ONE >> write/reads to QUORUM write/reads temporarily until the first replica >> becomes available again. QUORUM is required since there could be some >> replicas that were not updated after the first replica went down. >> >> Will this work? The goal is to have strong consistency with a read/write >> consistency level as low as possible while secondarily a network performance >> boost. >> > > > > -- Will Oberman Civic Science, Inc. 3030 Penn Avenue., First Floor Pittsburgh, PA 15201 (M) 412-480-7835 (E) ober...@civicscience.com