> This means that nodes in cassandra cluster contain data that has been > sharded onto serveral nodes as well as this sharded data may be > replicated further across several nodes ? So cassandra storage > utilizes both sharded as well as replication for load balancing? Is > this correct ?
Yes, sort of (though depending on your definition of sharding it might be slightly misleading). In short, replicas (copies of data) is placed on a number (determined by the replication factor, or RF) of nodes that participate in a ring where each node has a token associated with it. The row key (see http://wiki.apache.org/cassandra/DataModel) determines, along with the so-called 'replication strategy', which nodes in the ring should have replicas of the data. I just realized that I couldn't find a wiki page or section of the Riptano docs that explained the DHT ring and its implications from a top-down perspective. Am I missing something or is this something that should be written, anyone? Probably the best resource on this I found is http://www.riptano.com/docs/0.6/operations/clustering If you are interested in the reasoning behind it, I greatly recommend the Amazon Dynamo paper: http://s3.amazonaws.com/AllThingsDistributed/sosp/amazon-dynamo-sosp2007.pdf Cassandra does not exactly implement what is described, but it is strongly inspired by it. -- / Peter Schuller
