there are two numbers to look at, N the numbers of hosts in the ring (cluster) and R the number of replicas for each data item. R is configurable per column family. Typically for large clusters N >> R. For very small clusters if makes sense for R to be close to N in which case cassandra is useful so the database doesn't have a single a single point of failure but not so much b/c of the size of the data. But for large clusters it rarely makes sense to have N=R, usually N >> R.
On Thu, Dec 9, 2010 at 12:28 PM, Jonathan Colby <jonathan.co...@gmail.com>wrote: > I have a very basic question which I have been unable to find in > online documentation on cassandra. > > It seems like every node in a cassandra cluster contains all the data > ever stored in the cluster (i.e., all nodes are identical). I don't > understand how you can scale this on commodity servers with merely > internal hard disks. In other words, if I want to store 5 TB of > data, does that each node need a hard disk capacity of 5 TB?? > > With HBase, memcached and other nosql solutions it is more clear how > data is spilt up in the cluster and replicated for fault tolerance. > Again, please excuse the rather basic question. > -- /Ran