Thanks Ran. This helps a little but unfortunately I'm still a bit fuzzy for me. So is it not true that each node contains all the data in the cluster? I haven't come across any information on how clustered data is coordinated in cassandra. how does my query get directed to the right node?
On Thu, Dec 9, 2010 at 11:35 AM, Ran Tavory <ran...@gmail.com> wrote: > there are two numbers to look at, N the numbers of hosts in the ring > (cluster) and R the number of replicas for each data item. R is configurable > per column family. > Typically for large clusters N >> R. For very small clusters if makes sense > for R to be close to N in which case cassandra is useful so the database > doesn't have a single a single point of failure but not so much b/c of the > size of the data. But for large clusters it rarely makes sense to have N=R, > usually N >> R. > > On Thu, Dec 9, 2010 at 12:28 PM, Jonathan Colby <jonathan.co...@gmail.com> > wrote: >> >> I have a very basic question which I have been unable to find in >> online documentation on cassandra. >> >> It seems like every node in a cassandra cluster contains all the data >> ever stored in the cluster (i.e., all nodes are identical). I don't >> understand how you can scale this on commodity servers with merely >> internal hard disks. In other words, if I want to store 5 TB of >> data, does that each node need a hard disk capacity of 5 TB?? >> >> With HBase, memcached and other nosql solutions it is more clear how >> data is spilt up in the cluster and replicated for fault tolerance. >> Again, please excuse the rather basic question. > > > > -- > /Ran >