there are two numbers to look at, N the numbers of hosts in the ring
(cluster) and R the number of replicas for each data item. R is configurable
per column family.
Typically for large clusters N >> R. For very small clusters if makes sense
for R to be close to N in which case cassandra is useful so the database
doesn't have a single a single point of failure but not so much b/c of the
size of the data. But for large clusters it rarely makes sense to have N=R,
usually N >> R.

On Thu, Dec 9, 2010 at 12:28 PM, Jonathan Colby <jonathan.co...@gmail.com>wrote:

> I have a very basic question which I have been unable to find in
> online documentation on cassandra.
>
> It seems like every node in a cassandra cluster contains all the data
> ever stored in the cluster (i.e., all nodes are identical).  I don't
> understand how you can scale this on commodity servers with merely
> internal hard disks.   In other words, if I want to store 5 TB of
> data, does that each node need a hard disk capacity of 5 TB??
>
> With HBase, memcached and other nosql solutions it is more clear how
> data is spilt up in the cluster and replicated for fault tolerance.
> Again, please excuse the rather basic question.
>



-- 
/Ran

Reply via email to