It's a difficult questions to answer in the abstract. Some thoughts...

Scaling by adding one node at  time is not optimal. The best case scenario is 
to double the number of nodes, as this means existing nodes only have to stream 
their data to a new node. Obviously this is not always possible. When adding 
less nodes existing nodes keeping a balanced ring may mean streams data to 
other nodes and accepting data from nodes.  

In general try to keep the data volume < 50% full, so there is lots of free 
space to do moves. 

In general nodes with a few 100GB's of data are easiest to manage. 

The pending column in nodetool tpstats will let you know how many read or write 
requests are waiting to be serviced. If this is consistently above 
concurrent_reads or concurrent_writes it means there is a queue > 1 for each 
thread. This will add to request latency, once maximum throughput is reached 
additional requests will queue. See the SEDA paper.  

Sometime in the 0.7 dev client connection pooling was added to better manage 
those resources. See cassandra.yaml for info.  

The o.a.c.db.StorageProxy JMX MBean provides a latency trackers for total 
request time including wait times. And o.a.c.db.ColumnFamiles... provides 
latency trackers for the local node operations to do the read or write.

if your data set grows quickly watch the disk space etc. If you do a lot of 
requests but you data grows slowly watch the throughout and latency numbers. 

 
Hope that helps.   
-----------------
Aaron Morton
Freelance Cassandra Developer
@aaronmorton
http://www.thelastpickle.com

On 16 Jun 2011, at 18:28, Schuilenga, Jan Taeke wrote:

> Which variables (for instance: throughput, CPU, I/O, connections) are leading 
> in deciding to add a node to a Cassandra setup which is put under strain. We 
> are trying to proove scalibility, but when is the time there to add a node 
> and have the optimum scalibilty result.
> 

Reply via email to