It's a difficult questions to answer in the abstract. Some thoughts... Scaling by adding one node at time is not optimal. The best case scenario is to double the number of nodes, as this means existing nodes only have to stream their data to a new node. Obviously this is not always possible. When adding less nodes existing nodes keeping a balanced ring may mean streams data to other nodes and accepting data from nodes.
In general try to keep the data volume < 50% full, so there is lots of free space to do moves. In general nodes with a few 100GB's of data are easiest to manage. The pending column in nodetool tpstats will let you know how many read or write requests are waiting to be serviced. If this is consistently above concurrent_reads or concurrent_writes it means there is a queue > 1 for each thread. This will add to request latency, once maximum throughput is reached additional requests will queue. See the SEDA paper. Sometime in the 0.7 dev client connection pooling was added to better manage those resources. See cassandra.yaml for info. The o.a.c.db.StorageProxy JMX MBean provides a latency trackers for total request time including wait times. And o.a.c.db.ColumnFamiles... provides latency trackers for the local node operations to do the read or write. if your data set grows quickly watch the disk space etc. If you do a lot of requests but you data grows slowly watch the throughout and latency numbers. Hope that helps. ----------------- Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 16 Jun 2011, at 18:28, Schuilenga, Jan Taeke wrote: > Which variables (for instance: throughput, CPU, I/O, connections) are leading > in deciding to add a node to a Cassandra setup which is put under strain. We > are trying to proove scalibility, but when is the time there to add a node > and have the optimum scalibilty result. >