Good point. One thing I'm wondering about cassandra is what happens when there is a massive failure. For example, if 1/3 of the nodes go down or become unreachable. This could happen in EC2 if an AZ has a failure, or in a datacenter if a whole rack or UPS goes dark. I'm not so concerned about the time where the nodes are down. If I understand replication, consistency, ring, and such I can architect things such that what must continue running does continue.
What I'm concerned about is when these nodes all come back up or reconnect. I have a hard time figuring out what exactly happens other than the fact that hinted handoffs get processed. Are the restarted nodes handling reads during that time? If so, they could serve up massive amounts of stale data, no? Do they then all start a repair, or is this something that needs to be run manually? If many do a repair at the same time, do I effectively end up with a down cluster due to the repair load? If no node was lost, is a repair required or are the hinted handoffs sufficient? Is there a manual or wiki section that discusses some of this and I just missed it? On 1/21/2012 2:25 PM, Peter Schuller wrote: >> Thanks for the responses! We'll definitely go for powerful servers to >> reduce the total count. Beyond a dozen servers there really doesn't seem >> to be much point in trying to increase count anymore for > Just be aware that if "big" servers imply *lots* of data (especially > in relation to memory size), it's not necessarily the best trade-off. > Consider the time it takes to do repairs, streaming, node start-up, > etc. > > If it's only about CPU resources then bigger nodes probably make more > sense if the h/w is cost effective. >