This will be a very selective response, not at all as exhaustive as it should be to truly cover what you bring up. Sorry, but here goes some random tidbits.
> On the cassandra user list, I noticed a thread on a user that literally > wrote his cluster to death. Correct me if I'm wrong, but based on that > thread, it seemed like if one node runs out of space, it kills the > cluster. I hope that is not the case. A single node running out of disk doesn't kill a cluster as such. However, if you have a cluster of nodes where each is roughly the same size (and assuming even balancing of data) and you are so close to being full everywhere that one of them actually becomes full, you are in dangerous territory if you're worried about uptime. I'd say there are two main types of issues here: (1) Monitoring and predicting disk space usage with Cassandra is a bit more difficult than in a traditional system because of the way it varies over time and the way writes can have impacts on disk space that last for some time beyond the write itself. I.e., delayed effects. (2) The fact that operational things like moving nodes and repair operations need disk space, means it can be more difficult to get out of a bad situation. > If I have a cluster of 3 computers that is filling up in disk space and > I add a bunch of machines, how does cassandra deal with that situation? > Does cassandra know not to keep writing to the initial three machines > because the disk is near full and write to the new machines only? At > some point my machines are going to be full of data and I don't want to > write any more data to those machines until I add more storage to those > machines. Is this possible to do? Cassandra selects replica placement based on its replication strategy and ring layout (node's and their tokens). There is no "fallback" to putting data elsewhere because nodes are full (doing so would likely introduce tremendous complexity I think). If a node goes catastrophically out of disk, it will probably stop working and effectively be offline in the cluster. Not necessarily though; it could be that e.g. compactions are not completing but writes still work as do reads, and over time reads become less performant due to lack of compaction. Alternatively if disks are completely full and memtables cannot be flushed, I believe you'd expect it to essentially go down. I would say that running out of disk space is not something which is very gracefully handled right now, although there is some code to mitigate it. For example during compaction there is some logic to check for disk space availability and try to compact smaller files first to possibly make room for larger compactions, but that is not addressing the overall issue. > I read somewhere that it is a bad > idea to use more than 50% of the drive utilization because of the > compaction requirements. For a single column family, compaction (when major, i.e., all data is involved) compactions can currently double disk space if data is not overwritten or removed. In addition things like nodetool repair and nodetool cleanup need disk space too. Here's where I'm not really up on all the details. It would be nice to arrive at a figure which is the absolute worst-case possible disk space expansion that is possible. > I was planning on using ssd for the sstables > and standard hard drives for the compaction and writelogs. Is this > possible to do? It doesn't make sense for compaction because compaction involves replacing sstables with new ones. You could write new sstables to separate storage and then copy them into the sstable storage location, but that doesn't really do anything but increase the total amount of I/O that you're doing. >I didn't see any options for specifying where the write > logs are written and compaction is done. There is a commitlog_directory option. Alternatively the directory might be a symlink. Compaction is not in a separate location (see above). >Also, is it possible to add > more drives to a running machine and have cassandra utilize a mounted > directory with free space? Sort of, but not really the way you want it. Short version is, "pretend the answer is no". The real answer is somewhere between yes and no (if someone feels like doing a write-up....). I would suggest volume growth and file system resize if that is something you plan on doing. Assuming you have a setup where such operations are sufficiently trusted. Probably RAID + LVM. But beware of LVM and implications on correctness (e.g., write barriers). Err.. basically, no, I don't recommend planning to fiddle with that. The potential for problems is probably too high. KISS. > Cassandra should know to stop writing to the node once the directory > with the sstables is near full, but it doesn't seem like it does > anything like that. It's kind of non-trivial to gracefully handle it in a way that both satisfies the "I don't want to be stuck" requirement while also satisfying the "I don't want to be artificially down because of conservative estimates made by Cassandra, when I know my workload will survive with this diskspace" requirement. But disk space management is definitely an area of improvement. > My second set of questions relate to how Cassandra deals with recovering > from node failure. How long does it take for Cassandra to fully recover > from a node failure? Does it slow down the other nodes while it is > recovering? Anti-compaction and streaming is done to move data from nodes that have it (that are in the replica set). This implies CPU and I/O and networking load on the source node, so it does have an impact. See http://wiki.apache.org/cassandra/Streaming among others. (Here's where I'm not sure, someone please confirm/deny) In 0.6, I believe this required diskspace on the originating node. In 0.7, I *think* the need for disk space on the source node is removed. > When a node goes down that hasn't had a chance to replicate > the commit logs, then I'm assuming that data is lost forever. That > seems like a pretty bad thing. How long does it take for commit logs > to replicate? Commit logs don't replicate. Commit logs are only used for internal node consistency. Data is replicated as part of (1) RPC relay directly as a result of traffic (normal case), (2) read-repair, and (3) anti-entropy (nodetool repair). See 'consistency' on http://wiki.apache.org/cassandra/Operations You will have to select an appropriate consistency level on reads and writes, and configure your nodes appropriately (for example batched vs. periodic commit mode) depending on your requirements. > If a node goes down, after we fix it and get it back up, do we need to > do anything other than the nodetool add/remove commands? If it only went temporarily down, just start it up again as long as it hasn't been down longer than GCGraceSeconds (google it). If it's actually being replaced - see the operations wiki page linked above. > How long does > it take to get back to normal once it's back online. Is there any way > to see the status of recovery? If it's a node replacement then whatever time it takes to bootstrap that node into the ring, which means streaming data from other nodes. So it directly depends on the amount of data. That said, streaming is pretty efficient and should hopefully saturate either your network or your disks. > 1) Ability to specify which directory compactions occur in. (It's very, > very painful to lose 50% of my ssd storage because of where compaction > is occuring) Doesn't quite make sense (see above), but I sympathize with the need to know disk space reqs. > 2) Ability to add additional directories for storage of sstables. (I > should be able to add a hard drive and have cassandra start using it for > storage) See above - sorta supported but pretend it's not. In the future (see https://issues.apache.org/jira/browse/CASSANDRA-1608) it may possibly be that it again becomes useful to point Cassandra at multiple sstable directories. If the limited size of sstables become a reality. That should also help in general with disk space requirements for large CF:s. > 4) Ability to have specify a primary storage directory for main requests > and secondary storage directory for replicas. Primary storage could be > on a fast raid or ssd. Incoming Requests could prefer primary storage > that uses ssds. This will enable us to get the speed of ssd, but the > cheap replication capabilities of standard disk. That assumes that traffic goes only to the primary replica unless it goes down, which is only the case under specific circumstances. Also, it doesn't help you if, when a node goes down, all traffic goes to the remaining node(s) and kill them completely because they are having to serve the same traffic off of slower storage. > 5) Utilize all replicas for servicing requests. Not sure if this is > already done, but if I have 5 replicas, are all 5 being used to service > read requests? In general yes, but it's possible to affect that. You also have to be mindful of how consistency level of reads affect the number of reads that have to be done, in addition to read repair (if enabled). -- / Peter Schuller