On EC2, it is common and recommended ( http://developer.amazonwebservices.com/connect/entry.jspa?categoryID=100&externalID=1663) to use XFS's freeze/thaw functionality to create near-online snapshots of an EBS volume for MySQL snapshots.
"Besides being a stable, modern, high performance, journaling file system, XFS supports file system freeze/thaw which is extremely useful for ensuring a consistent state during EBS snapshots." I've implemented this with MySQL before, and it worked extremely well (miles beyond mysqldump or mysqlhotcopy). On a given node, you sacrifice a short period of availability (less than 0.5 seconds) to get a full, consistent snapshot of your EBS volume that can be sent off to S3 in the background, after the filesystem has unlocked and disk activity has resumed. Has anybody tried implementing this with a Cassandra cluster? What are the issues you ran into? How did it compare with using Cassandra's "nodetool snapshot"? I think I could do this on a running node with a 0.5 second timeout. The XFS docs state "Any process attempting to write to the frozen filesystem will block waiting for the filesystem to be unfrozen." Having writes block on a node for <0.5s sounds like something the Cassandra would handle fine. The Cassandra docs state "You can get an eventually consistent backup by flushing all nodes and snapshotting; no individual node's backup is guaranteed to be consistent but if you restore from that snapshot then clients will get eventually consistent behavior as usual." This lead me to believe that as long as I have snapshot each node in the cluster within a reasonable window (say 2 hours), I'd be able to bring the entire cluster back with a guarantee that it is consistent up to the point where the snapshot window began. I realize one of Cassandra's design goals is redundancy and high availability. I'm not worried about our entire infrastructure collapsing and having to restore backups because of massive node failure. I want to backup so that a bad logic bug in our app (ie messing up the timestamps) or Cassandra itself deletes or corrupts data in our Cassandra cluster. -Ben