I have not use tablesnap but it appears that it does not necessarily depend upon taking a cassandra snapshot. The example given in their documentation shows the source folder as /var/lib/cassandra/data/GiantKeyspace, which is the root of the "GiantKeyspace" keyspace. But, snapshots operate at the column-family level and are stored in a sub directory structure for each column family. For example, if we have 2 column families in GiantKeyspace, called cf1 and cf2, the snapshots would be located in /var/lib/cassandra/data/GiantKeyspace/cf1/snapshots/snapshot_id/ and var/lib/cassandra/data/GiantKeyspace/cf2/snapshots/snapshot_id/, where snapshot_id is some unique identifier for that snapshot. Unless tablesnap will detect changes in subfolders, I don't know how you will tell tablesnap the name of the actual snapshot folder before the snapshot is taken. I think tablesnap's premise is that since a snapshot is a simply a hard link to an existing sstable file and sstables are immutable, it will simply operate on the original sstable, no need for making a snapshot.
However cassandra also performs compactions on sstables which combines sstables into new sstables for the purpose of "de-fragging" row data to optimize lookups. The pre-compaction sstables will be marked for deletion and removed during the next GC. What this means to me is that you should use snaphshots to preserve point-in-time state of the data. So there seems to be a small problem to overcome if using snapshots and tablesnap. Ideally to create a completely consistent point-in-time backup you would stop client access to the cluster (nodetool thriftdisable), execute a flush to write memtables to disk, then execute the snapshot. In reality, if you can execute the snapshot on all servers within a "short period of time", for some value of 'short', your data will be relatively consistent. If you ever needed to perform a restore from these snapshots, cassandra's internal read repair feature would fixup any inconsistencies. I use DataStax OpsCenter to take snapshots and then a homebrew python script to upload to S3. OpsCenter sends the snapshot command to all servers nearly simultaneously so the snapshots are executed almost in parallel. This feature might only be available in the Enterprise version. You could use a simple bash script to execute the nodetool snapshot command via ssh to each server sequentially, or use a mutli-window ssh client ( csshX for OSX https://code.google.com/p/csshx/ ) to execute in true parallel fashion. -- Ray //o-o\\ On Sat, Dec 7, 2013 at 4:09 AM, Jason Wee <peich...@gmail.com> wrote: > Hmm... cassandra fundamental key features like fault tolerant, durable and > replication. Just out of curiousity, why would you want to do backup? > > /Jason > > > On Sat, Dec 7, 2013 at 3:31 AM, Robert Coli <rc...@eventbrite.com> wrote: > >> On Fri, Dec 6, 2013 at 6:41 AM, Amalrik Maia <amal...@s1mbi0se.com.br>wrote: >> >>> hey guys, I'm trying to take backups of a multi-node cassandra and save >>> them on S3. >>> My idea is simply doing ssh to each server and use nodetool to create >>> the snapshots then push then to S3. >>> >> >> https://github.com/synack/tablesnap >> >> So is this approach recommended? my concerns are about inconsistencies >>> that this approach can lead, since the snapshots are taken one by one and >>> not in parallel. >>> Should i worry about it or cassandra finds a way to deal with >>> inconsistencies when doing a restore? >>> >> >> The backup is as consistent as your cluster is at any given moment, which >> is "not necessarily". Manual repair brings you closer to consistency, but >> only on data present when the repair started. >> >> =Rob >> > >