At my current job I had to roll my own backup system. Hopefully I can get it OSS'd at some point. Here is a (now slightly outdated) presentation:
https://docs.google.com/presentation/d/13Aps-IlQPYAa_V34ocR0E8Q4C8W2YZ6Jn5_BYGrjqFk/edit#slide=id.p If you are struggling with the disk I/O cost of the sstable backups/copies, note that since sstables are append-only, if you adopt an incremental approach to your backups, you only need to track a list of the current files and upload the files that are new compared to a previous successful backup. Your "manifest" of files for a node will need to have references to the previous backup, and you'll wnat to "reset" with a full backup each month. I stole that idea from https://github.com/tbarbugli/cassandra_snapshotter. I would have used that but we had more complex node access modes (kubernetes, ssh through jumphosts, etc) and lots of other features needed that weren't supported. In AWS I use aws profiles to throttle the transfers, and parallelize across nodes. The basic unit of a successful backup is a single node, but you'll obviously want to track overall node success. Note that in rack-based topologies you really only need one whole successful rack if your RF is > # racks, and one DC. Beware doing simultaneous flushes/snapshots across the cluster at once, that might be the equivalent of a DDos. You might want to do a "jittered" randomized preflush of the cluster first before doing the snapshotting. Unfortunately, the nature of a distributed system is that snapshotting all the nodes at the precise same time is a hard problem. I also do not / have not used the built-in incremental backup feature of cassandra, which can enable more precise point-in-time backups (aside from the unflushed data in the commitlogs) A note on incrementals with occaisional FULLs: Note that FULL backups monthly might take more than a day or two, especially throttled. My incrementals were originally looking up previous manifests using only 'most recent", but then the long-running FULL backups were excluded from the "chain" of incremental backups. So I now implement a fuzzy lookup for the incrementals that prioritizes any FULL in the last 5 days over any more recent incremental. Thus you can purge old backups you don't need more safely using the monthly full backups as a reset point. On Mon, Apr 1, 2019 at 1:08 PM Alain RODRIGUEZ <arodr...@gmail.com> wrote: > Hello Manish, > > I think any disk works. As long as it is big enough. It's also better if > it's a reliable system (some kind of redundant raid, NAS, storage like GCS > or S3...). We are not looking for speed mostly during a backup, but > resiliency and not harming the source cluster mostly I would say. > Then how fast you write to the backup storage system will probably be more > often limited by what you can read from the source cluster. > The backups have to be taken from running nodes, thus it's easy to > overload the disk (reads), network (export backup data to final > destination), and even CPU (as/if the machine handles the transfer). > > What are the best practices while designing backup storage system for big >> Cassandra cluster? > > > What is nice to have (not to say mandatory) is a system of incremental > backups. You should not take the data from the nodes every time, or you'll > either harm the cluster regularly OR spend days to transfer the data (if > the amount of data grows big enough). > I'm not speaking about Cassandra incremental snapshots, but of using > something like AWS Snapshot, or copying this behaviour programmatically to > take (copy, link?) old SSTables from previous backups when they exist, will > greatly unload the clusters work and the resource needed as soon enough a > substantial amount of the data should be coming from the backup data source > itself. The problem with incremental snapshot is that when restoring, you > have to restore multiple pieces, making it harder and involving a lot of > compaction work. > The "caching" technic mentioned above gives the best of the 2 worlds: > - You will always backup from the nodes only the sstables you don’t have > already in your backup storage system, > - You will always restore easily as each backup is a full backup. > > It's not really a "hands-on" writing, but this should let you know about > existing ways to do backups and the tradeoffs, I wrote this a year ago: > http://thelastpickle.com/blog/2018/04/03/cassandra-backup-and-restore-aws-ebs.html > . > > It's a complex topic, I hope some of this is helpful to you. > > C*heers, > ----------------------- > Alain Rodriguez - al...@thelastpickle.com > France / Spain > > The Last Pickle - Apache Cassandra Consulting > http://www.thelastpickle.com > > > Le jeu. 28 mars 2019 à 11:24, manish khandelwal < > manishkhandelwa...@gmail.com> a écrit : > >> Hi >> >> >> >> I would like to know is there any guideline for selecting storage device >> (disk type) for Cassandra backups. >> >> >> >> As per my current observation, NearLine (NL) disk on SAN slows down >> significantly while copying backup files (taking full backup) from all node >> simultaneously. Will using SSD disk on SAN help us in this regard? >> >> Apart from using SSD disk, what are the alternative approach to make my >> backup process fast? >> >> What are the best practices while designing backup storage system for big >> Cassandra cluster? >> >> >> Regards >> >> Manish >> >