Another approach to avoiding the full backup I/O hit would be to rotate a node or small subset of nodes that do full backups routinely, so that over the course of a month or two you get full backups. Of course this assumes you have incremental ability for the other backup days/dates.
On Mon, Apr 1, 2019 at 1:30 PM Carl Mueller <carl.muel...@smartthings.com> wrote: > At my current job I had to roll my own backup system. Hopefully I can get > it OSS'd at some point. Here is a (now slightly outdated) presentation: > > > https://docs.google.com/presentation/d/13Aps-IlQPYAa_V34ocR0E8Q4C8W2YZ6Jn5_BYGrjqFk/edit#slide=id.p > > If you are struggling with the disk I/O cost of the sstable > backups/copies, note that since sstables are append-only, if you adopt an > incremental approach to your backups, you only need to track a list of the > current files and upload the files that are new compared to a previous > successful backup. Your "manifest" of files for a node will need to have > references to the previous backup, and you'll wnat to "reset" with a full > backup each month. > > I stole that idea from https://github.com/tbarbugli/cassandra_snapshotter. > I would have used that but we had more complex node access modes > (kubernetes, ssh through jumphosts, etc) and lots of other features needed > that weren't supported. > > In AWS I use aws profiles to throttle the transfers, and parallelize > across nodes. The basic unit of a successful backup is a single node, but > you'll obviously want to track overall node success. > > Note that in rack-based topologies you really only need one whole > successful rack if your RF is > # racks, and one DC. > > Beware doing simultaneous flushes/snapshots across the cluster at once, > that might be the equivalent of a DDos. You might want to do a "jittered" > randomized preflush of the cluster first before doing the snapshotting. > > Unfortunately, the nature of a distributed system is that snapshotting all > the nodes at the precise same time is a hard problem. > > I also do not / have not used the built-in incremental backup feature of > cassandra, which can enable more precise point-in-time backups (aside from > the unflushed data in the commitlogs) > > A note on incrementals with occaisional FULLs: Note that FULL backups > monthly might take more than a day or two, especially throttled. My > incrementals were originally looking up previous manifests using only 'most > recent", but then the long-running FULL backups were excluded from the > "chain" of incremental backups. So I now implement a fuzzy lookup for the > incrementals that prioritizes any FULL in the last 5 days over any more > recent incremental. Thus you can purge old backups you don't need more > safely using the monthly full backups as a reset point. > > On Mon, Apr 1, 2019 at 1:08 PM Alain RODRIGUEZ <arodr...@gmail.com> wrote: > >> Hello Manish, >> >> I think any disk works. As long as it is big enough. It's also better if >> it's a reliable system (some kind of redundant raid, NAS, storage like GCS >> or S3...). We are not looking for speed mostly during a backup, but >> resiliency and not harming the source cluster mostly I would say. >> Then how fast you write to the backup storage system will probably be >> more often limited by what you can read from the source cluster. >> The backups have to be taken from running nodes, thus it's easy to >> overload the disk (reads), network (export backup data to final >> destination), and even CPU (as/if the machine handles the transfer). >> >> What are the best practices while designing backup storage system for big >>> Cassandra cluster? >> >> >> What is nice to have (not to say mandatory) is a system of incremental >> backups. You should not take the data from the nodes every time, or you'll >> either harm the cluster regularly OR spend days to transfer the data (if >> the amount of data grows big enough). >> I'm not speaking about Cassandra incremental snapshots, but of using >> something like AWS Snapshot, or copying this behaviour programmatically to >> take (copy, link?) old SSTables from previous backups when they exist, will >> greatly unload the clusters work and the resource needed as soon enough a >> substantial amount of the data should be coming from the backup data source >> itself. The problem with incremental snapshot is that when restoring, you >> have to restore multiple pieces, making it harder and involving a lot of >> compaction work. >> The "caching" technic mentioned above gives the best of the 2 worlds: >> - You will always backup from the nodes only the sstables you don’t have >> already in your backup storage system, >> - You will always restore easily as each backup is a full backup. >> >> It's not really a "hands-on" writing, but this should let you know about >> existing ways to do backups and the tradeoffs, I wrote this a year ago: >> http://thelastpickle.com/blog/2018/04/03/cassandra-backup-and-restore-aws-ebs.html >> . >> >> It's a complex topic, I hope some of this is helpful to you. >> >> C*heers, >> ----------------------- >> Alain Rodriguez - al...@thelastpickle.com >> France / Spain >> >> The Last Pickle - Apache Cassandra Consulting >> http://www.thelastpickle.com >> >> >> Le jeu. 28 mars 2019 à 11:24, manish khandelwal < >> manishkhandelwa...@gmail.com> a écrit : >> >>> Hi >>> >>> >>> >>> I would like to know is there any guideline for selecting storage device >>> (disk type) for Cassandra backups. >>> >>> >>> >>> As per my current observation, NearLine (NL) disk on SAN slows down >>> significantly while copying backup files (taking full backup) from all node >>> simultaneously. Will using SSD disk on SAN help us in this regard? >>> >>> Apart from using SSD disk, what are the alternative approach to make my >>> backup process fast? >>> >>> What are the best practices while designing backup storage system for >>> big Cassandra cluster? >>> >>> >>> Regards >>> >>> Manish >>> >>