Hello Manish,

I think any disk works. As long as it is big enough. It's also better if
it's a reliable system (some kind of redundant raid, NAS, storage like GCS
or S3...). We are not looking for speed mostly during a backup, but
resiliency and not harming the source cluster mostly I would say.
Then how fast you write to the backup storage system will probably be more
often limited by what you can read from the source cluster.
The backups have to be taken from running nodes, thus it's easy to overload
the disk (reads), network (export backup data to final destination), and
even CPU (as/if the machine handles the transfer).

What are the best practices while designing backup storage system for big
> Cassandra cluster?


What is nice to have (not to say mandatory) is a system of incremental
backups. You should not take the data from the nodes every time, or you'll
either harm the cluster regularly OR spend days to transfer the data (if
the amount of data grows big enough).
I'm not speaking about Cassandra incremental snapshots, but of using
something like AWS Snapshot, or copying this behaviour programmatically to
take (copy, link?) old SSTables from previous backups when they exist, will
greatly unload the clusters work and the resource needed as soon enough a
substantial amount of the data should be coming from the backup data source
itself. The problem with incremental snapshot is that when restoring, you
have to restore multiple pieces, making it harder and involving a lot of
compaction work.
The "caching" technic mentioned above gives the best of the 2 worlds:
- You will always backup from the nodes only the sstables you don’t have
already in your backup storage system,
- You will always restore easily as each backup is a full backup.

It's not really a "hands-on" writing, but this should let you know about
existing ways to do backups and the tradeoffs, I wrote this a year ago:
http://thelastpickle.com/blog/2018/04/03/cassandra-backup-and-restore-aws-ebs.html
.

It's a complex topic, I hope some of this is helpful to you.

C*heers,
-----------------------
Alain Rodriguez - al...@thelastpickle.com
France / Spain

The Last Pickle - Apache Cassandra Consulting
http://www.thelastpickle.com


Le jeu. 28 mars 2019 à 11:24, manish khandelwal <
manishkhandelwa...@gmail.com> a écrit :

> Hi
>
>
>
> I would like to know is there any guideline for selecting storage device
> (disk type) for Cassandra backups.
>
>
>
> As per my current observation, NearLine (NL) disk on SAN  slows down
> significantly while copying backup files (taking full backup) from all node
> simultaneously. Will using SSD disk on SAN help us in this regard?
>
> Apart from using SSD disk, what are the alternative approach to make my
> backup process fast?
>
> What are the best practices while designing backup storage system for big
> Cassandra cluster?
>
>
> Regards
>
> Manish
>

Reply via email to