Re: Backups eating up disk space

Alain RODRIGUEZ Thu, 12 Jan 2017 06:17:45 -0800

My 2 cents,

As I mentioned earlier, we're not currently using snapshots - it's only the
> backups that are bothering me right now.



I believe backups folder is just the new name for the previously called
snapshots folder. But I can be completely wrong, I haven't played that much
with snapshots in new versions yet.

Anyway, some operations in Apache Cassandra can trigger a snapshot:

- Repair (when not using parallel option but sequential repairs instead)
- Truncating a table (by default)
- Dropping a table (by default)
- Maybe other I can't think of... ?

If you want to clean space but still keep a backup you can run:

"nodetool clearsnapshots"
"nodetool snapshot <whatever>"

This way and for a while, data won't be taking space as old files will be
cleaned and new files will be only hardlinks as detailed above. Then you
might want to work at a proper backup policy, probably implying getting
data out of production server (a lot of people uses S3 or similar
services). Or just do that from time to time, meaning you only keep a
backup and disk space behaviour will be hard to predict.

C*heers,
-----------------------
Alain Rodriguez - @arodream - al...@thelastpickle.com
France

The Last Pickle - Apache Cassandra Consulting
http://www.thelastpickle.com

2017-01-12 6:42 GMT+01:00 Prasenjit Sarkar <prasenjit.sar...@datos.io>:

> Hi Kunal,
>
> Razi's post does give a very lucid description of how cassandra manages
> the hard links inside the backup directory.
>
> Where it needs clarification is the following:
> --> incremental backups is a system wide setting and so its an all or
> nothing approach
>
> --> as multiple people have stated, incremental backups do not create hard
> links to compacted sstables. however, this can bloat the size of your
> backups
>
> --> again as stated, it is a general industry practice to place backups in
> a different secondary storage location than the main production site. So
> best to move it to the secondary storage before applying rm on the backups
> folder
>
> In my experience with production clusters, managing the backups folder
> across multiple nodes can be painful if the objective is to ever recover
> data. With the usual disclaimers, better to rely on third party vendors to
> accomplish the needful rather than scripts/tablesnap.
>
> Regards
> Prasenjit
>
>
> On Wed, Jan 11, 2017 at 7:49 AM, Khaja, Raziuddin (NIH/NLM/NCBI) [C] <
> raziuddin.kh...@nih.gov> wrote:
>
>> Hello Kunal,
>>
>>
>>
>> Caveat: I am not a super-expert on Cassandra, but it helps to explain to
>> others, in order to eventually become an expert, so if my explanation is
>> wrong, I would hope others would correct me. J
>>
>>
>>
>> The active sstables/data files are are all the files located in the
>> directory for the table.
>>
>> You can safely remove all files under the backups/ directory and the
>> directory itself.
>>
>> Removing any files that are current hard-links inside backups won’t cause
>> any issues, and I will explain why.
>>
>>
>>
>> Have you looked at your Cassandra.yaml file and checked the setting for
>> incremental_backups?  If it is set to true, and you don’t want to make new
>> backups, you can set it to false, so that after you clean up, you will not
>> have to clean up the backups again.
>>
>>
>>
>> Explanation:
>>
>> Lets look at the the definition of incremental backups again: “Cassandra
>> creates a hard link to each SSTable flushed or streamed locally in
>> a backups subdirectory of the keyspace data.”
>>
>>
>>
>> Suppose we have a directory path: my_keyspace/my_table-some-uuid/backups/
>>
>> In the rest of the discussion, when I refer to “table directory”, I
>> explicitly mean the directory: my_keyspace/my_table-some-uuid/
>>
>> When I refer to backups/ directory, I explicitly mean:
>> my_keyspace/my_table-some-uuid/backups/
>>
>>
>>
>> Suppose that you have an sstable-A that was either flushed from a
>> memtable or streamed from another node.
>>
>> At this point, you have a hardlink to sstable-A in your table directory,
>> and a hardlink to sstable-A in your backups/ directory.
>>
>> Suppose that you have another sstable-B that was also either flushed from
>> a memtable or streamed from another node.
>>
>> At this point, you have a hardlink to sstable-B in your table directory,
>> and a hardlink to sstable-B in your backups/ directory.
>>
>>
>>
>> Next, suppose compaction were to occur, where say sstable-A and sstable-B
>> would be compacted to produce sstable-C, representing all the data from A
>> and B.
>>
>> Now, sstable-C will live in your main table directory, and the hardlinks
>> to sstable-A and sstable-B will be deleted in the main table directory, but
>> sstable-A and sstable-B will continue to exist in /backups.
>>
>> At this point, in your main table directory, you will have a hardlink to
>> sstable-C. In your backups/ directory you will have hardlinks to sstable-A,
>> and sstable-B.
>>
>>
>>
>> Thus, your main table directory is not cluttered with old un-compacted
>> sstables, and only has the sstables along with other files that are
>> actively being used.
>>
>>
>>
>> To drive the point home, …
>>
>> Suppose that you have another sstable-D that was either flushed from a
>> memtable or streamed from another node.
>>
>> At this point, in your main table directory, you will have sstable-C and
>> sstable-D. In your backups/ directory you will have hardlinks to sstable-A,
>> sstable-B, and sstable-D.
>>
>>
>>
>> Next, suppose compaction were to occur where say sstable-C and sstable-D
>> would be compacted to produce sstable-E, representing all the data from C
>> and D.
>>
>> Now, sstable-E will live in your main table directory, and the hardlinks
>> to sstable-C and sstable-D will be deleted in the main table directory, but
>> sstable-D will continue to exist in /backups.
>>
>> At this point, in your main table directory, you will have a hardlink to
>> sstable-E. In your backups/ directory you will have hardlinks to sstable-A,
>> sstable-B and sstable-D.
>>
>>
>>
>> As you can see, the /backups directory quickly accumulates with all
>> un-compacted sstables and how it progressively used up more and more space.
>>
>> Also, note that the /backups directory does not contain sstables
>> generated from compaction, such as sstable-C and sstable-E.
>>
>> It is safe to delete the entire backups/ directory because all the data
>> is represented in the compacted sstable-E.
>>
>> I hope this explanation was clear and gives you confidence in using rm to
>> delete the directory for backups/.
>>
>>
>>
>> Best regards,
>>
>> -Razi
>>
>>
>>
>>
>>
>>
>>
>> *From: *Kunal Gangakhedkar <kgangakhed...@gmail.com>
>> *Reply-To: *"user@cassandra.apache.org" <user@cassandra.apache.org>
>> *Date: *Wednesday, January 11, 2017 at 6:47 AM
>>
>> *To: *"user@cassandra.apache.org" <user@cassandra.apache.org>
>> *Subject: *Re: Backups eating up disk space
>>
>>
>>
>> Thanks for the reply, Razi.
>>
>>
>>
>> As I mentioned earlier, we're not currently using snapshots - it's only
>> the backups that are bothering me right now.
>>
>>
>>
>> So my next question is pertaining to this statement of yours:
>>
>>
>>
>> As far as I am aware, using *rm* is perfectly safe to delete the
>> directories for snapshots/backups as long as you are careful not to delete
>> your actively used sstable files and directories.
>>
>>
>>
>> How do I find out which are the actively used sstables?
>>
>> If by that you mean the main data files, does that mean I can safely
>> remove all files ONLY under the "backups/" directory?
>>
>> Or, removing any files that are current hard-links inside backups can
>> potentially cause any issues?
>>
>>
>>
>> Thanks,
>>
>> Kunal
>>
>>
>>
>> On 11 January 2017 at 01:06, Khaja, Raziuddin (NIH/NLM/NCBI) [C] <
>> raziuddin.kh...@nih.gov> wrote:
>>
>> Hello Kunal,
>>
>>
>>
>> I would take a look at the following configuration options in the
>> Cassandra.yaml
>>
>>
>>
>> *Common automatic backup settings*
>>
>> *Incremental_backups:*
>>
>> http://docs.datastax.com/en/archived/cassandra/3.x/cassandra
>> /configuration/configCassandra_yaml.html#configCassandra_
>> yaml__incremental_backups
>>
>>
>>
>> (Default: false) Backs up data updated since the last snapshot was taken.
>> When enabled, Cassandra creates a hard link to each SSTable flushed or
>> streamed locally in a backups subdirectory of the keyspace data. Removing
>> these links is the operator's responsibility.
>>
>>
>>
>> *snapshot_before_compaction*:
>>
>> http://docs.datastax.com/en/archived/cassandra/3.x/cassandra
>> /configuration/configCassandra_yaml.html#configCassandra_
>> yaml__snapshot_before_compaction
>>
>>
>>
>> (Default: false) Enables or disables taking a snapshot before each
>> compaction. A snapshot is useful to back up data when there is a data
>> format change. Be careful using this option: Cassandra does not clean up
>> older snapshots automatically.
>>
>>
>>
>>
>>
>> *Advanced automatic backup setting*
>>
>> *auto_snapshot*:
>>
>> http://docs.datastax.com/en/archived/cassandra/3.x/cassandra
>> /configuration/configCassandra_yaml.html#configCassandra_
>> yaml__auto_snapshot
>>
>>
>>
>> (Default: true) Enables or disables whether Cassandra takes a snapshot of
>> the data before truncating a keyspace or dropping a table. To prevent data
>> loss, Datastax strongly advises using the default setting. If you
>> set auto_snapshot to false, you lose data on truncation or drop.
>>
>>
>>
>>
>>
>> *nodetool* also provides methods to manage snapshots.
>> http://docs.datastax.com/en/archived/cassandra/3.x/cassandra
>> /tools/toolsNodetool.html
>>
>> See the specific commands:
>>
>>    - nodetool clearsnapshot
>>    
>> <http://docs.datastax.com/en/archived/cassandra/3.x/cassandra/tools/toolsClearSnapShot.html>
>>    Removes one or more snapshots.
>>    - nodetool listsnapshots
>>    
>> <http://docs.datastax.com/en/archived/cassandra/3.x/cassandra/tools/toolsListSnapShots.html>
>>    Lists snapshot names, size on disk, and true size.
>>    - nodetool snapshot
>>    
>> <http://docs.datastax.com/en/archived/cassandra/3.x/cassandra/tools/toolsSnapShot.html>
>>    Take a snapshot of one or more keyspaces, or of a table, to backup
>>    data.
>>
>>
>>
>> As far as I am aware, using *rm* is perfectly safe to delete the
>> directories for snapshots/backups as long as you are careful not to delete
>> your actively used sstable files and directories.  I think the *nodetool
>> clearsnapshot* command is provided so that you don’t accidentally delete
>> actively used files.  Last I used *clearsnapshot*, (a very long time
>> ago), I thought it left behind the directory, but this could have been
>> fixed in newer versions (so you might want to check that).
>>
>>
>>
>> HTH
>>
>> -Razi
>>
>>
>>
>>
>>
>> *From: *Jonathan Haddad <j...@jonhaddad.com>
>> *Reply-To: *"user@cassandra.apache.org" <user@cassandra.apache.org>
>> *Date: *Tuesday, January 10, 2017 at 12:26 PM
>> *To: *"user@cassandra.apache.org" <user@cassandra.apache.org>
>> *Subject: *Re: Backups eating up disk space
>>
>>
>>
>> If you remove the files from the backup directory, you would not have
>> data loss in the case of a node going down.  They're hard links to the same
>> files that are in your data directory, and are created when an sstable is
>> written to disk.  At the time, they take up (almost) no space, so they
>> aren't a big deal, but when the sstable gets compacted, they stick around,
>> so they end up not freeing space up.
>>
>>
>>
>> Usually you use incremental backups as a means of moving the sstables off
>> the node to a backup location.  If you're not doing anything with them,
>> they're just wasting space and you should disable incremental backups.
>>
>>
>>
>> Some people take snapshots then rely on incremental backups.  Others use
>> the tablesnap utility which does sort of the same thing.
>>
>>
>>
>> On Tue, Jan 10, 2017 at 9:18 AM Kunal Gangakhedkar <
>> kgangakhed...@gmail.com> wrote:
>>
>> Thanks for quick reply, Jon.
>>
>>
>>
>> But, what about in case of node/cluster going down? Would there be data
>> loss if I remove these files manually?
>>
>>
>>
>> How is it typically managed in production setups?
>>
>> What are the best-practices for the same?
>>
>> Do people take snapshots on each node before removing the backups?
>>
>>
>>
>> This is my first production deployment - so, still trying to learn.
>>
>>
>>
>> Thanks,
>>
>> Kunal
>>
>>
>>
>> On 10 January 2017 at 21:36, Jonathan Haddad <j...@jonhaddad.com> wrote:
>>
>> You can just delete them off the filesystem (rm)
>>
>>
>>
>> On Tue, Jan 10, 2017 at 8:02 AM Kunal Gangakhedkar <
>> kgangakhed...@gmail.com> wrote:
>>
>> Hi all,
>>
>>
>>
>> We have a 3-node cassandra cluster with incremental backup set to true.
>>
>> Each node has 1TB data volume that stores cassandra data.
>>
>>
>>
>> The load in the output of 'nodetool status' comes up at around 260GB each
>> node.
>>
>> All our keyspaces use replication factor = 3.
>>
>>
>>
>> However, the df output shows the data volumes consuming around 850GB of
>> space.
>>
>> I checked the keyspace directory structures - most of the space goes in
>> <CASS_DATA_VOL>/data/<KEYSPACE>/<CF>/backups.
>>
>>
>>
>> We have never manually run snapshots.
>>
>>
>>
>> What is the typical procedure to clear the backups?
>>
>> Can it be done without taking the node offline?
>>
>>
>>
>> Thanks,
>>
>> Kunal
>>
>>
>>
>>
>>
>
>

Re: Backups eating up disk space

Reply via email to