Hello Kunal,

Caveat: I am not a super-expert on Cassandra, but it helps to explain to 
others, in order to eventually become an expert, so if my explanation is wrong, 
I would hope others would correct me. ☺

The active sstables/data files are are all the files located in the directory 
for the table.
You can safely remove all files under the backups/ directory and the directory 
itself.
Removing any files that are current hard-links inside backups won’t cause any 
issues, and I will explain why.

Have you looked at your Cassandra.yaml file and checked the setting for 
incremental_backups?  If it is set to true, and you don’t want to make new 
backups, you can set it to false, so that after you clean up, you will not have 
to clean up the backups again.

Explanation:
Lets look at the the definition of incremental backups again: “Cassandra 
creates a hard link to each SSTable flushed or streamed locally in a backups 
subdirectory of the keyspace data.”

Suppose we have a directory path: my_keyspace/my_table-some-uuid/backups/
In the rest of the discussion, when I refer to “table directory”, I explicitly 
mean the directory: my_keyspace/my_table-some-uuid/
When I refer to backups/ directory, I explicitly mean: 
my_keyspace/my_table-some-uuid/backups/

Suppose that you have an sstable-A that was either flushed from a memtable or 
streamed from another node.
At this point, you have a hardlink to sstable-A in your table directory, and a 
hardlink to sstable-A in your backups/ directory.
Suppose that you have another sstable-B that was also either flushed from a 
memtable or streamed from another node.
At this point, you have a hardlink to sstable-B in your table directory, and a 
hardlink to sstable-B in your backups/ directory.

Next, suppose compaction were to occur, where say sstable-A and sstable-B would 
be compacted to produce sstable-C, representing all the data from A and B.
Now, sstable-C will live in your main table directory, and the hardlinks to 
sstable-A and sstable-B will be deleted in the main table directory, but 
sstable-A and sstable-B will continue to exist in /backups.
At this point, in your main table directory, you will have a hardlink to 
sstable-C. In your backups/ directory you will have hardlinks to sstable-A, and 
sstable-B.

Thus, your main table directory is not cluttered with old un-compacted 
sstables, and only has the sstables along with other files that are actively 
being used.

To drive the point home, …
Suppose that you have another sstable-D that was either flushed from a memtable 
or streamed from another node.
At this point, in your main table directory, you will have sstable-C and 
sstable-D. In your backups/ directory you will have hardlinks to sstable-A, 
sstable-B, and sstable-D.

Next, suppose compaction were to occur where say sstable-C and sstable-D would 
be compacted to produce sstable-E, representing all the data from C and D.
Now, sstable-E will live in your main table directory, and the hardlinks to 
sstable-C and sstable-D will be deleted in the main table directory, but 
sstable-D will continue to exist in /backups.
At this point, in your main table directory, you will have a hardlink to 
sstable-E. In your backups/ directory you will have hardlinks to sstable-A, 
sstable-B and sstable-D.

As you can see, the /backups directory quickly accumulates with all 
un-compacted sstables and how it progressively used up more and more space.
Also, note that the /backups directory does not contain sstables generated from 
compaction, such as sstable-C and sstable-E.
It is safe to delete the entire backups/ directory because all the data is 
represented in the compacted sstable-E.
I hope this explanation was clear and gives you confidence in using rm to 
delete the directory for backups/.

Best regards,
-Razi



From: Kunal Gangakhedkar <kgangakhed...@gmail.com>
Reply-To: "user@cassandra.apache.org" <user@cassandra.apache.org>
Date: Wednesday, January 11, 2017 at 6:47 AM
To: "user@cassandra.apache.org" <user@cassandra.apache.org>
Subject: Re: Backups eating up disk space

Thanks for the reply, Razi.

As I mentioned earlier, we're not currently using snapshots - it's only the 
backups that are bothering me right now.

So my next question is pertaining to this statement of yours:

As far as I am aware, using rm is perfectly safe to delete the directories for 
snapshots/backups as long as you are careful not to delete your actively used 
sstable files and directories.

How do I find out which are the actively used sstables?
If by that you mean the main data files, does that mean I can safely remove all 
files ONLY under the "backups/" directory?
Or, removing any files that are current hard-links inside backups can 
potentially cause any issues?

Thanks,
Kunal

On 11 January 2017 at 01:06, Khaja, Raziuddin (NIH/NLM/NCBI) [C] 
<raziuddin.kh...@nih.gov<mailto:raziuddin.kh...@nih.gov>> wrote:
Hello Kunal,

I would take a look at the following configuration options in the Cassandra.yaml

Common automatic backup settings
Incremental_backups:
http://docs.datastax.com/en/archived/cassandra/3.x/cassandra/configuration/configCassandra_yaml.html#configCassandra_yaml__incremental_backups

(Default: false) Backs up data updated since the last snapshot was taken. When 
enabled, Cassandra creates a hard link to each SSTable flushed or streamed 
locally in a backups subdirectory of the keyspace data. Removing these links is 
the operator's responsibility.

snapshot_before_compaction:
http://docs.datastax.com/en/archived/cassandra/3.x/cassandra/configuration/configCassandra_yaml.html#configCassandra_yaml__snapshot_before_compaction

(Default: false) Enables or disables taking a snapshot before each compaction. 
A snapshot is useful to back up data when there is a data format change. Be 
careful using this option: Cassandra does not clean up older snapshots 
automatically.


Advanced automatic backup setting
auto_snapshot:
http://docs.datastax.com/en/archived/cassandra/3.x/cassandra/configuration/configCassandra_yaml.html#configCassandra_yaml__auto_snapshot

(Default: true) Enables or disables whether Cassandra takes a snapshot of the 
data before truncating a keyspace or dropping a table. To prevent data loss, 
Datastax strongly advises using the default setting. If you set auto_snapshot 
to false, you lose data on truncation or drop.


nodetool also provides methods to manage snapshots. 
http://docs.datastax.com/en/archived/cassandra/3.x/cassandra/tools/toolsNodetool.html
See the specific commands:

  *   nodetool 
clearsnapshot<http://docs.datastax.com/en/archived/cassandra/3.x/cassandra/tools/toolsClearSnapShot.html>
Removes one or more snapshots.
  *   nodetool 
listsnapshots<http://docs.datastax.com/en/archived/cassandra/3.x/cassandra/tools/toolsListSnapShots.html>
Lists snapshot names, size on disk, and true size.
  *   nodetool 
snapshot<http://docs.datastax.com/en/archived/cassandra/3.x/cassandra/tools/toolsSnapShot.html>
Take a snapshot of one or more keyspaces, or of a table, to backup data.

As far as I am aware, using rm is perfectly safe to delete the directories for 
snapshots/backups as long as you are careful not to delete your actively used 
sstable files and directories.  I think the nodetool clearsnapshot command is 
provided so that you don’t accidentally delete actively used files.  Last I 
used clearsnapshot, (a very long time ago), I thought it left behind the 
directory, but this could have been fixed in newer versions (so you might want 
to check that).

HTH
-Razi


From: Jonathan Haddad <j...@jonhaddad.com<mailto:j...@jonhaddad.com>>
Reply-To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" 
<user@cassandra.apache.org<mailto:user@cassandra.apache.org>>
Date: Tuesday, January 10, 2017 at 12:26 PM
To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" 
<user@cassandra.apache.org<mailto:user@cassandra.apache.org>>
Subject: Re: Backups eating up disk space

If you remove the files from the backup directory, you would not have data loss 
in the case of a node going down.  They're hard links to the same files that 
are in your data directory, and are created when an sstable is written to disk. 
 At the time, they take up (almost) no space, so they aren't a big deal, but 
when the sstable gets compacted, they stick around, so they end up not freeing 
space up.

Usually you use incremental backups as a means of moving the sstables off the 
node to a backup location.  If you're not doing anything with them, they're 
just wasting space and you should disable incremental backups.

Some people take snapshots then rely on incremental backups.  Others use the 
tablesnap utility which does sort of the same thing.

On Tue, Jan 10, 2017 at 9:18 AM Kunal Gangakhedkar 
<kgangakhed...@gmail.com<mailto:kgangakhed...@gmail.com>> wrote:
Thanks for quick reply, Jon.

But, what about in case of node/cluster going down? Would there be data loss if 
I remove these files manually?

How is it typically managed in production setups?
What are the best-practices for the same?
Do people take snapshots on each node before removing the backups?

This is my first production deployment - so, still trying to learn.

Thanks,
Kunal

On 10 January 2017 at 21:36, Jonathan Haddad 
<j...@jonhaddad.com<mailto:j...@jonhaddad.com>> wrote:
You can just delete them off the filesystem (rm)

On Tue, Jan 10, 2017 at 8:02 AM Kunal Gangakhedkar 
<kgangakhed...@gmail.com<mailto:kgangakhed...@gmail.com>> wrote:
Hi all,

We have a 3-node cassandra cluster with incremental backup set to true.
Each node has 1TB data volume that stores cassandra data.

The load in the output of 'nodetool status' comes up at around 260GB each node.
All our keyspaces use replication factor = 3.

However, the df output shows the data volumes consuming around 850GB of space.
I checked the keyspace directory structures - most of the space goes in 
<CASS_DATA_VOL>/data/<KEYSPACE>/<CF>/backups.

We have never manually run snapshots.

What is the typical procedure to clear the backups?
Can it be done without taking the node offline?

Thanks,
Kunal


Reply via email to