RE: Using TTL for data purge

SEAN_R_DURITY Tue, 29 Dec 2015 09:51:40 -0800

If you know how long the records should last, TTL is a good way to go. Remember 
that neither TTL or deletes are right-away purge strategies. Each inserts a 
special record called a tombstone to indicate a deleted record. After 
compaction (that is after gc_grace_seconds for the table, default 10 days), the 
data will be removed and you will regain disk space.


If the data is relatively volatile and read speeds are important, you might 
look at leveled compaction, though it can keep your nodes a bit busier than 
size-tiered. (An issue with size-tiered, over time, is that the tombstoned data 
in the larger and older sstables may rarely, if ever, get compacted out.)


Sean Durity – Lead Cassandra Admin
From: jaalex.tech [mailto:jaalex.t...@gmail.com]
Sent: Tuesday, December 22, 2015 4:36 AM
To: user@cassandra.apache.org
Subject: Using TTL for data purge

Hi,

I'm looking for suggestions/caveats on using TTL as a subsitute for a manual 
data purge job.

We have few tables that hold user information - this could be guest or 
registered users, and there could be between 500K to 1M records created per day 
per table. Currently, these tables have a secondary indexed updated_date column 
which is populated on each update. However, we have been getting timeouts when 
running queries using updated_date when the number of records are high, so i 
don't think this would be a reliable option in the long term when we need to 
purge records that have not been used for the last X days.

In this scenario, is it advisable to include a high enough TTL (i.e the amount 
of time we want these to last, could be 3 to 6 months) when inserting/updating 
records?

There could be cases where the TTL may get reset after couple of days/weeks, 
when the user visits the site again.

The tables have fixed number of columns, except for one which has a clustering 
key, and may have max 10 entries per  partition key.

I need to know the overhead of having so many rows with TTL hanging around for 
a relatively longer duration (weeks/months), and the impacts it could have on 
performance/storage. If this is not a recommended approach, what would be an 
alternate design which could be used for a manual purge job, without using 
secondary indices.

We are using Cassandra 2.0.x.

Thanks,
Joseph


________________________________

The information in this Internet Email is confidential and may be legally 
privileged. It is intended solely for the addressee. Access to this Email by 
anyone else is unauthorized. If you are not the intended recipient, any 
disclosure, copying, distribution or any action taken or omitted to be taken in 
reliance on it, is prohibited and may be unlawful. When addressed to our 
clients any opinions or advice contained in this Email are subject to the terms 
and conditions expressed in any applicable governing The Home Depot terms of 
business or client engagement letter. The Home Depot disclaims all 
responsibility and liability for the accuracy and content of this attachment 
and for any damages or losses arising from any inaccuracies, errors, viruses, 
e.g., worms, trojan horses, etc., or other items of a destructive nature, which 
may be contained in this attachment and shall not be liable for direct, 
indirect, consequential or special damages in connection with this e-mail 
message or its attachment.

RE: Using TTL for data purge

Reply via email to