Re: Time series data model and tombstones
Maybe trace your queries to see what's happening in detail. Am 28.01.2017 21:32 schrieb "John Sanda": Thanks for the response. This version of the code is using STCS. gc_grace_seconds was set to one day and then I changed it to zero since RF = 1. I understand that expired data will still generate tombstones and that STCS is not the best. More recent versions of the code use DTCS, and we'll be switching over to TWCS shortly. The suggestions raised are excellent ones, but I tend to think of them as optimizations that might not address my issue which I think may be 1) a problem with my data model, 2) problem with the queries used or 3) some misunderstanding of Cassandra performs range scans. I am doing append-only writes. There is no out of order data. There are no deletes, just TTLs. Data is stored on disk in descending order, and queries access recent data and never query past the TTL of seven days. Given this I would not except to be reading tombstones, certainly not the large numbers that I am seeing. On Sat, Jan 28, 2017 at 12:15 PM, Jonathan Haddad wrote: > Since you didn't specify a compaction strategy I'm guessing you're using > STCS. Your TTL'ed data is becoming a tombstone. TWCS is a better strategy > for this type of workload. > On Sat, Jan 28, 2017 at 8:30 AM John Sanda wrote: > >> I have a time series data model that is basically: >> >> CREATE TABLE metrics ( >> id text, >> time timeuuid, >> value double, >> PRIMARY KEY (id, time) >> ) WITH CLUSTERING ORDER BY (time DESC); >> >> I do append-only writes, no deletes, and use a TTL of seven days. Data >> points are written every seconds. The UI queries data for the past hour, >> two hours, day, or week. The UI refreshes and executes queries every 30 >> seconds. In one test environment I am seeing lots of tombstone threshold >> warnings and Cassandra has even OOME'd. Since I am storing data in >> descending order and always query for recent data, I do not understand why >> I am running into this problem. >> >> I know that it is recommended to do some date partitioning in part to >> ensure partitions do not grow too large. I already have some changes in >> place to partition by day.. Before I make those changes I want to >> understand why I am scanning so many tombstones so that I can be more >> confident that the date partitioning changes will help. >> >> Thanks >> >> - John >> > -- - John
Re: Time series data model and tombstones
Thanks for the response. This version of the code is using STCS. gc_grace_seconds was set to one day and then I changed it to zero since RF = 1. I understand that expired data will still generate tombstones and that STCS is not the best. More recent versions of the code use DTCS, and we'll be switching over to TWCS shortly. The suggestions raised are excellent ones, but I tend to think of them as optimizations that might not address my issue which I think may be 1) a problem with my data model, 2) problem with the queries used or 3) some misunderstanding of Cassandra performs range scans. I am doing append-only writes. There is no out of order data. There are no deletes, just TTLs. Data is stored on disk in descending order, and queries access recent data and never query past the TTL of seven days. Given this I would not except to be reading tombstones, certainly not the large numbers that I am seeing. On Sat, Jan 28, 2017 at 12:15 PM, Jonathan Haddadwrote: > Since you didn't specify a compaction strategy I'm guessing you're using > STCS. Your TTL'ed data is becoming a tombstone. TWCS is a better strategy > for this type of workload. > On Sat, Jan 28, 2017 at 8:30 AM John Sanda wrote: > >> I have a time series data model that is basically: >> >> CREATE TABLE metrics ( >> id text, >> time timeuuid, >> value double, >> PRIMARY KEY (id, time) >> ) WITH CLUSTERING ORDER BY (time DESC); >> >> I do append-only writes, no deletes, and use a TTL of seven days. Data >> points are written every seconds. The UI queries data for the past hour, >> two hours, day, or week. The UI refreshes and executes queries every 30 >> seconds. In one test environment I am seeing lots of tombstone threshold >> warnings and Cassandra has even OOME'd. Since I am storing data in >> descending order and always query for recent data, I do not understand why >> I am running into this problem. >> >> I know that it is recommended to do some date partitioning in part to >> ensure partitions do not grow too large. I already have some changes in >> place to partition by day.. Before I make those changes I want to >> understand why I am scanning so many tombstones so that I can be more >> confident that the date partitioning changes will help. >> >> Thanks >> >> - John >> > -- - John
Re: Time series data model and tombstones
Since you didn't specify a compaction strategy I'm guessing you're using STCS. Your TTL'ed data is becoming a tombstone. TWCS is a better strategy for this type of workload. On Sat, Jan 28, 2017 at 8:30 AM John Sandawrote: > I have a time series data model that is basically: > > CREATE TABLE metrics ( > id text, > time timeuuid, > value double, > PRIMARY KEY (id, time) > ) WITH CLUSTERING ORDER BY (time DESC); > > I do append-only writes, no deletes, and use a TTL of seven days. Data > points are written every seconds. The UI queries data for the past hour, > two hours, day, or week. The UI refreshes and executes queries every 30 > seconds. In one test environment I am seeing lots of tombstone threshold > warnings and Cassandra has even OOME'd. Since I am storing data in > descending order and always query for recent data, I do not understand why > I am running into this problem. > > I know that it is recommended to do some date partitioning in part to > ensure partitions do not grow too large. I already have some changes in > place to partition by day.. Before I make those changes I want to > understand why I am scanning so many tombstones so that I can be more > confident that the date partitioning changes will help. > > Thanks > > - John >
Re: Time series data model and tombstones
When the data expired (after TTL of 7 days), at the next compaction they are transformed into tombstonnes and will still stay there during gc_grace_seconds. After that, they (the tombstonnes) will be completely removed at the next compaction, if there is any ... So doing some maths, supposing that you have let gc_grace_seconds to its default value of 10 days then you'll have tombstonnes for 10 days worth of data before they got eventually removed... What is your compaction strategy ? I strongly suggest 1) Setting TTL directly as the table property (ALTER TABLE) instead of setting it at query level (INSERT INTO ... USING TTL). When setting TTL at table level, Cassandra can perform some optimization and drop entirely some SSTable and don't even bother compact them 2) Use TimeWindowCompactionStrategy and tune it properly to accomodate your workload On Sat, Jan 28, 2017 at 5:30 PM, John Sandawrote: > I have a time series data model that is basically: > > CREATE TABLE metrics ( > id text, > time timeuuid, > value double, > PRIMARY KEY (id, time) > ) WITH CLUSTERING ORDER BY (time DESC); > > I do append-only writes, no deletes, and use a TTL of seven days. Data > points are written every seconds. The UI queries data for the past hour, > two hours, day, or week. The UI refreshes and executes queries every 30 > seconds. In one test environment I am seeing lots of tombstone threshold > warnings and Cassandra has even OOME'd. Since I am storing data in > descending order and always query for recent data, I do not understand why > I am running into this problem. > > I know that it is recommended to do some date partitioning in part to > ensure partitions do not grow too large. I already have some changes in > place to partition by day.. Before I make those changes I want to > understand why I am scanning so many tombstones so that I can be more > confident that the date partitioning changes will help. > > Thanks > > - John >
Time series data model and tombstones
I have a time series data model that is basically: CREATE TABLE metrics ( id text, time timeuuid, value double, PRIMARY KEY (id, time) ) WITH CLUSTERING ORDER BY (time DESC); I do append-only writes, no deletes, and use a TTL of seven days. Data points are written every seconds. The UI queries data for the past hour, two hours, day, or week. The UI refreshes and executes queries every 30 seconds. In one test environment I am seeing lots of tombstone threshold warnings and Cassandra has even OOME'd. Since I am storing data in descending order and always query for recent data, I do not understand why I am running into this problem. I know that it is recommended to do some date partitioning in part to ensure partitions do not grow too large. I already have some changes in place to partition by day.. Before I make those changes I want to understand why I am scanning so many tombstones so that I can be more confident that the date partitioning changes will help. Thanks - John