Re: Time series data model and tombstones

2017-01-28 Thread Benjamin Roth
Maybe trace your queries to see what's happening in detail.

Am 28.01.2017 21:32 schrieb "John Sanda" :

Thanks for the response. This version of the code is using STCS.
gc_grace_seconds was set to one day and then I changed it to zero since RF
= 1. I understand that expired data will still generate tombstones and that
STCS is not the best. More recent versions of the code use DTCS, and we'll
be switching over to TWCS shortly. The suggestions raised are excellent
ones, but I tend to think of them as optimizations that might not address
my issue which I think may be 1) a problem with my data model, 2) problem
with the queries used or 3) some misunderstanding of Cassandra performs
range scans.

I am doing append-only writes. There is no out of order data. There are no
deletes, just TTLs. Data is stored on disk in descending order, and queries
access recent data and never query past the TTL of seven days. Given this I
would not except to be reading tombstones, certainly not the large numbers
that I am seeing.

On Sat, Jan 28, 2017 at 12:15 PM, Jonathan Haddad  wrote:

> Since you didn't specify a compaction strategy I'm guessing you're using
> STCS. Your TTL'ed data is becoming a tombstone. TWCS is a better strategy
> for this type of workload.
> On Sat, Jan 28, 2017 at 8:30 AM John Sanda  wrote:
>
>> I have a time series data model that is basically:
>>
>> CREATE TABLE metrics (
>> id text,
>> time timeuuid,
>> value double,
>> PRIMARY KEY (id, time)
>> ) WITH CLUSTERING ORDER BY (time DESC);
>>
>> I do append-only writes, no deletes, and use a TTL of seven days. Data
>> points are written every seconds. The UI queries data for the past hour,
>> two hours, day, or week. The UI refreshes and executes queries every 30
>> seconds. In one test environment I am seeing lots of tombstone threshold
>> warnings and Cassandra has even OOME'd. Since I am storing data in
>> descending order and always query for recent data, I do not understand why
>> I am running into this problem.
>>
>> I know that it is recommended to do some date partitioning in part to
>> ensure partitions do not grow too large. I already have some changes in
>> place to partition by day.. Before I make those changes I want to
>> understand why I am scanning so many tombstones so that I can be more
>> confident that the date partitioning changes will help.
>>
>> Thanks
>>
>> - John
>>
>


-- 

- John


Re: Time series data model and tombstones

2017-01-28 Thread John Sanda
Thanks for the response. This version of the code is using STCS.
gc_grace_seconds was set to one day and then I changed it to zero since RF
= 1. I understand that expired data will still generate tombstones and that
STCS is not the best. More recent versions of the code use DTCS, and we'll
be switching over to TWCS shortly. The suggestions raised are excellent
ones, but I tend to think of them as optimizations that might not address
my issue which I think may be 1) a problem with my data model, 2) problem
with the queries used or 3) some misunderstanding of Cassandra performs
range scans.

I am doing append-only writes. There is no out of order data. There are no
deletes, just TTLs. Data is stored on disk in descending order, and queries
access recent data and never query past the TTL of seven days. Given this I
would not except to be reading tombstones, certainly not the large numbers
that I am seeing.

On Sat, Jan 28, 2017 at 12:15 PM, Jonathan Haddad  wrote:

> Since you didn't specify a compaction strategy I'm guessing you're using
> STCS. Your TTL'ed data is becoming a tombstone. TWCS is a better strategy
> for this type of workload.
> On Sat, Jan 28, 2017 at 8:30 AM John Sanda  wrote:
>
>> I have a time series data model that is basically:
>>
>> CREATE TABLE metrics (
>> id text,
>> time timeuuid,
>> value double,
>> PRIMARY KEY (id, time)
>> ) WITH CLUSTERING ORDER BY (time DESC);
>>
>> I do append-only writes, no deletes, and use a TTL of seven days. Data
>> points are written every seconds. The UI queries data for the past hour,
>> two hours, day, or week. The UI refreshes and executes queries every 30
>> seconds. In one test environment I am seeing lots of tombstone threshold
>> warnings and Cassandra has even OOME'd. Since I am storing data in
>> descending order and always query for recent data, I do not understand why
>> I am running into this problem.
>>
>> I know that it is recommended to do some date partitioning in part to
>> ensure partitions do not grow too large. I already have some changes in
>> place to partition by day.. Before I make those changes I want to
>> understand why I am scanning so many tombstones so that I can be more
>> confident that the date partitioning changes will help.
>>
>> Thanks
>>
>> - John
>>
>


-- 

- John


Re: Time series data model and tombstones

2017-01-28 Thread Jonathan Haddad
Since you didn't specify a compaction strategy I'm guessing you're using
STCS. Your TTL'ed data is becoming a tombstone. TWCS is a better strategy
for this type of workload.
On Sat, Jan 28, 2017 at 8:30 AM John Sanda  wrote:

> I have a time series data model that is basically:
>
> CREATE TABLE metrics (
> id text,
> time timeuuid,
> value double,
> PRIMARY KEY (id, time)
> ) WITH CLUSTERING ORDER BY (time DESC);
>
> I do append-only writes, no deletes, and use a TTL of seven days. Data
> points are written every seconds. The UI queries data for the past hour,
> two hours, day, or week. The UI refreshes and executes queries every 30
> seconds. In one test environment I am seeing lots of tombstone threshold
> warnings and Cassandra has even OOME'd. Since I am storing data in
> descending order and always query for recent data, I do not understand why
> I am running into this problem.
>
> I know that it is recommended to do some date partitioning in part to
> ensure partitions do not grow too large. I already have some changes in
> place to partition by day.. Before I make those changes I want to
> understand why I am scanning so many tombstones so that I can be more
> confident that the date partitioning changes will help.
>
> Thanks
>
> - John
>


Re: Time series data model and tombstones

2017-01-28 Thread DuyHai Doan
When the data expired (after TTL of 7 days), at the next compaction they
are transformed into tombstonnes and will still stay there during
gc_grace_seconds. After that, they (the tombstonnes) will be completely
removed at the next compaction, if there is any ...

So doing some maths, supposing that you have let gc_grace_seconds to its
default value of 10 days then you'll have tombstonnes for 10 days worth of
data before they got eventually removed...

What is your compaction strategy ? I strongly suggest

1) Setting TTL directly as the table property (ALTER TABLE) instead of
setting it at query level (INSERT INTO ... USING TTL). When setting TTL at
table level, Cassandra can perform some optimization and drop entirely some
SSTable and don't even bother compact them

2) Use TimeWindowCompactionStrategy and tune it properly to accomodate your
workload



On Sat, Jan 28, 2017 at 5:30 PM, John Sanda  wrote:

> I have a time series data model that is basically:
>
> CREATE TABLE metrics (
> id text,
> time timeuuid,
> value double,
> PRIMARY KEY (id, time)
> ) WITH CLUSTERING ORDER BY (time DESC);
>
> I do append-only writes, no deletes, and use a TTL of seven days. Data
> points are written every seconds. The UI queries data for the past hour,
> two hours, day, or week. The UI refreshes and executes queries every 30
> seconds. In one test environment I am seeing lots of tombstone threshold
> warnings and Cassandra has even OOME'd. Since I am storing data in
> descending order and always query for recent data, I do not understand why
> I am running into this problem.
>
> I know that it is recommended to do some date partitioning in part to
> ensure partitions do not grow too large. I already have some changes in
> place to partition by day.. Before I make those changes I want to
> understand why I am scanning so many tombstones so that I can be more
> confident that the date partitioning changes will help.
>
> Thanks
>
> - John
>


Time series data model and tombstones

2017-01-28 Thread John Sanda
I have a time series data model that is basically:

CREATE TABLE metrics (
id text,
time timeuuid,
value double,
PRIMARY KEY (id, time)
) WITH CLUSTERING ORDER BY (time DESC);

I do append-only writes, no deletes, and use a TTL of seven days. Data
points are written every seconds. The UI queries data for the past hour,
two hours, day, or week. The UI refreshes and executes queries every 30
seconds. In one test environment I am seeing lots of tombstone threshold
warnings and Cassandra has even OOME'd. Since I am storing data in
descending order and always query for recent data, I do not understand why
I am running into this problem.

I know that it is recommended to do some date partitioning in part to
ensure partitions do not grow too large. I already have some changes in
place to partition by day.. Before I make those changes I want to
understand why I am scanning so many tombstones so that I can be more
confident that the date partitioning changes will help.

Thanks

- John