Cassandra Start script not working

2016-06-17 Thread Ali
Hi,

I downloaded Cassandra 3.6/3.7 and on my OSX machine having Java 1.8.0_45,
I can't launch whenever I try to launch it using
"bin/cassandra -f"

it gives me error

"Error: Could not find or load main class -ea"

Any help is appreciated.

Thanks
Ali


Re: [External] Re: Understanding when Cassandra drops expired time series data

2016-06-17 Thread jerome
Also, just out of curiosity, if the TTL is stored internally in the cell does 
that mean Cassandra checks the timestamp everytime it reads the data to 
determine if its valid?


Thanks,

Jerome


From: jerome 
Sent: Friday, June 17, 2016 4:32:47 PM
To: user@cassandra.apache.org
Subject: Re: [External] Re: Understanding when Cassandra drops expired time 
series data


Got it. Thanks again! But in versions of Cassandra after 2.0.15 
getFullyExpiredSSTables is more in line with what people would consider to be 
fully expired (i.e. any SSTable whose newest value is past the default TTL for 
the table)?


From: Jeff Jirsa 
Sent: Friday, June 17, 2016 4:21:02 PM
To: user@cassandra.apache.org
Subject: Re: [External] Re: Understanding when Cassandra drops expired time 
series data

Correcting myself - https://issues.apache.org/jira/browse/CASSANDRA-9882 made 
it check for fully expired tables no more than once every 10 minutes (still 
happens on flush as described, just not EVERY flush). Went in 2.0.17 / 2.1.9.


-  Jeff

From: Jeff Jirsa 
Reply-To: 
Date: Friday, June 17, 2016 at 1:05 PM
To: "user@cassandra.apache.org" 
Subject: [External] Re: Understanding when Cassandra drops expired time series 
data

Anytime the memtable is flushed to disk, the compaction strategy will look for 
tables to compact. If there exists fully expired sstables (again, where 
getFullyExpiredSStables agrees that they’re fully expired, which may not be 
when YOU think they’re fully expired), they’ll be included/dropped.



From: jerome 
Reply-To: "user@cassandra.apache.org" 
Date: Friday, June 17, 2016 at 1:02 PM
To: "user@cassandra.apache.org" 
Subject: Re: Understanding when Cassandra drops expired time series data


Hi Jeff,



Thanks for the information! That helps clear up a lot of things for me. We've 
had issues with SSTables not being dropped despite being past their TTLs (no 
read repairs or deletes) and I think issue 9572 explains why (definitely going 
to look into updating soon). Just to clarify, in the current version of 
Cassandra when do fully expired SSTables get dropped? Is it when a minor 
compaction runs or is it separate from minor compactions? Also thanks to the 
link to the slides, great stuff!



Jerome


From: Jeff Jirsa 
Sent: Friday, June 17, 2016 3:25:07 PM
To: user@cassandra.apache.org
Subject: Re: Understanding when Cassandra drops expired time series data

First, DTCS in 2.0.15 has some weird behaviors - 
https://issues.apache.org/jira/browse/CASSANDRA-9572
 .

That said, some other general notes:

Data deleted by TTL isn’t the same as issuing a delete – each expiring cell 
internally has a ttl/timestamp at which it will be converted into a tombstone. 
There is no tombstone added to the memtable, or flushed to disk – it just 
treats the expired cells as tombstones once they’re past that timestamp.

Cassandra’s getFullyExpiredSSTables() will consider a table fully expired if 
(and only if) all cells within that table are expired (current time > max 
timestamp ) AND the sstable timestamps don’t overlap with others that aren’t 
fully expired. Björn talks about this in 
https://issues.apache.org/jira/browse/CASSANDRA-8243
 - the intent here is so that explicit deletes (which do create tombstones) 
won’t be GC’d  from an otherwise fully expired sstable if they’re covering data 
in a more recent sstable – without this check, we could accidentally bring dead 
data back to life. In an append only time series workload this would be 
unusual, but not impossible.

Unfortunately, read repairs (foreground/blocking, if you write with CL < ALL 
and read with CL > ONE) will cause cells written with old timestamps to be 
written into the newly flushed sstables, which creates sstables with wide gaps 
between minTimestamp and maxTimestamp (you could have a read repair pull data 
that is 23 hours old into a new sstable, and now that one sstable spans 23 
hours, and isn’t fully expired until the oldest data is 47 hours old). There’s 
an open ticket 

Re: [External] Re: Understanding when Cassandra drops expired time series data

2016-06-17 Thread jerome
Got it. Thanks again! But in versions of Cassandra after 2.0.15 
getFullyExpiredSSTables is more in line with what people would consider to be 
fully expired (i.e. any SSTable whose newest value is past the default TTL for 
the table)?


From: Jeff Jirsa 
Sent: Friday, June 17, 2016 4:21:02 PM
To: user@cassandra.apache.org
Subject: Re: [External] Re: Understanding when Cassandra drops expired time 
series data

Correcting myself - https://issues.apache.org/jira/browse/CASSANDRA-9882 made 
it check for fully expired tables no more than once every 10 minutes (still 
happens on flush as described, just not EVERY flush). Went in 2.0.17 / 2.1.9.


-  Jeff

From: Jeff Jirsa 
Reply-To: 
Date: Friday, June 17, 2016 at 1:05 PM
To: "user@cassandra.apache.org" 
Subject: [External] Re: Understanding when Cassandra drops expired time series 
data

Anytime the memtable is flushed to disk, the compaction strategy will look for 
tables to compact. If there exists fully expired sstables (again, where 
getFullyExpiredSStables agrees that they’re fully expired, which may not be 
when YOU think they’re fully expired), they’ll be included/dropped.



From: jerome 
Reply-To: "user@cassandra.apache.org" 
Date: Friday, June 17, 2016 at 1:02 PM
To: "user@cassandra.apache.org" 
Subject: Re: Understanding when Cassandra drops expired time series data


Hi Jeff,



Thanks for the information! That helps clear up a lot of things for me. We've 
had issues with SSTables not being dropped despite being past their TTLs (no 
read repairs or deletes) and I think issue 9572 explains why (definitely going 
to look into updating soon). Just to clarify, in the current version of 
Cassandra when do fully expired SSTables get dropped? Is it when a minor 
compaction runs or is it separate from minor compactions? Also thanks to the 
link to the slides, great stuff!



Jerome


From: Jeff Jirsa 
Sent: Friday, June 17, 2016 3:25:07 PM
To: user@cassandra.apache.org
Subject: Re: Understanding when Cassandra drops expired time series data

First, DTCS in 2.0.15 has some weird behaviors - 
https://issues.apache.org/jira/browse/CASSANDRA-9572
 .

That said, some other general notes:

Data deleted by TTL isn’t the same as issuing a delete – each expiring cell 
internally has a ttl/timestamp at which it will be converted into a tombstone. 
There is no tombstone added to the memtable, or flushed to disk – it just 
treats the expired cells as tombstones once they’re past that timestamp.

Cassandra’s getFullyExpiredSSTables() will consider a table fully expired if 
(and only if) all cells within that table are expired (current time > max 
timestamp ) AND the sstable timestamps don’t overlap with others that aren’t 
fully expired. Björn talks about this in 
https://issues.apache.org/jira/browse/CASSANDRA-8243
 - the intent here is so that explicit deletes (which do create tombstones) 
won’t be GC’d  from an otherwise fully expired sstable if they’re covering data 
in a more recent sstable – without this check, we could accidentally bring dead 
data back to life. In an append only time series workload this would be 
unusual, but not impossible.

Unfortunately, read repairs (foreground/blocking, if you write with CL < ALL 
and read with CL > ONE) will cause cells written with old timestamps to be 
written into the newly flushed sstables, which creates sstables with wide gaps 
between minTimestamp and maxTimestamp (you could have a read repair pull data 
that is 23 hours old into a new sstable, and now that one sstable spans 23 
hours, and isn’t fully expired until the oldest data is 47 hours old). There’s 
an open ticket 
(https://issues.apache.org/jira/browse/CASSANDRA-10496
 ) meant to make this behavior ‘better’ in the future by splitting those old 
read-repaired cells from the newly flushed sstables.

I gave a talk on a lot of this 

Re: [External] Re: Understanding when Cassandra drops expired time series data

2016-06-17 Thread Jeff Jirsa
Correcting myself - https://issues.apache.org/jira/browse/CASSANDRA-9882 made 
it check for fully expired tables no more than once every 10 minutes (still 
happens on flush as described, just not EVERY flush). Went in 2.0.17 / 2.1.9. 

 

-  Jeff

 

From: Jeff Jirsa 
Reply-To: 
Date: Friday, June 17, 2016 at 1:05 PM
To: "user@cassandra.apache.org" 
Subject: [External] Re: Understanding when Cassandra drops expired time series 
data

 

Anytime the memtable is flushed to disk, the compaction strategy will look for 
tables to compact. If there exists fully expired sstables (again, where 
getFullyExpiredSStables agrees that they’re fully expired, which may not be 
when YOU think they’re fully expired), they’ll be included/dropped. 

 

 

 

From: jerome 
Reply-To: "user@cassandra.apache.org" 
Date: Friday, June 17, 2016 at 1:02 PM
To: "user@cassandra.apache.org" 
Subject: Re: Understanding when Cassandra drops expired time series data

 

Hi Jeff,

 

Thanks for the information! That helps clear up a lot of things for me. We've 
had issues with SSTables not being dropped despite being past their TTLs (no 
read repairs or deletes) and I think issue 9572 explains why (definitely going 
to look into updating soon). Just to clarify, in the current version of 
Cassandra when do fully expired SSTables get dropped? Is it when a minor 
compaction runs or is it separate from minor compactions? Also thanks to the 
link to the slides, great stuff!

 

Jerome

From: Jeff Jirsa 
Sent: Friday, June 17, 2016 3:25:07 PM
To: user@cassandra.apache.org
Subject: Re: Understanding when Cassandra drops expired time series data 

 

First, DTCS in 2.0.15 has some weird behaviors - 
https://issues.apache.org/jira/browse/CASSANDRA-9572 . 

 

That said, some other general notes:


Data deleted by TTL isn’t the same as issuing a delete – each expiring cell 
internally has a ttl/timestamp at which it will be converted into a tombstone. 
There is no tombstone added to the memtable, or flushed to disk – it just 
treats the expired cells as tombstones once they’re past that timestamp. 





Cassandra’s getFullyExpiredSSTables() will consider a table fully expired if 
(and only if) all cells within that table are expired (current time > max 
timestamp ) AND the sstable timestamps don’t overlap with others that aren’t 
fully expired. Björn talks about this in 
https://issues.apache.org/jira/browse/CASSANDRA-8243 - the intent here is so 
that explicit deletes (which do create tombstones) won’t be GC’d  from an 
otherwise fully expired sstable if they’re covering data in a more recent 
sstable – without this check, we could accidentally bring dead data back to 
life. In an append only time series workload this would be unusual, but not 
impossible. 

Unfortunately, read repairs (foreground/blocking, if you write with CL < ALL 
and read with CL > ONE) will cause cells written with old timestamps to be 
written into the newly flushed sstables, which creates sstables with wide gaps 
between minTimestamp and maxTimestamp (you could have a read repair pull data 
that is 23 hours old into a new sstable, and now that one sstable spans 23 
hours, and isn’t fully expired until the oldest data is 47 hours old). There’s 
an open ticket (https://issues.apache.org/jira/browse/CASSANDRA-10496 ) meant 
to make this behavior ‘better’ in the future by splitting those old 
read-repaired cells from the newly flushed sstables. 





I gave a talk on a lot of this behavior last year at Summit 
(http://www.slideshare.net/JeffJirsa1/cassandra-summit-2015-real-world-dtcs-for-operators
 ) - if you’re running time series in production on DTCS, it’s worth a glance.

 

 

 

From: jerome 
Reply-To: "user@cassandra.apache.org" 
Date: Friday, June 17, 2016 at 11:52 AM
To: "user@cassandra.apache.org" 
Subject: Understanding when Cassandra drops expired time series data

 

Hello! Recently I have been trying to familiarize myself with Cassandra but 
don't quite understand when data is removed from disk after it has been 
deleted. The use case I'm particularly interested is expiring time series data 
with DTCS. As an example, I created the following table:
CREATE TABLE metrics (
  metric_id text,
  time timestamp,
  value double,
  

Re: Understanding when Cassandra drops expired time series data

2016-06-17 Thread Jeff Jirsa
Anytime the memtable is flushed to disk, the compaction strategy will look for 
tables to compact. If there exists fully expired sstables (again, where 
getFullyExpiredSStables agrees that they’re fully expired, which may not be 
when YOU think they’re fully expired), they’ll be included/dropped. 

 

 

 

From: jerome 
Reply-To: "user@cassandra.apache.org" 
Date: Friday, June 17, 2016 at 1:02 PM
To: "user@cassandra.apache.org" 
Subject: Re: Understanding when Cassandra drops expired time series data

 

Hi Jeff,

 

Thanks for the information! That helps clear up a lot of things for me. We've 
had issues with SSTables not being dropped despite being past their TTLs (no 
read repairs or deletes) and I think issue 9572 explains why (definitely going 
to look into updating soon). Just to clarify, in the current version of 
Cassandra when do fully expired SSTables get dropped? Is it when a minor 
compaction runs or is it separate from minor compactions? Also thanks to the 
link to the slides, great stuff!

 

Jerome

From: Jeff Jirsa 
Sent: Friday, June 17, 2016 3:25:07 PM
To: user@cassandra.apache.org
Subject: Re: Understanding when Cassandra drops expired time series data 

 

First, DTCS in 2.0.15 has some weird behaviors - 
https://issues.apache.org/jira/browse/CASSANDRA-9572 . 

 

That said, some other general notes:


Data deleted by TTL isn’t the same as issuing a delete – each expiring cell 
internally has a ttl/timestamp at which it will be converted into a tombstone. 
There is no tombstone added to the memtable, or flushed to disk – it just 
treats the expired cells as tombstones once they’re past that timestamp. 





Cassandra’s getFullyExpiredSSTables() will consider a table fully expired if 
(and only if) all cells within that table are expired (current time > max 
timestamp ) AND the sstable timestamps don’t overlap with others that aren’t 
fully expired. Björn talks about this in 
https://issues.apache.org/jira/browse/CASSANDRA-8243 - the intent here is so 
that explicit deletes (which do create tombstones) won’t be GC’d  from an 
otherwise fully expired sstable if they’re covering data in a more recent 
sstable – without this check, we could accidentally bring dead data back to 
life. In an append only time series workload this would be unusual, but not 
impossible. 

Unfortunately, read repairs (foreground/blocking, if you write with CL < ALL 
and read with CL > ONE) will cause cells written with old timestamps to be 
written into the newly flushed sstables, which creates sstables with wide gaps 
between minTimestamp and maxTimestamp (you could have a read repair pull data 
that is 23 hours old into a new sstable, and now that one sstable spans 23 
hours, and isn’t fully expired until the oldest data is 47 hours old). There’s 
an open ticket (https://issues.apache.org/jira/browse/CASSANDRA-10496 ) meant 
to make this behavior ‘better’ in the future by splitting those old 
read-repaired cells from the newly flushed sstables. 





I gave a talk on a lot of this behavior last year at Summit 
(http://www.slideshare.net/JeffJirsa1/cassandra-summit-2015-real-world-dtcs-for-operators
 ) - if you’re running time series in production on DTCS, it’s worth a glance.

 

 

 

From: jerome 
Reply-To: "user@cassandra.apache.org" 
Date: Friday, June 17, 2016 at 11:52 AM
To: "user@cassandra.apache.org" 
Subject: Understanding when Cassandra drops expired time series data

 

Hello! Recently I have been trying to familiarize myself with Cassandra but 
don't quite understand when data is removed from disk after it has been 
deleted. The use case I'm particularly interested is expiring time series data 
with DTCS. As an example, I created the following table:
CREATE TABLE metrics (
  metric_id text,
  time timestamp,
  value double,
  PRIMARY KEY (metric_id, time),
) WITH CLUSTERING ORDER BY (time DESC) AND 
 default_time_to_live = 86400 AND
 gc_grace_seconds = 3600 AND
 compaction = {
  'class': 'DateTieredCompactionStrategy',
  'timestamp_resolution':'MICROSECONDS',
  'base_time_seconds':'3600',
  'max_sstable_age_days':'365',
  'min_threshold':'4'
 };
I understand that Cassandra will create a tombstone for all rows inserted into 
this table 24 hours after they are inserted (86400 seconds). These tombstones 
will 

Re: Understanding when Cassandra drops expired time series data

2016-06-17 Thread Jeff Jirsa
getFullyExpiredSSTables behavior and read repair mixing data apply to TWCS as 
well, though TWCS does a major compaction per window to try to decrease the 
interwoven graph that can happen in DTCS windows with sstable expired blockers  
(and the way DTCS windowing works, using minTimestamp to choose a bucket 
instead of maxTimestamp likely makes it “worse” in the sense that you’ll have 
more new data in old windows if you use DTCS than if you use TWCS).

 

Something like https://issues.apache.org/jira/browse/CASSANDRA-10496 may be the 
right fix eventually. 

 

(Alternatively, if https://issues.apache.org/jira/browse/CASSANDRA-9779 gets 
implemented, ‘APPEND ONLY’ tables could be more aggressive in 
getFullyExpiredSSTables)

 

 

From: "Jason J. W. Williams" 
Reply-To: "user@cassandra.apache.org" 
Date: Friday, June 17, 2016 at 12:29 PM
To: "user@cassandra.apache.org" 
Subject: Re: Understanding when Cassandra drops expired time series data

 

Hey Jeff, 

 

Do most of those behaviors apply to TWCS too?

 

-J

 

On Fri, Jun 17, 2016 at 1:25 PM, Jeff Jirsa  wrote:

First, DTCS in 2.0.15 has some weird behaviors - 
https://issues.apache.org/jira/browse/CASSANDRA-9572 . 

 

That said, some other general notes:


Data deleted by TTL isn’t the same as issuing a delete – each expiring cell 
internally has a ttl/timestamp at which it will be converted into a tombstone. 
There is no tombstone added to the memtable, or flushed to disk – it just 
treats the expired cells as tombstones once they’re past that timestamp. 





Cassandra’s getFullyExpiredSSTables() will consider a table fully expired if 
(and only if) all cells within that table are expired (current time > max 
timestamp ) AND the sstable timestamps don’t overlap with others that aren’t 
fully expired. Björn talks about this in 
https://issues.apache.org/jira/browse/CASSANDRA-8243 - the intent here is so 
that explicit deletes (which do create tombstones) won’t be GC’d  from an 
otherwise fully expired sstable if they’re covering data in a more recent 
sstable – without this check, we could accidentally bring dead data back to 
life. In an append only time series workload this would be unusual, but not 
impossible. 

Unfortunately, read repairs (foreground/blocking, if you write with CL < ALL 
and read with CL > ONE) will cause cells written with old timestamps to be 
written into the newly flushed sstables, which creates sstables with wide gaps 
between minTimestamp and maxTimestamp (you could have a read repair pull data 
that is 23 hours old into a new sstable, and now that one sstable spans 23 
hours, and isn’t fully expired until the oldest data is 47 hours old). There’s 
an open ticket (https://issues.apache.org/jira/browse/CASSANDRA-10496 ) meant 
to make this behavior ‘better’ in the future by splitting those old 
read-repaired cells from the newly flushed sstables. 





I gave a talk on a lot of this behavior last year at Summit 
(http://www.slideshare.net/JeffJirsa1/cassandra-summit-2015-real-world-dtcs-for-operators
 ) - if you’re running time series in production on DTCS, it’s worth a glance.

 

 

 

From: jerome 
Reply-To: "user@cassandra.apache.org" 
Date: Friday, June 17, 2016 at 11:52 AM
To: "user@cassandra.apache.org" 
Subject: Understanding when Cassandra drops expired time series data

 

Hello! Recently I have been trying to familiarize myself with Cassandra but 
don't quite understand when data is removed from disk after it has been 
deleted. The use case I'm particularly interested is expiring time series data 
with DTCS. As an example, I created the following table:
CREATE TABLE metrics (
  metric_id text,
  time timestamp,
  value double,
  PRIMARY KEY (metric_id, time),
) WITH CLUSTERING ORDER BY (time DESC) AND 
 default_time_to_live = 86400 AND
 gc_grace_seconds = 3600 AND
 compaction = {
  'class': 'DateTieredCompactionStrategy',
  'timestamp_resolution':'MICROSECONDS',
  'base_time_seconds':'3600',
  'max_sstable_age_days':'365',
  'min_threshold':'4'
 };
I understand that Cassandra will create a tombstone for all rows inserted into 
this table 24 hours after they are inserted (86400 seconds). These tombstones 
will first be written to an in-memory Memtable and then flushed to disk as an 
SSTable when the Memtable reaches a 

Re: Understanding when Cassandra drops expired time series data

2016-06-17 Thread jerome
Hi Jeff,


Thanks for the information! That helps clear up a lot of things for me. We've 
had issues with SSTables not being dropped despite being past their TTLs (no 
read repairs or deletes) and I think issue 9572 explains why (definitely going 
to look into updating soon). Just to clarify, in the current version of 
Cassandra when do fully expired SSTables get dropped? Is it when a minor 
compaction runs or is it separate from minor compactions? Also thanks to the 
link to the slides, great stuff!


Jerome


From: Jeff Jirsa 
Sent: Friday, June 17, 2016 3:25:07 PM
To: user@cassandra.apache.org
Subject: Re: Understanding when Cassandra drops expired time series data

First, DTCS in 2.0.15 has some weird behaviors - 
https://issues.apache.org/jira/browse/CASSANDRA-9572 .

That said, some other general notes:

Data deleted by TTL isn’t the same as issuing a delete – each expiring cell 
internally has a ttl/timestamp at which it will be converted into a tombstone. 
There is no tombstone added to the memtable, or flushed to disk – it just 
treats the expired cells as tombstones once they’re past that timestamp.

Cassandra’s getFullyExpiredSSTables() will consider a table fully expired if 
(and only if) all cells within that table are expired (current time > max 
timestamp ) AND the sstable timestamps don’t overlap with others that aren’t 
fully expired. Björn talks about this in 
https://issues.apache.org/jira/browse/CASSANDRA-8243 - the intent here is so 
that explicit deletes (which do create tombstones) won’t be GC’d  from an 
otherwise fully expired sstable if they’re covering data in a more recent 
sstable – without this check, we could accidentally bring dead data back to 
life. In an append only time series workload this would be unusual, but not 
impossible.

Unfortunately, read repairs (foreground/blocking, if you write with CL < ALL 
and read with CL > ONE) will cause cells written with old timestamps to be 
written into the newly flushed sstables, which creates sstables with wide gaps 
between minTimestamp and maxTimestamp (you could have a read repair pull data 
that is 23 hours old into a new sstable, and now that one sstable spans 23 
hours, and isn’t fully expired until the oldest data is 47 hours old). There’s 
an open ticket (https://issues.apache.org/jira/browse/CASSANDRA-10496 ) meant 
to make this behavior ‘better’ in the future by splitting those old 
read-repaired cells from the newly flushed sstables.

I gave a talk on a lot of this behavior last year at Summit 
(http://www.slideshare.net/JeffJirsa1/cassandra-summit-2015-real-world-dtcs-for-operators
 ) - if you’re running time series in production on DTCS, it’s worth a glance.



From: jerome 
Reply-To: "user@cassandra.apache.org" 
Date: Friday, June 17, 2016 at 11:52 AM
To: "user@cassandra.apache.org" 
Subject: Understanding when Cassandra drops expired time series data


Hello! Recently I have been trying to familiarize myself with Cassandra but 
don't quite understand when data is removed from disk after it has been 
deleted. The use case I'm particularly interested is expiring time series data 
with DTCS. As an example, I created the following table:

CREATE TABLE metrics (

  metric_id text,

  time timestamp,

  value double,

  PRIMARY KEY (metric_id, time),

) WITH CLUSTERING ORDER BY (time DESC) AND

 default_time_to_live = 86400 AND

 gc_grace_seconds = 3600 AND

 compaction = {

  'class': 'DateTieredCompactionStrategy',

  'timestamp_resolution':'MICROSECONDS',

  'base_time_seconds':'3600',

  'max_sstable_age_days':'365',

  'min_threshold':'4'

 };

I understand that Cassandra will create a tombstone for all rows inserted into 
this table 24 hours after they are inserted (86400 seconds). These tombstones 
will first be written to an in-memory Memtable and then flushed to disk as an 
SSTable when the Memtable reaches a certain size. My question is when will the 
data that is now expired be removed from disk? Is it the next time the SSTable 
which contains the data gets compacted? So, with DTCS and min_threshold set to 
four, we would wait until at least three other SSTables are in the same time 
window as the expired data, and then those SSTables will be compacted into a 
SSTable without the expired data. Is it only during this compaction that the 
data will be removed? It seems to me that this would require Cassandra to 
maintain some metadata on which rows have been deleted since the newer 
tombstones would likely not be in the older SSTables that are being compacted. 
Also, I'm aware that Cassandra can drop entire SSTables if they contain only 
expired data but I'm unsure of what qualifies as expired data (is it just 
SSTables whose maximum timestamp is past the default TTL for the table?) and 
when such SSTables are dropped.

Alternatively, 

Re: Understanding when Cassandra drops expired time series data

2016-06-17 Thread Jason J. W. Williams
Hey Jeff,

Do most of those behaviors apply to TWCS too?

-J

On Fri, Jun 17, 2016 at 1:25 PM, Jeff Jirsa 
wrote:

> First, DTCS in 2.0.15 has some weird behaviors -
> https://issues.apache.org/jira/browse/CASSANDRA-9572 .
>
>
>
> That said, some other general notes:
>
>
> Data deleted by TTL isn’t the same as issuing a delete – each expiring
> cell internally has a ttl/timestamp at which it will be converted into a
> tombstone. There is no tombstone added to the memtable, or flushed to disk
> – it just treats the expired cells as tombstones once they’re past that
> timestamp.
>
>
>
>
> Cassandra’s getFullyExpiredSSTables() will consider a table fully expired
> if (and only if) all cells within that table are expired (current time >
> max timestamp ) AND the sstable timestamps don’t overlap with others that
> aren’t fully expired. Björn talks about this in
> https://issues.apache.org/jira/browse/CASSANDRA-8243 - the intent here is
> so that explicit deletes (which do create tombstones) won’t be GC’d  from
> an otherwise fully expired sstable if they’re covering data in a more
> recent sstable – without this check, we could accidentally bring dead data
> back to life. In an append only time series workload this would be unusual,
> but not impossible.
>
> Unfortunately, read repairs (foreground/blocking, if you write with CL <
> ALL and read with CL > ONE) will cause cells written with old timestamps to
> be written into the newly flushed sstables, which creates sstables with
> wide gaps between minTimestamp and maxTimestamp (you could have a read
> repair pull data that is 23 hours old into a new sstable, and now that one
> sstable spans 23 hours, and isn’t fully expired until the oldest data is 47
> hours old). There’s an open ticket (
> https://issues.apache.org/jira/browse/CASSANDRA-10496 ) meant to make
> this behavior ‘better’ in the future by splitting those old read-repaired
> cells from the newly flushed sstables.
>
>
>
>
> I gave a talk on a lot of this behavior last year at Summit (
> http://www.slideshare.net/JeffJirsa1/cassandra-summit-2015-real-world-dtcs-for-operators
> ) - if you’re running time series in production on DTCS, it’s worth a
> glance.
>
>
>
>
>
>
>
> *From: *jerome 
> *Reply-To: *"user@cassandra.apache.org" 
> *Date: *Friday, June 17, 2016 at 11:52 AM
> *To: *"user@cassandra.apache.org" 
> *Subject: *Understanding when Cassandra drops expired time series data
>
>
>
> Hello! Recently I have been trying to familiarize myself with Cassandra
> but don't quite understand when data is removed from disk after it has been
> deleted. The use case I'm particularly interested is expiring time series
> data with DTCS. As an example, I created the following table:
>
> CREATE TABLE metrics (
>
>   metric_id text,
>
>   time timestamp,
>
>   value double,
>
>   PRIMARY KEY (metric_id, time),
>
> ) WITH CLUSTERING ORDER BY (time DESC) AND
>
>  default_time_to_live = 86400 AND
>
>  gc_grace_seconds = 3600 AND
>
>  compaction = {
>
>   'class': 'DateTieredCompactionStrategy',
>
>   'timestamp_resolution':'MICROSECONDS',
>
>   'base_time_seconds':'3600',
>
>   'max_sstable_age_days':'365',
>
>   'min_threshold':'4'
>
>  };
>
> I understand that Cassandra will create a tombstone for all rows inserted
> into this table 24 hours after they are inserted (86400 seconds). These
> tombstones will first be written to an in-memory Memtable and then flushed
> to disk as an SSTable when the Memtable reaches a certain size. My question
> is when will the data that is now expired be removed from disk? Is it the
> next time the SSTable which contains the data gets compacted? So, with DTCS
> and min_threshold set to four, we would wait until at least three other
> SSTables are in the same time window as the expired data, and then those
> SSTables will be compacted into a SSTable without the expired data. Is it
> only during this compaction that the data will be removed? It seems to me
> that this would require Cassandra to maintain some metadata on which rows
> have been deleted since the newer tombstones would likely not be in the
> older SSTables that are being compacted. Also, I'm aware that Cassandra can
> drop entire SSTables if they contain only expired data but I'm unsure of
> what qualifies as expired data (is it just SSTables whose maximum
> timestamp is past the default TTL for the table?) and when such SSTables
> are dropped.
>
> Alternatively, do the SSTables which contain the tombstones have to be
> compacted with the SSTables which contain the expired data for the data to
> be removed? It seems to me that this could result in Cassandra holding the
> expired data long after it has expired since it's waiting for the new
> tombstones to be compacted with the older expired data.
>
> Finally, I was also unsure when the tombstones themselves are removed. I
> 

Re: Understanding when Cassandra drops expired time series data

2016-06-17 Thread Jeff Jirsa
First, DTCS in 2.0.15 has some weird behaviors - 
https://issues.apache.org/jira/browse/CASSANDRA-9572 . 

 

That said, some other general notes:


Data deleted by TTL isn’t the same as issuing a delete – each expiring cell 
internally has a ttl/timestamp at which it will be converted into a tombstone. 
There is no tombstone added to the memtable, or flushed to disk – it just 
treats the expired cells as tombstones once they’re past that timestamp. 



    

Cassandra’s getFullyExpiredSSTables() will consider a table fully expired if 
(and only if) all cells within that table are expired (current time > max 
timestamp ) AND the sstable timestamps don’t overlap with others that aren’t 
fully expired. Björn talks about this in 
https://issues.apache.org/jira/browse/CASSANDRA-8243 - the intent here is so 
that explicit deletes (which do create tombstones) won’t be GC’d  from an 
otherwise fully expired sstable if they’re covering data in a more recent 
sstable – without this check, we could accidentally bring dead data back to 
life. In an append only time series workload this would be unusual, but not 
impossible. 

Unfortunately, read repairs (foreground/blocking, if you write with CL < ALL 
and read with CL > ONE) will cause cells written with old timestamps to be 
written into the newly flushed sstables, which creates sstables with wide gaps 
between minTimestamp and maxTimestamp (you could have a read repair pull data 
that is 23 hours old into a new sstable, and now that one sstable spans 23 
hours, and isn’t fully expired until the oldest data is 47 hours old). There’s 
an open ticket (https://issues.apache.org/jira/browse/CASSANDRA-10496 ) meant 
to make this behavior ‘better’ in the future by splitting those old 
read-repaired cells from the newly flushed sstables. 



    

I gave a talk on a lot of this behavior last year at Summit 
(http://www.slideshare.net/JeffJirsa1/cassandra-summit-2015-real-world-dtcs-for-operators
 ) - if you’re running time series in production on DTCS, it’s worth a glance.

 

 

 

From: jerome 
Reply-To: "user@cassandra.apache.org" 
Date: Friday, June 17, 2016 at 11:52 AM
To: "user@cassandra.apache.org" 
Subject: Understanding when Cassandra drops expired time series data

 

Hello! Recently I have been trying to familiarize myself with Cassandra but 
don't quite understand when data is removed from disk after it has been 
deleted. The use case I'm particularly interested is expiring time series data 
with DTCS. As an example, I created the following table:
CREATE TABLE metrics (
  metric_id text,
  time timestamp,
  value double,
  PRIMARY KEY (metric_id, time),
) WITH CLUSTERING ORDER BY (time DESC) AND 
 default_time_to_live = 86400 AND
 gc_grace_seconds = 3600 AND
 compaction = {
  'class': 'DateTieredCompactionStrategy',
  'timestamp_resolution':'MICROSECONDS',
  'base_time_seconds':'3600',
  'max_sstable_age_days':'365',
  'min_threshold':'4'
 };
I understand that Cassandra will create a tombstone for all rows inserted into 
this table 24 hours after they are inserted (86400 seconds). These tombstones 
will first be written to an in-memory Memtable and then flushed to disk as an 
SSTable when the Memtable reaches a certain size. My question is when will the 
data that is now expired be removed from disk? Is it the next time the SSTable 
which contains the data gets compacted? So, with DTCS and min_threshold set to 
four, we would wait until at least three other SSTables are in the same time 
window as the expired data, and then those SSTables will be compacted into a 
SSTable without the expired data. Is it only during this compaction that the 
data will be removed? It seems to me that this would require Cassandra to 
maintain some metadata on which rows have been deleted since the newer 
tombstones would likely not be in the older SSTables that are being compacted. 
Also, I'm aware that Cassandra can drop entire SSTables if they contain only 
expired data but I'm unsure of what qualifies as expired data (is it just 
SSTables whose maximum timestamp is past the default TTL for the table?) and 
when such SSTables are dropped.

Alternatively, do the SSTables which contain the tombstones have to be 
compacted with the SSTables which contain the expired data for the data to be 
removed? It seems to me that this could result in Cassandra holding the expired 
data long after it has expired since 

Understanding when Cassandra drops expired time series data

2016-06-17 Thread jerome
Hello! Recently I have been trying to familiarize myself with Cassandra but 
don't quite understand when data is removed from disk after it has been 
deleted. The use case I'm particularly interested is expiring time series data 
with DTCS. As an example, I created the following table:

CREATE TABLE metrics (
  metric_id text,
  time timestamp,
  value double,
  PRIMARY KEY (metric_id, time),
) WITH CLUSTERING ORDER BY (time DESC) AND
 default_time_to_live = 86400 AND
 gc_grace_seconds = 3600 AND
 compaction = {
  'class': 'DateTieredCompactionStrategy',
  'timestamp_resolution':'MICROSECONDS',
  'base_time_seconds':'3600',
  'max_sstable_age_days':'365',
  'min_threshold':'4'
 };


I understand that Cassandra will create a tombstone for all rows inserted into 
this table 24 hours after they are inserted (86400 seconds). These tombstones 
will first be written to an in-memory Memtable and then flushed to disk as an 
SSTable when the Memtable reaches a certain size. My question is when will the 
data that is now expired be removed from disk? Is it the next time the SSTable 
which contains the data gets compacted? So, with DTCS and min_threshold set to 
four, we would wait until at least three other SSTables are in the same time 
window as the expired data, and then those SSTables will be compacted into a 
SSTable without the expired data. Is it only during this compaction that the 
data will be removed? It seems to me that this would require Cassandra to 
maintain some metadata on which rows have been deleted since the newer 
tombstones would likely not be in the older SSTables that are being compacted. 
Also, I'm aware that Cassandra can drop entire SSTables if they contain only 
expired data but I'm unsure of what qualifies as expired data (is it just 
SSTables whose maximum timestamp is past the default TTL for the table?) and 
when such SSTables are dropped.

Alternatively, do the SSTables which contain the tombstones have to be 
compacted with the SSTables which contain the expired data for the data to be 
removed? It seems to me that this could result in Cassandra holding the expired 
data long after it has expired since it's waiting for the new tombstones to be 
compacted with the older expired data.

Finally, I was also unsure when the tombstones themselves are removed. I know 
Cassandra does not delete them until after gc_grace_seconds but it can't delete 
the tombstones until it's sure the expired data has been deleted right? 
Otherwise it would see the expired data as being valid. Consequently, it seems 
to me that the question of when tombstones are deleted is intimately tied to 
the questions above.

Thanks in advance! If it helps I've been experimenting with version 2.0.15 
myself.


RE: StreamCoordinator.ConnectionsPerHost set to 1

2016-06-17 Thread Anubhav Kale
Thanks Paulo. I made some changes along those lines, and seeing good 
improvement. I will discuss further (with a possible patch) on 
https://issues.apache.org/jira/browse/CASSANDRA-4663 (this is for bootstrap, so 
maybe we can repurpose it for rebuilds or create a separate one).

From: Paulo Motta [mailto:pauloricard...@gmail.com]
Sent: Thursday, June 16, 2016 3:06 PM
To: user@cassandra.apache.org
Subject: Re: StreamCoordinator.ConnectionsPerHost set to 1

Increasing the number of threads alone won't help, because you need to add 
connectionsPerHost-awareness to StreamPlan.requestRanges (otherwise only a 
single connection per host is created) similar to what was done to 
StreamPlan.transferFiles by CASSANDRA-3668, but maybe bit trickier. There's an 
open ticket to support that on CASSANDRA-4663
There's also another discussion on improving rebuild parallelism on 
CASSANDRA-12015.

2016-06-16 14:43 GMT-03:00 Anubhav Kale 
>:
Hello,

I noticed that StreamCoordinator.ConnectionsPerHost is always set to 1 
(Cassandra 2.1.13). If I am reading the code correctly, this means there will 
always be just one socket (well, 2 technically for each direction) between 
nodes when rebuilding thus the data will always be serialized.

Have folks experimented with increasing this ? It appears that some parallelism 
here might help rebuilds in a significant way assuming we aren’t hitting 
bandwidth caps (it’s a pain for us at the moment to rebuild nodes holding 
500GB).

I’m going to try to patch our cluster with a change to test this out, but 
wanted to hear from experts as well.

Thanks !



Re: how to force cassandra-stress to actually generate enough data

2016-06-17 Thread Giampaolo Trapasso
I do not know if it can really help in your situation,
but  from NGCC notes I discovered the existence of GatlingCQL (
https://github.com/gatling-cql/GatlingCql) as an alternative to
cassandra-stress.
In particular you can tweak a bit the data generation part.

giampaolo


2016-06-16 10:33 GMT+02:00 Peter Kovgan :

> Thank you, guys.
>
> I will try all proposals.
>
> The limitation, mentioned by Benedict, is huge.
>
> But anyway, there is something to do around…..
>
>
>
> *From:* Peter Kovgan
> *Sent:* Wednesday, June 15, 2016 3:25 PM
> *To:* 'user@cassandra.apache.org'
> *Subject:* how to force cassandra-stress to actually generate enough data
>
>
>
> Hi,
>
>
>
> The cassandra-stress is not helping really to populate the disk
> sufficiently.
>
>
>
> I tried several table structures, providing
>
> cluster: UNIFORM(1..100)  on clustering parts of the PK.
>
>
>
> Partition part of PK makes about 660 000 partitions.
>
>
>
> The hope was create enough cells in a row, make the row really WIDE.
>
>
>
> No matter what I tried, does no matter how long it runs, I see maximum 2-3
> SSTables per node and maximum 300Mb of data per node.
>
>
>
> (I have 6 nodes and very active 400 threads stress)
>
>
>
> It looks, like It is impossible to make the row really wide and disk
> really full.
>
>
>
> Is it intentional?
>
>
>
> I mean, if there was an intention to avoid really wide rows, why there is
> no hint on this in docs?
>
>
>
> Do you have similar experience and do you know how resolve that?
>
>
>
> Thanks.
>
>
>
>
>
>
>
>
>
>
> 
> This communication and all or some of the information contained therein
> may be confidential and is subject to our Terms and Conditions. If you have
> received this communication in error, please destroy all electronic and
> paper copies and notify the sender immediately. Unless specifically
> indicated, this communication is not a confirmation, an offer to sell or
> solicitation of any offer to buy any financial product, or an official
> statement of ICAP or its affiliates. Non-Transactable Pricing Terms and
> Conditions apply to any non-transactable pricing provided. All terms and
> conditions referenced herein available at www.icapterms.com. Please
> notify us by reply message if this link does not work.
>
> 
>