Re: Cassandra collection tombstones

2019-01-28 Thread Jeff Jirsa
The issue in 14861 doesn’t manifest itself in the data file (so you won’t see 
it in the sstable json), it’s in the min/max clustering of the metadata used in 
the read path. 

-- 
Jeff Jirsa


> On Jan 28, 2019, at 7:08 AM, Ahmed Eljami  wrote:
> 
> Hi Alain,
> 
> Just to confirm, range tombstones that we are talking about here is not 
> related to this jira: https://issues.apache.org/jira/browse/CASSANDRA-14861 ?
> 
> Thanks a lot.


Re: Cassandra collection tombstones

2019-01-28 Thread Ahmed Eljami
Hi Alain,

Just to confirm, range tombstones that we are talking about here is not
related to this jira: https://issues.apache.org/jira/browse/CASSANDRA-14861
?

Thanks a lot.


Re: Cassandra collection tombstones

2019-01-28 Thread Alain RODRIGUEZ
Hello,

@Chris, I mostly agree with you. I will try to make clear what I had in
mind, as it was not well-expressed obviously.


> it doesn't matter if the tombstone is overlapped it still need to be kept
> for the gc_grace before purging or it can result in data resurrection.


Yes, I agree. I do not recommend lowering the gc_grace without giving it a
thought. I was saying that this was not part of the ratio calculation as
you explained.



> sstablemetadata cannot reliably or safely know the table parameters that
> are not kept in the sstable so to get an accurate value you have to provide
> a -g or --gc-grace-seconds parameter. I am not sure where the "always
> wrong" comes in as the quantity of data thats being shadowed is not what
> its tracking (although it would be more meaningful for single sstable
> compactions if it did), just when tombstones can be purged.



What I tried to say is that the "estimated droppable tombstone" ratio does
not account for overlaps or gc_grace_seconds. Thus when the ratio shows
0.7, after running compactions you have no guarantee that this number will
be any lower, and can almost be sure it will not reach 0. Tombstones will
stay around. In that sense, I said a bit too strongly that this value is
always "wrong".

I was not saying it's easy or even possible to get accurate information. I
was rather warning users that in practice the gap between what this value
tells you to be the number of actually dropped tombstones is often far from
what this "estimated" ratio provides.

@Ayub

Firstly I am not seeing any difference when using gc_grace_seconds with
> sstablemetadata.
>

As we both said (or tried to say), this is expected. Yet, during
compactions, the tombstones will be eligible sooner for eviction
(definitive removal). You can test (test cluster) it and the tombstone
should go away with a compaction (only after 'gc_grace_seconds').

To work with this in prod, you should be sure that it won't harm the
cluster...
About gc_grace_seconds, remember that:
- gc_grace_seconds > repair interval (full, all cluster) - if you are
performing delete. Which I think might be your case if you're inserting
collections on the top of existing ones (instead of updating specific keys
in it). This, as Chris said, can lead to inconsistencies. TTLs are OK (no
repair needed - no more than for other columns without TTLs).
- gc_grace_seconds impacts hints TTL, Radovan wrote about this exact topic
here:
http://thelastpickle.com/blog/2018/03/21/hinted-handoff-gc-grace-demystified.html

This is a risky path to follow if you are not sure about the impacts.

Second question, for this test table I see the original record which I
> inserted got its tombstone removed by autocompaction which ran today as its
> gc_grace_seconds is set to one day. But I see some tables whose
> gc_grace_seconds is set to 3 days but when I do sstabledump on them I see
> tombstone entries for them in the json output and sstablemetadata also
> shows them to be estimated tombstones records. I see autocompaction is
> running on the stables of this table and I also manually ran using jmx
> shell but they are still there...any reason why they are not getting
> deleted?


As we also mentioned earlier, there are some conditions that can prevent a
tombstone to be actually dropped, even if it's part of a compaction after
gc_grace_seconds.
In particular, if the data is 'covering'/'shadowing' previously existing
data, that still exist in some other sstable(s) that are not part of the
compation, then Cassandra cannot safely remove the tombstone as the latest
still existing value for this cell(s) would reappear. That's what we call
'overlaps'. And it can prevent compactions.

There are some possibilities:
- Major compaction: To make it short we do not recommend that, almost never
unless you know what you're doing, because it will lead to one big sstable
that would no longer be automatically compacted for a long while. Making
thing quickly worst after a short term improvement.
- Compact using jmx, but selecting all the needed sstables. Depending on
your use case / compaction strategy it can quickly lead either to the first
point above as at some point all the nodes might be involved, or to a
complex strategy to handle compactions.
- Use a different compaction strategy to ease the tombstone removal
- if the tombstones are not an issue not in the read path (thus not
creating latency), not either in for the disk space, then maybe just ignore
it. The tombstones are here by design (
https://jsravn.com/2015/05/13/cassandra-tombstones-collections/#lists) when
inserting collection objects.

But if you're not performing deletes, and that tombstones appear way before
the TTL time, I would say it's an issue with the insert of the 'map' and I
would suggest you to rather focus on changing the queries/model to fix the
massive tombstone creation in the first place, that will probably make
things way nicer/easier.
The person in the post above used a json as 

Re: Cassandra collection tombstones

2019-01-27 Thread Ayub M
Thanks Alain/Chris.

Firstly I am not seeing any difference when using gc_grace_seconds with
sstablemetadata.

CREATE TABLE ks.nmtest (
reservation_id text,
order_id text,
c1 int,
order_details map,
PRIMARY KEY (reservation_id, order_id)
) WITH CLUSTERING ORDER BY (order_id ASC)
AND bloom_filter_fp_chance = 0.01
AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
AND comment = ''
AND compaction = {'class':
'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy',
'max_threshold': '32', 'min_threshold': '4'}
AND compression = {'chunk_length_in_kb': '64', 'class':
'org.apache.cassandra.io.compress.LZ4Compressor'}
AND crc_check_chance = 1.0
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
AND gc_grace_seconds = 86400
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99PERCENTILE';

[root@ip-xxx-xxx-xxx-xxx nmtest-e1302500201d11e983bb693c02c04c62]#
sstabledump mc-11-big-Data.db
WARN  08:27:32,793 memtable_cleanup_threshold has been deprecated and
should be removed from cassandra.yaml
[
  {
"partition" : {
  "key" : [ "4" ],
  "position" : 0
},
"rows" : [
  {
"type" : "row",
"position" : 40,
"clustering" : [ "4" ],
"cells" : [
  { "name" : "order_details", "path" : [ "key1" ], "value" :
"value1", "tstamp" : "2019-01-27T08:26:49.633240Z" }
]
  }
]
  },
  {
"partition" : {
  "key" : [ "5" ],
  "position" : 41
},
"rows" : [
  {
"type" : "row",
"position" : 82,
"clustering" : [ "5" ],
"liveness_info" : { "tstamp" : "2019-01-27T08:23:29.782506Z" },
"cells" : [
  { "name" : "c1", "value" : 5 },
  { "name" : "order_details", "deletion_info" : { "marked_deleted"
: "2019-01-27T08:23:29.782505Z", "local_delete_time" :
"2019-01-27T08:23:29Z" } },
  { "name" : "order_details", "path" : [ "key" ], "value" : "value"
}
]
  }
]
  }

Partition 5 is a newly inserted record, no matter what gc_grace_seconds
value I pass it still shows this record as estimated tombstone.

[root@ip-xxx-xxx-xxx-xxx nmtest-e1302500201d11e983bb693c02c04c62]#
sstablemetadata mc-11-big-Data.db | grep "Estimated tombstone drop times"
-A3
Estimated tombstone drop times:
1548577440: 1
Count   Row SizeCell Count

[root@ip-xxx-xxx-xxx-xxx nmtest-e1302500201d11e983bb693c02c04c62]#
sstablemetadata  --gc_grace_seconds 86400 mc-11-big-Data.db | grep
"Estimated tombstone drop times" -A4
Estimated tombstone drop times:
1548577440: 1
Count   Row SizeCell Count

Second question, for this test table I see the original record which I
inserted got its tombstone removed by autocompaction which ran today as its
gc_grace_seconds is set to one day. But I see some tables whose
gc_grace_seconds is set to 3 days but when I do sstabledump on them I see
tombstone entries for them in the json output and sstablemetadata also
shows them to be estimated tombstones records. I see autocompaction is
running on the stables of this table and I also manually ran using jmx
shell but they are still there...any reason why they are not getting
deleted?

sstablemetadata  --gc_grace_seconds 259200 mc-732-big-Data.db | grep
"Estimated tombstone drop times" -A10

WARN  07:28:03,086 memtable_cleanup_threshold has been deprecated and
should be removed from cassandra.yaml

Estimated tombstone drop times:

1537475340: 7

1537476150:14

1537476360: 7

1537476660: 7


one record from son file having old tombstone markers

  {

"partition" : {

  "key" : [ "2945132807" ],

  "position" : 9036596

},

"rows" : [

  {

"type" : "row",

"position" : 9037781,

"clustering" : [ "2018-08-15 00:00:00.000Z", "233359" ],

"liveness_info" : { "tstamp" : "2018-09-26T14:52:54.255395Z" },

"cells" : [

   .

  { "name" : "col1", "deletion_info" : { "marked_deleted" :
"2018-09-26T14:52:54.255394Z", "local_delete_time" : "2018-09-26T14:52:54Z"
} },

  { "name" : "col1", "path" : [ "zczxc" ], "value" : "ZXczx" },

  { "name" : "col1", "path" : [ "ZXCxzc" ], "value" : "ZCzxc" },

  { "name" : "col2", "deletion_info" : { "marked_deleted" :
"2018-09-26T14:52:54.255394Z", "local_delete_time" : "2018-09-26T14:52:54Z"
} },

  { "name" : "col2", "path" : [ "zcxxc" ], "value" : false },

  { "name" : "col2", "path" : [ "hjhkjh" ], "value" : false },

  { "name" : "col2", "path" : [ "LEGACY" ], "value" : true },

  { "name" : "col2", "path" : [ "NON_SITE_SPECIFIC" ], "value" :
true },

  { "name" : "issuance_data", "deletion_info" : { "marked_deleted"
: "2018-09-26T14:52:54.255394Z", "local_delete_time" :

Re: Cassandra collection tombstones

2019-01-25 Thread Chris Lohfink
>  The "estimated droppable tombstone" value is actually always wrong. Because 
> it's an estimate that does not consider overlaps (and I'm not sure about the 
> fact it considers the gc_grace_seconds either).

It considers the time the tombstone was created and the gc_grace_seconds, it 
doesn't matter if the tombstone is overlapped it still need to be kept for the 
gc_grace before purging or it can result in data resurrection. sstablemetadata 
cannot reliably or safely know the table parameters that are not kept in the 
sstable so to get an accurate value you have to provide a -g or 
--gc-grace-seconds parameter. I am not sure where the "always wrong" comes in 
as the quantity of data thats being shadowed is not what its tracking (although 
it would be more meaningful for single sstable compactions if it did), just 
when tombstones can be purged.

Chris


> On Jan 25, 2019, at 8:11 AM, Alain RODRIGUEZ  wrote:
> 
> Hello, 
> 
> I think you might be inserting on the top of an existing collection, 
> implicitly, Cassandra creates a range tombstone. Cassandra does not 
> update/delete data, it always inserts (data or tombstone). Then eventually 
> compaction merges the data and evict the tombstones. Thus, when overwriting 
> an entire collection, Cassandra performs a delete first under the hood.
> 
> I wrote about this, in this post about 2 years ago, in the middle of this 
> (long) article: 
> http://thelastpickle.com/blog/2016/07/27/about-deletes-and-tombstones.html 
> 
> 
> Here is the part that might be of interest in your case:
> 
> "Note: When using collections, range tombstones will be generated by INSERT 
> and UPDATE operations every time you are using an entire collection, and not 
> updating parts of it. Inserting a collection over an existing collection, 
> rather than appending it or updating only an item in it, leads to range 
> tombstones insert followed by the insert of the new values for the 
> collection. This DELETE operation is hidden leading to some weird and 
> frustrating tombstones issues."
> 
> and
> 
> "From the mailing list I found out that James Ravn posted about this topic 
> using list example, but it is true for all the collections, so I won’t go 
> through more details, I just wanted to point this out as it can be 
> surprising, see: 
> http://www.jsravn.com/2015/05/13/cassandra-tombstones-collections.html#lists 
> "
> 
> Thus to specifically answer your questions:
> 
>  Does this tombstone ever get removed?
> 
> Yes, after gc_grace_seconds (table option) happened AND if the data that is 
> shadowed by the tombstone is also part of the same compaction (all the 
> previous shards need to be there if I remember correctly). So yes, but 
> eventually, not immediately nor any time soon (10+ days by default). 
>  
> Also when I run sstablemetadata on the only sstable, it shows "Estimated 
> droppable tombstones" as 0.5", Similarly it shows one record with epoch time 
> as insert time for - "Estimated tombstone drop times: 1548384720: 1". Does it 
> mean that when I do sstablemetadata on a table having collections, the 
> estimated droppable tombstone ratio and drop times values are not true and 
> dependable values due to collection/list range tombstones?
> 
> I do not remember this precisely but you can check the code, it's worth 
> having a look. The "estimated droppable tombstone" value is actually always 
> wrong. Because it's an estimate that does not consider overlaps (and I'm not 
> sure about the fact it considers the gc_grace_seconds either). But also 
> because calculation does not count a certain type of tombstones and the 
> weight of range tombstones compared to the tombstone cells makes the count 
> quite inaccurate: 
> http://thelastpickle.com/blog/2018/07/05/undetectable-tombstones-in-apache-cassandra.html
>  
> .
> 
> I think this evolved since I looked at it and might not remember well, but 
> this value is definitely not accurate. 
> 
> If you're re-inserting a collection for a given existing partition often, 
> there is probably plenty of tombstones sitting around though, that's almost 
> guaranteed.
> 
> Does tombstone_threshold of compaction depend on the sstablemetadata 
> threshold value? If so then for tables having collections, this is not a true 
> threshold right?
> 
> Yes, I believe the tombstone threshold actually uses the "estimated droppable 
> tombstone" value to chose to trigger or not a "single-SSTable"/"tombstone" 
> compaction. Yet, in your case, this will not clean the tombstones in the 
> first 10 days at least (gc_grace_seconds default value). Compactions do not 
> keep triggering because there is a minimum interval defined between 2 
> tombstones compactions of an SSTable (1 day by default). This 

Re: Cassandra collection tombstones

2019-01-25 Thread Alain RODRIGUEZ
Hello,

I think you might be inserting on the top of an existing collection,
implicitly, Cassandra creates a range tombstone. Cassandra does not
update/delete data, it always inserts (data or tombstone). Then eventually
compaction merges the data and evict the tombstones. Thus, when overwriting
an entire collection, Cassandra performs a delete first under the hood.

I wrote about this, in this post about 2 years ago, in the middle of this
(long) article:
http://thelastpickle.com/blog/2016/07/27/about-deletes-and-tombstones.html

Here is the part that might be of interest in your case:

"Note: When using collections, range tombstones will be generated by INSERT
and UPDATE operations every time you are using an entire collection, and
not updating parts of it. Inserting a collection over an existing
collection, rather than appending it or updating only an item in it, leads
to range tombstones insert followed by the insert of the new values for the
collection. This DELETE operation is hidden leading to some weird and
frustrating tombstones issues."

and

"From the mailing list I found out that James Ravn posted about this topic
using list example, but it is true for all the collections, so I won’t go
through more details, I just wanted to point this out as it can be
surprising, see:
http://www.jsravn.com/2015/05/13/cassandra-tombstones-collections.html#lists
"

Thus to specifically answer your questions:

 Does this tombstone ever get removed?


Yes, after gc_grace_seconds (table option) happened AND if the data that is
shadowed by the tombstone is also part of the same compaction (all the
previous shards need to be there if I remember correctly). So yes, but
eventually, not immediately nor any time soon (10+ days by default).


> Also when I run sstablemetadata on the only sstable, it shows "Estimated
> droppable tombstones" as 0.5", Similarly it shows one record with epoch
> time as insert time for - "Estimated tombstone drop times: 1548384720: 1".
> Does it mean that when I do sstablemetadata on a table having collections,
> the estimated droppable tombstone ratio and drop times values are not true
> and dependable values due to collection/list range tombstones?


I do not remember this precisely but you can check the code, it's worth
having a look. The "estimated droppable tombstone" value is actually always
wrong. Because it's an estimate that does not consider overlaps (and I'm
not sure about the fact it considers the gc_grace_seconds either). But also
because calculation does not count a certain type of tombstones and the
weight of range tombstones compared to the tombstone cells makes the count
quite inaccurate:
http://thelastpickle.com/blog/2018/07/05/undetectable-tombstones-in-apache-cassandra.html
.

I think this evolved since I looked at it and might not remember well, but
this value is definitely not accurate.

If you're re-inserting a collection for a given existing partition often,
there is probably plenty of tombstones sitting around though, that's almost
guaranteed.

Does tombstone_threshold of compaction depend on the sstablemetadata
> threshold value? If so then for tables having collections, this is not a
> true threshold right?
>

Yes, I believe the tombstone threshold actually uses the "estimated
droppable tombstone" value to chose to trigger or not a
"single-SSTable"/"tombstone" compaction. Yet, in your case, this will not
clean the tombstones in the first 10 days at least (gc_grace_seconds
default value). Compactions do not keep triggering because there is a
minimum interval defined between 2 tombstones compactions of an SSTable (1
day by default). This setting is keeping you away from a useless compaction
loop most probably, I would not try to change this. Collection or not
collection does not change how the compaction strategy operates.

I faced this in the past. Operationally you can have things working, but
it's hard and really pointless (it was in my case at least). I would
definitely recommend changing the model to update parts of the map and
never rewrite a map.

C*heers,
---
Alain Rodriguez - al...@thelastpickle.com
France / Spain

The Last Pickle - Apache Cassandra Consulting
http://www.thelastpickle.com

Le ven. 25 janv. 2019 à 05:29, Ayub M  a écrit :

> I have created a table with a collection. Inserted a record and took
> sstabledump of it and seeing there is range tombstone for it in the
> sstable. Does this tombstone ever get removed? Also when I run
> sstablemetadata on the only sstable, it shows "Estimated droppable
> tombstones" as 0.5", Similarly it shows one record with epoch time as
> insert time for - "Estimated tombstone drop times: 1548384720: 1". Does it
> mean that when I do sstablemetadata on a table having collections, the
> estimated droppable tombstone ratio and drop times values are not true and
> dependable values due to collection/list range tombstones?
> CREATE TABLE ks.nmtest (
> reservation_id text,
> order_id text,
> 

Cassandra collection tombstones

2019-01-24 Thread Ayub M
I have created a table with a collection. Inserted a record and took
sstabledump of it and seeing there is range tombstone for it in the
sstable. Does this tombstone ever get removed? Also when I run
sstablemetadata on the only sstable, it shows "Estimated droppable
tombstones" as 0.5", Similarly it shows one record with epoch time as
insert time for - "Estimated tombstone drop times: 1548384720: 1". Does it
mean that when I do sstablemetadata on a table having collections, the
estimated droppable tombstone ratio and drop times values are not true and
dependable values due to collection/list range tombstones?
CREATE TABLE ks.nmtest (
reservation_id text,
order_id text,
c1 int,
order_details map,
PRIMARY KEY (reservation_id, order_id)
) WITH CLUSTERING ORDER BY (order_id ASC)

user@cqlsh:ks> insert into nmtest (reservation_id , order_id , c1,
order_details ) values('3','3',3,{'key':'value'});
user@cqlsh:ks> select * from nmtest ;
 reservation_id | order_id | c1 | order_details
+--++--
  3 |3 |  3 | {'key': 'value'}
(1 rows)

[root@localhost nmtest-e1302500201d11e983bb693c02c04c62]# sstabledump
mc-5-big-Data.db
WARN  02:52:19,596 memtable_cleanup_threshold has been deprecated and
should be removed from cassandra.yaml
[
  {
"partition" : {
  "key" : [ "3" ],
  "position" : 0
},
"rows" : [
  {
"type" : "row",
"position" : 41,
"clustering" : [ "3" ],
"liveness_info" : { "tstamp" : "2019-01-25T02:51:13.574409Z" },
"cells" : [
  { "name" : "c1", "value" : 3 },
  { "name" : "order_details", "deletion_info" : { "marked_deleted"
: "2019-01-25T02:51:13.574408Z", "local_delete_time" :
"2019-01-25T02:51:13Z" } },
  { "name" : "order_details", "path" : [ "key" ], "value" : "value"
}
]
  }
]
  }
SSTable: /data/data/ks/nmtest-e1302500201d11e983bb693c02c04c62/mc-5-big
Partitioner: org.apache.cassandra.dht.Murmur3Partitioner
Bloom Filter FP chance: 0.01
Minimum timestamp: 1548384673574408
Maximum timestamp: 1548384673574409
SSTable min local deletion time: 1548384673
SSTable max local deletion time: 2147483647
Compressor: org.apache.cassandra.io.compress.LZ4Compressor
Compression ratio: 1.0714285714285714
TTL min: 0
TTL max: 0
First token: -155496620801056360 (key=3)
Last token: -155496620801056360 (key=3)
minClustringValues: [3]
maxClustringValues: [3]
Estimated droppable tombstones: 0.5
SSTable Level: 0
Repaired at: 0
Replay positions covered: {CommitLogPosition(segmentId=1548382769966,
position=6243201)=CommitLogPosition(segmentId=1548382769966,
position=6433666)}
totalColumnsSet: 2
totalRows: 1
Estimated tombstone drop times:
1548384720: 1

Does tombstone_threshold of compaction depend on the sstablemetadata
threshold value? If so then for tables having collections, this is not a
true threshold right?