Re: Cassandra collection tombstones

Alain RODRIGUEZ Fri, 25 Jan 2019 06:13:25 -0800

Hello,

I think you might be inserting on the top of an existing collection,
implicitly, Cassandra creates a range tombstone. Cassandra does not
update/delete data, it always inserts (data or tombstone). Then eventually
compaction merges the data and evict the tombstones. Thus, when overwriting
an entire collection, Cassandra performs a delete first under the hood.


I wrote about this, in this post about 2 years ago, in the middle of this
(long) article:
http://thelastpickle.com/blog/2016/07/27/about-deletes-and-tombstones.html

Here is the part that might be of interest in your case:

"Note: When using collections, range tombstones will be generated by INSERT
and UPDATE operations every time you are using an entire collection, and
not updating parts of it. Inserting a collection over an existing
collection, rather than appending it or updating only an item in it, leads
to range tombstones insert followed by the insert of the new values for the
collection. This DELETE operation is hidden leading to some weird and
frustrating tombstones issues."

and

"From the mailing list I found out that James Ravn posted about this topic
using list example, but it is true for all the collections, so I won’t go
through more details, I just wanted to point this out as it can be
surprising, see:
http://www.jsravn.com/2015/05/13/cassandra-tombstones-collections.html#lists
"

Thus to specifically answer your questions:

 Does this tombstone ever get removed?


Yes, after gc_grace_seconds (table option) happened AND if the data that is
shadowed by the tombstone is also part of the same compaction (all the
previous shards need to be there if I remember correctly). So yes, but
eventually, not immediately nor any time soon (10+ days by default).


> Also when I run sstablemetadata on the only sstable, it shows "Estimated
> droppable tombstones" as 0.5", Similarly it shows one record with epoch
> time as insert time for - "Estimated tombstone drop times: 1548384720: 1".
> Does it mean that when I do sstablemetadata on a table having collections,
> the estimated droppable tombstone ratio and drop times values are not true
> and dependable values due to collection/list range tombstones?


I do not remember this precisely but you can check the code, it's worth
having a look. The "estimated droppable tombstone" value is actually always
wrong. Because it's an estimate that does not consider overlaps (and I'm
not sure about the fact it considers the gc_grace_seconds either). But also
because calculation does not count a certain type of tombstones and the
weight of range tombstones compared to the tombstone cells makes the count
quite inaccurate:
http://thelastpickle.com/blog/2018/07/05/undetectable-tombstones-in-apache-cassandra.html
.

I think this evolved since I looked at it and might not remember well, but
this value is definitely not accurate.

If you're re-inserting a collection for a given existing partition often,
there is probably plenty of tombstones sitting around though, that's almost
guaranteed.

Does tombstone_threshold of compaction depend on the sstablemetadata
> threshold value? If so then for tables having collections, this is not a
> true threshold right?
>

Yes, I believe the tombstone threshold actually uses the "estimated
droppable tombstone" value to chose to trigger or not a
"single-SSTable"/"tombstone" compaction. Yet, in your case, this will not
clean the tombstones in the first 10 days at least (gc_grace_seconds
default value). Compactions do not keep triggering because there is a
minimum interval defined between 2 tombstones compactions of an SSTable (1
day by default). This setting is keeping you away from a useless compaction
loop most probably, I would not try to change this. Collection or not
collection does not change how the compaction strategy operates.

I faced this in the past. Operationally you can have things working, but
it's hard and really pointless (it was in my case at least). I would
definitely recommend changing the model to update parts of the map and
never rewrite a map.

C*heers,
-----------------------
Alain Rodriguez - al...@thelastpickle.com
France / Spain

The Last Pickle - Apache Cassandra Consulting
http://www.thelastpickle.com

Le ven. 25 janv. 2019 à 05:29, Ayub M <hia...@gmail.com> a écrit :

> I have created a table with a collection. Inserted a record and took
> sstabledump of it and seeing there is range tombstone for it in the
> sstable. Does this tombstone ever get removed? Also when I run
> sstablemetadata on the only sstable, it shows "Estimated droppable
> tombstones" as 0.5", Similarly it shows one record with epoch time as
> insert time for - "Estimated tombstone drop times: 1548384720: 1". Does it
> mean that when I do sstablemetadata on a table having collections, the
> estimated droppable tombstone ratio and drop times values are not true and
> dependable values due to collection/list range tombstones?
> CREATE TABLE ks.nmtest (
>     reservation_id text,
>     order_id text,
>     c1 int,
>     order_details map<text, text>,
>     PRIMARY KEY (reservation_id, order_id)
> ) WITH CLUSTERING ORDER BY (order_id ASC)
>
> user@cqlsh:ks> insert into nmtest (reservation_id , order_id , c1,
> order_details ) values('3','3',3,{'key':'value'});
> user@cqlsh:ks> select * from nmtest ;
>  reservation_id | order_id | c1 | order_details
> ----------------+----------+----+------------------
>               3 |        3 |  3 | {'key': 'value'}
> (1 rows)
>
> [root@localhost nmtest-e1302500201d11e983bb693c02c04c62]# sstabledump
> mc-5-big-Data.db
> WARN  02:52:19,596 memtable_cleanup_threshold has been deprecated and
> should be removed from cassandra.yaml
> [
>   {
>     "partition" : {
>       "key" : [ "3" ],
>       "position" : 0
>     },
>     "rows" : [
>       {
>         "type" : "row",
>         "position" : 41,
>         "clustering" : [ "3" ],
>         "liveness_info" : { "tstamp" : "2019-01-25T02:51:13.574409Z" },
>         "cells" : [
>           { "name" : "c1", "value" : 3 },
>           { "name" : "order_details", "deletion_info" : { "marked_deleted"
> : "2019-01-25T02:51:13.574408Z", "local_delete_time" :
> "2019-01-25T02:51:13Z" } },
>           { "name" : "order_details", "path" : [ "key" ], "value" :
> "value" }
>         ]
>       }
>     ]
>   }
> SSTable: /data/data/ks/nmtest-e1302500201d11e983bb693c02c04c62/mc-5-big
> Partitioner: org.apache.cassandra.dht.Murmur3Partitioner
> Bloom Filter FP chance: 0.010000
> Minimum timestamp: 1548384673574408
> Maximum timestamp: 1548384673574409
> SSTable min local deletion time: 1548384673
> SSTable max local deletion time: 2147483647
> Compressor: org.apache.cassandra.io.compress.LZ4Compressor
> Compression ratio: 1.0714285714285714
> TTL min: 0
> TTL max: 0
> First token: -155496620801056360 (key=3)
> Last token: -155496620801056360 (key=3)
> minClustringValues: [3]
> maxClustringValues: [3]
> Estimated droppable tombstones: 0.5
> SSTable Level: 0
> Repaired at: 0
> Replay positions covered: {CommitLogPosition(segmentId=1548382769966,
> position=6243201)=CommitLogPosition(segmentId=1548382769966,
> position=6433666)}
> totalColumnsSet: 2
> totalRows: 1
> Estimated tombstone drop times:
> 1548384720:         1
>
> Does tombstone_threshold of compaction depend on the sstablemetadata
> threshold value? If so then for tables having collections, this is not a
> true threshold right?
>

Re: Cassandra collection tombstones

Reply via email to