In 2.2 and earlier, cql row deletes appear as range tombstones

All Range tombstones for a cql PARTITION get read into a list (I think it’s 
literally RangeTombstoneList.java But I’m not at a computer to check ) at the 
start of a read command and held in memory to reconcile the read

Because they’re not “normal” tombstones in 2.1/2.2/etc, a few weird things are 
true:

- They’re read anytime you touch the partition, even if you have a data model 
where the read command uses clustering order to avoid reading them
- They’re not counted in the tombstone overwhelming exception, so if you have a 
million Row deletes but set your tombstone threshold to 10000, the reads will 
succeed anyway, but you’ll have massive GC because the range tombstones list 
object is going to be very large 
- Theres a timing quirk in 2.1/2.2 where the read timeout timer doesn’t apply 
to the range tombstones - I’ve seen heaps where a read command spent minutes 
reading tombstones from a slow disk (+ pausing for gc). 

That combo makes row deletes especially painful in 2.1.

In 3.0, they’re just point deletes - they’re only read when they’re in the 
middle of data you have to read (So smart clustering and order by can avoid 
reading them), they’re not all materialized into memory, and they’re counted 
properly (at least in 3.11)

TLDR: it’s less likely you’ll be surprised by the painful behavior of row 
deletes in 3.0+ than you may have been in 2.2 and older


> On May 24, 2020, at 12:13 PM, Tobias Eriksson <tobias.eriks...@qvantel.com> 
> wrote:
> 
> 
> Hi Jeff
> Could you elaborate on the statement that you made :
> “CQL Row level tombstones don’t matter in cassandra 3+ - they’re just point 
> deletes after the storage engine rewrite.”
> Are you saying that a row level delete is not like other tomestones ? if so 
> how are they different ?
> I tried to google but did not get any good results
> -Tobias
>  
>  
> From: Jeff Jirsa <jji...@gmail.com>
> Reply to: "user@cassandra.apache.org" <user@cassandra.apache.org>
> Date: Saturday, 23 May 2020 at 19:23
> To: "user@cassandra.apache.org" <user@cassandra.apache.org>
> Subject: Re: Cassandra Delete vs Update
>  
> 
> Using cassandra as a queue is possible if you really really understand the 
> data model, but most people will do it wrong the first few times 
>  
> Cap your partition size. The times I’ve seen this done were near 10mb 
> partitions and used a special hook into internals to track partition size via 
> index offsets so they knew when to switch to the next partition.
> Don’t delete records, delete partitions. 
> Maybe use CAS to know when to flip to the next partition.
> Maybe use CAS to track your consumed offset within a partition 
> CQL Row level tombstones don’t matter in cassandra 3+ - they’re just point 
> deletes after the storage engine rewrite. 
>  
> You’re still probably better off running Kafka in the spare cpu and memory 
> you’d use for this. Understand it’s nontrivial to setup but it’s also 
> nontrivial to do this properly. 
>  
>  
> 
> 
> On May 23, 2020, at 9:26 AM, Laxmikant Upadhyay <laxmikant....@gmail.com> 
> wrote:
> 
> Thanks you so much  for quick response. I completely agree with Jeff and 
> Gabor that it is an anti-pattern to build queue in Cassandra. But plan is to 
> reuse the existing Cassandra infrastructure without any additional cost (like 
> kafka).  
> So even if the data is partioned properly (max 10mb per date ) ..so still it 
> will be an issue if I read the partition only once a day ? Even with update 
> status and don't delete the row?
> 
> On Sat, May 23, 2020, 4:36 PM Gábor Auth <auth.ga...@gmail.com> wrote:
> Hi,
>  
> On Sat, May 23, 2020 at 4:09 PM Laxmikant Upadhyay <laxmikant....@gmail.com> 
> wrote:
> I think that we should avoid tombstones specially row-level so should go with 
> option-1. Kindly suggest on above or any other better approach ?
>  
> Why don't you use a queue implementation, like AcitiveMQ, Kafka and 
> something? Cassandra is not suitable for this at all, it is anti-pattern in 
> the Cassandra world.
>  
> --
> Bye,
> Auth Gábor (https://iotguru.cloud)

Reply via email to