Re: Cassandra Delete vs Update

2020-05-24 Thread Jeff Jirsa

In 2.2 and earlier, cql row deletes appear as range tombstones

All Range tombstones for a cql PARTITION get read into a list (I think it’s 
literally RangeTombstoneList.java But I’m not at a computer to check ) at the 
start of a read command and held in memory to reconcile the read

Because they’re not “normal” tombstones in 2.1/2.2/etc, a few weird things are 
true:

- They’re read anytime you touch the partition, even if you have a data model 
where the read command uses clustering order to avoid reading them
- They’re not counted in the tombstone overwhelming exception, so if you have a 
million Row deletes but set your tombstone threshold to 1, the reads will 
succeed anyway, but you’ll have massive GC because the range tombstones list 
object is going to be very large 
- Theres a timing quirk in 2.1/2.2 where the read timeout timer doesn’t apply 
to the range tombstones - I’ve seen heaps where a read command spent minutes 
reading tombstones from a slow disk (+ pausing for gc). 

That combo makes row deletes especially painful in 2.1.

In 3.0, they’re just point deletes - they’re only read when they’re in the 
middle of data you have to read (So smart clustering and order by can avoid 
reading them), they’re not all materialized into memory, and they’re counted 
properly (at least in 3.11)

TLDR: it’s less likely you’ll be surprised by the painful behavior of row 
deletes in 3.0+ than you may have been in 2.2 and older


> On May 24, 2020, at 12:13 PM, Tobias Eriksson  
> wrote:
> 
> 
> Hi Jeff
> Could you elaborate on the statement that you made :
> “CQL Row level tombstones don’t matter in cassandra 3+ - they’re just point 
> deletes after the storage engine rewrite.”
> Are you saying that a row level delete is not like other tomestones ? if so 
> how are they different ?
> I tried to google but did not get any good results
> -Tobias
>  
>  
> From: Jeff Jirsa 
> Reply to: "user@cassandra.apache.org" 
> Date: Saturday, 23 May 2020 at 19:23
> To: "user@cassandra.apache.org" 
> Subject: Re: Cassandra Delete vs Update
>  
> 
> Using cassandra as a queue is possible if you really really understand the 
> data model, but most people will do it wrong the first few times 
>  
> Cap your partition size. The times I’ve seen this done were near 10mb 
> partitions and used a special hook into internals to track partition size via 
> index offsets so they knew when to switch to the next partition.
> Don’t delete records, delete partitions. 
> Maybe use CAS to know when to flip to the next partition.
> Maybe use CAS to track your consumed offset within a partition 
> CQL Row level tombstones don’t matter in cassandra 3+ - they’re just point 
> deletes after the storage engine rewrite. 
>  
> You’re still probably better off running Kafka in the spare cpu and memory 
> you’d use for this. Understand it’s nontrivial to setup but it’s also 
> nontrivial to do this properly. 
>  
>  
> 
> 
> On May 23, 2020, at 9:26 AM, Laxmikant Upadhyay  
> wrote:
> 
> Thanks you so much  for quick response. I completely agree with Jeff and 
> Gabor that it is an anti-pattern to build queue in Cassandra. But plan is to 
> reuse the existing Cassandra infrastructure without any additional cost (like 
> kafka).  
> So even if the data is partioned properly (max 10mb per date ) ..so still it 
> will be an issue if I read the partition only once a day ? Even with update 
> status and don't delete the row?
> 
> On Sat, May 23, 2020, 4:36 PM Gábor Auth  wrote:
> Hi,
>  
> On Sat, May 23, 2020 at 4:09 PM Laxmikant Upadhyay  
> wrote:
> I think that we should avoid tombstones specially row-level so should go with 
> option-1. Kindly suggest on above or any other better approach ?
>  
> Why don't you use a queue implementation, like AcitiveMQ, Kafka and 
> something? Cassandra is not suitable for this at all, it is anti-pattern in 
> the Cassandra world.
>  
> --
> Bye,
> Auth Gábor (https://iotguru.cloud)


Re: Cassandra Delete vs Update

2020-05-24 Thread Tobias Eriksson
Hi Jeff
Could you elaborate on the statement that you made :
“CQL Row level tombstones don’t matter in cassandra 3+ - they’re just point 
deletes after the storage engine rewrite.”
Are you saying that a row level delete is not like other tomestones ? if so how 
are they different ?
I tried to google but did not get any good results
-Tobias


From: Jeff Jirsa 
Reply to: "user@cassandra.apache.org" 
Date: Saturday, 23 May 2020 at 19:23
To: "user@cassandra.apache.org" 
Subject: Re: Cassandra Delete vs Update


Using cassandra as a queue is possible if you really really understand the data 
model, but most people will do it wrong the first few times

Cap your partition size. The times I’ve seen this done were near 10mb 
partitions and used a special hook into internals to track partition size via 
index offsets so they knew when to switch to the next partition.
Don’t delete records, delete partitions.
Maybe use CAS to know when to flip to the next partition.
Maybe use CAS to track your consumed offset within a partition
CQL Row level tombstones don’t matter in cassandra 3+ - they’re just point 
deletes after the storage engine rewrite.

You’re still probably better off running Kafka in the spare cpu and memory 
you’d use for this. Understand it’s nontrivial to setup but it’s also 
nontrivial to do this properly.




On May 23, 2020, at 9:26 AM, Laxmikant Upadhyay  wrote:
Thanks you so much  for quick response. I completely agree with Jeff and Gabor 
that it is an anti-pattern to build queue in Cassandra. But plan is to reuse 
the existing Cassandra infrastructure without any additional cost (like kafka).
So even if the data is partioned properly (max 10mb per date ) ..so still it 
will be an issue if I read the partition only once a day ? Even with update 
status and don't delete the row?
On Sat, May 23, 2020, 4:36 PM Gábor Auth 
mailto:auth.ga...@gmail.com>> wrote:
Hi,

On Sat, May 23, 2020 at 4:09 PM Laxmikant Upadhyay 
mailto:laxmikant@gmail.com>> wrote:
I think that we should avoid tombstones specially row-level so should go with 
option-1. Kindly suggest on above or any other better approach ?

Why don't you use a queue implementation, like AcitiveMQ, Kafka and something? 
Cassandra is not suitable for this at all, it is anti-pattern in the Cassandra 
world.

--
Bye,
Auth Gábor (https://iotguru.cloud)


Re: Cassandra Delete vs Update

2020-05-23 Thread Laxmikant Upadhyay
Thanks all for your answer.
Thanks Jeff for clarification. Only thing I could not get is how CAS(I
assume you r talking about compare and set) will help track the offset
consumed within partition. But I got an good idea what you r trying to
explain. Capping partition size and deleting by partition r two important
point to remember. Thanks for your help.

On Sat, May 23, 2020, 6:23 PM Jeff Jirsa  wrote:

>
> Using cassandra as a queue is possible if you really really understand the
> data model, but most people will do it wrong the first few times
>
> Cap your partition size. The times I’ve seen this done were near 10mb
> partitions and used a special hook into internals to track partition size
> via index offsets so they knew when to switch to the next partition.
> Don’t delete records, delete partitions.
> Maybe use CAS to know when to flip to the next partition.
> Maybe use CAS to track your consumed offset within a partition
> CQL Row level tombstones don’t matter in cassandra 3+ - they’re just point
> deletes after the storage engine rewrite.
>
> You’re still probably better off running Kafka in the spare cpu and memory
> you’d use for this. Understand it’s nontrivial to setup but it’s also
> nontrivial to do this properly.
>
>
>
> On May 23, 2020, at 9:26 AM, Laxmikant Upadhyay 
> wrote:
>
> 
> Thanks you so much  for quick response. I completely agree with Jeff and
> Gabor that it is an anti-pattern to build queue in Cassandra. But plan is
> to reuse the existing Cassandra infrastructure without any additional cost
> (like kafka).
> So even if the data is partioned properly (max 10mb per date ) ..so still
> it will be an issue if I read the partition only once a day ? Even with
> update status and don't delete the row?
>
> On Sat, May 23, 2020, 4:36 PM Gábor Auth  wrote:
>
>> Hi,
>>
>> On Sat, May 23, 2020 at 4:09 PM Laxmikant Upadhyay <
>> laxmikant@gmail.com> wrote:
>>
>>> I think that we should avoid tombstones specially row-level so should go
>>> with option-1. Kindly suggest on above or any other better approach ?
>>>
>>
>> Why don't you use a queue implementation, like AcitiveMQ, Kafka and
>> something? Cassandra is not suitable for this at all, it is anti-pattern in
>> the Cassandra world.
>>
>> --
>> Bye,
>> Auth Gábor (https://iotguru.cloud)
>>
>


Re: Cassandra Delete vs Update

2020-05-23 Thread Jeff Jirsa

Using cassandra as a queue is possible if you really really understand the data 
model, but most people will do it wrong the first few times 

Cap your partition size. The times I’ve seen this done were near 10mb 
partitions and used a special hook into internals to track partition size via 
index offsets so they knew when to switch to the next partition.
Don’t delete records, delete partitions. 
Maybe use CAS to know when to flip to the next partition.
Maybe use CAS to track your consumed offset within a partition 
CQL Row level tombstones don’t matter in cassandra 3+ - they’re just point 
deletes after the storage engine rewrite. 

You’re still probably better off running Kafka in the spare cpu and memory 
you’d use for this. Understand it’s nontrivial to setup but it’s also 
nontrivial to do this properly. 



> On May 23, 2020, at 9:26 AM, Laxmikant Upadhyay  
> wrote:
> 
> 
> Thanks you so much  for quick response. I completely agree with Jeff and 
> Gabor that it is an anti-pattern to build queue in Cassandra. But plan is to 
> reuse the existing Cassandra infrastructure without any additional cost (like 
> kafka).  
> So even if the data is partioned properly (max 10mb per date ) ..so still it 
> will be an issue if I read the partition only once a day ? Even with update 
> status and don't delete the row?
> 
>> On Sat, May 23, 2020, 4:36 PM Gábor Auth  wrote:
>> Hi,
>> 
>>> On Sat, May 23, 2020 at 4:09 PM Laxmikant Upadhyay 
>>>  wrote:
>>> I think that we should avoid tombstones specially row-level so should go 
>>> with option-1. Kindly suggest on above or any other better approach ?
>> 
>> Why don't you use a queue implementation, like AcitiveMQ, Kafka and 
>> something? Cassandra is not suitable for this at all, it is anti-pattern in 
>> the Cassandra world.
>> 
>> -- 
>> Bye,
>> Auth Gábor (https://iotguru.cloud)


Re: Cassandra Delete vs Update

2020-05-23 Thread Gábor Auth
Hi,

On Sat, May 23, 2020 at 6:26 PM Laxmikant Upadhyay 
wrote:

> Thanks you so much  for quick response. I completely agree with Jeff and
> Gabor that it is an anti-pattern to build queue in Cassandra. But plan is
> to reuse the existing Cassandra infrastructure without any additional cost
> (like kafka).
> So even if the data is partioned properly (max 10mb per date ) ..so still
> it will be an issue if I read the partition only once a day ? Even with
> update status and don't delete the row?
>

Both options generate unnecessary records, there is no big difference
between them. But, if the load isn't too high - so, 10 MByte per day isn't
too much, it doesn't matter.

I also have a lot of little tables (oh, column families), that wouldn't be
in Cassandra, but since they have a very minimal load, I don't give a
shit... :)

-- 
Bye,
Auth Gábor (https://iotguru.cloud)


Re: Cassandra Delete vs Update

2020-05-23 Thread Laxmikant Upadhyay
Thanks you so much  for quick response. I completely agree with Jeff and
Gabor that it is an anti-pattern to build queue in Cassandra. But plan is
to reuse the existing Cassandra infrastructure without any additional cost
(like kafka).
So even if the data is partioned properly (max 10mb per date ) ..so still
it will be an issue if I read the partition only once a day ? Even with
update status and don't delete the row?

On Sat, May 23, 2020, 4:36 PM Gábor Auth  wrote:

> Hi,
>
> On Sat, May 23, 2020 at 4:09 PM Laxmikant Upadhyay <
> laxmikant@gmail.com> wrote:
>
>> I think that we should avoid tombstones specially row-level so should go
>> with option-1. Kindly suggest on above or any other better approach ?
>>
>
> Why don't you use a queue implementation, like AcitiveMQ, Kafka and
> something? Cassandra is not suitable for this at all, it is anti-pattern in
> the Cassandra world.
>
> --
> Bye,
> Auth Gábor (https://iotguru.cloud)
>


Re: Cassandra Delete vs Update

2020-05-23 Thread Gábor Auth
Hi,

On Sat, May 23, 2020 at 4:09 PM Laxmikant Upadhyay 
wrote:

> I think that we should avoid tombstones specially row-level so should go
> with option-1. Kindly suggest on above or any other better approach ?
>

Why don't you use a queue implementation, like AcitiveMQ, Kafka and
something? Cassandra is not suitable for this at all, it is anti-pattern in
the Cassandra world.

-- 
Bye,
Auth Gábor (https://iotguru.cloud)


Re: Cassandra Delete vs Update

2020-05-23 Thread Jeff Jirsa


You’re building a queue

Just use Kafka.


> On May 23, 2020, at 7:09 AM, Laxmikant Upadhyay  
> wrote:
> 
> 
> Hi All,
> I have a query regarding Cassandra data modelling:  I have created two tables:
> 
> 1. CREATE TABLE ks.records_by_id ( id uuid PRIMARY KEY,  status text, details 
> text);
> 2. CREATE TABLE ks.records_by_date ( date date, id uuid,  status text, 
> PRIMARY KEY(date, id));
> 
> I need to fetch records by date and then process each of them.Which of the 
> following options will be better when the record is processed?
> 
> Option-1 : 
> BEGIN BATCH
> UPDATE ks.records_by_id SET status = 'processed' WHERE id = ;
> UPDATE ks.records_by_date SET status = 'processed' WHERE id =  and 
> date='date1';
> APPLY BATCH ;
> 
> Option-2
> BEGIN BATCH
> UPDATE ks.records_by_id SET status = 'processed' WHERE id = ;
> DELETE FROM ks.records_by_date WHERE id =  and date='date1';
> APPLY BATCH ;
> 
> Option-1 will not create tombstones but i need to filter the records based of 
> status='pending' at application layer for each date. Option-2 will create 
> tombstone (however number of tombstones will be limited in a partition) but 
> it will not require application side filtering.
> 
> I think that we should avoid tombstones specially row-level so should go with 
> option-1. Kindly suggest on above or any other better approach ?
> 
> -- 
> 
> regards,
> Laxmikant Upadhyay
> 


Re: Cassandra Delete vs Update

2020-05-23 Thread Aakash Pandhi
Laxmikant, 
You mentioned that you need to filter records based on status='pending' in 
option-1. I don't see that filtering is done in that option. You are setting 
status as 'processed' when partition key is matched for table. For delete 
(option-2) it will completely remove whole partition for records_by_date table 
if that's what you want. 
Regards,
Aakash Pandhi
 

On Saturday, May 23, 2020, 09:09:48 AM CDT, Laxmikant Upadhyay 
 wrote:  
 
 Hi All,I have a query regarding Cassandra data modelling:  I have created two 
tables:
1. CREATE TABLE ks.records_by_id ( id uuid PRIMARY KEY,  status text, details 
text);
2. CREATE TABLE ks.records_by_date ( date date, id uuid,  status text, PRIMARY 
KEY(date, id));

I need to fetch records by date and then process each of them.Which of the 
following options will be better when the record is processed?

Option-1 : 
BEGIN BATCH
UPDATE ks.records_by_id SET status = 'processed' WHERE id = ;
UPDATE ks.records_by_date SET status = 'processed' WHERE id =  and 
date='date1';
APPLY BATCH ;

Option-2
BEGIN BATCH
UPDATE ks.records_by_id SET status = 'processed' WHERE id = ;
DELETE FROM ks.records_by_date WHERE id =  and date='date1';
APPLY BATCH ;

Option-1 will not create tombstones but i need to filter the records based of 
status='pending' at application layer for each date. Option-2 will create 
tombstone (however number of tombstones will be limited in a partition) but it 
will not require application side filtering.

I think that we should avoid tombstones specially row-level so should go with 
option-1. Kindly suggest on above or any other better approach ?

-- 

regards,Laxmikant Upadhyay
  

Cassandra Delete vs Update

2020-05-23 Thread Laxmikant Upadhyay
Hi All,
I have a query regarding Cassandra data modelling:  I have created two
tables:

1. CREATE TABLE ks.records_by_id ( id uuid PRIMARY KEY,  status text,
details text);
2. CREATE TABLE ks.records_by_date ( date date, id uuid,  status text,
PRIMARY KEY(date, id));

I need to fetch records by date and then process each of them.Which of the
following options will be better when the record is processed?

*Option-1 : *
BEGIN BATCH
UPDATE ks.records_by_id SET status = 'processed' WHERE id = ;
UPDATE ks.records_by_date SET status = 'processed' WHERE id =  and
date='date1';
APPLY BATCH ;

*Option-2*
BEGIN BATCH
UPDATE ks.records_by_id SET status = 'processed' WHERE id = ;
DELETE FROM ks.records_by_date WHERE id =  and date='date1';
APPLY BATCH ;

Option-1 will not create tombstones but i need to filter the records based
of status='pending' at application layer for each date. Option-2 will
create tombstone (however number of tombstones will be limited in a
partition) but it will not require application side filtering.

I think that we should avoid tombstones specially row-level so should go
with option-1. Kindly suggest on above or any other better approach ?

-- 

regards,
Laxmikant Upadhyay