Cassandra 0.7 C++ Thrift delete problem

2010-12-11 Thread Jaydeep Chovatia
Hi,

I am facing some issues with delete operation in Cassandra-0.7.0 using C++ 
Thrift API. Please find details here:
C++ Thrift Machine: Linux 64-bit


 1.  I am using remove  thrift API to delete either column or column family. 
Problem in this is when I call remove then it hangs forever (never returns 
back). Please find my code details here:

cassClient-set_keyspace(keySpace);

ColumnPath clnPath;
clnPath.column_family = columnFamily;
if(columnOrSuperColumnName.size()  0){
if(isSuperColumn) {
clnPath.__isset.super_column = true;
clnPath.super_column = columnOrSuperColumnName;
} else {
clnPath.__isset.column = true;
clnPath.column = columnOrSuperColumnName;
}
}
//DIAGC(mssRME,  1,  About to run remove...);
cassClient-remove(key, clnPath, delTimeStamp, consisLevel);
//DIAGC(mssRME,  1,  Success! remove...);



 1.  I have tried with batch_mutate thrift API. With this operations goes 
smoothly but I end up getting deleted column/columnfamily/key back when I 
perform get operation after delete. Looks like delete operation did not work. 
Please find sample code here:

mapstring, vectorColumn ::iterator iter = SCNameAndCols.begin(); 
//SuperColumn name as key - Vector containing columns in each super column
std::vectorMutation mv;
while(iter != SCNameAndCols.end()) {  //Iterate through each 
SuperColumn one after another
SuperColumn sc;
ColumnOrSuperColumn cs;

Mutation mu;
`   mu.__isset.deletion = true;

mu.deletion.timestamp = delTimeStamp;  //This is latest 
timestamp in miliseconds
mu.deletion.__isset.predicate = true;
mu.deletion.super_column = (*iter).first;
mu.deletion.predicate.__isset.column_names = true;
for(int  i = 0; i  (*iter).second.size(); i++) {

mu.deletion.predicate.column_names.push_back((*iter).second[i].name);
}

mv.push_back(mu);
iter++;
}//End of While loop
std::mapstd::string, std::vectorMutation  innMap;
innMap[CFName] = mv;

std::mapstd::string, std::mapstd::string, std::vectorMutation   
outMap;
outMap[Key] = innMap;

//DIAGC(mssRME,  1,  About to run batch_mutate...);
cassClient-batch_mutate(outMap, consisLevel);
//DIAGC(mssRME,  1,  Success! batch_mutate...);


Any help would be appreciated.

Thank you,
Jaydeep



Cassandra LongType data insertion problem

2011-01-04 Thread Jaydeep Chovatia
Hi,

I have configured Cassandra Column Family (standard CF) of LongType. If I try 
to insert data (using batch_mutate) in this Column Family then it shows me 
following error: A long is exactly 8 bytes. I have tried assigning column 
name of 8 bytes, 7 bytes, etc. but it shows same error.

Please find my sample program details:
Platform: Linux
Language: C++, Cassandra Thrift interface

Column c1;
c1.name = 12345678;
c1.value = SString(len).AsPtr();
c1.timestamp = curTime;
columns.push_back(c1);

Any help on this would be appreciated.

Thank you,
Jaydeep


CQL performance inserting multiple cluster keys under same partition key

2014-08-26 Thread Jaydeep Chovatia
Hi,

I have question on inserting multiple cluster keys under same partition
key.

Ex:

CREATE TABLE Employee (
  deptId int,
  empId int,
  name   varchar,
  address varchar,
  salary int,
  PRIMARY KEY(deptId, empId)
);

BEGIN *UNLOGGED *BATCH
  INSERT INTO Employee (deptId, empId, name, address, salary) VALUES (1,
10, 'testNameA', 'testAddressA', 2);
  INSERT INTO Employee (deptId, empId, name, address, salary) VALUES (1,
20, 'testNameB', 'testAddressB', 3);
APPLY BATCH;

Here we are inserting two cluster keys (10 and 20) under same partition key
(1).
Q1) Is this batch transaction atomic and isolated? If yes then is there any
performance overhead with this syntax?
Q2) Is this CQL syntax can be considered equivalent of Thrift
batch_mutate?

-jaydeep


Re: CQL performance inserting multiple cluster keys under same partition key

2014-08-26 Thread Jaydeep Chovatia
But if we look at thrift world batch_mutate then it used to perform all
mutations withing partition key atomically without using CAS i.e no extra
penalty.
Does this mean CQL degrades in performance as compared to thrift if we want
to do multiple updates to a partition key atomically?


On Tue, Aug 26, 2014 at 11:51 AM, Vivek Mishra mishra.v...@gmail.com
wrote:

 AFAIK, it is not. With CAS it should br
 On 26/08/2014 10:21 pm, Jaydeep Chovatia chovatia.jayd...@gmail.com
 wrote:

 Hi,

 I have question on inserting multiple cluster keys under same partition
 key.

 Ex:

 CREATE TABLE Employee (
   deptId int,
   empId int,
   name   varchar,
   address varchar,
   salary int,
   PRIMARY KEY(deptId, empId)
 );

 BEGIN *UNLOGGED *BATCH
   INSERT INTO Employee (deptId, empId, name, address, salary) VALUES (1,
 10, 'testNameA', 'testAddressA', 2);
   INSERT INTO Employee (deptId, empId, name, address, salary) VALUES (1,
 20, 'testNameB', 'testAddressB', 3);
 APPLY BATCH;

 Here we are inserting two cluster keys (10 and 20) under same partition
 key (1).
 Q1) Is this batch transaction atomic and isolated? If yes then is there
 any performance overhead with this syntax?
 Q2) Is this CQL syntax can be considered equivalent of Thrift
 batch_mutate?

 -jaydeep




Re: CQL performance inserting multiple cluster keys under same partition key

2014-08-27 Thread Jaydeep Chovatia
This clarifies my doubt.
Thanks You Sylvain for your help.


On Tue, Aug 26, 2014 at 11:59 PM, Sylvain Lebresne sylv...@datastax.com
wrote:

 On Tue, Aug 26, 2014 at 6:50 PM, Jaydeep Chovatia 
 chovatia.jayd...@gmail.com wrote:

 Hi,

 I have question on inserting multiple cluster keys under same partition
 key.

 Ex:

 CREATE TABLE Employee (
   deptId int,
   empId int,
   name   varchar,
   address varchar,
   salary int,
   PRIMARY KEY(deptId, empId)
 );

 BEGIN *UNLOGGED *BATCH
   INSERT INTO Employee (deptId, empId, name, address, salary) VALUES (1,
 10, 'testNameA', 'testAddressA', 2);
   INSERT INTO Employee (deptId, empId, name, address, salary) VALUES (1,
 20, 'testNameB', 'testAddressB', 3);
 APPLY BATCH;

 Here we are inserting two cluster keys (10 and 20) under same partition
 key (1).
 Q1) Is this batch transaction atomic and isolated? If yes then is there
 any performance overhead with this syntax?


 As long as the update are under the same partition key (and I insist, only
 in that condition), logged (the one without the UNLOGGED keyword) and
 unlogged batches behave *exactly* the same way. So yes, in that case the
 batch is atomic and isolated (though on the isolation, you may want to be
 aware that while technically isolated, the usual timestamp rules still
 apply and so you might not get the behavior you think if 2 batches have the
 same timestamp: see CASSANDRA-6123
 https://issues.apache.org/jira/browse/CASSANDRA-6123). There is no also
 no performance overhead (assuming you meant over logged batches).

 Q2) Is this CQL syntax can be considered equivalent of Thrift
 batch_mutate?


 It is equivalent, both (the CQL syntax and Thrift batch_mutate) resolve
 to the same operation internally.

 --
 Sylvain



Re: Replacing a dead node by deleting it and auto_bootstrap'ing a new node (Cassandra 2.0)

2014-12-04 Thread Jaydeep Chovatia
as per my knowledge if you have externally NOT specified
-Dcassandra.replace_address=old_node_ipaddress then new tokens (randomly)
would get assigned to bootstrapping node instead of tokens of dead node.

-jaydeep

On Thu, Dec 4, 2014 at 6:50 AM, Omri Bahumi om...@everything.me wrote:

 Hi,

 I was wondering, how would auto_bootstrap behave in this scenario:

 1. I had a cluster with 3 nodes (RF=2)
 2. One node died, I deleted it with nodetool removenode (+ force)
 3. A new node launched with auto_bootstrap: true

 The question is: will the right vnodes go to the new node as if it
 was bootstrapped with -Dcassandra.replace_address=old_node_ipaddress
 ?

 Thanks,
 Omri.



Re: Replacing a dead node by deleting it and auto_bootstrap'ing a new node (Cassandra 2.0)

2014-12-05 Thread Jaydeep Chovatia
I think Cassandra gives us control as what we want to do:
a) If we want to replace a dead node then we should specify
-Dcassandra.replace_address=old_node_ipaddress
b) If we are adding new nodes (no replacement) then do not specify above
option and tokens would get assigned randomly.

I can think of a scenario in which your dead node has tons of data and you
are hopeful on its recovery so you do not want to replace this dead node
always. Momentarily you might just add a new node to meet the the capacity
until dead not is fully recovered.

-jaydeep

On Thu, Dec 4, 2014 at 11:30 PM, Omri Bahumi om...@everything.me wrote:

 I guess Cassandra is aware that it has some replicas not meeting the
 replication factor. Wouldn't it be nice if a bootstrapping node would
 get those?
 Could make things much simpler in the Ops view.

 What do you think?

 On Fri, Dec 5, 2014 at 8:31 AM, Jaydeep Chovatia
 chovatia.jayd...@gmail.com wrote:
  as per my knowledge if you have externally NOT specified
  -Dcassandra.replace_address=old_node_ipaddress then new tokens
 (randomly)
  would get assigned to bootstrapping node instead of tokens of dead node.
 
  -jaydeep
 
  On Thu, Dec 4, 2014 at 6:50 AM, Omri Bahumi om...@everything.me wrote:
 
  Hi,
 
  I was wondering, how would auto_bootstrap behave in this scenario:
 
  1. I had a cluster with 3 nodes (RF=2)
  2. One node died, I deleted it with nodetool removenode (+ force)
  3. A new node launched with auto_bootstrap: true
 
  The question is: will the right vnodes go to the new node as if it
  was bootstrapped with -Dcassandra.replace_address=old_node_ipaddress
  ?
 
  Thanks,
  Omri.
 
 



Re: Write timeout under load but Read is fine

2015-03-05 Thread Jaydeep Chovatia
I have tried increasing timeout to 1 but no help. Also verified that
there is no network lost packets.

Jaydeep

On Wed, Mar 4, 2015 at 12:19 PM, Jan cne...@yahoo.com wrote:

 HI Jaydeep;


- look at the i/o  on all three nodes
- Increase the write_request_timeout_in_ms: 1
- check the time-outs if any on the client inserting the Writes
- check the Network for  dropped/lost packets


 hope this helps
 Jan/



   On Wednesday, March 4, 2015 12:26 PM, Jaydeep Chovatia 
 chovatia.jayd...@gmail.com wrote:


 Hi,

 In my test program when I increase load then I keep getting few write
 timeout from Cassandra say every 10~15 mins. My read:write ratio is
 50:50. My reads are fine but only writes time out.

 Here is my Cassandra details:
 Version: 2.0.11
 Ring of 3 nodes with RF=3
 Node configuration: 24 core + 64GB RAM + 2TB

 write_request_timeout_in_ms: 5000, rest of Cassandra.yaml configuration
 is default

 I've also checked IO on Cassandra nodes and looks very low (around 5%).
 I've also checked Cassandra log file and do not see any GC happening. Also
 CPU on Cassandra is low (around 20%). I have 20GB data on each node.

 My test program creates connection to all three Cassandra nodes and sends
 read+write request randomly.

 Any idea what should I look for?

 Jaydeep





-- 
Jaydeep


Re: Write timeout under load but Read is fine

2015-03-06 Thread Jaydeep Chovatia
I am using QUORUM
CQL
No SSDs (Anyway my IOPs is quite low so I dont think so it matters)
No compaction is running when I receive timeout


On Fri, Mar 6, 2015 at 12:35 AM, Carlos Rolo r...@pythian.com wrote:

 What is the consistency level you are using?
 Are you using Thrift or CQL?
 Are you using SSDs?
 Check if compactions are running when you get the timeouts.

 Regards,

 Carlos Juzarte Rolo
 Cassandra Consultant

 Pythian - Love your data

 rolo@pythian | Twitter: cjrolo | Linkedin: *linkedin.com/in/carlosjuzarterolo
 http://linkedin.com/in/carlosjuzarterolo*
 Tel: 1649
 www.pythian.com

 On Thu, Mar 5, 2015 at 9:51 PM, Jan cne...@yahoo.com wrote:

 Hello Jaydeep;

 Run *cassandra-stress* with R/W options enabled  for about the same time
 and check if you have dropped packets.
 It would eliminate the client as the source of the error  also give you
 a replicable tool to base subsequent tests/ findings.

 Jan/




   On Thursday, March 5, 2015 12:19 PM, Jaydeep Chovatia 
 chovatia.jayd...@gmail.com wrote:


 I have tried increasing timeout to 1 but no help. Also verified that
 there is no network lost packets.

 Jaydeep

 On Wed, Mar 4, 2015 at 12:19 PM, Jan cne...@yahoo.com wrote:

 HI Jaydeep;


- look at the i/o  on all three nodes
- Increase the write_request_timeout_in_ms: 1
- check the time-outs if any on the client inserting the Writes
- check the Network for  dropped/lost packets


 hope this helps
 Jan/



   On Wednesday, March 4, 2015 12:26 PM, Jaydeep Chovatia 
 chovatia.jayd...@gmail.com wrote:


 Hi,

 In my test program when I increase load then I keep getting few write
 timeout from Cassandra say every 10~15 mins. My read:write ratio is
 50:50. My reads are fine but only writes time out.

 Here is my Cassandra details:
 Version: 2.0.11
 Ring of 3 nodes with RF=3
 Node configuration: 24 core + 64GB RAM + 2TB

 write_request_timeout_in_ms: 5000, rest of Cassandra.yaml configuration
 is default

 I've also checked IO on Cassandra nodes and looks very low (around 5%).
 I've also checked Cassandra log file and do not see any GC happening. Also
 CPU on Cassandra is low (around 20%). I have 20GB data on each node.

 My test program creates connection to all three Cassandra nodes and sends
 read+write request randomly.

 Any idea what should I look for?

 Jaydeep





 --
 Jaydeep




 --






-- 
Jaydeep


Write timeout under load but Read is fine

2015-03-04 Thread Jaydeep Chovatia
Hi,

In my test program when I increase load then I keep getting few write
timeout from Cassandra say every 10~15 mins. My read:write ratio is
50:50. My reads are fine but only writes time out.

Here is my Cassandra details:
Version: 2.0.11
Ring of 3 nodes with RF=3
Node configuration: 24 core + 64GB RAM + 2TB

write_request_timeout_in_ms: 5000, rest of Cassandra.yaml configuration
is default

I've also checked IO on Cassandra nodes and looks very low (around 5%).
I've also checked Cassandra log file and do not see any GC happening. Also
CPU on Cassandra is low (around 20%). I have 20GB data on each node.

My test program creates connection to all three Cassandra nodes and sends
read+write request randomly.

Any idea what should I look for?

Jaydeep


One node taking more resources than others in the ring

2015-02-23 Thread Jaydeep Chovatia
Hi,

I have three node cluster with RF=1 (only one Datacenter) with following
size:

Datacenter: DC1
===
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  AddressLoad   Tokens  Owns   Host ID
Rack
UN  IP1  4.02 GB1   33.3%  ID1  RAC1
UN  IP2  4.05 GB1   33.3%  ID2  RAC2
UN  IP3  4.05 GB1   33.3%  ID3  RAC3

I have created different tables and my test application reads/writes with
CL=QUORUM. Under load I found that my one node is taking more
resources (double CPU) than the other two. I have also verified that there
is no other process causing this problem.
My hardware configuration on all nodes is same around Linux + 64-bit + 24
core + 64GB + 1TB
My Cassandra version is 2.0 and JDK 1.7

Jaydeep


Re: Cassandra WriteTimeoutException

2015-07-28 Thread Jaydeep Chovatia
Are you using light weight transactions anywhere?

On Wed, Jul 15, 2015 at 7:40 AM, Michael Shuler mich...@pbandjelly.org
wrote:

 On 07/15/2015 02:28 AM, Amlan Roy wrote:

 Hi,

 I get the following error intermittently while writing to Cassandra.
 I am using version 2.1.7. Not sure how to fix the actual issue
 without increasing the timeout in cassandra.yaml.


 snip

 Post your data model, query, and maybe some cluster config basics for
 better help. Increasing the timeout is never a great answer..

 --
 Kind regards,
 Michael



Re: High read latency

2015-09-27 Thread Jaydeep Chovatia
Read requires avg. 6 sstables and my read latency is 42 ms. so on avg. we
can say Cassandra is taking 7ms to process data from one sstable *which is
entirely in memory*. I think there is something wrong here. If we go with
this math then we can say Cassandra latency would be always > 7ms for most
of the use-cases which doesn't seem right to me atleast.

On Sat, Sep 26, 2015 at 6:19 AM, Eric Stevens <migh...@gmail.com> wrote:

> Since you have most of your reads hitting 5-8 SSTables, it's probably
> related to that increasing your latency.  That makes this look like your
> write workload is either overwrite-heavy or append-heavy.  Data for a
> single partition key is being written to repeatedly over long time periods,
> and this will definitely impact read performance.
>
> You can enable tracing in cqlsh and run your select to see where the time
> is going.
>
> On Fri, Sep 25, 2015 at 3:07 PM, Jaydeep Chovatia <
> chovatia.jayd...@gmail.com> wrote:
>
>> Please find histogram attached.
>>
>> On Fri, Sep 25, 2015 at 12:20 PM, Ryan Svihla <r...@foundev.pro> wrote:
>>
>>> if everything is in ram there could be a number of issues unrelated to
>>> Cassandra and there could be hardware limitations or contention problems.
>>> Otherwise cell count can really deeply impact reads, all ram or not, and
>>> some of this is because of the nature of GC and some of it is the age of
>>> the sstable format (which is due to be revamped in 3.0). Also partition
>>> size can matter just because of physics, if one of those is a 1gb
>>> partition, the network interface can only move that back across the wire so
>>> quickly not to mention the GC issues you’d run into.
>>>
>>> Anyway this is why I asked for the histograms, I wanted to get cell
>>> count and partition size. I’ve seen otherwise very stout hardware get slow
>>> on reads of large results because either a bottleneck was hit somewhere, or
>>> the CPU got slammed with GC, or other processes running on the machine were
>>> contending with Cassandra.
>>>
>>>
>>> On Sep 25, 2015, at 12:45 PM, Jaydeep Chovatia <
>>> chovatia.jayd...@gmail.com> wrote:
>>>
>>> I understand that but everything is in RAM (my data dir is tmpfs) and my
>>> row is not that wide approx. less than 5MB in size. So my question is if
>>> everything is in RAM then why does it take 43ms latency?
>>>
>>> On Fri, Sep 25, 2015 at 7:54 AM, Ryan Svihla <r...@foundev.pro> wrote:
>>>
>>>> if you run:
>>>>
>>>> nodetool cfhistograms  
>>>>
>>>> On the given table and that will tell you how wide your rows are
>>>> getting. At some point you can get wide enough rows that just the physics
>>>> of retrieving them all take some time.
>>>>
>>>>
>>>> On Sep 25, 2015, at 9:21 AM, sai krishnam raju potturi <
>>>> pskraj...@gmail.com> wrote:
>>>>
>>>> Jaydeep; since your primary key involves a clustering column, you may
>>>> be having pretty wide rows. The read would be sequential. The latency could
>>>> be acceptable, if the read were to involve really wide rows.
>>>>
>>>> If your primary key was like ((a,b)) without the clustering column,
>>>> it's like reading a key value pair, and 40ms latency may have been a
>>>> concern.
>>>>
>>>> Bottom Line : The latency depends on how wide the row is.
>>>>
>>>> On Tue, Sep 22, 2015 at 1:27 PM, sai krishnam raju potturi <
>>>> pskraj...@gmail.com> wrote:
>>>>
>>>>> thanks for the information. Posting the query too would be of help.
>>>>>
>>>>> On Tue, Sep 22, 2015 at 11:56 AM, Jaydeep Chovatia <
>>>>> chovatia.jayd...@gmail.com> wrote:
>>>>>
>>>>>> Please find required details here:
>>>>>>
>>>>>> -  Number of req/s
>>>>>>
>>>>>> 2k reads/s
>>>>>>
>>>>>> -  Schema details
>>>>>>
>>>>>> create table test {
>>>>>>
>>>>>> a timeuuid,
>>>>>>
>>>>>> b bigint,
>>>>>>
>>>>>> c int,
>>>>>>
>>>>>> d int static,
>>>>>>
>>>>>> e int static,
>>>>>>
>>>>>> f int static,
>>>>>>
>>&g

Re: High read latency

2015-09-22 Thread Jaydeep Chovatia
select * from test where a = ? and b = ?

On Tue, Sep 22, 2015 at 10:27 AM, sai krishnam raju potturi <
pskraj...@gmail.com> wrote:

> thanks for the information. Posting the query too would be of help.
>
> On Tue, Sep 22, 2015 at 11:56 AM, Jaydeep Chovatia <
> chovatia.jayd...@gmail.com> wrote:
>
>> Please find required details here:
>>
>> -  Number of req/s
>>
>> 2k reads/s
>>
>> -  Schema details
>>
>> create table test {
>>
>> a timeuuid,
>>
>> b bigint,
>>
>> c int,
>>
>> d int static,
>>
>> e int static,
>>
>> f int static,
>>
>> g int static,
>>
>> h int,
>>
>> i text,
>>
>> j text,
>>
>> k text,
>>
>> l text,
>>
>> m set
>>
>> n bigint
>>
>> o bigint
>>
>> p bigint
>>
>> q bigint
>>
>> r int
>>
>> s text
>>
>> t bigint
>>
>> u text
>>
>> v text
>>
>> w text
>>
>> x bigint
>>
>> y bigint
>>
>> z bigint,
>>
>> primary key ((a, b), c)
>>
>> };
>>
>> -  JVM settings about the heap
>>
>> Default settings
>>
>> -  Execution time of the GC
>>
>> Avg. 400ms. I do not see long pauses of GC anywhere in the log file.
>>
>> On Tue, Sep 22, 2015 at 5:34 AM, Leleu Eric <eric.le...@worldline.com>
>> wrote:
>>
>>> Hi,
>>>
>>>
>>>
>>>
>>>
>>> Before speaking about tuning, can you provide some additional
>>> information ?
>>>
>>>
>>>
>>> -  Number of req/s
>>>
>>> -  Schema details
>>>
>>> -  JVM settings about the heap
>>>
>>> -  Execution time of the GC
>>>
>>>
>>>
>>> 43ms for a read latency may be acceptable according to the number of
>>> request per second.
>>>
>>>
>>>
>>>
>>>
>>> Eric
>>>
>>>
>>>
>>> *De :* Jaydeep Chovatia [mailto:chovatia.jayd...@gmail.com]
>>> *Envoyé :* mardi 22 septembre 2015 00:07
>>> *À :* user@cassandra.apache.org
>>> *Objet :* High read latency
>>>
>>>
>>>
>>> Hi,
>>>
>>>
>>>
>>> My application issues more read requests than write, I do see that under
>>> load cfstats for one of the table is quite high around 43ms
>>>
>>>
>>>
>>> Local read count: 114479357
>>>
>>> Local read latency: 43.442 ms
>>>
>>> Local write count: 22288868
>>>
>>> Local write latency: 0.609 ms
>>>
>>>
>>>
>>>
>>>
>>> Here is my node configuration:
>>>
>>> RF=3, Read/Write with QUORUM, 64GB RAM, 48 CPU core. I have only 5 GB of
>>> data on each node (and for experiment purpose I stored data in tmpfs)
>>>
>>>
>>>
>>> I've tried increasing concurrent_read count upto 512 but no help in read
>>> latency. CPU/Memory/IO looks fine on system.
>>>
>>>
>>>
>>> Any idea what should I tune?
>>>
>>>
>>>
>>> Jaydeep
>>>
>>> --
>>>
>>> Ce message et les pièces jointes sont confidentiels et réservés à
>>> l'usage exclusif de ses destinataires. Il peut également être protégé par
>>> le secret professionnel. Si vous recevez ce message par erreur, merci d'en
>>> avertir immédiatement l'expéditeur et de le détruire. L'intégrité du
>>> message ne pouvant être assurée sur Internet, la responsabilité de
>>> Worldline ne pourra être recherchée quant au contenu de ce message. Bien
>>> que les meilleurs efforts soient faits pour maintenir cette transmission
>>> exempte de tout virus, l'expéditeur ne donne aucune garantie à cet égard et
>>> sa responsabilité ne saurait être recherchée pour tout dommage résultant
>>> d'un virus transmis.
>>>
>>> This e-mail and the documents attached are confidential and intended
>>> solely for the addressee; it may also be privileged. If you receive this
>>> e-mail in error, please notify the sender immediately and destroy it. As
>>> its integrity cannot be secured on the Internet, the Worldline liability
>>> cannot be triggered for the message content. Although the sender endeavours
>>> to maintain a computer virus-free network, the sender does not warrant that
>>> this transmission is virus-free and will not be liable for any damages
>>> resulting from any virus transmitted.
>>>
>>
>>
>


Re: High read latency

2015-09-22 Thread Jaydeep Chovatia
Please find required details here:

-  Number of req/s

2k reads/s

-  Schema details

create table test {

a timeuuid,

b bigint,

c int,

d int static,

e int static,

f int static,

g int static,

h int,

i text,

j text,

k text,

l text,

m set

n bigint

o bigint

p bigint

q bigint

r int

s text

t bigint

u text

v text

w text

x bigint

y bigint

z bigint,

primary key ((a, b), c)

};

-  JVM settings about the heap

Default settings

-  Execution time of the GC

Avg. 400ms. I do not see long pauses of GC anywhere in the log file.

On Tue, Sep 22, 2015 at 5:34 AM, Leleu Eric <eric.le...@worldline.com>
wrote:

> Hi,
>
>
>
>
>
> Before speaking about tuning, can you provide some additional information ?
>
>
>
> -  Number of req/s
>
> -  Schema details
>
> -  JVM settings about the heap
>
> -  Execution time of the GC
>
>
>
> 43ms for a read latency may be acceptable according to the number of
> request per second.
>
>
>
>
>
> Eric
>
>
>
> *De :* Jaydeep Chovatia [mailto:chovatia.jayd...@gmail.com]
> *Envoyé :* mardi 22 septembre 2015 00:07
> *À :* user@cassandra.apache.org
> *Objet :* High read latency
>
>
>
> Hi,
>
>
>
> My application issues more read requests than write, I do see that under
> load cfstats for one of the table is quite high around 43ms
>
>
>
> Local read count: 114479357
>
> Local read latency: 43.442 ms
>
> Local write count: 22288868
>
> Local write latency: 0.609 ms
>
>
>
>
>
> Here is my node configuration:
>
> RF=3, Read/Write with QUORUM, 64GB RAM, 48 CPU core. I have only 5 GB of
> data on each node (and for experiment purpose I stored data in tmpfs)
>
>
>
> I've tried increasing concurrent_read count upto 512 but no help in read
> latency. CPU/Memory/IO looks fine on system.
>
>
>
> Any idea what should I tune?
>
>
>
> Jaydeep
>
> --
>
> Ce message et les pièces jointes sont confidentiels et réservés à l'usage
> exclusif de ses destinataires. Il peut également être protégé par le secret
> professionnel. Si vous recevez ce message par erreur, merci d'en avertir
> immédiatement l'expéditeur et de le détruire. L'intégrité du message ne
> pouvant être assurée sur Internet, la responsabilité de Worldline ne pourra
> être recherchée quant au contenu de ce message. Bien que les meilleurs
> efforts soient faits pour maintenir cette transmission exempte de tout
> virus, l'expéditeur ne donne aucune garantie à cet égard et sa
> responsabilité ne saurait être recherchée pour tout dommage résultant d'un
> virus transmis.
>
> This e-mail and the documents attached are confidential and intended
> solely for the addressee; it may also be privileged. If you receive this
> e-mail in error, please notify the sender immediately and destroy it. As
> its integrity cannot be secured on the Internet, the Worldline liability
> cannot be triggered for the message content. Although the sender endeavours
> to maintain a computer virus-free network, the sender does not warrant that
> this transmission is virus-free and will not be liable for any damages
> resulting from any virus transmitted.
>


Re: High read latency

2015-09-25 Thread Jaydeep Chovatia
Please find histogram attached.

On Fri, Sep 25, 2015 at 12:20 PM, Ryan Svihla <r...@foundev.pro> wrote:

> if everything is in ram there could be a number of issues unrelated to
> Cassandra and there could be hardware limitations or contention problems.
> Otherwise cell count can really deeply impact reads, all ram or not, and
> some of this is because of the nature of GC and some of it is the age of
> the sstable format (which is due to be revamped in 3.0). Also partition
> size can matter just because of physics, if one of those is a 1gb
> partition, the network interface can only move that back across the wire so
> quickly not to mention the GC issues you’d run into.
>
> Anyway this is why I asked for the histograms, I wanted to get cell count
> and partition size. I’ve seen otherwise very stout hardware get slow on
> reads of large results because either a bottleneck was hit somewhere, or
> the CPU got slammed with GC, or other processes running on the machine were
> contending with Cassandra.
>
>
> On Sep 25, 2015, at 12:45 PM, Jaydeep Chovatia <chovatia.jayd...@gmail.com>
> wrote:
>
> I understand that but everything is in RAM (my data dir is tmpfs) and my
> row is not that wide approx. less than 5MB in size. So my question is if
> everything is in RAM then why does it take 43ms latency?
>
> On Fri, Sep 25, 2015 at 7:54 AM, Ryan Svihla <r...@foundev.pro> wrote:
>
>> if you run:
>>
>> nodetool cfhistograms  
>>
>> On the given table and that will tell you how wide your rows are getting.
>> At some point you can get wide enough rows that just the physics of
>> retrieving them all take some time.
>>
>>
>> On Sep 25, 2015, at 9:21 AM, sai krishnam raju potturi <
>> pskraj...@gmail.com> wrote:
>>
>> Jaydeep; since your primary key involves a clustering column, you may be
>> having pretty wide rows. The read would be sequential. The latency could be
>> acceptable, if the read were to involve really wide rows.
>>
>> If your primary key was like ((a,b)) without the clustering column, it's
>> like reading a key value pair, and 40ms latency may have been a concern.
>>
>> Bottom Line : The latency depends on how wide the row is.
>>
>> On Tue, Sep 22, 2015 at 1:27 PM, sai krishnam raju potturi <
>> pskraj...@gmail.com> wrote:
>>
>>> thanks for the information. Posting the query too would be of help.
>>>
>>> On Tue, Sep 22, 2015 at 11:56 AM, Jaydeep Chovatia <
>>> chovatia.jayd...@gmail.com> wrote:
>>>
>>>> Please find required details here:
>>>>
>>>> -  Number of req/s
>>>>
>>>> 2k reads/s
>>>>
>>>> -  Schema details
>>>>
>>>> create table test {
>>>>
>>>> a timeuuid,
>>>>
>>>> b bigint,
>>>>
>>>> c int,
>>>>
>>>> d int static,
>>>>
>>>> e int static,
>>>>
>>>> f int static,
>>>>
>>>> g int static,
>>>>
>>>> h int,
>>>>
>>>> i text,
>>>>
>>>> j text,
>>>>
>>>> k text,
>>>>
>>>> l text,
>>>>
>>>> m set
>>>>
>>>> n bigint
>>>>
>>>> o bigint
>>>>
>>>> p bigint
>>>>
>>>> q bigint
>>>>
>>>> r int
>>>>
>>>> s text
>>>>
>>>> t bigint
>>>>
>>>> u text
>>>>
>>>> v text
>>>>
>>>> w text
>>>>
>>>> x bigint
>>>>
>>>> y bigint
>>>>
>>>> z bigint,
>>>>
>>>> primary key ((a, b), c)
>>>>
>>>> };
>>>>
>>>> -  JVM settings about the heap
>>>>
>>>> Default settings
>>>>
>>>> -  Execution time of the GC
>>>>
>>>> Avg. 400ms. I do not see long pauses of GC anywhere in the log file.
>>>>
>>>> On Tue, Sep 22, 2015 at 5:34 AM, Leleu Eric <eric.le...@worldline.com>
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Before speaking about tuning, can you provide some additional
>>>>> information ?
>>>>>
>>>>>
>>>>&

Re: High read latency

2015-09-25 Thread Jaydeep Chovatia
I understand that but everything is in RAM (my data dir is tmpfs) and my
row is not that wide approx. less than 5MB in size. So my question is if
everything is in RAM then why does it take 43ms latency?

On Fri, Sep 25, 2015 at 7:54 AM, Ryan Svihla <r...@foundev.pro> wrote:

> if you run:
>
> nodetool cfhistograms  
>
> On the given table and that will tell you how wide your rows are getting.
> At some point you can get wide enough rows that just the physics of
> retrieving them all take some time.
>
>
> On Sep 25, 2015, at 9:21 AM, sai krishnam raju potturi <
> pskraj...@gmail.com> wrote:
>
> Jaydeep; since your primary key involves a clustering column, you may be
> having pretty wide rows. The read would be sequential. The latency could be
> acceptable, if the read were to involve really wide rows.
>
> If your primary key was like ((a,b)) without the clustering column, it's
> like reading a key value pair, and 40ms latency may have been a concern.
>
> Bottom Line : The latency depends on how wide the row is.
>
> On Tue, Sep 22, 2015 at 1:27 PM, sai krishnam raju potturi <
> pskraj...@gmail.com> wrote:
>
>> thanks for the information. Posting the query too would be of help.
>>
>> On Tue, Sep 22, 2015 at 11:56 AM, Jaydeep Chovatia <
>> chovatia.jayd...@gmail.com> wrote:
>>
>>> Please find required details here:
>>>
>>> -  Number of req/s
>>>
>>> 2k reads/s
>>>
>>> -  Schema details
>>>
>>> create table test {
>>>
>>> a timeuuid,
>>>
>>> b bigint,
>>>
>>> c int,
>>>
>>> d int static,
>>>
>>> e int static,
>>>
>>> f int static,
>>>
>>> g int static,
>>>
>>> h int,
>>>
>>> i text,
>>>
>>> j text,
>>>
>>> k text,
>>>
>>> l text,
>>>
>>> m set
>>>
>>> n bigint
>>>
>>> o bigint
>>>
>>> p bigint
>>>
>>> q bigint
>>>
>>> r int
>>>
>>> s text
>>>
>>> t bigint
>>>
>>> u text
>>>
>>> v text
>>>
>>> w text
>>>
>>> x bigint
>>>
>>> y bigint
>>>
>>> z bigint,
>>>
>>> primary key ((a, b), c)
>>>
>>> };
>>>
>>> -  JVM settings about the heap
>>>
>>> Default settings
>>>
>>> -  Execution time of the GC
>>>
>>> Avg. 400ms. I do not see long pauses of GC anywhere in the log file.
>>>
>>> On Tue, Sep 22, 2015 at 5:34 AM, Leleu Eric <eric.le...@worldline.com>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Before speaking about tuning, can you provide some additional
>>>> information ?
>>>>
>>>>
>>>>
>>>> -  Number of req/s
>>>>
>>>> -  Schema details
>>>>
>>>> -  JVM settings about the heap
>>>>
>>>> -  Execution time of the GC
>>>>
>>>>
>>>>
>>>> 43ms for a read latency may be acceptable according to the number of
>>>> request per second.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Eric
>>>>
>>>>
>>>>
>>>> *De :* Jaydeep Chovatia [mailto:chovatia.jayd...@gmail.com]
>>>> *Envoyé :* mardi 22 septembre 2015 00:07
>>>> *À :* user@cassandra.apache.org
>>>> *Objet :* High read latency
>>>>
>>>>
>>>>
>>>> Hi,
>>>>
>>>>
>>>>
>>>> My application issues more read requests than write, I do see that
>>>> under load cfstats for one of the table is quite high around 43ms
>>>>
>>>>
>>>>
>>>> Local read count: 114479357
>>>>
>>>> Local read latency: 43.442 ms
>>>>
>>>> Local write count: 22288868
>>>>
>>>> Local write latency: 0.609 ms
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Here is my node configuration:
>>>>
>>>> RF=3, Read/Write with QUORUM, 64GB RAM, 48 CPU core. I have o

High read latency

2015-09-21 Thread Jaydeep Chovatia
Hi,

My application issues more read requests than write, I do see that under
load cfstats for one of the table is quite high around 43ms

Local read count: 114479357
Local read latency: 43.442 ms
Local write count: 22288868
Local write latency: 0.609 ms


Here is my node configuration:
RF=3, Read/Write with QUORUM, 64GB RAM, 48 CPU core. I have only 5 GB of
data on each node (and for experiment purpose I stored data in tmpfs)

I've tried increasing concurrent_read count upto 512 but no help in read
latency. CPU/Memory/IO looks fine on system.

Any idea what should I tune?

Jaydeep


Re: Debugging write timeouts on Cassandra 2.2.5

2016-02-17 Thread Jaydeep Chovatia
Are you guys using light weight transactions in your write path?

On Thu, Feb 11, 2016 at 12:36 AM, Fabrice Facorat  wrote:

> Are your commitlog and data on the same disk ? If yes, you should put
> commitlogs on a separate disk which don't have a lot of IO.
>
> Others IO may have great impact impact on your commitlog writing and
> it may even block.
>
> An example of impact IO may have, even for Async writes:
>
> https://engineering.linkedin.com/blog/2016/02/eliminating-large-jvm-gc-pauses-caused-by-background-io-traffic
>
> 2016-02-11 0:31 GMT+01:00 Mike Heffner :
> > Jeff,
> >
> > We have both commitlog and data on a 4TB EBS with 10k IOPS.
> >
> > Mike
> >
> > On Wed, Feb 10, 2016 at 5:28 PM, Jeff Jirsa 
> > wrote:
> >>
> >> What disk size are you using?
> >>
> >>
> >>
> >> From: Mike Heffner
> >> Reply-To: "user@cassandra.apache.org"
> >> Date: Wednesday, February 10, 2016 at 2:24 PM
> >> To: "user@cassandra.apache.org"
> >> Cc: Peter Norton
> >> Subject: Re: Debugging write timeouts on Cassandra 2.2.5
> >>
> >> Paulo,
> >>
> >> Thanks for the suggestion, we ran some tests against CMS and saw the
> same
> >> timeouts. On that note though, we are going to try doubling the instance
> >> sizes and testing with double the heap (even though current usage is
> low).
> >>
> >> Mike
> >>
> >> On Wed, Feb 10, 2016 at 3:40 PM, Paulo Motta 
> >> wrote:
> >>>
> >>> Are you using the same GC settings as the staging 2.0 cluster? If not,
> >>> could you try using the default GC settings (CMS) and see if that
> changes
> >>> anything? This is just a wild guess, but there were reports before of
> >>> G1-caused instabilities with small heap sizes (< 16GB - see
> CASSANDRA-10403
> >>> for more context). Please ignore if you already tried reverting back
> to CMS.
> >>>
> >>> 2016-02-10 16:51 GMT-03:00 Mike Heffner :
> 
>  Hi all,
> 
>  We've recently embarked on a project to update our Cassandra
>  infrastructure running on EC2. We are long time users of 2.0.x and are
>  testing out a move to version 2.2.5 running on VPC with EBS. Our test
> setup
>  is a 3 node, RF=3 cluster supporting a small write load (mirror of our
>  staging load).
> 
>  We are writing at QUORUM and while p95's look good compared to our
>  staging 2.0.x cluster, we are seeing frequent write operations that
> time out
>  at the max write_request_timeout_in_ms (10 seconds). CPU across the
> cluster
>  is < 10% and EBS write load is < 100 IOPS. Cassandra is running with
> the
>  Oracle JDK 8u60 and we're using G1GC and any GC pauses are less than
> 500ms.
> 
>  We run on c4.2xl instances with GP2 EBS attached storage for data and
>  commitlog directories. The nodes are using EC2 enhanced networking
> and have
>  the latest Intel network driver module. We are running on HVM
> instances
>  using Ubuntu 14.04.2.
> 
>  Our schema is 5 tables, all with COMPACT STORAGE. Each table is
> similar
>  to the definition here:
>  https://gist.github.com/mheffner/4d80f6b53ccaa24cc20a
> 
>  This is our cassandra.yaml:
> 
> https://gist.github.com/mheffner/fea80e6e939dd483f94f#file-cassandra-yaml
> 
>  Like I mentioned we use 8u60 with G1GC and have used many of the GC
>  settings in Al Tobey's tuning guide. This is our upstart config with
> JVM and
>  other CPU settings:
> https://gist.github.com/mheffner/dc44613620b25c4fa46d
> 
>  We've used several of the sysctl settings from Al's guide as well:
>  https://gist.github.com/mheffner/ea40d58f58a517028152
> 
>  Our client application is able to write using either Thrift batches
>  using Asytanax driver or CQL async INSERT's using the Datastax Java
> driver.
> 
>  For testing against Thrift (our legacy infra uses this) we write
> batches
>  of anywhere from 6 to 1500 rows at a time. Our p99 for batch
> execution is
>  around 45ms but our maximum (p100) sits less than 150ms except when it
>  periodically spikes to the full 10seconds.
> 
>  Testing the same write path using CQL writes instead demonstrates
>  similar behavior. Low p99s except for periodic full timeouts. We
> enabled
>  tracing for several operations but were unable to get a trace that
> completed
>  successfully -- Cassandra started logging many messages as:
> 
>  INFO  [ScheduledTasks:1] - MessagingService.java:946 - _TRACE messages
>  were dropped in last 5000 ms: 52499 for internal timeout and 0 for
> cross
>  node timeout
> 
>  And all the traces contained rows with a "null" source_elapsed row:
> 
> https://gist.githubusercontent.com/mheffner/1d68a70449bd6688a010/raw/0327d7d3d94c3a93af02b64212e3b7e7d8f2911b/trace.out
> 
> 
>  We've exhausted as many configuration option permutations that we can
>  think of. This 

Re: High Bloom filter false ratio

2016-02-19 Thread Jaydeep Chovatia
To me following three looks on higher side:
SSTable count: 1289

In order to reduce SSTable count see if you are compacting of not (If using
STCS). Is it possible to change this to LCS?


Number of keys (estimate): 345137664 (345M partition keys)

I don't have any suggestion about reducing this unless you partition your
data.


Bloom filter space used, bytes: 493777336 (400MB is huge)

If number of keys are reduced then this will automatically reduce bloom
filter size I believe.



Jaydeep

On Thu, Feb 18, 2016 at 7:52 PM, Anishek Agarwal <anis...@gmail.com> wrote:

> Hey all,
>
> @Jaydeep here is the cfstats output from one node.
>
> Read Count: 1721134722
>
> Read Latency: 0.04268825050756254 ms.
>
> Write Count: 56743880
>
> Write Latency: 0.014650376727851532 ms.
>
> Pending Tasks: 0
>
> Table: user_stay_points
>
> SSTable count: 1289
>
> Space used (live), bytes: 122141272262
>
> Space used (total), bytes: 224227850870
>
> Off heap memory used (total), bytes: 653827528
>
> SSTable Compression Ratio: 0.4959736121441446
>
> Number of keys (estimate): 345137664
>
> Memtable cell count: 339034
>
> Memtable data size, bytes: 106558314
>
> Memtable switch count: 3266
>
> Local read count: 1721134803
>
> Local read latency: 0.048 ms
>
> Local write count: 56743898
>
> Local write latency: 0.018 ms
>
> Pending tasks: 0
>
> Bloom filter false positives: 40664437
>
> Bloom filter false ratio: 0.69058
>
> Bloom filter space used, bytes: 493777336
>
> Bloom filter off heap memory used, bytes: 493767024
>
> Index summary off heap memory used, bytes: 91677192
>
> Compression metadata off heap memory used, bytes: 68383312
>
> Compacted partition minimum bytes: 104
>
> Compacted partition maximum bytes: 1629722
>
> Compacted partition mean bytes: 1773
>
> Average live cells per slice (last five minutes): 0.0
>
> Average tombstones per slice (last five minutes): 0.0
>
>
> @Tyler Hobbs
>
> we are using cassandra 2.0.15 so
> https://issues.apache.org/jira/browse/CASSANDRA-8525  shouldnt occur.
> Other problems looks like will be fixed in 3.0 .. we will mostly try and
> slot in an upgrade to 3.x version towards second quarter of this year.
>
>
> @Daemon
>
> Latencies seem to have higher ratios, attached is the graph.
>
>
> I am mostly trying to look at Bloom filters, because the way we do reads,
> we read data with non existent partition keys and it seems to be taking
> long to respond, like for 720 queries it takes 2 seconds, with all 721
> queries not returning anything. the 720 queries are done in sequence of
> 180 queries each with 180 of them running in parallel.
>
>
> thanks
>
> anishek
>
>
>
> On Fri, Feb 19, 2016 at 3:09 AM, Jaydeep Chovatia <
> chovatia.jayd...@gmail.com> wrote:
>
>> How many partition keys exists for the table which shows this problem (or
>> provide nodetool cfstats for that table)?
>>
>> On Thu, Feb 18, 2016 at 11:38 AM, daemeon reiydelle <daeme...@gmail.com>
>> wrote:
>>
>>> The bloom filter buckets the values in a small number of buckets. I have
>>> been surprised by how many cases I see with large cardinality where a few
>>> values populate a given bloom leaf, resulting in high false positives, and
>>> a surprising impact on latencies!
>>>
>>> Are you seeing 2:1 ranges between mean and worse case latencies
>>> (allowing for gc times)?
>>>
>>> Daemeon Reiydelle
>>> On Feb 18, 2016 8:57 AM, "Tyler Hobbs" <ty...@datastax.com> wrote:
>>>
>>>> You can try slightly lowering the bloom_filter_fp_chance on your table.
>>>>
>>>> Otherwise, it's possible that you're repeatedly querying one or two
>>>> partitions that always trigger a bloom filter false positive.  You could
>>>> try manually tracing a few queries on this table (for non-existent
>>>> partitions) to see if the bloom filter rejects them.
>>>>
>>>> Depending on your Cassandra version, your false positive ratio could be
>>>> inaccurate: https://issues.apache.org/jira/browse/CASSANDRA-8525
>>>>
>>>> There are also a couple of recent improvements to bloom filters:
>>>> * https://issues.apache.org/jira/browse/CASSANDRA-8413
>>>> * https://issues.apache.org/jira/browse/CASSANDRA-9167
>>>>
>>>>
>>>> On Thu, Feb 18, 2016 at 1:35 AM, Anishek Agarwal <anis...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hello,
>>>>>
>>>>> We have a table with compo

Re: Regarding cassandra-stress results

2016-03-14 Thread Jaydeep Chovatia
ms

On Mon, Mar 14, 2016 at 11:38 AM, Rajath Subramanyam 
wrote:

> Hello Cassandra Community,
>
> When cassandra-stress tool dumps the output at the end of the benchmarking
> run, what is the unit of latency statistics ?
>
> latency mean  : 0.7 [READ:0.7, WRITE:0.7]
> latency median: 0.6 [READ:0.6, WRITE:0.6]
> latency 95th percentile   : 0.8 [READ:0.8, WRITE:0.8]
> latency 99th percentile   : 1.2 [READ:1.2, WRITE:1.2]
> latency 99.9th percentile : 8.8 [READ:8.9, WRITE:9.0]
> latency max   : 448.7 [READ:162.3, WRITE:448.7]
>
> Thanks in advance.
>
> - Rajath
> 
> Rajath Subramanyam
>
>


How to get current value of commitlog_segment_size_in_mb?

2016-07-07 Thread Jaydeep Chovatia
Hi,

In my project I need to read current value for
"commitlog_segment_size_in_mb", I am looking for CQL query to do this. Any
idea if this information gets stored in any of the Cassandra system table?

Thanks,
Jaydeep


Inconsistent results with Quorum at different times

2016-09-16 Thread Jaydeep Chovatia
Hi,

We have three node (N1, N2, N3) cluster (RF=3) and data in SSTable as
following:

N1:
SSTable: Partition key K1 is marked as tombstone at time T2

N2:
SSTable: Partition key K1 is marked as tombstone at time T2

N3:
SSTable: Partition key K1 is valid and has data D1 with lower time-stamp T1
(T1 < T2)


Now when I read using quorum then sometimes it returns data D1 and
sometimes it returns empty results. After tracing I found that when N1 and
N2 are chosen then we get empty data, when (N1/N2) and N3 are chosen then
D1 data is returned.

My point is when we read with Quorum then our results have to be
consistent, here same query give different results at different times.

Isn't this a big problem with Cassandra @QUORUM (with tombstone)?


Thanks,
Jaydeep


Lightweight tx is good enough to handle counter?

2016-09-23 Thread Jaydeep Chovatia
We have a following table:

create table mytable {

id int,
count int static,
rec_id int,
primary key (id, rec_id)

};

The count in the table represents how many records (rec_id clustering
columns) exists. So when we add new a new record we do it following way:

UNLOGGED BATCH
insert into mytable (id, rec_id) values (, );
update mytable set count =  + 1 where id =  if count =
; //light-weight transaction
APPLY BATCH

Then we do following read query as QUORUM:

select count, rec_id from mytable where id = ;

Here we expect count to exactly match number of rows (number of clustering
rec_id) returned. But under a stress we have observed that they do not
match sometimes.

Is this expected?

Thanks,
Jaydeep


Re: Lightweight tx is good enough to handle counter?

2016-09-23 Thread Jaydeep Chovatia
Ok.  But I am trying to understand a scenario under which this mis-match
can occur with light-weight tx.

On Fri, Sep 23, 2016 at 11:14 AM, DuyHai Doan <doanduy...@gmail.com> wrote:

> Lightweight transaction is not available for counters, for the simple
> reason that counters are not idempotent
>
> On Fri, Sep 23, 2016 at 8:10 PM, Jaydeep Chovatia <
> chovatia.jayd...@gmail.com> wrote:
>
>> We have a following table:
>>
>> create table mytable {
>>
>> id int,
>> count int static,
>> rec_id int,
>> primary key (id, rec_id)
>>
>> };
>>
>> The count in the table represents how many records (rec_id clustering
>> columns) exists. So when we add new a new record we do it following way:
>>
>> UNLOGGED BATCH
>> insert into mytable (id, rec_id) values (, );
>> update mytable set count =  + 1 where id =  if count =
>> ; //light-weight transaction
>> APPLY BATCH
>>
>> Then we do following read query as QUORUM:
>>
>> select count, rec_id from mytable where id = ;
>>
>> Here we expect count to exactly match number of rows (number of
>> clustering rec_id) returned. But under a stress we have observed that they
>> do not match sometimes.
>>
>> Is this expected?
>>
>> Thanks,
>> Jaydeep
>>
>
>


Re: Lightweight tx is good enough to handle counter?

2016-09-23 Thread Jaydeep Chovatia
Since SERIAL consistency is not supported for batch updates, I used QUORUM
for the operation.

On Fri, Sep 23, 2016 at 11:23 AM, DuyHai Doan <doanduy...@gmail.com> wrote:

> What is the consistency level used for the batch query ?
>
>
> On Fri, Sep 23, 2016 at 8:19 PM, Jaydeep Chovatia <
> chovatia.jayd...@gmail.com> wrote:
>
>> Ok.  But I am trying to understand a scenario under which this mis-match
>> can occur with light-weight tx.
>>
>> On Fri, Sep 23, 2016 at 11:14 AM, DuyHai Doan <doanduy...@gmail.com>
>> wrote:
>>
>>> Lightweight transaction is not available for counters, for the simple
>>> reason that counters are not idempotent
>>>
>>> On Fri, Sep 23, 2016 at 8:10 PM, Jaydeep Chovatia <
>>> chovatia.jayd...@gmail.com> wrote:
>>>
>>>> We have a following table:
>>>>
>>>> create table mytable {
>>>>
>>>> id int,
>>>> count int static,
>>>> rec_id int,
>>>> primary key (id, rec_id)
>>>>
>>>> };
>>>>
>>>> The count in the table represents how many records (rec_id clustering
>>>> columns) exists. So when we add new a new record we do it following way:
>>>>
>>>> UNLOGGED BATCH
>>>> insert into mytable (id, rec_id) values (, );
>>>> update mytable set count =  + 1 where id =  if count =
>>>> ; //light-weight transaction
>>>> APPLY BATCH
>>>>
>>>> Then we do following read query as QUORUM:
>>>>
>>>> select count, rec_id from mytable where id = ;
>>>>
>>>> Here we expect count to exactly match number of rows (number of
>>>> clustering rec_id) returned. But under a stress we have observed that they
>>>> do not match sometimes.
>>>>
>>>> Is this expected?
>>>>
>>>> Thanks,
>>>> Jaydeep
>>>>
>>>
>>>
>>
>


Cassandra 3.0.14 transport completely blocked

2022-03-22 Thread Jaydeep Chovatia
Hi,

I have been using Cassandra 3.0.14 in production for a long time. Recently
I have found a bug in that, all of a sudden the transport thread-pool
hangs.

*Observation:*
If I do *nodetool tpstats*, then it shows *"Native-Transport-Requests"* is
blocking "Active" tasks. I stopped the complete traffic, and sent a very
light load, but still my requests are getting denied, and active transport
blocked tasks keep happening.

*Fix:*
If I restart my cluster, then everything works fine, which means there
might be some deadlock, etc. in the system.


Is anyone aware of this issue? I know there have been quite a lot of fixes
on top of 3.0.14, is there any specific fix that addresses this particular
issue?

Any help would be appreciated.

Yours Sincerely,
Jaydeep


Re: Cassandra 3.0.14 transport completely blocked

2022-03-22 Thread Jaydeep Chovatia
Thanks, Scott, for the prompt response! We will apply this patch and see
how it goes.
Also, in the near future, we will consider upgrading to 3.0.26 and
eventually to 4.0
Thanks a lot!

On Tue, Mar 22, 2022 at 9:45 PM C. Scott Andreas 
wrote:

> Hi Jaydeep, thanks for reaching out.
>
> The most notable deadlock identified and resolved in the last few years is
> https://issues.apache.org/jira/browse/CASSANDRA-15367: Memtable memory
> allocations may deadlock (fixed in Apache Cassandra 3.0.21).
>
> Mentioning for completeness - since the release of Cassandra 3.0.14
> several years ago, many critical bugs whose consequences include data loss
> have been resolved. I'd strongly recommend upgrading to 3.0.26 - and
> ideally to 4.0 after you've confirmed behavior is as expected on 3.0.26.
>
> – Scott
>
> On Mar 22, 2022, at 9:30 PM, Jaydeep Chovatia 
> wrote:
>
>
> Hi,
>
> I have been using Cassandra 3.0.14 in production for a long time. Recently
> I have found a bug in that, all of a sudden the transport thread-pool
> hangs.
>
> *Observation:*
> If I do *nodetool tpstats*, then it shows *"Native-Transport-Requests"*
> is blocking "Active" tasks. I stopped the complete traffic, and sent a very
> light load, but still my requests are getting denied, and active transport
> blocked tasks keep happening.
>
> *Fix:*
> If I restart my cluster, then everything works fine, which means there
> might be some deadlock, etc. in the system.
>
>
> Is anyone aware of this issue? I know there have been quite a lot of fixes
> on top of 3.0.14, is there any specific fix that addresses this particular
> issue?
>
> Any help would be appreciated.
>
> Yours Sincerely,
> Jaydeep
>
>
>
>
>


Re: Cassandra 3.0.14 transport completely blocked

2022-03-23 Thread Jaydeep Chovatia
Thank you all. I will try different options and will let you know which one
worked for my case.

On Wed, Mar 23, 2022 at 3:25 AM Bowen Song  wrote:

> I remember we had the same issue back in the Cassandra 2.x days, and
> restarting the affected node only makes the issue go away temporarily. The
> issue we had was "fixed" by adding
> "-Dcassandra.max_queued_native_transport_requests=4096" to the JVM options.
> I dug that option out from our old Ansible playbook. Now, after so many
> years, I've long forgotten what does that option do.
>
> Please seriously consider upgrade your Cassandra cluster to the least
> version. I can't tell which exact version fixed this bug, but we had
> removed this from our servers many years ago after several rounds of
> upgrades, and we have not had the NTR pool blocking issue coming back.
> On 23/03/2022 04:29, Jaydeep Chovatia wrote:
>
> Hi,
>
> I have been using Cassandra 3.0.14 in production for a long time. Recently
> I have found a bug in that, all of a sudden the transport thread-pool
> hangs.
>
> *Observation:*
> If I do *nodetool tpstats*, then it shows *"Native-Transport-Requests"*
> is blocking "Active" tasks. I stopped the complete traffic, and sent a very
> light load, but still my requests are getting denied, and active transport
> blocked tasks keep happening.
>
> *Fix:*
> If I restart my cluster, then everything works fine, which means there
> might be some deadlock, etc. in the system.
>
>
> Is anyone aware of this issue? I know there have been quite a lot of fixes
> on top of 3.0.14, is there any specific fix that addresses this particular
> issue?
>
> Any help would be appreciated.
>
> Yours Sincerely,
> Jaydeep
>
>


Re: Cassandra 4.0.6 token mismatch issue in production environment

2023-10-23 Thread Jaydeep Chovatia
Sounds good. Thanks a lot for all your help!

Jaydeep

On Mon, Oct 23, 2023 at 3:30 PM Jeff Jirsa  wrote:

> Not aware of any that survive node restart, though in the past, there were
> races around starting an expansion while one node was partitioned/down (and
> missing the initial gossip / UP). A heap dump could have told us a bit more
> conclusively, but it's hard to guess for now.
>
>
>
> On Mon, Oct 23, 2023 at 3:22 PM Jaydeep Chovatia <
> chovatia.jayd...@gmail.com> wrote:
>
>> The issue was persisting on a few nodes despite no changes to the
>> topology. Even node restarting did not help. Only after we evacuated those
>> nodes, the issue got resolved.
>>
>> Do you think of a possible situation under which this could happen?
>>
>> Jaydeep
>>
>> On Sat, Oct 21, 2023 at 10:25 AM Jaydeep Chovatia <
>> chovatia.jayd...@gmail.com> wrote:
>>
>>> Thanks, Jeff!
>>> I will keep this thread updated on our findings.
>>>
>>> Jaydeep
>>>
>>> On Sat, Oct 21, 2023 at 9:37 AM Jeff Jirsa  wrote:
>>>
>>>> That code path was added to protect against invalid gossip states
>>>>
>>>> For this logger to be issued, the coordinator receiving the query must
>>>> identify a set of replicas holding the data to serve the read, and one of
>>>> the selected replicas must disagree that it’s a replica based on its view
>>>> of the token ring
>>>>
>>>> This probably means that at least one node in your cluster has an
>>>> invalid view of the ring - if you issue a “nodetool ring” from every host
>>>> and compare them, you’ll probably notice one or more is wrong
>>>>
>>>> It’s also possible this happens for a few seconds during adding /
>>>> moving / removing hosts
>>>>
>>>> If you weren’t changing the topology of the cluster, it’s  likely the
>>>> case that bouncing the cluster fixes it
>>>>
>>>> (Im unsure of the defaults and not able to look it up, but cassandra
>>>> can log or log and drop the read - you probably want to drop the read log,
>>>> which is the right solution so it doesn’t accidentally return a missing /
>>>> empty result set as a valid query result, instead it’ll force it to read
>>>> from other replicas or time out)
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Oct 20, 2023, at 10:57 PM, Jaydeep Chovatia <
>>>> chovatia.jayd...@gmail.com> wrote:
>>>>
>>>> 
>>>>
>>>> Hi,
>>>>
>>>> I am using Cassandra 4.0.6 in production, and receiving the following 
>>>> error. This indicates that Cassandra nodes have mismatch in token-owership.
>>>>
>>>> Has anyone seen this issue before?
>>>>
>>>> Received a read request from /XX.XX.XXX.XXX:Y for a range that is not 
>>>> owned by the current replica Read(keyspace.table columns=*/[c1] rowFilter= 
>>>> limits=LIMIT 100 key=7BE78B90-AD66-406B-AA05-6A062F72F542:0 
>>>> filter=slice(slices=ALL, reversed=false), nowInSec=1697751757).
>>>>
>>>> Jaydeep
>>>>
>>>>


Re: Cassandra 4.0.6 token mismatch issue in production environment

2023-10-23 Thread Jaydeep Chovatia
The issue was persisting on a few nodes despite no changes to the topology.
Even node restarting did not help. Only after we evacuated those nodes, the
issue got resolved.

Do you think of a possible situation under which this could happen?

Jaydeep

On Sat, Oct 21, 2023 at 10:25 AM Jaydeep Chovatia <
chovatia.jayd...@gmail.com> wrote:

> Thanks, Jeff!
> I will keep this thread updated on our findings.
>
> Jaydeep
>
> On Sat, Oct 21, 2023 at 9:37 AM Jeff Jirsa  wrote:
>
>> That code path was added to protect against invalid gossip states
>>
>> For this logger to be issued, the coordinator receiving the query must
>> identify a set of replicas holding the data to serve the read, and one of
>> the selected replicas must disagree that it’s a replica based on its view
>> of the token ring
>>
>> This probably means that at least one node in your cluster has an invalid
>> view of the ring - if you issue a “nodetool ring” from every host and
>> compare them, you’ll probably notice one or more is wrong
>>
>> It’s also possible this happens for a few seconds during adding / moving
>> / removing hosts
>>
>> If you weren’t changing the topology of the cluster, it’s  likely the
>> case that bouncing the cluster fixes it
>>
>> (Im unsure of the defaults and not able to look it up, but cassandra can
>> log or log and drop the read - you probably want to drop the read log,
>> which is the right solution so it doesn’t accidentally return a missing /
>> empty result set as a valid query result, instead it’ll force it to read
>> from other replicas or time out)
>>
>>
>>
>>
>>
>> On Oct 20, 2023, at 10:57 PM, Jaydeep Chovatia <
>> chovatia.jayd...@gmail.com> wrote:
>>
>> 
>>
>> Hi,
>>
>> I am using Cassandra 4.0.6 in production, and receiving the following error. 
>> This indicates that Cassandra nodes have mismatch in token-owership.
>>
>> Has anyone seen this issue before?
>>
>> Received a read request from /XX.XX.XXX.XXX:Y for a range that is not 
>> owned by the current replica Read(keyspace.table columns=*/[c1] rowFilter= 
>> limits=LIMIT 100 key=7BE78B90-AD66-406B-AA05-6A062F72F542:0 
>> filter=slice(slices=ALL, reversed=false), nowInSec=1697751757).
>>
>> Jaydeep
>>
>>


Cassandra 4.0.6 token mismatch issue in production environment

2023-10-20 Thread Jaydeep Chovatia
Hi,

I am using Cassandra 4.0.6 in production, and receiving the following
error. This indicates that Cassandra nodes have mismatch in
token-owership.

Has anyone seen this issue before?

Received a read request from /XX.XX.XXX.XXX:Y for a range that is
not owned by the current replica Read(keyspace.table columns=*/[c1]
rowFilter= limits=LIMIT 100 key=7BE78B90-AD66-406B-AA05-6A062F72F542:0
filter=slice(slices=ALL, reversed=false), nowInSec=1697751757).

Jaydeep


Re: Cassandra 4.0.6 token mismatch issue in production environment

2023-10-21 Thread Jaydeep Chovatia
Thanks, Jeff!
I will keep this thread updated on our findings.

Jaydeep

On Sat, Oct 21, 2023 at 9:37 AM Jeff Jirsa  wrote:

> That code path was added to protect against invalid gossip states
>
> For this logger to be issued, the coordinator receiving the query must
> identify a set of replicas holding the data to serve the read, and one of
> the selected replicas must disagree that it’s a replica based on its view
> of the token ring
>
> This probably means that at least one node in your cluster has an invalid
> view of the ring - if you issue a “nodetool ring” from every host and
> compare them, you’ll probably notice one or more is wrong
>
> It’s also possible this happens for a few seconds during adding / moving /
> removing hosts
>
> If you weren’t changing the topology of the cluster, it’s  likely the case
> that bouncing the cluster fixes it
>
> (Im unsure of the defaults and not able to look it up, but cassandra can
> log or log and drop the read - you probably want to drop the read log,
> which is the right solution so it doesn’t accidentally return a missing /
> empty result set as a valid query result, instead it’ll force it to read
> from other replicas or time out)
>
>
>
>
>
> On Oct 20, 2023, at 10:57 PM, Jaydeep Chovatia 
> wrote:
>
> 
>
> Hi,
>
> I am using Cassandra 4.0.6 in production, and receiving the following error. 
> This indicates that Cassandra nodes have mismatch in token-owership.
>
> Has anyone seen this issue before?
>
> Received a read request from /XX.XX.XXX.XXX:Y for a range that is not 
> owned by the current replica Read(keyspace.table columns=*/[c1] rowFilter= 
> limits=LIMIT 100 key=7BE78B90-AD66-406B-AA05-6A062F72F542:0 
> filter=slice(slices=ALL, reversed=false), nowInSec=1697751757).
>
> Jaydeep
>
>


Re: Cassandra 3.0 upgrade

2022-06-13 Thread Jaydeep Chovatia
Thanks Jeff and Scott for valuable feedback!
One more question, do we have to upgrade the dTest repo if we go to 3.0.27,
or the one we have currently already working with 3.0.14 should continue to
work fine?

Jaydeep

On Mon, Jun 13, 2022 at 10:25 PM C. Scott Andreas 
wrote:

> Thank you for reaching out, and for planning the upgrade!
>
> Upgrading from 3.0.14 to 3.0.27 would be best, followed by upgrading to
> 4.0.4.
>
> 3.0.14 contains a number of serious bugs that are resolved in more recent
> 3.0.x releases (3.0.19+ are generally good/safe). Upgrading to 3.0.27 will
> put you on a great 3.0.x build. If all looks good from there, you should
> have an easy upgrade to 4.0.4.
>
> I would not recommend passing through intermediate 3.0.x releases on the
> way to 3.0.27; doing so is not necessary.
>
> Cheers,
>
> - Scott
>
>
> On Jun 13, 2022, at 10:17 PM, Runtian Liu  wrote:
>
> 
>
> Hi,
>
> I am running Cassandra version 3.0.14 at scale on thousands of nodes. I am
> planning to do a minor version upgrade from 3.0.14 to 3.0.26 in a safe
> manner. My eventual goal is to upgrade from 3.0.26 to a major release 4.0.
>
> As you know, there are multiple minor releases between 3.0.14 and 3.0.26,
> so I am planning to upgrade in 2-3 batches say 1) 3.0.14 → 3.0.16 2) 3.0.16
> to 3.0.20 3) 3.0.20 → 3.0.26.
>
> . Do you have suggestions or anything that I need to be aware of? Is there
> any minor release between 3.0.14 and 3.0.26, which is not safe etc.?
>
> Best regards.
>
>


Re: Cassandra 3.0 upgrade

2022-06-14 Thread Jaydeep Chovatia
Yeah, we have a fork of Cassandra with custom patches, and a fork of dtest
with some additional custom tests, so we will have to upgrade dtest as
well.
Is there any specific tag of dtest we should use or the latest trunk is
fine to test against 3.0.27?

Jaydeep

On Mon, Jun 13, 2022 at 10:51 PM C. Scott Andreas 
wrote:

> If you have a fork of Cassandra with custom patches and build/execute the
> dtest suite as part of qualification, you’d want to upgrade that as well.
>
> Note that in more recent 3.0.x releases, the project also introduced
> in-JVM dtests. This is a new suite that serves a similar purpose to the
> Python dtests, but which are much more reliable and allow for spawning of
> multiple C* instances in a JVM for testing. I’d recommend adding this to
> your build/CI process if so.
>
> If you don’t have a fork and deploy stock builds of Cassandra, you
> probably don’t need to worry about the dtest repo.
>
> - Scott
>
> On Jun 13, 2022, at 10:36 PM, Jaydeep Chovatia 
> wrote:
>
> 
> Thanks Jeff and Scott for valuable feedback!
> One more question, do we have to upgrade the dTest repo if we go to
> 3.0.27, or the one we have currently already working with 3.0.14 should
> continue to work fine?
>
> Jaydeep
>
> On Mon, Jun 13, 2022 at 10:25 PM C. Scott Andreas 
> wrote:
>
>> Thank you for reaching out, and for planning the upgrade!
>>
>> Upgrading from 3.0.14 to 3.0.27 would be best, followed by upgrading to
>> 4.0.4.
>>
>> 3.0.14 contains a number of serious bugs that are resolved in more recent
>> 3.0.x releases (3.0.19+ are generally good/safe). Upgrading to 3.0.27 will
>> put you on a great 3.0.x build. If all looks good from there, you should
>> have an easy upgrade to 4.0.4.
>>
>> I would not recommend passing through intermediate 3.0.x releases on the
>> way to 3.0.27; doing so is not necessary.
>>
>> Cheers,
>>
>> - Scott
>>
>>
>> On Jun 13, 2022, at 10:17 PM, Runtian Liu  wrote:
>>
>> 
>>
>> Hi,
>>
>> I am running Cassandra version 3.0.14 at scale on thousands of nodes. I
>> am planning to do a minor version upgrade from 3.0.14 to 3.0.26 in a safe
>> manner. My eventual goal is to upgrade from 3.0.26 to a major release 4.0.
>>
>> As you know, there are multiple minor releases between 3.0.14 and 3.0.26,
>> so I am planning to upgrade in 2-3 batches say 1) 3.0.14 → 3.0.16 2) 3.0.16
>> to 3.0.20 3) 3.0.20 → 3.0.26.
>>
>> . Do you have suggestions or anything that I need to be aware of? Is
>> there any minor release between 3.0.14 and 3.0.26, which is not safe etc.?
>>
>> Best regards.
>>
>>


Re: Is cleanup is required if cluster topology changes

2023-05-05 Thread Jaydeep Chovatia
Thanks all for your valuable inputs. We will try some of the suggested
methods in this thread, and see how it goes. We will keep you updated on
our progress.
Thanks a lot once again!

Jaydeep

On Fri, May 5, 2023 at 8:55 AM Bowen Song via user <
user@cassandra.apache.org> wrote:

> Depending on the number of vnodes per server, the probability and severity
> (i.e. the size of the affected token ranges) of an availability degradation
> due to a server failure during node replacement may be small. You also have
> the choice of increasing the RF if that's still not acceptable.
>
> Also, reducing number of vnodes per server can limit the number of servers
> affected by replacing a single server, therefore reducing the amount of
> time required to run "nodetool cleanup" if it is run sequentially.
>
> Finally, you may choose to run "nodetool cleanup" concurrently on multiple
> nodes to reduce the amount of time required to complete it.
>
>
> On 05/05/2023 16:26, Runtian Liu wrote:
>
> We are doing the "adding a node then decommissioning a node" to
> achieve better availability. Replacing a node need to shut down one node
> first, if another node is down during the node replacement period, we will
> get availability drop because most of our use case is local_quorum with
> replication factor 3.
>
> On Fri, May 5, 2023 at 5:59 AM Bowen Song via user <
> user@cassandra.apache.org> wrote:
>
>> Have you thought of using "-Dcassandra.replace_address_first_boot=..."
>> (or "-Dcassandra.replace_address=..." if you are using an older version)?
>> This will not result in a topology change, which means "nodetool cleanup"
>> is not needed after the operation is completed.
>> On 05/05/2023 05:24, Jaydeep Chovatia wrote:
>>
>> Thanks, Jeff!
>> But in our environment we replace nodes quite often for various
>> optimization purposes, etc. say, almost 1 node per day (node *addition*
>> followed by node *decommission*, which of course changes the topology),
>> and we have a cluster of size 100 nodes with 300GB per node. If we have to
>> run cleanup on 100 nodes after every replacement, then it could take
>> forever.
>> What is the recommendation until we get this fixed in Cassandra itself as
>> part of compaction (w/o externally triggering *cleanup*)?
>>
>> Jaydeep
>>
>> On Thu, May 4, 2023 at 8:14 PM Jeff Jirsa  wrote:
>>
>>> Cleanup is fast and cheap and basically a no-op if you haven’t changed
>>> the ring
>>>
>>> After cassandra has transactional cluster metadata to make ring changes
>>> strongly consistent, cassandra should do this in every compaction. But
>>> until then it’s left for operators to run when they’re sure the state of
>>> the ring is correct .
>>>
>>>
>>>
>>> On May 4, 2023, at 7:41 PM, Jaydeep Chovatia 
>>> wrote:
>>>
>>> 
>>> Isn't this considered a kind of *bug* in Cassandra because as we know
>>> *cleanup* is a lengthy and unreliable operation, so relying on the
>>> *cleanup* means higher chances of data resurrection?
>>> Do you think we should discard the unowned token-ranges as part of the
>>> regular compaction itself? What are the pitfalls of doing this as part of
>>> compaction itself?
>>>
>>> Jaydeep
>>>
>>> On Thu, May 4, 2023 at 7:25 PM guo Maxwell  wrote:
>>>
>>>> compact ion will just merge duplicate data and remove delete data in
>>>> this node .if you add or remove one node for the cluster, I think clean up
>>>> is needed. if clean up failed, I think we should come to see the reason.
>>>>
>>>> Runtian Liu  于2023年5月5日周五 06:37写道:
>>>>
>>>>> Hi all,
>>>>>
>>>>> Is cleanup the sole method to remove data that does not belong to a
>>>>> specific node? In a cluster, where nodes are added or decommissioned from
>>>>> time to time, failure to run cleanup may lead to data resurrection issues,
>>>>> as deleted data may remain on the node that lost ownership of certain
>>>>> partitions. Or is it true that normal compactions can also handle data
>>>>> removal for nodes that no longer have ownership of certain data?
>>>>>
>>>>> Thanks,
>>>>> Runtian
>>>>>
>>>>
>>>>
>>>> --
>>>> you are the apple of my eye !
>>>>
>>>


Re: Is cleanup is required if cluster topology changes

2023-05-09 Thread Jaydeep Chovatia
Another request to the community to see if this is feasible or not:
Can we not wait for (CEP-21), and do the necessary cleanup as part of
regular compaction itself to avoid running *cleanup* manually? For now, we
can control through a flag, which is *false* by default. Whosoever wants to
do the cleanup as part of compaction will turn this flag on. Once we have
CEP-21 addressed, then we can remove this flag, and enable this always.
Thoughts?

Jaydeep

On Tue, May 9, 2023 at 3:58 AM Bowen Song via user <
user@cassandra.apache.org> wrote:

> Because an operator will need to check and ensure the schema is consistent
> across the cluster before running "nodetool cleanup". At the moment, it's
> the operator's responsibility to ensure bad things don't happen.
> On 09/05/2023 06:20, Jaydeep Chovatia wrote:
>
> One clarification question Jeff.
> AFAIK, the *nodetool cleanup* also internally goes through the same
> compaction path as the regular compaction. Then why do we have to wait for
> CEP-21 to clean up unowned data in the regular compaction path? Wouldn't it
> be as simple as regular compaction just invoke the code of *nodetool
> cleanup*?
> In other words, without CEP-21, why is *nodetool cleanup* a safer
> operation but doing the same in the regular compaction isn't?
>
> Jaydeep
>
> On Fri, May 5, 2023 at 11:58 AM Jaydeep Chovatia <
> chovatia.jayd...@gmail.com> wrote:
>
>> Thanks, Jeff, for the detailed steps and summary.
>> We will keep the community (this thread) up to date on how it plays out
>> in our fleet.
>>
>> Jaydeep
>>
>> On Fri, May 5, 2023 at 9:10 AM Jeff Jirsa  wrote:
>>
>>> Lots of caveats on these suggestions, let me try to hit most of them.
>>>
>>> Cleanup in parallel is good and fine and common. Limit number of threads
>>> in cleanup if you're using lots of vnodes, so each node runs one at a time
>>> and not all nodes use all your cores at the same time.
>>> If a host is fully offline, you can ALSO use replace address first boot.
>>> It'll stream data right to that host with the same token assignments you
>>> had before, and no cleanup is needed then. Strictly speaking, to avoid
>>> resurrection here, you'd want to run repair on the replicas of the down
>>> host (for vnodes, probably the whole cluster), but your current process
>>> doesnt guarantee that either (decom + bootstrap may resurrect, strictly
>>> speaking).
>>> Dropping vnodes will reduce the replicas that have to be cleaned up, but
>>> also potentially increase your imbalance on each replacement.
>>>
>>> Cassandra should still do this on its own, and I think once CEP-21 is
>>> committed, this should be one of the first enhancement tickets.
>>>
>>> Until then, LeveledCompactionStrategy really does make cleanup fast and
>>> cheap, at the cost of higher IO the rest of the time. If you can tolerate
>>> that higher IO, you'll probably appreciate LCS anyway (faster reads, faster
>>> data deletion than STCS). It's a lot of IO compared to STCS though.
>>>
>>>
>>>
>>> On Fri, May 5, 2023 at 9:02 AM Jaydeep Chovatia <
>>> chovatia.jayd...@gmail.com> wrote:
>>>
>>>> Thanks all for your valuable inputs. We will try some of the suggested
>>>> methods in this thread, and see how it goes. We will keep you updated on
>>>> our progress.
>>>> Thanks a lot once again!
>>>>
>>>> Jaydeep
>>>>
>>>> On Fri, May 5, 2023 at 8:55 AM Bowen Song via user <
>>>> user@cassandra.apache.org> wrote:
>>>>
>>>>> Depending on the number of vnodes per server, the probability and
>>>>> severity (i.e. the size of the affected token ranges) of an availability
>>>>> degradation due to a server failure during node replacement may be small.
>>>>> You also have the choice of increasing the RF if that's still not
>>>>> acceptable.
>>>>>
>>>>> Also, reducing number of vnodes per server can limit the number of
>>>>> servers affected by replacing a single server, therefore reducing the
>>>>> amount of time required to run "nodetool cleanup" if it is run 
>>>>> sequentially.
>>>>>
>>>>> Finally, you may choose to run "nodetool cleanup" concurrently on
>>>>> multiple nodes to reduce the amount of time required to complete it.
>>>>>
>>>>>
>>>>> On 05/05/2023 16:26, Runtian Liu wrote:
>>>>>
>>>&g

Re: Is cleanup is required if cluster topology changes

2023-05-08 Thread Jaydeep Chovatia
One clarification question Jeff.
AFAIK, the *nodetool cleanup* also internally goes through the same
compaction path as the regular compaction. Then why do we have to wait for
CEP-21 to clean up unowned data in the regular compaction path? Wouldn't it
be as simple as regular compaction just invoke the code of *nodetool
cleanup*?
In other words, without CEP-21, why is *nodetool cleanup* a safer operation
but doing the same in the regular compaction isn't?

Jaydeep

On Fri, May 5, 2023 at 11:58 AM Jaydeep Chovatia 
wrote:

> Thanks, Jeff, for the detailed steps and summary.
> We will keep the community (this thread) up to date on how it plays out in
> our fleet.
>
> Jaydeep
>
> On Fri, May 5, 2023 at 9:10 AM Jeff Jirsa  wrote:
>
>> Lots of caveats on these suggestions, let me try to hit most of them.
>>
>> Cleanup in parallel is good and fine and common. Limit number of threads
>> in cleanup if you're using lots of vnodes, so each node runs one at a time
>> and not all nodes use all your cores at the same time.
>> If a host is fully offline, you can ALSO use replace address first boot.
>> It'll stream data right to that host with the same token assignments you
>> had before, and no cleanup is needed then. Strictly speaking, to avoid
>> resurrection here, you'd want to run repair on the replicas of the down
>> host (for vnodes, probably the whole cluster), but your current process
>> doesnt guarantee that either (decom + bootstrap may resurrect, strictly
>> speaking).
>> Dropping vnodes will reduce the replicas that have to be cleaned up, but
>> also potentially increase your imbalance on each replacement.
>>
>> Cassandra should still do this on its own, and I think once CEP-21 is
>> committed, this should be one of the first enhancement tickets.
>>
>> Until then, LeveledCompactionStrategy really does make cleanup fast and
>> cheap, at the cost of higher IO the rest of the time. If you can tolerate
>> that higher IO, you'll probably appreciate LCS anyway (faster reads, faster
>> data deletion than STCS). It's a lot of IO compared to STCS though.
>>
>>
>>
>> On Fri, May 5, 2023 at 9:02 AM Jaydeep Chovatia <
>> chovatia.jayd...@gmail.com> wrote:
>>
>>> Thanks all for your valuable inputs. We will try some of the suggested
>>> methods in this thread, and see how it goes. We will keep you updated on
>>> our progress.
>>> Thanks a lot once again!
>>>
>>> Jaydeep
>>>
>>> On Fri, May 5, 2023 at 8:55 AM Bowen Song via user <
>>> user@cassandra.apache.org> wrote:
>>>
>>>> Depending on the number of vnodes per server, the probability and
>>>> severity (i.e. the size of the affected token ranges) of an availability
>>>> degradation due to a server failure during node replacement may be small.
>>>> You also have the choice of increasing the RF if that's still not
>>>> acceptable.
>>>>
>>>> Also, reducing number of vnodes per server can limit the number of
>>>> servers affected by replacing a single server, therefore reducing the
>>>> amount of time required to run "nodetool cleanup" if it is run 
>>>> sequentially.
>>>>
>>>> Finally, you may choose to run "nodetool cleanup" concurrently on
>>>> multiple nodes to reduce the amount of time required to complete it.
>>>>
>>>>
>>>> On 05/05/2023 16:26, Runtian Liu wrote:
>>>>
>>>> We are doing the "adding a node then decommissioning a node" to
>>>> achieve better availability. Replacing a node need to shut down one node
>>>> first, if another node is down during the node replacement period, we will
>>>> get availability drop because most of our use case is local_quorum with
>>>> replication factor 3.
>>>>
>>>> On Fri, May 5, 2023 at 5:59 AM Bowen Song via user <
>>>> user@cassandra.apache.org> wrote:
>>>>
>>>>> Have you thought of using "-Dcassandra.replace_address_first_boot=..."
>>>>> (or "-Dcassandra.replace_address=..." if you are using an older version)?
>>>>> This will not result in a topology change, which means "nodetool cleanup"
>>>>> is not needed after the operation is completed.
>>>>> On 05/05/2023 05:24, Jaydeep Chovatia wrote:
>>>>>
>>>>> Thanks, Jeff!
>>>>> But in our environment we replace nodes quite often for various
>>>>> optimization purposes, etc. say, almost 1

Re: Replacing node without shutting down the old node

2023-05-16 Thread Jaydeep Chovatia
Hi Jeff,

Do you think this is a good workaround to have in the Cassandra itself
until we have CEP-21

available and cleanup as part of the compaction in the Cassandra itself?
It can work as follows in Cassandra:
Step.1: Add a new flag in bootstrap, say
*-Dcopy_tokens_from=*. If set, then the newly joining node
will copy the tokens from *src_ip_address* and add "-1" to it
Step.2: Continue with the remaining bootstrap as is

Thoughts?

Jaydeep

On Tue, May 16, 2023 at 10:23 AM Runtian Liu  wrote:

> cool, thank you. This looks like a very good setup for us and cleanup
> should be very fast for this case.
>
> On Tue, May 16, 2023 at 5:53 AM Jeff Jirsa  wrote:
>
>>
>> In-line
>>
>> On May 15, 2023, at 5:26 PM, Runtian Liu  wrote:
>>
>> 
>> Hi Jeff,
>>
>> I tried the setup with vnode 16 and NetworkTopologyStrategy replication
>> strategy with replication factor 3 with 3 racks in one cluster. When using
>> the new node token as the old node token - 1
>>
>>
>> I had said +1 but you’re right that it’s actually -1 , sorry about that.
>> You want the new node to be lower than the existing host. The lower token
>> will take most of the data.
>>
>> I see the new node is streaming from the old node only. And the decom
>> phase of the old node is extremely fast. Does this mean the new node will
>> only take data ownership from the old node?
>>
>>
>> With exactly three racks, yes. With more racks or fewer racks, no.
>>
>> I also did some cleanups after replacing node with old token - 1 and the
>> cleanup sstable count was not increasing. Looks like adding a node with
>> old_token - 1 and decom the old node will not generate stale data on the
>> rest of the cluster. Do you know if  there are any edge cases that in this
>> replacement process can generate any stale data on other nodes of the
>> cluster with the setup I mentioned?
>>
>>
>> Should do exactly what you want. I’d still run cleanup but it should be a
>> no-op.
>>
>>
>> Thanks,
>> Runtian
>>
>> On Mon, May 8, 2023 at 9:59 PM Runtian Liu  wrote:
>>
>>> I thought the joining node would not participate in quorum? How are we
>>> counting things like how many replicas ACK a write when we are adding a new
>>> node for expansion? The token ownership won't change until the new node is
>>> fully joined right?
>>>
>>> On Mon, May 8, 2023 at 8:58 PM Jeff Jirsa  wrote:
>>>
 You can't have two nodes with the same token (in the current metadata
 implementation) - it causes problems counting things like how many replicas
 ACK a write, and what happens if the one you're replacing ACKs a write but
 the joining host doesn't? It's harder than it seems to maintain consistency
 guarantees in that model, because you have 2 nodes where either may end up
 becoming the sole true owner of the token, and you have to handle both
 cases where one of them fails.

 An easier option is to add it with new token set to old token +1 (as an
 expansion), then decom the leaving node (shrink). That'll minimize
 streaming when you decommission that node.



 On Mon, May 8, 2023 at 7:19 PM Runtian Liu  wrote:

> Hi all,
>
> Sometimes we want to replace a node for various reasons, we can
> replace a node by shutting down the old node and letting the new node
> stream data from other replicas, but this approach may have availability
> issues or data consistency issues if one more node in the same cluster 
> went
> down. Why Cassandra doesn't support replacing a node without shutting down
> the old one? Can we treat the new node as normal node addition while 
> having
> exactly the same token ranges as the node to be replaced. After the new
> node's joining process is complete, we just need to cut off the old node.
> With this, we don't lose any availability and the token range is not moved
> so no clean up is needed. Is there any downside of doing this?
>
> Thanks,
> Runtian
>



Re: Is cleanup is required if cluster topology changes

2023-05-05 Thread Jaydeep Chovatia
Thanks, Jeff, for the detailed steps and summary.
We will keep the community (this thread) up to date on how it plays out in
our fleet.

Jaydeep

On Fri, May 5, 2023 at 9:10 AM Jeff Jirsa  wrote:

> Lots of caveats on these suggestions, let me try to hit most of them.
>
> Cleanup in parallel is good and fine and common. Limit number of threads
> in cleanup if you're using lots of vnodes, so each node runs one at a time
> and not all nodes use all your cores at the same time.
> If a host is fully offline, you can ALSO use replace address first boot.
> It'll stream data right to that host with the same token assignments you
> had before, and no cleanup is needed then. Strictly speaking, to avoid
> resurrection here, you'd want to run repair on the replicas of the down
> host (for vnodes, probably the whole cluster), but your current process
> doesnt guarantee that either (decom + bootstrap may resurrect, strictly
> speaking).
> Dropping vnodes will reduce the replicas that have to be cleaned up, but
> also potentially increase your imbalance on each replacement.
>
> Cassandra should still do this on its own, and I think once CEP-21 is
> committed, this should be one of the first enhancement tickets.
>
> Until then, LeveledCompactionStrategy really does make cleanup fast and
> cheap, at the cost of higher IO the rest of the time. If you can tolerate
> that higher IO, you'll probably appreciate LCS anyway (faster reads, faster
> data deletion than STCS). It's a lot of IO compared to STCS though.
>
>
>
> On Fri, May 5, 2023 at 9:02 AM Jaydeep Chovatia <
> chovatia.jayd...@gmail.com> wrote:
>
>> Thanks all for your valuable inputs. We will try some of the suggested
>> methods in this thread, and see how it goes. We will keep you updated on
>> our progress.
>> Thanks a lot once again!
>>
>> Jaydeep
>>
>> On Fri, May 5, 2023 at 8:55 AM Bowen Song via user <
>> user@cassandra.apache.org> wrote:
>>
>>> Depending on the number of vnodes per server, the probability and
>>> severity (i.e. the size of the affected token ranges) of an availability
>>> degradation due to a server failure during node replacement may be small.
>>> You also have the choice of increasing the RF if that's still not
>>> acceptable.
>>>
>>> Also, reducing number of vnodes per server can limit the number of
>>> servers affected by replacing a single server, therefore reducing the
>>> amount of time required to run "nodetool cleanup" if it is run sequentially.
>>>
>>> Finally, you may choose to run "nodetool cleanup" concurrently on
>>> multiple nodes to reduce the amount of time required to complete it.
>>>
>>>
>>> On 05/05/2023 16:26, Runtian Liu wrote:
>>>
>>> We are doing the "adding a node then decommissioning a node" to
>>> achieve better availability. Replacing a node need to shut down one node
>>> first, if another node is down during the node replacement period, we will
>>> get availability drop because most of our use case is local_quorum with
>>> replication factor 3.
>>>
>>> On Fri, May 5, 2023 at 5:59 AM Bowen Song via user <
>>> user@cassandra.apache.org> wrote:
>>>
>>>> Have you thought of using "-Dcassandra.replace_address_first_boot=..."
>>>> (or "-Dcassandra.replace_address=..." if you are using an older version)?
>>>> This will not result in a topology change, which means "nodetool cleanup"
>>>> is not needed after the operation is completed.
>>>> On 05/05/2023 05:24, Jaydeep Chovatia wrote:
>>>>
>>>> Thanks, Jeff!
>>>> But in our environment we replace nodes quite often for various
>>>> optimization purposes, etc. say, almost 1 node per day (node *addition*
>>>> followed by node *decommission*, which of course changes the
>>>> topology), and we have a cluster of size 100 nodes with 300GB per node. If
>>>> we have to run cleanup on 100 nodes after every replacement, then it could
>>>> take forever.
>>>> What is the recommendation until we get this fixed in Cassandra itself
>>>> as part of compaction (w/o externally triggering *cleanup*)?
>>>>
>>>> Jaydeep
>>>>
>>>> On Thu, May 4, 2023 at 8:14 PM Jeff Jirsa  wrote:
>>>>
>>>>> Cleanup is fast and cheap and basically a no-op if you haven’t changed
>>>>> the ring
>>>>>
>>>>> After cassandra has transactional cluster metadata to make ring
>&g

Re: Is cleanup is required if cluster topology changes

2023-05-04 Thread Jaydeep Chovatia
Thanks, Jeff!
But in our environment we replace nodes quite often for various
optimization purposes, etc. say, almost 1 node per day (node *addition*
followed by node *decommission*, which of course changes the topology), and
we have a cluster of size 100 nodes with 300GB per node. If we have to run
cleanup on 100 nodes after every replacement, then it could take forever.
What is the recommendation until we get this fixed in Cassandra itself as
part of compaction (w/o externally triggering *cleanup*)?

Jaydeep

On Thu, May 4, 2023 at 8:14 PM Jeff Jirsa  wrote:

> Cleanup is fast and cheap and basically a no-op if you haven’t changed the
> ring
>
> After cassandra has transactional cluster metadata to make ring changes
> strongly consistent, cassandra should do this in every compaction. But
> until then it’s left for operators to run when they’re sure the state of
> the ring is correct .
>
>
>
> On May 4, 2023, at 7:41 PM, Jaydeep Chovatia 
> wrote:
>
> 
> Isn't this considered a kind of *bug* in Cassandra because as we know
> *cleanup* is a lengthy and unreliable operation, so relying on the
> *cleanup* means higher chances of data resurrection?
> Do you think we should discard the unowned token-ranges as part of the
> regular compaction itself? What are the pitfalls of doing this as part of
> compaction itself?
>
> Jaydeep
>
> On Thu, May 4, 2023 at 7:25 PM guo Maxwell  wrote:
>
>> compact ion will just merge duplicate data and remove delete data in this
>> node .if you add or remove one node for the cluster, I think clean up is
>> needed. if clean up failed, I think we should come to see the reason.
>>
>> Runtian Liu  于2023年5月5日周五 06:37写道:
>>
>>> Hi all,
>>>
>>> Is cleanup the sole method to remove data that does not belong to a
>>> specific node? In a cluster, where nodes are added or decommissioned from
>>> time to time, failure to run cleanup may lead to data resurrection issues,
>>> as deleted data may remain on the node that lost ownership of certain
>>> partitions. Or is it true that normal compactions can also handle data
>>> removal for nodes that no longer have ownership of certain data?
>>>
>>> Thanks,
>>> Runtian
>>>
>>
>>
>> --
>> you are the apple of my eye !
>>
>


Re: Is cleanup is required if cluster topology changes

2023-05-04 Thread Jaydeep Chovatia
We use STCS, and our experience with *cleanup* is that it takes a long time
to run in a 100-node cluster. We would like to replace one node every day
for various purposes in our fleet.

If we run *cleanup* after each node replacement, then it might take, say,
15 days to complete, and that hinders our node replacement frequency.

Do you see any other options?

Jaydeep

On Thu, May 4, 2023 at 9:47 PM Jeff Jirsa  wrote:

> You should 100% trigger cleanup each time or you’ll almost certainly
> resurrect data sooner or later
>
> If you’re using leveled compaction it’s especially cheap. Stcs and twcs
> are worse, but if you’re really scaling that often, I’d be considering lcs
> and running cleanup just before or just after each scaling
>
> On May 4, 2023, at 9:25 PM, Jaydeep Chovatia 
> wrote:
>
> 
> Thanks, Jeff!
> But in our environment we replace nodes quite often for various
> optimization purposes, etc. say, almost 1 node per day (node *addition*
> followed by node *decommission*, which of course changes the topology),
> and we have a cluster of size 100 nodes with 300GB per node. If we have to
> run cleanup on 100 nodes after every replacement, then it could take
> forever.
> What is the recommendation until we get this fixed in Cassandra itself as
> part of compaction (w/o externally triggering *cleanup*)?
>
> Jaydeep
>
> On Thu, May 4, 2023 at 8:14 PM Jeff Jirsa  wrote:
>
>> Cleanup is fast and cheap and basically a no-op if you haven’t changed
>> the ring
>>
>> After cassandra has transactional cluster metadata to make ring changes
>> strongly consistent, cassandra should do this in every compaction. But
>> until then it’s left for operators to run when they’re sure the state of
>> the ring is correct .
>>
>>
>>
>> On May 4, 2023, at 7:41 PM, Jaydeep Chovatia 
>> wrote:
>>
>> 
>> Isn't this considered a kind of *bug* in Cassandra because as we know
>> *cleanup* is a lengthy and unreliable operation, so relying on the
>> *cleanup* means higher chances of data resurrection?
>> Do you think we should discard the unowned token-ranges as part of the
>> regular compaction itself? What are the pitfalls of doing this as part of
>> compaction itself?
>>
>> Jaydeep
>>
>> On Thu, May 4, 2023 at 7:25 PM guo Maxwell  wrote:
>>
>>> compact ion will just merge duplicate data and remove delete data in
>>> this node .if you add or remove one node for the cluster, I think clean up
>>> is needed. if clean up failed, I think we should come to see the reason.
>>>
>>> Runtian Liu  于2023年5月5日周五 06:37写道:
>>>
>>>> Hi all,
>>>>
>>>> Is cleanup the sole method to remove data that does not belong to a
>>>> specific node? In a cluster, where nodes are added or decommissioned from
>>>> time to time, failure to run cleanup may lead to data resurrection issues,
>>>> as deleted data may remain on the node that lost ownership of certain
>>>> partitions. Or is it true that normal compactions can also handle data
>>>> removal for nodes that no longer have ownership of certain data?
>>>>
>>>> Thanks,
>>>> Runtian
>>>>
>>>
>>>
>>> --
>>> you are the apple of my eye !
>>>
>>


Re: Is cleanup is required if cluster topology changes

2023-05-04 Thread Jaydeep Chovatia
Isn't this considered a kind of *bug* in Cassandra because as we know
*cleanup* is a lengthy and unreliable operation, so relying on the *cleanup*
means higher chances of data resurrection?
Do you think we should discard the unowned token-ranges as part of the
regular compaction itself? What are the pitfalls of doing this as part of
compaction itself?

Jaydeep

On Thu, May 4, 2023 at 7:25 PM guo Maxwell  wrote:

> compact ion will just merge duplicate data and remove delete data in this
> node .if you add or remove one node for the cluster, I think clean up is
> needed. if clean up failed, I think we should come to see the reason.
>
> Runtian Liu  于2023年5月5日周五 06:37写道:
>
>> Hi all,
>>
>> Is cleanup the sole method to remove data that does not belong to a
>> specific node? In a cluster, where nodes are added or decommissioned from
>> time to time, failure to run cleanup may lead to data resurrection issues,
>> as deleted data may remain on the node that lost ownership of certain
>> partitions. Or is it true that normal compactions can also handle data
>> removal for nodes that no longer have ownership of certain data?
>>
>> Thanks,
>> Runtian
>>
>
>
> --
> you are the apple of my eye !
>


Race condition in QueryProcessor::prepare API

2024-01-19 Thread Jaydeep Chovatia
Hi,

Today, in our production, we came across the following scenario:

   1. We have 100 nodes of the Cassandra cluster on 4.0.6, and our client
   uses PreparedStatement, say, "*SELECT * FROM T1 WHERE PK=?*"
   2. We applied a schema change to add a *regular* column, "*ALTER TABLE
   T1 ADD COLUMN c1 UUID"*
   3. Around 20 (out of 100) of the Cassandra nodes started throwing the
   error: "*Successfully prepared, but could not find prepared statement
   for* "  (Code path: QueryEvents.java
   

   )
   4. The error continued for around 2 hours, and only restarting those 20
   nodes resolved the issue.

Our current hypothesis is that there is a race condition in the
QueryProcessor::prepare

API in that one thread is evicting the prepared statements while the other
is adding, which is never-ending. A similar hypothesis has been mentioned
in a ticket in 2022: https://issues.apache.org/jira/browse/CASSANDRA-17401

Has anyone ever experienced this? Are there any quick pointers on what
could have gone wrong?

Thanks in advance!

Jaydeep


Re: Race condition in QueryProcessor::prepare API

2024-01-20 Thread Jaydeep Chovatia
I think this is a regression introduced as part of CASSANDRA-17248
<https://issues.apache.org/jira/browse/CASSANDRA-17248>, the following code
got introduced in QueryProcessor.java
<https://github.com/apache/cassandra/commit/242f7f9b18db77bce36c9bba00b2acda4ff3209e#r137491766>
since C* 3.0.26.

// Make sure the missing one is going to be eventually re-prepared
evictPrepared(hashWithKeyspace);
evictPrepared(hashWithoutKeyspace);

The above-mentioned code was not present in C* 3.0.25
<https://github.com/apache/cassandra/blob/cassandra-3.0.25/src/java/org/apache/cassandra/cql3/QueryProcessor.java#L391>
.

Alex, could you please take a look at it?

Jaydeep

On Fri, Jan 19, 2024 at 7:17 PM Jaydeep Chovatia 
wrote:

> Hi,
>
> Today, in our production, we came across the following scenario:
>
>1. We have 100 nodes of the Cassandra cluster on 4.0.6, and our client
>uses PreparedStatement, say, "*SELECT * FROM T1 WHERE PK=?*"
>2. We applied a schema change to add a *regular* column, "*ALTER TABLE
>T1 ADD COLUMN c1 UUID"*
>3. Around 20 (out of 100) of the Cassandra nodes started throwing the
>error: "*Successfully prepared, but could not find prepared statement
>for* "  (Code path: QueryEvents.java
>
> <https://github.com/apache/cassandra/blob/cassandra-4.0/src/java/org/apache/cassandra/cql3/QueryEvents.java#L225C80-L225C145>
>)
>4. The error continued for around 2 hours, and only restarting those
>20 nodes resolved the issue.
>
> Our current hypothesis is that there is a race condition in the
> QueryProcessor::prepare
> <https://github.com/apache/cassandra/blob/cassandra-4.0/src/java/org/apache/cassandra/cql3/QueryProcessor.java#L575>
> API in that one thread is evicting the prepared statements while the other
> is adding, which is never-ending. A similar hypothesis has been mentioned
> in a ticket in 2022: https://issues.apache.org/jira/browse/CASSANDRA-17401
>
> Has anyone ever experienced this? Are there any quick pointers on what
> could have gone wrong?
>
> Thanks in advance!
>
> Jaydeep
>