Mutation dropped and Read-Repair performance issue

2020-12-19 Thread sunil pawar
Hi All,

We are facing problems of failure of Read-Repair stages with error Digest
Mismatch and count is 300+ per day per node.
At the same time, we are experiencing node is getting overloaded for a
quick couple of seconds due to long GC pauses (of around 7-8 seconds). We
are not running a repair on regular basis as a maintenance activity owing
to the node is going down whenever we are running repair for the tables.
After running the repair node is going down due to long GC pauses again.
Except for one table for all other tables, we can run the repair with
option  --in-local-dc. Below is the configuration of the cluster:

   1. 15 node cluster.
   2. RF=3
   3. Xmx and Xms 31GB.
   4. G1GC algorithm is in use.
   5. Version 3.11.2
   6. Load on each node roughly around 500GB
   7. One table is having a maximum amount of load compared to other tables.

Please suggest if there are any configuration level changes which we can do
to avoid the above problems. Getting too many digest mismatch messages is a
sign of node is doing more read and write operations compared to without
those messages and it can be the cause of node is getting overloaded for
that particular moment?

-- 
Thanks,
S.R.


Re: Mutation dropped

2013-02-23 Thread Víctor Hugo Oliveira Molinar
Aaron, what did u mean with RF3 CLQuorum is more a real world scenario?
If there are only 2 nodes, where will be placed the third replica?
By increasing the CL wont it decrease the performance on w/r and then
increase the timeoutexceptions of this mentioned case?


On Fri, Feb 22, 2013 at 1:59 PM, aaron morton aa...@thelastpickle.comwrote:

 If you are running repair, using QUORUM, and there are not dropped writes
 you should not be getting DigestMismatch during reads.

 If everything else looks good, but the request latency is higher than the
 CF latency I would check that client load is evenly distributed. Then start
 looking to see if the request throughput is at it's maximum for the
 cluster.

 Cheers

-
 Aaron Morton
 Freelance Cassandra Developer
 New Zealand

 @aaronmorton
 http://www.thelastpickle.com

 On 22/02/2013, at 8:15 PM, Wei Zhu wz1...@yahoo.com wrote:

 Thanks Aaron for the great information as always. I just checked
 cfhistograms and only a handful of read latency are bigger than 100ms, but
 for proxyhistograms there are 10 times more are greater than 100ms. We are
 using QUORUM  for reading with RF=3, and I understand coordinator needs to
 get the digest from other nodes and read repair on the miss match etc. But
 is it normal to see the latency from proxyhistograms to go beyond 100ms? Is
 there anyway to improve that?
 We are tracking the metrics from Client side and we see the 95th
 percentile response time averages at 40ms which is a bit high. Our 50th
 percentile was great under 3ms.

 Any suggestion is very much appreciated.

 Thanks.
 -Wei

 - Original Message -
 From: aaron morton aa...@thelastpickle.com
 To: Cassandra User user@cassandra.apache.org
 Sent: Thursday, February 21, 2013 9:20:49 AM
 Subject: Re: Mutation dropped

 What does rpc_timeout control? Only the reads/writes?

 Yes.

 like data stream,

 streaming_socket_timeout_in_ms in the yaml

 merkle tree request?

 Either no time out or a number of days, cannot remember which right now.

 What is the side effect if it's set to a really small number, say 20ms?

 You will probably get a lot more requests that fail with a
 TimedOutException.

 rpc_timeout needs to be longer than the time it takes a node to process
 the message, and the time it takes the coordinator to do it's thing. You
 can look at cfhistograms and proxyhistograms to get a better idea of how
 long a request takes in your system.

 Cheers

 -
 Aaron Morton
 Freelance Cassandra Developer
 New Zealand

 @aaronmorton
 http://www.thelastpickle.com

 On 21/02/2013, at 6:56 AM, Wei Zhu wz1...@yahoo.com wrote:

 What does rpc_timeout control? Only the reads/writes? How about other
 inter-node communication, like data stream, merkle tree request?  What is
 the reasonable value for roc_timeout? The default value of 10 seconds are
 way too long. What is the side effect if it's set to a really small number,
 say 20ms?

 Thanks.
 -Wei

 From: aaron morton aa...@thelastpickle.com
 To: user@cassandra.apache.org
 Sent: Tuesday, February 19, 2013 7:32 PM
 Subject: Re: Mutation dropped

 Does the rpc_timeout not control the client timeout ?

 No it is how long a node will wait for a response from other nodes before
 raising a TimedOutException if less than CL nodes have responded.
 Set the client side socket timeout using your preferred client.

 Is there any param which is configurable to control the replication
 timeout between nodes ?

 There is no such thing.
 rpc_timeout is roughly like that, but it's not right to think about it
 that way.
 i.e. if a message to a replica times out and CL nodes have already
 responded then we are happy to call the request complete.

 Cheers


 -
 Aaron Morton
 Freelance Cassandra Developer
 New Zealand

 @aaronmorton
 http://www.thelastpickle.com

 On 19/02/2013, at 1:48 AM, Kanwar Sangha kan...@mavenir.com wrote:

 Thanks Aaron.

 Does the rpc_timeout not control the client timeout ? Is there any param
 which is configurable to control the replication timeout between nodes ? Or
 the same param is used to control that since the other node is also like a
 client ?



 From: aaron morton [mailto:aa...@thelastpickle.com]
 Sent: 17 February 2013 11:26
 To: user@cassandra.apache.org
 Subject: Re: Mutation dropped

 You are hitting the maximum throughput on the cluster.

 The messages are dropped because the node fails to start processing them
 before rpc_timeout.

 However the request is still a success because the client requested CL was
 achieved.

 Testing with RF 2 and CL 1 really just tests the disks on one local
 machine. Both nodes replicate each row, and writes are sent to each
 replica, so the only thing the client is waiting on is the local node to
 write to it's commit log.

 Testing with (and running in prod) RF3 and CL QUROUM is a more real world
 scenario.

 Cheers

 -
 Aaron Morton
 Freelance Cassandra Developer
 New Zealand

 @aaronmorton
 http

Re: Mutation dropped

2013-02-22 Thread aaron morton
If you are running repair, using QUORUM, and there are not dropped writes you 
should not be getting DigestMismatch during reads. 

If everything else looks good, but the request latency is higher than the CF 
latency I would check that client load is evenly distributed. Then start 
looking to see if the request throughput is at it's maximum for the cluster. 

Cheers
  
-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 22/02/2013, at 8:15 PM, Wei Zhu wz1...@yahoo.com wrote:

 Thanks Aaron for the great information as always. I just checked cfhistograms 
 and only a handful of read latency are bigger than 100ms, but for 
 proxyhistograms there are 10 times more are greater than 100ms. We are using 
 QUORUM  for reading with RF=3, and I understand coordinator needs to get the 
 digest from other nodes and read repair on the miss match etc. But is it 
 normal to see the latency from proxyhistograms to go beyond 100ms? Is there 
 anyway to improve that? 
 We are tracking the metrics from Client side and we see the 95th percentile 
 response time averages at 40ms which is a bit high. Our 50th percentile was 
 great under 3ms. 
 
 Any suggestion is very much appreciated.
 
 Thanks.
 -Wei
 
 - Original Message -
 From: aaron morton aa...@thelastpickle.com
 To: Cassandra User user@cassandra.apache.org
 Sent: Thursday, February 21, 2013 9:20:49 AM
 Subject: Re: Mutation dropped
 
 What does rpc_timeout control? Only the reads/writes? 
 Yes. 
 
 like data stream,
 streaming_socket_timeout_in_ms in the yaml
 
 merkle tree request? 
 Either no time out or a number of days, cannot remember which right now. 
 
 What is the side effect if it's set to a really small number, say 20ms?
 You will probably get a lot more requests that fail with a TimedOutException. 
 
 rpc_timeout needs to be longer than the time it takes a node to process the 
 message, and the time it takes the coordinator to do it's thing. You can look 
 at cfhistograms and proxyhistograms to get a better idea of how long a 
 request takes in your system.  
 
 Cheers
 
 -
 Aaron Morton
 Freelance Cassandra Developer
 New Zealand
 
 @aaronmorton
 http://www.thelastpickle.com
 
 On 21/02/2013, at 6:56 AM, Wei Zhu wz1...@yahoo.com wrote:
 
 What does rpc_timeout control? Only the reads/writes? How about other 
 inter-node communication, like data stream, merkle tree request?  What is 
 the reasonable value for roc_timeout? The default value of 10 seconds are 
 way too long. What is the side effect if it's set to a really small number, 
 say 20ms?
 
 Thanks.
 -Wei
 
 From: aaron morton aa...@thelastpickle.com
 To: user@cassandra.apache.org 
 Sent: Tuesday, February 19, 2013 7:32 PM
 Subject: Re: Mutation dropped
 
 Does the rpc_timeout not control the client timeout ?
 No it is how long a node will wait for a response from other nodes before 
 raising a TimedOutException if less than CL nodes have responded. 
 Set the client side socket timeout using your preferred client. 
 
 Is there any param which is configurable to control the replication timeout 
 between nodes ?
 There is no such thing.
 rpc_timeout is roughly like that, but it's not right to think about it that 
 way. 
 i.e. if a message to a replica times out and CL nodes have already responded 
 then we are happy to call the request complete. 
 
 Cheers
 
 
 -
 Aaron Morton
 Freelance Cassandra Developer
 New Zealand
 
 @aaronmorton
 http://www.thelastpickle.com
 
 On 19/02/2013, at 1:48 AM, Kanwar Sangha kan...@mavenir.com wrote:
 
 Thanks Aaron.
 
 Does the rpc_timeout not control the client timeout ? Is there any param 
 which is configurable to control the replication timeout between nodes ? Or 
 the same param is used to control that since the other node is also like a 
 client ?
 
 
 
 From: aaron morton [mailto:aa...@thelastpickle.com] 
 Sent: 17 February 2013 11:26
 To: user@cassandra.apache.org
 Subject: Re: Mutation dropped
 
 You are hitting the maximum throughput on the cluster. 
 
 The messages are dropped because the node fails to start processing them 
 before rpc_timeout. 
 
 However the request is still a success because the client requested CL was 
 achieved. 
 
 Testing with RF 2 and CL 1 really just tests the disks on one local 
 machine. Both nodes replicate each row, and writes are sent to each 
 replica, so the only thing the client is waiting on is the local node to 
 write to it's commit log. 
 
 Testing with (and running in prod) RF3 and CL QUROUM is a more real world 
 scenario. 
 
 Cheers
 
 -
 Aaron Morton
 Freelance Cassandra Developer
 New Zealand
 
 @aaronmorton
 http://www.thelastpickle.com
 
 On 15/02/2013, at 9:42 AM, Kanwar Sangha kan...@mavenir.com wrote:
 
 
 Hi – Is there a parameter which can be tuned to prevent the mutations from 
 being dropped ? Is this logic correct ?
 
 Node A and B with RF=2, CL =1. Load balanced between the two

Re: Mutation dropped

2013-02-21 Thread aaron morton
 What does rpc_timeout control? Only the reads/writes? 
Yes. 

 like data stream,
streaming_socket_timeout_in_ms in the yaml

 merkle tree request? 
Either no time out or a number of days, cannot remember which right now. 

 What is the side effect if it's set to a really small number, say 20ms?
You will probably get a lot more requests that fail with a TimedOutException. 

rpc_timeout needs to be longer than the time it takes a node to process the 
message, and the time it takes the coordinator to do it's thing. You can look 
at cfhistograms and proxyhistograms to get a better idea of how long a request 
takes in your system.  
  
Cheers

-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 21/02/2013, at 6:56 AM, Wei Zhu wz1...@yahoo.com wrote:

 What does rpc_timeout control? Only the reads/writes? How about other 
 inter-node communication, like data stream, merkle tree request?  What is the 
 reasonable value for roc_timeout? The default value of 10 seconds are way too 
 long. What is the side effect if it's set to a really small number, say 20ms?
 
 Thanks.
 -Wei
 
 From: aaron morton aa...@thelastpickle.com
 To: user@cassandra.apache.org 
 Sent: Tuesday, February 19, 2013 7:32 PM
 Subject: Re: Mutation dropped
 
 Does the rpc_timeout not control the client timeout ?
 No it is how long a node will wait for a response from other nodes before 
 raising a TimedOutException if less than CL nodes have responded. 
 Set the client side socket timeout using your preferred client. 
 
 Is there any param which is configurable to control the replication timeout 
 between nodes ?
 There is no such thing.
 rpc_timeout is roughly like that, but it's not right to think about it that 
 way. 
 i.e. if a message to a replica times out and CL nodes have already responded 
 then we are happy to call the request complete. 
 
 Cheers
 
  
 -
 Aaron Morton
 Freelance Cassandra Developer
 New Zealand
 
 @aaronmorton
 http://www.thelastpickle.com
 
 On 19/02/2013, at 1:48 AM, Kanwar Sangha kan...@mavenir.com wrote:
 
 Thanks Aaron.
  
 Does the rpc_timeout not control the client timeout ? Is there any param 
 which is configurable to control the replication timeout between nodes ? Or 
 the same param is used to control that since the other node is also like a 
 client ?
  
  
  
 From: aaron morton [mailto:aa...@thelastpickle.com] 
 Sent: 17 February 2013 11:26
 To: user@cassandra.apache.org
 Subject: Re: Mutation dropped
  
 You are hitting the maximum throughput on the cluster. 
  
 The messages are dropped because the node fails to start processing them 
 before rpc_timeout. 
  
 However the request is still a success because the client requested CL was 
 achieved. 
  
 Testing with RF 2 and CL 1 really just tests the disks on one local machine. 
 Both nodes replicate each row, and writes are sent to each replica, so the 
 only thing the client is waiting on is the local node to write to it's 
 commit log. 
  
 Testing with (and running in prod) RF3 and CL QUROUM is a more real world 
 scenario. 
  
 Cheers
  
 -
 Aaron Morton
 Freelance Cassandra Developer
 New Zealand
  
 @aaronmorton
 http://www.thelastpickle.com
  
 On 15/02/2013, at 9:42 AM, Kanwar Sangha kan...@mavenir.com wrote:
 
 
 Hi – Is there a parameter which can be tuned to prevent the mutations from 
 being dropped ? Is this logic correct ?
  
 Node A and B with RF=2, CL =1. Load balanced between the two.
  
 --  Address   Load   Tokens  Owns (effective)  Host ID   
 Rack
 UN  10.x.x.x   746.78 GB  256 100.0%
 dbc9e539-f735-4b0b-8067-b97a85522a1a  rack1
 UN  10.x.x.x   880.77 GB  256 100.0%
 95d59054-be99-455f-90d1-f43981d3d778  rack1
  
 Once we hit a very high TPS (around 50k/sec of inserts), the nodes start 
 falling behind and we see the mutation dropped messages. But there are no 
 failures on the client. Does that mean other node is not able to persist the 
 replicated data ? Is there some timeout associated with replicated data 
 persistence ?
  
 Thanks,
 Kanwar
  
  
  
  
  
  
  
 From: Kanwar Sangha [mailto:kan...@mavenir.com] 
 Sent: 14 February 2013 09:08
 To: user@cassandra.apache.org
 Subject: Mutation dropped
  
 Hi – I am doing a load test using YCSB across 2 nodes in a cluster and 
 seeing a lot of mutation dropped messages.  I understand that this is due to 
 the replica not being written to the
 other node ? RF = 2, CL =1.
  
 From the wiki -
 For MUTATION messages this means that the mutation was not applied to all 
 replicas it was sent to. The inconsistency will be repaired by Read Repair 
 or Anti Entropy Repair
  
 Thanks,
 Kanwar
  
 
 
 



Re: Mutation dropped

2013-02-21 Thread Wei Zhu
Thanks Aaron for the great information as always. I just checked cfhistograms 
and only a handful of read latency are bigger than 100ms, but for 
proxyhistograms there are 10 times more are greater than 100ms. We are using 
QUORUM  for reading with RF=3, and I understand coordinator needs to get the 
digest from other nodes and read repair on the miss match etc. But is it normal 
to see the latency from proxyhistograms to go beyond 100ms? Is there anyway to 
improve that? 
We are tracking the metrics from Client side and we see the 95th percentile 
response time averages at 40ms which is a bit high. Our 50th percentile was 
great under 3ms. 

Any suggestion is very much appreciated.

Thanks.
-Wei

- Original Message -
From: aaron morton aa...@thelastpickle.com
To: Cassandra User user@cassandra.apache.org
Sent: Thursday, February 21, 2013 9:20:49 AM
Subject: Re: Mutation dropped

 What does rpc_timeout control? Only the reads/writes? 
Yes. 

 like data stream,
streaming_socket_timeout_in_ms in the yaml

 merkle tree request? 
Either no time out or a number of days, cannot remember which right now. 

 What is the side effect if it's set to a really small number, say 20ms?
You will probably get a lot more requests that fail with a TimedOutException. 

rpc_timeout needs to be longer than the time it takes a node to process the 
message, and the time it takes the coordinator to do it's thing. You can look 
at cfhistograms and proxyhistograms to get a better idea of how long a request 
takes in your system.  
  
Cheers

-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 21/02/2013, at 6:56 AM, Wei Zhu wz1...@yahoo.com wrote:

 What does rpc_timeout control? Only the reads/writes? How about other 
 inter-node communication, like data stream, merkle tree request?  What is the 
 reasonable value for roc_timeout? The default value of 10 seconds are way too 
 long. What is the side effect if it's set to a really small number, say 20ms?
 
 Thanks.
 -Wei
 
 From: aaron morton aa...@thelastpickle.com
 To: user@cassandra.apache.org 
 Sent: Tuesday, February 19, 2013 7:32 PM
 Subject: Re: Mutation dropped
 
 Does the rpc_timeout not control the client timeout ?
 No it is how long a node will wait for a response from other nodes before 
 raising a TimedOutException if less than CL nodes have responded. 
 Set the client side socket timeout using your preferred client. 
 
 Is there any param which is configurable to control the replication timeout 
 between nodes ?
 There is no such thing.
 rpc_timeout is roughly like that, but it's not right to think about it that 
 way. 
 i.e. if a message to a replica times out and CL nodes have already responded 
 then we are happy to call the request complete. 
 
 Cheers
 
  
 -
 Aaron Morton
 Freelance Cassandra Developer
 New Zealand
 
 @aaronmorton
 http://www.thelastpickle.com
 
 On 19/02/2013, at 1:48 AM, Kanwar Sangha kan...@mavenir.com wrote:
 
 Thanks Aaron.
  
 Does the rpc_timeout not control the client timeout ? Is there any param 
 which is configurable to control the replication timeout between nodes ? Or 
 the same param is used to control that since the other node is also like a 
 client ?
  
  
  
 From: aaron morton [mailto:aa...@thelastpickle.com] 
 Sent: 17 February 2013 11:26
 To: user@cassandra.apache.org
 Subject: Re: Mutation dropped
  
 You are hitting the maximum throughput on the cluster. 
  
 The messages are dropped because the node fails to start processing them 
 before rpc_timeout. 
  
 However the request is still a success because the client requested CL was 
 achieved. 
  
 Testing with RF 2 and CL 1 really just tests the disks on one local machine. 
 Both nodes replicate each row, and writes are sent to each replica, so the 
 only thing the client is waiting on is the local node to write to it's 
 commit log. 
  
 Testing with (and running in prod) RF3 and CL QUROUM is a more real world 
 scenario. 
  
 Cheers
  
 -
 Aaron Morton
 Freelance Cassandra Developer
 New Zealand
  
 @aaronmorton
 http://www.thelastpickle.com
  
 On 15/02/2013, at 9:42 AM, Kanwar Sangha kan...@mavenir.com wrote:
 
 
 Hi – Is there a parameter which can be tuned to prevent the mutations from 
 being dropped ? Is this logic correct ?
  
 Node A and B with RF=2, CL =1. Load balanced between the two.
  
 --  Address   Load   Tokens  Owns (effective)  Host ID   
 Rack
 UN  10.x.x.x   746.78 GB  256 100.0%
 dbc9e539-f735-4b0b-8067-b97a85522a1a  rack1
 UN  10.x.x.x   880.77 GB  256 100.0%
 95d59054-be99-455f-90d1-f43981d3d778  rack1
  
 Once we hit a very high TPS (around 50k/sec of inserts), the nodes start 
 falling behind and we see the mutation dropped messages. But there are no 
 failures on the client. Does that mean other node is not able to persist the 
 replicated data ? Is there some

Re: Mutation dropped

2013-02-20 Thread Wei Zhu
What does rpc_timeout control? Only the reads/writes? How about other 
inter-node communication, like data stream, merkle tree request?  What is the 
reasonable value for roc_timeout? The default value of 10 seconds are way too 
long. What is the side effect if it's set to a really small number, say 20ms?

Thanks.
-Wei



 From: aaron morton aa...@thelastpickle.com
To: user@cassandra.apache.org 
Sent: Tuesday, February 19, 2013 7:32 PM
Subject: Re: Mutation dropped
 

Does the rpc_timeout not control the client timeout ?No it is how long a node 
will wait for a response from other nodes before raising a TimedOutException if 
less than CL nodes have responded. 
Set the client side socket timeout using your preferred client. 

Is there any param which is configurable to control the replication timeout 
between nodes ?There is no such thing.
rpc_timeout is roughly like that, but it's not right to think about it that 
way. 
i.e. if a message to a replica times out and CL nodes have already responded 
then we are happy to call the request complete. 

Cheers

 

-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 19/02/2013, at 1:48 AM, Kanwar Sangha kan...@mavenir.com wrote:

Thanks Aaron.
 
Does the rpc_timeout not control the client timeout ? Is there any param which 
is configurable to control the replication timeout between nodes ? Or the same 
param is used to control that since the other node is also like a client ?
 
 
 
From: aaron morton [mailto:aa...@thelastpickle.com] 
Sent: 17 February 2013 11:26
To: user@cassandra.apache.org
Subject: Re: Mutation dropped
 
You are hitting the maximum throughput on the cluster. 
 
The messages are dropped because the node fails to start processing them 
before rpc_timeout. 
 
However the request is still a success because the client requested CL was 
achieved. 
 
Testing with RF 2 and CL 1 really just tests the disks on one local machine. 
Both nodes replicate each row, and writes are sent to each replica, so the 
only thing the client is waiting on is the local node to write to it's commit 
log. 
 
Testing with (and running in prod) RF3 and CL QUROUM is a more real world 
scenario. 
 
Cheers
 
-
Aaron Morton
Freelance Cassandra Developer
New Zealand
 
@aaronmorton
http://www.thelastpickle.com
 
On 15/02/2013, at 9:42 AM, Kanwar Sangha kan...@mavenir.com wrote:



Hi – Is there a parameter which can be tuned to prevent the mutations from 
being dropped ? Is this logic correct ?
 
Node A and B with RF=2, CL =1. Load balanced between the two.
 
--  Address   Load   Tokens  Owns (effective)  Host ID 
  Rack
UN  10.x.x.x   746.78 GB  256 100.0%    
dbc9e539-f735-4b0b-8067-b97a85522a1a  rack1
UN  10.x.x.x   880.77 GB  256 100.0%    
95d59054-be99-455f-90d1-f43981d3d778  rack1
 
Once we hit a very high TPS (around 50k/sec of inserts), the nodes start 
falling behind and we see the mutation dropped messages. But there are no 
failures on the client. Does that mean other node is not able to persist the 
replicated data ? Is there some timeout associated with replicated data 
persistence ?
 
Thanks,
Kanwar
 
 
 
 
 
 
 
From: Kanwar Sangha [mailto:kan...@mavenir.com] 
Sent: 14 February 2013 09:08
To: user@cassandra.apache.org
Subject: Mutation dropped
 
Hi – I am doing a load test using YCSB across 2 nodes in a cluster and seeing 
a lot of mutation dropped messages.  I understand that this is due to the 
replica not being written to the
other node ? RF = 2, CL =1.
 
From the wiki -
For MUTATION messages this means that the mutation was not applied to all 
replicas it was sent to. The inconsistency will be repaired by Read Repair or 
Anti Entropy Repair
 
Thanks,
Kanwar
 

Re: Mutation dropped

2013-02-19 Thread aaron morton
 Does the rpc_timeout not control the client timeout ?
No it is how long a node will wait for a response from other nodes before 
raising a TimedOutException if less than CL nodes have responded. 
Set the client side socket timeout using your preferred client. 

 Is there any param which is configurable to control the replication timeout 
 between nodes ?
There is no such thing.
rpc_timeout is roughly like that, but it's not right to think about it that 
way. 
i.e. if a message to a replica times out and CL nodes have already responded 
then we are happy to call the request complete. 

Cheers

 
-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 19/02/2013, at 1:48 AM, Kanwar Sangha kan...@mavenir.com wrote:

 Thanks Aaron.
  
 Does the rpc_timeout not control the client timeout ? Is there any param 
 which is configurable to control the replication timeout between nodes ? Or 
 the same param is used to control that since the other node is also like a 
 client ?
  
  
  
 From: aaron morton [mailto:aa...@thelastpickle.com] 
 Sent: 17 February 2013 11:26
 To: user@cassandra.apache.org
 Subject: Re: Mutation dropped
  
 You are hitting the maximum throughput on the cluster. 
  
 The messages are dropped because the node fails to start processing them 
 before rpc_timeout. 
  
 However the request is still a success because the client requested CL was 
 achieved. 
  
 Testing with RF 2 and CL 1 really just tests the disks on one local machine. 
 Both nodes replicate each row, and writes are sent to each replica, so the 
 only thing the client is waiting on is the local node to write to it's commit 
 log. 
  
 Testing with (and running in prod) RF3 and CL QUROUM is a more real world 
 scenario. 
  
 Cheers
  
 -
 Aaron Morton
 Freelance Cassandra Developer
 New Zealand
  
 @aaronmorton
 http://www.thelastpickle.com
  
 On 15/02/2013, at 9:42 AM, Kanwar Sangha kan...@mavenir.com wrote:
 
 
 Hi – Is there a parameter which can be tuned to prevent the mutations from 
 being dropped ? Is this logic correct ?
  
 Node A and B with RF=2, CL =1. Load balanced between the two.
  
 --  Address   Load   Tokens  Owns (effective)  Host ID
Rack
 UN  10.x.x.x   746.78 GB  256 100.0%
 dbc9e539-f735-4b0b-8067-b97a85522a1a  rack1
 UN  10.x.x.x   880.77 GB  256 100.0%
 95d59054-be99-455f-90d1-f43981d3d778  rack1
  
 Once we hit a very high TPS (around 50k/sec of inserts), the nodes start 
 falling behind and we see the mutation dropped messages. But there are no 
 failures on the client. Does that mean other node is not able to persist the 
 replicated data ? Is there some timeout associated with replicated data 
 persistence ?
  
 Thanks,
 Kanwar
  
  
  
  
  
  
  
 From: Kanwar Sangha [mailto:kan...@mavenir.com] 
 Sent: 14 February 2013 09:08
 To: user@cassandra.apache.org
 Subject: Mutation dropped
  
 Hi – I am doing a load test using YCSB across 2 nodes in a cluster and seeing 
 a lot of mutation dropped messages.  I understand that this is due to the 
 replica not being written to the
 other node ? RF = 2, CL =1.
  
 From the wiki -
 For MUTATION messages this means that the mutation was not applied to all 
 replicas it was sent to. The inconsistency will be repaired by Read Repair or 
 Anti Entropy Repair
  
 Thanks,
 Kanwar
  



RE: Mutation dropped

2013-02-18 Thread Kanwar Sangha
Thanks Aaron.

Does the rpc_timeout not control the client timeout ? Is there any param which 
is configurable to control the replication timeout between nodes ? Or the same 
param is used to control that since the other node is also like a client ?



From: aaron morton [mailto:aa...@thelastpickle.com]
Sent: 17 February 2013 11:26
To: user@cassandra.apache.org
Subject: Re: Mutation dropped

You are hitting the maximum throughput on the cluster.

The messages are dropped because the node fails to start processing them before 
rpc_timeout.

However the request is still a success because the client requested CL was 
achieved.

Testing with RF 2 and CL 1 really just tests the disks on one local machine. 
Both nodes replicate each row, and writes are sent to each replica, so the only 
thing the client is waiting on is the local node to write to it's commit log.

Testing with (and running in prod) RF3 and CL QUROUM is a more real world 
scenario.

Cheers

-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 15/02/2013, at 9:42 AM, Kanwar Sangha 
kan...@mavenir.commailto:kan...@mavenir.com wrote:


Hi - Is there a parameter which can be tuned to prevent the mutations from 
being dropped ? Is this logic correct ?

Node A and B with RF=2, CL =1. Load balanced between the two.

--  Address   Load   Tokens  Owns (effective)  Host ID  
 Rack
UN  10.x.x.x   746.78 GB  256 100.0%
dbc9e539-f735-4b0b-8067-b97a85522a1a  rack1
UN  10.x.x.x   880.77 GB  256 100.0%
95d59054-be99-455f-90d1-f43981d3d778  rack1

Once we hit a very high TPS (around 50k/sec of inserts), the nodes start 
falling behind and we see the mutation dropped messages. But there are no 
failures on the client. Does that mean other node is not able to persist the 
replicated data ? Is there some timeout associated with replicated data 
persistence ?

Thanks,
Kanwar







From: Kanwar Sangha [mailto:kan...@mavenir.comhttp://mavenir.com]
Sent: 14 February 2013 09:08
To: user@cassandra.apache.orgmailto:user@cassandra.apache.org
Subject: Mutation dropped

Hi - I am doing a load test using YCSB across 2 nodes in a cluster and seeing a 
lot of mutation dropped messages.  I understand that this is due to the replica 
not being written to the
other node ? RF = 2, CL =1.

From the wiki -
For MUTATION messages this means that the mutation was not applied to all 
replicas it was sent to. The inconsistency will be repaired by Read Repair or 
Anti Entropy Repair

Thanks,
Kanwar




Re: Mutation dropped

2013-02-17 Thread aaron morton
You are hitting the maximum throughput on the cluster. 

The messages are dropped because the node fails to start processing them before 
rpc_timeout. 

However the request is still a success because the client requested CL was 
achieved. 

Testing with RF 2 and CL 1 really just tests the disks on one local machine. 
Both nodes replicate each row, and writes are sent to each replica, so the only 
thing the client is waiting on is the local node to write to it's commit log. 

Testing with (and running in prod) RF3 and CL QUROUM is a more real world 
scenario. 

Cheers
 
-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 15/02/2013, at 9:42 AM, Kanwar Sangha kan...@mavenir.com wrote:

 Hi – Is there a parameter which can be tuned to prevent the mutations from 
 being dropped ? Is this logic correct ?
  
 Node A and B with RF=2, CL =1. Load balanced between the two.
  
 --  Address   Load   Tokens  Owns (effective)  Host ID
Rack
 UN  10.x.x.x   746.78 GB  256 100.0%
 dbc9e539-f735-4b0b-8067-b97a85522a1a  rack1
 UN  10.x.x.x   880.77 GB  256 100.0%
 95d59054-be99-455f-90d1-f43981d3d778  rack1
  
 Once we hit a very high TPS (around 50k/sec of inserts), the nodes start 
 falling behind and we see the mutation dropped messages. But there are no 
 failures on the client. Does that mean other node is not able to persist the 
 replicated data ? Is there some timeout associated with replicated data 
 persistence ?
  
 Thanks,
 Kanwar
  
  
  
  
  
  
  
 From: Kanwar Sangha [mailto:kan...@mavenir.com] 
 Sent: 14 February 2013 09:08
 To: user@cassandra.apache.org
 Subject: Mutation dropped
  
 Hi – I am doing a load test using YCSB across 2 nodes in a cluster and seeing 
 a lot of mutation dropped messages.  I understand that this is due to the 
 replica not being written to the
 other node ? RF = 2, CL =1.
  
 From the wiki -
 For MUTATION messages this means that the mutation was not applied to all 
 replicas it was sent to. The inconsistency will be repaired by Read Repair or 
 Anti Entropy Repair
  
 Thanks,
 Kanwar
  



Mutation dropped

2013-02-14 Thread Kanwar Sangha
Hi - I am doing a load test using YCSB across 2 nodes in a cluster and seeing a 
lot of mutation dropped messages.  I understand that this is due to the replica 
not being written to the
other node ? RF = 2, CL =1.

From the wiki -
For MUTATION messages this means that the mutation was not applied to all 
replicas it was sent to. The inconsistency will be repaired by Read Repair or 
Anti Entropy Repair

Thanks,
Kanwar



RE: Mutation dropped

2013-02-14 Thread Kanwar Sangha
Hi - Is there a parameter which can be tuned to prevent the mutations from 
being dropped ? Is this logic correct ?

Node A and B with RF=2, CL =1. Load balanced between the two.

--  Address   Load   Tokens  Owns (effective)  Host ID  
 Rack
UN  10.x.x.x   746.78 GB  256 100.0%
dbc9e539-f735-4b0b-8067-b97a85522a1a  rack1
UN  10.x.x.x   880.77 GB  256 100.0%
95d59054-be99-455f-90d1-f43981d3d778  rack1

Once we hit a very high TPS (around 50k/sec of inserts), the nodes start 
falling behind and we see the mutation dropped messages. But there are no 
failures on the client. Does that mean other node is not able to persist the 
replicated data ? Is there some timeout associated with replicated data 
persistence ?

Thanks,
Kanwar







From: Kanwar Sangha [mailto:kan...@mavenir.com]
Sent: 14 February 2013 09:08
To: user@cassandra.apache.org
Subject: Mutation dropped

Hi - I am doing a load test using YCSB across 2 nodes in a cluster and seeing a 
lot of mutation dropped messages.  I understand that this is due to the replica 
not being written to the
other node ? RF = 2, CL =1.

From the wiki -
For MUTATION messages this means that the mutation was not applied to all 
replicas it was sent to. The inconsistency will be repaired by Read Repair or 
Anti Entropy Repair

Thanks,
Kanwar




RE: Mutation Dropped Messages

2012-03-06 Thread Tiwari, Dushyant
1.   One node is running at 8G rest on 10G - same config

2.   Nodetool -

Status State   LoadOwnsToken

   
162563731948587347959549934419333022646

Up Normal  107.79 MB   25.00%  34957844353235424160784456632419943350

Up Normal  116.44 MB   25.00%  77493140218352732093706282561390969782

Up Normal  27.01 MB12.68%  99065646426277998282363457251162269147

Up Normal  35.9 MB 12.32%  120028436083470040026628108490361996214

Up Normal  512.55 KB   25.00%  162563731948587347959549934419333022646



RF:2 and CL: QUORUM - writes at a rate of 1750 rows/s - every row - 5 cols and 
2 of them indexes.



Thanks,

Dushyant



From: aaron morton [mailto:aa...@thelastpickle.com]
Sent: Monday, March 05, 2012 11:07 PM
To: user@cassandra.apache.org
Subject: Re: Mutation Dropped Messages

I increased the size of the cluster also the concurrent_writes parameter. Still 
there is a node which keeps on dropping the mutation messages.
Ensure all the nodes have the same spec, and the nodes have the same config. In 
a virtual environment consider moving the node.

Is this due to some improper load balancing?
What does nodetool ring say and what sort of queries (and RF and CL) are you 
sending.

Cheers

-
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 6/03/2012, at 3:58 AM, Tiwari, Dushyant wrote:


Hey Aaron,

I increased the size of the cluster also the concurrent_writes parameter. Still 
there is a node which keeps on dropping the mutation messages. The other nodes 
are not dropping mutation messages. I am using Hector API and had done nothing 
for load balancing so far. Just provided the host:port of the nodes in the 
Cassandrahostconfig. Is this due to some improper load balancing? Also the 
physical host where the node is hosted is relatively heavier than other nodes' 
host. What can I do to improve?
PS: The node is seed of the cluster.

Thanks,
Dushyant

From: aaron morton [mailto:aa...@thelastpickle.com]
Sent: Monday, March 05, 2012 4:15 PM
To: user@cassandra.apache.orgmailto:user@cassandra.apache.org
Subject: Re: Mutation Dropped Messages

1.   Which parameters to tune in the config files? - Especially looking for 
heavy writes
The node is overloaded. It may be because there are no enough nodes, or the 
node is under temporary stress such as GC or repair.
If you have spare IO / CPU capacity you could increase the current_writes to 
increase throughput on the write stage. You then need to ensure the commit log 
and, to a lesser degree, the data volumes can keep up.

2.   What is the difference between TimedOutException and silently dropping 
mutation messages while operating on a CL of QUORUM.
TimedOutExceptions means CL nodes did not respond to the coordinator before 
rpc_timeout. Dropping messages happens when a message is removed from the queue 
in the a thread pool after rpc_timeout has occurred. it is a feature of the 
architecture, and correct behaviour under stress.
Inconsistencies created by dropped messages are repaired via reads as high CL, 
HH (in 1.+), Read Repair or Anti Entropy.

Cheers

-
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 5/03/2012, at 11:32 PM, Tiwari, Dushyant wrote:



Hi All,

While benchmarking Cassandra I found Mutation Dropped messages in the logs.  
Now I know this is a good old question. It will be really great if someone can 
provide a check list to recover when such a thing happens. I am looking for 
answers of the following questions  -

1.   Which parameters to tune in the config files? - Especially looking for 
heavy writes
2.   What is the difference between TimedOutException and silently dropping 
mutation messages while operating on a CL of QUORUM.


Regards,
Dushyant

NOTICE: Morgan Stanley is not acting as a municipal advisor and the opinions or 
views contained herein are not intended to be, and do not constitute, advice 
within the meaning of Section 975 of the Dodd-Frank Wall Street Reform and 
Consumer Protection Act. If you have received this communication in error, 
please destroy all electronic and paper copies and notify the sender 
immediately. Mistransmission is not intended to waive confidentiality or 
privilege. Morgan Stanley reserves the right, to the extent permitted under 
applicable law, to monitor electronic communications. This message is subject 
to terms available at the following link: 
http://www.morganstanley.com/disclaimers. If you cannot access these links, 
please notify us by reply message and we will send the contents to you. By 
messaging with Morgan Stanley you consent to the foregoing.


NOTICE: Morgan Stanley is not acting as a municipal advisor and the opinions or 
views contained herein are not intended to be, and do

Re: Mutation Dropped Messages

2012-03-06 Thread aaron morton
 1.   One node is running at 8G rest on 10G – same config
Make them all the same. 

 2.   Nodetool –
Even though the token ranges are not balanced, the load looks a little odd. 
Have you moved tokens ? Did you do a cleanup ? 

You'll need to look at the node that is dropping messages (not sure what that 
is). 

What is happening in the log ? Is it having GC problems ? 
What is happening with the io and CPU load on the machine ? 

Cheers

-
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 6/03/2012, at 11:57 PM, Tiwari, Dushyant wrote:

 1.   One node is running at 8G rest on 10G – same config
 2.   Nodetool –
 Status State   LoadOwnsToken
   
  162563731948587347959549934419333022646
 Up Normal  107.79 MB   25.00%  34957844353235424160784456632419943350
 Up Normal  116.44 MB   25.00%  77493140218352732093706282561390969782
 Up Normal  27.01 MB12.68%  99065646426277998282363457251162269147
 Up Normal  35.9 MB 12.32%  120028436083470040026628108490361996214
 Up Normal  512.55 KB   25.00%  162563731948587347959549934419333022646
  
 RF:2 and CL: QUORUM – writes at a rate of 1750 rows/s – every row – 5 cols 
 and 2 of them indexes.
  
 Thanks,
 Dushyant
  
  
 From: aaron morton [mailto:aa...@thelastpickle.com] 
 Sent: Monday, March 05, 2012 11:07 PM
 To: user@cassandra.apache.org
 Subject: Re: Mutation Dropped Messages
  
 I increased the size of the cluster also the concurrent_writes parameter. 
 Still there is a node which keeps on dropping the mutation messages.
 Ensure all the nodes have the same spec, and the nodes have the same config. 
 In a virtual environment consider moving the node.
  
 Is this due to some improper load balancing? 
 What does nodetool ring say and what sort of queries (and RF and CL) are you 
 sending.
  
 Cheers
  
 -
 Aaron Morton
 Freelance Developer
 @aaronmorton
 http://www.thelastpickle.com
  
 On 6/03/2012, at 3:58 AM, Tiwari, Dushyant wrote:
 
 
 Hey Aaron,
  
 I increased the size of the cluster also the concurrent_writes parameter. 
 Still there is a node which keeps on dropping the mutation messages. The 
 other nodes are not dropping mutation messages. I am using Hector API and had 
 done nothing for load balancing so far. Just provided the host:port of the 
 nodes in the Cassandrahostconfig. Is this due to some improper load 
 balancing? Also the physical host where the node is hosted is relatively 
 heavier than other nodes’ host. What can I do to improve?
 PS: The node is seed of the cluster.
  
 Thanks,
 Dushyant
  
 From: aaron morton [mailto:aa...@thelastpickle.com] 
 Sent: Monday, March 05, 2012 4:15 PM
 To: user@cassandra.apache.org
 Subject: Re: Mutation Dropped Messages
  
 1.   Which parameters to tune in the config files? – Especially looking 
 for heavy writes
 The node is overloaded. It may be because there are no enough nodes, or the 
 node is under temporary stress such as GC or repair. 
 If you have spare IO / CPU capacity you could increase the current_writes to 
 increase throughput on the write stage. You then need to ensure the commit 
 log and, to a lesser degree, the data volumes can keep up. 
  
 2.   What is the difference between TimedOutException and silently 
 dropping mutation messages while operating on a CL of QUORUM.
 TimedOutExceptions means CL nodes did not respond to the coordinator before 
 rpc_timeout. Dropping messages happens when a message is removed from the 
 queue in the a thread pool after rpc_timeout has occurred. it is a feature of 
 the architecture, and correct behaviour under stress. 
 Inconsistencies created by dropped messages are repaired via reads as high 
 CL, HH (in 1.+), Read Repair or Anti Entropy.
  
 Cheers
  
 -
 Aaron Morton
 Freelance Developer
 @aaronmorton
 http://www.thelastpickle.com
  
 On 5/03/2012, at 11:32 PM, Tiwari, Dushyant wrote:
 
 
 
 Hi All,
  
 While benchmarking Cassandra I found “Mutation Dropped” messages in the logs. 
  Now I know this is a good old question. It will be really great if someone 
 can provide a check list to recover when such a thing happens. I am looking 
 for answers of the following questions  -
  
 1.   Which parameters to tune in the config files? – Especially looking 
 for heavy writes
 2.   What is the difference between TimedOutException and silently 
 dropping mutation messages while operating on a CL of QUORUM.
  
  
 Regards,
 Dushyant
 NOTICE: Morgan Stanley is not acting as a municipal advisor and the opinions 
 or views contained herein are not intended to be, and do not constitute, 
 advice within the meaning of Section 975 of the Dodd-Frank Wall Street Reform 
 and Consumer Protection Act. If you have received this communication in 
 error, please destroy all electronic and paper copies and notify the sender

Mutation Dropped Messages

2012-03-05 Thread Tiwari, Dushyant
Hi All,

While benchmarking Cassandra I found Mutation Dropped messages in the logs.  
Now I know this is a good old question. It will be really great if someone can 
provide a check list to recover when such a thing happens. I am looking for 
answers of the following questions  -


1.   Which parameters to tune in the config files? - Especially looking for 
heavy writes

2.   What is the difference between TimedOutException and silently dropping 
mutation messages while operating on a CL of QUORUM.


Regards,
Dushyant

--
NOTICE: Morgan Stanley is not acting as a municipal advisor and the opinions or 
views contained herein are not intended to be, and do not constitute, advice 
within the meaning of Section 975 of the Dodd-Frank Wall Street Reform and 
Consumer Protection Act. If you have received this communication in error, 
please destroy all electronic and paper copies and notify the sender 
immediately. Mistransmission is not intended to waive confidentiality or 
privilege. Morgan Stanley reserves the right, to the extent permitted under 
applicable law, to monitor electronic communications. This message is subject 
to terms available at the following link: 
http://www.morganstanley.com/disclaimers. If you cannot access these links, 
please notify us by reply message and we will send the contents to you. By 
messaging with Morgan Stanley you consent to the foregoing.


Re: Mutation Dropped Messages

2012-03-05 Thread aaron morton
 1.   Which parameters to tune in the config files? – Especially looking 
 for heavy writes
The node is overloaded. It may be because there are no enough nodes, or the 
node is under temporary stress such as GC or repair. 
If you have spare IO / CPU capacity you could increase the current_writes to 
increase throughput on the write stage. You then need to ensure the commit log 
and, to a lesser degree, the data volumes can keep up. 

 2.   What is the difference between TimedOutException and silently 
 dropping mutation messages while operating on a CL of QUORUM.
TimedOutExceptions means CL nodes did not respond to the coordinator before 
rpc_timeout. Dropping messages happens when a message is removed from the queue 
in the a thread pool after rpc_timeout has occurred. it is a feature of the 
architecture, and correct behaviour under stress. 
Inconsistencies created by dropped messages are repaired via reads as high CL, 
HH (in 1.+), Read Repair or Anti Entropy.

Cheers

-
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 5/03/2012, at 11:32 PM, Tiwari, Dushyant wrote:

 Hi All,
  
 While benchmarking Cassandra I found “Mutation Dropped” messages in the logs. 
  Now I know this is a good old question. It will be really great if someone 
 can provide a check list to recover when such a thing happens. I am looking 
 for answers of the following questions  -
  
 1.   Which parameters to tune in the config files? – Especially looking 
 for heavy writes
 2.   What is the difference between TimedOutException and silently 
 dropping mutation messages while operating on a CL of QUORUM.
  
  
 Regards,
 Dushyant
 NOTICE: Morgan Stanley is not acting as a municipal advisor and the opinions 
 or views contained herein are not intended to be, and do not constitute, 
 advice within the meaning of Section 975 of the Dodd-Frank Wall Street Reform 
 and Consumer Protection Act. If you have received this communication in 
 error, please destroy all electronic and paper copies and notify the sender 
 immediately. Mistransmission is not intended to waive confidentiality or 
 privilege. Morgan Stanley reserves the right, to the extent permitted under 
 applicable law, to monitor electronic communications. This message is subject 
 to terms available at the following link: 
 http://www.morganstanley.com/disclaimers. If you cannot access these links, 
 please notify us by reply message and we will send the contents to you. By 
 messaging with Morgan Stanley you consent to the foregoing.



Re: Mutation Dropped Messages

2012-03-05 Thread aaron morton
 I increased the size of the cluster also the concurrent_writes parameter. 
 Still there is a node which keeps on dropping the mutation messages.
Ensure all the nodes have the same spec, and the nodes have the same config. In 
a virtual environment consider moving the node.

 Is this due to some improper load balancing? 
What does nodetool ring say and what sort of queries (and RF and CL) are you 
sending.

Cheers

-
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 6/03/2012, at 3:58 AM, Tiwari, Dushyant wrote:

 Hey Aaron,
  
 I increased the size of the cluster also the concurrent_writes parameter. 
 Still there is a node which keeps on dropping the mutation messages. The 
 other nodes are not dropping mutation messages. I am using Hector API and had 
 done nothing for load balancing so far. Just provided the host:port of the 
 nodes in the Cassandrahostconfig. Is this due to some improper load 
 balancing? Also the physical host where the node is hosted is relatively 
 heavier than other nodes’ host. What can I do to improve?
 PS: The node is seed of the cluster.
  
 Thanks,
 Dushyant
  
 From: aaron morton [mailto:aa...@thelastpickle.com] 
 Sent: Monday, March 05, 2012 4:15 PM
 To: user@cassandra.apache.org
 Subject: Re: Mutation Dropped Messages
  
 1.   Which parameters to tune in the config files? – Especially looking 
 for heavy writes
 The node is overloaded. It may be because there are no enough nodes, or the 
 node is under temporary stress such as GC or repair. 
 If you have spare IO / CPU capacity you could increase the current_writes to 
 increase throughput on the write stage. You then need to ensure the commit 
 log and, to a lesser degree, the data volumes can keep up. 
  
 2.   What is the difference between TimedOutException and silently 
 dropping mutation messages while operating on a CL of QUORUM.
 TimedOutExceptions means CL nodes did not respond to the coordinator before 
 rpc_timeout. Dropping messages happens when a message is removed from the 
 queue in the a thread pool after rpc_timeout has occurred. it is a feature of 
 the architecture, and correct behaviour under stress. 
 Inconsistencies created by dropped messages are repaired via reads as high 
 CL, HH (in 1.+), Read Repair or Anti Entropy.
  
 Cheers
  
 -
 Aaron Morton
 Freelance Developer
 @aaronmorton
 http://www.thelastpickle.com
  
 On 5/03/2012, at 11:32 PM, Tiwari, Dushyant wrote:
 
 
 Hi All,
  
 While benchmarking Cassandra I found “Mutation Dropped” messages in the logs. 
  Now I know this is a good old question. It will be really great if someone 
 can provide a check list to recover when such a thing happens. I am looking 
 for answers of the following questions  -
  
 1.   Which parameters to tune in the config files? – Especially looking 
 for heavy writes
 2.   What is the difference between TimedOutException and silently 
 dropping mutation messages while operating on a CL of QUORUM.
  
  
 Regards,
 Dushyant
 NOTICE: Morgan Stanley is not acting as a municipal advisor and the opinions 
 or views contained herein are not intended to be, and do not constitute, 
 advice within the meaning of Section 975 of the Dodd-Frank Wall Street Reform 
 and Consumer Protection Act. If you have received this communication in 
 error, please destroy all electronic and paper copies and notify the sender 
 immediately. Mistransmission is not intended to waive confidentiality or 
 privilege. Morgan Stanley reserves the right, to the extent permitted under 
 applicable law, to monitor electronic communications. This message is subject 
 to terms available at the following link: 
 http://www.morganstanley.com/disclaimers. If you cannot access these links, 
 please notify us by reply message and we will send the contents to you. By 
 messaging with Morgan Stanley you consent to the foregoing.
  
 NOTICE: Morgan Stanley is not acting as a municipal advisor and the opinions 
 or views contained herein are not intended to be, and do not constitute, 
 advice within the meaning of Section 975 of the Dodd-Frank Wall Street Reform 
 and Consumer Protection Act. If you have received this communication in 
 error, please destroy all electronic and paper copies and notify the sender 
 immediately. Mistransmission is not intended to waive confidentiality or 
 privilege. Morgan Stanley reserves the right, to the extent permitted under 
 applicable law, to monitor electronic communications. This message is subject 
 to terms available at the following 
 link:http://www.morganstanley.com/disclaimers. If you cannot access these 
 links, please notify us by reply message and we will send the contents to 
 you. By messaging with Morgan Stanley you consent to the foregoing.