Re: need help tuning dropped mutation messages

2017-07-06 Thread Subroto Barua
c* version: 3.0.11
cross_node_timeout: truerange_request_timeout_in_ms: 
1write_request_timeout_in_ms: 2000counter_write_request_timeout_in_ms: 
5000cas_contention_timeout_in_ms: 1000

On Thursday, July 6, 2017, 11:43:44 AM PDT, Subroto Barua 
 wrote:

I am seeing these errors:
MessagingService.java: 1013 -- MUTATION messages dropped in last 5000 ms: 0 for 
internal timeout and 4 for cross node timeout
write consistency @ LOCAL_QUORUM is failing on a 3-node cluster and 18-node 
cluster..

need help tuning dropped mutation messages

2017-07-06 Thread Subroto Barua
I am seeing these errors:
MessagingService.java: 1013 -- MUTATION messages dropped in last 5000 ms: 0 for 
internal timeout and 4 for cross node timeout
write consistency @ LOCAL_QUORUM is failing on a 3-node cluster and 18-node 
cluster..

Dropped Mutation Messages in two DCs at different sites

2017-01-03 Thread Benyi Wang
I need to batch load a lot of data everyday into a keyspace across two DCs,
 one DC is at west coast and the other is at east coast.

I assume that the network delay between two DCs at different sites will
cause a lot of dropped mutation messages if I write too fast in LOCAL DC
using LOCAL_QUORUM.

I did this test: the test cluster has two DCs in one network at the same
site, but the configuration of the remote DC is lower than the local one.
When I used LOCAL_QUORUM and wrote fast enough, I observed a lot of dropped
mutation messages in the remote DC. So I guess the same thing will happen
if two DCs are at different sites.

To my understanding, the coordinator in the LOCAL dc will send write
requests to all copies including the remote copies, and return SUCCESS to
the client once the quorum of the copies in LOCAL dc respond. Due to the
network delay, the remote side will process the requests with a delay, and
new requests to the remote side arrive at the speed of LOCAL dc.
Eventually, the requests in the queue will exceed the timeout, and the
dropped mutation messages happen.

But I am not sure if my analysis is correct because the above analysis
doesn't consider that there are more connections than one DC situation and
if the network bandwidth slows down the process in LOCAL DC.

If my analysis is correct, the solution could be either slow down the batch
load speed, or configure remote side with longer timeout. My question is
how can I design some tests to find out how slow will be for the batch load
to avoid dropped mutation messages at the remote site.

If my analysis is wrong, could you explain what actually happens in this
situation?

Thanks.


Re: Dropped mutation messages

2015-06-13 Thread Robert Wille

Internode messages which are received by a node, but do not get not to be 
processed within rpc_timeout are dropped rather than processed. As the 
coordinator node will no longer be waiting for a response. If the Coordinator 
node does not receive Consistency Level responses before the rpc_timeout it 
will return a TimedOutException to the client.

I understand that, but that’s where this makes no sense. I’m running with RF=1, 
and CL=QUORUM, which means each update goes to one node, and I need one 
response for a success. I have many thousands of dropped mutation messages, but 
no TimedOutExceptions thrown back to the client. If I have GC problems, or 
other issues that are making my cluster unresponsive, I can deal with that. But 
having writes that fail and no error is clearly not acceptable. How is it 
possible to be getting errors and not be informed about them?

Thanks

Robert



Re: Dropped mutation messages

2015-06-13 Thread Anuj Wadehra
U said RF=1...missed that..so not sure eventual consistency is creating issues..


Thanks

Anuj Wadehra


Sent from Yahoo Mail on Android

From:Anuj Wadehra anujw_2...@yahoo.co.in
Date:Sat, 13 Jun, 2015 at 11:31 pm
Subject:Re: Dropped mutation messages

I think the messages dropped are the asynchronous ones required to maintain 
eventual consistency. Client may not be complaining as the data gets commited 
to one node synchronously..but dropped when sent to other nodes asynchronously..


We resolved similar issue in our cluster by increasing memtable_flush_writers 
to 3 from 1 ( we were writing to multiple cf simultaneously).


We also fixed GC issues and reduced total_memtable_size_in_mb to ensure that 
most memtables are flushed early in heavy write loads.


Thanks

Anuj Wadehra


Sent from Yahoo Mail on Android

From:Robert Wille rwi...@fold3.com
Date:Sat, 13 Jun, 2015 at 8:29 pm
Subject:Re: Dropped mutation messages


Internode messages which are received by a node, but do not get not to be 
processed within rpc_timeout are dropped rather than processed. As the 
coordinator node will no longer be waiting for a response. If the Coordinator 
node does not receive Consistency Level responses before the rpc_timeout it 
will return a TimedOutException to the client. 


I understand that, but that’s where this makes no sense. I’m running with RF=1, 
and CL=QUORUM, which means each update goes to one node, and I need one 
response for a success. I have many thousands of dropped mutation messages, but 
no TimedOutExceptions thrown back to the client. If I have GC problems, or 
other issues that are making my cluster unresponsive, I can deal with that. But 
having writes that fail and no error is clearly not acceptable. How is it 
possible to be getting errors and not be informed about them?


Thanks


Robert




Re: Dropped mutation messages

2015-06-13 Thread Anuj Wadehra
I think the messages dropped are the asynchronous ones required to maintain 
eventual consistency. Client may not be complaining as the data gets commited 
to one node synchronously..but dropped when sent to other nodes asynchronously..


We resolved similar issue in our cluster by increasing memtable_flush_writers 
to 3 from 1 ( we were writing to multiple cf simultaneously).


We also fixed GC issues and reduced total_memtable_size_in_mb to ensure that 
most memtables are flushed early in heavy write loads.


Thanks

Anuj Wadehra


Sent from Yahoo Mail on Android

From:Robert Wille rwi...@fold3.com
Date:Sat, 13 Jun, 2015 at 8:29 pm
Subject:Re: Dropped mutation messages


Internode messages which are received by a node, but do not get not to be 
processed within rpc_timeout are dropped rather than processed. As the 
coordinator node will no longer be waiting for a response. If the Coordinator 
node does not receive Consistency Level responses before the rpc_timeout it 
will return a TimedOutException to the client. 


I understand that, but that’s where this makes no sense. I’m running with RF=1, 
and CL=QUORUM, which means each update goes to one node, and I need one 
response for a success. I have many thousands of dropped mutation messages, but 
no TimedOutExceptions thrown back to the client. If I have GC problems, or 
other issues that are making my cluster unresponsive, I can deal with that. But 
having writes that fail and no error is clearly not acceptable. How is it 
possible to be getting errors and not be informed about them?


Thanks


Robert




Dropped mutation messages

2015-06-12 Thread Robert Wille
I am preparing to migrate a large amount of data to Cassandra. In order to test 
my migration code, I’ve been doing some dry runs to a test cluster. My test 
cluster is 2.0.15, 3 nodes, RF=1 and CL=QUORUM. I know RF=1 and CL=QUORUM is a 
weird combination, but my production cluster that will eventually receive this 
data is RF=3. I am running with RF=1 so its faster while I work out the kinks 
in the migration.

There are a few things that have puzzled me, after writing several 10’s of 
millions records to my test cluster.

My main concern is that I have a few tens of thousands of dropped mutation 
messages. I’m overloading my cluster. I never have more than about 10% CPU 
utilization (even my I/O wait is negligible). A curious thing about that is 
that the driver hasn’t thrown any exceptions, even though mutations have been 
dropped. I’ve seen dropped mutation messages on my production cluster, but like 
this, I’ve never gotten errors back from the client. I had always assumed that 
one node dropped mutation messages, but the other two did not, and so quorum 
was satisfied. With RF=1, I don’t understand how mutation messages are being 
dropped and the client doesn’t tell me about it. Does this mean my cluster is 
missing data, and I have no idea?

Each node has a couple dozen all-time blocked FlushWriters. Is that bad?

I have around 100 dropped counter mutations, which is very weird because I 
don’t write any counters. I have counters in my schema for tracking view 
counts, but the migration code doesn’t write them. How could I get dropped 
counter mutation messages when I don’t modify them?

Any insights would be appreciated. Thanks in advance.

Robert



Re: Dropped mutation messages

2015-06-12 Thread Robert Wille
I meant to say I’m *not* overloading my cluster.

On Jun 12, 2015, at 6:52 PM, Robert Wille rwi...@fold3.com wrote:

 I am preparing to migrate a large amount of data to Cassandra. In order to 
 test my migration code, I’ve been doing some dry runs to a test cluster. My 
 test cluster is 2.0.15, 3 nodes, RF=1 and CL=QUORUM. I know RF=1 and 
 CL=QUORUM is a weird combination, but my production cluster that will 
 eventually receive this data is RF=3. I am running with RF=1 so its faster 
 while I work out the kinks in the migration.
 
 There are a few things that have puzzled me, after writing several 10’s of 
 millions records to my test cluster.
 
 My main concern is that I have a few tens of thousands of dropped mutation 
 messages. I’m overloading my cluster. I never have more than about 10% CPU 
 utilization (even my I/O wait is negligible). A curious thing about that is 
 that the driver hasn’t thrown any exceptions, even though mutations have been 
 dropped. I’ve seen dropped mutation messages on my production cluster, but 
 like this, I’ve never gotten errors back from the client. I had always 
 assumed that one node dropped mutation messages, but the other two did not, 
 and so quorum was satisfied. With RF=1, I don’t understand how mutation 
 messages are being dropped and the client doesn’t tell me about it. Does this 
 mean my cluster is missing data, and I have no idea?
 
 Each node has a couple dozen all-time blocked FlushWriters. Is that bad?
 
 I have around 100 dropped counter mutations, which is very weird because I 
 don’t write any counters. I have counters in my schema for tracking view 
 counts, but the migration code doesn’t write them. How could I get dropped 
 counter mutation messages when I don’t modify them?
 
 Any insights would be appreciated. Thanks in advance.
 
 Robert
 



Re: Dropped mutation messages

2013-06-20 Thread aaron morton
 What should be the path to investigate this?
Dropped messages are a symptom of other problems. 

Look for the GCInspector logging lots of ParNew, or the IO system being 
overloaded, or large (1000's) read or write batches from the client. 
Cheers

-
Aaron Morton
Freelance Cassandra Consultant
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 20/06/2013, at 12:40 AM, Shahab Yunus shahab.yu...@gmail.com wrote:

 Hello Arthur,
 
 What do you mean by The queries need to be lightened?
 
 Thanks,
 Shahb
 
 
 On Tue, Jun 18, 2013 at 8:47 PM, Arthur Zubarev arthur.zuba...@aol.com 
 wrote:
 Cem hi,
  
 as per http://wiki.apache.org/cassandra/FAQ#dropped_messages
  
 Internode messages which are received by a node, but do not get not to be 
 processed within rpc_timeout are dropped rather than processed. As the 
 coordinator node will no longer be waiting for a response. If the Coordinator 
 node does not receive Consistency Level responses before the rpc_timeout it 
 will return a TimedOutException to the client. If the coordinator receives 
 Consistency Level responses it will return success to the client.
 
 For MUTATION messages this means that the mutation was not applied to all 
 replicas it was sent to. The inconsistency will be repaired by Read Repair or 
 Anti Entropy Repair.
 
 For READ messages this means a read request may not have completed.
 
 Load shedding is part of the Cassandra architecture, if this is a persistent 
 issue it is generally a sign of an overloaded node or cluster.
 
 By the way, I am on C* 1.2.4 too in dev mode, after having my node filled 
 with 400 GB I started getting RPC timeouts on large data retrievals, so in 
 short, you may need to revise how you query.
 
 The queries need to be lightened
 
 /Arthur
 
  
 From: cem
 Sent: Tuesday, June 18, 2013 1:12 PM
 To: user@cassandra.apache.org
 Subject: Dropped mutation messages
  
 Hi All,
  
 I have a cluster of 5 nodes with C* 1.2.4.
  
 Each node has 4 disks 1 TB each.
  
 I see  a lot of dropped messages after it stores 400 GB  per disk. (1.6 TB 
 per node).
  
 The recommendation was 500 GB max per node before 1.2.  Datastax says that we 
 can store terabytes of data per node with 1.2.
 http://www.datastax.com/docs/1.2/cluster_architecture/cluster_planning
  
 Do I need to enable anything to leverage from 1.2? Do you have any other 
 advice?
  
 What should be the path to investigate this?
  
 Thanks in advance!
  
 Best Regards,
 Cem.
  
  
 



Re: Dropped mutation messages

2013-06-19 Thread Shahab Yunus
Hello Arthur,

What do you mean by The queries need to be lightened?

Thanks,
Shahb


On Tue, Jun 18, 2013 at 8:47 PM, Arthur Zubarev arthur.zuba...@aol.comwrote:

   Cem hi,

 as per http://wiki.apache.org/cassandra/FAQ#dropped_messages


 Internode messages which are received by a node, but do not get not to be
 processed within rpc_timeout are dropped rather than processed. As the
 coordinator node will no longer be waiting for a response. If the
 Coordinator node does not receive Consistency Level responses before the
 rpc_timeout it will return a TimedOutException to the client. If the
 coordinator receives Consistency Level responses it will return success to
 the client.

 For MUTATION messages this means that the mutation was not applied to all
 replicas it was sent to. The inconsistency will be repaired by Read Repair
 or Anti Entropy Repair.

 For READ messages this means a read request may not have completed.

 Load shedding is part of the Cassandra architecture, if this is a
 persistent issue it is generally a sign of an overloaded node or cluster.

 By the way, I am on C* 1.2.4 too in dev mode, after having my node filled
 with 400 GB I started getting RPC timeouts on large data retrievals, so in
 short, you may need to revise how you query.

 The queries need to be lightened

 /Arthur

  *From:* cem cayiro...@gmail.com
 *Sent:* Tuesday, June 18, 2013 1:12 PM
 *To:* user@cassandra.apache.org
 *Subject:* Dropped mutation messages

  Hi All,

 I have a cluster of 5 nodes with C* 1.2.4.

 Each node has 4 disks 1 TB each.

 I see  a lot of dropped messages after it stores 400 GB  per disk. (1.6 TB
 per node).

 The recommendation was 500 GB max per node before 1.2.  Datastax says that
 we can store terabytes of data per node with 1.2.
 http://www.datastax.com/docs/1.2/cluster_architecture/cluster_planning

 Do I need to enable anything to leverage from 1.2? Do you have any other
 advice?

 What should be the path to investigate this?

 Thanks in advance!

 Best Regards,
 Cem.





Dropped mutation messages

2013-06-18 Thread cem
Hi All,

I have a cluster of 5 nodes with C* 1.2.4.

Each node has 4 disks 1 TB each.

I see  a lot of dropped messages after it stores 400 GB  per disk. (1.6 TB
per node).

The recommendation was 500 GB max per node before 1.2.  Datastax says that
we can store terabytes of data per node with 1.2.
http://www.datastax.com/docs/1.2/cluster_architecture/cluster_planning

Do I need to enable anything to leverage from 1.2? Do you have any other
advice?

What should be the path to investigate this?

Thanks in advance!

Best Regards,
Cem.


Re: Dropped mutation messages

2013-06-18 Thread Arthur Zubarev
Cem hi,

as per http://wiki.apache.org/cassandra/FAQ#dropped_messages

Internode messages which are received by a node, but do not get not to be 
processed within rpc_timeout are dropped rather than processed. As the 
coordinator node will no longer be waiting for a response. If the Coordinator 
node does not receive Consistency Level responses before the rpc_timeout it 
will return a TimedOutException to the client. If the coordinator receives 
Consistency Level responses it will return success to the client.

For MUTATION messages this means that the mutation was not applied to all 
replicas it was sent to. The inconsistency will be repaired by Read Repair or 
Anti Entropy Repair.

For READ messages this means a read request may not have completed.

Load shedding is part of the Cassandra architecture, if this is a persistent 
issue it is generally a sign of an overloaded node or cluster.

By the way, I am on C* 1.2.4 too in dev mode, after having my node filled with 
400 GB I started getting RPC timeouts on large data retrievals, so in short, 
you may need to revise how you query.

The queries need to be lightened 

/Arthur


From: cem 
Sent: Tuesday, June 18, 2013 1:12 PM
To: user@cassandra.apache.org 
Subject: Dropped mutation messages

Hi All, 

I have a cluster of 5 nodes with C* 1.2.4.

Each node has 4 disks 1 TB each.

I see  a lot of dropped messages after it stores 400 GB  per disk. (1.6 TB per 
node).

The recommendation was 500 GB max per node before 1.2.  Datastax says that we 
can store terabytes of data per node with 1.2.
http://www.datastax.com/docs/1.2/cluster_architecture/cluster_planning

Do I need to enable anything to leverage from 1.2? Do you have any other advice?


What should be the path to investigate this?

Thanks in advance! 

Best Regards,
Cem.