Re: need help tuning dropped mutation messages
c* version: 3.0.11 cross_node_timeout: truerange_request_timeout_in_ms: 1write_request_timeout_in_ms: 2000counter_write_request_timeout_in_ms: 5000cas_contention_timeout_in_ms: 1000 On Thursday, July 6, 2017, 11:43:44 AM PDT, Subroto Baruawrote: I am seeing these errors: MessagingService.java: 1013 -- MUTATION messages dropped in last 5000 ms: 0 for internal timeout and 4 for cross node timeout write consistency @ LOCAL_QUORUM is failing on a 3-node cluster and 18-node cluster..
need help tuning dropped mutation messages
I am seeing these errors: MessagingService.java: 1013 -- MUTATION messages dropped in last 5000 ms: 0 for internal timeout and 4 for cross node timeout write consistency @ LOCAL_QUORUM is failing on a 3-node cluster and 18-node cluster..
Dropped Mutation Messages in two DCs at different sites
I need to batch load a lot of data everyday into a keyspace across two DCs, one DC is at west coast and the other is at east coast. I assume that the network delay between two DCs at different sites will cause a lot of dropped mutation messages if I write too fast in LOCAL DC using LOCAL_QUORUM. I did this test: the test cluster has two DCs in one network at the same site, but the configuration of the remote DC is lower than the local one. When I used LOCAL_QUORUM and wrote fast enough, I observed a lot of dropped mutation messages in the remote DC. So I guess the same thing will happen if two DCs are at different sites. To my understanding, the coordinator in the LOCAL dc will send write requests to all copies including the remote copies, and return SUCCESS to the client once the quorum of the copies in LOCAL dc respond. Due to the network delay, the remote side will process the requests with a delay, and new requests to the remote side arrive at the speed of LOCAL dc. Eventually, the requests in the queue will exceed the timeout, and the dropped mutation messages happen. But I am not sure if my analysis is correct because the above analysis doesn't consider that there are more connections than one DC situation and if the network bandwidth slows down the process in LOCAL DC. If my analysis is correct, the solution could be either slow down the batch load speed, or configure remote side with longer timeout. My question is how can I design some tests to find out how slow will be for the batch load to avoid dropped mutation messages at the remote site. If my analysis is wrong, could you explain what actually happens in this situation? Thanks.
Re: Dropped mutation messages
Internode messages which are received by a node, but do not get not to be processed within rpc_timeout are dropped rather than processed. As the coordinator node will no longer be waiting for a response. If the Coordinator node does not receive Consistency Level responses before the rpc_timeout it will return a TimedOutException to the client. I understand that, but that’s where this makes no sense. I’m running with RF=1, and CL=QUORUM, which means each update goes to one node, and I need one response for a success. I have many thousands of dropped mutation messages, but no TimedOutExceptions thrown back to the client. If I have GC problems, or other issues that are making my cluster unresponsive, I can deal with that. But having writes that fail and no error is clearly not acceptable. How is it possible to be getting errors and not be informed about them? Thanks Robert
Re: Dropped mutation messages
U said RF=1...missed that..so not sure eventual consistency is creating issues.. Thanks Anuj Wadehra Sent from Yahoo Mail on Android From:Anuj Wadehra anujw_2...@yahoo.co.in Date:Sat, 13 Jun, 2015 at 11:31 pm Subject:Re: Dropped mutation messages I think the messages dropped are the asynchronous ones required to maintain eventual consistency. Client may not be complaining as the data gets commited to one node synchronously..but dropped when sent to other nodes asynchronously.. We resolved similar issue in our cluster by increasing memtable_flush_writers to 3 from 1 ( we were writing to multiple cf simultaneously). We also fixed GC issues and reduced total_memtable_size_in_mb to ensure that most memtables are flushed early in heavy write loads. Thanks Anuj Wadehra Sent from Yahoo Mail on Android From:Robert Wille rwi...@fold3.com Date:Sat, 13 Jun, 2015 at 8:29 pm Subject:Re: Dropped mutation messages Internode messages which are received by a node, but do not get not to be processed within rpc_timeout are dropped rather than processed. As the coordinator node will no longer be waiting for a response. If the Coordinator node does not receive Consistency Level responses before the rpc_timeout it will return a TimedOutException to the client. I understand that, but that’s where this makes no sense. I’m running with RF=1, and CL=QUORUM, which means each update goes to one node, and I need one response for a success. I have many thousands of dropped mutation messages, but no TimedOutExceptions thrown back to the client. If I have GC problems, or other issues that are making my cluster unresponsive, I can deal with that. But having writes that fail and no error is clearly not acceptable. How is it possible to be getting errors and not be informed about them? Thanks Robert
Re: Dropped mutation messages
I think the messages dropped are the asynchronous ones required to maintain eventual consistency. Client may not be complaining as the data gets commited to one node synchronously..but dropped when sent to other nodes asynchronously.. We resolved similar issue in our cluster by increasing memtable_flush_writers to 3 from 1 ( we were writing to multiple cf simultaneously). We also fixed GC issues and reduced total_memtable_size_in_mb to ensure that most memtables are flushed early in heavy write loads. Thanks Anuj Wadehra Sent from Yahoo Mail on Android From:Robert Wille rwi...@fold3.com Date:Sat, 13 Jun, 2015 at 8:29 pm Subject:Re: Dropped mutation messages Internode messages which are received by a node, but do not get not to be processed within rpc_timeout are dropped rather than processed. As the coordinator node will no longer be waiting for a response. If the Coordinator node does not receive Consistency Level responses before the rpc_timeout it will return a TimedOutException to the client. I understand that, but that’s where this makes no sense. I’m running with RF=1, and CL=QUORUM, which means each update goes to one node, and I need one response for a success. I have many thousands of dropped mutation messages, but no TimedOutExceptions thrown back to the client. If I have GC problems, or other issues that are making my cluster unresponsive, I can deal with that. But having writes that fail and no error is clearly not acceptable. How is it possible to be getting errors and not be informed about them? Thanks Robert
Dropped mutation messages
I am preparing to migrate a large amount of data to Cassandra. In order to test my migration code, I’ve been doing some dry runs to a test cluster. My test cluster is 2.0.15, 3 nodes, RF=1 and CL=QUORUM. I know RF=1 and CL=QUORUM is a weird combination, but my production cluster that will eventually receive this data is RF=3. I am running with RF=1 so its faster while I work out the kinks in the migration. There are a few things that have puzzled me, after writing several 10’s of millions records to my test cluster. My main concern is that I have a few tens of thousands of dropped mutation messages. I’m overloading my cluster. I never have more than about 10% CPU utilization (even my I/O wait is negligible). A curious thing about that is that the driver hasn’t thrown any exceptions, even though mutations have been dropped. I’ve seen dropped mutation messages on my production cluster, but like this, I’ve never gotten errors back from the client. I had always assumed that one node dropped mutation messages, but the other two did not, and so quorum was satisfied. With RF=1, I don’t understand how mutation messages are being dropped and the client doesn’t tell me about it. Does this mean my cluster is missing data, and I have no idea? Each node has a couple dozen all-time blocked FlushWriters. Is that bad? I have around 100 dropped counter mutations, which is very weird because I don’t write any counters. I have counters in my schema for tracking view counts, but the migration code doesn’t write them. How could I get dropped counter mutation messages when I don’t modify them? Any insights would be appreciated. Thanks in advance. Robert
Re: Dropped mutation messages
I meant to say I’m *not* overloading my cluster. On Jun 12, 2015, at 6:52 PM, Robert Wille rwi...@fold3.com wrote: I am preparing to migrate a large amount of data to Cassandra. In order to test my migration code, I’ve been doing some dry runs to a test cluster. My test cluster is 2.0.15, 3 nodes, RF=1 and CL=QUORUM. I know RF=1 and CL=QUORUM is a weird combination, but my production cluster that will eventually receive this data is RF=3. I am running with RF=1 so its faster while I work out the kinks in the migration. There are a few things that have puzzled me, after writing several 10’s of millions records to my test cluster. My main concern is that I have a few tens of thousands of dropped mutation messages. I’m overloading my cluster. I never have more than about 10% CPU utilization (even my I/O wait is negligible). A curious thing about that is that the driver hasn’t thrown any exceptions, even though mutations have been dropped. I’ve seen dropped mutation messages on my production cluster, but like this, I’ve never gotten errors back from the client. I had always assumed that one node dropped mutation messages, but the other two did not, and so quorum was satisfied. With RF=1, I don’t understand how mutation messages are being dropped and the client doesn’t tell me about it. Does this mean my cluster is missing data, and I have no idea? Each node has a couple dozen all-time blocked FlushWriters. Is that bad? I have around 100 dropped counter mutations, which is very weird because I don’t write any counters. I have counters in my schema for tracking view counts, but the migration code doesn’t write them. How could I get dropped counter mutation messages when I don’t modify them? Any insights would be appreciated. Thanks in advance. Robert
Re: Dropped mutation messages
What should be the path to investigate this? Dropped messages are a symptom of other problems. Look for the GCInspector logging lots of ParNew, or the IO system being overloaded, or large (1000's) read or write batches from the client. Cheers - Aaron Morton Freelance Cassandra Consultant New Zealand @aaronmorton http://www.thelastpickle.com On 20/06/2013, at 12:40 AM, Shahab Yunus shahab.yu...@gmail.com wrote: Hello Arthur, What do you mean by The queries need to be lightened? Thanks, Shahb On Tue, Jun 18, 2013 at 8:47 PM, Arthur Zubarev arthur.zuba...@aol.com wrote: Cem hi, as per http://wiki.apache.org/cassandra/FAQ#dropped_messages Internode messages which are received by a node, but do not get not to be processed within rpc_timeout are dropped rather than processed. As the coordinator node will no longer be waiting for a response. If the Coordinator node does not receive Consistency Level responses before the rpc_timeout it will return a TimedOutException to the client. If the coordinator receives Consistency Level responses it will return success to the client. For MUTATION messages this means that the mutation was not applied to all replicas it was sent to. The inconsistency will be repaired by Read Repair or Anti Entropy Repair. For READ messages this means a read request may not have completed. Load shedding is part of the Cassandra architecture, if this is a persistent issue it is generally a sign of an overloaded node or cluster. By the way, I am on C* 1.2.4 too in dev mode, after having my node filled with 400 GB I started getting RPC timeouts on large data retrievals, so in short, you may need to revise how you query. The queries need to be lightened /Arthur From: cem Sent: Tuesday, June 18, 2013 1:12 PM To: user@cassandra.apache.org Subject: Dropped mutation messages Hi All, I have a cluster of 5 nodes with C* 1.2.4. Each node has 4 disks 1 TB each. I see a lot of dropped messages after it stores 400 GB per disk. (1.6 TB per node). The recommendation was 500 GB max per node before 1.2. Datastax says that we can store terabytes of data per node with 1.2. http://www.datastax.com/docs/1.2/cluster_architecture/cluster_planning Do I need to enable anything to leverage from 1.2? Do you have any other advice? What should be the path to investigate this? Thanks in advance! Best Regards, Cem.
Re: Dropped mutation messages
Hello Arthur, What do you mean by The queries need to be lightened? Thanks, Shahb On Tue, Jun 18, 2013 at 8:47 PM, Arthur Zubarev arthur.zuba...@aol.comwrote: Cem hi, as per http://wiki.apache.org/cassandra/FAQ#dropped_messages Internode messages which are received by a node, but do not get not to be processed within rpc_timeout are dropped rather than processed. As the coordinator node will no longer be waiting for a response. If the Coordinator node does not receive Consistency Level responses before the rpc_timeout it will return a TimedOutException to the client. If the coordinator receives Consistency Level responses it will return success to the client. For MUTATION messages this means that the mutation was not applied to all replicas it was sent to. The inconsistency will be repaired by Read Repair or Anti Entropy Repair. For READ messages this means a read request may not have completed. Load shedding is part of the Cassandra architecture, if this is a persistent issue it is generally a sign of an overloaded node or cluster. By the way, I am on C* 1.2.4 too in dev mode, after having my node filled with 400 GB I started getting RPC timeouts on large data retrievals, so in short, you may need to revise how you query. The queries need to be lightened /Arthur *From:* cem cayiro...@gmail.com *Sent:* Tuesday, June 18, 2013 1:12 PM *To:* user@cassandra.apache.org *Subject:* Dropped mutation messages Hi All, I have a cluster of 5 nodes with C* 1.2.4. Each node has 4 disks 1 TB each. I see a lot of dropped messages after it stores 400 GB per disk. (1.6 TB per node). The recommendation was 500 GB max per node before 1.2. Datastax says that we can store terabytes of data per node with 1.2. http://www.datastax.com/docs/1.2/cluster_architecture/cluster_planning Do I need to enable anything to leverage from 1.2? Do you have any other advice? What should be the path to investigate this? Thanks in advance! Best Regards, Cem.
Dropped mutation messages
Hi All, I have a cluster of 5 nodes with C* 1.2.4. Each node has 4 disks 1 TB each. I see a lot of dropped messages after it stores 400 GB per disk. (1.6 TB per node). The recommendation was 500 GB max per node before 1.2. Datastax says that we can store terabytes of data per node with 1.2. http://www.datastax.com/docs/1.2/cluster_architecture/cluster_planning Do I need to enable anything to leverage from 1.2? Do you have any other advice? What should be the path to investigate this? Thanks in advance! Best Regards, Cem.
Re: Dropped mutation messages
Cem hi, as per http://wiki.apache.org/cassandra/FAQ#dropped_messages Internode messages which are received by a node, but do not get not to be processed within rpc_timeout are dropped rather than processed. As the coordinator node will no longer be waiting for a response. If the Coordinator node does not receive Consistency Level responses before the rpc_timeout it will return a TimedOutException to the client. If the coordinator receives Consistency Level responses it will return success to the client. For MUTATION messages this means that the mutation was not applied to all replicas it was sent to. The inconsistency will be repaired by Read Repair or Anti Entropy Repair. For READ messages this means a read request may not have completed. Load shedding is part of the Cassandra architecture, if this is a persistent issue it is generally a sign of an overloaded node or cluster. By the way, I am on C* 1.2.4 too in dev mode, after having my node filled with 400 GB I started getting RPC timeouts on large data retrievals, so in short, you may need to revise how you query. The queries need to be lightened /Arthur From: cem Sent: Tuesday, June 18, 2013 1:12 PM To: user@cassandra.apache.org Subject: Dropped mutation messages Hi All, I have a cluster of 5 nodes with C* 1.2.4. Each node has 4 disks 1 TB each. I see a lot of dropped messages after it stores 400 GB per disk. (1.6 TB per node). The recommendation was 500 GB max per node before 1.2. Datastax says that we can store terabytes of data per node with 1.2. http://www.datastax.com/docs/1.2/cluster_architecture/cluster_planning Do I need to enable anything to leverage from 1.2? Do you have any other advice? What should be the path to investigate this? Thanks in advance! Best Regards, Cem.