Re: dropped mutations cross node
Thanks, I've done a lot of conf changes to fix the problem but nothing worked (last one was disabling hints) and after a few days problem gone!! The source of droppedCrossNode was changing every half an hour and it was not always the new nodes No difference between new nodes and old ones in configuration and node spec Sent using https://www.zoho.com/mail/ On Mon, 05 Oct 2020 09:14:17 +0330 Erick Ramirez wrote Sorry for the late reply. Do you still need assistance with this issue? If the source of the dropped mutations and high latency are the newer nodes, that indicates to me that you have an issue with the commitlog disks. Are the newer nodes identical in hardware configuration to the pre-existing nodes? Any differences in configuration you could point out? Cheers!
Re: dropped mutations cross node
Sorry for the late reply. Do you still need assistance with this issue? If the source of the dropped mutations and high latency are the newer nodes, that indicates to me that you have an issue with the commitlog disks. Are the newer nodes identical in hardware configuration to the pre-existing nodes? Any differences in configuration you could point out? Cheers!
dropped mutations cross node
Hi, I've extended a cluster by 10% and after that each hour, on some of the nodes (which changes randomly each time), "dropped mutations cross node" appears on logs (each time 1 or 2 drops and some times some thousands with cross node latency from 3000ms to 9ms or 90seconds!) and insert rate been decreased abour 50%: on token ownership everything is OK (stdev.p of ownership percent even decreased with cluster extension) CPU usage on nodes is less than 30 percent and all well balanced disk usage is less than 10% watching through iostat and also no pending compaction on nodes there is no other log beside dropped reports (although a few GC about 200-300ms every 5 minuts) no sign of memory problem looking at jvisualVM honestly i do not monitor network equipments (Switches) but the network did not changed since the extend of cluster and not increase in packet discard counters at node side So to emphasize; there is mutation drop which i can not detect the root cause. Is there any workaround or monitoring metric that i missed here? Cluster Info: Cassandra 3.11.2 RF 3 30 Nodes Sent using https://www.zoho.com/mail/
Re: Dropped mutations
What does Read and _Trace dropped mutations mean? There is no tracing enabled on any node in the cluster, what are these _TRACE dropped messages? INFO [ScheduledTasks:1] 2019-07-25 21:17:13,878 MessagingService.java:1281 - READ messages were dropped in last 5000 ms: 1 internal and 0 cross node. Mean internal dropped latency: 5960 ms and Mean cross-node dropped latency: 0 ms INFO [ScheduledTasks:1] 2019-07-25 20:38:43,788 MessagingService.java:1281 - _TRACE messages were dropped in last 5000 ms: 5035 internal and 0 cross node. Mean internal dropped latency: 0 ms and Mean cross-node dropped latency: 0 ms On Thu, Jul 25, 2019 at 1:49 PM Ayub M wrote: > Thanks Jeff, does internal mean local node operations - in this case > mutation response from local node and cross node means the time it took to > get response back from other nodes depending on the consistency level > choosen? > > On Thu, Jul 25, 2019 at 11:51 AM Jeff Jirsa wrote: > >> This means your database is seeing commands that have already timed out >> by the time it goes to execute them, so it ignores them and gives up >> instead of working on work items that have already expired. >> >> The first log line shows 5 second latencies, the second line 6s and 8s >> latencies, which sounds like either really bad disks or really bad JVM GC >> pauses. >> >> >> On Thu, Jul 25, 2019 at 8:45 AM Ayub M wrote: >> >>> Hello, how do I read dropped mutations error messages - whats internal >>> and cross node? For mutations it fails on cross-node and read_repair/read >>> it fails on internal. What does it mean? >>> >>> INFO [ScheduledTasks:1] 2019-07-21 11:44:46,150 >>> MessagingService.java:1281 - MUTATION messages were dropped in last 5000 >>> ms: 0 internal and 65 cross node. Mean internal dropped latency: 0 ms and >>> Mean cross-node dropped latency: 4966 ms >>> INFO [ScheduledTasks:1] 2019-07-19 05:01:10,620 >>> MessagingService.java:1281 - READ_REPAIR messages were dropped in last 5000 >>> ms: 9 internal and 8 cross node. Mean internal dropped latency: 6013 ms and >>> Mean cross-node dropped latency: 8164 ms >>> >>> -- >>> >>> Regards, >>> Ayub >>> >> > > -- > Regards, > Ayub > -- Regards, Ayub
Re: Dropped mutations
Thanks Jeff, does internal mean local node operations - in this case mutation response from local node and cross node means the time it took to get response back from other nodes depending on the consistency level choosen? On Thu, Jul 25, 2019 at 11:51 AM Jeff Jirsa wrote: > This means your database is seeing commands that have already timed out by > the time it goes to execute them, so it ignores them and gives up instead > of working on work items that have already expired. > > The first log line shows 5 second latencies, the second line 6s and 8s > latencies, which sounds like either really bad disks or really bad JVM GC > pauses. > > > On Thu, Jul 25, 2019 at 8:45 AM Ayub M wrote: > >> Hello, how do I read dropped mutations error messages - whats internal >> and cross node? For mutations it fails on cross-node and read_repair/read >> it fails on internal. What does it mean? >> >> INFO [ScheduledTasks:1] 2019-07-21 11:44:46,150 >> MessagingService.java:1281 - MUTATION messages were dropped in last 5000 >> ms: 0 internal and 65 cross node. Mean internal dropped latency: 0 ms and >> Mean cross-node dropped latency: 4966 ms >> INFO [ScheduledTasks:1] 2019-07-19 05:01:10,620 >> MessagingService.java:1281 - READ_REPAIR messages were dropped in last 5000 >> ms: 9 internal and 8 cross node. Mean internal dropped latency: 6013 ms and >> Mean cross-node dropped latency: 8164 ms >> >> -- >> >> Regards, >> Ayub >> > -- Regards, Ayub
Re: Dropped mutations
Hello Jeff, Request you to help on how to visualise the terms 1. Internal mutations 2. Cross node mutations 3. Mean internal dropped latency 4. Cross node dropped latency Thanks, Rajsekhar On Thu, 25 Jul, 2019, 9:21 PM Jeff Jirsa, wrote: > This means your database is seeing commands that have already timed out by > the time it goes to execute them, so it ignores them and gives up instead > of working on work items that have already expired. > > The first log line shows 5 second latencies, the second line 6s and 8s > latencies, which sounds like either really bad disks or really bad JVM GC > pauses. > > > On Thu, Jul 25, 2019 at 8:45 AM Ayub M wrote: > >> Hello, how do I read dropped mutations error messages - whats internal >> and cross node? For mutations it fails on cross-node and read_repair/read >> it fails on internal. What does it mean? >> >> INFO [ScheduledTasks:1] 2019-07-21 11:44:46,150 >> MessagingService.java:1281 - MUTATION messages were dropped in last 5000 >> ms: 0 internal and 65 cross node. Mean internal dropped latency: 0 ms and >> Mean cross-node dropped latency: 4966 ms >> INFO [ScheduledTasks:1] 2019-07-19 05:01:10,620 >> MessagingService.java:1281 - READ_REPAIR messages were dropped in last 5000 >> ms: 9 internal and 8 cross node. Mean internal dropped latency: 6013 ms and >> Mean cross-node dropped latency: 8164 ms >> >> -- >> >> Regards, >> Ayub >> >
Re: Dropped mutations
This means your database is seeing commands that have already timed out by the time it goes to execute them, so it ignores them and gives up instead of working on work items that have already expired. The first log line shows 5 second latencies, the second line 6s and 8s latencies, which sounds like either really bad disks or really bad JVM GC pauses. On Thu, Jul 25, 2019 at 8:45 AM Ayub M wrote: > Hello, how do I read dropped mutations error messages - whats internal and > cross node? For mutations it fails on cross-node and read_repair/read it > fails on internal. What does it mean? > > INFO [ScheduledTasks:1] 2019-07-21 11:44:46,150 > MessagingService.java:1281 - MUTATION messages were dropped in last 5000 > ms: 0 internal and 65 cross node. Mean internal dropped latency: 0 ms and > Mean cross-node dropped latency: 4966 ms > INFO [ScheduledTasks:1] 2019-07-19 05:01:10,620 > MessagingService.java:1281 - READ_REPAIR messages were dropped in last 5000 > ms: 9 internal and 8 cross node. Mean internal dropped latency: 6013 ms and > Mean cross-node dropped latency: 8164 ms > > -- > > Regards, > Ayub >
Dropped mutations
Hello, how do I read dropped mutations error messages - whats internal and cross node? For mutations it fails on cross-node and read_repair/read it fails on internal. What does it mean? INFO [ScheduledTasks:1] 2019-07-21 11:44:46,150 MessagingService.java:1281 - MUTATION messages were dropped in last 5000 ms: 0 internal and 65 cross node. Mean internal dropped latency: 0 ms and Mean cross-node dropped latency: 4966 ms INFO [ScheduledTasks:1] 2019-07-19 05:01:10,620 MessagingService.java:1281 - READ_REPAIR messages were dropped in last 5000 ms: 9 internal and 8 cross node. Mean internal dropped latency: 6013 ms and Mean cross-node dropped latency: 8164 ms -- Regards, Ayub
Re: Problem with dropped mutations
Dropped mutations are load shedding - somethings not happy. Are you seeing GC pauses? What heap size and version? What memtable settings ? -- Jeff Jirsa > On Jul 2, 2018, at 12:48 AM, Hannu Kröger wrote: > > Yes, there are timeouts sometimes but more on the read side. And yes, there > are certain data modeling problems which will be soon addressed but we need > to keep things steady before we get there. > > I guess many write timeouts go unnoticed due to consistency level != ALL. > > Network looks to be working fine. > > Hannu > >> ZAIDI, ASAD A kirjoitti 26.6.2018 kello 21.42: >> >> Are you also seeing time-outs on certain Cassandra operations?? If yes, you >> may have to tweak *request_timeout parameter in order to get rid of dropped >> mutation messages if application data model is not upto mark! >> >> You can also check if network isn't dropping packets (ifconfig -a tool) + >> storage (dstat tool) isn't reporting too slow disks. >> >> Cheers/Asad >> >> >> -Original Message- >> From: Hannu Kröger [mailto:hkro...@gmail.com] >> Sent: Tuesday, June 26, 2018 9:49 AM >> To: user >> Subject: Problem with dropped mutations >> >> Hello, >> >> We have a cluster with somewhat heavy load and we are seeing dropped >> mutations (variable amount and not all nodes have those). >> >> Are there some clear trigger which cause those? What would be the best >> pragmatic approach to start debugging those? We have already added more >> memory which seemed to help somewhat but not completely. >> >> Cheers, >> Hannu >> >> >> >> - >> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org >> For additional commands, e-mail: user-h...@cassandra.apache.org >> >> >> - >> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org >> For additional commands, e-mail: user-h...@cassandra.apache.org >> > > - > To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org > For additional commands, e-mail: user-h...@cassandra.apache.org > - To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org For additional commands, e-mail: user-h...@cassandra.apache.org
Re: Problem with dropped mutations
Yes, there are timeouts sometimes but more on the read side. And yes, there are certain data modeling problems which will be soon addressed but we need to keep things steady before we get there. I guess many write timeouts go unnoticed due to consistency level != ALL. Network looks to be working fine. Hannu > ZAIDI, ASAD A kirjoitti 26.6.2018 kello 21.42: > > Are you also seeing time-outs on certain Cassandra operations?? If yes, you > may have to tweak *request_timeout parameter in order to get rid of dropped > mutation messages if application data model is not upto mark! > > You can also check if network isn't dropping packets (ifconfig -a tool) + > storage (dstat tool) isn't reporting too slow disks. > > Cheers/Asad > > > -Original Message- > From: Hannu Kröger [mailto:hkro...@gmail.com] > Sent: Tuesday, June 26, 2018 9:49 AM > To: user > Subject: Problem with dropped mutations > > Hello, > > We have a cluster with somewhat heavy load and we are seeing dropped > mutations (variable amount and not all nodes have those). > > Are there some clear trigger which cause those? What would be the best > pragmatic approach to start debugging those? We have already added more > memory which seemed to help somewhat but not completely. > > Cheers, > Hannu > > > > - > To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org > For additional commands, e-mail: user-h...@cassandra.apache.org > > > - > To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org > For additional commands, e-mail: user-h...@cassandra.apache.org > - To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org For additional commands, e-mail: user-h...@cassandra.apache.org
RE: Problem with dropped mutations
Are you also seeing time-outs on certain Cassandra operations?? If yes, you may have to tweak *request_timeout parameter in order to get rid of dropped mutation messages if application data model is not upto mark! You can also check if network isn't dropping packets (ifconfig -a tool) + storage (dstat tool) isn't reporting too slow disks. Cheers/Asad -Original Message- From: Hannu Kröger [mailto:hkro...@gmail.com] Sent: Tuesday, June 26, 2018 9:49 AM To: user Subject: Problem with dropped mutations Hello, We have a cluster with somewhat heavy load and we are seeing dropped mutations (variable amount and not all nodes have those). Are there some clear trigger which cause those? What would be the best pragmatic approach to start debugging those? We have already added more memory which seemed to help somewhat but not completely. Cheers, Hannu - To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org For additional commands, e-mail: user-h...@cassandra.apache.org - To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org For additional commands, e-mail: user-h...@cassandra.apache.org
Re: Problem with dropped mutations
Hannu, Dropped mutations are often a sign of load-shedding due to an overloaded node or cluster. Are you seeing resource saturation like high CPU usage (because the write path is usually CPU-bound) on any of the nodes in your cluster? Some potential contributing factors that might be causing you to drop mutations are long garbage collection (GC) pauses or large partitions. Do the drops coincide with an increase in requests, a code change, or compaction activity? On Tue, Jun 26, 2018 at 7:48 AM, Hannu Kröger wrote: > Hello, > > We have a cluster with somewhat heavy load and we are seeing dropped > mutations (variable amount and not all nodes have those). > > Are there some clear trigger which cause those? What would be the best > pragmatic approach to start debugging those? We have already added more > memory which seemed to help somewhat but not completely. > > Cheers, > Hannu > > > > - > To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org > For additional commands, e-mail: user-h...@cassandra.apache.org > > -- *Joshua Galbraith *| Senior Software Engineer | New Relic
Problem with dropped mutations
Hello, We have a cluster with somewhat heavy load and we are seeing dropped mutations (variable amount and not all nodes have those). Are there some clear trigger which cause those? What would be the best pragmatic approach to start debugging those? We have already added more memory which seemed to help somewhat but not completely. Cheers, Hannu - To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org For additional commands, e-mail: user-h...@cassandra.apache.org
Re: Dropped Mutations
Thanks a lot Hitesh! I'll try to re-tune the heap to a lower level Shalom Sagges DBA T: +972-74-700-4035 <http://www.linkedin.com/company/164748> <http://twitter.com/liveperson> <http://www.facebook.com/LivePersonInc> We Create Meaningful Connections <https://liveperson.docsend.com/view/8iiswfp> On Thu, Apr 19, 2018 at 12:42 AM, hitesh dua wrote: > Hi , > > I'll recommend tuning you heap size further( preferably lower) as large > Heap size can lead to Large Garbage collection pauses also known as also > known as a stop-the-world event. A pause occurs when a region of memory is > full and the JVM needs to make space to continue. During a pause all > operations are suspended. Because a pause affects networking, the node can > appear as down to other nodes in the cluster. Additionally, any Select and > Insert statements will wait, which increases read and write latencies. > > Any pause of more than a second, or multiple pauses within a second that > add to a large fraction of that second, should be avoided. The basic cause > of the problem is the rate of data stored in memory outpaces the rate at > which data can be removed > > MUTATION : If a write message is processed after its timeout > (write_request_timeout_in_ms) it either sent a failure to the client or it > met its requested consistency level and will relay on hinted handoff and > read repairs to do the mutation if it succeeded. > > Another possible cause of the Issue could be you HDDs as that could too > be a bottleneck. > > *MAX_HEAP_SIZE* > The recommended maximum heap size depends on which GC is used: > Hardware setupRecommended MAX_HEAP_SIZE > Older computers Typically 8 GB. > CMS for newer computers (8+ cores) with up to 256 GB RAM No more 14 GB. > > > Thanks, > Hitesh dua > hiteshd...@gmail.com > > On Wed, Apr 18, 2018 at 10:07 PM, shalom sagges > wrote: > >> Hi All, >> >> I have a 44 node cluster (22 nodes on each DC). >> Each node has 24 cores and 130 GB RAM, 3 TB HDDs. >> Version 2.0.14 (soon to be upgraded) >> ~10K writes per second per node. >> Heap size: 8 GB max, 2.4 GB newgen >> >> I deployed Reaper and GC started to increase rapidly. I'm not sure if >> it's because there was a lot of inconsistency in the data, but I decided to >> increase the heap to 16 GB and new gen to 6 GB. I increased the max tenure >> from 1 to 5. >> >> I tested on a canary node and everything was fine but when I changed the >> entire DC, I suddenly saw a lot of dropped mutations in the logs on most of >> the nodes. (Reaper was not running on the cluster yet but a manual repair >> was running). >> >> Can the heap increment cause lots of dropped mutations? >> When is a mutation considered as dropped? Is it during flush? Is it >> during the write to the commit log or memtable? >> >> Thanks! >> >> >> >> > -- This message may contain confidential and/or privileged information. If you are not the addressee or authorized to receive this on behalf of the addressee you must not use, copy, disclose or take action based on this message or any information herein. If you have received this message in error, please advise the sender immediately by reply email and delete this message. Thank you.
Re: Dropped Mutations
Hi , I'll recommend tuning you heap size further( preferably lower) as large Heap size can lead to Large Garbage collection pauses also known as also known as a stop-the-world event. A pause occurs when a region of memory is full and the JVM needs to make space to continue. During a pause all operations are suspended. Because a pause affects networking, the node can appear as down to other nodes in the cluster. Additionally, any Select and Insert statements will wait, which increases read and write latencies. Any pause of more than a second, or multiple pauses within a second that add to a large fraction of that second, should be avoided. The basic cause of the problem is the rate of data stored in memory outpaces the rate at which data can be removed MUTATION : If a write message is processed after its timeout (write_request_timeout_in_ms) it either sent a failure to the client or it met its requested consistency level and will relay on hinted handoff and read repairs to do the mutation if it succeeded. Another possible cause of the Issue could be you HDDs as that could too be a bottleneck. *MAX_HEAP_SIZE* The recommended maximum heap size depends on which GC is used: Hardware setupRecommended MAX_HEAP_SIZE Older computers Typically 8 GB. CMS for newer computers (8+ cores) with up to 256 GB RAM No more 14 GB. Thanks, Hitesh dua hiteshd...@gmail.com On Wed, Apr 18, 2018 at 10:07 PM, shalom sagges wrote: > Hi All, > > I have a 44 node cluster (22 nodes on each DC). > Each node has 24 cores and 130 GB RAM, 3 TB HDDs. > Version 2.0.14 (soon to be upgraded) > ~10K writes per second per node. > Heap size: 8 GB max, 2.4 GB newgen > > I deployed Reaper and GC started to increase rapidly. I'm not sure if it's > because there was a lot of inconsistency in the data, but I decided to > increase the heap to 16 GB and new gen to 6 GB. I increased the max tenure > from 1 to 5. > > I tested on a canary node and everything was fine but when I changed the > entire DC, I suddenly saw a lot of dropped mutations in the logs on most of > the nodes. (Reaper was not running on the cluster yet but a manual repair > was running). > > Can the heap increment cause lots of dropped mutations? > When is a mutation considered as dropped? Is it during flush? Is it during > the write to the commit log or memtable? > > Thanks! > > > >
Dropped Mutations
Hi All, I have a 44 node cluster (22 nodes on each DC). Each node has 24 cores and 130 GB RAM, 3 TB HDDs. Version 2.0.14 (soon to be upgraded) ~10K writes per second per node. Heap size: 8 GB max, 2.4 GB newgen I deployed Reaper and GC started to increase rapidly. I'm not sure if it's because there was a lot of inconsistency in the data, but I decided to increase the heap to 16 GB and new gen to 6 GB. I increased the max tenure from 1 to 5. I tested on a canary node and everything was fine but when I changed the entire DC, I suddenly saw a lot of dropped mutations in the logs on most of the nodes. (Reaper was not running on the cluster yet but a manual repair was running). Can the heap increment cause lots of dropped mutations? When is a mutation considered as dropped? Is it during flush? Is it during the write to the commit log or memtable? Thanks!
Re: Dropped Mutations
Dropped mutations aren't data loss. Data loss implies the data was already there and is now gone, whereas for a dropped mutation the data was never there in the first place. A dropped mutation just results in a inconsistency, or potentially no data if all mutations are dropped, and C* will tell you this and it's up to your client to respond accordingly (e.g: re-write the data if it's an idempotent query and your desired CL failed to be achieved). On 11 January 2018 at 08:18, ਨਿਹੰਗ wrote: > Hello > Could the following be interpreted as, 'Dropped Mutations', in some cases > mean data loss? > > http://cassandra.apache.org/doc/latest/faq/index.html#why-message-dropped > For writes, this means that the mutation was not applied to all replicas > it was sent to. The inconsistency will be repaired by read repair, hints or > a manual repair. *The write operation may also have timeouted as a result* > . > > Thanks > N >
Dropped Mutations
Hello Could the following be interpreted as, 'Dropped Mutations', in some cases mean data loss? http://cassandra.apache.org/doc/latest/faq/index.html#why-message-dropped For writes, this means that the mutation was not applied to all replicas it was sent to. The inconsistency will be repaired by read repair, hints or a manual repair. *The write operation may also have timeouted as a result*. Thanks N
Re: Increase in dropped mutations after major upgrade from 1.2.18 to 2.0.10
On Mon, Nov 10, 2014 at 12:46 PM, Duncan Sands wrote: > Hi Paulo, > > On 10/11/14 15:18, Paulo Ricardo Motta Gomes wrote: > >> Hey, >> >> We've seen a considerable increase in the number of dropped mutations >> after a >> major upgrade from 1.2.18 to 2.0.10. I initially thought it was due to >> the extra >> load incurred by upgradesstables, but the dropped mutations continue even >> after >> all sstables are upgraded. >> > > are the clocks on all your nodes synchronized with each other? > > Ciao, Duncan. > Yes, the servers are synchronized via NTP. Cheers! > > >> Additional info: Overall (read, write and range) latency improved with the >> upgrade, which is great, but I don't understand why dropped mutations has >> increased. I/O and CPU load is pretty much the same, number of completed >> tasks >> is the only metric that increased together with dropped mutations. >> >> I also noticed that the number of "all time blocked" FlushWriter >> operations is >> about 5% of completed operations, don't know if this is related, but in >> case it >> helps out... >> >> Anyone has a clue on what could that be? Or what should we monitor to >> find out? >> Any help or JIRA pointers would be kindly appreciated. >> >> Cheers, >> >> -- >> *Paulo Motta* >> >> Chaordic | /Platform/ >> _www.chaordic.com.br <http://www.chaordic.com.br/>_ >> +55 48 3232.3200 >> > > -- *Paulo Motta* Chaordic | *Platform* *www.chaordic.com.br <http://www.chaordic.com.br/>* +55 48 3232.3200
Re: Increase in dropped mutations after major upgrade from 1.2.18 to 2.0.10
Hi Paulo, On 10/11/14 15:18, Paulo Ricardo Motta Gomes wrote: Hey, We've seen a considerable increase in the number of dropped mutations after a major upgrade from 1.2.18 to 2.0.10. I initially thought it was due to the extra load incurred by upgradesstables, but the dropped mutations continue even after all sstables are upgraded. are the clocks on all your nodes synchronized with each other? Ciao, Duncan. Additional info: Overall (read, write and range) latency improved with the upgrade, which is great, but I don't understand why dropped mutations has increased. I/O and CPU load is pretty much the same, number of completed tasks is the only metric that increased together with dropped mutations. I also noticed that the number of "all time blocked" FlushWriter operations is about 5% of completed operations, don't know if this is related, but in case it helps out... Anyone has a clue on what could that be? Or what should we monitor to find out? Any help or JIRA pointers would be kindly appreciated. Cheers, -- *Paulo Motta* Chaordic | /Platform/ _www.chaordic.com.br <http://www.chaordic.com.br/>_ +55 48 3232.3200
Increase in dropped mutations after major upgrade from 1.2.18 to 2.0.10
Hey, We've seen a considerable increase in the number of dropped mutations after a major upgrade from 1.2.18 to 2.0.10. I initially thought it was due to the extra load incurred by upgradesstables, but the dropped mutations continue even after all sstables are upgraded. Additional info: Overall (read, write and range) latency improved with the upgrade, which is great, but I don't understand why dropped mutations has increased. I/O and CPU load is pretty much the same, number of completed tasks is the only metric that increased together with dropped mutations. I also noticed that the number of "all time blocked" FlushWriter operations is about 5% of completed operations, don't know if this is related, but in case it helps out... Anyone has a clue on what could that be? Or what should we monitor to find out? Any help or JIRA pointers would be kindly appreciated. Cheers, -- *Paulo Motta* Chaordic | *Platform* *www.chaordic.com.br <http://www.chaordic.com.br/>* +55 48 3232.3200
Re: dropped mutations, UnavailableException, and long GC
1. Why 24GB of heap? Do you need this high heap? Bigger heap can lead to longer GC cycles but 15min look too long. 2. Do you have ROW cache enabled? 3. How many column families do you have? 4. Enable GC logs and monitor what GC is doing to get idea of why it is taking so long. You can add following to enable gc log. # GC logging options -- uncomment to enable # JVM_OPTS="$JVM_OPTS -XX:+PrintGCDetails" # JVM_OPTS="$JVM_OPTS -XX:+PrintGCTimeStamps" # JVM_OPTS="$JVM_OPTS -XX:+PrintClassHistogram" # JVM_OPTS="$JVM_OPTS -XX:+PrintTenuringDistribution" # JVM_OPTS="$JVM_OPTS -XX:+PrintGCApplicationStoppedTime" # JVM_OPTS="$JVM_OPTS -Xloggc:/var/log/cassandra/gc.log" 5. Move to Cassandra 0.7.2, if possible. It has following nice feature: "added flush_largest_memtables_at and reduce_cache_sizes_at options to cassandra.yaml as an escape value for memory pressure" Thanks, Naren On Thu, Feb 24, 2011 at 2:21 PM, Jeffrey Wang wrote: > Hey all, > > > > Our setup is 5 machines running Cassandra 0.7.0 with 24GB of heap and 1.5TB > disk each collocated in a DC. We’re doing bulk imports from each of the > nodes with RF = 2 and write consistency ANY (write perf is very important). > The behavior we’re seeing is this: > > > > - Nodes often see each other as dead even though none of the > nodes actually go down. I suspect this may be due to long GCs. It seems like > increasing the RPC timeout could help this, but I’m not convinced this is > the root of the problem. Note that in this case writes return with the > UnavailableException. > > - As mentioned, long GCs. We see the ParNew GC doing a lot of > smaller collections (few hundred MB) which are very fast (few hundred ms), > but every once in a while the ConcurrentMarkSweep will take a LONG time (up > to 15 min!) to collect upwards of 15GB at once. > > - On some nodes, we see a lot of pending MutationStages build up > (e.g. 500K), which leads to the messages “Dropped X MUTATION messages in the > last 5000ms,” presumably meaning that Cassandra has decided to not write one > of the replicas of the data. This is not a HUGE deal, but is less than > ideal. > > - The end result is that a bunch of writes end up failing due to > the UnavailableExceptions, so not all of our data is getting into Cassandra. > > > > So my question is: what is the best way to avoid this behavior? Our > memtable thresholds are fairly low (256MB) so there should be plenty of heap > space to work with. We may experiment with write consistency ONE or ALL to > see if the perf hit is not too bad, but I wanted to get some opinions on why > this might be happening. Thanks! > > > > -Jeffrey > > >
dropped mutations, UnavailableException, and long GC
Hey all, Our setup is 5 machines running Cassandra 0.7.0 with 24GB of heap and 1.5TB disk each collocated in a DC. We're doing bulk imports from each of the nodes with RF = 2 and write consistency ANY (write perf is very important). The behavior we're seeing is this: - Nodes often see each other as dead even though none of the nodes actually go down. I suspect this may be due to long GCs. It seems like increasing the RPC timeout could help this, but I'm not convinced this is the root of the problem. Note that in this case writes return with the UnavailableException. - As mentioned, long GCs. We see the ParNew GC doing a lot of smaller collections (few hundred MB) which are very fast (few hundred ms), but every once in a while the ConcurrentMarkSweep will take a LONG time (up to 15 min!) to collect upwards of 15GB at once. - On some nodes, we see a lot of pending MutationStages build up (e.g. 500K), which leads to the messages "Dropped X MUTATION messages in the last 5000ms," presumably meaning that Cassandra has decided to not write one of the replicas of the data. This is not a HUGE deal, but is less than ideal. - The end result is that a bunch of writes end up failing due to the UnavailableExceptions, so not all of our data is getting into Cassandra. So my question is: what is the best way to avoid this behavior? Our memtable thresholds are fairly low (256MB) so there should be plenty of heap space to work with. We may experiment with write consistency ONE or ALL to see if the perf hit is not too bad, but I wanted to get some opinions on why this might be happening. Thanks! -Jeffrey