Cassandra compaction appears to stall, node becomes partially unresponsive
Hi there, Within our Cassandra cluster, we're observing, on occasion, one or two nodes at a time becoming partially unresponsive. We're running 2.1.7 across the entire cluster. nodetool still reports the node as being healthy, and it does respond to some local queries; however, the CPU is pegged at 100%. One common thread (heh) each time this happens is that there always seems to be one of more compaction threads running (via nodetool tpstats), and some appear to be stuck (active count doesn't change, pending count doesn't decrease). A request for compactionstats hangs with no response. Each time we've seen this, the only thing that appears to resolve the issue is a restart of the Cassandra process; the restart does not appear to be clean, and requires one or more attempts (or a -9 on occasion). There does not seem to be any pattern to what machines are affected; the nodes thus far have been different instances on different physical machines and on different racks. Has anyone seen this before? Alternatively, when this happens again, what data can we collect that would help with the debugging process (in addition to tpstats)? Thanks in advance, Bryan
Re: Cassandra compaction appears to stall, node becomes partially unresponsive
Hi Bryan How's GC behaving on these boxes? On Wed, Jul 22, 2015 at 2:55 PM, Bryan Cheng br...@blockcypher.com wrote: Hi there, Within our Cassandra cluster, we're observing, on occasion, one or two nodes at a time becoming partially unresponsive. We're running 2.1.7 across the entire cluster. nodetool still reports the node as being healthy, and it does respond to some local queries; however, the CPU is pegged at 100%. One common thread (heh) each time this happens is that there always seems to be one of more compaction threads running (via nodetool tpstats), and some appear to be stuck (active count doesn't change, pending count doesn't decrease). A request for compactionstats hangs with no response. Each time we've seen this, the only thing that appears to resolve the issue is a restart of the Cassandra process; the restart does not appear to be clean, and requires one or more attempts (or a -9 on occasion). There does not seem to be any pattern to what machines are affected; the nodes thus far have been different instances on different physical machines and on different racks. Has anyone seen this before? Alternatively, when this happens again, what data can we collect that would help with the debugging process (in addition to tpstats)? Thanks in advance, Bryan -- *Aiman Parvaiz* Lead Systems Architect ai...@flipagram.com cell: 213-300-6377 http://flipagram.com/apz
Re: Cassandra compaction appears to stall, node becomes partially unresponsive
Robert, thanks for these references! We're not using DTCS, so 9056 and 8243 seem out, but I'll take a look at 9577 (also looked at the referenced thread on this list, which seems to have some interesting data) On Wed, Jul 22, 2015 at 5:33 PM, Robert Coli rc...@eventbrite.com wrote: On Wed, Jul 22, 2015 at 2:55 PM, Bryan Cheng br...@blockcypher.com wrote: nodetool still reports the node as being healthy, and it does respond to some local queries; however, the CPU is pegged at 100%. One common thread (heh) each time this happens is that there always seems to be one of more compaction threads running (via nodetool tpstats), and some appear to be stuck (active count doesn't change, pending count doesn't decrease). A request for compactionstats hangs with no response. I've heard other reports of compaction appearing to stall in 2.1.7... wondering if you're affected by any of these... https://issues.apache.org/jira/browse/CASSANDRA-9577 or https://issues.apache.org/jira/browse/CASSANDRA-9056 or https://issues.apache.org/jira/browse/CASSANDRA-8243 (these should not be in 2.1.7) =Rob
Re: Cassandra compaction appears to stall, node becomes partially unresponsive
Hi Aiman, We previously had issues with GC, but since upgrading to 2.1.7 things seem a lot healthier. We collect GC statistics through collectd via the garbage collector mbean, ParNew GC's report sub 500ms collection time on average (I believe accumulated per minute?) and CMS peaks at about 300ms collection time when it runs. On Wed, Jul 22, 2015 at 3:22 PM, Aiman Parvaiz ai...@flipagram.com wrote: Hi Bryan How's GC behaving on these boxes? On Wed, Jul 22, 2015 at 2:55 PM, Bryan Cheng br...@blockcypher.com wrote: Hi there, Within our Cassandra cluster, we're observing, on occasion, one or two nodes at a time becoming partially unresponsive. We're running 2.1.7 across the entire cluster. nodetool still reports the node as being healthy, and it does respond to some local queries; however, the CPU is pegged at 100%. One common thread (heh) each time this happens is that there always seems to be one of more compaction threads running (via nodetool tpstats), and some appear to be stuck (active count doesn't change, pending count doesn't decrease). A request for compactionstats hangs with no response. Each time we've seen this, the only thing that appears to resolve the issue is a restart of the Cassandra process; the restart does not appear to be clean, and requires one or more attempts (or a -9 on occasion). There does not seem to be any pattern to what machines are affected; the nodes thus far have been different instances on different physical machines and on different racks. Has anyone seen this before? Alternatively, when this happens again, what data can we collect that would help with the debugging process (in addition to tpstats)? Thanks in advance, Bryan -- *Aiman Parvaiz* Lead Systems Architect ai...@flipagram.com cell: 213-300-6377 http://flipagram.com/apz
Re: Cassandra compaction appears to stall, node becomes partially unresponsive
Aiman, Your post made me look back at our data a bit. The most recent occurrence of this incident was not preceded by any abnormal GC activity; however, the previous occurrence (which took place a few days ago) did correspond to a massive, order-of-magnitude increase in both ParNew and CMS collection times which lasted ~17 hours. Was there something in particular that links GC to these stalls? At this point in time, we cannot identify any particular reason for either that GC spike or the subsequent apparent compaction stall, although it did not seem to have any effect on our usage of the cluster. On Wed, Jul 22, 2015 at 3:35 PM, Bryan Cheng br...@blockcypher.com wrote: Hi Aiman, We previously had issues with GC, but since upgrading to 2.1.7 things seem a lot healthier. We collect GC statistics through collectd via the garbage collector mbean, ParNew GC's report sub 500ms collection time on average (I believe accumulated per minute?) and CMS peaks at about 300ms collection time when it runs. On Wed, Jul 22, 2015 at 3:22 PM, Aiman Parvaiz ai...@flipagram.com wrote: Hi Bryan How's GC behaving on these boxes? On Wed, Jul 22, 2015 at 2:55 PM, Bryan Cheng br...@blockcypher.com wrote: Hi there, Within our Cassandra cluster, we're observing, on occasion, one or two nodes at a time becoming partially unresponsive. We're running 2.1.7 across the entire cluster. nodetool still reports the node as being healthy, and it does respond to some local queries; however, the CPU is pegged at 100%. One common thread (heh) each time this happens is that there always seems to be one of more compaction threads running (via nodetool tpstats), and some appear to be stuck (active count doesn't change, pending count doesn't decrease). A request for compactionstats hangs with no response. Each time we've seen this, the only thing that appears to resolve the issue is a restart of the Cassandra process; the restart does not appear to be clean, and requires one or more attempts (or a -9 on occasion). There does not seem to be any pattern to what machines are affected; the nodes thus far have been different instances on different physical machines and on different racks. Has anyone seen this before? Alternatively, when this happens again, what data can we collect that would help with the debugging process (in addition to tpstats)? Thanks in advance, Bryan -- *Aiman Parvaiz* Lead Systems Architect ai...@flipagram.com cell: 213-300-6377 http://flipagram.com/apz
Re: Cassandra compaction appears to stall, node becomes partially unresponsive
I faced something similar in past and the reason for nodes becoming unresponsive intermittently was Long GC pauses. That's why I wanted to bring this to your attention incase GC pause is a potential cause. Sent from my iPhone On Jul 22, 2015, at 4:32 PM, Bryan Cheng br...@blockcypher.com wrote: Aiman, Your post made me look back at our data a bit. The most recent occurrence of this incident was not preceded by any abnormal GC activity; however, the previous occurrence (which took place a few days ago) did correspond to a massive, order-of-magnitude increase in both ParNew and CMS collection times which lasted ~17 hours. Was there something in particular that links GC to these stalls? At this point in time, we cannot identify any particular reason for either that GC spike or the subsequent apparent compaction stall, although it did not seem to have any effect on our usage of the cluster. On Wed, Jul 22, 2015 at 3:35 PM, Bryan Cheng br...@blockcypher.com wrote: Hi Aiman, We previously had issues with GC, but since upgrading to 2.1.7 things seem a lot healthier. We collect GC statistics through collectd via the garbage collector mbean, ParNew GC's report sub 500ms collection time on average (I believe accumulated per minute?) and CMS peaks at about 300ms collection time when it runs. On Wed, Jul 22, 2015 at 3:22 PM, Aiman Parvaiz ai...@flipagram.com wrote: Hi Bryan How's GC behaving on these boxes? On Wed, Jul 22, 2015 at 2:55 PM, Bryan Cheng br...@blockcypher.com wrote: Hi there, Within our Cassandra cluster, we're observing, on occasion, one or two nodes at a time becoming partially unresponsive. We're running 2.1.7 across the entire cluster. nodetool still reports the node as being healthy, and it does respond to some local queries; however, the CPU is pegged at 100%. One common thread (heh) each time this happens is that there always seems to be one of more compaction threads running (via nodetool tpstats), and some appear to be stuck (active count doesn't change, pending count doesn't decrease). A request for compactionstats hangs with no response. Each time we've seen this, the only thing that appears to resolve the issue is a restart of the Cassandra process; the restart does not appear to be clean, and requires one or more attempts (or a -9 on occasion). There does not seem to be any pattern to what machines are affected; the nodes thus far have been different instances on different physical machines and on different racks. Has anyone seen this before? Alternatively, when this happens again, what data can we collect that would help with the debugging process (in addition to tpstats)? Thanks in advance, Bryan -- Aiman Parvaiz Lead Systems Architect ai...@flipagram.com cell: 213-300-6377 http://flipagram.com/apz
Re: Cassandra compaction appears to stall, node becomes partially unresponsive
On Wed, Jul 22, 2015 at 2:55 PM, Bryan Cheng br...@blockcypher.com wrote: nodetool still reports the node as being healthy, and it does respond to some local queries; however, the CPU is pegged at 100%. One common thread (heh) each time this happens is that there always seems to be one of more compaction threads running (via nodetool tpstats), and some appear to be stuck (active count doesn't change, pending count doesn't decrease). A request for compactionstats hangs with no response. I've heard other reports of compaction appearing to stall in 2.1.7... wondering if you're affected by any of these... https://issues.apache.org/jira/browse/CASSANDRA-9577 or https://issues.apache.org/jira/browse/CASSANDRA-9056 or https://issues.apache.org/jira/browse/CASSANDRA-8243 (these should not be in 2.1.7) =Rob