Cassandra compaction appears to stall, node becomes partially unresponsive

2015-07-22 Thread Bryan Cheng
Hi there,

Within our Cassandra cluster, we're observing, on occasion, one or two
nodes at a time becoming partially unresponsive.

We're running 2.1.7 across the entire cluster.

nodetool still reports the node as being healthy, and it does respond to
some local queries; however, the CPU is pegged at 100%. One common thread
(heh) each time this happens is that there always seems to be one of more
compaction threads running (via nodetool tpstats), and some appear to be
stuck (active count doesn't change, pending count doesn't decrease). A
request for compactionstats hangs with no response.

Each time we've seen this, the only thing that appears to resolve the issue
is a restart of the Cassandra process; the restart does not appear to be
clean, and requires one or more attempts (or a -9 on occasion).

There does not seem to be any pattern to what machines are affected; the
nodes thus far have been different instances on different physical machines
and on different racks.

Has anyone seen this before? Alternatively, when this happens again, what
data can we collect that would help with the debugging process (in addition
to tpstats)?

Thanks in advance,

Bryan


Re: Cassandra compaction appears to stall, node becomes partially unresponsive

2015-07-22 Thread Aiman Parvaiz
Hi Bryan
How's GC behaving on these boxes?

On Wed, Jul 22, 2015 at 2:55 PM, Bryan Cheng br...@blockcypher.com wrote:

 Hi there,

 Within our Cassandra cluster, we're observing, on occasion, one or two
 nodes at a time becoming partially unresponsive.

 We're running 2.1.7 across the entire cluster.

 nodetool still reports the node as being healthy, and it does respond to
 some local queries; however, the CPU is pegged at 100%. One common thread
 (heh) each time this happens is that there always seems to be one of more
 compaction threads running (via nodetool tpstats), and some appear to be
 stuck (active count doesn't change, pending count doesn't decrease). A
 request for compactionstats hangs with no response.

 Each time we've seen this, the only thing that appears to resolve the
 issue is a restart of the Cassandra process; the restart does not appear to
 be clean, and requires one or more attempts (or a -9 on occasion).

 There does not seem to be any pattern to what machines are affected; the
 nodes thus far have been different instances on different physical machines
 and on different racks.

 Has anyone seen this before? Alternatively, when this happens again, what
 data can we collect that would help with the debugging process (in addition
 to tpstats)?

 Thanks in advance,

 Bryan




-- 
*Aiman Parvaiz*
Lead Systems Architect
ai...@flipagram.com
cell: 213-300-6377
http://flipagram.com/apz


Re: Cassandra compaction appears to stall, node becomes partially unresponsive

2015-07-22 Thread Bryan Cheng
Robert, thanks for these references! We're not using DTCS, so 9056 and 8243
seem out, but I'll take a look at 9577 (also looked at the referenced
thread on this list, which seems to have some interesting data)

On Wed, Jul 22, 2015 at 5:33 PM, Robert Coli rc...@eventbrite.com wrote:

 On Wed, Jul 22, 2015 at 2:55 PM, Bryan Cheng br...@blockcypher.com
 wrote:

 nodetool still reports the node as being healthy, and it does respond to
 some local queries; however, the CPU is pegged at 100%. One common thread
 (heh) each time this happens is that there always seems to be one of more
 compaction threads running (via nodetool tpstats), and some appear to be
 stuck (active count doesn't change, pending count doesn't decrease). A
 request for compactionstats hangs with no response.


 I've heard other reports of compaction appearing to stall in 2.1.7...
 wondering if you're affected by any of these...

 https://issues.apache.org/jira/browse/CASSANDRA-9577
 or
 https://issues.apache.org/jira/browse/CASSANDRA-9056 or
 https://issues.apache.org/jira/browse/CASSANDRA-8243 (these should not be
 in 2.1.7)

 =Rob




Re: Cassandra compaction appears to stall, node becomes partially unresponsive

2015-07-22 Thread Bryan Cheng
Hi Aiman,

We previously had issues with GC, but since upgrading to 2.1.7 things seem
a lot healthier.

We collect GC statistics through collectd via the garbage collector mbean,
ParNew GC's report sub 500ms collection time on average (I believe
accumulated per minute?) and CMS peaks at about 300ms collection time when
it runs.

On Wed, Jul 22, 2015 at 3:22 PM, Aiman Parvaiz ai...@flipagram.com wrote:

 Hi Bryan
 How's GC behaving on these boxes?

 On Wed, Jul 22, 2015 at 2:55 PM, Bryan Cheng br...@blockcypher.com
 wrote:

 Hi there,

 Within our Cassandra cluster, we're observing, on occasion, one or two
 nodes at a time becoming partially unresponsive.

 We're running 2.1.7 across the entire cluster.

 nodetool still reports the node as being healthy, and it does respond to
 some local queries; however, the CPU is pegged at 100%. One common thread
 (heh) each time this happens is that there always seems to be one of more
 compaction threads running (via nodetool tpstats), and some appear to be
 stuck (active count doesn't change, pending count doesn't decrease). A
 request for compactionstats hangs with no response.

 Each time we've seen this, the only thing that appears to resolve the
 issue is a restart of the Cassandra process; the restart does not appear to
 be clean, and requires one or more attempts (or a -9 on occasion).

 There does not seem to be any pattern to what machines are affected; the
 nodes thus far have been different instances on different physical machines
 and on different racks.

 Has anyone seen this before? Alternatively, when this happens again, what
 data can we collect that would help with the debugging process (in addition
 to tpstats)?

 Thanks in advance,

 Bryan




 --
 *Aiman Parvaiz*
 Lead Systems Architect
 ai...@flipagram.com
 cell: 213-300-6377
 http://flipagram.com/apz



Re: Cassandra compaction appears to stall, node becomes partially unresponsive

2015-07-22 Thread Bryan Cheng
Aiman,

Your post made me look back at our data a bit. The most recent occurrence
of this incident was not preceded by any abnormal GC activity; however, the
previous occurrence (which took place a few days ago) did correspond to a
massive, order-of-magnitude increase in both ParNew and CMS collection
times which lasted ~17 hours.

Was there something in particular that links GC to these stalls? At this
point in time, we cannot identify any particular reason for either that GC
spike or the subsequent apparent compaction stall, although it did not seem
to have any effect on our usage of the cluster.

On Wed, Jul 22, 2015 at 3:35 PM, Bryan Cheng br...@blockcypher.com wrote:

 Hi Aiman,

 We previously had issues with GC, but since upgrading to 2.1.7 things seem
 a lot healthier.

 We collect GC statistics through collectd via the garbage collector mbean,
 ParNew GC's report sub 500ms collection time on average (I believe
 accumulated per minute?) and CMS peaks at about 300ms collection time when
 it runs.

 On Wed, Jul 22, 2015 at 3:22 PM, Aiman Parvaiz ai...@flipagram.com
 wrote:

 Hi Bryan
 How's GC behaving on these boxes?

 On Wed, Jul 22, 2015 at 2:55 PM, Bryan Cheng br...@blockcypher.com
 wrote:

 Hi there,

 Within our Cassandra cluster, we're observing, on occasion, one or two
 nodes at a time becoming partially unresponsive.

 We're running 2.1.7 across the entire cluster.

 nodetool still reports the node as being healthy, and it does respond to
 some local queries; however, the CPU is pegged at 100%. One common thread
 (heh) each time this happens is that there always seems to be one of more
 compaction threads running (via nodetool tpstats), and some appear to be
 stuck (active count doesn't change, pending count doesn't decrease). A
 request for compactionstats hangs with no response.

 Each time we've seen this, the only thing that appears to resolve the
 issue is a restart of the Cassandra process; the restart does not appear to
 be clean, and requires one or more attempts (or a -9 on occasion).

 There does not seem to be any pattern to what machines are affected; the
 nodes thus far have been different instances on different physical machines
 and on different racks.

 Has anyone seen this before? Alternatively, when this happens again,
 what data can we collect that would help with the debugging process (in
 addition to tpstats)?

 Thanks in advance,

 Bryan




 --
 *Aiman Parvaiz*
 Lead Systems Architect
 ai...@flipagram.com
 cell: 213-300-6377
 http://flipagram.com/apz





Re: Cassandra compaction appears to stall, node becomes partially unresponsive

2015-07-22 Thread Aiman Parvaiz
I faced something similar in past and the reason for nodes becoming 
unresponsive intermittently was Long GC pauses. That's why I wanted to bring 
this to your attention incase GC pause is a potential cause.

Sent from my iPhone

 On Jul 22, 2015, at 4:32 PM, Bryan Cheng br...@blockcypher.com wrote:
 
 Aiman,
 
 Your post made me look back at our data a bit. The most recent occurrence of 
 this incident was not preceded by any abnormal GC activity; however, the 
 previous occurrence (which took place a few days ago) did correspond to a 
 massive, order-of-magnitude increase in both ParNew and CMS collection times 
 which lasted ~17 hours.
 
 Was there something in particular that links GC to these stalls? At this 
 point in time, we cannot identify any particular reason for either that GC 
 spike or the subsequent apparent compaction stall, although it did not seem 
 to have any effect on our usage of the cluster.
 
 On Wed, Jul 22, 2015 at 3:35 PM, Bryan Cheng br...@blockcypher.com wrote:
 Hi Aiman,
 
 We previously had issues with GC, but since upgrading to 2.1.7 things seem a 
 lot healthier.
 
 We collect GC statistics through collectd via the garbage collector mbean, 
 ParNew GC's report sub 500ms collection time on average (I believe 
 accumulated per minute?) and CMS peaks at about 300ms collection time when 
 it runs.
 
 On Wed, Jul 22, 2015 at 3:22 PM, Aiman Parvaiz ai...@flipagram.com wrote:
 Hi Bryan
 How's GC behaving on these boxes?
 
 On Wed, Jul 22, 2015 at 2:55 PM, Bryan Cheng br...@blockcypher.com wrote:
 Hi there,
 
 Within our Cassandra cluster, we're observing, on occasion, one or two 
 nodes at a time becoming partially unresponsive.
 
 We're running 2.1.7 across the entire cluster.
 
 nodetool still reports the node as being healthy, and it does respond to 
 some local queries; however, the CPU is pegged at 100%. One common thread 
 (heh) each time this happens is that there always seems to be one of more 
 compaction threads running (via nodetool tpstats), and some appear to be 
 stuck (active count doesn't change, pending count doesn't decrease). A 
 request for compactionstats hangs with no response.
 
 Each time we've seen this, the only thing that appears to resolve the 
 issue is a restart of the Cassandra process; the restart does not appear 
 to be clean, and requires one or more attempts (or a -9 on occasion).
 
 There does not seem to be any pattern to what machines are affected; the 
 nodes thus far have been different instances on different physical 
 machines and on different racks.
 
 Has anyone seen this before? Alternatively, when this happens again, what 
 data can we collect that would help with the debugging process (in 
 addition to tpstats)?
 
 Thanks in advance,
 
 Bryan
 
 
 
 -- 
 Aiman Parvaiz
 Lead Systems Architect
 ai...@flipagram.com
 cell: 213-300-6377
 http://flipagram.com/apz
 


Re: Cassandra compaction appears to stall, node becomes partially unresponsive

2015-07-22 Thread Robert Coli
On Wed, Jul 22, 2015 at 2:55 PM, Bryan Cheng br...@blockcypher.com wrote:

 nodetool still reports the node as being healthy, and it does respond to
 some local queries; however, the CPU is pegged at 100%. One common thread
 (heh) each time this happens is that there always seems to be one of more
 compaction threads running (via nodetool tpstats), and some appear to be
 stuck (active count doesn't change, pending count doesn't decrease). A
 request for compactionstats hangs with no response.


I've heard other reports of compaction appearing to stall in 2.1.7...
wondering if you're affected by any of these...

https://issues.apache.org/jira/browse/CASSANDRA-9577
or
https://issues.apache.org/jira/browse/CASSANDRA-9056 or
https://issues.apache.org/jira/browse/CASSANDRA-8243 (these should not be
in 2.1.7)

=Rob