[jira] [Comment Edited] (CASSANDRA-11363) Blocked NTR When Connecting Causing Excessive Load

2016-04-07 Thread Russell Bradberry (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15231130#comment-15231130
 ] 

Russell Bradberry edited comment on CASSANDRA-11363 at 4/7/16 9:42 PM:
---

[~pauloricardomg] CASSANDRA-11529 makes sense because CASSANDRA-9303 was 
backported to 2.1.12 in DSE 4.8.4. Hence why we see it in that version vs only 
in 2.1.13.


was (Author: devdazed):
[~pauloricardomg] 11529 makes sense because CASSANDRA-9303 was backported to 
2.1.12 in DSE 4.8.4. Hence why we see it in that version vs only in 2.1.13.

> Blocked NTR When Connecting Causing Excessive Load
> --
>
> Key: CASSANDRA-11363
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11363
> Project: Cassandra
>  Issue Type: Bug
>  Components: Coordination
>Reporter: Russell Bradberry
>Priority: Critical
> Attachments: cassandra-102-cms.stack, cassandra-102-g1gc.stack
>
>
> When upgrading from 2.1.9 to 2.1.13, we are witnessing an issue where the 
> machine load increases to very high levels (> 120 on an 8 core machine) and 
> native transport requests get blocked in tpstats.
> I was able to reproduce this in both CMS and G1GC as well as on JVM 7 and 8.
> The issue does not seem to affect the nodes running 2.1.9.
> The issue seems to coincide with the number of connections OR the number of 
> total requests being processed at a given time (as the latter increases with 
> the former in our system)
> Currently there is between 600 and 800 client connections on each machine and 
> each machine is handling roughly 2000-3000 client requests per second.
> Disabling the binary protocol fixes the issue for this node but isn't a 
> viable option cluster-wide.
> Here is the output from tpstats:
> {code}
> Pool NameActive   Pending  Completed   Blocked  All 
> time blocked
> MutationStage 0 88387821 0
>  0
> ReadStage 0 0 355860 0
>  0
> RequestResponseStage  0 72532457 0
>  0
> ReadRepairStage   0 0150 0
>  0
> CounterMutationStage 32   104 897560 0
>  0
> MiscStage 0 0  0 0
>  0
> HintedHandoff 0 0 65 0
>  0
> GossipStage   0 0   2338 0
>  0
> CacheCleanupExecutor  0 0  0 0
>  0
> InternalResponseStage 0 0  0 0
>  0
> CommitLogArchiver 0 0  0 0
>  0
> CompactionExecutor2   190474 0
>  0
> ValidationExecutor0 0  0 0
>  0
> MigrationStage0 0 10 0
>  0
> AntiEntropyStage  0 0  0 0
>  0
> PendingRangeCalculator0 0310 0
>  0
> Sampler   0 0  0 0
>  0
> MemtableFlushWriter   110 94 0
>  0
> MemtablePostFlush 134257 0
>  0
> MemtableReclaimMemory 0 0 94 0
>  0
> Native-Transport-Requests   128   156 38795716
> 278451
> Message type   Dropped
> READ 0
> RANGE_SLICE  0
> _TRACE   0
> MUTATION 0
> COUNTER_MUTATION 0
> BINARY   0
> REQUEST_RESPONSE 0
> PAGED_RANGE  0
> READ_REPAIR  0
> {code}
> Attached is the jstack output for both CMS and G1GC.
> Flight recordings are here:
> https://s3.amazonaws.com/simple-logs/cassandra-102-cms.jfr
> https://s3.amazonaws.com/simple-logs/cassandra-102-g1gc.jfr
> It is interesting to note that while the flight recording was taking place, 
> the load on the machine went back to healthy, and when the flight recording 
> finished the load went back to > 100.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-11363) Blocked NTR When Connecting Causing Excessive Load

2016-04-07 Thread Paulo Motta (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15231123#comment-15231123
 ] 

Paulo Motta edited comment on CASSANDRA-11363 at 4/7/16 9:41 PM:
-

yes, cluster with more vnodes will be more affected by that.


was (Author: pauloricardomg):
yes, cluster with more vnodes will be more affected by this.

> Blocked NTR When Connecting Causing Excessive Load
> --
>
> Key: CASSANDRA-11363
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11363
> Project: Cassandra
>  Issue Type: Bug
>  Components: Coordination
>Reporter: Russell Bradberry
>Priority: Critical
> Attachments: cassandra-102-cms.stack, cassandra-102-g1gc.stack
>
>
> When upgrading from 2.1.9 to 2.1.13, we are witnessing an issue where the 
> machine load increases to very high levels (> 120 on an 8 core machine) and 
> native transport requests get blocked in tpstats.
> I was able to reproduce this in both CMS and G1GC as well as on JVM 7 and 8.
> The issue does not seem to affect the nodes running 2.1.9.
> The issue seems to coincide with the number of connections OR the number of 
> total requests being processed at a given time (as the latter increases with 
> the former in our system)
> Currently there is between 600 and 800 client connections on each machine and 
> each machine is handling roughly 2000-3000 client requests per second.
> Disabling the binary protocol fixes the issue for this node but isn't a 
> viable option cluster-wide.
> Here is the output from tpstats:
> {code}
> Pool NameActive   Pending  Completed   Blocked  All 
> time blocked
> MutationStage 0 88387821 0
>  0
> ReadStage 0 0 355860 0
>  0
> RequestResponseStage  0 72532457 0
>  0
> ReadRepairStage   0 0150 0
>  0
> CounterMutationStage 32   104 897560 0
>  0
> MiscStage 0 0  0 0
>  0
> HintedHandoff 0 0 65 0
>  0
> GossipStage   0 0   2338 0
>  0
> CacheCleanupExecutor  0 0  0 0
>  0
> InternalResponseStage 0 0  0 0
>  0
> CommitLogArchiver 0 0  0 0
>  0
> CompactionExecutor2   190474 0
>  0
> ValidationExecutor0 0  0 0
>  0
> MigrationStage0 0 10 0
>  0
> AntiEntropyStage  0 0  0 0
>  0
> PendingRangeCalculator0 0310 0
>  0
> Sampler   0 0  0 0
>  0
> MemtableFlushWriter   110 94 0
>  0
> MemtablePostFlush 134257 0
>  0
> MemtableReclaimMemory 0 0 94 0
>  0
> Native-Transport-Requests   128   156 38795716
> 278451
> Message type   Dropped
> READ 0
> RANGE_SLICE  0
> _TRACE   0
> MUTATION 0
> COUNTER_MUTATION 0
> BINARY   0
> REQUEST_RESPONSE 0
> PAGED_RANGE  0
> READ_REPAIR  0
> {code}
> Attached is the jstack output for both CMS and G1GC.
> Flight recordings are here:
> https://s3.amazonaws.com/simple-logs/cassandra-102-cms.jfr
> https://s3.amazonaws.com/simple-logs/cassandra-102-g1gc.jfr
> It is interesting to note that while the flight recording was taking place, 
> the load on the machine went back to healthy, and when the flight recording 
> finished the load went back to > 100.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-11363) Blocked NTR When Connecting Causing Excessive Load

2016-04-07 Thread Paulo Motta (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15231113#comment-15231113
 ] 

Paulo Motta edited comment on CASSANDRA-11363 at 4/7/16 9:35 PM:
-

[~devdazed] Ok, then it's very likely that you're hit by CASSANDRA-11529. I 
created another ticket in case this one is a different issue.

[~zznate] [~CRolo] can you double check you're not hitting CASSANDRA-11529?


was (Author: pauloricardomg):
Ok, then it's very likely that you're hit by CASSANDRA-11529. I created another 
ticket in case this one is a different issue.

[~zznate] [~CRolo] can you double check you're not hitting CASSANDRA-11529?

> Blocked NTR When Connecting Causing Excessive Load
> --
>
> Key: CASSANDRA-11363
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11363
> Project: Cassandra
>  Issue Type: Bug
>  Components: Coordination
>Reporter: Russell Bradberry
>Priority: Critical
> Attachments: cassandra-102-cms.stack, cassandra-102-g1gc.stack
>
>
> When upgrading from 2.1.9 to 2.1.13, we are witnessing an issue where the 
> machine load increases to very high levels (> 120 on an 8 core machine) and 
> native transport requests get blocked in tpstats.
> I was able to reproduce this in both CMS and G1GC as well as on JVM 7 and 8.
> The issue does not seem to affect the nodes running 2.1.9.
> The issue seems to coincide with the number of connections OR the number of 
> total requests being processed at a given time (as the latter increases with 
> the former in our system)
> Currently there is between 600 and 800 client connections on each machine and 
> each machine is handling roughly 2000-3000 client requests per second.
> Disabling the binary protocol fixes the issue for this node but isn't a 
> viable option cluster-wide.
> Here is the output from tpstats:
> {code}
> Pool NameActive   Pending  Completed   Blocked  All 
> time blocked
> MutationStage 0 88387821 0
>  0
> ReadStage 0 0 355860 0
>  0
> RequestResponseStage  0 72532457 0
>  0
> ReadRepairStage   0 0150 0
>  0
> CounterMutationStage 32   104 897560 0
>  0
> MiscStage 0 0  0 0
>  0
> HintedHandoff 0 0 65 0
>  0
> GossipStage   0 0   2338 0
>  0
> CacheCleanupExecutor  0 0  0 0
>  0
> InternalResponseStage 0 0  0 0
>  0
> CommitLogArchiver 0 0  0 0
>  0
> CompactionExecutor2   190474 0
>  0
> ValidationExecutor0 0  0 0
>  0
> MigrationStage0 0 10 0
>  0
> AntiEntropyStage  0 0  0 0
>  0
> PendingRangeCalculator0 0310 0
>  0
> Sampler   0 0  0 0
>  0
> MemtableFlushWriter   110 94 0
>  0
> MemtablePostFlush 134257 0
>  0
> MemtableReclaimMemory 0 0 94 0
>  0
> Native-Transport-Requests   128   156 38795716
> 278451
> Message type   Dropped
> READ 0
> RANGE_SLICE  0
> _TRACE   0
> MUTATION 0
> COUNTER_MUTATION 0
> BINARY   0
> REQUEST_RESPONSE 0
> PAGED_RANGE  0
> READ_REPAIR  0
> {code}
> Attached is the jstack output for both CMS and G1GC.
> Flight recordings are here:
> https://s3.amazonaws.com/simple-logs/cassandra-102-cms.jfr
> https://s3.amazonaws.com/simple-logs/cassandra-102-g1gc.jfr
> It is interesting to note that while the flight recording was taking place, 
> the load on the machine went back to healthy, and when the flight recording 
> finished the load went back to > 100.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-11363) Blocked NTR When Connecting Causing Excessive Load

2016-04-07 Thread Nate McCall (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15231063#comment-15231063
 ] 

Nate McCall edited comment on CASSANDRA-11363 at 4/7/16 9:06 PM:
-

Yeah, that is quite different that what [~arodrime] and I have seen recently. 
Nodes in our case were otherwise well within utilization thresholds.

[~devdazed] Your issue looks like it would be addressed by [~cnlwsu]'s 
reference to CASSANDRA-10200 (lean on DSE folks for a patch). 


was (Author: zznate):
Yeah, that is quite different that what [~arodrime] and I have seen recently. 
Nodes in our case were otherwise well within utilization thresholds.

[~devdazed] Your issue looks like it would be addressed by [~cnlwsu]'s 
reference to CASSANDRA-10200 (lean on them for a patch). 

> Blocked NTR When Connecting Causing Excessive Load
> --
>
> Key: CASSANDRA-11363
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11363
> Project: Cassandra
>  Issue Type: Bug
>  Components: Coordination
>Reporter: Russell Bradberry
>Priority: Critical
> Attachments: cassandra-102-cms.stack, cassandra-102-g1gc.stack
>
>
> When upgrading from 2.1.9 to 2.1.13, we are witnessing an issue where the 
> machine load increases to very high levels (> 120 on an 8 core machine) and 
> native transport requests get blocked in tpstats.
> I was able to reproduce this in both CMS and G1GC as well as on JVM 7 and 8.
> The issue does not seem to affect the nodes running 2.1.9.
> The issue seems to coincide with the number of connections OR the number of 
> total requests being processed at a given time (as the latter increases with 
> the former in our system)
> Currently there is between 600 and 800 client connections on each machine and 
> each machine is handling roughly 2000-3000 client requests per second.
> Disabling the binary protocol fixes the issue for this node but isn't a 
> viable option cluster-wide.
> Here is the output from tpstats:
> {code}
> Pool NameActive   Pending  Completed   Blocked  All 
> time blocked
> MutationStage 0 88387821 0
>  0
> ReadStage 0 0 355860 0
>  0
> RequestResponseStage  0 72532457 0
>  0
> ReadRepairStage   0 0150 0
>  0
> CounterMutationStage 32   104 897560 0
>  0
> MiscStage 0 0  0 0
>  0
> HintedHandoff 0 0 65 0
>  0
> GossipStage   0 0   2338 0
>  0
> CacheCleanupExecutor  0 0  0 0
>  0
> InternalResponseStage 0 0  0 0
>  0
> CommitLogArchiver 0 0  0 0
>  0
> CompactionExecutor2   190474 0
>  0
> ValidationExecutor0 0  0 0
>  0
> MigrationStage0 0 10 0
>  0
> AntiEntropyStage  0 0  0 0
>  0
> PendingRangeCalculator0 0310 0
>  0
> Sampler   0 0  0 0
>  0
> MemtableFlushWriter   110 94 0
>  0
> MemtablePostFlush 134257 0
>  0
> MemtableReclaimMemory 0 0 94 0
>  0
> Native-Transport-Requests   128   156 38795716
> 278451
> Message type   Dropped
> READ 0
> RANGE_SLICE  0
> _TRACE   0
> MUTATION 0
> COUNTER_MUTATION 0
> BINARY   0
> REQUEST_RESPONSE 0
> PAGED_RANGE  0
> READ_REPAIR  0
> {code}
> Attached is the jstack output for both CMS and G1GC.
> Flight recordings are here:
> https://s3.amazonaws.com/simple-logs/cassandra-102-cms.jfr
> https://s3.amazonaws.com/simple-logs/cassandra-102-g1gc.jfr
> It is interesting to note that while the flight recording was taking place, 
> the load on the machine went back to healthy, and when the flight recording 
> finished the load went back to > 

[jira] [Comment Edited] (CASSANDRA-11363) Blocked NTR When Connecting Causing Excessive Load

2016-04-07 Thread Russell Bradberry (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15231045#comment-15231045
 ] 

Russell Bradberry edited comment on CASSANDRA-11363 at 4/7/16 8:56 PM:
---

we may have two separate issues here, in mine, the issue is 100% CPU 
utilization and ultra high load when using batches.  According to the jfr all 
of hot threads are spinning on 
{code}
org.apache.cassandra.locator.NetworkTopologyStrategy.hasSufficientReplicas(String,
 Map, Multimap)
-> 
org.apache.cassandra.locator.NetworkTopologyStrategy.hasSufficientReplicas(Map, 
Multimap)
-> 
org.apache.cassandra.locator.NetworkTopologyStrategy.calculateNaturalEndpoints(Token,
 TokenMetadata)
{code}


was (Author: devdazed):
we may have two separate issues here, in mine, the issue is 100% CPU 
utilization and ultra high load when using batches.  According to the jfr all 
of hot threads are putting their resources in 
{code}
org.apache.cassandra.locator.NetworkTopologyStrategy.hasSufficientReplicas(String,
 Map, Multimap)
-> 
org.apache.cassandra.locator.NetworkTopologyStrategy.hasSufficientReplicas(Map, 
Multimap)
-> 
org.apache.cassandra.locator.NetworkTopologyStrategy.calculateNaturalEndpoints(Token,
 TokenMetadata)
{code}

> Blocked NTR When Connecting Causing Excessive Load
> --
>
> Key: CASSANDRA-11363
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11363
> Project: Cassandra
>  Issue Type: Bug
>  Components: Coordination
>Reporter: Russell Bradberry
>Priority: Critical
> Attachments: cassandra-102-cms.stack, cassandra-102-g1gc.stack
>
>
> When upgrading from 2.1.9 to 2.1.13, we are witnessing an issue where the 
> machine load increases to very high levels (> 120 on an 8 core machine) and 
> native transport requests get blocked in tpstats.
> I was able to reproduce this in both CMS and G1GC as well as on JVM 7 and 8.
> The issue does not seem to affect the nodes running 2.1.9.
> The issue seems to coincide with the number of connections OR the number of 
> total requests being processed at a given time (as the latter increases with 
> the former in our system)
> Currently there is between 600 and 800 client connections on each machine and 
> each machine is handling roughly 2000-3000 client requests per second.
> Disabling the binary protocol fixes the issue for this node but isn't a 
> viable option cluster-wide.
> Here is the output from tpstats:
> {code}
> Pool NameActive   Pending  Completed   Blocked  All 
> time blocked
> MutationStage 0 88387821 0
>  0
> ReadStage 0 0 355860 0
>  0
> RequestResponseStage  0 72532457 0
>  0
> ReadRepairStage   0 0150 0
>  0
> CounterMutationStage 32   104 897560 0
>  0
> MiscStage 0 0  0 0
>  0
> HintedHandoff 0 0 65 0
>  0
> GossipStage   0 0   2338 0
>  0
> CacheCleanupExecutor  0 0  0 0
>  0
> InternalResponseStage 0 0  0 0
>  0
> CommitLogArchiver 0 0  0 0
>  0
> CompactionExecutor2   190474 0
>  0
> ValidationExecutor0 0  0 0
>  0
> MigrationStage0 0 10 0
>  0
> AntiEntropyStage  0 0  0 0
>  0
> PendingRangeCalculator0 0310 0
>  0
> Sampler   0 0  0 0
>  0
> MemtableFlushWriter   110 94 0
>  0
> MemtablePostFlush 134257 0
>  0
> MemtableReclaimMemory 0 0 94 0
>  0
> Native-Transport-Requests   128   156 38795716
> 278451
> Message type   Dropped
> READ 0
> RANGE_SLICE  0
> _TRACE   0
> MUTATION 0
> COUNTER_MUTATION 0
> BINARY   0
> REQUEST_RESPONSE 0
> 

[jira] [Comment Edited] (CASSANDRA-11363) Blocked NTR When Connecting Causing Excessive Load

2016-04-07 Thread Paulo Motta (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15231025#comment-15231025
 ] 

Paulo Motta edited comment on CASSANDRA-11363 at 4/7/16 8:44 PM:
-

that's right, that was just a wild guess. I focused my quick investigation on 
2.1.12 and 2.1.13 commits, but some deeper investigation is probably needed.

probably it's a good idea to backport CASSANDRA-10044 to versions < 2.1.12 and 
setup a coordinator-only node and check where it started happening to narrow 
down the scope.


was (Author: pauloricardomg):
that's right, that was just a wild guess. I focused my quick investigation on 
2.1.12 and 2.1.13 commits, but some deeper investigation is probably needed.

probably it's a good idea to backport CASSANDRA-10044 to versions <= 2.1.12 and 
setup a coordinator-only node and check where it started happening to narrow 
down the scope.

> Blocked NTR When Connecting Causing Excessive Load
> --
>
> Key: CASSANDRA-11363
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11363
> Project: Cassandra
>  Issue Type: Bug
>  Components: Coordination
>Reporter: Russell Bradberry
>Priority: Critical
> Attachments: cassandra-102-cms.stack, cassandra-102-g1gc.stack
>
>
> When upgrading from 2.1.9 to 2.1.13, we are witnessing an issue where the 
> machine load increases to very high levels (> 120 on an 8 core machine) and 
> native transport requests get blocked in tpstats.
> I was able to reproduce this in both CMS and G1GC as well as on JVM 7 and 8.
> The issue does not seem to affect the nodes running 2.1.9.
> The issue seems to coincide with the number of connections OR the number of 
> total requests being processed at a given time (as the latter increases with 
> the former in our system)
> Currently there is between 600 and 800 client connections on each machine and 
> each machine is handling roughly 2000-3000 client requests per second.
> Disabling the binary protocol fixes the issue for this node but isn't a 
> viable option cluster-wide.
> Here is the output from tpstats:
> {code}
> Pool NameActive   Pending  Completed   Blocked  All 
> time blocked
> MutationStage 0 88387821 0
>  0
> ReadStage 0 0 355860 0
>  0
> RequestResponseStage  0 72532457 0
>  0
> ReadRepairStage   0 0150 0
>  0
> CounterMutationStage 32   104 897560 0
>  0
> MiscStage 0 0  0 0
>  0
> HintedHandoff 0 0 65 0
>  0
> GossipStage   0 0   2338 0
>  0
> CacheCleanupExecutor  0 0  0 0
>  0
> InternalResponseStage 0 0  0 0
>  0
> CommitLogArchiver 0 0  0 0
>  0
> CompactionExecutor2   190474 0
>  0
> ValidationExecutor0 0  0 0
>  0
> MigrationStage0 0 10 0
>  0
> AntiEntropyStage  0 0  0 0
>  0
> PendingRangeCalculator0 0310 0
>  0
> Sampler   0 0  0 0
>  0
> MemtableFlushWriter   110 94 0
>  0
> MemtablePostFlush 134257 0
>  0
> MemtableReclaimMemory 0 0 94 0
>  0
> Native-Transport-Requests   128   156 38795716
> 278451
> Message type   Dropped
> READ 0
> RANGE_SLICE  0
> _TRACE   0
> MUTATION 0
> COUNTER_MUTATION 0
> BINARY   0
> REQUEST_RESPONSE 0
> PAGED_RANGE  0
> READ_REPAIR  0
> {code}
> Attached is the jstack output for both CMS and G1GC.
> Flight recordings are here:
> https://s3.amazonaws.com/simple-logs/cassandra-102-cms.jfr
> https://s3.amazonaws.com/simple-logs/cassandra-102-g1gc.jfr
> It is interesting to note that while the flight recording was taking place, 
> the 

[jira] [Comment Edited] (CASSANDRA-11363) Blocked NTR When Connecting Causing Excessive Load

2016-04-07 Thread Chris Lohfink (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15230670#comment-15230670
 ] 

Chris Lohfink edited comment on CASSANDRA-11363 at 4/7/16 5:54 PM:
---

fixed by CASSANDRA-10200 ?


was (Author: cnlwsu):
caused by CASSANDRA-10200 ?

> Blocked NTR When Connecting Causing Excessive Load
> --
>
> Key: CASSANDRA-11363
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11363
> Project: Cassandra
>  Issue Type: Bug
>  Components: Coordination
>Reporter: Russell Bradberry
>Priority: Critical
> Attachments: cassandra-102-cms.stack, cassandra-102-g1gc.stack
>
>
> When upgrading from 2.1.9 to 2.1.13, we are witnessing an issue where the 
> machine load increases to very high levels (> 120 on an 8 core machine) and 
> native transport requests get blocked in tpstats.
> I was able to reproduce this in both CMS and G1GC as well as on JVM 7 and 8.
> The issue does not seem to affect the nodes running 2.1.9.
> The issue seems to coincide with the number of connections OR the number of 
> total requests being processed at a given time (as the latter increases with 
> the former in our system)
> Currently there is between 600 and 800 client connections on each machine and 
> each machine is handling roughly 2000-3000 client requests per second.
> Disabling the binary protocol fixes the issue for this node but isn't a 
> viable option cluster-wide.
> Here is the output from tpstats:
> {code}
> Pool NameActive   Pending  Completed   Blocked  All 
> time blocked
> MutationStage 0 88387821 0
>  0
> ReadStage 0 0 355860 0
>  0
> RequestResponseStage  0 72532457 0
>  0
> ReadRepairStage   0 0150 0
>  0
> CounterMutationStage 32   104 897560 0
>  0
> MiscStage 0 0  0 0
>  0
> HintedHandoff 0 0 65 0
>  0
> GossipStage   0 0   2338 0
>  0
> CacheCleanupExecutor  0 0  0 0
>  0
> InternalResponseStage 0 0  0 0
>  0
> CommitLogArchiver 0 0  0 0
>  0
> CompactionExecutor2   190474 0
>  0
> ValidationExecutor0 0  0 0
>  0
> MigrationStage0 0 10 0
>  0
> AntiEntropyStage  0 0  0 0
>  0
> PendingRangeCalculator0 0310 0
>  0
> Sampler   0 0  0 0
>  0
> MemtableFlushWriter   110 94 0
>  0
> MemtablePostFlush 134257 0
>  0
> MemtableReclaimMemory 0 0 94 0
>  0
> Native-Transport-Requests   128   156 38795716
> 278451
> Message type   Dropped
> READ 0
> RANGE_SLICE  0
> _TRACE   0
> MUTATION 0
> COUNTER_MUTATION 0
> BINARY   0
> REQUEST_RESPONSE 0
> PAGED_RANGE  0
> READ_REPAIR  0
> {code}
> Attached is the jstack output for both CMS and G1GC.
> Flight recordings are here:
> https://s3.amazonaws.com/simple-logs/cassandra-102-cms.jfr
> https://s3.amazonaws.com/simple-logs/cassandra-102-g1gc.jfr
> It is interesting to note that while the flight recording was taking place, 
> the load on the machine went back to healthy, and when the flight recording 
> finished the load went back to > 100.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-11363) Blocked NTR When Connecting Causing Excessive Load

2016-04-07 Thread Alain RODRIGUEZ (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15230325#comment-15230325
 ] 

Alain RODRIGUEZ edited comment on CASSANDRA-11363 at 4/7/16 2:37 PM:
-

I also observed in C*2.1.12 that a certain percentage of the 
Native-Transport-Requests are blocked, yet no major CPU or resources issue on 
my side though, it might then not be related.

For what it is worth here is something I observed about 
Native-Transport-Requests: increasing the 'native_transport_max_threads' value 
help mitigating this, as expected, but Native-Transport-Requests number is 
still a non zero value.

{noformat}
[alain~]$ knife ssh "role:cassandra" "nodetool tpstats | grep 
Native-Transport-Requests" | grep -e server1 -e server2 -e server3 -e server4 | 
sort | awk 'BEGIN { printf "%50s %10s","Server |"," Blocked ratio:\n" } { 
printf "%50s %10f%\n", $1, (($7/$5)*100) }'
   Server |  Blocked ratio:
   server3   0.044902%
   server4   0.030127%
   server2   0.045759%
   server1   0.082763%
{noformat}

I waited long enough between the change and the result capture, many days, 
probably a few weeks. As all the nodes are in the same datacenter, under a 
(fairly) balanced load, this is probably relevant.

Here are the result for those nodes, in our use case.

||Server||native_transport_max_threads||Percentage of blocked threads||
|server1|128|0.082763%|
|server2|384|0.044902%|
|server3|512|0.045759%|
|server4|1024|0.030127%|

Also from the mailing list outputs, it looks like it is quite common to have 
some Native-Transport-Requests blocked, probably unavoidable depending on the 
network and use cases (spiky workloads?). 


was (Author: arodrime):
I also observed in C*2.1.12 that a certain percentage of the 
Native-Transport-Requests are blocked, yet no major CPU or resources issue on 
my side though, it might then not be related.

For what it is worth here is something I observed about 
Native-Transport-Requests: increasing the 'native_transport_max_threads' value 
help mitigating this, as expected, but Native-Transport-Requests number is 
still a non zero value.

{noformat}
[alain@bastion-d3-prod ~]$ knife ssh "role:cassandra" "nodetool tpstats | grep 
Native-Transport-Requests" | grep -e server1 -e server2 -e server3 -e server4 | 
sort | awk 'BEGIN { printf "%50s %10s","Server |"," Blocked ratio:\n" } { 
printf "%50s %10f%\n", $1, (($7/$5)*100) }'
  Server |  Blocked ratio:
   ip-172-17-42-105.us-west-2.compute.internal   0.044902%
   ip-172-17-42-107.us-west-2.compute.internal   0.030127%
   ip-172-17-42-114.us-west-2.compute.internal   0.045759%
   ip-172-17-42-116.us-west-2.compute.internal   0.082763%
{noformat}

I waited long enough between the change and the result capture, many days, 
probably a few weeks. As all the nodes are in the same datacenter, under a 
(fairly) balanced load, this is probably relevant.

Here are the result for those nodes, in our use case.

||Server||native_transport_max_threads||Percentage of blocked threads||
|server1|128|0.082763%|
|server2|384|0.044902%|
|server3|512|0.045759%|
|server4|1024|0.030127%|

Also from the mailing list outputs, it looks like it is quite common to have 
some Native-Transport-Requests blocked, probably unavoidable depending on the 
network and use cases (spiky workloads?). 

> Blocked NTR When Connecting Causing Excessive Load
> --
>
> Key: CASSANDRA-11363
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11363
> Project: Cassandra
>  Issue Type: Bug
>  Components: Coordination
>Reporter: Russell Bradberry
> Attachments: cassandra-102-cms.stack, cassandra-102-g1gc.stack
>
>
> When upgrading from 2.1.9 to 2.1.13, we are witnessing an issue where the 
> machine load increases to very high levels (> 120 on an 8 core machine) and 
> native transport requests get blocked in tpstats.
> I was able to reproduce this in both CMS and G1GC as well as on JVM 7 and 8.
> The issue does not seem to affect the nodes running 2.1.9.
> The issue seems to coincide with the number of connections OR the number of 
> total requests being processed at a given time (as the latter increases with 
> the former in our system)
> Currently there is between 600 and 800 client connections on each machine and 
> each machine is handling roughly 2000-3000 client requests per second.
> Disabling the binary protocol fixes the issue for this node but isn't a 
> viable option cluster-wide.
> Here is the output from tpstats:
> {code}
> Pool NameActive   Pending  Completed   Blocked  All 
> time blocked
> MutationStage 0 88387821 0
>  0
> ReadStage 0