[jira] [Comment Edited] (CASSANDRA-11363) Blocked NTR When Connecting Causing Excessive Load
[ https://issues.apache.org/jira/browse/CASSANDRA-11363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15231130#comment-15231130 ] Russell Bradberry edited comment on CASSANDRA-11363 at 4/7/16 9:42 PM: --- [~pauloricardomg] CASSANDRA-11529 makes sense because CASSANDRA-9303 was backported to 2.1.12 in DSE 4.8.4. Hence why we see it in that version vs only in 2.1.13. was (Author: devdazed): [~pauloricardomg] 11529 makes sense because CASSANDRA-9303 was backported to 2.1.12 in DSE 4.8.4. Hence why we see it in that version vs only in 2.1.13. > Blocked NTR When Connecting Causing Excessive Load > -- > > Key: CASSANDRA-11363 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11363 > Project: Cassandra > Issue Type: Bug > Components: Coordination >Reporter: Russell Bradberry >Priority: Critical > Attachments: cassandra-102-cms.stack, cassandra-102-g1gc.stack > > > When upgrading from 2.1.9 to 2.1.13, we are witnessing an issue where the > machine load increases to very high levels (> 120 on an 8 core machine) and > native transport requests get blocked in tpstats. > I was able to reproduce this in both CMS and G1GC as well as on JVM 7 and 8. > The issue does not seem to affect the nodes running 2.1.9. > The issue seems to coincide with the number of connections OR the number of > total requests being processed at a given time (as the latter increases with > the former in our system) > Currently there is between 600 and 800 client connections on each machine and > each machine is handling roughly 2000-3000 client requests per second. > Disabling the binary protocol fixes the issue for this node but isn't a > viable option cluster-wide. > Here is the output from tpstats: > {code} > Pool NameActive Pending Completed Blocked All > time blocked > MutationStage 0 88387821 0 > 0 > ReadStage 0 0 355860 0 > 0 > RequestResponseStage 0 72532457 0 > 0 > ReadRepairStage 0 0150 0 > 0 > CounterMutationStage 32 104 897560 0 > 0 > MiscStage 0 0 0 0 > 0 > HintedHandoff 0 0 65 0 > 0 > GossipStage 0 0 2338 0 > 0 > CacheCleanupExecutor 0 0 0 0 > 0 > InternalResponseStage 0 0 0 0 > 0 > CommitLogArchiver 0 0 0 0 > 0 > CompactionExecutor2 190474 0 > 0 > ValidationExecutor0 0 0 0 > 0 > MigrationStage0 0 10 0 > 0 > AntiEntropyStage 0 0 0 0 > 0 > PendingRangeCalculator0 0310 0 > 0 > Sampler 0 0 0 0 > 0 > MemtableFlushWriter 110 94 0 > 0 > MemtablePostFlush 134257 0 > 0 > MemtableReclaimMemory 0 0 94 0 > 0 > Native-Transport-Requests 128 156 38795716 > 278451 > Message type Dropped > READ 0 > RANGE_SLICE 0 > _TRACE 0 > MUTATION 0 > COUNTER_MUTATION 0 > BINARY 0 > REQUEST_RESPONSE 0 > PAGED_RANGE 0 > READ_REPAIR 0 > {code} > Attached is the jstack output for both CMS and G1GC. > Flight recordings are here: > https://s3.amazonaws.com/simple-logs/cassandra-102-cms.jfr > https://s3.amazonaws.com/simple-logs/cassandra-102-g1gc.jfr > It is interesting to note that while the flight recording was taking place, > the load on the machine went back to healthy, and when the flight recording > finished the load went back to > 100. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (CASSANDRA-11363) Blocked NTR When Connecting Causing Excessive Load
[ https://issues.apache.org/jira/browse/CASSANDRA-11363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15231123#comment-15231123 ] Paulo Motta edited comment on CASSANDRA-11363 at 4/7/16 9:41 PM: - yes, cluster with more vnodes will be more affected by that. was (Author: pauloricardomg): yes, cluster with more vnodes will be more affected by this. > Blocked NTR When Connecting Causing Excessive Load > -- > > Key: CASSANDRA-11363 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11363 > Project: Cassandra > Issue Type: Bug > Components: Coordination >Reporter: Russell Bradberry >Priority: Critical > Attachments: cassandra-102-cms.stack, cassandra-102-g1gc.stack > > > When upgrading from 2.1.9 to 2.1.13, we are witnessing an issue where the > machine load increases to very high levels (> 120 on an 8 core machine) and > native transport requests get blocked in tpstats. > I was able to reproduce this in both CMS and G1GC as well as on JVM 7 and 8. > The issue does not seem to affect the nodes running 2.1.9. > The issue seems to coincide with the number of connections OR the number of > total requests being processed at a given time (as the latter increases with > the former in our system) > Currently there is between 600 and 800 client connections on each machine and > each machine is handling roughly 2000-3000 client requests per second. > Disabling the binary protocol fixes the issue for this node but isn't a > viable option cluster-wide. > Here is the output from tpstats: > {code} > Pool NameActive Pending Completed Blocked All > time blocked > MutationStage 0 88387821 0 > 0 > ReadStage 0 0 355860 0 > 0 > RequestResponseStage 0 72532457 0 > 0 > ReadRepairStage 0 0150 0 > 0 > CounterMutationStage 32 104 897560 0 > 0 > MiscStage 0 0 0 0 > 0 > HintedHandoff 0 0 65 0 > 0 > GossipStage 0 0 2338 0 > 0 > CacheCleanupExecutor 0 0 0 0 > 0 > InternalResponseStage 0 0 0 0 > 0 > CommitLogArchiver 0 0 0 0 > 0 > CompactionExecutor2 190474 0 > 0 > ValidationExecutor0 0 0 0 > 0 > MigrationStage0 0 10 0 > 0 > AntiEntropyStage 0 0 0 0 > 0 > PendingRangeCalculator0 0310 0 > 0 > Sampler 0 0 0 0 > 0 > MemtableFlushWriter 110 94 0 > 0 > MemtablePostFlush 134257 0 > 0 > MemtableReclaimMemory 0 0 94 0 > 0 > Native-Transport-Requests 128 156 38795716 > 278451 > Message type Dropped > READ 0 > RANGE_SLICE 0 > _TRACE 0 > MUTATION 0 > COUNTER_MUTATION 0 > BINARY 0 > REQUEST_RESPONSE 0 > PAGED_RANGE 0 > READ_REPAIR 0 > {code} > Attached is the jstack output for both CMS and G1GC. > Flight recordings are here: > https://s3.amazonaws.com/simple-logs/cassandra-102-cms.jfr > https://s3.amazonaws.com/simple-logs/cassandra-102-g1gc.jfr > It is interesting to note that while the flight recording was taking place, > the load on the machine went back to healthy, and when the flight recording > finished the load went back to > 100. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (CASSANDRA-11363) Blocked NTR When Connecting Causing Excessive Load
[ https://issues.apache.org/jira/browse/CASSANDRA-11363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15231113#comment-15231113 ] Paulo Motta edited comment on CASSANDRA-11363 at 4/7/16 9:35 PM: - [~devdazed] Ok, then it's very likely that you're hit by CASSANDRA-11529. I created another ticket in case this one is a different issue. [~zznate] [~CRolo] can you double check you're not hitting CASSANDRA-11529? was (Author: pauloricardomg): Ok, then it's very likely that you're hit by CASSANDRA-11529. I created another ticket in case this one is a different issue. [~zznate] [~CRolo] can you double check you're not hitting CASSANDRA-11529? > Blocked NTR When Connecting Causing Excessive Load > -- > > Key: CASSANDRA-11363 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11363 > Project: Cassandra > Issue Type: Bug > Components: Coordination >Reporter: Russell Bradberry >Priority: Critical > Attachments: cassandra-102-cms.stack, cassandra-102-g1gc.stack > > > When upgrading from 2.1.9 to 2.1.13, we are witnessing an issue where the > machine load increases to very high levels (> 120 on an 8 core machine) and > native transport requests get blocked in tpstats. > I was able to reproduce this in both CMS and G1GC as well as on JVM 7 and 8. > The issue does not seem to affect the nodes running 2.1.9. > The issue seems to coincide with the number of connections OR the number of > total requests being processed at a given time (as the latter increases with > the former in our system) > Currently there is between 600 and 800 client connections on each machine and > each machine is handling roughly 2000-3000 client requests per second. > Disabling the binary protocol fixes the issue for this node but isn't a > viable option cluster-wide. > Here is the output from tpstats: > {code} > Pool NameActive Pending Completed Blocked All > time blocked > MutationStage 0 88387821 0 > 0 > ReadStage 0 0 355860 0 > 0 > RequestResponseStage 0 72532457 0 > 0 > ReadRepairStage 0 0150 0 > 0 > CounterMutationStage 32 104 897560 0 > 0 > MiscStage 0 0 0 0 > 0 > HintedHandoff 0 0 65 0 > 0 > GossipStage 0 0 2338 0 > 0 > CacheCleanupExecutor 0 0 0 0 > 0 > InternalResponseStage 0 0 0 0 > 0 > CommitLogArchiver 0 0 0 0 > 0 > CompactionExecutor2 190474 0 > 0 > ValidationExecutor0 0 0 0 > 0 > MigrationStage0 0 10 0 > 0 > AntiEntropyStage 0 0 0 0 > 0 > PendingRangeCalculator0 0310 0 > 0 > Sampler 0 0 0 0 > 0 > MemtableFlushWriter 110 94 0 > 0 > MemtablePostFlush 134257 0 > 0 > MemtableReclaimMemory 0 0 94 0 > 0 > Native-Transport-Requests 128 156 38795716 > 278451 > Message type Dropped > READ 0 > RANGE_SLICE 0 > _TRACE 0 > MUTATION 0 > COUNTER_MUTATION 0 > BINARY 0 > REQUEST_RESPONSE 0 > PAGED_RANGE 0 > READ_REPAIR 0 > {code} > Attached is the jstack output for both CMS and G1GC. > Flight recordings are here: > https://s3.amazonaws.com/simple-logs/cassandra-102-cms.jfr > https://s3.amazonaws.com/simple-logs/cassandra-102-g1gc.jfr > It is interesting to note that while the flight recording was taking place, > the load on the machine went back to healthy, and when the flight recording > finished the load went back to > 100. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (CASSANDRA-11363) Blocked NTR When Connecting Causing Excessive Load
[ https://issues.apache.org/jira/browse/CASSANDRA-11363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15231063#comment-15231063 ] Nate McCall edited comment on CASSANDRA-11363 at 4/7/16 9:06 PM: - Yeah, that is quite different that what [~arodrime] and I have seen recently. Nodes in our case were otherwise well within utilization thresholds. [~devdazed] Your issue looks like it would be addressed by [~cnlwsu]'s reference to CASSANDRA-10200 (lean on DSE folks for a patch). was (Author: zznate): Yeah, that is quite different that what [~arodrime] and I have seen recently. Nodes in our case were otherwise well within utilization thresholds. [~devdazed] Your issue looks like it would be addressed by [~cnlwsu]'s reference to CASSANDRA-10200 (lean on them for a patch). > Blocked NTR When Connecting Causing Excessive Load > -- > > Key: CASSANDRA-11363 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11363 > Project: Cassandra > Issue Type: Bug > Components: Coordination >Reporter: Russell Bradberry >Priority: Critical > Attachments: cassandra-102-cms.stack, cassandra-102-g1gc.stack > > > When upgrading from 2.1.9 to 2.1.13, we are witnessing an issue where the > machine load increases to very high levels (> 120 on an 8 core machine) and > native transport requests get blocked in tpstats. > I was able to reproduce this in both CMS and G1GC as well as on JVM 7 and 8. > The issue does not seem to affect the nodes running 2.1.9. > The issue seems to coincide with the number of connections OR the number of > total requests being processed at a given time (as the latter increases with > the former in our system) > Currently there is between 600 and 800 client connections on each machine and > each machine is handling roughly 2000-3000 client requests per second. > Disabling the binary protocol fixes the issue for this node but isn't a > viable option cluster-wide. > Here is the output from tpstats: > {code} > Pool NameActive Pending Completed Blocked All > time blocked > MutationStage 0 88387821 0 > 0 > ReadStage 0 0 355860 0 > 0 > RequestResponseStage 0 72532457 0 > 0 > ReadRepairStage 0 0150 0 > 0 > CounterMutationStage 32 104 897560 0 > 0 > MiscStage 0 0 0 0 > 0 > HintedHandoff 0 0 65 0 > 0 > GossipStage 0 0 2338 0 > 0 > CacheCleanupExecutor 0 0 0 0 > 0 > InternalResponseStage 0 0 0 0 > 0 > CommitLogArchiver 0 0 0 0 > 0 > CompactionExecutor2 190474 0 > 0 > ValidationExecutor0 0 0 0 > 0 > MigrationStage0 0 10 0 > 0 > AntiEntropyStage 0 0 0 0 > 0 > PendingRangeCalculator0 0310 0 > 0 > Sampler 0 0 0 0 > 0 > MemtableFlushWriter 110 94 0 > 0 > MemtablePostFlush 134257 0 > 0 > MemtableReclaimMemory 0 0 94 0 > 0 > Native-Transport-Requests 128 156 38795716 > 278451 > Message type Dropped > READ 0 > RANGE_SLICE 0 > _TRACE 0 > MUTATION 0 > COUNTER_MUTATION 0 > BINARY 0 > REQUEST_RESPONSE 0 > PAGED_RANGE 0 > READ_REPAIR 0 > {code} > Attached is the jstack output for both CMS and G1GC. > Flight recordings are here: > https://s3.amazonaws.com/simple-logs/cassandra-102-cms.jfr > https://s3.amazonaws.com/simple-logs/cassandra-102-g1gc.jfr > It is interesting to note that while the flight recording was taking place, > the load on the machine went back to healthy, and when the flight recording > finished the load went back to >
[jira] [Comment Edited] (CASSANDRA-11363) Blocked NTR When Connecting Causing Excessive Load
[ https://issues.apache.org/jira/browse/CASSANDRA-11363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15231045#comment-15231045 ] Russell Bradberry edited comment on CASSANDRA-11363 at 4/7/16 8:56 PM: --- we may have two separate issues here, in mine, the issue is 100% CPU utilization and ultra high load when using batches. According to the jfr all of hot threads are spinning on {code} org.apache.cassandra.locator.NetworkTopologyStrategy.hasSufficientReplicas(String, Map, Multimap) -> org.apache.cassandra.locator.NetworkTopologyStrategy.hasSufficientReplicas(Map, Multimap) -> org.apache.cassandra.locator.NetworkTopologyStrategy.calculateNaturalEndpoints(Token, TokenMetadata) {code} was (Author: devdazed): we may have two separate issues here, in mine, the issue is 100% CPU utilization and ultra high load when using batches. According to the jfr all of hot threads are putting their resources in {code} org.apache.cassandra.locator.NetworkTopologyStrategy.hasSufficientReplicas(String, Map, Multimap) -> org.apache.cassandra.locator.NetworkTopologyStrategy.hasSufficientReplicas(Map, Multimap) -> org.apache.cassandra.locator.NetworkTopologyStrategy.calculateNaturalEndpoints(Token, TokenMetadata) {code} > Blocked NTR When Connecting Causing Excessive Load > -- > > Key: CASSANDRA-11363 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11363 > Project: Cassandra > Issue Type: Bug > Components: Coordination >Reporter: Russell Bradberry >Priority: Critical > Attachments: cassandra-102-cms.stack, cassandra-102-g1gc.stack > > > When upgrading from 2.1.9 to 2.1.13, we are witnessing an issue where the > machine load increases to very high levels (> 120 on an 8 core machine) and > native transport requests get blocked in tpstats. > I was able to reproduce this in both CMS and G1GC as well as on JVM 7 and 8. > The issue does not seem to affect the nodes running 2.1.9. > The issue seems to coincide with the number of connections OR the number of > total requests being processed at a given time (as the latter increases with > the former in our system) > Currently there is between 600 and 800 client connections on each machine and > each machine is handling roughly 2000-3000 client requests per second. > Disabling the binary protocol fixes the issue for this node but isn't a > viable option cluster-wide. > Here is the output from tpstats: > {code} > Pool NameActive Pending Completed Blocked All > time blocked > MutationStage 0 88387821 0 > 0 > ReadStage 0 0 355860 0 > 0 > RequestResponseStage 0 72532457 0 > 0 > ReadRepairStage 0 0150 0 > 0 > CounterMutationStage 32 104 897560 0 > 0 > MiscStage 0 0 0 0 > 0 > HintedHandoff 0 0 65 0 > 0 > GossipStage 0 0 2338 0 > 0 > CacheCleanupExecutor 0 0 0 0 > 0 > InternalResponseStage 0 0 0 0 > 0 > CommitLogArchiver 0 0 0 0 > 0 > CompactionExecutor2 190474 0 > 0 > ValidationExecutor0 0 0 0 > 0 > MigrationStage0 0 10 0 > 0 > AntiEntropyStage 0 0 0 0 > 0 > PendingRangeCalculator0 0310 0 > 0 > Sampler 0 0 0 0 > 0 > MemtableFlushWriter 110 94 0 > 0 > MemtablePostFlush 134257 0 > 0 > MemtableReclaimMemory 0 0 94 0 > 0 > Native-Transport-Requests 128 156 38795716 > 278451 > Message type Dropped > READ 0 > RANGE_SLICE 0 > _TRACE 0 > MUTATION 0 > COUNTER_MUTATION 0 > BINARY 0 > REQUEST_RESPONSE 0 >
[jira] [Comment Edited] (CASSANDRA-11363) Blocked NTR When Connecting Causing Excessive Load
[ https://issues.apache.org/jira/browse/CASSANDRA-11363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15231025#comment-15231025 ] Paulo Motta edited comment on CASSANDRA-11363 at 4/7/16 8:44 PM: - that's right, that was just a wild guess. I focused my quick investigation on 2.1.12 and 2.1.13 commits, but some deeper investigation is probably needed. probably it's a good idea to backport CASSANDRA-10044 to versions < 2.1.12 and setup a coordinator-only node and check where it started happening to narrow down the scope. was (Author: pauloricardomg): that's right, that was just a wild guess. I focused my quick investigation on 2.1.12 and 2.1.13 commits, but some deeper investigation is probably needed. probably it's a good idea to backport CASSANDRA-10044 to versions <= 2.1.12 and setup a coordinator-only node and check where it started happening to narrow down the scope. > Blocked NTR When Connecting Causing Excessive Load > -- > > Key: CASSANDRA-11363 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11363 > Project: Cassandra > Issue Type: Bug > Components: Coordination >Reporter: Russell Bradberry >Priority: Critical > Attachments: cassandra-102-cms.stack, cassandra-102-g1gc.stack > > > When upgrading from 2.1.9 to 2.1.13, we are witnessing an issue where the > machine load increases to very high levels (> 120 on an 8 core machine) and > native transport requests get blocked in tpstats. > I was able to reproduce this in both CMS and G1GC as well as on JVM 7 and 8. > The issue does not seem to affect the nodes running 2.1.9. > The issue seems to coincide with the number of connections OR the number of > total requests being processed at a given time (as the latter increases with > the former in our system) > Currently there is between 600 and 800 client connections on each machine and > each machine is handling roughly 2000-3000 client requests per second. > Disabling the binary protocol fixes the issue for this node but isn't a > viable option cluster-wide. > Here is the output from tpstats: > {code} > Pool NameActive Pending Completed Blocked All > time blocked > MutationStage 0 88387821 0 > 0 > ReadStage 0 0 355860 0 > 0 > RequestResponseStage 0 72532457 0 > 0 > ReadRepairStage 0 0150 0 > 0 > CounterMutationStage 32 104 897560 0 > 0 > MiscStage 0 0 0 0 > 0 > HintedHandoff 0 0 65 0 > 0 > GossipStage 0 0 2338 0 > 0 > CacheCleanupExecutor 0 0 0 0 > 0 > InternalResponseStage 0 0 0 0 > 0 > CommitLogArchiver 0 0 0 0 > 0 > CompactionExecutor2 190474 0 > 0 > ValidationExecutor0 0 0 0 > 0 > MigrationStage0 0 10 0 > 0 > AntiEntropyStage 0 0 0 0 > 0 > PendingRangeCalculator0 0310 0 > 0 > Sampler 0 0 0 0 > 0 > MemtableFlushWriter 110 94 0 > 0 > MemtablePostFlush 134257 0 > 0 > MemtableReclaimMemory 0 0 94 0 > 0 > Native-Transport-Requests 128 156 38795716 > 278451 > Message type Dropped > READ 0 > RANGE_SLICE 0 > _TRACE 0 > MUTATION 0 > COUNTER_MUTATION 0 > BINARY 0 > REQUEST_RESPONSE 0 > PAGED_RANGE 0 > READ_REPAIR 0 > {code} > Attached is the jstack output for both CMS and G1GC. > Flight recordings are here: > https://s3.amazonaws.com/simple-logs/cassandra-102-cms.jfr > https://s3.amazonaws.com/simple-logs/cassandra-102-g1gc.jfr > It is interesting to note that while the flight recording was taking place, > the
[jira] [Comment Edited] (CASSANDRA-11363) Blocked NTR When Connecting Causing Excessive Load
[ https://issues.apache.org/jira/browse/CASSANDRA-11363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15230670#comment-15230670 ] Chris Lohfink edited comment on CASSANDRA-11363 at 4/7/16 5:54 PM: --- fixed by CASSANDRA-10200 ? was (Author: cnlwsu): caused by CASSANDRA-10200 ? > Blocked NTR When Connecting Causing Excessive Load > -- > > Key: CASSANDRA-11363 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11363 > Project: Cassandra > Issue Type: Bug > Components: Coordination >Reporter: Russell Bradberry >Priority: Critical > Attachments: cassandra-102-cms.stack, cassandra-102-g1gc.stack > > > When upgrading from 2.1.9 to 2.1.13, we are witnessing an issue where the > machine load increases to very high levels (> 120 on an 8 core machine) and > native transport requests get blocked in tpstats. > I was able to reproduce this in both CMS and G1GC as well as on JVM 7 and 8. > The issue does not seem to affect the nodes running 2.1.9. > The issue seems to coincide with the number of connections OR the number of > total requests being processed at a given time (as the latter increases with > the former in our system) > Currently there is between 600 and 800 client connections on each machine and > each machine is handling roughly 2000-3000 client requests per second. > Disabling the binary protocol fixes the issue for this node but isn't a > viable option cluster-wide. > Here is the output from tpstats: > {code} > Pool NameActive Pending Completed Blocked All > time blocked > MutationStage 0 88387821 0 > 0 > ReadStage 0 0 355860 0 > 0 > RequestResponseStage 0 72532457 0 > 0 > ReadRepairStage 0 0150 0 > 0 > CounterMutationStage 32 104 897560 0 > 0 > MiscStage 0 0 0 0 > 0 > HintedHandoff 0 0 65 0 > 0 > GossipStage 0 0 2338 0 > 0 > CacheCleanupExecutor 0 0 0 0 > 0 > InternalResponseStage 0 0 0 0 > 0 > CommitLogArchiver 0 0 0 0 > 0 > CompactionExecutor2 190474 0 > 0 > ValidationExecutor0 0 0 0 > 0 > MigrationStage0 0 10 0 > 0 > AntiEntropyStage 0 0 0 0 > 0 > PendingRangeCalculator0 0310 0 > 0 > Sampler 0 0 0 0 > 0 > MemtableFlushWriter 110 94 0 > 0 > MemtablePostFlush 134257 0 > 0 > MemtableReclaimMemory 0 0 94 0 > 0 > Native-Transport-Requests 128 156 38795716 > 278451 > Message type Dropped > READ 0 > RANGE_SLICE 0 > _TRACE 0 > MUTATION 0 > COUNTER_MUTATION 0 > BINARY 0 > REQUEST_RESPONSE 0 > PAGED_RANGE 0 > READ_REPAIR 0 > {code} > Attached is the jstack output for both CMS and G1GC. > Flight recordings are here: > https://s3.amazonaws.com/simple-logs/cassandra-102-cms.jfr > https://s3.amazonaws.com/simple-logs/cassandra-102-g1gc.jfr > It is interesting to note that while the flight recording was taking place, > the load on the machine went back to healthy, and when the flight recording > finished the load went back to > 100. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (CASSANDRA-11363) Blocked NTR When Connecting Causing Excessive Load
[ https://issues.apache.org/jira/browse/CASSANDRA-11363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15230325#comment-15230325 ] Alain RODRIGUEZ edited comment on CASSANDRA-11363 at 4/7/16 2:37 PM: - I also observed in C*2.1.12 that a certain percentage of the Native-Transport-Requests are blocked, yet no major CPU or resources issue on my side though, it might then not be related. For what it is worth here is something I observed about Native-Transport-Requests: increasing the 'native_transport_max_threads' value help mitigating this, as expected, but Native-Transport-Requests number is still a non zero value. {noformat} [alain~]$ knife ssh "role:cassandra" "nodetool tpstats | grep Native-Transport-Requests" | grep -e server1 -e server2 -e server3 -e server4 | sort | awk 'BEGIN { printf "%50s %10s","Server |"," Blocked ratio:\n" } { printf "%50s %10f%\n", $1, (($7/$5)*100) }' Server | Blocked ratio: server3 0.044902% server4 0.030127% server2 0.045759% server1 0.082763% {noformat} I waited long enough between the change and the result capture, many days, probably a few weeks. As all the nodes are in the same datacenter, under a (fairly) balanced load, this is probably relevant. Here are the result for those nodes, in our use case. ||Server||native_transport_max_threads||Percentage of blocked threads|| |server1|128|0.082763%| |server2|384|0.044902%| |server3|512|0.045759%| |server4|1024|0.030127%| Also from the mailing list outputs, it looks like it is quite common to have some Native-Transport-Requests blocked, probably unavoidable depending on the network and use cases (spiky workloads?). was (Author: arodrime): I also observed in C*2.1.12 that a certain percentage of the Native-Transport-Requests are blocked, yet no major CPU or resources issue on my side though, it might then not be related. For what it is worth here is something I observed about Native-Transport-Requests: increasing the 'native_transport_max_threads' value help mitigating this, as expected, but Native-Transport-Requests number is still a non zero value. {noformat} [alain@bastion-d3-prod ~]$ knife ssh "role:cassandra" "nodetool tpstats | grep Native-Transport-Requests" | grep -e server1 -e server2 -e server3 -e server4 | sort | awk 'BEGIN { printf "%50s %10s","Server |"," Blocked ratio:\n" } { printf "%50s %10f%\n", $1, (($7/$5)*100) }' Server | Blocked ratio: ip-172-17-42-105.us-west-2.compute.internal 0.044902% ip-172-17-42-107.us-west-2.compute.internal 0.030127% ip-172-17-42-114.us-west-2.compute.internal 0.045759% ip-172-17-42-116.us-west-2.compute.internal 0.082763% {noformat} I waited long enough between the change and the result capture, many days, probably a few weeks. As all the nodes are in the same datacenter, under a (fairly) balanced load, this is probably relevant. Here are the result for those nodes, in our use case. ||Server||native_transport_max_threads||Percentage of blocked threads|| |server1|128|0.082763%| |server2|384|0.044902%| |server3|512|0.045759%| |server4|1024|0.030127%| Also from the mailing list outputs, it looks like it is quite common to have some Native-Transport-Requests blocked, probably unavoidable depending on the network and use cases (spiky workloads?). > Blocked NTR When Connecting Causing Excessive Load > -- > > Key: CASSANDRA-11363 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11363 > Project: Cassandra > Issue Type: Bug > Components: Coordination >Reporter: Russell Bradberry > Attachments: cassandra-102-cms.stack, cassandra-102-g1gc.stack > > > When upgrading from 2.1.9 to 2.1.13, we are witnessing an issue where the > machine load increases to very high levels (> 120 on an 8 core machine) and > native transport requests get blocked in tpstats. > I was able to reproduce this in both CMS and G1GC as well as on JVM 7 and 8. > The issue does not seem to affect the nodes running 2.1.9. > The issue seems to coincide with the number of connections OR the number of > total requests being processed at a given time (as the latter increases with > the former in our system) > Currently there is between 600 and 800 client connections on each machine and > each machine is handling roughly 2000-3000 client requests per second. > Disabling the binary protocol fixes the issue for this node but isn't a > viable option cluster-wide. > Here is the output from tpstats: > {code} > Pool NameActive Pending Completed Blocked All > time blocked > MutationStage 0 88387821 0 > 0 > ReadStage 0