[jira] [Commented] (CASSANDRA-11363) Blocked NTR When Connecting Causing Excessive Load
[ https://issues.apache.org/jira/browse/CASSANDRA-11363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15334604#comment-15334604 ] T Jake Luciani commented on CASSANDRA-11363: The Native Transport Request pool is the only thread pool that has a bounded limit (128) The NTR uses the SEPExecutor which effectively blocks till the queue has room. However if I'm reading it correctly the SEPWorker goes into a spin loop for some scenarios when there us no work so perhaps we are hitting some edge case when the tasks are blocked. [~pauloricardomg] perhaps try setting this queue to something small like 4 to force blocking? > Blocked NTR When Connecting Causing Excessive Load > -- > > Key: CASSANDRA-11363 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11363 > Project: Cassandra > Issue Type: Bug > Components: Coordination >Reporter: Russell Bradberry >Assignee: Paulo Motta > Attachments: cassandra-102-cms.stack, cassandra-102-g1gc.stack > > > When upgrading from 2.1.9 to 2.1.13, we are witnessing an issue where the > machine load increases to very high levels (> 120 on an 8 core machine) and > native transport requests get blocked in tpstats. > I was able to reproduce this in both CMS and G1GC as well as on JVM 7 and 8. > The issue does not seem to affect the nodes running 2.1.9. > The issue seems to coincide with the number of connections OR the number of > total requests being processed at a given time (as the latter increases with > the former in our system) > Currently there is between 600 and 800 client connections on each machine and > each machine is handling roughly 2000-3000 client requests per second. > Disabling the binary protocol fixes the issue for this node but isn't a > viable option cluster-wide. > Here is the output from tpstats: > {code} > Pool NameActive Pending Completed Blocked All > time blocked > MutationStage 0 88387821 0 > 0 > ReadStage 0 0 355860 0 > 0 > RequestResponseStage 0 72532457 0 > 0 > ReadRepairStage 0 0150 0 > 0 > CounterMutationStage 32 104 897560 0 > 0 > MiscStage 0 0 0 0 > 0 > HintedHandoff 0 0 65 0 > 0 > GossipStage 0 0 2338 0 > 0 > CacheCleanupExecutor 0 0 0 0 > 0 > InternalResponseStage 0 0 0 0 > 0 > CommitLogArchiver 0 0 0 0 > 0 > CompactionExecutor2 190474 0 > 0 > ValidationExecutor0 0 0 0 > 0 > MigrationStage0 0 10 0 > 0 > AntiEntropyStage 0 0 0 0 > 0 > PendingRangeCalculator0 0310 0 > 0 > Sampler 0 0 0 0 > 0 > MemtableFlushWriter 110 94 0 > 0 > MemtablePostFlush 134257 0 > 0 > MemtableReclaimMemory 0 0 94 0 > 0 > Native-Transport-Requests 128 156 38795716 > 278451 > Message type Dropped > READ 0 > RANGE_SLICE 0 > _TRACE 0 > MUTATION 0 > COUNTER_MUTATION 0 > BINARY 0 > REQUEST_RESPONSE 0 > PAGED_RANGE 0 > READ_REPAIR 0 > {code} > Attached is the jstack output for both CMS and G1GC. > Flight recordings are here: > https://s3.amazonaws.com/simple-logs/cassandra-102-cms.jfr > https://s3.amazonaws.com/simple-logs/cassandra-102-g1gc.jfr > It is interesting to note that while the flight recording was taking place, > the load on the machine went back to healthy, and when the flight recording > finished the load went back to > 100. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-11363) Blocked NTR When Connecting Causing Excessive Load
[ https://issues.apache.org/jira/browse/CASSANDRA-11363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15271519#comment-15271519 ] Paulo Motta commented on CASSANDRA-11363: - Tried reproducing this in a variety of workloads without success in the following environment: * 3 nodes m3.xlarge (m3.xlarge, 4 vcpu, 15GB RAM, 2 x 40) * 8GB Heap, CMS * 100M keys (24GB per node) * C* 3.0.3 with default settings and a variation with native_transport_max_threads=10 * no-vnodes The workloads were based a variation of https://raw.githubusercontent.com/mesosphere/cassandra-mesos/master/driver-extensions/cluster-loadtest/cqlstress-example.yaml with 100M keys and RF=3 executed from 2 m3.xlarge stress nodes for 1, 2 and 6 hours with 10, 20 and 30 threads without exhausting the cluster CPU/IO capacity. I tried several combinations of the following workloads: * read-only * write-only * range-only * triggering repairs during the execution * unthrottling compaction I recorded and analyzed flight recordings during the tests but didn't find anything suspicious. No blocked native transport threads were verified during tests with above scenarios, so this might indicate that this condition is not a widespread bug like CASSANDRA-11529 but probably some edgy combination of workload, environment and bad scheduling that happens in production but is harder to reproduce with synthetic workloads. A thread dump of when this condition happens would probably help us detect where is the bottleneck or contention, so I created CASSANDRA-11713 to add ability of logging a thread dump when the thread pool queue is full. If someone could install that patch and enable it in production to capture a thread dump when the blockage happens that would probably help us elucidate what's going on here. > Blocked NTR When Connecting Causing Excessive Load > -- > > Key: CASSANDRA-11363 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11363 > Project: Cassandra > Issue Type: Bug > Components: Coordination >Reporter: Russell Bradberry >Assignee: Paulo Motta >Priority: Critical > Attachments: cassandra-102-cms.stack, cassandra-102-g1gc.stack > > > When upgrading from 2.1.9 to 2.1.13, we are witnessing an issue where the > machine load increases to very high levels (> 120 on an 8 core machine) and > native transport requests get blocked in tpstats. > I was able to reproduce this in both CMS and G1GC as well as on JVM 7 and 8. > The issue does not seem to affect the nodes running 2.1.9. > The issue seems to coincide with the number of connections OR the number of > total requests being processed at a given time (as the latter increases with > the former in our system) > Currently there is between 600 and 800 client connections on each machine and > each machine is handling roughly 2000-3000 client requests per second. > Disabling the binary protocol fixes the issue for this node but isn't a > viable option cluster-wide. > Here is the output from tpstats: > {code} > Pool NameActive Pending Completed Blocked All > time blocked > MutationStage 0 88387821 0 > 0 > ReadStage 0 0 355860 0 > 0 > RequestResponseStage 0 72532457 0 > 0 > ReadRepairStage 0 0150 0 > 0 > CounterMutationStage 32 104 897560 0 > 0 > MiscStage 0 0 0 0 > 0 > HintedHandoff 0 0 65 0 > 0 > GossipStage 0 0 2338 0 > 0 > CacheCleanupExecutor 0 0 0 0 > 0 > InternalResponseStage 0 0 0 0 > 0 > CommitLogArchiver 0 0 0 0 > 0 > CompactionExecutor2 190474 0 > 0 > ValidationExecutor0 0 0 0 > 0 > MigrationStage0 0 10 0 > 0 > AntiEntropyStage 0 0 0 0 > 0 > PendingRangeCalculator0 0310 0 > 0 > Sampler 0 0 0 0 > 0 > MemtableFlushWriter 110 94 0 >
[jira] [Commented] (CASSANDRA-11363) Blocked NTR When Connecting Causing Excessive Load
[ https://issues.apache.org/jira/browse/CASSANDRA-11363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15262397#comment-15262397 ] Nate McCall commented on CASSANDRA-11363: - [~pauloricardomg] Thanks for continuing to dig into this. However, it looks like you have RF=1 in that stress script: {noformat} keyspace_definition: | CREATE KEYSPACE stresscql WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1}; {noformat} > Blocked NTR When Connecting Causing Excessive Load > -- > > Key: CASSANDRA-11363 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11363 > Project: Cassandra > Issue Type: Bug > Components: Coordination >Reporter: Russell Bradberry >Assignee: Paulo Motta >Priority: Critical > Attachments: cassandra-102-cms.stack, cassandra-102-g1gc.stack > > > When upgrading from 2.1.9 to 2.1.13, we are witnessing an issue where the > machine load increases to very high levels (> 120 on an 8 core machine) and > native transport requests get blocked in tpstats. > I was able to reproduce this in both CMS and G1GC as well as on JVM 7 and 8. > The issue does not seem to affect the nodes running 2.1.9. > The issue seems to coincide with the number of connections OR the number of > total requests being processed at a given time (as the latter increases with > the former in our system) > Currently there is between 600 and 800 client connections on each machine and > each machine is handling roughly 2000-3000 client requests per second. > Disabling the binary protocol fixes the issue for this node but isn't a > viable option cluster-wide. > Here is the output from tpstats: > {code} > Pool NameActive Pending Completed Blocked All > time blocked > MutationStage 0 88387821 0 > 0 > ReadStage 0 0 355860 0 > 0 > RequestResponseStage 0 72532457 0 > 0 > ReadRepairStage 0 0150 0 > 0 > CounterMutationStage 32 104 897560 0 > 0 > MiscStage 0 0 0 0 > 0 > HintedHandoff 0 0 65 0 > 0 > GossipStage 0 0 2338 0 > 0 > CacheCleanupExecutor 0 0 0 0 > 0 > InternalResponseStage 0 0 0 0 > 0 > CommitLogArchiver 0 0 0 0 > 0 > CompactionExecutor2 190474 0 > 0 > ValidationExecutor0 0 0 0 > 0 > MigrationStage0 0 10 0 > 0 > AntiEntropyStage 0 0 0 0 > 0 > PendingRangeCalculator0 0310 0 > 0 > Sampler 0 0 0 0 > 0 > MemtableFlushWriter 110 94 0 > 0 > MemtablePostFlush 134257 0 > 0 > MemtableReclaimMemory 0 0 94 0 > 0 > Native-Transport-Requests 128 156 38795716 > 278451 > Message type Dropped > READ 0 > RANGE_SLICE 0 > _TRACE 0 > MUTATION 0 > COUNTER_MUTATION 0 > BINARY 0 > REQUEST_RESPONSE 0 > PAGED_RANGE 0 > READ_REPAIR 0 > {code} > Attached is the jstack output for both CMS and G1GC. > Flight recordings are here: > https://s3.amazonaws.com/simple-logs/cassandra-102-cms.jfr > https://s3.amazonaws.com/simple-logs/cassandra-102-g1gc.jfr > It is interesting to note that while the flight recording was taking place, > the load on the machine went back to healthy, and when the flight recording > finished the load went back to > 100. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-11363) Blocked NTR When Connecting Causing Excessive Load
[ https://issues.apache.org/jira/browse/CASSANDRA-11363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15260641#comment-15260641 ] Paulo Motta commented on CASSANDRA-11363: - I wasn't able to reproduce this condition so far in a 2.1 [cstar_perf|http://cstar.datastax.com/] cluster with the following spec: 1 stress, 3 Cassandra, each node 2x Intel(R) Xeon(R) CPU E5-2620 v2 @ 2.10GHz (12 cores total), 64G, 3 Samsung SSD 845DC EVO 240GB, mdadm RAID 0. The test consistent in the following sequence of stress steps followed by {{nodetool tpstats}}: * {{user profile=https://raw.githubusercontent.com/mesosphere/cassandra-mesos/master/driver-extensions/cluster-loadtest/cqlstress-example.yaml ops\(insert=1\) n=1M -rate threads=300}} * {{user profile=https://raw.githubusercontent.com/mesosphere/cassandra-mesos/master/driver-extensions/cluster-loadtest/cqlstress-example.yaml ops\(simple1=1\) n=1M -rate threads=300}} * {{user profile=https://raw.githubusercontent.com/mesosphere/cassandra-mesos/master/driver-extensions/cluster-loadtest/cqlstress-example.yaml ops\(range1=1\) n=1M -rate threads=300}} At the end of 5 runs, the total number of blocked NTR threads was negligible (0 for all runs, except one with 0.004% blocked). I will try running on a larger mixed workload, ramping up the number of stress threads and also try it on 3.0. Meanwhile, some JFR files, reproduction steps or at least more detailed description on the environment/workload to reproduce this would be greatly appreciated. > Blocked NTR When Connecting Causing Excessive Load > -- > > Key: CASSANDRA-11363 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11363 > Project: Cassandra > Issue Type: Bug > Components: Coordination >Reporter: Russell Bradberry >Assignee: Paulo Motta >Priority: Critical > Attachments: cassandra-102-cms.stack, cassandra-102-g1gc.stack > > > When upgrading from 2.1.9 to 2.1.13, we are witnessing an issue where the > machine load increases to very high levels (> 120 on an 8 core machine) and > native transport requests get blocked in tpstats. > I was able to reproduce this in both CMS and G1GC as well as on JVM 7 and 8. > The issue does not seem to affect the nodes running 2.1.9. > The issue seems to coincide with the number of connections OR the number of > total requests being processed at a given time (as the latter increases with > the former in our system) > Currently there is between 600 and 800 client connections on each machine and > each machine is handling roughly 2000-3000 client requests per second. > Disabling the binary protocol fixes the issue for this node but isn't a > viable option cluster-wide. > Here is the output from tpstats: > {code} > Pool NameActive Pending Completed Blocked All > time blocked > MutationStage 0 88387821 0 > 0 > ReadStage 0 0 355860 0 > 0 > RequestResponseStage 0 72532457 0 > 0 > ReadRepairStage 0 0150 0 > 0 > CounterMutationStage 32 104 897560 0 > 0 > MiscStage 0 0 0 0 > 0 > HintedHandoff 0 0 65 0 > 0 > GossipStage 0 0 2338 0 > 0 > CacheCleanupExecutor 0 0 0 0 > 0 > InternalResponseStage 0 0 0 0 > 0 > CommitLogArchiver 0 0 0 0 > 0 > CompactionExecutor2 190474 0 > 0 > ValidationExecutor0 0 0 0 > 0 > MigrationStage0 0 10 0 > 0 > AntiEntropyStage 0 0 0 0 > 0 > PendingRangeCalculator0 0310 0 > 0 > Sampler 0 0 0 0 > 0 > MemtableFlushWriter 110 94 0 > 0 > MemtablePostFlush 134257 0 > 0 > MemtableReclaimMemory 0 0 94 0 > 0 > Native-Transport-Requests 128 156 38795716 >
[jira] [Commented] (CASSANDRA-11363) Blocked NTR When Connecting Causing Excessive Load
[ https://issues.apache.org/jira/browse/CASSANDRA-11363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15234872#comment-15234872 ] Carlos Rolo commented on CASSANDRA-11363: - I will try to get it today, I might able to get those for the 3.0.3 cluster, the 2.1.13 might prove more difficult. I will update this ticket later. > Blocked NTR When Connecting Causing Excessive Load > -- > > Key: CASSANDRA-11363 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11363 > Project: Cassandra > Issue Type: Bug > Components: Coordination >Reporter: Russell Bradberry >Assignee: Paulo Motta >Priority: Critical > Attachments: cassandra-102-cms.stack, cassandra-102-g1gc.stack > > > When upgrading from 2.1.9 to 2.1.13, we are witnessing an issue where the > machine load increases to very high levels (> 120 on an 8 core machine) and > native transport requests get blocked in tpstats. > I was able to reproduce this in both CMS and G1GC as well as on JVM 7 and 8. > The issue does not seem to affect the nodes running 2.1.9. > The issue seems to coincide with the number of connections OR the number of > total requests being processed at a given time (as the latter increases with > the former in our system) > Currently there is between 600 and 800 client connections on each machine and > each machine is handling roughly 2000-3000 client requests per second. > Disabling the binary protocol fixes the issue for this node but isn't a > viable option cluster-wide. > Here is the output from tpstats: > {code} > Pool NameActive Pending Completed Blocked All > time blocked > MutationStage 0 88387821 0 > 0 > ReadStage 0 0 355860 0 > 0 > RequestResponseStage 0 72532457 0 > 0 > ReadRepairStage 0 0150 0 > 0 > CounterMutationStage 32 104 897560 0 > 0 > MiscStage 0 0 0 0 > 0 > HintedHandoff 0 0 65 0 > 0 > GossipStage 0 0 2338 0 > 0 > CacheCleanupExecutor 0 0 0 0 > 0 > InternalResponseStage 0 0 0 0 > 0 > CommitLogArchiver 0 0 0 0 > 0 > CompactionExecutor2 190474 0 > 0 > ValidationExecutor0 0 0 0 > 0 > MigrationStage0 0 10 0 > 0 > AntiEntropyStage 0 0 0 0 > 0 > PendingRangeCalculator0 0310 0 > 0 > Sampler 0 0 0 0 > 0 > MemtableFlushWriter 110 94 0 > 0 > MemtablePostFlush 134257 0 > 0 > MemtableReclaimMemory 0 0 94 0 > 0 > Native-Transport-Requests 128 156 38795716 > 278451 > Message type Dropped > READ 0 > RANGE_SLICE 0 > _TRACE 0 > MUTATION 0 > COUNTER_MUTATION 0 > BINARY 0 > REQUEST_RESPONSE 0 > PAGED_RANGE 0 > READ_REPAIR 0 > {code} > Attached is the jstack output for both CMS and G1GC. > Flight recordings are here: > https://s3.amazonaws.com/simple-logs/cassandra-102-cms.jfr > https://s3.amazonaws.com/simple-logs/cassandra-102-g1gc.jfr > It is interesting to note that while the flight recording was taking place, > the load on the machine went back to healthy, and when the flight recording > finished the load went back to > 100. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-11363) Blocked NTR When Connecting Causing Excessive Load
[ https://issues.apache.org/jira/browse/CASSANDRA-11363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15232837#comment-15232837 ] Paulo Motta commented on CASSANDRA-11363: - [~CRolo] [~arodrime] if you could record and attach JFR files of servers with high numbers of blocked NTR threads that would be of great help in investigating this issue. (I will also have a look on the already attached files, but they could be affected by CASSANDRA-11529). Thanks in advance! > Blocked NTR When Connecting Causing Excessive Load > -- > > Key: CASSANDRA-11363 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11363 > Project: Cassandra > Issue Type: Bug > Components: Coordination >Reporter: Russell Bradberry >Assignee: Paulo Motta >Priority: Critical > Attachments: cassandra-102-cms.stack, cassandra-102-g1gc.stack > > > When upgrading from 2.1.9 to 2.1.13, we are witnessing an issue where the > machine load increases to very high levels (> 120 on an 8 core machine) and > native transport requests get blocked in tpstats. > I was able to reproduce this in both CMS and G1GC as well as on JVM 7 and 8. > The issue does not seem to affect the nodes running 2.1.9. > The issue seems to coincide with the number of connections OR the number of > total requests being processed at a given time (as the latter increases with > the former in our system) > Currently there is between 600 and 800 client connections on each machine and > each machine is handling roughly 2000-3000 client requests per second. > Disabling the binary protocol fixes the issue for this node but isn't a > viable option cluster-wide. > Here is the output from tpstats: > {code} > Pool NameActive Pending Completed Blocked All > time blocked > MutationStage 0 88387821 0 > 0 > ReadStage 0 0 355860 0 > 0 > RequestResponseStage 0 72532457 0 > 0 > ReadRepairStage 0 0150 0 > 0 > CounterMutationStage 32 104 897560 0 > 0 > MiscStage 0 0 0 0 > 0 > HintedHandoff 0 0 65 0 > 0 > GossipStage 0 0 2338 0 > 0 > CacheCleanupExecutor 0 0 0 0 > 0 > InternalResponseStage 0 0 0 0 > 0 > CommitLogArchiver 0 0 0 0 > 0 > CompactionExecutor2 190474 0 > 0 > ValidationExecutor0 0 0 0 > 0 > MigrationStage0 0 10 0 > 0 > AntiEntropyStage 0 0 0 0 > 0 > PendingRangeCalculator0 0310 0 > 0 > Sampler 0 0 0 0 > 0 > MemtableFlushWriter 110 94 0 > 0 > MemtablePostFlush 134257 0 > 0 > MemtableReclaimMemory 0 0 94 0 > 0 > Native-Transport-Requests 128 156 38795716 > 278451 > Message type Dropped > READ 0 > RANGE_SLICE 0 > _TRACE 0 > MUTATION 0 > COUNTER_MUTATION 0 > BINARY 0 > REQUEST_RESPONSE 0 > PAGED_RANGE 0 > READ_REPAIR 0 > {code} > Attached is the jstack output for both CMS and G1GC. > Flight recordings are here: > https://s3.amazonaws.com/simple-logs/cassandra-102-cms.jfr > https://s3.amazonaws.com/simple-logs/cassandra-102-g1gc.jfr > It is interesting to note that while the flight recording was taking place, > the load on the machine went back to healthy, and when the flight recording > finished the load went back to > 100. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-11363) Blocked NTR When Connecting Causing Excessive Load
[ https://issues.apache.org/jira/browse/CASSANDRA-11363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15231924#comment-15231924 ] Carlos Rolo commented on CASSANDRA-11363: - On my systems: 3.0.3 -> Logged Batches, 2.1.13 -> Unlogged Batches. But on 2.1.13 batches where taken out at some point to see if it would improve, but not much improvement was seen. So it might not be related to batches. The test was also small (With batches disabled), so would not make that count as solid test. > Blocked NTR When Connecting Causing Excessive Load > -- > > Key: CASSANDRA-11363 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11363 > Project: Cassandra > Issue Type: Bug > Components: Coordination >Reporter: Russell Bradberry >Priority: Critical > Attachments: cassandra-102-cms.stack, cassandra-102-g1gc.stack > > > When upgrading from 2.1.9 to 2.1.13, we are witnessing an issue where the > machine load increases to very high levels (> 120 on an 8 core machine) and > native transport requests get blocked in tpstats. > I was able to reproduce this in both CMS and G1GC as well as on JVM 7 and 8. > The issue does not seem to affect the nodes running 2.1.9. > The issue seems to coincide with the number of connections OR the number of > total requests being processed at a given time (as the latter increases with > the former in our system) > Currently there is between 600 and 800 client connections on each machine and > each machine is handling roughly 2000-3000 client requests per second. > Disabling the binary protocol fixes the issue for this node but isn't a > viable option cluster-wide. > Here is the output from tpstats: > {code} > Pool NameActive Pending Completed Blocked All > time blocked > MutationStage 0 88387821 0 > 0 > ReadStage 0 0 355860 0 > 0 > RequestResponseStage 0 72532457 0 > 0 > ReadRepairStage 0 0150 0 > 0 > CounterMutationStage 32 104 897560 0 > 0 > MiscStage 0 0 0 0 > 0 > HintedHandoff 0 0 65 0 > 0 > GossipStage 0 0 2338 0 > 0 > CacheCleanupExecutor 0 0 0 0 > 0 > InternalResponseStage 0 0 0 0 > 0 > CommitLogArchiver 0 0 0 0 > 0 > CompactionExecutor2 190474 0 > 0 > ValidationExecutor0 0 0 0 > 0 > MigrationStage0 0 10 0 > 0 > AntiEntropyStage 0 0 0 0 > 0 > PendingRangeCalculator0 0310 0 > 0 > Sampler 0 0 0 0 > 0 > MemtableFlushWriter 110 94 0 > 0 > MemtablePostFlush 134257 0 > 0 > MemtableReclaimMemory 0 0 94 0 > 0 > Native-Transport-Requests 128 156 38795716 > 278451 > Message type Dropped > READ 0 > RANGE_SLICE 0 > _TRACE 0 > MUTATION 0 > COUNTER_MUTATION 0 > BINARY 0 > REQUEST_RESPONSE 0 > PAGED_RANGE 0 > READ_REPAIR 0 > {code} > Attached is the jstack output for both CMS and G1GC. > Flight recordings are here: > https://s3.amazonaws.com/simple-logs/cassandra-102-cms.jfr > https://s3.amazonaws.com/simple-logs/cassandra-102-g1gc.jfr > It is interesting to note that while the flight recording was taking place, > the load on the machine went back to healthy, and when the flight recording > finished the load went back to > 100. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-11363) Blocked NTR When Connecting Causing Excessive Load
[ https://issues.apache.org/jira/browse/CASSANDRA-11363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15231802#comment-15231802 ] Alain RODRIGUEZ commented on CASSANDRA-11363: - On my description above we where using physical nodes (initial token set, num_token commented, no vnodes enable), fwiw. > Blocked NTR When Connecting Causing Excessive Load > -- > > Key: CASSANDRA-11363 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11363 > Project: Cassandra > Issue Type: Bug > Components: Coordination >Reporter: Russell Bradberry >Priority: Critical > Attachments: cassandra-102-cms.stack, cassandra-102-g1gc.stack > > > When upgrading from 2.1.9 to 2.1.13, we are witnessing an issue where the > machine load increases to very high levels (> 120 on an 8 core machine) and > native transport requests get blocked in tpstats. > I was able to reproduce this in both CMS and G1GC as well as on JVM 7 and 8. > The issue does not seem to affect the nodes running 2.1.9. > The issue seems to coincide with the number of connections OR the number of > total requests being processed at a given time (as the latter increases with > the former in our system) > Currently there is between 600 and 800 client connections on each machine and > each machine is handling roughly 2000-3000 client requests per second. > Disabling the binary protocol fixes the issue for this node but isn't a > viable option cluster-wide. > Here is the output from tpstats: > {code} > Pool NameActive Pending Completed Blocked All > time blocked > MutationStage 0 88387821 0 > 0 > ReadStage 0 0 355860 0 > 0 > RequestResponseStage 0 72532457 0 > 0 > ReadRepairStage 0 0150 0 > 0 > CounterMutationStage 32 104 897560 0 > 0 > MiscStage 0 0 0 0 > 0 > HintedHandoff 0 0 65 0 > 0 > GossipStage 0 0 2338 0 > 0 > CacheCleanupExecutor 0 0 0 0 > 0 > InternalResponseStage 0 0 0 0 > 0 > CommitLogArchiver 0 0 0 0 > 0 > CompactionExecutor2 190474 0 > 0 > ValidationExecutor0 0 0 0 > 0 > MigrationStage0 0 10 0 > 0 > AntiEntropyStage 0 0 0 0 > 0 > PendingRangeCalculator0 0310 0 > 0 > Sampler 0 0 0 0 > 0 > MemtableFlushWriter 110 94 0 > 0 > MemtablePostFlush 134257 0 > 0 > MemtableReclaimMemory 0 0 94 0 > 0 > Native-Transport-Requests 128 156 38795716 > 278451 > Message type Dropped > READ 0 > RANGE_SLICE 0 > _TRACE 0 > MUTATION 0 > COUNTER_MUTATION 0 > BINARY 0 > REQUEST_RESPONSE 0 > PAGED_RANGE 0 > READ_REPAIR 0 > {code} > Attached is the jstack output for both CMS and G1GC. > Flight recordings are here: > https://s3.amazonaws.com/simple-logs/cassandra-102-cms.jfr > https://s3.amazonaws.com/simple-logs/cassandra-102-g1gc.jfr > It is interesting to note that while the flight recording was taking place, > the load on the machine went back to healthy, and when the flight recording > finished the load went back to > 100. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-11363) Blocked NTR When Connecting Causing Excessive Load
[ https://issues.apache.org/jira/browse/CASSANDRA-11363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15231171#comment-15231171 ] Nate McCall commented on CASSANDRA-11363: - CASSANDRA-11529 makes sense for [~devdazed], but the numbers from [~arodrime] are on a *non-vnode* cluster. Keep in mind: - we actually get this metric to go down a bit by increasing native transport threads - we are not CPU bound or spiking (noticeably at least) If this was an inefficiency in a hot code path per CASSANDRA-11529, I feel like increasing parallelism would exacerbate the issue. Good thoughts though - thanks for digging in! > Blocked NTR When Connecting Causing Excessive Load > -- > > Key: CASSANDRA-11363 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11363 > Project: Cassandra > Issue Type: Bug > Components: Coordination >Reporter: Russell Bradberry >Priority: Critical > Attachments: cassandra-102-cms.stack, cassandra-102-g1gc.stack > > > When upgrading from 2.1.9 to 2.1.13, we are witnessing an issue where the > machine load increases to very high levels (> 120 on an 8 core machine) and > native transport requests get blocked in tpstats. > I was able to reproduce this in both CMS and G1GC as well as on JVM 7 and 8. > The issue does not seem to affect the nodes running 2.1.9. > The issue seems to coincide with the number of connections OR the number of > total requests being processed at a given time (as the latter increases with > the former in our system) > Currently there is between 600 and 800 client connections on each machine and > each machine is handling roughly 2000-3000 client requests per second. > Disabling the binary protocol fixes the issue for this node but isn't a > viable option cluster-wide. > Here is the output from tpstats: > {code} > Pool NameActive Pending Completed Blocked All > time blocked > MutationStage 0 88387821 0 > 0 > ReadStage 0 0 355860 0 > 0 > RequestResponseStage 0 72532457 0 > 0 > ReadRepairStage 0 0150 0 > 0 > CounterMutationStage 32 104 897560 0 > 0 > MiscStage 0 0 0 0 > 0 > HintedHandoff 0 0 65 0 > 0 > GossipStage 0 0 2338 0 > 0 > CacheCleanupExecutor 0 0 0 0 > 0 > InternalResponseStage 0 0 0 0 > 0 > CommitLogArchiver 0 0 0 0 > 0 > CompactionExecutor2 190474 0 > 0 > ValidationExecutor0 0 0 0 > 0 > MigrationStage0 0 10 0 > 0 > AntiEntropyStage 0 0 0 0 > 0 > PendingRangeCalculator0 0310 0 > 0 > Sampler 0 0 0 0 > 0 > MemtableFlushWriter 110 94 0 > 0 > MemtablePostFlush 134257 0 > 0 > MemtableReclaimMemory 0 0 94 0 > 0 > Native-Transport-Requests 128 156 38795716 > 278451 > Message type Dropped > READ 0 > RANGE_SLICE 0 > _TRACE 0 > MUTATION 0 > COUNTER_MUTATION 0 > BINARY 0 > REQUEST_RESPONSE 0 > PAGED_RANGE 0 > READ_REPAIR 0 > {code} > Attached is the jstack output for both CMS and G1GC. > Flight recordings are here: > https://s3.amazonaws.com/simple-logs/cassandra-102-cms.jfr > https://s3.amazonaws.com/simple-logs/cassandra-102-g1gc.jfr > It is interesting to note that while the flight recording was taking place, > the load on the machine went back to healthy, and when the flight recording > finished the load went back to > 100. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-11363) Blocked NTR When Connecting Causing Excessive Load
[ https://issues.apache.org/jira/browse/CASSANDRA-11363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15231130#comment-15231130 ] Russell Bradberry commented on CASSANDRA-11363: --- [~pauloricardomg] 11529 makes sense because CASSANDRA-9303 was backported to 2.1.12 in DSE 4.8.4. Hence why we see it in that version vs only in 2.1.13. > Blocked NTR When Connecting Causing Excessive Load > -- > > Key: CASSANDRA-11363 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11363 > Project: Cassandra > Issue Type: Bug > Components: Coordination >Reporter: Russell Bradberry >Priority: Critical > Attachments: cassandra-102-cms.stack, cassandra-102-g1gc.stack > > > When upgrading from 2.1.9 to 2.1.13, we are witnessing an issue where the > machine load increases to very high levels (> 120 on an 8 core machine) and > native transport requests get blocked in tpstats. > I was able to reproduce this in both CMS and G1GC as well as on JVM 7 and 8. > The issue does not seem to affect the nodes running 2.1.9. > The issue seems to coincide with the number of connections OR the number of > total requests being processed at a given time (as the latter increases with > the former in our system) > Currently there is between 600 and 800 client connections on each machine and > each machine is handling roughly 2000-3000 client requests per second. > Disabling the binary protocol fixes the issue for this node but isn't a > viable option cluster-wide. > Here is the output from tpstats: > {code} > Pool NameActive Pending Completed Blocked All > time blocked > MutationStage 0 88387821 0 > 0 > ReadStage 0 0 355860 0 > 0 > RequestResponseStage 0 72532457 0 > 0 > ReadRepairStage 0 0150 0 > 0 > CounterMutationStage 32 104 897560 0 > 0 > MiscStage 0 0 0 0 > 0 > HintedHandoff 0 0 65 0 > 0 > GossipStage 0 0 2338 0 > 0 > CacheCleanupExecutor 0 0 0 0 > 0 > InternalResponseStage 0 0 0 0 > 0 > CommitLogArchiver 0 0 0 0 > 0 > CompactionExecutor2 190474 0 > 0 > ValidationExecutor0 0 0 0 > 0 > MigrationStage0 0 10 0 > 0 > AntiEntropyStage 0 0 0 0 > 0 > PendingRangeCalculator0 0310 0 > 0 > Sampler 0 0 0 0 > 0 > MemtableFlushWriter 110 94 0 > 0 > MemtablePostFlush 134257 0 > 0 > MemtableReclaimMemory 0 0 94 0 > 0 > Native-Transport-Requests 128 156 38795716 > 278451 > Message type Dropped > READ 0 > RANGE_SLICE 0 > _TRACE 0 > MUTATION 0 > COUNTER_MUTATION 0 > BINARY 0 > REQUEST_RESPONSE 0 > PAGED_RANGE 0 > READ_REPAIR 0 > {code} > Attached is the jstack output for both CMS and G1GC. > Flight recordings are here: > https://s3.amazonaws.com/simple-logs/cassandra-102-cms.jfr > https://s3.amazonaws.com/simple-logs/cassandra-102-g1gc.jfr > It is interesting to note that while the flight recording was taking place, > the load on the machine went back to healthy, and when the flight recording > finished the load went back to > 100. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-11363) Blocked NTR When Connecting Causing Excessive Load
[ https://issues.apache.org/jira/browse/CASSANDRA-11363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15231123#comment-15231123 ] Paulo Motta commented on CASSANDRA-11363: - yes, cluster with more vnodes will be more affected by this. > Blocked NTR When Connecting Causing Excessive Load > -- > > Key: CASSANDRA-11363 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11363 > Project: Cassandra > Issue Type: Bug > Components: Coordination >Reporter: Russell Bradberry >Priority: Critical > Attachments: cassandra-102-cms.stack, cassandra-102-g1gc.stack > > > When upgrading from 2.1.9 to 2.1.13, we are witnessing an issue where the > machine load increases to very high levels (> 120 on an 8 core machine) and > native transport requests get blocked in tpstats. > I was able to reproduce this in both CMS and G1GC as well as on JVM 7 and 8. > The issue does not seem to affect the nodes running 2.1.9. > The issue seems to coincide with the number of connections OR the number of > total requests being processed at a given time (as the latter increases with > the former in our system) > Currently there is between 600 and 800 client connections on each machine and > each machine is handling roughly 2000-3000 client requests per second. > Disabling the binary protocol fixes the issue for this node but isn't a > viable option cluster-wide. > Here is the output from tpstats: > {code} > Pool NameActive Pending Completed Blocked All > time blocked > MutationStage 0 88387821 0 > 0 > ReadStage 0 0 355860 0 > 0 > RequestResponseStage 0 72532457 0 > 0 > ReadRepairStage 0 0150 0 > 0 > CounterMutationStage 32 104 897560 0 > 0 > MiscStage 0 0 0 0 > 0 > HintedHandoff 0 0 65 0 > 0 > GossipStage 0 0 2338 0 > 0 > CacheCleanupExecutor 0 0 0 0 > 0 > InternalResponseStage 0 0 0 0 > 0 > CommitLogArchiver 0 0 0 0 > 0 > CompactionExecutor2 190474 0 > 0 > ValidationExecutor0 0 0 0 > 0 > MigrationStage0 0 10 0 > 0 > AntiEntropyStage 0 0 0 0 > 0 > PendingRangeCalculator0 0310 0 > 0 > Sampler 0 0 0 0 > 0 > MemtableFlushWriter 110 94 0 > 0 > MemtablePostFlush 134257 0 > 0 > MemtableReclaimMemory 0 0 94 0 > 0 > Native-Transport-Requests 128 156 38795716 > 278451 > Message type Dropped > READ 0 > RANGE_SLICE 0 > _TRACE 0 > MUTATION 0 > COUNTER_MUTATION 0 > BINARY 0 > REQUEST_RESPONSE 0 > PAGED_RANGE 0 > READ_REPAIR 0 > {code} > Attached is the jstack output for both CMS and G1GC. > Flight recordings are here: > https://s3.amazonaws.com/simple-logs/cassandra-102-cms.jfr > https://s3.amazonaws.com/simple-logs/cassandra-102-g1gc.jfr > It is interesting to note that while the flight recording was taking place, > the load on the machine went back to healthy, and when the flight recording > finished the load went back to > 100. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-11363) Blocked NTR When Connecting Causing Excessive Load
[ https://issues.apache.org/jira/browse/CASSANDRA-11363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15231115#comment-15231115 ] Russell Bradberry commented on CASSANDRA-11363: --- any chance the number of vnodes in a cluster affects how bad this issue is? > Blocked NTR When Connecting Causing Excessive Load > -- > > Key: CASSANDRA-11363 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11363 > Project: Cassandra > Issue Type: Bug > Components: Coordination >Reporter: Russell Bradberry >Priority: Critical > Attachments: cassandra-102-cms.stack, cassandra-102-g1gc.stack > > > When upgrading from 2.1.9 to 2.1.13, we are witnessing an issue where the > machine load increases to very high levels (> 120 on an 8 core machine) and > native transport requests get blocked in tpstats. > I was able to reproduce this in both CMS and G1GC as well as on JVM 7 and 8. > The issue does not seem to affect the nodes running 2.1.9. > The issue seems to coincide with the number of connections OR the number of > total requests being processed at a given time (as the latter increases with > the former in our system) > Currently there is between 600 and 800 client connections on each machine and > each machine is handling roughly 2000-3000 client requests per second. > Disabling the binary protocol fixes the issue for this node but isn't a > viable option cluster-wide. > Here is the output from tpstats: > {code} > Pool NameActive Pending Completed Blocked All > time blocked > MutationStage 0 88387821 0 > 0 > ReadStage 0 0 355860 0 > 0 > RequestResponseStage 0 72532457 0 > 0 > ReadRepairStage 0 0150 0 > 0 > CounterMutationStage 32 104 897560 0 > 0 > MiscStage 0 0 0 0 > 0 > HintedHandoff 0 0 65 0 > 0 > GossipStage 0 0 2338 0 > 0 > CacheCleanupExecutor 0 0 0 0 > 0 > InternalResponseStage 0 0 0 0 > 0 > CommitLogArchiver 0 0 0 0 > 0 > CompactionExecutor2 190474 0 > 0 > ValidationExecutor0 0 0 0 > 0 > MigrationStage0 0 10 0 > 0 > AntiEntropyStage 0 0 0 0 > 0 > PendingRangeCalculator0 0310 0 > 0 > Sampler 0 0 0 0 > 0 > MemtableFlushWriter 110 94 0 > 0 > MemtablePostFlush 134257 0 > 0 > MemtableReclaimMemory 0 0 94 0 > 0 > Native-Transport-Requests 128 156 38795716 > 278451 > Message type Dropped > READ 0 > RANGE_SLICE 0 > _TRACE 0 > MUTATION 0 > COUNTER_MUTATION 0 > BINARY 0 > REQUEST_RESPONSE 0 > PAGED_RANGE 0 > READ_REPAIR 0 > {code} > Attached is the jstack output for both CMS and G1GC. > Flight recordings are here: > https://s3.amazonaws.com/simple-logs/cassandra-102-cms.jfr > https://s3.amazonaws.com/simple-logs/cassandra-102-g1gc.jfr > It is interesting to note that while the flight recording was taking place, > the load on the machine went back to healthy, and when the flight recording > finished the load went back to > 100. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-11363) Blocked NTR When Connecting Causing Excessive Load
[ https://issues.apache.org/jira/browse/CASSANDRA-11363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15231113#comment-15231113 ] Paulo Motta commented on CASSANDRA-11363: - Ok, then it's very likely that you're hit by CASSANDRA-11529. I created another ticket in case this one is a different issue. [~zznate] [~CRolo] can you double check you're not hitting CASSANDRA-11529? > Blocked NTR When Connecting Causing Excessive Load > -- > > Key: CASSANDRA-11363 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11363 > Project: Cassandra > Issue Type: Bug > Components: Coordination >Reporter: Russell Bradberry >Priority: Critical > Attachments: cassandra-102-cms.stack, cassandra-102-g1gc.stack > > > When upgrading from 2.1.9 to 2.1.13, we are witnessing an issue where the > machine load increases to very high levels (> 120 on an 8 core machine) and > native transport requests get blocked in tpstats. > I was able to reproduce this in both CMS and G1GC as well as on JVM 7 and 8. > The issue does not seem to affect the nodes running 2.1.9. > The issue seems to coincide with the number of connections OR the number of > total requests being processed at a given time (as the latter increases with > the former in our system) > Currently there is between 600 and 800 client connections on each machine and > each machine is handling roughly 2000-3000 client requests per second. > Disabling the binary protocol fixes the issue for this node but isn't a > viable option cluster-wide. > Here is the output from tpstats: > {code} > Pool NameActive Pending Completed Blocked All > time blocked > MutationStage 0 88387821 0 > 0 > ReadStage 0 0 355860 0 > 0 > RequestResponseStage 0 72532457 0 > 0 > ReadRepairStage 0 0150 0 > 0 > CounterMutationStage 32 104 897560 0 > 0 > MiscStage 0 0 0 0 > 0 > HintedHandoff 0 0 65 0 > 0 > GossipStage 0 0 2338 0 > 0 > CacheCleanupExecutor 0 0 0 0 > 0 > InternalResponseStage 0 0 0 0 > 0 > CommitLogArchiver 0 0 0 0 > 0 > CompactionExecutor2 190474 0 > 0 > ValidationExecutor0 0 0 0 > 0 > MigrationStage0 0 10 0 > 0 > AntiEntropyStage 0 0 0 0 > 0 > PendingRangeCalculator0 0310 0 > 0 > Sampler 0 0 0 0 > 0 > MemtableFlushWriter 110 94 0 > 0 > MemtablePostFlush 134257 0 > 0 > MemtableReclaimMemory 0 0 94 0 > 0 > Native-Transport-Requests 128 156 38795716 > 278451 > Message type Dropped > READ 0 > RANGE_SLICE 0 > _TRACE 0 > MUTATION 0 > COUNTER_MUTATION 0 > BINARY 0 > REQUEST_RESPONSE 0 > PAGED_RANGE 0 > READ_REPAIR 0 > {code} > Attached is the jstack output for both CMS and G1GC. > Flight recordings are here: > https://s3.amazonaws.com/simple-logs/cassandra-102-cms.jfr > https://s3.amazonaws.com/simple-logs/cassandra-102-g1gc.jfr > It is interesting to note that while the flight recording was taking place, > the load on the machine went back to healthy, and when the flight recording > finished the load went back to > 100. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-11363) Blocked NTR When Connecting Causing Excessive Load
[ https://issues.apache.org/jira/browse/CASSANDRA-11363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15231108#comment-15231108 ] Russell Bradberry commented on CASSANDRA-11363: --- [~pauloricardomg] yes, we are using unlogged batches that cross partitions > Blocked NTR When Connecting Causing Excessive Load > -- > > Key: CASSANDRA-11363 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11363 > Project: Cassandra > Issue Type: Bug > Components: Coordination >Reporter: Russell Bradberry >Priority: Critical > Attachments: cassandra-102-cms.stack, cassandra-102-g1gc.stack > > > When upgrading from 2.1.9 to 2.1.13, we are witnessing an issue where the > machine load increases to very high levels (> 120 on an 8 core machine) and > native transport requests get blocked in tpstats. > I was able to reproduce this in both CMS and G1GC as well as on JVM 7 and 8. > The issue does not seem to affect the nodes running 2.1.9. > The issue seems to coincide with the number of connections OR the number of > total requests being processed at a given time (as the latter increases with > the former in our system) > Currently there is between 600 and 800 client connections on each machine and > each machine is handling roughly 2000-3000 client requests per second. > Disabling the binary protocol fixes the issue for this node but isn't a > viable option cluster-wide. > Here is the output from tpstats: > {code} > Pool NameActive Pending Completed Blocked All > time blocked > MutationStage 0 88387821 0 > 0 > ReadStage 0 0 355860 0 > 0 > RequestResponseStage 0 72532457 0 > 0 > ReadRepairStage 0 0150 0 > 0 > CounterMutationStage 32 104 897560 0 > 0 > MiscStage 0 0 0 0 > 0 > HintedHandoff 0 0 65 0 > 0 > GossipStage 0 0 2338 0 > 0 > CacheCleanupExecutor 0 0 0 0 > 0 > InternalResponseStage 0 0 0 0 > 0 > CommitLogArchiver 0 0 0 0 > 0 > CompactionExecutor2 190474 0 > 0 > ValidationExecutor0 0 0 0 > 0 > MigrationStage0 0 10 0 > 0 > AntiEntropyStage 0 0 0 0 > 0 > PendingRangeCalculator0 0310 0 > 0 > Sampler 0 0 0 0 > 0 > MemtableFlushWriter 110 94 0 > 0 > MemtablePostFlush 134257 0 > 0 > MemtableReclaimMemory 0 0 94 0 > 0 > Native-Transport-Requests 128 156 38795716 > 278451 > Message type Dropped > READ 0 > RANGE_SLICE 0 > _TRACE 0 > MUTATION 0 > COUNTER_MUTATION 0 > BINARY 0 > REQUEST_RESPONSE 0 > PAGED_RANGE 0 > READ_REPAIR 0 > {code} > Attached is the jstack output for both CMS and G1GC. > Flight recordings are here: > https://s3.amazonaws.com/simple-logs/cassandra-102-cms.jfr > https://s3.amazonaws.com/simple-logs/cassandra-102-g1gc.jfr > It is interesting to note that while the flight recording was taking place, > the load on the machine went back to healthy, and when the flight recording > finished the load went back to > 100. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-11363) Blocked NTR When Connecting Causing Excessive Load
[ https://issues.apache.org/jira/browse/CASSANDRA-11363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15231095#comment-15231095 ] Paulo Motta commented on CASSANDRA-11363: - [~devdazed] are you using unlogged batches by any chance? > Blocked NTR When Connecting Causing Excessive Load > -- > > Key: CASSANDRA-11363 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11363 > Project: Cassandra > Issue Type: Bug > Components: Coordination >Reporter: Russell Bradberry >Priority: Critical > Attachments: cassandra-102-cms.stack, cassandra-102-g1gc.stack > > > When upgrading from 2.1.9 to 2.1.13, we are witnessing an issue where the > machine load increases to very high levels (> 120 on an 8 core machine) and > native transport requests get blocked in tpstats. > I was able to reproduce this in both CMS and G1GC as well as on JVM 7 and 8. > The issue does not seem to affect the nodes running 2.1.9. > The issue seems to coincide with the number of connections OR the number of > total requests being processed at a given time (as the latter increases with > the former in our system) > Currently there is between 600 and 800 client connections on each machine and > each machine is handling roughly 2000-3000 client requests per second. > Disabling the binary protocol fixes the issue for this node but isn't a > viable option cluster-wide. > Here is the output from tpstats: > {code} > Pool NameActive Pending Completed Blocked All > time blocked > MutationStage 0 88387821 0 > 0 > ReadStage 0 0 355860 0 > 0 > RequestResponseStage 0 72532457 0 > 0 > ReadRepairStage 0 0150 0 > 0 > CounterMutationStage 32 104 897560 0 > 0 > MiscStage 0 0 0 0 > 0 > HintedHandoff 0 0 65 0 > 0 > GossipStage 0 0 2338 0 > 0 > CacheCleanupExecutor 0 0 0 0 > 0 > InternalResponseStage 0 0 0 0 > 0 > CommitLogArchiver 0 0 0 0 > 0 > CompactionExecutor2 190474 0 > 0 > ValidationExecutor0 0 0 0 > 0 > MigrationStage0 0 10 0 > 0 > AntiEntropyStage 0 0 0 0 > 0 > PendingRangeCalculator0 0310 0 > 0 > Sampler 0 0 0 0 > 0 > MemtableFlushWriter 110 94 0 > 0 > MemtablePostFlush 134257 0 > 0 > MemtableReclaimMemory 0 0 94 0 > 0 > Native-Transport-Requests 128 156 38795716 > 278451 > Message type Dropped > READ 0 > RANGE_SLICE 0 > _TRACE 0 > MUTATION 0 > COUNTER_MUTATION 0 > BINARY 0 > REQUEST_RESPONSE 0 > PAGED_RANGE 0 > READ_REPAIR 0 > {code} > Attached is the jstack output for both CMS and G1GC. > Flight recordings are here: > https://s3.amazonaws.com/simple-logs/cassandra-102-cms.jfr > https://s3.amazonaws.com/simple-logs/cassandra-102-g1gc.jfr > It is interesting to note that while the flight recording was taking place, > the load on the machine went back to healthy, and when the flight recording > finished the load went back to > 100. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-11363) Blocked NTR When Connecting Causing Excessive Load
[ https://issues.apache.org/jira/browse/CASSANDRA-11363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15231066#comment-15231066 ] Russell Bradberry commented on CASSANDRA-11363: --- well, it may be possible that this is the same issue, just very much exacerbated by the batches > Blocked NTR When Connecting Causing Excessive Load > -- > > Key: CASSANDRA-11363 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11363 > Project: Cassandra > Issue Type: Bug > Components: Coordination >Reporter: Russell Bradberry >Priority: Critical > Attachments: cassandra-102-cms.stack, cassandra-102-g1gc.stack > > > When upgrading from 2.1.9 to 2.1.13, we are witnessing an issue where the > machine load increases to very high levels (> 120 on an 8 core machine) and > native transport requests get blocked in tpstats. > I was able to reproduce this in both CMS and G1GC as well as on JVM 7 and 8. > The issue does not seem to affect the nodes running 2.1.9. > The issue seems to coincide with the number of connections OR the number of > total requests being processed at a given time (as the latter increases with > the former in our system) > Currently there is between 600 and 800 client connections on each machine and > each machine is handling roughly 2000-3000 client requests per second. > Disabling the binary protocol fixes the issue for this node but isn't a > viable option cluster-wide. > Here is the output from tpstats: > {code} > Pool NameActive Pending Completed Blocked All > time blocked > MutationStage 0 88387821 0 > 0 > ReadStage 0 0 355860 0 > 0 > RequestResponseStage 0 72532457 0 > 0 > ReadRepairStage 0 0150 0 > 0 > CounterMutationStage 32 104 897560 0 > 0 > MiscStage 0 0 0 0 > 0 > HintedHandoff 0 0 65 0 > 0 > GossipStage 0 0 2338 0 > 0 > CacheCleanupExecutor 0 0 0 0 > 0 > InternalResponseStage 0 0 0 0 > 0 > CommitLogArchiver 0 0 0 0 > 0 > CompactionExecutor2 190474 0 > 0 > ValidationExecutor0 0 0 0 > 0 > MigrationStage0 0 10 0 > 0 > AntiEntropyStage 0 0 0 0 > 0 > PendingRangeCalculator0 0310 0 > 0 > Sampler 0 0 0 0 > 0 > MemtableFlushWriter 110 94 0 > 0 > MemtablePostFlush 134257 0 > 0 > MemtableReclaimMemory 0 0 94 0 > 0 > Native-Transport-Requests 128 156 38795716 > 278451 > Message type Dropped > READ 0 > RANGE_SLICE 0 > _TRACE 0 > MUTATION 0 > COUNTER_MUTATION 0 > BINARY 0 > REQUEST_RESPONSE 0 > PAGED_RANGE 0 > READ_REPAIR 0 > {code} > Attached is the jstack output for both CMS and G1GC. > Flight recordings are here: > https://s3.amazonaws.com/simple-logs/cassandra-102-cms.jfr > https://s3.amazonaws.com/simple-logs/cassandra-102-g1gc.jfr > It is interesting to note that while the flight recording was taking place, > the load on the machine went back to healthy, and when the flight recording > finished the load went back to > 100. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-11363) Blocked NTR When Connecting Causing Excessive Load
[ https://issues.apache.org/jira/browse/CASSANDRA-11363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15231063#comment-15231063 ] Nate McCall commented on CASSANDRA-11363: - Yeah, that is quite different that what [~arodrime] and I have seen recently. Nodes in our case were otherwise well within utilization thresholds. [~devdazed] Your issue looks like it would be addressed by [~cnlwsu]'s reference to CASSANDRA-10200 (lean on them for a patch). > Blocked NTR When Connecting Causing Excessive Load > -- > > Key: CASSANDRA-11363 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11363 > Project: Cassandra > Issue Type: Bug > Components: Coordination >Reporter: Russell Bradberry >Priority: Critical > Attachments: cassandra-102-cms.stack, cassandra-102-g1gc.stack > > > When upgrading from 2.1.9 to 2.1.13, we are witnessing an issue where the > machine load increases to very high levels (> 120 on an 8 core machine) and > native transport requests get blocked in tpstats. > I was able to reproduce this in both CMS and G1GC as well as on JVM 7 and 8. > The issue does not seem to affect the nodes running 2.1.9. > The issue seems to coincide with the number of connections OR the number of > total requests being processed at a given time (as the latter increases with > the former in our system) > Currently there is between 600 and 800 client connections on each machine and > each machine is handling roughly 2000-3000 client requests per second. > Disabling the binary protocol fixes the issue for this node but isn't a > viable option cluster-wide. > Here is the output from tpstats: > {code} > Pool NameActive Pending Completed Blocked All > time blocked > MutationStage 0 88387821 0 > 0 > ReadStage 0 0 355860 0 > 0 > RequestResponseStage 0 72532457 0 > 0 > ReadRepairStage 0 0150 0 > 0 > CounterMutationStage 32 104 897560 0 > 0 > MiscStage 0 0 0 0 > 0 > HintedHandoff 0 0 65 0 > 0 > GossipStage 0 0 2338 0 > 0 > CacheCleanupExecutor 0 0 0 0 > 0 > InternalResponseStage 0 0 0 0 > 0 > CommitLogArchiver 0 0 0 0 > 0 > CompactionExecutor2 190474 0 > 0 > ValidationExecutor0 0 0 0 > 0 > MigrationStage0 0 10 0 > 0 > AntiEntropyStage 0 0 0 0 > 0 > PendingRangeCalculator0 0310 0 > 0 > Sampler 0 0 0 0 > 0 > MemtableFlushWriter 110 94 0 > 0 > MemtablePostFlush 134257 0 > 0 > MemtableReclaimMemory 0 0 94 0 > 0 > Native-Transport-Requests 128 156 38795716 > 278451 > Message type Dropped > READ 0 > RANGE_SLICE 0 > _TRACE 0 > MUTATION 0 > COUNTER_MUTATION 0 > BINARY 0 > REQUEST_RESPONSE 0 > PAGED_RANGE 0 > READ_REPAIR 0 > {code} > Attached is the jstack output for both CMS and G1GC. > Flight recordings are here: > https://s3.amazonaws.com/simple-logs/cassandra-102-cms.jfr > https://s3.amazonaws.com/simple-logs/cassandra-102-g1gc.jfr > It is interesting to note that while the flight recording was taking place, > the load on the machine went back to healthy, and when the flight recording > finished the load went back to > 100. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-11363) Blocked NTR When Connecting Causing Excessive Load
[ https://issues.apache.org/jira/browse/CASSANDRA-11363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15231045#comment-15231045 ] Russell Bradberry commented on CASSANDRA-11363: --- we may have two separate issues here, in mine, the issue is 100% CPU utilization and ultra high load when using batches. According to the jfr all of hot threads are putting their resources in {code} org.apache.cassandra.locator.NetworkTopologyStrategy.hasSufficientReplicas(String, Map, Multimap) -> org.apache.cassandra.locator.NetworkTopologyStrategy.hasSufficientReplicas(Map, Multimap) -> org.apache.cassandra.locator.NetworkTopologyStrategy.calculateNaturalEndpoints(Token, TokenMetadata) {code} > Blocked NTR When Connecting Causing Excessive Load > -- > > Key: CASSANDRA-11363 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11363 > Project: Cassandra > Issue Type: Bug > Components: Coordination >Reporter: Russell Bradberry >Priority: Critical > Attachments: cassandra-102-cms.stack, cassandra-102-g1gc.stack > > > When upgrading from 2.1.9 to 2.1.13, we are witnessing an issue where the > machine load increases to very high levels (> 120 on an 8 core machine) and > native transport requests get blocked in tpstats. > I was able to reproduce this in both CMS and G1GC as well as on JVM 7 and 8. > The issue does not seem to affect the nodes running 2.1.9. > The issue seems to coincide with the number of connections OR the number of > total requests being processed at a given time (as the latter increases with > the former in our system) > Currently there is between 600 and 800 client connections on each machine and > each machine is handling roughly 2000-3000 client requests per second. > Disabling the binary protocol fixes the issue for this node but isn't a > viable option cluster-wide. > Here is the output from tpstats: > {code} > Pool NameActive Pending Completed Blocked All > time blocked > MutationStage 0 88387821 0 > 0 > ReadStage 0 0 355860 0 > 0 > RequestResponseStage 0 72532457 0 > 0 > ReadRepairStage 0 0150 0 > 0 > CounterMutationStage 32 104 897560 0 > 0 > MiscStage 0 0 0 0 > 0 > HintedHandoff 0 0 65 0 > 0 > GossipStage 0 0 2338 0 > 0 > CacheCleanupExecutor 0 0 0 0 > 0 > InternalResponseStage 0 0 0 0 > 0 > CommitLogArchiver 0 0 0 0 > 0 > CompactionExecutor2 190474 0 > 0 > ValidationExecutor0 0 0 0 > 0 > MigrationStage0 0 10 0 > 0 > AntiEntropyStage 0 0 0 0 > 0 > PendingRangeCalculator0 0310 0 > 0 > Sampler 0 0 0 0 > 0 > MemtableFlushWriter 110 94 0 > 0 > MemtablePostFlush 134257 0 > 0 > MemtableReclaimMemory 0 0 94 0 > 0 > Native-Transport-Requests 128 156 38795716 > 278451 > Message type Dropped > READ 0 > RANGE_SLICE 0 > _TRACE 0 > MUTATION 0 > COUNTER_MUTATION 0 > BINARY 0 > REQUEST_RESPONSE 0 > PAGED_RANGE 0 > READ_REPAIR 0 > {code} > Attached is the jstack output for both CMS and G1GC. > Flight recordings are here: > https://s3.amazonaws.com/simple-logs/cassandra-102-cms.jfr > https://s3.amazonaws.com/simple-logs/cassandra-102-g1gc.jfr > It is interesting to note that while the flight recording was taking place, > the load on the machine went back to healthy, and when the flight recording > finished the load went back to > 100. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-11363) Blocked NTR When Connecting Causing Excessive Load
[ https://issues.apache.org/jira/browse/CASSANDRA-11363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15231025#comment-15231025 ] Paulo Motta commented on CASSANDRA-11363: - that's right, that was just a wild guess. I focused my quick investigation on 2.1.12 and 2.1.13 commits, but some deeper investigation is probably needed. probably it's a good idea to backport CASSANDRA-10044 to versions <= 2.1.12 and setup a coordinator-only node and check where it started happening to narrow down the scope. > Blocked NTR When Connecting Causing Excessive Load > -- > > Key: CASSANDRA-11363 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11363 > Project: Cassandra > Issue Type: Bug > Components: Coordination >Reporter: Russell Bradberry >Priority: Critical > Attachments: cassandra-102-cms.stack, cassandra-102-g1gc.stack > > > When upgrading from 2.1.9 to 2.1.13, we are witnessing an issue where the > machine load increases to very high levels (> 120 on an 8 core machine) and > native transport requests get blocked in tpstats. > I was able to reproduce this in both CMS and G1GC as well as on JVM 7 and 8. > The issue does not seem to affect the nodes running 2.1.9. > The issue seems to coincide with the number of connections OR the number of > total requests being processed at a given time (as the latter increases with > the former in our system) > Currently there is between 600 and 800 client connections on each machine and > each machine is handling roughly 2000-3000 client requests per second. > Disabling the binary protocol fixes the issue for this node but isn't a > viable option cluster-wide. > Here is the output from tpstats: > {code} > Pool NameActive Pending Completed Blocked All > time blocked > MutationStage 0 88387821 0 > 0 > ReadStage 0 0 355860 0 > 0 > RequestResponseStage 0 72532457 0 > 0 > ReadRepairStage 0 0150 0 > 0 > CounterMutationStage 32 104 897560 0 > 0 > MiscStage 0 0 0 0 > 0 > HintedHandoff 0 0 65 0 > 0 > GossipStage 0 0 2338 0 > 0 > CacheCleanupExecutor 0 0 0 0 > 0 > InternalResponseStage 0 0 0 0 > 0 > CommitLogArchiver 0 0 0 0 > 0 > CompactionExecutor2 190474 0 > 0 > ValidationExecutor0 0 0 0 > 0 > MigrationStage0 0 10 0 > 0 > AntiEntropyStage 0 0 0 0 > 0 > PendingRangeCalculator0 0310 0 > 0 > Sampler 0 0 0 0 > 0 > MemtableFlushWriter 110 94 0 > 0 > MemtablePostFlush 134257 0 > 0 > MemtableReclaimMemory 0 0 94 0 > 0 > Native-Transport-Requests 128 156 38795716 > 278451 > Message type Dropped > READ 0 > RANGE_SLICE 0 > _TRACE 0 > MUTATION 0 > COUNTER_MUTATION 0 > BINARY 0 > REQUEST_RESPONSE 0 > PAGED_RANGE 0 > READ_REPAIR 0 > {code} > Attached is the jstack output for both CMS and G1GC. > Flight recordings are here: > https://s3.amazonaws.com/simple-logs/cassandra-102-cms.jfr > https://s3.amazonaws.com/simple-logs/cassandra-102-g1gc.jfr > It is interesting to note that while the flight recording was taking place, > the load on the machine went back to healthy, and when the flight recording > finished the load went back to > 100. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-11363) Blocked NTR When Connecting Causing Excessive Load
[ https://issues.apache.org/jira/browse/CASSANDRA-11363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15230947#comment-15230947 ] Nate McCall commented on CASSANDRA-11363: - [~pauloricardomg] unfortunately, this may be been latent for some time. CASSANDRA-10044 re-introduced the tpstats counters as they had been missing (which is mostly likely why this has not been noticed until recently). > Blocked NTR When Connecting Causing Excessive Load > -- > > Key: CASSANDRA-11363 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11363 > Project: Cassandra > Issue Type: Bug > Components: Coordination >Reporter: Russell Bradberry >Priority: Critical > Attachments: cassandra-102-cms.stack, cassandra-102-g1gc.stack > > > When upgrading from 2.1.9 to 2.1.13, we are witnessing an issue where the > machine load increases to very high levels (> 120 on an 8 core machine) and > native transport requests get blocked in tpstats. > I was able to reproduce this in both CMS and G1GC as well as on JVM 7 and 8. > The issue does not seem to affect the nodes running 2.1.9. > The issue seems to coincide with the number of connections OR the number of > total requests being processed at a given time (as the latter increases with > the former in our system) > Currently there is between 600 and 800 client connections on each machine and > each machine is handling roughly 2000-3000 client requests per second. > Disabling the binary protocol fixes the issue for this node but isn't a > viable option cluster-wide. > Here is the output from tpstats: > {code} > Pool NameActive Pending Completed Blocked All > time blocked > MutationStage 0 88387821 0 > 0 > ReadStage 0 0 355860 0 > 0 > RequestResponseStage 0 72532457 0 > 0 > ReadRepairStage 0 0150 0 > 0 > CounterMutationStage 32 104 897560 0 > 0 > MiscStage 0 0 0 0 > 0 > HintedHandoff 0 0 65 0 > 0 > GossipStage 0 0 2338 0 > 0 > CacheCleanupExecutor 0 0 0 0 > 0 > InternalResponseStage 0 0 0 0 > 0 > CommitLogArchiver 0 0 0 0 > 0 > CompactionExecutor2 190474 0 > 0 > ValidationExecutor0 0 0 0 > 0 > MigrationStage0 0 10 0 > 0 > AntiEntropyStage 0 0 0 0 > 0 > PendingRangeCalculator0 0310 0 > 0 > Sampler 0 0 0 0 > 0 > MemtableFlushWriter 110 94 0 > 0 > MemtablePostFlush 134257 0 > 0 > MemtableReclaimMemory 0 0 94 0 > 0 > Native-Transport-Requests 128 156 38795716 > 278451 > Message type Dropped > READ 0 > RANGE_SLICE 0 > _TRACE 0 > MUTATION 0 > COUNTER_MUTATION 0 > BINARY 0 > REQUEST_RESPONSE 0 > PAGED_RANGE 0 > READ_REPAIR 0 > {code} > Attached is the jstack output for both CMS and G1GC. > Flight recordings are here: > https://s3.amazonaws.com/simple-logs/cassandra-102-cms.jfr > https://s3.amazonaws.com/simple-logs/cassandra-102-g1gc.jfr > It is interesting to note that while the flight recording was taking place, > the load on the machine went back to healthy, and when the flight recording > finished the load went back to > 100. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-11363) Blocked NTR When Connecting Causing Excessive Load
[ https://issues.apache.org/jira/browse/CASSANDRA-11363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15230926#comment-15230926 ] Russell Bradberry commented on CASSANDRA-11363: --- Unfortunately we are on DSE, so I can't run the revert > Blocked NTR When Connecting Causing Excessive Load > -- > > Key: CASSANDRA-11363 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11363 > Project: Cassandra > Issue Type: Bug > Components: Coordination >Reporter: Russell Bradberry >Priority: Critical > Attachments: cassandra-102-cms.stack, cassandra-102-g1gc.stack > > > When upgrading from 2.1.9 to 2.1.13, we are witnessing an issue where the > machine load increases to very high levels (> 120 on an 8 core machine) and > native transport requests get blocked in tpstats. > I was able to reproduce this in both CMS and G1GC as well as on JVM 7 and 8. > The issue does not seem to affect the nodes running 2.1.9. > The issue seems to coincide with the number of connections OR the number of > total requests being processed at a given time (as the latter increases with > the former in our system) > Currently there is between 600 and 800 client connections on each machine and > each machine is handling roughly 2000-3000 client requests per second. > Disabling the binary protocol fixes the issue for this node but isn't a > viable option cluster-wide. > Here is the output from tpstats: > {code} > Pool NameActive Pending Completed Blocked All > time blocked > MutationStage 0 88387821 0 > 0 > ReadStage 0 0 355860 0 > 0 > RequestResponseStage 0 72532457 0 > 0 > ReadRepairStage 0 0150 0 > 0 > CounterMutationStage 32 104 897560 0 > 0 > MiscStage 0 0 0 0 > 0 > HintedHandoff 0 0 65 0 > 0 > GossipStage 0 0 2338 0 > 0 > CacheCleanupExecutor 0 0 0 0 > 0 > InternalResponseStage 0 0 0 0 > 0 > CommitLogArchiver 0 0 0 0 > 0 > CompactionExecutor2 190474 0 > 0 > ValidationExecutor0 0 0 0 > 0 > MigrationStage0 0 10 0 > 0 > AntiEntropyStage 0 0 0 0 > 0 > PendingRangeCalculator0 0310 0 > 0 > Sampler 0 0 0 0 > 0 > MemtableFlushWriter 110 94 0 > 0 > MemtablePostFlush 134257 0 > 0 > MemtableReclaimMemory 0 0 94 0 > 0 > Native-Transport-Requests 128 156 38795716 > 278451 > Message type Dropped > READ 0 > RANGE_SLICE 0 > _TRACE 0 > MUTATION 0 > COUNTER_MUTATION 0 > BINARY 0 > REQUEST_RESPONSE 0 > PAGED_RANGE 0 > READ_REPAIR 0 > {code} > Attached is the jstack output for both CMS and G1GC. > Flight recordings are here: > https://s3.amazonaws.com/simple-logs/cassandra-102-cms.jfr > https://s3.amazonaws.com/simple-logs/cassandra-102-g1gc.jfr > It is interesting to note that while the flight recording was taking place, > the load on the machine went back to healthy, and when the flight recording > finished the load went back to > 100. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-11363) Blocked NTR When Connecting Causing Excessive Load
[ https://issues.apache.org/jira/browse/CASSANDRA-11363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15230924#comment-15230924 ] Paulo Motta commented on CASSANDRA-11363: - I went through 2.1.12 changes and didn't find anything suspicious. On 2.1.13 and 3.0.3 though, we changed the {{ServerConnection}} query state map from a {{NonBlockingHashMap}} to a {{ConcurrentHashMap}} on CASSANDRA-10938, which might be misbehaving for some reason. Is anyone willing to try the revert patch below on 2.1.13 or 3.0.3 and check if that changes anything? {noformat} diff --git a/src/java/org/apache/cassandra/transport/ServerConnection.java b/src/java/org/apache/cassandra/transport/ServerConnection.java index ce4d164..5991b33 100644 --- a/src/java/org/apache/cassandra/transport/ServerConnection.java +++ b/src/java/org/apache/cassandra/transport/ServerConnection.java @@ -17,7 +17,6 @@ */ package org.apache.cassandra.transport; -import java.util.concurrent.ConcurrentHashMap; import java.util.concurrent.ConcurrentMap; import io.netty.channel.Channel; @@ -29,6 +28,8 @@ import org.apache.cassandra.config.DatabaseDescriptor; import org.apache.cassandra.service.ClientState; import org.apache.cassandra.service.QueryState; +import org.cliffc.high_scale_lib.NonBlockingHashMap; + public class ServerConnection extends Connection { private enum State { UNINITIALIZED, AUTHENTICATION, READY } @@ -37,7 +38,7 @@ public class ServerConnection extends Connection private final ClientState clientState; private volatile State state; -private final ConcurrentMapqueryStates = new ConcurrentHashMap<>(); +private final ConcurrentMap queryStates = new NonBlockingHashMap<>(); public ServerConnection(Channel channel, int version, Connection.Tracker tracker) { {noformat} > Blocked NTR When Connecting Causing Excessive Load > -- > > Key: CASSANDRA-11363 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11363 > Project: Cassandra > Issue Type: Bug > Components: Coordination >Reporter: Russell Bradberry >Priority: Critical > Attachments: cassandra-102-cms.stack, cassandra-102-g1gc.stack > > > When upgrading from 2.1.9 to 2.1.13, we are witnessing an issue where the > machine load increases to very high levels (> 120 on an 8 core machine) and > native transport requests get blocked in tpstats. > I was able to reproduce this in both CMS and G1GC as well as on JVM 7 and 8. > The issue does not seem to affect the nodes running 2.1.9. > The issue seems to coincide with the number of connections OR the number of > total requests being processed at a given time (as the latter increases with > the former in our system) > Currently there is between 600 and 800 client connections on each machine and > each machine is handling roughly 2000-3000 client requests per second. > Disabling the binary protocol fixes the issue for this node but isn't a > viable option cluster-wide. > Here is the output from tpstats: > {code} > Pool NameActive Pending Completed Blocked All > time blocked > MutationStage 0 88387821 0 > 0 > ReadStage 0 0 355860 0 > 0 > RequestResponseStage 0 72532457 0 > 0 > ReadRepairStage 0 0150 0 > 0 > CounterMutationStage 32 104 897560 0 > 0 > MiscStage 0 0 0 0 > 0 > HintedHandoff 0 0 65 0 > 0 > GossipStage 0 0 2338 0 > 0 > CacheCleanupExecutor 0 0 0 0 > 0 > InternalResponseStage 0 0 0 0 > 0 > CommitLogArchiver 0 0 0 0 > 0 > CompactionExecutor2 190474 0 > 0 > ValidationExecutor0 0 0 0 > 0 > MigrationStage0 0 10 0 > 0 > AntiEntropyStage 0 0 0 0 > 0 > PendingRangeCalculator0 0310 0 > 0 > Sampler 0 0 0 0 > 0 > MemtableFlushWriter 110 94
[jira] [Commented] (CASSANDRA-11363) Blocked NTR When Connecting Causing Excessive Load
[ https://issues.apache.org/jira/browse/CASSANDRA-11363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15230670#comment-15230670 ] Chris Lohfink commented on CASSANDRA-11363: --- caused by CASSANDRA-10200 ? > Blocked NTR When Connecting Causing Excessive Load > -- > > Key: CASSANDRA-11363 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11363 > Project: Cassandra > Issue Type: Bug > Components: Coordination >Reporter: Russell Bradberry >Priority: Critical > Attachments: cassandra-102-cms.stack, cassandra-102-g1gc.stack > > > When upgrading from 2.1.9 to 2.1.13, we are witnessing an issue where the > machine load increases to very high levels (> 120 on an 8 core machine) and > native transport requests get blocked in tpstats. > I was able to reproduce this in both CMS and G1GC as well as on JVM 7 and 8. > The issue does not seem to affect the nodes running 2.1.9. > The issue seems to coincide with the number of connections OR the number of > total requests being processed at a given time (as the latter increases with > the former in our system) > Currently there is between 600 and 800 client connections on each machine and > each machine is handling roughly 2000-3000 client requests per second. > Disabling the binary protocol fixes the issue for this node but isn't a > viable option cluster-wide. > Here is the output from tpstats: > {code} > Pool NameActive Pending Completed Blocked All > time blocked > MutationStage 0 88387821 0 > 0 > ReadStage 0 0 355860 0 > 0 > RequestResponseStage 0 72532457 0 > 0 > ReadRepairStage 0 0150 0 > 0 > CounterMutationStage 32 104 897560 0 > 0 > MiscStage 0 0 0 0 > 0 > HintedHandoff 0 0 65 0 > 0 > GossipStage 0 0 2338 0 > 0 > CacheCleanupExecutor 0 0 0 0 > 0 > InternalResponseStage 0 0 0 0 > 0 > CommitLogArchiver 0 0 0 0 > 0 > CompactionExecutor2 190474 0 > 0 > ValidationExecutor0 0 0 0 > 0 > MigrationStage0 0 10 0 > 0 > AntiEntropyStage 0 0 0 0 > 0 > PendingRangeCalculator0 0310 0 > 0 > Sampler 0 0 0 0 > 0 > MemtableFlushWriter 110 94 0 > 0 > MemtablePostFlush 134257 0 > 0 > MemtableReclaimMemory 0 0 94 0 > 0 > Native-Transport-Requests 128 156 38795716 > 278451 > Message type Dropped > READ 0 > RANGE_SLICE 0 > _TRACE 0 > MUTATION 0 > COUNTER_MUTATION 0 > BINARY 0 > REQUEST_RESPONSE 0 > PAGED_RANGE 0 > READ_REPAIR 0 > {code} > Attached is the jstack output for both CMS and G1GC. > Flight recordings are here: > https://s3.amazonaws.com/simple-logs/cassandra-102-cms.jfr > https://s3.amazonaws.com/simple-logs/cassandra-102-g1gc.jfr > It is interesting to note that while the flight recording was taking place, > the load on the machine went back to healthy, and when the flight recording > finished the load went back to > 100. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-11363) Blocked NTR When Connecting Causing Excessive Load
[ https://issues.apache.org/jira/browse/CASSANDRA-11363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15230546#comment-15230546 ] Nate McCall commented on CASSANDRA-11363: - Raised this to critical. You have four long-time users with multiple large deployments who are seeing a quantifiable percentage of client errors on non-resource constrained clusters across multiple versions. > Blocked NTR When Connecting Causing Excessive Load > -- > > Key: CASSANDRA-11363 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11363 > Project: Cassandra > Issue Type: Bug > Components: Coordination >Reporter: Russell Bradberry >Priority: Critical > Attachments: cassandra-102-cms.stack, cassandra-102-g1gc.stack > > > When upgrading from 2.1.9 to 2.1.13, we are witnessing an issue where the > machine load increases to very high levels (> 120 on an 8 core machine) and > native transport requests get blocked in tpstats. > I was able to reproduce this in both CMS and G1GC as well as on JVM 7 and 8. > The issue does not seem to affect the nodes running 2.1.9. > The issue seems to coincide with the number of connections OR the number of > total requests being processed at a given time (as the latter increases with > the former in our system) > Currently there is between 600 and 800 client connections on each machine and > each machine is handling roughly 2000-3000 client requests per second. > Disabling the binary protocol fixes the issue for this node but isn't a > viable option cluster-wide. > Here is the output from tpstats: > {code} > Pool NameActive Pending Completed Blocked All > time blocked > MutationStage 0 88387821 0 > 0 > ReadStage 0 0 355860 0 > 0 > RequestResponseStage 0 72532457 0 > 0 > ReadRepairStage 0 0150 0 > 0 > CounterMutationStage 32 104 897560 0 > 0 > MiscStage 0 0 0 0 > 0 > HintedHandoff 0 0 65 0 > 0 > GossipStage 0 0 2338 0 > 0 > CacheCleanupExecutor 0 0 0 0 > 0 > InternalResponseStage 0 0 0 0 > 0 > CommitLogArchiver 0 0 0 0 > 0 > CompactionExecutor2 190474 0 > 0 > ValidationExecutor0 0 0 0 > 0 > MigrationStage0 0 10 0 > 0 > AntiEntropyStage 0 0 0 0 > 0 > PendingRangeCalculator0 0310 0 > 0 > Sampler 0 0 0 0 > 0 > MemtableFlushWriter 110 94 0 > 0 > MemtablePostFlush 134257 0 > 0 > MemtableReclaimMemory 0 0 94 0 > 0 > Native-Transport-Requests 128 156 38795716 > 278451 > Message type Dropped > READ 0 > RANGE_SLICE 0 > _TRACE 0 > MUTATION 0 > COUNTER_MUTATION 0 > BINARY 0 > REQUEST_RESPONSE 0 > PAGED_RANGE 0 > READ_REPAIR 0 > {code} > Attached is the jstack output for both CMS and G1GC. > Flight recordings are here: > https://s3.amazonaws.com/simple-logs/cassandra-102-cms.jfr > https://s3.amazonaws.com/simple-logs/cassandra-102-g1gc.jfr > It is interesting to note that while the flight recording was taking place, > the load on the machine went back to healthy, and when the flight recording > finished the load went back to > 100. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-11363) Blocked NTR When Connecting Causing Excessive Load
[ https://issues.apache.org/jira/browse/CASSANDRA-11363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15230325#comment-15230325 ] Alain RODRIGUEZ commented on CASSANDRA-11363: - I also observed in C*2.1.12 that a certain percentage of the Native-Transport-Requests are blocked, yet no major CPU or resources issue on my side though, it might then not be related. For what it is worth here is something I observed about Native-Transport-Requests: increasing the 'native_transport_max_threads' value help mitigating this, as expected, but Native-Transport-Requests number is still a non zero value. {noformat} [alain@bastion-d3-prod ~]$ knife ssh "role:cassandra" "nodetool tpstats | grep Native-Transport-Requests" | grep -e server1 -e server2 -e server3 -e server4 | sort | awk 'BEGIN { printf "%50s %10s","Server |"," Blocked ratio:\n" } { printf "%50s %10f%\n", $1, (($7/$5)*100) }' Server | Blocked ratio: ip-172-17-42-105.us-west-2.compute.internal 0.044902% ip-172-17-42-107.us-west-2.compute.internal 0.030127% ip-172-17-42-114.us-west-2.compute.internal 0.045759% ip-172-17-42-116.us-west-2.compute.internal 0.082763% {noformat} I waited long enough between the change and the result capture, many days, probably a few weeks. As all the nodes are in the same datacenter, under a (fairly) balanced load, this is probably relevant. Here are the result for those nodes, in our use case. ||Server||native_transport_max_threads||Percentage of blocked threads|| |server1|128|0.082763%| |server2|384|0.044902%| |server3|512|0.045759%| |server4|1024|0.030127%| Also from the mailing list outputs, it looks like it is quite common to have some Native-Transport-Requests blocked, probably unavoidable depending on the network and use cases (spiky workloads?). > Blocked NTR When Connecting Causing Excessive Load > -- > > Key: CASSANDRA-11363 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11363 > Project: Cassandra > Issue Type: Bug > Components: Coordination >Reporter: Russell Bradberry > Attachments: cassandra-102-cms.stack, cassandra-102-g1gc.stack > > > When upgrading from 2.1.9 to 2.1.13, we are witnessing an issue where the > machine load increases to very high levels (> 120 on an 8 core machine) and > native transport requests get blocked in tpstats. > I was able to reproduce this in both CMS and G1GC as well as on JVM 7 and 8. > The issue does not seem to affect the nodes running 2.1.9. > The issue seems to coincide with the number of connections OR the number of > total requests being processed at a given time (as the latter increases with > the former in our system) > Currently there is between 600 and 800 client connections on each machine and > each machine is handling roughly 2000-3000 client requests per second. > Disabling the binary protocol fixes the issue for this node but isn't a > viable option cluster-wide. > Here is the output from tpstats: > {code} > Pool NameActive Pending Completed Blocked All > time blocked > MutationStage 0 88387821 0 > 0 > ReadStage 0 0 355860 0 > 0 > RequestResponseStage 0 72532457 0 > 0 > ReadRepairStage 0 0150 0 > 0 > CounterMutationStage 32 104 897560 0 > 0 > MiscStage 0 0 0 0 > 0 > HintedHandoff 0 0 65 0 > 0 > GossipStage 0 0 2338 0 > 0 > CacheCleanupExecutor 0 0 0 0 > 0 > InternalResponseStage 0 0 0 0 > 0 > CommitLogArchiver 0 0 0 0 > 0 > CompactionExecutor2 190474 0 > 0 > ValidationExecutor0 0 0 0 > 0 > MigrationStage0 0 10 0 > 0 > AntiEntropyStage 0 0 0 0 > 0 > PendingRangeCalculator0 0310 0 > 0 > Sampler 0 0 0 0 > 0 > MemtableFlushWriter 110 94 0 > 0 >
[jira] [Commented] (CASSANDRA-11363) Blocked NTR When Connecting Causing Excessive Load
[ https://issues.apache.org/jira/browse/CASSANDRA-11363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15199757#comment-15199757 ] Russell Bradberry commented on CASSANDRA-11363: --- After some digging, we found the issue to be introduce in Cassandra 2.1.12. We are running DSE so the issue manifested in DSE 4.8.4 > Blocked NTR When Connecting Causing Excessive Load > -- > > Key: CASSANDRA-11363 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11363 > Project: Cassandra > Issue Type: Bug > Components: Coordination >Reporter: Russell Bradberry > Attachments: cassandra-102-cms.stack, cassandra-102-g1gc.stack > > > When upgrading from 2.1.9 to 2.1.13, we are witnessing an issue where the > machine load increases to very high levels (> 120 on an 8 core machine) and > native transport requests get blocked in tpstats. > I was able to reproduce this in both CMS and G1GC as well as on JVM 7 and 8. > The issue does not seem to affect the nodes running 2.1.9. > The issue seems to coincide with the number of connections OR the number of > total requests being processed at a given time (as the latter increases with > the former in our system) > Currently there is between 600 and 800 client connections on each machine and > each machine is handling roughly 2000-3000 client requests per second. > Disabling the binary protocol fixes the issue for this node but isn't a > viable option cluster-wide. > Here is the output from tpstats: > {code} > Pool NameActive Pending Completed Blocked All > time blocked > MutationStage 0 88387821 0 > 0 > ReadStage 0 0 355860 0 > 0 > RequestResponseStage 0 72532457 0 > 0 > ReadRepairStage 0 0150 0 > 0 > CounterMutationStage 32 104 897560 0 > 0 > MiscStage 0 0 0 0 > 0 > HintedHandoff 0 0 65 0 > 0 > GossipStage 0 0 2338 0 > 0 > CacheCleanupExecutor 0 0 0 0 > 0 > InternalResponseStage 0 0 0 0 > 0 > CommitLogArchiver 0 0 0 0 > 0 > CompactionExecutor2 190474 0 > 0 > ValidationExecutor0 0 0 0 > 0 > MigrationStage0 0 10 0 > 0 > AntiEntropyStage 0 0 0 0 > 0 > PendingRangeCalculator0 0310 0 > 0 > Sampler 0 0 0 0 > 0 > MemtableFlushWriter 110 94 0 > 0 > MemtablePostFlush 134257 0 > 0 > MemtableReclaimMemory 0 0 94 0 > 0 > Native-Transport-Requests 128 156 38795716 > 278451 > Message type Dropped > READ 0 > RANGE_SLICE 0 > _TRACE 0 > MUTATION 0 > COUNTER_MUTATION 0 > BINARY 0 > REQUEST_RESPONSE 0 > PAGED_RANGE 0 > READ_REPAIR 0 > {code} > Attached is the jstack output for both CMS and G1GC. > Flight recordings are here: > https://s3.amazonaws.com/simple-logs/cassandra-102-cms.jfr > https://s3.amazonaws.com/simple-logs/cassandra-102-g1gc.jfr > It is interesting to note that while the flight recording was taking place, > the load on the machine went back to healthy, and when the flight recording > finished the load went back to > 100. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-11363) Blocked NTR When Connecting Causing Excessive Load
[ https://issues.apache.org/jira/browse/CASSANDRA-11363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15197912#comment-15197912 ] Carlos Rolo commented on CASSANDRA-11363: - Also reproducible in 3.0.3. I have 2 clusters showing exactly the same problem, one running 2.1.13 (5 nodes, 4 cores, 32GB Ram, CMS Java 7) and another 3 nodes, 4cores, 16GB Ram running 3.0.3 (CMS Java 8). Both get the problems described: "The issue seems to coincide with the number of connections OR the number of total requests being processed at a given time (as the latter increases with the former in our system)" It is normal for OS Load to shoot to 33+ with around 40connected clients. > Blocked NTR When Connecting Causing Excessive Load > -- > > Key: CASSANDRA-11363 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11363 > Project: Cassandra > Issue Type: Bug > Components: Coordination >Reporter: Russell Bradberry > Attachments: cassandra-102-cms.stack, cassandra-102-g1gc.stack > > > When upgrading from 2.1.9 to 2.1.13, we are witnessing an issue where the > machine load increases to very high levels (> 120 on an 8 core machine) and > native transport requests get blocked in tpstats. > I was able to reproduce this in both CMS and G1GC as well as on JVM 7 and 8. > The issue does not seem to affect the nodes running 2.1.9. > The issue seems to coincide with the number of connections OR the number of > total requests being processed at a given time (as the latter increases with > the former in our system) > Currently there is between 600 and 800 client connections on each machine and > each machine is handling roughly 2000-3000 client requests per second. > Disabling the binary protocol fixes the issue for this node but isn't a > viable option cluster-wide. > Here is the output from tpstats: > {code} > Pool NameActive Pending Completed Blocked All > time blocked > MutationStage 0 88387821 0 > 0 > ReadStage 0 0 355860 0 > 0 > RequestResponseStage 0 72532457 0 > 0 > ReadRepairStage 0 0150 0 > 0 > CounterMutationStage 32 104 897560 0 > 0 > MiscStage 0 0 0 0 > 0 > HintedHandoff 0 0 65 0 > 0 > GossipStage 0 0 2338 0 > 0 > CacheCleanupExecutor 0 0 0 0 > 0 > InternalResponseStage 0 0 0 0 > 0 > CommitLogArchiver 0 0 0 0 > 0 > CompactionExecutor2 190474 0 > 0 > ValidationExecutor0 0 0 0 > 0 > MigrationStage0 0 10 0 > 0 > AntiEntropyStage 0 0 0 0 > 0 > PendingRangeCalculator0 0310 0 > 0 > Sampler 0 0 0 0 > 0 > MemtableFlushWriter 110 94 0 > 0 > MemtablePostFlush 134257 0 > 0 > MemtableReclaimMemory 0 0 94 0 > 0 > Native-Transport-Requests 128 156 38795716 > 278451 > Message type Dropped > READ 0 > RANGE_SLICE 0 > _TRACE 0 > MUTATION 0 > COUNTER_MUTATION 0 > BINARY 0 > REQUEST_RESPONSE 0 > PAGED_RANGE 0 > READ_REPAIR 0 > {code} > Attached is the jstack output for both CMS and G1GC. > Flight recordings are here: > https://s3.amazonaws.com/simple-logs/cassandra-102-cms.jfr > https://s3.amazonaws.com/simple-logs/cassandra-102-g1gc.jfr > It is interesting to note that while the flight recording was taking place, > the load on the machine went back to healthy, and when the flight recording > finished the load went back to > 100. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-11363) Blocked NTR When Connecting Causing Excessive Load
[ https://issues.apache.org/jira/browse/CASSANDRA-11363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15199929#comment-15199929 ] Russell Bradberry commented on CASSANDRA-11363: --- For troubleshooting we set up a coordinator-only node and pointed one app server at it. This resulted in roughly 90 connections to the node. We witnessed many timeouts of requests from the app server's perspective. We downgraded the coordinator-only node to 2.1.9 and upgraded point-release by point-release (DSE point releases) until we saw the same behavior in DSE 4.8.4 (Cassandra 2.1.12). I'm not certain this has to do with connection count anymore. We have several different types of workloads going on, but we found that only the workloads that use batches were timing out. Additionally this is only happening when the node is used as a coordinator. We are not seeing this issue when we disable the binary protocol, effectively making the node no-longer a coordinator. > Blocked NTR When Connecting Causing Excessive Load > -- > > Key: CASSANDRA-11363 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11363 > Project: Cassandra > Issue Type: Bug > Components: Coordination >Reporter: Russell Bradberry > Attachments: cassandra-102-cms.stack, cassandra-102-g1gc.stack > > > When upgrading from 2.1.9 to 2.1.13, we are witnessing an issue where the > machine load increases to very high levels (> 120 on an 8 core machine) and > native transport requests get blocked in tpstats. > I was able to reproduce this in both CMS and G1GC as well as on JVM 7 and 8. > The issue does not seem to affect the nodes running 2.1.9. > The issue seems to coincide with the number of connections OR the number of > total requests being processed at a given time (as the latter increases with > the former in our system) > Currently there is between 600 and 800 client connections on each machine and > each machine is handling roughly 2000-3000 client requests per second. > Disabling the binary protocol fixes the issue for this node but isn't a > viable option cluster-wide. > Here is the output from tpstats: > {code} > Pool NameActive Pending Completed Blocked All > time blocked > MutationStage 0 88387821 0 > 0 > ReadStage 0 0 355860 0 > 0 > RequestResponseStage 0 72532457 0 > 0 > ReadRepairStage 0 0150 0 > 0 > CounterMutationStage 32 104 897560 0 > 0 > MiscStage 0 0 0 0 > 0 > HintedHandoff 0 0 65 0 > 0 > GossipStage 0 0 2338 0 > 0 > CacheCleanupExecutor 0 0 0 0 > 0 > InternalResponseStage 0 0 0 0 > 0 > CommitLogArchiver 0 0 0 0 > 0 > CompactionExecutor2 190474 0 > 0 > ValidationExecutor0 0 0 0 > 0 > MigrationStage0 0 10 0 > 0 > AntiEntropyStage 0 0 0 0 > 0 > PendingRangeCalculator0 0310 0 > 0 > Sampler 0 0 0 0 > 0 > MemtableFlushWriter 110 94 0 > 0 > MemtablePostFlush 134257 0 > 0 > MemtableReclaimMemory 0 0 94 0 > 0 > Native-Transport-Requests 128 156 38795716 > 278451 > Message type Dropped > READ 0 > RANGE_SLICE 0 > _TRACE 0 > MUTATION 0 > COUNTER_MUTATION 0 > BINARY 0 > REQUEST_RESPONSE 0 > PAGED_RANGE 0 > READ_REPAIR 0 > {code} > Attached is the jstack output for both CMS and G1GC. > Flight recordings are here: > https://s3.amazonaws.com/simple-logs/cassandra-102-cms.jfr > https://s3.amazonaws.com/simple-logs/cassandra-102-g1gc.jfr > It is interesting to note that while the flight
[jira] [Commented] (CASSANDRA-11363) Blocked NTR When Connecting Causing Excessive Load
[ https://issues.apache.org/jira/browse/CASSANDRA-11363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15199944#comment-15199944 ] Carlos Rolo commented on CASSANDRA-11363: - I can confirm that in both my clusters batches are in use. > Blocked NTR When Connecting Causing Excessive Load > -- > > Key: CASSANDRA-11363 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11363 > Project: Cassandra > Issue Type: Bug > Components: Coordination >Reporter: Russell Bradberry > Attachments: cassandra-102-cms.stack, cassandra-102-g1gc.stack > > > When upgrading from 2.1.9 to 2.1.13, we are witnessing an issue where the > machine load increases to very high levels (> 120 on an 8 core machine) and > native transport requests get blocked in tpstats. > I was able to reproduce this in both CMS and G1GC as well as on JVM 7 and 8. > The issue does not seem to affect the nodes running 2.1.9. > The issue seems to coincide with the number of connections OR the number of > total requests being processed at a given time (as the latter increases with > the former in our system) > Currently there is between 600 and 800 client connections on each machine and > each machine is handling roughly 2000-3000 client requests per second. > Disabling the binary protocol fixes the issue for this node but isn't a > viable option cluster-wide. > Here is the output from tpstats: > {code} > Pool NameActive Pending Completed Blocked All > time blocked > MutationStage 0 88387821 0 > 0 > ReadStage 0 0 355860 0 > 0 > RequestResponseStage 0 72532457 0 > 0 > ReadRepairStage 0 0150 0 > 0 > CounterMutationStage 32 104 897560 0 > 0 > MiscStage 0 0 0 0 > 0 > HintedHandoff 0 0 65 0 > 0 > GossipStage 0 0 2338 0 > 0 > CacheCleanupExecutor 0 0 0 0 > 0 > InternalResponseStage 0 0 0 0 > 0 > CommitLogArchiver 0 0 0 0 > 0 > CompactionExecutor2 190474 0 > 0 > ValidationExecutor0 0 0 0 > 0 > MigrationStage0 0 10 0 > 0 > AntiEntropyStage 0 0 0 0 > 0 > PendingRangeCalculator0 0310 0 > 0 > Sampler 0 0 0 0 > 0 > MemtableFlushWriter 110 94 0 > 0 > MemtablePostFlush 134257 0 > 0 > MemtableReclaimMemory 0 0 94 0 > 0 > Native-Transport-Requests 128 156 38795716 > 278451 > Message type Dropped > READ 0 > RANGE_SLICE 0 > _TRACE 0 > MUTATION 0 > COUNTER_MUTATION 0 > BINARY 0 > REQUEST_RESPONSE 0 > PAGED_RANGE 0 > READ_REPAIR 0 > {code} > Attached is the jstack output for both CMS and G1GC. > Flight recordings are here: > https://s3.amazonaws.com/simple-logs/cassandra-102-cms.jfr > https://s3.amazonaws.com/simple-logs/cassandra-102-g1gc.jfr > It is interesting to note that while the flight recording was taking place, > the load on the machine went back to healthy, and when the flight recording > finished the load went back to > 100. -- This message was sent by Atlassian JIRA (v6.3.4#6332)