[jira] [Comment Edited] (CASSANDRA-14747) Evaluate 200 node, compression=none, encryption=none, coalescing=off

2020-04-23 Thread Benjamin Lerer (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-14747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17090689#comment-17090689
 ] 

Benjamin Lerer edited comment on CASSANDRA-14747 at 4/23/20, 3:39 PM:
--

This ticket has not been updated since almost 5 months and it is not clear to 
me what the expected output for it is.
[~jolynch] [~vinaykumarcse] what is the status of this ticket? What still need 
to be done?


was (Author: blerer):
This ticket has not been updated since almost 5 months and it is not clear to 
me what the expected output for it is.
[~jolynch] [~vinaykumarcse] what is the status of that ticket?

> Evaluate 200 node, compression=none, encryption=none, coalescing=off 
> -
>
> Key: CASSANDRA-14747
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14747
> Project: Cassandra
>  Issue Type: Sub-task
>  Components: Legacy/Testing
>Reporter: Joey Lynch
>Assignee: Joey Lynch
>Priority: Normal
> Fix For: 4.0-beta
>
> Attachments: 3.0.17-QPS.png, 4.0.1-QPS.png, 
> 4.0.11-after-jolynch-tweaks.svg, 4.0.12-after-unconditional-flush.svg, 
> 4.0.15-after-sndbuf-fix.svg, 4.0.7-before-my-changes.svg, 
> 4.0_errors_showing_heap_pressure.txt, 
> 4.0_heap_histogram_showing_many_MessageOuts.txt, 
> i-0ed2acd2dfacab7c1-after-looping-fixes.svg, 
> trunk_14503_v2_cpuflamegraph.svg, trunk_vs_3.0.17_latency_under_load.png, 
> ttop_NettyOutbound-Thread_spinning.txt, 
> useast1c-i-0e1ddfe8b2f769060-mutation-flame.svg, 
> useast1e-i-08635fa1631601538_flamegraph_96node.svg, 
> useast1e-i-08635fa1631601538_ttop_netty_outbound_threads_96nodes, 
> useast1e-i-08635fa1631601538_uninlinedcpuflamegraph.0_96node_60sec_profile.svg
>
>
> Tracks evaluating a 200 node cluster with all internode settings off (no 
> compression, no encryption, no coalescing).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Comment Edited] (CASSANDRA-14747) Evaluate 200 node, compression=none, encryption=none, coalescing=off

2018-10-04 Thread Jason Brown (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-14747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16638181#comment-16638181
 ] 

Jason Brown edited comment on CASSANDRA-14747 at 10/4/18 12:45 PM:
---

Excellent find, [~jolynch].

Looks like we added the ability to set the send/recv buffer size in 
CASSANDRA-3378 (which apparently I reviewed, 5.5 years ago). Looks like in 3.11 
we [set the 
SO_SNDBUF|https://github.com/apache/cassandra/blob/cassandra-3.11/src/java/org/apache/cassandra/net/OutboundTcpConnection.java#L444]
 if the operator provided a value in the yaml, but we did not set a default 
value. However, it does appear I added a hard-coded default in 4.0 with 
CASSANDRA-8457. As it's been nearly two years since I wrote that part of the 
patch, I have no recollection of why I added a default. Removing it is trivial 
and has huge benefits, as [~jolynch] has proven. I'm working on combining the 
findings [~jolynch] and I have discovered over the last weeks and should have a 
patch ready in a few days (which will probably be part CASSANDRA-14503, as most 
of this work was based on that work-in-progress).



was (Author: jasobrown):
Excellent find, [~jolynch].

Looks like we added the ability to set the send/recv buffer size in 
CASSANDRA-3378 (which apparently I reviewed, 5.5 years ago). Looks like in 3.11 
we [set the 
SO_SNDBUF|https://github.com/apache/cassandra/blob/cassandra-3.11/src/java/org/apache/cassandra/net/OutboundTcpConnection.java#L444]
 if the operator provided a value in the yaml, but we did not set a default 
value. However, it does appear I added a hard-coded default in 4.0 with 
CASSANDRA-8457. As it's been nearly two years since I wrote that part of the 
patch, I have no recollection of why I added a default. Removing it is trivial 
and has huge benefits, as  has proven. I'm working on combining the findings 
[~jolynch] and I have discovered over the last weeks and should have a patch 
ready in a few days (which will probably be part CASSANDRA-14503, as most of 
this work was based on that work-in-progress).


> Evaluate 200 node, compression=none, encryption=none, coalescing=off 
> -
>
> Key: CASSANDRA-14747
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14747
> Project: Cassandra
>  Issue Type: Sub-task
>Reporter: Joseph Lynch
>Assignee: Joseph Lynch
>Priority: Major
> Attachments: 3.0.17-QPS.png, 4.0.1-QPS.png, 
> 4.0.11-after-jolynch-tweaks.svg, 4.0.12-after-unconditional-flush.svg, 
> 4.0.15-after-sndbuf-fix.svg, 4.0.7-before-my-changes.svg, 
> 4.0_errors_showing_heap_pressure.txt, 
> 4.0_heap_histogram_showing_many_MessageOuts.txt, 
> i-0ed2acd2dfacab7c1-after-looping-fixes.svg, 
> trunk_vs_3.0.17_latency_under_load.png, 
> ttop_NettyOutbound-Thread_spinning.txt, 
> useast1c-i-0e1ddfe8b2f769060-mutation-flame.svg, 
> useast1e-i-08635fa1631601538_flamegraph_96node.svg, 
> useast1e-i-08635fa1631601538_ttop_netty_outbound_threads_96nodes, 
> useast1e-i-08635fa1631601538_uninlinedcpuflamegraph.0_96node_60sec_profile.svg
>
>
> Tracks evaluating a 200 node cluster with all internode settings off (no 
> compression, no encryption, no coalescing).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Comment Edited] (CASSANDRA-14747) Evaluate 200 node, compression=none, encryption=none, coalescing=off

2018-10-04 Thread Dinesh Joshi (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-14747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16637835#comment-16637835
 ] 

Dinesh Joshi edited comment on CASSANDRA-14747 at 10/4/18 6:13 AM:
---

[~jolynch] this is pretty cool! I think it would make sense to set all tunable 
knobs to default and see the impact. Then we can start tuning the parameters to 
arrive at sensible defaults. We should also document the findings.


was (Author: djoshi3):
[~jolynch] this is pretty cool! I think it would make sense to set all tunable 
knobs to default and see the impact.

> Evaluate 200 node, compression=none, encryption=none, coalescing=off 
> -
>
> Key: CASSANDRA-14747
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14747
> Project: Cassandra
>  Issue Type: Sub-task
>Reporter: Joseph Lynch
>Assignee: Joseph Lynch
>Priority: Major
> Attachments: 3.0.17-QPS.png, 4.0.1-QPS.png, 
> 4.0.11-after-jolynch-tweaks.svg, 4.0.12-after-unconditional-flush.svg, 
> 4.0.15-after-sndbuf-fix.svg, 4.0.7-before-my-changes.svg, 
> 4.0_errors_showing_heap_pressure.txt, 
> 4.0_heap_histogram_showing_many_MessageOuts.txt, 
> i-0ed2acd2dfacab7c1-after-looping-fixes.svg, 
> trunk_vs_3.0.17_latency_under_load.png, 
> ttop_NettyOutbound-Thread_spinning.txt, 
> useast1c-i-0e1ddfe8b2f769060-mutation-flame.svg, 
> useast1e-i-08635fa1631601538_flamegraph_96node.svg, 
> useast1e-i-08635fa1631601538_ttop_netty_outbound_threads_96nodes, 
> useast1e-i-08635fa1631601538_uninlinedcpuflamegraph.0_96node_60sec_profile.svg
>
>
> Tracks evaluating a 200 node cluster with all internode settings off (no 
> compression, no encryption, no coalescing).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Comment Edited] (CASSANDRA-14747) Evaluate 200 node, compression=none, encryption=none, coalescing=off

2018-10-04 Thread Dinesh Joshi (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-14747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16637835#comment-16637835
 ] 

Dinesh Joshi edited comment on CASSANDRA-14747 at 10/4/18 6:05 AM:
---

[~jolynch] this is pretty cool! I think it would make sense to set all tunable 
knobs to default and see the impact.


was (Author: djoshi3):
[~jolynch] this is pretty cool!

> Evaluate 200 node, compression=none, encryption=none, coalescing=off 
> -
>
> Key: CASSANDRA-14747
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14747
> Project: Cassandra
>  Issue Type: Sub-task
>Reporter: Joseph Lynch
>Assignee: Joseph Lynch
>Priority: Major
> Attachments: 3.0.17-QPS.png, 4.0.1-QPS.png, 
> 4.0.11-after-jolynch-tweaks.svg, 4.0.12-after-unconditional-flush.svg, 
> 4.0.15-after-sndbuf-fix.svg, 4.0.7-before-my-changes.svg, 
> 4.0_errors_showing_heap_pressure.txt, 
> 4.0_heap_histogram_showing_many_MessageOuts.txt, 
> i-0ed2acd2dfacab7c1-after-looping-fixes.svg, 
> trunk_vs_3.0.17_latency_under_load.png, 
> ttop_NettyOutbound-Thread_spinning.txt, 
> useast1c-i-0e1ddfe8b2f769060-mutation-flame.svg, 
> useast1e-i-08635fa1631601538_flamegraph_96node.svg, 
> useast1e-i-08635fa1631601538_ttop_netty_outbound_threads_96nodes, 
> useast1e-i-08635fa1631601538_uninlinedcpuflamegraph.0_96node_60sec_profile.svg
>
>
> Tracks evaluating a 200 node cluster with all internode settings off (no 
> compression, no encryption, no coalescing).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Comment Edited] (CASSANDRA-14747) Evaluate 200 node, compression=none, encryption=none, coalescing=off

2018-10-03 Thread Joseph Lynch (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-14747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16637652#comment-16637652
 ] 

Joseph Lynch edited comment on CASSANDRA-14747 at 10/4/18 12:38 AM:


[~jasobrown] Ok, I think we found the problem! In the new Netty code we 
explicitly set the {{SO_SNDBUF}} [of the outbound 
socket|https://github.com/apache/cassandra/blob/47a10649dadbdea6960836a7c0fe6d271a476204/src/java/org/apache/cassandra/net/async/NettyFactory.java#L332]
 to 64KB. This works great if you have a low latency connection, but for long 
fat networks this is a serious issue as you restrict your bandwidth 
significantly due to a high [bandwidth delay 
product|https://en.wikipedia.org/wiki/Bandwidth-delay_product]. In the tests 
we've been running where we are trying to push a semi reasonable amount of 
traffic (like 8mbps) to peers that are about 80ms away (us-east-1 to eu-west-1 
is usually about [80ms|https://www.cloudping.co/]). With a 64KB window size we 
just don't have enough bandwidth even though the actual link is very high 
bandwidth. As we can see using {{iperf}} setting a 64KB buffer cripples 
throughput:
{noformat}
# On the eu-west-1 node X
$ iperf -s -p 8080

Server listening on TCP port 8080
TCP window size: 12.0 MByte (default)

[  4] local X port 8080 connected with Y port 26964
[ ID] Interval   Transfer Bandwidth
[  4]  0.0-10.5 sec   506 MBytes   404 Mbits/sec
[  5] local X port 8080 connected with Y port 27050
[  5]  0.0-10.5 sec  8.50 MBytes  6.81 Mbits/sec

# On the us-east-1 node Y about 80ms away

$ iperf -N -c X -p 8080

Client connecting to X, TCP port 8080
TCP window size: 12.0 MByte (default)

[  3] local Y port 26964 connected with X port 8080
[ ID] Interval   Transfer Bandwidth
[  3]  0.0-10.1 sec   506 MBytes   421 Mbits/sec

$ iperf -N -w 64K -c X -p 8080

Client connecting to X, TCP port 8080
TCP window size:  128 KByte (WARNING: requested 64.0 KByte)

[  3] local Y port 27050 connected with X port 8080
[ ID] Interval   Transfer Bandwidth
[  3]  0.0-10.1 sec  8.50 MBytes  7.03 Mbits/sec
{noformat}

So instead of Cassandra getting the full link's bandwidth of 500mbps we're only 
able to get 7mbps. This is lower than the 8mbps we need to push so the 
us-east-1 -> eu-west-1 queues effectively grow without bound until we start 
dropping messages.

I applied a 
[patch|https://gist.github.com/jolynch/966e0e52f34eff7a7b8ac8d5a9cb4b5d#file-fix-the-problem-diff]
 which does not set {{SO_SNDBUF}} unless explicitly asked to and *everything is 
completely wonderful* now. Some ways that things are wonderful:

1. The cpu usage is now on par with 3.0.x, and most of that CPU time is spent 
in compaction (both in garbage creation and actual cpu time):

{noformat}
$ sjk  ttop -p $(pgrep -f Cassandra) -n 20 -o CPU
2018-10-03T23:56:40.889+ Process summary 
  process cpu=321.33%
  application cpu=301.46% (user=185.93% sys=115.52%)
  other: cpu=19.88% 
  thread count: 274
  GC time=5.27% (young=5.27%, old=0.00%)
  heap allocation rate 478mb/s
  safe point rate: 0.4 (events/s) avg. safe point pause: 135.64ms
  safe point sync time: 0.08% processing time: 5.38% (wallclock time)
[000135] user=49.03% sys=11.84% alloc=  142mb/s - CompactionExecutor:1
[000136] user=44.60% sys=13.81% alloc=  133mb/s - CompactionExecutor:2
[000198] user= 0.00% sys=41.46% alloc=  4833b/s - NonPeriodicTasks:1
[10] user= 9.56% sys= 0.67% alloc=   57mb/s - spectator-gauge-polling-0
[29] user= 7.45% sys= 2.13% alloc= 5772kb/s - PerDiskMemtableFlushWriter_0:1
[36] user= 0.00% sys= 8.98% alloc=  2598b/s - PERIODIC-COMMIT-LOG-SYNCER
[000115] user= 5.74% sys= 2.22% alloc=   12mb/s - 
MessagingService-NettyInbound-Thread-3-1
[000118] user= 4.03% sys= 3.75% alloc= 2915kb/s - 
MessagingService-NettyOutbound-Thread-4-3
[000117] user= 3.12% sys= 2.79% alloc= 2110kb/s - 
MessagingService-NettyOutbound-Thread-4-2
[000144] user= 4.03% sys= 0.92% alloc= 7205kb/s - MutationStage-1
[000146] user= 4.13% sys= 0.77% alloc= 6837kb/s - Native-Transport-Requests-2
[000147] user= 3.12% sys= 1.49% alloc= 6054kb/s - MutationStage-3
[000150] user= 3.22% sys= 1.21% alloc= 6630kb/s - MutationStage-4
[000116] user= 2.72% sys= 1.61% alloc= 1412kb/s - 
MessagingService-NettyOutbound-Thread-4-1
[000132] user= 2.21% sys= 2.04% alloc=   11mb/s - 
MessagingService-NettyInbound-Thread-3-2
[000151] user= 2.92% sys= 1.30% alloc= 5462kb/s - Native-Transport-Requests-5
[000134] user= 2.11% sys= 1.71% alloc= 6212kb/s - 
MessagingService-NettyInbound-Thread-3-4

[jira] [Comment Edited] (CASSANDRA-14747) Evaluate 200 node, compression=none, encryption=none, coalescing=off

2018-10-03 Thread Joseph Lynch (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-14747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16637652#comment-16637652
 ] 

Joseph Lynch edited comment on CASSANDRA-14747 at 10/4/18 12:31 AM:


[~jasobrown] Ok, I think we found the problem! In the new Netty code we 
explicitly set the {{SO_SNDBUF}} [of the outbound 
socket|https://github.com/apache/cassandra/blob/47a10649dadbdea6960836a7c0fe6d271a476204/src/java/org/apache/cassandra/net/async/NettyFactory.java#L332]
 to 64KB. This works great if you have a low latency connection, but for long 
fat networks this is a serious issue as you restrict your bandwidth 
significantly due to a high [bandwidth delay 
product|https://en.wikipedia.org/wiki/Bandwidth-delay_product]. In the tests 
we've been running where we are trying to push a semi reasonable amount of 
traffic (like 8mbps) to peers that are about 80ms away (us-east-1 to eu-west-1 
is usually about [80ms|https://www.cloudping.co/]). With a 64KB window size we 
just don't have enough bandwidth even though the actual link is very high 
bandwidth. As we can see using {{iperf}} setting a 64KB buffer cripples 
throughput:
{noformat}
# On the eu-west-1 node X
$ iperf -s -p 8080

Server listening on TCP port 8080
TCP window size: 12.0 MByte (default)

[  4] local X port 8080 connected with Y port 26964
[ ID] Interval   Transfer Bandwidth
[  4]  0.0-10.5 sec   506 MBytes   404 Mbits/sec
[  5] local X port 8080 connected with Y port 27050
[  5]  0.0-10.5 sec  8.50 MBytes  6.81 Mbits/sec

# On the us-east-1 node Y about 80ms away

$ iperf -N -c X -p 8080

Client connecting to X, TCP port 8080
TCP window size: 12.0 MByte (default)

[  3] local Y port 26964 connected with X port 8080
[ ID] Interval   Transfer Bandwidth
[  3]  0.0-10.1 sec   506 MBytes   421 Mbits/sec

$ iperf -N -w 64K -c X -p 8080

Client connecting to X, TCP port 8080
TCP window size:  128 KByte (WARNING: requested 64.0 KByte)

[  3] local Y port 27050 connected with X port 8080
[ ID] Interval   Transfer Bandwidth
[  3]  0.0-10.1 sec  8.50 MBytes  7.03 Mbits/sec
{noformat}

So instead of Cassandra getting the full link's bandwidth of 500mbps we're only 
able to get 7mbps. This is lower than the 8mbps we need to push so the 
us-east-1 -> eu-west-1 queues effectively grow without bound until we start 
dropping messages.

I applied a 
[patch|https://gist.github.com/jolynch/966e0e52f34eff7a7b8ac8d5a9cb4b5d#file-fix-the-problem-diff]
 which does not set {{SO_SNDBUF}} unless explicitly asked to and *everything is 
completely wonderful* now. Some ways that things are wonderful:

1. The cpu usage is now on par with 3.0.x, and most of that CPU time is spent 
in compaction (both in garbage creation and actual cpu time):

{noformat}
2018-10-03T23:56:40.889+ Process summary 
  process cpu=321.33%
  application cpu=301.46% (user=185.93% sys=115.52%)
  other: cpu=19.88% 
  thread count: 274
  GC time=5.27% (young=5.27%, old=0.00%)
  heap allocation rate 478mb/s
  safe point rate: 0.4 (events/s) avg. safe point pause: 135.64ms
  safe point sync time: 0.08% processing time: 5.38% (wallclock time)
[000135] user=49.03% sys=11.84% alloc=  142mb/s - CompactionExecutor:1
[000136] user=44.60% sys=13.81% alloc=  133mb/s - CompactionExecutor:2
[000198] user= 0.00% sys=41.46% alloc=  4833b/s - NonPeriodicTasks:1
[10] user= 9.56% sys= 0.67% alloc=   57mb/s - spectator-gauge-polling-0
[29] user= 7.45% sys= 2.13% alloc= 5772kb/s - PerDiskMemtableFlushWriter_0:1
[36] user= 0.00% sys= 8.98% alloc=  2598b/s - PERIODIC-COMMIT-LOG-SYNCER
[000115] user= 5.74% sys= 2.22% alloc=   12mb/s - 
MessagingService-NettyInbound-Thread-3-1
[000118] user= 4.03% sys= 3.75% alloc= 2915kb/s - 
MessagingService-NettyOutbound-Thread-4-3
[000117] user= 3.12% sys= 2.79% alloc= 2110kb/s - 
MessagingService-NettyOutbound-Thread-4-2
[000144] user= 4.03% sys= 0.92% alloc= 7205kb/s - MutationStage-1
[000146] user= 4.13% sys= 0.77% alloc= 6837kb/s - Native-Transport-Requests-2
[000147] user= 3.12% sys= 1.49% alloc= 6054kb/s - MutationStage-3
[000150] user= 3.22% sys= 1.21% alloc= 6630kb/s - MutationStage-4
[000116] user= 2.72% sys= 1.61% alloc= 1412kb/s - 
MessagingService-NettyOutbound-Thread-4-1
[000132] user= 2.21% sys= 2.04% alloc=   11mb/s - 
MessagingService-NettyInbound-Thread-3-2
[000151] user= 2.92% sys= 1.30% alloc= 5462kb/s - Native-Transport-Requests-5
[000134] user= 2.11% sys= 1.71% alloc= 6212kb/s - 
MessagingService-NettyInbound-Thread-3-4
[000152] user= 3.02% sys= 0.65% alloc= 5357kb/s - 

[jira] [Comment Edited] (CASSANDRA-14747) Evaluate 200 node, compression=none, encryption=none, coalescing=off

2018-10-03 Thread Joseph Lynch (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-14747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16637652#comment-16637652
 ] 

Joseph Lynch edited comment on CASSANDRA-14747 at 10/4/18 12:31 AM:


[~jasobrown] Ok, I think we found the problem! In the new Netty code we 
explicitly set the {{SO_SNDBUF}} [of the outbound 
socket|https://github.com/apache/cassandra/blob/47a10649dadbdea6960836a7c0fe6d271a476204/src/java/org/apache/cassandra/net/async/NettyFactory.java#L332]
 to 64KB. This works great if you have a low latency connection, but for long 
fat networks this is a serious issue as you restrict your bandwidth 
significantly due to a high [bandwidth delay 
product|https://en.wikipedia.org/wiki/Bandwidth-delay_product]. In the tests 
we've been running where we are trying to push a semi reasonable amount of 
traffic (like 8mbps) to peers that are about 80ms away (us-east-1 to eu-west-1 
is usually about [80ms|https://www.cloudping.co/]). With a 64KB window size we 
just don't have enough bandwidth even though the actual link is very high 
bandwidth. As we can see using {{iperf}} setting a 64KB buffer cripples 
throughput:
{noformat}
# On the eu-west-1 node X
$ iperf -s -p 8080

Server listening on TCP port 8080
TCP window size: 12.0 MByte (default)

[  4] local X port 8080 connected with Y port 26964
[ ID] Interval   Transfer Bandwidth
[  4]  0.0-10.5 sec   506 MBytes   404 Mbits/sec
[  5] local X port 8080 connected with Y port 27050
[  5]  0.0-10.5 sec  8.50 MBytes  6.81 Mbits/sec

# On the us-east-1 node Y about 80ms away

$ iperf -N -c X -p 8080

Client connecting to X, TCP port 8080
TCP window size: 12.0 MByte (default)

[  3] local Y port 26964 connected with X port 8080
[ ID] Interval   Transfer Bandwidth
[  3]  0.0-10.1 sec   506 MBytes   421 Mbits/sec

$ iperf -N -w 64K -c X -p 8080

Client connecting to X, TCP port 8080
TCP window size:  128 KByte (WARNING: requested 64.0 KByte)

[  3] local Y port 27050 connected with X port 8080
[ ID] Interval   Transfer Bandwidth
[  3]  0.0-10.1 sec  8.50 MBytes  7.03 Mbits/sec
{noformat}

So instead of Cassandra getting the full link's bandwidth of 500mbps we're only 
able to get 7mbps. This is lower than the 8mbps we need to push so the 
us-east-1 -> eu-west-1 queues effectively grow without bound until we start 
dropping messages.

I applied a 
[patch|https://gist.github.com/jolynch/966e0e52f34eff7a7b8ac8d5a9cb4b5d#file-fix-the-problem-diff]
 which does not set {{SO_SNDBUF}} unless explicitly asked to and *everything is 
completely wonderful* now. Some ways that things are wonderful:

1. The cpu usage is now on par with 3.0.x, and most of that CPU time is spent 
in compaction (both in garbage creation and actual cpu time):

{noformat}
2018-10-03T23:56:40.889+ Process summary 
  process cpu=321.33%
  application cpu=301.46% (user=185.93% sys=115.52%)
  other: cpu=19.88% 
  thread count: 274
  GC time=5.27% (young=5.27%, old=0.00%)
  heap allocation rate 478mb/s
  safe point rate: 0.4 (events/s) avg. safe point pause: 135.64ms
  safe point sync time: 0.08% processing time: 5.38% (wallclock time)
[000135] user=49.03% sys=11.84% alloc=  142mb/s - CompactionExecutor:1
[000136] user=44.60% sys=13.81% alloc=  133mb/s - CompactionExecutor:2
[000198] user= 0.00% sys=41.46% alloc=  4833b/s - NonPeriodicTasks:1
[10] user= 9.56% sys= 0.67% alloc=   57mb/s - spectator-gauge-polling-0
[29] user= 7.45% sys= 2.13% alloc= 5772kb/s - PerDiskMemtableFlushWriter_0:1
[36] user= 0.00% sys= 8.98% alloc=  2598b/s - PERIODIC-COMMIT-LOG-SYNCER
[000115] user= 5.74% sys= 2.22% alloc=   12mb/s - 
MessagingService-NettyInbound-Thread-3-1
[000118] user= 4.03% sys= 3.75% alloc= 2915kb/s - 
MessagingService-NettyOutbound-Thread-4-3
[000117] user= 3.12% sys= 2.79% alloc= 2110kb/s - 
MessagingService-NettyOutbound-Thread-4-2
[000144] user= 4.03% sys= 0.92% alloc= 7205kb/s - MutationStage-1
[000146] user= 4.13% sys= 0.77% alloc= 6837kb/s - Native-Transport-Requests-2
[000147] user= 3.12% sys= 1.49% alloc= 6054kb/s - MutationStage-3
[000150] user= 3.22% sys= 1.21% alloc= 6630kb/s - MutationStage-4
[000116] user= 2.72% sys= 1.61% alloc= 1412kb/s - 
MessagingService-NettyOutbound-Thread-4-1
[000132] user= 2.21% sys= 2.04% alloc=   11mb/s - 
MessagingService-NettyInbound-Thread-3-2
[000151] user= 2.92% sys= 1.30% alloc= 5462kb/s - Native-Transport-Requests-5
[000134] user= 2.11% sys= 1.71% alloc= 6212kb/s - 
MessagingService-NettyInbound-Thread-3-4
[000152] user= 3.02% sys= 0.65% alloc= 5357kb/s - 

[jira] [Comment Edited] (CASSANDRA-14747) Evaluate 200 node, compression=none, encryption=none, coalescing=off

2018-10-03 Thread Joseph Lynch (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-14747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16637652#comment-16637652
 ] 

Joseph Lynch edited comment on CASSANDRA-14747 at 10/4/18 12:30 AM:


[~jasobrown] Ok, I think we found the problem! In the new Netty code we 
explicitly set the {{SO_SNDBUF}} [of the outbound 
socket|https://github.com/apache/cassandra/blob/47a10649dadbdea6960836a7c0fe6d271a476204/src/java/org/apache/cassandra/net/async/NettyFactory.java#L332]
 to 64KB. This works great if you have a low latency connection, but for long 
fat networks this is a serious issue as you restrict your bandwidth 
significantly due to a high [bandwidth delay 
product|https://en.wikipedia.org/wiki/Bandwidth-delay_product]. In the tests 
we've been running where we are trying to push a semi reasonable amount of 
traffic (like 8mbps) to peers that are about 80ms away (us-east-1 to eu-west-1 
is usually about [80ms|https://www.cloudping.co/]). With a 64KB window size we 
just don't have enough bandwidth even though the actual link is very high 
bandwidth. As we can see using {{iperf}} setting a 64KB buffer cripples 
throughput:
{noformat}
# On the eu-west-1 node X
$ iperf -s -p 8080

Server listening on TCP port 8080
TCP window size: 12.0 MByte (default)

[  4] local X port 8080 connected with Y port 26964
[ ID] Interval   Transfer Bandwidth
[  4]  0.0-10.5 sec   506 MBytes   404 Mbits/sec
[  5] local X port 8080 connected with Y port 27050
[  5]  0.0-10.5 sec  8.50 MBytes  6.81 Mbits/sec

# On the us-east-1 node Y about 80ms away

$ iperf -N -c X -p 8080

Client connecting to X, TCP port 8080
TCP window size: 12.0 MByte (default)

[  3] local Y port 26964 connected with X port 8080
[ ID] Interval   Transfer Bandwidth
[  3]  0.0-10.1 sec   506 MBytes   421 Mbits/sec

$ iperf -N -w 64K -c X -p 8080

Client connecting to X, TCP port 8080
TCP window size:  128 KByte (WARNING: requested 64.0 KByte)

[  3] local Y port 27050 connected with X port 8080
[ ID] Interval   Transfer Bandwidth
[  3]  0.0-10.1 sec  8.50 MBytes  7.03 Mbits/sec
{noformat}

So instead of Cassandra getting the full link's bandwidth of 500mbps we're only 
able to get 7mbps. This is lower than the 8mbps we need to push so the 
us-east-1 -> eu-west-1 queues effectively grow without bound until we start 
dropping messages.

I applied a 
[patch|https://gist.github.com/jolynch/966e0e52f34eff7a7b8ac8d5a9cb4b5d#file-fix-the-problem-diff]
 which does not set {{SO_SNDBUF}} unless explicitly asked to and *everything is 
completely wonderful* now. Some ways that things are wonderful:

1. The cpu usage is now on par with 3.0.x, and most of that CPU time is spent 
in compaction (both in garbage creation and actual cpu time):

{noformat}
2018-10-03T23:56:40.889+ Process summary 
  process cpu=321.33%
  application cpu=301.46% (user=185.93% sys=115.52%)
  other: cpu=19.88% 
  thread count: 274
  GC time=5.27% (young=5.27%, old=0.00%)
  heap allocation rate 478mb/s
  safe point rate: 0.4 (events/s) avg. safe point pause: 135.64ms
  safe point sync time: 0.08% processing time: 5.38% (wallclock time)
[000135] user=49.03% sys=11.84% alloc=  142mb/s - CompactionExecutor:1
[000136] user=44.60% sys=13.81% alloc=  133mb/s - CompactionExecutor:2
[000198] user= 0.00% sys=41.46% alloc=  4833b/s - NonPeriodicTasks:1
[10] user= 9.56% sys= 0.67% alloc=   57mb/s - spectator-gauge-polling-0
[29] user= 7.45% sys= 2.13% alloc= 5772kb/s - PerDiskMemtableFlushWriter_0:1
[36] user= 0.00% sys= 8.98% alloc=  2598b/s - PERIODIC-COMMIT-LOG-SYNCER
[000115] user= 5.74% sys= 2.22% alloc=   12mb/s - 
MessagingService-NettyInbound-Thread-3-1
[000118] user= 4.03% sys= 3.75% alloc= 2915kb/s - 
MessagingService-NettyOutbound-Thread-4-3
[000117] user= 3.12% sys= 2.79% alloc= 2110kb/s - 
MessagingService-NettyOutbound-Thread-4-2
[000144] user= 4.03% sys= 0.92% alloc= 7205kb/s - MutationStage-1
[000146] user= 4.13% sys= 0.77% alloc= 6837kb/s - Native-Transport-Requests-2
[000147] user= 3.12% sys= 1.49% alloc= 6054kb/s - MutationStage-3
[000150] user= 3.22% sys= 1.21% alloc= 6630kb/s - MutationStage-4
[000116] user= 2.72% sys= 1.61% alloc= 1412kb/s - 
MessagingService-NettyOutbound-Thread-4-1
[000132] user= 2.21% sys= 2.04% alloc=   11mb/s - 
MessagingService-NettyInbound-Thread-3-2
[000151] user= 2.92% sys= 1.30% alloc= 5462kb/s - Native-Transport-Requests-5
[000134] user= 2.11% sys= 1.71% alloc= 6212kb/s - 
MessagingService-NettyInbound-Thread-3-4
[000152] user= 3.02% sys= 0.65% alloc= 5357kb/s - 

[jira] [Comment Edited] (CASSANDRA-14747) Evaluate 200 node, compression=none, encryption=none, coalescing=off

2018-10-03 Thread Joseph Lynch (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-14747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16637652#comment-16637652
 ] 

Joseph Lynch edited comment on CASSANDRA-14747 at 10/4/18 12:29 AM:


[~jasobrown] Ok, I think we found the problem! In the new Netty code we 
explicitly set the {{SO_SNDBUF}} [of the outbound 
socket|https://github.com/apache/cassandra/blob/47a10649dadbdea6960836a7c0fe6d271a476204/src/java/org/apache/cassandra/net/async/NettyFactory.java#L332]
 to 64KB. This works great if you have a low latency connection, but for long 
fat networks this is a serious issue as you restrict your bandwidth 
significantly due to a high [bandwidth delay 
product|https://en.wikipedia.org/wiki/Bandwidth-delay_product]. In the tests 
we've been running where we are trying to push a semi reasonable amount of 
traffic (like 8mbps) to peers that are about 80ms away (us-east-1 to eu-west-1 
is usually about [80ms|https://www.cloudping.co/]). With a 64KB window size we 
just don't have enough bandwidth even though the actual link is very high 
bandwidth. As we can see using {{iperf}} setting a 64KB buffer cripples 
throughput:
{noformat}
# On the eu-west-1 node X
$ iperf -s -p 8080

Server listening on TCP port 8080
TCP window size: 12.0 MByte (default)

[  4] local X port 8080 connected with Y port 26964
[ ID] Interval   Transfer Bandwidth
[  4]  0.0-10.5 sec   506 MBytes   404 Mbits/sec
[  5] local X port 8080 connected with Y port 27050
[  5]  0.0-10.5 sec  8.50 MBytes  6.81 Mbits/sec

# On the us-east-1 node Y about 80ms away

$ iperf -N -c X -p 8080

Client connecting to X, TCP port 8080
TCP window size: 12.0 MByte (default)

[  3] local Y port 26964 connected with X port 8080
[ ID] Interval   Transfer Bandwidth
[  3]  0.0-10.1 sec   506 MBytes   421 Mbits/sec

$ iperf -N -w 64K -c X -p 8080

Client connecting to X, TCP port 8080
TCP window size:  128 KByte (WARNING: requested 64.0 KByte)

[  3] local Y port 27050 connected with X port 8080
[ ID] Interval   Transfer Bandwidth
[  3]  0.0-10.1 sec  8.50 MBytes  7.03 Mbits/sec
{noformat}

So instead of Cassandra getting the full link's bandwidth of 500mbps we're only 
able to get 7mbps. This is lower than the 8mbps we need to push so the 
us-east-1 -> eu-west-1 queues effectively grow without bound until we start 
dropping messages.

I applied a 
[patch|https://gist.github.com/jolynch/966e0e52f34eff7a7b8ac8d5a9cb4b5d#file-fix-the-problem-diff]
 which does not set {{SO_SNDBUF}} unless explicitly asked to and *everything is 
completely wonderful* now. Some ways that things are wonderful:

1. The cpu usage is now on par with 3.0.x, and most of that CPU time is spent 
in compaction (both in garbage creation and actual cpu time):

{noformat}
2018-10-03T23:56:40.889+ Process summary 
  process cpu=321.33%
  application cpu=301.46% (user=185.93% sys=115.52%)
  other: cpu=19.88% 
  thread count: 274
  GC time=5.27% (young=5.27%, old=0.00%)
  heap allocation rate 478mb/s
  safe point rate: 0.4 (events/s) avg. safe point pause: 135.64ms
  safe point sync time: 0.08% processing time: 5.38% (wallclock time)
[000135] user=49.03% sys=11.84% alloc=  142mb/s - CompactionExecutor:1
[000136] user=44.60% sys=13.81% alloc=  133mb/s - CompactionExecutor:2
[000198] user= 0.00% sys=41.46% alloc=  4833b/s - NonPeriodicTasks:1
[10] user= 9.56% sys= 0.67% alloc=   57mb/s - spectator-gauge-polling-0
[29] user= 7.45% sys= 2.13% alloc= 5772kb/s - PerDiskMemtableFlushWriter_0:1
[36] user= 0.00% sys= 8.98% alloc=  2598b/s - PERIODIC-COMMIT-LOG-SYNCER
[000115] user= 5.74% sys= 2.22% alloc=   12mb/s - 
MessagingService-NettyInbound-Thread-3-1
[000118] user= 4.03% sys= 3.75% alloc= 2915kb/s - 
MessagingService-NettyOutbound-Thread-4-3
[000117] user= 3.12% sys= 2.79% alloc= 2110kb/s - 
MessagingService-NettyOutbound-Thread-4-2
[000144] user= 4.03% sys= 0.92% alloc= 7205kb/s - MutationStage-1
[000146] user= 4.13% sys= 0.77% alloc= 6837kb/s - Native-Transport-Requests-2
[000147] user= 3.12% sys= 1.49% alloc= 6054kb/s - MutationStage-3
[000150] user= 3.22% sys= 1.21% alloc= 6630kb/s - MutationStage-4
[000116] user= 2.72% sys= 1.61% alloc= 1412kb/s - 
MessagingService-NettyOutbound-Thread-4-1
[000132] user= 2.21% sys= 2.04% alloc=   11mb/s - 
MessagingService-NettyInbound-Thread-3-2
[000151] user= 2.92% sys= 1.30% alloc= 5462kb/s - Native-Transport-Requests-5
[000134] user= 2.11% sys= 1.71% alloc= 6212kb/s - 
MessagingService-NettyInbound-Thread-3-4
[000152] user= 3.02% sys= 0.65% alloc= 5357kb/s - 

[jira] [Comment Edited] (CASSANDRA-14747) Evaluate 200 node, compression=none, encryption=none, coalescing=off

2018-10-01 Thread Joseph Lynch (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-14747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16634851#comment-16634851
 ] 

Joseph Lynch edited comment on CASSANDRA-14747 at 10/2/18 2:29 AM:
---

Ah yea I see that's a problem. I worked around it by making a new callback just 
for that case. While I was testing it out I also tested flushing 
unconditionally 
([diff|https://gist.github.com/jolynch/966e0e52f34eff7a7b8ac8d5a9cb4b5d#file-some-more-tweaks-diff-L22])
 and CPU usage dropped by about half and the flamegraph looks _excellent_.

I've attached the flamegraph as [^4.0.12-after-unconditional-flush.svg], where 
we can see that after the unconditional flush we are spending less than 7% CPU 
usage now! (compared to like 40%). I think that with 198 other nodes we were 
spending a lot of time waiting with data in the channel that's unflushed 
because well there are 195 other queues that get to be serviced before you get 
serviced again and fill up the channel.

We're not done yet as we still have dropped messages (vs 3.0 which has very few 
if any dropped), but this is much better. 


was (Author: jolynch):
Ah yea I see that's a problem. I worked around it by making a new callback just 
for that case. While I was testing it out I also tested flushing 
unconditionally 
([diff|https://gist.github.com/jolynch/966e0e52f34eff7a7b8ac8d5a9cb4b5d#file-some-more-tweaks-diff-L22])
 and CPU usage dropped by about half and the flamegraph looks _excellent_.

I've attached the flamegraph as [^4.0.12-after-unconditional-flush.svg], where 
we can see that after the unconditional flush we are spending less than 7% CPU 
usage now! (compared to like 70%). I think that with 198 other nodes we were 
spending a lot of time waiting with data in the channel that's unflushed 
because well there are 195 other queues that get to be serviced before you get 
serviced again and fill up the channel.

We're not done yet as we still have dropped messages (vs 3.0 which has very few 
if any dropped), but this is much better. 

> Evaluate 200 node, compression=none, encryption=none, coalescing=off 
> -
>
> Key: CASSANDRA-14747
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14747
> Project: Cassandra
>  Issue Type: Sub-task
>Reporter: Joseph Lynch
>Assignee: Joseph Lynch
>Priority: Major
> Attachments: 3.0.17-QPS.png, 4.0.1-QPS.png, 
> 4.0.11-after-jolynch-tweaks.svg, 4.0.12-after-unconditional-flush.svg, 
> 4.0.7-before-my-changes.svg, 4.0_errors_showing_heap_pressure.txt, 
> 4.0_heap_histogram_showing_many_MessageOuts.txt, 
> i-0ed2acd2dfacab7c1-after-looping-fixes.svg, 
> ttop_NettyOutbound-Thread_spinning.txt, 
> useast1c-i-0e1ddfe8b2f769060-mutation-flame.svg, 
> useast1e-i-08635fa1631601538_flamegraph_96node.svg, 
> useast1e-i-08635fa1631601538_ttop_netty_outbound_threads_96nodes, 
> useast1e-i-08635fa1631601538_uninlinedcpuflamegraph.0_96node_60sec_profile.svg
>
>
> Tracks evaluating a 200 node cluster with all internode settings off (no 
> compression, no encryption, no coalescing).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Comment Edited] (CASSANDRA-14747) Evaluate 200 node, compression=none, encryption=none, coalescing=off

2018-10-01 Thread Joseph Lynch (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-14747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16634851#comment-16634851
 ] 

Joseph Lynch edited comment on CASSANDRA-14747 at 10/2/18 2:25 AM:
---

Ah yea I see that's a problem. I worked around it by making a new callback just 
for that case. While I was testing it out I also tested flushing 
unconditionally 
([diff|https://gist.github.com/jolynch/966e0e52f34eff7a7b8ac8d5a9cb4b5d#file-some-more-tweaks-diff-L22])
 and CPU usage dropped by about half and the flamegraph looks _excellent_.

I've attached the flamegraph as [^4.0.12-after-unconditional-flush.svg], where 
we can see that after the unconditional flush we are spending less than 7% CPU 
usage now! (compared to like 70%). I think that with 198 other nodes we were 
spending a lot of time waiting with data in the channel that's unflushed 
because well there are 195 other queues that get to be serviced before you get 
serviced again and fill up the channel.

We're not done yet as we still have dropped messages (vs 3.0 which has very few 
if any dropped), but this is much better. 


was (Author: jolynch):
Ah yea I see that's a problem. I worked around it by making a new callback just 
for that case. While I was testing it out I also tested flushing 
unconditionally 
[https://gist.github.com/jolynch/966e0e52f34eff7a7b8ac8d5a9cb4b5d#file-some-more-tweaks-diff,]
 and CPU usage dropped by about half and the flamegraph looks _excellent_.

I've attached the flamegraph as [^4.0.12-after-unconditional-flush.svg], where 
we can see that after the unconditional flush we are spending less than 7% CPU 
usage now! (compared to like 70%). I think that with 198 other nodes we were 
spending a lot of time waiting with data in the channel that's unflushed 
because well there are 195 other queues that get to be serviced before you get 
serviced again and fill up the channel.

We're not done yet as we still have dropped messages (vs 3.0 which has very few 
if any dropped), but this is much better. 

> Evaluate 200 node, compression=none, encryption=none, coalescing=off 
> -
>
> Key: CASSANDRA-14747
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14747
> Project: Cassandra
>  Issue Type: Sub-task
>Reporter: Joseph Lynch
>Assignee: Joseph Lynch
>Priority: Major
> Attachments: 3.0.17-QPS.png, 4.0.1-QPS.png, 
> 4.0.11-after-jolynch-tweaks.svg, 4.0.12-after-unconditional-flush.svg, 
> 4.0.7-before-my-changes.svg, 4.0_errors_showing_heap_pressure.txt, 
> 4.0_heap_histogram_showing_many_MessageOuts.txt, 
> i-0ed2acd2dfacab7c1-after-looping-fixes.svg, 
> ttop_NettyOutbound-Thread_spinning.txt, 
> useast1c-i-0e1ddfe8b2f769060-mutation-flame.svg, 
> useast1e-i-08635fa1631601538_flamegraph_96node.svg, 
> useast1e-i-08635fa1631601538_ttop_netty_outbound_threads_96nodes, 
> useast1e-i-08635fa1631601538_uninlinedcpuflamegraph.0_96node_60sec_profile.svg
>
>
> Tracks evaluating a 200 node cluster with all internode settings off (no 
> compression, no encryption, no coalescing).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Comment Edited] (CASSANDRA-14747) Evaluate 200 node, compression=none, encryption=none, coalescing=off

2018-09-18 Thread Joseph Lynch (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-14747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16619974#comment-16619974
 ] 

Joseph Lynch edited comment on CASSANDRA-14747 at 9/19/18 2:08 AM:
---

Things went much better today, after the queue fixes we no longer ran out of 
memory, but the {{MessagingService-NettyOutbound-Thread}} s would be pinned at 
100% cpu. We (Jason, Jordan, myself, etc) tracked it down to various 
unfortunate looping behaviors in the {{OutboundMessagingConnection}} class. 
We're following up with various fixes to these queueing problems. I've attached 
flame graphs and ttop outputs showing what's going on on the latest version of 
{{jasobrown/14503-collab}} branch.

 

We think a few things are going on here:
 # When the outbound queues get backed up we enter various long (sometimes 
infinite) loops. We're working on stopping those
 # Since we're multiplexing multiple nodes onto one outbound thread, we can 
have multi-tenant queues where one slow consumer hurts other nodes as well. 
We're working on a fix for this.


was (Author: jolynch):
Things went much better today, after the queue fixes we no longer ran out of 
memory, but the {{MessagingService-NettyOutbound-Thread}} s would be pinned at 
100% cpu. We (Jason, Jordan, myself, etc) tracked it down to various 
unfortunate looping behaviors in the {{OutboundMessagingConnection}} class. 
We're following up with various fixes to these queueing problems. I've attached 
flame graphs and ttop outputs showing what's going on on the latest version of 
{{jasobrown/14503-collab}} branch.

> Evaluate 200 node, compression=none, encryption=none, coalescing=off 
> -
>
> Key: CASSANDRA-14747
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14747
> Project: Cassandra
>  Issue Type: Sub-task
>Reporter: Joseph Lynch
>Assignee: Joseph Lynch
>Priority: Major
> Attachments: 3.0.17-QPS.png, 4.0.1-QPS.png, 
> 4.0_errors_showing_heap_pressure.txt, 
> 4.0_heap_histogram_showing_many_MessageOuts.txt, 
> i-0ed2acd2dfacab7c1-after-looping-fixes.svg, 
> ttop_NettyOutbound-Thread_spinning.txt, 
> useast1c-i-0e1ddfe8b2f769060-mutation-flame.svg
>
>
> Tracks evaluating a 200 node cluster with all internode settings off (no 
> compression, no encryption, no coalescing).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Comment Edited] (CASSANDRA-14747) Evaluate 200 node, compression=none, encryption=none, coalescing=off

2018-09-18 Thread Joseph Lynch (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-14747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16619974#comment-16619974
 ] 

Joseph Lynch edited comment on CASSANDRA-14747 at 9/19/18 2:05 AM:
---

Things went much better today, after the queue fixes we no longer ran out of 
memory, but the {{MessagingService-NettyOutbound-Thread}} s would be pinned at 
100% cpu. We (Jason, Jordan, myself, etc) tracked it down to various 
unfortunate looping behaviors in the {{OutboundMessagingConnection}} class. 
We're following up with various fixes to these queueing problems. I've attached 
flame graphs and ttop outputs showing what's going on on the latest version of 
{{jasobrown/14503-collab}} branch.


was (Author: jolynch):
Things went much better today, after the queue fixes we no longer ran out of 
memory, but the {{MessagingService-NettyOutbound-Thread}}s would be pinned at 
100% cpu. We (Jason, Jordan, myself, etc) tracked it down to various 
unfortunate looping behaviors in the {{OutboundMessagingConnection}} class. 
We're following up with various fixes to these queueing problems. I've attached 
flame graphs and ttop outputs showing what's going on on the latest version of 
{{jasobrown/14503-collab}} branch.

> Evaluate 200 node, compression=none, encryption=none, coalescing=off 
> -
>
> Key: CASSANDRA-14747
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14747
> Project: Cassandra
>  Issue Type: Sub-task
>Reporter: Joseph Lynch
>Assignee: Joseph Lynch
>Priority: Major
> Attachments: 3.0.17-QPS.png, 4.0.1-QPS.png, 
> 4.0_errors_showing_heap_pressure.txt, 
> 4.0_heap_histogram_showing_many_MessageOuts.txt, 
> i-0ed2acd2dfacab7c1-after-looping-fixes.svg, 
> useast1c-i-0e1ddfe8b2f769060-mutation-flame.svg
>
>
> Tracks evaluating a 200 node cluster with all internode settings off (no 
> compression, no encryption, no coalescing).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Comment Edited] (CASSANDRA-14747) Evaluate 200 node, compression=none, encryption=none, coalescing=off

2018-09-18 Thread Joseph Lynch (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-14747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16619974#comment-16619974
 ] 

Joseph Lynch edited comment on CASSANDRA-14747 at 9/19/18 2:04 AM:
---

Things went much better today, after the queue fixes we no longer ran out of 
memory, but the {{MessagingService-NettyOutbound-Thread}}s would be pinned at 
100% cpu. We (Jason, Jordan, myself, etc) tracked it down to various 
unfortunate looping behaviors in the {{OutboundMessagingConnection}} class. 
We're following up with various fixes to these queueing problems. I've attached 
flame graphs and ttop outputs showing what's going on on the latest version of 
{{jasobrown/14503-collab}} branch.


was (Author: jolynch):
Things went much better today, after the queue fixes we no longer ran out of 
memory, but the {{MessagingService-NettyOutbound-Thread}}s would be pinned at 
100% cpu. We (Jason, Jordan, myself, etc) tracked it down to various 
unfortunate looping behaviors in the {{OutboundMessagingConnection}} class. 
We're following up with various fixes to these queueing problems. I've attached 
flame graphs and ttop outputs showing what's going on on the latest version of 
{{jasobrown/14503-collab}} branch.

> Evaluate 200 node, compression=none, encryption=none, coalescing=off 
> -
>
> Key: CASSANDRA-14747
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14747
> Project: Cassandra
>  Issue Type: Sub-task
>Reporter: Joseph Lynch
>Assignee: Joseph Lynch
>Priority: Major
> Attachments: 3.0.17-QPS.png, 4.0.1-QPS.png, 
> 4.0_errors_showing_heap_pressure.txt, 
> 4.0_heap_histogram_showing_many_MessageOuts.txt, 
> useast1c-i-0e1ddfe8b2f769060-mutation-flame.svg
>
>
> Tracks evaluating a 200 node cluster with all internode settings off (no 
> compression, no encryption, no coalescing).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Comment Edited] (CASSANDRA-14747) Evaluate 200 node, compression=none, encryption=none, coalescing=off

2018-09-12 Thread Joseph Lynch (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-14747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16612856#comment-16612856
 ] 

Joseph Lynch edited comment on CASSANDRA-14747 at 9/12/18 11:52 PM:


*Setup:*
 * Cassandra: 192 (2*96) node i3.xlarge AWS instance (4 cpu cores, 30GB ram) 
running cassandra trunk f25a765b vs the same footprint running 3.0.17
 * Two datacenters with 100ms latency between them
 * No compression, encryption, or coalescing turned on

*Test #1:*

ndbench sent 30k QPS at a coordinator level to one datacenter (RF=3*2 = 6 so 
180k global replica QPS) of 40kb and then 4kb single partition BATCH mutations 
at LOCAL_QUORUM. This represents about 300 QPS per coordinator in the first 
datacenter or 75 per core.

*Result:*

We quickly overwhelmed the 4.0 cluster which started having high latencies and 
throwing errors while the 3.0 cluster remained healthy. 4.0 nodes were running 
out of heap within minutes during the 40kb test and a few minutes with the 4kb 
test. I've attached flamegraphs showing 4.0 spending half its time garbage 
collecting and logs indicating large on heap usage.

On the bright side the thread count was _way down:_ the 3.0 cluster had 1.2k 
threads and the 4.0 cluster only had 220 threads (and almost all of that 
reduction was the messaging thread reduction). Also the startup time was super 
fast (as in less than one second to handshake the entire cluster, vs 3.0 which 
took minutes. We didn't feel that proceeding with the test made sense given the 
instability until follow ups could be committed. We used heap dumps and 
{{jmap}} to determine the issue was the outgoing message queue retaining large 
numbers of mutations on heap rather than dropping them.

*Follow Ups:*
 The outgoing queue holding mutations on heap appears to be the problem. 
Specifically the 3.x code would police the internode queues and ensure they did 
not get too large at enqueue and dequeue time (expiring messages and turning 
them into hints as needed), the 4.0 code took out the enqueue policing 
complexity in the hope that we wouldn't need it. It appears it is necessary. 
[~jasobrown] is including fixes to the queue policing in CASSANDRA-14503 and 
CASSANDRA-13630 and we will re-execute this test once those are merged to 
ensure that they fix the issue with large volume mutations.


was (Author: jolynch):
*Setup:*
 * Cassandra: 192 (2*96) node i3.xlarge AWS instance (4 cpu cores, 30GB ram) 
running cassandra trunk f25a765b vs the same footprint running 3.0.17
 * Two datacenters with 100ms latency between them
 * No compression, encryption, or coalescing turned on

*Test #1:*

ndbench sent 30k QPS at a coordinator level to one datacenter (RF=3*2 = 6 so 
180k global replica QPS) of 40kb and then 4kb single partition BATCH mutations. 
This represents about 300 QPS per coordinator in the first datacenter or 75 per 
core.

*Result:*

We quickly overwhelmed the 4.0 cluster which started having high latencies and 
throwing errors while the 3.0 cluster remained healthy. 4.0 nodes were running 
out of heap within minutes during the 40kb test and a few minutes with the 4kb 
test. I've attached flamegraphs showing 4.0 spending half its time garbage 
collecting and logs indicating large on heap usage.

On the bright side the thread count was _way down:_ the 3.0 cluster had 1.2k 
threads and the 4.0 cluster only had 220 threads (and almost all of that 
reduction was the messaging thread reduction). Also the startup time was super 
fast (as in less than one second to handshake the entire cluster, vs 3.0 which 
took minutes. We didn't feel that proceeding with the test made sense given the 
instability until follow ups could be committed. We used heap dumps and 
{{jmap}} to determine the issue was the outgoing message queue retaining large 
numbers of mutations on heap rather than dropping them.

*Follow Ups:*
 The outgoing queue holding mutations on heap appears to be the problem. 
Specifically the 3.x code would police the internode queues and ensure they did 
not get too large at enqueue and dequeue time (expiring messages and turning 
them into hints as needed), the 4.0 code took out the enqueue policing 
complexity in the hope that we wouldn't need it. It appears it is necessary. 
[~jasobrown] is including fixes to the queue policing in CASSANDRA-14503 and 
CASSANDRA-13630 and we will re-execute this test once those are merged to 
ensure that they fix the issue with large volume mutations.

> Evaluate 200 node, compression=none, encryption=none, coalescing=off 
> -
>
> Key: CASSANDRA-14747
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14747
> Project: Cassandra
>  Issue Type: Sub-task
>Reporter: Joseph Lynch
>Priority: Major
> Attachments: 

[jira] [Comment Edited] (CASSANDRA-14747) Evaluate 200 node, compression=none, encryption=none, coalescing=off

2018-09-12 Thread Joseph Lynch (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-14747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16612856#comment-16612856
 ] 

Joseph Lynch edited comment on CASSANDRA-14747 at 9/12/18 11:50 PM:


*Setup:*
 * Cassandra: 192 (2*96) node i3.xlarge AWS instance (4 cpu cores, 30GB ram) 
running cassandra trunk f25a765b vs the same footprint running 3.0.17
 * Two datacenters with 100ms latency between them
 * No compression, encryption, or coalescing turned on

*Test #1:*

ndbench sent 30k QPS at a coordinator level to one datacenter (RF=3*2 = 6 so 
180k global replica QPS) of 40kb and then 4kb single partition BATCH mutations. 
This represents about 300 QPS per coordinator in the first datacenter or 75 per 
core.

*Result:*

We quickly overwhelmed the 4.0 cluster which started having high latencies and 
throwing errors while the 3.0 cluster remained healthy. 4.0 nodes were running 
out of heap within minutes during the 40kb test and a few minutes with the 4kb 
test. I've attached flamegraphs showing 4.0 spending half its time garbage 
collecting and logs indicating large on heap usage.

On the bright side the thread count was _way down:_ the 3.0 cluster had 1.2k 
threads and the 4.0 cluster only had 220 threads (and almost all of that 
reduction was the messaging thread reduction). Also the startup time was super 
fast (as in less than one second to handshake the entire cluster, vs 3.0 which 
took minutes. We didn't feel that proceeding with the test made sense given the 
instability until follow ups could be committed. We used heap dumps and 
{{jmap}} to determine the issue was the outgoing message queue retaining large 
numbers of mutations on heap rather than dropping them.

*Follow Ups:*
 The outgoing queue holding mutations on heap appears to be the problem. 
Specifically the 3.x code would police the internode queues and ensure they did 
not get too large at enqueue and dequeue time (expiring messages and turning 
them into hints as needed), the 4.0 code took out the enqueue policing 
complexity in the hope that we wouldn't need it. It appears it is necessary. 
[~jasobrown] is including fixes to the queue policing in CASSANDRA-14503 and 
CASSANDRA-13630 and we will re-execute this test once those are merged to 
ensure that they fix the issue with large volume mutations.


was (Author: jolynch):
*Setup:*
 * Cassandra: 192 (2*96) node i3.xlarge AWS instance (4 cpu cores, 30GB ram) 
running cassandra trunk f25a765b.
 * Two datacenters with 100ms latency between them
 * No compression, encryption, or coalescing turned on

*Test #1:*

ndbench sent 30k QPS at a coordinator level to one datacenter (RF=3*2 = 6 so 
180k global replica QPS) of 40kb and then 4kb single partition BATCH mutations. 
This represents about 300 QPS per coordinator in the first datacenter or 75 per 
core.

*Result:*

We quickly overwhelmed the 4.0 cluster which started having high latencies and 
throwing errors while the 3.0 cluster remained healthy. 4.0 nodes were running 
out of heap within minutes during the 40kb test and a few minutes with the 4kb 
test. I've attached flamegraphs showing 4.0 spending half its time garbage 
collecting and logs indicating large on heap usage.

On the bright side the thread count was _way down:_ the 3.0 cluster had 1.2k 
threads and the 4.0 cluster only had 220 threads (and almost all of that 
reduction was the messaging thread reduction). Also the startup time was super 
fast (as in less than one second to handshake the entire cluster, vs 3.0 which 
took minutes. We didn't feel that proceeding with the test made sense given the 
instability until follow ups could be committed. We used heap dumps and 
{{jmap}} to determine the issue was the outgoing message queue retaining large 
numbers of mutations on heap rather than dropping them.

*Follow Ups:*
 The outgoing queue holding mutations on heap appears to be the problem. 
Specifically the 3.x code would police the internode queues and ensure they did 
not get too large at enqueue and dequeue time (expiring messages and turning 
them into hints as needed), the 4.0 code took out the enqueue policing 
complexity in the hope that we wouldn't need it. It appears it is necessary. 
[~jasobrown] is including fixes to the queue policing in CASSANDRA-14503 and 
CASSANDRA-13630 and we will re-execute this test once those are merged to 
ensure that they fix the issue with large volume mutations.

> Evaluate 200 node, compression=none, encryption=none, coalescing=off 
> -
>
> Key: CASSANDRA-14747
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14747
> Project: Cassandra
>  Issue Type: Sub-task
>Reporter: Joseph Lynch
>Priority: Major
> Attachments: 3.0.17-QPS.png, 4.0.1-QPS.png, 
> 

[jira] [Comment Edited] (CASSANDRA-14747) Evaluate 200 node, compression=none, encryption=none, coalescing=off

2018-09-12 Thread Joseph Lynch (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-14747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16612856#comment-16612856
 ] 

Joseph Lynch edited comment on CASSANDRA-14747 at 9/12/18 11:31 PM:


*Setup:*
 * Cassandra: 192 (2*96) node i3.xlarge AWS instance (4 cpu cores, 30GB ram) 
running cassandra trunk f25a765b.
 * Two datacenters with 100ms latency between them
 * No compression, encryption, or coalescing turned on

*Test #1:*

ndbench sent 30k QPS at a coordinator level to one datacenter (RF=3*2 = 6 so 
180k global replica QPS) of 40kb and then 4kb single partition BATCH mutations. 
This represents about 300 QPS per coordinator in the first datacenter or 75 per 
core.

*Result:*

We quickly overwhelmed the 4.0 cluster which started having high latencies and 
throwing errors while the 3.0 cluster remained healthy. 4.0 nodes were running 
out of heap within minutes during the 40kb test and a few minutes with the 4kb 
test. I've attached flamegraphs showing 4.0 spending half its time garbage 
collecting and logs indicating large on heap usage.

On the bright side the thread count was _way down:_ the 3.0 cluster had 1.2k 
threads and the 4.0 cluster only had 220 threads (and almost all of that 
reduction was the messaging thread reduction). Also the startup time was super 
fast (as in less than one second to handshake the entire cluster, vs 3.0 which 
took minutes. We didn't feel that proceeding with the test made sense given the 
instability until follow ups could be committed. We used heap dumps and 
{{jmap}} to determine the issue was the outgoing message queue retaining large 
numbers of mutations on heap rather than dropping them.

*Follow Ups:*
 The outgoing queue holding mutations on heap appears to be the problem. 
Specifically the 3.x code would police the internode queues and ensure they did 
not get too large at enqueue and dequeue time (expiring messages and turning 
them into hints as needed), the 4.0 code took out the enqueue policing 
complexity in the hope that we wouldn't need it. It appears it is necessary. 
[~jasobrown] is including fixes to the queue policing in CASSANDRA-14503 and 
CASSANDRA-13630 and we will re-execute this test once those are merged to 
ensure that they fix the issue with large volume mutations.


was (Author: jolynch):
*Setup:*
 * Cassandra: 192 (2*96) node i3.xlarge AWS instance (4 cpu cores, 30GB ram) 
running cassandra trunk f25a765b.
 * Two datacenters with 100ms latency between them
 * No compression, encryption, or coalescing turned on

*Test:*

ndbench sent 30k QPS at a coordinator level to one datacenter (RF=3*2 = 6 so 
180k global replica QPS) of 40kb and then 4kb single partition BATCH mutations. 
This represents about 300 QPS per coordinator in the first datacenter or 75 per 
core.

*Result:*

We quickly overwhelmed the 4.0 cluster which started having high latencies and 
throwing errors while the 3.0 cluster remained healthy. 4.0 nodes were running 
out of heap within minutes during the 40kb test and a few minutes with the 4kb 
test. I've attached flamegraphs showing 4.0 spending half its time garbage 
collecting and logs indicating large on heap usage.

On the bright side the thread count was _way down:_ the 3.0 cluster had 1.2k 
threads and the 4.0 cluster only had 220 threads (and almost all of that 
reduction was the messaging thread reduction). Also the startup time was super 
fast (as in less than one second to handshake the entire cluster, vs 3.0 which 
took minutes. We didn't feel that proceeding with the test made sense given the 
instability until follow ups could be committed. We used heap dumps and 
{{jmap}} to determine the issue was the outgoing message queue retaining large 
numbers of mutations on heap rather than dropping them.

*Follow Ups:*
The outgoing queue holding mutations on heap appears to be the problem. 
Specifically the 3.x code would police the internode queues and ensure they did 
not get too large at enqueue and dequeue time (expiring messages and turning 
them into hints as needed), the 4.0 code took out the enqueue policing 
complexity in the hope that we wouldn't need it. It appears it is necessary. 
[~jasobrown] is including fixes to the queue policing in CASSANDRA-14503 and 
CASSANDRA-13630 and we will re-execute this test once those are merged to 
ensure that they fix the issue with large volume mutations.

> Evaluate 200 node, compression=none, encryption=none, coalescing=off 
> -
>
> Key: CASSANDRA-14747
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14747
> Project: Cassandra
>  Issue Type: Sub-task
>Reporter: Joseph Lynch
>Priority: Major
>
> Tracks evaluating a 200 node cluster with all internode settings off (no 
> compression, no encryption, no