[jira] [Comment Edited] (CASSANDRA-8457) nio MessagingService
[ https://issues.apache.org/jira/browse/CASSANDRA-8457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16062158#comment-16062158 ] Jason Brown edited comment on CASSANDRA-8457 at 6/24/17 10:06 PM: -- uploaded results of stress run (using the basic/naive CQL schema), from June 2017. accounts for all the variants (compress/coalesce/jdk-TLS/openssl-TLS). See the README in the tarball for more details was (Author: jasobrown): results of 'naive' stress run, from June 2017 > nio MessagingService > > > Key: CASSANDRA-8457 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8457 > Project: Cassandra > Issue Type: New Feature >Reporter: Jonathan Ellis >Assignee: Jason Brown >Priority: Minor > Labels: netty, performance > Fix For: 4.x > > Attachments: 8457-load.tgz > > > Thread-per-peer (actually two each incoming and outbound) is a big > contributor to context switching, especially for larger clusters. Let's look > at switching to nio, possibly via Netty. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Comment Edited] (CASSANDRA-8457) nio MessagingService
[ https://issues.apache.org/jira/browse/CASSANDRA-8457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15966390#comment-15966390 ] Jason Brown edited comment on CASSANDRA-8457 at 4/12/17 7:58 PM: - [~aweisberg] So, I've kicked this idea around a bit (haven't coded anything up yet), but I thought about using something like {{ExpiringMap}} for the {{QueuedMessage}} s, and when the message times out, either update a (possibly new) field in the {{QueuedMessage}} or failing the {{ChannelPromise}}. Failing the {{ChannelPromise}} is a bit racy as the netty event loop may be in the process of actually serializing and sending the message, and failing the promise (from another thread) might not be great as we're actually doing the work. Meanwhile, updating some new {{volatile}} field in the {{QueuedMessage}} requires yet another volatile update on the send path (better than a synchronized call for sure, but ...). {{ExpiringMap}}, of course, creates more garbage ({{CacheableObject}}), and has it's own internal {{ConcurrentHashMap}}, with more volatiles and such. Not sure if this is great idea, but it's the current state of my thinking? Thoughts? was (Author: jasobrown): [~aweisberg] So, I've kicked this idea around a bit (haven't coded anything up yet), but I thought about using something like {{ExpiringMap}} for the {{QueuedMessages}}s, and when the message times out, either update a (possibly new) field in the {{QueuedMessage}} or failing the {{ChannelPromise}}. Failing the {{ChannelPromise}} is a bit racy as the netty event loop may be in the process of actually serializing and sending the message, and failing the promise (from another thread) might not be great as we're actually doing the work. Meanwhile, updating some new {{volatile}} field in the {{QueuedMessage}} requires yet another volatile update on the send path (better than a synchronized call for sure, but ...). {{ExpiringMap}}, of course, creates more garbage ({{CacheableObject}}), and has it's own internal {{ConcurrentHashMap}}, with more volatiles and such. Not sure if this is great idea, but it's the current state of my thinking? Thoughts? > nio MessagingService > > > Key: CASSANDRA-8457 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8457 > Project: Cassandra > Issue Type: New Feature >Reporter: Jonathan Ellis >Assignee: Jason Brown >Priority: Minor > Labels: netty, performance > Fix For: 4.x > > > Thread-per-peer (actually two each incoming and outbound) is a big > contributor to context switching, especially for larger clusters. Let's look > at switching to nio, possibly via Netty. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Comment Edited] (CASSANDRA-8457) nio MessagingService
[ https://issues.apache.org/jira/browse/CASSANDRA-8457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15868091#comment-15868091 ] Ariel Weisberg edited comment on CASSANDRA-8457 at 2/15/17 4:53 PM: {quote} Thus, I propose to bring back the SwappingByteBufDataOutputStreamPlus that I had in an earlier commit. To recap, the basic idea is provide a DataOutputPlus that has a backing ByteBuffer that is written to, and when it is filled, it is written to the netty context and flushed, then allocate a new buffer for more writes - kinda similar to a BufferedOutputStream, but replacing the backing buffer when full. Bringing this idea back is also what underpins one of the major performance things I wanted to address: buffering up smaller messages into one buffer to avoid going back to the netty allocator for every tiny buffer we might need - think Mutation acks. {quote} What thread is going to be writing to the output stream to serialize the messages? If it's a Netty thread you can't block inside a serialization method waiting for the bytes to drain to the socket that is not keeping up. You also can't wind out of the serialization method and continue it later. If it's an application thread then it's no longer asynchronous and a slow connection can block the application and prevent it from doing say a quorum write to just the fast nodes. You would also need to lock during serialization or queue concurrently sent messages behind the one currently being written. With large messages we aren't really fully eliminating the issue only making it a factor better. At the other end you still need to materialize a buffer containing the message + the object graph you are going to materialize. This is different from how things worked previously where we had a dedicated thread that would read fixed size buffers and then materialize just the object graph from that. To really solve this we need to be able to avoid buffering the entire message at both sending and receiving side. The buffering is worse because we are allocating contiguous memory and not just doubling the space impact. We could make it incrementally better by using chains of fixed size buffers so there is less external fragmentation and allocator overhead. That's still committing additional memory compared to pre-8457, but at least it's being committed in a more reasonable way. I think the most elegant solution is to use a lightweight thread implementation. What we will probably be boxed into doing is making the serialization of result data and other large message portions able to yield. This will bound the memory committed to large messages to the largest atomic portion we have to serialize (Cell?). Something like an output stream being able to say "shouldYield". If you continue to write it will continue to buffer and not fail, but use memory. Then serializers can implement a return value for serialize which indicates whether there is more to serialize. You would check shouldYield after each Cell or some unit of work when serializing. Most of these large things being serialized are iterators which could be stashed away. The trick will be that most serialization is stateless, and objects are serialized concurrently so you can't stored the serialization state in the object being serialized safely. We also need to solve incremental non-blocking deserialization on the receive side and that I don't know. That's even trickier because you don't control how the message is fragmented so you can't insert the yield points trivially. was (Author: aweisberg): {quote} Thus, I propose to bring back the SwappingByteBufDataOutputStreamPlus that I had in an earlier commit. To recap, the basic idea is provide a DataOutputPlus that has a backing ByteBuffer that is written to, and when it is filled, it is written to the netty context and flushed, then allocate a new buffer for more writes - kinda similar to a BufferedOutputStream, but replacing the backing buffer when full. Bringing this idea back is also what underpins one of the major performance things I wanted to address: buffering up smaller messages into one buffer to avoid going back to the netty allocator for every tiny buffer we might need - think Mutation acks. {quote} What thread is going to be writing to the output stream to serialize the messages? If it's a Netty thread you can't block inside a serialization method waiting for the bytes to drain to the socket that is not keeping up. You also can't wind out of the serialization method and continue it later. If it's an application thread then it's no longer asynchronous and a slow connection can block the application and prevent it from doing say a quorum write to just the fast nodes. You would also need to lock during serialization or queue concurrently sent messages behind the one currently being written. With
[jira] [Comment Edited] (CASSANDRA-8457) nio MessagingService
[ https://issues.apache.org/jira/browse/CASSANDRA-8457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15868091#comment-15868091 ] Ariel Weisberg edited comment on CASSANDRA-8457 at 2/15/17 4:42 PM: {quote} Thus, I propose to bring back the SwappingByteBufDataOutputStreamPlus that I had in an earlier commit. To recap, the basic idea is provide a DataOutputPlus that has a backing ByteBuffer that is written to, and when it is filled, it is written to the netty context and flushed, then allocate a new buffer for more writes - kinda similar to a BufferedOutputStream, but replacing the backing buffer when full. Bringing this idea back is also what underpins one of the major performance things I wanted to address: buffering up smaller messages into one buffer to avoid going back to the netty allocator for every tiny buffer we might need - think Mutation acks. {quote} What thread is going to be writing to the output stream to serialize the messages? If it's a Netty thread you can't block inside a serialization method waiting for the bytes to drain to the socket that is not keeping up. You also can't wind out of the serialization method and continue it later. If it's an application thread then it's no longer asynchronous and a slow connection can block the application and prevent it from doing say a quorum write to just the fast nodes. You would also need to lock during serialization or queue concurrently sent messages behind the one currently being written. With large messages we aren't really fully eliminating the issue only making it a factor better. At the other end you still need to materialize a buffer containing the message + the object graph you are going to materialize. This is different from how things worked previously where we had a dedicated thread that would read fixed size buffers and then materialize just the object graph from that. To really solve this we need to be able to avoid buffering the entire message at both sending and receiving side. The buffering is worse because we are allocating contiguous memory and not just doubling the space impact. We could make it incrementally better by using chains of fixed size buffers so there is less external fragmentation and allocator overhead. That's still committing additional memory compared to pre-8457, but at least it's being committed in a more reasonable way. I think the most elegant solution is to use a lightweight thread implementation. What we will probably be boxed into doing is making the serialization of result data and other large message portions able to yield. This will bound the memory committed to large messages to the largest atomic portion we have to serialize (Cell?). Something like an output stream being able to say "shouldYield". If you continue to write it will continue to buffer and not fail, but use memory. Then serializers can implement a return value for serialize which indicates whether there is more to serialize. You would check shouldYield after each Cell or some unit of work when serializing. Most of these large things being serialized are iterators. The trick will be that most serialization is stateless, and objects are serialized concurrently so you can't stored the serialization state in object being serialized safely. We also need to solve incremental non-blocking deserialization on the receive side and that I don't know. That's even trickier because you don't control how the message is fragmented so you can't insert the yield points trivially. was (Author: aweisberg): {quote} Thus, I propose to bring back the SwappingByteBufDataOutputStreamPlus that I had in an earlier commit. To recap, the basic idea is provide a DataOutputPlus that has a backing ByteBuffer that is written to, and when it is filled, it is written to the netty context and flushed, then allocate a new buffer for more writes - kinda similar to a BufferedOutputStream, but replacing the backing buffer when full. Bringing this idea back is also what underpins one of the major performance things I wanted to address: buffering up smaller messages into one buffer to avoid going back to the netty allocator for every tiny buffer we might need - think Mutation acks. {quote} What thread is going to be writing to the output stream to serialize the messages? If it's a Netty thread you can't block inside a serialization method waiting for the bytes to drain to the socket that is not keeping up. You also can't wind out of the serialization method and continue it later. If it's an application thread then it's no longer asynchronous and a slow connection can block the application and prevent it from doing say a quorum write to just the fast nodes. You would also need to lock during serialization or queue concurrently sent messages behind the one currently being written. With large messages we aren't really
[jira] [Comment Edited] (CASSANDRA-8457) nio MessagingService
[ https://issues.apache.org/jira/browse/CASSANDRA-8457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15868091#comment-15868091 ] Ariel Weisberg edited comment on CASSANDRA-8457 at 2/15/17 4:05 PM: {quote} Thus, I propose to bring back the SwappingByteBufDataOutputStreamPlus that I had in an earlier commit. To recap, the basic idea is provide a DataOutputPlus that has a backing ByteBuffer that is written to, and when it is filled, it is written to the netty context and flushed, then allocate a new buffer for more writes - kinda similar to a BufferedOutputStream, but replacing the backing buffer when full. Bringing this idea back is also what underpins one of the major performance things I wanted to address: buffering up smaller messages into one buffer to avoid going back to the netty allocator for every tiny buffer we might need - think Mutation acks. {quote} What thread is going to be writing to the output stream to serialize the messages? If it's a Netty thread you can't block inside a serialization method waiting for the bytes to drain to the socket that is not keeping up. You also can't wind out of the serialization method and continue it later. If it's an application thread then it's no longer asynchronous and a slow connection can block the application and prevent it from doing say a quorum write to just the fast nodes. You would also need to lock during serialization or queue concurrently sent messages behind the one currently being written. With large messages we aren't really fully eliminating the issue only making it a factor better. At the other end you still need to materialize a buffer containing the message + the object graph you are going to materialize. This is different from how things worked previously where we had a dedicated thread that would read fixed size buffers and then materialize just the object graph from that. To really solve this we need to be able to avoid buffering the entire message at both sending and receiving side. The buffering is worse because we are allocating contagious memory and not just doubling the space impact. We could make it incrementally better by using chains of fixed size buffers so there is less external fragmentation and allocator overhead. That's still committing additional memory compared to pre-8457, but at least it's being committed in a more reasonable way. I think the most elegant solution is to use a lightweight thread implementation. What we will probably be boxed into doing is making the serialization of result data and other large message portions able to yield. This will bound the memory committed to large messages to the largest atomic portion we have to serialize (Cell?). Something like an output stream being able to say "shouldYield". If you continue to write it will continue to buffer and not fail, but use memory. Then serializers can implement a return value for serialize which indicates whether there is more to serialize. You would check shouldYield after each Cell or some unit of work when serializing. Most of these large things being serialized are iterators. The trick will be that most serialization is stateless, and objects are serialized concurrently so you can't stored the serialization state in object being serialized safely. We also need to solve incremental non-blocking deserialization on the receive side and that I don't know. That's even trickier because you don't control how the message is fragmented so you can't insert the yield points trivially. was (Author: aweisberg): {quote} Thus, I propose to bring back the SwappingByteBufDataOutputStreamPlus that I had in an earlier commit. To recap, the basic idea is provide a DataOutputPlus that has a backing ByteBuffer that is written to, and when it is filled, it is written to the netty context and flushed, then allocate a new buffer for more writes - kinda similar to a BufferedOutputStream, but replacing the backing buffer when full. Bringing this idea back is also what underpins one of the major performance things I wanted to address: buffering up smaller messages into one buffer to avoid going back to the netty allocator for every tiny buffer we might need - think Mutation acks. {quote} What thread is going to be writing to the output stream to serialize the messages? If it's a Netty thread you can't block inside a serialization method waiting for the bytes to drain to the socket that is not keeping up. You also can't wind out of the serialization method and continue it later. If it's an application thread then it's no longer asynchronous and a slow connection can block the application and prevent it from doing say a quorum write to just the fast nodes. You would also need to lock during serialization or queue concurrently sent messages behind the one currently being written. With large messages we aren't really
[jira] [Comment Edited] (CASSANDRA-8457) nio MessagingService
[ https://issues.apache.org/jira/browse/CASSANDRA-8457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15865754#comment-15865754 ] Jason Brown edited comment on CASSANDRA-8457 at 2/14/17 1:14 PM: - rebased, and made changes wrt CASSANDRA-13090 was (Author: jasobrown): rebased, and pulled in CASSANDRA-13090 > nio MessagingService > > > Key: CASSANDRA-8457 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8457 > Project: Cassandra > Issue Type: New Feature >Reporter: Jonathan Ellis >Assignee: Jason Brown >Priority: Minor > Labels: netty, performance > Fix For: 4.x > > > Thread-per-peer (actually two each incoming and outbound) is a big > contributor to context switching, especially for larger clusters. Let's look > at switching to nio, possibly via Netty. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Comment Edited] (CASSANDRA-8457) nio MessagingService
[ https://issues.apache.org/jira/browse/CASSANDRA-8457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15863758#comment-15863758 ] Jason Brown edited comment on CASSANDRA-8457 at 2/13/17 9:02 PM: - OK, so I've been performance load testing the snot of this code for the last several weeks, with help from netty committers, flight recorder, and flame graphs. As a result, I've made some major and some minor tweaks, and now I'm slightly faster than trunk with slightly better throughput. I have some optimizations in my back pocket that will increase even more, but as [~slebresne] and I have agreed before, we'll save those for follow up tickets. trunk {code} id, type total ops,op/s,pk/s, row/s,mean, med, .95, .99,.999, max, time, stderr, errors, gc: #, max ms, sum ms, sdv ms, mb 4 threadCount, total,233344,3889,3889,3889, 1.0, 1.0, 1.2, 1.3, 1.5,68.2, 60.0, 0.01549, 0, 9, 538, 538, 4,5381 8 threadCount, total,544637,9076,9076,9076, 0.8, 0.8, 1.0, 1.1, 1.4,73.8, 60.0, 0.00978, 0, 20, 1267,1267, 5, 11848 16 threadCount, total, 1126627, 18774, 18774, 18774, 0.8, 0.8, 0.9, 1.0, 5.5,78.2, 60.0, 0.01882, 0, 40, 2665,2665, 6, 23666 24 threadCount, total, 1562460, 26036, 26036, 26036, 0.9, 0.8, 1.0, 1.1, 9.1,81.3, 60.0, 0.00837, 0, 55, 3543,3543, 9, 32619 36 threadCount, total, 2098097, 34962, 34962, 34962, 1.0, 0.9, 1.1, 1.3,60.9,83.0, 60.0, 0.00793, 0, 73, 4665,4665, 7, 43144 54 threadCount, total, 2741814, 45686, 45686, 45686, 1.1, 1.0, 1.4, 1.7,62.2, 131.7, 60.0, 0.01321, 0, 93, 5748,5748, 7, 55097 81 threadCount, total, 3851131, 64166, 64166, 64166, 1.2, 1.0, 1.6, 2.6,62.3, 151.7, 60.0, 0.01152, 0,159, 8190,8521, 14, 106805 121 threadCount, total, 4798169, 79947, 79947, 79947, 1.5, 1.1, 2.0, 3.0,63.5, 117.8, 60.0, 0.05689, 0,165, 9323,9439, 5, 97536 181 threadCount, total, 5647043, 94088, 94088, 94088, 1.9, 1.4, 2.6, 4.9,68.5, 169.2, 60.0, 0.01639, 0,195, 10106, 11011, 11, 126422 271 threadCount, total, 6450510, 107461, 107461, 107461, 2.5, 1.8, 3.7,12.0,75.4, 155.8, 60.0, 0.01542, 0,228, 10304, 12789, 9, 143857 406 threadCount, total, 6700764, 111635, 111635, 111635, 3.6, 2.5, 5.3,55.8,75.6, 196.5, 60.0, 0.01800, 0,243, 9995, 13170, 7, 144166 609 threadCount, total, 7119535, 118477, 118477, 118477, 5.1, 3.5, 7.9,62.8,85.1, 170.0, 60.1, 0.01775, 0,250, 10149, 13781, 7, 148118 913 threadCount, total, 7093347, 117981, 117981, 117981, 7.7, 4.9,15.7,71.3, 101.1, 173.4, 60.1, 0.02780, 0,246, 10327, 13859, 8, 155896 {code} 8457 {code} id, type total ops,op/s,pk/s, row/s,mean, med, .95, .99,.999, max, time, stderr, errors, gc: #, max ms, sum ms, sdv ms, mb 4 threadCount, total,161668,2694,2694,2694, 1.4, 1.4, 1.6, 1.7, 3.2,68.2, 60.0, 0.01264, 0, 6, 363, 363, 4,3631 8 threadCount, total,498139,8301,8301,8301, 0.9, 0.9, 1.1, 1.3, 1.8,73.5, 60.0, 0.00446, 0, 19, 1164,1164, 6, 11266 16 threadCount, total,765437, 12756, 12756, 12756, 1.2, 1.2, 1.4, 1.5, 5.7,74.8, 60.0, 0.01251, 0, 29, 1819,1819, 5, 17238 24 threadCount, total, 1122768, 18710, 18710, 18710, 1.2, 1.2, 1.4, 1.5, 8.5, 127.7, 60.0, 0.00871, 0, 42, 2538,2538, 5, 25054 36 threadCount, total, 1649658, 27489, 27489, 27489, 1.3, 1.2, 1.4, 1.6,60.1,77.7, 60.0, 0.00627, 0, 57, 3652,3652, 7, 33743 54 threadCount, total, 2258999, 37641, 37641, 37641, 1.4, 1.3, 1.6, 1.8,62.5,81.7, 60.0, 0.00771, 0, 79, 4908,4908, 6, 46789 81 threadCount, total, 3255005, 54220, 54220, 54220, 1.5, 1.2, 1.7, 2.2,63.8, 133.4, 60.0, 0.02030, 0,117, 6953,7008, 9, 72208 121
[jira] [Comment Edited] (CASSANDRA-8457) nio MessagingService
[ https://issues.apache.org/jira/browse/CASSANDRA-8457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15527775#comment-15527775 ] Jason Brown edited comment on CASSANDRA-8457 at 9/27/16 11:31 PM: -- TL;DR - I've addressed everything except for the interaction between {{ClientHandshakeHandler}} and {{InternodeMessagingConnection}} (both are now renamed). I've noticed the odd rub there, as well, for a while, and I'll take some time to reconsider it. re: "talking points" - Backward compatibility - bit the bullet, and just yanked the old code - streaming - [~slebresne] and I talked offline, and CASSANDRA-12229 will address the streaming parts, and will be worked on/reviewed concurrently. Both tickets will be committed together to avoid breaking streaming. re: comments section 1 - Netty openssl - when I implemented this back in February, there was no mechanism to use {{KeyFactoryManager}} with the OpenSSL implementation. Fortunately, this has changed since I last checked in, so I've deleted the extra {{keyfile}} and friends entries from the yaml/{{Config}}. - "old code" - deleted now - package javadoc - I absolutely want this :), I just want things to be more solid code-wise before diving into that work. - naming - names are now more consistent using In/Out (or Inbound/Outbound), and use of client/server is removed. re: comments section 2 - {{getSocketThreads()}} - I've removed this for now, and will be resolved with CASSANDRA-12229 - {{MessagingService}} renames - done - {{MessagingService#createConnection()}} In the previous implementation, {{OutboundTcpConnectionPool}} only blocked on creating the threads for it's wrapped {{OutboundTcpConnection}} instances (gossip, large, and small messages). No sockets were actually opened until a message was actually sent to that peer {{OutboundTcpConnection#connect()}}. Since we do not spawn a separate thread for each connection type (even though we will have separate sockets), I don't think it's necessary to block {{MessagingService#createConnection()}}, or more correctly now, {{MessagingService.NettySender#getMessagingConnection()}}. - "Seems {{NettySender.listen()}} always starts a non-secure connection" - You are correct; however, looks like we've always been doing it that way (for better or worse). I've gone ahead and made the change (it's a one liner, plus a couple extra for error checking). - {{ClientConnect#connectComplete}} - I've renamed the function to be more accurate ({{connectCallback}}). - {{CoalescingMessageOutHandler}} - done Other issues resolved, as well. Branch has been pushed (with several commits at the top) and tests running. was (Author: jasobrown): TL;DR - I've addressed everything except for the interaction between {{ClientHandshakeHandler}} and {{Interno -deMessagingConnection}} (both are noew renamed). I've noticed the odd rub there, as well, for a while, and I'll take some time to reconsider it. re: "talking points" - Backward compatibility - bit the bullet, and just yanked the old code - streaming - [~slebresne] and I talked offline, and CASSANDRA-12229 will address the streaming parts, and will be worked on/reviewed concurrently. Both tickets will be committed together to avoid breaking streaming. re: comments section 1 - Netty openssl - when I implemented this back in February, there was no mechanism to use {{KeyFactoryManager}} with the OpenSSL implementaion. Fortunately, this has changed since I last checked in, so I've deleted the extra {{keyfile}} and friends entries from the yaml/{{Config}}. - "old code" - deleted now - package javadoc - I absolutely want this :), I just want things to be more solid code-wise before diving into that work. - naming - names are now more consistent using In/Out (or Inbound/Outbound), and use of client/server is removed. re: comments section 2 - {{getSocketThreads()}} - I've removed this for now, and will be resolved with CASSANDRA-12229 - {{MessagingService}} renames - done - {{MessagingService#createConnection()}} In the previous implementation, {{OutboundTcpConnectionPool}} only blocked on creating the threads for it's wrapped {{OutboundTcpConnection}} instances (gossip, large, and small messages). No sockets were actually opened until a message was actually sent to that peer {{OutboundTcpConnection#connect()}}. Since we do not spawn a separate thread for each connection type (even though we will have separate sockets), I don't think it's necessary to block {{MessagingService#createConnection()}}, or more correctly now, {{MessagingService.NettySender#getMessagingConnection()}}. - "Seems {{NettySender.listen()}} always starts a non-secure connection" - You are correct; however, looks like we've always been doing it that way (for better or worse). I've gone ahead and made the change (it's a one liner, plus a couple extra for error checking). -
[jira] [Comment Edited] (CASSANDRA-8457) nio MessagingService
[ https://issues.apache.org/jira/browse/CASSANDRA-8457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14271701#comment-14271701 ] Ariel Weisberg edited comment on CASSANDRA-8457 at 1/9/15 7:08 PM: --- Took a stab at writing an adaptive approach to coalescing based on a moving average. Numbers look good for the workloads tested. Code https://github.com/aweisberg/cassandra/compare/6be33289f34782e12229a7621022bb5ce66b2f1b...e48133c4d5acbaa6563ea48a0ca118c278b2f6f7 Testing in AWS, 14 servers 6 clients. Using a fixed coalescing window at low concurrency there is a drop of performance from 6746 to 3929. With adaptive coalescing I got 6758. At medium concurrency (5 threads per client, 6 clients) I got 31097 with coalescing disable and 31120 with coalescing. At high concurrency (500 threads per client, 6 clients) I got 479532 with coalescing and 166010 without. This is with a maximum coalescing window of 200 milliseconds. I added debug output to log when coalescing starts and stops and it's interesting. At the beginning of the benchmark things flap, but they don't flap madly. After a few minutes it settles. I also notice a strange thing where CPU utilization at the start of a benchmark is 500% or so and then after a while it climbs. Like something somewhere is warming up or balancing. I recall seeing this in GCE as well. I had one of the OutboundTcpConnections (first to get the permit) log a trace of all outgoing message times. I threw that into a histogram for informational purposes. 50% of messages are sent within 100 microseconds of each other and 92% are sent within one millisecond. This is without any coalescing. {noformat} Value Percentile TotalCount 1/(1-Percentile) 0.000 0. 5554 1.00 5.703 0.1000 124565 1.11 13.263 0.2000 249128 1.25 24.143 0.3000 373630 1.43 40.607 0.4000 498108 1.67 94.015 0.5000 622664 2.00 158.463 0.5500 684867 2.22 244.351 0.6000 747137 2.50 305.407 0.6500 809631 2.86 362.239 0.7000 871641 3.33 428.031 0.7500 933978 4.00 467.711 0.7750 965085 4.44 520.703 0.8000 996254 5.00 595.967 0.82501027359 5.71 672.767 0.85001058457 6.67 743.935 0.87501089573 8.00 780.799 0.88751105290 8.89 821.247 0.90001120774 10.00 868.351 0.91251136261 11.43 928.767 0.92501151889 13.33 1006.079 0.93751167421 16.00 1049.599 0.943750001175260 17.78 1095.679 0.95001183041 20.00 1143.807 0.956250001190779 22.86 1198.079 0.96251198542 26.67 1264.639 0.968750001206301 32.00 1305.599 0.971875001210228 35.56 1354.751 0.97501214090 40.00 1407.999 0.978125001217975 45.71 1470.463 0.981250001221854 53.33 1542.143 0.984375001225759 64.00 1586.175 0.985937501227720 71.11 1634.303 0.98751229643 80.00 1688.575 0.989062501231596 91.43 1756.159 0.990625001233523 106.67 1839.103 0.992187501235464 128.00 1887.231 0.992968751236430 142.22 1944.575 0.993750001237409 160.00 2007.039 0.994531251238384 182.86 2084.863 0.995312501239358 213.33 2174.975 0.996093751240326 256.00 2230.271 0.9964843750001240818 284.44 2293.759 0.996875001241292 320.00 2369.535 0.9972656250001241785 365.71 2455.551 0.997656251242271 426.67 2578.431 0.9980468750001242752 512.00 2656.255 0.9982421875001242999 568.89 2740.223 0.998437501243244 640.00 2834.431 0.9986328125001243482 731.43 2957.311 0.9988281250001243725 853.33 3131.391 0.99902343750012439691024.00 3235.839 0.99912109375012440911137.78 3336.191 0.9992187512442121280.00 3471.359 0.99931640625012443321462.86 3641.343 0.99941406250012444551706.67 3837.951 0.99951171875012445762048.00 4001.791 0.99956054687512446362275.56 4136.959 0.99960937500012446972560.00 4399.103
[jira] [Comment Edited] (CASSANDRA-8457) nio MessagingService
[ https://issues.apache.org/jira/browse/CASSANDRA-8457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14271701#comment-14271701 ] Ariel Weisberg edited comment on CASSANDRA-8457 at 1/9/15 7:24 PM: --- Took a stab at writing an adaptive approach to coalescing based on a moving average. Numbers look good for the workloads tested. Code https://github.com/aweisberg/cassandra/compare/6be33289f34782e12229a7621022bb5ce66b2f1b...e48133c4d5acbaa6563ea48a0ca118c278b2f6f7 The impact of coalescing on individual messages appears to introduce quite a bit of latency. Without I see an average latency between when the message is submitted to when it is written to the socket of 25 microseconds, but with coalescing that delay is 350-400 microseconds even though I am only requesting a wait of 200 microseconds at most. Testing in AWS, 14 servers 6 clients. Using a fixed coalescing window at low concurrency there is a drop of performance from 6746 to 3929. With adaptive coalescing I got 6758. At medium concurrency (5 threads per client, 6 clients) I got 31097 with coalescing disable and 31120 with coalescing. At high concurrency (500 threads per client, 6 clients) I got 479532 with coalescing and 166010 without. This is with a maximum coalescing window of 200 milliseconds. I added debug output to log when coalescing starts and stops and it's interesting. At the beginning of the benchmark things flap, but they don't flap madly. After a few minutes it settles. I also notice a strange thing where CPU utilization at the start of a benchmark is 500% or so and then after a while it climbs. Like something somewhere is warming up or balancing. I recall seeing this in GCE as well. I had one of the OutboundTcpConnections (first to get the permit) log a trace of all outgoing message times. I threw that into a histogram for informational purposes. 50% of messages are sent within 100 microseconds of each other and 92% are sent within one millisecond. This is without any coalescing. {noformat} Value Percentile TotalCount 1/(1-Percentile) 0.000 0. 5554 1.00 5.703 0.1000 124565 1.11 13.263 0.2000 249128 1.25 24.143 0.3000 373630 1.43 40.607 0.4000 498108 1.67 94.015 0.5000 622664 2.00 158.463 0.5500 684867 2.22 244.351 0.6000 747137 2.50 305.407 0.6500 809631 2.86 362.239 0.7000 871641 3.33 428.031 0.7500 933978 4.00 467.711 0.7750 965085 4.44 520.703 0.8000 996254 5.00 595.967 0.82501027359 5.71 672.767 0.85001058457 6.67 743.935 0.87501089573 8.00 780.799 0.88751105290 8.89 821.247 0.90001120774 10.00 868.351 0.91251136261 11.43 928.767 0.92501151889 13.33 1006.079 0.93751167421 16.00 1049.599 0.943750001175260 17.78 1095.679 0.95001183041 20.00 1143.807 0.956250001190779 22.86 1198.079 0.96251198542 26.67 1264.639 0.968750001206301 32.00 1305.599 0.971875001210228 35.56 1354.751 0.97501214090 40.00 1407.999 0.978125001217975 45.71 1470.463 0.981250001221854 53.33 1542.143 0.984375001225759 64.00 1586.175 0.985937501227720 71.11 1634.303 0.98751229643 80.00 1688.575 0.989062501231596 91.43 1756.159 0.990625001233523 106.67 1839.103 0.992187501235464 128.00 1887.231 0.992968751236430 142.22 1944.575 0.993750001237409 160.00 2007.039 0.994531251238384 182.86 2084.863 0.995312501239358 213.33 2174.975 0.996093751240326 256.00 2230.271 0.9964843750001240818 284.44 2293.759 0.996875001241292 320.00 2369.535 0.9972656250001241785 365.71 2455.551 0.997656251242271 426.67 2578.431 0.9980468750001242752 512.00 2656.255 0.9982421875001242999 568.89 2740.223 0.998437501243244 640.00 2834.431 0.9986328125001243482 731.43 2957.311 0.9988281250001243725 853.33 3131.391 0.99902343750012439691024.00 3235.839 0.99912109375012440911137.78
[jira] [Comment Edited] (CASSANDRA-8457) nio MessagingService
[ https://issues.apache.org/jira/browse/CASSANDRA-8457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14268221#comment-14268221 ] Ariel Weisberg edited comment on CASSANDRA-8457 at 1/7/15 9:11 PM: --- In GCE coalescing provided a 12% increase in throughput in this specific message heavy high concurrency workload. The penalty is that at low concurrency there is an immediate loss of performance with any coalescing and a large window has a greater impact at low concurrency so there is tension there. The larger the window the better the performance bump. Right scale does not offer instances with enhanced networking. To find out whether coalescing provides real benefits in EC2/Xen or milder GCE like benefits I will have to get my hands on some. I wanted to account for the impact of coalescing at low concurrency. Low concurrency is not a recipe for great performance, but it is part of the out of the box experience and people do compare different databases at low concurrency. Testing with 3 client threads each running on a dedicated client instance (3 threads total). This is in GCE. With TCP no delay on and coalescing ||Coalesce window microseconds|Throughput|| |0| 2191| |6| 1910| |12| 1873| |25| 1867| |50| 1779| |100| 1667| |150| 1566| |200| 1491| I also tried disabling coalescing when using HSHA and it didn't seem to make a difference. Surprising considering the impact of 25 microseconds of coalescing intra-cluster. I also experimented with some other things. Binding interrupts cores 0 and 8 and task setting C* off of those cores. I didn't see a big impact. I did see a small positive impact using 3 clients 8 servers which means the measurements with 2 clients might be a little suspect. With 3 clients and 200 microseconds of coalescing it peaked at 165k in GCE. I also found out that banned CPUs in irqbalance is broken and has no effect and this has been the case for some time. was (Author: aweisberg): I wanted to account for the impact of coalescing at low concurrency. Low concurrency is not a recipe for great performance, but it is part of the out of the box experience and people do compare different databases at low concurrency. In GCE coalescing provided a 12% increase in throughput in this specific message heavy high concurrency workload. The penalty is that at low concurrency there is an immediate loss of performance with any coalescing and a large window has a greater impact at low concurrency so there is tension there. The larger the window the better the performance bump. Testing with 3 client threads each running on a dedicated client instance (3 threads total). This is in GCE. With TCP no delay on and coalescing ||Coalesce window microseconds|Throughput|| |0| 2191| |6| 1910| |12| 1873| |25| 1867| |50| 1779| |100| 1667| |150| 1566| |200| 1491| I also tried disabling coalescing when using HSHA and it didn't seem to make a difference. Surprising considering the impact of 25 microseconds of coalescing intra-cluster. I also experimented with some other things. Binding interrupts cores 0 and 8 and task setting C* off of those cores. I didn't see a big impact. I did see a small positive impact using 3 clients 8 servers which means the measurements with 2 clients might be a little suspect. With 3 clients and 200 microseconds of coalescing it peaked at 165k in GCE. I also found out that banned CPUs in irqbalance is broken and has no effect and this has been the case for some time. Right scale does not offer instances with enhanced networking. To find out whether coalescing provides real benefits in EC2/Xen or milder GCE like benefits I will have to get my hands on some. nio MessagingService Key: CASSANDRA-8457 URL: https://issues.apache.org/jira/browse/CASSANDRA-8457 Project: Cassandra Issue Type: New Feature Components: Core Reporter: Jonathan Ellis Assignee: Ariel Weisberg Labels: performance Fix For: 3.0 Thread-per-peer (actually two each incoming and outbound) is a big contributor to context switching, especially for larger clusters. Let's look at switching to nio, possibly via Netty. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (CASSANDRA-8457) nio MessagingService
[ https://issues.apache.org/jira/browse/CASSANDRA-8457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14266402#comment-14266402 ] Ariel Weisberg edited comment on CASSANDRA-8457 at 1/6/15 5:05 PM: --- I think I stumbled onto what is going on based on Benedict's suggestion to disable TCP no delay. It looks like there is a small message performance issue. This is something I have seen before in EC2 where you can only send a surprisingly small number of messages in/out of a VM. I don't have the numbers from when I micro benchmarked it, but it is something like 450k messages with TCP no delay and a million or so without. Adding more sockets helps but it doesn't even double the number of messages in/out. Throwing more cores at the problem doesn't help you just end up with under utilized cores which matches the mysterious levels of starvation I was seeing in C* even though I was exposing sufficient concurrency. 14 servers nodes. 6 client nodes. 500 threads per client. Server started with row_cache_size_in_mb : 2000, key_cache_size_in_mb:500, rpc_max_threads : 1024, rpc_min_threads : 16, native_transport_max_threads : 1024 8-gig old gen, 2 gig new gen. Client running CL=ALL and the same schema I have been using throughout this ticket. With no delay off First set of runs 390264 387958 392322 After replacing 10 instances 366579 365818 378221 No delay on 162987 Modified trunk to fix a bug batching messages and add a configurable window for coalescing multiple messages into a socket see https://github.com/aweisberg/cassandra/compare/f733996...49c6609 ||Coalesce window microseconds|Throughput|| |250| 502614| |200| 496206| |150| 487195| |100| 423415| |50| 326648| |25| 308175| |12| 292894| |6| 268456| |0| 153688| I did not expect to get mileage out of coalescing at the application level but it works extremely well. CPU utilization is still low at 1800%. There seems to be less correlation between CPU utilization and throughput as I vary the coalescing window and throughput changes dramatically. I do see that core 0 is looking pretty saturated and is only 10% idle. That might be the next or actual bottleneck. What role this optimization plays at different cluster sizes is an important question. There has to be a tipping point where coalescing stops working because not enough packets go to each end point at the same time. With vnodes it wouldn't be unusual to be communicating with a large number of other hosts right? It also takes a significant amount of additional latency to get the mileage at high levels of throughput, but at lower concurrency there is no benefit and it will probably show up as decreased throughput. It makes it tough to crank it up as a default. Either it is adaptive or most people don't get the benefit. At high levels of throughput it is a clear latency win. Latency is much lower for individual requests on average. Making this a config option is viable as a starting point. Possibly a separate option for local/remote DC coalescing. Ideally we could make it adapt to the workload. I am going to chase down what impact coalescing has at lower levels of concurrency so we can quantify the cost of turning it on. I'm also going to try and get to the bottom of all interrupts going to core 0. Maybe it is the real problem and coalescing is just a band aid to get more throughput. was (Author: aweisberg): I think I stumbled onto what is going on based on Benedict's suggestion to disable TCP no delay. It looks like there is a small message performance issue. This is something I have seen before in EC2 where you can only send a surprisingly small number of messages in/out of a VM. I don't have the numbers from when I micro benchmarked it, but it is something like 450k messages with TCP no delay and a million or so without. Adding more sockets helps but it doesn't even double the number of messages in/out. Throwing more cores at the problem doesn't help you just end up with under utilized cores which matches the mysterious levels of starvation I was seeing in C* even though I was exposing sufficient concurrency. 14 servers nodes. 6 client nodes. 500 threads per client. Server started with row_cache_size_in_mb : 2000, key_cache_size_in_mb:500, rpc_max_threads : 1024, rpc_min_threads : 16, native_transport_max_threads : 1024 8-gig old gen, 2 gig new gen. Client running CL=ALL and the same schema I have been using throughout this ticket. With no delay off First set of runs 390264 387958 392322 After replacing 10 instances 366579 365818 378221 No delay on 162987 Modified trunk to fix a bug batching messages and add a configurable window for coalescing multiple messages into a socket see https://github.com/aweisberg/cassandra/compare/f733996...49c6609 ||Coalesce window
[jira] [Comment Edited] (CASSANDRA-8457) nio MessagingService
[ https://issues.apache.org/jira/browse/CASSANDRA-8457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14266422#comment-14266422 ] Ariel Weisberg edited comment on CASSANDRA-8457 at 1/6/15 5:31 PM: --- https://forums.aws.amazon.com/thread.jspa?messageID=459260 This looks like a Xen/EC2 issue. I'll bet bare metal never has this issue because it has multiple interrupt vectors for NICs. The only work around is using multiple elastic network interfaces and doing something to make that practical for intra-cluster communication. Apparently the right scale instances I am using don't have enhanced networking. To get enhanced networking I need to use VPC. I am not sure if starting instances with enhanced networking is possible via right scale, but I am going to find out. I don't know if enhanced networking addresses the interrupt vector issue. Will do more digging. was (Author: aweisberg): https://forums.aws.amazon.com/thread.jspa?messageID=459260 This looks like a Xen/EC2 issue. I'll bet bare metal never has this issue because it has multiple interrupt vectors for NICs. The only work around is using multiple elastic network interfaces and doing something to make that practical for intra-cluster communication. nio MessagingService Key: CASSANDRA-8457 URL: https://issues.apache.org/jira/browse/CASSANDRA-8457 Project: Cassandra Issue Type: New Feature Components: Core Reporter: Jonathan Ellis Assignee: Ariel Weisberg Labels: performance Fix For: 3.0 Thread-per-peer (actually two each incoming and outbound) is a big contributor to context switching, especially for larger clusters. Let's look at switching to nio, possibly via Netty. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (CASSANDRA-8457) nio MessagingService
[ https://issues.apache.org/jira/browse/CASSANDRA-8457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14256326#comment-14256326 ] Ariel Weisberg edited comment on CASSANDRA-8457 at 12/22/14 11:48 PM: -- Repeating these benchmarks, trunk can be slower as or more often than it is faster. It's one of those peak performance vs. average and worst case sort of decisions. This is odd. Some measurements running trunk. Soft interrupt hotspot on one CPU. Don't see why that should happen. Interrupt hotspots are usually related to not enough TCP streams and there should be several in this workload. Kind of looks like someone somewhere pinned interrupts to core 0. {noformat} 11:20:04 PM CPU%usr %nice%sys %iowait%irq %soft %steal %guest %idle 11:20:05 PM all 11.190.009.820.000.000.981.27 0.00 76.74 11:20:05 PM0 18.970.00 12.070.000.00 32.761.72 0.00 34.48 11:20:05 PM16.120.00 10.200.000.000.004.08 0.00 79.59 11:20:05 PM27.550.00 11.320.000.000.003.77 0.00 77.36 11:20:05 PM3 11.110.009.260.000.000.001.85 0.00 77.78 11:20:05 PM4 17.860.005.360.000.000.000.00 0.00 76.79 11:20:05 PM5 10.710.008.930.000.000.000.00 0.00 80.36 11:20:05 PM6 10.170.00 10.170.000.000.001.69 0.00 77.97 11:20:05 PM7 10.000.00 11.670.000.000.001.67 0.00 76.67 11:20:05 PM8 13.560.006.780.000.000.001.69 0.00 77.97 11:20:05 PM9 12.500.00 10.940.000.000.001.56 0.00 75.00 11:20:05 PM 10 15.150.009.090.000.000.001.52 0.00 74.24 11:20:05 PM 117.940.00 12.700.000.000.001.59 0.00 77.78 11:20:05 PM 129.520.009.520.000.000.001.59 0.00 79.37 11:20:05 PM 139.090.00 12.120.000.000.000.00 0.00 78.79 11:20:05 PM 14 11.760.00 10.290.000.000.001.47 0.00 76.47 11:20:05 PM 159.230.009.230.000.000.001.54 0.00 80.00 11:20:05 PM 16 10.290.00 14.710.000.000.001.47 0.00 73.53 11:20:05 PM 17 10.450.00 11.940.000.000.000.00 0.00 77.61 11:20:05 PM 18 11.940.008.960.000.000.001.49 0.00 77.61 11:20:05 PM 19 12.680.00 12.680.000.000.001.41 0.00 73.24 11:20:05 PM 20 13.240.008.820.000.000.000.00 0.00 77.94 11:20:05 PM 215.800.00 15.940.000.000.000.00 0.00 78.26 11:20:05 PM 22 13.890.009.720.000.000.001.39 0.00 75.00 11:20:05 PM 239.380.009.380.000.000.000.00 0.00 81.25 11:20:05 PM 247.810.009.380.000.000.001.56 0.00 81.25 11:20:05 PM 258.960.008.960.000.000.001.49 0.00 80.60 11:20:05 PM 26 11.590.00 10.140.000.000.000.00 0.00 78.26 11:20:05 PM 27 13.040.007.250.000.000.001.45 0.00 78.26 11:20:05 PM 287.810.007.810.000.000.000.00 0.00 84.38 11:20:05 PM 29 11.760.008.820.000.000.000.00 0.00 79.41 11:20:05 PM 30 11.430.00 10.000.000.000.001.43 0.00 77.14 11:20:05 PM 31 12.500.004.690.000.000.000.00 0.00 82.81 {noformat} Modified mpstat output is similar. perf stat doesn't have access to performance counters in these instances. I'll have to see if I can get instances that do that. I have a flight recording of each, but not a flight recording of great performance on trunk. was (Author: aweisberg): Repeating these benchmarks, trunk can be slower as or more often then it is faster. It's one of those peak performance vs. average and worst case sort of decisions. This is odd. Some measurements running trunk. Soft interrupt hotspot on one CPU. Don't see why that should happen. Interrupt hotspots are usually related to not enough TCP streams and there should be several in this workload. Kind of looks like someone somewhere pinned interrupts to core 0. {noformat} 11:20:04 PM CPU%usr %nice%sys %iowait%irq %soft %steal %guest %idle 11:20:05 PM all 11.190.009.820.000.000.981.27 0.00 76.74 11:20:05 PM0 18.970.00 12.070.000.00 32.761.72 0.00 34.48 11:20:05 PM16.120.00 10.200.000.000.004.08 0.00 79.59 11:20:05
[jira] [Comment Edited] (CASSANDRA-8457) nio MessagingService
[ https://issues.apache.org/jira/browse/CASSANDRA-8457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14244258#comment-14244258 ] Ryan McGuire edited comment on CASSANDRA-8457 at 12/12/14 3:19 PM: --- [~benedict] modifying cstar_perf to run multiple instances per node is a larger task, and I'm wondering how useful that will be (seems like a lot of resource contention / non-real-world variables.) Assuming we had the alternative of 100% automated EC2 cluster bootstrapping/teardown, how often would we want to run these larger tests for it to be worth it? was (Author: enigmacurry): @Benedict modifying cstar_perf to run multiple instances per node is a larger task, and I'm wondering how useful that will be (seems like a lot of resource contention / non-real-world variables.) Assuming we had the alternative of 100% automated EC2 cluster bootstrapping/teardown, how often would we want to run these larger tests for it to be worth it? nio MessagingService Key: CASSANDRA-8457 URL: https://issues.apache.org/jira/browse/CASSANDRA-8457 Project: Cassandra Issue Type: New Feature Components: Core Reporter: Jonathan Ellis Assignee: Ariel Weisberg Labels: performance Fix For: 3.0 Thread-per-peer (actually two each incoming and outbound) is a big contributor to context switching, especially for larger clusters. Let's look at switching to nio, possibly via Netty. -- This message was sent by Atlassian JIRA (v6.3.4#6332)