[jira] [Comment Edited] (CASSANDRA-8457) nio MessagingService

2017-06-24 Thread Jason Brown (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16062158#comment-16062158
 ] 

Jason Brown edited comment on CASSANDRA-8457 at 6/24/17 10:06 PM:
--

uploaded results of stress run (using the basic/naive CQL schema), from June 
2017. accounts for all the variants (compress/coalesce/jdk-TLS/openssl-TLS). 
See the README in the tarball for more details


was (Author: jasobrown):
results of 'naive' stress run, from June 2017

> nio MessagingService
> 
>
> Key: CASSANDRA-8457
> URL: https://issues.apache.org/jira/browse/CASSANDRA-8457
> Project: Cassandra
>  Issue Type: New Feature
>Reporter: Jonathan Ellis
>Assignee: Jason Brown
>Priority: Minor
>  Labels: netty, performance
> Fix For: 4.x
>
> Attachments: 8457-load.tgz
>
>
> Thread-per-peer (actually two each incoming and outbound) is a big 
> contributor to context switching, especially for larger clusters.  Let's look 
> at switching to nio, possibly via Netty.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Comment Edited] (CASSANDRA-8457) nio MessagingService

2017-04-12 Thread Jason Brown (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15966390#comment-15966390
 ] 

Jason Brown edited comment on CASSANDRA-8457 at 4/12/17 7:58 PM:
-

[~aweisberg] So, I've kicked this idea around a bit (haven't coded anything up 
yet), but I thought about using something like {{ExpiringMap}} for the 
{{QueuedMessage}} s, and when the message times out, either update a (possibly 
new) field in the {{QueuedMessage}} or failing the {{ChannelPromise}}. Failing 
the {{ChannelPromise}} is a bit racy as the netty event loop may be in the 
process of actually serializing and sending the message, and failing the 
promise (from another thread) might not be great as we're actually doing the 
work. Meanwhile, updating some new {{volatile}} field in the {{QueuedMessage}} 
requires yet another volatile update on the send path (better than a 
synchronized call for sure, but ...).

{{ExpiringMap}}, of course, creates more garbage ({{CacheableObject}}), and has 
it's own internal {{ConcurrentHashMap}}, with more volatiles and such. Not sure 
if this is great idea, but it's the current state of my thinking? Thoughts?


was (Author: jasobrown):
[~aweisberg] So, I've kicked this idea around a bit (haven't coded anything up 
yet), but I thought about using something like {{ExpiringMap}} for the 
{{QueuedMessages}}s, and when the message times out, either update a (possibly 
new) field in the {{QueuedMessage}} or failing the {{ChannelPromise}}. Failing 
the {{ChannelPromise}} is a bit racy as the netty event loop may be in the 
process of actually serializing and sending the message, and failing the 
promise (from another thread) might not be great as we're actually doing the 
work. Meanwhile, updating some new {{volatile}} field in the {{QueuedMessage}} 
requires yet another volatile update on the send path (better than a 
synchronized call for sure, but ...).

{{ExpiringMap}}, of course, creates more garbage ({{CacheableObject}}), and has 
it's own internal {{ConcurrentHashMap}}, with more volatiles and such. Not sure 
if this is great idea, but it's the current state of my thinking? Thoughts?

> nio MessagingService
> 
>
> Key: CASSANDRA-8457
> URL: https://issues.apache.org/jira/browse/CASSANDRA-8457
> Project: Cassandra
>  Issue Type: New Feature
>Reporter: Jonathan Ellis
>Assignee: Jason Brown
>Priority: Minor
>  Labels: netty, performance
> Fix For: 4.x
>
>
> Thread-per-peer (actually two each incoming and outbound) is a big 
> contributor to context switching, especially for larger clusters.  Let's look 
> at switching to nio, possibly via Netty.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Comment Edited] (CASSANDRA-8457) nio MessagingService

2017-02-15 Thread Ariel Weisberg (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15868091#comment-15868091
 ] 

Ariel Weisberg edited comment on CASSANDRA-8457 at 2/15/17 4:53 PM:


{quote}
Thus, I propose to bring back the SwappingByteBufDataOutputStreamPlus that I 
had in an earlier commit. To recap, the basic idea is provide a DataOutputPlus 
that has a backing ByteBuffer that is written to, and when it is filled, it is 
written to the netty context and flushed, then allocate a new buffer for more 
writes - kinda similar to a BufferedOutputStream, but replacing the backing 
buffer when full. Bringing this idea back is also what underpins one of the 
major performance things I wanted to address: buffering up smaller messages 
into one buffer to avoid going back to the netty allocator for every tiny 
buffer we might need - think Mutation acks.
{quote}

What thread is going to be writing to the output stream to serialize the 
messages? If it's a Netty thread you can't block inside a serialization method 
waiting for the bytes to drain to the socket that is not keeping up. You also 
can't wind out of the serialization method and continue it later.

If it's an application thread then it's no longer asynchronous and a slow 
connection can block the application and prevent it from doing say a quorum 
write to just the fast nodes. You would also need to lock during serialization 
or queue concurrently sent messages behind the one currently being written.

With large messages we aren't really fully eliminating the issue only making it 
a factor better. At the other end you still need to materialize a buffer 
containing the message + the object graph you are going to materialize. This is 
different from how things worked previously where we had a dedicated thread 
that would read fixed size buffers and then materialize just the object graph 
from that. 

To really solve this we need to be able to avoid buffering the entire message 
at both sending and receiving side. The buffering is worse because we are 
allocating contiguous memory and not just doubling the space impact. We could 
make it incrementally better by using chains of fixed size buffers so there is 
less external fragmentation and allocator overhead. That's still committing 
additional memory compared to pre-8457, but at least it's being committed in a 
more reasonable way.

I think the most elegant solution is to use a lightweight thread 
implementation. What we will probably be boxed into doing is making the 
serialization of result data and other large message portions able to yield. 
This will bound the memory committed to large messages to the largest atomic 
portion we have to serialize (Cell?).

Something like an output stream being able to say "shouldYield". If you 
continue to write it will continue to buffer and not fail, but use memory. Then 
serializers can implement a return value for serialize which indicates whether 
there is more to serialize. You would check shouldYield after each Cell or some 
unit of work when serializing. Most of these large things being serialized are 
iterators which could be stashed away. The trick will be that most 
serialization is stateless, and objects are serialized concurrently so you 
can't stored the serialization state in the object being serialized safely.

We also need to solve incremental non-blocking deserialization on the receive 
side and that I don't know. That's even trickier because you don't control how 
the message is fragmented so you can't insert the yield points trivially.


was (Author: aweisberg):
{quote}
Thus, I propose to bring back the SwappingByteBufDataOutputStreamPlus that I 
had in an earlier commit. To recap, the basic idea is provide a DataOutputPlus 
that has a backing ByteBuffer that is written to, and when it is filled, it is 
written to the netty context and flushed, then allocate a new buffer for more 
writes - kinda similar to a BufferedOutputStream, but replacing the backing 
buffer when full. Bringing this idea back is also what underpins one of the 
major performance things I wanted to address: buffering up smaller messages 
into one buffer to avoid going back to the netty allocator for every tiny 
buffer we might need - think Mutation acks.
{quote}

What thread is going to be writing to the output stream to serialize the 
messages? If it's a Netty thread you can't block inside a serialization method 
waiting for the bytes to drain to the socket that is not keeping up. You also 
can't wind out of the serialization method and continue it later.

If it's an application thread then it's no longer asynchronous and a slow 
connection can block the application and prevent it from doing say a quorum 
write to just the fast nodes. You would also need to lock during serialization 
or queue concurrently sent messages behind the one currently being written.

With 

[jira] [Comment Edited] (CASSANDRA-8457) nio MessagingService

2017-02-15 Thread Ariel Weisberg (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15868091#comment-15868091
 ] 

Ariel Weisberg edited comment on CASSANDRA-8457 at 2/15/17 4:42 PM:


{quote}
Thus, I propose to bring back the SwappingByteBufDataOutputStreamPlus that I 
had in an earlier commit. To recap, the basic idea is provide a DataOutputPlus 
that has a backing ByteBuffer that is written to, and when it is filled, it is 
written to the netty context and flushed, then allocate a new buffer for more 
writes - kinda similar to a BufferedOutputStream, but replacing the backing 
buffer when full. Bringing this idea back is also what underpins one of the 
major performance things I wanted to address: buffering up smaller messages 
into one buffer to avoid going back to the netty allocator for every tiny 
buffer we might need - think Mutation acks.
{quote}

What thread is going to be writing to the output stream to serialize the 
messages? If it's a Netty thread you can't block inside a serialization method 
waiting for the bytes to drain to the socket that is not keeping up. You also 
can't wind out of the serialization method and continue it later.

If it's an application thread then it's no longer asynchronous and a slow 
connection can block the application and prevent it from doing say a quorum 
write to just the fast nodes. You would also need to lock during serialization 
or queue concurrently sent messages behind the one currently being written.

With large messages we aren't really fully eliminating the issue only making it 
a factor better. At the other end you still need to materialize a buffer 
containing the message + the object graph you are going to materialize. This is 
different from how things worked previously where we had a dedicated thread 
that would read fixed size buffers and then materialize just the object graph 
from that. 

To really solve this we need to be able to avoid buffering the entire message 
at both sending and receiving side. The buffering is worse because we are 
allocating contiguous memory and not just doubling the space impact. We could 
make it incrementally better by using chains of fixed size buffers so there is 
less external fragmentation and allocator overhead. That's still committing 
additional memory compared to pre-8457, but at least it's being committed in a 
more reasonable way.

I think the most elegant solution is to use a lightweight thread 
implementation. What we will probably be boxed into doing is making the 
serialization of result data and other large message portions able to yield. 
This will bound the memory committed to large messages to the largest atomic 
portion we have to serialize (Cell?).

Something like an output stream being able to say "shouldYield". If you 
continue to write it will continue to buffer and not fail, but use memory. Then 
serializers can implement a return value for serialize which indicates whether 
there is more to serialize. You would check shouldYield after each Cell or some 
unit of work when serializing. Most of these large things being serialized are 
iterators. The trick will be that most serialization is stateless, and objects 
are serialized concurrently so you can't stored the serialization state in 
object being serialized safely.

We also need to solve incremental non-blocking deserialization on the receive 
side and that I don't know. That's even trickier because you don't control how 
the message is fragmented so you can't insert the yield points trivially.


was (Author: aweisberg):
{quote}
Thus, I propose to bring back the SwappingByteBufDataOutputStreamPlus that I 
had in an earlier commit. To recap, the basic idea is provide a DataOutputPlus 
that has a backing ByteBuffer that is written to, and when it is filled, it is 
written to the netty context and flushed, then allocate a new buffer for more 
writes - kinda similar to a BufferedOutputStream, but replacing the backing 
buffer when full. Bringing this idea back is also what underpins one of the 
major performance things I wanted to address: buffering up smaller messages 
into one buffer to avoid going back to the netty allocator for every tiny 
buffer we might need - think Mutation acks.
{quote}

What thread is going to be writing to the output stream to serialize the 
messages? If it's a Netty thread you can't block inside a serialization method 
waiting for the bytes to drain to the socket that is not keeping up. You also 
can't wind out of the serialization method and continue it later.

If it's an application thread then it's no longer asynchronous and a slow 
connection can block the application and prevent it from doing say a quorum 
write to just the fast nodes. You would also need to lock during serialization 
or queue concurrently sent messages behind the one currently being written.

With large messages we aren't really 

[jira] [Comment Edited] (CASSANDRA-8457) nio MessagingService

2017-02-15 Thread Ariel Weisberg (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15868091#comment-15868091
 ] 

Ariel Weisberg edited comment on CASSANDRA-8457 at 2/15/17 4:05 PM:


{quote}
Thus, I propose to bring back the SwappingByteBufDataOutputStreamPlus that I 
had in an earlier commit. To recap, the basic idea is provide a DataOutputPlus 
that has a backing ByteBuffer that is written to, and when it is filled, it is 
written to the netty context and flushed, then allocate a new buffer for more 
writes - kinda similar to a BufferedOutputStream, but replacing the backing 
buffer when full. Bringing this idea back is also what underpins one of the 
major performance things I wanted to address: buffering up smaller messages 
into one buffer to avoid going back to the netty allocator for every tiny 
buffer we might need - think Mutation acks.
{quote}

What thread is going to be writing to the output stream to serialize the 
messages? If it's a Netty thread you can't block inside a serialization method 
waiting for the bytes to drain to the socket that is not keeping up. You also 
can't wind out of the serialization method and continue it later.

If it's an application thread then it's no longer asynchronous and a slow 
connection can block the application and prevent it from doing say a quorum 
write to just the fast nodes. You would also need to lock during serialization 
or queue concurrently sent messages behind the one currently being written.

With large messages we aren't really fully eliminating the issue only making it 
a factor better. At the other end you still need to materialize a buffer 
containing the message + the object graph you are going to materialize. This is 
different from how things worked previously where we had a dedicated thread 
that would read fixed size buffers and then materialize just the object graph 
from that. 

To really solve this we need to be able to avoid buffering the entire message 
at both sending and receiving side. The buffering is worse because we are 
allocating contagious memory and not just doubling the space impact. We could 
make it incrementally better by using chains of fixed size buffers so there is 
less external fragmentation and allocator overhead. That's still committing 
additional memory compared to pre-8457, but at least it's being committed in a 
more reasonable way.

I think the most elegant solution is to use a lightweight thread 
implementation. What we will probably be boxed into doing is making the 
serialization of result data and other large message portions able to yield. 
This will bound the memory committed to large messages to the largest atomic 
portion we have to serialize (Cell?).

Something like an output stream being able to say "shouldYield". If you 
continue to write it will continue to buffer and not fail, but use memory. Then 
serializers can implement a return value for serialize which indicates whether 
there is more to serialize. You would check shouldYield after each Cell or some 
unit of work when serializing. Most of these large things being serialized are 
iterators. The trick will be that most serialization is stateless, and objects 
are serialized concurrently so you can't stored the serialization state in 
object being serialized safely.

We also need to solve incremental non-blocking deserialization on the receive 
side and that I don't know. That's even trickier because you don't control how 
the message is fragmented so you can't insert the yield points trivially.


was (Author: aweisberg):
{quote}
Thus, I propose to bring back the SwappingByteBufDataOutputStreamPlus that I 
had in an earlier commit. To recap, the basic idea is provide a DataOutputPlus 
that has a backing ByteBuffer that is written to, and when it is filled, it is 
written to the netty context and flushed, then allocate a new buffer for more 
writes - kinda similar to a BufferedOutputStream, but replacing the backing 
buffer when full. Bringing this idea back is also what underpins one of the 
major performance things I wanted to address: buffering up smaller messages 
into one buffer to avoid going back to the netty allocator for every tiny 
buffer we might need - think Mutation acks.
{quote}

What thread is going to be writing to the output stream to serialize the 
messages? If it's a Netty thread you can't block inside a serialization method 
waiting for the bytes to drain to the socket that is not keeping up. You also 
can't wind out of the serialization method and continue it later.

If it's an application thread then it's no longer asynchronous and a slow 
connection can block the application and prevent it from doing say a quorum 
write to just the fast nodes. You would also need to lock during serialization 
or queue concurrently sent messages behind the one currently being written.

With large messages we aren't really 

[jira] [Comment Edited] (CASSANDRA-8457) nio MessagingService

2017-02-14 Thread Jason Brown (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15865754#comment-15865754
 ] 

Jason Brown edited comment on CASSANDRA-8457 at 2/14/17 1:14 PM:
-

rebased, and made changes wrt CASSANDRA-13090


was (Author: jasobrown):
rebased, and pulled in CASSANDRA-13090

> nio MessagingService
> 
>
> Key: CASSANDRA-8457
> URL: https://issues.apache.org/jira/browse/CASSANDRA-8457
> Project: Cassandra
>  Issue Type: New Feature
>Reporter: Jonathan Ellis
>Assignee: Jason Brown
>Priority: Minor
>  Labels: netty, performance
> Fix For: 4.x
>
>
> Thread-per-peer (actually two each incoming and outbound) is a big 
> contributor to context switching, especially for larger clusters.  Let's look 
> at switching to nio, possibly via Netty.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Comment Edited] (CASSANDRA-8457) nio MessagingService

2017-02-13 Thread Jason Brown (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15863758#comment-15863758
 ] 

Jason Brown edited comment on CASSANDRA-8457 at 2/13/17 9:02 PM:
-

OK, so I've been performance load testing the snot of this code for the last 
several weeks, with help from netty committers, flight recorder, and flame 
graphs. As a result, I've made some major and some minor tweaks, and now I'm 
slightly faster than trunk with slightly better throughput. I have some 
optimizations in my back pocket that will increase even more, but as 
[~slebresne] and I have agreed before, we'll save those for follow up tickets.

trunk
{code}
 id, type   total ops,op/s,pk/s,   row/s,mean, 
med, .95, .99,.999, max,   time,   stderr, errors,  gc: #,  max 
ms,  sum ms,  sdv ms,  mb
  4 threadCount, total,233344,3889,3889,3889, 1.0, 
1.0, 1.2, 1.3, 1.5,68.2,   60.0,  0.01549,  0,  9, 
538, 538,   4,5381
  8 threadCount, total,544637,9076,9076,9076, 0.8, 
0.8, 1.0, 1.1, 1.4,73.8,   60.0,  0.00978,  0, 20,
1267,1267,   5,   11848
 16 threadCount, total,   1126627,   18774,   18774,   18774, 0.8, 
0.8, 0.9, 1.0, 5.5,78.2,   60.0,  0.01882,  0, 40,
2665,2665,   6,   23666
 24 threadCount, total,   1562460,   26036,   26036,   26036, 0.9, 
0.8, 1.0, 1.1, 9.1,81.3,   60.0,  0.00837,  0, 55,
3543,3543,   9,   32619
 36 threadCount, total,   2098097,   34962,   34962,   34962, 1.0, 
0.9, 1.1, 1.3,60.9,83.0,   60.0,  0.00793,  0, 73,
4665,4665,   7,   43144
 54 threadCount, total,   2741814,   45686,   45686,   45686, 1.1, 
1.0, 1.4, 1.7,62.2,   131.7,   60.0,  0.01321,  0, 93,
5748,5748,   7,   55097
 81 threadCount, total,   3851131,   64166,   64166,   64166, 1.2, 
1.0, 1.6, 2.6,62.3,   151.7,   60.0,  0.01152,  0,159,
8190,8521,  14,  106805
121 threadCount, total,   4798169,   79947,   79947,   79947, 1.5, 
1.1, 2.0, 3.0,63.5,   117.8,   60.0,  0.05689,  0,165,
9323,9439,   5,   97536
181 threadCount, total,   5647043,   94088,   94088,   94088, 1.9, 
1.4, 2.6, 4.9,68.5,   169.2,   60.0,  0.01639,  0,195,   
10106,   11011,  11,  126422
271 threadCount, total,   6450510,  107461,  107461,  107461, 2.5, 
1.8, 3.7,12.0,75.4,   155.8,   60.0,  0.01542,  0,228,   
10304,   12789,   9,  143857
406 threadCount, total,   6700764,  111635,  111635,  111635, 3.6, 
2.5, 5.3,55.8,75.6,   196.5,   60.0,  0.01800,  0,243,
9995,   13170,   7,  144166
609 threadCount, total,   7119535,  118477,  118477,  118477, 5.1, 
3.5, 7.9,62.8,85.1,   170.0,   60.1,  0.01775,  0,250,   
10149,   13781,   7,  148118
913 threadCount, total,   7093347,  117981,  117981,  117981, 7.7, 
4.9,15.7,71.3,   101.1,   173.4,   60.1,  0.02780,  0,246,   
10327,   13859,   8,  155896
{code}

8457
{code}
 id, type   total ops,op/s,pk/s,   row/s,mean, 
med, .95, .99,.999, max,   time,   stderr, errors,  gc: #,  max 
ms,  sum ms,  sdv ms,  mb
  4 threadCount, total,161668,2694,2694,2694, 1.4, 
1.4, 1.6, 1.7, 3.2,68.2,   60.0,  0.01264,  0,  6, 
363, 363,   4,3631
  8 threadCount, total,498139,8301,8301,8301, 0.9, 
0.9, 1.1, 1.3, 1.8,73.5,   60.0,  0.00446,  0, 19,
1164,1164,   6,   11266
 16 threadCount, total,765437,   12756,   12756,   12756, 1.2, 
1.2, 1.4, 1.5, 5.7,74.8,   60.0,  0.01251,  0, 29,
1819,1819,   5,   17238
 24 threadCount, total,   1122768,   18710,   18710,   18710, 1.2, 
1.2, 1.4, 1.5, 8.5,   127.7,   60.0,  0.00871,  0, 42,
2538,2538,   5,   25054
 36 threadCount, total,   1649658,   27489,   27489,   27489, 1.3, 
1.2, 1.4, 1.6,60.1,77.7,   60.0,  0.00627,  0, 57,
3652,3652,   7,   33743
 54 threadCount, total,   2258999,   37641,   37641,   37641, 1.4, 
1.3, 1.6, 1.8,62.5,81.7,   60.0,  0.00771,  0, 79,
4908,4908,   6,   46789
 81 threadCount, total,   3255005,   54220,   54220,   54220, 1.5, 
1.2, 1.7, 2.2,63.8,   133.4,   60.0,  0.02030,  0,117,
6953,7008,   9,   72208
121 

[jira] [Comment Edited] (CASSANDRA-8457) nio MessagingService

2016-09-27 Thread Jason Brown (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15527775#comment-15527775
 ] 

Jason Brown edited comment on CASSANDRA-8457 at 9/27/16 11:31 PM:
--

TL;DR - I've addressed everything except for the interaction between 
{{ClientHandshakeHandler}} and {{InternodeMessagingConnection}} (both are now 
renamed). I've noticed the odd rub there, as well, for a while, and I'll take 
some time to reconsider it.

re: "talking points"
- Backward compatibility - bit the bullet, and just yanked the old code
- streaming - [~slebresne] and I talked offline, and CASSANDRA-12229 will 
address the streaming parts, and will be worked on/reviewed concurrently. Both 
tickets will be committed together to avoid breaking streaming.

re: comments section 1
- Netty openssl - when I implemented this back in February, there was no 
mechanism to use {{KeyFactoryManager}} with the OpenSSL implementation. 
Fortunately, this has changed since I last checked in, so I've deleted the 
extra {{keyfile}} and friends entries from the yaml/{{Config}}.
- "old code" - deleted now
- package javadoc - I absolutely want this :), I just want things to be more 
solid code-wise before diving into that work.
- naming - names are now more consistent using In/Out (or Inbound/Outbound), 
and use of client/server is removed.

re: comments section 2
- {{getSocketThreads()}} - I've removed this for now, and will be resolved with 
CASSANDRA-12229
- {{MessagingService}} renames - done
- {{MessagingService#createConnection()}} In the previous implementation, 
{{OutboundTcpConnectionPool}} only blocked on creating the threads for it's 
wrapped {{OutboundTcpConnection}} instances (gossip, large, and small 
messages). No sockets were actually opened until a message was actually sent to 
that peer {{OutboundTcpConnection#connect()}}. Since we do not spawn a separate 
thread for each connection type (even though we will have separate sockets), I 
don't think it's necessary to block {{MessagingService#createConnection()}}, or 
more correctly now, {{MessagingService.NettySender#getMessagingConnection()}}.
- "Seems {{NettySender.listen()}} always starts a non-secure connection" - You 
are correct; however, looks like we've always been doing it that way (for 
better or worse). I've gone ahead and made the change (it's a one liner, plus a 
couple extra for error checking).
- {{ClientConnect#connectComplete}} - I've renamed the function to be more 
accurate ({{connectCallback}}).
- {{CoalescingMessageOutHandler}} - done

Other issues resolved, as well. Branch has been pushed (with several commits at 
the top) and tests running.



was (Author: jasobrown):
TL;DR - I've addressed everything except for the interaction between 
{{ClientHandshakeHandler}} and {{Interno -deMessagingConnection}} (both are 
noew renamed). I've noticed the odd rub there, as well, for a while, and I'll 
take some time to reconsider it.

re: "talking points"
- Backward compatibility - bit the bullet, and just yanked the old code
- streaming - [~slebresne] and I talked offline, and CASSANDRA-12229 will 
address the streaming parts, and will be worked on/reviewed concurrently. Both 
tickets will be committed together to avoid breaking streaming.

re: comments section 1
- Netty openssl - when I implemented this back in February, there was no 
mechanism to use {{KeyFactoryManager}} with the OpenSSL implementaion. 
Fortunately, this has changed since I last checked in, so I've deleted the 
extra {{keyfile}} and friends entries from the yaml/{{Config}}.
- "old code" - deleted now
- package javadoc - I absolutely want this :), I just want things to be more 
solid code-wise before diving into that work.
- naming - names are now more consistent using In/Out (or Inbound/Outbound), 
and use of client/server is removed.

re: comments section 2
- {{getSocketThreads()}} - I've removed this for now, and will be resolved with 
CASSANDRA-12229
- {{MessagingService}} renames - done
- {{MessagingService#createConnection()}} In the previous implementation, 
{{OutboundTcpConnectionPool}} only blocked on creating the threads for it's 
wrapped {{OutboundTcpConnection}} instances (gossip, large, and small 
messages). No sockets were actually opened until a message was actually sent to 
that peer {{OutboundTcpConnection#connect()}}. Since we do not spawn a separate 
thread for each connection type (even though we will have separate sockets), I 
don't think it's necessary to block {{MessagingService#createConnection()}}, or 
more correctly now, {{MessagingService.NettySender#getMessagingConnection()}}.
- "Seems {{NettySender.listen()}} always starts a non-secure connection" - You 
are correct; however, looks like we've always been doing it that way (for 
better or worse). I've gone ahead and made the change (it's a one liner, plus a 
couple extra for error checking).
- 

[jira] [Comment Edited] (CASSANDRA-8457) nio MessagingService

2015-01-09 Thread Ariel Weisberg (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14271701#comment-14271701
 ] 

Ariel Weisberg edited comment on CASSANDRA-8457 at 1/9/15 7:08 PM:
---

Took a stab at writing an adaptive approach to coalescing based on a moving 
average. Numbers look good for the workloads tested.
Code 
https://github.com/aweisberg/cassandra/compare/6be33289f34782e12229a7621022bb5ce66b2f1b...e48133c4d5acbaa6563ea48a0ca118c278b2f6f7

Testing in AWS, 14 servers 6 clients.

Using a fixed coalescing window at low concurrency there is a drop of 
performance from 6746 to 3929. With adaptive coalescing I got 6758.

At medium concurrency (5 threads per client, 6 clients) I got 31097 with 
coalescing disable and 31120 with coalescing.

At high concurrency (500 threads per client, 6 clients) I got 479532 with 
coalescing and 166010 without. This is with a maximum coalescing window of 200 
milliseconds.

I added debug output to log when coalescing starts and stops and it's 
interesting. At the beginning of the benchmark things flap, but they don't flap 
madly. After a few minutes it settles. I also notice a strange thing where CPU 
utilization at the start of a benchmark is 500% or so and then after a while it 
climbs. Like something somewhere is warming up or balancing. I recall seeing 
this in GCE as well.

I had one of the OutboundTcpConnections (first to get the permit) log a trace 
of all outgoing message times. I threw that into a histogram for informational 
purposes. 50% of messages are sent within 100 microseconds of each other and 
92% are sent within one millisecond. This is without any coalescing.

{noformat}
   Value Percentile TotalCount 1/(1-Percentile)

   0.000 0.   5554   1.00
   5.703 0.1000 124565   1.11
  13.263 0.2000 249128   1.25
  24.143 0.3000 373630   1.43
  40.607 0.4000 498108   1.67
  94.015 0.5000 622664   2.00
 158.463 0.5500 684867   2.22
 244.351 0.6000 747137   2.50
 305.407 0.6500 809631   2.86
 362.239 0.7000 871641   3.33
 428.031 0.7500 933978   4.00
 467.711 0.7750 965085   4.44
 520.703 0.8000 996254   5.00
 595.967 0.82501027359   5.71
 672.767 0.85001058457   6.67
 743.935 0.87501089573   8.00
 780.799 0.88751105290   8.89
 821.247 0.90001120774  10.00
 868.351 0.91251136261  11.43
 928.767 0.92501151889  13.33
1006.079 0.93751167421  16.00
1049.599 0.943750001175260  17.78
1095.679 0.95001183041  20.00
1143.807 0.956250001190779  22.86
1198.079 0.96251198542  26.67
1264.639 0.968750001206301  32.00
1305.599 0.971875001210228  35.56
1354.751 0.97501214090  40.00
1407.999 0.978125001217975  45.71
1470.463 0.981250001221854  53.33
1542.143 0.984375001225759  64.00
1586.175 0.985937501227720  71.11
1634.303 0.98751229643  80.00
1688.575 0.989062501231596  91.43
1756.159 0.990625001233523 106.67
1839.103 0.992187501235464 128.00
1887.231 0.992968751236430 142.22
1944.575 0.993750001237409 160.00
2007.039 0.994531251238384 182.86
2084.863 0.995312501239358 213.33
2174.975 0.996093751240326 256.00
2230.271 0.9964843750001240818 284.44
2293.759 0.996875001241292 320.00
2369.535 0.9972656250001241785 365.71
2455.551 0.997656251242271 426.67
2578.431 0.9980468750001242752 512.00
2656.255 0.9982421875001242999 568.89
2740.223 0.998437501243244 640.00
2834.431 0.9986328125001243482 731.43
2957.311 0.9988281250001243725 853.33
3131.391 0.99902343750012439691024.00
3235.839 0.99912109375012440911137.78
3336.191 0.9992187512442121280.00
3471.359 0.99931640625012443321462.86
3641.343 0.99941406250012444551706.67
3837.951 0.99951171875012445762048.00
4001.791 0.99956054687512446362275.56
4136.959 0.99960937500012446972560.00
4399.103 

[jira] [Comment Edited] (CASSANDRA-8457) nio MessagingService

2015-01-09 Thread Ariel Weisberg (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14271701#comment-14271701
 ] 

Ariel Weisberg edited comment on CASSANDRA-8457 at 1/9/15 7:24 PM:
---

Took a stab at writing an adaptive approach to coalescing based on a moving 
average. Numbers look good for the workloads tested.
Code 
https://github.com/aweisberg/cassandra/compare/6be33289f34782e12229a7621022bb5ce66b2f1b...e48133c4d5acbaa6563ea48a0ca118c278b2f6f7

The impact of coalescing on individual messages appears to introduce quite a 
bit of latency. Without I see an average latency between when the message is 
submitted to when it is written to the socket of 25 microseconds, but with 
coalescing that delay is 350-400 microseconds even though I am only requesting 
a wait of 200 microseconds at most.

Testing in AWS, 14 servers 6 clients.

Using a fixed coalescing window at low concurrency there is a drop of 
performance from 6746 to 3929. With adaptive coalescing I got 6758.

At medium concurrency (5 threads per client, 6 clients) I got 31097 with 
coalescing disable and 31120 with coalescing.

At high concurrency (500 threads per client, 6 clients) I got 479532 with 
coalescing and 166010 without. This is with a maximum coalescing window of 200 
milliseconds.

I added debug output to log when coalescing starts and stops and it's 
interesting. At the beginning of the benchmark things flap, but they don't flap 
madly. After a few minutes it settles. I also notice a strange thing where CPU 
utilization at the start of a benchmark is 500% or so and then after a while it 
climbs. Like something somewhere is warming up or balancing. I recall seeing 
this in GCE as well.

I had one of the OutboundTcpConnections (first to get the permit) log a trace 
of all outgoing message times. I threw that into a histogram for informational 
purposes. 50% of messages are sent within 100 microseconds of each other and 
92% are sent within one millisecond. This is without any coalescing.

{noformat}
   Value Percentile TotalCount 1/(1-Percentile)

   0.000 0.   5554   1.00
   5.703 0.1000 124565   1.11
  13.263 0.2000 249128   1.25
  24.143 0.3000 373630   1.43
  40.607 0.4000 498108   1.67
  94.015 0.5000 622664   2.00
 158.463 0.5500 684867   2.22
 244.351 0.6000 747137   2.50
 305.407 0.6500 809631   2.86
 362.239 0.7000 871641   3.33
 428.031 0.7500 933978   4.00
 467.711 0.7750 965085   4.44
 520.703 0.8000 996254   5.00
 595.967 0.82501027359   5.71
 672.767 0.85001058457   6.67
 743.935 0.87501089573   8.00
 780.799 0.88751105290   8.89
 821.247 0.90001120774  10.00
 868.351 0.91251136261  11.43
 928.767 0.92501151889  13.33
1006.079 0.93751167421  16.00
1049.599 0.943750001175260  17.78
1095.679 0.95001183041  20.00
1143.807 0.956250001190779  22.86
1198.079 0.96251198542  26.67
1264.639 0.968750001206301  32.00
1305.599 0.971875001210228  35.56
1354.751 0.97501214090  40.00
1407.999 0.978125001217975  45.71
1470.463 0.981250001221854  53.33
1542.143 0.984375001225759  64.00
1586.175 0.985937501227720  71.11
1634.303 0.98751229643  80.00
1688.575 0.989062501231596  91.43
1756.159 0.990625001233523 106.67
1839.103 0.992187501235464 128.00
1887.231 0.992968751236430 142.22
1944.575 0.993750001237409 160.00
2007.039 0.994531251238384 182.86
2084.863 0.995312501239358 213.33
2174.975 0.996093751240326 256.00
2230.271 0.9964843750001240818 284.44
2293.759 0.996875001241292 320.00
2369.535 0.9972656250001241785 365.71
2455.551 0.997656251242271 426.67
2578.431 0.9980468750001242752 512.00
2656.255 0.9982421875001242999 568.89
2740.223 0.998437501243244 640.00
2834.431 0.9986328125001243482 731.43
2957.311 0.9988281250001243725 853.33
3131.391 0.99902343750012439691024.00
3235.839 0.99912109375012440911137.78
  

[jira] [Comment Edited] (CASSANDRA-8457) nio MessagingService

2015-01-07 Thread Ariel Weisberg (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14268221#comment-14268221
 ] 

Ariel Weisberg edited comment on CASSANDRA-8457 at 1/7/15 9:11 PM:
---

In GCE coalescing provided a 12% increase in throughput in this specific 
message heavy high concurrency workload. The penalty is that at low concurrency 
there is an immediate loss of performance with any coalescing and a large 
window has a greater impact at low concurrency so there is tension there. The 
larger the window the better the performance bump.

Right scale does not offer instances with enhanced networking. To find out 
whether coalescing provides real benefits in EC2/Xen or milder GCE like 
benefits I will have to get my hands on some.

I wanted to account for the impact of coalescing at low concurrency. Low 
concurrency is not a recipe for great performance, but it is part of the out of 
the box experience and people do compare different databases at low concurrency.

Testing with 3 client threads each running on a dedicated client instance (3 
threads total). This is in GCE.

With TCP no delay on and coalescing
||Coalesce window microseconds|Throughput||
|0| 2191|
|6| 1910|
|12| 1873|
|25| 1867|
|50| 1779|
|100| 1667|
|150| 1566|
|200| 1491|

I also tried disabling coalescing when using HSHA and it didn't seem to make a 
difference. Surprising considering the impact of 25 microseconds of coalescing 
intra-cluster.

I also experimented with some other things. Binding interrupts cores 0 and 8 
and task setting C* off of those cores. I didn't see a big impact.

I did see a small positive impact using 3 clients 8 servers which means the 
measurements with 2 clients might be a little suspect. With 3 clients and 200 
microseconds of coalescing it peaked at 165k in GCE.

I also found out that banned CPUs in irqbalance is broken and has no effect and 
this has been the case for some time.



was (Author: aweisberg):
I wanted to account for the impact of coalescing at low concurrency. Low 
concurrency is not a recipe for great performance, but it is part of the out of 
the box experience and people do compare different databases at low concurrency.

In GCE coalescing provided a 12% increase in throughput in this specific 
message heavy high concurrency workload. The penalty is that at low concurrency 
there is an immediate loss of performance with any coalescing and a large 
window has a greater impact at low concurrency so there is tension there. The 
larger the window the better the performance bump.

Testing with 3 client threads each running on a dedicated client instance (3 
threads total). This is in GCE.

With TCP no delay on and coalescing
||Coalesce window microseconds|Throughput||
|0| 2191|
|6| 1910|
|12| 1873|
|25| 1867|
|50| 1779|
|100| 1667|
|150| 1566|
|200| 1491|

I also tried disabling coalescing when using HSHA and it didn't seem to make a 
difference. Surprising considering the impact of 25 microseconds of coalescing 
intra-cluster.

I also experimented with some other things. Binding interrupts cores 0 and 8 
and task setting C* off of those cores. I didn't see a big impact.

I did see a small positive impact using 3 clients 8 servers which means the 
measurements with 2 clients might be a little suspect. With 3 clients and 200 
microseconds of coalescing it peaked at 165k in GCE.

I also found out that banned CPUs in irqbalance is broken and has no effect and 
this has been the case for some time.

Right scale does not offer instances with enhanced networking. To find out 
whether coalescing provides real benefits in EC2/Xen or milder GCE like 
benefits I will have to get my hands on some.


 nio MessagingService
 

 Key: CASSANDRA-8457
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8457
 Project: Cassandra
  Issue Type: New Feature
  Components: Core
Reporter: Jonathan Ellis
Assignee: Ariel Weisberg
  Labels: performance
 Fix For: 3.0


 Thread-per-peer (actually two each incoming and outbound) is a big 
 contributor to context switching, especially for larger clusters.  Let's look 
 at switching to nio, possibly via Netty.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-8457) nio MessagingService

2015-01-06 Thread Ariel Weisberg (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14266402#comment-14266402
 ] 

Ariel Weisberg edited comment on CASSANDRA-8457 at 1/6/15 5:05 PM:
---

I think I stumbled onto what is going on based on Benedict's suggestion to 
disable TCP no delay. It looks like there is a small message performance issue. 

This is something I have seen before in EC2 where you can only send a 
surprisingly small number of messages in/out of a VM. I don't have the numbers 
from when I micro benchmarked it, but it is something like 450k messages with 
TCP no delay and a million or so without. Adding more sockets helps but it 
doesn't  even double the number of messages in/out. Throwing more cores at the 
problem doesn't help you just end up with under utilized cores which matches 
the mysterious levels of starvation I was seeing in C* even though I was 
exposing sufficient concurrency.

14 servers nodes. 6 client nodes. 500 threads per client. Server started with
row_cache_size_in_mb : 2000,
key_cache_size_in_mb:500,
rpc_max_threads : 1024,
rpc_min_threads : 16,
native_transport_max_threads : 1024
8-gig old gen, 2 gig new gen.

Client running CL=ALL and the same schema I have been using throughout this 
ticket.

With no delay off
First set of runs
390264
387958
392322
After replacing 10 instances
366579
365818
378221

No delay on 
162987

Modified trunk to fix a bug batching messages and add a configurable window for 
coalescing multiple messages into a socket see 
https://github.com/aweisberg/cassandra/compare/f733996...49c6609
||Coalesce window microseconds|Throughput||
|250| 502614|
|200| 496206|
|150| 487195|
|100| 423415|
|50| 326648|
|25| 308175|
|12| 292894|
|6| 268456|
|0| 153688|

I did not expect to get mileage out of coalescing at the application level but 
it works extremely well. CPU utilization is still low at 1800%. There seems to 
be less correlation between CPU utilization and throughput as I vary the 
coalescing window and throughput changes dramatically. I do see that core 0 is 
looking pretty saturated and is only 10% idle. That might be the next or actual 
bottleneck.

What role this optimization plays at different cluster sizes is an important 
question. There has to be a tipping point where coalescing stops working 
because not enough packets go to each end point at the same time. With vnodes 
it wouldn't be unusual to be communicating with a large number of other hosts 
right?

It also takes a significant amount of additional latency to get the mileage at 
high levels of throughput, but at lower concurrency there is no benefit and it 
will probably show up as decreased throughput. It makes it tough to crank it up 
as a default. Either it is adaptive or most people don't get the benefit.

At high levels of throughput it is a clear latency win. Latency is much lower 
for individual requests on average. Making this a config option is viable as a 
starting point. Possibly a separate option for local/remote DC coalescing. 
Ideally we could make it adapt to the workload.

I am going to chase down what impact coalescing has at lower levels of 
concurrency so we can quantify the cost of turning it on. I'm also going to try 
and get to the bottom of all interrupts going to core 0. Maybe it is the real 
problem and coalescing is just a band aid to get more throughput.


was (Author: aweisberg):
I think I stumbled onto what is going on based on Benedict's suggestion to 
disable TCP no delay. It looks like there is a small message performance issue. 

This is something I have seen before in EC2 where you can only send a 
surprisingly small number of messages in/out of a VM. I don't have the numbers 
from when I micro benchmarked it, but it is something like 450k messages with 
TCP no delay and a million or so without. Adding more sockets helps but it 
doesn't  even double the number of messages in/out. Throwing more cores at the 
problem doesn't help you just end up with under utilized cores which matches 
the mysterious levels of starvation I was seeing in C* even though I was 
exposing sufficient concurrency.

14 servers nodes. 6 client nodes. 500 threads per client. Server started with
row_cache_size_in_mb : 2000,
key_cache_size_in_mb:500,
rpc_max_threads : 1024,
rpc_min_threads : 16,
native_transport_max_threads : 1024
8-gig old gen, 2 gig new gen.

Client running CL=ALL and the same schema I have been using throughout this 
ticket.

With no delay off
First set of runs
390264
387958
392322
After replacing 10 instances
366579
365818
378221

No delay on 
162987

Modified trunk to fix a bug batching messages and add a configurable window for 
coalescing multiple messages into a socket see 
https://github.com/aweisberg/cassandra/compare/f733996...49c6609
||Coalesce window 

[jira] [Comment Edited] (CASSANDRA-8457) nio MessagingService

2015-01-06 Thread Ariel Weisberg (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14266422#comment-14266422
 ] 

Ariel Weisberg edited comment on CASSANDRA-8457 at 1/6/15 5:31 PM:
---

https://forums.aws.amazon.com/thread.jspa?messageID=459260
This looks like a Xen/EC2 issue. I'll bet bare metal never has this issue 
because it has multiple interrupt vectors for NICs. The only work around is 
using multiple elastic network interfaces and doing something to make that 
practical for intra-cluster communication.

Apparently the right scale instances I am using don't have enhanced networking. 
To get enhanced networking I need to use VPC. I am not sure if starting 
instances with enhanced networking is possible via right scale, but I am going 
to find out. I don't know if enhanced networking addresses the interrupt vector 
issue. Will do more digging.


was (Author: aweisberg):
https://forums.aws.amazon.com/thread.jspa?messageID=459260
This looks like a Xen/EC2 issue. I'll bet bare metal never has this issue 
because it has multiple interrupt vectors for NICs. The only work around is 
using multiple elastic network interfaces and doing something to make that 
practical for intra-cluster communication.

 nio MessagingService
 

 Key: CASSANDRA-8457
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8457
 Project: Cassandra
  Issue Type: New Feature
  Components: Core
Reporter: Jonathan Ellis
Assignee: Ariel Weisberg
  Labels: performance
 Fix For: 3.0


 Thread-per-peer (actually two each incoming and outbound) is a big 
 contributor to context switching, especially for larger clusters.  Let's look 
 at switching to nio, possibly via Netty.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-8457) nio MessagingService

2014-12-22 Thread Ariel Weisberg (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14256326#comment-14256326
 ] 

Ariel Weisberg edited comment on CASSANDRA-8457 at 12/22/14 11:48 PM:
--

Repeating these benchmarks, trunk can be slower as or more often than it is 
faster. It's one of those peak performance vs. average and worst case sort of 
decisions.

This is odd. Some measurements running trunk. Soft interrupt hotspot on one 
CPU. Don't see why that should happen. Interrupt hotspots are usually related 
to not enough TCP streams and there should be several in this workload. Kind of 
looks like someone somewhere pinned interrupts to core 0.
{noformat}
11:20:04 PM  CPU%usr   %nice%sys %iowait%irq   %soft  %steal  
%guest   %idle
11:20:05 PM  all   11.190.009.820.000.000.981.27
0.00   76.74
11:20:05 PM0   18.970.00   12.070.000.00   32.761.72
0.00   34.48
11:20:05 PM16.120.00   10.200.000.000.004.08
0.00   79.59
11:20:05 PM27.550.00   11.320.000.000.003.77
0.00   77.36
11:20:05 PM3   11.110.009.260.000.000.001.85
0.00   77.78
11:20:05 PM4   17.860.005.360.000.000.000.00
0.00   76.79
11:20:05 PM5   10.710.008.930.000.000.000.00
0.00   80.36
11:20:05 PM6   10.170.00   10.170.000.000.001.69
0.00   77.97
11:20:05 PM7   10.000.00   11.670.000.000.001.67
0.00   76.67
11:20:05 PM8   13.560.006.780.000.000.001.69
0.00   77.97
11:20:05 PM9   12.500.00   10.940.000.000.001.56
0.00   75.00
11:20:05 PM   10   15.150.009.090.000.000.001.52
0.00   74.24
11:20:05 PM   117.940.00   12.700.000.000.001.59
0.00   77.78
11:20:05 PM   129.520.009.520.000.000.001.59
0.00   79.37
11:20:05 PM   139.090.00   12.120.000.000.000.00
0.00   78.79
11:20:05 PM   14   11.760.00   10.290.000.000.001.47
0.00   76.47
11:20:05 PM   159.230.009.230.000.000.001.54
0.00   80.00
11:20:05 PM   16   10.290.00   14.710.000.000.001.47
0.00   73.53
11:20:05 PM   17   10.450.00   11.940.000.000.000.00
0.00   77.61
11:20:05 PM   18   11.940.008.960.000.000.001.49
0.00   77.61
11:20:05 PM   19   12.680.00   12.680.000.000.001.41
0.00   73.24
11:20:05 PM   20   13.240.008.820.000.000.000.00
0.00   77.94
11:20:05 PM   215.800.00   15.940.000.000.000.00
0.00   78.26
11:20:05 PM   22   13.890.009.720.000.000.001.39
0.00   75.00
11:20:05 PM   239.380.009.380.000.000.000.00
0.00   81.25
11:20:05 PM   247.810.009.380.000.000.001.56
0.00   81.25
11:20:05 PM   258.960.008.960.000.000.001.49
0.00   80.60
11:20:05 PM   26   11.590.00   10.140.000.000.000.00
0.00   78.26
11:20:05 PM   27   13.040.007.250.000.000.001.45
0.00   78.26
11:20:05 PM   287.810.007.810.000.000.000.00
0.00   84.38
11:20:05 PM   29   11.760.008.820.000.000.000.00
0.00   79.41
11:20:05 PM   30   11.430.00   10.000.000.000.001.43
0.00   77.14
11:20:05 PM   31   12.500.004.690.000.000.000.00
0.00   82.81
{noformat}  
 
Modified mpstat output is similar. perf stat doesn't have access to performance 
counters in these instances. I'll have to see if I can get instances that do 
that. I have a flight recording of each, but not a flight recording of great 
performance on trunk.


was (Author: aweisberg):
Repeating these benchmarks, trunk can be slower as or more often then it is 
faster. It's one of those peak performance vs. average and worst case sort of 
decisions.

This is odd. Some measurements running trunk. Soft interrupt hotspot on one 
CPU. Don't see why that should happen. Interrupt hotspots are usually related 
to not enough TCP streams and there should be several in this workload. Kind of 
looks like someone somewhere pinned interrupts to core 0.
{noformat}
11:20:04 PM  CPU%usr   %nice%sys %iowait%irq   %soft  %steal  
%guest   %idle
11:20:05 PM  all   11.190.009.820.000.000.981.27
0.00   76.74
11:20:05 PM0   18.970.00   12.070.000.00   32.761.72
0.00   34.48
11:20:05 PM16.120.00   10.200.000.000.004.08
0.00   79.59
11:20:05 

[jira] [Comment Edited] (CASSANDRA-8457) nio MessagingService

2014-12-12 Thread Ryan McGuire (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14244258#comment-14244258
 ] 

Ryan McGuire edited comment on CASSANDRA-8457 at 12/12/14 3:19 PM:
---

[~benedict] modifying cstar_perf to run multiple instances per node is a larger 
task, and I'm wondering how useful that will be (seems like a lot of resource 
contention / non-real-world variables.) Assuming we had the alternative of 100% 
automated EC2 cluster bootstrapping/teardown, how often would we want to run 
these larger tests for it to be worth it? 


was (Author: enigmacurry):
@Benedict modifying cstar_perf to run multiple instances per node is a larger 
task, and I'm wondering how useful that will be (seems like a lot of resource 
contention / non-real-world variables.) Assuming we had the alternative of 100% 
automated EC2 cluster bootstrapping/teardown, how often would we want to run 
these larger tests for it to be worth it? 

 nio MessagingService
 

 Key: CASSANDRA-8457
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8457
 Project: Cassandra
  Issue Type: New Feature
  Components: Core
Reporter: Jonathan Ellis
Assignee: Ariel Weisberg
  Labels: performance
 Fix For: 3.0


 Thread-per-peer (actually two each incoming and outbound) is a big 
 contributor to context switching, especially for larger clusters.  Let's look 
 at switching to nio, possibly via Netty.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)