Re: Proton Performance Pictures (1 of 2)

2014-09-04 Thread Alan Conway
On Thu, 2014-09-04 at 02:35 +0200, Leon Mlakar wrote:
 On 04/09/14 01:34, Alan Conway wrote:
  On Thu, 2014-09-04 at 00:38 +0200, Leon Mlakar wrote:
  On 04/09/14 00:25, Alan Conway wrote:
  On Wed, 2014-09-03 at 12:16 -0400, Michael Goulish wrote:
  OK -- I just had a quick talk with Ted, and this makes sense
  to me now:
 
  count *receives* per second.
 
  I had it turned around and was worried about *sends* per second,
  and then got confused by issues of fanout.
 
  If you only count *receives* per second, and assume no discards,
  it seems to me that you can indeed make a fair speed comparison
  between
 
   sender -- receiver
 
   sender -- intermediary -- receiver
 
  and
 
   sender -- intermediary -- {receiver_1 ... receiver_n}
 
  and even
 
   sender -- {arbitrary network of intermediaries} -- {receiver_1 
  ... receiver_n}
 
  phew.
 
 
  So I will do it that way.
  That's right for throughput, but don't forget latency. A well behaved
  intermediary should have little effect on throughput but will inevitably
  add latency.
 
  Measuring latency between hosts is a pain. You can time-stamp messages
  at the origin host but clock differences can give you bogus numbers if
  you compare that to the time on a different host when the messages
  arrive. One trick is to have the messages arrive back at the same host
  where you time-stamped them (even if they pass thru other hosts in
  between) but that isn't always what you really want to measure. Maybe
  there's something to be done with NNTP, I've never dug into that. Have
  fun!
 
  To get a reasonably good estimate of the time difference between sender
  an receiver, one could exchange several timestamped messages, w/o
  intermediary, in both directions and get both sides to agree on the
  difference between them. Do that before the test, and then repeat the
  exchange at the end of the test to check for the drift. This of course
  assumes stable network latencies during these exchanges and is usable
  only in test environments. Exchanging several messages instead of just
  one should help eliminating sporadic instabilities.
 
  As I understand it that's pretty much what NTP does.
  http://en.wikipedia.org/wiki/Network_Time_Protocol says that NTP can
  achieve better than one millisecond accuracy in local area networks
  under ideal conditions. That doesn't sound good enough to measure
  sub-millisecond latencies. I doubt that a home grown attempt at timing
  message exchanges will do better than NTP :( NTP may deserve further
  investigation however, Wikipedia probably makes some very broad
  assumptions about what your ideal network conditions are, its possible
  that it can be tuned better than that.
 
  I can easily get sub-millisecond round-trip latencies out of Qpid with a
  short message burst:
  qpidd --auth=no --tcp-nodelay
  qpid-cpp-benchmark --connection-options '{tcp-nodelay:true}' -q1 -m100
  send-tp recv-tp l-min   l-max   l-avg   total-tp
  38816   30370   0.211.180.703943
 
  Sadly if I tell qpidd to use AMQP 1.0 (and therefore proton), things
  degenerate very badly from a latency perspective.
  qpid-cpp-benchmark --connection-options 
  '{tcp-nodelay:true,protocol:amqp1.0}' -q1 -m100
  send-tp recv-tp l-min   l-max   l-avg   total-tp
  26086   19552   3.136.655.28913
  
  However this may not be protons fault, the problem could be in qpidd's
  AMQP1.0 adapter layer. I'm glad to see that we're starting to measure
  these things for proton and dispatch, that will surely lead to
  improvement.
 
 Yes, you are correct, that's basically what NTP does ... and neither 
 will work well with sub-millisecond ranges. I didn't realize that this 
 is what you are after.

It depends a lot on what kind of system you are measuring but fine
tuning the qpid/proton/dispatch tool-set involves some (hopefully!)
pretty low latencies. Even in relatively complicated cases (my past pain
is clustered qpidd for low-latency applications) you may be measuring
3-4ms, but still 1ms error bars are too big.

 
 There is a beast called 
 http://en.wikipedia.org/wiki/Precision_Time_Protocol, though. A year ago 
 we took a brief look into this but concluded that millisecond accuracy 
 was good enough and that it was not worth the effort.
 
Thanks that's interesting!

 And of course, it is also possible to attach a GPS receiver to both 
 sending and receiving host. With decent drivers this should provide at 
 least microsecond accuracy.

That is interesting also! But complicated. This is why I end up just
sending the message back to the host of origin and dividing by 2, even
if it's not really the right answer. I usually only care if its better
or worse than the previous build anyway :)

Cheers,
Alan.




Proton Performance Pictures (1 of 2)

2014-09-03 Thread Michael Goulish

[ resend :  I am attaching only 1 image here, so hopefully the apache 
mail gadget will not become upset.  Next one in next email. ]



Attached, please find two cool pictures of the valgrind/callgrind data
I got with a test run of the psend and precv clients I mentioned before.


( Sorry, I keep saying 'clients'.  These are pure Peer-to-Peer. )
( Hey -- if we ever sell this technology to maritime transport companies,
could we call it Pier-to-Pier ? )



This was from a run of 100,000 messages, using credit strategy of
200, 100, 100.
i.e. start at 200, every time you get down to 100, add 100.

That point is where I seem to find the best performance on my
system: 123,500 messages per second received.  ( i.e. 247,000
transfers per second ) using about 180% CPU ( i.e. 90% each of
2 processors. )

By the way, I actually got repeatably better performance (maybe
1.5% better  (which resulted in the 123,500 number)) by using processors
1 and 3 on my laptop, rather than 1 and 2.   Looking at /proc/cpuinfo,
I see that processors 1 and 3 have different core IDs.  OK, whatever.
( And it's an Intel system... )


I think there are no shockers here:
  psend uses its time in pn_post_transfer_frame  (44%)
  precv uses its time in pn_dispatch_frame   (67%)


The code is at https://github.com/mick-goulish/proton_c_clients.git

I will put all this performance info in there too, shortly.


Re: Proton Performance Pictures (1 of 2)

2014-09-03 Thread Gordon Sim

On 09/03/2014 08:51 AM, Michael Goulish wrote:

That point is where I seem to find the best performance on my
system: 123,500 messages per second received.  ( i.e. 247,000
transfers per second ) using about 180% CPU ( i.e. 90% each of
2 processors. )


If you are sending direct between the sender and receiver process (i.e. 
no intermediary process), then why are you doubling the number of 
messages sent to get 'transfers per second'? One transfer is the sending 
of a message from one process to another, which in this case is the same 
as messages sent or received.


Re: Proton Performance Pictures (1 of 2)

2014-09-03 Thread Michael Goulish




- Original Message -
 On 09/03/2014 08:51 AM, Michael Goulish wrote:
  That point is where I seem to find the best performance on my
  system: 123,500 messages per second received.  ( i.e. 247,000
  transfers per second ) using about 180% CPU ( i.e. 90% each of
  2 processors. )
 
 If you are sending direct between the sender and receiver process (i.e.
 no intermediary process), then why are you doubling the number of
 messages sent to get 'transfers per second'? One transfer is the sending
 of a message from one process to another, which in this case is the same
 as messages sent or received.
 

Yes, this is interesting.

I need a way to make a fair comparison between something like this setup 
(simple peer-to-peer) and the Dispatch Router numbers I was getting
earlier.


For the router, the analogous topology is

writer -- router -- reader

in which case I counted each message twice.



But it does not seem right to count a single message in
   writer -- router -- reader 
as 2 transfers, while counting a single message in
   writer -- reader
as only 1 transfer.

Because -- from the application point of view, those two topologies 
are doing the same work.



Also I think that I *need* to countwriter--router--reader   
as 2, because in *this* case:


 writer --  router -- reader_1
  \
   \-- reader_2


...I need to count that as 3 .



? Thoughts ?




Re: Proton Performance Pictures (1 of 2)

2014-09-03 Thread Ted Ross


On 09/03/2014 11:35 AM, Michael Goulish wrote:
 
 
 
 
 - Original Message -
 On 09/03/2014 08:51 AM, Michael Goulish wrote:
 That point is where I seem to find the best performance on my
 system: 123,500 messages per second received.  ( i.e. 247,000
 transfers per second ) using about 180% CPU ( i.e. 90% each of
 2 processors. )

 If you are sending direct between the sender and receiver process (i.e.
 no intermediary process), then why are you doubling the number of
 messages sent to get 'transfers per second'? One transfer is the sending
 of a message from one process to another, which in this case is the same
 as messages sent or received.

 
 Yes, this is interesting.
 
 I need a way to make a fair comparison between something like this setup 
 (simple peer-to-peer) and the Dispatch Router numbers I was getting
 earlier.
 
 
 For the router, the analogous topology is
 
 writer -- router -- reader
 
 in which case I counted each message twice.
 
 
 
 But it does not seem right to count a single message in
writer -- router -- reader 
 as 2 transfers, while counting a single message in
writer -- reader
 as only 1 transfer.
 
 Because -- from the application point of view, those two topologies 
 are doing the same work.

You should probably be using throughput and not transfers in this case.

 
 
 
 Also I think that I *need* to countwriter--router--reader   
 as 2, because in *this* case:
 
 
  writer --  router -- reader_1
   \
\-- reader_2
 
 
 ...I need to count that as 3 .
 
 
 
 ? Thoughts ?
 
 


Re: Proton Performance Pictures (1 of 2)

2014-09-03 Thread Michael Goulish

OK -- I just had a quick talk with Ted, and this makes sense
to me now:

  count *receives* per second.

I had it turned around and was worried about *sends* per second,
and then got confused by issues of fanout.

If you only count *receives* per second, and assume no discards,
it seems to me that you can indeed make a fair speed comparison 
between

   sender -- receiver

   sender -- intermediary -- receiver

and

   sender -- intermediary -- {receiver_1 ... receiver_n}

and even

   sender -- {arbitrary network of intermediaries} -- {receiver_1 ... 
receiver_n}

phew.


So I will do it that way.

This is from the application perspective, asking how fast is your 
messaging system.  It doesn't care about how fancy the intermediation 
is, it only cares about results.  This seems like the right way to judge that.









- Original Message -


On 09/03/2014 11:35 AM, Michael Goulish wrote:
 
 
 
 
 - Original Message -
 On 09/03/2014 08:51 AM, Michael Goulish wrote:
 That point is where I seem to find the best performance on my
 system: 123,500 messages per second received.  ( i.e. 247,000
 transfers per second ) using about 180% CPU ( i.e. 90% each of
 2 processors. )

 If you are sending direct between the sender and receiver process (i.e.
 no intermediary process), then why are you doubling the number of
 messages sent to get 'transfers per second'? One transfer is the sending
 of a message from one process to another, which in this case is the same
 as messages sent or received.

 
 Yes, this is interesting.
 
 I need a way to make a fair comparison between something like this setup 
 (simple peer-to-peer) and the Dispatch Router numbers I was getting
 earlier.
 
 
 For the router, the analogous topology is
 
 writer -- router -- reader
 
 in which case I counted each message twice.
 
 
 
 But it does not seem right to count a single message in
writer -- router -- reader 
 as 2 transfers, while counting a single message in
writer -- reader
 as only 1 transfer.
 
 Because -- from the application point of view, those two topologies 
 are doing the same work.

You should probably be using throughput and not transfers in this case.

 
 
 
 Also I think that I *need* to countwriter--router--reader   
 as 2, because in *this* case:
 
 
  writer --  router -- reader_1
   \
\-- reader_2
 
 
 ...I need to count that as 3 .
 
 
 
 ? Thoughts ?
 
 


Re: Proton Performance Pictures (1 of 2)

2014-09-03 Thread Alan Conway
On Wed, 2014-09-03 at 12:16 -0400, Michael Goulish wrote:
 OK -- I just had a quick talk with Ted, and this makes sense
 to me now:
 
   count *receives* per second.
 
 I had it turned around and was worried about *sends* per second,
 and then got confused by issues of fanout.
 
 If you only count *receives* per second, and assume no discards,
 it seems to me that you can indeed make a fair speed comparison 
 between
 
sender -- receiver
 
sender -- intermediary -- receiver
 
 and
 
sender -- intermediary -- {receiver_1 ... receiver_n}
 
 and even
 
sender -- {arbitrary network of intermediaries} -- {receiver_1 ... 
 receiver_n}
 
 phew.
 
 
 So I will do it that way.

That's right for throughput, but don't forget latency. A well behaved
intermediary should have little effect on throughput but will inevitably
add latency.

Measuring latency between hosts is a pain. You can time-stamp messages
at the origin host but clock differences can give you bogus numbers if
you compare that to the time on a different host when the messages
arrive. One trick is to have the messages arrive back at the same host
where you time-stamped them (even if they pass thru other hosts in
between) but that isn't always what you really want to measure. Maybe
there's something to be done with NNTP, I've never dug into that. Have
fun!





Re: Proton Performance Pictures (1 of 2)

2014-09-03 Thread Leon Mlakar

On 04/09/14 00:25, Alan Conway wrote:

On Wed, 2014-09-03 at 12:16 -0400, Michael Goulish wrote:

OK -- I just had a quick talk with Ted, and this makes sense
to me now:

   count *receives* per second.

I had it turned around and was worried about *sends* per second,
and then got confused by issues of fanout.

If you only count *receives* per second, and assume no discards,
it seems to me that you can indeed make a fair speed comparison
between

sender -- receiver

sender -- intermediary -- receiver

and

sender -- intermediary -- {receiver_1 ... receiver_n}

and even

sender -- {arbitrary network of intermediaries} -- {receiver_1 ... 
receiver_n}

phew.


So I will do it that way.

That's right for throughput, but don't forget latency. A well behaved
intermediary should have little effect on throughput but will inevitably
add latency.

Measuring latency between hosts is a pain. You can time-stamp messages
at the origin host but clock differences can give you bogus numbers if
you compare that to the time on a different host when the messages
arrive. One trick is to have the messages arrive back at the same host
where you time-stamped them (even if they pass thru other hosts in
between) but that isn't always what you really want to measure. Maybe
there's something to be done with NNTP, I've never dug into that. Have
fun!

To get a reasonably good estimate of the time difference between sender 
an receiver, one could exchange several timestamped messages, w/o 
intermediary, in both directions and get both sides to agree on the 
difference between them. Do that before the test, and then repeat the 
exchange at the end of the test to check for the drift. This of course 
assumes stable network latencies during these exchanges and is usable 
only in test environments. Exchanging several messages instead of just 
one should help eliminating sporadic instabilities.


Leon



Re: Proton Performance Pictures (1 of 2)

2014-09-03 Thread Alan Conway
On Thu, 2014-09-04 at 00:38 +0200, Leon Mlakar wrote:
 On 04/09/14 00:25, Alan Conway wrote:
  On Wed, 2014-09-03 at 12:16 -0400, Michael Goulish wrote:
  OK -- I just had a quick talk with Ted, and this makes sense
  to me now:
 
 count *receives* per second.
 
  I had it turned around and was worried about *sends* per second,
  and then got confused by issues of fanout.
 
  If you only count *receives* per second, and assume no discards,
  it seems to me that you can indeed make a fair speed comparison
  between
 
  sender -- receiver
 
  sender -- intermediary -- receiver
 
  and
 
  sender -- intermediary -- {receiver_1 ... receiver_n}
 
  and even
 
  sender -- {arbitrary network of intermediaries} -- {receiver_1 ... 
  receiver_n}
 
  phew.
 
 
  So I will do it that way.
  That's right for throughput, but don't forget latency. A well behaved
  intermediary should have little effect on throughput but will inevitably
  add latency.
 
  Measuring latency between hosts is a pain. You can time-stamp messages
  at the origin host but clock differences can give you bogus numbers if
  you compare that to the time on a different host when the messages
  arrive. One trick is to have the messages arrive back at the same host
  where you time-stamped them (even if they pass thru other hosts in
  between) but that isn't always what you really want to measure. Maybe
  there's something to be done with NNTP, I've never dug into that. Have
  fun!
 
 To get a reasonably good estimate of the time difference between sender 
 an receiver, one could exchange several timestamped messages, w/o 
 intermediary, in both directions and get both sides to agree on the 
 difference between them. Do that before the test, and then repeat the 
 exchange at the end of the test to check for the drift. This of course 
 assumes stable network latencies during these exchanges and is usable 
 only in test environments. Exchanging several messages instead of just 
 one should help eliminating sporadic instabilities.
 

As I understand it that's pretty much what NTP does.
http://en.wikipedia.org/wiki/Network_Time_Protocol says that NTP can
achieve better than one millisecond accuracy in local area networks
under ideal conditions. That doesn't sound good enough to measure
sub-millisecond latencies. I doubt that a home grown attempt at timing
message exchanges will do better than NTP :( NTP may deserve further
investigation however, Wikipedia probably makes some very broad
assumptions about what your ideal network conditions are, its possible
that it can be tuned better than that.

I can easily get sub-millisecond round-trip latencies out of Qpid with a
short message burst:
qpidd --auth=no --tcp-nodelay
qpid-cpp-benchmark --connection-options '{tcp-nodelay:true}' -q1 -m100
send-tp recv-tp l-min   l-max   l-avg   total-tp
38816   30370   0.211.180.703943

Sadly if I tell qpidd to use AMQP 1.0 (and therefore proton), things
degenerate very badly from a latency perspective.
qpid-cpp-benchmark --connection-options '{tcp-nodelay:true,protocol:amqp1.0}' 
-q1 -m100
send-tp recv-tp l-min   l-max   l-avg   total-tp
26086   19552   3.136.655.28913

However this may not be protons fault, the problem could be in qpidd's
AMQP1.0 adapter layer. I'm glad to see that we're starting to measure
these things for proton and dispatch, that will surely lead to
improvement.



Re: Proton Performance Pictures (1 of 2)

2014-09-03 Thread Leon Mlakar

On 04/09/14 01:34, Alan Conway wrote:

On Thu, 2014-09-04 at 00:38 +0200, Leon Mlakar wrote:

On 04/09/14 00:25, Alan Conway wrote:

On Wed, 2014-09-03 at 12:16 -0400, Michael Goulish wrote:

OK -- I just had a quick talk with Ted, and this makes sense
to me now:

count *receives* per second.

I had it turned around and was worried about *sends* per second,
and then got confused by issues of fanout.

If you only count *receives* per second, and assume no discards,
it seems to me that you can indeed make a fair speed comparison
between

 sender -- receiver

 sender -- intermediary -- receiver

and

 sender -- intermediary -- {receiver_1 ... receiver_n}

and even

 sender -- {arbitrary network of intermediaries} -- {receiver_1 ... 
receiver_n}

phew.


So I will do it that way.

That's right for throughput, but don't forget latency. A well behaved
intermediary should have little effect on throughput but will inevitably
add latency.

Measuring latency between hosts is a pain. You can time-stamp messages
at the origin host but clock differences can give you bogus numbers if
you compare that to the time on a different host when the messages
arrive. One trick is to have the messages arrive back at the same host
where you time-stamped them (even if they pass thru other hosts in
between) but that isn't always what you really want to measure. Maybe
there's something to be done with NNTP, I've never dug into that. Have
fun!


To get a reasonably good estimate of the time difference between sender
an receiver, one could exchange several timestamped messages, w/o
intermediary, in both directions and get both sides to agree on the
difference between them. Do that before the test, and then repeat the
exchange at the end of the test to check for the drift. This of course
assumes stable network latencies during these exchanges and is usable
only in test environments. Exchanging several messages instead of just
one should help eliminating sporadic instabilities.


As I understand it that's pretty much what NTP does.
http://en.wikipedia.org/wiki/Network_Time_Protocol says that NTP can
achieve better than one millisecond accuracy in local area networks
under ideal conditions. That doesn't sound good enough to measure
sub-millisecond latencies. I doubt that a home grown attempt at timing
message exchanges will do better than NTP :( NTP may deserve further
investigation however, Wikipedia probably makes some very broad
assumptions about what your ideal network conditions are, its possible
that it can be tuned better than that.

I can easily get sub-millisecond round-trip latencies out of Qpid with a
short message burst:
qpidd --auth=no --tcp-nodelay
qpid-cpp-benchmark --connection-options '{tcp-nodelay:true}' -q1 -m100
send-tp recv-tp l-min   l-max   l-avg   total-tp
38816   30370   0.211.180.703943

Sadly if I tell qpidd to use AMQP 1.0 (and therefore proton), things
degenerate very badly from a latency perspective.
qpid-cpp-benchmark --connection-options '{tcp-nodelay:true,protocol:amqp1.0}' 
-q1 -m100
send-tp recv-tp l-min   l-max   l-avg   total-tp
26086   19552   3.136.655.28913

However this may not be protons fault, the problem could be in qpidd's
AMQP1.0 adapter layer. I'm glad to see that we're starting to measure
these things for proton and dispatch, that will surely lead to
improvement.

Yes, you are correct, that's basically what NTP does ... and neither 
will work well with sub-millisecond ranges. I didn't realize that this 
is what you are after.


There is a beast called 
http://en.wikipedia.org/wiki/Precision_Time_Protocol, though. A year ago 
we took a brief look into this but concluded that millisecond accuracy 
was good enough and that it was not worth the effort.


And of course, it is also possible to attach a GPS receiver to both 
sending and receiving host. With decent drivers this should provide at 
least microsecond accuracy.


Leon