Re: Proton Performance Pictures (1 of 2)
On Thu, 2014-09-04 at 02:35 +0200, Leon Mlakar wrote: On 04/09/14 01:34, Alan Conway wrote: On Thu, 2014-09-04 at 00:38 +0200, Leon Mlakar wrote: On 04/09/14 00:25, Alan Conway wrote: On Wed, 2014-09-03 at 12:16 -0400, Michael Goulish wrote: OK -- I just had a quick talk with Ted, and this makes sense to me now: count *receives* per second. I had it turned around and was worried about *sends* per second, and then got confused by issues of fanout. If you only count *receives* per second, and assume no discards, it seems to me that you can indeed make a fair speed comparison between sender -- receiver sender -- intermediary -- receiver and sender -- intermediary -- {receiver_1 ... receiver_n} and even sender -- {arbitrary network of intermediaries} -- {receiver_1 ... receiver_n} phew. So I will do it that way. That's right for throughput, but don't forget latency. A well behaved intermediary should have little effect on throughput but will inevitably add latency. Measuring latency between hosts is a pain. You can time-stamp messages at the origin host but clock differences can give you bogus numbers if you compare that to the time on a different host when the messages arrive. One trick is to have the messages arrive back at the same host where you time-stamped them (even if they pass thru other hosts in between) but that isn't always what you really want to measure. Maybe there's something to be done with NNTP, I've never dug into that. Have fun! To get a reasonably good estimate of the time difference between sender an receiver, one could exchange several timestamped messages, w/o intermediary, in both directions and get both sides to agree on the difference between them. Do that before the test, and then repeat the exchange at the end of the test to check for the drift. This of course assumes stable network latencies during these exchanges and is usable only in test environments. Exchanging several messages instead of just one should help eliminating sporadic instabilities. As I understand it that's pretty much what NTP does. http://en.wikipedia.org/wiki/Network_Time_Protocol says that NTP can achieve better than one millisecond accuracy in local area networks under ideal conditions. That doesn't sound good enough to measure sub-millisecond latencies. I doubt that a home grown attempt at timing message exchanges will do better than NTP :( NTP may deserve further investigation however, Wikipedia probably makes some very broad assumptions about what your ideal network conditions are, its possible that it can be tuned better than that. I can easily get sub-millisecond round-trip latencies out of Qpid with a short message burst: qpidd --auth=no --tcp-nodelay qpid-cpp-benchmark --connection-options '{tcp-nodelay:true}' -q1 -m100 send-tp recv-tp l-min l-max l-avg total-tp 38816 30370 0.211.180.703943 Sadly if I tell qpidd to use AMQP 1.0 (and therefore proton), things degenerate very badly from a latency perspective. qpid-cpp-benchmark --connection-options '{tcp-nodelay:true,protocol:amqp1.0}' -q1 -m100 send-tp recv-tp l-min l-max l-avg total-tp 26086 19552 3.136.655.28913 However this may not be protons fault, the problem could be in qpidd's AMQP1.0 adapter layer. I'm glad to see that we're starting to measure these things for proton and dispatch, that will surely lead to improvement. Yes, you are correct, that's basically what NTP does ... and neither will work well with sub-millisecond ranges. I didn't realize that this is what you are after. It depends a lot on what kind of system you are measuring but fine tuning the qpid/proton/dispatch tool-set involves some (hopefully!) pretty low latencies. Even in relatively complicated cases (my past pain is clustered qpidd for low-latency applications) you may be measuring 3-4ms, but still 1ms error bars are too big. There is a beast called http://en.wikipedia.org/wiki/Precision_Time_Protocol, though. A year ago we took a brief look into this but concluded that millisecond accuracy was good enough and that it was not worth the effort. Thanks that's interesting! And of course, it is also possible to attach a GPS receiver to both sending and receiving host. With decent drivers this should provide at least microsecond accuracy. That is interesting also! But complicated. This is why I end up just sending the message back to the host of origin and dividing by 2, even if it's not really the right answer. I usually only care if its better or worse than the previous build anyway :) Cheers, Alan.
Proton Performance Pictures (1 of 2)
[ resend : I am attaching only 1 image here, so hopefully the apache mail gadget will not become upset. Next one in next email. ] Attached, please find two cool pictures of the valgrind/callgrind data I got with a test run of the psend and precv clients I mentioned before. ( Sorry, I keep saying 'clients'. These are pure Peer-to-Peer. ) ( Hey -- if we ever sell this technology to maritime transport companies, could we call it Pier-to-Pier ? ) This was from a run of 100,000 messages, using credit strategy of 200, 100, 100. i.e. start at 200, every time you get down to 100, add 100. That point is where I seem to find the best performance on my system: 123,500 messages per second received. ( i.e. 247,000 transfers per second ) using about 180% CPU ( i.e. 90% each of 2 processors. ) By the way, I actually got repeatably better performance (maybe 1.5% better (which resulted in the 123,500 number)) by using processors 1 and 3 on my laptop, rather than 1 and 2. Looking at /proc/cpuinfo, I see that processors 1 and 3 have different core IDs. OK, whatever. ( And it's an Intel system... ) I think there are no shockers here: psend uses its time in pn_post_transfer_frame (44%) precv uses its time in pn_dispatch_frame (67%) The code is at https://github.com/mick-goulish/proton_c_clients.git I will put all this performance info in there too, shortly.
Re: Proton Performance Pictures (1 of 2)
On 09/03/2014 08:51 AM, Michael Goulish wrote: That point is where I seem to find the best performance on my system: 123,500 messages per second received. ( i.e. 247,000 transfers per second ) using about 180% CPU ( i.e. 90% each of 2 processors. ) If you are sending direct between the sender and receiver process (i.e. no intermediary process), then why are you doubling the number of messages sent to get 'transfers per second'? One transfer is the sending of a message from one process to another, which in this case is the same as messages sent or received.
Re: Proton Performance Pictures (1 of 2)
- Original Message - On 09/03/2014 08:51 AM, Michael Goulish wrote: That point is where I seem to find the best performance on my system: 123,500 messages per second received. ( i.e. 247,000 transfers per second ) using about 180% CPU ( i.e. 90% each of 2 processors. ) If you are sending direct between the sender and receiver process (i.e. no intermediary process), then why are you doubling the number of messages sent to get 'transfers per second'? One transfer is the sending of a message from one process to another, which in this case is the same as messages sent or received. Yes, this is interesting. I need a way to make a fair comparison between something like this setup (simple peer-to-peer) and the Dispatch Router numbers I was getting earlier. For the router, the analogous topology is writer -- router -- reader in which case I counted each message twice. But it does not seem right to count a single message in writer -- router -- reader as 2 transfers, while counting a single message in writer -- reader as only 1 transfer. Because -- from the application point of view, those two topologies are doing the same work. Also I think that I *need* to countwriter--router--reader as 2, because in *this* case: writer -- router -- reader_1 \ \-- reader_2 ...I need to count that as 3 . ? Thoughts ?
Re: Proton Performance Pictures (1 of 2)
On 09/03/2014 11:35 AM, Michael Goulish wrote: - Original Message - On 09/03/2014 08:51 AM, Michael Goulish wrote: That point is where I seem to find the best performance on my system: 123,500 messages per second received. ( i.e. 247,000 transfers per second ) using about 180% CPU ( i.e. 90% each of 2 processors. ) If you are sending direct between the sender and receiver process (i.e. no intermediary process), then why are you doubling the number of messages sent to get 'transfers per second'? One transfer is the sending of a message from one process to another, which in this case is the same as messages sent or received. Yes, this is interesting. I need a way to make a fair comparison between something like this setup (simple peer-to-peer) and the Dispatch Router numbers I was getting earlier. For the router, the analogous topology is writer -- router -- reader in which case I counted each message twice. But it does not seem right to count a single message in writer -- router -- reader as 2 transfers, while counting a single message in writer -- reader as only 1 transfer. Because -- from the application point of view, those two topologies are doing the same work. You should probably be using throughput and not transfers in this case. Also I think that I *need* to countwriter--router--reader as 2, because in *this* case: writer -- router -- reader_1 \ \-- reader_2 ...I need to count that as 3 . ? Thoughts ?
Re: Proton Performance Pictures (1 of 2)
OK -- I just had a quick talk with Ted, and this makes sense to me now: count *receives* per second. I had it turned around and was worried about *sends* per second, and then got confused by issues of fanout. If you only count *receives* per second, and assume no discards, it seems to me that you can indeed make a fair speed comparison between sender -- receiver sender -- intermediary -- receiver and sender -- intermediary -- {receiver_1 ... receiver_n} and even sender -- {arbitrary network of intermediaries} -- {receiver_1 ... receiver_n} phew. So I will do it that way. This is from the application perspective, asking how fast is your messaging system. It doesn't care about how fancy the intermediation is, it only cares about results. This seems like the right way to judge that. - Original Message - On 09/03/2014 11:35 AM, Michael Goulish wrote: - Original Message - On 09/03/2014 08:51 AM, Michael Goulish wrote: That point is where I seem to find the best performance on my system: 123,500 messages per second received. ( i.e. 247,000 transfers per second ) using about 180% CPU ( i.e. 90% each of 2 processors. ) If you are sending direct between the sender and receiver process (i.e. no intermediary process), then why are you doubling the number of messages sent to get 'transfers per second'? One transfer is the sending of a message from one process to another, which in this case is the same as messages sent or received. Yes, this is interesting. I need a way to make a fair comparison between something like this setup (simple peer-to-peer) and the Dispatch Router numbers I was getting earlier. For the router, the analogous topology is writer -- router -- reader in which case I counted each message twice. But it does not seem right to count a single message in writer -- router -- reader as 2 transfers, while counting a single message in writer -- reader as only 1 transfer. Because -- from the application point of view, those two topologies are doing the same work. You should probably be using throughput and not transfers in this case. Also I think that I *need* to countwriter--router--reader as 2, because in *this* case: writer -- router -- reader_1 \ \-- reader_2 ...I need to count that as 3 . ? Thoughts ?
Re: Proton Performance Pictures (1 of 2)
On Wed, 2014-09-03 at 12:16 -0400, Michael Goulish wrote: OK -- I just had a quick talk with Ted, and this makes sense to me now: count *receives* per second. I had it turned around and was worried about *sends* per second, and then got confused by issues of fanout. If you only count *receives* per second, and assume no discards, it seems to me that you can indeed make a fair speed comparison between sender -- receiver sender -- intermediary -- receiver and sender -- intermediary -- {receiver_1 ... receiver_n} and even sender -- {arbitrary network of intermediaries} -- {receiver_1 ... receiver_n} phew. So I will do it that way. That's right for throughput, but don't forget latency. A well behaved intermediary should have little effect on throughput but will inevitably add latency. Measuring latency between hosts is a pain. You can time-stamp messages at the origin host but clock differences can give you bogus numbers if you compare that to the time on a different host when the messages arrive. One trick is to have the messages arrive back at the same host where you time-stamped them (even if they pass thru other hosts in between) but that isn't always what you really want to measure. Maybe there's something to be done with NNTP, I've never dug into that. Have fun!
Re: Proton Performance Pictures (1 of 2)
On 04/09/14 00:25, Alan Conway wrote: On Wed, 2014-09-03 at 12:16 -0400, Michael Goulish wrote: OK -- I just had a quick talk with Ted, and this makes sense to me now: count *receives* per second. I had it turned around and was worried about *sends* per second, and then got confused by issues of fanout. If you only count *receives* per second, and assume no discards, it seems to me that you can indeed make a fair speed comparison between sender -- receiver sender -- intermediary -- receiver and sender -- intermediary -- {receiver_1 ... receiver_n} and even sender -- {arbitrary network of intermediaries} -- {receiver_1 ... receiver_n} phew. So I will do it that way. That's right for throughput, but don't forget latency. A well behaved intermediary should have little effect on throughput but will inevitably add latency. Measuring latency between hosts is a pain. You can time-stamp messages at the origin host but clock differences can give you bogus numbers if you compare that to the time on a different host when the messages arrive. One trick is to have the messages arrive back at the same host where you time-stamped them (even if they pass thru other hosts in between) but that isn't always what you really want to measure. Maybe there's something to be done with NNTP, I've never dug into that. Have fun! To get a reasonably good estimate of the time difference between sender an receiver, one could exchange several timestamped messages, w/o intermediary, in both directions and get both sides to agree on the difference between them. Do that before the test, and then repeat the exchange at the end of the test to check for the drift. This of course assumes stable network latencies during these exchanges and is usable only in test environments. Exchanging several messages instead of just one should help eliminating sporadic instabilities. Leon
Re: Proton Performance Pictures (1 of 2)
On Thu, 2014-09-04 at 00:38 +0200, Leon Mlakar wrote: On 04/09/14 00:25, Alan Conway wrote: On Wed, 2014-09-03 at 12:16 -0400, Michael Goulish wrote: OK -- I just had a quick talk with Ted, and this makes sense to me now: count *receives* per second. I had it turned around and was worried about *sends* per second, and then got confused by issues of fanout. If you only count *receives* per second, and assume no discards, it seems to me that you can indeed make a fair speed comparison between sender -- receiver sender -- intermediary -- receiver and sender -- intermediary -- {receiver_1 ... receiver_n} and even sender -- {arbitrary network of intermediaries} -- {receiver_1 ... receiver_n} phew. So I will do it that way. That's right for throughput, but don't forget latency. A well behaved intermediary should have little effect on throughput but will inevitably add latency. Measuring latency between hosts is a pain. You can time-stamp messages at the origin host but clock differences can give you bogus numbers if you compare that to the time on a different host when the messages arrive. One trick is to have the messages arrive back at the same host where you time-stamped them (even if they pass thru other hosts in between) but that isn't always what you really want to measure. Maybe there's something to be done with NNTP, I've never dug into that. Have fun! To get a reasonably good estimate of the time difference between sender an receiver, one could exchange several timestamped messages, w/o intermediary, in both directions and get both sides to agree on the difference between them. Do that before the test, and then repeat the exchange at the end of the test to check for the drift. This of course assumes stable network latencies during these exchanges and is usable only in test environments. Exchanging several messages instead of just one should help eliminating sporadic instabilities. As I understand it that's pretty much what NTP does. http://en.wikipedia.org/wiki/Network_Time_Protocol says that NTP can achieve better than one millisecond accuracy in local area networks under ideal conditions. That doesn't sound good enough to measure sub-millisecond latencies. I doubt that a home grown attempt at timing message exchanges will do better than NTP :( NTP may deserve further investigation however, Wikipedia probably makes some very broad assumptions about what your ideal network conditions are, its possible that it can be tuned better than that. I can easily get sub-millisecond round-trip latencies out of Qpid with a short message burst: qpidd --auth=no --tcp-nodelay qpid-cpp-benchmark --connection-options '{tcp-nodelay:true}' -q1 -m100 send-tp recv-tp l-min l-max l-avg total-tp 38816 30370 0.211.180.703943 Sadly if I tell qpidd to use AMQP 1.0 (and therefore proton), things degenerate very badly from a latency perspective. qpid-cpp-benchmark --connection-options '{tcp-nodelay:true,protocol:amqp1.0}' -q1 -m100 send-tp recv-tp l-min l-max l-avg total-tp 26086 19552 3.136.655.28913 However this may not be protons fault, the problem could be in qpidd's AMQP1.0 adapter layer. I'm glad to see that we're starting to measure these things for proton and dispatch, that will surely lead to improvement.
Re: Proton Performance Pictures (1 of 2)
On 04/09/14 01:34, Alan Conway wrote: On Thu, 2014-09-04 at 00:38 +0200, Leon Mlakar wrote: On 04/09/14 00:25, Alan Conway wrote: On Wed, 2014-09-03 at 12:16 -0400, Michael Goulish wrote: OK -- I just had a quick talk with Ted, and this makes sense to me now: count *receives* per second. I had it turned around and was worried about *sends* per second, and then got confused by issues of fanout. If you only count *receives* per second, and assume no discards, it seems to me that you can indeed make a fair speed comparison between sender -- receiver sender -- intermediary -- receiver and sender -- intermediary -- {receiver_1 ... receiver_n} and even sender -- {arbitrary network of intermediaries} -- {receiver_1 ... receiver_n} phew. So I will do it that way. That's right for throughput, but don't forget latency. A well behaved intermediary should have little effect on throughput but will inevitably add latency. Measuring latency between hosts is a pain. You can time-stamp messages at the origin host but clock differences can give you bogus numbers if you compare that to the time on a different host when the messages arrive. One trick is to have the messages arrive back at the same host where you time-stamped them (even if they pass thru other hosts in between) but that isn't always what you really want to measure. Maybe there's something to be done with NNTP, I've never dug into that. Have fun! To get a reasonably good estimate of the time difference between sender an receiver, one could exchange several timestamped messages, w/o intermediary, in both directions and get both sides to agree on the difference between them. Do that before the test, and then repeat the exchange at the end of the test to check for the drift. This of course assumes stable network latencies during these exchanges and is usable only in test environments. Exchanging several messages instead of just one should help eliminating sporadic instabilities. As I understand it that's pretty much what NTP does. http://en.wikipedia.org/wiki/Network_Time_Protocol says that NTP can achieve better than one millisecond accuracy in local area networks under ideal conditions. That doesn't sound good enough to measure sub-millisecond latencies. I doubt that a home grown attempt at timing message exchanges will do better than NTP :( NTP may deserve further investigation however, Wikipedia probably makes some very broad assumptions about what your ideal network conditions are, its possible that it can be tuned better than that. I can easily get sub-millisecond round-trip latencies out of Qpid with a short message burst: qpidd --auth=no --tcp-nodelay qpid-cpp-benchmark --connection-options '{tcp-nodelay:true}' -q1 -m100 send-tp recv-tp l-min l-max l-avg total-tp 38816 30370 0.211.180.703943 Sadly if I tell qpidd to use AMQP 1.0 (and therefore proton), things degenerate very badly from a latency perspective. qpid-cpp-benchmark --connection-options '{tcp-nodelay:true,protocol:amqp1.0}' -q1 -m100 send-tp recv-tp l-min l-max l-avg total-tp 26086 19552 3.136.655.28913 However this may not be protons fault, the problem could be in qpidd's AMQP1.0 adapter layer. I'm glad to see that we're starting to measure these things for proton and dispatch, that will surely lead to improvement. Yes, you are correct, that's basically what NTP does ... and neither will work well with sub-millisecond ranges. I didn't realize that this is what you are after. There is a beast called http://en.wikipedia.org/wiki/Precision_Time_Protocol, though. A year ago we took a brief look into this but concluded that millisecond accuracy was good enough and that it was not worth the effort. And of course, it is also possible to attach a GPS receiver to both sending and receiving host. With decent drivers this should provide at least microsecond accuracy. Leon