Hi, so I added some more logging to check if the consumer is slowed down. In stream_engine::out_event, I check if the call to tcp_write would block and count blocking calls. To make sure that I don't miss blocked send calls, I also added output to tcp.cpp. I also increased the in_batch_size and out_batch_size to reduce the number of syscalls. With that, I ran my test setup and observed the problem once more:
The good connection consistently sends ~100Mbps Mon Jul 11 09:02:28 2016 1468227748668193 147 Sent 12769753 of 12769753 bytes in 1.00001s (1000010us) with 303 calls 0 blocked Mon Jul 11 09:02:29 2016 1468227749668202 147 Sent 12700046 of 12700046 bytes in 1.00001s (1000009us) with 295 calls 0 blocked Mon Jul 11 09:02:30 2016 1468227750668249 147 Sent 12792417 of 12792417 bytes in 1.00005s (1000047us) with 295 calls 0 blocked Mon Jul 11 09:02:31 2016 1468227751668461 147 Sent 12787390 of 12787390 bytes in 1.00021s (1000212us) with 345 calls 0 blocked Mon Jul 11 09:02:32 2016 1468227752668635 147 Sent 12808874 of 12808874 bytes in 1.00017s (1000174us) with 308 calls 0 blocked Mon Jul 11 09:02:33 2016 1468227753668824 147 Sent 12832523 of 12832523 bytes in 1.00019s (1000189us) with 274 calls 0 blocked Mon Jul 11 09:02:34 2016 1468227754668855 147 Sent 12861635 of 12861635 bytes in 1.00003s (1000031us) with 234 calls 0 blocked The second connection on the same PUB socket slows down after a couple of minutes: Mon Jul 11 09:02:45 2016 1468227765157113 159 Sent 9568256 of 9568256 bytes in 1.01988s (1019875us) with 146 calls 0 blocked Mon Jul 11 09:02:46 2016 1468227766180625 159 Sent 9175040 of 9175040 bytes in 1.02351s (1023512us) with 140 calls 0 blocked Mon Jul 11 09:02:47 2016 1468227767181102 159 Sent 8060928 of 8060928 bytes in 1.00048s (1000477us) with 123 calls 0 blocked Mon Jul 11 09:02:48 2016 1468227768200968 159 Sent 7471104 of 7471104 bytes in 1.01987s (1019866us) with 114 calls 0 blocked Mon Jul 11 09:02:49 2016 1468227769226228 159 Sent 7864320 of 7864320 bytes in 1.02526s (1025260us) with 120 calls 0 blocked Mon Jul 11 09:02:50 2016 1468227770226817 159 Sent 8323072 of 8323072 bytes in 1.00059s (1000589us) with 127 calls 0 blocked Mon Jul 11 09:02:51 2016 1468227771246216 159 Sent 7143424 of 7143424 bytes in 1.0194s (1019399us) with 109 calls 0 blocked Mon Jul 11 09:02:52 2016 1468227772270349 159 Sent 6553600 of 6553600 bytes in 1.02413s (1024133us) with 100 calls 0 blocked and then catches up after some time. Since I have no blocked send calls I am very sure that it is not a slow consumer problem. It looks like a scheduling problem in ZeroMQ (or maybe the OS scheduler?). Best wishes, Jens -- Jens Auer | CGI | Software-Engineer CGI (Germany) GmbH & Co. KG Rheinstraße 95 | 64295 Darmstadt | Germany T: +49 6151 36860 154 [email protected] Unsere Pflichtangaben gemäß § 35a GmbHG / §§ 161, 125a HGB finden Sie unter de.cgi.com/pflichtangaben. CONFIDENTIALITY NOTICE: Proprietary/Confidential information belonging to CGI Group Inc. and its affiliates may be contained in this message. If you are not a recipient indicated or intended in this message (or responsible for delivery of this message to such person), or you think for any reason that this message may have been addressed to you in error, you may not use or copy or deliver this message to anyone else. In such case, you should destroy this message and are asked to notify the sender by reply e-mail. ________________________________________ Von: zeromq-dev [[email protected]]" im Auftrag von "Jens Auer [[email protected]] Gesendet: Sonntag, 10. Juli 2016 10:59 An: 'ZeroMQ development list' Betreff: [zeromq-dev] Pub socket slows down for one client Hi, I am having a problem in my system that looks like a slow consumer, but I don't think it is. All my processes are running at ~20% maximum CPU, and there is plenty of margin on the system and on the network. However, I see that sometimes one of the PUB connections slows down for a rather long time (in the order of 5-40s) and then catches up for hours without any further problems. I added some logging the stream_engine::in_event and out_event, but I am not sure how to interpret the observations. I analyzed one case of slow down. Here, I have one processes receiving a constant stream of ~32000000 bytes on a ZMQ_STREAM raw tcp socket. The process basically splits this into individual streams using an id in the data and then publishes this with the id as subscription. Two subscribers are connected to this PUB socket. Sometimes, one of these SUBs gets slowed down. I have logging at the application<->ZeroMQ boundaries to count the number of messages received and sent plus logging the ZeroMQ<->system boundary to count the number of bytes received/sent and the number of syscalls. I see the following pattern appearing in the logs: - No data is lost. Everything that is sent is received. - The number of messages received and sent in the application is as expected. - One of the connections in the PUB socket does not send the same data as the other one. This looks like a slow receiver, but the receiver is running at 20% CPU, the system is far from being loaded and the network is a 10GB network with all hosts connected to a single switch. TCP windows are sized to 6MB default value for a latency expected to be 160ms - Everything runs fine until I start to simulate latency with netem. Looking at the log files, I see the expected output for the fast output stream. The log prints a time stamp, the number of bytes sent since the last log output, the time between the outputs, the number of frames (each 1142 bytes) and the number of calls to out_event resp. send. Time Bytes delta time Frames Syscalls to send 1467902214174220 12775617 1.00005s 1645 11187 1467902215174480 12724076 1.00026s 1615 11142 1467902216174610 12913520 1.00012s 1657 11308 1467902217174610 12913979 1.00001s 1661 11308 1467902218174670 12619185 1.00005s 1636 11050 1467902219174780 12699323 1.00011s 1608 11120 1467902220174780 12941494 1s 1666 11332 The problematic stream slows down, and then catches up and after sending buffered frames, everything is fine: 1467902253096660 7487488 1.00045s 914 6556 1467902254113250 7929856 1.01659s 968 6944 1467902255137710 7675904 1.02446s 937 6721 1467902256137830 7979008 1.00013s 974 6987 1467902257154080 8888320 1.01624s 1085 7783 1467902258180360 8863744 1.02628s 1082 7762 1467902259181080 9691136 1.00072s 1183 8486 1467902260197550 11091968 1.01648s 1354 9713 1467902261222090 11558912 1.02454s 1411 10122 1467902262222300 12984320 1.00021s 1585 11370 1467902263240030 15409152 1.01773s 1881 13493 1467902264263870 16228352 1.02384s 1981 14210 1467902265264570 18800640 1.0007s 2295 16463 1467902266281750 22364160 1.01719s 2730 19583 1467902267305500 24100864 1.02375s 2942 21104 1467902268306090 29376512 1.00059s 3586 25724 1467902269324240 35127296 1.01815s 4288 30759 1467902270347700 35618816 1.02347s 4348 31190 1467902271347770 39297024 1.00006s 4797 34411 1467902272482910 44589056 1.13515s 5443 39045 1467902273487890 42328064 1.00498s 5167 37065 1467902274510060 35741696 1.02217s 4363 31297 1467902275510560 38371328 1.0005s 4684 33600 1467902276530410 43376640 1.01985s 5295 37983 1467902277594050 35787401 1.06365s 4382 31337 1467902278594850 12757268 1.00079s 1582 11171 1467902279595210 12757337 1.00037s 1577 11171 What puzzles me is that the number of calls to out_event resp. send goes down for a considerable time, and then catches up. Does ZeroMQ dynamically schedule attempts to send data? If so, what is algorithm here? Can it be influenced? One thing I forgot to mention is that this only happens when I use netem to simulate the expected latency of 80ms (RTT=160ms). TCP settings are tuned and I consistently achieve the expected performance with iperf etc. I have the feeling that there is some interaction between the TCP layer and ZeroMQ that I am missing. Best wishes, Jens --- Diese E-Mail wurde von Avast Antivirus-Software auf Viren geprüft. https://www.avast.com/antivirus _______________________________________________ zeromq-dev mailing list [email protected] http://lists.zeromq.org/mailman/listinfo/zeromq-dev _______________________________________________ zeromq-dev mailing list [email protected] http://lists.zeromq.org/mailman/listinfo/zeromq-dev
