Hi all,

I have come across a weird/bad bug, I believe.

I run libzmq 4.1.6 and pyzmq 16.0.2. This happens on both Centos 6 and Centos 7.

The application is a celery worker that runs 16 worker threads. Each worker thread instantiates a 0MQ-based client, gets data and then closes this client. The 0MQ-based client creates its own 0MQ context and terminates it on exit. Nothing is shared between the threads or clients, every client processes only one request and then it's fully terminated.

The client itself is a REQ socket which uses CURVE authentication to authenticate with a ROUTER socket on the server side. The REQ socket has linger=0. Almost always, the REQ socket issues request, gets back response, closes the socket, destroys its context, all is good. Once every one or two days though, the REQ socket times out when waiting for the response from the ROUTER server, it then successfully closes the socket but indefinitely hangs when it goes on to destroy the context.

This runs in a data center on 1Gb/s LAN so the responses usually finish in under a second, the timeout is 20s. My theory is that the socket gets into a weird state and that's why it times out and blocks the context termination.

I ran a tcpdump and it turns out that the REQ client successfully authenticates with the ROUTER server but then it goes completely silent for those 20 odd seconds.

Here is a tcpdump capture of a stuck REQ client - https://pastebin.com/HxWAp6SQ. Here is a tcpdump capture of a normal communication - https://pastebin.com/qCi1jTp0. This is a full backtrace (after SIGABRT signal to the stuck application) - https://pastebin.com/jHdZS4VU

Here is ulimit:

[root@auhwbesap001 tomask]# cat /proc/311/limits
Limit Soft Limit Hard Limit Units Max cpu time unlimited unlimited seconds Max file size unlimited unlimited bytes Max data size unlimited unlimited bytes Max stack size 8388608 unlimited bytes Max core file size 0 unlimited bytes Max resident set unlimited unlimited bytes Max processes 31141 31141 processes Max open files 8196 8196 files Max locked memory 65536 65536 bytes Max address space unlimited unlimited bytes Max file locks unlimited unlimited locks Max pending signals 31141 31141 signals Max msgqueue size 819200 819200 bytes
Max nice priority         0                    0
Max realtime priority     0                    0
Max realtime timeout unlimited unlimited us

The application doesn't seem to get over any of the limits, it usually hovers between 100 and 200 open file handlers.

I tried to swap the REQ socket for a DEALER socket but that didn't help, the context eventually hung as well.

I also tried to set ZMQ_BLOCKY to 0 and/or ZMQ_HANDSHAKE_IVL to 100ms but the context still eventually hung.

I looked into the C++ code of libzmq but would need some guidance to troubleshoot this as I am primarily a python programmer.

I think we had a similar issue back in 2014 - https://lists.zeromq.org/pipermail/zeromq-dev/2014-September/026752.html. From memory, the tcpdump capture also showed the client/REQ going silent after the successful initial CURVE authentication but at that time the server/ROUTER application was crashing with an assertion.

I am happy to do any more debugging.

Thanks in advance for any help/pointers.
--
<http://www.repositpower.com/>

*Tomas Krajca *
Software architect
m.  02 6162 0277
e.   to...@repositpower.com
<https://twitter.com/RepositPower>
<https://www.facebook.com/Reposit-Power-1423585874607903/>
<https://www.linkedin.com/company/reposit-power>
_______________________________________________
zeromq-dev mailing list
zeromq-dev@lists.zeromq.org
https://lists.zeromq.org/mailman/listinfo/zeromq-dev

Reply via email to