Hi all, I'm using ZMQ 4.2.3 in a sort of large project. This project has automatic integration tests running every night that start/stop processes, send debug commands to a daemon processing data. I have built a debug interface to some of the C++ processes I have using ZMQ REQ/REP pattern. Through this interface I just send text commands like "show statistics" to retrieve some debugging counters from the live, running daemon I want to debug. To send the commands I built a small utility (named "debug shell") that opens a REQ socket, sends the text message, waits for the reply and prints the received output on stdout. Note that both REQ/REP sockets use 127.0.0.1 and TCP transport.
Sporadically the integration tests fail because the daemon stops replying to the debugging text commands on its REP socket. When this happens that daemon enters into some weird state where until I restart it, it will NEVER reply anymore on its REP socket. So the bug happens sporadically but when it manifests, it remains till the daemon is restarted. So far I could do some basic debugging (against a daemon built in release mode unfortunately!!) and I found that: 1) adding a ZMQ monitor socket on the REQ socket inside the "debug shell" I discovered that I get ZMQ_EVENT_CONNECTED notification on the REQ socket: so IMHO the ZMQ background threads have completed their TCP handshake. 2) I run a "strace -p" on the thread of the daemon that does zmq_poll() (to wait for ZMQ_POLLIN events) on the REP socket. When the debug shell is down I see a regular pattern of: poll([{fd=35, events=POLLIN}], 1, 0) = 0 (Timeout) poll([{fd=35, events=POLLIN}], 1, 0) = 0 (Timeout) poll([{fd=35, events=POLLIN}], 1, 0) = 0 (Timeout) poll([{fd=35, events=POLLIN}], 1, 250) = 0 (Timeout) ... indeed that thread does the zmq_poll() with a timeout of 250ms, finds no message and retunr.s I found that when I start the debug shell the following system calls happen: ... poll([{fd=35, events=POLLIN}], 1, 0) = 0 (Timeout) poll([{fd=35, events=POLLIN}], 1, 250) = 1 ([{fd=35, revents=POLLIN}]) poll([{fd=35, events=POLLIN}], 1, 0) = 1 ([{fd=35, revents=POLLIN}]) read(35, "\1\0\0\0\0\0\0\0", 8) = 8 poll([{fd=35, events=POLLIN}], 1, 0) = 0 (Timeout) poll([{fd=35, events=POLLIN}], 1, 162) = 1 ([{fd=35, revents=POLLIN}]) poll([{fd=35, events=POLLIN}], 1, 0) = 1 ([{fd=35, revents=POLLIN}]) read(35, "\1\0\0\0\0\0\0\0", 8) = 8 write(11, "\1\0\0\0\0\0\0\0", 8) = 8 poll([{fd=35, events=POLLIN}], 1, 0) = 0 (Timeout) poll([{fd=35, events=POLLIN}], 1, 161) = 0 (Timeout) poll([{fd=35, events=POLLIN}], 1, 0) = 0 (Timeout) ... I believe this is the ZMQ background thread handshake going on. The problem is that then I send a text command from the debug shell and I see nothing happens, just the endless flow of: poll([{fd=35, events=POLLIN}], 1, 0) = 0 (Timeout) poll([{fd=35, events=POLLIN}], 1, 0) = 0 (Timeout) poll([{fd=35, events=POLLIN}], 1, 0) = 0 (Timeout) poll([{fd=35, events=POLLIN}], 1, 250) = 0 (Timeout) ... However note that the zmq_send() on the REQ socket on the debug shell returns success. If I restart the daemon process and then repeat this test and I send a message from the debug shell I see this pattern: poll([{fd=73, events=POLLIN}], 1, 0) = 0 (Timeout) poll([{fd=73, events=POLLIN}], 1, 0) = 0 (Timeout) poll([{fd=73, events=POLLIN}], 1, 250) = 1 ([{fd=73, revents=POLLIN}]) poll([{fd=73, events=POLLIN}], 1, 0) = 1 ([{fd=73, revents=POLLIN}]) read(73, "\1\0\0\0\0\0\0\0", 8) = 8 poll([{fd=73, events=POLLIN}], 1, 0) = 0 (Timeout) open("/proc/2933/status", O_RDONLY) = 70 fstat(70, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fcd87392000 read(70, "Name:\thmstatemachine\nUmask:\t0000"..., 1024) = 1024 close(70) = 0 munmap(0x7fcd87392000, 4096) = 0 poll([{fd=17, events=POLLIN}], 1, 0) = 0 (Timeout) write(18, "\1\0\0\0\0\0\0\0", 8) = 8 poll([{fd=17, events=POLLIN}], 1, -1) = 1 ([{fd=17, revents=POLLIN}]) read(17, "\1\0\0\0\0\0\0\0", 8) = 8 poll([{fd=17, events=POLLIN}], 1, 0) = 0 (Timeout) poll([{fd=73, events=POLLIN}], 1, 0) = 0 (Timeout) write(13, "\1\0\0\0\0\0\0\0", 8) = 8 poll([{fd=73, events=POLLIN}], 1, 0) = 0 (Timeout) poll([{fd=73, events=POLLIN}], 1, 0) = 0 (Timeout) poll([{fd=73, events=POLLIN}], 1, 250) = 0 (Timeout) poll([{fd=73, events=POLLIN}], 1, 0) = 0 (Timeout) and the daemon process is able to reply to the text command successfully. Can you suggest me some way to debug this problem? How can I know if the ZMQ background threads of the daemon and the debug shell have successfully connected? (I'm using a monitor socket to understand that but maybe there-s a better way) How can I know if the problem lies on the debug shell (REQ socket) or on the daemon (REP socket)? (everything seems to indicate the problem is in the daemon: only restarting it I get back to a working condition!!) Thanks, Francesco _______________________________________________ zeromq-dev mailing list zeromq-dev@lists.zeromq.org https://lists.zeromq.org/mailman/listinfo/zeromq-dev