Hi Luca,

Having a single/shared context didn't help. As soon as the REQ client timed out, 0MQ seemed to get confused and started leaking file handles. It ended up with 100s of those [eventfd] open file descriptors.

I am not sure if it's an issue with the reaper. My feeling is that the core issue is the REQ client going silent after successfully establishing the CURVE authentication. I have no idea if 0MQ hits some system limit or if there is a bug of some sort but that's the odd thing for me - successful CURVE handshake/authentication and then silence.

For now, I've got a cron job that restarts stuck workers so it's not that urgent/critical. Anyway, I've got some time to do a bit more digging or testing but I don't quite know where to start.

Thanks,
Tomas

Date: Thu, 11 May 2017 11:38:35 +0100
From: Luca Boccassi <luca.bocca...@gmail.com>
To: ZeroMQ development list <zeromq-dev@lists.zeromq.org>
Subject: Re: [zeromq-dev] Destroying 0MQ context gets indefinitely,
        stuck/hangs despite linger=0
Message-ID: <1494499115.4886.3.ca...@gmail.com>
Content-Type: text/plain; charset="utf-8"

On Wed, 2017-05-10 at 15:21 +1000, Tomas Krajca wrote:
Hi Luca and thanks for your reply.

  > Note that these are two well-known anti-patterns. The context is
  > intended to be shared and be unique in an application, and live
for as
  > long as the process does, and the sockets are meant to be long
lived as
  > well.
  >
  > I would recommend refactoring and, at the very least, use a single
  > context for the duration of your application.
  >

I always thought that having separate context was safer. I will
refactor the application to use one context for all the
clients/sockets
and see if it makes any difference.

I wonder if that's going eliminate the initial problem though. If
the
sockets really get somehow stuck/into an inconsistent state, then I
imagine they will just "leak" and stay in that context forever,
possibly
preventing the app from a proper termination.

There could be an unknown race with the reaper. It should help in that
case.

The client usually is long lived for as long as the app lives but in
this particular app, it's a bit more special in that the separate
tasks
just use the clients to fetch some data in a standardized way, do
their
computation and exit. These tasks are periodically spawned by celery.

Message: 1
Date: Mon, 08 May 2017 11:58:42 +0100
From: Luca Boccassi <luca.bocca...@gmail.com>
To: ZeroMQ development list <zeromq-dev@lists.zeromq.org>
Cc: "develop...@repositpower.com" <develop...@repositpower.com>
Subject: Re: [zeromq-dev] Destroying 0MQ context gets indefinitely
        stuck/hangs despite linger=0
Message-ID: <1494241122.11089.5.ca...@gmail.com>
Content-Type: text/plain; charset="utf-8"

On Mon, 2017-05-08 at 11:08 +1000, Tomas Krajca wrote:
Hi all,

I have come across a weird/bad bug, I believe.

I run libzmq 4.1.6 and pyzmq 16.0.2. This happens on both Centos
6
and
Centos 7.

The application is a celery worker that runs 16 worker threads.
Each
worker thread instantiates a 0MQ-based client, gets data and then
closes
this client. The 0MQ-based client creates its own 0MQ context and
terminates it on exit. Nothing is shared between the threads or
clients,
every client processes only one request and then it's fully
terminated.

The client itself is a REQ socket which uses CURVE authentication
to
authenticate with a ROUTER socket on the server side. The REQ
socket
has
linger=0. Almost always, the REQ socket issues request, gets back
response, closes the socket, destroys its context, all is good.
Once
every one or two days though, the REQ socket times out when
waiting
for
the response from the ROUTER server, it then successfully closes
the
socket but indefinitely hangs when it goes on to destroy the
context.

Note that these are two well-known anti-patterns. The context is
intended to be shared and be unique in an application, and live for
as
long as the process does, and the sockets are meant to be long
lived as
well.

I would recommend refactoring and, at the very least, use a single
context for the duration of your application.

This runs in a data center on 1Gb/s LAN so the responses usually
finish
in under a second, the timeout is 20s. My theory is that the
socket
gets
into a weird state and that's why it times out and blocks the
context
termination.

I ran a tcpdump and it turns out that the REQ client successfully
authenticates with the ROUTER server but then it goes completely
silent
for those 20 odd seconds.

Here is a tcpdump capture of a stuck REQ client -
https://pastebin.com/HxWAp6SQ. Here is a tcpdump capture of a
normal
communication - https://pastebin.com/qCi1jTp0. This is a full
backtrace
(after SIGABRT signal to the stuck application) -
https://pastebin.com/jHdZS4VU

Here is ulimit:

[root@auhwbesap001 tomask]# cat /proc/311/limits
Limit                     Soft Limit           Hard Limit
Units
Max cpu time              unlimited            unlimited
seconds
Max file size             unlimited            unlimited
bytes
Max data size             unlimited            unlimited
bytes
Max stack size            8388608              unlimited
bytes
Max core file size        0                    unlimited
bytes
Max resident set          unlimited            unlimited
bytes
Max processes             31141                31141
processes
Max open files            8196                 8196
files
Max locked memory         65536                65536
bytes
Max address space         unlimited            unlimited
bytes
Max file locks            unlimited            unlimited
locks
Max pending signals       31141                31141
signals
Max msgqueue size         819200               819200
bytes
Max nice priority         0                    0
Max realtime priority     0                    0
Max realtime
timeout      unlimited            unlimited            us


The application doesn't seem to get over any of the limits, it
usually
hovers between 100 and 200 open file handlers.

I tried to swap the REQ socket for a DEALER socket but that
didn't
help,
the context eventually hung as well.

I also tried to set ZMQ_BLOCKY to 0 and/or ZMQ_HANDSHAKE_IVL to
100ms
but the context still eventually hung.

I looked into the C++ code of libzmq but would need some guidance
to
troubleshoot this as I am primarily a python programmer.

I think we had a similar issue back in 2014 -
https://lists.zeromq.org/pipermail/zeromq-dev/2014-September/0267
52.h
tml. From
memory, the tcpdump capture also showed the client/REQ going
silent
after the successful initial CURVE authentication but at that
time
the
server/ROUTER application was crashing with an assertion.

I am happy to do any more debugging.

Thanks in advance for any help/pointers.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 488 bytes
Desc: This is a digitally signed message part
URL: <https://lists.zeromq.org/pipermail/zeromq-dev/attachments/201
70508/fd178ae0/attachment-0001.sig>

------------------------------

<http://www.repositpower.com/>

*Tomas Krajca *
Software architect
m. 02 6162 0277
e.  to...@repositpower.com
<https://twitter.com/RepositPower>
<https://www.facebook.com/Reposit-Power-1423585874607903/>
<https://www.linkedin.com/company/reposit-power>
_______________________________________________
zeromq-dev mailing list
zeromq-dev@lists.zeromq.org
https://lists.zeromq.org/mailman/listinfo/zeromq-dev
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 488 bytes
Desc: This is a digitally signed message part
URL: 
<https://lists.zeromq.org/pipermail/zeromq-dev/attachments/20170511/b79e65b7/attachment-0001.sig>

------------------------------

Subject: Digest Footer

_______________________________________________
zeromq-dev mailing list
zeromq-dev@lists.zeromq.org
https://lists.zeromq.org/mailman/listinfo/zeromq-dev

------------------------------

End of zeromq-dev Digest, Vol 14, Issue 7
*****************************************


--
<http://www.repositpower.com/>

*Tomas Krajca *
Software architect
m.  02 6162 0277
e.   to...@repositpower.com
<https://twitter.com/RepositPower>
<https://www.facebook.com/Reposit-Power-1423585874607903/>
<https://www.linkedin.com/company/reposit-power>
_______________________________________________
zeromq-dev mailing list
zeromq-dev@lists.zeromq.org
https://lists.zeromq.org/mailman/listinfo/zeromq-dev

Reply via email to