I've done some testing of github head version with the heartbeat code, and it didn't work for me, I have monotonically increasing file descriptor counts, but I'm not sure if I set up my test scenario properly. I had the following setup.
1 ZMQ_ROUTER socket with ZMQ_HEARTBEAT_IVL = 3000 (3 seconds) ZMQ_HEARTBEAT_TTL = 30000 (30 seconds) ZMQ_HEARTBEAT_TIMEOUT = 30000 (30 seconds) However, due to logistical reasons, my clients which were connecting to this ZMQ socket were on ZMQ 4.1.2. Was this a valid test scenario? It would take me a couple of days to set up the AMI's to test Router(4.2.0) <-> client(4.2.0) Another question: If I switch a router socket into ZMQ_ROUTER_RAW mode, send it a disconnect fame (peer identity followed by empty frame), then switch off RAW mode, would I be doing something completely unsupported, or is it worth a try? My tests take a very long time and a lot of work to set up right now, so I'm reluctant to try something if it's probably a waste of time. Thanks, -- Marcin On Wed, Jun 24, 2015 at 1:01 PM, Pieter Hintjens <[email protected]> wrote: > The underlying sockets should indeed error out. Presumably the code > isn't handling this properly. > > > On Wed, Jun 24, 2015 at 8:16 PM, Marcin Romaszewicz <[email protected]> > wrote: > > Thanks, this probably would solve our problem, however, I'm reluctant to > > deploy the bleeding edge from your git repo into our production systems, > > even if it does work on my test cluster. > > > > When I detect that a peer is dead with my own heartbeats, why is it that > > attempting to send data to the dead peer doesn't force some kind of > > connection cleanup or reset? The underlying os sockets should error out > > eventually. > > > > On Wed, Jun 24, 2015 at 10:52 AM, Pieter Hintjens <[email protected]> wrote: > >> > >> For what it's worth, we just merged a pull request that adds > >> connection heartbeating. It could be fun to see if this solves your > >> problem. (In theory it should...) > >> > >> https://github.com/zeromq/libzmq/pull/1448 > >> > >> > >> On Wed, Jun 24, 2015 at 6:48 PM, Marcin Romaszewicz <[email protected]> > >> wrote: > >> > Yes, you can easily reproduce this by pulling a network cable or > >> > shutting > >> > the host down before it can do any sort of TCP connection cleanup. I'm > >> > seeing it in AWS when instances get terminated, because they're given > so > >> > little time to respond to TERM that connections aren't cleaned up. > >> > > >> > The iptables approach which Francis mentioned should work as well. > >> > > >> > I'll see if I can come up with a simple example of reproducing this. > It > >> > might be even possible to repro this on a single machine simply by > >> > suspending a peer. > >> > > >> > -- Marcin > >> > > >> > On Wed, Jun 24, 2015 at 2:47 AM, Pieter Hintjens <[email protected]> > wrote: > >> >> > >> >> Do you think there's any way to reproduce this in the lab, e.g. > >> >> killing a peer before it can shut down TCP properly? > >> >> > >> >> On Tue, Jun 23, 2015 at 10:08 PM, Marcin Romaszewicz < > [email protected]> > >> >> wrote: > >> >> > Hi All, > >> >> > > >> >> > I've got an issue with ZMQ_ROUTER sockets which I'm having a hard > >> >> > time > >> >> > working around, and I'd love some advice, but I suspect the answer > is > >> >> > that > >> >> > what I want to do isn't possible. > >> >> > > >> >> > Say I have a router socket listening on a port, and I have peers > >> >> > connecting > >> >> > and disconnecting randomly over TCP. These peers have random > >> >> > identities > >> >> > for > >> >> > all intents and purposes. > >> >> > > >> >> > Most of the time, a peer will disconnect "cleanly", meaning the TCP > >> >> > connection is terminated via FIN or RST packets, ZMQ cleans up the > >> >> > file > >> >> > descriptor. > >> >> > > >> >> > However, some of the time, my peer will die silently, effectively > due > >> >> > to > >> >> > network outage or power outage or something. > >> >> > > >> >> > In these cases, the router socket keeps the file descriptor around > >> >> > forever. > >> >> > I know that the peer is dead because all my peers heartbeat to each > >> >> > other, > >> >> > and the heartbeats have gone away. I thought that trying to send > some > >> >> > data > >> >> > to a dead peer would tear down that connection, since the > underlying > >> >> > TCP > >> >> > socket would eventually start erroring, but it doesn't, zmq must be > >> >> > dropping > >> >> > my packet before sending it to the underlying socket. > >> >> > > >> >> > The socket monitor tells me that someone has connected to the > router > >> >> > socket > >> >> > on on its bound port with a specific file descriptor, but I've got > so > >> >> > many > >> >> > of these coming in that I can't associate a specific file > descriptor > >> >> > with a > >> >> > specific peer. > >> >> > > >> >> > TCP keep-alives don't work all that well in raising errors in a > dead > >> >> > connection. > >> >> > > >> >> > What I know on the app side due to my heartbeats is that peer XYZ > is > >> >> > dead. > >> >> > I'd like to tell the router socket to close the underlying file > >> >> > descriptor. > >> >> > What I know via the monitor is that I have a bunch of file > >> >> > descriptors > >> >> > open, > >> >> > but I can't map them to peers. If I could, I'd just call os.close() > >> >> > on > >> >> > that > >> >> > file descriptor and hopefully ZMQ would handle this gracefully. > >> >> > > >> >> > Eventually, in a few hours of uptime, my process hits the os file > >> >> > descriptor > >> >> > limit, and stops receiving new connections on the zeromq level. I > can > >> >> > have > >> >> > the process quit when it detects this, but that forces all the > >> >> > functioning > >> >> > peers to reconnect and re-do some work, so I'd like to avoid it. > >> >> > > >> >> > I scanned the previous discussions about it, and there has been > >> >> > mention > >> >> > of > >> >> > exposing this somehow, but I don't see anything along these lines > in > >> >> > the > >> >> > latest API. (looking at 4.1.2 release). > >> >> > > >> >> > Any suggestions on how I could work around this? > >> >> > > >> >> > I'm thinking of extending the socket monitor to have a new event > >> >> > type, > >> >> > like > >> >> > ZMQ_PEER_CONNECT/DISCONNECT which passes back the peer ID and file > >> >> > descriptor, but I've not gone through the zmq code enough yet to > know > >> >> > how > >> >> > much work this would be. > >> >> > > >> >> > Thanks in advance, > >> >> > -- Marcin > >> >> > > >> >> > > >> >> > > >> >> > _______________________________________________ > >> >> > zeromq-dev mailing list > >> >> > [email protected] > >> >> > http://lists.zeromq.org/mailman/listinfo/zeromq-dev > >> >> > > >> >> _______________________________________________ > >> >> zeromq-dev mailing list > >> >> [email protected] > >> >> http://lists.zeromq.org/mailman/listinfo/zeromq-dev > >> > > >> > > >> > > >> > _______________________________________________ > >> > zeromq-dev mailing list > >> > [email protected] > >> > http://lists.zeromq.org/mailman/listinfo/zeromq-dev > >> > > >> _______________________________________________ > >> zeromq-dev mailing list > >> [email protected] > >> http://lists.zeromq.org/mailman/listinfo/zeromq-dev > > > > > > > > _______________________________________________ > > zeromq-dev mailing list > > [email protected] > > http://lists.zeromq.org/mailman/listinfo/zeromq-dev > > > _______________________________________________ > zeromq-dev mailing list > [email protected] > http://lists.zeromq.org/mailman/listinfo/zeromq-dev >
_______________________________________________ zeromq-dev mailing list [email protected] http://lists.zeromq.org/mailman/listinfo/zeromq-dev
