The underlying sockets should indeed error out. Presumably the code isn't handling this properly.
On Wed, Jun 24, 2015 at 8:16 PM, Marcin Romaszewicz <[email protected]> wrote: > Thanks, this probably would solve our problem, however, I'm reluctant to > deploy the bleeding edge from your git repo into our production systems, > even if it does work on my test cluster. > > When I detect that a peer is dead with my own heartbeats, why is it that > attempting to send data to the dead peer doesn't force some kind of > connection cleanup or reset? The underlying os sockets should error out > eventually. > > On Wed, Jun 24, 2015 at 10:52 AM, Pieter Hintjens <[email protected]> wrote: >> >> For what it's worth, we just merged a pull request that adds >> connection heartbeating. It could be fun to see if this solves your >> problem. (In theory it should...) >> >> https://github.com/zeromq/libzmq/pull/1448 >> >> >> On Wed, Jun 24, 2015 at 6:48 PM, Marcin Romaszewicz <[email protected]> >> wrote: >> > Yes, you can easily reproduce this by pulling a network cable or >> > shutting >> > the host down before it can do any sort of TCP connection cleanup. I'm >> > seeing it in AWS when instances get terminated, because they're given so >> > little time to respond to TERM that connections aren't cleaned up. >> > >> > The iptables approach which Francis mentioned should work as well. >> > >> > I'll see if I can come up with a simple example of reproducing this. It >> > might be even possible to repro this on a single machine simply by >> > suspending a peer. >> > >> > -- Marcin >> > >> > On Wed, Jun 24, 2015 at 2:47 AM, Pieter Hintjens <[email protected]> wrote: >> >> >> >> Do you think there's any way to reproduce this in the lab, e.g. >> >> killing a peer before it can shut down TCP properly? >> >> >> >> On Tue, Jun 23, 2015 at 10:08 PM, Marcin Romaszewicz <[email protected]> >> >> wrote: >> >> > Hi All, >> >> > >> >> > I've got an issue with ZMQ_ROUTER sockets which I'm having a hard >> >> > time >> >> > working around, and I'd love some advice, but I suspect the answer is >> >> > that >> >> > what I want to do isn't possible. >> >> > >> >> > Say I have a router socket listening on a port, and I have peers >> >> > connecting >> >> > and disconnecting randomly over TCP. These peers have random >> >> > identities >> >> > for >> >> > all intents and purposes. >> >> > >> >> > Most of the time, a peer will disconnect "cleanly", meaning the TCP >> >> > connection is terminated via FIN or RST packets, ZMQ cleans up the >> >> > file >> >> > descriptor. >> >> > >> >> > However, some of the time, my peer will die silently, effectively due >> >> > to >> >> > network outage or power outage or something. >> >> > >> >> > In these cases, the router socket keeps the file descriptor around >> >> > forever. >> >> > I know that the peer is dead because all my peers heartbeat to each >> >> > other, >> >> > and the heartbeats have gone away. I thought that trying to send some >> >> > data >> >> > to a dead peer would tear down that connection, since the underlying >> >> > TCP >> >> > socket would eventually start erroring, but it doesn't, zmq must be >> >> > dropping >> >> > my packet before sending it to the underlying socket. >> >> > >> >> > The socket monitor tells me that someone has connected to the router >> >> > socket >> >> > on on its bound port with a specific file descriptor, but I've got so >> >> > many >> >> > of these coming in that I can't associate a specific file descriptor >> >> > with a >> >> > specific peer. >> >> > >> >> > TCP keep-alives don't work all that well in raising errors in a dead >> >> > connection. >> >> > >> >> > What I know on the app side due to my heartbeats is that peer XYZ is >> >> > dead. >> >> > I'd like to tell the router socket to close the underlying file >> >> > descriptor. >> >> > What I know via the monitor is that I have a bunch of file >> >> > descriptors >> >> > open, >> >> > but I can't map them to peers. If I could, I'd just call os.close() >> >> > on >> >> > that >> >> > file descriptor and hopefully ZMQ would handle this gracefully. >> >> > >> >> > Eventually, in a few hours of uptime, my process hits the os file >> >> > descriptor >> >> > limit, and stops receiving new connections on the zeromq level. I can >> >> > have >> >> > the process quit when it detects this, but that forces all the >> >> > functioning >> >> > peers to reconnect and re-do some work, so I'd like to avoid it. >> >> > >> >> > I scanned the previous discussions about it, and there has been >> >> > mention >> >> > of >> >> > exposing this somehow, but I don't see anything along these lines in >> >> > the >> >> > latest API. (looking at 4.1.2 release). >> >> > >> >> > Any suggestions on how I could work around this? >> >> > >> >> > I'm thinking of extending the socket monitor to have a new event >> >> > type, >> >> > like >> >> > ZMQ_PEER_CONNECT/DISCONNECT which passes back the peer ID and file >> >> > descriptor, but I've not gone through the zmq code enough yet to know >> >> > how >> >> > much work this would be. >> >> > >> >> > Thanks in advance, >> >> > -- Marcin >> >> > >> >> > >> >> > >> >> > _______________________________________________ >> >> > zeromq-dev mailing list >> >> > [email protected] >> >> > http://lists.zeromq.org/mailman/listinfo/zeromq-dev >> >> > >> >> _______________________________________________ >> >> zeromq-dev mailing list >> >> [email protected] >> >> http://lists.zeromq.org/mailman/listinfo/zeromq-dev >> > >> > >> > >> > _______________________________________________ >> > zeromq-dev mailing list >> > [email protected] >> > http://lists.zeromq.org/mailman/listinfo/zeromq-dev >> > >> _______________________________________________ >> zeromq-dev mailing list >> [email protected] >> http://lists.zeromq.org/mailman/listinfo/zeromq-dev > > > > _______________________________________________ > zeromq-dev mailing list > [email protected] > http://lists.zeromq.org/mailman/listinfo/zeromq-dev > _______________________________________________ zeromq-dev mailing list [email protected] http://lists.zeromq.org/mailman/listinfo/zeromq-dev
