Do you think there's any way to reproduce this in the lab, e.g. killing a peer before it can shut down TCP properly?
On Tue, Jun 23, 2015 at 10:08 PM, Marcin Romaszewicz <[email protected]> wrote: > Hi All, > > I've got an issue with ZMQ_ROUTER sockets which I'm having a hard time > working around, and I'd love some advice, but I suspect the answer is that > what I want to do isn't possible. > > Say I have a router socket listening on a port, and I have peers connecting > and disconnecting randomly over TCP. These peers have random identities for > all intents and purposes. > > Most of the time, a peer will disconnect "cleanly", meaning the TCP > connection is terminated via FIN or RST packets, ZMQ cleans up the file > descriptor. > > However, some of the time, my peer will die silently, effectively due to > network outage or power outage or something. > > In these cases, the router socket keeps the file descriptor around forever. > I know that the peer is dead because all my peers heartbeat to each other, > and the heartbeats have gone away. I thought that trying to send some data > to a dead peer would tear down that connection, since the underlying TCP > socket would eventually start erroring, but it doesn't, zmq must be dropping > my packet before sending it to the underlying socket. > > The socket monitor tells me that someone has connected to the router socket > on on its bound port with a specific file descriptor, but I've got so many > of these coming in that I can't associate a specific file descriptor with a > specific peer. > > TCP keep-alives don't work all that well in raising errors in a dead > connection. > > What I know on the app side due to my heartbeats is that peer XYZ is dead. > I'd like to tell the router socket to close the underlying file descriptor. > What I know via the monitor is that I have a bunch of file descriptors open, > but I can't map them to peers. If I could, I'd just call os.close() on that > file descriptor and hopefully ZMQ would handle this gracefully. > > Eventually, in a few hours of uptime, my process hits the os file descriptor > limit, and stops receiving new connections on the zeromq level. I can have > the process quit when it detects this, but that forces all the functioning > peers to reconnect and re-do some work, so I'd like to avoid it. > > I scanned the previous discussions about it, and there has been mention of > exposing this somehow, but I don't see anything along these lines in the > latest API. (looking at 4.1.2 release). > > Any suggestions on how I could work around this? > > I'm thinking of extending the socket monitor to have a new event type, like > ZMQ_PEER_CONNECT/DISCONNECT which passes back the peer ID and file > descriptor, but I've not gone through the zmq code enough yet to know how > much work this would be. > > Thanks in advance, > -- Marcin > > > > _______________________________________________ > zeromq-dev mailing list > [email protected] > http://lists.zeromq.org/mailman/listinfo/zeromq-dev > _______________________________________________ zeromq-dev mailing list [email protected] http://lists.zeromq.org/mailman/listinfo/zeromq-dev
