Using srcfd is prolemtic, zeromq handle reconnection and the srcfd might change.
To solve your problem I would change the design and continue sending heartbeat during long job and change to worker to two threads model. Alternatively you set a maximum time for a job, after which you consider the worker dead. If not dead you can handle the reconnection then. On Feb 14, 2017 17:33, "Gyorgy Szekely" <[email protected]> wrote: > Hi, > The implemented protocol (ZMQ-RFC 7/MDP) has application level mutual > heartbeating between the broker and the worker. And this works fine: both > parties detect if the other side dies via missing heartbeats. The problem > appears when the worker is assigned a long running job, heartbeating is > _disabled_ while the job is being processed (as per 7/MDP specifies). This > enables the worker to be single threaded, and avoids typical multithreaded > issues (eg. processing thread hangs, heartbeating thread runs; worker in > inconsistent state). > > When a worker crashes during job processing my application doesn't realize > this since no messages are flowing (the broker is waiting for the job > result), but the libzmq detects this, as the socket is always closed. My > goal is to always keep in sync the number of underlying sockets in libzmq > and Worker related objects in my application. > > I've googled around and found a few libzmq features that would suit my > needs: > - ZMQ_IDENTITY_FD - this was introduced and shortly removed from the lib > - ZMQ_SRCFD - deprecated, but it's exactly what I need! > - "Peer-Address" metadata, the recommended replacement for ZMQ_SRCFD, but > not suitable for my needs > > I know fd's should be handled with care (monitor events are asynchronous, > fd's get reused), but ZMQ_SRCFD solves my problem with the following > ruleset: > 1. When a Worker registers (first message over a connection) save the > underlying fd > - and - > 2. Check that this fd is in use by another Worker, if it is: that Worker > is dead since libzmq reused its file descriptor > > 3. If a Worker's fd is in closed state for a longer period (heartbeat > expiry time), then it crashed and the fd was not re-used (get this info > from monitor) > > I don't know if this is considered as an ugly hack by hardcore zeromq > users, but it looks like a legitimate ZMQ_SRCFD use-case to me. It would be > nice if it wasn't removed in the upcoming versions. > Any feedback welcome! > > Regards, > Gyorgy > > > > On Mon, Feb 13, 2017 at 10:21 PM, Greg Young <[email protected]> > wrote: > >> I believe the term here is application level heartbeats. >> >> It should also be supported that clients can heartbeat to server. It >> is not always that all clients want similar heartbeat timeouts. >> >> On Mon, Feb 13, 2017 at 4:07 PM, Michal Vyskocil >> <[email protected]> wrote: >> > Hi, >> > >> > You can take inspiration from malamute broker >> > https://github.com/zeromq/malamute >> > >> > There clients pings server regularly. The same does MQTT (just it's a >> > server, who pings clients). >> > >> > Sadly malamute is vulnerable to the same problem, that received service >> > request may get lost. Solution would be to let client to send a request >> > again after timeout, however wasn't yet implemented. >> > >> > _______________________________________________ >> > zeromq-dev mailing list >> > [email protected] >> > https://lists.zeromq.org/mailman/listinfo/zeromq-dev >> >> >> >> -- >> Studying for the Turing test >> _______________________________________________ >> zeromq-dev mailing list >> [email protected] >> https://lists.zeromq.org/mailman/listinfo/zeromq-dev >> > > > _______________________________________________ > zeromq-dev mailing list > [email protected] > https://lists.zeromq.org/mailman/listinfo/zeromq-dev >
_______________________________________________ zeromq-dev mailing list [email protected] https://lists.zeromq.org/mailman/listinfo/zeromq-dev
