On 23Feb18, Luca Boccassi allegedly wrote: > That's because it's round-robin, and the connection is async - so it > will wait on the first server to respond, and it never does so it's > blocked there. Sounds like what you really want is "fail-over" - IE if > the first does not respond, try the second. That might work if you tune > the tcp reconnect options to have a small timeout, so that the pipe is > removed - by default it's quite large. Note sure it will work - try it.
The TCP (re)connection is being established just fine so I doubt re-connect options will help. (But on your suggestion I did try adjusting re-connect opts just to be sure - no effect). What did help somewhat was setting ZMQ_IMMEDIATE. But that doesn't handle the corner case of the connection appearing to be up as far as 0MQ is concerned, but unbeknown to it, the TCP socket is dead or will be dead when it tries to use it next. My guess is that once a connection is selected by round-robin and the message is queued, the round-robin decision is never re-evaluated even if the initially selected connection subsequently dies and ZMQ_IMMEDIATE is set. IOWs ZMQ_IMMEDIATE helps, but it's not a bullet-proof way of achieve fallover depending on when the socket is discovered to be disconnected. To fix that case 0MQ would have to re-queue and re-round-robin to an active connection on detecting a dead TCP socket. That would be a convenient enhancement, but even if 0MQ did this we're still not out of the woods wrt making a bullet-proof fail-over scheme. And that's because the next failure mode is when the REQ gets sent but the REP never makes it back - most likely because the selected server dies. The only solution to that is for the client to re-send the REQ, but to do that we are breaking the REQ/REP sequence on the socket so we have to set ZMQ_REQ_RELAXED and thus also ZMQ_REQ_CORRELATE. And then there is still the final failure mode where all servers are dead and we'll never ever get a REP. So, all in all, to achieve a bullet-proof fail-over with REQ/REP taking advantage of 0MQs multiple connections per socket feature we have to deal with: 1) A 0MQ connection dead 2) The selected 0MQ connection alive but TCP dead 3) The selected 0MQ connection alive and TCP alive but server dies prior to REP and finally: 4) All connections dead And to achieve that the application has to: a) set ZMQ_IMMEDIATE to handle 1) b) set ZMQ_SNDTIMEO to detect 2) and detect 4) c) set ZMQ_RCVTIMEO to detect 3) and detect 4) d) set ZMQ_REQ_RELAXED and ZMQ_REQ_CORRELATE to allow... d.1) re-send REQ on an EAGAIN return from send() to handle 2) d.2) re-send REQ on an EAGAIN return from recv() to handle 3) and finally: e) Wrap all of the above in a retry limit to handle 4) Phew! That's a bit of work, but I can wrapper that all up in some sort of reqrep_exchange() routine as it's a common-enough pattern. Question: What have I missed? Mark. _______________________________________________ zeromq-dev mailing list email@example.com https://lists.zeromq.org/mailman/listinfo/zeromq-dev