Make the protocol between the client and server stateful and have some kind of handshake between client and server.
So when the server dies and then restart quickly, client sending a ping message will be replied with an error as the server doesn't know the client. The client then will re-initiate the handshake and will expire immediately any pending requests which you can know resend. To summarize: 1. Have handshake / login process between client and server 2. Server has a hash table of all clients (routing id to client) - which it added client to it as part of the handshake process 3. Client sends a ping every X seconds 4. Server receives a ping and reply with pong if the client is known or error if client is unknown 5. Client that receives an error from a ping re-initiate the handshake/login process 6. Client immediately expire pending requests when error is received / or resending all pending requests after handshake is completed On Fri, Mar 20, 2015 at 8:53 PM, Russell Della Rosa < [email protected]> wrote: > Hi all, > > Quite a bit of setup to ask a question... > > Setup: > ------ > I setup a simple zmq aync request/reply architecture that looks like this: > client[DEALER] --tcp--> primary server[ROUTER/proxy/DEALER] --inproc--> > workers[DEALER] > > An identical secondary server exists for failover. > > To make this more robust against server death / network issues I setup a > ping pong heartbeat system as described in the guide. (I liked the > elegance of having the client control the timeouts, ping times, etc in > ping/pong. This makes the server simple since all timings are controlled > by the client and all the server has to do is reply with a pong. I > basically modeled it after the Amazon EC2 ELB heartbeating, but for brevity > won't go into that.) > > The client sends requests & pings in an async manner, but it will poll > waiting for a reply before sending the next request. The client will fail > a request if the REQUEST_TIMEOUT[300s] is exceeded. While waiting for a > reply the client will ping the server every HB_INTERVAL[10s], wait > HB_TIMEOUT[2s] in the poll loop to get a pong, and after some threshold of > missed pings, HB_UNHEALTHY_THRESHOLD[2], the client will failover to a > secondary server. (Note that the REQUEST_TIMEOUT is quite long at 300s > since some requests can take quite a while to complete.) > > Using the settings above, in brackets, all this works very well and only > causes a worst case delay of around 20s on an outstanding request before it > will failover to the secondary. > > Problem: > -------- > This works well, except in this one case: > - Client sends a request, server receives the request, sever dies, server > is restarted very quickly (fast enough to miss no more than one ping) > > In this case the client will wait the entire REQUEST_TIMEOUT and then > fail the request. (The client assumes the server was working so it waits. > The pings kept flowing to the server, save maybe 1, so it treated as alive.) > > I have various ideas on how to fix this issue and resend faster, but none > are that elegant. > 1) Could add a retry after REQUEST_TIMEOUT, but that is a long time [300s] > to wait before retrying. Easiest... > 2) Could add the server zmq identity to the pong message and force a > reconnect when the pong identity changes, but that can get complex with > multiple servers. > 3) I considered using a ROUTER as the client so the pings would be dropped > when a server dies, but that is difficult to setup the first time and > various posts on this forum (see below) mention client routers coming and > going as being troublesome. (And ROUTER to ROUTER looks tricky to get > correct.) > > I considered Pub/Sub and one way heartbeats but neither would change this > behavior, the pong messages would still flow. > > I have the service setup to auto-recover on a crash so it's more than just > an edge case. > > Question: > --------- > I'm curious if anyone has solved this quick server restart problem in a > clean way with socket patterns? Or if you have other suggestions? > > Or if you have example code of ping/pong handling this case I'd love to > see it. > > Thanks! > -- Russell > > Related threads: > ---------------- > Disconnects / Retry Logic - > http://lists.zeromq.org/pipermail/zeromq-dev/2012-January/015024.html > Using a router with an identity issue - > http://lists.zeromq.org/pipermail/zeromq-dev/2014-February/025206.html > > > _______________________________________________ > zeromq-dev mailing list > [email protected] > http://lists.zeromq.org/mailman/listinfo/zeromq-dev > >
_______________________________________________ zeromq-dev mailing list [email protected] http://lists.zeromq.org/mailman/listinfo/zeromq-dev
