Make the protocol between the client and server stateful and have some kind
of handshake between client and server.

So when the server dies and then restart quickly, client sending a ping
message will be replied with an error as the server doesn't know the
client. The client then will re-initiate the handshake and will expire
immediately any pending requests which you can know resend. To summarize:

1. Have handshake / login process between client and server
2. Server has a hash table of all clients (routing id to client) - which it
added client to it as part of the handshake process
3. Client sends a ping every X seconds
4. Server receives a ping and reply with pong if the client is known or
error if client is unknown
5. Client that receives an error from a ping re-initiate the
handshake/login process
6. Client immediately expire pending requests when error is received / or
resending all pending requests after handshake is completed




On Fri, Mar 20, 2015 at 8:53 PM, Russell Della Rosa <
[email protected]> wrote:

> Hi all,
>
> Quite a bit of setup to ask a question...
>
> Setup:
> ------
> I setup a simple zmq aync request/reply architecture that looks like this:
> client[DEALER] --tcp--> primary server[ROUTER/proxy/DEALER] --inproc-->
> workers[DEALER]
>
> An identical secondary server exists for failover.
>
> To make this more robust against server death / network issues I setup a
> ping pong heartbeat system as described in the guide.  (I liked the
> elegance of having the client control the timeouts, ping times, etc in
> ping/pong.  This makes the server simple since all timings are controlled
> by the client and all the server has to do is reply with a pong.  I
> basically modeled it after the Amazon EC2 ELB heartbeating, but for brevity
> won't go into that.)
>
> The client sends requests & pings in an async manner, but it will poll
> waiting for a reply before sending the next request.  The client will fail
> a request if the REQUEST_TIMEOUT[300s] is exceeded.  While waiting for a
> reply the client will ping the server every HB_INTERVAL[10s], wait
> HB_TIMEOUT[2s] in the poll loop to get a pong, and after some threshold of
> missed pings, HB_UNHEALTHY_THRESHOLD[2], the client will failover to a
> secondary server.  (Note that the REQUEST_TIMEOUT is quite long at 300s
> since some requests can take quite a while to complete.)
>
> Using the settings above, in brackets, all this works very well and only
> causes a worst case delay of around 20s on an outstanding request before it
> will failover to the secondary.
>
> Problem:
> --------
> This works well, except in this one case:
> - Client sends a request, server receives the request, sever dies, server
> is restarted very quickly (fast enough to miss no more than one ping)
>
> In this case the client will wait the entire REQUEST_TIMEOUT and then
> fail the request.  (The client assumes the server was working so it waits.
> The pings kept flowing to the server, save maybe 1, so it treated as alive.)
>
> I have various ideas on how to fix this issue and resend faster, but none
> are that elegant.
> 1) Could add a retry after REQUEST_TIMEOUT, but that is a long time [300s]
> to wait before retrying.  Easiest...
> 2) Could add the server zmq identity to the pong message and force a
> reconnect when the pong identity changes, but that can get complex with
> multiple servers.
> 3) I considered using a ROUTER as the client so the pings would be dropped
> when a server dies, but that is difficult to setup the first time and
> various posts on this forum (see below) mention client routers coming and
> going as being troublesome.  (And ROUTER to ROUTER looks tricky to get
> correct.)
>
> I considered Pub/Sub and one way heartbeats but neither would change this
> behavior, the pong messages would still flow.
>
> I have the service setup to auto-recover on a crash so it's more than just
> an edge case.
>
> Question:
> ---------
> I'm curious if anyone has solved this quick server restart problem in a
> clean way with socket patterns?  Or if you have other suggestions?
>
> Or if you have example code of ping/pong handling this case I'd love to
> see it.
>
> Thanks!
>  -- Russell
>
> Related threads:
> ----------------
> Disconnects / Retry Logic -
> http://lists.zeromq.org/pipermail/zeromq-dev/2012-January/015024.html
> Using a router with an identity issue -
> http://lists.zeromq.org/pipermail/zeromq-dev/2014-February/025206.html
>
>
> _______________________________________________
> zeromq-dev mailing list
> [email protected]
> http://lists.zeromq.org/mailman/listinfo/zeromq-dev
>
>
_______________________________________________
zeromq-dev mailing list
[email protected]
http://lists.zeromq.org/mailman/listinfo/zeromq-dev

Reply via email to