I implemented ping pong heartbeats with the UUID idea and it works great. Thanks!
It keeps the server stateless (just a UUID on construction was the primary change) and very simple. And the client controls the entire ping / pong lifecycle which is what I really like. (The client is complex, but I'd rather have the controlling logic all in the client.) Here's a high level summary of how I did the ping pong heartbeats... I listed the pseudocode as accurate I could recall, but no warranty implied or otherwise. :) Hopefully this pattern is useful to others... Setup: ------ client[DEALER] --tcp--> primary server[ROUTER/proxy/DEALER] --inproc--> workers[DEALER] The ping request contains: - A message header that denotes it as a ping (No payload) The pong response contains: - Server UUID - Health The client side keeps the following state information on each server: - Dead/Alive - missedPongs - successfulPongs - Socket / Url / UUID Client Ping: ------------ - A ping (async & dontwait) is sent to all servers - Will poll up to HEARTBEAT_TIMEOUT seconds to receive a pong. -- Upon pong receipt --- If the UUID is unknown, set the UUID = received UUID --- Records the pong as valid IF: ---- The UUID didn't change ---- The pong listed the server as healthy ---- The pong was received before the timeout (implied, see notes) --- For each valid pong: ---- If current status is Alive, reset the missedPongs count to 0 ---- If current status Dead, increment the successfulPongs by 1 - After each HEARTBEAT_TIMEOUT poll completes, update state for all servers -- If no valid pong was received --- If current status is Alive, increment the missedPongs count by 1 --- If current status Dead, reset the successfulPongs count to 0 -- If (Alive && missedPongs == HEARTBEAT_UNHEALTHY_THRESHOLD) or UUID changed --- Mark the server as Dead --- Failover to the next best server by rebuilding the socket and resending --- Reset server state variables properly (missed/successful=0, etc) -- If Dead && successfulPongs == HEARTBEAT_HEALTHY_THRESHOLD --- Mark the server as Alive --- Failover to the next best server by rebuilding the socket and resending --- Reset server state variables properly (missed/successful=0, etc) -- If there was a UUID conflict, reset the stored UUID to unknown - Delay for what is remaining of HEARTBEAT_INTERVAL and then repeat... Server Pong: ------------ - Generates a UUID on startup. (Also uses this as the Router socket identity) - Replies to a ping with a pong that includes the UUID & health -- I decided to include the health in case the server was starting / shutting down Notes: The above includes failback also... The pattern is just like missedPongs except you track successfulPongs if the server is dead. And when successfulPongs == HEART_BEAT_HEALTHY_THRESHOLD you bring a server back to life. (You will also failback to the primary if it comes alive. Note that HEART_BEAT_HEALTHY_THRESHOLD should be quite a bit bigger than HEART_BEAT_UNHEALTHY_THRESHOLD.) Make sure to rebuild the poll list each time also, since a reconnect will foul the old socket. Also if both servers are dead I decided to keep trying to send requests to the last know live server. (If both servers die at the same time I will reconnect / resend the first time to handle the quick server death on a single server issue.) I set the heartbeat high water mark to something low so after a few outstanding pings it wouldn't queue any more. Depending on what you set the HWM to you will need to properly handle receiving multiple pongs when a server comes to life. (I treated multiple pongs within the same HEARTBEAT_TIMEOUT period as a single pong. This simplified the logic so I didn't have to track if a pong REALLY came back during the window. It can come back in the next window and not be double counted.) In the actual implementation the client pinger and server ponger are running in separate threads. I did this so the pings wouldn't backup behind real work and real work wouldn't wait on pings. This means a valid response from the server isn't counting as a pong. I debated this and decided it was ok since the ping/pongs will flow even if the server is working. Normally a valid server reply would count as a pong though. Hope this helps someone else, -- Russell ________________________________ From: Stephen Lord <[email protected]> To: ZeroMQ development list <[email protected]> Sent: Monday, March 23, 2015 10:50 AM Subject: Re: [zeromq-dev] Ping Pong Heartbeats & Quick Server Restart Issue Have the heartbeat reply include a guid which represents the instance of the server, the server picks a guid at startup and always uses it. If the client sees two different guids then it knows the server restarted and can take action. The server side state is minimal, the client needs to track the guids it gets back on a per server basis. > > >On Mar 23, 2015, at 9:39 AM, Russell Della Rosa <[email protected]> wrote: > >I'm doing this using JeroMq (may use jzmq at some point) so I'm at the mercy of the JVM. > > >I have a wrapper around the JVM that heartbeats also, it and will kill the JVM if it doesn't reply with a pong. After the wrapper kills the JVM, it will quickly restart the JVM so I'm not sure there is a good point to send this shutdown message. (The wrapper might be able to but I think that might get complex.) > > >I like this idea though since it keeps the server stateless. > > >________________________________ > From: Justin Karneges <[email protected]> >To: [email protected] >Sent: Friday, March 20, 2015 2:41 PM >Subject: Re: [zeromq-dev] Ping Pong Heartbeats & Quick Server Restart Issue > > >> I'm curious if anyone has solved this quick server restart problem in a >> clean way with socket patterns? Or if you have other suggestions? Or if >> you have example code of ping/pong handling this case I'd love to see it. > >I suggest having the server send some kind of shutdown message. This is >basically the same as how regular TCP connection loss is indicated, >except that you have to do it yourself rather than the OS doing it for > > > > >you. > >Of course, the advantage of the OS doing it for you is that you can >ensure a close packet is sent even if your process crashes. This may bit >a bit harder to do with ZeroMQ, depending on the language. >_______________________________________________ >zeromq-dev mailing list >[email protected] >http://lists.zeromq.org/mailman/listinfo/zeromq-dev > > > > _______________________________________________ >zeromq-dev mailing list >[email protected] >https://urldefense.proofpoint.com/v1/url?u=http://lists.zeromq.org/mailman/lis tinfo/zeromq- dev&k=8F5TVnBDKF32UabxXsxZiA%3D%3D%0A&r=3Cz4BWxkuioYQ%2BdxY62EqptPwDeTj3M%2B5v0 6yEnFWTY%3D%0A&m=7MAl5btFQ60vHCXT9uKH65obPguN3ihVUEVTNwTJvzY%3D%0A&s=caf963151a 45847acc3cf01940a6b42b97590128726ebbf7e0ef3af7f0b78330 > ________________________________ The information contained in this transmission may be confidential. Any disclosure, copying, or further distribution of confidential information is not permitted unless such privilege is explicitly granted in writing by Quantum. Quantum reserves the right to have electronic communications, including email and attachments, sent across its networks filtered through anti virus and spam software programs and retain such messages in order to comply with applicable data security and retention requirements. Quantum is not responsible for the proper and complete transmission of the substance of this communication or for any delay in its receipt. _______________________________________________ zeromq-dev mailing list [email protected] http://lists.zeromq.org/mailman/listinfo/zeromq-dev _______________________________________________ zeromq-dev mailing list [email protected] http://lists.zeromq.org/mailman/listinfo/zeromq-dev
