It's similar to the model we use in projects like Malamute. You don't need a UUID however. The client sends PING, and the server replies PING-OK if it recognizes the client, else it replies with an "unexpected command" error. The client handles that by restarting its protocol handshake.
Server can time-out idle clients, and clients can detect dead servers. There are a few corner cases, e.g. don't send more than 3-4 PINGs before getting a response. -Pieter On Wed, Mar 25, 2015 at 6:38 PM, Russell Della Rosa <[email protected]> wrote: > I implemented ping pong heartbeats with the UUID idea and it works great. > Thanks! > > It keeps the server stateless (just a UUID on construction was the primary > change) and very simple. > > And the client controls the entire ping / pong lifecycle which is what I > really > like. (The client is complex, but I'd rather have the controlling logic all > in > the client.) > > > Here's a high level summary of how I did the ping pong heartbeats... I listed > the pseudocode as accurate I could recall, but no warranty implied or > otherwise. :) > > Hopefully this pattern is useful to others... > > > Setup: > ------ > client[DEALER] --tcp--> primary server[ROUTER/proxy/DEALER] --inproc--> > workers[DEALER] > > The ping request contains: > - A message header that denotes it as a ping (No payload) > > The pong response contains: > - Server UUID > - Health > > The client side keeps the following state information on each server: > - Dead/Alive > - missedPongs > - successfulPongs > - Socket / Url / UUID > > Client Ping: > ------------ > - A ping (async & dontwait) is sent to all servers > > - Will poll up to HEARTBEAT_TIMEOUT seconds to receive a pong. > -- Upon pong receipt > --- If the UUID is unknown, set the UUID = received UUID > --- Records the pong as valid IF: > ---- The UUID didn't change > ---- The pong listed the server as healthy > ---- The pong was received before the timeout (implied, see notes) > --- For each valid pong: > ---- If current status is Alive, reset the missedPongs count to 0 > ---- If current status Dead, increment the successfulPongs by 1 > > - After each HEARTBEAT_TIMEOUT poll completes, update state for all servers > -- If no valid pong was received > --- If current status is Alive, increment the missedPongs count by 1 > --- If current status Dead, reset the successfulPongs count to 0 > -- If (Alive && missedPongs == HEARTBEAT_UNHEALTHY_THRESHOLD) or UUID changed > --- Mark the server as Dead > --- Failover to the next best server by rebuilding the socket and resending > --- Reset server state variables properly (missed/successful=0, etc) > -- If Dead && successfulPongs == HEARTBEAT_HEALTHY_THRESHOLD > --- Mark the server as Alive > --- Failover to the next best server by rebuilding the socket and resending > --- Reset server state variables properly (missed/successful=0, etc) > -- If there was a UUID conflict, reset the stored UUID to unknown > > - Delay for what is remaining of HEARTBEAT_INTERVAL and then repeat... > > > Server Pong: > ------------ > - Generates a UUID on startup. (Also uses this as the Router socket identity) > - Replies to a ping with a pong that includes the UUID & health > -- I decided to include the health in case the server was starting / shutting > down > > > Notes: > > The above includes failback also... The pattern is just like missedPongs > except you track successfulPongs if the server is dead. And when > successfulPongs == HEART_BEAT_HEALTHY_THRESHOLD you bring a server back to > life. (You will also failback to the primary if it comes alive. Note that > HEART_BEAT_HEALTHY_THRESHOLD should be quite a bit bigger than > HEART_BEAT_UNHEALTHY_THRESHOLD.) > > Make sure to rebuild the poll list each time also, since a reconnect will foul > the old socket. > > Also if both servers are dead I decided to keep trying to send requests to the > last know live server. (If both servers die at the same time I will reconnect > / resend the first time to handle the quick server death on a single server > issue.) > > I set the heartbeat high water mark to something low so after a few > outstanding > pings it wouldn't queue any more. Depending on what you set the HWM to you > will need to properly handle receiving multiple pongs when a server comes to > life. (I treated multiple pongs within the same HEARTBEAT_TIMEOUT period as a > single pong. This simplified the logic so I didn't have to track if a pong > REALLY came back during the window. It can come back in the next window and > not be double counted.) > > In the actual implementation the client pinger and server ponger are running > in > separate threads. I did this so the pings wouldn't backup behind real work > and real work wouldn't wait on pings. This means a valid response from the > server isn't counting as a pong. I debated this and decided it was ok since > the ping/pongs will flow even if the server is working. Normally a valid > server reply would count as a pong though. > > Hope this helps someone else, > > -- Russell > > ________________________________ > From: Stephen Lord <[email protected]> > To: ZeroMQ development list <[email protected]> > Sent: Monday, March 23, 2015 10:50 AM > Subject: Re: [zeromq-dev] Ping Pong Heartbeats & Quick Server Restart Issue > > > > Have the heartbeat reply include a guid which represents the instance of the > server, the server picks a guid at startup and always uses it. If the client > sees two different guids then it knows the server restarted and can take > action. The server side state is minimal, the client needs to track the guids > it gets back on a per server basis. > > > >> >> >>On Mar 23, 2015, at 9:39 AM, Russell Della Rosa <[email protected]> > wrote: >> >>I'm doing this using JeroMq (may use jzmq at some point) so I'm at the mercy > of the JVM. >> >> >>I have a wrapper around the JVM that heartbeats also, it and will kill the JVM > if it doesn't reply with a pong. After the wrapper kills the JVM, it will > quickly restart the JVM so I'm not sure there is a good point to send this > shutdown message. (The wrapper might be able to but I think that might get > complex.) >> >> >>I like this idea though since it keeps the server stateless. >> >> >>________________________________ >> From: Justin Karneges <[email protected]> >>To: [email protected] >>Sent: Friday, March 20, 2015 2:41 PM >>Subject: Re: [zeromq-dev] Ping Pong Heartbeats & Quick Server Restart Issue >> >> >>> I'm curious if anyone has solved this quick server restart problem in a >>> clean way with socket patterns? Or if you have other suggestions? Or if >>> you have example code of ping/pong handling this case I'd love to see it. >> >>I suggest having the server send some kind of shutdown message. This is >>basically the same as how regular TCP connection loss is indicated, >>except that you have to do it yourself rather than the OS doing it for >> >> >> >> >>you. >> >>Of course, the advantage of the OS doing it for you is that you can >>ensure a close packet is sent even if your process crashes. This may bit >>a bit harder to do with ZeroMQ, depending on the language. >>_______________________________________________ >>zeromq-dev mailing list >>[email protected] >>http://lists.zeromq.org/mailman/listinfo/zeromq-dev >> >> >> >> > _______________________________________________ >>zeromq-dev mailing list >>[email protected] >>https://urldefense.proofpoint.com/v1/url?u=http://lists.zeromq.org/mailman/lis > tinfo/zeromq- > dev&k=8F5TVnBDKF32UabxXsxZiA%3D%3D%0A&r=3Cz4BWxkuioYQ%2BdxY62EqptPwDeTj3M%2B5v0 > 6yEnFWTY%3D%0A&m=7MAl5btFQ60vHCXT9uKH65obPguN3ihVUEVTNwTJvzY%3D%0A&s=caf963151a > 45847acc3cf01940a6b42b97590128726ebbf7e0ef3af7f0b78330 >> > > ________________________________ > The information contained in this transmission may be confidential. Any > disclosure, copying, or further distribution of confidential information is > not > permitted unless such privilege is explicitly granted in writing by Quantum. > Quantum reserves the right to have electronic communications, including email > and attachments, sent across its networks filtered through anti virus and spam > software programs and retain such messages in order to comply with applicable > data security and retention requirements. Quantum is not responsible for the > proper and complete transmission of the substance of this communication or for > any delay in its receipt. > > > _______________________________________________ > zeromq-dev mailing list > [email protected] > http://lists.zeromq.org/mailman/listinfo/zeromq-dev > _______________________________________________ > zeromq-dev mailing list > [email protected] > http://lists.zeromq.org/mailman/listinfo/zeromq-dev _______________________________________________ zeromq-dev mailing list [email protected] http://lists.zeromq.org/mailman/listinfo/zeromq-dev
