We have recently received an RB that aims to use `.healthchecksnooze` for the burn-in phase, guarding the state transition to RUNNING.
I am not sure if it is a good idea (e.g., as one get remain stuck in STARTING). In any case, it is worth a cross-reference: https://reviews.apache.org/r/58462/ From: David McLaughlin <[email protected]> Reply-To: "[email protected]" <[email protected]> Date: Tuesday, 21. March 2017 at 17:38 To: "[email protected]" <[email protected]> Subject: Re: Why doesn't announcer delay until task indicates it's ready? "I'm curious why not? This seems like a fundamental requirement." This was pretty controversial inside Twitter too. The idea is that the presence of any node in a serverset does not mean it's healthy, which is especially true long after Aurora has finished scheduling the task - so your RPC or routing layer should be able to detect and avoid the node until it recovers. Finagle solves a lot of this for Twitter, and tools like linkerd (which came out of some members of the Twitter traffic team) aim to solve in a more generic way in OSS - https://linkerd.io/. There of course services for which the initial batch of failures to let the proxy know it's a bad node is unacceptable or don't have the necessary load balancing intelligence in place. So those services tend to manually register to serversets (and avoid AURORA-321) as you've resorted to. On Tue, Mar 21, 2017 at 8:53 AM, Bill Farner <[email protected]<mailto:[email protected]>> wrote: Announcement is done immediately to announce presence of an instance for other services to determine what to do from there. A use case we considered was allowing monitoring of a service via HTTP before the service is ready for traffic. This is useful, for example, if the application has a long burn-in setup phase. In your case, the expectation is that the load balancer (or other upstream service) handles and routes away from unavailable backends; whether it's because they are not yet ready or otherwise. This could be using independent health checks or retries, depending on what is available. On Mar 21, 2017, 8:28 AM -0700, Richard Klancer <[email protected]<mailto:[email protected]>>, wrote: Hi all, I'm preparing to launch a public-facing Aurora based HTTP service. As part of this exercise my team recently attempted to `aurora update` the service while it was serving high request volume from an external load generator. We were surprised to find that our ops team was paged due to bursts of 502's from our frontend server, which routes external traffic to our service using the serverset published by the Aurora announcer. Upon investigation, we discovered that the serverset is announced as soon as the thermos executor runs, even though the app is not ready to serve requests right away. The 502s, of course, were due to the chosen server not yet being able to respond to a connection request. Last night I searched JIRA, the user and dev mailing lists, and the thermos code, and I didn't see any conversations about delaying announcement until the configured health check passes (thus indicating that the server is ready to accept connections) I'm curious why not? This seems like a fundamental requirement. A couple notes. First, our frontend server doesn't support explicit health checking, yet, though this will be implemented soon. Perhaps it is considered the proper task of load balancers and frontend servers to validate the health of servers in the serverset before routing traffic to them? Also, to work around this problem, we announced the serverset from the app itself. This means we no longer have an 'announce' section in our config, and thus no portmap. But http health checking is silently (in 0.12, though not 0.17) disabled if there is no thermos port named 'health'. We had our "admin" and "health" ports aliased, but with no portmap I had to just rename "admin" to "health" everywhere in our job definition. It works but it's a little silly. This was previously noted in https://issues.apache.org/jira/browse/AURORA-321 Thanks in advance for any comments, --Richard
