Re: Why doesn't announcer delay until task indicates it's ready?

Erb, Stephan Tue, 18 Apr 2017 09:59:59 -0700

We have recently received an RB that aims to use `.healthchecksnooze` for the 
burn-in phase, guarding the state transition to RUNNING.

I am not sure if it is a good idea (e.g., as one get remain stuck in STARTING). 
In any case, it is worth a cross-reference: https://reviews.apache.org/r/58462/

From: David McLaughlin <[email protected]>
Reply-To: "[email protected]" <[email protected]>
Date: Tuesday, 21. March 2017 at 17:38
To: "[email protected]" <[email protected]>
Subject: Re: Why doesn't announcer delay until task indicates it's ready?

"I'm curious why not? This seems like a fundamental requirement."

This was pretty controversial inside Twitter too. The idea is that the presence 
of any node in a serverset does not mean it's healthy, which is especially true 
long after Aurora has finished scheduling the task - so your RPC or routing 
layer should be able to detect and avoid the node until it recovers. Finagle 
solves a lot of this for Twitter, and tools like linkerd (which came out of 
some members of the Twitter traffic team) aim to solve in a more generic way in 
OSS - https://linkerd.io/.

There of course services for which the initial batch of failures to let the 
proxy know it's a bad node is unacceptable or don't have the necessary load 
balancing intelligence in place. So those services tend to manually register to 
serversets (and avoid AURORA-321) as you've resorted to.

On Tue, Mar 21, 2017 at 8:53 AM, Bill Farner 
<[email protected]<mailto:[email protected]>> wrote:
Announcement is done immediately to announce presence of an instance for other 
services to determine what to do from there. A use case we considered was 
allowing monitoring of a service via HTTP before the service is ready for 
traffic. This is useful, for example, if the application has a long burn-in 
setup phase.

In your case, the expectation is that the load balancer (or other upstream 
service) handles and routes away from unavailable backends; whether it's 
because they are not yet ready or otherwise. This could be using independent 
health checks or retries, depending on what is available.

On Mar 21, 2017, 8:28 AM -0700, Richard Klancer 
<[email protected]<mailto:[email protected]>>, wrote:

Hi all,

I'm preparing to launch a public-facing Aurora based HTTP service. As
part of this exercise my team recently attempted to `aurora update`
the service while it was serving high request volume from an external
load generator.

We were surprised to find that our ops team was paged due to bursts of
502's from our frontend server, which routes external traffic to our
service using the serverset published by the Aurora announcer. Upon
investigation, we discovered that the serverset is announced as soon
as the thermos executor runs, even though the app is not ready to
serve requests right away. The 502s, of course, were due to the chosen
server not yet being able to respond to a connection request.

Last night I searched JIRA, the user and dev mailing lists, and the
thermos code, and I didn't see any conversations about delaying
announcement until the configured health check passes (thus indicating
that the server is ready to accept connections)

I'm curious why not? This seems like a fundamental requirement.

A couple notes. First, our frontend server doesn't support explicit
health checking, yet, though this will be implemented soon. Perhaps it
is considered the proper task of load balancers and frontend servers
to validate the health of servers in the serverset before routing
traffic to them?

Also, to work around this problem, we announced the serverset from the
app itself. This means we no longer have an 'announce' section in our
config, and thus no portmap. But http health checking is silently (in
0.12, though not 0.17) disabled if there is no thermos port named
'health'. We had our "admin" and "health" ports aliased, but with no
portmap I had to just rename "admin" to "health" everywhere in our job
definition. It works but it's a little silly. This was previously
noted in https://issues.apache.org/jira/browse/AURORA-321

Thanks in advance for any comments,

--Richard

Re: Why doesn't announcer delay until task indicates it's ready?

Reply via email to