"I'm curious why not? This seems like a fundamental requirement."

This was pretty controversial inside Twitter too. The idea is that the
presence of any node in a serverset does not mean it's healthy, which is
especially true long after Aurora has finished scheduling the task - so
your RPC or routing layer should be able to detect and avoid the node until
it recovers. Finagle solves a lot of this for Twitter, and tools like
linkerd (which came out of some members of the Twitter traffic team) aim to
solve in a more generic way in OSS - https://linkerd.io/.

There of course services for which the initial batch of failures to let the
proxy know it's a bad node is unacceptable or don't have the necessary load
balancing intelligence in place. So those services tend to manually
register to serversets (and avoid AURORA-321) as you've resorted to.

On Tue, Mar 21, 2017 at 8:53 AM, Bill Farner <[email protected]> wrote:

> Announcement is done immediately to announce presence of an instance for
> other services to determine what to do from there. A use case we considered
> was allowing monitoring of a service via HTTP before the service is ready
> for traffic. This is useful, for example, if the application has a long
> burn-in setup phase.
>
> In your case, the expectation is that the load balancer (or other upstream
> service) handles and routes away from unavailable backends; whether it's
> because they are not yet ready or otherwise. This could be using
> independent health checks or retries, depending on what is available.
>
>
> On Mar 21, 2017, 8:28 AM -0700, Richard Klancer <[email protected]>, wrote:
>
> Hi all,
>
> I'm preparing to launch a public-facing Aurora based HTTP service. As
> part of this exercise my team recently attempted to `aurora update`
> the service while it was serving high request volume from an external
> load generator.
>
> We were surprised to find that our ops team was paged due to bursts of
> 502's from our frontend server, which routes external traffic to our
> service using the serverset published by the Aurora announcer. Upon
> investigation, we discovered that the serverset is announced as soon
> as the thermos executor runs, even though the app is not ready to
> serve requests right away. The 502s, of course, were due to the chosen
> server not yet being able to respond to a connection request.
>
> Last night I searched JIRA, the user and dev mailing lists, and the
> thermos code, and I didn't see any conversations about delaying
> announcement until the configured health check passes (thus indicating
> that the server is ready to accept connections)
>
> I'm curious why not? This seems like a fundamental requirement.
>
> A couple notes. First, our frontend server doesn't support explicit
> health checking, yet, though this will be implemented soon. Perhaps it
> is considered the proper task of load balancers and frontend servers
> to validate the health of servers in the serverset before routing
> traffic to them?
>
> Also, to work around this problem, we announced the serverset from the
> app itself. This means we no longer have an 'announce' section in our
> config, and thus no portmap. But http health checking is silently (in
> 0.12, though not 0.17) disabled if there is no thermos port named
> 'health'. We had our "admin" and "health" ports aliased, but with no
> portmap I had to just rename "admin" to "health" everywhere in our job
> definition. It works but it's a little silly. This was previously
> noted in https://issues.apache.org/jira/browse/AURORA-321
>
> Thanks in advance for any comments,
>
> --Richard
>
>

Reply via email to