Hi all,

I'm preparing to launch a public-facing Aurora based HTTP service. As
part of this exercise my team recently attempted to `aurora update`
the service while it was serving high request volume from an external
load generator.

We were surprised to find that our ops team was paged due to bursts of
502's from our frontend server, which routes external traffic to our
service using the serverset published by the Aurora announcer. Upon
investigation, we discovered that the serverset is announced as soon
as the thermos executor runs, even though the app is not ready to
serve requests right away. The 502s, of course, were due to the chosen
server not yet being able to respond to a connection request.

Last night I searched JIRA, the user and dev mailing lists, and the
thermos code, and I didn't see any conversations about delaying
announcement until the configured health check passes (thus indicating
that the server is ready to accept connections)

I'm curious why not? This seems like a fundamental requirement.

A couple notes. First, our frontend server doesn't support explicit
health checking, yet, though this will be implemented soon. Perhaps it
is considered the proper task of load balancers and frontend servers
to validate the health of servers in the serverset before routing
traffic to them?

Also, to work around this problem, we announced the serverset from the
app itself. This means we no longer have an 'announce' section in our
config, and thus no portmap. But http health checking is silently (in
0.12, though not 0.17) disabled if there is no thermos port named
'health'. We had our "admin" and "health" ports aliased, but with no
portmap I had to just rename "admin" to "health" everywhere in our job
definition. It works but it's a little silly. This was previously
noted in https://issues.apache.org/jira/browse/AURORA-321

Thanks in advance for any comments,

--Richard

Reply via email to