Hi all, I'm preparing to launch a public-facing Aurora based HTTP service. As part of this exercise my team recently attempted to `aurora update` the service while it was serving high request volume from an external load generator.
We were surprised to find that our ops team was paged due to bursts of 502's from our frontend server, which routes external traffic to our service using the serverset published by the Aurora announcer. Upon investigation, we discovered that the serverset is announced as soon as the thermos executor runs, even though the app is not ready to serve requests right away. The 502s, of course, were due to the chosen server not yet being able to respond to a connection request. Last night I searched JIRA, the user and dev mailing lists, and the thermos code, and I didn't see any conversations about delaying announcement until the configured health check passes (thus indicating that the server is ready to accept connections) I'm curious why not? This seems like a fundamental requirement. A couple notes. First, our frontend server doesn't support explicit health checking, yet, though this will be implemented soon. Perhaps it is considered the proper task of load balancers and frontend servers to validate the health of servers in the serverset before routing traffic to them? Also, to work around this problem, we announced the serverset from the app itself. This means we no longer have an 'announce' section in our config, and thus no portmap. But http health checking is silently (in 0.12, though not 0.17) disabled if there is no thermos port named 'health'. We had our "admin" and "health" ports aliased, but with no portmap I had to just rename "admin" to "health" everywhere in our job definition. It works but it's a little silly. This was previously noted in https://issues.apache.org/jira/browse/AURORA-321 Thanks in advance for any comments, --Richard
