+ dev list for visibility and history. Okay, let's dig into this a little bit : ).
First, it is true that Marathon and Mesos HTTP health checks are not equivalent. It's not just 1xx status codes, you can't have multiple Mesos health checks for example. I don't understand why you say that the operator should know that failed is an expected response. It is not! Health checks do not have a concept of "not ready yet", grace period serves this purpose. The health check has failed because the contract had been violated: 111 is considered a failure. If you think that 1xx codes should be treated as success — let's have this discussion separately, probably on the dev list (btw, k8s does the same <https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-probes/#define-a-liveness-http-request> ). Second, are you sure about the status code in the second case? The error does not say anything about empty body, but empty reply. From what I can see <https://stackoverflow.com/questions/41290792/my-curl-post-gets-empty-reply-from-server>, (52) means a misbehaving server. If you're convinced that your server returned a proper HTTP response with some status code but with empty body, please file a bug report against Mesos jira. On Fri, Nov 3, 2017 at 2:20 PM, Alex Rukletsov <a...@mesosphere.com> wrote: > Tomas, can I reply to you and cc devlist to have our discussion logged > publicly? > > > On Fri, Nov 3, 2017 at 10:43 AM, Tomas Barton <barton.to...@gmail.com> > wrote: > >> Hi Alex, >> >> I'm quite ok with the current contract, treat "codes between 200 and 399 >> as success" seems reasonable for me. We're using code < 200 for "not >> ready yet" and >= 500 for error states. >> >> But that's not really the problem. While Marathon's implementation only >> checked the HTTP code, curl tends to be too smart. Meaning that going from >> Marathon healthcheck to MESOS based might introduce some incompatibility. >> >> For example: >> >> (2017-11-02 19:31:25) [INFO ] Request: 127.0.0.1:44172 0x1fcc44f0 >> HTTP/1.1 GET /health >> (2017-11-02 19:31:25) [INFO ] Response: 0x1fcc44f0 /health 111 0 >> I1102 19:31:25.548070 23822 checker_process.cpp:959] HTTP health check >> for task 'reql-dev.3c83761f-c004-11e7-acb9-be622fe0971d' returned: 111 >> W1102 19:31:25.548195 23822 health_checker.cpp:317] HTTP health check for >> task 'reql-dev.3c83761f-c004-11e7-acb9-be622fe0971d' failed: Unexpected >> HTTP response code: 111 >> >> This is sort of ok, the operator should know that "failed: Unexpected >> HTTP response code: 111" isn't really a failure but an expected response. >> >> But in order to get this we had to hack into HTTP server and introduce >> some "special" HTTP codes. >> >> Another component where health checks on Marathon we responding as >> expected, behaves funny with MESOS_HTTP: >> >> W1102 10:50:38.637907 6 health_checker.cpp:307] HTTP health check for >> task 'xxx' failed: curl exited with status 52: curl: (52) Empty reply from >> server >> I1102 10:50:38.637949 6 health_checker.cpp:333] Ignoring failure of >> HTTP health check for task 'xxx': still in grace period >> >> In this case the response code was either 100 or 111. Hard to tell from >> the logs as the return code is not logged. The problem is, that the >> component is written in Java, where some library for creating simple >> webserver responds to /health endpoint is using underneath pretty standard >> Jetty server. And Jetty decided that responses with code 1xx doesn't have >> to send body response. On the other side curl thinks that HTTP response >> with 1xx should have body response, thus the error code (52) Empty reply >> from server. Maybe we should simply respond with HTTP 418 I'm a teapot, >> meaning that the tea is not ready yet :) >> >> So, the question is, could be curl configured in a way where it doesn't >> check for body content? And if body is present include it in logs? >> >> Or should I file bug reports to all web servers to include Mesos >> compatible http responses? :) >> >> Thanks! >> Tomas >> >> >> On 2 November 2017 at 19:58, Alex Rukletsov <a...@mesosphere.com> wrote: >> >>> Hi Tomas! >>> >>> I wanted to make health checks as simple as possible. I had looked at >>> what aws, k8s, and nomad do and decided that I will not support >>> customization for return codes unless someone shows me a very good reason >>> to do so. Such customization is not easy, once you start it, people will >>> want more and more, think about API: enumerate "good" codes, enumerate >>> "bad" codes, specify "good" range, specify "bad" range, specify set of >>> "good" ranges, and so on. >>> >>> Regarding the empty reply, why an empty reply should be considered ok? >>> The contract is very explicit: "Default executors treat return codes >>> between 200 and 399 as success; custom executors may employ a different >>> strategy, e.g. leveraging the `statuses` field." >>> >>> And it actually should affect app scaling, as the task should be >>> considered unhealthy. >>> >>> So—give me a good reason to change my mind ; ) >>> >>> —Alex >>> >>> On Thu, Nov 2, 2017 at 4:43 PM, Tomas Barton <barton.to...@gmail.com> >>> wrote: >>> >>>> Hi Alex, >>>> >>>> one more question regarding health checks. Marathon health checks has >>>> option to ignore 1xx error codes: ignoreHttp1xx. >>>> >>>> If I understand correctly MESOS_HTTP checks there's no option to apply >>>> similar behaviour. What's the motivation? >>>> >>>> When using Mesos health checks I see following in logs: >>>> >>>> I1102 10:50:49.891046 12 health_checker.cpp:333] Ignoring failure of >>>> HTTP health check for task >>>> 'rcm_worker.5af95051-bfba-11e7-81db-024220a10091': still in grace period >>>> W1102 10:51:20.690042 10 health_checker.cpp:307] HTTP health check for >>>> task 'rcm_worker.5af95051-bfba-11e7-81db-024220a10091' failed: curl exited >>>> with status 52: curl: (52) Empty reply from server >>>> I1102 10:51:20.690389 10 health_checker.cpp:333] Ignoring failure of >>>> HTTP health check for task >>>> 'rcm_worker.5af95051-bfba-11e7-81db-024220a10091': still in grace period >>>> W1102 10:51:51.391033 12 health_checker.cpp:307] HTTP health check for >>>> task 'rcm_worker.5af95051-bfba-11e7-81db-024220a10091' failed: curl exited >>>> with status 52: curl: (52) Empty reply from server >>>> W1102 10:51:51.391294 12 health_checker.cpp:339] HTTP health check for >>>> task 'rcm_worker.5af95051-bfba-11e7-81db-024220a10091' failed 1 times >>>> consecutively >>>> >>>> It would be much more useful if the Mesos health checked returned the >>>> corresponding code instead of `curl: (52) Empty reply from server`. >>>> >>>> It doesn't affect the app scaling, but it's quite strange to see >>>> failures that should be tolerated. >>>> >>>> Am I missing something? >>>> >>>> Regards, >>>> Tomas >>>> >>> >>> >> >