Re: Tasks with failed health-checks intermittently not restarted

Jay Taylor Sun, 11 Oct 2015 12:08:26 -0700

Interesting, I just got the new cluster up.  I'll re-run both these scenarios 
and let you know what I find.


What you're saying makes a lot of sense- many different versions of Mesos were 
installed on the old cluster!  Hopefully that is the cause of the strangeness.

Thanks again for all your help Haosdent!!!



> On Oct 11, 2015, at 11:17 AM, haosdent <[email protected]> wrote:
> 
> I guess your environment have some problems, just like your met strange 
> behaviour in launcher_dir. Last time I remember also met strange things in my 
> laptop, it is because I make install a older version Mesos while I launch a 
> new version Mesos. If you could `make uninstall` and fresh build through 
> following http://mesos.apache.org/documentation/latest/getting-started/ and 
> still met those problems, we could more sure it is a bug. But so far I could 
> not reproduce your problems after repeat tests, I still think it maybe cause 
> by your build environment is not clear or have error configurations.
> 
>> On Sun, Oct 11, 2015 at 11:02 PM, Jay Taylor <[email protected]> wrote:
>> I was testing on a set of VMs on a lan with ideal very low latency.
>> 
>> 
>> 
>>> On Oct 11, 2015, at 3:30 AM, haosdent <[email protected]> wrote:
>>> 
>>> My stdout looks like:
>>> 
>>> ```
>>> Launching health check process: 
>>> /home/ld-sgdev/huangh/mesos/build/src/.libs/mesos-health-check 
>>> --executor=(1)@xxxx 
>>> --health_check_json={"command":{"shell":true,"value":"docker exec 
>>> mesos-90f9f9ac-e2e4-4d7d-8d07-50e37f727b91-S0.d8c73ee2-deca-46b6-ab51-d7815bded2d4
>>>  sh -c \" exit 1 
>>> \""},"consecutive_failures":3,"delay_seconds":0.0,"grace_period_seconds":10.0,"interval_seconds":10.0,"timeout_seconds":20.0}
>>>  --task_id=test-health-check.901b938b-7002-11e5-a62b-0a0027000000
>>> Health check process launched at pid: 24720
>>> Received task health update, healthy: false
>>> Received task health update, healthy: false
>>> Received task health update, healthy: false
>>> Killing docker task
>>> Shutting down
>>> ```
>>> 
>>> Does your network latency affect this result?
>>> 
>>> 
>>>> On Sun, Oct 11, 2015 at 6:18 PM, haosdent <[email protected]> wrote:
>>>> Could not reproduce your problem in my side. But I guess it maybe related 
>>>> to this ticket. MESOS-1613 HealthCheckTest.ConsecutiveFailures is flaky
>>>> 
>>>>> On Fri, Oct 9, 2015 at 12:13 PM, haosdent <[email protected]> wrote:
>>>>> I think it maybe because health check exit before executor receive the 
>>>>> TaskHealthStatus. I would try "exit 1" and give your feedback later.
>>>>> 
>>>>>> On Fri, Oct 9, 2015 at 11:30 AM, Jay Taylor <[email protected]> wrote:
>>>>>> Following up on this:
>>>>>> 
>>>>>> This problem is reproducible when the command is "exit 1".
>>>>>> 
>>>>>> Once I set it to a real curl cmd the intermittent failures stopped and 
>>>>>> health checks worked as advertised.
>>>>>> 
>>>>>> 
>>>>>>> On Oct 8, 2015, at 12:45 PM, Jay Taylor <[email protected]> wrote:
>>>>>>> 
>>>>>>> Using the health-check following parameters:
>>>>>>> 
>>>>>>> cmd="exit 1"
>>>>>>> delay=5.0
>>>>>>> grace-period=10.0
>>>>>>> interval=10.0
>>>>>>> timeout=10.0
>>>>>>> consecutiveFailures=3
>>>>>>> 
>>>>>>> Sometimes the tasks are successfully identified as failing and 
>>>>>>> restarted, however other times the health-check command exits yet the 
>>>>>>> task is left in a running state and the failure is ignored.
>>>>>>> 
>>>>>>> Sample of failed Mesos task log:
>>>>>>> 
>>>>>>> STDOUT:
>>>>>>> 
>>>>>>>> --container="mesos-61373c0e-7349-4173-ab8d-9d7b260e8a30-S1.05dd08c5-ffba-47d8-8a8a-b6cb0c58b662"
>>>>>>>>  --docker="docker" --docker_socket="/var/run/docker.sock" 
>>>>>>>> --help="false" --initialize_driver_logging="true" --logbufsecs="0" 
>>>>>>>> --logging_level="INFO" --mapped_directory="/mnt/mesos/sandbox" 
>>>>>>>> --quiet="false" 
>>>>>>>> --sandbox_directory="/tmp/mesos/slaves/61373c0e-7349-4173-ab8d-9d7b260e8a30-S1/frameworks/20150924-210922-1608624320-5050-1792-0020/executors/hello-app_web-v3.d14ba30e-6401-4044-a97a-86a2cab65631/runs/05dd08c5-ffba-47d8-8a8a-b6cb0c58b662"
>>>>>>>>  --stop_timeout="0ns"
>>>>>>>> --container="mesos-61373c0e-7349-4173-ab8d-9d7b260e8a30-S1.05dd08c5-ffba-47d8-8a8a-b6cb0c58b662"
>>>>>>>>  --docker="docker" --docker_socket="/var/run/docker.sock" 
>>>>>>>> --help="false" --initialize_driver_logging="true" --logbufsecs="0" 
>>>>>>>> --logging_level="INFO" --mapped_directory="/mnt/mesos/sandbox" 
>>>>>>>> --quiet="false" 
>>>>>>>> --sandbox_directory="/tmp/mesos/slaves/61373c0e-7349-4173-ab8d-9d7b260e8a30-S1/frameworks/20150924-210922-1608624320-5050-1792-0020/executors/hello-app_web-v3.d14ba30e-6401-4044-a97a-86a2cab65631/runs/05dd08c5-ffba-47d8-8a8a-b6cb0c58b662"
>>>>>>>>  --stop_timeout="0ns"
>>>>>>>> Registered docker executor on mesos-worker2a
>>>>>>>> Starting task hello-app_web-v3.d14ba30e-6401-4044-a97a-86a2cab65631
>>>>>>>> Launching health check process: /usr/libexec/mesos/mesos-health-check 
>>>>>>>> --executor=(1)@192.168.225.59:38776 
>>>>>>>> --health_check_json={"command":{"shell":true,"value":"docker exec 
>>>>>>>> mesos-61373c0e-7349-4173-ab8d-9d7b260e8a30-S1.05dd08c5-ffba-47d8-8a8a-b6cb0c58b662
>>>>>>>>  sh -c \" exit 1 
>>>>>>>> \""},"consecutive_failures":3,"delay_seconds":5.0,"grace_period_seconds":10.0,"interval_seconds":10.0,"timeout_seconds":10.0}
>>>>>>>>  --task_id=hello-app_web-v3.d14ba30e-6401-4044-a97a-86a2cab65631
>>>>>>>> Health check process launched at pid: 7525
>>>>>>>> Received task health update, healthy: false
>>>>>>>> Received task health update, healthy: false
>>>>>>> 
>>>>>>> 
>>>>>>> STDERR:
>>>>>>> 
>>>>>>>> I1008 19:30:02.569856  7408 exec.cpp:134] Version: 0.26.0
>>>>>>>> I1008 19:30:02.571815  7411 exec.cpp:208] Executor registered on slave 
>>>>>>>> 61373c0e-7349-4173-ab8d-9d7b260e8a30-S1
>>>>>>>> WARNING: Your kernel does not support swap limit capabilities, memory 
>>>>>>>> limited without swap.
>>>>>>>> WARNING: Logging before InitGoogleLogging() is written to STDERR
>>>>>>>> I1008 19:30:08.527354  7533 main.cpp:100] Ignoring failure as health 
>>>>>>>> check still in grace period
>>>>>>>> W1008 19:30:38.912325  7525 main.cpp:375] Health check failed Health 
>>>>>>>> command check exited with status 1
>>>>>>> 
>>>>>>> 
>>>>>>> Screenshot of the task still running despite health-check exited with 
>>>>>>> status code 1:
>>>>>>> 
>>>>>>> http://i.imgur.com/zx9GQuo.png
>>>>>>> 
>>>>>>> The expected behavior when the health-check binary has exited w/ 
>>>>>>> non-zero status is that the task would be killed and restarted (rather 
>>>>>>> than continuing to run as outlined above).
>>>>>>> 
>>>>>>> -----
>>>>>>> Additional note: After hard-coding the "path" string of the 
>>>>>>> health-check binary parent dir into b/src/docker/executor.cpp, I am 
>>>>>>> able to at least test the functionality.  The other issue of 
>>>>>>> health-checks for docker tasks failing to start is still unresolved due 
>>>>>>> to the unpropagated MESOS_LAUNCH_DIR issue.
>>>>> 
>>>>> 
>>>>> 
>>>>> -- 
>>>>> Best Regards,
>>>>> Haosdent Huang
>>>> 
>>>> 
>>>> 
>>>> -- 
>>>> Best Regards,
>>>> Haosdent Huang
>>> 
>>> 
>>> 
>>> -- 
>>> Best Regards,
>>> Haosdent Huang
> 
> 
> 
> -- 
> Best Regards,
> Haosdent Huang

Re: Tasks with failed health-checks intermittently not restarted

Reply via email to