Re: Tasks with failed health-checks intermittently not restarted

haosdent Sun, 11 Oct 2015 11:18:06 -0700

I guess your environment have some problems, just like your met strange
behaviour in launcher_dir. Last time I remember also met strange things in
my laptop, it is because I make install a older version Mesos while I
launch a new version Mesos. If you could `make uninstall` and fresh build
through following
http://mesos.apache.org/documentation/latest/getting-started/ and still met
those problems, we could more sure it is a bug. But so far I could not
reproduce your problems after repeat tests, I still think it maybe cause by
your build environment is not clear or have error configurations.


On Sun, Oct 11, 2015 at 11:02 PM, Jay Taylor <[email protected]> wrote:

> I was testing on a set of VMs on a lan with ideal very low latency.
>
>
>
> On Oct 11, 2015, at 3:30 AM, haosdent <[email protected]> wrote:
>
> My stdout looks like:
>
> ```
> Launching health check process:
> /home/ld-sgdev/huangh/mesos/build/src/.libs/mesos-health-check
> --executor=(1)@xxxx
> --health_check_json={"command":{"shell":true,"value":"docker exec
> mesos-90f9f9ac-e2e4-4d7d-8d07-50e37f727b91-S0.d8c73ee2-deca-46b6-ab51-d7815bded2d4
> sh -c \" exit 1
> \""},"consecutive_failures":3,"delay_seconds":0.0,"grace_period_seconds":10.0,"interval_seconds":10.0,"timeout_seconds":20.0}
> --task_id=test-health-check.901b938b-7002-11e5-a62b-0a0027000000
> Health check process launched at pid: 24720
> Received task health update, healthy: false
> Received task health update, healthy: false
> Received task health update, healthy: false
> Killing docker task
> Shutting down
> ```
>
> Does your network latency affect this result?
>
>
> On Sun, Oct 11, 2015 at 6:18 PM, haosdent <[email protected]> wrote:
>
>> Could not reproduce your problem in my side. But I guess it maybe related
>> to this ticket. MESOS-1613
>> <https://issues.apache.org/jira/browse/MESOS-1613> 
>> HealthCheckTest.ConsecutiveFailures
>> is flaky
>>
>> On Fri, Oct 9, 2015 at 12:13 PM, haosdent <[email protected]> wrote:
>>
>>> I think it maybe because health check exit before executor receive
>>> the TaskHealthStatus. I would try "exit 1" and give your feedback later.
>>>
>>> On Fri, Oct 9, 2015 at 11:30 AM, Jay Taylor <[email protected]> wrote:
>>>
>>>> Following up on this:
>>>>
>>>> This problem is reproducible when the command is "exit 1".
>>>>
>>>> Once I set it to a real curl cmd the intermittent failures stopped and
>>>> health checks worked as advertised.
>>>>
>>>>
>>>> On Oct 8, 2015, at 12:45 PM, Jay Taylor <[email protected]> wrote:
>>>>
>>>> Using the health-check following parameters:
>>>>
>>>> cmd="exit 1"
>>>> delay=5.0
>>>> grace-period=10.0
>>>> interval=10.0
>>>> timeout=10.0
>>>> consecutiveFailures=3
>>>>
>>>> Sometimes the tasks are successfully identified as failing and
>>>> restarted, however other times the health-check command exits yet the task
>>>> is left in a running state and the failure is ignored.
>>>>
>>>> Sample of failed Mesos task log:
>>>>
>>>> STDOUT:
>>>>
>>>> --container="mesos-61373c0e-7349-4173-ab8d-9d7b260e8a30-S1.05dd08c5-ffba-47d8-8a8a-b6cb0c58b662"
>>>>> --docker="docker" --docker_socket="/var/run/docker.sock" --help="false"
>>>>> --initialize_driver_logging="true" --logbufsecs="0" --logging_level="INFO"
>>>>> --mapped_directory="/mnt/mesos/sandbox" --quiet="false"
>>>>> --sandbox_directory="/tmp/mesos/slaves/61373c0e-7349-4173-ab8d-9d7b260e8a30-S1/frameworks/20150924-210922-1608624320-5050-1792-0020/executors/hello-app_web-v3.d14ba30e-6401-4044-a97a-86a2cab65631/runs/05dd08c5-ffba-47d8-8a8a-b6cb0c58b662"
>>>>> --stop_timeout="0ns"
>>>>> --container="mesos-61373c0e-7349-4173-ab8d-9d7b260e8a30-S1.05dd08c5-ffba-47d8-8a8a-b6cb0c58b662"
>>>>> --docker="docker" --docker_socket="/var/run/docker.sock" --help="false"
>>>>> --initialize_driver_logging="true" --logbufsecs="0" --logging_level="INFO"
>>>>> --mapped_directory="/mnt/mesos/sandbox" --quiet="false"
>>>>> --sandbox_directory="/tmp/mesos/slaves/61373c0e-7349-4173-ab8d-9d7b260e8a30-S1/frameworks/20150924-210922-1608624320-5050-1792-0020/executors/hello-app_web-v3.d14ba30e-6401-4044-a97a-86a2cab65631/runs/05dd08c5-ffba-47d8-8a8a-b6cb0c58b662"
>>>>> --stop_timeout="0ns"
>>>>> Registered docker executor on mesos-worker2a
>>>>> Starting task hello-app_web-v3.d14ba30e-6401-4044-a97a-86a2cab65631
>>>>> Launching health check process: /usr/libexec/mesos/mesos-health-check
>>>>> --executor=(1)@192.168.225.59:38776
>>>>> --health_check_json={"command":{"shell":true,"value":"docker exec
>>>>> mesos-61373c0e-7349-4173-ab8d-9d7b260e8a30-S1.05dd08c5-ffba-47d8-8a8a-b6cb0c58b662
>>>>> sh -c \" exit 1
>>>>> \""},"consecutive_failures":3,"delay_seconds":5.0,"grace_period_seconds":10.0,"interval_seconds":10.0,"timeout_seconds":10.0}
>>>>> --task_id=hello-app_web-v3.d14ba30e-6401-4044-a97a-86a2cab65631
>>>>>
>>>>> *Health check process launched at pid: 7525*
>>>>> *Received task health update, healthy: false**Received task health
>>>>> update, healthy: false*
>>>>
>>>>
>>>>
>>>> STDERR:
>>>>
>>>> I1008 19:30:02.569856  7408 exec.cpp:134] Version: 0.26.0
>>>>> I1008 19:30:02.571815  7411 exec.cpp:208] Executor registered on slave
>>>>> 61373c0e-7349-4173-ab8d-9d7b260e8a30-S1
>>>>> WARNING: Your kernel does not support swap limit capabilities, memory
>>>>> limited without swap.
>>>>> WARNING: Logging before InitGoogleLogging() is written to STDERR
>>>>> I1008 19:30:08.527354  7533 main.cpp:100] Ignoring failure as health
>>>>> check still in grace period
>>>>> *W1008 19:30:38.912325  7525 main.cpp:375] Health check failed Health
>>>>> command check exited with status 1*
>>>>
>>>>
>>>> Screenshot of the task still running despite health-check exited with
>>>> status code 1:
>>>>
>>>> http://i.imgur.com/zx9GQuo.png
>>>>
>>>> The expected behavior when the health-check binary has exited w/
>>>> non-zero status is that the task would be killed and restarted (rather than
>>>> continuing to run as outlined above).
>>>>
>>>> -----
>>>> Additional note: After hard-coding the "path" string of the
>>>> health-check binary parent dir into b/src/docker/executor.cpp, I am able to
>>>> at least test the functionality.  The other issue of health-checks for
>>>> docker tasks failing to start is still unresolved due to the unpropagated
>>>> MESOS_LAUNCH_DIR issue.
>>>>
>>>>
>>>
>>>
>>> --
>>> Best Regards,
>>> Haosdent Huang
>>>
>>
>>
>>
>> --
>> Best Regards,
>> Haosdent Huang
>>
>
>
>
> --
> Best Regards,
> Haosdent Huang
>
>


-- 
Best Regards,
Haosdent Huang

Re: Tasks with failed health-checks intermittently not restarted

Reply via email to