Qian Zhang created MESOS-10139: ---------------------------------- Summary: Mesos agent host may become unresponsive when it is under low memory pressure Key: MESOS-10139 URL: https://issues.apache.org/jira/browse/MESOS-10139 Project: Mesos Issue Type: Bug Reporter: Qian Zhang
When user launches a task to use a large number of memory on an agent host (e.g., launch a task to run `stress --vm 1 --vm-bytes 29800M --vm-hang 0` on an agent host which have 32GB memory), the whole agent host will become unresponsive (no commands can be executed anymore, but still pingable). A few minutes later Mesos master will mark this agent as unreachable and update all its task’s state to `TASK_UNREACHABLE`. {code:java} May 26 02:13:31 ip-172-16-15-17.us-west-2.compute.internal mesos-master[15468]: I0526 02:13:31.103382 15491 master.cpp:260] Scheduling transition of agent 89d2d679-fa08-49be-94c3-880ebb595212-S0 to UNREACHABLE because of health check timeout May 26 02:13:31 ip-172-16-15-17.us-west-2.compute.internal mesos-master[15468]: I0526 02:13:31.103612 15491 master.cpp:8592] Marking agent 89d2d679-fa08-49be-94c3-880ebb595212-S0 (172.16.3.236) unreachable: health check timed out May 26 02:13:31 ip-172-16-15-17.us-west-2.compute.internal mesos-master[15468]: I0526 02:13:31.108093 15495 master.cpp:8635] Marked agent 89d2d679-fa08-49be-94c3-880ebb595212-S0 (172.16.3.236) unreachable: health check timed out … May 26 02:13:31 ip-172-16-15-17.us-west-2.compute.internal mesos-master[15468]: I0526 02:13:31.108419 15495 master.cpp:11149] Updating the state of task app10.instance-1f70be9f-9ef5-11ea-8981-9a93e42a6514._app.2 of framework 89d2d679-fa08-49be-94c3-880ebb595212-0000 (latest state: TASK_UNREACHABLE, status update state: TASK_UNREACHABLE) May 26 02:13:31 ip-172-16-15-17.us-west-2.compute.internal mesos-master[15468]: I0526 02:13:31.108865 15495 master.cpp:11149] Updating the state of task app9.instance-954f91ad-9ef4-11ea-8981-9a93e42a6514._app.1 of framework 89d2d679-fa08-49be-94c3-880ebb595212-0000 (latest state: TASK_UNREACHABLE, status update state: TASK_UNREACHABLE) ...{code} -- This message was sent by Atlassian Jira (v8.3.4#803005)