[jira] [Commented] (MESOS-8051) Killing TASK_GROUP fail to kill some tasks
[ https://issues.apache.org/jira/browse/MESOS-8051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16226257#comment-16226257 ] Qian Zhang commented on MESOS-8051: --- commit 05c7dd88f269692b7248c1087a3f57759eba6853 Author: Qian ZhangDate: Mon Oct 9 09:01:15 2017 +0800 Ignored the tasks already being killed when killing the task group. When the scheduler tries to kill multiple tasks in the task group simultaneously, the default executor will kill the tasks one by one. When the first task is killed, the default executor will kill all the other tasks in the task group, however, we need to ignore the tasks which are already being killed, otherwise, the check `CHECK(!container->killing);` in `DefaultExecutor::kill()` will fail. Review: https://reviews.apache.org/r/62836 commit 28831de34d098c894042246dd6fef402eb3b960d Author: Qian Zhang Date: Mon Oct 9 14:25:31 2017 +0800 Added a test `DefaultExecutorTest.KillMultipleTasks`. Review: https://reviews.apache.org/r/62837 > Killing TASK_GROUP fail to kill some tasks > -- > > Key: MESOS-8051 > URL: https://issues.apache.org/jira/browse/MESOS-8051 > Project: Mesos > Issue Type: Bug > Components: agent, executor >Affects Versions: 1.4.0 >Reporter: A. Dukhovniy >Assignee: Qian Zhang >Priority: Critical > Attachments: dcos-mesos-master.log.gz, dcos-mesos-slave.log.gz, > screenshot-1.png > > > When starting following pod definition via marathon: > {code:java} > { > "id": "/simple-pod", > "scaling": { > "kind": "fixed", > "instances": 3 > }, > "environment": { > "PING": "PONG" > }, > "containers": [ > { > "name": "ct1", > "resources": { > "cpus": 0.1, > "mem": 32 > }, > "image": { > "kind": "MESOS", > "id": "busybox" > }, > "exec": { > "command": { > "shell": "while true; do echo the current time is $(date) > > ./test-v1/clock; sleep 1; done" > } > }, > "volumeMounts": [ > { > "name": "v1", > "mountPath": "test-v1" > } > ] > }, > { > "name": "ct2", > "resources": { > "cpus": 0.1, > "mem": 32 > }, > "exec": { > "command": { > "shell": "while true; do echo -n $PING ' '; cat ./etc/clock; sleep > 1; done" > } > }, > "volumeMounts": [ > { > "name": "v1", > "mountPath": "etc" > }, > { > "name": "v2", > "mountPath": "docker" > } > ] > } > ], > "networks": [ > { > "mode": "host" > } > ], > "volumes": [ > { > "name": "v1" > }, > { > "name": "v2", > "host": "/var/lib/docker" > } > ] > } > {code} > mesos will successfully kill all {{ct2}} containers but fail to kill all/some > of the {{ct1}} containers. I've attached both master and agent logs. The > interesting part starts after marathon issues 6 kills: > {code:java} > Oct 04 14:58:25 ip-10-0-5-229.eu-central-1.compute.internal > mesos-master[4708]: I1004 14:58:25.209966 4746 master.cpp:5297] Processing > KILL call for task 'simple-pod.instance-3c1098e5-a914-11e7-bcd5-e63c853d > bf20.ct1' of framework bae11d5d-20c2-4d66-9ec3-773d1d717e58-0001 (marathon) > at scheduler-c61c493c-728f-4bd9-be60-7373574749af@10.0.5.229:15101 > Oct 04 14:58:25 ip-10-0-5-229.eu-central-1.compute.internal > mesos-master[4708]: I1004 14:58:25.210033 4746 master.cpp:5371] Telling > agent bae11d5d-20c2-4d66-9ec3-773d1d717e58-S1 at slave(1)@10.0.1.207:5051 ( > 10.0.1.207) to kill task > simple-pod.instance-3c1098e5-a914-11e7-bcd5-e63c853dbf20.ct1 of framework > bae11d5d-20c2-4d66-9ec3-773d1d717e58-0001 (marathon) at > scheduler-c61c493c-728f-4bd9-be60-7373574749af@10.0.5 > .229:15101 > Oct 04 14:58:25 ip-10-0-5-229.eu-central-1.compute.internal > mesos-master[4708]: I1004 14:58:25.210471 4748 master.cpp:5297] Processing > KILL call for task 'simple-pod.instance-3c1098e5-a914-11e7-bcd5-e63c853d > bf20.ct2' of framework bae11d5d-20c2-4d66-9ec3-773d1d717e58-0001 (marathon) > at scheduler-c61c493c-728f-4bd9-be60-7373574749af@10.0.5.229:15101 > Oct 04 14:58:25 ip-10-0-5-229.eu-central-1.compute.internal > mesos-master[4708]: I1004 14:58:25.210518 4748 master.cpp:5371] Telling > agent bae11d5d-20c2-4d66-9ec3-773d1d717e58-S1 at slave(1)@10.0.1.207:5051 ( > 10.0.1.207) to kill task > simple-pod.instance-3c1098e5-a914-11e7-bcd5-e63c853dbf20.ct2 of framework > bae11d5d-20c2-4d66-9ec3-773d1d717e58-0001 (marathon) at > scheduler-c61c493c-728f-4bd9-be60-7373574749af@10.0.5 > .229:15101 > Oct 04 14:58:25
[jira] [Commented] (MESOS-8051) Killing TASK_GROUP fail to kill some tasks
[ https://issues.apache.org/jira/browse/MESOS-8051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16196406#comment-16196406 ] Qian Zhang commented on MESOS-8051: --- RR: https://reviews.apache.org/r/62836/ > Killing TASK_GROUP fail to kill some tasks > -- > > Key: MESOS-8051 > URL: https://issues.apache.org/jira/browse/MESOS-8051 > Project: Mesos > Issue Type: Bug > Components: agent, executor >Affects Versions: 1.4.0 >Reporter: A. Dukhovniy >Assignee: Qian Zhang >Priority: Critical > Attachments: dcos-mesos-master.log.gz, dcos-mesos-slave.log.gz, > screenshot-1.png > > > When starting following pod definition via marathon: > {code:java} > { > "id": "/simple-pod", > "scaling": { > "kind": "fixed", > "instances": 3 > }, > "environment": { > "PING": "PONG" > }, > "containers": [ > { > "name": "ct1", > "resources": { > "cpus": 0.1, > "mem": 32 > }, > "image": { > "kind": "MESOS", > "id": "busybox" > }, > "exec": { > "command": { > "shell": "while true; do echo the current time is $(date) > > ./test-v1/clock; sleep 1; done" > } > }, > "volumeMounts": [ > { > "name": "v1", > "mountPath": "test-v1" > } > ] > }, > { > "name": "ct2", > "resources": { > "cpus": 0.1, > "mem": 32 > }, > "exec": { > "command": { > "shell": "while true; do echo -n $PING ' '; cat ./etc/clock; sleep > 1; done" > } > }, > "volumeMounts": [ > { > "name": "v1", > "mountPath": "etc" > }, > { > "name": "v2", > "mountPath": "docker" > } > ] > } > ], > "networks": [ > { > "mode": "host" > } > ], > "volumes": [ > { > "name": "v1" > }, > { > "name": "v2", > "host": "/var/lib/docker" > } > ] > } > {code} > mesos will successfully kill all {{ct2}} containers but fail to kill all/some > of the {{ct1}} containers. I've attached both master and agent logs. The > interesting part starts after marathon issues 6 kills: > {code:java} > Oct 04 14:58:25 ip-10-0-5-229.eu-central-1.compute.internal > mesos-master[4708]: I1004 14:58:25.209966 4746 master.cpp:5297] Processing > KILL call for task 'simple-pod.instance-3c1098e5-a914-11e7-bcd5-e63c853d > bf20.ct1' of framework bae11d5d-20c2-4d66-9ec3-773d1d717e58-0001 (marathon) > at scheduler-c61c493c-728f-4bd9-be60-7373574749af@10.0.5.229:15101 > Oct 04 14:58:25 ip-10-0-5-229.eu-central-1.compute.internal > mesos-master[4708]: I1004 14:58:25.210033 4746 master.cpp:5371] Telling > agent bae11d5d-20c2-4d66-9ec3-773d1d717e58-S1 at slave(1)@10.0.1.207:5051 ( > 10.0.1.207) to kill task > simple-pod.instance-3c1098e5-a914-11e7-bcd5-e63c853dbf20.ct1 of framework > bae11d5d-20c2-4d66-9ec3-773d1d717e58-0001 (marathon) at > scheduler-c61c493c-728f-4bd9-be60-7373574749af@10.0.5 > .229:15101 > Oct 04 14:58:25 ip-10-0-5-229.eu-central-1.compute.internal > mesos-master[4708]: I1004 14:58:25.210471 4748 master.cpp:5297] Processing > KILL call for task 'simple-pod.instance-3c1098e5-a914-11e7-bcd5-e63c853d > bf20.ct2' of framework bae11d5d-20c2-4d66-9ec3-773d1d717e58-0001 (marathon) > at scheduler-c61c493c-728f-4bd9-be60-7373574749af@10.0.5.229:15101 > Oct 04 14:58:25 ip-10-0-5-229.eu-central-1.compute.internal > mesos-master[4708]: I1004 14:58:25.210518 4748 master.cpp:5371] Telling > agent bae11d5d-20c2-4d66-9ec3-773d1d717e58-S1 at slave(1)@10.0.1.207:5051 ( > 10.0.1.207) to kill task > simple-pod.instance-3c1098e5-a914-11e7-bcd5-e63c853dbf20.ct2 of framework > bae11d5d-20c2-4d66-9ec3-773d1d717e58-0001 (marathon) at > scheduler-c61c493c-728f-4bd9-be60-7373574749af@10.0.5 > .229:15101 > Oct 04 14:58:25 ip-10-0-5-229.eu-central-1.compute.internal > mesos-master[4708]: I1004 14:58:25.210602 4748 master.cpp:5297] Processing > KILL call for task 'simple-pod.instance-3c0ffca4-a914-11e7-bcd5-e63c853d > bf20.ct1' of framework bae11d5d-20c2-4d66-9ec3-773d1d717e58-0001 (marathon) > at scheduler-c61c493c-728f-4bd9-be60-7373574749af@10.0.5.229:15101 > Oct 04 14:58:25 ip-10-0-5-229.eu-central-1.compute.internal > mesos-master[4708]: I1004 14:58:25.210639 4748 master.cpp:5371] Telling > agent bae11d5d-20c2-4d66-9ec3-773d1d717e58-S1 at slave(1)@10.0.1.207:5051 ( > 10.0.1.207) to kill task > simple-pod.instance-3c0ffca4-a914-11e7-bcd5-e63c853dbf20.ct1 of framework > bae11d5d-20c2-4d66-9ec3-773d1d717e58-0001 (marathon) at > scheduler-c61c493c-728f-4bd9-be60-7373574749af@10.0.5 > .229:15101 > Oct 04 14:58:25 ip-10-0-5-229.eu-central-1.compute.internal > mesos-master[4708]: I1004 14:58:25.210932 4753
[jira] [Commented] (MESOS-8051) Killing TASK_GROUP fail to kill some tasks
[ https://issues.apache.org/jira/browse/MESOS-8051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16196097#comment-16196097 ] Qian Zhang commented on MESOS-8051: --- I can reproduce this issue with the latest Mesos code (master branch) + Marathon v1.5.1. I created the same task group (pod) as [~zen-dog]'s, and when it is running, run the following command to delete it: {code} curl -X DELETE http://192.168.1.3:8080/v2/pods/simple-pod {code} Then I can see one task is in {{TASK_KILLED}} status which is good: {code} I1008 19:58:59.928489 20665 slave.cpp:4411] Handling status update TASK_KILLED (UUID: e4b6562b-629a-4da8-85b8-ec7a57d1157e) for task simple-pod.instance-c034f882-ac1f-11e7-bc9c-024255337c79.ct2 of framework 206eabc8-a6ba-430f-8ecb-b0900b26b820- {code} But the other task is in {{TASK_FAILED}} status: {code} I1008 19:59:00.413954 20665 slave.cpp:4411] Handling status update TASK_FAILED (UUID: 65467177-066c-4a5d-b04a-f35c66a6d89c) for task simple-pod.instance-c034f882-ac1f-11e7-bc9c-024255337c79.ct1 of framework 206eabc8-a6ba-430f-8ecb-b0900b26b820- from @0.0.0.0:0 {code} And I see the executor is core dumped (I found the same core dump in [~zen-dog]'s agent log as well): {code} I1008 19:59:00.412891 20665 slave.cpp:5424] Executor 'instance-simple-pod.c034f882-ac1f-11e7-bc9c-024255337c79' of framework 206eabc8-a6ba-430f-8ecb-b0900b26b820- terminated with signal Aborted (core dumped) {code} And then in the {{stderr}} of the executor, I found the root cause of its core dump. {code} # cat /opt/mesos/slaves/097ad1b0-4a33-400d-9d06-b554f9c7c009-S0/frameworks/206eabc8-a6ba-430f-8ecb-b0900b26b820-/executors/instance-simple-pod.c034f882-ac1f-11e7-bc9c-024255337c79/runs/019b7103-b5e4-4fa9-aaa6-e4b092167000/stderr I1008 19:56:43.578035 20917 executor.cpp:192] Version: 1.5.0 I1008 19:56:43.612187 20938 default_executor.cpp:185] Received SUBSCRIBED event I1008 19:56:43.614189 20938 default_executor.cpp:189] Subscribed executor on workstation I1008 19:56:43.619079 20932 default_executor.cpp:185] Received LAUNCH_GROUP event I1008 19:56:43.621937 20931 default_executor.cpp:394] Setting 'MESOS_CONTAINER_IP' to: 192.168.1.3 I1008 19:56:43.703390 20933 default_executor.cpp:624] Successfully launched tasks [ simple-pod.instance-c034f882-ac1f-11e7-bc9c-024255337c79.ct1, simple-pod.instance-c034f882-ac1f-11e7-bc9c-024255337c79.ct2 ] in child containers [ 019b7103-b5e4-4fa9-aaa6-e4b092167000.3d104d9e-9e66-4a55-b067-ec58e782c62a, 019b7103-b5e4-4fa9-aaa6-e4b092167000.6f655d39-7a60-4fb5-84d3-5c9ef61070e8 ] I1008 19:56:43.707593 20936 default_executor.cpp:697] Waiting for child container 019b7103-b5e4-4fa9-aaa6-e4b092167000.3d104d9e-9e66-4a55-b067-ec58e782c62a of task 'simple-pod.instance-c034f882-ac1f-11e7-bc9c-024255337c79.ct1' I1008 19:56:43.707864 20936 default_executor.cpp:697] Waiting for child container 019b7103-b5e4-4fa9-aaa6-e4b092167000.6f655d39-7a60-4fb5-84d3-5c9ef61070e8 of task 'simple-pod.instance-c034f882-ac1f-11e7-bc9c-024255337c79.ct2' I1008 19:56:43.730860 20938 default_executor.cpp:185] Received ACKNOWLEDGED event I1008 19:56:43.762722 20937 default_executor.cpp:185] Received ACKNOWLEDGED event I1008 19:58:59.719534 20934 default_executor.cpp:185] Received KILL event I1008 19:58:59.719614 20934 default_executor.cpp:1119] Received kill for task 'simple-pod.instance-c034f882-ac1f-11e7-bc9c-024255337c79.ct1' I1008 19:58:59.719633 20934 default_executor.cpp:1004] Killing task simple-pod.instance-c034f882-ac1f-11e7-bc9c-024255337c79.ct1 running in child container 019b7103-b5e4-4fa9-aaa6-e4b092167000.3d104d9e-9e66-4a55-b067-ec58e782c62a with SIGTERM signal I1008 19:58:59.719643 20934 default_executor.cpp:1026] Scheduling escalation to SIGKILL in 3secs from now I1008 19:58:59.723572 20931 default_executor.cpp:185] Received KILL event I1008 19:58:59.723633 20931 default_executor.cpp:1119] Received kill for task 'simple-pod.instance-c034f882-ac1f-11e7-bc9c-024255337c79.ct2' I1008 19:58:59.723646 20931 default_executor.cpp:1004] Killing task simple-pod.instance-c034f882-ac1f-11e7-bc9c-024255337c79.ct2 running in child container 019b7103-b5e4-4fa9-aaa6-e4b092167000.6f655d39-7a60-4fb5-84d3-5c9ef61070e8 with SIGTERM signal I1008 19:58:59.723654 20931 default_executor.cpp:1026] Scheduling escalation to SIGKILL in 3secs from now I1008 19:58:59.896623 20936 default_executor.cpp:842] Child container 019b7103-b5e4-4fa9-aaa6-e4b092167000.6f655d39-7a60-4fb5-84d3-5c9ef61070e8 of task 'simple-pod.instance-c034f882-ac1f-11e7-bc9c-024255337c79.ct2' in state TASK_KILLED terminated with signal Terminated I1008 19:58:59.896730 20936 default_executor.cpp:879] Killing task group containing tasks [ simple-pod.instance-c034f882-ac1f-11e7-bc9c-024255337c79.ct1, simple-pod.instance-c034f882-ac1f-11e7-bc9c-024255337c79.ct2 ] F1008 19:58:59.896762 20936 default_executor.cpp:979] Check failed:
[jira] [Commented] (MESOS-8051) Killing TASK_GROUP fail to kill some tasks
[ https://issues.apache.org/jira/browse/MESOS-8051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16196014#comment-16196014 ] Qian Zhang commented on MESOS-8051: --- [~vinodkone] Sure, I am working on it. > Killing TASK_GROUP fail to kill some tasks > -- > > Key: MESOS-8051 > URL: https://issues.apache.org/jira/browse/MESOS-8051 > Project: Mesos > Issue Type: Bug > Components: agent, executor >Affects Versions: 1.4.0 >Reporter: A. Dukhovniy >Assignee: Qian Zhang >Priority: Critical > Attachments: dcos-mesos-master.log.gz, dcos-mesos-slave.log.gz, > screenshot-1.png > > > When starting following pod definition via marathon: > {code:java} > { > "id": "/simple-pod", > "scaling": { > "kind": "fixed", > "instances": 3 > }, > "environment": { > "PING": "PONG" > }, > "containers": [ > { > "name": "ct1", > "resources": { > "cpus": 0.1, > "mem": 32 > }, > "image": { > "kind": "MESOS", > "id": "busybox" > }, > "exec": { > "command": { > "shell": "while true; do echo the current time is $(date) > > ./test-v1/clock; sleep 1; done" > } > }, > "volumeMounts": [ > { > "name": "v1", > "mountPath": "test-v1" > } > ] > }, > { > "name": "ct2", > "resources": { > "cpus": 0.1, > "mem": 32 > }, > "exec": { > "command": { > "shell": "while true; do echo -n $PING ' '; cat ./etc/clock; sleep > 1; done" > } > }, > "volumeMounts": [ > { > "name": "v1", > "mountPath": "etc" > }, > { > "name": "v2", > "mountPath": "docker" > } > ] > } > ], > "networks": [ > { > "mode": "host" > } > ], > "volumes": [ > { > "name": "v1" > }, > { > "name": "v2", > "host": "/var/lib/docker" > } > ] > } > {code} > mesos will successfully kill all {{ct2}} containers but fail to kill all/some > of the {{ct1}} containers. I've attached both master and agent logs. The > interesting part starts after marathon issues 6 kills: > {code:java} > Oct 04 14:58:25 ip-10-0-5-229.eu-central-1.compute.internal > mesos-master[4708]: I1004 14:58:25.209966 4746 master.cpp:5297] Processing > KILL call for task 'simple-pod.instance-3c1098e5-a914-11e7-bcd5-e63c853d > bf20.ct1' of framework bae11d5d-20c2-4d66-9ec3-773d1d717e58-0001 (marathon) > at scheduler-c61c493c-728f-4bd9-be60-7373574749af@10.0.5.229:15101 > Oct 04 14:58:25 ip-10-0-5-229.eu-central-1.compute.internal > mesos-master[4708]: I1004 14:58:25.210033 4746 master.cpp:5371] Telling > agent bae11d5d-20c2-4d66-9ec3-773d1d717e58-S1 at slave(1)@10.0.1.207:5051 ( > 10.0.1.207) to kill task > simple-pod.instance-3c1098e5-a914-11e7-bcd5-e63c853dbf20.ct1 of framework > bae11d5d-20c2-4d66-9ec3-773d1d717e58-0001 (marathon) at > scheduler-c61c493c-728f-4bd9-be60-7373574749af@10.0.5 > .229:15101 > Oct 04 14:58:25 ip-10-0-5-229.eu-central-1.compute.internal > mesos-master[4708]: I1004 14:58:25.210471 4748 master.cpp:5297] Processing > KILL call for task 'simple-pod.instance-3c1098e5-a914-11e7-bcd5-e63c853d > bf20.ct2' of framework bae11d5d-20c2-4d66-9ec3-773d1d717e58-0001 (marathon) > at scheduler-c61c493c-728f-4bd9-be60-7373574749af@10.0.5.229:15101 > Oct 04 14:58:25 ip-10-0-5-229.eu-central-1.compute.internal > mesos-master[4708]: I1004 14:58:25.210518 4748 master.cpp:5371] Telling > agent bae11d5d-20c2-4d66-9ec3-773d1d717e58-S1 at slave(1)@10.0.1.207:5051 ( > 10.0.1.207) to kill task > simple-pod.instance-3c1098e5-a914-11e7-bcd5-e63c853dbf20.ct2 of framework > bae11d5d-20c2-4d66-9ec3-773d1d717e58-0001 (marathon) at > scheduler-c61c493c-728f-4bd9-be60-7373574749af@10.0.5 > .229:15101 > Oct 04 14:58:25 ip-10-0-5-229.eu-central-1.compute.internal > mesos-master[4708]: I1004 14:58:25.210602 4748 master.cpp:5297] Processing > KILL call for task 'simple-pod.instance-3c0ffca4-a914-11e7-bcd5-e63c853d > bf20.ct1' of framework bae11d5d-20c2-4d66-9ec3-773d1d717e58-0001 (marathon) > at scheduler-c61c493c-728f-4bd9-be60-7373574749af@10.0.5.229:15101 > Oct 04 14:58:25 ip-10-0-5-229.eu-central-1.compute.internal > mesos-master[4708]: I1004 14:58:25.210639 4748 master.cpp:5371] Telling > agent bae11d5d-20c2-4d66-9ec3-773d1d717e58-S1 at slave(1)@10.0.1.207:5051 ( > 10.0.1.207) to kill task > simple-pod.instance-3c0ffca4-a914-11e7-bcd5-e63c853dbf20.ct1 of framework > bae11d5d-20c2-4d66-9ec3-773d1d717e58-0001 (marathon) at > scheduler-c61c493c-728f-4bd9-be60-7373574749af@10.0.5 > .229:15101 > Oct 04 14:58:25 ip-10-0-5-229.eu-central-1.compute.internal > mesos-master[4708]: I1004 14:58:25.210932 4753