[jira] [Commented] (MESOS-7963) Task groups can lose the container limitation status.
[ https://issues.apache.org/jira/browse/MESOS-7963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16179635#comment-16179635 ] Vinod Kone commented on MESOS-7963: --- [~jpe...@apache.org] If you are working on this can you assign it to yourself? Thanks. > Task groups can lose the container limitation status. > - > > Key: MESOS-7963 > URL: https://issues.apache.org/jira/browse/MESOS-7963 > Project: Mesos > Issue Type: Bug > Components: containerization, executor >Reporter: James Peach > > If you run a single task in a task group and that task fails with a container > limitation, that status update can be lost and only the executor failure will > be reported to the framework. > {noformat} > exec /opt/mesos/bin/mesos-execute --content_type=json > --master=jpeach.apple.com:5050 '--task_group={ > "tasks": > [ > { > "name": "7f141aca-55fe-4bb0-af4b-87f5ee26986a", > "task_id": {"value" : "2866368d-7279-4657-b8eb-bf1d968e8ebf"}, > "agent_id": {"value" : ""}, > "resources": [{ > "name": "cpus", > "type": "SCALAR", > "scalar": { > "value": 0.2 > } > }, { > "name": "mem", > "type": "SCALAR", > "scalar": { > "value": 32 > } > }, { > "name": "disk", > "type": "SCALAR", > "scalar": { > "value": 2 > } > } > ], > "command": { > "value": "sleep 2 ; /usr/bin/dd if=/dev/zero of=out.dat bs=1M > count=64 ; sleep 1" > } > } > ] > }' > I0911 11:48:01.480689 7340 scheduler.cpp:184] Version: 1.5.0 > I0911 11:48:01.488868 7339 scheduler.cpp:470] New master detected at > master@17.228.224.108:5050 > Subscribed with ID aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-0010 > Submitted task group with tasks [ 2866368d-7279-4657-b8eb-bf1d968e8ebf ] to > agent 'aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-S0' > Received status update TASK_RUNNING for task > '2866368d-7279-4657-b8eb-bf1d968e8ebf' > source: SOURCE_EXECUTOR > Received status update TASK_FAILED for task > '2866368d-7279-4657-b8eb-bf1d968e8ebf' > message: 'Command terminated with signal Killed' > source: SOURCE_EXECUTOR > {noformat} > However, the agent logs show that this failed with a memory limitation: > {noformat} > I0911 11:48:02.235818 7012 http.cpp:532] Processing call > WAIT_NESTED_CONTAINER > I0911 11:48:02.236395 7013 status_update_manager.cpp:323] Received status > update TASK_RUNNING (UUID: 85e7a8e8-22a7-4561-9000-2cd6d93502d9) for task > 2866368d-7279-4657-b8eb-bf1d968e8ebf of framework > aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-0010 > I0911 11:48:02.237083 7016 slave.cpp:4875] Forwarding the update > TASK_RUNNING (UUID: 85e7a8e8-22a7-4561-9000-2cd6d93502d9) for task > 2866368d-7279-4657-b8eb-bf1d968e8ebf of framework > aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-0010 to master@17.228.224.108:5050 > I0911 11:48:02.283661 7007 status_update_manager.cpp:395] Received status > update acknowledgement (UUID: 85e7a8e8-22a7-4561-9000-2cd6d93502d9) for task > 2866368d-7279-4657-b8eb-bf1d968e8ebf of framework > aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-0010 > I0911 11:48:04.771455 7014 memory.cpp:516] OOM detected for container > 474388fe-43c3-4372-b903-eaca22740996 > I0911 11:48:04.776445 7014 memory.cpp:556] Memory limit exceeded: Requested: > 64MB Maximum Used: 64MB > ... > I0911 11:48:04.776943 7012 containerizer.cpp:2681] Container > 474388fe-43c3-4372-b903-eaca22740996 has reached its limit for resource > [{"name":"mem","scalar":{"value":64.0},"type":"SCALAR"}] and will be > terminated > {noformat} > The following {{mesos-execute}} task will show the container limitation > correctly: > {noformat} > exec /opt/mesos/bin/mesos-execute --content_type=json > --master=jpeach.apple.com:5050 '--task_group={ > "tasks": > [ > { > "name": "37db08f6-4f0f-4ef6-97ee-b10a5c5cc211", > "task_id": {"value" : "1372b2e2-c501-4e80-bcbd-1a5c5194e206"}, > "agent_id": {"value" : ""}, > "resources": [{ > "name": "cpus", > "type": "SCALAR", > "scalar": { > "value": 0.2 > } > }, > { > "name": "mem", > "type": "SCALAR", > "scalar": { > "value": 32 > } > }], > "command": { > "value": "sleep 600" > } > }, { > "name": "7247643c-5e4d-4b01-9839-e38db49f7f4d", >
[jira] [Commented] (MESOS-7963) Task groups can lose the container limitation status.
[ https://issues.apache.org/jira/browse/MESOS-7963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16174971#comment-16174971 ] Qian Zhang commented on MESOS-7963: --- Can you let me know what the special case is? I think currently when the default executor gets a limitation, it will kill all other nested containers and then terminate itself, I do not think we need to change this. And even without my proposal (i.e., raise limitation only for root container), all the nested containers will be killed as well (by Mesos containerizer), so the result is same, I am not sure when we need to restart the nested container. > Task groups can lose the container limitation status. > - > > Key: MESOS-7963 > URL: https://issues.apache.org/jira/browse/MESOS-7963 > Project: Mesos > Issue Type: Bug > Components: containerization, executor >Reporter: James Peach > > If you run a single task in a task group and that task fails with a container > limitation, that status update can be lost and only the executor failure will > be reported to the framework. > {noformat} > exec /opt/mesos/bin/mesos-execute --content_type=json > --master=jpeach.apple.com:5050 '--task_group={ > "tasks": > [ > { > "name": "7f141aca-55fe-4bb0-af4b-87f5ee26986a", > "task_id": {"value" : "2866368d-7279-4657-b8eb-bf1d968e8ebf"}, > "agent_id": {"value" : ""}, > "resources": [{ > "name": "cpus", > "type": "SCALAR", > "scalar": { > "value": 0.2 > } > }, { > "name": "mem", > "type": "SCALAR", > "scalar": { > "value": 32 > } > }, { > "name": "disk", > "type": "SCALAR", > "scalar": { > "value": 2 > } > } > ], > "command": { > "value": "sleep 2 ; /usr/bin/dd if=/dev/zero of=out.dat bs=1M > count=64 ; sleep 1" > } > } > ] > }' > I0911 11:48:01.480689 7340 scheduler.cpp:184] Version: 1.5.0 > I0911 11:48:01.488868 7339 scheduler.cpp:470] New master detected at > master@17.228.224.108:5050 > Subscribed with ID aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-0010 > Submitted task group with tasks [ 2866368d-7279-4657-b8eb-bf1d968e8ebf ] to > agent 'aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-S0' > Received status update TASK_RUNNING for task > '2866368d-7279-4657-b8eb-bf1d968e8ebf' > source: SOURCE_EXECUTOR > Received status update TASK_FAILED for task > '2866368d-7279-4657-b8eb-bf1d968e8ebf' > message: 'Command terminated with signal Killed' > source: SOURCE_EXECUTOR > {noformat} > However, the agent logs show that this failed with a memory limitation: > {noformat} > I0911 11:48:02.235818 7012 http.cpp:532] Processing call > WAIT_NESTED_CONTAINER > I0911 11:48:02.236395 7013 status_update_manager.cpp:323] Received status > update TASK_RUNNING (UUID: 85e7a8e8-22a7-4561-9000-2cd6d93502d9) for task > 2866368d-7279-4657-b8eb-bf1d968e8ebf of framework > aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-0010 > I0911 11:48:02.237083 7016 slave.cpp:4875] Forwarding the update > TASK_RUNNING (UUID: 85e7a8e8-22a7-4561-9000-2cd6d93502d9) for task > 2866368d-7279-4657-b8eb-bf1d968e8ebf of framework > aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-0010 to master@17.228.224.108:5050 > I0911 11:48:02.283661 7007 status_update_manager.cpp:395] Received status > update acknowledgement (UUID: 85e7a8e8-22a7-4561-9000-2cd6d93502d9) for task > 2866368d-7279-4657-b8eb-bf1d968e8ebf of framework > aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-0010 > I0911 11:48:04.771455 7014 memory.cpp:516] OOM detected for container > 474388fe-43c3-4372-b903-eaca22740996 > I0911 11:48:04.776445 7014 memory.cpp:556] Memory limit exceeded: Requested: > 64MB Maximum Used: 64MB > ... > I0911 11:48:04.776943 7012 containerizer.cpp:2681] Container > 474388fe-43c3-4372-b903-eaca22740996 has reached its limit for resource > [{"name":"mem","scalar":{"value":64.0},"type":"SCALAR"}] and will be > terminated > {noformat} > The following {{mesos-execute}} task will show the container limitation > correctly: > {noformat} > exec /opt/mesos/bin/mesos-execute --content_type=json > --master=jpeach.apple.com:5050 '--task_group={ > "tasks": > [ > { > "name": "37db08f6-4f0f-4ef6-97ee-b10a5c5cc211", > "task_id": {"value" : "1372b2e2-c501-4e80-bcbd-1a5c5194e206"}, > "agent_id": {"value" : ""}, > "resources": [{ > "name": "cpus", > "type": "SCALAR", > "scalar": { > "value": 0.2 > } >
[jira] [Commented] (MESOS-7963) Task groups can lose the container limitation status.
[ https://issues.apache.org/jira/browse/MESOS-7963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16174890#comment-16174890 ] James Peach commented on MESOS-7963: Right now, if an executor gets any limitation, it knows it will be terminated. The special case is that in your proposal some kinds of limitation would not cause the executor to be terminated, so the executor needs to decide how to handle that by either manually tearing everything down or restarting the nested container. > Task groups can lose the container limitation status. > - > > Key: MESOS-7963 > URL: https://issues.apache.org/jira/browse/MESOS-7963 > Project: Mesos > Issue Type: Bug > Components: containerization, executor >Reporter: James Peach > > If you run a single task in a task group and that task fails with a container > limitation, that status update can be lost and only the executor failure will > be reported to the framework. > {noformat} > exec /opt/mesos/bin/mesos-execute --content_type=json > --master=jpeach.apple.com:5050 '--task_group={ > "tasks": > [ > { > "name": "7f141aca-55fe-4bb0-af4b-87f5ee26986a", > "task_id": {"value" : "2866368d-7279-4657-b8eb-bf1d968e8ebf"}, > "agent_id": {"value" : ""}, > "resources": [{ > "name": "cpus", > "type": "SCALAR", > "scalar": { > "value": 0.2 > } > }, { > "name": "mem", > "type": "SCALAR", > "scalar": { > "value": 32 > } > }, { > "name": "disk", > "type": "SCALAR", > "scalar": { > "value": 2 > } > } > ], > "command": { > "value": "sleep 2 ; /usr/bin/dd if=/dev/zero of=out.dat bs=1M > count=64 ; sleep 1" > } > } > ] > }' > I0911 11:48:01.480689 7340 scheduler.cpp:184] Version: 1.5.0 > I0911 11:48:01.488868 7339 scheduler.cpp:470] New master detected at > master@17.228.224.108:5050 > Subscribed with ID aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-0010 > Submitted task group with tasks [ 2866368d-7279-4657-b8eb-bf1d968e8ebf ] to > agent 'aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-S0' > Received status update TASK_RUNNING for task > '2866368d-7279-4657-b8eb-bf1d968e8ebf' > source: SOURCE_EXECUTOR > Received status update TASK_FAILED for task > '2866368d-7279-4657-b8eb-bf1d968e8ebf' > message: 'Command terminated with signal Killed' > source: SOURCE_EXECUTOR > {noformat} > However, the agent logs show that this failed with a memory limitation: > {noformat} > I0911 11:48:02.235818 7012 http.cpp:532] Processing call > WAIT_NESTED_CONTAINER > I0911 11:48:02.236395 7013 status_update_manager.cpp:323] Received status > update TASK_RUNNING (UUID: 85e7a8e8-22a7-4561-9000-2cd6d93502d9) for task > 2866368d-7279-4657-b8eb-bf1d968e8ebf of framework > aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-0010 > I0911 11:48:02.237083 7016 slave.cpp:4875] Forwarding the update > TASK_RUNNING (UUID: 85e7a8e8-22a7-4561-9000-2cd6d93502d9) for task > 2866368d-7279-4657-b8eb-bf1d968e8ebf of framework > aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-0010 to master@17.228.224.108:5050 > I0911 11:48:02.283661 7007 status_update_manager.cpp:395] Received status > update acknowledgement (UUID: 85e7a8e8-22a7-4561-9000-2cd6d93502d9) for task > 2866368d-7279-4657-b8eb-bf1d968e8ebf of framework > aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-0010 > I0911 11:48:04.771455 7014 memory.cpp:516] OOM detected for container > 474388fe-43c3-4372-b903-eaca22740996 > I0911 11:48:04.776445 7014 memory.cpp:556] Memory limit exceeded: Requested: > 64MB Maximum Used: 64MB > ... > I0911 11:48:04.776943 7012 containerizer.cpp:2681] Container > 474388fe-43c3-4372-b903-eaca22740996 has reached its limit for resource > [{"name":"mem","scalar":{"value":64.0},"type":"SCALAR"}] and will be > terminated > {noformat} > The following {{mesos-execute}} task will show the container limitation > correctly: > {noformat} > exec /opt/mesos/bin/mesos-execute --content_type=json > --master=jpeach.apple.com:5050 '--task_group={ > "tasks": > [ > { > "name": "37db08f6-4f0f-4ef6-97ee-b10a5c5cc211", > "task_id": {"value" : "1372b2e2-c501-4e80-bcbd-1a5c5194e206"}, > "agent_id": {"value" : ""}, > "resources": [{ > "name": "cpus", > "type": "SCALAR", > "scalar": { > "value": 0.2 > } > }, > { > "name": "mem", > "type": "SCALAR", > "scalar": { >
[jira] [Commented] (MESOS-7963) Task groups can lose the container limitation status.
[ https://issues.apache.org/jira/browse/MESOS-7963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16174089#comment-16174089 ] Qian Zhang commented on MESOS-7963: --- {quote}it ends up being an extra special case that executors might have to deal with.{quote} Can you elaborate a bit about why it will be an extra special case for executor to handle? I think no matter we raise limitation for root container or nested container, executors will always wait for nested containers. > Task groups can lose the container limitation status. > - > > Key: MESOS-7963 > URL: https://issues.apache.org/jira/browse/MESOS-7963 > Project: Mesos > Issue Type: Bug > Components: containerization, executor >Reporter: James Peach > > If you run a single task in a task group and that task fails with a container > limitation, that status update can be lost and only the executor failure will > be reported to the framework. > {noformat} > exec /opt/mesos/bin/mesos-execute --content_type=json > --master=jpeach.apple.com:5050 '--task_group={ > "tasks": > [ > { > "name": "7f141aca-55fe-4bb0-af4b-87f5ee26986a", > "task_id": {"value" : "2866368d-7279-4657-b8eb-bf1d968e8ebf"}, > "agent_id": {"value" : ""}, > "resources": [{ > "name": "cpus", > "type": "SCALAR", > "scalar": { > "value": 0.2 > } > }, { > "name": "mem", > "type": "SCALAR", > "scalar": { > "value": 32 > } > }, { > "name": "disk", > "type": "SCALAR", > "scalar": { > "value": 2 > } > } > ], > "command": { > "value": "sleep 2 ; /usr/bin/dd if=/dev/zero of=out.dat bs=1M > count=64 ; sleep 1" > } > } > ] > }' > I0911 11:48:01.480689 7340 scheduler.cpp:184] Version: 1.5.0 > I0911 11:48:01.488868 7339 scheduler.cpp:470] New master detected at > master@17.228.224.108:5050 > Subscribed with ID aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-0010 > Submitted task group with tasks [ 2866368d-7279-4657-b8eb-bf1d968e8ebf ] to > agent 'aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-S0' > Received status update TASK_RUNNING for task > '2866368d-7279-4657-b8eb-bf1d968e8ebf' > source: SOURCE_EXECUTOR > Received status update TASK_FAILED for task > '2866368d-7279-4657-b8eb-bf1d968e8ebf' > message: 'Command terminated with signal Killed' > source: SOURCE_EXECUTOR > {noformat} > However, the agent logs show that this failed with a memory limitation: > {noformat} > I0911 11:48:02.235818 7012 http.cpp:532] Processing call > WAIT_NESTED_CONTAINER > I0911 11:48:02.236395 7013 status_update_manager.cpp:323] Received status > update TASK_RUNNING (UUID: 85e7a8e8-22a7-4561-9000-2cd6d93502d9) for task > 2866368d-7279-4657-b8eb-bf1d968e8ebf of framework > aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-0010 > I0911 11:48:02.237083 7016 slave.cpp:4875] Forwarding the update > TASK_RUNNING (UUID: 85e7a8e8-22a7-4561-9000-2cd6d93502d9) for task > 2866368d-7279-4657-b8eb-bf1d968e8ebf of framework > aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-0010 to master@17.228.224.108:5050 > I0911 11:48:02.283661 7007 status_update_manager.cpp:395] Received status > update acknowledgement (UUID: 85e7a8e8-22a7-4561-9000-2cd6d93502d9) for task > 2866368d-7279-4657-b8eb-bf1d968e8ebf of framework > aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-0010 > I0911 11:48:04.771455 7014 memory.cpp:516] OOM detected for container > 474388fe-43c3-4372-b903-eaca22740996 > I0911 11:48:04.776445 7014 memory.cpp:556] Memory limit exceeded: Requested: > 64MB Maximum Used: 64MB > ... > I0911 11:48:04.776943 7012 containerizer.cpp:2681] Container > 474388fe-43c3-4372-b903-eaca22740996 has reached its limit for resource > [{"name":"mem","scalar":{"value":64.0},"type":"SCALAR"}] and will be > terminated > {noformat} > The following {{mesos-execute}} task will show the container limitation > correctly: > {noformat} > exec /opt/mesos/bin/mesos-execute --content_type=json > --master=jpeach.apple.com:5050 '--task_group={ > "tasks": > [ > { > "name": "37db08f6-4f0f-4ef6-97ee-b10a5c5cc211", > "task_id": {"value" : "1372b2e2-c501-4e80-bcbd-1a5c5194e206"}, > "agent_id": {"value" : ""}, > "resources": [{ > "name": "cpus", > "type": "SCALAR", > "scalar": { > "value": 0.2 > } > }, > { > "name": "mem", > "type": "SCALAR", > "scalar": { >
[jira] [Commented] (MESOS-7963) Task groups can lose the container limitation status.
[ https://issues.apache.org/jira/browse/MESOS-7963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16173413#comment-16173413 ] James Peach commented on MESOS-7963: I think that steps 1 - 3 sound fine and will fix the immediate problem. {quote} if one of the tasks reaches the limit ... the scheduler will receive TASK_FAILED for all the tasks with the exactly same reason and message {quote} This can happen today, though the raciness generally hides it. In most case, I think that this behavior is correct and the scheduler ought to deal with it. For example, it is possible for one container to use most of the memory and a smaller container to trigger the OOM (similarly with disk). In these cases, it is hard to make a principled decision about which container is responsible. {quote} least for {{network/ports}} isolator, we should make it raise the limitation for nested container since it has the ability to tell which nested container reaches the limit {quote} We could do this for the {{network/ports}} isolator, but I would generally prefer the resource limitations to have a single well-defined behavior. If we let this isolator be special, then the flexibility that allows is inconsistent, so it ends up being an extra special case that executors might have to deal with. At least as a starting point, I'd like the behavior to be consistent and well-defined. > Task groups can lose the container limitation status. > - > > Key: MESOS-7963 > URL: https://issues.apache.org/jira/browse/MESOS-7963 > Project: Mesos > Issue Type: Bug > Components: containerization, executor >Reporter: James Peach > > If you run a single task in a task group and that task fails with a container > limitation, that status update can be lost and only the executor failure will > be reported to the framework. > {noformat} > exec /opt/mesos/bin/mesos-execute --content_type=json > --master=jpeach.apple.com:5050 '--task_group={ > "tasks": > [ > { > "name": "7f141aca-55fe-4bb0-af4b-87f5ee26986a", > "task_id": {"value" : "2866368d-7279-4657-b8eb-bf1d968e8ebf"}, > "agent_id": {"value" : ""}, > "resources": [{ > "name": "cpus", > "type": "SCALAR", > "scalar": { > "value": 0.2 > } > }, { > "name": "mem", > "type": "SCALAR", > "scalar": { > "value": 32 > } > }, { > "name": "disk", > "type": "SCALAR", > "scalar": { > "value": 2 > } > } > ], > "command": { > "value": "sleep 2 ; /usr/bin/dd if=/dev/zero of=out.dat bs=1M > count=64 ; sleep 1" > } > } > ] > }' > I0911 11:48:01.480689 7340 scheduler.cpp:184] Version: 1.5.0 > I0911 11:48:01.488868 7339 scheduler.cpp:470] New master detected at > master@17.228.224.108:5050 > Subscribed with ID aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-0010 > Submitted task group with tasks [ 2866368d-7279-4657-b8eb-bf1d968e8ebf ] to > agent 'aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-S0' > Received status update TASK_RUNNING for task > '2866368d-7279-4657-b8eb-bf1d968e8ebf' > source: SOURCE_EXECUTOR > Received status update TASK_FAILED for task > '2866368d-7279-4657-b8eb-bf1d968e8ebf' > message: 'Command terminated with signal Killed' > source: SOURCE_EXECUTOR > {noformat} > However, the agent logs show that this failed with a memory limitation: > {noformat} > I0911 11:48:02.235818 7012 http.cpp:532] Processing call > WAIT_NESTED_CONTAINER > I0911 11:48:02.236395 7013 status_update_manager.cpp:323] Received status > update TASK_RUNNING (UUID: 85e7a8e8-22a7-4561-9000-2cd6d93502d9) for task > 2866368d-7279-4657-b8eb-bf1d968e8ebf of framework > aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-0010 > I0911 11:48:02.237083 7016 slave.cpp:4875] Forwarding the update > TASK_RUNNING (UUID: 85e7a8e8-22a7-4561-9000-2cd6d93502d9) for task > 2866368d-7279-4657-b8eb-bf1d968e8ebf of framework > aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-0010 to master@17.228.224.108:5050 > I0911 11:48:02.283661 7007 status_update_manager.cpp:395] Received status > update acknowledgement (UUID: 85e7a8e8-22a7-4561-9000-2cd6d93502d9) for task > 2866368d-7279-4657-b8eb-bf1d968e8ebf of framework > aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-0010 > I0911 11:48:04.771455 7014 memory.cpp:516] OOM detected for container > 474388fe-43c3-4372-b903-eaca22740996 > I0911 11:48:04.776445 7014 memory.cpp:556] Memory limit exceeded: Requested: > 64MB Maximum Used: 64MB > ... > I0911 11:48:04.776943 7012 containerizer.cpp:2681] Container >
[jira] [Commented] (MESOS-7963) Task groups can lose the container limitation status.
[ https://issues.apache.org/jira/browse/MESOS-7963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16172846#comment-16172846 ] Qian Zhang commented on MESOS-7963: --- [~jieyu] and [~jpe...@apache.org], so I think what we should do is: # In {{MesosContainerizerProcess::limited()}}, check if it is a root container, if yes, not only set root container's {{limitations}} but also set all its nested container's {{limitations}}. # In {{Http::waitNestedContainer()}}, propagate the reason and message of the nested container termination to the default executor so that the default executor can send the status update with the reason and message for the nested container to the scheduler. # For `network/ports` isolator, we also want it to raise limitation only for root container rather than any nested containers, just like `cgroups/mem` and `disk/du`. Please correct me if I missed anything. And my comment is, for a task group with multiple tasks, if one of the tasks reaches the limit (e.g., it listens on an unallocated port), the scheduler will receive {{TASK_FAILED}} for all the tasks with the exactly same reason and message, I think this will confuse scheduler because schedule can not figure out which particular task reached the limit. > Task groups can lose the container limitation status. > - > > Key: MESOS-7963 > URL: https://issues.apache.org/jira/browse/MESOS-7963 > Project: Mesos > Issue Type: Bug > Components: containerization, executor >Reporter: James Peach > > If you run a single task in a task group and that task fails with a container > limitation, that status update can be lost and only the executor failure will > be reported to the framework. > {noformat} > exec /opt/mesos/bin/mesos-execute --content_type=json > --master=jpeach.apple.com:5050 '--task_group={ > "tasks": > [ > { > "name": "7f141aca-55fe-4bb0-af4b-87f5ee26986a", > "task_id": {"value" : "2866368d-7279-4657-b8eb-bf1d968e8ebf"}, > "agent_id": {"value" : ""}, > "resources": [{ > "name": "cpus", > "type": "SCALAR", > "scalar": { > "value": 0.2 > } > }, { > "name": "mem", > "type": "SCALAR", > "scalar": { > "value": 32 > } > }, { > "name": "disk", > "type": "SCALAR", > "scalar": { > "value": 2 > } > } > ], > "command": { > "value": "sleep 2 ; /usr/bin/dd if=/dev/zero of=out.dat bs=1M > count=64 ; sleep 1" > } > } > ] > }' > I0911 11:48:01.480689 7340 scheduler.cpp:184] Version: 1.5.0 > I0911 11:48:01.488868 7339 scheduler.cpp:470] New master detected at > master@17.228.224.108:5050 > Subscribed with ID aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-0010 > Submitted task group with tasks [ 2866368d-7279-4657-b8eb-bf1d968e8ebf ] to > agent 'aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-S0' > Received status update TASK_RUNNING for task > '2866368d-7279-4657-b8eb-bf1d968e8ebf' > source: SOURCE_EXECUTOR > Received status update TASK_FAILED for task > '2866368d-7279-4657-b8eb-bf1d968e8ebf' > message: 'Command terminated with signal Killed' > source: SOURCE_EXECUTOR > {noformat} > However, the agent logs show that this failed with a memory limitation: > {noformat} > I0911 11:48:02.235818 7012 http.cpp:532] Processing call > WAIT_NESTED_CONTAINER > I0911 11:48:02.236395 7013 status_update_manager.cpp:323] Received status > update TASK_RUNNING (UUID: 85e7a8e8-22a7-4561-9000-2cd6d93502d9) for task > 2866368d-7279-4657-b8eb-bf1d968e8ebf of framework > aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-0010 > I0911 11:48:02.237083 7016 slave.cpp:4875] Forwarding the update > TASK_RUNNING (UUID: 85e7a8e8-22a7-4561-9000-2cd6d93502d9) for task > 2866368d-7279-4657-b8eb-bf1d968e8ebf of framework > aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-0010 to master@17.228.224.108:5050 > I0911 11:48:02.283661 7007 status_update_manager.cpp:395] Received status > update acknowledgement (UUID: 85e7a8e8-22a7-4561-9000-2cd6d93502d9) for task > 2866368d-7279-4657-b8eb-bf1d968e8ebf of framework > aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-0010 > I0911 11:48:04.771455 7014 memory.cpp:516] OOM detected for container > 474388fe-43c3-4372-b903-eaca22740996 > I0911 11:48:04.776445 7014 memory.cpp:556] Memory limit exceeded: Requested: > 64MB Maximum Used: 64MB > ... > I0911 11:48:04.776943 7012 containerizer.cpp:2681] Container > 474388fe-43c3-4372-b903-eaca22740996 has reached its limit for resource > [{"name":"mem","scalar":{"value":64.0},"type":"SCALAR"}] and will be >
[jira] [Commented] (MESOS-7963) Task groups can lose the container limitation status.
[ https://issues.apache.org/jira/browse/MESOS-7963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16172091#comment-16172091 ] Jie Yu commented on MESOS-7963: --- For the racy behavior, this is because the order of the following is not deterministic: 1) executor's WAIT on nested container returns first and send TASK_FAILED 2) executor is killed by the containerizer, and agent will generate status update for the executor for non-terminal task. > Task groups can lose the container limitation status. > - > > Key: MESOS-7963 > URL: https://issues.apache.org/jira/browse/MESOS-7963 > Project: Mesos > Issue Type: Bug > Components: containerization, executor >Reporter: James Peach > > If you run a single task in a task group and that task fails with a container > limitation, that status update can be lost and only the executor failure will > be reported to the framework. > {noformat} > exec /opt/mesos/bin/mesos-execute --content_type=json > --master=jpeach.apple.com:5050 '--task_group={ > "tasks": > [ > { > "name": "7f141aca-55fe-4bb0-af4b-87f5ee26986a", > "task_id": {"value" : "2866368d-7279-4657-b8eb-bf1d968e8ebf"}, > "agent_id": {"value" : ""}, > "resources": [{ > "name": "cpus", > "type": "SCALAR", > "scalar": { > "value": 0.2 > } > }, { > "name": "mem", > "type": "SCALAR", > "scalar": { > "value": 32 > } > }, { > "name": "disk", > "type": "SCALAR", > "scalar": { > "value": 2 > } > } > ], > "command": { > "value": "sleep 2 ; /usr/bin/dd if=/dev/zero of=out.dat bs=1M > count=64 ; sleep 1" > } > } > ] > }' > I0911 11:48:01.480689 7340 scheduler.cpp:184] Version: 1.5.0 > I0911 11:48:01.488868 7339 scheduler.cpp:470] New master detected at > master@17.228.224.108:5050 > Subscribed with ID aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-0010 > Submitted task group with tasks [ 2866368d-7279-4657-b8eb-bf1d968e8ebf ] to > agent 'aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-S0' > Received status update TASK_RUNNING for task > '2866368d-7279-4657-b8eb-bf1d968e8ebf' > source: SOURCE_EXECUTOR > Received status update TASK_FAILED for task > '2866368d-7279-4657-b8eb-bf1d968e8ebf' > message: 'Command terminated with signal Killed' > source: SOURCE_EXECUTOR > {noformat} > However, the agent logs show that this failed with a memory limitation: > {noformat} > I0911 11:48:02.235818 7012 http.cpp:532] Processing call > WAIT_NESTED_CONTAINER > I0911 11:48:02.236395 7013 status_update_manager.cpp:323] Received status > update TASK_RUNNING (UUID: 85e7a8e8-22a7-4561-9000-2cd6d93502d9) for task > 2866368d-7279-4657-b8eb-bf1d968e8ebf of framework > aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-0010 > I0911 11:48:02.237083 7016 slave.cpp:4875] Forwarding the update > TASK_RUNNING (UUID: 85e7a8e8-22a7-4561-9000-2cd6d93502d9) for task > 2866368d-7279-4657-b8eb-bf1d968e8ebf of framework > aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-0010 to master@17.228.224.108:5050 > I0911 11:48:02.283661 7007 status_update_manager.cpp:395] Received status > update acknowledgement (UUID: 85e7a8e8-22a7-4561-9000-2cd6d93502d9) for task > 2866368d-7279-4657-b8eb-bf1d968e8ebf of framework > aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-0010 > I0911 11:48:04.771455 7014 memory.cpp:516] OOM detected for container > 474388fe-43c3-4372-b903-eaca22740996 > I0911 11:48:04.776445 7014 memory.cpp:556] Memory limit exceeded: Requested: > 64MB Maximum Used: 64MB > ... > I0911 11:48:04.776943 7012 containerizer.cpp:2681] Container > 474388fe-43c3-4372-b903-eaca22740996 has reached its limit for resource > [{"name":"mem","scalar":{"value":64.0},"type":"SCALAR"}] and will be > terminated > {noformat} > The following {{mesos-execute}} task will show the container limitation > correctly: > {noformat} > exec /opt/mesos/bin/mesos-execute --content_type=json > --master=jpeach.apple.com:5050 '--task_group={ > "tasks": > [ > { > "name": "37db08f6-4f0f-4ef6-97ee-b10a5c5cc211", > "task_id": {"value" : "1372b2e2-c501-4e80-bcbd-1a5c5194e206"}, > "agent_id": {"value" : ""}, > "resources": [{ > "name": "cpus", > "type": "SCALAR", > "scalar": { > "value": 0.2 > } > }, > { > "name": "mem", > "type": "SCALAR", > "scalar": { > "value": 32 > } >
[jira] [Commented] (MESOS-7963) Task groups can lose the container limitation status.
[ https://issues.apache.org/jira/browse/MESOS-7963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16172089#comment-16172089 ] James Peach commented on MESOS-7963: As per discussion with [~jieyu] the right fix here is for the containerizer to propagate the container limitation to the {{WAIT_NESTED_CONTAINER}} response when it destroys the nested containers. Then the executor should pass that information on in status updates. Though there will still be a race between the executor and the agent, both parties will still end up propagating the same status update information. > Task groups can lose the container limitation status. > - > > Key: MESOS-7963 > URL: https://issues.apache.org/jira/browse/MESOS-7963 > Project: Mesos > Issue Type: Bug > Components: containerization, executor >Reporter: James Peach > > If you run a single task in a task group and that task fails with a container > limitation, that status update can be lost and only the executor failure will > be reported to the framework. > {noformat} > exec /opt/mesos/bin/mesos-execute --content_type=json > --master=jpeach.apple.com:5050 '--task_group={ > "tasks": > [ > { > "name": "7f141aca-55fe-4bb0-af4b-87f5ee26986a", > "task_id": {"value" : "2866368d-7279-4657-b8eb-bf1d968e8ebf"}, > "agent_id": {"value" : ""}, > "resources": [{ > "name": "cpus", > "type": "SCALAR", > "scalar": { > "value": 0.2 > } > }, { > "name": "mem", > "type": "SCALAR", > "scalar": { > "value": 32 > } > }, { > "name": "disk", > "type": "SCALAR", > "scalar": { > "value": 2 > } > } > ], > "command": { > "value": "sleep 2 ; /usr/bin/dd if=/dev/zero of=out.dat bs=1M > count=64 ; sleep 1" > } > } > ] > }' > I0911 11:48:01.480689 7340 scheduler.cpp:184] Version: 1.5.0 > I0911 11:48:01.488868 7339 scheduler.cpp:470] New master detected at > master@17.228.224.108:5050 > Subscribed with ID aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-0010 > Submitted task group with tasks [ 2866368d-7279-4657-b8eb-bf1d968e8ebf ] to > agent 'aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-S0' > Received status update TASK_RUNNING for task > '2866368d-7279-4657-b8eb-bf1d968e8ebf' > source: SOURCE_EXECUTOR > Received status update TASK_FAILED for task > '2866368d-7279-4657-b8eb-bf1d968e8ebf' > message: 'Command terminated with signal Killed' > source: SOURCE_EXECUTOR > {noformat} > However, the agent logs show that this failed with a memory limitation: > {noformat} > I0911 11:48:02.235818 7012 http.cpp:532] Processing call > WAIT_NESTED_CONTAINER > I0911 11:48:02.236395 7013 status_update_manager.cpp:323] Received status > update TASK_RUNNING (UUID: 85e7a8e8-22a7-4561-9000-2cd6d93502d9) for task > 2866368d-7279-4657-b8eb-bf1d968e8ebf of framework > aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-0010 > I0911 11:48:02.237083 7016 slave.cpp:4875] Forwarding the update > TASK_RUNNING (UUID: 85e7a8e8-22a7-4561-9000-2cd6d93502d9) for task > 2866368d-7279-4657-b8eb-bf1d968e8ebf of framework > aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-0010 to master@17.228.224.108:5050 > I0911 11:48:02.283661 7007 status_update_manager.cpp:395] Received status > update acknowledgement (UUID: 85e7a8e8-22a7-4561-9000-2cd6d93502d9) for task > 2866368d-7279-4657-b8eb-bf1d968e8ebf of framework > aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-0010 > I0911 11:48:04.771455 7014 memory.cpp:516] OOM detected for container > 474388fe-43c3-4372-b903-eaca22740996 > I0911 11:48:04.776445 7014 memory.cpp:556] Memory limit exceeded: Requested: > 64MB Maximum Used: 64MB > ... > I0911 11:48:04.776943 7012 containerizer.cpp:2681] Container > 474388fe-43c3-4372-b903-eaca22740996 has reached its limit for resource > [{"name":"mem","scalar":{"value":64.0},"type":"SCALAR"}] and will be > terminated > {noformat} > The following {{mesos-execute}} task will show the container limitation > correctly: > {noformat} > exec /opt/mesos/bin/mesos-execute --content_type=json > --master=jpeach.apple.com:5050 '--task_group={ > "tasks": > [ > { > "name": "37db08f6-4f0f-4ef6-97ee-b10a5c5cc211", > "task_id": {"value" : "1372b2e2-c501-4e80-bcbd-1a5c5194e206"}, > "agent_id": {"value" : ""}, > "resources": [{ > "name": "cpus", > "type": "SCALAR", > "scalar": { > "value": 0.2 > } > }, > { >
[jira] [Commented] (MESOS-7963) Task groups can lose the container limitation status.
[ https://issues.apache.org/jira/browse/MESOS-7963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16171986#comment-16171986 ] James Peach commented on MESOS-7963: FWIW if I use the `disk/du` isolator to trigger a disk resource limitation from {{mesos-execute}}, I get the correct status update (ie. the one with the limitation reason) about 20% of the time. > Task groups can lose the container limitation status. > - > > Key: MESOS-7963 > URL: https://issues.apache.org/jira/browse/MESOS-7963 > Project: Mesos > Issue Type: Bug > Components: containerization, executor >Reporter: James Peach > > If you run a single task in a task group and that task fails with a container > limitation, that status update can be lost and only the executor failure will > be reported to the framework. > {noformat} > exec /opt/mesos/bin/mesos-execute --content_type=json > --master=jpeach.apple.com:5050 '--task_group={ > "tasks": > [ > { > "name": "7f141aca-55fe-4bb0-af4b-87f5ee26986a", > "task_id": {"value" : "2866368d-7279-4657-b8eb-bf1d968e8ebf"}, > "agent_id": {"value" : ""}, > "resources": [{ > "name": "cpus", > "type": "SCALAR", > "scalar": { > "value": 0.2 > } > }, { > "name": "mem", > "type": "SCALAR", > "scalar": { > "value": 32 > } > }, { > "name": "disk", > "type": "SCALAR", > "scalar": { > "value": 2 > } > } > ], > "command": { > "value": "sleep 2 ; /usr/bin/dd if=/dev/zero of=out.dat bs=1M > count=64 ; sleep 1" > } > } > ] > }' > I0911 11:48:01.480689 7340 scheduler.cpp:184] Version: 1.5.0 > I0911 11:48:01.488868 7339 scheduler.cpp:470] New master detected at > master@17.228.224.108:5050 > Subscribed with ID aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-0010 > Submitted task group with tasks [ 2866368d-7279-4657-b8eb-bf1d968e8ebf ] to > agent 'aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-S0' > Received status update TASK_RUNNING for task > '2866368d-7279-4657-b8eb-bf1d968e8ebf' > source: SOURCE_EXECUTOR > Received status update TASK_FAILED for task > '2866368d-7279-4657-b8eb-bf1d968e8ebf' > message: 'Command terminated with signal Killed' > source: SOURCE_EXECUTOR > {noformat} > However, the agent logs show that this failed with a memory limitation: > {noformat} > I0911 11:48:02.235818 7012 http.cpp:532] Processing call > WAIT_NESTED_CONTAINER > I0911 11:48:02.236395 7013 status_update_manager.cpp:323] Received status > update TASK_RUNNING (UUID: 85e7a8e8-22a7-4561-9000-2cd6d93502d9) for task > 2866368d-7279-4657-b8eb-bf1d968e8ebf of framework > aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-0010 > I0911 11:48:02.237083 7016 slave.cpp:4875] Forwarding the update > TASK_RUNNING (UUID: 85e7a8e8-22a7-4561-9000-2cd6d93502d9) for task > 2866368d-7279-4657-b8eb-bf1d968e8ebf of framework > aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-0010 to master@17.228.224.108:5050 > I0911 11:48:02.283661 7007 status_update_manager.cpp:395] Received status > update acknowledgement (UUID: 85e7a8e8-22a7-4561-9000-2cd6d93502d9) for task > 2866368d-7279-4657-b8eb-bf1d968e8ebf of framework > aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-0010 > I0911 11:48:04.771455 7014 memory.cpp:516] OOM detected for container > 474388fe-43c3-4372-b903-eaca22740996 > I0911 11:48:04.776445 7014 memory.cpp:556] Memory limit exceeded: Requested: > 64MB Maximum Used: 64MB > ... > I0911 11:48:04.776943 7012 containerizer.cpp:2681] Container > 474388fe-43c3-4372-b903-eaca22740996 has reached its limit for resource > [{"name":"mem","scalar":{"value":64.0},"type":"SCALAR"}] and will be > terminated > {noformat} > The following {{mesos-execute}} task will show the container limitation > correctly: > {noformat} > exec /opt/mesos/bin/mesos-execute --content_type=json > --master=jpeach.apple.com:5050 '--task_group={ > "tasks": > [ > { > "name": "37db08f6-4f0f-4ef6-97ee-b10a5c5cc211", > "task_id": {"value" : "1372b2e2-c501-4e80-bcbd-1a5c5194e206"}, > "agent_id": {"value" : ""}, > "resources": [{ > "name": "cpus", > "type": "SCALAR", > "scalar": { > "value": 0.2 > } > }, > { > "name": "mem", > "type": "SCALAR", > "scalar": { > "value": 32 > } > }], > "command": { > "value": "sleep 600" >
[jira] [Commented] (MESOS-7963) Task groups can lose the container limitation status.
[ https://issues.apache.org/jira/browse/MESOS-7963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16171327#comment-16171327 ] James Peach commented on MESOS-7963: {quote} For the second task, I think before the default executor got a chance to send status update for it, the default executor itself was destroyed by Mesos containerizer, that's why we see a TASK_FAILED for the second task and its source is SOURCE_AGENT. {quote} Yes I think this would also happen if the executor was terminated before the status update from the failed task was acknowledged, though I'm not positive whether this can actually happen. When running a single task, you will sometimes get different status update results, implying that there is a race. {quote} Currently both cgroups isolator (memory subsystem) and disk/du isolator raise the limitation for root container rather than nested container. {quote} This is consistent since resources are always accumulated on the root container. I'm not sure that it is feasible to detect when a nested container triggers a limitation for these resources. > Task groups can lose the container limitation status. > - > > Key: MESOS-7963 > URL: https://issues.apache.org/jira/browse/MESOS-7963 > Project: Mesos > Issue Type: Bug > Components: containerization, executor >Reporter: James Peach > > If you run a single task in a task group and that task fails with a container > limitation, that status update can be lost and only the executor failure will > be reported to the framework. > {noformat} > exec /opt/mesos/bin/mesos-execute --content_type=json > --master=jpeach.apple.com:5050 '--task_group={ > "tasks": > [ > { > "name": "7f141aca-55fe-4bb0-af4b-87f5ee26986a", > "task_id": {"value" : "2866368d-7279-4657-b8eb-bf1d968e8ebf"}, > "agent_id": {"value" : ""}, > "resources": [{ > "name": "cpus", > "type": "SCALAR", > "scalar": { > "value": 0.2 > } > }, { > "name": "mem", > "type": "SCALAR", > "scalar": { > "value": 32 > } > }, { > "name": "disk", > "type": "SCALAR", > "scalar": { > "value": 2 > } > } > ], > "command": { > "value": "sleep 2 ; /usr/bin/dd if=/dev/zero of=out.dat bs=1M > count=64 ; sleep 1" > } > } > ] > }' > I0911 11:48:01.480689 7340 scheduler.cpp:184] Version: 1.5.0 > I0911 11:48:01.488868 7339 scheduler.cpp:470] New master detected at > master@17.228.224.108:5050 > Subscribed with ID aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-0010 > Submitted task group with tasks [ 2866368d-7279-4657-b8eb-bf1d968e8ebf ] to > agent 'aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-S0' > Received status update TASK_RUNNING for task > '2866368d-7279-4657-b8eb-bf1d968e8ebf' > source: SOURCE_EXECUTOR > Received status update TASK_FAILED for task > '2866368d-7279-4657-b8eb-bf1d968e8ebf' > message: 'Command terminated with signal Killed' > source: SOURCE_EXECUTOR > {noformat} > However, the agent logs show that this failed with a memory limitation: > {noformat} > I0911 11:48:02.235818 7012 http.cpp:532] Processing call > WAIT_NESTED_CONTAINER > I0911 11:48:02.236395 7013 status_update_manager.cpp:323] Received status > update TASK_RUNNING (UUID: 85e7a8e8-22a7-4561-9000-2cd6d93502d9) for task > 2866368d-7279-4657-b8eb-bf1d968e8ebf of framework > aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-0010 > I0911 11:48:02.237083 7016 slave.cpp:4875] Forwarding the update > TASK_RUNNING (UUID: 85e7a8e8-22a7-4561-9000-2cd6d93502d9) for task > 2866368d-7279-4657-b8eb-bf1d968e8ebf of framework > aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-0010 to master@17.228.224.108:5050 > I0911 11:48:02.283661 7007 status_update_manager.cpp:395] Received status > update acknowledgement (UUID: 85e7a8e8-22a7-4561-9000-2cd6d93502d9) for task > 2866368d-7279-4657-b8eb-bf1d968e8ebf of framework > aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-0010 > I0911 11:48:04.771455 7014 memory.cpp:516] OOM detected for container > 474388fe-43c3-4372-b903-eaca22740996 > I0911 11:48:04.776445 7014 memory.cpp:556] Memory limit exceeded: Requested: > 64MB Maximum Used: 64MB > ... > I0911 11:48:04.776943 7012 containerizer.cpp:2681] Container > 474388fe-43c3-4372-b903-eaca22740996 has reached its limit for resource > [{"name":"mem","scalar":{"value":64.0},"type":"SCALAR"}] and will be > terminated > {noformat} > The following {{mesos-execute}} task will show the container limitation > correctly: > {noformat} > exec /opt/mesos/bin/mesos-execute
[jira] [Commented] (MESOS-7963) Task groups can lose the container limitation status.
[ https://issues.apache.org/jira/browse/MESOS-7963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16171182#comment-16171182 ] Qian Zhang commented on MESOS-7963: --- [~jpe...@apache.org] In your second example, a task group with two tasks was launched, when disk/du isolator raised a limitation for the root container, Mesos containerizer will try to destroy the root container, but before that, it will try to destroy the two nested containers first. So when the first nested container was destroyed, the default executor will know it (since the default executor was still alive at that moment) and it will send a {{TASK_FAILED}} for the first task and the source is {{SOURCE_EXECUTOR}}. For the second task, I think before the default executor got a chance to send status update for it, the default executor itself was destroyed by Mesos containerizer, that's why we see a {{TASK_FAILED}} for the second task and its source is {{SOURCE_AGENT}}. In your first example, the task group has only one task, so I think it follow the same flow of the first task in your second example, i.e., the default executor sent a {{TASK_FAILED}} for the task, and then the default executor itself was destroyed (or maybe self terminated). Currently both cgroups isolator (memory subsystem) and disk/du isolator raise the limitation for root container rather than nested container. I think we may need to change them to raise the limitation for nested container, and enhance the implementation of waitNestedContainer() to make it propagate the reason and message of the container termination to the default executor, and then the default executor can send the status update with the reason and message for the nested container to the scheduler. > Task groups can lose the container limitation status. > - > > Key: MESOS-7963 > URL: https://issues.apache.org/jira/browse/MESOS-7963 > Project: Mesos > Issue Type: Bug > Components: containerization, executor >Reporter: James Peach > > If you run a single task in a task group and that task fails with a container > limitation, that status update can be lost and only the executor failure will > be reported to the framework. > {noformat} > exec /opt/mesos/bin/mesos-execute --content_type=json > --master=jpeach.apple.com:5050 '--task_group={ > "tasks": > [ > { > "name": "7f141aca-55fe-4bb0-af4b-87f5ee26986a", > "task_id": {"value" : "2866368d-7279-4657-b8eb-bf1d968e8ebf"}, > "agent_id": {"value" : ""}, > "resources": [{ > "name": "cpus", > "type": "SCALAR", > "scalar": { > "value": 0.2 > } > }, { > "name": "mem", > "type": "SCALAR", > "scalar": { > "value": 32 > } > }, { > "name": "disk", > "type": "SCALAR", > "scalar": { > "value": 2 > } > } > ], > "command": { > "value": "sleep 2 ; /usr/bin/dd if=/dev/zero of=out.dat bs=1M > count=64 ; sleep 1" > } > } > ] > }' > I0911 11:48:01.480689 7340 scheduler.cpp:184] Version: 1.5.0 > I0911 11:48:01.488868 7339 scheduler.cpp:470] New master detected at > master@17.228.224.108:5050 > Subscribed with ID aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-0010 > Submitted task group with tasks [ 2866368d-7279-4657-b8eb-bf1d968e8ebf ] to > agent 'aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-S0' > Received status update TASK_RUNNING for task > '2866368d-7279-4657-b8eb-bf1d968e8ebf' > source: SOURCE_EXECUTOR > Received status update TASK_FAILED for task > '2866368d-7279-4657-b8eb-bf1d968e8ebf' > message: 'Command terminated with signal Killed' > source: SOURCE_EXECUTOR > {noformat} > However, the agent logs show that this failed with a memory limitation: > {noformat} > I0911 11:48:02.235818 7012 http.cpp:532] Processing call > WAIT_NESTED_CONTAINER > I0911 11:48:02.236395 7013 status_update_manager.cpp:323] Received status > update TASK_RUNNING (UUID: 85e7a8e8-22a7-4561-9000-2cd6d93502d9) for task > 2866368d-7279-4657-b8eb-bf1d968e8ebf of framework > aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-0010 > I0911 11:48:02.237083 7016 slave.cpp:4875] Forwarding the update > TASK_RUNNING (UUID: 85e7a8e8-22a7-4561-9000-2cd6d93502d9) for task > 2866368d-7279-4657-b8eb-bf1d968e8ebf of framework > aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-0010 to master@17.228.224.108:5050 > I0911 11:48:02.283661 7007 status_update_manager.cpp:395] Received status > update acknowledgement (UUID: 85e7a8e8-22a7-4561-9000-2cd6d93502d9) for task > 2866368d-7279-4657-b8eb-bf1d968e8ebf of framework
[jira] [Commented] (MESOS-7963) Task groups can lose the container limitation status.
[ https://issues.apache.org/jira/browse/MESOS-7963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16170890#comment-16170890 ] James Peach commented on MESOS-7963: /cc [~jieyu] This covers the executor container limitation we discussed on the Slack channel. > Task groups can lose the container limitation status. > - > > Key: MESOS-7963 > URL: https://issues.apache.org/jira/browse/MESOS-7963 > Project: Mesos > Issue Type: Bug > Components: containerization, executor >Reporter: James Peach > > If you run a single task in a task group and that task fails with a container > limitation, that status update can be lost and only the executor failure will > be reported to the framework. > {noformat} > exec /opt/mesos/bin/mesos-execute --content_type=json > --master=jpeach.apple.com:5050 '--task_group={ > "tasks": > [ > { > "name": "7f141aca-55fe-4bb0-af4b-87f5ee26986a", > "task_id": {"value" : "2866368d-7279-4657-b8eb-bf1d968e8ebf"}, > "agent_id": {"value" : ""}, > "resources": [{ > "name": "cpus", > "type": "SCALAR", > "scalar": { > "value": 0.2 > } > }, { > "name": "mem", > "type": "SCALAR", > "scalar": { > "value": 32 > } > }, { > "name": "disk", > "type": "SCALAR", > "scalar": { > "value": 2 > } > } > ], > "command": { > "value": "sleep 2 ; /usr/bin/dd if=/dev/zero of=out.dat bs=1M > count=64 ; sleep 1" > } > } > ] > }' > I0911 11:48:01.480689 7340 scheduler.cpp:184] Version: 1.5.0 > I0911 11:48:01.488868 7339 scheduler.cpp:470] New master detected at > master@17.228.224.108:5050 > Subscribed with ID aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-0010 > Submitted task group with tasks [ 2866368d-7279-4657-b8eb-bf1d968e8ebf ] to > agent 'aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-S0' > Received status update TASK_RUNNING for task > '2866368d-7279-4657-b8eb-bf1d968e8ebf' > source: SOURCE_EXECUTOR > Received status update TASK_FAILED for task > '2866368d-7279-4657-b8eb-bf1d968e8ebf' > message: 'Command terminated with signal Killed' > source: SOURCE_EXECUTOR > {noformat} > However, the agent logs show that this failed with a memory limitation: > {noformat} > I0911 11:48:02.235818 7012 http.cpp:532] Processing call > WAIT_NESTED_CONTAINER > I0911 11:48:02.236395 7013 status_update_manager.cpp:323] Received status > update TASK_RUNNING (UUID: 85e7a8e8-22a7-4561-9000-2cd6d93502d9) for task > 2866368d-7279-4657-b8eb-bf1d968e8ebf of framework > aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-0010 > I0911 11:48:02.237083 7016 slave.cpp:4875] Forwarding the update > TASK_RUNNING (UUID: 85e7a8e8-22a7-4561-9000-2cd6d93502d9) for task > 2866368d-7279-4657-b8eb-bf1d968e8ebf of framework > aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-0010 to master@17.228.224.108:5050 > I0911 11:48:02.283661 7007 status_update_manager.cpp:395] Received status > update acknowledgement (UUID: 85e7a8e8-22a7-4561-9000-2cd6d93502d9) for task > 2866368d-7279-4657-b8eb-bf1d968e8ebf of framework > aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-0010 > I0911 11:48:04.771455 7014 memory.cpp:516] OOM detected for container > 474388fe-43c3-4372-b903-eaca22740996 > I0911 11:48:04.776445 7014 memory.cpp:556] Memory limit exceeded: Requested: > 64MB Maximum Used: 64MB > ... > I0911 11:48:04.776943 7012 containerizer.cpp:2681] Container > 474388fe-43c3-4372-b903-eaca22740996 has reached its limit for resource > [{"name":"mem","scalar":{"value":64.0},"type":"SCALAR"}] and will be > terminated > {noformat} > The following {{mesos-execute}} task will show the container limitation > correctly: > {noformat} > exec /opt/mesos/bin/mesos-execute --content_type=json > --master=jpeach.apple.com:5050 '--task_group={ > "tasks": > [ > { > "name": "37db08f6-4f0f-4ef6-97ee-b10a5c5cc211", > "task_id": {"value" : "1372b2e2-c501-4e80-bcbd-1a5c5194e206"}, > "agent_id": {"value" : ""}, > "resources": [{ > "name": "cpus", > "type": "SCALAR", > "scalar": { > "value": 0.2 > } > }, > { > "name": "mem", > "type": "SCALAR", > "scalar": { > "value": 32 > } > }], > "command": { > "value": "sleep 600" > } > }, { > "name": "7247643c-5e4d-4b01-9839-e38db49f7f4d", >