[jira] [Commented] (MESOS-7963) Task groups can lose the container limitation status.

2017-09-25 Thread Vinod Kone (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16179635#comment-16179635
 ] 

Vinod Kone commented on MESOS-7963:
---

[~jpe...@apache.org] If you are working on this can you assign it to yourself? 
Thanks.

> Task groups can lose the container limitation status.
> -
>
> Key: MESOS-7963
> URL: https://issues.apache.org/jira/browse/MESOS-7963
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization, executor
>Reporter: James Peach
>
> If you run a single task in a task group and that task fails with a container 
> limitation, that status update can be lost and only the executor failure will 
> be reported to the framework.
> {noformat}
> exec /opt/mesos/bin/mesos-execute --content_type=json 
> --master=jpeach.apple.com:5050 '--task_group={
> "tasks":
> [
> {
> "name": "7f141aca-55fe-4bb0-af4b-87f5ee26986a",
> "task_id": {"value" : "2866368d-7279-4657-b8eb-bf1d968e8ebf"},
> "agent_id": {"value" : ""},
> "resources": [{
> "name": "cpus",
> "type": "SCALAR",
> "scalar": {
> "value": 0.2
> }
> }, {
> "name": "mem",
> "type": "SCALAR",
> "scalar": {
> "value": 32
> }
> }, {
> "name": "disk",
> "type": "SCALAR",
> "scalar": {
> "value": 2
> }
> }
> ],
> "command": {
> "value": "sleep 2 ; /usr/bin/dd if=/dev/zero of=out.dat bs=1M 
> count=64 ; sleep 1"
> }
> }
> ]
> }'
> I0911 11:48:01.480689  7340 scheduler.cpp:184] Version: 1.5.0
> I0911 11:48:01.488868  7339 scheduler.cpp:470] New master detected at 
> master@17.228.224.108:5050
> Subscribed with ID aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-0010
> Submitted task group with tasks [ 2866368d-7279-4657-b8eb-bf1d968e8ebf ] to 
> agent 'aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-S0'
> Received status update TASK_RUNNING for task 
> '2866368d-7279-4657-b8eb-bf1d968e8ebf'
>   source: SOURCE_EXECUTOR
> Received status update TASK_FAILED for task 
> '2866368d-7279-4657-b8eb-bf1d968e8ebf'
>   message: 'Command terminated with signal Killed'
>   source: SOURCE_EXECUTOR
> {noformat}
> However, the agent logs show that this failed with a memory limitation:
> {noformat}
> I0911 11:48:02.235818  7012 http.cpp:532] Processing call 
> WAIT_NESTED_CONTAINER
> I0911 11:48:02.236395  7013 status_update_manager.cpp:323] Received status 
> update TASK_RUNNING (UUID: 85e7a8e8-22a7-4561-9000-2cd6d93502d9) for task 
> 2866368d-7279-4657-b8eb-bf1d968e8ebf of framework 
> aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-0010
> I0911 11:48:02.237083  7016 slave.cpp:4875] Forwarding the update 
> TASK_RUNNING (UUID: 85e7a8e8-22a7-4561-9000-2cd6d93502d9) for task 
> 2866368d-7279-4657-b8eb-bf1d968e8ebf of framework 
> aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-0010 to master@17.228.224.108:5050
> I0911 11:48:02.283661  7007 status_update_manager.cpp:395] Received status 
> update acknowledgement (UUID: 85e7a8e8-22a7-4561-9000-2cd6d93502d9) for task 
> 2866368d-7279-4657-b8eb-bf1d968e8ebf of framework 
> aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-0010
> I0911 11:48:04.771455  7014 memory.cpp:516] OOM detected for container 
> 474388fe-43c3-4372-b903-eaca22740996
> I0911 11:48:04.776445  7014 memory.cpp:556] Memory limit exceeded: Requested: 
> 64MB Maximum Used: 64MB
> ...
> I0911 11:48:04.776943  7012 containerizer.cpp:2681] Container 
> 474388fe-43c3-4372-b903-eaca22740996 has reached its limit for resource 
> [{"name":"mem","scalar":{"value":64.0},"type":"SCALAR"}] and will be 
> terminated
> {noformat}
> The following {{mesos-execute}} task will show the container limitation 
> correctly:
> {noformat}
> exec /opt/mesos/bin/mesos-execute --content_type=json 
> --master=jpeach.apple.com:5050 '--task_group={
> "tasks":
> [
> {
> "name": "37db08f6-4f0f-4ef6-97ee-b10a5c5cc211",
> "task_id": {"value" : "1372b2e2-c501-4e80-bcbd-1a5c5194e206"},
> "agent_id": {"value" : ""},
> "resources": [{
> "name": "cpus",
> "type": "SCALAR",
> "scalar": {
> "value": 0.2
> }
> },
> {
> "name": "mem",
> "type": "SCALAR",
> "scalar": {
> "value": 32
> }
> }],
> "command": {
> "value": "sleep 600"
> }
> }, {
> "name": "7247643c-5e4d-4b01-9839-e38db49f7f4d",
> 

[jira] [Commented] (MESOS-7963) Task groups can lose the container limitation status.

2017-09-21 Thread Qian Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16174971#comment-16174971
 ] 

Qian Zhang commented on MESOS-7963:
---

Can you let me know what the special case is? I think currently when the 
default executor gets a limitation, it will kill all other nested containers 
and then terminate itself, I do not think we need to change this. And even 
without my proposal (i.e., raise limitation only for root container), all the 
nested containers will be killed as well (by Mesos containerizer), so the 
result is same, I am not sure when we need to restart the nested container.

> Task groups can lose the container limitation status.
> -
>
> Key: MESOS-7963
> URL: https://issues.apache.org/jira/browse/MESOS-7963
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization, executor
>Reporter: James Peach
>
> If you run a single task in a task group and that task fails with a container 
> limitation, that status update can be lost and only the executor failure will 
> be reported to the framework.
> {noformat}
> exec /opt/mesos/bin/mesos-execute --content_type=json 
> --master=jpeach.apple.com:5050 '--task_group={
> "tasks":
> [
> {
> "name": "7f141aca-55fe-4bb0-af4b-87f5ee26986a",
> "task_id": {"value" : "2866368d-7279-4657-b8eb-bf1d968e8ebf"},
> "agent_id": {"value" : ""},
> "resources": [{
> "name": "cpus",
> "type": "SCALAR",
> "scalar": {
> "value": 0.2
> }
> }, {
> "name": "mem",
> "type": "SCALAR",
> "scalar": {
> "value": 32
> }
> }, {
> "name": "disk",
> "type": "SCALAR",
> "scalar": {
> "value": 2
> }
> }
> ],
> "command": {
> "value": "sleep 2 ; /usr/bin/dd if=/dev/zero of=out.dat bs=1M 
> count=64 ; sleep 1"
> }
> }
> ]
> }'
> I0911 11:48:01.480689  7340 scheduler.cpp:184] Version: 1.5.0
> I0911 11:48:01.488868  7339 scheduler.cpp:470] New master detected at 
> master@17.228.224.108:5050
> Subscribed with ID aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-0010
> Submitted task group with tasks [ 2866368d-7279-4657-b8eb-bf1d968e8ebf ] to 
> agent 'aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-S0'
> Received status update TASK_RUNNING for task 
> '2866368d-7279-4657-b8eb-bf1d968e8ebf'
>   source: SOURCE_EXECUTOR
> Received status update TASK_FAILED for task 
> '2866368d-7279-4657-b8eb-bf1d968e8ebf'
>   message: 'Command terminated with signal Killed'
>   source: SOURCE_EXECUTOR
> {noformat}
> However, the agent logs show that this failed with a memory limitation:
> {noformat}
> I0911 11:48:02.235818  7012 http.cpp:532] Processing call 
> WAIT_NESTED_CONTAINER
> I0911 11:48:02.236395  7013 status_update_manager.cpp:323] Received status 
> update TASK_RUNNING (UUID: 85e7a8e8-22a7-4561-9000-2cd6d93502d9) for task 
> 2866368d-7279-4657-b8eb-bf1d968e8ebf of framework 
> aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-0010
> I0911 11:48:02.237083  7016 slave.cpp:4875] Forwarding the update 
> TASK_RUNNING (UUID: 85e7a8e8-22a7-4561-9000-2cd6d93502d9) for task 
> 2866368d-7279-4657-b8eb-bf1d968e8ebf of framework 
> aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-0010 to master@17.228.224.108:5050
> I0911 11:48:02.283661  7007 status_update_manager.cpp:395] Received status 
> update acknowledgement (UUID: 85e7a8e8-22a7-4561-9000-2cd6d93502d9) for task 
> 2866368d-7279-4657-b8eb-bf1d968e8ebf of framework 
> aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-0010
> I0911 11:48:04.771455  7014 memory.cpp:516] OOM detected for container 
> 474388fe-43c3-4372-b903-eaca22740996
> I0911 11:48:04.776445  7014 memory.cpp:556] Memory limit exceeded: Requested: 
> 64MB Maximum Used: 64MB
> ...
> I0911 11:48:04.776943  7012 containerizer.cpp:2681] Container 
> 474388fe-43c3-4372-b903-eaca22740996 has reached its limit for resource 
> [{"name":"mem","scalar":{"value":64.0},"type":"SCALAR"}] and will be 
> terminated
> {noformat}
> The following {{mesos-execute}} task will show the container limitation 
> correctly:
> {noformat}
> exec /opt/mesos/bin/mesos-execute --content_type=json 
> --master=jpeach.apple.com:5050 '--task_group={
> "tasks":
> [
> {
> "name": "37db08f6-4f0f-4ef6-97ee-b10a5c5cc211",
> "task_id": {"value" : "1372b2e2-c501-4e80-bcbd-1a5c5194e206"},
> "agent_id": {"value" : ""},
> "resources": [{
> "name": "cpus",
> "type": "SCALAR",
> "scalar": {
> "value": 0.2
> }
> 

[jira] [Commented] (MESOS-7963) Task groups can lose the container limitation status.

2017-09-21 Thread James Peach (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16174890#comment-16174890
 ] 

James Peach commented on MESOS-7963:


Right now, if an executor gets any limitation, it knows it will be terminated. 
The special case is that in your proposal some kinds of limitation would not 
cause the executor to be terminated, so the executor needs to decide how to 
handle that by either manually tearing everything down or restarting the nested 
container.

> Task groups can lose the container limitation status.
> -
>
> Key: MESOS-7963
> URL: https://issues.apache.org/jira/browse/MESOS-7963
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization, executor
>Reporter: James Peach
>
> If you run a single task in a task group and that task fails with a container 
> limitation, that status update can be lost and only the executor failure will 
> be reported to the framework.
> {noformat}
> exec /opt/mesos/bin/mesos-execute --content_type=json 
> --master=jpeach.apple.com:5050 '--task_group={
> "tasks":
> [
> {
> "name": "7f141aca-55fe-4bb0-af4b-87f5ee26986a",
> "task_id": {"value" : "2866368d-7279-4657-b8eb-bf1d968e8ebf"},
> "agent_id": {"value" : ""},
> "resources": [{
> "name": "cpus",
> "type": "SCALAR",
> "scalar": {
> "value": 0.2
> }
> }, {
> "name": "mem",
> "type": "SCALAR",
> "scalar": {
> "value": 32
> }
> }, {
> "name": "disk",
> "type": "SCALAR",
> "scalar": {
> "value": 2
> }
> }
> ],
> "command": {
> "value": "sleep 2 ; /usr/bin/dd if=/dev/zero of=out.dat bs=1M 
> count=64 ; sleep 1"
> }
> }
> ]
> }'
> I0911 11:48:01.480689  7340 scheduler.cpp:184] Version: 1.5.0
> I0911 11:48:01.488868  7339 scheduler.cpp:470] New master detected at 
> master@17.228.224.108:5050
> Subscribed with ID aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-0010
> Submitted task group with tasks [ 2866368d-7279-4657-b8eb-bf1d968e8ebf ] to 
> agent 'aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-S0'
> Received status update TASK_RUNNING for task 
> '2866368d-7279-4657-b8eb-bf1d968e8ebf'
>   source: SOURCE_EXECUTOR
> Received status update TASK_FAILED for task 
> '2866368d-7279-4657-b8eb-bf1d968e8ebf'
>   message: 'Command terminated with signal Killed'
>   source: SOURCE_EXECUTOR
> {noformat}
> However, the agent logs show that this failed with a memory limitation:
> {noformat}
> I0911 11:48:02.235818  7012 http.cpp:532] Processing call 
> WAIT_NESTED_CONTAINER
> I0911 11:48:02.236395  7013 status_update_manager.cpp:323] Received status 
> update TASK_RUNNING (UUID: 85e7a8e8-22a7-4561-9000-2cd6d93502d9) for task 
> 2866368d-7279-4657-b8eb-bf1d968e8ebf of framework 
> aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-0010
> I0911 11:48:02.237083  7016 slave.cpp:4875] Forwarding the update 
> TASK_RUNNING (UUID: 85e7a8e8-22a7-4561-9000-2cd6d93502d9) for task 
> 2866368d-7279-4657-b8eb-bf1d968e8ebf of framework 
> aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-0010 to master@17.228.224.108:5050
> I0911 11:48:02.283661  7007 status_update_manager.cpp:395] Received status 
> update acknowledgement (UUID: 85e7a8e8-22a7-4561-9000-2cd6d93502d9) for task 
> 2866368d-7279-4657-b8eb-bf1d968e8ebf of framework 
> aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-0010
> I0911 11:48:04.771455  7014 memory.cpp:516] OOM detected for container 
> 474388fe-43c3-4372-b903-eaca22740996
> I0911 11:48:04.776445  7014 memory.cpp:556] Memory limit exceeded: Requested: 
> 64MB Maximum Used: 64MB
> ...
> I0911 11:48:04.776943  7012 containerizer.cpp:2681] Container 
> 474388fe-43c3-4372-b903-eaca22740996 has reached its limit for resource 
> [{"name":"mem","scalar":{"value":64.0},"type":"SCALAR"}] and will be 
> terminated
> {noformat}
> The following {{mesos-execute}} task will show the container limitation 
> correctly:
> {noformat}
> exec /opt/mesos/bin/mesos-execute --content_type=json 
> --master=jpeach.apple.com:5050 '--task_group={
> "tasks":
> [
> {
> "name": "37db08f6-4f0f-4ef6-97ee-b10a5c5cc211",
> "task_id": {"value" : "1372b2e2-c501-4e80-bcbd-1a5c5194e206"},
> "agent_id": {"value" : ""},
> "resources": [{
> "name": "cpus",
> "type": "SCALAR",
> "scalar": {
> "value": 0.2
> }
> },
> {
> "name": "mem",
> "type": "SCALAR",
> "scalar": {
>  

[jira] [Commented] (MESOS-7963) Task groups can lose the container limitation status.

2017-09-20 Thread Qian Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16174089#comment-16174089
 ] 

Qian Zhang commented on MESOS-7963:
---

{quote}it ends up being an extra special case that executors might have to deal 
with.{quote}
Can you elaborate a bit about why it will be an extra special case for executor 
to handle? I think no matter we raise limitation for root container or nested 
container, executors will always wait for nested containers.

> Task groups can lose the container limitation status.
> -
>
> Key: MESOS-7963
> URL: https://issues.apache.org/jira/browse/MESOS-7963
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization, executor
>Reporter: James Peach
>
> If you run a single task in a task group and that task fails with a container 
> limitation, that status update can be lost and only the executor failure will 
> be reported to the framework.
> {noformat}
> exec /opt/mesos/bin/mesos-execute --content_type=json 
> --master=jpeach.apple.com:5050 '--task_group={
> "tasks":
> [
> {
> "name": "7f141aca-55fe-4bb0-af4b-87f5ee26986a",
> "task_id": {"value" : "2866368d-7279-4657-b8eb-bf1d968e8ebf"},
> "agent_id": {"value" : ""},
> "resources": [{
> "name": "cpus",
> "type": "SCALAR",
> "scalar": {
> "value": 0.2
> }
> }, {
> "name": "mem",
> "type": "SCALAR",
> "scalar": {
> "value": 32
> }
> }, {
> "name": "disk",
> "type": "SCALAR",
> "scalar": {
> "value": 2
> }
> }
> ],
> "command": {
> "value": "sleep 2 ; /usr/bin/dd if=/dev/zero of=out.dat bs=1M 
> count=64 ; sleep 1"
> }
> }
> ]
> }'
> I0911 11:48:01.480689  7340 scheduler.cpp:184] Version: 1.5.0
> I0911 11:48:01.488868  7339 scheduler.cpp:470] New master detected at 
> master@17.228.224.108:5050
> Subscribed with ID aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-0010
> Submitted task group with tasks [ 2866368d-7279-4657-b8eb-bf1d968e8ebf ] to 
> agent 'aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-S0'
> Received status update TASK_RUNNING for task 
> '2866368d-7279-4657-b8eb-bf1d968e8ebf'
>   source: SOURCE_EXECUTOR
> Received status update TASK_FAILED for task 
> '2866368d-7279-4657-b8eb-bf1d968e8ebf'
>   message: 'Command terminated with signal Killed'
>   source: SOURCE_EXECUTOR
> {noformat}
> However, the agent logs show that this failed with a memory limitation:
> {noformat}
> I0911 11:48:02.235818  7012 http.cpp:532] Processing call 
> WAIT_NESTED_CONTAINER
> I0911 11:48:02.236395  7013 status_update_manager.cpp:323] Received status 
> update TASK_RUNNING (UUID: 85e7a8e8-22a7-4561-9000-2cd6d93502d9) for task 
> 2866368d-7279-4657-b8eb-bf1d968e8ebf of framework 
> aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-0010
> I0911 11:48:02.237083  7016 slave.cpp:4875] Forwarding the update 
> TASK_RUNNING (UUID: 85e7a8e8-22a7-4561-9000-2cd6d93502d9) for task 
> 2866368d-7279-4657-b8eb-bf1d968e8ebf of framework 
> aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-0010 to master@17.228.224.108:5050
> I0911 11:48:02.283661  7007 status_update_manager.cpp:395] Received status 
> update acknowledgement (UUID: 85e7a8e8-22a7-4561-9000-2cd6d93502d9) for task 
> 2866368d-7279-4657-b8eb-bf1d968e8ebf of framework 
> aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-0010
> I0911 11:48:04.771455  7014 memory.cpp:516] OOM detected for container 
> 474388fe-43c3-4372-b903-eaca22740996
> I0911 11:48:04.776445  7014 memory.cpp:556] Memory limit exceeded: Requested: 
> 64MB Maximum Used: 64MB
> ...
> I0911 11:48:04.776943  7012 containerizer.cpp:2681] Container 
> 474388fe-43c3-4372-b903-eaca22740996 has reached its limit for resource 
> [{"name":"mem","scalar":{"value":64.0},"type":"SCALAR"}] and will be 
> terminated
> {noformat}
> The following {{mesos-execute}} task will show the container limitation 
> correctly:
> {noformat}
> exec /opt/mesos/bin/mesos-execute --content_type=json 
> --master=jpeach.apple.com:5050 '--task_group={
> "tasks":
> [
> {
> "name": "37db08f6-4f0f-4ef6-97ee-b10a5c5cc211",
> "task_id": {"value" : "1372b2e2-c501-4e80-bcbd-1a5c5194e206"},
> "agent_id": {"value" : ""},
> "resources": [{
> "name": "cpus",
> "type": "SCALAR",
> "scalar": {
> "value": 0.2
> }
> },
> {
> "name": "mem",
> "type": "SCALAR",
> "scalar": {
> 

[jira] [Commented] (MESOS-7963) Task groups can lose the container limitation status.

2017-09-20 Thread James Peach (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16173413#comment-16173413
 ] 

James Peach commented on MESOS-7963:


I think that steps 1 - 3 sound fine and will fix the immediate problem.

{quote}
if one of the tasks reaches the limit ... the scheduler will receive 
TASK_FAILED for all the tasks with the exactly same reason and message
{quote}

This can happen today, though the raciness generally hides it. In most case, I 
think that this behavior is correct and the scheduler ought to deal with it. 
For example, it is possible for one container to use most of the memory and a 
smaller container to trigger the OOM (similarly with disk). In these cases, it 
is hard to make a principled decision about which container is responsible.

{quote}
least for {{network/ports}} isolator, we should make it raise the limitation 
for nested container since it has the ability to tell which nested container 
reaches the limit
{quote}

We could do this for the {{network/ports}} isolator, but I would generally 
prefer the resource limitations to have a single well-defined behavior. If we 
let this isolator be special, then the flexibility that allows is inconsistent, 
so it ends up being an extra special case that executors might have to deal 
with. At least as a starting point, I'd like the behavior to be consistent and 
well-defined.

> Task groups can lose the container limitation status.
> -
>
> Key: MESOS-7963
> URL: https://issues.apache.org/jira/browse/MESOS-7963
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization, executor
>Reporter: James Peach
>
> If you run a single task in a task group and that task fails with a container 
> limitation, that status update can be lost and only the executor failure will 
> be reported to the framework.
> {noformat}
> exec /opt/mesos/bin/mesos-execute --content_type=json 
> --master=jpeach.apple.com:5050 '--task_group={
> "tasks":
> [
> {
> "name": "7f141aca-55fe-4bb0-af4b-87f5ee26986a",
> "task_id": {"value" : "2866368d-7279-4657-b8eb-bf1d968e8ebf"},
> "agent_id": {"value" : ""},
> "resources": [{
> "name": "cpus",
> "type": "SCALAR",
> "scalar": {
> "value": 0.2
> }
> }, {
> "name": "mem",
> "type": "SCALAR",
> "scalar": {
> "value": 32
> }
> }, {
> "name": "disk",
> "type": "SCALAR",
> "scalar": {
> "value": 2
> }
> }
> ],
> "command": {
> "value": "sleep 2 ; /usr/bin/dd if=/dev/zero of=out.dat bs=1M 
> count=64 ; sleep 1"
> }
> }
> ]
> }'
> I0911 11:48:01.480689  7340 scheduler.cpp:184] Version: 1.5.0
> I0911 11:48:01.488868  7339 scheduler.cpp:470] New master detected at 
> master@17.228.224.108:5050
> Subscribed with ID aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-0010
> Submitted task group with tasks [ 2866368d-7279-4657-b8eb-bf1d968e8ebf ] to 
> agent 'aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-S0'
> Received status update TASK_RUNNING for task 
> '2866368d-7279-4657-b8eb-bf1d968e8ebf'
>   source: SOURCE_EXECUTOR
> Received status update TASK_FAILED for task 
> '2866368d-7279-4657-b8eb-bf1d968e8ebf'
>   message: 'Command terminated with signal Killed'
>   source: SOURCE_EXECUTOR
> {noformat}
> However, the agent logs show that this failed with a memory limitation:
> {noformat}
> I0911 11:48:02.235818  7012 http.cpp:532] Processing call 
> WAIT_NESTED_CONTAINER
> I0911 11:48:02.236395  7013 status_update_manager.cpp:323] Received status 
> update TASK_RUNNING (UUID: 85e7a8e8-22a7-4561-9000-2cd6d93502d9) for task 
> 2866368d-7279-4657-b8eb-bf1d968e8ebf of framework 
> aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-0010
> I0911 11:48:02.237083  7016 slave.cpp:4875] Forwarding the update 
> TASK_RUNNING (UUID: 85e7a8e8-22a7-4561-9000-2cd6d93502d9) for task 
> 2866368d-7279-4657-b8eb-bf1d968e8ebf of framework 
> aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-0010 to master@17.228.224.108:5050
> I0911 11:48:02.283661  7007 status_update_manager.cpp:395] Received status 
> update acknowledgement (UUID: 85e7a8e8-22a7-4561-9000-2cd6d93502d9) for task 
> 2866368d-7279-4657-b8eb-bf1d968e8ebf of framework 
> aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-0010
> I0911 11:48:04.771455  7014 memory.cpp:516] OOM detected for container 
> 474388fe-43c3-4372-b903-eaca22740996
> I0911 11:48:04.776445  7014 memory.cpp:556] Memory limit exceeded: Requested: 
> 64MB Maximum Used: 64MB
> ...
> I0911 11:48:04.776943  7012 containerizer.cpp:2681] Container 
> 

[jira] [Commented] (MESOS-7963) Task groups can lose the container limitation status.

2017-09-20 Thread Qian Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16172846#comment-16172846
 ] 

Qian Zhang commented on MESOS-7963:
---

[~jieyu] and [~jpe...@apache.org], so I think what we should do is:
# In {{MesosContainerizerProcess::limited()}}, check if it is a root container, 
if yes, not only set root container's {{limitations}} but also set all its 
nested container's {{limitations}}.
# In {{Http::waitNestedContainer()}}, propagate the reason and message of the 
nested container termination to the default executor so that the default 
executor can send the status update with the reason and message for the nested 
container to the scheduler.
# For `network/ports` isolator, we also want it to raise limitation only for 
root container rather than any nested containers, just like `cgroups/mem` and 
`disk/du`.

Please correct me if I missed anything. And my comment is, for a task group 
with multiple tasks, if one of the tasks reaches the limit (e.g., it listens on 
an unallocated port), the scheduler will receive {{TASK_FAILED}} for all the 
tasks with the exactly same reason and message, I think this will confuse 
scheduler because schedule can not figure out which particular task reached the 
limit.

> Task groups can lose the container limitation status.
> -
>
> Key: MESOS-7963
> URL: https://issues.apache.org/jira/browse/MESOS-7963
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization, executor
>Reporter: James Peach
>
> If you run a single task in a task group and that task fails with a container 
> limitation, that status update can be lost and only the executor failure will 
> be reported to the framework.
> {noformat}
> exec /opt/mesos/bin/mesos-execute --content_type=json 
> --master=jpeach.apple.com:5050 '--task_group={
> "tasks":
> [
> {
> "name": "7f141aca-55fe-4bb0-af4b-87f5ee26986a",
> "task_id": {"value" : "2866368d-7279-4657-b8eb-bf1d968e8ebf"},
> "agent_id": {"value" : ""},
> "resources": [{
> "name": "cpus",
> "type": "SCALAR",
> "scalar": {
> "value": 0.2
> }
> }, {
> "name": "mem",
> "type": "SCALAR",
> "scalar": {
> "value": 32
> }
> }, {
> "name": "disk",
> "type": "SCALAR",
> "scalar": {
> "value": 2
> }
> }
> ],
> "command": {
> "value": "sleep 2 ; /usr/bin/dd if=/dev/zero of=out.dat bs=1M 
> count=64 ; sleep 1"
> }
> }
> ]
> }'
> I0911 11:48:01.480689  7340 scheduler.cpp:184] Version: 1.5.0
> I0911 11:48:01.488868  7339 scheduler.cpp:470] New master detected at 
> master@17.228.224.108:5050
> Subscribed with ID aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-0010
> Submitted task group with tasks [ 2866368d-7279-4657-b8eb-bf1d968e8ebf ] to 
> agent 'aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-S0'
> Received status update TASK_RUNNING for task 
> '2866368d-7279-4657-b8eb-bf1d968e8ebf'
>   source: SOURCE_EXECUTOR
> Received status update TASK_FAILED for task 
> '2866368d-7279-4657-b8eb-bf1d968e8ebf'
>   message: 'Command terminated with signal Killed'
>   source: SOURCE_EXECUTOR
> {noformat}
> However, the agent logs show that this failed with a memory limitation:
> {noformat}
> I0911 11:48:02.235818  7012 http.cpp:532] Processing call 
> WAIT_NESTED_CONTAINER
> I0911 11:48:02.236395  7013 status_update_manager.cpp:323] Received status 
> update TASK_RUNNING (UUID: 85e7a8e8-22a7-4561-9000-2cd6d93502d9) for task 
> 2866368d-7279-4657-b8eb-bf1d968e8ebf of framework 
> aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-0010
> I0911 11:48:02.237083  7016 slave.cpp:4875] Forwarding the update 
> TASK_RUNNING (UUID: 85e7a8e8-22a7-4561-9000-2cd6d93502d9) for task 
> 2866368d-7279-4657-b8eb-bf1d968e8ebf of framework 
> aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-0010 to master@17.228.224.108:5050
> I0911 11:48:02.283661  7007 status_update_manager.cpp:395] Received status 
> update acknowledgement (UUID: 85e7a8e8-22a7-4561-9000-2cd6d93502d9) for task 
> 2866368d-7279-4657-b8eb-bf1d968e8ebf of framework 
> aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-0010
> I0911 11:48:04.771455  7014 memory.cpp:516] OOM detected for container 
> 474388fe-43c3-4372-b903-eaca22740996
> I0911 11:48:04.776445  7014 memory.cpp:556] Memory limit exceeded: Requested: 
> 64MB Maximum Used: 64MB
> ...
> I0911 11:48:04.776943  7012 containerizer.cpp:2681] Container 
> 474388fe-43c3-4372-b903-eaca22740996 has reached its limit for resource 
> [{"name":"mem","scalar":{"value":64.0},"type":"SCALAR"}] and will be 
> 

[jira] [Commented] (MESOS-7963) Task groups can lose the container limitation status.

2017-09-19 Thread Jie Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16172091#comment-16172091
 ] 

Jie Yu commented on MESOS-7963:
---

For the racy behavior, this is because the order of the following is not 
deterministic:
1) executor's WAIT on nested container returns first and send TASK_FAILED
2) executor is killed by the containerizer, and agent will generate status 
update for the executor for non-terminal task.

> Task groups can lose the container limitation status.
> -
>
> Key: MESOS-7963
> URL: https://issues.apache.org/jira/browse/MESOS-7963
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization, executor
>Reporter: James Peach
>
> If you run a single task in a task group and that task fails with a container 
> limitation, that status update can be lost and only the executor failure will 
> be reported to the framework.
> {noformat}
> exec /opt/mesos/bin/mesos-execute --content_type=json 
> --master=jpeach.apple.com:5050 '--task_group={
> "tasks":
> [
> {
> "name": "7f141aca-55fe-4bb0-af4b-87f5ee26986a",
> "task_id": {"value" : "2866368d-7279-4657-b8eb-bf1d968e8ebf"},
> "agent_id": {"value" : ""},
> "resources": [{
> "name": "cpus",
> "type": "SCALAR",
> "scalar": {
> "value": 0.2
> }
> }, {
> "name": "mem",
> "type": "SCALAR",
> "scalar": {
> "value": 32
> }
> }, {
> "name": "disk",
> "type": "SCALAR",
> "scalar": {
> "value": 2
> }
> }
> ],
> "command": {
> "value": "sleep 2 ; /usr/bin/dd if=/dev/zero of=out.dat bs=1M 
> count=64 ; sleep 1"
> }
> }
> ]
> }'
> I0911 11:48:01.480689  7340 scheduler.cpp:184] Version: 1.5.0
> I0911 11:48:01.488868  7339 scheduler.cpp:470] New master detected at 
> master@17.228.224.108:5050
> Subscribed with ID aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-0010
> Submitted task group with tasks [ 2866368d-7279-4657-b8eb-bf1d968e8ebf ] to 
> agent 'aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-S0'
> Received status update TASK_RUNNING for task 
> '2866368d-7279-4657-b8eb-bf1d968e8ebf'
>   source: SOURCE_EXECUTOR
> Received status update TASK_FAILED for task 
> '2866368d-7279-4657-b8eb-bf1d968e8ebf'
>   message: 'Command terminated with signal Killed'
>   source: SOURCE_EXECUTOR
> {noformat}
> However, the agent logs show that this failed with a memory limitation:
> {noformat}
> I0911 11:48:02.235818  7012 http.cpp:532] Processing call 
> WAIT_NESTED_CONTAINER
> I0911 11:48:02.236395  7013 status_update_manager.cpp:323] Received status 
> update TASK_RUNNING (UUID: 85e7a8e8-22a7-4561-9000-2cd6d93502d9) for task 
> 2866368d-7279-4657-b8eb-bf1d968e8ebf of framework 
> aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-0010
> I0911 11:48:02.237083  7016 slave.cpp:4875] Forwarding the update 
> TASK_RUNNING (UUID: 85e7a8e8-22a7-4561-9000-2cd6d93502d9) for task 
> 2866368d-7279-4657-b8eb-bf1d968e8ebf of framework 
> aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-0010 to master@17.228.224.108:5050
> I0911 11:48:02.283661  7007 status_update_manager.cpp:395] Received status 
> update acknowledgement (UUID: 85e7a8e8-22a7-4561-9000-2cd6d93502d9) for task 
> 2866368d-7279-4657-b8eb-bf1d968e8ebf of framework 
> aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-0010
> I0911 11:48:04.771455  7014 memory.cpp:516] OOM detected for container 
> 474388fe-43c3-4372-b903-eaca22740996
> I0911 11:48:04.776445  7014 memory.cpp:556] Memory limit exceeded: Requested: 
> 64MB Maximum Used: 64MB
> ...
> I0911 11:48:04.776943  7012 containerizer.cpp:2681] Container 
> 474388fe-43c3-4372-b903-eaca22740996 has reached its limit for resource 
> [{"name":"mem","scalar":{"value":64.0},"type":"SCALAR"}] and will be 
> terminated
> {noformat}
> The following {{mesos-execute}} task will show the container limitation 
> correctly:
> {noformat}
> exec /opt/mesos/bin/mesos-execute --content_type=json 
> --master=jpeach.apple.com:5050 '--task_group={
> "tasks":
> [
> {
> "name": "37db08f6-4f0f-4ef6-97ee-b10a5c5cc211",
> "task_id": {"value" : "1372b2e2-c501-4e80-bcbd-1a5c5194e206"},
> "agent_id": {"value" : ""},
> "resources": [{
> "name": "cpus",
> "type": "SCALAR",
> "scalar": {
> "value": 0.2
> }
> },
> {
> "name": "mem",
> "type": "SCALAR",
> "scalar": {
> "value": 32
> }
>   

[jira] [Commented] (MESOS-7963) Task groups can lose the container limitation status.

2017-09-19 Thread James Peach (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16172089#comment-16172089
 ] 

James Peach commented on MESOS-7963:


As per discussion with [~jieyu] the right fix here is for the containerizer to 
propagate the container limitation to the {{WAIT_NESTED_CONTAINER}} response 
when it destroys the nested containers. Then the executor should pass that 
information on in status updates. Though there will still be a race between the 
executor and the agent, both parties will still end up propagating the same 
status update information.

> Task groups can lose the container limitation status.
> -
>
> Key: MESOS-7963
> URL: https://issues.apache.org/jira/browse/MESOS-7963
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization, executor
>Reporter: James Peach
>
> If you run a single task in a task group and that task fails with a container 
> limitation, that status update can be lost and only the executor failure will 
> be reported to the framework.
> {noformat}
> exec /opt/mesos/bin/mesos-execute --content_type=json 
> --master=jpeach.apple.com:5050 '--task_group={
> "tasks":
> [
> {
> "name": "7f141aca-55fe-4bb0-af4b-87f5ee26986a",
> "task_id": {"value" : "2866368d-7279-4657-b8eb-bf1d968e8ebf"},
> "agent_id": {"value" : ""},
> "resources": [{
> "name": "cpus",
> "type": "SCALAR",
> "scalar": {
> "value": 0.2
> }
> }, {
> "name": "mem",
> "type": "SCALAR",
> "scalar": {
> "value": 32
> }
> }, {
> "name": "disk",
> "type": "SCALAR",
> "scalar": {
> "value": 2
> }
> }
> ],
> "command": {
> "value": "sleep 2 ; /usr/bin/dd if=/dev/zero of=out.dat bs=1M 
> count=64 ; sleep 1"
> }
> }
> ]
> }'
> I0911 11:48:01.480689  7340 scheduler.cpp:184] Version: 1.5.0
> I0911 11:48:01.488868  7339 scheduler.cpp:470] New master detected at 
> master@17.228.224.108:5050
> Subscribed with ID aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-0010
> Submitted task group with tasks [ 2866368d-7279-4657-b8eb-bf1d968e8ebf ] to 
> agent 'aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-S0'
> Received status update TASK_RUNNING for task 
> '2866368d-7279-4657-b8eb-bf1d968e8ebf'
>   source: SOURCE_EXECUTOR
> Received status update TASK_FAILED for task 
> '2866368d-7279-4657-b8eb-bf1d968e8ebf'
>   message: 'Command terminated with signal Killed'
>   source: SOURCE_EXECUTOR
> {noformat}
> However, the agent logs show that this failed with a memory limitation:
> {noformat}
> I0911 11:48:02.235818  7012 http.cpp:532] Processing call 
> WAIT_NESTED_CONTAINER
> I0911 11:48:02.236395  7013 status_update_manager.cpp:323] Received status 
> update TASK_RUNNING (UUID: 85e7a8e8-22a7-4561-9000-2cd6d93502d9) for task 
> 2866368d-7279-4657-b8eb-bf1d968e8ebf of framework 
> aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-0010
> I0911 11:48:02.237083  7016 slave.cpp:4875] Forwarding the update 
> TASK_RUNNING (UUID: 85e7a8e8-22a7-4561-9000-2cd6d93502d9) for task 
> 2866368d-7279-4657-b8eb-bf1d968e8ebf of framework 
> aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-0010 to master@17.228.224.108:5050
> I0911 11:48:02.283661  7007 status_update_manager.cpp:395] Received status 
> update acknowledgement (UUID: 85e7a8e8-22a7-4561-9000-2cd6d93502d9) for task 
> 2866368d-7279-4657-b8eb-bf1d968e8ebf of framework 
> aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-0010
> I0911 11:48:04.771455  7014 memory.cpp:516] OOM detected for container 
> 474388fe-43c3-4372-b903-eaca22740996
> I0911 11:48:04.776445  7014 memory.cpp:556] Memory limit exceeded: Requested: 
> 64MB Maximum Used: 64MB
> ...
> I0911 11:48:04.776943  7012 containerizer.cpp:2681] Container 
> 474388fe-43c3-4372-b903-eaca22740996 has reached its limit for resource 
> [{"name":"mem","scalar":{"value":64.0},"type":"SCALAR"}] and will be 
> terminated
> {noformat}
> The following {{mesos-execute}} task will show the container limitation 
> correctly:
> {noformat}
> exec /opt/mesos/bin/mesos-execute --content_type=json 
> --master=jpeach.apple.com:5050 '--task_group={
> "tasks":
> [
> {
> "name": "37db08f6-4f0f-4ef6-97ee-b10a5c5cc211",
> "task_id": {"value" : "1372b2e2-c501-4e80-bcbd-1a5c5194e206"},
> "agent_id": {"value" : ""},
> "resources": [{
> "name": "cpus",
> "type": "SCALAR",
> "scalar": {
> "value": 0.2
> }
> },
> {
> 

[jira] [Commented] (MESOS-7963) Task groups can lose the container limitation status.

2017-09-19 Thread James Peach (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16171986#comment-16171986
 ] 

James Peach commented on MESOS-7963:


FWIW if I use the `disk/du` isolator to trigger a disk resource limitation from 
{{mesos-execute}}, I get the correct status update (ie. the one with the 
limitation reason) about 20% of the time. 

> Task groups can lose the container limitation status.
> -
>
> Key: MESOS-7963
> URL: https://issues.apache.org/jira/browse/MESOS-7963
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization, executor
>Reporter: James Peach
>
> If you run a single task in a task group and that task fails with a container 
> limitation, that status update can be lost and only the executor failure will 
> be reported to the framework.
> {noformat}
> exec /opt/mesos/bin/mesos-execute --content_type=json 
> --master=jpeach.apple.com:5050 '--task_group={
> "tasks":
> [
> {
> "name": "7f141aca-55fe-4bb0-af4b-87f5ee26986a",
> "task_id": {"value" : "2866368d-7279-4657-b8eb-bf1d968e8ebf"},
> "agent_id": {"value" : ""},
> "resources": [{
> "name": "cpus",
> "type": "SCALAR",
> "scalar": {
> "value": 0.2
> }
> }, {
> "name": "mem",
> "type": "SCALAR",
> "scalar": {
> "value": 32
> }
> }, {
> "name": "disk",
> "type": "SCALAR",
> "scalar": {
> "value": 2
> }
> }
> ],
> "command": {
> "value": "sleep 2 ; /usr/bin/dd if=/dev/zero of=out.dat bs=1M 
> count=64 ; sleep 1"
> }
> }
> ]
> }'
> I0911 11:48:01.480689  7340 scheduler.cpp:184] Version: 1.5.0
> I0911 11:48:01.488868  7339 scheduler.cpp:470] New master detected at 
> master@17.228.224.108:5050
> Subscribed with ID aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-0010
> Submitted task group with tasks [ 2866368d-7279-4657-b8eb-bf1d968e8ebf ] to 
> agent 'aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-S0'
> Received status update TASK_RUNNING for task 
> '2866368d-7279-4657-b8eb-bf1d968e8ebf'
>   source: SOURCE_EXECUTOR
> Received status update TASK_FAILED for task 
> '2866368d-7279-4657-b8eb-bf1d968e8ebf'
>   message: 'Command terminated with signal Killed'
>   source: SOURCE_EXECUTOR
> {noformat}
> However, the agent logs show that this failed with a memory limitation:
> {noformat}
> I0911 11:48:02.235818  7012 http.cpp:532] Processing call 
> WAIT_NESTED_CONTAINER
> I0911 11:48:02.236395  7013 status_update_manager.cpp:323] Received status 
> update TASK_RUNNING (UUID: 85e7a8e8-22a7-4561-9000-2cd6d93502d9) for task 
> 2866368d-7279-4657-b8eb-bf1d968e8ebf of framework 
> aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-0010
> I0911 11:48:02.237083  7016 slave.cpp:4875] Forwarding the update 
> TASK_RUNNING (UUID: 85e7a8e8-22a7-4561-9000-2cd6d93502d9) for task 
> 2866368d-7279-4657-b8eb-bf1d968e8ebf of framework 
> aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-0010 to master@17.228.224.108:5050
> I0911 11:48:02.283661  7007 status_update_manager.cpp:395] Received status 
> update acknowledgement (UUID: 85e7a8e8-22a7-4561-9000-2cd6d93502d9) for task 
> 2866368d-7279-4657-b8eb-bf1d968e8ebf of framework 
> aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-0010
> I0911 11:48:04.771455  7014 memory.cpp:516] OOM detected for container 
> 474388fe-43c3-4372-b903-eaca22740996
> I0911 11:48:04.776445  7014 memory.cpp:556] Memory limit exceeded: Requested: 
> 64MB Maximum Used: 64MB
> ...
> I0911 11:48:04.776943  7012 containerizer.cpp:2681] Container 
> 474388fe-43c3-4372-b903-eaca22740996 has reached its limit for resource 
> [{"name":"mem","scalar":{"value":64.0},"type":"SCALAR"}] and will be 
> terminated
> {noformat}
> The following {{mesos-execute}} task will show the container limitation 
> correctly:
> {noformat}
> exec /opt/mesos/bin/mesos-execute --content_type=json 
> --master=jpeach.apple.com:5050 '--task_group={
> "tasks":
> [
> {
> "name": "37db08f6-4f0f-4ef6-97ee-b10a5c5cc211",
> "task_id": {"value" : "1372b2e2-c501-4e80-bcbd-1a5c5194e206"},
> "agent_id": {"value" : ""},
> "resources": [{
> "name": "cpus",
> "type": "SCALAR",
> "scalar": {
> "value": 0.2
> }
> },
> {
> "name": "mem",
> "type": "SCALAR",
> "scalar": {
> "value": 32
> }
> }],
> "command": {
> "value": "sleep 600"
> 

[jira] [Commented] (MESOS-7963) Task groups can lose the container limitation status.

2017-09-19 Thread James Peach (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16171327#comment-16171327
 ] 

James Peach commented on MESOS-7963:


{quote}
For the second task, I think before the default executor got a chance to send 
status update for it, the default executor itself was destroyed by Mesos 
containerizer, that's why we see a TASK_FAILED for the second task and its 
source is SOURCE_AGENT.
{quote}

Yes I think this would also happen if the executor was terminated before the 
status update from the failed task was acknowledged, though I'm not positive 
whether this can actually happen. When running a single task, you will 
sometimes get different status update results, implying that there is a race.

{quote}
Currently both cgroups isolator (memory subsystem) and disk/du isolator raise 
the limitation for root container rather than nested container.
{quote}

This is consistent since resources are always accumulated on the root 
container. I'm not sure that it is feasible to detect when a nested container 
triggers a limitation for these resources.

> Task groups can lose the container limitation status.
> -
>
> Key: MESOS-7963
> URL: https://issues.apache.org/jira/browse/MESOS-7963
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization, executor
>Reporter: James Peach
>
> If you run a single task in a task group and that task fails with a container 
> limitation, that status update can be lost and only the executor failure will 
> be reported to the framework.
> {noformat}
> exec /opt/mesos/bin/mesos-execute --content_type=json 
> --master=jpeach.apple.com:5050 '--task_group={
> "tasks":
> [
> {
> "name": "7f141aca-55fe-4bb0-af4b-87f5ee26986a",
> "task_id": {"value" : "2866368d-7279-4657-b8eb-bf1d968e8ebf"},
> "agent_id": {"value" : ""},
> "resources": [{
> "name": "cpus",
> "type": "SCALAR",
> "scalar": {
> "value": 0.2
> }
> }, {
> "name": "mem",
> "type": "SCALAR",
> "scalar": {
> "value": 32
> }
> }, {
> "name": "disk",
> "type": "SCALAR",
> "scalar": {
> "value": 2
> }
> }
> ],
> "command": {
> "value": "sleep 2 ; /usr/bin/dd if=/dev/zero of=out.dat bs=1M 
> count=64 ; sleep 1"
> }
> }
> ]
> }'
> I0911 11:48:01.480689  7340 scheduler.cpp:184] Version: 1.5.0
> I0911 11:48:01.488868  7339 scheduler.cpp:470] New master detected at 
> master@17.228.224.108:5050
> Subscribed with ID aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-0010
> Submitted task group with tasks [ 2866368d-7279-4657-b8eb-bf1d968e8ebf ] to 
> agent 'aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-S0'
> Received status update TASK_RUNNING for task 
> '2866368d-7279-4657-b8eb-bf1d968e8ebf'
>   source: SOURCE_EXECUTOR
> Received status update TASK_FAILED for task 
> '2866368d-7279-4657-b8eb-bf1d968e8ebf'
>   message: 'Command terminated with signal Killed'
>   source: SOURCE_EXECUTOR
> {noformat}
> However, the agent logs show that this failed with a memory limitation:
> {noformat}
> I0911 11:48:02.235818  7012 http.cpp:532] Processing call 
> WAIT_NESTED_CONTAINER
> I0911 11:48:02.236395  7013 status_update_manager.cpp:323] Received status 
> update TASK_RUNNING (UUID: 85e7a8e8-22a7-4561-9000-2cd6d93502d9) for task 
> 2866368d-7279-4657-b8eb-bf1d968e8ebf of framework 
> aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-0010
> I0911 11:48:02.237083  7016 slave.cpp:4875] Forwarding the update 
> TASK_RUNNING (UUID: 85e7a8e8-22a7-4561-9000-2cd6d93502d9) for task 
> 2866368d-7279-4657-b8eb-bf1d968e8ebf of framework 
> aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-0010 to master@17.228.224.108:5050
> I0911 11:48:02.283661  7007 status_update_manager.cpp:395] Received status 
> update acknowledgement (UUID: 85e7a8e8-22a7-4561-9000-2cd6d93502d9) for task 
> 2866368d-7279-4657-b8eb-bf1d968e8ebf of framework 
> aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-0010
> I0911 11:48:04.771455  7014 memory.cpp:516] OOM detected for container 
> 474388fe-43c3-4372-b903-eaca22740996
> I0911 11:48:04.776445  7014 memory.cpp:556] Memory limit exceeded: Requested: 
> 64MB Maximum Used: 64MB
> ...
> I0911 11:48:04.776943  7012 containerizer.cpp:2681] Container 
> 474388fe-43c3-4372-b903-eaca22740996 has reached its limit for resource 
> [{"name":"mem","scalar":{"value":64.0},"type":"SCALAR"}] and will be 
> terminated
> {noformat}
> The following {{mesos-execute}} task will show the container limitation 
> correctly:
> {noformat}
> exec /opt/mesos/bin/mesos-execute 

[jira] [Commented] (MESOS-7963) Task groups can lose the container limitation status.

2017-09-19 Thread Qian Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16171182#comment-16171182
 ] 

Qian Zhang commented on MESOS-7963:
---

[~jpe...@apache.org] In your second example, a task group with two tasks was 
launched, when disk/du isolator raised a limitation for the root container, 
Mesos containerizer will try to destroy the root container, but before that, it 
will try to destroy the two nested containers first. So when the first nested 
container was destroyed, the default executor will know it (since the default 
executor was still alive at that moment) and it will send a {{TASK_FAILED}} for 
the first task and the source is {{SOURCE_EXECUTOR}}. For the second task, I 
think before the default executor got a chance to send status update for it, 
the default executor itself was destroyed by Mesos containerizer, that's why we 
see a {{TASK_FAILED}} for the second task and its source is {{SOURCE_AGENT}}.

In your first example, the task group has only one task, so I think it follow 
the same flow of the first task in your second example, i.e., the default 
executor sent a {{TASK_FAILED}} for the task, and then the default executor 
itself was destroyed (or maybe self terminated).

Currently both cgroups isolator (memory subsystem) and disk/du isolator raise 
the limitation for root container rather than nested container. I think we may 
need to change them to raise the limitation for nested container, and enhance 
the implementation of waitNestedContainer() to make it propagate the reason and 
message of the container termination to the default executor, and then the 
default executor can send the status update with the reason and message for the 
nested container to the scheduler.

> Task groups can lose the container limitation status.
> -
>
> Key: MESOS-7963
> URL: https://issues.apache.org/jira/browse/MESOS-7963
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization, executor
>Reporter: James Peach
>
> If you run a single task in a task group and that task fails with a container 
> limitation, that status update can be lost and only the executor failure will 
> be reported to the framework.
> {noformat}
> exec /opt/mesos/bin/mesos-execute --content_type=json 
> --master=jpeach.apple.com:5050 '--task_group={
> "tasks":
> [
> {
> "name": "7f141aca-55fe-4bb0-af4b-87f5ee26986a",
> "task_id": {"value" : "2866368d-7279-4657-b8eb-bf1d968e8ebf"},
> "agent_id": {"value" : ""},
> "resources": [{
> "name": "cpus",
> "type": "SCALAR",
> "scalar": {
> "value": 0.2
> }
> }, {
> "name": "mem",
> "type": "SCALAR",
> "scalar": {
> "value": 32
> }
> }, {
> "name": "disk",
> "type": "SCALAR",
> "scalar": {
> "value": 2
> }
> }
> ],
> "command": {
> "value": "sleep 2 ; /usr/bin/dd if=/dev/zero of=out.dat bs=1M 
> count=64 ; sleep 1"
> }
> }
> ]
> }'
> I0911 11:48:01.480689  7340 scheduler.cpp:184] Version: 1.5.0
> I0911 11:48:01.488868  7339 scheduler.cpp:470] New master detected at 
> master@17.228.224.108:5050
> Subscribed with ID aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-0010
> Submitted task group with tasks [ 2866368d-7279-4657-b8eb-bf1d968e8ebf ] to 
> agent 'aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-S0'
> Received status update TASK_RUNNING for task 
> '2866368d-7279-4657-b8eb-bf1d968e8ebf'
>   source: SOURCE_EXECUTOR
> Received status update TASK_FAILED for task 
> '2866368d-7279-4657-b8eb-bf1d968e8ebf'
>   message: 'Command terminated with signal Killed'
>   source: SOURCE_EXECUTOR
> {noformat}
> However, the agent logs show that this failed with a memory limitation:
> {noformat}
> I0911 11:48:02.235818  7012 http.cpp:532] Processing call 
> WAIT_NESTED_CONTAINER
> I0911 11:48:02.236395  7013 status_update_manager.cpp:323] Received status 
> update TASK_RUNNING (UUID: 85e7a8e8-22a7-4561-9000-2cd6d93502d9) for task 
> 2866368d-7279-4657-b8eb-bf1d968e8ebf of framework 
> aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-0010
> I0911 11:48:02.237083  7016 slave.cpp:4875] Forwarding the update 
> TASK_RUNNING (UUID: 85e7a8e8-22a7-4561-9000-2cd6d93502d9) for task 
> 2866368d-7279-4657-b8eb-bf1d968e8ebf of framework 
> aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-0010 to master@17.228.224.108:5050
> I0911 11:48:02.283661  7007 status_update_manager.cpp:395] Received status 
> update acknowledgement (UUID: 85e7a8e8-22a7-4561-9000-2cd6d93502d9) for task 
> 2866368d-7279-4657-b8eb-bf1d968e8ebf of framework 

[jira] [Commented] (MESOS-7963) Task groups can lose the container limitation status.

2017-09-18 Thread James Peach (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16170890#comment-16170890
 ] 

James Peach commented on MESOS-7963:


/cc [~jieyu] This covers the executor container limitation we discussed on the 
Slack channel.

> Task groups can lose the container limitation status.
> -
>
> Key: MESOS-7963
> URL: https://issues.apache.org/jira/browse/MESOS-7963
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization, executor
>Reporter: James Peach
>
> If you run a single task in a task group and that task fails with a container 
> limitation, that status update can be lost and only the executor failure will 
> be reported to the framework.
> {noformat}
> exec /opt/mesos/bin/mesos-execute --content_type=json 
> --master=jpeach.apple.com:5050 '--task_group={
> "tasks":
> [
> {
> "name": "7f141aca-55fe-4bb0-af4b-87f5ee26986a",
> "task_id": {"value" : "2866368d-7279-4657-b8eb-bf1d968e8ebf"},
> "agent_id": {"value" : ""},
> "resources": [{
> "name": "cpus",
> "type": "SCALAR",
> "scalar": {
> "value": 0.2
> }
> }, {
> "name": "mem",
> "type": "SCALAR",
> "scalar": {
> "value": 32
> }
> }, {
> "name": "disk",
> "type": "SCALAR",
> "scalar": {
> "value": 2
> }
> }
> ],
> "command": {
> "value": "sleep 2 ; /usr/bin/dd if=/dev/zero of=out.dat bs=1M 
> count=64 ; sleep 1"
> }
> }
> ]
> }'
> I0911 11:48:01.480689  7340 scheduler.cpp:184] Version: 1.5.0
> I0911 11:48:01.488868  7339 scheduler.cpp:470] New master detected at 
> master@17.228.224.108:5050
> Subscribed with ID aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-0010
> Submitted task group with tasks [ 2866368d-7279-4657-b8eb-bf1d968e8ebf ] to 
> agent 'aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-S0'
> Received status update TASK_RUNNING for task 
> '2866368d-7279-4657-b8eb-bf1d968e8ebf'
>   source: SOURCE_EXECUTOR
> Received status update TASK_FAILED for task 
> '2866368d-7279-4657-b8eb-bf1d968e8ebf'
>   message: 'Command terminated with signal Killed'
>   source: SOURCE_EXECUTOR
> {noformat}
> However, the agent logs show that this failed with a memory limitation:
> {noformat}
> I0911 11:48:02.235818  7012 http.cpp:532] Processing call 
> WAIT_NESTED_CONTAINER
> I0911 11:48:02.236395  7013 status_update_manager.cpp:323] Received status 
> update TASK_RUNNING (UUID: 85e7a8e8-22a7-4561-9000-2cd6d93502d9) for task 
> 2866368d-7279-4657-b8eb-bf1d968e8ebf of framework 
> aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-0010
> I0911 11:48:02.237083  7016 slave.cpp:4875] Forwarding the update 
> TASK_RUNNING (UUID: 85e7a8e8-22a7-4561-9000-2cd6d93502d9) for task 
> 2866368d-7279-4657-b8eb-bf1d968e8ebf of framework 
> aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-0010 to master@17.228.224.108:5050
> I0911 11:48:02.283661  7007 status_update_manager.cpp:395] Received status 
> update acknowledgement (UUID: 85e7a8e8-22a7-4561-9000-2cd6d93502d9) for task 
> 2866368d-7279-4657-b8eb-bf1d968e8ebf of framework 
> aabd0847-aabc-4eb4-9c66-7d91fc9e9c32-0010
> I0911 11:48:04.771455  7014 memory.cpp:516] OOM detected for container 
> 474388fe-43c3-4372-b903-eaca22740996
> I0911 11:48:04.776445  7014 memory.cpp:556] Memory limit exceeded: Requested: 
> 64MB Maximum Used: 64MB
> ...
> I0911 11:48:04.776943  7012 containerizer.cpp:2681] Container 
> 474388fe-43c3-4372-b903-eaca22740996 has reached its limit for resource 
> [{"name":"mem","scalar":{"value":64.0},"type":"SCALAR"}] and will be 
> terminated
> {noformat}
> The following {{mesos-execute}} task will show the container limitation 
> correctly:
> {noformat}
> exec /opt/mesos/bin/mesos-execute --content_type=json 
> --master=jpeach.apple.com:5050 '--task_group={
> "tasks":
> [
> {
> "name": "37db08f6-4f0f-4ef6-97ee-b10a5c5cc211",
> "task_id": {"value" : "1372b2e2-c501-4e80-bcbd-1a5c5194e206"},
> "agent_id": {"value" : ""},
> "resources": [{
> "name": "cpus",
> "type": "SCALAR",
> "scalar": {
> "value": 0.2
> }
> },
> {
> "name": "mem",
> "type": "SCALAR",
> "scalar": {
> "value": 32
> }
> }],
> "command": {
> "value": "sleep 600"
> }
> }, {
> "name": "7247643c-5e4d-4b01-9839-e38db49f7f4d",
>