[jira] [Assigned] (MESOS-6568) JSON serialization should not omit empty arrays in HTTP APIs

2019-12-06 Thread Benjamin Mahler (Jira)


 [ 
https://issues.apache.org/jira/browse/MESOS-6568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler reassigned MESOS-6568:
--

Assignee: Benjamin Mahler

> JSON serialization should not omit empty arrays in HTTP APIs
> 
>
> Key: MESOS-6568
> URL: https://issues.apache.org/jira/browse/MESOS-6568
> Project: Mesos
>  Issue Type: Improvement
>  Components: HTTP API
>Reporter: Neil Conway
>Assignee: Benjamin Mahler
>Priority: Major
>  Labels: mesosphere
>
> When using the JSON content type with the HTTP APIs, a {{repeated}} protobuf 
> field is omitted entirely from the JSON serialization of the message. For 
> example, this is a response to the {{GetTasks}} call:
> {noformat}
> {
>   "get_tasks": {
> "tasks": [{...}]
>   },
>   "type": "GET_TASKS"
> }
> {noformat}
> I think it would be better to include empty arrays for the other fields of 
> the message ({{pending_tasks}}, {{completed_tasks}}, etc.). Advantages:
> # Consistency with the old HTTP endpoints, e.g., /state
> # Semantically, an empty array is more accurate. The master's response should 
> be interpreted as saying it doesn't know about any pending/completed tasks; 
> that is more accurately conveyed by explicitly including an empty array, not 
> by omitting the key entirely.
> *NOTE: The 
> [asV1Protobuf|https://github.com/apache/mesos/blob/d10a33acc426dda9e34db995f16450faf898bb3b/src/common/http.cpp#L172-L423]
>  copy needs to also be updated.*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-10066) mesos-docker-executor process dies when agent stops. Recovery fails when agent returns

2019-12-06 Thread Dalton Matos Coelho Barreto (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16990017#comment-16990017
 ] 

Dalton Matos Coelho Barreto commented on MESOS-10066:
-

I see. In fact this could be a reason why the option {{--docker_mesos_image}} 
didn't work.

 

But focusing on the first try, where only the Mesos Agent is running on a 
docker container and the task stays running.

All work as expected until the Mesos Agent stops. Then the executor processs 
also dies. Am I missing some configuration in order to make agent recovery work 
as expected?

Can you tell anything by looking into the logs I attached here?

I think that the recovery should work transparently, should be an easy (almost 
automatic) setup. That's why I think I may be missing something.

 

Thanks,

 

 

 

> mesos-docker-executor process dies when agent stops. Recovery fails when 
> agent returns
> --
>
> Key: MESOS-10066
> URL: https://issues.apache.org/jira/browse/MESOS-10066
> Project: Mesos
>  Issue Type: Bug
>  Components: agent, containerization, docker, executor
>Affects Versions: 1.7.3
>Reporter: Dalton Matos Coelho Barreto
>Priority: Critical
> Attachments: logs-after.txt, logs-before.txt
>
>
> Hello all,
> The documentation about Agent Recovery shows two conditions for the recovery 
> to be possible:
>  - The agent must have recovery enabled (default true?);
>  - The scheduler must register itself saying that it has checkpointing 
> enabled.
> In my tests I'm using Marathon as the scheduler and Mesos itself sees 
> Marathon as e checkpoint-enabled scheduler:
> {noformat}
> $ curl -sL 10.234.172.27:5050/state | jq '.frameworks[] | {"name": .name, 
> "id": .id, "checkpoint": .checkpoint, "active": .active}'
> {
>   "name": "asgard-chronos",
>   "id": "4783cf15-4fb1-4c75-90fe-44eeec5258a7-0001",
>   "checkpoint": true,
>   "active": true
> }
> {
>   "name": "marathon",
>   "id": "4783cf15-4fb1-4c75-90fe-44eeec5258a7-",
>   "checkpoint": true,
>   "active": true
> }
> }}
> {noformat}
> Here is what I'm using:
>  # Mesos Master, 1.4.1
>  # Mesos Agent 1.7.3
>  # Using docker image {{mesos/mesos-centos:1.7.x}}
>  # Docker sock mounted from the host
>  # Docker binary also mounted from the host
>  # Marathon: 1.4.12
>  # Docker
> {noformat}
> Client: Docker Engine - Community
>  Version:   19.03.5
>  API version:   1.39 (downgraded from 1.40)
>  Go version:go1.12.12
>  Git commit:633a0ea838
>  Built: Wed Nov 13 07:22:05 2019
>  OS/Arch:   linux/amd64
>  Experimental:  false
> 
> Server: Docker Engine - Community
>  Engine:
>   Version:  18.09.2
>   API version:  1.39 (minimum version 1.12)
>   Go version:   go1.10.6
>   Git commit:   6247962
>   Built:Sun Feb 10 03:42:13 2019
>   OS/Arch:  linux/amd64
>   Experimental: false
> {noformat}
> h2. The problem
> Here is the Marathon test app, a simple {{sleep 99d}} based on {{debian}} 
> docker image.
> {noformat}
> {
>   "id": "/sleep",
>   "cmd": "sleep 99d",
>   "cpus": 0.1,
>   "mem": 128,
>   "disk": 0,
>   "instances": 1,
>   "constraints": [],
>   "acceptedResourceRoles": [
> "*"
>   ],
>   "container": {
> "type": "DOCKER",
> "volumes": [],
> "docker": {
>   "image": "debian",
>   "network": "HOST",
>   "privileged": false,
>   "parameters": [],
>   "forcePullImage": true
> }
>   },
>   "labels": {},
>   "portDefinitions": []
> }
> {noformat}
> This task runs fine and get scheduled on the right agent, which is running 
> mesos agent 1.7.3 (using the docker image, {{mesos/mesos-centos:1.7.x}}).
> Here is a sample log:
> {noformat}
> mesos-slave_1  | I1205 13:24:21.391464 19849 slave.cpp:2403] Authorizing 
> task 'sleep.8c187c41-1762-11ea-a2e5-02429217540f' for framework 
> 4783cf15-4fb1-4c75-90fe-44eeec5258a7-
> mesos-slave_1  | I1205 13:24:21.392707 19849 slave.cpp:2846] Launching 
> task 'sleep.8c187c41-1762-11ea-a2e5-02429217540f' for framework 
> 4783cf15-4fb1-4c75-90fe-44eeec5258a7-
> mesos-slave_1  | I1205 13:24:21.392895 19849 paths.cpp:748] Creating 
> sandbox 
> '/var/lib/mesos/agent/slaves/79ad3a13-b567-4273-ac8c-30378d35a439-S60499/frameworks/4783cf15-4fb1-4c75-90fe-44eeec5258a7-/executors/sleep.8c187c41-1762-11ea-a2e5-02429217540f/runs/53ec0ef3-3290-476a-b2b6-385099e9b923'
> mesos-slave_1  | I1205 13:24:21.394399 19849 paths.cpp:748] Creating 
> sandbox 
> '/var/lib/mesos/agent/meta/slaves/

[jira] [Commented] (MESOS-10066) mesos-docker-executor process dies when agent stops. Recovery fails when agent returns

2019-12-06 Thread Andrei Budnik (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16989880#comment-16989880
 ] 

Andrei Budnik commented on MESOS-10066:
---

So the Docker socket is mounted from the host FS into the Docker container? I'm 
not sure if Mesos supports such a configuration. Since mesos-docker-executor is 
launched in a separate Docker container, there is no way to establish a socket 
connection from one Docker container (where agent runs) to another (where 
executor runs). Is executor's port 10.234.172.56:9899 exposed by the Docker 
container?

AFAIK, [Mesos mini|http://mesos.apache.org/blog/mesos-mini/] uses 
Docker-in-Docker technique instead.

> mesos-docker-executor process dies when agent stops. Recovery fails when 
> agent returns
> --
>
> Key: MESOS-10066
> URL: https://issues.apache.org/jira/browse/MESOS-10066
> Project: Mesos
>  Issue Type: Bug
>  Components: agent, containerization, docker, executor
>Affects Versions: 1.7.3
>Reporter: Dalton Matos Coelho Barreto
>Priority: Critical
> Attachments: logs-after.txt, logs-before.txt
>
>
> Hello all,
> The documentation about Agent Recovery shows two conditions for the recovery 
> to be possible:
>  - The agent must have recovery enabled (default true?);
>  - The scheduler must register itself saying that it has checkpointing 
> enabled.
> In my tests I'm using Marathon as the scheduler and Mesos itself sees 
> Marathon as e checkpoint-enabled scheduler:
> {noformat}
> $ curl -sL 10.234.172.27:5050/state | jq '.frameworks[] | {"name": .name, 
> "id": .id, "checkpoint": .checkpoint, "active": .active}'
> {
>   "name": "asgard-chronos",
>   "id": "4783cf15-4fb1-4c75-90fe-44eeec5258a7-0001",
>   "checkpoint": true,
>   "active": true
> }
> {
>   "name": "marathon",
>   "id": "4783cf15-4fb1-4c75-90fe-44eeec5258a7-",
>   "checkpoint": true,
>   "active": true
> }
> }}
> {noformat}
> Here is what I'm using:
>  # Mesos Master, 1.4.1
>  # Mesos Agent 1.7.3
>  # Using docker image {{mesos/mesos-centos:1.7.x}}
>  # Docker sock mounted from the host
>  # Docker binary also mounted from the host
>  # Marathon: 1.4.12
>  # Docker
> {noformat}
> Client: Docker Engine - Community
>  Version:   19.03.5
>  API version:   1.39 (downgraded from 1.40)
>  Go version:go1.12.12
>  Git commit:633a0ea838
>  Built: Wed Nov 13 07:22:05 2019
>  OS/Arch:   linux/amd64
>  Experimental:  false
> 
> Server: Docker Engine - Community
>  Engine:
>   Version:  18.09.2
>   API version:  1.39 (minimum version 1.12)
>   Go version:   go1.10.6
>   Git commit:   6247962
>   Built:Sun Feb 10 03:42:13 2019
>   OS/Arch:  linux/amd64
>   Experimental: false
> {noformat}
> h2. The problem
> Here is the Marathon test app, a simple {{sleep 99d}} based on {{debian}} 
> docker image.
> {noformat}
> {
>   "id": "/sleep",
>   "cmd": "sleep 99d",
>   "cpus": 0.1,
>   "mem": 128,
>   "disk": 0,
>   "instances": 1,
>   "constraints": [],
>   "acceptedResourceRoles": [
> "*"
>   ],
>   "container": {
> "type": "DOCKER",
> "volumes": [],
> "docker": {
>   "image": "debian",
>   "network": "HOST",
>   "privileged": false,
>   "parameters": [],
>   "forcePullImage": true
> }
>   },
>   "labels": {},
>   "portDefinitions": []
> }
> {noformat}
> This task runs fine and get scheduled on the right agent, which is running 
> mesos agent 1.7.3 (using the docker image, {{mesos/mesos-centos:1.7.x}}).
> Here is a sample log:
> {noformat}
> mesos-slave_1  | I1205 13:24:21.391464 19849 slave.cpp:2403] Authorizing 
> task 'sleep.8c187c41-1762-11ea-a2e5-02429217540f' for framework 
> 4783cf15-4fb1-4c75-90fe-44eeec5258a7-
> mesos-slave_1  | I1205 13:24:21.392707 19849 slave.cpp:2846] Launching 
> task 'sleep.8c187c41-1762-11ea-a2e5-02429217540f' for framework 
> 4783cf15-4fb1-4c75-90fe-44eeec5258a7-
> mesos-slave_1  | I1205 13:24:21.392895 19849 paths.cpp:748] Creating 
> sandbox 
> '/var/lib/mesos/agent/slaves/79ad3a13-b567-4273-ac8c-30378d35a439-S60499/frameworks/4783cf15-4fb1-4c75-90fe-44eeec5258a7-/executors/sleep.8c187c41-1762-11ea-a2e5-02429217540f/runs/53ec0ef3-3290-476a-b2b6-385099e9b923'
> mesos-slave_1  | I1205 13:24:21.394399 19849 paths.cpp:748] Creating 
> sandbox 
> '/var/lib/mesos/agent/meta/slaves/79ad3a13-b567-4273-ac8c-30378d35a439-S60499/frameworks/4783cf15-4fb1-4c75-90fe-44eeec5258a7-/executors/sleep.8c187c41-1762-11ea-a2e5-024

[jira] [Commented] (MESOS-10066) mesos-docker-executor process dies when agent stops. Recovery fails when agent returns

2019-12-06 Thread Dalton Matos Coelho Barreto (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16989821#comment-16989821
 ] 

Dalton Matos Coelho Barreto commented on MESOS-10066:
-

Yes. I used the same image that the agent uses as the argument do 
{{--docker_mesos_image}}. When using this options the tasks didn't even stayed 
running, so I didn't try to restart the agent since no task was up.

Do you think that this logs would be helpfull? I could modify the agent to 
re-add this option and post the full logs here.

 

Thanks,

> mesos-docker-executor process dies when agent stops. Recovery fails when 
> agent returns
> --
>
> Key: MESOS-10066
> URL: https://issues.apache.org/jira/browse/MESOS-10066
> Project: Mesos
>  Issue Type: Bug
>  Components: agent, containerization, docker, executor
>Affects Versions: 1.7.3
>Reporter: Dalton Matos Coelho Barreto
>Priority: Critical
> Attachments: logs-after.txt, logs-before.txt
>
>
> Hello all,
> The documentation about Agent Recovery shows two conditions for the recovery 
> to be possible:
>  - The agent must have recovery enabled (default true?);
>  - The scheduler must register itself saying that it has checkpointing 
> enabled.
> In my tests I'm using Marathon as the scheduler and Mesos itself sees 
> Marathon as e checkpoint-enabled scheduler:
> {noformat}
> $ curl -sL 10.234.172.27:5050/state | jq '.frameworks[] | {"name": .name, 
> "id": .id, "checkpoint": .checkpoint, "active": .active}'
> {
>   "name": "asgard-chronos",
>   "id": "4783cf15-4fb1-4c75-90fe-44eeec5258a7-0001",
>   "checkpoint": true,
>   "active": true
> }
> {
>   "name": "marathon",
>   "id": "4783cf15-4fb1-4c75-90fe-44eeec5258a7-",
>   "checkpoint": true,
>   "active": true
> }
> }}
> {noformat}
> Here is what I'm using:
>  # Mesos Master, 1.4.1
>  # Mesos Agent 1.7.3
>  # Using docker image {{mesos/mesos-centos:1.7.x}}
>  # Docker sock mounted from the host
>  # Docker binary also mounted from the host
>  # Marathon: 1.4.12
>  # Docker
> {noformat}
> Client: Docker Engine - Community
>  Version:   19.03.5
>  API version:   1.39 (downgraded from 1.40)
>  Go version:go1.12.12
>  Git commit:633a0ea838
>  Built: Wed Nov 13 07:22:05 2019
>  OS/Arch:   linux/amd64
>  Experimental:  false
> 
> Server: Docker Engine - Community
>  Engine:
>   Version:  18.09.2
>   API version:  1.39 (minimum version 1.12)
>   Go version:   go1.10.6
>   Git commit:   6247962
>   Built:Sun Feb 10 03:42:13 2019
>   OS/Arch:  linux/amd64
>   Experimental: false
> {noformat}
> h2. The problem
> Here is the Marathon test app, a simple {{sleep 99d}} based on {{debian}} 
> docker image.
> {noformat}
> {
>   "id": "/sleep",
>   "cmd": "sleep 99d",
>   "cpus": 0.1,
>   "mem": 128,
>   "disk": 0,
>   "instances": 1,
>   "constraints": [],
>   "acceptedResourceRoles": [
> "*"
>   ],
>   "container": {
> "type": "DOCKER",
> "volumes": [],
> "docker": {
>   "image": "debian",
>   "network": "HOST",
>   "privileged": false,
>   "parameters": [],
>   "forcePullImage": true
> }
>   },
>   "labels": {},
>   "portDefinitions": []
> }
> {noformat}
> This task runs fine and get scheduled on the right agent, which is running 
> mesos agent 1.7.3 (using the docker image, {{mesos/mesos-centos:1.7.x}}).
> Here is a sample log:
> {noformat}
> mesos-slave_1  | I1205 13:24:21.391464 19849 slave.cpp:2403] Authorizing 
> task 'sleep.8c187c41-1762-11ea-a2e5-02429217540f' for framework 
> 4783cf15-4fb1-4c75-90fe-44eeec5258a7-
> mesos-slave_1  | I1205 13:24:21.392707 19849 slave.cpp:2846] Launching 
> task 'sleep.8c187c41-1762-11ea-a2e5-02429217540f' for framework 
> 4783cf15-4fb1-4c75-90fe-44eeec5258a7-
> mesos-slave_1  | I1205 13:24:21.392895 19849 paths.cpp:748] Creating 
> sandbox 
> '/var/lib/mesos/agent/slaves/79ad3a13-b567-4273-ac8c-30378d35a439-S60499/frameworks/4783cf15-4fb1-4c75-90fe-44eeec5258a7-/executors/sleep.8c187c41-1762-11ea-a2e5-02429217540f/runs/53ec0ef3-3290-476a-b2b6-385099e9b923'
> mesos-slave_1  | I1205 13:24:21.394399 19849 paths.cpp:748] Creating 
> sandbox 
> '/var/lib/mesos/agent/meta/slaves/79ad3a13-b567-4273-ac8c-30378d35a439-S60499/frameworks/4783cf15-4fb1-4c75-90fe-44eeec5258a7-/executors/sleep.8c187c41-1762-11ea-a2e5-02429217540f/runs/53ec0ef3-3290-476a-b2b6-385099e9b923'
> mesos-slave_1  | I1205 13:24:21.394918 19849 slave.cpp:9068] Launching

[jira] [Commented] (MESOS-10066) mesos-docker-executor process dies when agent stops. Recovery fails when agent returns

2019-12-06 Thread Andrei Budnik (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16989808#comment-16989808
 ] 

Andrei Budnik commented on MESOS-10066:
---

Did you try to specify  --docker_mesos_image  command-line option for the agent 
that runs inside the Docker container?

> mesos-docker-executor process dies when agent stops. Recovery fails when 
> agent returns
> --
>
> Key: MESOS-10066
> URL: https://issues.apache.org/jira/browse/MESOS-10066
> Project: Mesos
>  Issue Type: Bug
>  Components: agent, containerization, docker, executor
>Affects Versions: 1.7.3
>Reporter: Dalton Matos Coelho Barreto
>Priority: Critical
> Attachments: logs-after.txt, logs-before.txt
>
>
> Hello all,
> The documentation about Agent Recovery shows two conditions for the recovery 
> to be possible:
>  - The agent must have recovery enabled (default true?);
>  - The scheduler must register itself saying that it has checkpointing 
> enabled.
> In my tests I'm using Marathon as the scheduler and Mesos itself sees 
> Marathon as e checkpoint-enabled scheduler:
> {noformat}
> $ curl -sL 10.234.172.27:5050/state | jq '.frameworks[] | {"name": .name, 
> "id": .id, "checkpoint": .checkpoint, "active": .active}'
> {
>   "name": "asgard-chronos",
>   "id": "4783cf15-4fb1-4c75-90fe-44eeec5258a7-0001",
>   "checkpoint": true,
>   "active": true
> }
> {
>   "name": "marathon",
>   "id": "4783cf15-4fb1-4c75-90fe-44eeec5258a7-",
>   "checkpoint": true,
>   "active": true
> }
> }}
> {noformat}
> Here is what I'm using:
>  # Mesos Master, 1.4.1
>  # Mesos Agent 1.7.3
>  # Using docker image {{mesos/mesos-centos:1.7.x}}
>  # Docker sock mounted from the host
>  # Docker binary also mounted from the host
>  # Marathon: 1.4.12
>  # Docker
> {noformat}
> Client: Docker Engine - Community
>  Version:   19.03.5
>  API version:   1.39 (downgraded from 1.40)
>  Go version:go1.12.12
>  Git commit:633a0ea838
>  Built: Wed Nov 13 07:22:05 2019
>  OS/Arch:   linux/amd64
>  Experimental:  false
> 
> Server: Docker Engine - Community
>  Engine:
>   Version:  18.09.2
>   API version:  1.39 (minimum version 1.12)
>   Go version:   go1.10.6
>   Git commit:   6247962
>   Built:Sun Feb 10 03:42:13 2019
>   OS/Arch:  linux/amd64
>   Experimental: false
> {noformat}
> h2. The problem
> Here is the Marathon test app, a simple {{sleep 99d}} based on {{debian}} 
> docker image.
> {noformat}
> {
>   "id": "/sleep",
>   "cmd": "sleep 99d",
>   "cpus": 0.1,
>   "mem": 128,
>   "disk": 0,
>   "instances": 1,
>   "constraints": [],
>   "acceptedResourceRoles": [
> "*"
>   ],
>   "container": {
> "type": "DOCKER",
> "volumes": [],
> "docker": {
>   "image": "debian",
>   "network": "HOST",
>   "privileged": false,
>   "parameters": [],
>   "forcePullImage": true
> }
>   },
>   "labels": {},
>   "portDefinitions": []
> }
> {noformat}
> This task runs fine and get scheduled on the right agent, which is running 
> mesos agent 1.7.3 (using the docker image, {{mesos/mesos-centos:1.7.x}}).
> Here is a sample log:
> {noformat}
> mesos-slave_1  | I1205 13:24:21.391464 19849 slave.cpp:2403] Authorizing 
> task 'sleep.8c187c41-1762-11ea-a2e5-02429217540f' for framework 
> 4783cf15-4fb1-4c75-90fe-44eeec5258a7-
> mesos-slave_1  | I1205 13:24:21.392707 19849 slave.cpp:2846] Launching 
> task 'sleep.8c187c41-1762-11ea-a2e5-02429217540f' for framework 
> 4783cf15-4fb1-4c75-90fe-44eeec5258a7-
> mesos-slave_1  | I1205 13:24:21.392895 19849 paths.cpp:748] Creating 
> sandbox 
> '/var/lib/mesos/agent/slaves/79ad3a13-b567-4273-ac8c-30378d35a439-S60499/frameworks/4783cf15-4fb1-4c75-90fe-44eeec5258a7-/executors/sleep.8c187c41-1762-11ea-a2e5-02429217540f/runs/53ec0ef3-3290-476a-b2b6-385099e9b923'
> mesos-slave_1  | I1205 13:24:21.394399 19849 paths.cpp:748] Creating 
> sandbox 
> '/var/lib/mesos/agent/meta/slaves/79ad3a13-b567-4273-ac8c-30378d35a439-S60499/frameworks/4783cf15-4fb1-4c75-90fe-44eeec5258a7-/executors/sleep.8c187c41-1762-11ea-a2e5-02429217540f/runs/53ec0ef3-3290-476a-b2b6-385099e9b923'
> mesos-slave_1  | I1205 13:24:21.394918 19849 slave.cpp:9068] Launching 
> executor 'sleep.8c187c41-1762-11ea-a2e5-02429217540f' of framework 
> 4783cf15-4fb1-4c75-90fe-44eeec5258a7- with resources 
> [{"allocation_info":{"role":"*"},"name":"cpus","scalar":{"value":0.1},"type":"SCALAR"},{"allocation_info":{"role":"*"},"name":

[jira] [Commented] (MESOS-10066) mesos-docker-executor process dies when agent stops. Recovery fails when agent returns

2019-12-06 Thread Dalton Matos Coelho Barreto (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16989744#comment-16989744
 ] 

Dalton Matos Coelho Barreto commented on MESOS-10066:
-

Hello [~abudnik],

 

I have attached the raw logs. Let me know if you need anything.

 

Thanks,

> mesos-docker-executor process dies when agent stops. Recovery fails when 
> agent returns
> --
>
> Key: MESOS-10066
> URL: https://issues.apache.org/jira/browse/MESOS-10066
> Project: Mesos
>  Issue Type: Bug
>  Components: agent, containerization, docker, executor
>Affects Versions: 1.7.3
>Reporter: Dalton Matos Coelho Barreto
>Priority: Critical
> Attachments: logs-after.txt, logs-before.txt
>
>
> Hello all,
> The documentation about Agent Recovery shows two conditions for the recovery 
> to be possible:
>  - The agent must have recovery enabled (default true?);
>  - The scheduler must register itself saying that it has checkpointing 
> enabled.
> In my tests I'm using Marathon as the scheduler and Mesos itself sees 
> Marathon as e checkpoint-enabled scheduler:
> {noformat}
> $ curl -sL 10.234.172.27:5050/state | jq '.frameworks[] | {"name": .name, 
> "id": .id, "checkpoint": .checkpoint, "active": .active}'
> {
>   "name": "asgard-chronos",
>   "id": "4783cf15-4fb1-4c75-90fe-44eeec5258a7-0001",
>   "checkpoint": true,
>   "active": true
> }
> {
>   "name": "marathon",
>   "id": "4783cf15-4fb1-4c75-90fe-44eeec5258a7-",
>   "checkpoint": true,
>   "active": true
> }
> }}
> {noformat}
> Here is what I'm using:
>  # Mesos Master, 1.4.1
>  # Mesos Agent 1.7.3
>  # Using docker image {{mesos/mesos-centos:1.7.x}}
>  # Docker sock mounted from the host
>  # Docker binary also mounted from the host
>  # Marathon: 1.4.12
>  # Docker
> {noformat}
> Client: Docker Engine - Community
>  Version:   19.03.5
>  API version:   1.39 (downgraded from 1.40)
>  Go version:go1.12.12
>  Git commit:633a0ea838
>  Built: Wed Nov 13 07:22:05 2019
>  OS/Arch:   linux/amd64
>  Experimental:  false
> 
> Server: Docker Engine - Community
>  Engine:
>   Version:  18.09.2
>   API version:  1.39 (minimum version 1.12)
>   Go version:   go1.10.6
>   Git commit:   6247962
>   Built:Sun Feb 10 03:42:13 2019
>   OS/Arch:  linux/amd64
>   Experimental: false
> {noformat}
> h2. The problem
> Here is the Marathon test app, a simple {{sleep 99d}} based on {{debian}} 
> docker image.
> {noformat}
> {
>   "id": "/sleep",
>   "cmd": "sleep 99d",
>   "cpus": 0.1,
>   "mem": 128,
>   "disk": 0,
>   "instances": 1,
>   "constraints": [],
>   "acceptedResourceRoles": [
> "*"
>   ],
>   "container": {
> "type": "DOCKER",
> "volumes": [],
> "docker": {
>   "image": "debian",
>   "network": "HOST",
>   "privileged": false,
>   "parameters": [],
>   "forcePullImage": true
> }
>   },
>   "labels": {},
>   "portDefinitions": []
> }
> {noformat}
> This task runs fine and get scheduled on the right agent, which is running 
> mesos agent 1.7.3 (using the docker image, {{mesos/mesos-centos:1.7.x}}).
> Here is a sample log:
> {noformat}
> mesos-slave_1  | I1205 13:24:21.391464 19849 slave.cpp:2403] Authorizing 
> task 'sleep.8c187c41-1762-11ea-a2e5-02429217540f' for framework 
> 4783cf15-4fb1-4c75-90fe-44eeec5258a7-
> mesos-slave_1  | I1205 13:24:21.392707 19849 slave.cpp:2846] Launching 
> task 'sleep.8c187c41-1762-11ea-a2e5-02429217540f' for framework 
> 4783cf15-4fb1-4c75-90fe-44eeec5258a7-
> mesos-slave_1  | I1205 13:24:21.392895 19849 paths.cpp:748] Creating 
> sandbox 
> '/var/lib/mesos/agent/slaves/79ad3a13-b567-4273-ac8c-30378d35a439-S60499/frameworks/4783cf15-4fb1-4c75-90fe-44eeec5258a7-/executors/sleep.8c187c41-1762-11ea-a2e5-02429217540f/runs/53ec0ef3-3290-476a-b2b6-385099e9b923'
> mesos-slave_1  | I1205 13:24:21.394399 19849 paths.cpp:748] Creating 
> sandbox 
> '/var/lib/mesos/agent/meta/slaves/79ad3a13-b567-4273-ac8c-30378d35a439-S60499/frameworks/4783cf15-4fb1-4c75-90fe-44eeec5258a7-/executors/sleep.8c187c41-1762-11ea-a2e5-02429217540f/runs/53ec0ef3-3290-476a-b2b6-385099e9b923'
> mesos-slave_1  | I1205 13:24:21.394918 19849 slave.cpp:9068] Launching 
> executor 'sleep.8c187c41-1762-11ea-a2e5-02429217540f' of framework 
> 4783cf15-4fb1-4c75-90fe-44eeec5258a7- with resources 
> [{"allocation_info":{"role":"*"},"name":"cpus","scalar":{"value":0.1},"type":"SCALAR"},{"allocation_info":{"role":"*"},"

[jira] [Commented] (MESOS-10066) mesos-docker-executor process dies when agent stops. Recovery fails when agent returns

2019-12-06 Thread Dalton Matos Coelho Barreto (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16989732#comment-16989732
 ] 

Dalton Matos Coelho Barreto commented on MESOS-10066:
-

Hello [~abudnik],

Sure! I will re-run the tests and attach here the logs before and after the 
reboot.



> mesos-docker-executor process dies when agent stops. Recovery fails when 
> agent returns
> --
>
> Key: MESOS-10066
> URL: https://issues.apache.org/jira/browse/MESOS-10066
> Project: Mesos
>  Issue Type: Bug
>  Components: agent, containerization, docker, executor
>Affects Versions: 1.7.3
>Reporter: Dalton Matos Coelho Barreto
>Priority: Critical
>
> Hello all,
> The documentation about Agent Recovery shows two conditions for the recovery 
> to be possible:
>  - The agent must have recovery enabled (default true?);
>  - The scheduler must register itself saying that it has checkpointing 
> enabled.
> In my tests I'm using Marathon as the scheduler and Mesos itself sees 
> Marathon as e checkpoint-enabled scheduler:
> {noformat}
> $ curl -sL 10.234.172.27:5050/state | jq '.frameworks[] | {"name": .name, 
> "id": .id, "checkpoint": .checkpoint, "active": .active}'
> {
>   "name": "asgard-chronos",
>   "id": "4783cf15-4fb1-4c75-90fe-44eeec5258a7-0001",
>   "checkpoint": true,
>   "active": true
> }
> {
>   "name": "marathon",
>   "id": "4783cf15-4fb1-4c75-90fe-44eeec5258a7-",
>   "checkpoint": true,
>   "active": true
> }
> }}
> {noformat}
> Here is what I'm using:
>  # Mesos Master, 1.4.1
>  # Mesos Agent 1.7.3
>  # Using docker image {{mesos/mesos-centos:1.7.x}}
>  # Docker sock mounted from the host
>  # Docker binary also mounted from the host
>  # Marathon: 1.4.12
>  # Docker
> {noformat}
> Client: Docker Engine - Community
>  Version:   19.03.5
>  API version:   1.39 (downgraded from 1.40)
>  Go version:go1.12.12
>  Git commit:633a0ea838
>  Built: Wed Nov 13 07:22:05 2019
>  OS/Arch:   linux/amd64
>  Experimental:  false
> 
> Server: Docker Engine - Community
>  Engine:
>   Version:  18.09.2
>   API version:  1.39 (minimum version 1.12)
>   Go version:   go1.10.6
>   Git commit:   6247962
>   Built:Sun Feb 10 03:42:13 2019
>   OS/Arch:  linux/amd64
>   Experimental: false
> {noformat}
> h2. The problem
> Here is the Marathon test app, a simple {{sleep 99d}} based on {{debian}} 
> docker image.
> {noformat}
> {
>   "id": "/sleep",
>   "cmd": "sleep 99d",
>   "cpus": 0.1,
>   "mem": 128,
>   "disk": 0,
>   "instances": 1,
>   "constraints": [],
>   "acceptedResourceRoles": [
> "*"
>   ],
>   "container": {
> "type": "DOCKER",
> "volumes": [],
> "docker": {
>   "image": "debian",
>   "network": "HOST",
>   "privileged": false,
>   "parameters": [],
>   "forcePullImage": true
> }
>   },
>   "labels": {},
>   "portDefinitions": []
> }
> {noformat}
> This task runs fine and get scheduled on the right agent, which is running 
> mesos agent 1.7.3 (using the docker image, {{mesos/mesos-centos:1.7.x}}).
> Here is a sample log:
> {noformat}
> mesos-slave_1  | I1205 13:24:21.391464 19849 slave.cpp:2403] Authorizing 
> task 'sleep.8c187c41-1762-11ea-a2e5-02429217540f' for framework 
> 4783cf15-4fb1-4c75-90fe-44eeec5258a7-
> mesos-slave_1  | I1205 13:24:21.392707 19849 slave.cpp:2846] Launching 
> task 'sleep.8c187c41-1762-11ea-a2e5-02429217540f' for framework 
> 4783cf15-4fb1-4c75-90fe-44eeec5258a7-
> mesos-slave_1  | I1205 13:24:21.392895 19849 paths.cpp:748] Creating 
> sandbox 
> '/var/lib/mesos/agent/slaves/79ad3a13-b567-4273-ac8c-30378d35a439-S60499/frameworks/4783cf15-4fb1-4c75-90fe-44eeec5258a7-/executors/sleep.8c187c41-1762-11ea-a2e5-02429217540f/runs/53ec0ef3-3290-476a-b2b6-385099e9b923'
> mesos-slave_1  | I1205 13:24:21.394399 19849 paths.cpp:748] Creating 
> sandbox 
> '/var/lib/mesos/agent/meta/slaves/79ad3a13-b567-4273-ac8c-30378d35a439-S60499/frameworks/4783cf15-4fb1-4c75-90fe-44eeec5258a7-/executors/sleep.8c187c41-1762-11ea-a2e5-02429217540f/runs/53ec0ef3-3290-476a-b2b6-385099e9b923'
> mesos-slave_1  | I1205 13:24:21.394918 19849 slave.cpp:9068] Launching 
> executor 'sleep.8c187c41-1762-11ea-a2e5-02429217540f' of framework 
> 4783cf15-4fb1-4c75-90fe-44eeec5258a7- with resources 
> [{"allocation_info":{"role":"*"},"name":"cpus","scalar":{"value":0.1},"type":"SCALAR"},{"allocation_info":{"role":"*"},"name":"mem","scalar":{"value":32.0},"type":"SCALA

[jira] [Commented] (MESOS-10066) mesos-docker-executor process dies when agent stops. Recovery fails when agent returns

2019-12-06 Thread Andrei Budnik (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16989728#comment-16989728
 ] 

Andrei Budnik commented on MESOS-10066:
---

Could you please attach full agent logs?

> mesos-docker-executor process dies when agent stops. Recovery fails when 
> agent returns
> --
>
> Key: MESOS-10066
> URL: https://issues.apache.org/jira/browse/MESOS-10066
> Project: Mesos
>  Issue Type: Bug
>  Components: agent, containerization, docker, executor
>Affects Versions: 1.7.3
>Reporter: Dalton Matos Coelho Barreto
>Priority: Critical
>
> Hello all,
> The documentation about Agent Recovery shows two conditions for the recovery 
> to be possible:
>  - The agent must have recovery enabled (default true?);
>  - The scheduler must register itself saying that it has checkpointing 
> enabled.
> In my tests I'm using Marathon as the scheduler and Mesos itself sees 
> Marathon as e checkpoint-enabled scheduler:
> {noformat}
> $ curl -sL 10.234.172.27:5050/state | jq '.frameworks[] | {"name": .name, 
> "id": .id, "checkpoint": .checkpoint, "active": .active}'
> {
>   "name": "asgard-chronos",
>   "id": "4783cf15-4fb1-4c75-90fe-44eeec5258a7-0001",
>   "checkpoint": true,
>   "active": true
> }
> {
>   "name": "marathon",
>   "id": "4783cf15-4fb1-4c75-90fe-44eeec5258a7-",
>   "checkpoint": true,
>   "active": true
> }
> }}
> {noformat}
> Here is what I'm using:
>  # Mesos Master, 1.4.1
>  # Mesos Agent 1.7.3
>  # Using docker image {{mesos/mesos-centos:1.7.x}}
>  # Docker sock mounted from the host
>  # Docker binary also mounted from the host
>  # Marathon: 1.4.12
>  # Docker
> {noformat}
> Client: Docker Engine - Community
>  Version:   19.03.5
>  API version:   1.39 (downgraded from 1.40)
>  Go version:go1.12.12
>  Git commit:633a0ea838
>  Built: Wed Nov 13 07:22:05 2019
>  OS/Arch:   linux/amd64
>  Experimental:  false
> 
> Server: Docker Engine - Community
>  Engine:
>   Version:  18.09.2
>   API version:  1.39 (minimum version 1.12)
>   Go version:   go1.10.6
>   Git commit:   6247962
>   Built:Sun Feb 10 03:42:13 2019
>   OS/Arch:  linux/amd64
>   Experimental: false
> {noformat}
> h2. The problem
> Here is the Marathon test app, a simple {{sleep 99d}} based on {{debian}} 
> docker image.
> {noformat}
> {
>   "id": "/sleep",
>   "cmd": "sleep 99d",
>   "cpus": 0.1,
>   "mem": 128,
>   "disk": 0,
>   "instances": 1,
>   "constraints": [],
>   "acceptedResourceRoles": [
> "*"
>   ],
>   "container": {
> "type": "DOCKER",
> "volumes": [],
> "docker": {
>   "image": "debian",
>   "network": "HOST",
>   "privileged": false,
>   "parameters": [],
>   "forcePullImage": true
> }
>   },
>   "labels": {},
>   "portDefinitions": []
> }
> {noformat}
> This task runs fine and get scheduled on the right agent, which is running 
> mesos agent 1.7.3 (using the docker image, {{mesos/mesos-centos:1.7.x}}).
> Here is a sample log:
> {noformat}
> mesos-slave_1  | I1205 13:24:21.391464 19849 slave.cpp:2403] Authorizing 
> task 'sleep.8c187c41-1762-11ea-a2e5-02429217540f' for framework 
> 4783cf15-4fb1-4c75-90fe-44eeec5258a7-
> mesos-slave_1  | I1205 13:24:21.392707 19849 slave.cpp:2846] Launching 
> task 'sleep.8c187c41-1762-11ea-a2e5-02429217540f' for framework 
> 4783cf15-4fb1-4c75-90fe-44eeec5258a7-
> mesos-slave_1  | I1205 13:24:21.392895 19849 paths.cpp:748] Creating 
> sandbox 
> '/var/lib/mesos/agent/slaves/79ad3a13-b567-4273-ac8c-30378d35a439-S60499/frameworks/4783cf15-4fb1-4c75-90fe-44eeec5258a7-/executors/sleep.8c187c41-1762-11ea-a2e5-02429217540f/runs/53ec0ef3-3290-476a-b2b6-385099e9b923'
> mesos-slave_1  | I1205 13:24:21.394399 19849 paths.cpp:748] Creating 
> sandbox 
> '/var/lib/mesos/agent/meta/slaves/79ad3a13-b567-4273-ac8c-30378d35a439-S60499/frameworks/4783cf15-4fb1-4c75-90fe-44eeec5258a7-/executors/sleep.8c187c41-1762-11ea-a2e5-02429217540f/runs/53ec0ef3-3290-476a-b2b6-385099e9b923'
> mesos-slave_1  | I1205 13:24:21.394918 19849 slave.cpp:9068] Launching 
> executor 'sleep.8c187c41-1762-11ea-a2e5-02429217540f' of framework 
> 4783cf15-4fb1-4c75-90fe-44eeec5258a7- with resources 
> [{"allocation_info":{"role":"*"},"name":"cpus","scalar":{"value":0.1},"type":"SCALAR"},{"allocation_info":{"role":"*"},"name":"mem","scalar":{"value":32.0},"type":"SCALAR"}]
>  in work directory 
> '/var/lib/mesos/agent/slaves/79ad3a13-b567-4273-ac8c-30378d35a43

[jira] [Comment Edited] (MESOS-10047) Update the CPU subsystem in the cgroup isolator to set container's CPU resource limits

2019-12-06 Thread Qian Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16989493#comment-16989493
 ] 

Qian Zhang edited comment on MESOS-10047 at 12/6/19 8:05 AM:
-

RR: [https://reviews.apache.org/r/71886/]


was (Author: qianzhang):
[https://reviews.apache.org/r/71886/]

> Update the CPU subsystem in the cgroup isolator to set container's CPU 
> resource limits
> --
>
> Key: MESOS-10047
> URL: https://issues.apache.org/jira/browse/MESOS-10047
> Project: Mesos
>  Issue Type: Task
>Reporter: Qian Zhang
>Assignee: Qian Zhang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)