[jira] [Assigned] (MESOS-6568) JSON serialization should not omit empty arrays in HTTP APIs
[ https://issues.apache.org/jira/browse/MESOS-6568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler reassigned MESOS-6568: -- Assignee: Benjamin Mahler > JSON serialization should not omit empty arrays in HTTP APIs > > > Key: MESOS-6568 > URL: https://issues.apache.org/jira/browse/MESOS-6568 > Project: Mesos > Issue Type: Improvement > Components: HTTP API >Reporter: Neil Conway >Assignee: Benjamin Mahler >Priority: Major > Labels: mesosphere > > When using the JSON content type with the HTTP APIs, a {{repeated}} protobuf > field is omitted entirely from the JSON serialization of the message. For > example, this is a response to the {{GetTasks}} call: > {noformat} > { > "get_tasks": { > "tasks": [{...}] > }, > "type": "GET_TASKS" > } > {noformat} > I think it would be better to include empty arrays for the other fields of > the message ({{pending_tasks}}, {{completed_tasks}}, etc.). Advantages: > # Consistency with the old HTTP endpoints, e.g., /state > # Semantically, an empty array is more accurate. The master's response should > be interpreted as saying it doesn't know about any pending/completed tasks; > that is more accurately conveyed by explicitly including an empty array, not > by omitting the key entirely. > *NOTE: The > [asV1Protobuf|https://github.com/apache/mesos/blob/d10a33acc426dda9e34db995f16450faf898bb3b/src/common/http.cpp#L172-L423] > copy needs to also be updated.* -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (MESOS-10066) mesos-docker-executor process dies when agent stops. Recovery fails when agent returns
[ https://issues.apache.org/jira/browse/MESOS-10066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16990017#comment-16990017 ] Dalton Matos Coelho Barreto commented on MESOS-10066: - I see. In fact this could be a reason why the option {{--docker_mesos_image}} didn't work. But focusing on the first try, where only the Mesos Agent is running on a docker container and the task stays running. All work as expected until the Mesos Agent stops. Then the executor processs also dies. Am I missing some configuration in order to make agent recovery work as expected? Can you tell anything by looking into the logs I attached here? I think that the recovery should work transparently, should be an easy (almost automatic) setup. That's why I think I may be missing something. Thanks, > mesos-docker-executor process dies when agent stops. Recovery fails when > agent returns > -- > > Key: MESOS-10066 > URL: https://issues.apache.org/jira/browse/MESOS-10066 > Project: Mesos > Issue Type: Bug > Components: agent, containerization, docker, executor >Affects Versions: 1.7.3 >Reporter: Dalton Matos Coelho Barreto >Priority: Critical > Attachments: logs-after.txt, logs-before.txt > > > Hello all, > The documentation about Agent Recovery shows two conditions for the recovery > to be possible: > - The agent must have recovery enabled (default true?); > - The scheduler must register itself saying that it has checkpointing > enabled. > In my tests I'm using Marathon as the scheduler and Mesos itself sees > Marathon as e checkpoint-enabled scheduler: > {noformat} > $ curl -sL 10.234.172.27:5050/state | jq '.frameworks[] | {"name": .name, > "id": .id, "checkpoint": .checkpoint, "active": .active}' > { > "name": "asgard-chronos", > "id": "4783cf15-4fb1-4c75-90fe-44eeec5258a7-0001", > "checkpoint": true, > "active": true > } > { > "name": "marathon", > "id": "4783cf15-4fb1-4c75-90fe-44eeec5258a7-", > "checkpoint": true, > "active": true > } > }} > {noformat} > Here is what I'm using: > # Mesos Master, 1.4.1 > # Mesos Agent 1.7.3 > # Using docker image {{mesos/mesos-centos:1.7.x}} > # Docker sock mounted from the host > # Docker binary also mounted from the host > # Marathon: 1.4.12 > # Docker > {noformat} > Client: Docker Engine - Community > Version: 19.03.5 > API version: 1.39 (downgraded from 1.40) > Go version:go1.12.12 > Git commit:633a0ea838 > Built: Wed Nov 13 07:22:05 2019 > OS/Arch: linux/amd64 > Experimental: false > > Server: Docker Engine - Community > Engine: > Version: 18.09.2 > API version: 1.39 (minimum version 1.12) > Go version: go1.10.6 > Git commit: 6247962 > Built:Sun Feb 10 03:42:13 2019 > OS/Arch: linux/amd64 > Experimental: false > {noformat} > h2. The problem > Here is the Marathon test app, a simple {{sleep 99d}} based on {{debian}} > docker image. > {noformat} > { > "id": "/sleep", > "cmd": "sleep 99d", > "cpus": 0.1, > "mem": 128, > "disk": 0, > "instances": 1, > "constraints": [], > "acceptedResourceRoles": [ > "*" > ], > "container": { > "type": "DOCKER", > "volumes": [], > "docker": { > "image": "debian", > "network": "HOST", > "privileged": false, > "parameters": [], > "forcePullImage": true > } > }, > "labels": {}, > "portDefinitions": [] > } > {noformat} > This task runs fine and get scheduled on the right agent, which is running > mesos agent 1.7.3 (using the docker image, {{mesos/mesos-centos:1.7.x}}). > Here is a sample log: > {noformat} > mesos-slave_1 | I1205 13:24:21.391464 19849 slave.cpp:2403] Authorizing > task 'sleep.8c187c41-1762-11ea-a2e5-02429217540f' for framework > 4783cf15-4fb1-4c75-90fe-44eeec5258a7- > mesos-slave_1 | I1205 13:24:21.392707 19849 slave.cpp:2846] Launching > task 'sleep.8c187c41-1762-11ea-a2e5-02429217540f' for framework > 4783cf15-4fb1-4c75-90fe-44eeec5258a7- > mesos-slave_1 | I1205 13:24:21.392895 19849 paths.cpp:748] Creating > sandbox > '/var/lib/mesos/agent/slaves/79ad3a13-b567-4273-ac8c-30378d35a439-S60499/frameworks/4783cf15-4fb1-4c75-90fe-44eeec5258a7-/executors/sleep.8c187c41-1762-11ea-a2e5-02429217540f/runs/53ec0ef3-3290-476a-b2b6-385099e9b923' > mesos-slave_1 | I1205 13:24:21.394399 19849 paths.cpp:748] Creating > sandbox > '/var/lib/mesos/agent/meta/slaves/
[jira] [Commented] (MESOS-10066) mesos-docker-executor process dies when agent stops. Recovery fails when agent returns
[ https://issues.apache.org/jira/browse/MESOS-10066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16989880#comment-16989880 ] Andrei Budnik commented on MESOS-10066: --- So the Docker socket is mounted from the host FS into the Docker container? I'm not sure if Mesos supports such a configuration. Since mesos-docker-executor is launched in a separate Docker container, there is no way to establish a socket connection from one Docker container (where agent runs) to another (where executor runs). Is executor's port 10.234.172.56:9899 exposed by the Docker container? AFAIK, [Mesos mini|http://mesos.apache.org/blog/mesos-mini/] uses Docker-in-Docker technique instead. > mesos-docker-executor process dies when agent stops. Recovery fails when > agent returns > -- > > Key: MESOS-10066 > URL: https://issues.apache.org/jira/browse/MESOS-10066 > Project: Mesos > Issue Type: Bug > Components: agent, containerization, docker, executor >Affects Versions: 1.7.3 >Reporter: Dalton Matos Coelho Barreto >Priority: Critical > Attachments: logs-after.txt, logs-before.txt > > > Hello all, > The documentation about Agent Recovery shows two conditions for the recovery > to be possible: > - The agent must have recovery enabled (default true?); > - The scheduler must register itself saying that it has checkpointing > enabled. > In my tests I'm using Marathon as the scheduler and Mesos itself sees > Marathon as e checkpoint-enabled scheduler: > {noformat} > $ curl -sL 10.234.172.27:5050/state | jq '.frameworks[] | {"name": .name, > "id": .id, "checkpoint": .checkpoint, "active": .active}' > { > "name": "asgard-chronos", > "id": "4783cf15-4fb1-4c75-90fe-44eeec5258a7-0001", > "checkpoint": true, > "active": true > } > { > "name": "marathon", > "id": "4783cf15-4fb1-4c75-90fe-44eeec5258a7-", > "checkpoint": true, > "active": true > } > }} > {noformat} > Here is what I'm using: > # Mesos Master, 1.4.1 > # Mesos Agent 1.7.3 > # Using docker image {{mesos/mesos-centos:1.7.x}} > # Docker sock mounted from the host > # Docker binary also mounted from the host > # Marathon: 1.4.12 > # Docker > {noformat} > Client: Docker Engine - Community > Version: 19.03.5 > API version: 1.39 (downgraded from 1.40) > Go version:go1.12.12 > Git commit:633a0ea838 > Built: Wed Nov 13 07:22:05 2019 > OS/Arch: linux/amd64 > Experimental: false > > Server: Docker Engine - Community > Engine: > Version: 18.09.2 > API version: 1.39 (minimum version 1.12) > Go version: go1.10.6 > Git commit: 6247962 > Built:Sun Feb 10 03:42:13 2019 > OS/Arch: linux/amd64 > Experimental: false > {noformat} > h2. The problem > Here is the Marathon test app, a simple {{sleep 99d}} based on {{debian}} > docker image. > {noformat} > { > "id": "/sleep", > "cmd": "sleep 99d", > "cpus": 0.1, > "mem": 128, > "disk": 0, > "instances": 1, > "constraints": [], > "acceptedResourceRoles": [ > "*" > ], > "container": { > "type": "DOCKER", > "volumes": [], > "docker": { > "image": "debian", > "network": "HOST", > "privileged": false, > "parameters": [], > "forcePullImage": true > } > }, > "labels": {}, > "portDefinitions": [] > } > {noformat} > This task runs fine and get scheduled on the right agent, which is running > mesos agent 1.7.3 (using the docker image, {{mesos/mesos-centos:1.7.x}}). > Here is a sample log: > {noformat} > mesos-slave_1 | I1205 13:24:21.391464 19849 slave.cpp:2403] Authorizing > task 'sleep.8c187c41-1762-11ea-a2e5-02429217540f' for framework > 4783cf15-4fb1-4c75-90fe-44eeec5258a7- > mesos-slave_1 | I1205 13:24:21.392707 19849 slave.cpp:2846] Launching > task 'sleep.8c187c41-1762-11ea-a2e5-02429217540f' for framework > 4783cf15-4fb1-4c75-90fe-44eeec5258a7- > mesos-slave_1 | I1205 13:24:21.392895 19849 paths.cpp:748] Creating > sandbox > '/var/lib/mesos/agent/slaves/79ad3a13-b567-4273-ac8c-30378d35a439-S60499/frameworks/4783cf15-4fb1-4c75-90fe-44eeec5258a7-/executors/sleep.8c187c41-1762-11ea-a2e5-02429217540f/runs/53ec0ef3-3290-476a-b2b6-385099e9b923' > mesos-slave_1 | I1205 13:24:21.394399 19849 paths.cpp:748] Creating > sandbox > '/var/lib/mesos/agent/meta/slaves/79ad3a13-b567-4273-ac8c-30378d35a439-S60499/frameworks/4783cf15-4fb1-4c75-90fe-44eeec5258a7-/executors/sleep.8c187c41-1762-11ea-a2e5-024
[jira] [Commented] (MESOS-10066) mesos-docker-executor process dies when agent stops. Recovery fails when agent returns
[ https://issues.apache.org/jira/browse/MESOS-10066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16989821#comment-16989821 ] Dalton Matos Coelho Barreto commented on MESOS-10066: - Yes. I used the same image that the agent uses as the argument do {{--docker_mesos_image}}. When using this options the tasks didn't even stayed running, so I didn't try to restart the agent since no task was up. Do you think that this logs would be helpfull? I could modify the agent to re-add this option and post the full logs here. Thanks, > mesos-docker-executor process dies when agent stops. Recovery fails when > agent returns > -- > > Key: MESOS-10066 > URL: https://issues.apache.org/jira/browse/MESOS-10066 > Project: Mesos > Issue Type: Bug > Components: agent, containerization, docker, executor >Affects Versions: 1.7.3 >Reporter: Dalton Matos Coelho Barreto >Priority: Critical > Attachments: logs-after.txt, logs-before.txt > > > Hello all, > The documentation about Agent Recovery shows two conditions for the recovery > to be possible: > - The agent must have recovery enabled (default true?); > - The scheduler must register itself saying that it has checkpointing > enabled. > In my tests I'm using Marathon as the scheduler and Mesos itself sees > Marathon as e checkpoint-enabled scheduler: > {noformat} > $ curl -sL 10.234.172.27:5050/state | jq '.frameworks[] | {"name": .name, > "id": .id, "checkpoint": .checkpoint, "active": .active}' > { > "name": "asgard-chronos", > "id": "4783cf15-4fb1-4c75-90fe-44eeec5258a7-0001", > "checkpoint": true, > "active": true > } > { > "name": "marathon", > "id": "4783cf15-4fb1-4c75-90fe-44eeec5258a7-", > "checkpoint": true, > "active": true > } > }} > {noformat} > Here is what I'm using: > # Mesos Master, 1.4.1 > # Mesos Agent 1.7.3 > # Using docker image {{mesos/mesos-centos:1.7.x}} > # Docker sock mounted from the host > # Docker binary also mounted from the host > # Marathon: 1.4.12 > # Docker > {noformat} > Client: Docker Engine - Community > Version: 19.03.5 > API version: 1.39 (downgraded from 1.40) > Go version:go1.12.12 > Git commit:633a0ea838 > Built: Wed Nov 13 07:22:05 2019 > OS/Arch: linux/amd64 > Experimental: false > > Server: Docker Engine - Community > Engine: > Version: 18.09.2 > API version: 1.39 (minimum version 1.12) > Go version: go1.10.6 > Git commit: 6247962 > Built:Sun Feb 10 03:42:13 2019 > OS/Arch: linux/amd64 > Experimental: false > {noformat} > h2. The problem > Here is the Marathon test app, a simple {{sleep 99d}} based on {{debian}} > docker image. > {noformat} > { > "id": "/sleep", > "cmd": "sleep 99d", > "cpus": 0.1, > "mem": 128, > "disk": 0, > "instances": 1, > "constraints": [], > "acceptedResourceRoles": [ > "*" > ], > "container": { > "type": "DOCKER", > "volumes": [], > "docker": { > "image": "debian", > "network": "HOST", > "privileged": false, > "parameters": [], > "forcePullImage": true > } > }, > "labels": {}, > "portDefinitions": [] > } > {noformat} > This task runs fine and get scheduled on the right agent, which is running > mesos agent 1.7.3 (using the docker image, {{mesos/mesos-centos:1.7.x}}). > Here is a sample log: > {noformat} > mesos-slave_1 | I1205 13:24:21.391464 19849 slave.cpp:2403] Authorizing > task 'sleep.8c187c41-1762-11ea-a2e5-02429217540f' for framework > 4783cf15-4fb1-4c75-90fe-44eeec5258a7- > mesos-slave_1 | I1205 13:24:21.392707 19849 slave.cpp:2846] Launching > task 'sleep.8c187c41-1762-11ea-a2e5-02429217540f' for framework > 4783cf15-4fb1-4c75-90fe-44eeec5258a7- > mesos-slave_1 | I1205 13:24:21.392895 19849 paths.cpp:748] Creating > sandbox > '/var/lib/mesos/agent/slaves/79ad3a13-b567-4273-ac8c-30378d35a439-S60499/frameworks/4783cf15-4fb1-4c75-90fe-44eeec5258a7-/executors/sleep.8c187c41-1762-11ea-a2e5-02429217540f/runs/53ec0ef3-3290-476a-b2b6-385099e9b923' > mesos-slave_1 | I1205 13:24:21.394399 19849 paths.cpp:748] Creating > sandbox > '/var/lib/mesos/agent/meta/slaves/79ad3a13-b567-4273-ac8c-30378d35a439-S60499/frameworks/4783cf15-4fb1-4c75-90fe-44eeec5258a7-/executors/sleep.8c187c41-1762-11ea-a2e5-02429217540f/runs/53ec0ef3-3290-476a-b2b6-385099e9b923' > mesos-slave_1 | I1205 13:24:21.394918 19849 slave.cpp:9068] Launching
[jira] [Commented] (MESOS-10066) mesos-docker-executor process dies when agent stops. Recovery fails when agent returns
[ https://issues.apache.org/jira/browse/MESOS-10066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16989808#comment-16989808 ] Andrei Budnik commented on MESOS-10066: --- Did you try to specify --docker_mesos_image command-line option for the agent that runs inside the Docker container? > mesos-docker-executor process dies when agent stops. Recovery fails when > agent returns > -- > > Key: MESOS-10066 > URL: https://issues.apache.org/jira/browse/MESOS-10066 > Project: Mesos > Issue Type: Bug > Components: agent, containerization, docker, executor >Affects Versions: 1.7.3 >Reporter: Dalton Matos Coelho Barreto >Priority: Critical > Attachments: logs-after.txt, logs-before.txt > > > Hello all, > The documentation about Agent Recovery shows two conditions for the recovery > to be possible: > - The agent must have recovery enabled (default true?); > - The scheduler must register itself saying that it has checkpointing > enabled. > In my tests I'm using Marathon as the scheduler and Mesos itself sees > Marathon as e checkpoint-enabled scheduler: > {noformat} > $ curl -sL 10.234.172.27:5050/state | jq '.frameworks[] | {"name": .name, > "id": .id, "checkpoint": .checkpoint, "active": .active}' > { > "name": "asgard-chronos", > "id": "4783cf15-4fb1-4c75-90fe-44eeec5258a7-0001", > "checkpoint": true, > "active": true > } > { > "name": "marathon", > "id": "4783cf15-4fb1-4c75-90fe-44eeec5258a7-", > "checkpoint": true, > "active": true > } > }} > {noformat} > Here is what I'm using: > # Mesos Master, 1.4.1 > # Mesos Agent 1.7.3 > # Using docker image {{mesos/mesos-centos:1.7.x}} > # Docker sock mounted from the host > # Docker binary also mounted from the host > # Marathon: 1.4.12 > # Docker > {noformat} > Client: Docker Engine - Community > Version: 19.03.5 > API version: 1.39 (downgraded from 1.40) > Go version:go1.12.12 > Git commit:633a0ea838 > Built: Wed Nov 13 07:22:05 2019 > OS/Arch: linux/amd64 > Experimental: false > > Server: Docker Engine - Community > Engine: > Version: 18.09.2 > API version: 1.39 (minimum version 1.12) > Go version: go1.10.6 > Git commit: 6247962 > Built:Sun Feb 10 03:42:13 2019 > OS/Arch: linux/amd64 > Experimental: false > {noformat} > h2. The problem > Here is the Marathon test app, a simple {{sleep 99d}} based on {{debian}} > docker image. > {noformat} > { > "id": "/sleep", > "cmd": "sleep 99d", > "cpus": 0.1, > "mem": 128, > "disk": 0, > "instances": 1, > "constraints": [], > "acceptedResourceRoles": [ > "*" > ], > "container": { > "type": "DOCKER", > "volumes": [], > "docker": { > "image": "debian", > "network": "HOST", > "privileged": false, > "parameters": [], > "forcePullImage": true > } > }, > "labels": {}, > "portDefinitions": [] > } > {noformat} > This task runs fine and get scheduled on the right agent, which is running > mesos agent 1.7.3 (using the docker image, {{mesos/mesos-centos:1.7.x}}). > Here is a sample log: > {noformat} > mesos-slave_1 | I1205 13:24:21.391464 19849 slave.cpp:2403] Authorizing > task 'sleep.8c187c41-1762-11ea-a2e5-02429217540f' for framework > 4783cf15-4fb1-4c75-90fe-44eeec5258a7- > mesos-slave_1 | I1205 13:24:21.392707 19849 slave.cpp:2846] Launching > task 'sleep.8c187c41-1762-11ea-a2e5-02429217540f' for framework > 4783cf15-4fb1-4c75-90fe-44eeec5258a7- > mesos-slave_1 | I1205 13:24:21.392895 19849 paths.cpp:748] Creating > sandbox > '/var/lib/mesos/agent/slaves/79ad3a13-b567-4273-ac8c-30378d35a439-S60499/frameworks/4783cf15-4fb1-4c75-90fe-44eeec5258a7-/executors/sleep.8c187c41-1762-11ea-a2e5-02429217540f/runs/53ec0ef3-3290-476a-b2b6-385099e9b923' > mesos-slave_1 | I1205 13:24:21.394399 19849 paths.cpp:748] Creating > sandbox > '/var/lib/mesos/agent/meta/slaves/79ad3a13-b567-4273-ac8c-30378d35a439-S60499/frameworks/4783cf15-4fb1-4c75-90fe-44eeec5258a7-/executors/sleep.8c187c41-1762-11ea-a2e5-02429217540f/runs/53ec0ef3-3290-476a-b2b6-385099e9b923' > mesos-slave_1 | I1205 13:24:21.394918 19849 slave.cpp:9068] Launching > executor 'sleep.8c187c41-1762-11ea-a2e5-02429217540f' of framework > 4783cf15-4fb1-4c75-90fe-44eeec5258a7- with resources > [{"allocation_info":{"role":"*"},"name":"cpus","scalar":{"value":0.1},"type":"SCALAR"},{"allocation_info":{"role":"*"},"name":
[jira] [Commented] (MESOS-10066) mesos-docker-executor process dies when agent stops. Recovery fails when agent returns
[ https://issues.apache.org/jira/browse/MESOS-10066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16989744#comment-16989744 ] Dalton Matos Coelho Barreto commented on MESOS-10066: - Hello [~abudnik], I have attached the raw logs. Let me know if you need anything. Thanks, > mesos-docker-executor process dies when agent stops. Recovery fails when > agent returns > -- > > Key: MESOS-10066 > URL: https://issues.apache.org/jira/browse/MESOS-10066 > Project: Mesos > Issue Type: Bug > Components: agent, containerization, docker, executor >Affects Versions: 1.7.3 >Reporter: Dalton Matos Coelho Barreto >Priority: Critical > Attachments: logs-after.txt, logs-before.txt > > > Hello all, > The documentation about Agent Recovery shows two conditions for the recovery > to be possible: > - The agent must have recovery enabled (default true?); > - The scheduler must register itself saying that it has checkpointing > enabled. > In my tests I'm using Marathon as the scheduler and Mesos itself sees > Marathon as e checkpoint-enabled scheduler: > {noformat} > $ curl -sL 10.234.172.27:5050/state | jq '.frameworks[] | {"name": .name, > "id": .id, "checkpoint": .checkpoint, "active": .active}' > { > "name": "asgard-chronos", > "id": "4783cf15-4fb1-4c75-90fe-44eeec5258a7-0001", > "checkpoint": true, > "active": true > } > { > "name": "marathon", > "id": "4783cf15-4fb1-4c75-90fe-44eeec5258a7-", > "checkpoint": true, > "active": true > } > }} > {noformat} > Here is what I'm using: > # Mesos Master, 1.4.1 > # Mesos Agent 1.7.3 > # Using docker image {{mesos/mesos-centos:1.7.x}} > # Docker sock mounted from the host > # Docker binary also mounted from the host > # Marathon: 1.4.12 > # Docker > {noformat} > Client: Docker Engine - Community > Version: 19.03.5 > API version: 1.39 (downgraded from 1.40) > Go version:go1.12.12 > Git commit:633a0ea838 > Built: Wed Nov 13 07:22:05 2019 > OS/Arch: linux/amd64 > Experimental: false > > Server: Docker Engine - Community > Engine: > Version: 18.09.2 > API version: 1.39 (minimum version 1.12) > Go version: go1.10.6 > Git commit: 6247962 > Built:Sun Feb 10 03:42:13 2019 > OS/Arch: linux/amd64 > Experimental: false > {noformat} > h2. The problem > Here is the Marathon test app, a simple {{sleep 99d}} based on {{debian}} > docker image. > {noformat} > { > "id": "/sleep", > "cmd": "sleep 99d", > "cpus": 0.1, > "mem": 128, > "disk": 0, > "instances": 1, > "constraints": [], > "acceptedResourceRoles": [ > "*" > ], > "container": { > "type": "DOCKER", > "volumes": [], > "docker": { > "image": "debian", > "network": "HOST", > "privileged": false, > "parameters": [], > "forcePullImage": true > } > }, > "labels": {}, > "portDefinitions": [] > } > {noformat} > This task runs fine and get scheduled on the right agent, which is running > mesos agent 1.7.3 (using the docker image, {{mesos/mesos-centos:1.7.x}}). > Here is a sample log: > {noformat} > mesos-slave_1 | I1205 13:24:21.391464 19849 slave.cpp:2403] Authorizing > task 'sleep.8c187c41-1762-11ea-a2e5-02429217540f' for framework > 4783cf15-4fb1-4c75-90fe-44eeec5258a7- > mesos-slave_1 | I1205 13:24:21.392707 19849 slave.cpp:2846] Launching > task 'sleep.8c187c41-1762-11ea-a2e5-02429217540f' for framework > 4783cf15-4fb1-4c75-90fe-44eeec5258a7- > mesos-slave_1 | I1205 13:24:21.392895 19849 paths.cpp:748] Creating > sandbox > '/var/lib/mesos/agent/slaves/79ad3a13-b567-4273-ac8c-30378d35a439-S60499/frameworks/4783cf15-4fb1-4c75-90fe-44eeec5258a7-/executors/sleep.8c187c41-1762-11ea-a2e5-02429217540f/runs/53ec0ef3-3290-476a-b2b6-385099e9b923' > mesos-slave_1 | I1205 13:24:21.394399 19849 paths.cpp:748] Creating > sandbox > '/var/lib/mesos/agent/meta/slaves/79ad3a13-b567-4273-ac8c-30378d35a439-S60499/frameworks/4783cf15-4fb1-4c75-90fe-44eeec5258a7-/executors/sleep.8c187c41-1762-11ea-a2e5-02429217540f/runs/53ec0ef3-3290-476a-b2b6-385099e9b923' > mesos-slave_1 | I1205 13:24:21.394918 19849 slave.cpp:9068] Launching > executor 'sleep.8c187c41-1762-11ea-a2e5-02429217540f' of framework > 4783cf15-4fb1-4c75-90fe-44eeec5258a7- with resources > [{"allocation_info":{"role":"*"},"name":"cpus","scalar":{"value":0.1},"type":"SCALAR"},{"allocation_info":{"role":"*"},"
[jira] [Commented] (MESOS-10066) mesos-docker-executor process dies when agent stops. Recovery fails when agent returns
[ https://issues.apache.org/jira/browse/MESOS-10066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16989732#comment-16989732 ] Dalton Matos Coelho Barreto commented on MESOS-10066: - Hello [~abudnik], Sure! I will re-run the tests and attach here the logs before and after the reboot. > mesos-docker-executor process dies when agent stops. Recovery fails when > agent returns > -- > > Key: MESOS-10066 > URL: https://issues.apache.org/jira/browse/MESOS-10066 > Project: Mesos > Issue Type: Bug > Components: agent, containerization, docker, executor >Affects Versions: 1.7.3 >Reporter: Dalton Matos Coelho Barreto >Priority: Critical > > Hello all, > The documentation about Agent Recovery shows two conditions for the recovery > to be possible: > - The agent must have recovery enabled (default true?); > - The scheduler must register itself saying that it has checkpointing > enabled. > In my tests I'm using Marathon as the scheduler and Mesos itself sees > Marathon as e checkpoint-enabled scheduler: > {noformat} > $ curl -sL 10.234.172.27:5050/state | jq '.frameworks[] | {"name": .name, > "id": .id, "checkpoint": .checkpoint, "active": .active}' > { > "name": "asgard-chronos", > "id": "4783cf15-4fb1-4c75-90fe-44eeec5258a7-0001", > "checkpoint": true, > "active": true > } > { > "name": "marathon", > "id": "4783cf15-4fb1-4c75-90fe-44eeec5258a7-", > "checkpoint": true, > "active": true > } > }} > {noformat} > Here is what I'm using: > # Mesos Master, 1.4.1 > # Mesos Agent 1.7.3 > # Using docker image {{mesos/mesos-centos:1.7.x}} > # Docker sock mounted from the host > # Docker binary also mounted from the host > # Marathon: 1.4.12 > # Docker > {noformat} > Client: Docker Engine - Community > Version: 19.03.5 > API version: 1.39 (downgraded from 1.40) > Go version:go1.12.12 > Git commit:633a0ea838 > Built: Wed Nov 13 07:22:05 2019 > OS/Arch: linux/amd64 > Experimental: false > > Server: Docker Engine - Community > Engine: > Version: 18.09.2 > API version: 1.39 (minimum version 1.12) > Go version: go1.10.6 > Git commit: 6247962 > Built:Sun Feb 10 03:42:13 2019 > OS/Arch: linux/amd64 > Experimental: false > {noformat} > h2. The problem > Here is the Marathon test app, a simple {{sleep 99d}} based on {{debian}} > docker image. > {noformat} > { > "id": "/sleep", > "cmd": "sleep 99d", > "cpus": 0.1, > "mem": 128, > "disk": 0, > "instances": 1, > "constraints": [], > "acceptedResourceRoles": [ > "*" > ], > "container": { > "type": "DOCKER", > "volumes": [], > "docker": { > "image": "debian", > "network": "HOST", > "privileged": false, > "parameters": [], > "forcePullImage": true > } > }, > "labels": {}, > "portDefinitions": [] > } > {noformat} > This task runs fine and get scheduled on the right agent, which is running > mesos agent 1.7.3 (using the docker image, {{mesos/mesos-centos:1.7.x}}). > Here is a sample log: > {noformat} > mesos-slave_1 | I1205 13:24:21.391464 19849 slave.cpp:2403] Authorizing > task 'sleep.8c187c41-1762-11ea-a2e5-02429217540f' for framework > 4783cf15-4fb1-4c75-90fe-44eeec5258a7- > mesos-slave_1 | I1205 13:24:21.392707 19849 slave.cpp:2846] Launching > task 'sleep.8c187c41-1762-11ea-a2e5-02429217540f' for framework > 4783cf15-4fb1-4c75-90fe-44eeec5258a7- > mesos-slave_1 | I1205 13:24:21.392895 19849 paths.cpp:748] Creating > sandbox > '/var/lib/mesos/agent/slaves/79ad3a13-b567-4273-ac8c-30378d35a439-S60499/frameworks/4783cf15-4fb1-4c75-90fe-44eeec5258a7-/executors/sleep.8c187c41-1762-11ea-a2e5-02429217540f/runs/53ec0ef3-3290-476a-b2b6-385099e9b923' > mesos-slave_1 | I1205 13:24:21.394399 19849 paths.cpp:748] Creating > sandbox > '/var/lib/mesos/agent/meta/slaves/79ad3a13-b567-4273-ac8c-30378d35a439-S60499/frameworks/4783cf15-4fb1-4c75-90fe-44eeec5258a7-/executors/sleep.8c187c41-1762-11ea-a2e5-02429217540f/runs/53ec0ef3-3290-476a-b2b6-385099e9b923' > mesos-slave_1 | I1205 13:24:21.394918 19849 slave.cpp:9068] Launching > executor 'sleep.8c187c41-1762-11ea-a2e5-02429217540f' of framework > 4783cf15-4fb1-4c75-90fe-44eeec5258a7- with resources > [{"allocation_info":{"role":"*"},"name":"cpus","scalar":{"value":0.1},"type":"SCALAR"},{"allocation_info":{"role":"*"},"name":"mem","scalar":{"value":32.0},"type":"SCALA
[jira] [Commented] (MESOS-10066) mesos-docker-executor process dies when agent stops. Recovery fails when agent returns
[ https://issues.apache.org/jira/browse/MESOS-10066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16989728#comment-16989728 ] Andrei Budnik commented on MESOS-10066: --- Could you please attach full agent logs? > mesos-docker-executor process dies when agent stops. Recovery fails when > agent returns > -- > > Key: MESOS-10066 > URL: https://issues.apache.org/jira/browse/MESOS-10066 > Project: Mesos > Issue Type: Bug > Components: agent, containerization, docker, executor >Affects Versions: 1.7.3 >Reporter: Dalton Matos Coelho Barreto >Priority: Critical > > Hello all, > The documentation about Agent Recovery shows two conditions for the recovery > to be possible: > - The agent must have recovery enabled (default true?); > - The scheduler must register itself saying that it has checkpointing > enabled. > In my tests I'm using Marathon as the scheduler and Mesos itself sees > Marathon as e checkpoint-enabled scheduler: > {noformat} > $ curl -sL 10.234.172.27:5050/state | jq '.frameworks[] | {"name": .name, > "id": .id, "checkpoint": .checkpoint, "active": .active}' > { > "name": "asgard-chronos", > "id": "4783cf15-4fb1-4c75-90fe-44eeec5258a7-0001", > "checkpoint": true, > "active": true > } > { > "name": "marathon", > "id": "4783cf15-4fb1-4c75-90fe-44eeec5258a7-", > "checkpoint": true, > "active": true > } > }} > {noformat} > Here is what I'm using: > # Mesos Master, 1.4.1 > # Mesos Agent 1.7.3 > # Using docker image {{mesos/mesos-centos:1.7.x}} > # Docker sock mounted from the host > # Docker binary also mounted from the host > # Marathon: 1.4.12 > # Docker > {noformat} > Client: Docker Engine - Community > Version: 19.03.5 > API version: 1.39 (downgraded from 1.40) > Go version:go1.12.12 > Git commit:633a0ea838 > Built: Wed Nov 13 07:22:05 2019 > OS/Arch: linux/amd64 > Experimental: false > > Server: Docker Engine - Community > Engine: > Version: 18.09.2 > API version: 1.39 (minimum version 1.12) > Go version: go1.10.6 > Git commit: 6247962 > Built:Sun Feb 10 03:42:13 2019 > OS/Arch: linux/amd64 > Experimental: false > {noformat} > h2. The problem > Here is the Marathon test app, a simple {{sleep 99d}} based on {{debian}} > docker image. > {noformat} > { > "id": "/sleep", > "cmd": "sleep 99d", > "cpus": 0.1, > "mem": 128, > "disk": 0, > "instances": 1, > "constraints": [], > "acceptedResourceRoles": [ > "*" > ], > "container": { > "type": "DOCKER", > "volumes": [], > "docker": { > "image": "debian", > "network": "HOST", > "privileged": false, > "parameters": [], > "forcePullImage": true > } > }, > "labels": {}, > "portDefinitions": [] > } > {noformat} > This task runs fine and get scheduled on the right agent, which is running > mesos agent 1.7.3 (using the docker image, {{mesos/mesos-centos:1.7.x}}). > Here is a sample log: > {noformat} > mesos-slave_1 | I1205 13:24:21.391464 19849 slave.cpp:2403] Authorizing > task 'sleep.8c187c41-1762-11ea-a2e5-02429217540f' for framework > 4783cf15-4fb1-4c75-90fe-44eeec5258a7- > mesos-slave_1 | I1205 13:24:21.392707 19849 slave.cpp:2846] Launching > task 'sleep.8c187c41-1762-11ea-a2e5-02429217540f' for framework > 4783cf15-4fb1-4c75-90fe-44eeec5258a7- > mesos-slave_1 | I1205 13:24:21.392895 19849 paths.cpp:748] Creating > sandbox > '/var/lib/mesos/agent/slaves/79ad3a13-b567-4273-ac8c-30378d35a439-S60499/frameworks/4783cf15-4fb1-4c75-90fe-44eeec5258a7-/executors/sleep.8c187c41-1762-11ea-a2e5-02429217540f/runs/53ec0ef3-3290-476a-b2b6-385099e9b923' > mesos-slave_1 | I1205 13:24:21.394399 19849 paths.cpp:748] Creating > sandbox > '/var/lib/mesos/agent/meta/slaves/79ad3a13-b567-4273-ac8c-30378d35a439-S60499/frameworks/4783cf15-4fb1-4c75-90fe-44eeec5258a7-/executors/sleep.8c187c41-1762-11ea-a2e5-02429217540f/runs/53ec0ef3-3290-476a-b2b6-385099e9b923' > mesos-slave_1 | I1205 13:24:21.394918 19849 slave.cpp:9068] Launching > executor 'sleep.8c187c41-1762-11ea-a2e5-02429217540f' of framework > 4783cf15-4fb1-4c75-90fe-44eeec5258a7- with resources > [{"allocation_info":{"role":"*"},"name":"cpus","scalar":{"value":0.1},"type":"SCALAR"},{"allocation_info":{"role":"*"},"name":"mem","scalar":{"value":32.0},"type":"SCALAR"}] > in work directory > '/var/lib/mesos/agent/slaves/79ad3a13-b567-4273-ac8c-30378d35a43
[jira] [Comment Edited] (MESOS-10047) Update the CPU subsystem in the cgroup isolator to set container's CPU resource limits
[ https://issues.apache.org/jira/browse/MESOS-10047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16989493#comment-16989493 ] Qian Zhang edited comment on MESOS-10047 at 12/6/19 8:05 AM: - RR: [https://reviews.apache.org/r/71886/] was (Author: qianzhang): [https://reviews.apache.org/r/71886/] > Update the CPU subsystem in the cgroup isolator to set container's CPU > resource limits > -- > > Key: MESOS-10047 > URL: https://issues.apache.org/jira/browse/MESOS-10047 > Project: Mesos > Issue Type: Task >Reporter: Qian Zhang >Assignee: Qian Zhang >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)