[jira] [Commented] (MESOS-10066) mesos-docker-executor process dies when agent stops. Recovery fails when agent returns
[ https://issues.apache.org/jira/browse/MESOS-10066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16995737#comment-16995737 ] Andrei Budnik commented on MESOS-10066: --- cc [~qianzhang] > mesos-docker-executor process dies when agent stops. Recovery fails when > agent returns > -- > > Key: MESOS-10066 > URL: https://issues.apache.org/jira/browse/MESOS-10066 > Project: Mesos > Issue Type: Bug > Components: agent, containerization, docker, executor >Affects Versions: 1.7.3 >Reporter: Dalton Matos Coelho Barreto >Priority: Critical > Attachments: logs-after.txt, logs-before.txt > > > Hello all, > The documentation about Agent Recovery shows two conditions for the recovery > to be possible: > - The agent must have recovery enabled (default true?); > - The scheduler must register itself saying that it has checkpointing > enabled. > In my tests I'm using Marathon as the scheduler and Mesos itself sees > Marathon as e checkpoint-enabled scheduler: > {noformat} > $ curl -sL 10.234.172.27:5050/state | jq '.frameworks[] | {"name": .name, > "id": .id, "checkpoint": .checkpoint, "active": .active}' > { > "name": "asgard-chronos", > "id": "4783cf15-4fb1-4c75-90fe-44eeec5258a7-0001", > "checkpoint": true, > "active": true > } > { > "name": "marathon", > "id": "4783cf15-4fb1-4c75-90fe-44eeec5258a7-", > "checkpoint": true, > "active": true > } > }} > {noformat} > Here is what I'm using: > # Mesos Master, 1.4.1 > # Mesos Agent 1.7.3 > # Using docker image {{mesos/mesos-centos:1.7.x}} > # Docker sock mounted from the host > # Docker binary also mounted from the host > # Marathon: 1.4.12 > # Docker > {noformat} > Client: Docker Engine - Community > Version: 19.03.5 > API version: 1.39 (downgraded from 1.40) > Go version:go1.12.12 > Git commit:633a0ea838 > Built: Wed Nov 13 07:22:05 2019 > OS/Arch: linux/amd64 > Experimental: false > > Server: Docker Engine - Community > Engine: > Version: 18.09.2 > API version: 1.39 (minimum version 1.12) > Go version: go1.10.6 > Git commit: 6247962 > Built:Sun Feb 10 03:42:13 2019 > OS/Arch: linux/amd64 > Experimental: false > {noformat} > h2. The problem > Here is the Marathon test app, a simple {{sleep 99d}} based on {{debian}} > docker image. > {noformat} > { > "id": "/sleep", > "cmd": "sleep 99d", > "cpus": 0.1, > "mem": 128, > "disk": 0, > "instances": 1, > "constraints": [], > "acceptedResourceRoles": [ > "*" > ], > "container": { > "type": "DOCKER", > "volumes": [], > "docker": { > "image": "debian", > "network": "HOST", > "privileged": false, > "parameters": [], > "forcePullImage": true > } > }, > "labels": {}, > "portDefinitions": [] > } > {noformat} > This task runs fine and get scheduled on the right agent, which is running > mesos agent 1.7.3 (using the docker image, {{mesos/mesos-centos:1.7.x}}). > Here is a sample log: > {noformat} > mesos-slave_1 | I1205 13:24:21.391464 19849 slave.cpp:2403] Authorizing > task 'sleep.8c187c41-1762-11ea-a2e5-02429217540f' for framework > 4783cf15-4fb1-4c75-90fe-44eeec5258a7- > mesos-slave_1 | I1205 13:24:21.392707 19849 slave.cpp:2846] Launching > task 'sleep.8c187c41-1762-11ea-a2e5-02429217540f' for framework > 4783cf15-4fb1-4c75-90fe-44eeec5258a7- > mesos-slave_1 | I1205 13:24:21.392895 19849 paths.cpp:748] Creating > sandbox > '/var/lib/mesos/agent/slaves/79ad3a13-b567-4273-ac8c-30378d35a439-S60499/frameworks/4783cf15-4fb1-4c75-90fe-44eeec5258a7-/executors/sleep.8c187c41-1762-11ea-a2e5-02429217540f/runs/53ec0ef3-3290-476a-b2b6-385099e9b923' > mesos-slave_1 | I1205 13:24:21.394399 19849 paths.cpp:748] Creating > sandbox > '/var/lib/mesos/agent/meta/slaves/79ad3a13-b567-4273-ac8c-30378d35a439-S60499/frameworks/4783cf15-4fb1-4c75-90fe-44eeec5258a7-/executors/sleep.8c187c41-1762-11ea-a2e5-02429217540f/runs/53ec0ef3-3290-476a-b2b6-385099e9b923' > mesos-slave_1 | I1205 13:24:21.394918 19849 slave.cpp:9068] Launching > executor 'sleep.8c187c41-1762-11ea-a2e5-02429217540f' of framework > 4783cf15-4fb1-4c75-90fe-44eeec5258a7- with resources > [{"allocation_info":{"role":"*"},"name":"cpus","scalar":{"value":0.1},"type":"SCALAR"},{"allocation_info":{"role":"*"},"name":"mem","scalar":{"value":32.0},"type":"SCALAR"}] > in work directory > '/var/lib/mesos/agent/slaves/79a
[jira] [Commented] (MESOS-10066) mesos-docker-executor process dies when agent stops. Recovery fails when agent returns
[ https://issues.apache.org/jira/browse/MESOS-10066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16995653#comment-16995653 ] Dalton Matos Coelho Barreto commented on MESOS-10066: - Hello [~abudnik], Any tips about this so I can debug further? Do you think web can ping anyone else to help me on this issue? Thanks, > mesos-docker-executor process dies when agent stops. Recovery fails when > agent returns > -- > > Key: MESOS-10066 > URL: https://issues.apache.org/jira/browse/MESOS-10066 > Project: Mesos > Issue Type: Bug > Components: agent, containerization, docker, executor >Affects Versions: 1.7.3 >Reporter: Dalton Matos Coelho Barreto >Priority: Critical > Attachments: logs-after.txt, logs-before.txt > > > Hello all, > The documentation about Agent Recovery shows two conditions for the recovery > to be possible: > - The agent must have recovery enabled (default true?); > - The scheduler must register itself saying that it has checkpointing > enabled. > In my tests I'm using Marathon as the scheduler and Mesos itself sees > Marathon as e checkpoint-enabled scheduler: > {noformat} > $ curl -sL 10.234.172.27:5050/state | jq '.frameworks[] | {"name": .name, > "id": .id, "checkpoint": .checkpoint, "active": .active}' > { > "name": "asgard-chronos", > "id": "4783cf15-4fb1-4c75-90fe-44eeec5258a7-0001", > "checkpoint": true, > "active": true > } > { > "name": "marathon", > "id": "4783cf15-4fb1-4c75-90fe-44eeec5258a7-", > "checkpoint": true, > "active": true > } > }} > {noformat} > Here is what I'm using: > # Mesos Master, 1.4.1 > # Mesos Agent 1.7.3 > # Using docker image {{mesos/mesos-centos:1.7.x}} > # Docker sock mounted from the host > # Docker binary also mounted from the host > # Marathon: 1.4.12 > # Docker > {noformat} > Client: Docker Engine - Community > Version: 19.03.5 > API version: 1.39 (downgraded from 1.40) > Go version:go1.12.12 > Git commit:633a0ea838 > Built: Wed Nov 13 07:22:05 2019 > OS/Arch: linux/amd64 > Experimental: false > > Server: Docker Engine - Community > Engine: > Version: 18.09.2 > API version: 1.39 (minimum version 1.12) > Go version: go1.10.6 > Git commit: 6247962 > Built:Sun Feb 10 03:42:13 2019 > OS/Arch: linux/amd64 > Experimental: false > {noformat} > h2. The problem > Here is the Marathon test app, a simple {{sleep 99d}} based on {{debian}} > docker image. > {noformat} > { > "id": "/sleep", > "cmd": "sleep 99d", > "cpus": 0.1, > "mem": 128, > "disk": 0, > "instances": 1, > "constraints": [], > "acceptedResourceRoles": [ > "*" > ], > "container": { > "type": "DOCKER", > "volumes": [], > "docker": { > "image": "debian", > "network": "HOST", > "privileged": false, > "parameters": [], > "forcePullImage": true > } > }, > "labels": {}, > "portDefinitions": [] > } > {noformat} > This task runs fine and get scheduled on the right agent, which is running > mesos agent 1.7.3 (using the docker image, {{mesos/mesos-centos:1.7.x}}). > Here is a sample log: > {noformat} > mesos-slave_1 | I1205 13:24:21.391464 19849 slave.cpp:2403] Authorizing > task 'sleep.8c187c41-1762-11ea-a2e5-02429217540f' for framework > 4783cf15-4fb1-4c75-90fe-44eeec5258a7- > mesos-slave_1 | I1205 13:24:21.392707 19849 slave.cpp:2846] Launching > task 'sleep.8c187c41-1762-11ea-a2e5-02429217540f' for framework > 4783cf15-4fb1-4c75-90fe-44eeec5258a7- > mesos-slave_1 | I1205 13:24:21.392895 19849 paths.cpp:748] Creating > sandbox > '/var/lib/mesos/agent/slaves/79ad3a13-b567-4273-ac8c-30378d35a439-S60499/frameworks/4783cf15-4fb1-4c75-90fe-44eeec5258a7-/executors/sleep.8c187c41-1762-11ea-a2e5-02429217540f/runs/53ec0ef3-3290-476a-b2b6-385099e9b923' > mesos-slave_1 | I1205 13:24:21.394399 19849 paths.cpp:748] Creating > sandbox > '/var/lib/mesos/agent/meta/slaves/79ad3a13-b567-4273-ac8c-30378d35a439-S60499/frameworks/4783cf15-4fb1-4c75-90fe-44eeec5258a7-/executors/sleep.8c187c41-1762-11ea-a2e5-02429217540f/runs/53ec0ef3-3290-476a-b2b6-385099e9b923' > mesos-slave_1 | I1205 13:24:21.394918 19849 slave.cpp:9068] Launching > executor 'sleep.8c187c41-1762-11ea-a2e5-02429217540f' of framework > 4783cf15-4fb1-4c75-90fe-44eeec5258a7- with resources > [{"allocation_info":{"role":"*"},"name":"cpus","scalar":{"value":0.1},"type":"
[jira] [Commented] (MESOS-10066) mesos-docker-executor process dies when agent stops. Recovery fails when agent returns
[ https://issues.apache.org/jira/browse/MESOS-10066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16990017#comment-16990017 ] Dalton Matos Coelho Barreto commented on MESOS-10066: - I see. In fact this could be a reason why the option {{--docker_mesos_image}} didn't work. But focusing on the first try, where only the Mesos Agent is running on a docker container and the task stays running. All work as expected until the Mesos Agent stops. Then the executor processs also dies. Am I missing some configuration in order to make agent recovery work as expected? Can you tell anything by looking into the logs I attached here? I think that the recovery should work transparently, should be an easy (almost automatic) setup. That's why I think I may be missing something. Thanks, > mesos-docker-executor process dies when agent stops. Recovery fails when > agent returns > -- > > Key: MESOS-10066 > URL: https://issues.apache.org/jira/browse/MESOS-10066 > Project: Mesos > Issue Type: Bug > Components: agent, containerization, docker, executor >Affects Versions: 1.7.3 >Reporter: Dalton Matos Coelho Barreto >Priority: Critical > Attachments: logs-after.txt, logs-before.txt > > > Hello all, > The documentation about Agent Recovery shows two conditions for the recovery > to be possible: > - The agent must have recovery enabled (default true?); > - The scheduler must register itself saying that it has checkpointing > enabled. > In my tests I'm using Marathon as the scheduler and Mesos itself sees > Marathon as e checkpoint-enabled scheduler: > {noformat} > $ curl -sL 10.234.172.27:5050/state | jq '.frameworks[] | {"name": .name, > "id": .id, "checkpoint": .checkpoint, "active": .active}' > { > "name": "asgard-chronos", > "id": "4783cf15-4fb1-4c75-90fe-44eeec5258a7-0001", > "checkpoint": true, > "active": true > } > { > "name": "marathon", > "id": "4783cf15-4fb1-4c75-90fe-44eeec5258a7-", > "checkpoint": true, > "active": true > } > }} > {noformat} > Here is what I'm using: > # Mesos Master, 1.4.1 > # Mesos Agent 1.7.3 > # Using docker image {{mesos/mesos-centos:1.7.x}} > # Docker sock mounted from the host > # Docker binary also mounted from the host > # Marathon: 1.4.12 > # Docker > {noformat} > Client: Docker Engine - Community > Version: 19.03.5 > API version: 1.39 (downgraded from 1.40) > Go version:go1.12.12 > Git commit:633a0ea838 > Built: Wed Nov 13 07:22:05 2019 > OS/Arch: linux/amd64 > Experimental: false > > Server: Docker Engine - Community > Engine: > Version: 18.09.2 > API version: 1.39 (minimum version 1.12) > Go version: go1.10.6 > Git commit: 6247962 > Built:Sun Feb 10 03:42:13 2019 > OS/Arch: linux/amd64 > Experimental: false > {noformat} > h2. The problem > Here is the Marathon test app, a simple {{sleep 99d}} based on {{debian}} > docker image. > {noformat} > { > "id": "/sleep", > "cmd": "sleep 99d", > "cpus": 0.1, > "mem": 128, > "disk": 0, > "instances": 1, > "constraints": [], > "acceptedResourceRoles": [ > "*" > ], > "container": { > "type": "DOCKER", > "volumes": [], > "docker": { > "image": "debian", > "network": "HOST", > "privileged": false, > "parameters": [], > "forcePullImage": true > } > }, > "labels": {}, > "portDefinitions": [] > } > {noformat} > This task runs fine and get scheduled on the right agent, which is running > mesos agent 1.7.3 (using the docker image, {{mesos/mesos-centos:1.7.x}}). > Here is a sample log: > {noformat} > mesos-slave_1 | I1205 13:24:21.391464 19849 slave.cpp:2403] Authorizing > task 'sleep.8c187c41-1762-11ea-a2e5-02429217540f' for framework > 4783cf15-4fb1-4c75-90fe-44eeec5258a7- > mesos-slave_1 | I1205 13:24:21.392707 19849 slave.cpp:2846] Launching > task 'sleep.8c187c41-1762-11ea-a2e5-02429217540f' for framework > 4783cf15-4fb1-4c75-90fe-44eeec5258a7- > mesos-slave_1 | I1205 13:24:21.392895 19849 paths.cpp:748] Creating > sandbox > '/var/lib/mesos/agent/slaves/79ad3a13-b567-4273-ac8c-30378d35a439-S60499/frameworks/4783cf15-4fb1-4c75-90fe-44eeec5258a7-/executors/sleep.8c187c41-1762-11ea-a2e5-02429217540f/runs/53ec0ef3-3290-476a-b2b6-385099e9b923' > mesos-slave_1 | I1205 13:24:21.394399 19849 paths.cpp:748] Creating > sandbox > '/var/lib/mesos/agent/meta/slaves/
[jira] [Commented] (MESOS-10066) mesos-docker-executor process dies when agent stops. Recovery fails when agent returns
[ https://issues.apache.org/jira/browse/MESOS-10066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16989880#comment-16989880 ] Andrei Budnik commented on MESOS-10066: --- So the Docker socket is mounted from the host FS into the Docker container? I'm not sure if Mesos supports such a configuration. Since mesos-docker-executor is launched in a separate Docker container, there is no way to establish a socket connection from one Docker container (where agent runs) to another (where executor runs). Is executor's port 10.234.172.56:9899 exposed by the Docker container? AFAIK, [Mesos mini|http://mesos.apache.org/blog/mesos-mini/] uses Docker-in-Docker technique instead. > mesos-docker-executor process dies when agent stops. Recovery fails when > agent returns > -- > > Key: MESOS-10066 > URL: https://issues.apache.org/jira/browse/MESOS-10066 > Project: Mesos > Issue Type: Bug > Components: agent, containerization, docker, executor >Affects Versions: 1.7.3 >Reporter: Dalton Matos Coelho Barreto >Priority: Critical > Attachments: logs-after.txt, logs-before.txt > > > Hello all, > The documentation about Agent Recovery shows two conditions for the recovery > to be possible: > - The agent must have recovery enabled (default true?); > - The scheduler must register itself saying that it has checkpointing > enabled. > In my tests I'm using Marathon as the scheduler and Mesos itself sees > Marathon as e checkpoint-enabled scheduler: > {noformat} > $ curl -sL 10.234.172.27:5050/state | jq '.frameworks[] | {"name": .name, > "id": .id, "checkpoint": .checkpoint, "active": .active}' > { > "name": "asgard-chronos", > "id": "4783cf15-4fb1-4c75-90fe-44eeec5258a7-0001", > "checkpoint": true, > "active": true > } > { > "name": "marathon", > "id": "4783cf15-4fb1-4c75-90fe-44eeec5258a7-", > "checkpoint": true, > "active": true > } > }} > {noformat} > Here is what I'm using: > # Mesos Master, 1.4.1 > # Mesos Agent 1.7.3 > # Using docker image {{mesos/mesos-centos:1.7.x}} > # Docker sock mounted from the host > # Docker binary also mounted from the host > # Marathon: 1.4.12 > # Docker > {noformat} > Client: Docker Engine - Community > Version: 19.03.5 > API version: 1.39 (downgraded from 1.40) > Go version:go1.12.12 > Git commit:633a0ea838 > Built: Wed Nov 13 07:22:05 2019 > OS/Arch: linux/amd64 > Experimental: false > > Server: Docker Engine - Community > Engine: > Version: 18.09.2 > API version: 1.39 (minimum version 1.12) > Go version: go1.10.6 > Git commit: 6247962 > Built:Sun Feb 10 03:42:13 2019 > OS/Arch: linux/amd64 > Experimental: false > {noformat} > h2. The problem > Here is the Marathon test app, a simple {{sleep 99d}} based on {{debian}} > docker image. > {noformat} > { > "id": "/sleep", > "cmd": "sleep 99d", > "cpus": 0.1, > "mem": 128, > "disk": 0, > "instances": 1, > "constraints": [], > "acceptedResourceRoles": [ > "*" > ], > "container": { > "type": "DOCKER", > "volumes": [], > "docker": { > "image": "debian", > "network": "HOST", > "privileged": false, > "parameters": [], > "forcePullImage": true > } > }, > "labels": {}, > "portDefinitions": [] > } > {noformat} > This task runs fine and get scheduled on the right agent, which is running > mesos agent 1.7.3 (using the docker image, {{mesos/mesos-centos:1.7.x}}). > Here is a sample log: > {noformat} > mesos-slave_1 | I1205 13:24:21.391464 19849 slave.cpp:2403] Authorizing > task 'sleep.8c187c41-1762-11ea-a2e5-02429217540f' for framework > 4783cf15-4fb1-4c75-90fe-44eeec5258a7- > mesos-slave_1 | I1205 13:24:21.392707 19849 slave.cpp:2846] Launching > task 'sleep.8c187c41-1762-11ea-a2e5-02429217540f' for framework > 4783cf15-4fb1-4c75-90fe-44eeec5258a7- > mesos-slave_1 | I1205 13:24:21.392895 19849 paths.cpp:748] Creating > sandbox > '/var/lib/mesos/agent/slaves/79ad3a13-b567-4273-ac8c-30378d35a439-S60499/frameworks/4783cf15-4fb1-4c75-90fe-44eeec5258a7-/executors/sleep.8c187c41-1762-11ea-a2e5-02429217540f/runs/53ec0ef3-3290-476a-b2b6-385099e9b923' > mesos-slave_1 | I1205 13:24:21.394399 19849 paths.cpp:748] Creating > sandbox > '/var/lib/mesos/agent/meta/slaves/79ad3a13-b567-4273-ac8c-30378d35a439-S60499/frameworks/4783cf15-4fb1-4c75-90fe-44eeec5258a7-/executors/sleep.8c187c41-1762-11ea-a2e5-024
[jira] [Commented] (MESOS-10066) mesos-docker-executor process dies when agent stops. Recovery fails when agent returns
[ https://issues.apache.org/jira/browse/MESOS-10066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16989821#comment-16989821 ] Dalton Matos Coelho Barreto commented on MESOS-10066: - Yes. I used the same image that the agent uses as the argument do {{--docker_mesos_image}}. When using this options the tasks didn't even stayed running, so I didn't try to restart the agent since no task was up. Do you think that this logs would be helpfull? I could modify the agent to re-add this option and post the full logs here. Thanks, > mesos-docker-executor process dies when agent stops. Recovery fails when > agent returns > -- > > Key: MESOS-10066 > URL: https://issues.apache.org/jira/browse/MESOS-10066 > Project: Mesos > Issue Type: Bug > Components: agent, containerization, docker, executor >Affects Versions: 1.7.3 >Reporter: Dalton Matos Coelho Barreto >Priority: Critical > Attachments: logs-after.txt, logs-before.txt > > > Hello all, > The documentation about Agent Recovery shows two conditions for the recovery > to be possible: > - The agent must have recovery enabled (default true?); > - The scheduler must register itself saying that it has checkpointing > enabled. > In my tests I'm using Marathon as the scheduler and Mesos itself sees > Marathon as e checkpoint-enabled scheduler: > {noformat} > $ curl -sL 10.234.172.27:5050/state | jq '.frameworks[] | {"name": .name, > "id": .id, "checkpoint": .checkpoint, "active": .active}' > { > "name": "asgard-chronos", > "id": "4783cf15-4fb1-4c75-90fe-44eeec5258a7-0001", > "checkpoint": true, > "active": true > } > { > "name": "marathon", > "id": "4783cf15-4fb1-4c75-90fe-44eeec5258a7-", > "checkpoint": true, > "active": true > } > }} > {noformat} > Here is what I'm using: > # Mesos Master, 1.4.1 > # Mesos Agent 1.7.3 > # Using docker image {{mesos/mesos-centos:1.7.x}} > # Docker sock mounted from the host > # Docker binary also mounted from the host > # Marathon: 1.4.12 > # Docker > {noformat} > Client: Docker Engine - Community > Version: 19.03.5 > API version: 1.39 (downgraded from 1.40) > Go version:go1.12.12 > Git commit:633a0ea838 > Built: Wed Nov 13 07:22:05 2019 > OS/Arch: linux/amd64 > Experimental: false > > Server: Docker Engine - Community > Engine: > Version: 18.09.2 > API version: 1.39 (minimum version 1.12) > Go version: go1.10.6 > Git commit: 6247962 > Built:Sun Feb 10 03:42:13 2019 > OS/Arch: linux/amd64 > Experimental: false > {noformat} > h2. The problem > Here is the Marathon test app, a simple {{sleep 99d}} based on {{debian}} > docker image. > {noformat} > { > "id": "/sleep", > "cmd": "sleep 99d", > "cpus": 0.1, > "mem": 128, > "disk": 0, > "instances": 1, > "constraints": [], > "acceptedResourceRoles": [ > "*" > ], > "container": { > "type": "DOCKER", > "volumes": [], > "docker": { > "image": "debian", > "network": "HOST", > "privileged": false, > "parameters": [], > "forcePullImage": true > } > }, > "labels": {}, > "portDefinitions": [] > } > {noformat} > This task runs fine and get scheduled on the right agent, which is running > mesos agent 1.7.3 (using the docker image, {{mesos/mesos-centos:1.7.x}}). > Here is a sample log: > {noformat} > mesos-slave_1 | I1205 13:24:21.391464 19849 slave.cpp:2403] Authorizing > task 'sleep.8c187c41-1762-11ea-a2e5-02429217540f' for framework > 4783cf15-4fb1-4c75-90fe-44eeec5258a7- > mesos-slave_1 | I1205 13:24:21.392707 19849 slave.cpp:2846] Launching > task 'sleep.8c187c41-1762-11ea-a2e5-02429217540f' for framework > 4783cf15-4fb1-4c75-90fe-44eeec5258a7- > mesos-slave_1 | I1205 13:24:21.392895 19849 paths.cpp:748] Creating > sandbox > '/var/lib/mesos/agent/slaves/79ad3a13-b567-4273-ac8c-30378d35a439-S60499/frameworks/4783cf15-4fb1-4c75-90fe-44eeec5258a7-/executors/sleep.8c187c41-1762-11ea-a2e5-02429217540f/runs/53ec0ef3-3290-476a-b2b6-385099e9b923' > mesos-slave_1 | I1205 13:24:21.394399 19849 paths.cpp:748] Creating > sandbox > '/var/lib/mesos/agent/meta/slaves/79ad3a13-b567-4273-ac8c-30378d35a439-S60499/frameworks/4783cf15-4fb1-4c75-90fe-44eeec5258a7-/executors/sleep.8c187c41-1762-11ea-a2e5-02429217540f/runs/53ec0ef3-3290-476a-b2b6-385099e9b923' > mesos-slave_1 | I1205 13:24:21.394918 19849 slave.cpp:9068] Launching
[jira] [Commented] (MESOS-10066) mesos-docker-executor process dies when agent stops. Recovery fails when agent returns
[ https://issues.apache.org/jira/browse/MESOS-10066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16989808#comment-16989808 ] Andrei Budnik commented on MESOS-10066: --- Did you try to specify --docker_mesos_image command-line option for the agent that runs inside the Docker container? > mesos-docker-executor process dies when agent stops. Recovery fails when > agent returns > -- > > Key: MESOS-10066 > URL: https://issues.apache.org/jira/browse/MESOS-10066 > Project: Mesos > Issue Type: Bug > Components: agent, containerization, docker, executor >Affects Versions: 1.7.3 >Reporter: Dalton Matos Coelho Barreto >Priority: Critical > Attachments: logs-after.txt, logs-before.txt > > > Hello all, > The documentation about Agent Recovery shows two conditions for the recovery > to be possible: > - The agent must have recovery enabled (default true?); > - The scheduler must register itself saying that it has checkpointing > enabled. > In my tests I'm using Marathon as the scheduler and Mesos itself sees > Marathon as e checkpoint-enabled scheduler: > {noformat} > $ curl -sL 10.234.172.27:5050/state | jq '.frameworks[] | {"name": .name, > "id": .id, "checkpoint": .checkpoint, "active": .active}' > { > "name": "asgard-chronos", > "id": "4783cf15-4fb1-4c75-90fe-44eeec5258a7-0001", > "checkpoint": true, > "active": true > } > { > "name": "marathon", > "id": "4783cf15-4fb1-4c75-90fe-44eeec5258a7-", > "checkpoint": true, > "active": true > } > }} > {noformat} > Here is what I'm using: > # Mesos Master, 1.4.1 > # Mesos Agent 1.7.3 > # Using docker image {{mesos/mesos-centos:1.7.x}} > # Docker sock mounted from the host > # Docker binary also mounted from the host > # Marathon: 1.4.12 > # Docker > {noformat} > Client: Docker Engine - Community > Version: 19.03.5 > API version: 1.39 (downgraded from 1.40) > Go version:go1.12.12 > Git commit:633a0ea838 > Built: Wed Nov 13 07:22:05 2019 > OS/Arch: linux/amd64 > Experimental: false > > Server: Docker Engine - Community > Engine: > Version: 18.09.2 > API version: 1.39 (minimum version 1.12) > Go version: go1.10.6 > Git commit: 6247962 > Built:Sun Feb 10 03:42:13 2019 > OS/Arch: linux/amd64 > Experimental: false > {noformat} > h2. The problem > Here is the Marathon test app, a simple {{sleep 99d}} based on {{debian}} > docker image. > {noformat} > { > "id": "/sleep", > "cmd": "sleep 99d", > "cpus": 0.1, > "mem": 128, > "disk": 0, > "instances": 1, > "constraints": [], > "acceptedResourceRoles": [ > "*" > ], > "container": { > "type": "DOCKER", > "volumes": [], > "docker": { > "image": "debian", > "network": "HOST", > "privileged": false, > "parameters": [], > "forcePullImage": true > } > }, > "labels": {}, > "portDefinitions": [] > } > {noformat} > This task runs fine and get scheduled on the right agent, which is running > mesos agent 1.7.3 (using the docker image, {{mesos/mesos-centos:1.7.x}}). > Here is a sample log: > {noformat} > mesos-slave_1 | I1205 13:24:21.391464 19849 slave.cpp:2403] Authorizing > task 'sleep.8c187c41-1762-11ea-a2e5-02429217540f' for framework > 4783cf15-4fb1-4c75-90fe-44eeec5258a7- > mesos-slave_1 | I1205 13:24:21.392707 19849 slave.cpp:2846] Launching > task 'sleep.8c187c41-1762-11ea-a2e5-02429217540f' for framework > 4783cf15-4fb1-4c75-90fe-44eeec5258a7- > mesos-slave_1 | I1205 13:24:21.392895 19849 paths.cpp:748] Creating > sandbox > '/var/lib/mesos/agent/slaves/79ad3a13-b567-4273-ac8c-30378d35a439-S60499/frameworks/4783cf15-4fb1-4c75-90fe-44eeec5258a7-/executors/sleep.8c187c41-1762-11ea-a2e5-02429217540f/runs/53ec0ef3-3290-476a-b2b6-385099e9b923' > mesos-slave_1 | I1205 13:24:21.394399 19849 paths.cpp:748] Creating > sandbox > '/var/lib/mesos/agent/meta/slaves/79ad3a13-b567-4273-ac8c-30378d35a439-S60499/frameworks/4783cf15-4fb1-4c75-90fe-44eeec5258a7-/executors/sleep.8c187c41-1762-11ea-a2e5-02429217540f/runs/53ec0ef3-3290-476a-b2b6-385099e9b923' > mesos-slave_1 | I1205 13:24:21.394918 19849 slave.cpp:9068] Launching > executor 'sleep.8c187c41-1762-11ea-a2e5-02429217540f' of framework > 4783cf15-4fb1-4c75-90fe-44eeec5258a7- with resources > [{"allocation_info":{"role":"*"},"name":"cpus","scalar":{"value":0.1},"type":"SCALAR"},{"allocation_info":{"role":"*"},"name":
[jira] [Commented] (MESOS-10066) mesos-docker-executor process dies when agent stops. Recovery fails when agent returns
[ https://issues.apache.org/jira/browse/MESOS-10066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16989744#comment-16989744 ] Dalton Matos Coelho Barreto commented on MESOS-10066: - Hello [~abudnik], I have attached the raw logs. Let me know if you need anything. Thanks, > mesos-docker-executor process dies when agent stops. Recovery fails when > agent returns > -- > > Key: MESOS-10066 > URL: https://issues.apache.org/jira/browse/MESOS-10066 > Project: Mesos > Issue Type: Bug > Components: agent, containerization, docker, executor >Affects Versions: 1.7.3 >Reporter: Dalton Matos Coelho Barreto >Priority: Critical > Attachments: logs-after.txt, logs-before.txt > > > Hello all, > The documentation about Agent Recovery shows two conditions for the recovery > to be possible: > - The agent must have recovery enabled (default true?); > - The scheduler must register itself saying that it has checkpointing > enabled. > In my tests I'm using Marathon as the scheduler and Mesos itself sees > Marathon as e checkpoint-enabled scheduler: > {noformat} > $ curl -sL 10.234.172.27:5050/state | jq '.frameworks[] | {"name": .name, > "id": .id, "checkpoint": .checkpoint, "active": .active}' > { > "name": "asgard-chronos", > "id": "4783cf15-4fb1-4c75-90fe-44eeec5258a7-0001", > "checkpoint": true, > "active": true > } > { > "name": "marathon", > "id": "4783cf15-4fb1-4c75-90fe-44eeec5258a7-", > "checkpoint": true, > "active": true > } > }} > {noformat} > Here is what I'm using: > # Mesos Master, 1.4.1 > # Mesos Agent 1.7.3 > # Using docker image {{mesos/mesos-centos:1.7.x}} > # Docker sock mounted from the host > # Docker binary also mounted from the host > # Marathon: 1.4.12 > # Docker > {noformat} > Client: Docker Engine - Community > Version: 19.03.5 > API version: 1.39 (downgraded from 1.40) > Go version:go1.12.12 > Git commit:633a0ea838 > Built: Wed Nov 13 07:22:05 2019 > OS/Arch: linux/amd64 > Experimental: false > > Server: Docker Engine - Community > Engine: > Version: 18.09.2 > API version: 1.39 (minimum version 1.12) > Go version: go1.10.6 > Git commit: 6247962 > Built:Sun Feb 10 03:42:13 2019 > OS/Arch: linux/amd64 > Experimental: false > {noformat} > h2. The problem > Here is the Marathon test app, a simple {{sleep 99d}} based on {{debian}} > docker image. > {noformat} > { > "id": "/sleep", > "cmd": "sleep 99d", > "cpus": 0.1, > "mem": 128, > "disk": 0, > "instances": 1, > "constraints": [], > "acceptedResourceRoles": [ > "*" > ], > "container": { > "type": "DOCKER", > "volumes": [], > "docker": { > "image": "debian", > "network": "HOST", > "privileged": false, > "parameters": [], > "forcePullImage": true > } > }, > "labels": {}, > "portDefinitions": [] > } > {noformat} > This task runs fine and get scheduled on the right agent, which is running > mesos agent 1.7.3 (using the docker image, {{mesos/mesos-centos:1.7.x}}). > Here is a sample log: > {noformat} > mesos-slave_1 | I1205 13:24:21.391464 19849 slave.cpp:2403] Authorizing > task 'sleep.8c187c41-1762-11ea-a2e5-02429217540f' for framework > 4783cf15-4fb1-4c75-90fe-44eeec5258a7- > mesos-slave_1 | I1205 13:24:21.392707 19849 slave.cpp:2846] Launching > task 'sleep.8c187c41-1762-11ea-a2e5-02429217540f' for framework > 4783cf15-4fb1-4c75-90fe-44eeec5258a7- > mesos-slave_1 | I1205 13:24:21.392895 19849 paths.cpp:748] Creating > sandbox > '/var/lib/mesos/agent/slaves/79ad3a13-b567-4273-ac8c-30378d35a439-S60499/frameworks/4783cf15-4fb1-4c75-90fe-44eeec5258a7-/executors/sleep.8c187c41-1762-11ea-a2e5-02429217540f/runs/53ec0ef3-3290-476a-b2b6-385099e9b923' > mesos-slave_1 | I1205 13:24:21.394399 19849 paths.cpp:748] Creating > sandbox > '/var/lib/mesos/agent/meta/slaves/79ad3a13-b567-4273-ac8c-30378d35a439-S60499/frameworks/4783cf15-4fb1-4c75-90fe-44eeec5258a7-/executors/sleep.8c187c41-1762-11ea-a2e5-02429217540f/runs/53ec0ef3-3290-476a-b2b6-385099e9b923' > mesos-slave_1 | I1205 13:24:21.394918 19849 slave.cpp:9068] Launching > executor 'sleep.8c187c41-1762-11ea-a2e5-02429217540f' of framework > 4783cf15-4fb1-4c75-90fe-44eeec5258a7- with resources > [{"allocation_info":{"role":"*"},"name":"cpus","scalar":{"value":0.1},"type":"SCALAR"},{"allocation_info":{"role":"*"},"
[jira] [Commented] (MESOS-10066) mesos-docker-executor process dies when agent stops. Recovery fails when agent returns
[ https://issues.apache.org/jira/browse/MESOS-10066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16989732#comment-16989732 ] Dalton Matos Coelho Barreto commented on MESOS-10066: - Hello [~abudnik], Sure! I will re-run the tests and attach here the logs before and after the reboot. > mesos-docker-executor process dies when agent stops. Recovery fails when > agent returns > -- > > Key: MESOS-10066 > URL: https://issues.apache.org/jira/browse/MESOS-10066 > Project: Mesos > Issue Type: Bug > Components: agent, containerization, docker, executor >Affects Versions: 1.7.3 >Reporter: Dalton Matos Coelho Barreto >Priority: Critical > > Hello all, > The documentation about Agent Recovery shows two conditions for the recovery > to be possible: > - The agent must have recovery enabled (default true?); > - The scheduler must register itself saying that it has checkpointing > enabled. > In my tests I'm using Marathon as the scheduler and Mesos itself sees > Marathon as e checkpoint-enabled scheduler: > {noformat} > $ curl -sL 10.234.172.27:5050/state | jq '.frameworks[] | {"name": .name, > "id": .id, "checkpoint": .checkpoint, "active": .active}' > { > "name": "asgard-chronos", > "id": "4783cf15-4fb1-4c75-90fe-44eeec5258a7-0001", > "checkpoint": true, > "active": true > } > { > "name": "marathon", > "id": "4783cf15-4fb1-4c75-90fe-44eeec5258a7-", > "checkpoint": true, > "active": true > } > }} > {noformat} > Here is what I'm using: > # Mesos Master, 1.4.1 > # Mesos Agent 1.7.3 > # Using docker image {{mesos/mesos-centos:1.7.x}} > # Docker sock mounted from the host > # Docker binary also mounted from the host > # Marathon: 1.4.12 > # Docker > {noformat} > Client: Docker Engine - Community > Version: 19.03.5 > API version: 1.39 (downgraded from 1.40) > Go version:go1.12.12 > Git commit:633a0ea838 > Built: Wed Nov 13 07:22:05 2019 > OS/Arch: linux/amd64 > Experimental: false > > Server: Docker Engine - Community > Engine: > Version: 18.09.2 > API version: 1.39 (minimum version 1.12) > Go version: go1.10.6 > Git commit: 6247962 > Built:Sun Feb 10 03:42:13 2019 > OS/Arch: linux/amd64 > Experimental: false > {noformat} > h2. The problem > Here is the Marathon test app, a simple {{sleep 99d}} based on {{debian}} > docker image. > {noformat} > { > "id": "/sleep", > "cmd": "sleep 99d", > "cpus": 0.1, > "mem": 128, > "disk": 0, > "instances": 1, > "constraints": [], > "acceptedResourceRoles": [ > "*" > ], > "container": { > "type": "DOCKER", > "volumes": [], > "docker": { > "image": "debian", > "network": "HOST", > "privileged": false, > "parameters": [], > "forcePullImage": true > } > }, > "labels": {}, > "portDefinitions": [] > } > {noformat} > This task runs fine and get scheduled on the right agent, which is running > mesos agent 1.7.3 (using the docker image, {{mesos/mesos-centos:1.7.x}}). > Here is a sample log: > {noformat} > mesos-slave_1 | I1205 13:24:21.391464 19849 slave.cpp:2403] Authorizing > task 'sleep.8c187c41-1762-11ea-a2e5-02429217540f' for framework > 4783cf15-4fb1-4c75-90fe-44eeec5258a7- > mesos-slave_1 | I1205 13:24:21.392707 19849 slave.cpp:2846] Launching > task 'sleep.8c187c41-1762-11ea-a2e5-02429217540f' for framework > 4783cf15-4fb1-4c75-90fe-44eeec5258a7- > mesos-slave_1 | I1205 13:24:21.392895 19849 paths.cpp:748] Creating > sandbox > '/var/lib/mesos/agent/slaves/79ad3a13-b567-4273-ac8c-30378d35a439-S60499/frameworks/4783cf15-4fb1-4c75-90fe-44eeec5258a7-/executors/sleep.8c187c41-1762-11ea-a2e5-02429217540f/runs/53ec0ef3-3290-476a-b2b6-385099e9b923' > mesos-slave_1 | I1205 13:24:21.394399 19849 paths.cpp:748] Creating > sandbox > '/var/lib/mesos/agent/meta/slaves/79ad3a13-b567-4273-ac8c-30378d35a439-S60499/frameworks/4783cf15-4fb1-4c75-90fe-44eeec5258a7-/executors/sleep.8c187c41-1762-11ea-a2e5-02429217540f/runs/53ec0ef3-3290-476a-b2b6-385099e9b923' > mesos-slave_1 | I1205 13:24:21.394918 19849 slave.cpp:9068] Launching > executor 'sleep.8c187c41-1762-11ea-a2e5-02429217540f' of framework > 4783cf15-4fb1-4c75-90fe-44eeec5258a7- with resources > [{"allocation_info":{"role":"*"},"name":"cpus","scalar":{"value":0.1},"type":"SCALAR"},{"allocation_info":{"role":"*"},"name":"mem","scalar":{"value":32.0},"type":"SCALA
[jira] [Commented] (MESOS-10066) mesos-docker-executor process dies when agent stops. Recovery fails when agent returns
[ https://issues.apache.org/jira/browse/MESOS-10066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16989728#comment-16989728 ] Andrei Budnik commented on MESOS-10066: --- Could you please attach full agent logs? > mesos-docker-executor process dies when agent stops. Recovery fails when > agent returns > -- > > Key: MESOS-10066 > URL: https://issues.apache.org/jira/browse/MESOS-10066 > Project: Mesos > Issue Type: Bug > Components: agent, containerization, docker, executor >Affects Versions: 1.7.3 >Reporter: Dalton Matos Coelho Barreto >Priority: Critical > > Hello all, > The documentation about Agent Recovery shows two conditions for the recovery > to be possible: > - The agent must have recovery enabled (default true?); > - The scheduler must register itself saying that it has checkpointing > enabled. > In my tests I'm using Marathon as the scheduler and Mesos itself sees > Marathon as e checkpoint-enabled scheduler: > {noformat} > $ curl -sL 10.234.172.27:5050/state | jq '.frameworks[] | {"name": .name, > "id": .id, "checkpoint": .checkpoint, "active": .active}' > { > "name": "asgard-chronos", > "id": "4783cf15-4fb1-4c75-90fe-44eeec5258a7-0001", > "checkpoint": true, > "active": true > } > { > "name": "marathon", > "id": "4783cf15-4fb1-4c75-90fe-44eeec5258a7-", > "checkpoint": true, > "active": true > } > }} > {noformat} > Here is what I'm using: > # Mesos Master, 1.4.1 > # Mesos Agent 1.7.3 > # Using docker image {{mesos/mesos-centos:1.7.x}} > # Docker sock mounted from the host > # Docker binary also mounted from the host > # Marathon: 1.4.12 > # Docker > {noformat} > Client: Docker Engine - Community > Version: 19.03.5 > API version: 1.39 (downgraded from 1.40) > Go version:go1.12.12 > Git commit:633a0ea838 > Built: Wed Nov 13 07:22:05 2019 > OS/Arch: linux/amd64 > Experimental: false > > Server: Docker Engine - Community > Engine: > Version: 18.09.2 > API version: 1.39 (minimum version 1.12) > Go version: go1.10.6 > Git commit: 6247962 > Built:Sun Feb 10 03:42:13 2019 > OS/Arch: linux/amd64 > Experimental: false > {noformat} > h2. The problem > Here is the Marathon test app, a simple {{sleep 99d}} based on {{debian}} > docker image. > {noformat} > { > "id": "/sleep", > "cmd": "sleep 99d", > "cpus": 0.1, > "mem": 128, > "disk": 0, > "instances": 1, > "constraints": [], > "acceptedResourceRoles": [ > "*" > ], > "container": { > "type": "DOCKER", > "volumes": [], > "docker": { > "image": "debian", > "network": "HOST", > "privileged": false, > "parameters": [], > "forcePullImage": true > } > }, > "labels": {}, > "portDefinitions": [] > } > {noformat} > This task runs fine and get scheduled on the right agent, which is running > mesos agent 1.7.3 (using the docker image, {{mesos/mesos-centos:1.7.x}}). > Here is a sample log: > {noformat} > mesos-slave_1 | I1205 13:24:21.391464 19849 slave.cpp:2403] Authorizing > task 'sleep.8c187c41-1762-11ea-a2e5-02429217540f' for framework > 4783cf15-4fb1-4c75-90fe-44eeec5258a7- > mesos-slave_1 | I1205 13:24:21.392707 19849 slave.cpp:2846] Launching > task 'sleep.8c187c41-1762-11ea-a2e5-02429217540f' for framework > 4783cf15-4fb1-4c75-90fe-44eeec5258a7- > mesos-slave_1 | I1205 13:24:21.392895 19849 paths.cpp:748] Creating > sandbox > '/var/lib/mesos/agent/slaves/79ad3a13-b567-4273-ac8c-30378d35a439-S60499/frameworks/4783cf15-4fb1-4c75-90fe-44eeec5258a7-/executors/sleep.8c187c41-1762-11ea-a2e5-02429217540f/runs/53ec0ef3-3290-476a-b2b6-385099e9b923' > mesos-slave_1 | I1205 13:24:21.394399 19849 paths.cpp:748] Creating > sandbox > '/var/lib/mesos/agent/meta/slaves/79ad3a13-b567-4273-ac8c-30378d35a439-S60499/frameworks/4783cf15-4fb1-4c75-90fe-44eeec5258a7-/executors/sleep.8c187c41-1762-11ea-a2e5-02429217540f/runs/53ec0ef3-3290-476a-b2b6-385099e9b923' > mesos-slave_1 | I1205 13:24:21.394918 19849 slave.cpp:9068] Launching > executor 'sleep.8c187c41-1762-11ea-a2e5-02429217540f' of framework > 4783cf15-4fb1-4c75-90fe-44eeec5258a7- with resources > [{"allocation_info":{"role":"*"},"name":"cpus","scalar":{"value":0.1},"type":"SCALAR"},{"allocation_info":{"role":"*"},"name":"mem","scalar":{"value":32.0},"type":"SCALAR"}] > in work directory > '/var/lib/mesos/agent/slaves/79ad3a13-b567-4273-ac8c-30378d35a43