[jira] [Updated] (MESOS-6302) Agent recovery can fail after nested containers are launched
[ https://issues.apache.org/jira/browse/MESOS-6302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jie Yu updated MESOS-6302: -- Shepherd: Jie Yu > Agent recovery can fail after nested containers are launched > > > Key: MESOS-6302 > URL: https://issues.apache.org/jira/browse/MESOS-6302 > Project: Mesos > Issue Type: Bug > Components: containerization >Reporter: Greg Mann >Assignee: Gilbert Song >Priority: Blocker > Labels: mesosphere > Fix For: 1.1.0 > > Attachments: read_write_app.json > > > After launching a nested container which used a Docker image, I restarted the > agent which ran that task group and saw the following in the agent logs > during recovery: > {code} > Oct 01 01:45:10 ip-10-0-3-133.us-west-2.compute.internal mesos-agent[4629]: > I1001 01:45:10.813596 4640 status_update_manager.cpp:203] Recovering status > update manager > Oct 01 01:45:10 ip-10-0-3-133.us-west-2.compute.internal mesos-agent[4629]: > I1001 01:45:10.813622 4640 status_update_manager.cpp:211] Recovering > executor 'instance-testvolume.02c26bce-8778-11e6-9ff3-7a3cd7c1568e' of > framework 118ca38d-daee-4b2d-b584-b5581738a3dd- > Oct 01 01:45:10 ip-10-0-3-133.us-west-2.compute.internal mesos-agent[4629]: > I1001 01:45:10.814249 4639 docker.cpp:745] Recovering Docker containers > Oct 01 01:45:10 ip-10-0-3-133.us-west-2.compute.internal mesos-agent[4629]: > I1001 01:45:10.815294 4642 containerizer.cpp:581] Recovering containerizer > Oct 01 01:45:10 ip-10-0-3-133.us-west-2.compute.internal mesos-agent[4629]: > Failed to perform recovery: Collect failed: Unable to list rootfses belonged > to container a7d576da-fd0f-4dc1-bd5a-6d0a93ac8a53: Unable to list the > container directory: Failed to opendir > '/var/lib/mesos/slave/provisioner/containers/a7d576da-fd0f-4dc1-bd5a-6d0a93ac8a53/backends': > No such file or directory > Oct 01 01:45:10 ip-10-0-3-133.us-west-2.compute.internal mesos-agent[4629]: > To remedy this do as follows: > Oct 01 01:45:10 ip-10-0-3-133.us-west-2.compute.internal mesos-agent[4629]: > Step 1: rm -f /var/lib/mesos/slave/meta/slaves/latest > Oct 01 01:45:10 ip-10-0-3-133.us-west-2.compute.internal mesos-agent[4629]: > This ensures agent doesn't recover old live executors. > Oct 01 01:45:10 ip-10-0-3-133.us-west-2.compute.internal mesos-agent[4629]: > Step 2: Restart the agent. > {code} > and the agent continues to restart in this fashion. Attached is the Marathon > app definition that I used to launch the task group. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-6302) Agent recovery can fail after nested containers are launched
[ https://issues.apache.org/jira/browse/MESOS-6302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gilbert Song updated MESOS-6302: Sprint: Mesosphere Sprint 44 Story Points: 3 Component/s: containerization > Agent recovery can fail after nested containers are launched > > > Key: MESOS-6302 > URL: https://issues.apache.org/jira/browse/MESOS-6302 > Project: Mesos > Issue Type: Bug > Components: containerization >Reporter: Greg Mann >Assignee: Gilbert Song >Priority: Blocker > Labels: mesosphere > Fix For: 1.1.0 > > Attachments: read_write_app.json > > > After launching a nested container which used a Docker image, I restarted the > agent which ran that task group and saw the following in the agent logs > during recovery: > {code} > Oct 01 01:45:10 ip-10-0-3-133.us-west-2.compute.internal mesos-agent[4629]: > I1001 01:45:10.813596 4640 status_update_manager.cpp:203] Recovering status > update manager > Oct 01 01:45:10 ip-10-0-3-133.us-west-2.compute.internal mesos-agent[4629]: > I1001 01:45:10.813622 4640 status_update_manager.cpp:211] Recovering > executor 'instance-testvolume.02c26bce-8778-11e6-9ff3-7a3cd7c1568e' of > framework 118ca38d-daee-4b2d-b584-b5581738a3dd- > Oct 01 01:45:10 ip-10-0-3-133.us-west-2.compute.internal mesos-agent[4629]: > I1001 01:45:10.814249 4639 docker.cpp:745] Recovering Docker containers > Oct 01 01:45:10 ip-10-0-3-133.us-west-2.compute.internal mesos-agent[4629]: > I1001 01:45:10.815294 4642 containerizer.cpp:581] Recovering containerizer > Oct 01 01:45:10 ip-10-0-3-133.us-west-2.compute.internal mesos-agent[4629]: > Failed to perform recovery: Collect failed: Unable to list rootfses belonged > to container a7d576da-fd0f-4dc1-bd5a-6d0a93ac8a53: Unable to list the > container directory: Failed to opendir > '/var/lib/mesos/slave/provisioner/containers/a7d576da-fd0f-4dc1-bd5a-6d0a93ac8a53/backends': > No such file or directory > Oct 01 01:45:10 ip-10-0-3-133.us-west-2.compute.internal mesos-agent[4629]: > To remedy this do as follows: > Oct 01 01:45:10 ip-10-0-3-133.us-west-2.compute.internal mesos-agent[4629]: > Step 1: rm -f /var/lib/mesos/slave/meta/slaves/latest > Oct 01 01:45:10 ip-10-0-3-133.us-west-2.compute.internal mesos-agent[4629]: > This ensures agent doesn't recover old live executors. > Oct 01 01:45:10 ip-10-0-3-133.us-west-2.compute.internal mesos-agent[4629]: > Step 2: Restart the agent. > {code} > and the agent continues to restart in this fashion. Attached is the Marathon > app definition that I used to launch the task group. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-6302) Agent recovery can fail after nested containers are launched
[ https://issues.apache.org/jira/browse/MESOS-6302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Mann updated MESOS-6302: - Attachment: read_write_app.json > Agent recovery can fail after nested containers are launched > > > Key: MESOS-6302 > URL: https://issues.apache.org/jira/browse/MESOS-6302 > Project: Mesos > Issue Type: Bug >Reporter: Greg Mann >Assignee: Gilbert Song >Priority: Blocker > Labels: mesosphere > Fix For: 1.1.0 > > Attachments: read_write_app.json > > > After launching a nested container which used a Docker image, I restarted the > agent which ran that task group and saw the following in the agent logs > during recovery: > {code} > Oct 01 01:45:10 ip-10-0-3-133.us-west-2.compute.internal mesos-agent[4629]: > I1001 01:45:10.813596 4640 status_update_manager.cpp:203] Recovering status > update manager > Oct 01 01:45:10 ip-10-0-3-133.us-west-2.compute.internal mesos-agent[4629]: > I1001 01:45:10.813622 4640 status_update_manager.cpp:211] Recovering > executor 'instance-testvolume.02c26bce-8778-11e6-9ff3-7a3cd7c1568e' of > framework 118ca38d-daee-4b2d-b584-b5581738a3dd- > Oct 01 01:45:10 ip-10-0-3-133.us-west-2.compute.internal mesos-agent[4629]: > I1001 01:45:10.814249 4639 docker.cpp:745] Recovering Docker containers > Oct 01 01:45:10 ip-10-0-3-133.us-west-2.compute.internal mesos-agent[4629]: > I1001 01:45:10.815294 4642 containerizer.cpp:581] Recovering containerizer > Oct 01 01:45:10 ip-10-0-3-133.us-west-2.compute.internal mesos-agent[4629]: > Failed to perform recovery: Collect failed: Unable to list rootfses belonged > to container a7d576da-fd0f-4dc1-bd5a-6d0a93ac8a53: Unable to list the > container directory: Failed to opendir > '/var/lib/mesos/slave/provisioner/containers/a7d576da-fd0f-4dc1-bd5a-6d0a93ac8a53/backends': > No such file or directory > Oct 01 01:45:10 ip-10-0-3-133.us-west-2.compute.internal mesos-agent[4629]: > To remedy this do as follows: > Oct 01 01:45:10 ip-10-0-3-133.us-west-2.compute.internal mesos-agent[4629]: > Step 1: rm -f /var/lib/mesos/slave/meta/slaves/latest > Oct 01 01:45:10 ip-10-0-3-133.us-west-2.compute.internal mesos-agent[4629]: > This ensures agent doesn't recover old live executors. > Oct 01 01:45:10 ip-10-0-3-133.us-west-2.compute.internal mesos-agent[4629]: > Step 2: Restart the agent. > {code} > and the agent continues to restart in this fashion. Attached is the Marathon > app definition that I used to launch the task group. -- This message was sent by Atlassian JIRA (v6.3.4#6332)