[jira] [Updated] (MESOS-6302) Agent recovery can fail after nested containers are launched

2016-10-13 Thread Jie Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jie Yu updated MESOS-6302:
--
Shepherd: Jie Yu

> Agent recovery can fail after nested containers are launched
> 
>
> Key: MESOS-6302
> URL: https://issues.apache.org/jira/browse/MESOS-6302
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Reporter: Greg Mann
>Assignee: Gilbert Song
>Priority: Blocker
>  Labels: mesosphere
> Fix For: 1.1.0
>
> Attachments: read_write_app.json
>
>
> After launching a nested container which used a Docker image, I restarted the 
> agent which ran that task group and saw the following in the agent logs 
> during recovery:
> {code}
> Oct 01 01:45:10 ip-10-0-3-133.us-west-2.compute.internal mesos-agent[4629]: 
> I1001 01:45:10.813596  4640 status_update_manager.cpp:203] Recovering status 
> update manager
> Oct 01 01:45:10 ip-10-0-3-133.us-west-2.compute.internal mesos-agent[4629]: 
> I1001 01:45:10.813622  4640 status_update_manager.cpp:211] Recovering 
> executor 'instance-testvolume.02c26bce-8778-11e6-9ff3-7a3cd7c1568e' of 
> framework 118ca38d-daee-4b2d-b584-b5581738a3dd-
> Oct 01 01:45:10 ip-10-0-3-133.us-west-2.compute.internal mesos-agent[4629]: 
> I1001 01:45:10.814249  4639 docker.cpp:745] Recovering Docker containers
> Oct 01 01:45:10 ip-10-0-3-133.us-west-2.compute.internal mesos-agent[4629]: 
> I1001 01:45:10.815294  4642 containerizer.cpp:581] Recovering containerizer
> Oct 01 01:45:10 ip-10-0-3-133.us-west-2.compute.internal mesos-agent[4629]: 
> Failed to perform recovery: Collect failed: Unable to list rootfses belonged 
> to container a7d576da-fd0f-4dc1-bd5a-6d0a93ac8a53: Unable to list the 
> container directory: Failed to opendir 
> '/var/lib/mesos/slave/provisioner/containers/a7d576da-fd0f-4dc1-bd5a-6d0a93ac8a53/backends':
>  No such file or directory
> Oct 01 01:45:10 ip-10-0-3-133.us-west-2.compute.internal mesos-agent[4629]: 
> To remedy this do as follows:
> Oct 01 01:45:10 ip-10-0-3-133.us-west-2.compute.internal mesos-agent[4629]: 
> Step 1: rm -f /var/lib/mesos/slave/meta/slaves/latest
> Oct 01 01:45:10 ip-10-0-3-133.us-west-2.compute.internal mesos-agent[4629]:   
>   This ensures agent doesn't recover old live executors.
> Oct 01 01:45:10 ip-10-0-3-133.us-west-2.compute.internal mesos-agent[4629]: 
> Step 2: Restart the agent.
> {code}
> and the agent continues to restart in this fashion. Attached is the Marathon 
> app definition that I used to launch the task group.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6302) Agent recovery can fail after nested containers are launched

2016-10-03 Thread Gilbert Song (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gilbert Song updated MESOS-6302:

  Sprint: Mesosphere Sprint 44
Story Points: 3
 Component/s: containerization

> Agent recovery can fail after nested containers are launched
> 
>
> Key: MESOS-6302
> URL: https://issues.apache.org/jira/browse/MESOS-6302
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Reporter: Greg Mann
>Assignee: Gilbert Song
>Priority: Blocker
>  Labels: mesosphere
> Fix For: 1.1.0
>
> Attachments: read_write_app.json
>
>
> After launching a nested container which used a Docker image, I restarted the 
> agent which ran that task group and saw the following in the agent logs 
> during recovery:
> {code}
> Oct 01 01:45:10 ip-10-0-3-133.us-west-2.compute.internal mesos-agent[4629]: 
> I1001 01:45:10.813596  4640 status_update_manager.cpp:203] Recovering status 
> update manager
> Oct 01 01:45:10 ip-10-0-3-133.us-west-2.compute.internal mesos-agent[4629]: 
> I1001 01:45:10.813622  4640 status_update_manager.cpp:211] Recovering 
> executor 'instance-testvolume.02c26bce-8778-11e6-9ff3-7a3cd7c1568e' of 
> framework 118ca38d-daee-4b2d-b584-b5581738a3dd-
> Oct 01 01:45:10 ip-10-0-3-133.us-west-2.compute.internal mesos-agent[4629]: 
> I1001 01:45:10.814249  4639 docker.cpp:745] Recovering Docker containers
> Oct 01 01:45:10 ip-10-0-3-133.us-west-2.compute.internal mesos-agent[4629]: 
> I1001 01:45:10.815294  4642 containerizer.cpp:581] Recovering containerizer
> Oct 01 01:45:10 ip-10-0-3-133.us-west-2.compute.internal mesos-agent[4629]: 
> Failed to perform recovery: Collect failed: Unable to list rootfses belonged 
> to container a7d576da-fd0f-4dc1-bd5a-6d0a93ac8a53: Unable to list the 
> container directory: Failed to opendir 
> '/var/lib/mesos/slave/provisioner/containers/a7d576da-fd0f-4dc1-bd5a-6d0a93ac8a53/backends':
>  No such file or directory
> Oct 01 01:45:10 ip-10-0-3-133.us-west-2.compute.internal mesos-agent[4629]: 
> To remedy this do as follows:
> Oct 01 01:45:10 ip-10-0-3-133.us-west-2.compute.internal mesos-agent[4629]: 
> Step 1: rm -f /var/lib/mesos/slave/meta/slaves/latest
> Oct 01 01:45:10 ip-10-0-3-133.us-west-2.compute.internal mesos-agent[4629]:   
>   This ensures agent doesn't recover old live executors.
> Oct 01 01:45:10 ip-10-0-3-133.us-west-2.compute.internal mesos-agent[4629]: 
> Step 2: Restart the agent.
> {code}
> and the agent continues to restart in this fashion. Attached is the Marathon 
> app definition that I used to launch the task group.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6302) Agent recovery can fail after nested containers are launched

2016-09-30 Thread Greg Mann (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Mann updated MESOS-6302:
-
Attachment: read_write_app.json

> Agent recovery can fail after nested containers are launched
> 
>
> Key: MESOS-6302
> URL: https://issues.apache.org/jira/browse/MESOS-6302
> Project: Mesos
>  Issue Type: Bug
>Reporter: Greg Mann
>Assignee: Gilbert Song
>Priority: Blocker
>  Labels: mesosphere
> Fix For: 1.1.0
>
> Attachments: read_write_app.json
>
>
> After launching a nested container which used a Docker image, I restarted the 
> agent which ran that task group and saw the following in the agent logs 
> during recovery:
> {code}
> Oct 01 01:45:10 ip-10-0-3-133.us-west-2.compute.internal mesos-agent[4629]: 
> I1001 01:45:10.813596  4640 status_update_manager.cpp:203] Recovering status 
> update manager
> Oct 01 01:45:10 ip-10-0-3-133.us-west-2.compute.internal mesos-agent[4629]: 
> I1001 01:45:10.813622  4640 status_update_manager.cpp:211] Recovering 
> executor 'instance-testvolume.02c26bce-8778-11e6-9ff3-7a3cd7c1568e' of 
> framework 118ca38d-daee-4b2d-b584-b5581738a3dd-
> Oct 01 01:45:10 ip-10-0-3-133.us-west-2.compute.internal mesos-agent[4629]: 
> I1001 01:45:10.814249  4639 docker.cpp:745] Recovering Docker containers
> Oct 01 01:45:10 ip-10-0-3-133.us-west-2.compute.internal mesos-agent[4629]: 
> I1001 01:45:10.815294  4642 containerizer.cpp:581] Recovering containerizer
> Oct 01 01:45:10 ip-10-0-3-133.us-west-2.compute.internal mesos-agent[4629]: 
> Failed to perform recovery: Collect failed: Unable to list rootfses belonged 
> to container a7d576da-fd0f-4dc1-bd5a-6d0a93ac8a53: Unable to list the 
> container directory: Failed to opendir 
> '/var/lib/mesos/slave/provisioner/containers/a7d576da-fd0f-4dc1-bd5a-6d0a93ac8a53/backends':
>  No such file or directory
> Oct 01 01:45:10 ip-10-0-3-133.us-west-2.compute.internal mesos-agent[4629]: 
> To remedy this do as follows:
> Oct 01 01:45:10 ip-10-0-3-133.us-west-2.compute.internal mesos-agent[4629]: 
> Step 1: rm -f /var/lib/mesos/slave/meta/slaves/latest
> Oct 01 01:45:10 ip-10-0-3-133.us-west-2.compute.internal mesos-agent[4629]:   
>   This ensures agent doesn't recover old live executors.
> Oct 01 01:45:10 ip-10-0-3-133.us-west-2.compute.internal mesos-agent[4629]: 
> Step 2: Restart the agent.
> {code}
> and the agent continues to restart in this fashion. Attached is the Marathon 
> app definition that I used to launch the task group.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)