[jira] [Commented] (MESOS-8830) Agent gc on old slave sandboxes could empty persistent volume data

2018-05-23 Thread Zhitao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16486761#comment-16486761
 ] 

Zhitao Li commented on MESOS-8830:
--

[~jieyu] I put up a patch in https://reviews.apache.org/r/67264/. Please let me 
know what you think. Thanks.

> Agent gc on old slave sandboxes could empty persistent volume data
> --
>
> Key: MESOS-8830
> URL: https://issues.apache.org/jira/browse/MESOS-8830
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.3.1, 1.4.1, 1.5.0
>Reporter: Zhitao Li
>Priority: Blocker
>
> We had an issue in which custom Cassandra executors (which does not use any 
> container image thus running on host filesystem) saw its persistent volume 
> data got wiped out.
> Upon revisiting logs, we found following suspicious lines:
> {panel:title=log}
> {noformat}
> I0424 02:06:11.716380 10980 slave.cpp:5723] Current disk usage 21.93%. Max 
> allowed age: 4.764742265646493days
> I0424 02:06:11.716883 10994 gc.cpp:170] Pruning directories with remaining 
> removal time 2.23508429704593days
> I0424 02:06:11.716943 10994 gc.cpp:170] Pruning directories with remaining 
> removal time 2.23508429587852days
> I0424 02:06:11.717183 10994 gc.cpp:133] Deleting 
> /var/lib/mesos/meta/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44
> I0424 02:06:11.727033 10994 gc.cpp:146] Deleted 
> '/var/lib/mesos/meta/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44'
> I0424 02:06:11.727094 10994 gc.cpp:133] Deleting 
> /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44
> I0424 02:06:14.933104 10972 http.cpp:1115] HTTP GET for /slave(1)/state from 
> 127.0.0.1:53602 with User-Agent='Go-http-client/1.1'
> E0424 02:06:15.245652 10994 rmdir.hpp:81] Failed to delete directory 
> /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks/63a90717-5df8-4f61-bf18-da20eb7a7999-0004/executors/node-5_executor__7e360c28-4138-4175-8999-ffcc5296c34a/runs/904d8155-e4c3-43e3-bf01-85de6a702149/volume:
>  Device or resource busy
> E0424 02:06:15.394328 10994 rmdir.hpp:81] Failed to delete directory 
> /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks/63a90717-5df8-4f61-bf18-da20eb7a7999-0004/executors/node-5_executor__7e360c28-4138-4175-8999-ffcc5296c34a/runs/904d8155-e4c3-43e3-bf01-85de6a702149:
>  Directory not empty
> E0424 02:06:15.394419 10994 rmdir.hpp:81] Failed to delete directory 
> /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks/63a90717-5df8-4f61-bf18-da20eb7a7999-0004/executors/node-5_executor__7e360c28-4138-4175-8999-ffcc5296c34a/runs:
>  Directory not empty
> E0424 02:06:15.394459 10994 rmdir.hpp:81] Failed to delete directory 
> /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks/63a90717-5df8-4f61-bf18-da20eb7a7999-0004/executors/node-5_executor__7e360c28-4138-4175-8999-ffcc5296c34a:
>  Directory not empty
> E0424 02:06:15.394477 10994 rmdir.hpp:81] Failed to delete directory 
> /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks/63a90717-5df8-4f61-bf18-da20eb7a7999-0004/executors:
>  Directory not empty
> E0424 02:06:15.394511 10994 rmdir.hpp:81] Failed to delete directory 
> /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks/63a90717-5df8-4f61-bf18-da20eb7a7999-0004:
>  Directory not empty
> E0424 02:06:15.394536 10994 rmdir.hpp:81] Failed to delete directory 
> /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks:
>  Directory not empty
> E0424 02:06:15.394556 10994 rmdir.hpp:81] Failed to delete directory 
> /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44:
>  Directory not empty
> {noformat}
> {panel}
> (I can try to provide more logs, depending on how much local archive after 
> rotation has)
> This happened on a 1.3.1 agent although I suspect it's not local to that 
> version.
> The path 
> */var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks/63a90717-5df8-4f61-bf18-da20eb7a7999-0004/executors/node-5_executor__7e360c28-4138-4175-8999-ffcc5296c34a/runs/904d8155-e4c3-43e3-bf01-85de6a702149/volume*
>  is a bind mount to a persistent volume. The fact that agent gc touched that 
> process makes me believe this is what triggered the data loss.

[jira] [Commented] (MESOS-8830) Agent gc on old slave sandboxes could empty persistent volume data

2018-05-15 Thread Zhitao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16476604#comment-16476604
 ] 

Zhitao Li commented on MESOS-8830:
--

[~jieyu] Unfortunately I lost the environment on this issue.

Still, I'd like to pursue on the idea of `not follow bind mounts (just like not 
follow symlinks) when doing workdir gc`: Do you mean that we should just filter 
out any FTS node in 
[stout::rmdir|https://github.com/apache/mesos/blob/master/3rdparty/stout/include/stout/os/posix/rmdir.hpp#L43]?

> Agent gc on old slave sandboxes could empty persistent volume data
> --
>
> Key: MESOS-8830
> URL: https://issues.apache.org/jira/browse/MESOS-8830
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.3.1, 1.4.1, 1.5.0
>Reporter: Zhitao Li
>Priority: Blocker
>
> We had an issue in which custom Cassandra executors (which does not use any 
> container image thus running on host filesystem) saw its persistent volume 
> data got wiped out.
> Upon revisiting logs, we found following suspicious lines:
> {panel:title=log}
> {noformat}
> I0424 02:06:11.716380 10980 slave.cpp:5723] Current disk usage 21.93%. Max 
> allowed age: 4.764742265646493days
> I0424 02:06:11.716883 10994 gc.cpp:170] Pruning directories with remaining 
> removal time 2.23508429704593days
> I0424 02:06:11.716943 10994 gc.cpp:170] Pruning directories with remaining 
> removal time 2.23508429587852days
> I0424 02:06:11.717183 10994 gc.cpp:133] Deleting 
> /var/lib/mesos/meta/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44
> I0424 02:06:11.727033 10994 gc.cpp:146] Deleted 
> '/var/lib/mesos/meta/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44'
> I0424 02:06:11.727094 10994 gc.cpp:133] Deleting 
> /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44
> I0424 02:06:14.933104 10972 http.cpp:1115] HTTP GET for /slave(1)/state from 
> 127.0.0.1:53602 with User-Agent='Go-http-client/1.1'
> E0424 02:06:15.245652 10994 rmdir.hpp:81] Failed to delete directory 
> /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks/63a90717-5df8-4f61-bf18-da20eb7a7999-0004/executors/node-5_executor__7e360c28-4138-4175-8999-ffcc5296c34a/runs/904d8155-e4c3-43e3-bf01-85de6a702149/volume:
>  Device or resource busy
> E0424 02:06:15.394328 10994 rmdir.hpp:81] Failed to delete directory 
> /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks/63a90717-5df8-4f61-bf18-da20eb7a7999-0004/executors/node-5_executor__7e360c28-4138-4175-8999-ffcc5296c34a/runs/904d8155-e4c3-43e3-bf01-85de6a702149:
>  Directory not empty
> E0424 02:06:15.394419 10994 rmdir.hpp:81] Failed to delete directory 
> /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks/63a90717-5df8-4f61-bf18-da20eb7a7999-0004/executors/node-5_executor__7e360c28-4138-4175-8999-ffcc5296c34a/runs:
>  Directory not empty
> E0424 02:06:15.394459 10994 rmdir.hpp:81] Failed to delete directory 
> /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks/63a90717-5df8-4f61-bf18-da20eb7a7999-0004/executors/node-5_executor__7e360c28-4138-4175-8999-ffcc5296c34a:
>  Directory not empty
> E0424 02:06:15.394477 10994 rmdir.hpp:81] Failed to delete directory 
> /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks/63a90717-5df8-4f61-bf18-da20eb7a7999-0004/executors:
>  Directory not empty
> E0424 02:06:15.394511 10994 rmdir.hpp:81] Failed to delete directory 
> /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks/63a90717-5df8-4f61-bf18-da20eb7a7999-0004:
>  Directory not empty
> E0424 02:06:15.394536 10994 rmdir.hpp:81] Failed to delete directory 
> /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks:
>  Directory not empty
> E0424 02:06:15.394556 10994 rmdir.hpp:81] Failed to delete directory 
> /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44:
>  Directory not empty
> {noformat}
> {panel}
> (I can try to provide more logs, depending on how much local archive after 
> rotation has)
> This happened on a 1.3.1 agent although I suspect it's not local to that 
> version.
> The path 
> 

[jira] [Commented] (MESOS-8830) Agent gc on old slave sandboxes could empty persistent volume data

2018-05-10 Thread Jie Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16470780#comment-16470780
 ] 

Jie Yu commented on MESOS-8830:
---

[~zhitao] can you grep for container id `904d8155-e4c3-43e3-bf01-85de6a702149` 
in the agent log?

> Agent gc on old slave sandboxes could empty persistent volume data
> --
>
> Key: MESOS-8830
> URL: https://issues.apache.org/jira/browse/MESOS-8830
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.3.1, 1.4.1, 1.5.0
>Reporter: Zhitao Li
>Priority: Blocker
>
> We had an issue in which custom Cassandra executors (which does not use any 
> container image thus running on host filesystem) saw its persistent volume 
> data got wiped out.
> Upon revisiting logs, we found following suspicious lines:
> {panel:title=log}
> I0424 02:06:11.716380 10980 slave.cpp:5723] Current disk usage 21.93%. Max 
> allowed age: 4.764742265646493days
> I0424 02:06:11.716883 10994 gc.cpp:170] Pruning directories with remaining 
> removal time 2.23508429704593days
> I0424 02:06:11.716943 10994 gc.cpp:170] Pruning directories with remaining 
> removal time 2.23508429587852days
> I0424 02:06:11.717183 10994 gc.cpp:133] Deleting 
> /var/lib/mesos/meta/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44
> I0424 02:06:11.727033 10994 gc.cpp:146] Deleted 
> '/var/lib/mesos/meta/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44'
> I0424 02:06:11.727094 10994 gc.cpp:133] Deleting 
> /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44
> I0424 02:06:14.933104 10972 http.cpp:1115] HTTP GET for /slave(1)/state from 
> 127.0.0.1:53602 with User-Agent='Go-http-client/1.1'
> E0424 02:06:15.245652 10994 rmdir.hpp:81] Failed to delete directory 
> /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks/63a90717-5df8-4f61-bf18-da20eb7a7999-0004/executors/node-5_executor__7e360c28-4138-4175-8999-ffcc5296c34a/runs/904d8155-e4c3-43e3-bf01-85de6a702149/volume:
>  Device or resource busy
> E0424 02:06:15.394328 10994 rmdir.hpp:81] Failed to delete directory 
> /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks/63a90717-5df8-4f61-bf18-da20eb7a7999-0004/executors/node-5_executor__7e360c28-4138-4175-8999-ffcc5296c34a/runs/904d8155-e4c3-43e3-bf01-85de6a702149:
>  Directory not empty
> E0424 02:06:15.394419 10994 rmdir.hpp:81] Failed to delete directory 
> /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks/63a90717-5df8-4f61-bf18-da20eb7a7999-0004/executors/node-5_executor__7e360c28-4138-4175-8999-ffcc5296c34a/runs:
>  Directory not empty
> E0424 02:06:15.394459 10994 rmdir.hpp:81] Failed to delete directory 
> /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks/63a90717-5df8-4f61-bf18-da20eb7a7999-0004/executors/node-5_executor__7e360c28-4138-4175-8999-ffcc5296c34a:
>  Directory not empty
> E0424 02:06:15.394477 10994 rmdir.hpp:81] Failed to delete directory 
> /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks/63a90717-5df8-4f61-bf18-da20eb7a7999-0004/executors:
>  Directory not empty
> E0424 02:06:15.394511 10994 rmdir.hpp:81] Failed to delete directory 
> /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks/63a90717-5df8-4f61-bf18-da20eb7a7999-0004:
>  Directory not empty
> E0424 02:06:15.394536 10994 rmdir.hpp:81] Failed to delete directory 
> /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks:
>  Directory not empty
> E0424 02:06:15.394556 10994 rmdir.hpp:81] Failed to delete directory 
> /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44:
>  Directory not empty
> {panel}
> (I can try to provide more logs, depending on how much local archive after 
> rotation has)
> This happened on a 1.3.1 agent although I suspect it's not local to that 
> version.
> The path 
> */var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks/63a90717-5df8-4f61-bf18-da20eb7a7999-0004/executors/node-5_executor__7e360c28-4138-4175-8999-ffcc5296c34a/runs/904d8155-e4c3-43e3-bf01-85de6a702149/volume*
>  is a bind mount to a persistent volume. The fact that agent gc touched that 
> process makes me believe this is what triggered the data loss.
> We had some misconfigurations on out 

[jira] [Commented] (MESOS-8830) Agent gc on old slave sandboxes could empty persistent volume data

2018-05-10 Thread Jie Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16470754#comment-16470754
 ] 

Jie Yu commented on MESOS-8830:
---

I think the problem might be:

1) the destroy of the orphan containers can be in parallel with old agent 
workdir gc because the agent recovery is blocked on the destroy of orphan 
containers
2) when the old agent's workdir is scheduled for gc, since the orphan container 
cleanup is not done yet, it's possible that bind mounted pv is deleted.

Can you provide more agent logs around containerizer recovery. I am curious why 
the destroy of the orphan container takes too long.

But nevertheless, this is an issue. one possible way is to not follow bind 
mounts (just like not follow symlinks) when doing workdir gc.

Another way is to block the agent recovery until all orphans are cleaned up. 
This used to be the case, but was changed due to MESOS-2367. Now i am thinking 
about that, orphans can hold resources (in this case, the pv). allocating those 
resources out before the cleanup is done sounds like not the right approach.

> Agent gc on old slave sandboxes could empty persistent volume data
> --
>
> Key: MESOS-8830
> URL: https://issues.apache.org/jira/browse/MESOS-8830
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.3.1
>Reporter: Zhitao Li
>Priority: Blocker
>
> We had an issue in which custom Cassandra executors (which does not use any 
> container image thus running on host filesystem) saw its persistent volume 
> data got wiped out.
> Upon revisiting logs, we found following suspicious lines:
> {panel:title=log}
> I0424 02:06:11.716380 10980 slave.cpp:5723] Current disk usage 21.93%. Max 
> allowed age: 4.764742265646493days
> I0424 02:06:11.716883 10994 gc.cpp:170] Pruning directories with remaining 
> removal time 2.23508429704593days
> I0424 02:06:11.716943 10994 gc.cpp:170] Pruning directories with remaining 
> removal time 2.23508429587852days
> I0424 02:06:11.717183 10994 gc.cpp:133] Deleting 
> /var/lib/mesos/meta/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44
> I0424 02:06:11.727033 10994 gc.cpp:146] Deleted 
> '/var/lib/mesos/meta/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44'
> I0424 02:06:11.727094 10994 gc.cpp:133] Deleting 
> /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44
> I0424 02:06:14.933104 10972 http.cpp:1115] HTTP GET for /slave(1)/state from 
> 127.0.0.1:53602 with User-Agent='Go-http-client/1.1'
> E0424 02:06:15.245652 10994 rmdir.hpp:81] Failed to delete directory 
> /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks/63a90717-5df8-4f61-bf18-da20eb7a7999-0004/executors/node-5_executor__7e360c28-4138-4175-8999-ffcc5296c34a/runs/904d8155-e4c3-43e3-bf01-85de6a702149/volume:
>  Device or resource busy
> E0424 02:06:15.394328 10994 rmdir.hpp:81] Failed to delete directory 
> /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks/63a90717-5df8-4f61-bf18-da20eb7a7999-0004/executors/node-5_executor__7e360c28-4138-4175-8999-ffcc5296c34a/runs/904d8155-e4c3-43e3-bf01-85de6a702149:
>  Directory not empty
> E0424 02:06:15.394419 10994 rmdir.hpp:81] Failed to delete directory 
> /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks/63a90717-5df8-4f61-bf18-da20eb7a7999-0004/executors/node-5_executor__7e360c28-4138-4175-8999-ffcc5296c34a/runs:
>  Directory not empty
> E0424 02:06:15.394459 10994 rmdir.hpp:81] Failed to delete directory 
> /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks/63a90717-5df8-4f61-bf18-da20eb7a7999-0004/executors/node-5_executor__7e360c28-4138-4175-8999-ffcc5296c34a:
>  Directory not empty
> E0424 02:06:15.394477 10994 rmdir.hpp:81] Failed to delete directory 
> /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks/63a90717-5df8-4f61-bf18-da20eb7a7999-0004/executors:
>  Directory not empty
> E0424 02:06:15.394511 10994 rmdir.hpp:81] Failed to delete directory 
> /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks/63a90717-5df8-4f61-bf18-da20eb7a7999-0004:
>  Directory not empty
> E0424 02:06:15.394536 10994 rmdir.hpp:81] Failed to delete directory 
> /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks:
>  Directory not empty
> E0424 02:06:15.394556 10994 rmdir.hpp:81] Failed to delete directory 
> 

[jira] [Commented] (MESOS-8830) Agent gc on old slave sandboxes could empty persistent volume data

2018-05-10 Thread Greg Mann (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16470562#comment-16470562
 ] 

Greg Mann commented on MESOS-8830:
--

cc [~gilbert] [~jieyu]

> Agent gc on old slave sandboxes could empty persistent volume data
> --
>
> Key: MESOS-8830
> URL: https://issues.apache.org/jira/browse/MESOS-8830
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.3.1
>Reporter: Zhitao Li
>Priority: Blocker
>
> We had an issue in which custom Cassandra executors (which does not use any 
> container image thus running on host filesystem) saw its persistent volume 
> data got wiped out.
> Upon revisiting logs, we found following suspicious lines:
> {panel:title=log}
> I0424 02:06:11.716380 10980 slave.cpp:5723] Current disk usage 21.93%. Max 
> allowed age: 4.764742265646493days
> I0424 02:06:11.716883 10994 gc.cpp:170] Pruning directories with remaining 
> removal time 2.23508429704593days
> I0424 02:06:11.716943 10994 gc.cpp:170] Pruning directories with remaining 
> removal time 2.23508429587852days
> I0424 02:06:11.717183 10994 gc.cpp:133] Deleting 
> /var/lib/mesos/meta/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44
> I0424 02:06:11.727033 10994 gc.cpp:146] Deleted 
> '/var/lib/mesos/meta/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44'
> I0424 02:06:11.727094 10994 gc.cpp:133] Deleting 
> /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44
> I0424 02:06:14.933104 10972 http.cpp:1115] HTTP GET for /slave(1)/state from 
> 127.0.0.1:53602 with User-Agent='Go-http-client/1.1'
> E0424 02:06:15.245652 10994 rmdir.hpp:81] Failed to delete directory 
> /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks/63a90717-5df8-4f61-bf18-da20eb7a7999-0004/executors/node-5_executor__7e360c28-4138-4175-8999-ffcc5296c34a/runs/904d8155-e4c3-43e3-bf01-85de6a702149/volume:
>  Device or resource busy
> E0424 02:06:15.394328 10994 rmdir.hpp:81] Failed to delete directory 
> /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks/63a90717-5df8-4f61-bf18-da20eb7a7999-0004/executors/node-5_executor__7e360c28-4138-4175-8999-ffcc5296c34a/runs/904d8155-e4c3-43e3-bf01-85de6a702149:
>  Directory not empty
> E0424 02:06:15.394419 10994 rmdir.hpp:81] Failed to delete directory 
> /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks/63a90717-5df8-4f61-bf18-da20eb7a7999-0004/executors/node-5_executor__7e360c28-4138-4175-8999-ffcc5296c34a/runs:
>  Directory not empty
> E0424 02:06:15.394459 10994 rmdir.hpp:81] Failed to delete directory 
> /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks/63a90717-5df8-4f61-bf18-da20eb7a7999-0004/executors/node-5_executor__7e360c28-4138-4175-8999-ffcc5296c34a:
>  Directory not empty
> E0424 02:06:15.394477 10994 rmdir.hpp:81] Failed to delete directory 
> /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks/63a90717-5df8-4f61-bf18-da20eb7a7999-0004/executors:
>  Directory not empty
> E0424 02:06:15.394511 10994 rmdir.hpp:81] Failed to delete directory 
> /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks/63a90717-5df8-4f61-bf18-da20eb7a7999-0004:
>  Directory not empty
> E0424 02:06:15.394536 10994 rmdir.hpp:81] Failed to delete directory 
> /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks:
>  Directory not empty
> E0424 02:06:15.394556 10994 rmdir.hpp:81] Failed to delete directory 
> /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44:
>  Directory not empty
> {panel}
> (I can try to provide more logs, depending on how much local archive after 
> rotation has)
> This happened on a 1.3.1 agent although I suspect it's not local to that 
> version.
> The path 
> */var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks/63a90717-5df8-4f61-bf18-da20eb7a7999-0004/executors/node-5_executor__7e360c28-4138-4175-8999-ffcc5296c34a/runs/904d8155-e4c3-43e3-bf01-85de6a702149/volume*
>  is a bind mount to a persistent volume. The fact that agent gc touched that 
> process makes me believe this is what triggered the data loss.
> We had some misconfigurations on out fleet, and I do not know whether the 
> previous slave (id-ed 

[jira] [Commented] (MESOS-8830) Agent gc on old slave sandboxes could empty persistent volume data

2018-04-26 Thread Chun-Hung Hsiao (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16454853#comment-16454853
 ] 

Chun-Hung Hsiao commented on MESOS-8830:


How do you restart the agent as a new one? Did you just remove the {{latest}} 
symlink in the meta dir, or did you remove the runtime dir as well?

When an agent is restarted as a new one, it goes through the runtime dir to 
discover existing containers, and check if there is a matching record in its 
checkpoint in the meta dir. In your case, since the agent is a new one, there 
will be no record at all, so all running containers discovered in the runtime 
dir will be considered as orphans, and the containerizer will destroy them and 
clean them up, which includes running cleanup for each isolator.

I'm suspecting that for some reason the containers occured in the log were 
still running and were not treated as orphaned containers. Could you verify if 
this is the case? You could look at the agent log and see if they have been 
cleaned up as orphaned containers during recovery.

> Agent gc on old slave sandboxes could empty persistent volume data
> --
>
> Key: MESOS-8830
> URL: https://issues.apache.org/jira/browse/MESOS-8830
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.3.1
>Reporter: Zhitao Li
>Priority: Blocker
>
> We had an issue in which custom Cassandra executors (which does not use any 
> container image thus running on host filesystem) saw its persistent volume 
> data got wiped out.
> Upon revisiting logs, we found following suspicious lines:
> {panel:title=log}
> I0424 02:06:11.716380 10980 slave.cpp:5723] Current disk usage 21.93%. Max 
> allowed age: 4.764742265646493days
> I0424 02:06:11.716883 10994 gc.cpp:170] Pruning directories with remaining 
> removal time 2.23508429704593days
> I0424 02:06:11.716943 10994 gc.cpp:170] Pruning directories with remaining 
> removal time 2.23508429587852days
> I0424 02:06:11.717183 10994 gc.cpp:133] Deleting 
> /var/lib/mesos/meta/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44
> I0424 02:06:11.727033 10994 gc.cpp:146] Deleted 
> '/var/lib/mesos/meta/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44'
> I0424 02:06:11.727094 10994 gc.cpp:133] Deleting 
> /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44
> I0424 02:06:14.933104 10972 http.cpp:1115] HTTP GET for /slave(1)/state from 
> 127.0.0.1:53602 with User-Agent='Go-http-client/1.1'
> E0424 02:06:15.245652 10994 rmdir.hpp:81] Failed to delete directory 
> /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks/63a90717-5df8-4f61-bf18-da20eb7a7999-0004/executors/node-5_executor__7e360c28-4138-4175-8999-ffcc5296c34a/runs/904d8155-e4c3-43e3-bf01-85de6a702149/volume:
>  Device or resource busy
> E0424 02:06:15.394328 10994 rmdir.hpp:81] Failed to delete directory 
> /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks/63a90717-5df8-4f61-bf18-da20eb7a7999-0004/executors/node-5_executor__7e360c28-4138-4175-8999-ffcc5296c34a/runs/904d8155-e4c3-43e3-bf01-85de6a702149:
>  Directory not empty
> E0424 02:06:15.394419 10994 rmdir.hpp:81] Failed to delete directory 
> /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks/63a90717-5df8-4f61-bf18-da20eb7a7999-0004/executors/node-5_executor__7e360c28-4138-4175-8999-ffcc5296c34a/runs:
>  Directory not empty
> E0424 02:06:15.394459 10994 rmdir.hpp:81] Failed to delete directory 
> /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks/63a90717-5df8-4f61-bf18-da20eb7a7999-0004/executors/node-5_executor__7e360c28-4138-4175-8999-ffcc5296c34a:
>  Directory not empty
> E0424 02:06:15.394477 10994 rmdir.hpp:81] Failed to delete directory 
> /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks/63a90717-5df8-4f61-bf18-da20eb7a7999-0004/executors:
>  Directory not empty
> E0424 02:06:15.394511 10994 rmdir.hpp:81] Failed to delete directory 
> /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks/63a90717-5df8-4f61-bf18-da20eb7a7999-0004:
>  Directory not empty
> E0424 02:06:15.394536 10994 rmdir.hpp:81] Failed to delete directory 
> /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks:
>  Directory not empty
> E0424 02:06:15.394556 10994 rmdir.hpp:81] Failed to delete directory 
> 

[jira] [Commented] (MESOS-8830) Agent gc on old slave sandboxes could empty persistent volume data

2018-04-24 Thread Zhitao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16450468#comment-16450468
 ] 

Zhitao Li commented on MESOS-8830:
--

Minor correction: I mistakenly thought the problematic path is a hard link 
while it's actually a bind mount.

> Agent gc on old slave sandboxes could empty persistent volume data
> --
>
> Key: MESOS-8830
> URL: https://issues.apache.org/jira/browse/MESOS-8830
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.3.1
>Reporter: Zhitao Li
>Priority: Blocker
>
> We had an issue in which custom Cassandra executors (which does not use any 
> container image thus running on host filesystem) saw its persistent volume 
> data got wiped out.
> Upon revisiting logs, we found following suspicious lines:
> {panel:title=log}
> I0424 02:06:11.716380 10980 slave.cpp:5723] Current disk usage 21.93%. Max 
> allowed age: 4.764742265646493days
> I0424 02:06:11.716883 10994 gc.cpp:170] Pruning directories with remaining 
> removal time 2.23508429704593days
> I0424 02:06:11.716943 10994 gc.cpp:170] Pruning directories with remaining 
> removal time 2.23508429587852days
> I0424 02:06:11.717183 10994 gc.cpp:133] Deleting 
> /var/lib/mesos/meta/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44
> I0424 02:06:11.727033 10994 gc.cpp:146] Deleted 
> '/var/lib/mesos/meta/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44'
> I0424 02:06:11.727094 10994 gc.cpp:133] Deleting 
> /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44
> I0424 02:06:14.933104 10972 http.cpp:1115] HTTP GET for /slave(1)/state from 
> 127.0.0.1:53602 with User-Agent='Go-http-client/1.1'
> E0424 02:06:15.245652 10994 rmdir.hpp:81] Failed to delete directory 
> /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks/63a90717-5df8-4f61-bf18-da20eb7a7999-0004/executors/node-5_executor__7e360c28-4138-4175-8999-ffcc5296c34a/runs/904d8155-e4c3-43e3-bf01-85de6a702149/volume:
>  Device or resource busy
> E0424 02:06:15.394328 10994 rmdir.hpp:81] Failed to delete directory 
> /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks/63a90717-5df8-4f61-bf18-da20eb7a7999-0004/executors/node-5_executor__7e360c28-4138-4175-8999-ffcc5296c34a/runs/904d8155-e4c3-43e3-bf01-85de6a702149:
>  Directory not empty
> E0424 02:06:15.394419 10994 rmdir.hpp:81] Failed to delete directory 
> /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks/63a90717-5df8-4f61-bf18-da20eb7a7999-0004/executors/node-5_executor__7e360c28-4138-4175-8999-ffcc5296c34a/runs:
>  Directory not empty
> E0424 02:06:15.394459 10994 rmdir.hpp:81] Failed to delete directory 
> /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks/63a90717-5df8-4f61-bf18-da20eb7a7999-0004/executors/node-5_executor__7e360c28-4138-4175-8999-ffcc5296c34a:
>  Directory not empty
> E0424 02:06:15.394477 10994 rmdir.hpp:81] Failed to delete directory 
> /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks/63a90717-5df8-4f61-bf18-da20eb7a7999-0004/executors:
>  Directory not empty
> E0424 02:06:15.394511 10994 rmdir.hpp:81] Failed to delete directory 
> /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks/63a90717-5df8-4f61-bf18-da20eb7a7999-0004:
>  Directory not empty
> E0424 02:06:15.394536 10994 rmdir.hpp:81] Failed to delete directory 
> /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks:
>  Directory not empty
> E0424 02:06:15.394556 10994 rmdir.hpp:81] Failed to delete directory 
> /var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44:
>  Directory not empty
> {panel}
> (I can try to provide more logs, depending on how much local archive after 
> rotation has)
> This happened on a 1.3.1 agent although I suspect it's not local to that 
> version.
> The path 
> */var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/var/lib/mesos/slaves/70279b87-553a-4213-a85b-46fdc191849d-S44/frameworks/63a90717-5df8-4f61-bf18-da20eb7a7999-0004/executors/node-5_executor__7e360c28-4138-4175-8999-ffcc5296c34a/runs/904d8155-e4c3-43e3-bf01-85de6a702149/volume*
>  is a bind mount to a persistent volume. The fact that agent gc touched that 
> process makes me believe this is what triggered the data loss.
> We had some misconfigurations on out