[ 
https://issues.apache.org/jira/browse/MESOS-9977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16965152#comment-16965152
 ] 

Greg Mann commented on MESOS-9977:
----------------------------------

I can think of two options for handling this:
1) Force the persistent volume removal by having the agent unset the immutable 
attribute
2) Fail the DESTROY operation

In the case of persistent volumes, I think that #2 might make more sense - this 
is the more conservative thing to do, which seems prudent in the case of 
potentially critical data. Perhaps we could surface the presence of the 
immutable attribute in the volume via logging somewhere.

[~kaysoky] you mentioned sandbox GC in the description as well - in this case, 
I might be OK with just forcing the directory removal by having the agent unset 
the immutable attribute.

> Agent does not check for immutable files while removing persistent volumes 
> (and possibly in other GC operations)
> ----------------------------------------------------------------------------------------------------------------
>
>                 Key: MESOS-9977
>                 URL: https://issues.apache.org/jira/browse/MESOS-9977
>             Project: Mesos
>          Issue Type: Bug
>          Components: agent
>    Affects Versions: 1.6.2, 1.7.2, 1.8.1, 1.9.0
>            Reporter: Joseph Wu
>            Priority: Major
>              Labels: foundations
>
> We observed an exit/crash loop on an agent originating from deleting a 
> persistent volume:
> {code}
> slave.cpp:4557] Deleting persistent volume '<UUID>' at 
> '/path/to/mesos/slave/volumes/roles/my-role/<UUID>'
> {code}
> This persistent volume happened to have one (or more) files within marked as 
> {{immutable}}.
> When the agent went to delete this persistent volume, via {{os::rmdir(...)}}, 
> it encountered these immutable file(s) and exits like:
> {code}
> slave.cpp:4423] EXIT with status 1: Failed to sync checkpointed resources: 
> Failed to remove persistent volume '<UUID>' at 
> '/path/to/mesos/slave/volumes/roles/my-role/<UUID>': Operation not permitted
> {code}
> The agent would then be unable to start up again, because during recovery, 
> the agent would attempt to delete the same persistent volume and fail to do 
> so.
> Manually removing the immutable attribute from files within the persistent 
> volume allows the agent to recover:
> {code}
> chattr -R -i /path/to/mesos/slave/volumes/roles/my-role/<UUID>
> {code}
> Immutable attributes can be easily introduced by any tasks running on the 
> agent.  As long as the task has sufficient permissions, it could easily call 
> {{chattr +i ...}}.  This attribute could also affect sandbox GC, which also 
> uses {{os::rmdir}} to clean up.  However, sandbox GC tends to warn rather 
> than exit on failure.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to