Hey Vinod,

So i've run into this a few more times and am struggling to understand why
this is happening. It only seems to happen for some tasks.

>From what I can tell, the path to the sandbox directory is scheduled for GC
after the executor finishes. This GC process then iterates over the
scheduled directories and figures out what needs to be cleaned up. Given a
path that is to be removed, it then runs *os::rmdir()* on that directory.
It doesn't seem to do anything explicitly recursive (maybe i'm looking in
the wrong place) which is quite confusing, given the GC process is working
fine for some tasks which have many levels of nested directors and files.

The log entry comes from this function;
https://github.com/apache/mesos/blob/master/src/slave/gc.cpp#L127-L160
where I can see the *os::rmdir()* call.

Any chance you could bring some clarity to how recursive directory deletes
happen? Assuming they do happen, the "Directory not empty" error is even
more frustrating, because they're clearly not behaving correctly. Perhaps
an error is being thrown when deleting the contents of the directory and
that is being swallowed, so files still remain by the time the whole
sandbox removal is attempted, causing a "Directory is not empty".

Appreciate any input!


On 8 September 2014 07:26, Tom Arnfeld <[email protected]> wrote:

> That's useful to know, thanks Vinod. I'll try and dig deeper.
>
>
> On Mon, Sep 8, 2014 at 5:33 AM, Vinod Kone <[email protected]> wrote:
>
>>
>> On Sat, Sep 6, 2014 at 8:23 AM, Tom Arnfeld <[email protected]> wrote:
>>
>>> If I try and manually remove the directory mentioned, it works fine. Is
>>> this a known issue, or should I do a little more debugging? I've not tried
>>> to reproduce it under specific conditions yet.
>>>
>>>
>> This is surprising. GC does a recursive directory removal (see
>> os::rmdir() in stout) using post-order traversal. Definitely some debugging
>> is in order to see which directory failed and why. Does your sandbox
>> contain any special files (other than directories and files) like mounts,
>> devices etc?
>>
>>
>>
>>>  As a side note, should mesos perhaps have some kind of retry mechanism
>>> for GC? Also, will GC still run for an executor if the slave restarts after
>>> an executor terminates but before the GC process runs?
>>>
>>
>> I don't know what the error was above but I doubt a retry would've helped
>> here. And yes GC runs for a terminated executor when slave restarts.
>>
>
>

Reply via email to