os::rmdir() does recursive deletion. See it's implementation in stout. I would recommend patching that function to print more details (e.g., directory name and contents) to debug this.
@vinodkone > On Sep 11, 2014, at 8:32 AM, Tom Arnfeld <[email protected]> wrote: > > Hey Vinod, > > So i've run into this a few more times and am struggling to understand why > this is happening. It only seems to happen for some tasks. > > From what I can tell, the path to the sandbox directory is scheduled for GC > after the executor finishes. This GC process then iterates over the scheduled > directories and figures out what needs to be cleaned up. Given a path that is > to be removed, it then runs os::rmdir() on that directory. It doesn't seem to > do anything explicitly recursive (maybe i'm looking in the wrong place) which > is quite confusing, given the GC process is working fine for some tasks which > have many levels of nested directors and files. > > The log entry comes from this function; > https://github.com/apache/mesos/blob/master/src/slave/gc.cpp#L127-L160 where > I can see the os::rmdir() call. > > Any chance you could bring some clarity to how recursive directory deletes > happen? Assuming they do happen, the "Directory not empty" error is even more > frustrating, because they're clearly not behaving correctly. Perhaps an error > is being thrown when deleting the contents of the directory and that is being > swallowed, so files still remain by the time the whole sandbox removal is > attempted, causing a "Directory is not empty". > > Appreciate any input! > > >> On 8 September 2014 07:26, Tom Arnfeld <[email protected]> wrote: >> That's useful to know, thanks Vinod. I'll try and dig deeper. >> >> >>> On Mon, Sep 8, 2014 at 5:33 AM, Vinod Kone <[email protected]> wrote: >>> >>>> On Sat, Sep 6, 2014 at 8:23 AM, Tom Arnfeld <[email protected]> wrote: >>>> If I try and manually remove the directory mentioned, it works fine. Is >>>> this a known issue, or should I do a little more debugging? I've not tried >>>> to reproduce it under specific conditions yet. >>> >>> This is surprising. GC does a recursive directory removal (see os::rmdir() >>> in stout) using post-order traversal. Definitely some debugging is in order >>> to see which directory failed and why. Does your sandbox contain any >>> special files (other than directories and files) like mounts, devices etc? >>> >>> >>>> As a side note, should mesos perhaps have some kind of retry mechanism for >>>> GC? Also, will GC still run for an executor if the slave restarts after an >>>> executor terminates but before the GC process runs? >>> >>> I don't know what the error was above but I doubt a retry would've helped >>> here. And yes GC runs for a terminated executor when slave restarts. >

