Hi Thomas,
The GC in mesos slave works as follows:
--> Whenever an executor terminates, its sandbox directory is scheduled for
gc for "--gc_delay" seconds into the future by the slave.
--> However the slave also periodically ("--disk_watch_interval") monitors
the disk utilization and expedites the gc based on the usage.
For example if gc_delay is 1 week and the current disk utilization is 80%
then instead of waiting for a week to gc a terminated executor's sandbox
the slave gc'es it after 16.8 hours (= (1- GC_DISK_HEADROOM - 0.8) *
7days). GC_DISK_HEADROOM is currently set to 0.1.
However it might happen that executors are getting launched (and sandboxes
created) at a very high rate. In this case the slave might not able to
react quickly enough to gc sandboxes.
You could grep for "Current usage" in the slave log to see how the disk
utilization varies over time.
HTH,
On Thu, Dec 26, 2013 at 10:56 AM, Thomas Petr <[email protected]> wrote:
> Hi,
>
> We're running Mesos 0.14.0-rc4 on CentOS from the mesosphere repository.
> Last week we had an issue where the mesos-slave process died due running
> out of disk space. [1]
>
> The mesos-slave usage docs mention the "[GC] delay may be shorter
> depending on the available disk usage." Does anyone have any insight into
> how the GC logic works? Is there a configurable threshold percentage or
> amount that will force it to clean up more often?
>
> If the mesos-slave process is going to die due to lack of disk space,
> would it make sense for it to attempt one last GC run before giving up?
>
> Thanks,
> Tom
>
>
> [1]
> Could not create logging file: No space left on device
> COULD NOT CREATE A LOGGINGFILE 20131221-120618.20562!F1221 12:06:18.978813
> 20567 paths.hpp:333] CHECK_SOME(mkdir): Failed to create executor directory
> '/usr/share/hubspot/mesos/slaves/201311111611-3792629514-5050-11268-18/frameworks/Singularity11/executors/singularity-ContactsHadoopDynamicListSegJobs-contacts-wal-dynamic-list-seg-refresher-1387627577839-1-littleslash-us_east_1e/runs/457a8df0-baa7-4d22-a5ac-ba5935ea6032'No
> space left on device
> *** Check failure stack trace: ***
> I1221 12:06:19.008946 20564 cgroups_isolator.cpp:1275] Successfully
> destroyed cgroup
> mesos/framework_Singularity11_executor_singularity-ContactsTasks-parallel-machines:6988:list-intersection-count:1387565552709-1387627447707-1-littleslash-us_east_1e_tag_fc028903-d303-468d-902a-dade8c22e206
> @ 0x7f2c806bcb5d google::LogMessage::Fail()
> @ 0x7f2c806c0b77 google::LogMessage::SendToLog()
> @ 0x7f2c806be9f9 google::LogMessage::Flush()
> @ 0x7f2c806becfd google::LogMessageFatal::~LogMessageFatal()
> @ 0x40f6cf _CheckSome::~_CheckSome()
> @ 0x7f2c804492e3
> mesos::internal::slave::paths::createExecutorDirectory()
> @ 0x7f2c80418a6d
> mesos::internal::slave::Framework::launchExecutor()
> @ 0x7f2c80419dd3 mesos::internal::slave::Slave::_runTask()
> @ 0x7f2c8042d5d1 std::tr1::_Function_handler<>::_M_invoke()
> @ 0x7f2c805d3ae8 process::ProcessManager::resume()
> @ 0x7f2c805d3e8c process::schedule()
> @ 0x7f2c7fe41851 start_thread
> @ 0x7f2c7e78794d clone
>