I am running 0.24.

I am running some tasks in marathon, and when they hit an OOM condition a
task is killed that is expected. Than I get a bunch of errors related to
"Failed to read "meory.limit_in_bytes', 'memory.max_usage_in_bytes' and
memory.stat.

In addition the task tries to restart but keeps failing.

A few notes, when the tasks fails, the sandbox becomes unavailable making
troubleshooting difficult. When this has occurred before, it seemed the
only way to get things working was to stop the slave, clear out the tmp
directory, and start it again. I'd like to understand why my task won't get
moving again.

There are also lots of errors related to "failed to clean up isolator" and
invalid cgroups, I can get specific logs if people think it's needed.  I am
thinking it's related to checkpointing or something like that? I.e. an
executor hit the OOM got killed, and it is trying to start back up, but
something isn't right?

I know this is a jumped unorganized question, I can logs if needed.

Reply via email to