Re: Problems with OOM

Stephan Erb Tue, 07 Oct 2014 03:31:24 -0700

Ok, here is something odd. My kernel is booted using"cgroup_enable=memory swapaccount=1" in order to enable cgroup accounting.


The log for starting a new container:


I1007 11:38:25.881882  3698 slave.cpp:1222] Queuing task 
'1412674695525-www-data-test-ipython-1-1ecf0bba-6989-4b5c-b800-717914b57dd1' 
for executor 
thermos-1412674695525-www-data-test-ipython-1-1ecf0bba-6989-4b5c-b800-717914b57dd1
 of framework '20140919-174559-16842879-5050-27194-0000

I1007 11:38:25.891448  3696 cpushare.cpp:338] Updated 'cpu.shares' to 1280 
(cpus 1.25) for container 866af1d4-14df-4e55-be5d-a54e2a573cd7

I1007 11:38:25.892354  3695 mem.cpp:479] Started listening for OOM events for 
container 866af1d4-14df-4e55-be5d-a54e2a573cd7

I1007 11:38:25.894224  3695 mem.cpp:293] Updated 'memory.soft_limit_in_bytes' 
to 628MB for container 866af1d4-14df-4e55-be5d-a54e2a573cd7

I1007 11:38:25.897894  3695 mem.cpp:347] Updated 'memory.memsw.limit_in_bytes' 
to 628MB for container 866af1d4-14df-4e55-be5d-a54e2a573cd7

I1007 11:38:25.901499  3693 linux_launcher.cpp:191] Cloning child process with 
flags = 0

I1007 11:38:25.982059  3693 containerizer.cpp:678] Checkpointing executor's 
forked pid 3985 to 
'/var/lib/mesos/meta/slaves/20141007-113221-16842879-5050-2279-0/frameworks/20140919-174559-16842879-5050-27194-0000/executors/thermos-1412674695525-www-data-test-ipython-1-1ecf0bba-6989-4b5c-b800-717914b57dd1/runs/866af1d4-14df-4e55-be5d-a54e2a573cd7/pids/forked.pid'

I1007 11:38:26.170440  3696 containerizer.cpp:510] Fetching URIs for container 
'866af1d4-14df-4e55-be5d-a54e2a573cd7' using command 
'/usr/local/libexec/mesos/mesos-fetcher'

I1007 11:38:26.796327  3692 slave.cpp:2538] Monitoring executor 
'thermos-1412674695525-www-data-test-ipython-1-1ecf0bba-6989-4b5c-b800-717914b57dd1'
 of framework '20140919-174559-16842879-5050-27194-0000' in container 
'866af1d4-14df-4e55-be5d-a54e2a573cd7'

I1007 11:38:27.611901  3691 slave.cpp:1733] Got registration for executor 
'thermos-1412674695525-www-data-test-ipython-1-1ecf0bba-6989-4b5c-b800-717914b57dd1'
 of framework 20140919-174559-16842879-5050-27194-0000 from 
executor(1)@127.0.1.1:39709

I1007 11:38:27.612476  3691 slave.cpp:1819] Checkpointing executor pid 
'executor(1)@127.0.1.1:39709' to 
'/var/lib/mesos/meta/slaves/20141007-113221-16842879-5050-2279-0/frameworks/20140919-174559-16842879-5050-27194-0000/executors/thermos-1412674695525-www-data-test-ipython-1-1ecf0bba-6989-4b5c-b800-717914b57dd1/runs/866af1d4-14df-4e55-be5d-a54e2a573cd7/pids/libprocess.pid'

I1007 11:38:27.614302  3691 slave.cpp:1853] Flushing queued task 
1412674695525-www-data-test-ipython-1-1ecf0bba-6989-4b5c-b800-717914b57dd1 for 
executor 
'thermos-1412674695525-www-data-test-ipython-1-1ecf0bba-6989-4b5c-b800-717914b57dd1'
 of framework 20140919-174559-16842879-5050-27194-0000

I1007 11:38:27.615567  3697 cpushare.cpp:338] Updated 'cpu.shares' to 1280 
(cpus 1.25) for container 866af1d4-14df-4e55-be5d-a54e2a573cd7

I1007 11:38:27.615622  3694 mem.cpp:293] Updated 'memory.soft_limit_in_bytes' 
to 628MB for container 866af1d4-14df-4e55-be5d-a54e2a573cd7

I1007 11:38:27.630520  3694 slave.cpp:2088] Handling status update 
TASK_STARTING (UUID: 177f83dd-6669-4ead-8e42-95030e5723e4) for task 
1412674695525-www-data-test-ipython-1-1ecf0bba-6989-4b5c-b800-717914b57dd1 of 
framework 20140919-174559-16842879-5050-27194-0000 from 
executor(1)@127.0.1.1:39709

But when inspecting the limits of my container, they are not enforced asexpected:


# cat 866af1d4-14df-4e55-be5d-a54e2a573cd7/memory.soft_limit_in_bytes

658505728

# cat 866af1d4-14df-4e55-be5d-a54e2a573cd7/memory.limit_in_bytes

9223372036854775807

# cat 866af1d4-14df-4e55-be5d-a54e2a573cd7/memory.memsw.limit_in_bytes

9223372036854775807


Shouldn't the memsw.limit_in_bytes be set as well?

Best Regards,
Stephan


On 06.10.2014 18:56, Stephan Erb wrote:

Hello,

I am still facing the same issue:

  * My process keeps allocating memory until all available system
    memory is used, but it is never killed. Its sandbox is limited to
    x00 MB but it ends up using several GB.
  * There is no OOM or cgroup related entry in dmesg (beside the
    initialization, i.e., "Initializing cgroup subsys memory"...)
  * The slave log contains nothing suspicious (see the attached logfile)
Updating my Debian kernel from 3.2 to a backported 3.16 kernel did nothelp. The system is more responsive under load, but the OOM killer isstill not triggered. I haven't tried running kernelshark on any ofthese kernels, yet.
My used slave command line: /usr/local/sbin/mesos-slave--master=zk://test-host:2181/mesos --log_dir=/var/log/mesos--cgroups_limit_swap --isolation=cgroups/cpu,cgroups/mem--work_dir=/var/lib/mesos --attributes=host:test-host;rack:unspecified
Any more ideas?

Thanks,
Stephan


On 27.09.2014 19:34, CCAAT wrote:
On 09/26/14 06:20, Stephan Erb wrote:
Hi everyone,

I am having issues with the cgroups isolation of Mesos. It seems like
tasks are prevented from allocating more memory than their limit.
However, they are never killed.
I am running Aurora and Mesos 0.20.1 using the cgroups isolation on
Debian 7 (kernel 3.2.60-1+deb7u3). .
Maybe a newer kernel might help? I've poked around for somesuggestions on the kernel-configuration file for servers runningmesos, but nobody is talking about how they "tweak" their kernelsettings, yet.
Here's a good article on default shared memory limits:
[1]http://lwn.net/Articles/595638/


Also, I'm not sure if OOM-Killer works on kernel space problems
where memory is grabbed up continuously by the kernel. That may
not even be your problem. I know OOM-killer works on userspace
memory problems.

Kernelshark is your friend....

hth,
James

Re: Problems with OOM

Reply via email to