Ok, here is something odd. My kernel is booted using
"cgroup_enable=memory swapaccount=1" in order to enable cgroup accounting.
The log for starting a new container:
I1007 11:38:25.881882 3698 slave.cpp:1222] Queuing task
'1412674695525-www-data-test-ipython-1-1ecf0bba-6989-4b5c-b800-717914b57dd1'
for executor
thermos-1412674695525-www-data-test-ipython-1-1ecf0bba-6989-4b5c-b800-717914b57dd1
of framework '20140919-174559-16842879-5050-27194-0000
I1007 11:38:25.891448 3696 cpushare.cpp:338] Updated 'cpu.shares' to 1280
(cpus 1.25) for container 866af1d4-14df-4e55-be5d-a54e2a573cd7
I1007 11:38:25.892354 3695 mem.cpp:479] Started listening for OOM events for
container 866af1d4-14df-4e55-be5d-a54e2a573cd7
I1007 11:38:25.894224 3695 mem.cpp:293] Updated 'memory.soft_limit_in_bytes'
to 628MB for container 866af1d4-14df-4e55-be5d-a54e2a573cd7
I1007 11:38:25.897894 3695 mem.cpp:347] Updated 'memory.memsw.limit_in_bytes'
to 628MB for container 866af1d4-14df-4e55-be5d-a54e2a573cd7
I1007 11:38:25.901499 3693 linux_launcher.cpp:191] Cloning child process with
flags = 0
I1007 11:38:25.982059 3693 containerizer.cpp:678] Checkpointing executor's
forked pid 3985 to
'/var/lib/mesos/meta/slaves/20141007-113221-16842879-5050-2279-0/frameworks/20140919-174559-16842879-5050-27194-0000/executors/thermos-1412674695525-www-data-test-ipython-1-1ecf0bba-6989-4b5c-b800-717914b57dd1/runs/866af1d4-14df-4e55-be5d-a54e2a573cd7/pids/forked.pid'
I1007 11:38:26.170440 3696 containerizer.cpp:510] Fetching URIs for container
'866af1d4-14df-4e55-be5d-a54e2a573cd7' using command
'/usr/local/libexec/mesos/mesos-fetcher'
I1007 11:38:26.796327 3692 slave.cpp:2538] Monitoring executor
'thermos-1412674695525-www-data-test-ipython-1-1ecf0bba-6989-4b5c-b800-717914b57dd1'
of framework '20140919-174559-16842879-5050-27194-0000' in container
'866af1d4-14df-4e55-be5d-a54e2a573cd7'
I1007 11:38:27.611901 3691 slave.cpp:1733] Got registration for executor
'thermos-1412674695525-www-data-test-ipython-1-1ecf0bba-6989-4b5c-b800-717914b57dd1'
of framework 20140919-174559-16842879-5050-27194-0000 from
executor(1)@127.0.1.1:39709
I1007 11:38:27.612476 3691 slave.cpp:1819] Checkpointing executor pid
'executor(1)@127.0.1.1:39709' to
'/var/lib/mesos/meta/slaves/20141007-113221-16842879-5050-2279-0/frameworks/20140919-174559-16842879-5050-27194-0000/executors/thermos-1412674695525-www-data-test-ipython-1-1ecf0bba-6989-4b5c-b800-717914b57dd1/runs/866af1d4-14df-4e55-be5d-a54e2a573cd7/pids/libprocess.pid'
I1007 11:38:27.614302 3691 slave.cpp:1853] Flushing queued task
1412674695525-www-data-test-ipython-1-1ecf0bba-6989-4b5c-b800-717914b57dd1 for
executor
'thermos-1412674695525-www-data-test-ipython-1-1ecf0bba-6989-4b5c-b800-717914b57dd1'
of framework 20140919-174559-16842879-5050-27194-0000
I1007 11:38:27.615567 3697 cpushare.cpp:338] Updated 'cpu.shares' to 1280
(cpus 1.25) for container 866af1d4-14df-4e55-be5d-a54e2a573cd7
I1007 11:38:27.615622 3694 mem.cpp:293] Updated 'memory.soft_limit_in_bytes'
to 628MB for container 866af1d4-14df-4e55-be5d-a54e2a573cd7
I1007 11:38:27.630520 3694 slave.cpp:2088] Handling status update
TASK_STARTING (UUID: 177f83dd-6669-4ead-8e42-95030e5723e4) for task
1412674695525-www-data-test-ipython-1-1ecf0bba-6989-4b5c-b800-717914b57dd1 of
framework 20140919-174559-16842879-5050-27194-0000 from
executor(1)@127.0.1.1:39709
But when inspecting the limits of my container, they are not enforced as
expected:
# cat 866af1d4-14df-4e55-be5d-a54e2a573cd7/memory.soft_limit_in_bytes
658505728
# cat 866af1d4-14df-4e55-be5d-a54e2a573cd7/memory.limit_in_bytes
9223372036854775807
# cat 866af1d4-14df-4e55-be5d-a54e2a573cd7/memory.memsw.limit_in_bytes
9223372036854775807
Shouldn't the memsw.limit_in_bytes be set as well?
Best Regards,
Stephan
On 06.10.2014 18:56, Stephan Erb wrote:
Hello,
I am still facing the same issue:
* My process keeps allocating memory until all available system
memory is used, but it is never killed. Its sandbox is limited to
x00 MB but it ends up using several GB.
* There is no OOM or cgroup related entry in dmesg (beside the
initialization, i.e., "Initializing cgroup subsys memory"...)
* The slave log contains nothing suspicious (see the attached logfile)
Updating my Debian kernel from 3.2 to a backported 3.16 kernel did not
help. The system is more responsive under load, but the OOM killer is
still not triggered. I haven't tried running kernelshark on any of
these kernels, yet.
My used slave command line: /usr/local/sbin/mesos-slave
--master=zk://test-host:2181/mesos --log_dir=/var/log/mesos
--cgroups_limit_swap --isolation=cgroups/cpu,cgroups/mem
--work_dir=/var/lib/mesos --attributes=host:test-host;rack:unspecified
Any more ideas?
Thanks,
Stephan
On 27.09.2014 19:34, CCAAT wrote:
On 09/26/14 06:20, Stephan Erb wrote:
Hi everyone,
I am having issues with the cgroups isolation of Mesos. It seems like
tasks are prevented from allocating more memory than their limit.
However, they are never killed.
I am running Aurora and Mesos 0.20.1 using the cgroups isolation on
Debian 7 (kernel 3.2.60-1+deb7u3). .
Maybe a newer kernel might help? I've poked around for some
suggestions on the kernel-configuration file for servers running
mesos, but nobody is talking about how they "tweak" their kernel
settings, yet.
Here's a good article on default shared memory limits:
[1]http://lwn.net/Articles/595638/
Also, I'm not sure if OOM-Killer works on kernel space problems
where memory is grabbed up continuously by the kernel. That may
not even be your problem. I know OOM-killer works on userspace
memory problems.
Kernelshark is your friend....
hth,
James