Hello,

I am still facing the same issue:

 * My process keeps allocating memory until all available system memory
   is used, but it is never killed. Its sandbox is limited to x00 MB
   but it ends up using several GB.
 * There is no OOM or cgroup related entry in dmesg (beside the
   initialization, i.e., "Initializing cgroup subsys memory"...)
 * The slave log contains nothing suspicious (see the attached logfile)

Updating my Debian kernel from 3.2 to a backported 3.16 kernel did not help. The system is more responsive under load, but the OOM killer is still not triggered. I haven't tried running kernelshark on any of these kernels, yet.

My used slave command line: /usr/local/sbin/mesos-slave --master=zk://test-host:2181/mesos --log_dir=/var/log/mesos --cgroups_limit_swap --isolation=cgroups/cpu,cgroups/mem --work_dir=/var/lib/mesos --attributes=host:test-host;rack:unspecified

Any more ideas?

Thanks,
Stephan


On 27.09.2014 19:34, CCAAT wrote:
On 09/26/14 06:20, Stephan Erb wrote:
Hi everyone,

I am having issues with the cgroups isolation of Mesos. It seems like
tasks are prevented from allocating more memory than their limit.
However, they are never killed.

I am running Aurora and Mesos 0.20.1 using the cgroups isolation on
Debian 7 (kernel 3.2.60-1+deb7u3). .


Maybe a newer kernel might help? I've poked around for some suggestions on the kernel-configuration file for servers running mesos, but nobody is talking about how they "tweak" their kernel settings, yet.

Here's a good article on default shared memory limits:
[1]http://lwn.net/Articles/595638/


Also, I'm not sure if OOM-Killer works on kernel space problems
where memory is grabbed up continuously by the kernel. That may
not even be your problem. I know OOM-killer works on userspace
memory problems.

Kernelshark is your friend....

hth,
James







Log file created at: 2014/10/06 16:58:15
Running on machine: test-host
Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
I1006 16:58:15.520334  2266 logging.cpp:142] INFO level logging started!
I1006 16:58:15.522333  2266 main.cpp:126] Build: 2014-09-23 05:35:41 by root
I1006 16:58:15.522378  2266 main.cpp:128] Version: 0.20.1
I1006 16:58:15.522400  2266 main.cpp:131] Git tag: 0.20.1
I1006 16:58:15.522420  2266 main.cpp:135] Git SHA: fe0a39112f3304283f970f1b08b322b1e970829d
I1006 16:58:15.524052  2266 containerizer.cpp:89] Using isolation: cgroups/cpu,cgroups/mem
I1006 16:58:15.927139  2266 linux_launcher.cpp:78] Using /sys/fs/cgroup/freezer as the freezer hierarchy for the Linux launcher
I1006 16:58:15.929747  2266 main.cpp:149] Starting Mesos slave
I1006 16:58:15.933691  2818 slave.cpp:167] Slave started on 1)@127.0.1.1:5051
I1006 16:58:15.988718  2818 slave.cpp:278] Slave resources: cpus(*):8; mem(*):15061; disk(*):919916; ports(*):[31000-32000]
I1006 16:58:15.992478  2818 slave.cpp:306] Slave hostname: test-host.local
I1006 16:58:15.992552  2818 slave.cpp:307] Slave checkpoint: true
I1006 16:58:16.002214  2815 state.cpp:33] Recovering state from '/var/lib/mesos/meta'
I1006 16:58:16.003589  2815 state.cpp:50] Slave host rebooted
I1006 16:58:16.004365  2816 status_update_manager.cpp:193] Recovering status update manager
I1006 16:58:16.076061  2816 containerizer.cpp:252] Recovering containerizer
I1006 16:58:16.088528  2815 slave.cpp:3198] Finished recovery

... 

I1006 17:28:04.565655  2814 slave.cpp:1002] Got assigned task 1412609276176-www-data-test-ipython-1-b69cccbf-677b-47a7-83f9-74e713b7678c for framework 20140919-174559-16842879-5050-27194-0000
I1006 17:28:04.568666  2814 slave.cpp:1112] Launching task 1412609276176-www-data-test-ipython-1-b69cccbf-677b-47a7-83f9-74e713b7678c for framework 20140919-174559-16842879-5050-27194-0000
I1006 17:28:05.814142  2814 slave.cpp:3857] Checkpointing ExecutorInfo to '/var/lib/mesos/meta/slaves/20141006-165817-16842879-5050-2264-0/frameworks/20140919-174559-16842879-5050-27194-0000/executors/thermos-1412609276176-www-data-test-ipython-1-b69cccbf-677b-47a7-83f9-74e713b7678c/executor.info'
I1006 17:28:06.006503  2814 slave.cpp:3972] Checkpointing TaskInfo to '/var/lib/mesos/meta/slaves/20141006-165817-16842879-5050-2264-0/frameworks/20140919-174559-16842879-5050-27194-0000/executors/thermos-1412609276176-www-data-test-ipython-1-b69cccbf-677b-47a7-83f9-74e713b7678c/runs/899fb038-cb6c-429b-8132-630ac582c846/tasks/1412609276176-www-data-test-ipython-1-b69cccbf-677b-47a7-83f9-74e713b7678c/task.info'
I1006 17:28:06.006503  2817 containerizer.cpp:394] Starting container '899fb038-cb6c-429b-8132-630ac582c846' for executor 'thermos-1412609276176-www-data-test-ipython-1-b69cccbf-677b-47a7-83f9-74e713b7678c' of framework '20140919-174559-16842879-5050-27194-0000'
I1006 17:28:06.008249  2814 slave.cpp:1222] Queuing task '1412609276176-www-data-test-ipython-1-b69cccbf-677b-47a7-83f9-74e713b7678c' for executor thermos-1412609276176-www-data-test-ipython-1-b69cccbf-677b-47a7-83f9-74e713b7678c of framework '20140919-174559-16842879-5050-27194-0000
I1006 17:28:06.011657  2819 cpushare.cpp:338] Updated 'cpu.shares' to 1280 (cpus 1.25) for container 899fb038-cb6c-429b-8132-630ac582c846
I1006 17:28:06.012593  2818 mem.cpp:479] Started listening for OOM events for container 899fb038-cb6c-429b-8132-630ac582c846
I1006 17:28:06.014375  2818 mem.cpp:293] Updated 'memory.soft_limit_in_bytes' to 628MB for container 899fb038-cb6c-429b-8132-630ac582c846
I1006 17:28:06.015763  2818 mem.cpp:347] Updated 'memory.memsw.limit_in_bytes' to 628MB for container 899fb038-cb6c-429b-8132-630ac582c846
I1006 17:28:06.019621  2819 linux_launcher.cpp:191] Cloning child process with flags = 0
I1006 17:28:06.084575  2819 containerizer.cpp:678] Checkpointing executor's forked pid 6299 to '/var/lib/mesos/meta/slaves/20141006-165817-16842879-5050-2264-0/frameworks/20140919-174559-16842879-5050-27194-0000/executors/thermos-1412609276176-www-data-test-ipython-1-b69cccbf-677b-47a7-83f9-74e713b7678c/runs/899fb038-cb6c-429b-8132-630ac582c846/pids/forked.pid'
I1006 17:28:06.266526  2816 containerizer.cpp:510] Fetching URIs for container '899fb038-cb6c-429b-8132-630ac582c846' using command '/usr/local/libexec/mesos/mesos-fetcher'
I1006 17:28:06.636471  2815 slave.cpp:2538] Monitoring executor 'thermos-1412609276176-www-data-test-ipython-1-b69cccbf-677b-47a7-83f9-74e713b7678c' of framework '20140919-174559-16842879-5050-27194-0000' in container '899fb038-cb6c-429b-8132-630ac582c846'
I1006 17:28:07.468716  2812 slave.cpp:1733] Got registration for executor 'thermos-1412609276176-www-data-test-ipython-1-b69cccbf-677b-47a7-83f9-74e713b7678c' of framework 20140919-174559-16842879-5050-27194-0000 from executor(1)@127.0.1.1:60662
I1006 17:28:07.469163  2812 slave.cpp:1819] Checkpointing executor pid 'executor(1)@127.0.1.1:60662' to '/var/lib/mesos/meta/slaves/20141006-165817-16842879-5050-2264-0/frameworks/20140919-174559-16842879-5050-27194-0000/executors/thermos-1412609276176-www-data-test-ipython-1-b69cccbf-677b-47a7-83f9-74e713b7678c/runs/899fb038-cb6c-429b-8132-630ac582c846/pids/libprocess.pid'
I1006 17:28:07.471247  2812 slave.cpp:1853] Flushing queued task 1412609276176-www-data-test-ipython-1-b69cccbf-677b-47a7-83f9-74e713b7678c for executor 'thermos-1412609276176-www-data-test-ipython-1-b69cccbf-677b-47a7-83f9-74e713b7678c' of framework 20140919-174559-16842879-5050-27194-0000
I1006 17:28:07.472482  2816 mem.cpp:293] Updated 'memory.soft_limit_in_bytes' to 628MB for container 899fb038-cb6c-429b-8132-630ac582c846
I1006 17:28:07.472482  2813 cpushare.cpp:338] Updated 'cpu.shares' to 1280 (cpus 1.25) for container 899fb038-cb6c-429b-8132-630ac582c846
I1006 17:28:07.486479  2813 slave.cpp:2088] Handling status update TASK_STARTING (UUID: 02e6c6a9-c02e-4eb8-af91-bf74f8a7e7ca) for task 1412609276176-www-data-test-ipython-1-b69cccbf-677b-47a7-83f9-74e713b7678c of framework 20140919-174559-16842879-5050-27194-0000 from executor(1)@127.0.1.1:60662
I1006 17:28:07.486933  2813 status_update_manager.cpp:320] Received status update TASK_STARTING (UUID: 02e6c6a9-c02e-4eb8-af91-bf74f8a7e7ca) for task 1412609276176-www-data-test-ipython-1-b69cccbf-677b-47a7-83f9-74e713b7678c of framework 20140919-174559-16842879-5050-27194-0000
I1006 17:28:07.505487  2813 status_update_manager.hpp:342] Checkpointing UPDATE for status update TASK_STARTING (UUID: 02e6c6a9-c02e-4eb8-af91-bf74f8a7e7ca) for task 1412609276176-www-data-test-ipython-1-b69cccbf-677b-47a7-83f9-74e713b7678c of framework 20140919-174559-16842879-5050-27194-0000
I1006 17:28:07.597834  2813 status_update_manager.cpp:373] Forwarding status update TASK_STARTING (UUID: 02e6c6a9-c02e-4eb8-af91-bf74f8a7e7ca) for task 1412609276176-www-data-test-ipython-1-b69cccbf-677b-47a7-83f9-74e713b7678c of framework 20140919-174559-16842879-5050-27194-0000 to [email protected]:5050
I1006 17:28:07.599594  2813 slave.cpp:2252] Sending acknowledgement for status update TASK_STARTING (UUID: 02e6c6a9-c02e-4eb8-af91-bf74f8a7e7ca) for task 1412609276176-www-data-test-ipython-1-b69cccbf-677b-47a7-83f9-74e713b7678c of framework 20140919-174559-16842879-5050-27194-0000 to executor(1)@127.0.1.1:60662
I1006 17:28:07.747617  2815 status_update_manager.cpp:398] Received status update acknowledgement (UUID: 02e6c6a9-c02e-4eb8-af91-bf74f8a7e7ca) for task 1412609276176-www-data-test-ipython-1-b69cccbf-677b-47a7-83f9-74e713b7678c of framework 20140919-174559-16842879-5050-27194-0000
I1006 17:28:07.748025  2815 status_update_manager.hpp:342] Checkpointing ACK for status update TASK_STARTING (UUID: 02e6c6a9-c02e-4eb8-af91-bf74f8a7e7ca) for task 1412609276176-www-data-test-ipython-1-b69cccbf-677b-47a7-83f9-74e713b7678c of framework 20140919-174559-16842879-5050-27194-0000
I1006 17:28:08.655814  2817 slave.cpp:2088] Handling status update TASK_RUNNING (UUID: 48e39640-eee1-404b-88aa-f31383404d05) for task 1412609276176-www-data-test-ipython-1-b69cccbf-677b-47a7-83f9-74e713b7678c of framework 20140919-174559-16842879-5050-27194-0000 from executor(1)@127.0.1.1:60662
I1006 17:28:08.656527  2815 status_update_manager.cpp:320] Received status update TASK_RUNNING (UUID: 48e39640-eee1-404b-88aa-f31383404d05) for task 1412609276176-www-data-test-ipython-1-b69cccbf-677b-47a7-83f9-74e713b7678c of framework 20140919-174559-16842879-5050-27194-0000
I1006 17:28:08.665670  2815 status_update_manager.hpp:342] Checkpointing UPDATE for status update TASK_RUNNING (UUID: 48e39640-eee1-404b-88aa-f31383404d05) for task 1412609276176-www-data-test-ipython-1-b69cccbf-677b-47a7-83f9-74e713b7678c of framework 20140919-174559-16842879-5050-27194-0000
I1006 17:28:08.855469  2815 status_update_manager.cpp:373] Forwarding status update TASK_RUNNING (UUID: 48e39640-eee1-404b-88aa-f31383404d05) for task 1412609276176-www-data-test-ipython-1-b69cccbf-677b-47a7-83f9-74e713b7678c of framework 20140919-174559-16842879-5050-27194-0000 to [email protected]:5050
I1006 17:28:08.856343  2815 slave.cpp:2252] Sending acknowledgement for status update TASK_RUNNING (UUID: 48e39640-eee1-404b-88aa-f31383404d05) for task 1412609276176-www-data-test-ipython-1-b69cccbf-677b-47a7-83f9-74e713b7678c of framework 20140919-174559-16842879-5050-27194-0000 to executor(1)@127.0.1.1:60662
I1006 17:28:08.964869  2812 status_update_manager.cpp:398] Received status update acknowledgement (UUID: 48e39640-eee1-404b-88aa-f31383404d05) for task 1412609276176-www-data-test-ipython-1-b69cccbf-677b-47a7-83f9-74e713b7678c of framework 20140919-174559-16842879-5050-27194-0000
I1006 17:28:08.966275  2812 status_update_manager.hpp:342] Checkpointing ACK for status update TASK_RUNNING (UUID: 48e39640-eee1-404b-88aa-f31383404d05) for task 1412609276176-www-data-test-ipython-1-b69cccbf-677b-47a7-83f9-74e713b7678c of framework 20140919-174559-16842879-5050-27194-0000
I1006 17:28:16.920442  2819 slave.cpp:3053] Current usage 1.83%. Max allowed age: 6.171900438733218days
I1006 17:29:16.923148  2816 slave.cpp:3053] Current usage 1.83%. Max allowed age: 6.172047379764641days
I1006 17:30:16.924427  2815 slave.cpp:3053] Current usage 1.83%. Max allowed age: 6.172047202406724days
I1006 17:31:16.926211  2819 slave.cpp:3053] Current usage 1.83%. Max allowed age: 6.172046906810208days


Reply via email to