Hello,
I am still facing the same issue:
* My process keeps allocating memory until all available system memory
is used, but it is never killed. Its sandbox is limited to x00 MB
but it ends up using several GB.
* There is no OOM or cgroup related entry in dmesg (beside the
initialization, i.e., "Initializing cgroup subsys memory"...)
* The slave log contains nothing suspicious (see the attached logfile)
Updating my Debian kernel from 3.2 to a backported 3.16 kernel did not
help. The system is more responsive under load, but the OOM killer is
still not triggered. I haven't tried running kernelshark on any of these
kernels, yet.
My used slave command line: /usr/local/sbin/mesos-slave
--master=zk://test-host:2181/mesos --log_dir=/var/log/mesos
--cgroups_limit_swap --isolation=cgroups/cpu,cgroups/mem
--work_dir=/var/lib/mesos --attributes=host:test-host;rack:unspecified
Any more ideas?
Thanks,
Stephan
On 27.09.2014 19:34, CCAAT wrote:
On 09/26/14 06:20, Stephan Erb wrote:
Hi everyone,
I am having issues with the cgroups isolation of Mesos. It seems like
tasks are prevented from allocating more memory than their limit.
However, they are never killed.
I am running Aurora and Mesos 0.20.1 using the cgroups isolation on
Debian 7 (kernel 3.2.60-1+deb7u3). .
Maybe a newer kernel might help? I've poked around for some
suggestions on the kernel-configuration file for servers running
mesos, but nobody is talking about how they "tweak" their kernel
settings, yet.
Here's a good article on default shared memory limits:
[1]http://lwn.net/Articles/595638/
Also, I'm not sure if OOM-Killer works on kernel space problems
where memory is grabbed up continuously by the kernel. That may
not even be your problem. I know OOM-killer works on userspace
memory problems.
Kernelshark is your friend....
hth,
James
Log file created at: 2014/10/06 16:58:15
Running on machine: test-host
Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
I1006 16:58:15.520334 2266 logging.cpp:142] INFO level logging started!
I1006 16:58:15.522333 2266 main.cpp:126] Build: 2014-09-23 05:35:41 by root
I1006 16:58:15.522378 2266 main.cpp:128] Version: 0.20.1
I1006 16:58:15.522400 2266 main.cpp:131] Git tag: 0.20.1
I1006 16:58:15.522420 2266 main.cpp:135] Git SHA: fe0a39112f3304283f970f1b08b322b1e970829d
I1006 16:58:15.524052 2266 containerizer.cpp:89] Using isolation: cgroups/cpu,cgroups/mem
I1006 16:58:15.927139 2266 linux_launcher.cpp:78] Using /sys/fs/cgroup/freezer as the freezer hierarchy for the Linux launcher
I1006 16:58:15.929747 2266 main.cpp:149] Starting Mesos slave
I1006 16:58:15.933691 2818 slave.cpp:167] Slave started on 1)@127.0.1.1:5051
I1006 16:58:15.988718 2818 slave.cpp:278] Slave resources: cpus(*):8; mem(*):15061; disk(*):919916; ports(*):[31000-32000]
I1006 16:58:15.992478 2818 slave.cpp:306] Slave hostname: test-host.local
I1006 16:58:15.992552 2818 slave.cpp:307] Slave checkpoint: true
I1006 16:58:16.002214 2815 state.cpp:33] Recovering state from '/var/lib/mesos/meta'
I1006 16:58:16.003589 2815 state.cpp:50] Slave host rebooted
I1006 16:58:16.004365 2816 status_update_manager.cpp:193] Recovering status update manager
I1006 16:58:16.076061 2816 containerizer.cpp:252] Recovering containerizer
I1006 16:58:16.088528 2815 slave.cpp:3198] Finished recovery
...
I1006 17:28:04.565655 2814 slave.cpp:1002] Got assigned task 1412609276176-www-data-test-ipython-1-b69cccbf-677b-47a7-83f9-74e713b7678c for framework 20140919-174559-16842879-5050-27194-0000
I1006 17:28:04.568666 2814 slave.cpp:1112] Launching task 1412609276176-www-data-test-ipython-1-b69cccbf-677b-47a7-83f9-74e713b7678c for framework 20140919-174559-16842879-5050-27194-0000
I1006 17:28:05.814142 2814 slave.cpp:3857] Checkpointing ExecutorInfo to '/var/lib/mesos/meta/slaves/20141006-165817-16842879-5050-2264-0/frameworks/20140919-174559-16842879-5050-27194-0000/executors/thermos-1412609276176-www-data-test-ipython-1-b69cccbf-677b-47a7-83f9-74e713b7678c/executor.info'
I1006 17:28:06.006503 2814 slave.cpp:3972] Checkpointing TaskInfo to '/var/lib/mesos/meta/slaves/20141006-165817-16842879-5050-2264-0/frameworks/20140919-174559-16842879-5050-27194-0000/executors/thermos-1412609276176-www-data-test-ipython-1-b69cccbf-677b-47a7-83f9-74e713b7678c/runs/899fb038-cb6c-429b-8132-630ac582c846/tasks/1412609276176-www-data-test-ipython-1-b69cccbf-677b-47a7-83f9-74e713b7678c/task.info'
I1006 17:28:06.006503 2817 containerizer.cpp:394] Starting container '899fb038-cb6c-429b-8132-630ac582c846' for executor 'thermos-1412609276176-www-data-test-ipython-1-b69cccbf-677b-47a7-83f9-74e713b7678c' of framework '20140919-174559-16842879-5050-27194-0000'
I1006 17:28:06.008249 2814 slave.cpp:1222] Queuing task '1412609276176-www-data-test-ipython-1-b69cccbf-677b-47a7-83f9-74e713b7678c' for executor thermos-1412609276176-www-data-test-ipython-1-b69cccbf-677b-47a7-83f9-74e713b7678c of framework '20140919-174559-16842879-5050-27194-0000
I1006 17:28:06.011657 2819 cpushare.cpp:338] Updated 'cpu.shares' to 1280 (cpus 1.25) for container 899fb038-cb6c-429b-8132-630ac582c846
I1006 17:28:06.012593 2818 mem.cpp:479] Started listening for OOM events for container 899fb038-cb6c-429b-8132-630ac582c846
I1006 17:28:06.014375 2818 mem.cpp:293] Updated 'memory.soft_limit_in_bytes' to 628MB for container 899fb038-cb6c-429b-8132-630ac582c846
I1006 17:28:06.015763 2818 mem.cpp:347] Updated 'memory.memsw.limit_in_bytes' to 628MB for container 899fb038-cb6c-429b-8132-630ac582c846
I1006 17:28:06.019621 2819 linux_launcher.cpp:191] Cloning child process with flags = 0
I1006 17:28:06.084575 2819 containerizer.cpp:678] Checkpointing executor's forked pid 6299 to '/var/lib/mesos/meta/slaves/20141006-165817-16842879-5050-2264-0/frameworks/20140919-174559-16842879-5050-27194-0000/executors/thermos-1412609276176-www-data-test-ipython-1-b69cccbf-677b-47a7-83f9-74e713b7678c/runs/899fb038-cb6c-429b-8132-630ac582c846/pids/forked.pid'
I1006 17:28:06.266526 2816 containerizer.cpp:510] Fetching URIs for container '899fb038-cb6c-429b-8132-630ac582c846' using command '/usr/local/libexec/mesos/mesos-fetcher'
I1006 17:28:06.636471 2815 slave.cpp:2538] Monitoring executor 'thermos-1412609276176-www-data-test-ipython-1-b69cccbf-677b-47a7-83f9-74e713b7678c' of framework '20140919-174559-16842879-5050-27194-0000' in container '899fb038-cb6c-429b-8132-630ac582c846'
I1006 17:28:07.468716 2812 slave.cpp:1733] Got registration for executor 'thermos-1412609276176-www-data-test-ipython-1-b69cccbf-677b-47a7-83f9-74e713b7678c' of framework 20140919-174559-16842879-5050-27194-0000 from executor(1)@127.0.1.1:60662
I1006 17:28:07.469163 2812 slave.cpp:1819] Checkpointing executor pid 'executor(1)@127.0.1.1:60662' to '/var/lib/mesos/meta/slaves/20141006-165817-16842879-5050-2264-0/frameworks/20140919-174559-16842879-5050-27194-0000/executors/thermos-1412609276176-www-data-test-ipython-1-b69cccbf-677b-47a7-83f9-74e713b7678c/runs/899fb038-cb6c-429b-8132-630ac582c846/pids/libprocess.pid'
I1006 17:28:07.471247 2812 slave.cpp:1853] Flushing queued task 1412609276176-www-data-test-ipython-1-b69cccbf-677b-47a7-83f9-74e713b7678c for executor 'thermos-1412609276176-www-data-test-ipython-1-b69cccbf-677b-47a7-83f9-74e713b7678c' of framework 20140919-174559-16842879-5050-27194-0000
I1006 17:28:07.472482 2816 mem.cpp:293] Updated 'memory.soft_limit_in_bytes' to 628MB for container 899fb038-cb6c-429b-8132-630ac582c846
I1006 17:28:07.472482 2813 cpushare.cpp:338] Updated 'cpu.shares' to 1280 (cpus 1.25) for container 899fb038-cb6c-429b-8132-630ac582c846
I1006 17:28:07.486479 2813 slave.cpp:2088] Handling status update TASK_STARTING (UUID: 02e6c6a9-c02e-4eb8-af91-bf74f8a7e7ca) for task 1412609276176-www-data-test-ipython-1-b69cccbf-677b-47a7-83f9-74e713b7678c of framework 20140919-174559-16842879-5050-27194-0000 from executor(1)@127.0.1.1:60662
I1006 17:28:07.486933 2813 status_update_manager.cpp:320] Received status update TASK_STARTING (UUID: 02e6c6a9-c02e-4eb8-af91-bf74f8a7e7ca) for task 1412609276176-www-data-test-ipython-1-b69cccbf-677b-47a7-83f9-74e713b7678c of framework 20140919-174559-16842879-5050-27194-0000
I1006 17:28:07.505487 2813 status_update_manager.hpp:342] Checkpointing UPDATE for status update TASK_STARTING (UUID: 02e6c6a9-c02e-4eb8-af91-bf74f8a7e7ca) for task 1412609276176-www-data-test-ipython-1-b69cccbf-677b-47a7-83f9-74e713b7678c of framework 20140919-174559-16842879-5050-27194-0000
I1006 17:28:07.597834 2813 status_update_manager.cpp:373] Forwarding status update TASK_STARTING (UUID: 02e6c6a9-c02e-4eb8-af91-bf74f8a7e7ca) for task 1412609276176-www-data-test-ipython-1-b69cccbf-677b-47a7-83f9-74e713b7678c of framework 20140919-174559-16842879-5050-27194-0000 to [email protected]:5050
I1006 17:28:07.599594 2813 slave.cpp:2252] Sending acknowledgement for status update TASK_STARTING (UUID: 02e6c6a9-c02e-4eb8-af91-bf74f8a7e7ca) for task 1412609276176-www-data-test-ipython-1-b69cccbf-677b-47a7-83f9-74e713b7678c of framework 20140919-174559-16842879-5050-27194-0000 to executor(1)@127.0.1.1:60662
I1006 17:28:07.747617 2815 status_update_manager.cpp:398] Received status update acknowledgement (UUID: 02e6c6a9-c02e-4eb8-af91-bf74f8a7e7ca) for task 1412609276176-www-data-test-ipython-1-b69cccbf-677b-47a7-83f9-74e713b7678c of framework 20140919-174559-16842879-5050-27194-0000
I1006 17:28:07.748025 2815 status_update_manager.hpp:342] Checkpointing ACK for status update TASK_STARTING (UUID: 02e6c6a9-c02e-4eb8-af91-bf74f8a7e7ca) for task 1412609276176-www-data-test-ipython-1-b69cccbf-677b-47a7-83f9-74e713b7678c of framework 20140919-174559-16842879-5050-27194-0000
I1006 17:28:08.655814 2817 slave.cpp:2088] Handling status update TASK_RUNNING (UUID: 48e39640-eee1-404b-88aa-f31383404d05) for task 1412609276176-www-data-test-ipython-1-b69cccbf-677b-47a7-83f9-74e713b7678c of framework 20140919-174559-16842879-5050-27194-0000 from executor(1)@127.0.1.1:60662
I1006 17:28:08.656527 2815 status_update_manager.cpp:320] Received status update TASK_RUNNING (UUID: 48e39640-eee1-404b-88aa-f31383404d05) for task 1412609276176-www-data-test-ipython-1-b69cccbf-677b-47a7-83f9-74e713b7678c of framework 20140919-174559-16842879-5050-27194-0000
I1006 17:28:08.665670 2815 status_update_manager.hpp:342] Checkpointing UPDATE for status update TASK_RUNNING (UUID: 48e39640-eee1-404b-88aa-f31383404d05) for task 1412609276176-www-data-test-ipython-1-b69cccbf-677b-47a7-83f9-74e713b7678c of framework 20140919-174559-16842879-5050-27194-0000
I1006 17:28:08.855469 2815 status_update_manager.cpp:373] Forwarding status update TASK_RUNNING (UUID: 48e39640-eee1-404b-88aa-f31383404d05) for task 1412609276176-www-data-test-ipython-1-b69cccbf-677b-47a7-83f9-74e713b7678c of framework 20140919-174559-16842879-5050-27194-0000 to [email protected]:5050
I1006 17:28:08.856343 2815 slave.cpp:2252] Sending acknowledgement for status update TASK_RUNNING (UUID: 48e39640-eee1-404b-88aa-f31383404d05) for task 1412609276176-www-data-test-ipython-1-b69cccbf-677b-47a7-83f9-74e713b7678c of framework 20140919-174559-16842879-5050-27194-0000 to executor(1)@127.0.1.1:60662
I1006 17:28:08.964869 2812 status_update_manager.cpp:398] Received status update acknowledgement (UUID: 48e39640-eee1-404b-88aa-f31383404d05) for task 1412609276176-www-data-test-ipython-1-b69cccbf-677b-47a7-83f9-74e713b7678c of framework 20140919-174559-16842879-5050-27194-0000
I1006 17:28:08.966275 2812 status_update_manager.hpp:342] Checkpointing ACK for status update TASK_RUNNING (UUID: 48e39640-eee1-404b-88aa-f31383404d05) for task 1412609276176-www-data-test-ipython-1-b69cccbf-677b-47a7-83f9-74e713b7678c of framework 20140919-174559-16842879-5050-27194-0000
I1006 17:28:16.920442 2819 slave.cpp:3053] Current usage 1.83%. Max allowed age: 6.171900438733218days
I1006 17:29:16.923148 2816 slave.cpp:3053] Current usage 1.83%. Max allowed age: 6.172047379764641days
I1006 17:30:16.924427 2815 slave.cpp:3053] Current usage 1.83%. Max allowed age: 6.172047202406724days
I1006 17:31:16.926211 2819 slave.cpp:3053] Current usage 1.83%. Max allowed age: 6.172046906810208days