Hey Lin. Mind filing a ticket for this issue? This is definitely a bug we would like to get fixed.
@vinodkone On Tue, Jan 21, 2014 at 2:00 PM, Benjamin Mahler <[email protected]>wrote: > TLDR: Specify resources in your *executor*, rather than only in your > *task*. > > No OOM is occurring in the logs. The "triggered" log line is misleading, > you can see that the notification was merely discarded: > > I0121 19:44:07.180585 8577 cgroups_isolator.cpp:1183] OOM notifier is > triggered for executor default of framework > 201401171812-2907575306-5050-19011-0020 with uuid > 8bc2ab10-8988-4b22-afa2-3433bbedc3ed > I0121 19:44:07.181037 8577 cgroups_isolator.cpp:1188] Discarded OOM > notifier for executor default of framework > 201401171812-2907575306-5050-19011-0020 with uuid > 8bc2ab10-8988-4b22-afa2-3433bbedc3ed > > > This looks like a bug in Mesos. What's happening is that you're launching > an executor with no resources, consequently before we fork, we attempt to > update the memory control but we don't call the memory handler since the > executor has no memory resources: > > I0121 19:39:01.660071 8566 cgroups_isolator.cpp:516] Launching default > (/home/lin/test-executor) in > /tmp/mesos/slaves/201312032357-3645772810-5050-2033-0/frameworks/201401171812-2907575306-5050-19011-0020/executors/default/runs/8bc2ab10-8988-4b22-afa2-3433bbedc3ed > with resources for framework 201401171812-2907575306-5050-19011-0020 in > cgroup > mesos/framework_201401171812-2907575306-5050-19011-0020_executor_default_tag_8bc2ab10-8988-4b22-afa2-3433bbedc3ed > I0121 19:39:01.663082 8566 cgroups_isolator.cpp:709] Changing cgroup > controls for executor default of framework > 201401171812-2907575306-5050-19011-0020 with resources > I0121 19:39:01.667129 8566 cgroups_isolator.cpp:1163] Started listening > for OOM events for executor default of framework > 201401171812-2907575306-5050-19011-0020 > I0121 19:39:01.681857 8566 cgroups_isolator.cpp:568] Forked executor at > = 27609 > > Then, later, when we are updating the resources for your 128MB task, we > set the soft limit, but we don't set the hard limit because the following > buggy check is not satisfied: > > // Determine whether to set the hard limit. If this is the first > // time (info->pid.isNone()), or we're raising the existing limit, > // then we can update the hard limit safely. Otherwise, if we need > // to decrease 'memory.limit_in_bytes' we may induce an OOM if too > // much memory is in use. As a result, we only update the soft > // limit when the memory reservation is being reduced. This is > // probably okay if the machine has available resources. > // TODO(benh): Introduce a MemoryWatcherProcess which monitors the > // discrepancy between usage and soft limit and introduces a > // "manual oom" if necessary. > if (info->pid.isNone() || limit > currentLimit.get()) { > > The assumption here was that there would always be an initial call with > info->pid.isNone(), however, since your executor has no resources we did > not update the control before forking the executor. And limit was left as > the inherited value. I've cc'ed Ian Downes on this since he's re-working > the Isolator, I'll leave it to him to determine whether this is a bug that > should be filed or not. > > > On Tue, Jan 21, 2014 at 12:51 PM, Lin Zhao <[email protected]> wrote: > >> Vinod, >> >> Correction to my message, when my job is sleeping below values are 500+ >> MB as expected. I was looking at the kmem values. OOM notifier is triggered >> much later when the executor is killed. Would appreciate it if you have an >> idea where to look. >> >> cat /cgroup/mesos/framework_<foo>_executor_<bar>_<uuid>/memory. >> usage_in_bytes >> cat /cgroup/mesos/framework_<foo>_executor_<bar>_<uuid>/memory.max_ >> usage_in_bytes >> >> >> On Tue, Jan 21, 2014 at 2:54 PM, Lin Zhao <[email protected]> wrote: >> >>> Interesting. Looking at the log, it seems that OOM is fired when the >>> executor is shut down (19:44:07.180585), which is 300 seconds after the job >>> launch and memory use. Within the 300 seconds usage_in_bytes and >>> max_usage_in_bytes are 0. >>> >>> Attaching the log. Any idea of the slow OOM? As you can see at >>> https://gist.github.com/lin-zhao/8544495#file-testexecutor-java-L80, >>> 512M mem is used before the sleep. >>> >>> >>> On Tue, Jan 21, 2014 at 2:28 PM, Vinod Kone <[email protected]> wrote: >>> >>>> The way you set task resources looks correct. >>>> >>>> Can you paste what the slave logs say regarding the task/executor, esp. >>>> the lines that are from the cgroups isolator? Also, what is the command >>>> line of the slave? >>>> >>>> >>>> @vinodkone >>>> >>>> >>>> On Tue, Jan 21, 2014 at 11:18 AM, Lin Zhao <[email protected]> wrote: >>>> >>>>> >>>>> *[lin@mesos2 ~]$ cat >>>>> /cgroup/mesos/framework_201401171812-2907575306-5050-19011-0019_executor_default_tag_72c003a3-f213-479e-a7e3-9b86930703a7/memory.limit_in_bytes >>>>> * >>>>> >>>>> *9223372036854775807* >>>>> >>>>> *[lin@mesos2 ~]$ cat >>>>> /cgroup/mesos/framework_201401171812-2907575306-5050-19011-0019_executor_default_tag_72c003a3-f213-479e-a7e3-9b86930703a7/memory.usage_in_bytes >>>>> * >>>>> >>>>> *584146944* >>>>> >>>>> *[lin@mesos2 ~]$ cat >>>>> /cgroup/mesos/framework_201401171812-2907575306-5050-19011-0019_executor_default_tag_72c003a3-f213-479e-a7e3-9b86930703a7/memory.max_usage_in_bytes >>>>> * >>>>> >>>>> *585809920* >>>>> >>>>> Hmm the limit is weird. Can you find anything wrong about the way my >>>>> mem is defined? >>>>> >>>>> >>>>> .addResources(Resource.newBuilder() >>>>> >>>>> .setName("mem") >>>>> >>>>> .setType(Value.Type.SCALAR) >>>>> >>>>> >>>>> .setScalar(Value.Scalar.newBuilder().setValue(128))) >>>>> >>>>> >>>>> >>>>> >>>>> On Tue, Jan 21, 2014 at 2:02 PM, Vinod Kone <[email protected]> wrote: >>>>> >>>>>> Mesos uses >>>>>> cgroups<https://www.kernel.org/doc/Documentation/cgroups/cgroups.txt>to >>>>>> limit cpu and memory. >>>>>> >>>>>> It is indeed surprising that your executor in not OOMing when using >>>>>> more memory than requested. >>>>>> >>>>>> Can you tell us what the following values look like in the executor's >>>>>> cgroup? These are the values the kernel uses to decide whether the cgroup >>>>>> is hitting its limit. >>>>>> >>>>>> cat >>>>>> /cgroup/mesos/framework_<foo>_executor_<bar>_<uuid>/memory.limit_in_bytes >>>>>> >>>>>> cat >>>>>> /cgroup/mesos/framework_<foo>_executor_<bar>_<uuid>/memory.usage_in_bytes >>>>>> >>>>>> cat >>>>>> /cgroup/mesos/framework_<foo>_executor_<bar>_<uuid>/memory.max_usage_in_bytes >>>>>> >>>>>> >>>>>> >>>>>> @vinodkone >>>>>> >>>>>> >>>>>> On Tue, Jan 21, 2014 at 9:58 AM, Lin Zhao <[email protected]> wrote: >>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> I'm new to Mesos and have some questions about resource management. >>>>>>> I want to understand how Mesos limits resources used by each executors, >>>>>>> given resources defined in TaskInfo. I did some tests and have seen >>>>>>> different behavior for different types of resources. It appears that >>>>>>> Mesos >>>>>>> caps CPU usage for the executors, but doesn't limit the memory >>>>>>> accessible >>>>>>> to each executor. >>>>>>> >>>>>>> I created an example java framework, which is largely taken from the >>>>>>> mesos example: >>>>>>> >>>>>>> https://gist.github.com/lin-zhao/8544495 >>>>>>> >>>>>>> Basically, >>>>>>> >>>>>>> 1. the Scheduler launches tasks with *2* cpus, and *128 mb* memory. >>>>>>> 2. The executor launches java with *-Xms 1500m* and *-Xmx 1500m*. >>>>>>> 3. The java executor creates a byte array that uses *512 MB* memory. >>>>>>> 4. The java executor starts 3 threads that loops forever, which >>>>>>> potentially uses *3* full cpus. >>>>>>> >>>>>>> The framework is launched in a 3 slave Mesos (v0.14.2) cluster and >>>>>>> finished without error. >>>>>>> >>>>>>> CPU: on the slaves, the cpu usage for the TestExecutor process is >>>>>>> capped at 199%, indicating that Mesos does cap CPU usage. When the >>>>>>> executor >>>>>>> are assigned 1 cpu instead of 2, the cpu usage is capped at 99%. >>>>>>> >>>>>>> Memory: There is no error thrown. The executors used > 512 MB memory >>>>>>> and get away with it. >>>>>>> >>>>>>> Can someone confirm this? I haven't tested the other resource types >>>>>>> (ports, disk). Is the behavior documented somewhere? >>>>>>> >>>>>>> -- >>>>>>> Lin Zhao >>>>>>> >>>>>>> https://wiki.groupondev.com/Message_Bus >>>>>>> 3101 Park Blvd, Palo Alto, CA 94306 >>>>>>> >>>>>>> Temporarily based in NY >>>>>>> 33 W 19th St. >>>>>>> >>>>>>> >>>>>> >>>>> >>>>> >>>>> -- >>>>> Lin Zhao >>>>> >>>>> https://wiki.groupondev.com/Message_Bus >>>>> 3101 Park Blvd, Palo Alto, CA 94306 >>>>> >>>>> Temporarily based in NY >>>>> 33 W 19th St. >>>>> >>>>> >>>> >>> >>> >>> -- >>> Lin Zhao >>> >>> https://wiki.groupondev.com/Message_Bus >>> 3101 Park Blvd, Palo Alto, CA 94306 >>> >>> Temporarily based in NY >>> 33 W 19th St. >>> >>> >> >> >> -- >> Lin Zhao >> >> https://wiki.groupondev.com/Message_Bus >> 3101 Park Blvd, Palo Alto, CA 94306 >> >> Temporarily based in NY >> 33 W 19th St. >> >> >

