Re: How Mesos limits resources used by the executors

Lin Zhao Thu, 23 Jan 2014 08:21:26 -0800

Entered https://issues.apache.org/jira/browse/MESOS-941. Thanks everyone
for the help!



On Thu, Jan 23, 2014 at 2:03 AM, Vinod Kone <[email protected]> wrote:

> Hey Lin. Mind filing a ticket for this issue? This is definitely a bug we
> would like to get fixed.
>
>
> @vinodkone
>
>
> On Tue, Jan 21, 2014 at 2:00 PM, Benjamin Mahler <
> [email protected]> wrote:
>
>> TLDR: Specify resources in your *executor*, rather than only in your
>> *task*.
>>
>> No OOM is occurring in the logs. The "triggered" log line is misleading,
>> you can see that the notification was merely discarded:
>>
>> I0121 19:44:07.180585  8577 cgroups_isolator.cpp:1183] OOM notifier is
>> triggered for executor default of framework
>> 201401171812-2907575306-5050-19011-0020 with uuid
>> 8bc2ab10-8988-4b22-afa2-3433bbedc3ed
>> I0121 19:44:07.181037  8577 cgroups_isolator.cpp:1188] Discarded OOM
>> notifier for executor default of framework
>> 201401171812-2907575306-5050-19011-0020 with uuid
>> 8bc2ab10-8988-4b22-afa2-3433bbedc3ed
>>
>>
>> This looks like a bug in Mesos. What's happening is that you're launching
>> an executor with no resources, consequently before we fork, we attempt to
>> update the memory control but we don't call the memory handler since the
>> executor has no memory resources:
>>
>> I0121 19:39:01.660071  8566 cgroups_isolator.cpp:516] Launching default
>> (/home/lin/test-executor) in
>> /tmp/mesos/slaves/201312032357-3645772810-5050-2033-0/frameworks/201401171812-2907575306-5050-19011-0020/executors/default/runs/8bc2ab10-8988-4b22-afa2-3433bbedc3ed
>> with resources  for framework 201401171812-2907575306-5050-19011-0020 in
>> cgroup
>> mesos/framework_201401171812-2907575306-5050-19011-0020_executor_default_tag_8bc2ab10-8988-4b22-afa2-3433bbedc3ed
>> I0121 19:39:01.663082  8566 cgroups_isolator.cpp:709] Changing cgroup
>> controls for executor default of framework
>> 201401171812-2907575306-5050-19011-0020 with resources
>> I0121 19:39:01.667129  8566 cgroups_isolator.cpp:1163] Started listening
>> for OOM events for executor default of framework
>> 201401171812-2907575306-5050-19011-0020
>> I0121 19:39:01.681857  8566 cgroups_isolator.cpp:568] Forked executor at
>> = 27609
>>
>> Then, later, when we are updating the resources for your 128MB task, we
>> set the soft limit, but we don't set the hard limit because the following
>> buggy check is not satisfied:
>>
>>   // Determine whether to set the hard limit. If this is the first
>>   // time (info->pid.isNone()), or we're raising the existing limit,
>>   // then we can update the hard limit safely. Otherwise, if we need
>>   // to decrease 'memory.limit_in_bytes' we may induce an OOM if too
>>   // much memory is in use. As a result, we only update the soft
>>   // limit when the memory reservation is being reduced. This is
>>   // probably okay if the machine has available resources.
>>   // TODO(benh): Introduce a MemoryWatcherProcess which monitors the
>>   // discrepancy between usage and soft limit and introduces a
>>   // "manual oom" if necessary.
>>   if (info->pid.isNone() || limit > currentLimit.get()) {
>>
>> The assumption here was that there would always be an initial call with
>> info->pid.isNone(), however, since your executor has no resources we did
>> not update the control before forking the executor. And limit was left as
>> the inherited value. I've cc'ed Ian Downes on this since he's re-working
>> the Isolator, I'll leave it to him to determine whether this is a bug that
>> should be filed or not.
>>
>>
>> On Tue, Jan 21, 2014 at 12:51 PM, Lin Zhao <[email protected]> wrote:
>>
>>> Vinod,
>>>
>>> Correction to my message, when my job is sleeping below values are 500+
>>> MB as expected. I was looking at the kmem values. OOM notifier is triggered
>>> much later when the executor is killed. Would appreciate it if you have an
>>> idea where to look.
>>>
>>> cat /cgroup/mesos/framework_<foo>_executor_<bar>_<uuid>/memory.
>>> usage_in_bytes
>>> cat /cgroup/mesos/framework_<foo>_executor_<bar>_<uuid>/memory.max_
>>> usage_in_bytes
>>>
>>>
>>> On Tue, Jan 21, 2014 at 2:54 PM, Lin Zhao <[email protected]> wrote:
>>>
>>>> Interesting. Looking at the log, it seems that OOM is fired when the
>>>> executor is shut down (19:44:07.180585), which is 300 seconds after the job
>>>> launch and memory use. Within the 300 seconds usage_in_bytes and
>>>> max_usage_in_bytes are 0.
>>>>
>>>> Attaching the log. Any idea of the slow OOM? As you can see at
>>>> https://gist.github.com/lin-zhao/8544495#file-testexecutor-java-L80,
>>>> 512M mem is used before the sleep.
>>>>
>>>>
>>>> On Tue, Jan 21, 2014 at 2:28 PM, Vinod Kone <[email protected]> wrote:
>>>>
>>>>> The way you set task resources looks correct.
>>>>>
>>>>> Can you paste what the slave logs say regarding the task/executor,
>>>>> esp. the lines that are from the cgroups isolator? Also, what is the
>>>>> command line of the slave?
>>>>>
>>>>>
>>>>> @vinodkone
>>>>>
>>>>>
>>>>> On Tue, Jan 21, 2014 at 11:18 AM, Lin Zhao <[email protected]> wrote:
>>>>>
>>>>>>
>>>>>> *[lin@mesos2 ~]$ cat
>>>>>> /cgroup/mesos/framework_201401171812-2907575306-5050-19011-0019_executor_default_tag_72c003a3-f213-479e-a7e3-9b86930703a7/memory.limit_in_bytes
>>>>>>  *
>>>>>>
>>>>>> *9223372036854775807*
>>>>>>
>>>>>> *[lin@mesos2 ~]$ cat
>>>>>> /cgroup/mesos/framework_201401171812-2907575306-5050-19011-0019_executor_default_tag_72c003a3-f213-479e-a7e3-9b86930703a7/memory.usage_in_bytes
>>>>>>  *
>>>>>>
>>>>>> *584146944*
>>>>>>
>>>>>> *[lin@mesos2 ~]$ cat
>>>>>> /cgroup/mesos/framework_201401171812-2907575306-5050-19011-0019_executor_default_tag_72c003a3-f213-479e-a7e3-9b86930703a7/memory.max_usage_in_bytes
>>>>>>  *
>>>>>>
>>>>>> *585809920*
>>>>>>
>>>>>> Hmm the limit is weird. Can you find anything wrong about the way my
>>>>>> mem is defined?
>>>>>>
>>>>>>
>>>>>> .addResources(Resource.newBuilder()
>>>>>>
>>>>>>                                     .setName("mem")
>>>>>>
>>>>>>                                     .setType(Value.Type.SCALAR)
>>>>>>
>>>>>>
>>>>>> .setScalar(Value.Scalar.newBuilder().setValue(128)))
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, Jan 21, 2014 at 2:02 PM, Vinod Kone <[email protected]>wrote:
>>>>>>
>>>>>>> Mesos uses 
>>>>>>> cgroups<https://www.kernel.org/doc/Documentation/cgroups/cgroups.txt>to 
>>>>>>> limit cpu and memory.
>>>>>>>
>>>>>>> It is indeed surprising that your executor in not OOMing when using
>>>>>>> more memory than requested.
>>>>>>>
>>>>>>> Can you tell us what the following values look like in the
>>>>>>> executor's cgroup? These are the values the kernel uses to decide 
>>>>>>> whether
>>>>>>> the cgroup is hitting its limit.
>>>>>>>
>>>>>>> cat
>>>>>>> /cgroup/mesos/framework_<foo>_executor_<bar>_<uuid>/memory.limit_in_bytes
>>>>>>>
>>>>>>> cat
>>>>>>> /cgroup/mesos/framework_<foo>_executor_<bar>_<uuid>/memory.usage_in_bytes
>>>>>>>
>>>>>>> cat
>>>>>>> /cgroup/mesos/framework_<foo>_executor_<bar>_<uuid>/memory.max_usage_in_bytes
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> @vinodkone
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Jan 21, 2014 at 9:58 AM, Lin Zhao <[email protected]> wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> I'm new to Mesos and have some questions about resource management.
>>>>>>>> I want to understand how Mesos limits resources used by each executors,
>>>>>>>> given resources defined in TaskInfo. I did some tests and have seen
>>>>>>>> different behavior for different types of resources. It appears that 
>>>>>>>> Mesos
>>>>>>>> caps CPU usage for the executors, but doesn't limit the memory 
>>>>>>>> accessible
>>>>>>>> to each executor.
>>>>>>>>
>>>>>>>> I created an example java framework, which is largely taken from
>>>>>>>> the mesos example:
>>>>>>>>
>>>>>>>> https://gist.github.com/lin-zhao/8544495
>>>>>>>>
>>>>>>>> Basically,
>>>>>>>>
>>>>>>>> 1. the Scheduler launches tasks with *2* cpus, and *128 mb*memory.
>>>>>>>> 2. The executor launches java with *-Xms 1500m* and *-Xmx 1500m*.
>>>>>>>> 3. The java executor creates a byte array that uses *512 MB*memory.
>>>>>>>> 4. The java executor starts 3 threads that loops forever, which
>>>>>>>> potentially uses *3* full cpus.
>>>>>>>>
>>>>>>>> The framework is launched in a 3 slave Mesos (v0.14.2) cluster and
>>>>>>>> finished without error.
>>>>>>>>
>>>>>>>> CPU: on the slaves, the cpu usage for the TestExecutor process is
>>>>>>>> capped at 199%, indicating that Mesos does cap CPU usage. When the 
>>>>>>>> executor
>>>>>>>> are assigned 1 cpu instead of 2, the cpu usage is capped at 99%.
>>>>>>>>
>>>>>>>> Memory: There is no error thrown. The executors used > 512 MB
>>>>>>>> memory and get away with it.
>>>>>>>>
>>>>>>>> Can someone confirm this? I haven't tested the other resource types
>>>>>>>> (ports, disk). Is the behavior documented somewhere?
>>>>>>>>
>>>>>>>> --
>>>>>>>> Lin Zhao
>>>>>>>>
>>>>>>>> https://wiki.groupondev.com/Message_Bus
>>>>>>>> 3101 Park Blvd, Palo Alto, CA 94306
>>>>>>>>
>>>>>>>> Temporarily based in NY
>>>>>>>> 33 W 19th St.
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Lin Zhao
>>>>>>
>>>>>> https://wiki.groupondev.com/Message_Bus
>>>>>> 3101 Park Blvd, Palo Alto, CA 94306
>>>>>>
>>>>>> Temporarily based in NY
>>>>>> 33 W 19th St.
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Lin Zhao
>>>>
>>>> https://wiki.groupondev.com/Message_Bus
>>>> 3101 Park Blvd, Palo Alto, CA 94306
>>>>
>>>> Temporarily based in NY
>>>> 33 W 19th St.
>>>>
>>>>
>>>
>>>
>>> --
>>> Lin Zhao
>>>
>>> https://wiki.groupondev.com/Message_Bus
>>> 3101 Park Blvd, Palo Alto, CA 94306
>>>
>>> Temporarily based in NY
>>> 33 W 19th St.
>>>
>>>
>>
>


-- 
Lin Zhao

https://wiki.groupondev.com/Message_Bus
3101 Park Blvd, Palo Alto, CA 94306

Temporarily based in NY
33 W 19th St.

Re: How Mesos limits resources used by the executors

Reply via email to