Re: How Mesos limits resources used by the executors

Vinod Kone Thu, 23 Jan 2014 10:07:03 -0800

Hey Lin. Thanks for filing the ticket!

Regarding ports and disk usage, currently Mesos doesn't enforce them. But
it is definitely something that is on our radar. You might hear more about
the former (network isolation) sometime this quarter.



@vinodkone


On Thu, Jan 23, 2014 at 8:56 AM, Lin Zhao <[email protected]> wrote:

> More resources question: how does Mesos control "ports" and "disk"
> resources? I started a framework that claims port1, yet listens to port2,
> and doesn't have problem doing so. And it claims 10 units (mb, i assume) of
> disk, then writes 512 mb data to the work directory, and succeeds too. Is
> this expected? I can provide source/log if requested.
>
>
> On Thu, Jan 23, 2014 at 11:10 AM, Lin Zhao <[email protected]> wrote:
>
>> Entered https://issues.apache.org/jira/browse/MESOS-941. Thanks everyone
>> for the help!
>>
>>
>> On Thu, Jan 23, 2014 at 2:03 AM, Vinod Kone <[email protected]> wrote:
>>
>>> Hey Lin. Mind filing a ticket for this issue? This is definitely a bug
>>> we would like to get fixed.
>>>
>>>
>>> @vinodkone
>>>
>>>
>>> On Tue, Jan 21, 2014 at 2:00 PM, Benjamin Mahler <
>>> [email protected]> wrote:
>>>
>>>> TLDR: Specify resources in your *executor*, rather than only in your
>>>> *task*.
>>>>
>>>> No OOM is occurring in the logs. The "triggered" log line is
>>>> misleading, you can see that the notification was merely discarded:
>>>>
>>>> I0121 19:44:07.180585  8577 cgroups_isolator.cpp:1183] OOM notifier is
>>>> triggered for executor default of framework
>>>> 201401171812-2907575306-5050-19011-0020 with uuid
>>>> 8bc2ab10-8988-4b22-afa2-3433bbedc3ed
>>>> I0121 19:44:07.181037  8577 cgroups_isolator.cpp:1188] Discarded OOM
>>>> notifier for executor default of framework
>>>> 201401171812-2907575306-5050-19011-0020 with uuid
>>>> 8bc2ab10-8988-4b22-afa2-3433bbedc3ed
>>>>
>>>>
>>>> This looks like a bug in Mesos. What's happening is that you're
>>>> launching an executor with no resources, consequently before we fork, we
>>>> attempt to update the memory control but we don't call the memory handler
>>>> since the executor has no memory resources:
>>>>
>>>> I0121 19:39:01.660071  8566 cgroups_isolator.cpp:516] Launching default
>>>> (/home/lin/test-executor) in
>>>> /tmp/mesos/slaves/201312032357-3645772810-5050-2033-0/frameworks/201401171812-2907575306-5050-19011-0020/executors/default/runs/8bc2ab10-8988-4b22-afa2-3433bbedc3ed
>>>> with resources  for framework 201401171812-2907575306-5050-19011-0020 in
>>>> cgroup
>>>> mesos/framework_201401171812-2907575306-5050-19011-0020_executor_default_tag_8bc2ab10-8988-4b22-afa2-3433bbedc3ed
>>>> I0121 19:39:01.663082  8566 cgroups_isolator.cpp:709] Changing cgroup
>>>> controls for executor default of framework
>>>> 201401171812-2907575306-5050-19011-0020 with resources
>>>> I0121 19:39:01.667129  8566 cgroups_isolator.cpp:1163] Started
>>>> listening for OOM events for executor default of framework
>>>> 201401171812-2907575306-5050-19011-0020
>>>> I0121 19:39:01.681857  8566 cgroups_isolator.cpp:568] Forked executor
>>>> at = 27609
>>>>
>>>> Then, later, when we are updating the resources for your 128MB task, we
>>>> set the soft limit, but we don't set the hard limit because the following
>>>> buggy check is not satisfied:
>>>>
>>>>   // Determine whether to set the hard limit. If this is the first
>>>>   // time (info->pid.isNone()), or we're raising the existing limit,
>>>>   // then we can update the hard limit safely. Otherwise, if we need
>>>>   // to decrease 'memory.limit_in_bytes' we may induce an OOM if too
>>>>   // much memory is in use. As a result, we only update the soft
>>>>   // limit when the memory reservation is being reduced. This is
>>>>   // probably okay if the machine has available resources.
>>>>   // TODO(benh): Introduce a MemoryWatcherProcess which monitors the
>>>>   // discrepancy between usage and soft limit and introduces a
>>>>   // "manual oom" if necessary.
>>>>   if (info->pid.isNone() || limit > currentLimit.get()) {
>>>>
>>>> The assumption here was that there would always be an initial call with
>>>> info->pid.isNone(), however, since your executor has no resources we did
>>>> not update the control before forking the executor. And limit was left as
>>>> the inherited value. I've cc'ed Ian Downes on this since he's re-working
>>>> the Isolator, I'll leave it to him to determine whether this is a bug that
>>>> should be filed or not.
>>>>
>>>>
>>>> On Tue, Jan 21, 2014 at 12:51 PM, Lin Zhao <[email protected]> wrote:
>>>>
>>>>> Vinod,
>>>>>
>>>>> Correction to my message, when my job is sleeping below values are
>>>>> 500+ MB as expected. I was looking at the kmem values. OOM notifier is
>>>>> triggered much later when the executor is killed. Would appreciate it if
>>>>> you have an idea where to look.
>>>>>
>>>>> cat /cgroup/mesos/framework_<foo>_executor_<bar>_<uuid>/memory.
>>>>> usage_in_bytes
>>>>> cat /cgroup/mesos/framework_<foo>_executor_<bar>_<uuid>/memory.max_
>>>>> usage_in_bytes
>>>>>
>>>>>
>>>>> On Tue, Jan 21, 2014 at 2:54 PM, Lin Zhao <[email protected]> wrote:
>>>>>
>>>>>> Interesting. Looking at the log, it seems that OOM is fired when the
>>>>>> executor is shut down (19:44:07.180585), which is 300 seconds after the 
>>>>>> job
>>>>>> launch and memory use. Within the 300 seconds usage_in_bytes and
>>>>>> max_usage_in_bytes are 0.
>>>>>>
>>>>>> Attaching the log. Any idea of the slow OOM? As you can see at
>>>>>> https://gist.github.com/lin-zhao/8544495#file-testexecutor-java-L80,
>>>>>> 512M mem is used before the sleep.
>>>>>>
>>>>>>
>>>>>> On Tue, Jan 21, 2014 at 2:28 PM, Vinod Kone <[email protected]>wrote:
>>>>>>
>>>>>>> The way you set task resources looks correct.
>>>>>>>
>>>>>>> Can you paste what the slave logs say regarding the task/executor,
>>>>>>> esp. the lines that are from the cgroups isolator? Also, what is the
>>>>>>> command line of the slave?
>>>>>>>
>>>>>>>
>>>>>>> @vinodkone
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Jan 21, 2014 at 11:18 AM, Lin Zhao <[email protected]> wrote:
>>>>>>>
>>>>>>>>
>>>>>>>> *[lin@mesos2 ~]$ cat
>>>>>>>> /cgroup/mesos/framework_201401171812-2907575306-5050-19011-0019_executor_default_tag_72c003a3-f213-479e-a7e3-9b86930703a7/memory.limit_in_bytes
>>>>>>>>  *
>>>>>>>>
>>>>>>>> *9223372036854775807*
>>>>>>>>
>>>>>>>> *[lin@mesos2 ~]$ cat
>>>>>>>> /cgroup/mesos/framework_201401171812-2907575306-5050-19011-0019_executor_default_tag_72c003a3-f213-479e-a7e3-9b86930703a7/memory.usage_in_bytes
>>>>>>>>  *
>>>>>>>>
>>>>>>>> *584146944*
>>>>>>>>
>>>>>>>> *[lin@mesos2 ~]$ cat
>>>>>>>> /cgroup/mesos/framework_201401171812-2907575306-5050-19011-0019_executor_default_tag_72c003a3-f213-479e-a7e3-9b86930703a7/memory.max_usage_in_bytes
>>>>>>>>  *
>>>>>>>>
>>>>>>>> *585809920*
>>>>>>>>
>>>>>>>> Hmm the limit is weird. Can you find anything wrong about the way
>>>>>>>> my mem is defined?
>>>>>>>>
>>>>>>>>
>>>>>>>> .addResources(Resource.newBuilder()
>>>>>>>>
>>>>>>>>                                     .setName("mem")
>>>>>>>>
>>>>>>>>                                     .setType(Value.Type.SCALAR)
>>>>>>>>
>>>>>>>>
>>>>>>>> .setScalar(Value.Scalar.newBuilder().setValue(128)))
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Jan 21, 2014 at 2:02 PM, Vinod Kone <[email protected]>wrote:
>>>>>>>>
>>>>>>>>> Mesos uses 
>>>>>>>>> cgroups<https://www.kernel.org/doc/Documentation/cgroups/cgroups.txt>to
>>>>>>>>>  limit cpu and memory.
>>>>>>>>>
>>>>>>>>> It is indeed surprising that your executor in not OOMing when
>>>>>>>>> using more memory than requested.
>>>>>>>>>
>>>>>>>>> Can you tell us what the following values look like in the
>>>>>>>>> executor's cgroup? These are the values the kernel uses to decide 
>>>>>>>>> whether
>>>>>>>>> the cgroup is hitting its limit.
>>>>>>>>>
>>>>>>>>> cat
>>>>>>>>> /cgroup/mesos/framework_<foo>_executor_<bar>_<uuid>/memory.limit_in_bytes
>>>>>>>>>
>>>>>>>>> cat
>>>>>>>>> /cgroup/mesos/framework_<foo>_executor_<bar>_<uuid>/memory.usage_in_bytes
>>>>>>>>>
>>>>>>>>> cat
>>>>>>>>> /cgroup/mesos/framework_<foo>_executor_<bar>_<uuid>/memory.max_usage_in_bytes
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> @vinodkone
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tue, Jan 21, 2014 at 9:58 AM, Lin Zhao <[email protected]> wrote:
>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> I'm new to Mesos and have some questions about resource
>>>>>>>>>> management. I want to understand how Mesos limits resources used by 
>>>>>>>>>> each
>>>>>>>>>> executors, given resources defined in TaskInfo. I did some tests and 
>>>>>>>>>> have
>>>>>>>>>> seen different behavior for different types of resources. It appears 
>>>>>>>>>> that
>>>>>>>>>> Mesos caps CPU usage for the executors, but doesn't limit the memory
>>>>>>>>>> accessible to each executor.
>>>>>>>>>>
>>>>>>>>>> I created an example java framework, which is largely taken from
>>>>>>>>>> the mesos example:
>>>>>>>>>>
>>>>>>>>>> https://gist.github.com/lin-zhao/8544495
>>>>>>>>>>
>>>>>>>>>> Basically,
>>>>>>>>>>
>>>>>>>>>> 1. the Scheduler launches tasks with *2* cpus, and *128 mb*memory.
>>>>>>>>>> 2. The executor launches java with *-Xms 1500m* and *-Xmx 1500m*
>>>>>>>>>> .
>>>>>>>>>> 3. The java executor creates a byte array that uses *512 MB*memory.
>>>>>>>>>> 4. The java executor starts 3 threads that loops forever, which
>>>>>>>>>> potentially uses *3* full cpus.
>>>>>>>>>>
>>>>>>>>>> The framework is launched in a 3 slave Mesos (v0.14.2) cluster
>>>>>>>>>> and finished without error.
>>>>>>>>>>
>>>>>>>>>> CPU: on the slaves, the cpu usage for the TestExecutor process is
>>>>>>>>>> capped at 199%, indicating that Mesos does cap CPU usage. When the 
>>>>>>>>>> executor
>>>>>>>>>> are assigned 1 cpu instead of 2, the cpu usage is capped at 99%.
>>>>>>>>>>
>>>>>>>>>> Memory: There is no error thrown. The executors used > 512 MB
>>>>>>>>>> memory and get away with it.
>>>>>>>>>>
>>>>>>>>>> Can someone confirm this? I haven't tested the other resource
>>>>>>>>>> types (ports, disk). Is the behavior documented somewhere?
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Lin Zhao
>>>>>>>>>>
>>>>>>>>>> https://wiki.groupondev.com/Message_Bus
>>>>>>>>>> 3101 Park Blvd, Palo Alto, CA 94306
>>>>>>>>>>
>>>>>>>>>> Temporarily based in NY
>>>>>>>>>> 33 W 19th St.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Lin Zhao
>>>>>>>>
>>>>>>>> https://wiki.groupondev.com/Message_Bus
>>>>>>>> 3101 Park Blvd, Palo Alto, CA 94306
>>>>>>>>
>>>>>>>> Temporarily based in NY
>>>>>>>> 33 W 19th St.
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Lin Zhao
>>>>>>
>>>>>> https://wiki.groupondev.com/Message_Bus
>>>>>> 3101 Park Blvd, Palo Alto, CA 94306
>>>>>>
>>>>>> Temporarily based in NY
>>>>>> 33 W 19th St.
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Lin Zhao
>>>>>
>>>>> https://wiki.groupondev.com/Message_Bus
>>>>> 3101 Park Blvd, Palo Alto, CA 94306
>>>>>
>>>>> Temporarily based in NY
>>>>> 33 W 19th St.
>>>>>
>>>>>
>>>>
>>>
>>
>>
>> --
>> Lin Zhao
>>
>> https://wiki.groupondev.com/Message_Bus
>> 3101 Park Blvd, Palo Alto, CA 94306
>>
>> Temporarily based in NY
>> 33 W 19th St.
>>
>>
>
>
> --
> Lin Zhao
>
> https://wiki.groupondev.com/Message_Bus
> 3101 Park Blvd, Palo Alto, CA 94306
>
> Temporarily based in NY
> 33 W 19th St.
>
>

Re: How Mesos limits resources used by the executors

Reply via email to