Interesting. Looking at the log, it seems that OOM is fired when the
executor is shut down (19:44:07.180585), which is 300 seconds after the job
launch and memory use. Within the 300 seconds usage_in_bytes and
max_usage_in_bytes are 0.

Attaching the log. Any idea of the slow OOM? As you can see at
https://gist.github.com/lin-zhao/8544495#file-testexecutor-java-L80, 512M
mem is used before the sleep.


On Tue, Jan 21, 2014 at 2:28 PM, Vinod Kone <[email protected]> wrote:

> The way you set task resources looks correct.
>
> Can you paste what the slave logs say regarding the task/executor, esp.
> the lines that are from the cgroups isolator? Also, what is the command
> line of the slave?
>
>
> @vinodkone
>
>
> On Tue, Jan 21, 2014 at 11:18 AM, Lin Zhao <[email protected]> wrote:
>
>>
>> *[lin@mesos2 ~]$ cat
>> /cgroup/mesos/framework_201401171812-2907575306-5050-19011-0019_executor_default_tag_72c003a3-f213-479e-a7e3-9b86930703a7/memory.limit_in_bytes
>>  *
>>
>> *9223372036854775807*
>>
>> *[lin@mesos2 ~]$ cat
>> /cgroup/mesos/framework_201401171812-2907575306-5050-19011-0019_executor_default_tag_72c003a3-f213-479e-a7e3-9b86930703a7/memory.usage_in_bytes
>>  *
>>
>> *584146944*
>>
>> *[lin@mesos2 ~]$ cat
>> /cgroup/mesos/framework_201401171812-2907575306-5050-19011-0019_executor_default_tag_72c003a3-f213-479e-a7e3-9b86930703a7/memory.max_usage_in_bytes
>>  *
>>
>> *585809920*
>>
>> Hmm the limit is weird. Can you find anything wrong about the way my mem
>> is defined?
>>
>>
>> .addResources(Resource.newBuilder()
>>
>>                                     .setName("mem")
>>
>>                                     .setType(Value.Type.SCALAR)
>>
>>
>> .setScalar(Value.Scalar.newBuilder().setValue(128)))
>>
>>
>>
>>
>> On Tue, Jan 21, 2014 at 2:02 PM, Vinod Kone <[email protected]> wrote:
>>
>>> Mesos uses 
>>> cgroups<https://www.kernel.org/doc/Documentation/cgroups/cgroups.txt>to 
>>> limit cpu and memory.
>>>
>>> It is indeed surprising that your executor in not OOMing when using more
>>> memory than requested.
>>>
>>> Can you tell us what the following values look like in the executor's
>>> cgroup? These are the values the kernel uses to decide whether the cgroup
>>> is hitting its limit.
>>>
>>> cat
>>> /cgroup/mesos/framework_<foo>_executor_<bar>_<uuid>/memory.limit_in_bytes
>>>
>>> cat
>>> /cgroup/mesos/framework_<foo>_executor_<bar>_<uuid>/memory.usage_in_bytes
>>>
>>> cat
>>> /cgroup/mesos/framework_<foo>_executor_<bar>_<uuid>/memory.max_usage_in_bytes
>>>
>>>
>>>
>>> @vinodkone
>>>
>>>
>>> On Tue, Jan 21, 2014 at 9:58 AM, Lin Zhao <[email protected]> wrote:
>>>
>>>> Hi,
>>>>
>>>> I'm new to Mesos and have some questions about resource management. I
>>>> want to understand how Mesos limits resources used by each executors, given
>>>> resources defined in TaskInfo. I did some tests and have seen different
>>>> behavior for different types of resources. It appears that Mesos caps CPU
>>>> usage for the executors, but doesn't limit the memory accessible to each
>>>> executor.
>>>>
>>>> I created an example java framework, which is largely taken from the
>>>> mesos example:
>>>>
>>>> https://gist.github.com/lin-zhao/8544495
>>>>
>>>> Basically,
>>>>
>>>> 1. the Scheduler launches tasks with *2* cpus, and *128 mb* memory.
>>>> 2. The executor launches java with *-Xms 1500m* and *-Xmx 1500m*.
>>>> 3. The java executor creates a byte array that uses *512 MB* memory.
>>>> 4. The java executor starts 3 threads that loops forever, which
>>>> potentially uses *3* full cpus.
>>>>
>>>> The framework is launched in a 3 slave Mesos (v0.14.2) cluster and
>>>> finished without error.
>>>>
>>>> CPU: on the slaves, the cpu usage for the TestExecutor process is
>>>> capped at 199%, indicating that Mesos does cap CPU usage. When the executor
>>>> are assigned 1 cpu instead of 2, the cpu usage is capped at 99%.
>>>>
>>>> Memory: There is no error thrown. The executors used > 512 MB memory
>>>> and get away with it.
>>>>
>>>> Can someone confirm this? I haven't tested the other resource types
>>>> (ports, disk). Is the behavior documented somewhere?
>>>>
>>>> --
>>>> Lin Zhao
>>>>
>>>> https://wiki.groupondev.com/Message_Bus
>>>> 3101 Park Blvd, Palo Alto, CA 94306
>>>>
>>>> Temporarily based in NY
>>>> 33 W 19th St.
>>>>
>>>>
>>>
>>
>>
>> --
>> Lin Zhao
>>
>> https://wiki.groupondev.com/Message_Bus
>> 3101 Park Blvd, Palo Alto, CA 94306
>>
>> Temporarily based in NY
>> 33 W 19th St.
>>
>>
>


-- 
Lin Zhao

https://wiki.groupondev.com/Message_Bus
3101 Park Blvd, Palo Alto, CA 94306

Temporarily based in NY
33 W 19th St.

Attachment: slave.log
Description: Binary data

Reply via email to