Interesting. Looking at the log, it seems that OOM is fired when the executor is shut down (19:44:07.180585), which is 300 seconds after the job launch and memory use. Within the 300 seconds usage_in_bytes and max_usage_in_bytes are 0.
Attaching the log. Any idea of the slow OOM? As you can see at https://gist.github.com/lin-zhao/8544495#file-testexecutor-java-L80, 512M mem is used before the sleep. On Tue, Jan 21, 2014 at 2:28 PM, Vinod Kone <[email protected]> wrote: > The way you set task resources looks correct. > > Can you paste what the slave logs say regarding the task/executor, esp. > the lines that are from the cgroups isolator? Also, what is the command > line of the slave? > > > @vinodkone > > > On Tue, Jan 21, 2014 at 11:18 AM, Lin Zhao <[email protected]> wrote: > >> >> *[lin@mesos2 ~]$ cat >> /cgroup/mesos/framework_201401171812-2907575306-5050-19011-0019_executor_default_tag_72c003a3-f213-479e-a7e3-9b86930703a7/memory.limit_in_bytes >> * >> >> *9223372036854775807* >> >> *[lin@mesos2 ~]$ cat >> /cgroup/mesos/framework_201401171812-2907575306-5050-19011-0019_executor_default_tag_72c003a3-f213-479e-a7e3-9b86930703a7/memory.usage_in_bytes >> * >> >> *584146944* >> >> *[lin@mesos2 ~]$ cat >> /cgroup/mesos/framework_201401171812-2907575306-5050-19011-0019_executor_default_tag_72c003a3-f213-479e-a7e3-9b86930703a7/memory.max_usage_in_bytes >> * >> >> *585809920* >> >> Hmm the limit is weird. Can you find anything wrong about the way my mem >> is defined? >> >> >> .addResources(Resource.newBuilder() >> >> .setName("mem") >> >> .setType(Value.Type.SCALAR) >> >> >> .setScalar(Value.Scalar.newBuilder().setValue(128))) >> >> >> >> >> On Tue, Jan 21, 2014 at 2:02 PM, Vinod Kone <[email protected]> wrote: >> >>> Mesos uses >>> cgroups<https://www.kernel.org/doc/Documentation/cgroups/cgroups.txt>to >>> limit cpu and memory. >>> >>> It is indeed surprising that your executor in not OOMing when using more >>> memory than requested. >>> >>> Can you tell us what the following values look like in the executor's >>> cgroup? These are the values the kernel uses to decide whether the cgroup >>> is hitting its limit. >>> >>> cat >>> /cgroup/mesos/framework_<foo>_executor_<bar>_<uuid>/memory.limit_in_bytes >>> >>> cat >>> /cgroup/mesos/framework_<foo>_executor_<bar>_<uuid>/memory.usage_in_bytes >>> >>> cat >>> /cgroup/mesos/framework_<foo>_executor_<bar>_<uuid>/memory.max_usage_in_bytes >>> >>> >>> >>> @vinodkone >>> >>> >>> On Tue, Jan 21, 2014 at 9:58 AM, Lin Zhao <[email protected]> wrote: >>> >>>> Hi, >>>> >>>> I'm new to Mesos and have some questions about resource management. I >>>> want to understand how Mesos limits resources used by each executors, given >>>> resources defined in TaskInfo. I did some tests and have seen different >>>> behavior for different types of resources. It appears that Mesos caps CPU >>>> usage for the executors, but doesn't limit the memory accessible to each >>>> executor. >>>> >>>> I created an example java framework, which is largely taken from the >>>> mesos example: >>>> >>>> https://gist.github.com/lin-zhao/8544495 >>>> >>>> Basically, >>>> >>>> 1. the Scheduler launches tasks with *2* cpus, and *128 mb* memory. >>>> 2. The executor launches java with *-Xms 1500m* and *-Xmx 1500m*. >>>> 3. The java executor creates a byte array that uses *512 MB* memory. >>>> 4. The java executor starts 3 threads that loops forever, which >>>> potentially uses *3* full cpus. >>>> >>>> The framework is launched in a 3 slave Mesos (v0.14.2) cluster and >>>> finished without error. >>>> >>>> CPU: on the slaves, the cpu usage for the TestExecutor process is >>>> capped at 199%, indicating that Mesos does cap CPU usage. When the executor >>>> are assigned 1 cpu instead of 2, the cpu usage is capped at 99%. >>>> >>>> Memory: There is no error thrown. The executors used > 512 MB memory >>>> and get away with it. >>>> >>>> Can someone confirm this? I haven't tested the other resource types >>>> (ports, disk). Is the behavior documented somewhere? >>>> >>>> -- >>>> Lin Zhao >>>> >>>> https://wiki.groupondev.com/Message_Bus >>>> 3101 Park Blvd, Palo Alto, CA 94306 >>>> >>>> Temporarily based in NY >>>> 33 W 19th St. >>>> >>>> >>> >> >> >> -- >> Lin Zhao >> >> https://wiki.groupondev.com/Message_Bus >> 3101 Park Blvd, Palo Alto, CA 94306 >> >> Temporarily based in NY >> 33 W 19th St. >> >> > -- Lin Zhao https://wiki.groupondev.com/Message_Bus 3101 Park Blvd, Palo Alto, CA 94306 Temporarily based in NY 33 W 19th St.
slave.log
Description: Binary data

