Thank you for the follow-up, Jan. I'll join the discussion on YARN-4681. --Chris Nauroth
On 2/9/16, 3:22 AM, "Jan Lukavský" <[email protected]> wrote: >Hi Chris and Varun, > >thanks for you suggestions. I played around with the cgroups, and I >think that although it kind of resolves memory issues, I think it >doesn't fit our needs, because of other restrictions enforced on the >container (mainly the CPU restrictions). I created >https://issues.apache.org/jira/browse/YARN-4681 and submitted a very >simplistic version of the patch. > >Thanks for comments, > Jan > >On 02/05/2016 06:10 PM, Chris Nauroth wrote: >> Interesting, I didn't know about "Locked" in smaps. Thanks for pointing >> that out. >> >> At this point, if Varun's suggestion to check out YARN-1856 doesn't >>solve >> the problem, then I suggest opening a JIRA to track further design >> discussion. >> >> --Chris Nauroth >> >> >> >> >> On 2/5/16, 6:10 AM, "Varun Vasudev" <[email protected]> wrote: >> >>> Hi Jan, >>> >>> YARN-1856 was recently committed which allows admins to use cgroups >>> instead the ProcFsBasedProcessTree monitory. Would that solve your >>> problem? However, that requires usage of the LinuxContainerExecutor. >>> >>> -Varun >>> >>> >>> >>> On 2/5/16, 6:45 PM, "Jan Lukavský" <[email protected]> >>>wrote: >>> >>>> Hi Chris, >>>> >>>> thanks for your reply. As far as I can see right, new linux kernels >>>>show >>>> the locked memory in "Locked" field. >>>> >>>> If mmap file a mlock it, I see the following in 'smaps' file: >>>> >>>> 7efd20aeb000-7efd2172b000 r--p 00000000 103:04 1870 >>>> /tmp/file.bin >>>> Size: 12544 kB >>>> Rss: 12544 kB >>>> Pss: 12544 kB >>>> Shared_Clean: 0 kB >>>> Shared_Dirty: 0 kB >>>> Private_Clean: 12544 kB >>>> Private_Dirty: 0 kB >>>> Referenced: 12544 kB >>>> Anonymous: 0 kB >>>> AnonHugePages: 0 kB >>>> Swap: 0 kB >>>> KernelPageSize: 4 kB >>>> MMUPageSize: 4 kB >>>> Locked: 12544 kB >>>> >>>> ... >>>> # uname -a >>>> Linux XXXXXX 3.2.0-4-amd64 #1 SMP Debian 3.2.68-1+deb7u3 x86_64 >>>>GNU/Linux >>>> >>>> If I do this on an older kernel (2.6.x), the Locked field is missing. >>>> >>>> I can make a patch for the ProcfsBasedProcessTree that will calculate >>>> the "Locked" pages instead of the "Private_Clean" (based on >>>> configuration option). The question is - should there be made even >>>>more >>>> changes in the way the memory footprint is calculated? For instance, I >>>> believe the kernel can write to disk even all dirty pages (if they are >>>> backed by a file), making them clean and therefore can later free >>>>them. >>>> Should I open a JIRA for this to have some discussion on this topic? >>>> >>>> Regards, >>>> Jan >>>> >>>> >>>> On 02/04/2016 07:20 PM, Chris Nauroth wrote: >>>>> Hello Jan, >>>>> >>>>> I am moving this thread from [email protected] to >>>>> [email protected], since it's less a question of general >>>>>usage >>>>> and more a question of internal code implementation details and >>>>> possible >>>>> enhancements. >>>>> >>>>> I think the issue is that it's not guaranteed in the general case >>>>>that >>>>> Private_Clean pages are easily evictable from page cache by the >>>>>kernel. >>>>> For example, if the pages have been pinned into RAM by calling mlock >>>>> [1], >>>>> then the kernel cannot evict them. Since YARN can execute any code >>>>> submitted by an application, including possibly code that calls >>>>>mlock, >>>>> it >>>>> takes a cautious approach and assumes that these pages must be >>>>>counted >>>>> towards the process footprint. Although your Spark use case won't >>>>> mlock >>>>> the pages (I assume), YARN doesn't have a way to identify this. >>>>> >>>>> Perhaps there is room for improvement here. If there is a reliable >>>>> way to >>>>> count only mlock'ed pages, then perhaps that behavior could be added >>>>>as >>>>> another option in ProcfsBasedProcessTree. Off the top of my head, I >>>>> can't >>>>> think of a reliable way to do this, and I can't research it further >>>>> immediately. Do others on the thread have ideas? >>>>> >>>>> --Chris Nauroth >>>>> >>>>> [1] http://linux.die.net/man/2/mlock >>>>> >>>>> >>>>> >>>>> >>>>> On 2/4/16, 5:11 AM, "Jan Lukavský" <[email protected]> >>>>> wrote: >>>>> >>>>>> Hello, >>>>>> >>>>>> I have a question about the way LinuxResourceCalculatorPlugin >>>>>> calculates >>>>>> memory consumed by process tree (it is calculated via >>>>>> ProcfsBasedProcessTree class). When we enable caching (disk) in >>>>>>apache >>>>>> spark jobs run on YARN cluster, the node manager starts to kill the >>>>>> containers while reading the cached data, because of "Container is >>>>>> running beyond memory limits ...". The reason is that even if we >>>>>> enable >>>>>> parsing of the smaps file >>>>>> >>>>>> >>>>>>(yarn.nodemanager.container-monitor.procfs-tree.smaps-based-rss.enabl >>>>>>ed >>>>>> ) >>>>>> the ProcfsBasedProcessTree calculates mmaped read-only pages as >>>>>> consumed >>>>>> by the process tree, while spark uses >>>>>> FileChannel.map(MapMode.READ_ONLY) >>>>>> to read the cached data. The JVM then consumes *a lot* more memory >>>>>> than >>>>>> the configured heap size (and it cannot be really controlled), but >>>>>> this >>>>>> memory is IMO not really consumed by the process, the kernel can >>>>>> reclaim >>>>>> these pages, if needed. My question is - is there any explicit >>>>>>reason >>>>>> why "Private_Clean" pages are calculated as consumed by process >>>>>>tree? >>>>>> I >>>>>> patched the ProcfsBasedProcessTree not to calculate them, but I >>>>>>don't >>>>>> know if this is the "correct" solution. >>>>>> >>>>>> Thanks for opinions, >>>>>> cheers, >>>>>> Jan >>>>>> >>>>>> >>>>>> >>>>>>--------------------------------------------------------------------- >>>>>> To unsubscribe, e-mail: [email protected] >>>>>> For additional commands, e-mail: [email protected] >>>>>> >>>>>> >>> > > >-- > >Jan Lukavský >Vedoucí týmu vývoje >Seznam.cz, a.s. >Radlická 3294/10 >15000, Praha 5 > >[email protected] >http://www.seznam.cz > >
