Interesting, I didn't know about "Locked" in smaps. Thanks for pointing that out.
At this point, if Varun's suggestion to check out YARN-1856 doesn't solve the problem, then I suggest opening a JIRA to track further design discussion. --Chris Nauroth On 2/5/16, 6:10 AM, "Varun Vasudev" <[email protected]> wrote: >Hi Jan, > >YARN-1856 was recently committed which allows admins to use cgroups >instead the ProcFsBasedProcessTree monitory. Would that solve your >problem? However, that requires usage of the LinuxContainerExecutor. > >-Varun > > > >On 2/5/16, 6:45 PM, "Jan Lukavský" <[email protected]> wrote: > >>Hi Chris, >> >>thanks for your reply. As far as I can see right, new linux kernels show >>the locked memory in "Locked" field. >> >>If mmap file a mlock it, I see the following in 'smaps' file: >> >>7efd20aeb000-7efd2172b000 r--p 00000000 103:04 1870 >>/tmp/file.bin >>Size: 12544 kB >>Rss: 12544 kB >>Pss: 12544 kB >>Shared_Clean: 0 kB >>Shared_Dirty: 0 kB >>Private_Clean: 12544 kB >>Private_Dirty: 0 kB >>Referenced: 12544 kB >>Anonymous: 0 kB >>AnonHugePages: 0 kB >>Swap: 0 kB >>KernelPageSize: 4 kB >>MMUPageSize: 4 kB >>Locked: 12544 kB >> >>... >># uname -a >>Linux XXXXXX 3.2.0-4-amd64 #1 SMP Debian 3.2.68-1+deb7u3 x86_64 GNU/Linux >> >>If I do this on an older kernel (2.6.x), the Locked field is missing. >> >>I can make a patch for the ProcfsBasedProcessTree that will calculate >>the "Locked" pages instead of the "Private_Clean" (based on >>configuration option). The question is - should there be made even more >>changes in the way the memory footprint is calculated? For instance, I >>believe the kernel can write to disk even all dirty pages (if they are >>backed by a file), making them clean and therefore can later free them. >>Should I open a JIRA for this to have some discussion on this topic? >> >>Regards, >> Jan >> >> >>On 02/04/2016 07:20 PM, Chris Nauroth wrote: >>> Hello Jan, >>> >>> I am moving this thread from [email protected] to >>> [email protected], since it's less a question of general usage >>> and more a question of internal code implementation details and >>>possible >>> enhancements. >>> >>> I think the issue is that it's not guaranteed in the general case that >>> Private_Clean pages are easily evictable from page cache by the kernel. >>> For example, if the pages have been pinned into RAM by calling mlock >>>[1], >>> then the kernel cannot evict them. Since YARN can execute any code >>> submitted by an application, including possibly code that calls mlock, >>>it >>> takes a cautious approach and assumes that these pages must be counted >>> towards the process footprint. Although your Spark use case won't >>>mlock >>> the pages (I assume), YARN doesn't have a way to identify this. >>> >>> Perhaps there is room for improvement here. If there is a reliable >>>way to >>> count only mlock'ed pages, then perhaps that behavior could be added as >>> another option in ProcfsBasedProcessTree. Off the top of my head, I >>>can't >>> think of a reliable way to do this, and I can't research it further >>> immediately. Do others on the thread have ideas? >>> >>> --Chris Nauroth >>> >>> [1] http://linux.die.net/man/2/mlock >>> >>> >>> >>> >>> On 2/4/16, 5:11 AM, "Jan Lukavský" <[email protected]> >>>wrote: >>> >>>> Hello, >>>> >>>> I have a question about the way LinuxResourceCalculatorPlugin >>>>calculates >>>> memory consumed by process tree (it is calculated via >>>> ProcfsBasedProcessTree class). When we enable caching (disk) in apache >>>> spark jobs run on YARN cluster, the node manager starts to kill the >>>> containers while reading the cached data, because of "Container is >>>> running beyond memory limits ...". The reason is that even if we >>>>enable >>>> parsing of the smaps file >>>> >>>>(yarn.nodemanager.container-monitor.procfs-tree.smaps-based-rss.enabled >>>>) >>>> the ProcfsBasedProcessTree calculates mmaped read-only pages as >>>>consumed >>>> by the process tree, while spark uses >>>>FileChannel.map(MapMode.READ_ONLY) >>>> to read the cached data. The JVM then consumes *a lot* more memory >>>>than >>>> the configured heap size (and it cannot be really controlled), but >>>>this >>>> memory is IMO not really consumed by the process, the kernel can >>>>reclaim >>>> these pages, if needed. My question is - is there any explicit reason >>>> why "Private_Clean" pages are calculated as consumed by process tree? >>>>I >>>> patched the ProcfsBasedProcessTree not to calculate them, but I don't >>>> know if this is the "correct" solution. >>>> >>>> Thanks for opinions, >>>> cheers, >>>> Jan >>>> >>>> >>>> --------------------------------------------------------------------- >>>> To unsubscribe, e-mail: [email protected] >>>> For additional commands, e-mail: [email protected] >>>> >>>> >> > >
