Hi Chris,
thanks for your reply. As far as I can see right, new linux kernels show
the locked memory in "Locked" field.
If mmap file a mlock it, I see the following in 'smaps' file:
7efd20aeb000-7efd2172b000 r--p 00000000 103:04 1870
/tmp/file.bin
Size: 12544 kB
Rss: 12544 kB
Pss: 12544 kB
Shared_Clean: 0 kB
Shared_Dirty: 0 kB
Private_Clean: 12544 kB
Private_Dirty: 0 kB
Referenced: 12544 kB
Anonymous: 0 kB
AnonHugePages: 0 kB
Swap: 0 kB
KernelPageSize: 4 kB
MMUPageSize: 4 kB
Locked: 12544 kB
...
# uname -a
Linux XXXXXX 3.2.0-4-amd64 #1 SMP Debian 3.2.68-1+deb7u3 x86_64 GNU/Linux
If I do this on an older kernel (2.6.x), the Locked field is missing.
I can make a patch for the ProcfsBasedProcessTree that will calculate
the "Locked" pages instead of the "Private_Clean" (based on
configuration option). The question is - should there be made even more
changes in the way the memory footprint is calculated? For instance, I
believe the kernel can write to disk even all dirty pages (if they are
backed by a file), making them clean and therefore can later free them.
Should I open a JIRA for this to have some discussion on this topic?
Regards,
Jan
On 02/04/2016 07:20 PM, Chris Nauroth wrote:
Hello Jan,
I am moving this thread from [email protected] to
[email protected], since it's less a question of general usage
and more a question of internal code implementation details and possible
enhancements.
I think the issue is that it's not guaranteed in the general case that
Private_Clean pages are easily evictable from page cache by the kernel.
For example, if the pages have been pinned into RAM by calling mlock [1],
then the kernel cannot evict them. Since YARN can execute any code
submitted by an application, including possibly code that calls mlock, it
takes a cautious approach and assumes that these pages must be counted
towards the process footprint. Although your Spark use case won't mlock
the pages (I assume), YARN doesn't have a way to identify this.
Perhaps there is room for improvement here. If there is a reliable way to
count only mlock'ed pages, then perhaps that behavior could be added as
another option in ProcfsBasedProcessTree. Off the top of my head, I can't
think of a reliable way to do this, and I can't research it further
immediately. Do others on the thread have ideas?
--Chris Nauroth
[1] http://linux.die.net/man/2/mlock
On 2/4/16, 5:11 AM, "Jan Lukavský" <[email protected]> wrote:
Hello,
I have a question about the way LinuxResourceCalculatorPlugin calculates
memory consumed by process tree (it is calculated via
ProcfsBasedProcessTree class). When we enable caching (disk) in apache
spark jobs run on YARN cluster, the node manager starts to kill the
containers while reading the cached data, because of "Container is
running beyond memory limits ...". The reason is that even if we enable
parsing of the smaps file
(yarn.nodemanager.container-monitor.procfs-tree.smaps-based-rss.enabled)
the ProcfsBasedProcessTree calculates mmaped read-only pages as consumed
by the process tree, while spark uses FileChannel.map(MapMode.READ_ONLY)
to read the cached data. The JVM then consumes *a lot* more memory than
the configured heap size (and it cannot be really controlled), but this
memory is IMO not really consumed by the process, the kernel can reclaim
these pages, if needed. My question is - is there any explicit reason
why "Private_Clean" pages are calculated as consumed by process tree? I
patched the ProcfsBasedProcessTree not to calculate them, but I don't
know if this is the "correct" solution.
Thanks for opinions,
cheers,
Jan
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]