Hi Chris and Varun,

thanks for you suggestions. I played around with the cgroups, and I think that although it kind of resolves memory issues, I think it doesn't fit our needs, because of other restrictions enforced on the container (mainly the CPU restrictions). I created https://issues.apache.org/jira/browse/YARN-4681 and submitted a very simplistic version of the patch.

Thanks for comments,
 Jan

On 02/05/2016 06:10 PM, Chris Nauroth wrote:
Interesting, I didn't know about "Locked" in smaps.  Thanks for pointing
that out.

At this point, if Varun's suggestion to check out YARN-1856 doesn't solve
the problem, then I suggest opening a JIRA to track further design
discussion.

--Chris Nauroth




On 2/5/16, 6:10 AM, "Varun Vasudev" <[email protected]> wrote:

Hi Jan,

YARN-1856 was recently committed which allows admins to use cgroups
instead the ProcFsBasedProcessTree monitory. Would that solve your
problem? However, that requires usage of the LinuxContainerExecutor.

-Varun



On 2/5/16, 6:45 PM, "Jan Lukavský" <[email protected]> wrote:

Hi Chris,

thanks for your reply. As far as I can see right, new linux kernels show
the locked memory in "Locked" field.

If mmap file a mlock it, I see the following in 'smaps' file:

7efd20aeb000-7efd2172b000 r--p 00000000 103:04 1870
/tmp/file.bin
Size:              12544 kB
Rss:               12544 kB
Pss:               12544 kB
Shared_Clean:          0 kB
Shared_Dirty:          0 kB
Private_Clean:     12544 kB
Private_Dirty:         0 kB
Referenced:        12544 kB
Anonymous:             0 kB
AnonHugePages:         0 kB
Swap:                  0 kB
KernelPageSize:        4 kB
MMUPageSize:           4 kB
Locked:            12544 kB

...
# uname -a
Linux XXXXXX 3.2.0-4-amd64 #1 SMP Debian 3.2.68-1+deb7u3 x86_64 GNU/Linux

If I do this on an older kernel (2.6.x), the Locked field is missing.

I can make a patch for the ProcfsBasedProcessTree that will calculate
the "Locked" pages instead of the "Private_Clean" (based on
configuration option). The question is - should there be made even more
changes in the way the memory footprint is calculated? For instance, I
believe the kernel can write to disk even all dirty pages (if they are
backed by a file), making them clean and therefore can later free them.
Should I open a JIRA for this to have some discussion on this topic?

Regards,
  Jan


On 02/04/2016 07:20 PM, Chris Nauroth wrote:
Hello Jan,

I am moving this thread from [email protected] to
[email protected], since it's less a question of general usage
and more a question of internal code implementation details and
possible
enhancements.

I think the issue is that it's not guaranteed in the general case that
Private_Clean pages are easily evictable from page cache by the kernel.
For example, if the pages have been pinned into RAM by calling mlock
[1],
then the kernel cannot evict them.  Since YARN can execute any code
submitted by an application, including possibly code that calls mlock,
it
takes a cautious approach and assumes that these pages must be counted
towards the process footprint.  Although your Spark use case won't
mlock
the pages (I assume), YARN doesn't have a way to identify this.

Perhaps there is room for improvement here.  If there is a reliable
way to
count only mlock'ed pages, then perhaps that behavior could be added as
another option in ProcfsBasedProcessTree.  Off the top of my head, I
can't
think of a reliable way to do this, and I can't research it further
immediately.  Do others on the thread have ideas?

--Chris Nauroth

[1] http://linux.die.net/man/2/mlock




On 2/4/16, 5:11 AM, "Jan Lukavský" <[email protected]>
wrote:

Hello,

I have a question about the way LinuxResourceCalculatorPlugin
calculates
memory consumed by process tree (it is calculated via
ProcfsBasedProcessTree class). When we enable caching (disk) in apache
spark jobs run on YARN cluster, the node manager starts to kill the
containers while reading the cached data, because of "Container is
running beyond memory limits ...". The reason is that even if we
enable
parsing of the smaps file

(yarn.nodemanager.container-monitor.procfs-tree.smaps-based-rss.enabled
)
the ProcfsBasedProcessTree calculates mmaped read-only pages as
consumed
by the process tree, while spark uses
FileChannel.map(MapMode.READ_ONLY)
to read the cached data. The JVM then consumes *a lot* more memory
than
the configured heap size (and it cannot be really controlled), but
this
memory is IMO not really consumed by the process, the kernel can
reclaim
these pages, if needed. My question is - is there any explicit reason
why "Private_Clean" pages are calculated as consumed by process tree?
I
patched the ProcfsBasedProcessTree not to calculate them, but I don't
know if this is the "correct" solution.

Thanks for opinions,
   cheers,
   Jan


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]





--

Jan Lukavský
Vedoucí týmu vývoje
Seznam.cz, a.s.
Radlická 3294/10
15000, Praha 5

[email protected]
http://www.seznam.cz

Reply via email to