In our production environment of ~1800 nodes we've seen oom-kill events
that looked similar to this bug's pattern  - oom-kills killing large
server processes while resident memory was far lower than available
physical memory.

We were affected by the original bug and saw that issue readily
addressed in newer kernel versions, as mentioned in the earlier comments
in this ticket. However, we still kept seeing oom-kill events, albeit in
far lower numbers over time, that were happening on kernel-upgraded
systems. These were a mystery for awhile, largely due to their
infrequent occurrence.

After a lot of research we think we've pinned it down to a subset of our
multi-socket servers that have >1 NUMA memory pools. After implementing
some scripts to track NUMA stats we've observed that one of the two NUMA
pools is being fully utilized while the other has large amounts of
memory to spare (often 90-95%) Either our server app, the JVM its
running on, or the kernel itself isn't handling the NUMA memory pooling
well and we're ending up exhausting an entire NUMA pool.

Work is ongoing to see the causality chain that's leading to this. We
don't yet have confirmation about whether its something our app (or its
libraries) is doing, if we just need to make the JVM NUMA-aware with
args, or if there's kernel tuning to be done. But I did want to mention
it here as a warning to folks running on multi-NUMA-pool multi-socket
systems seeing similar behavior.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1655842

Title:
  "Out of memory" errors after upgrade to 4.4.0-59

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1655842/+subscriptions

-- 
ubuntu-bugs mailing list
[email protected]
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Reply via email to