Hi Jiatong,

Thanks for emailing me, happy to answer questions anytime.

> 1. why linux-hwe-4.15.0 source code is used?

If you look closely at the oops in the description, the customer I was
working with was running:

4.15.0-106-generic #107~16.04.1-Ubuntu
 
This is the Xenial (16.04) HWE kernel. I was using the linux-hwe-4.15.0 source 
code to make sure the debug symbols used for the debug symbol package matched 
exactly.

In your case:

4.15.0-72-generic #81-Ubuntu

you are running the 4.15 kernel on normal Bionic (18.04), so we can use
the normal linux-4.15.0 source code.

> 2. we are using linux-4.15.0-unsigned and by skimming through the
source code, looks like try_get_page is not defined at that time?

Yes! You are correct, the original mainline 4.15 kernel did not have
try_get_page() defined at:

https://elixir.bootlin.com/linux/v4.15/source/mm/gup.c#L156

But if you look closely at the actual kernel sources for
4.15.0-72-generic:

https://git.launchpad.net/~ubuntu-
kernel/ubuntu/+source/linux/+git/bionic/tree/mm/gup.c?h=Ubuntu-4.15.0-72.81#n156

We see that try_get_page() is there. That is because we backported:

commit 8fde12ca79aff9b5ba951fce1a2641901b8d8e64
Author: Linus Torvalds <[email protected]>
Date:   Thu Apr 11 10:49:19 2019 -0700
Subject: mm: prevent get_user_pages() from overflowing page refcount
Link:https://github.com/torvalds/linux/commit/8fde12ca79aff9b5ba951fce1a2641901b8d8e64

Ubuntu 4.15 backport link: https://paste.ubuntu.com/p/2bF5WWQy2r/

That commit first turned up in 4.15.0-59-generic, via upstream-stable.

Anyway, let's have a look at your stack trace:

4.15.0-72-generic #81-Ubuntu
RIP: 0010:follow_page_pte+0x663/0x6d0

I downloaded the debug symbols:

http://ddebs.ubuntu.com/ubuntu/pool/main/l/linux/linux-image-
unsigned-4.15.0-72-generic-dbgsym_4.15.0-72.81_amd64.ddeb

Extracted them:

dpkg -x linux-image-unsigned-4.15.0-72-generic-
dbgsym_4.15.0-72.81_amd64.ddeb debug

and looked up:

$ eu-addr2line -e ./vmlinux-4.15.0-72-generic -f follow_page_pte+0x663
try_get_page inlined at /build/linux-E6MDAa/linux-4.15.0/mm/gup.c:156 in 
follow_page_pte
/build/linux-E6MDAa/linux-4.15.0/mm/gup.c:138

We see that you hit try_get_page() in mm/gup.c:156

 155     if (flags & FOLL_GET) {
 156         if (unlikely(!try_get_page(page))) {
 157             page = ERR_PTR(-ENOMEM);
 158             goto out;
 159         }
 
Looking at try_get_page() in include/linux/mm.h:

 854 static inline __must_check bool try_get_page(struct page *page)
 855 {
 856     page = compound_head(page);
 857     if (WARN_ON_ONCE(page_ref_count(page) <= 0))
 858         return false;
 859     page_ref_inc(page);
 860     return true;
 861 }
 
We see that you hit the exact same WARN_ON_ONCE for the page_ref_count(page) <= 
0).

So, whatever page you are trying to access, has its reference counter in
the negatives, which suggests that has either wrapped around, or has
been decremented too many times.

Looking at your error log, I can't tell for sure if it is the zero_page,
but its quite likely going to be. The zero_page is a frequently used
page in the system, and it is used outside of ksm, it's just that ksm is
a heavy user of the zero_page. If you are constantly allocating large
amounts of new memory, you will be be using the zero_page similar to
ksm, and the reference counter will eventually overflow.

I think there is a good chance that the fix I submitted in
4.15.0-118-generic will solve your problems. Please do a "apt update"
and "apt upgrade" and upgrade to a newer kernel, the newer the better,
and it will most likely fix the problem.

Let me know if you have any more questions.

Thanks,
Matthew

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1837810

Title:
  KVM: Fix zero_page reference counter overflow when using KSM on KVM
  compute host

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1837810/+subscriptions

-- 
ubuntu-bugs mailing list
[email protected]
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Reply via email to