Hi Jiatong, Thanks for emailing me, happy to answer questions anytime.
> 1. why linux-hwe-4.15.0 source code is used? If you look closely at the oops in the description, the customer I was working with was running: 4.15.0-106-generic #107~16.04.1-Ubuntu This is the Xenial (16.04) HWE kernel. I was using the linux-hwe-4.15.0 source code to make sure the debug symbols used for the debug symbol package matched exactly. In your case: 4.15.0-72-generic #81-Ubuntu you are running the 4.15 kernel on normal Bionic (18.04), so we can use the normal linux-4.15.0 source code. > 2. we are using linux-4.15.0-unsigned and by skimming through the source code, looks like try_get_page is not defined at that time? Yes! You are correct, the original mainline 4.15 kernel did not have try_get_page() defined at: https://elixir.bootlin.com/linux/v4.15/source/mm/gup.c#L156 But if you look closely at the actual kernel sources for 4.15.0-72-generic: https://git.launchpad.net/~ubuntu- kernel/ubuntu/+source/linux/+git/bionic/tree/mm/gup.c?h=Ubuntu-4.15.0-72.81#n156 We see that try_get_page() is there. That is because we backported: commit 8fde12ca79aff9b5ba951fce1a2641901b8d8e64 Author: Linus Torvalds <[email protected]> Date: Thu Apr 11 10:49:19 2019 -0700 Subject: mm: prevent get_user_pages() from overflowing page refcount Link:https://github.com/torvalds/linux/commit/8fde12ca79aff9b5ba951fce1a2641901b8d8e64 Ubuntu 4.15 backport link: https://paste.ubuntu.com/p/2bF5WWQy2r/ That commit first turned up in 4.15.0-59-generic, via upstream-stable. Anyway, let's have a look at your stack trace: 4.15.0-72-generic #81-Ubuntu RIP: 0010:follow_page_pte+0x663/0x6d0 I downloaded the debug symbols: http://ddebs.ubuntu.com/ubuntu/pool/main/l/linux/linux-image- unsigned-4.15.0-72-generic-dbgsym_4.15.0-72.81_amd64.ddeb Extracted them: dpkg -x linux-image-unsigned-4.15.0-72-generic- dbgsym_4.15.0-72.81_amd64.ddeb debug and looked up: $ eu-addr2line -e ./vmlinux-4.15.0-72-generic -f follow_page_pte+0x663 try_get_page inlined at /build/linux-E6MDAa/linux-4.15.0/mm/gup.c:156 in follow_page_pte /build/linux-E6MDAa/linux-4.15.0/mm/gup.c:138 We see that you hit try_get_page() in mm/gup.c:156 155 if (flags & FOLL_GET) { 156 if (unlikely(!try_get_page(page))) { 157 page = ERR_PTR(-ENOMEM); 158 goto out; 159 } Looking at try_get_page() in include/linux/mm.h: 854 static inline __must_check bool try_get_page(struct page *page) 855 { 856 page = compound_head(page); 857 if (WARN_ON_ONCE(page_ref_count(page) <= 0)) 858 return false; 859 page_ref_inc(page); 860 return true; 861 } We see that you hit the exact same WARN_ON_ONCE for the page_ref_count(page) <= 0). So, whatever page you are trying to access, has its reference counter in the negatives, which suggests that has either wrapped around, or has been decremented too many times. Looking at your error log, I can't tell for sure if it is the zero_page, but its quite likely going to be. The zero_page is a frequently used page in the system, and it is used outside of ksm, it's just that ksm is a heavy user of the zero_page. If you are constantly allocating large amounts of new memory, you will be be using the zero_page similar to ksm, and the reference counter will eventually overflow. I think there is a good chance that the fix I submitted in 4.15.0-118-generic will solve your problems. Please do a "apt update" and "apt upgrade" and upgrade to a newer kernel, the newer the better, and it will most likely fix the problem. Let me know if you have any more questions. Thanks, Matthew -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1837810 Title: KVM: Fix zero_page reference counter overflow when using KSM on KVM compute host To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1837810/+subscriptions -- ubuntu-bugs mailing list [email protected] https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
