** Summary changed:

- qemu instance gets paused with error: kvm run failed Bad address
+ KVM: Fix zero_page reference counter overflow when using KSM on KVM compute 
host

** Description changed:

- We are occasionally running into the below qemu error on our Ubuntu16
- KVM hypervisors managed using Openstack, where the qemu guest instance
- gets paused unexpectedly.
- 
- pooja@kvm14:~$ sudo cat 
/var/log/libvirt/qemu/f8a20654-4c96-4446-95b3-24b8d28fab7f.log.1
+ BugLink: https://bugs.launchpad.net/bugs/1837810
+ 
+ [Impact]
+ 
+ We are seeing a problem on OpenStack compute nodes, and KVM hosts, where
+ a kernel oops is generated, and all running KVM machines are placed into
+ the pause state.
+ 
+ This is caused by the kernel's reserved zero_page reference counter
+ overflowing from a positive number to a negative number, and hitting a
+ (WARN_ON_ONCE(page_ref_count(page) <= 0)) condition in try_get_page().
+ 
+ This only happens if the machine has Kernel Samepage Mapping (KSM)
+ enabled, with "use_zero_pages" turned on. Each time a new VM starts and
+ the kernel does a KSM merge run during a EPT violation, the reference
+ counter for the zero_page is incremented in try_async_pf() and never
+ decremented. Eventually, the reference counter will overflow, causing
+ the KVM subsystem to fail.
+ 
+ Syslog:
+ error : qemuMonitorJSONCheckError:392 : internal error: unable to execute 
QEMU command 'cont': Resetting the Virtual Machine is required
+ 
+ QEMU Logs:
  error: kvm run failed Bad address
- RAX=0000000000000000 RBX=ffff8d12ffc01b00 RCX=0000000000000000 
RDX=ffff8d14d111c040
- RSI=000000000000000e RDI=ffff8d15bffda000 RBP=ffff8d156afbb898 
RSP=ffff8d156afbb870
- R8 =000000000000001e R9 =0000000000000000 R10=000000000000001d 
R11=ffffffffffffffd0
- R12=ffff8d14d111c000 R13=fffff7de0d444700 R14=ffff8d14d111c000 
R15=0000000000000001
- RIP=ffffffff9ba1911b RFL=00010082 [--S----] CPL=0 II=0 A20=1 SMM=0 HLT=0
- ES =0000 0000000000000000 000fffff 00000000
- CS =0010 0000000000000000 ffffffff 00a09b00 DPL=0 CS64 [-RA]
- SS =0000 0000000000000000 ffffffff 00c00000
- DS =0000 0000000000000000 000fffff 00000000
- FS =0000 00007f8c7e8ac700 000fffff 00000000
- GS =0000 ffff8d15bfd00000 000fffff 00000000
- LDT=0000 0000000000000000 000fffff 00000000
- TR =0040 ffff8d15bfd04000 00002087 00008b00 DPL=0 TSS64-busy
- GDT=     ffff8d15bfd0c000 0000007f
- IDT=     ffffffffff528000 00000fff
- CR0=80050033 CR2=000000001699e002 CR3=00000000b0d54000 CR4=003606e0
- DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 
DR3=0000000000000000
- DR6=00000000fffe0ff0 DR7=0000000000000400
- EFER=0000000000000d01
- Code=44 39 f8 48 63 43 20 0f 8e e1 00 00 00 48 63 53 18 4c 01 e2 <49> 89 14 
04 41 0f b7 55 1a 48 63 43 18 41 83 c7 01 66 81 e2 ff 7f 49 01 c4 0f b7 c2 44 39
- 2019-07-16T23:59:30.354240Z qemu-system-x86_64: terminating on signal 15 from 
pid 7487 (/usr/sbin/libvirtd)
- 2019-07-16 23:59:31.549+0000: shutting down, reason=destroyed
- 
- 
- We also saw some swap related errors in /var/log/syslog previously:
- 
- Jul  4 08:40:18 kvm14 kernel: \[8084318.769268] audit: type=1400
- audit(1562229618.904:4385): apparmor="STATUS"
- operation="profile_replace" profile="unconfined" name="libvirt-a89c8a67
- -24ae-45cb-a885-8880b49f86fd" pid=53770 comm="apparmor_parser"
- 
- Jul  4 08:40:18 kvm14 libvirtd\[20274]: 2019-07-04 08:40:18.911+0000:
- 20277: warning : AppArmorSetFDLabel:1164 : could not find path for
- descriptor /proc/self/fd/29, skippingJul  4 08:40:19 kvm13 kernel:
- \[8084318.930088] print_req_error: critical medium error, dev nvme0n1,
- sector 392710192
- 
- Jul  4 08:40:19 kvm14 kernel: \[8084318.937754] Read-error on swap-
- device (259:0:392710200)
- 
- Jul  4 08:40:19 kvm14 kernel: \[8084318.943157] Read-error on swap-
- device (259:0:392710208)
- 
- Wondering if these swap read errors happening intermittently, cause qemu
- to pause the guest instance due to memory overcommitment issues.
- 
- pooja@kvm14:~$ lsb_release -a
- No LSB modules are available.
- Distributor ID: Ubuntu
- Description:    Ubuntu 16.04.6 LTS
- Release:        16.04
- Codename:       xenial
- 
- pooja@kvm14:~$ /usr/bin/qemu-system-x86_64 --version
- QEMU emulator version 2.11.1(Debian 1:2.11+dfsg-1ubuntu7.10~cloud0)
- Copyright (c) 2003-2017 Fabrice Bellard and the QEMU Project developers
- 
- pooja@kvm14:~$ dpkg -l | grep libvirt
- ii  libvirt-bin                           4.0.0-1ubuntu8.8~cloud0             
          amd64        programs for the libvirt library
- 
- 
- Sample qemu process args for an instance created using Openstack Nova (Newton 
release):
- 
- pooja@kvm14:~$ sudo ps -ef | grep qemu | tail -1
- libvirt+ 52655     1 83 Jun27 ?        22-23:37:19 qemu-system-x86_64 
-enable-kvm -name guest=157b8c3d-87c6-4a62-a05b-8d1aae47e890,debug-threads=on 
-S -object 
secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-567-157b8c3d-87c6-4a62-a/master-key.aes
 -machine pc-i440fx-bionic,accel=kvm,usb=off,dump-guest-core=off -cpu 
Skylake-Server-IBRS -m 8192 -realtime mlock=off -smp 
2,sockets=2,cores=1,threads=1 -uuid 157b8c3d-87c6-4a62-a05b-8d1aae47e890 
-smbios type=1,manufacturer=OpenStack Foundation,product=OpenStack 
Nova,version=2013.2.1,serial=1a2fd1a2-3df9-4ed9-ad7f-b4c2479a01ca,uuid=157b8c3d-87c6-4a62-a05b-8d1aae47e890,family=Virtual
 Machine -no-user-config -nodefaults -chardev 
socket,id=charmonitor,path=/var/lib/libvirt/qemu/domain-567-157b8c3d-87c6-4a62-a/monitor.sock,server,nowait
 -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc,driftfix=slew 
-global kvm-pit.lost_tick_policy=delay -no-hpet -no-shutdown -boot strict=on 
-device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -drive 
file=/dev/disk/by-path/ip-192.168.5.47:3260-iscsi-iqn.2013-05.com.daterainc:tc:01:sn:722e8e82b38019b5-lun-0,format=raw,if=none,id=drive-virtio-disk0,serial=1acefaf2-4096-4e85-9a39-aabf3841d022,cache=none,aio=native
 -device 
virtio-blk-pci,scsi=off,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1
 -netdev tap,fd=53,id=hostnet0,vhost=on,vhostfd=56 -device 
virtio-net-pci,netdev=hostnet0,id=net0,mac=fa:16:3e:cd:4e:e1,bus=pci.0,addr=0x3 
-add-fd set=2,fd=58 -chardev file,id=charserial0,path=/dev/fdset/2,append=on 
-device isa-serial,chardev=charserial0,id=serial0 -chardev pty,id=charserial1 
-device isa-serial,chardev=charserial1,id=serial1 -device 
usb-tablet,id=input0,bus=usb.0,port=1 -vnc 127.0.0.1:23 -k en-us -device 
cirrus-vga,id=video0,bus=pci.0,addr=0x2 -device 
virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x5 -msg timestamp=on
+ EAX=000afe00 EBX=0000000b ECX=00000080 EDX=00000cfe
+ ESI=0003fe00 EDI=000afe00 EBP=00000007 ESP=00006d74
+ EIP=000ee344 EFL=00010002 [-------] CPL=0 II=0 A20=1 SMM=0 HLT=0
+ ES =0010 00000000 ffffffff 00c09300 DPL=0 DS   [-WA]
+ CS =0008 00000000 ffffffff 00c09b00 DPL=0 CS32 [-RA]
+ SS =0010 00000000 ffffffff 00c09300 DPL=0 DS   [-WA]
+ DS =0010 00000000 ffffffff 00c09300 DPL=0 DS   [-WA]
+ FS =0010 00000000 ffffffff 00c09300 DPL=0 DS   [-WA]
+ GS =0010 00000000 ffffffff 00c09300 DPL=0 DS   [-WA]
+ LDT=0000 00000000 0000ffff 00008200 DPL=0 LDT
+ TR =0000 00000000 0000ffff 00008b00 DPL=0 TSS32-busy
+ GDT=     000f7040 00000037
+ IDT=     000f707e 00000000
+ CR0=00000011 CR2=00000000 CR3=00000000 CR4=00000000
+ DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 
DR3=0000000000000000 
+ DR6=00000000ffff0ff0 DR7=0000000000000400
+ EFER=0000000000000000
+ Code=c3 57 56 b8 00 fe 0a 00 be 00 fe 03 00 b9 80 00 00 00 89 c7 <f3> a5 a1 
00 80 03 00 8b 15 04 80 03 00 a3 00 80 0a 00 89 15 04 80 0a 00 b8 ae e2 00 00 31
+ 
+ Kernel Oops:
+ 
+ [  167.695986] WARNING: CPU: 1 PID: 3016 at 
/build/linux-hwe-FEhT7y/linux-hwe-4.15.0/include/linux/mm.h:852 
follow_page_pte+0x6f4/0x710
+ [  167.696023] CPU: 1 PID: 3016 Comm: CPU 0/KVM Tainted: G           OE    
4.15.0-106-generic #107~16.04.1-Ubuntu
+ [  167.696023] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
1.13.0-1ubuntu1 04/01/2014
+ [  167.696025] RIP: 0010:follow_page_pte+0x6f4/0x710
+ [  167.696026] RSP: 0018:ffffa81802023908 EFLAGS: 00010286
+ [  167.696027] RAX: ffffed8786e33a80 RBX: ffffed878c6d21b0 RCX: 
0000000080000000
+ [  167.696027] RDX: 0000000000000000 RSI: 00003ffffffff000 RDI: 
80000001b8cea225
+ [  167.696028] RBP: ffffa81802023970 R08: 80000001b8cea225 R09: 
ffff90c4d55fa340
+ [  167.696028] R10: 0000000000000000 R11: 0000000000000000 R12: 
ffffed8786e33a80
+ [  167.696029] R13: 0000000000000326 R14: ffff90c4db94fc50 R15: 
ffff90c4d55fa340
+ [  167.696030] FS:  00007f6a7798c700(0000) GS:ffff90c4edc80000(0000) 
knlGS:0000000000000000
+ [  167.696030] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
+ [  167.696031] CR2: 0000000000000000 CR3: 0000000315580002 CR4: 
0000000000162ee0
+ [  167.696033] Call Trace:
+ [  167.696047]  follow_pmd_mask+0x273/0x630
+ [  167.696049]  follow_page_mask+0x178/0x230
+ [  167.696051]  __get_user_pages+0xb8/0x740
+ [  167.696052]  get_user_pages+0x42/0x50
+ [  167.696068]  __gfn_to_pfn_memslot+0x18b/0x3b0 [kvm]
+ [  167.696079]  ? mmu_set_spte+0x1dd/0x3a0 [kvm]
+ [  167.696090]  try_async_pf+0x66/0x220 [kvm]
+ [  167.696101]  tdp_page_fault+0x14b/0x2b0 [kvm]
+ [  167.696104]  ? vmexit_fill_RSB+0x10/0x40 [kvm_intel]
+ [  167.696114]  kvm_mmu_page_fault+0x62/0x180 [kvm]
+ [  167.696117]  handle_ept_violation+0xbc/0x160 [kvm_intel]
+ [  167.696119]  vmx_handle_exit+0xa5/0x580 [kvm_intel]
+ [  167.696129]  vcpu_enter_guest+0x414/0x1260 [kvm]
+ [  167.696138]  ? kvm_arch_vcpu_load+0x4d/0x280 [kvm]
+ [  167.696148]  kvm_arch_vcpu_ioctl_run+0xd9/0x3d0 [kvm]
+ [  167.696157]  ? kvm_arch_vcpu_ioctl_run+0xd9/0x3d0 [kvm]
+ [  167.696165]  kvm_vcpu_ioctl+0x33a/0x610 [kvm]
+ [  167.696166]  ? do_futex+0x129/0x590
+ [  167.696171]  ? __switch_to+0x34c/0x4e0
+ [  167.696174]  ? __switch_to_asm+0x35/0x70
+ [  167.696176]  do_vfs_ioctl+0xa4/0x600
+ [  167.696177]  SyS_ioctl+0x79/0x90
+ [  167.696180]  ? exit_to_usermode_loop+0xa5/0xd0
+ [  167.696181]  do_syscall_64+0x73/0x130
+ [  167.696182]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
+ [  167.696184] RIP: 0033:0x7f6a80482007
+ [  167.696184] RSP: 002b:00007f6a7798b8b8 EFLAGS: 00000246 ORIG_RAX: 
0000000000000010
+ [  167.696185] RAX: ffffffffffffffda RBX: 000000000000ae80 RCX: 
00007f6a80482007
+ [  167.696185] RDX: 0000000000000000 RSI: 000000000000ae80 RDI: 
0000000000000016
+ [  167.696186] RBP: 000055fe135f3240 R08: 000055fe118be530 R09: 
0000000000000001
+ [  167.696186] R10: 0000000000000000 R11: 0000000000000246 R12: 
0000000000000000
+ [  167.696187] R13: 00007f6a85852000 R14: 0000000000000000 R15: 
000055fe135f3240
+ [  167.696188] Code: 4d 63 e6 e9 f2 fc ff ff 4c 89 45 d0 48 8b 47 10 e8 22 f0 
9e 00 4c 8b 45 d0 e9 89 fc ff ff 4c 89 e7 e8 81 3f fd ff e9 aa fc ff ff <0f> 0b 
49 c7 c4 f4 ff ff ff e9 c1 fc ff ff 0f 1f 40 00 66 2e 0f 
+ [  167.696200] ---[ end trace 7573f6868ea8f069 ]---
+ 
+ [Fix]
+ 
+ This was fixed in 5.6-rc1 with the following commit:
+ 
+ commit 7df003c85218b5f5b10a7f6418208f31e813f38f
+ Author: Zhuang Yanying <ann.zhuangyany...@huawei.com>
+ Date:   Sat Oct 12 11:37:31 2019 +0800
+ Subject:  KVM: fix overflow of zero page refcount with ksm running
+ Link: 
https://github.com/torvalds/linux/commit/7df003c85218b5f5b10a7f6418208f31e813f38f
 
+ 
+ The fix adds a check to see if the Page Frame Number (pfn) is linked to
+ the zero page, and if it is, treats it as reserved. This has the effect
+ that put_page() is no longer called on the zero_page, and reference
+ counting is no longer needed.
+ 
+ This is a clean cherry pick to Bionic and Focal kernels.
+ 
+ [Testcase]
+ 
+ Create a new KVM host, and make sure it has plenty of ram. 16gb should
+ be okay.
+ 
+ Install KVM packages:
+ 
+ $ sudo apt install -y qemu-kvm libvirt-bin qemu-utils genisoimage
+ virtinst
+ 
+ Enable Kernel Samepage Mapping, and use_zero_pages:
+ 
+ $ echo 10000 | sudo tee /sys/kernel/mm/ksm/pages_to_scan
+ $ echo 1 | sudo tee /sys/kernel/mm/ksm/run
+ $ echo 1 | sudo tee /sys/kernel/mm/ksm/use_zero_pages
+ 
+ I wrote a script which creates and destroys xenial KVM VMs in a infinite loop:
+ https://paste.ubuntu.com/p/CvRTsDkdC7/
+ 
+ Save the script to disk, and execute it:
+ 
+ $ chmod +x ksm_refcnt_overflow.sh
+ $ ./ksm_refcnt_overflow.sh
+ 
+ Each time a VM is created and destroyed the reference counter will
+ increase.
+ 
+ I wrote a kernel module which exposes a /proc interface, which we can
+ use to look at the value of the zero_page reference counter. It works by
+ taking the memory allocated for the zero page: empty_zero_page, which is
+ defined in arch/x86/include/asm/pgtable.h, running virt_to_page() to get
+ the page struct, which we can then dereference to get _refcount;
+ 
+ https://paste.ubuntu.com/p/MJMN8jMVds/
+ 
+ Save the module to disk, create its Makefile from the included
+ documentation, and build it:
+ 
+ $ make
+ $ sudo insmod zero_page_refcount.ko 
+ 
+ From there, we can examine the reference counter with:
+ 
+ $ cat /proc/zero_page_refcount
+ Zero Page Refcount: 0x687 or 1671
+ $ cat /proc/zero_page_refcount
+ Zero Page Refcount: 0x846 or 2118
+ $ cat /proc/zero_page_refcount
+ Zero Page Refcount: 0x9f8 or 2552
+ $ cat /proc/zero_page_refcount
+ Zero Page Refcount: 0xcb2 or 3250 
+ 
+ We see it steadily increase. Instead of waiting months for it to
+ overflow, I implemented a /proc entry to set it to near overflow. You
+ can use it with:
+ 
+ $ cat /proc/zero_page_refcount_set
+ Zero Page Refcount set to 0x1FFFFFFFFF000 
+ 
+ After that, wait a few seconds and the reference counter will overflow:
+ 
+ $ cat /proc/zero_page_refcount
+ Zero Page Refcount: 0x7fffff16 or 2147483414
+ $ cat /proc/zero_page_refcount
+ Zero Page Refcount: 0x80000000 or -2147483648 
+ 
+ All VMs will become paused:
+ 
+ $ virsh list
+ Id Name State
+ ----------------------------------------------------
+ 1 instance-0 paused
+ 2 instance-1 paused 
+ 
+ QEMU will error out, and the kernel will oops with the messages in the
+ impact section.
+ 
+ I built a test kernel, which is available here:
+ 
+ https://launchpad.net/~mruffell/+archive/ubuntu/sf290373-test
+ 
+ If you install the test kernel and try reproduce, you will notice the
+ reference counter is never incremented past 1:
+ 
+ $ cat /proc/zero_page_refcount
+ Zero Page Refcount: 0x1 or 1
+ $ cat /proc/zero_page_refcount
+ Zero Page Refcount: 0x1 or 1
+ $ cat /proc/zero_page_refcount
+ Zero Page Refcount: 0x1 or 1 
+ 
+ This resolves the problem.
+ 
+ [Regression Potential]
+ 
+ While the change itself seems simple, it changes how the kernel treats
+ the zero_page. The zero_page is important, since it is just a page full
+ of 0's. Each time memory is allocated which is all 0s, the kernel sets
+ it to use the zero_page to save memory. When an application writes to
+ the buffer, a EPT violation happens, and the kernel does a COW to new
+ pages to hold the data.
+ 
+ The change is limited to how the KVM subsystem handles the zero_page.
+ This will not break the entire kernel if a regression occurs, only KVM.
+ 
+ If a regression were to occur, users could turn off KSM and disable KSM
+ use_zero_pages until a fix is ready, as this particular use of
+ zero_pages is limited to KSM.
+ 
+ The fix landed in upstream 5.6, and has not been backported to stable
+ kernels.
+ 
+ I have read a bit of the paging code, especially around where the
+ zero_page is used, and where its reference counters were being
+ incorrectly incremented.
+ 
+ I think the fix is correct, and I believe it won't cause any
+ regressions.

** Tags removed: xenial
** Tags added: bionic focal sts

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1837810

Title:
  KVM: Fix zero_page reference counter overflow when using KSM on KVM
  compute host

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1837810/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Reply via email to