------- Comment From [email protected] 2017-10-17 18:34 EDT------- Hi folks.
Good news! We got a test window on the Ubuntu KVM host today. We provisioned a collection of 24 new virtual Ubuntu guests for this test. Each virtual domain uses a single qcow2 virtual boot volume. All guests are configured exactly the same (except guests zs93kag100080, zs93kag100081 and zs93kag100082 are on a macvtap interface. Otherwise, identical.). Here's a sample of one (running) guest's XML: ubuntu@zm93k8:/home/scottg$ virsh dumpxml zs93kag100080 <domain type='kvm' id='65'> <name>zs93kag100080</name> <uuid>6bd4ebad-414b-4e1e-9995-7d061331ec01</uuid> <memory unit='KiB'>4194304</memory> <currentMemory unit='KiB'>4194304</currentMemory> <vcpu placement='static'>2</vcpu> <resource> <partition>/machine</partition> </resource> <os> <type arch='s390x' machine='s390-ccw-virtio-xenial'>hvm</type> </os> <clock offset='utc'/> <on_poweroff>destroy</on_poweroff> <on_reboot>restart</on_reboot> <on_crash>preserve</on_crash> <devices> <emulator>/usr/bin/qemu-kvm</emulator> <disk type='file' device='disk'> <driver name='qemu' type='qcow2' cache='none' io='native'/> <source file='/guestimages/data1/zs93kag100080.qcow2'/> <backingStore type='file' index='1'> <format type='raw'/> <source file='/rawimages/ubu1604qcow2/ubuntu.1604-1.20161206.v1.raw.backing'/> <backingStore/> </backingStore> <target dev='vda' bus='virtio'/> <boot order='1'/> <alias name='virtio-disk0'/> <address type='ccw' cssid='0xfe' ssid='0x0' devno='0x0000'/> </disk> <disk type='file' device='disk'> <driver name='qemu' type='raw' cache='none' io='native'/> <source file='/guestimages/data1/zs93kag100080.prm'/> <backingStore/> <target dev='vdc' bus='virtio'/> <alias name='virtio-disk2'/> <address type='ccw' cssid='0xfe' ssid='0x0' devno='0x0006'/> </disk> <disk type='file' device='cdrom'> <driver name='qemu' type='raw'/> <backingStore/> <target dev='sda' bus='scsi'/> <readonly/> <alias name='scsi0-0-0-0'/> <address type='drive' controller='0' bus='0' target='0' unit='0'/> </disk> <controller type='usb' index='0' model='none'> <alias name='usb'/> </controller> <controller type='scsi' index='0' model='virtio-scsi'> <alias name='scsi0'/> <address type='ccw' cssid='0xfe' ssid='0x0' devno='0x0002'/> </controller> <interface type='bridge'> <mac address='02:00:00:00:40:80'/> <source bridge='ovsbridge1'/> <vlan> <tag id='1297'/> </vlan> <virtualport type='openvswitch'> <parameters interfaceid='cd58c548-0b1f-47e7-9ed5-ad4a1bc8b8e0'/> </virtualport> <target dev='vnet0'/> <model type='virtio'/> <alias name='net0'/> <address type='ccw' cssid='0xfe' ssid='0x0' devno='0x0001'/> </interface> <console type='pty' tty='/dev/pts/3'> <source path='/dev/pts/3'/> <target type='sclp' port='0'/> <alias name='console0'/> </console> <memballoon model='none'> <alias name='balloon0'/> </memballoon> </devices> <seclabel type='dynamic' model='apparmor' relabel='yes'> <label>libvirt-6bd4ebad-414b-4e1e-9995-7d061331ec01</label> <imagelabel>libvirt-6bd4ebad-414b-4e1e-9995-7d061331ec01</imagelabel> </seclabel> </domain> To set up the test, we shutdown all virtual domains, and then ran a script which simply starts the guests, one at a time and captures fs .aio-nr before / after each 'virsh start'. After attempting to start all guests in the list, it goes into a loop, checking fs.aio-nr once every minute for 10 minutes to see if that value changes (which it does not). ubuntu@zm93k8:/home/scottg$ ./start_macvtaps_debug.sh Test started at Tue Oct 17 17:48:29 EDT 2017 cat /proc/sys/fs/aio-max-nr 65535 fs.aio-nr = 0 Starting zs93kag100080 ; Count = 1 zs93kag100080 started succesfully ... fs.aio-nr = 6144 Starting zs93kag100081 ; Count = 2 zs93kag100081 started succesfully ... fs.aio-nr = 12288 Starting zs93kag100082 ; Count = 3 zs93kag100082 started succesfully ... fs.aio-nr = 18432 Starting zs93kag100083 ; Count = 4 zs93kag100083 started succesfully ... fs.aio-nr = 24576 Starting zs93kag100084 ; Count = 5 zs93kag100084 started succesfully ... fs.aio-nr = 30720 Starting zs93kag100085 ; Count = 6 zs93kag100085 started succesfully ... fs.aio-nr = 36864 Starting zs93kag70024 ; Count = 7 zs93kag70024 started succesfully ... fs.aio-nr = 43008 Starting zs93kag70025 ; Count = 8 zs93kag70025 started succesfully ... fs.aio-nr = 49152 Starting zs93kag70026 ; Count = 9 zs93kag70026 started succesfully ... fs.aio-nr = 55296 Starting zs93kag70027 ; Count = 10 zs93kag70027 started succesfully ... fs.aio-nr = 61440 Starting zs93kag70038 ; Count = 11 zs93kag70038 started succesfully ... fs.aio-nr = 67584 Starting zs93kag70039 ; Count = 12 zs93kag70039 started succesfully ... fs.aio-nr = 73728 Starting zs93kag70040 ; Count = 13 zs93kag70040 started succesfully ... fs.aio-nr = 79872 Starting zs93kag70043 ; Count = 14 zs93kag70043 started succesfully ... fs.aio-nr = 86016 Starting zs93kag70045 ; Count = 15 zs93kag70045 started succesfully ... fs.aio-nr = 92160 Starting zs93kag70046 ; Count = 16 zs93kag70046 started succesfully ... fs.aio-nr = 98304 Starting zs93kag70047 ; Count = 17 zs93kag70047 started succesfully ... fs.aio-nr = 104448 Starting zs93kag70048 ; Count = 18 zs93kag70048 started succesfully ... fs.aio-nr = 110592 Starting zs93kag70049 ; Count = 19 zs93kag70049 started succesfully ... fs.aio-nr = 116736 Starting zs93kag70050 ; Count = 20 zs93kag70050 started succesfully ... fs.aio-nr = 122880 Starting zs93kag70051 ; Count = 21 zs93kag70051 started succesfully ... fs.aio-nr = 129024 Starting zs93kag70052 ; Count = 22 Error starting guest zs93kag70052 . error: Failed to start domain zs93kag70052 error: internal error: process exited while connecting to monitor: 2017-10-17T21:49:06.684444Z qemu-kvm: -drive file=/guestimages/data1/zs93kag70052.qcow2,format=qcow2,if=none,id=drive-virtio-disk0,cache=none,aio=native: Could not refresh total sector count: Bad file descriptor fs.aio-nr = 129024 Starting zs93kag70053 ; Count = 23 Error starting guest zs93kag70053 . error: Failed to start domain zs93kag70053 error: internal error: process exited while connecting to monitor: 2017-10-17T21:49:07.933457Z qemu-kvm: -drive file=/guestimages/data1/zs93kag70053.qcow2,format=qcow2,if=none,id=drive-virtio-disk0,cache=none,aio=native: Could not refresh total sector count: Bad file descriptor fs.aio-nr = 129024 Starting zs93kag70054 ; Count = 24 Error starting guest zs93kag70054 . error: Failed to start domain zs93kag70054 error: internal error: process exited while connecting to monitor: 2017-10-17T21:49:09.084863Z qemu-kvm: -drive file=/guestimages/data1/zs93kag70054.qcow2,format=qcow2,if=none,id=drive-virtio-disk0,cache=none,aio=native: Could not refresh total sector count: Bad file descriptor fs.aio-nr = 129024 Monitor fs.aio-nr for 10 minutes, capture value every 60 seconds... Sleeping 60 seconds. Loop count = 1 fs.aio-nr = 129024 Sleeping 60 seconds. Loop count = 2 fs.aio-nr = 129024 Sleeping 60 seconds. Loop count = 3 fs.aio-nr = 129024 Sleeping 60 seconds. Loop count = 4 fs.aio-nr = 129024 Sleeping 60 seconds. Loop count = 5 fs.aio-nr = 129024 Sleeping 60 seconds. Loop count = 6 fs.aio-nr = 129024 Sleeping 60 seconds. Loop count = 7 fs.aio-nr = 129024 Sleeping 60 seconds. Loop count = 8 fs.aio-nr = 129024 Sleeping 60 seconds. Loop count = 9 fs.aio-nr = 129024 Sleeping 60 seconds. Loop count = 10 fs.aio-nr = 129024 Test completed successfully. ## I couldn't understand why the error messages on start up were different this time, ## however it seems to be the same underlying cause. That is, if I stop one domain, I am ## then able to successfully start a failed domain. For example, ubuntu@zm93k8:/home/scottg$ virsh start zs93kag70052 Domain zs93kag70052 started ubuntu@zm93k8:/home/scottg$ virsh list |grep zs93kag70052 89 zs93kag70052 running ubuntu@zm93k8:/home/scottg$ ## And now, if I try to start zs93kag70051 (which started fine the first time), it fails (with yet a different error): ubuntu@zm93k8:/home/scottg$ virsh start zs93kag70051 error: Disconnected from qemu:///system due to I/O error error: Failed to start domain zs93kag70051 error: End of file while reading data: Input/output error error: One or more references were leaked after disconnect from the hypervisor ubuntu@zm93k8:/home/scottg$ ubuntu@zm93k8:/home/scottg$ date;sysctl fs.aio-nr Tue Oct 17 18:16:18 EDT 2017 fs.aio-nr = 129024 ## This time, I will kill one of the ovs-osa networked guests, and see if that then allows me to start zs93kag70051 ... (it does) ubuntu@zm93k8:/home/scottg$ date;virsh destroy zs93kag100080 Tue Oct 17 18:18:29 EDT 2017 Domain zs93kag100080 destroyed ubuntu@zm93k8:/home/scottg$ sysctl fs.aio-nr Tue Oct 17 18:19:18 EDT 2017 fs.aio-nr = 122880 ubuntu@zm93k8:/home/scottg$ date;virsh start zs93kag70051 Tue Oct 17 18:18:41 EDT 2017 Domain zs93kag70051 started ubuntu@zm93k8:/home/scottg$ date;sysctl fs.aio-nr Tue Oct 17 18:18:52 EDT 2017 fs.aio-nr = 129024 ## It appears that fs.aio-nr = 129024 is "The Brick Wall". ## Now, let's try increasing fs.aio-max-nr to 4194304 and see if that allows me to start more guests (it does). ubuntu@zm93k8:/home/scottg$ sudo sysctl -p /etc/sysctl.conf fs.aio-max-nr = 4194304 ubuntu@zm93k8:/home/scottg$ cat /proc/sys/fs/aio-max-nr 4194304 ubuntu@zm93k8:/home/scottg$ date;virsh start zs93kag70051 Tue Oct 17 18:27:54 EDT 2017 Domain zs93kag70051 started ubuntu@zm93k8:/home/scottg$ date;sysctl fs.aio-nr Tue Oct 17 18:28:12 EDT 2017 fs.aio-nr = 129024 ubuntu@zm93k8:/home/scottg$ date;virsh start zs93kag70053 Tue Oct 17 18:29:38 EDT 2017 Domain zs93kag70053 started ubuntu@zm93k8:/home/scottg$ date;sysctl fs.aio-nr Tue Oct 17 18:29:42 EDT 2017 fs.aio-nr = 135168 ubuntu@zm93k8:/home/scottg$ date;virsh start zs93kag70054 Tue Oct 17 18:29:55 EDT 2017 Domain zs93kag70054 started ubuntu@zm93k8:/home/scottg$ date;sysctl fs.aio-nr Tue Oct 17 18:29:58 EDT 2017 fs.aio-nr = 141312 I saved dmesg output in case you need that. ubuntu@zm93k8:/home/scottg$ dmesg > dmesg.out.Oct17_bug157241 I will also keep this test environment up for a couple days in case you need additional data. Thank you. -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1717224 Title: virsh start of virtual guest domain fails with internal error due to low default aio-max-nr sysctl value To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu-z-systems/+bug/1717224/+subscriptions -- ubuntu-bugs mailing list [email protected] https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
