------- Comment From [email protected] 2017-10-17 18:34 EDT-------
Hi folks.

Good news!  We got a test window on the Ubuntu KVM host today.

We provisioned a collection of 24 new virtual Ubuntu guests for this
test.  Each virtual domain uses a single qcow2 virtual boot volume.  All
guests are configured exactly the same (except guests zs93kag100080,
zs93kag100081 and zs93kag100082 are on a macvtap interface.  Otherwise,
identical.).

Here's a sample of one (running) guest's XML:

ubuntu@zm93k8:/home/scottg$ virsh dumpxml zs93kag100080
<domain type='kvm' id='65'>
<name>zs93kag100080</name>
<uuid>6bd4ebad-414b-4e1e-9995-7d061331ec01</uuid>
<memory unit='KiB'>4194304</memory>
<currentMemory unit='KiB'>4194304</currentMemory>
<vcpu placement='static'>2</vcpu>
<resource>
<partition>/machine</partition>
</resource>
<os>
<type arch='s390x' machine='s390-ccw-virtio-xenial'>hvm</type>
</os>
<clock offset='utc'/>
<on_poweroff>destroy</on_poweroff>
<on_reboot>restart</on_reboot>
<on_crash>preserve</on_crash>
<devices>
<emulator>/usr/bin/qemu-kvm</emulator>
<disk type='file' device='disk'>
<driver name='qemu' type='qcow2' cache='none' io='native'/>
<source file='/guestimages/data1/zs93kag100080.qcow2'/>
<backingStore type='file' index='1'>
<format type='raw'/>
<source file='/rawimages/ubu1604qcow2/ubuntu.1604-1.20161206.v1.raw.backing'/>
<backingStore/>
</backingStore>
<target dev='vda' bus='virtio'/>
<boot order='1'/>
<alias name='virtio-disk0'/>
<address type='ccw' cssid='0xfe' ssid='0x0' devno='0x0000'/>
</disk>
<disk type='file' device='disk'>
<driver name='qemu' type='raw' cache='none' io='native'/>
<source file='/guestimages/data1/zs93kag100080.prm'/>
<backingStore/>
<target dev='vdc' bus='virtio'/>
<alias name='virtio-disk2'/>
<address type='ccw' cssid='0xfe' ssid='0x0' devno='0x0006'/>
</disk>
<disk type='file' device='cdrom'>
<driver name='qemu' type='raw'/>
<backingStore/>
<target dev='sda' bus='scsi'/>
<readonly/>
<alias name='scsi0-0-0-0'/>
<address type='drive' controller='0' bus='0' target='0' unit='0'/>
</disk>
<controller type='usb' index='0' model='none'>
<alias name='usb'/>
</controller>
<controller type='scsi' index='0' model='virtio-scsi'>
<alias name='scsi0'/>
<address type='ccw' cssid='0xfe' ssid='0x0' devno='0x0002'/>
</controller>
<interface type='bridge'>
<mac address='02:00:00:00:40:80'/>
<source bridge='ovsbridge1'/>
<vlan>
<tag id='1297'/>
</vlan>
<virtualport type='openvswitch'>
<parameters interfaceid='cd58c548-0b1f-47e7-9ed5-ad4a1bc8b8e0'/>
</virtualport>
<target dev='vnet0'/>
<model type='virtio'/>
<alias name='net0'/>
<address type='ccw' cssid='0xfe' ssid='0x0' devno='0x0001'/>
</interface>
<console type='pty' tty='/dev/pts/3'>
<source path='/dev/pts/3'/>
<target type='sclp' port='0'/>
<alias name='console0'/>
</console>
<memballoon model='none'>
<alias name='balloon0'/>
</memballoon>
</devices>
<seclabel type='dynamic' model='apparmor' relabel='yes'>
<label>libvirt-6bd4ebad-414b-4e1e-9995-7d061331ec01</label>
<imagelabel>libvirt-6bd4ebad-414b-4e1e-9995-7d061331ec01</imagelabel>
</seclabel>
</domain>

To set up the test, we shutdown all virtual domains, and then ran a
script which simply starts the guests, one at a time and captures fs
.aio-nr before / after each 'virsh start'.

After attempting to start all guests in the list, it goes into a loop,
checking fs.aio-nr once every minute for 10 minutes to see if that value
changes (which it does not).

ubuntu@zm93k8:/home/scottg$ ./start_macvtaps_debug.sh

Test started at Tue Oct 17 17:48:29 EDT 2017

cat /proc/sys/fs/aio-max-nr
65535

fs.aio-nr = 0

Starting zs93kag100080 ;  Count = 1
zs93kag100080 started succesfully ...

fs.aio-nr = 6144

Starting zs93kag100081 ;  Count = 2
zs93kag100081 started succesfully ...

fs.aio-nr = 12288

Starting zs93kag100082 ;  Count = 3
zs93kag100082 started succesfully ...

fs.aio-nr = 18432

Starting zs93kag100083 ;  Count = 4
zs93kag100083 started succesfully ...

fs.aio-nr = 24576

Starting zs93kag100084 ;  Count = 5
zs93kag100084 started succesfully ...

fs.aio-nr = 30720

Starting zs93kag100085 ;  Count = 6
zs93kag100085 started succesfully ...

fs.aio-nr = 36864

Starting zs93kag70024 ;  Count = 7
zs93kag70024 started succesfully ...

fs.aio-nr = 43008

Starting zs93kag70025 ;  Count = 8
zs93kag70025 started succesfully ...

fs.aio-nr = 49152

Starting zs93kag70026 ;  Count = 9
zs93kag70026 started succesfully ...

fs.aio-nr = 55296

Starting zs93kag70027 ;  Count = 10
zs93kag70027 started succesfully ...

fs.aio-nr = 61440

Starting zs93kag70038 ;  Count = 11
zs93kag70038 started succesfully ...

fs.aio-nr = 67584

Starting zs93kag70039 ;  Count = 12
zs93kag70039 started succesfully ...

fs.aio-nr = 73728

Starting zs93kag70040 ;  Count = 13
zs93kag70040 started succesfully ...

fs.aio-nr = 79872

Starting zs93kag70043 ;  Count = 14
zs93kag70043 started succesfully ...

fs.aio-nr = 86016

Starting zs93kag70045 ;  Count = 15
zs93kag70045 started succesfully ...

fs.aio-nr = 92160

Starting zs93kag70046 ;  Count = 16
zs93kag70046 started succesfully ...

fs.aio-nr = 98304

Starting zs93kag70047 ;  Count = 17
zs93kag70047 started succesfully ...

fs.aio-nr = 104448

Starting zs93kag70048 ;  Count = 18
zs93kag70048 started succesfully ...

fs.aio-nr = 110592

Starting zs93kag70049 ;  Count = 19
zs93kag70049 started succesfully ...

fs.aio-nr = 116736

Starting zs93kag70050 ;  Count = 20
zs93kag70050 started succesfully ...

fs.aio-nr = 122880

Starting zs93kag70051 ;  Count = 21
zs93kag70051 started succesfully ...

fs.aio-nr = 129024

Starting zs93kag70052 ;  Count = 22
Error starting guest zs93kag70052 .

error: Failed to start domain zs93kag70052
error: internal error: process exited while connecting to monitor: 
2017-10-17T21:49:06.684444Z qemu-kvm: -drive 
file=/guestimages/data1/zs93kag70052.qcow2,format=qcow2,if=none,id=drive-virtio-disk0,cache=none,aio=native:
 Could not refresh total sector count: Bad file descriptor

fs.aio-nr = 129024

Starting zs93kag70053 ;  Count = 23
Error starting guest zs93kag70053 .

error: Failed to start domain zs93kag70053
error: internal error: process exited while connecting to monitor: 
2017-10-17T21:49:07.933457Z qemu-kvm: -drive 
file=/guestimages/data1/zs93kag70053.qcow2,format=qcow2,if=none,id=drive-virtio-disk0,cache=none,aio=native:
 Could not refresh total sector count: Bad file descriptor
fs.aio-nr = 129024

Starting zs93kag70054 ;  Count = 24
Error starting guest zs93kag70054 .

error: Failed to start domain zs93kag70054
error: internal error: process exited while connecting to monitor: 
2017-10-17T21:49:09.084863Z qemu-kvm: -drive 
file=/guestimages/data1/zs93kag70054.qcow2,format=qcow2,if=none,id=drive-virtio-disk0,cache=none,aio=native:
 Could not refresh total sector count: Bad file descriptor

fs.aio-nr = 129024

Monitor fs.aio-nr for 10 minutes, capture value every 60 seconds...
Sleeping 60 seconds.  Loop count = 1
fs.aio-nr = 129024

Sleeping 60 seconds.  Loop count = 2
fs.aio-nr = 129024

Sleeping 60 seconds.  Loop count = 3
fs.aio-nr = 129024

Sleeping 60 seconds.  Loop count = 4
fs.aio-nr = 129024

Sleeping 60 seconds.  Loop count = 5
fs.aio-nr = 129024

Sleeping 60 seconds.  Loop count = 6
fs.aio-nr = 129024

Sleeping 60 seconds.  Loop count = 7
fs.aio-nr = 129024

Sleeping 60 seconds.  Loop count = 8
fs.aio-nr = 129024

Sleeping 60 seconds.  Loop count = 9
fs.aio-nr = 129024

Sleeping 60 seconds.  Loop count = 10
fs.aio-nr = 129024

Test completed successfully.

## I  couldn't understand why the error messages on start up were different 
this time,
## however it seems to be the same underlying cause.  That is, if I stop one 
domain, I am ## then able to successfully start a failed domain.  For example,

ubuntu@zm93k8:/home/scottg$ virsh start zs93kag70052
Domain zs93kag70052 started

ubuntu@zm93k8:/home/scottg$ virsh list |grep zs93kag70052
89    zs93kag70052                   running
ubuntu@zm93k8:/home/scottg$

## And now, if I try to start zs93kag70051 (which started fine the first
time), it fails (with yet a different error):

ubuntu@zm93k8:/home/scottg$ virsh start zs93kag70051
error: Disconnected from qemu:///system due to I/O error
error: Failed to start domain zs93kag70051
error: End of file while reading data: Input/output error

error: One or more references were leaked after disconnect from the hypervisor
ubuntu@zm93k8:/home/scottg$

ubuntu@zm93k8:/home/scottg$ date;sysctl fs.aio-nr
Tue Oct 17 18:16:18 EDT 2017
fs.aio-nr = 129024

## This time, I will kill one of the ovs-osa networked guests, and see
if that then allows me to start zs93kag70051 ...  (it does)

ubuntu@zm93k8:/home/scottg$ date;virsh destroy zs93kag100080
Tue Oct 17 18:18:29 EDT 2017
Domain zs93kag100080 destroyed

ubuntu@zm93k8:/home/scottg$ sysctl fs.aio-nr
Tue Oct 17 18:19:18 EDT 2017
fs.aio-nr = 122880

ubuntu@zm93k8:/home/scottg$ date;virsh start zs93kag70051
Tue Oct 17 18:18:41 EDT 2017
Domain zs93kag70051 started

ubuntu@zm93k8:/home/scottg$ date;sysctl fs.aio-nr
Tue Oct 17 18:18:52 EDT 2017
fs.aio-nr = 129024

## It appears that fs.aio-nr = 129024  is "The Brick Wall".

## Now, let's try increasing fs.aio-max-nr to 4194304 and see if that
allows me to start more guests  (it does).

ubuntu@zm93k8:/home/scottg$ sudo sysctl -p /etc/sysctl.conf
fs.aio-max-nr = 4194304

ubuntu@zm93k8:/home/scottg$ cat /proc/sys/fs/aio-max-nr
4194304

ubuntu@zm93k8:/home/scottg$ date;virsh start zs93kag70051
Tue Oct 17 18:27:54 EDT 2017
Domain zs93kag70051 started

ubuntu@zm93k8:/home/scottg$ date;sysctl fs.aio-nr
Tue Oct 17 18:28:12 EDT 2017
fs.aio-nr = 129024

ubuntu@zm93k8:/home/scottg$ date;virsh start zs93kag70053
Tue Oct 17 18:29:38 EDT 2017
Domain zs93kag70053 started

ubuntu@zm93k8:/home/scottg$ date;sysctl fs.aio-nr
Tue Oct 17 18:29:42 EDT 2017
fs.aio-nr = 135168

ubuntu@zm93k8:/home/scottg$ date;virsh start zs93kag70054
Tue Oct 17 18:29:55 EDT 2017
Domain zs93kag70054 started

ubuntu@zm93k8:/home/scottg$ date;sysctl fs.aio-nr
Tue Oct 17 18:29:58 EDT 2017
fs.aio-nr = 141312

I saved dmesg output in case you need that.

ubuntu@zm93k8:/home/scottg$ dmesg > dmesg.out.Oct17_bug157241

I will also keep this test environment up for a couple days in case you
need additional data.

Thank you.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1717224

Title:
  virsh start of virtual guest domain fails with internal error due to
  low default aio-max-nr sysctl value

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-z-systems/+bug/1717224/+subscriptions

-- 
ubuntu-bugs mailing list
[email protected]
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Reply via email to