Public bug reported: Hello,
When issuing a live migration between kvm nodes having multipath cinder volume it sometimes hangs and causes qemu-kvm to crash, the only solution is a restart of the kvm node. Sometimes when live migrating you get stuck when it tries to migrate the active RAM, you will see something like this in the nova-compute.log: http://paste.openstack.org/show/481773/ As you can see it get's nowhere. What is happening in the backgroun is that for some reason the multipath volumes when viewing with 'multipath -ll' they go into a 'faulty running' state and causes issues with the block device causing the qemu-kvm process to hang, the kvm node also tries to run blkid and kpart but all of those hang, which means you can get 100+ load just for those stuck processes. [1015086.978188] end_request: I/O error, dev sdg, sector 41942912 [1015086.978398] device-mapper: multipath: Failing path 8:96. [1015088.547034] qbr8eff45f7-ed: port 1(qvb8eff45f7-ed) entered disabled state [1015088.791695] INFO: task qemu-system-x86:19383 blocked for more than 120 seconds. [1015088.791940] Tainted: P OX 3.13.0-68-generic #111-Ubuntu [1015088.792147] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [1015088.792396] qemu-system-x86 D ffff88301f2f3180 0 19383 1 0x00000000 [1015088.792404] ffff8817440ada88 0000000000000086 ffff8817fa574800 ffff8817440adfd8 [1015088.792414] 0000000000013180 0000000000013180 ffff8817fa574800 ffff88301f2f3a18 [1015088.792420] 0000000000000000 ffff882ff7ab5280 0000000000000000 ffff8817fa574800 [1015088.792426] Call Trace: [1015088.792440] [<ffffffff8172880d>] io_schedule+0x9d/0x140 [1015088.792449] [<ffffffff811fc304>] do_blockdev_direct_IO+0x1ce4/0x2910 [1015088.792456] [<ffffffff811f7170>] ? I_BDEV+0x10/0x10 [1015088.792462] [<ffffffff811fcf85>] __blockdev_direct_IO+0x55/0x60 [1015088.792467] [<ffffffff811f7170>] ? I_BDEV+0x10/0x10 [1015088.792472] [<ffffffff811f7866>] blkdev_direct_IO+0x56/0x60 [1015088.792476] [<ffffffff811f7170>] ? I_BDEV+0x10/0x10 [1015088.792482] [<ffffffff81150ce1>] generic_file_direct_write+0xc1/0x180 [1015088.792487] [<ffffffff811510a5>] __generic_file_aio_write+0x305/0x3d0 [1015088.792492] [<ffffffff811f8146>] blkdev_aio_write+0x46/0x90 [1015088.792501] [<ffffffff811bdc2a>] do_sync_write+0x5a/0x90 [1015088.792507] [<ffffffff811be3b4>] vfs_write+0xb4/0x1f0 [1015088.792512] [<ffffffff811bef62>] SyS_pwrite64+0x72/0xb0 [1015088.792519] [<ffffffff81734cdd>] system_call_fastpath+0x1a/0x1f root 19410 0.0 0.0 0 0 ? D 08:12 0:00 [blkid] root 19575 0.0 0.0 0 0 ? D 08:13 0:00 [blkid] root 19584 0.0 0.0 28276 1076 ? S 08:13 0:00 /sbin/kpartx -a -p -part /dev/mapper/36000d31000a6500000000000000000c6 root 21734 0.0 0.0 28276 1080 ? D 08:15 0:00 /sbin/kpartx -a -p -part /dev/mapper/36000d31000a6500000000000000000c6 root 21735 0.0 0.0 28276 1076 ? S 08:15 0:00 /sbin/kpartx -a -p -part /dev/mapper/36000d31000a6500000000000000000ed root 22419 0.0 0.0 28276 1076 ? D 08:16 0:00 /sbin/kpartx -a -p -part /dev/mapper/36000d31000a6500000000000000000ed root 22420 0.0 0.0 28276 1076 ? D 08:16 0:00 /sbin/kpartx -a -p -part /dev/mapper/36000d31000a6500000000000000000c6 root 22864 0.0 0.0 28276 1076 ? D 08:16 0:00 /sbin/kpartx -a -p -part /dev/mapper/36000d31000a6500000000000000000ed root 22865 0.0 0.0 28276 1076 ? D 08:16 0:00 /sbin/kpartx -a -p -part /dev/mapper/36000d31000a6500000000000000000c6 root 23316 0.0 0.0 28276 1076 ? D 08:17 0:00 /sbin/kpartx -a -p -part /dev/mapper/36000d31000a6500000000000000000c6 root 23317 0.0 0.0 28276 1072 ? D 08:17 0:00 /sbin/kpartx -a -p -part /dev/mapper/36000d31000a6500000000000000000ed root 23756 0.0 0.0 28276 1076 ? D 08:17 0:00 /sbin/kpartx -a -p -part /dev/mapper/36000d31000a6500000000000000000c6 root 24200 0.0 0.0 28276 1076 ? D 08:18 0:00 /sbin/kpartx -a -p -part /dev/mapper/36000d31000a6500000000000000000c6 root 24637 0.0 0.0 28276 1072 ? D 08:18 0:00 /sbin/kpartx -a -p -part /dev/mapper/36000d31000a6500000000000000000c6 root 25058 0.0 0.0 28276 1076 ? D 08:19 0:00 /sbin/kpartx -a -p -part /dev/mapper/36000d31000a6500000000000000000c6 root@kvm3:~# Ultimately this will cause so much issues on your kvm node that the only fix is a restart because of all the libvirt locks you won't be able to stop, restart or destroy the qemu-kvm process and issuing a kill -9 won't help you either, the only solution is a restart. What will happen is that your live migration will fail with something like this. 2015-12-14 08:19:51.577 23821 ERROR nova.compute.manager [req-99771cf6-d17e-49f7-a01d-38201afbce69 212f451de64b4ae89c853f1430510037 e47ebdf3f3934025b37df3b85bdfd565 - - -] [instance: babf696c-55d1-4bde-be83-3124be2ac7f2] Live migration failed. 2015-12-14 08:19:51.577 23821 ERROR nova.compute.manager [instance: babf696c-55d1-4bde-be83-3124be2ac7f2] Traceback (most recent call last): 2015-12-14 08:19:51.577 23821 ERROR nova.compute.manager [instance: babf696c-55d1-4bde-be83-3124be2ac7f2] File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 5108, in _do_live_migration 2015-12-14 08:19:51.577 23821 ERROR nova.compute.manager [instance: babf696c-55d1-4bde-be83-3124be2ac7f2] block_migration, migrate_data) 2015-12-14 08:19:51.577 23821 ERROR nova.compute.manager [instance: babf696c-55d1-4bde-be83-3124be2ac7f2] File "/usr/lib/python2.7/dist-packages/nova/virt/libvirt/driver.py", line 5431, in live_migration 2015-12-14 08:19:51.577 23821 ERROR nova.compute.manager [instance: babf696c-55d1-4bde-be83-3124be2ac7f2] migrate_data) 2015-12-14 08:19:51.577 23821 ERROR nova.compute.manager [instance: babf696c-55d1-4bde-be83-3124be2ac7f2] File "/usr/lib/python2.7/dist-packages/nova/virt/libvirt/driver.py", line 6071, in _live_migration 2015-12-14 08:19:51.577 23821 ERROR nova.compute.manager [instance: babf696c-55d1-4bde-be83-3124be2ac7f2] dom, finish_event) 2015-12-14 08:19:51.577 23821 ERROR nova.compute.manager [instance: babf696c-55d1-4bde-be83-3124be2ac7f2] File "/usr/lib/python2.7/dist-packages/nova/virt/libvirt/driver.py", line 5919, in _live_migration_monitor 2015-12-14 08:19:51.577 23821 ERROR nova.compute.manager [instance: babf696c-55d1-4bde-be83-3124be2ac7f2] dom.abortJob() 2015-12-14 08:19:51.577 23821 ERROR nova.compute.manager [instance: babf696c-55d1-4bde-be83-3124be2ac7f2] File "/usr/lib/python2.7/dist-packages/eventlet/tpool.py", line 183, in doit 2015-12-14 08:19:51.577 23821 ERROR nova.compute.manager [instance: babf696c-55d1-4bde-be83-3124be2ac7f2] result = proxy_call(self._autowrap, f, *args, **kwargs) 2015-12-14 08:19:51.577 23821 ERROR nova.compute.manager [instance: babf696c-55d1-4bde-be83-3124be2ac7f2] File "/usr/lib/python2.7/dist-packages/eventlet/tpool.py", line 141, in proxy_call 2015-12-14 08:19:51.577 23821 ERROR nova.compute.manager [instance: babf696c-55d1-4bde-be83-3124be2ac7f2] rv = execute(f, *args, **kwargs) 2015-12-14 08:19:51.577 23821 ERROR nova.compute.manager [instance: babf696c-55d1-4bde-be83-3124be2ac7f2] File "/usr/lib/python2.7/dist-packages/eventlet/tpool.py", line 122, in execute 2015-12-14 08:19:51.577 23821 ERROR nova.compute.manager [instance: babf696c-55d1-4bde-be83-3124be2ac7f2] six.reraise(c, e, tb) 2015-12-14 08:19:51.577 23821 ERROR nova.compute.manager [instance: babf696c-55d1-4bde-be83-3124be2ac7f2] File "/usr/lib/python2.7/dist-packages/eventlet/tpool.py", line 80, in tworker 2015-12-14 08:19:51.577 23821 ERROR nova.compute.manager [instance: babf696c-55d1-4bde-be83-3124be2ac7f2] rv = meth(*args, **kwargs) 2015-12-14 08:19:51.577 23821 ERROR nova.compute.manager [instance: babf696c-55d1-4bde-be83-3124be2ac7f2] File "/usr/lib/python2.7/dist-packages/libvirt.py", line 482, in abortJob 2015-12-14 08:19:51.577 23821 ERROR nova.compute.manager [instance: babf696c-55d1-4bde-be83-3124be2ac7f2] if ret == -1: raise libvirtError ('virDomainAbortJob() failed', dom=self) 2015-12-14 08:19:51.577 23821 ERROR nova.compute.manager [instance: babf696c-55d1-4bde-be83-3124be2ac7f2] libvirtError: Timed out during operation: cannot acquire state change lock (held by remoteDispatchDomainMigratePerform3) I'm not sure if this is the same bug as https://bugs.launchpad.net/nova/+bug/1419577 but I want to make sure to bring some light into this area since this affect our operations very much. Best regards ** Affects: nova Importance: Undecided Status: New ** Tags: cinder multipath -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1525802 Title: live migration with multipath cinder volumes crashes node Status in OpenStack Compute (nova): New Bug description: Hello, When issuing a live migration between kvm nodes having multipath cinder volume it sometimes hangs and causes qemu-kvm to crash, the only solution is a restart of the kvm node. Sometimes when live migrating you get stuck when it tries to migrate the active RAM, you will see something like this in the nova-compute.log: http://paste.openstack.org/show/481773/ As you can see it get's nowhere. What is happening in the backgroun is that for some reason the multipath volumes when viewing with 'multipath -ll' they go into a 'faulty running' state and causes issues with the block device causing the qemu-kvm process to hang, the kvm node also tries to run blkid and kpart but all of those hang, which means you can get 100+ load just for those stuck processes. [1015086.978188] end_request: I/O error, dev sdg, sector 41942912 [1015086.978398] device-mapper: multipath: Failing path 8:96. [1015088.547034] qbr8eff45f7-ed: port 1(qvb8eff45f7-ed) entered disabled state [1015088.791695] INFO: task qemu-system-x86:19383 blocked for more than 120 seconds. [1015088.791940] Tainted: P OX 3.13.0-68-generic #111-Ubuntu [1015088.792147] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [1015088.792396] qemu-system-x86 D ffff88301f2f3180 0 19383 1 0x00000000 [1015088.792404] ffff8817440ada88 0000000000000086 ffff8817fa574800 ffff8817440adfd8 [1015088.792414] 0000000000013180 0000000000013180 ffff8817fa574800 ffff88301f2f3a18 [1015088.792420] 0000000000000000 ffff882ff7ab5280 0000000000000000 ffff8817fa574800 [1015088.792426] Call Trace: [1015088.792440] [<ffffffff8172880d>] io_schedule+0x9d/0x140 [1015088.792449] [<ffffffff811fc304>] do_blockdev_direct_IO+0x1ce4/0x2910 [1015088.792456] [<ffffffff811f7170>] ? I_BDEV+0x10/0x10 [1015088.792462] [<ffffffff811fcf85>] __blockdev_direct_IO+0x55/0x60 [1015088.792467] [<ffffffff811f7170>] ? I_BDEV+0x10/0x10 [1015088.792472] [<ffffffff811f7866>] blkdev_direct_IO+0x56/0x60 [1015088.792476] [<ffffffff811f7170>] ? I_BDEV+0x10/0x10 [1015088.792482] [<ffffffff81150ce1>] generic_file_direct_write+0xc1/0x180 [1015088.792487] [<ffffffff811510a5>] __generic_file_aio_write+0x305/0x3d0 [1015088.792492] [<ffffffff811f8146>] blkdev_aio_write+0x46/0x90 [1015088.792501] [<ffffffff811bdc2a>] do_sync_write+0x5a/0x90 [1015088.792507] [<ffffffff811be3b4>] vfs_write+0xb4/0x1f0 [1015088.792512] [<ffffffff811bef62>] SyS_pwrite64+0x72/0xb0 [1015088.792519] [<ffffffff81734cdd>] system_call_fastpath+0x1a/0x1f root 19410 0.0 0.0 0 0 ? D 08:12 0:00 [blkid] root 19575 0.0 0.0 0 0 ? D 08:13 0:00 [blkid] root 19584 0.0 0.0 28276 1076 ? S 08:13 0:00 /sbin/kpartx -a -p -part /dev/mapper/36000d31000a6500000000000000000c6 root 21734 0.0 0.0 28276 1080 ? D 08:15 0:00 /sbin/kpartx -a -p -part /dev/mapper/36000d31000a6500000000000000000c6 root 21735 0.0 0.0 28276 1076 ? S 08:15 0:00 /sbin/kpartx -a -p -part /dev/mapper/36000d31000a6500000000000000000ed root 22419 0.0 0.0 28276 1076 ? D 08:16 0:00 /sbin/kpartx -a -p -part /dev/mapper/36000d31000a6500000000000000000ed root 22420 0.0 0.0 28276 1076 ? D 08:16 0:00 /sbin/kpartx -a -p -part /dev/mapper/36000d31000a6500000000000000000c6 root 22864 0.0 0.0 28276 1076 ? D 08:16 0:00 /sbin/kpartx -a -p -part /dev/mapper/36000d31000a6500000000000000000ed root 22865 0.0 0.0 28276 1076 ? D 08:16 0:00 /sbin/kpartx -a -p -part /dev/mapper/36000d31000a6500000000000000000c6 root 23316 0.0 0.0 28276 1076 ? D 08:17 0:00 /sbin/kpartx -a -p -part /dev/mapper/36000d31000a6500000000000000000c6 root 23317 0.0 0.0 28276 1072 ? D 08:17 0:00 /sbin/kpartx -a -p -part /dev/mapper/36000d31000a6500000000000000000ed root 23756 0.0 0.0 28276 1076 ? D 08:17 0:00 /sbin/kpartx -a -p -part /dev/mapper/36000d31000a6500000000000000000c6 root 24200 0.0 0.0 28276 1076 ? D 08:18 0:00 /sbin/kpartx -a -p -part /dev/mapper/36000d31000a6500000000000000000c6 root 24637 0.0 0.0 28276 1072 ? D 08:18 0:00 /sbin/kpartx -a -p -part /dev/mapper/36000d31000a6500000000000000000c6 root 25058 0.0 0.0 28276 1076 ? D 08:19 0:00 /sbin/kpartx -a -p -part /dev/mapper/36000d31000a6500000000000000000c6 root@kvm3:~# Ultimately this will cause so much issues on your kvm node that the only fix is a restart because of all the libvirt locks you won't be able to stop, restart or destroy the qemu-kvm process and issuing a kill -9 won't help you either, the only solution is a restart. What will happen is that your live migration will fail with something like this. 2015-12-14 08:19:51.577 23821 ERROR nova.compute.manager [req-99771cf6-d17e-49f7-a01d-38201afbce69 212f451de64b4ae89c853f1430510037 e47ebdf3f3934025b37df3b85bdfd565 - - -] [instance: babf696c-55d1-4bde-be83-3124be2ac7f2] Live migration failed. 2015-12-14 08:19:51.577 23821 ERROR nova.compute.manager [instance: babf696c-55d1-4bde-be83-3124be2ac7f2] Traceback (most recent call last): 2015-12-14 08:19:51.577 23821 ERROR nova.compute.manager [instance: babf696c-55d1-4bde-be83-3124be2ac7f2] File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 5108, in _do_live_migration 2015-12-14 08:19:51.577 23821 ERROR nova.compute.manager [instance: babf696c-55d1-4bde-be83-3124be2ac7f2] block_migration, migrate_data) 2015-12-14 08:19:51.577 23821 ERROR nova.compute.manager [instance: babf696c-55d1-4bde-be83-3124be2ac7f2] File "/usr/lib/python2.7/dist-packages/nova/virt/libvirt/driver.py", line 5431, in live_migration 2015-12-14 08:19:51.577 23821 ERROR nova.compute.manager [instance: babf696c-55d1-4bde-be83-3124be2ac7f2] migrate_data) 2015-12-14 08:19:51.577 23821 ERROR nova.compute.manager [instance: babf696c-55d1-4bde-be83-3124be2ac7f2] File "/usr/lib/python2.7/dist-packages/nova/virt/libvirt/driver.py", line 6071, in _live_migration 2015-12-14 08:19:51.577 23821 ERROR nova.compute.manager [instance: babf696c-55d1-4bde-be83-3124be2ac7f2] dom, finish_event) 2015-12-14 08:19:51.577 23821 ERROR nova.compute.manager [instance: babf696c-55d1-4bde-be83-3124be2ac7f2] File "/usr/lib/python2.7/dist-packages/nova/virt/libvirt/driver.py", line 5919, in _live_migration_monitor 2015-12-14 08:19:51.577 23821 ERROR nova.compute.manager [instance: babf696c-55d1-4bde-be83-3124be2ac7f2] dom.abortJob() 2015-12-14 08:19:51.577 23821 ERROR nova.compute.manager [instance: babf696c-55d1-4bde-be83-3124be2ac7f2] File "/usr/lib/python2.7/dist-packages/eventlet/tpool.py", line 183, in doit 2015-12-14 08:19:51.577 23821 ERROR nova.compute.manager [instance: babf696c-55d1-4bde-be83-3124be2ac7f2] result = proxy_call(self._autowrap, f, *args, **kwargs) 2015-12-14 08:19:51.577 23821 ERROR nova.compute.manager [instance: babf696c-55d1-4bde-be83-3124be2ac7f2] File "/usr/lib/python2.7/dist-packages/eventlet/tpool.py", line 141, in proxy_call 2015-12-14 08:19:51.577 23821 ERROR nova.compute.manager [instance: babf696c-55d1-4bde-be83-3124be2ac7f2] rv = execute(f, *args, **kwargs) 2015-12-14 08:19:51.577 23821 ERROR nova.compute.manager [instance: babf696c-55d1-4bde-be83-3124be2ac7f2] File "/usr/lib/python2.7/dist-packages/eventlet/tpool.py", line 122, in execute 2015-12-14 08:19:51.577 23821 ERROR nova.compute.manager [instance: babf696c-55d1-4bde-be83-3124be2ac7f2] six.reraise(c, e, tb) 2015-12-14 08:19:51.577 23821 ERROR nova.compute.manager [instance: babf696c-55d1-4bde-be83-3124be2ac7f2] File "/usr/lib/python2.7/dist-packages/eventlet/tpool.py", line 80, in tworker 2015-12-14 08:19:51.577 23821 ERROR nova.compute.manager [instance: babf696c-55d1-4bde-be83-3124be2ac7f2] rv = meth(*args, **kwargs) 2015-12-14 08:19:51.577 23821 ERROR nova.compute.manager [instance: babf696c-55d1-4bde-be83-3124be2ac7f2] File "/usr/lib/python2.7/dist-packages/libvirt.py", line 482, in abortJob 2015-12-14 08:19:51.577 23821 ERROR nova.compute.manager [instance: babf696c-55d1-4bde-be83-3124be2ac7f2] if ret == -1: raise libvirtError ('virDomainAbortJob() failed', dom=self) 2015-12-14 08:19:51.577 23821 ERROR nova.compute.manager [instance: babf696c-55d1-4bde-be83-3124be2ac7f2] libvirtError: Timed out during operation: cannot acquire state change lock (held by remoteDispatchDomainMigratePerform3) I'm not sure if this is the same bug as https://bugs.launchpad.net/nova/+bug/1419577 but I want to make sure to bring some light into this area since this affect our operations very much. Best regards To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1525802/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : [email protected] Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp

