[Yahoo-eng-team] [Bug 1945983] Re: Duplicate iSCSI initiators causing live migration failures
** No longer affects: nova -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1945983 Title: Duplicate iSCSI initiators causing live migration failures Status in devstack: Fix Released Bug description: Description === See c#2 for the actual issue here. Steps to reproduce == LiveAutoBlockMigrationV225Test:test_live_migration_with_trunk or any live migration test with trunk ports fails during cleanup. Expected result === Both the test and cleanup pass without impacting libvirtd. Actual result = The test passes, cleanup locks up the single thread handling the libvirtd event loop in 6.0.0. Environment === 1. Exact version of OpenStack you are running. See the following list for all releases: http://docs.openstack.org/releases/ stable/xena and master 2. Which hypervisor did you use? (For example: Libvirt + KVM, Libvirt + XEN, Hyper-V, PowerKVM, ...) What's the version of that? libvirt (6.0.0) and QEMU 2. Which storage type did you use? (For example: Ceph, LVM, GPFS, ...) What's the version of that? N/A 3. Which networking type did you use? (For example: nova-network, Neutron with OpenVSwitch, ...) Trunk ports. Logs & Configs == Initially discovered and discussed as part of https://bugs.launchpad.net/nova/+bug/1912310 where the locking up within libvirtd causes other tests to then fail. To manage notifications about this bug go to: https://bugs.launchpad.net/devstack/+bug/1945983/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1945983] Re: Duplicate iSCSI initiators causing live migration failures
Reviewed: https://review.opendev.org/c/openstack/devstack/+/812391 Committed: https://opendev.org/openstack/devstack/commit/714826d1a27085ba2384ca495c876588d77f0d27 Submitter: "Zuul (22348)" Branch:master commit 714826d1a27085ba2384ca495c876588d77f0d27 Author: Lee Yarwood Date: Mon Oct 4 18:07:17 2021 +0100 nova: Ensure each compute uses a unique iSCSI initiator The current initiator name embedded in our CI images is not unique at present and can often cause failures during live migrations with attached volumes. This change ensures the name is unique by running iscsi-iname again and overwriting the existing name. We could potentially do this during the image build process itself but given that devstack systems are not supposed to be multi-purpose this should be safe to do during the devstack run. Closes-Bug: #1945983 Change-Id: I9ed26a17858df96c04be9ae52bf2e33e023869a5 ** Changed in: devstack Status: In Progress => Fix Released -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1945983 Title: Duplicate iSCSI initiators causing live migration failures Status in devstack: Fix Released Status in OpenStack Compute (nova): Incomplete Bug description: Description === See c#2 for the actual issue here. Steps to reproduce == LiveAutoBlockMigrationV225Test:test_live_migration_with_trunk or any live migration test with trunk ports fails during cleanup. Expected result === Both the test and cleanup pass without impacting libvirtd. Actual result = The test passes, cleanup locks up the single thread handling the libvirtd event loop in 6.0.0. Environment === 1. Exact version of OpenStack you are running. See the following list for all releases: http://docs.openstack.org/releases/ stable/xena and master 2. Which hypervisor did you use? (For example: Libvirt + KVM, Libvirt + XEN, Hyper-V, PowerKVM, ...) What's the version of that? libvirt (6.0.0) and QEMU 2. Which storage type did you use? (For example: Ceph, LVM, GPFS, ...) What's the version of that? N/A 3. Which networking type did you use? (For example: nova-network, Neutron with OpenVSwitch, ...) Trunk ports. Logs & Configs == Initially discovered and discussed as part of https://bugs.launchpad.net/nova/+bug/1912310 where the locking up within libvirtd causes other tests to then fail. To manage notifications about this bug go to: https://bugs.launchpad.net/devstack/+bug/1945983/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1945983] Re: Duplicate iSCSI initiators causing live migration failures
** Also affects: devstack Importance: Undecided Status: New ** Description changed: Description === - The initial request to destroy the instance - (45adbb55-491d-418b-ba68-7db43d1c235b / instance-000d) that had been - live migrated to the host comes into n-cpu here: - - https://e29606aade3009a4207d-11b479ab8ac0999ee2009c93a602f83a.ssl.cf2.rackcdn.com/811748/4/check/nova- - live-migration/1cb6fb3/compute1/logs/screen-n-cpu.txt - - 9916 Oct 01 11:10:33.162056 ubuntu-focal-iweb-mtl01-0026751352 nova- - compute[67743]: DEBUG oslo_concurrency.lockutils [None - req-9a2b5677-0383-480a-bc26-9a70831bd975 tempest- - LiveMigrationTest-22574426 tempest-LiveMigrationTest-22574426-project] - Lock "45adbb55-491d-418b-ba68-7db43d1c235b" acquired by - "nova.compute.manager.ComputeManager.terminate_instance..do_terminate_instance" - :: waited 0.000s {{(pid=67743) inner /usr/local/lib/python3.8/dist- - packages/oslo_concurrency/lockutils.py:355}} - - In libvirtd we see the thread attempt and fail to terminate the process - that appears busy: - - https://e29606aade3009a4207d-11b479ab8ac0999ee2009c93a602f83a.ssl.cf2.rackcdn.com/811748/4/check/nova- - live-migration/1cb6fb3/compute1/logs/libvirt/libvirt/libvirtd_log.txt - - 77557 2021-10-01 11:10:33.173+: 57258: debug : virThreadJobSet:93 : Thread 57258 (virNetServerHandleJob) is now running job remoteDispatchDomainDestroy - 77558 2021-10-01 11:10:33.173+: 57258: debug : qemuProcessKill:7197 : vm=0x7f93400f98c0 name=instance-000d pid=73535 flags=0x1 - 77559 2021-10-01 11:10:33.173+: 57258: debug : virProcessKillPainfullyDelay:356 : vpid=73535 force=1 extradelay=0 - [..] - 77673 2021-10-01 11:10:43.180+: 57258: debug : virProcessKillPainfullyDelay:375 : Timed out waiting after SIGTERM to process 73535, sending SIGKILL - [..] - 77674 2021-10-01 11:11:13.202+: 57258: error : virProcessKillPainfullyDelay:403 : Failed to terminate process 73535 with SIGKILL: Device or resource busy - 77675 2021-10-01 11:11:13.202+: 57258: debug : qemuDomainObjBeginJobInternal:9416 : Starting job: job=destroy agentJob=none asyncJob=none (vm=0x7f93400f98c0 name=instance-000d, current job=none agentJob= none async=none) - 77676 2021-10-01 11:11:13.202+: 57258: debug : qemuDomainObjBeginJobInternal:9470 : Started job: destroy (async=none vm=0x7f93400f98c0 name=instance-000d) - 77677 2021-10-01 11:11:13.203+: 57258: debug : qemuProcessStop:7279 : Shutting down vm=0x7f93400f98c0 name=instance-000d id=14 pid=73535, reason=destroyed, asyncJob=none, flags=0x0 - 77678 2021-10-01 11:11:13.203+: 57258: debug : qemuDomainLogAppendMessage:10691 : Append log message (vm='instance-000d' message='2021-10-01 11:11:13.203+: shutting down, reason=destroyed - 77679 ) stdioLogD=1 - 77680 2021-10-01 11:11:13.204+: 57258: info : qemuMonitorClose:916 : QEMU_MONITOR_CLOSE: mon=0x7f93400ecce0 refs=4 - 77681 2021-10-01 11:11:13.205+: 57258: debug : qemuProcessKill:7197 : vm=0x7f93400f98c0 name=instance-000d pid=73535 flags=0x5 - 77682 2021-10-01 11:11:13.205+: 57258: debug : virProcessKillPainfullyDelay:356 : vpid=73535 force=1 extradelay=0 - 77683 2021-10-01 11:11:23.213+: 57258: debug : virProcessKillPainfullyDelay:375 : Timed out waiting after SIGTERM to process 73535, sending SIGKILL - [..] - 77684 2021-10-01 11:11:53.237+: 57258: error : virProcessKillPainfullyDelay:403 : Failed to terminate process 73535 with SIGKILL: Device or resource busy - - Towards the end we get an idea why this could be happening with the - following virNetDevTapDelete failure: - - 77689 2021-10-01 11:11:53.241+: 57258: error : virNetDevTapDelete:339 : Unable to associate TAP device: Device or resource busy - 77690 2021-10-01 11:11:53.241+: 57258: debug : virSystemdTerminateMachine:488 : Attempting to terminate machine via systemd - - libvirtd then attempts to kill the domain via systemd and this - eventually completes but in the meantime the attached volume has - disappeared, this might be due to Tempest cleaning the volume up in the - background: - - 77955 2021-10-01 11:11:54.266+: 57258: debug : virCgroupV1Remove:699 : Done removing cgroup /machine.slice/machine-qemu\x2d14\x2dinstance\x2d000d.scope - 77956 2021-10-01 11:11:54.266+: 57258: warning : qemuProcessStop:7488 : Failed to remove cgroup for instance-000d - 77957 2021-10-01 11:12:30.391+: 57258: error : virProcessRunInFork:1159 : internal error: child reported (status=125): unable to open /dev/sdb: No such device or address - 77958 2021-10-01 11:12:30.391+: 57258: warning : qemuBlockRemoveImageMetadata:2629 : Unable to remove disk metadata on vm instance-000d from /dev/sdb (disk target vda) - 77959 2021-10-01 11:12:30.392+: 57258: debug : virObjectEventNew:631 : obj=0x7f934c087670 - 77960 2021-10-01