[Yahoo-eng-team] [Bug 1945983] Re: Duplicate iSCSI initiators causing live migration failures

2021-10-07 Thread Lee Yarwood
** No longer affects: nova

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1945983

Title:
  Duplicate iSCSI initiators causing live migration failures

Status in devstack:
  Fix Released

Bug description:
  Description
  ===

  See c#2 for the actual issue here.

  Steps to reproduce
  ==

  LiveAutoBlockMigrationV225Test:test_live_migration_with_trunk or any
  live migration test with trunk ports fails during cleanup.

  Expected result
  ===

  Both the test and cleanup pass without impacting libvirtd.

  Actual result
  =

  The test passes, cleanup locks up the single thread handling the
  libvirtd event loop in 6.0.0.

  Environment
  ===
  1. Exact version of OpenStack you are running. See the following
    list for all releases: http://docs.openstack.org/releases/

     stable/xena and master

  2. Which hypervisor did you use?
     (For example: Libvirt + KVM, Libvirt + XEN, Hyper-V, PowerKVM, ...)
     What's the version of that?

     libvirt (6.0.0) and QEMU

  2. Which storage type did you use?
     (For example: Ceph, LVM, GPFS, ...)
     What's the version of that?

     N/A

  3. Which networking type did you use?
     (For example: nova-network, Neutron with OpenVSwitch, ...)

  Trunk ports.

  Logs & Configs
  ==

  Initially discovered and discussed as part of
  https://bugs.launchpad.net/nova/+bug/1912310 where the locking up
  within libvirtd causes other tests to then fail.

To manage notifications about this bug go to:
https://bugs.launchpad.net/devstack/+bug/1945983/+subscriptions


-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1945983] Re: Duplicate iSCSI initiators causing live migration failures

2021-10-06 Thread OpenStack Infra
Reviewed:  https://review.opendev.org/c/openstack/devstack/+/812391
Committed: 
https://opendev.org/openstack/devstack/commit/714826d1a27085ba2384ca495c876588d77f0d27
Submitter: "Zuul (22348)"
Branch:master

commit 714826d1a27085ba2384ca495c876588d77f0d27
Author: Lee Yarwood 
Date:   Mon Oct 4 18:07:17 2021 +0100

nova: Ensure each compute uses a unique iSCSI initiator

The current initiator name embedded in our CI images is not unique at
present and can often cause failures during live migrations with
attached volumes. This change ensures the name is unique by running
iscsi-iname again and overwriting the existing name.

We could potentially do this during the image build process itself but
given that devstack systems are not supposed to be multi-purpose this
should be safe to do during the devstack run.

Closes-Bug: #1945983
Change-Id: I9ed26a17858df96c04be9ae52bf2e33e023869a5


** Changed in: devstack
   Status: In Progress => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1945983

Title:
  Duplicate iSCSI initiators causing live migration failures

Status in devstack:
  Fix Released
Status in OpenStack Compute (nova):
  Incomplete

Bug description:
  Description
  ===

  See c#2 for the actual issue here.

  Steps to reproduce
  ==

  LiveAutoBlockMigrationV225Test:test_live_migration_with_trunk or any
  live migration test with trunk ports fails during cleanup.

  Expected result
  ===

  Both the test and cleanup pass without impacting libvirtd.

  Actual result
  =

  The test passes, cleanup locks up the single thread handling the
  libvirtd event loop in 6.0.0.

  Environment
  ===
  1. Exact version of OpenStack you are running. See the following
    list for all releases: http://docs.openstack.org/releases/

     stable/xena and master

  2. Which hypervisor did you use?
     (For example: Libvirt + KVM, Libvirt + XEN, Hyper-V, PowerKVM, ...)
     What's the version of that?

     libvirt (6.0.0) and QEMU

  2. Which storage type did you use?
     (For example: Ceph, LVM, GPFS, ...)
     What's the version of that?

     N/A

  3. Which networking type did you use?
     (For example: nova-network, Neutron with OpenVSwitch, ...)

  Trunk ports.

  Logs & Configs
  ==

  Initially discovered and discussed as part of
  https://bugs.launchpad.net/nova/+bug/1912310 where the locking up
  within libvirtd causes other tests to then fail.

To manage notifications about this bug go to:
https://bugs.launchpad.net/devstack/+bug/1945983/+subscriptions


-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1945983] Re: Duplicate iSCSI initiators causing live migration failures

2021-10-04 Thread Lee Yarwood
** Also affects: devstack
   Importance: Undecided
   Status: New

** Description changed:

  Description
  ===
  
- The initial request to destroy the instance
- (45adbb55-491d-418b-ba68-7db43d1c235b / instance-000d) that had been
- live migrated to the host comes into n-cpu here:
- 
- 
https://e29606aade3009a4207d-11b479ab8ac0999ee2009c93a602f83a.ssl.cf2.rackcdn.com/811748/4/check/nova-
- live-migration/1cb6fb3/compute1/logs/screen-n-cpu.txt
- 
-  9916 Oct 01 11:10:33.162056 ubuntu-focal-iweb-mtl01-0026751352 nova-
- compute[67743]: DEBUG oslo_concurrency.lockutils [None
- req-9a2b5677-0383-480a-bc26-9a70831bd975 tempest-
- LiveMigrationTest-22574426 tempest-LiveMigrationTest-22574426-project]
- Lock "45adbb55-491d-418b-ba68-7db43d1c235b" acquired by
- 
"nova.compute.manager.ComputeManager.terminate_instance..do_terminate_instance"
- :: waited 0.000s {{(pid=67743) inner /usr/local/lib/python3.8/dist-
- packages/oslo_concurrency/lockutils.py:355}}
- 
- In libvirtd we see the thread attempt and fail to terminate the process
- that appears busy:
- 
- 
https://e29606aade3009a4207d-11b479ab8ac0999ee2009c93a602f83a.ssl.cf2.rackcdn.com/811748/4/check/nova-
- live-migration/1cb6fb3/compute1/logs/libvirt/libvirt/libvirtd_log.txt
- 
- 77557 2021-10-01 11:10:33.173+: 57258: debug : virThreadJobSet:93 : 
Thread 57258 (virNetServerHandleJob) is now running job 
remoteDispatchDomainDestroy
- 77558 2021-10-01 11:10:33.173+: 57258: debug : qemuProcessKill:7197 : 
vm=0x7f93400f98c0 name=instance-000d pid=73535 flags=0x1
- 77559 2021-10-01 11:10:33.173+: 57258: debug : 
virProcessKillPainfullyDelay:356 : vpid=73535 force=1 extradelay=0
- [..]
- 77673 2021-10-01 11:10:43.180+: 57258: debug : 
virProcessKillPainfullyDelay:375 : Timed out waiting after SIGTERM to process 
73535, sending SIGKILL
- [..]
- 77674 2021-10-01 11:11:13.202+: 57258: error : 
virProcessKillPainfullyDelay:403 : Failed to terminate process 73535 with 
SIGKILL: Device or resource busy
- 77675 2021-10-01 11:11:13.202+: 57258: debug : 
qemuDomainObjBeginJobInternal:9416 : Starting job: job=destroy agentJob=none 
asyncJob=none (vm=0x7f93400f98c0 name=instance-000d, current job=none 
agentJob=  none async=none)
- 77676 2021-10-01 11:11:13.202+: 57258: debug : 
qemuDomainObjBeginJobInternal:9470 : Started job: destroy (async=none 
vm=0x7f93400f98c0 name=instance-000d)
- 77677 2021-10-01 11:11:13.203+: 57258: debug : qemuProcessStop:7279 : 
Shutting down vm=0x7f93400f98c0 name=instance-000d id=14 pid=73535, 
reason=destroyed, asyncJob=none, flags=0x0
- 77678 2021-10-01 11:11:13.203+: 57258: debug : 
qemuDomainLogAppendMessage:10691 : Append log message (vm='instance-000d' 
message='2021-10-01 11:11:13.203+: shutting down, reason=destroyed
- 77679 ) stdioLogD=1   

- 77680 2021-10-01 11:11:13.204+: 57258: info : qemuMonitorClose:916 : 
QEMU_MONITOR_CLOSE: mon=0x7f93400ecce0 refs=4
- 77681 2021-10-01 11:11:13.205+: 57258: debug : qemuProcessKill:7197 : 
vm=0x7f93400f98c0 name=instance-000d pid=73535 flags=0x5
- 77682 2021-10-01 11:11:13.205+: 57258: debug : 
virProcessKillPainfullyDelay:356 : vpid=73535 force=1 extradelay=0 
- 77683 2021-10-01 11:11:23.213+: 57258: debug : 
virProcessKillPainfullyDelay:375 : Timed out waiting after SIGTERM to process 
73535, sending SIGKILL
- [..]
- 77684 2021-10-01 11:11:53.237+: 57258: error : 
virProcessKillPainfullyDelay:403 : Failed to terminate process 73535 with 
SIGKILL: Device or resource busy
- 
- Towards the end we get an idea why this could be happening with the
- following virNetDevTapDelete failure:
- 
- 77689 2021-10-01 11:11:53.241+: 57258: error : virNetDevTapDelete:339 : 
Unable to associate TAP device: Device or resource busy
- 77690 2021-10-01 11:11:53.241+: 57258: debug : 
virSystemdTerminateMachine:488 : Attempting to terminate machine via systemd
- 
- libvirtd then attempts to kill the domain via systemd and this
- eventually completes but in the meantime the attached volume has
- disappeared, this might be due to Tempest cleaning the volume up in the
- background:
- 
- 77955 2021-10-01 11:11:54.266+: 57258: debug : virCgroupV1Remove:699 : 
Done removing cgroup 
/machine.slice/machine-qemu\x2d14\x2dinstance\x2d000d.scope
- 77956 2021-10-01 11:11:54.266+: 57258: warning : qemuProcessStop:7488 : 
Failed to remove cgroup for instance-000d
- 77957 2021-10-01 11:12:30.391+: 57258: error : virProcessRunInFork:1159 : 
internal error: child reported (status=125): unable to open /dev/sdb: No such 
device or address
- 77958 2021-10-01 11:12:30.391+: 57258: warning : 
qemuBlockRemoveImageMetadata:2629 : Unable to remove disk metadata on vm 
instance-000d from /dev/sdb (disk target vda)
- 77959 2021-10-01 11:12:30.392+: 57258: debug : virObjectEventNew:631 : 
obj=0x7f934c087670
- 77960 2021-10-01