[Yahoo-eng-team] [Bug 1840686] Re: Xenial images won't reboot if disk size is > 2TB when using GPT
Uploaded grub2-signed "Rebuild against grub2 2.02~beta2-36ubuntu3.23. (LP: #1840686)" ** Changed in: grub2-signed (Ubuntu) Status: In Progress => Fix Released ** Changed in: grub2-signed (Ubuntu) Assignee: Eric Desrochers (slashd) => (unassigned) ** Changed in: grub2-signed (Ubuntu) Importance: High => Undecided -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to cloud-init. https://bugs.launchpad.net/bugs/1840686 Title: Xenial images won't reboot if disk size is > 2TB when using GPT Status in cloud-init: Won't Fix Status in grub package in Ubuntu: Fix Released Status in grub2-signed package in Ubuntu: Fix Released Status in grub source package in Xenial: In Progress Bug description: [Impact] On Xenial images which use GPT instead of MBR to enable efi based booting, there is an issue where after booting an instance that has a disk size of 2049 GB or higher, we hang on the next subsequent boot (Logs indicate it hanging on "Booting Hard Disk 0"). This is a problem in grub2 where the system would become unbootable after ext* online resize if no resize_inode was created at ext* format time. [Test Case] To reproduce: 1) Create an image with a disk size of 3072 GB using a serial that has GPT: gcloud compute instances create test-3072-xenial --image daily- ubuntu-1604-xenial-v20190731 --image-project ubuntu-os-cloud-devel --boot-disk-size 3072 2) Reboot the instance The instance will hang on reboot and you cannot connect. If you go to GCP console and select Logs > Serial port 1 (console), you will see the boot process has stopped at "Booting Hard Disk 0". I have built a test package, which is available here: https://launchpad.net/~mruffell/+archive/ubuntu/lp1840686-test If you do step 1) but do not reboot, and instead add the PPA, install the new grub like so: 1) gcloud compute instances create test-3072-xenial --image daily-ubuntu-1604-xenial-v20190731 --image-project ubuntu-os-cloud-devel --boot-disk-size 3072 2) sudo add-apt-repository ppa:mruffell/lp1840686-test 3) sudo apt-get update 4) sudo apt remove grub-common grub-efi-amd64 grub-efi-amd64-bin grub-efi-amd64-signed grub-pc-bin grub2-common 5) sudo apt install grub-common grub-efi-amd64 grub-efi-amd64-bin grub-pc-bin grub2-common 6) sudo grub-install /dev/sda 7) sudo reboot The instance will boot successfully and you will be able to connect. Note, we must use "daily-ubuntu-1604-xenial-v20190731" as the image, as it is enabled for GPT and efi. GCP was reverted back to MBR and bios booting because of this bug, so the latest images will not reproduce the problem. [Regression Potential] Grub is a core package and every care must be taken in order to not introduce any regressions. The commit is present in B, D, E and F, and is considered well tested and widely adopted by the community. The commit comes with its own testcase, to test the ext4_metabg fix. The changes are localised to ext* based filesystems, although since they are the most popular family of filesystems used by the community, this does not reduce risk of breakage by much. If a regression were to happen, a regression would have a large impact, and in the worst case, can lead to unbootable systems and data loss for users who are not technical enough to reinstall grub from a working package inside the broken system chroot. [Other Info] In comment #4, Sultan identifies the fix as: commit e20aa39ea4298011ba716087713cff26c6c52006 Author: Vladimir Serbinenko Date: Mon Feb 16 20:53:26 2015 +0100 Subject: ext2: Support META_BG. This commit is from upstream grub2, and can be found here: https://git.savannah.gnu.org/cgit/grub.git/commit/?id=e20aa39ea4298011ba716087713cff26c6c52006 Looking at when this was merged: $ git describe --contains e20aa39ea4298011ba716087713cff26c6c52006 2.02-beta3~429 This commit is present in B, D, E and F, leaving X as the only version needing an SRU. The commit cleanly cherry picks to X, because the delta from 2.02~beta2-36ubuntu3.22 to 2.02-beta3~429 is small. To manage notifications about this bug go to: https://bugs.launchpad.net/cloud-init/+bug/1840686/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1840686] Re: Xenial images won't reboot if disk size is > 2TB when using GPT
** Also affects: grub2-signed (Ubuntu) Importance: Undecided Status: New ** Changed in: grub2-signed (Ubuntu) Status: New => In Progress ** Changed in: grub2-signed (Ubuntu) Importance: Undecided => High ** Changed in: grub2-signed (Ubuntu) Assignee: (unassigned) => Eric Desrochers (slashd) -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to cloud-init. https://bugs.launchpad.net/bugs/1840686 Title: Xenial images won't reboot if disk size is > 2TB when using GPT Status in cloud-init: Won't Fix Status in grub package in Ubuntu: Fix Released Status in grub2-signed package in Ubuntu: Fix Released Status in grub source package in Xenial: In Progress Bug description: [Impact] On Xenial images which use GPT instead of MBR to enable efi based booting, there is an issue where after booting an instance that has a disk size of 2049 GB or higher, we hang on the next subsequent boot (Logs indicate it hanging on "Booting Hard Disk 0"). This is a problem in grub2 where the system would become unbootable after ext* online resize if no resize_inode was created at ext* format time. [Test Case] To reproduce: 1) Create an image with a disk size of 3072 GB using a serial that has GPT: gcloud compute instances create test-3072-xenial --image daily- ubuntu-1604-xenial-v20190731 --image-project ubuntu-os-cloud-devel --boot-disk-size 3072 2) Reboot the instance The instance will hang on reboot and you cannot connect. If you go to GCP console and select Logs > Serial port 1 (console), you will see the boot process has stopped at "Booting Hard Disk 0". I have built a test package, which is available here: https://launchpad.net/~mruffell/+archive/ubuntu/lp1840686-test If you do step 1) but do not reboot, and instead add the PPA, install the new grub like so: 1) gcloud compute instances create test-3072-xenial --image daily-ubuntu-1604-xenial-v20190731 --image-project ubuntu-os-cloud-devel --boot-disk-size 3072 2) sudo add-apt-repository ppa:mruffell/lp1840686-test 3) sudo apt-get update 4) sudo apt remove grub-common grub-efi-amd64 grub-efi-amd64-bin grub-efi-amd64-signed grub-pc-bin grub2-common 5) sudo apt install grub-common grub-efi-amd64 grub-efi-amd64-bin grub-pc-bin grub2-common 6) sudo grub-install /dev/sda 7) sudo reboot The instance will boot successfully and you will be able to connect. Note, we must use "daily-ubuntu-1604-xenial-v20190731" as the image, as it is enabled for GPT and efi. GCP was reverted back to MBR and bios booting because of this bug, so the latest images will not reproduce the problem. [Regression Potential] Grub is a core package and every care must be taken in order to not introduce any regressions. The commit is present in B, D, E and F, and is considered well tested and widely adopted by the community. The commit comes with its own testcase, to test the ext4_metabg fix. The changes are localised to ext* based filesystems, although since they are the most popular family of filesystems used by the community, this does not reduce risk of breakage by much. If a regression were to happen, a regression would have a large impact, and in the worst case, can lead to unbootable systems and data loss for users who are not technical enough to reinstall grub from a working package inside the broken system chroot. [Other Info] In comment #4, Sultan identifies the fix as: commit e20aa39ea4298011ba716087713cff26c6c52006 Author: Vladimir Serbinenko Date: Mon Feb 16 20:53:26 2015 +0100 Subject: ext2: Support META_BG. This commit is from upstream grub2, and can be found here: https://git.savannah.gnu.org/cgit/grub.git/commit/?id=e20aa39ea4298011ba716087713cff26c6c52006 Looking at when this was merged: $ git describe --contains e20aa39ea4298011ba716087713cff26c6c52006 2.02-beta3~429 This commit is present in B, D, E and F, leaving X as the only version needing an SRU. The commit cleanly cherry picks to X, because the delta from 2.02~beta2-36ubuntu3.22 to 2.02-beta3~429 is small. To manage notifications about this bug go to: https://bugs.launchpad.net/cloud-init/+bug/1840686/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1560965] Re: libvirt selects wrong root device name
** Changed in: horizon Status: In Progress => Fix Released -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1560965 Title: libvirt selects wrong root device name Status in Cinder: Invalid Status in OpenStack Dashboard (Horizon): Fix Released Status in OpenStack Compute (nova): Invalid Bug description: Referring to Liberty, Compute runs with xen hypervisor: When trying to boot an instance from volume via Horizon, the VM fails to spawn because of an invalid block device mapping. I found a cause for that in a default initial "device_name=vda" in the file /srv/www/openstack-dashboard/openstack_dashboard/dashboards/project/instances/workflows/create_instance.py. Log file nova-compute.log reports "Ignoring supplied device name: /dev/vda" , but in the next step it uses it anyway and says "Booting with blank volume at /dev/vda". To test my assumption, I blanked the device_name and edited the array dev_mapping_2 to only append device_name if it's not empty. That works perfectly for Booting from Horizon and could be one way to fix this. But if you use nova boot command, you can still provide (multiple) device names, for example if you launch an instance and attach a blank volume. It seems that libvirt is indeed ignoring the supplied device name, but only if it's not the root device. If I understand correctly, a user- supplied device_name should also be nulled out for root_device_name and picked by libvirt, if it's not valid. And also the default value for device_name in Horizon dashboard should be None. If there is one supplied, it should be processed and probably validated by libvirt. Steps to reproduce from Horizon: Use Xen as hypervisor 1. Go to the Horizon dashboard and launch an instance 2. Select "Boot from image (creates a new volume)" as Instance Boot Source Expected result: Instance starts with /dev/xvda as root device. Actual result: Build of instance fails, nova-compute.log reports "BuildAbortException: Build of instance c15f3344-f9e3-4853-9c18-ea8741563205 aborted: Block Device Mapping is Invalid" Steps to reproduce from nova cli: 1. Launch an instance from command line via nova boot --flavor 1 --block-device source=image,id=IMAGE_ID,dest=volume,size=1,shutdown=remove,bootindex=0,device=vda --block-device source=blank,dest=volume,size=1,shutdown=remove,device=vdb VM Expected result: Instance starts with /dev/xvda as root device. Actual result: Build of instance fails, device name for vdb is ignored and replaced correctly, but the root device is not. To manage notifications about this bug go to: https://bugs.launchpad.net/cinder/+bug/1560965/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1560965] Re: libvirt selects wrong root device name
As agreed with amotoki, We will fix this in two phases, Phase 1: My current fix address the instance creation using image with new volume in function setFinalSpecBootImageToVolume() by leaving the decision to nova to determine the device name. Phase 2: Address the instance creation using an existing volume or snapshot (which still default to 'vda' using another function setFinalSpecBootFromVolumeDevice() which uses BDMv1 (legacy) instead of BDMv2. Maybe as we fix that part in a second phase, we should consider to take the time to upgrade to code to BDMv2. So this is the plan right now. Phase 1 is cover here: https://review.openstack.org/644982 and waiting for Code-Review. - Eric ** Changed in: nova Status: Confirmed => Invalid ** Changed in: cinder Status: New => Invalid -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1560965 Title: libvirt selects wrong root device name Status in Cinder: Invalid Status in OpenStack Dashboard (Horizon): In Progress Status in OpenStack Compute (nova): Invalid Bug description: Referring to Liberty, Compute runs with xen hypervisor: When trying to boot an instance from volume via Horizon, the VM fails to spawn because of an invalid block device mapping. I found a cause for that in a default initial "device_name=vda" in the file /srv/www/openstack-dashboard/openstack_dashboard/dashboards/project/instances/workflows/create_instance.py. Log file nova-compute.log reports "Ignoring supplied device name: /dev/vda" , but in the next step it uses it anyway and says "Booting with blank volume at /dev/vda". To test my assumption, I blanked the device_name and edited the array dev_mapping_2 to only append device_name if it's not empty. That works perfectly for Booting from Horizon and could be one way to fix this. But if you use nova boot command, you can still provide (multiple) device names, for example if you launch an instance and attach a blank volume. It seems that libvirt is indeed ignoring the supplied device name, but only if it's not the root device. If I understand correctly, a user- supplied device_name should also be nulled out for root_device_name and picked by libvirt, if it's not valid. And also the default value for device_name in Horizon dashboard should be None. If there is one supplied, it should be processed and probably validated by libvirt. Steps to reproduce from Horizon: Use Xen as hypervisor 1. Go to the Horizon dashboard and launch an instance 2. Select "Boot from image (creates a new volume)" as Instance Boot Source Expected result: Instance starts with /dev/xvda as root device. Actual result: Build of instance fails, nova-compute.log reports "BuildAbortException: Build of instance c15f3344-f9e3-4853-9c18-ea8741563205 aborted: Block Device Mapping is Invalid" Steps to reproduce from nova cli: 1. Launch an instance from command line via nova boot --flavor 1 --block-device source=image,id=IMAGE_ID,dest=volume,size=1,shutdown=remove,bootindex=0,device=vda --block-device source=blank,dest=volume,size=1,shutdown=remove,device=vdb VM Expected result: Instance starts with /dev/xvda as root device. Actual result: Build of instance fails, device name for vdb is ignored and replaced correctly, but the root device is not. To manage notifications about this bug go to: https://bugs.launchpad.net/cinder/+bug/1560965/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1374508] Re: Mismatch happens between BDM and domain XML If instance does not respond to ACPI hotplug during detach/attach.
** Also affects: nova (Ubuntu Trusty) Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1374508 Title: Mismatch happens between BDM and domain XML If instance does not respond to ACPI hotplug during detach/attach. Status in OpenStack Compute (nova): Fix Released Status in nova package in Ubuntu: New Status in nova source package in Trusty: New Bug description: tempest.api.compute.servers.test_server_rescue_negative:ServerRescueNegativeTestJSON.test_rescued_vm_detach_volume This test passes however it fails to properly cleanup after itself - the detach completes but without running the necessary iscsiadm commands. In nova.virt.libvirt.volume.LibvirtISCSIVolumeDriver.disconnect_volume the list returned by self.connection._get_all_block_devices includes the host_device which means that self._disconnect_from_iscsi_portal is never run. You can see evidence of this in /etc/iscsi/nodes as well as errors logged in /var/log/syslog I'm guessing there is a race between the unrescue and the detach within libvirt. In nova.virt.libvirt.driver.LibvirtDriver.detach_volume if I put in a sleep before virt_dom.detachDeviceFlags(xml, flags) the detach appears to work properly however if I sleep after that line it does not appear to have any effect. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1374508/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1692397] Re: hypervisor statistics could be incorrect
** Also affects: nova (Ubuntu Xenial) Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1692397 Title: hypervisor statistics could be incorrect Status in Ubuntu Cloud Archive: Fix Released Status in OpenStack Compute (nova): Fix Released Status in OpenStack Compute (nova) newton series: Fix Committed Status in OpenStack Compute (nova) ocata series: Fix Committed Status in nova package in Ubuntu: Fix Released Status in nova source package in Xenial: New Bug description: Hypervisor statistics could be incorrect: When we killed a nova-compute service and deleted the service from nova DB, and then start the nova-compute service again, the result of Hypervisor/statistics API (nova hypervisor-stats) will be incorrect; How to reproduce: Step1. Check the correct statistics before we do anything: root@SZX1000291919:/opt/stack/nova# nova hypervisor-stats +--+---+ | Property | Value | +--+---+ | count| 1 | | current_workload | 0 | | disk_available_least | 14| | free_disk_gb | 34| | free_ram_mb | 6936 | | local_gb | 35| | local_gb_used| 1 | | memory_mb| 7960 | | memory_mb_used | 1024 | | running_vms | 1 | | vcpus| 8 | | vcpus_used | 1 | +--+---+ Step2. Kill the compute service: root@SZX1000291919:/var/log/nova# ps -ef | grep nova-com root 120419 120411 0 11:06 pts/27 00:00:00 sg libvirtd /usr/local/bin/nova-compute --config-file /etc/nova/nova.conf --log-file /var/log/nova/nova-compute.log root 120420 120419 0 11:06 pts/27 00:00:07 /usr/bin/python /usr/local/bin/nova-compute --config-file /etc/nova/nova.conf --log-file /var/log/nova/nova-compute.log root@SZX1000291919:/var/log/nova# kill -9 120419 root@SZX1000291919:/var/log/nova# /usr/local/bin/stack: line 19: 120419 Killed sg libvirtd '/usr/local/bin/nova-compute --config-file /etc/nova/nova.conf --log-file /var/log/nova/nova-compute.log' > /dev/null 2>&1 root@SZX1000291919:/var/log/nova# nova service-list ++--+---+--+-+---++-+ | Id | Binary | Host | Zone | Status | State | Updated_at | Disabled Reason | ++--+---+--+-+---++-+ | 4 | nova-conductor | SZX1000291919 | internal | enabled | up| 2017-05-22T03:24:36.00 | - | | 6 | nova-scheduler | SZX1000291919 | internal | enabled | up| 2017-05-22T03:24:36.00 | - | | 7 | nova-consoleauth | SZX1000291919 | internal | enabled | up| 2017-05-22T03:24:37.00 | - | | 8 | nova-compute | SZX1000291919 | nova | enabled | down | 2017-05-22T03:23:38.00 | - | | 9 | nova-cert| SZX1000291919 | internal | enabled | down | 2017-05-17T02:50:13.00 | - | ++--+---+--+-+---++-+ Step3. Delete the service from DB: root@SZX1000291919:/var/log/nova# nova service-delete 8 root@SZX1000291919:/var/log/nova# nova service-list ++--+---+--+-+---++-+ | Id | Binary | Host | Zone | Status | State | Updated_at | Disabled Reason | ++--+---+--+-+---++-+ | 4 | nova-conductor | SZX1000291919 | internal | enabled | up| 2017-05-22T03:25:16.00 | - | | 6 | nova-scheduler | SZX1000291919 | internal | enabled | up| 2017-05-22T03:25:16.00 | - | | 7 | nova-consoleauth | SZX1000291919 | internal | enabled | up| 2017-05-22T03:25:17.00 | - | | 9 | nova-cert| SZX1000291919 | internal | enabled | down | 2017-05-17T02:50:13.00 | - | ++--+---+--+-+---++-+ Step4. Start the compute service again: root@SZX1000291919:/var/log/nova# nova service-list ++--+---+--+-+---++-+ | Id | Binary | Host | Zone | Status | State | Updated_at | Disabled Reason | ++--+---+-
[Yahoo-eng-team] [Bug 1535918] Re: instance.host not updated on evacuation
** Also affects: nova (Ubuntu Artful) Importance: Undecided Status: New ** Also affects: nova (Ubuntu Zesty) Importance: Undecided Status: New ** Changed in: nova (Ubuntu Artful) Status: New => Fix Released ** Changed in: nova (Ubuntu Zesty) Status: New => Fix Released ** Changed in: nova (Ubuntu Xenial) Status: New => In Progress -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1535918 Title: instance.host not updated on evacuation Status in Ubuntu Cloud Archive: Fix Released Status in OpenStack Compute (nova): Fix Released Status in nova-powervm: Fix Released Status in nova package in Ubuntu: Fix Released Status in nova source package in Xenial: In Progress Status in nova source package in Zesty: Fix Released Status in nova source package in Artful: Fix Released Bug description: [Impact] I created several VM instances and checked they are all ACTIVE state after creating vm. Right after checking them, shutdown nova-compute on their host(to test in this case). Then, I tried to evacuate them to the other host. But it is failed with ERROR state. I did some test and analysis. I found two commits below are related.(Please refer to [Others] section) In this context, migration_context is DB field to pass information when migration or evacuation. for [1], This gets host info from migration_context. if migration_context is abnormal or empty, migration would be fail. actually, with only this patch, migration_context is empty. so [2] is needed. I touched self.client.prepare part in rpcapi.py from original patch which is replaced on newer version. because it is related newer functionality, I remained mitaka's function call for this issue. for [2], This moves recreation check code to former if condition. and it calls rebuild_claim to create migration_context when recreate state not only scheduled. I adjusted test code which are pop up from backport process and seems to be needed. Someone want to backport or cherrypick code related to this, they could find it is already exist. Only one patch of them didn’t fix this issue as test said. [Test case] In below env, http://pastebin.ubuntu.com/25337153/ Network configuration is important in this case, because I tested different configuration. but couldn't reproduce it. reproduction test script ( based on juju ) http://pastebin.ubuntu.com/25360805/ [Regression Potential] Existing ACTIVE VMs or Newly creating VMs are not affected by this code because these commits are only called when doing migration or evacuation. If there are ACTIVE VMs and VMs got ERROR state caused by this issue in one host, and after upgrading pkg, All VMs should not be affected anything by this upgrading. After trying to evacuate problematic VM again, ERROR state should be fixed to ACTIVE. I tested this scenario on simple env, but still need to be considered possibility in complex, crowded environment. [Others] In test, I should patch two commits, one from https://bugs.launchpad.net/nova/+bug/1686041 Related Patches. [1] https://github.com/openstack/nova/commit/a5b920a197c70d2ae08a1e1335d979857f923b4f [2] https://github.com/openstack/nova/commit/0f2d87416eff1e96c0fbf0f4b08bf6b6b22246d5 ( backported to newton from below original) - https://github.com/openstack/nova/commit/a2b0824aca5cb4a2ae579f625327c51ed0414d35 ( original) [Original description] I'm working on the nova-powervm driver for Mitaka and trying to add support for evacuation. The problem I'm hitting is that instance.host is not updated when the compute driver is called to spawn the instance on the destination host. It is still set to the source host. It's not until after the spawn completes that the compute manager updates instance.host to reflect the destination host. The nova-powervm driver uses instance events callback mechanism during plug VIF to determine when Neutron has finished provisioning the network. The instance events code sends the event to instance.host and hence is sending the event to the source host (which is down). This causes the spawn to fail and also causes weirdness when the source host gets the events when it's powered back up. To temporarily work around the problem, I hacked in setting instance.host = CONF.host; instance.save() in the compute driver but that's not a good solution. To manage notifications about this bug go to: https://bugs.launchpad.net/cloud-archive/+bug/1535918/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1535918] Re: instance.host not updated on evacuation
** Also affects: nova (Ubuntu Xenial) Importance: Undecided Status: New ** Changed in: nova (Ubuntu Xenial) Assignee: (unassigned) => Seyeong Kim (xtrusia) -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1535918 Title: instance.host not updated on evacuation Status in Ubuntu Cloud Archive: Fix Released Status in OpenStack Compute (nova): Fix Released Status in nova-powervm: Fix Released Status in nova package in Ubuntu: New Status in nova source package in Xenial: New Bug description: [Impact] I created several VM instances and checked they are all ACTIVE state after creating vm. Right after checking them, shutdown nova-compute on their host(to test in this case). Then, I tried to evacuate them to the other host. But it is failed with ERROR state. I did some test and analysis. I found two commits below are related.(Please refer to [Others] section) In this context, migration_context is DB field to pass information when migration or evacuation. for [1], This gets host info from migration_context. if migration_context is abnormal or empty, migration would be fail. actually, with only this patch, migration_context is empty. so [2] is needed. I touched self.client.prepare part in rpcapi.py from original patch which is replaced on newer version. because it is related newer functionality, I remained mitaka's function call for this issue. for [2], This moves recreation check code to former if condition. and it calls rebuild_claim to create migration_context when recreate state not only scheduled. I adjusted test code which are pop up from backport process and seems to be needed. Someone want to backport or cherrypick code related to this, they could find it is already exist. Only one patch of them didn’t fix this issue as test said. [Test case] In below env, http://pastebin.ubuntu.com/25337153/ Network configuration is important in this case, because I tested different configuration. but couldn't reproduce it. reproduction test script ( based on juju ) http://pastebin.ubuntu.com/25360805/ [Regression Potential] Existing ACTIVE VMs or Newly creating VMs are not affected by this code because these commits are only called when doing migration or evacuation. If there are ACTIVE VMs and VMs got ERROR state caused by this issue in one host, and after upgrading pkg, All VMs should not be affected anything by this upgrading. After trying to evacuate problematic VM again, ERROR state should be fixed to ACTIVE. I tested this scenario on simple env, but still need to be considered possibility in complex, crowded environment. [Others] In test, I should patch two commits, one from https://bugs.launchpad.net/nova/+bug/1686041 Related Patches. [1] https://github.com/openstack/nova/commit/a5b920a197c70d2ae08a1e1335d979857f923b4f [2] https://github.com/openstack/nova/commit/0f2d87416eff1e96c0fbf0f4b08bf6b6b22246d5 ( backported to newton from below original) - https://github.com/openstack/nova/commit/a2b0824aca5cb4a2ae579f625327c51ed0414d35 ( original) [Original description] I'm working on the nova-powervm driver for Mitaka and trying to add support for evacuation. The problem I'm hitting is that instance.host is not updated when the compute driver is called to spawn the instance on the destination host. It is still set to the source host. It's not until after the spawn completes that the compute manager updates instance.host to reflect the destination host. The nova-powervm driver uses instance events callback mechanism during plug VIF to determine when Neutron has finished provisioning the network. The instance events code sends the event to instance.host and hence is sending the event to the source host (which is down). This causes the spawn to fail and also causes weirdness when the source host gets the events when it's powered back up. To temporarily work around the problem, I hacked in setting instance.host = CONF.host; instance.save() in the compute driver but that's not a good solution. To manage notifications about this bug go to: https://bugs.launchpad.net/cloud-archive/+bug/1535918/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1639230] Re: reschedule fails with ip already allocated error
** Also affects: nova (Ubuntu Xenial) Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1639230 Title: reschedule fails with ip already allocated error Status in OpenStack Compute (nova): Fix Released Status in OpenStack Compute (nova) newton series: Fix Committed Status in nova package in Ubuntu: Fix Released Status in nova source package in Xenial: New Bug description: Tried to create a server in a multi-host environment. The create failed on the first host that was attempted due to a ClientException raised by nova.volume.cinder.API.initialize_connection while trying to attach a volume. When the build was rescheduled on a different host, it should have realized that the network was already allocated by the first attempt and reused that, but the network_allocated=True from instance.system_metadata somehow disappeared, leading to the following exception that causes the reschedule to fail: 2016-10-13 04:48:29.007 16273 WARNING nova.network.neutronv2.api [req-9b343ef7-e8d9-4a61-b86c-a61908afe4df 0688b01e6439ca32d698d20789d52169126fb41fb1a4ddafcebb97d854e836c9 94e1baed634145e0aade858973ae88e8 - - -] [instance: b85d6c6c-e385-4601-aa47-5c580f893c9b] Neutron error creating port on network 5038a36b-cb1e-4a61-b26c-a05a80b37ed6 2016-10-13 04:48:29.007 16273 ERROR nova.network.neutronv2.api [instance: b85d6c6c-e385-4601-aa47-5c580f893c9b] Traceback (most recent call last): 2016-10-13 04:48:29.007 16273 ERROR nova.network.neutronv2.api [instance: b85d6c6c-e385-4601-aa47-5c580f893c9b] File "/usr/lib/python2.7/site-packages/nova/network/neutronv2/api.py", line 392, in _create_port_minimal 2016-10-13 04:48:29.007 16273 ERROR nova.network.neutronv2.api [instance: b85d6c6c-e385-4601-aa47-5c580f893c9b] port_response = port_client.create_port(port_req_body) 2016-10-13 04:48:29.007 16273 ERROR nova.network.neutronv2.api [instance: b85d6c6c-e385-4601-aa47-5c580f893c9b] File "/usr/lib/python2.7/site-packages/nova/network/neutronv2/api.py", line 98, in wrapper 2016-10-13 04:48:29.007 16273 ERROR nova.network.neutronv2.api [instance: b85d6c6c-e385-4601-aa47-5c580f893c9b] ret = obj(*args, **kwargs) 2016-10-13 04:48:29.007 16273 ERROR nova.network.neutronv2.api [instance: b85d6c6c-e385-4601-aa47-5c580f893c9b] File "/usr/lib/python2.7/site-packages/neutronclient/v2_0/client.py", line 750, in create_port 2016-10-13 04:48:29.007 16273 ERROR nova.network.neutronv2.api [instance: b85d6c6c-e385-4601-aa47-5c580f893c9b] return self.post(self.ports_path, body=body) 2016-10-13 04:48:29.007 16273 ERROR nova.network.neutronv2.api [instance: b85d6c6c-e385-4601-aa47-5c580f893c9b] File "/usr/lib/python2.7/site-packages/nova/network/neutronv2/api.py", line 98, in wrapper 2016-10-13 04:48:29.007 16273 ERROR nova.network.neutronv2.api [instance: b85d6c6c-e385-4601-aa47-5c580f893c9b] ret = obj(*args, **kwargs) 2016-10-13 04:48:29.007 16273 ERROR nova.network.neutronv2.api [instance: b85d6c6c-e385-4601-aa47-5c580f893c9b] File "/usr/lib/python2.7/site-packages/neutronclient/v2_0/client.py", line 365, in post 2016-10-13 04:48:29.007 16273 ERROR nova.network.neutronv2.api [instance: b85d6c6c-e385-4601-aa47-5c580f893c9b] headers=headers, params=params) 2016-10-13 04:48:29.007 16273 ERROR nova.network.neutronv2.api [instance: b85d6c6c-e385-4601-aa47-5c580f893c9b] File "/usr/lib/python2.7/site-packages/nova/network/neutronv2/api.py", line 98, in wrapper 2016-10-13 04:48:29.007 16273 ERROR nova.network.neutronv2.api [instance: b85d6c6c-e385-4601-aa47-5c580f893c9b] ret = obj(*args, **kwargs) 2016-10-13 04:48:29.007 16273 ERROR nova.network.neutronv2.api [instance: b85d6c6c-e385-4601-aa47-5c580f893c9b] File "/usr/lib/python2.7/site-packages/neutronclient/v2_0/client.py", line 300, in do_request 2016-10-13 04:48:29.007 16273 ERROR nova.network.neutronv2.api [instance: b85d6c6c-e385-4601-aa47-5c580f893c9b] self._handle_fault_response(status_code, replybody, resp) 2016-10-13 04:48:29.007 16273 ERROR nova.network.neutronv2.api [instance: b85d6c6c-e385-4601-aa47-5c580f893c9b] File "/usr/lib/python2.7/site-packages/nova/network/neutronv2/api.py", line 98, in wrapper 2016-10-13 04:48:29.007 16273 ERROR nova.network.neutronv2.api [instance: b85d6c6c-e385-4601-aa47-5c580f893c9b] ret = obj(*args, **kwargs) 2016-10-13 04:48:29.007 16273 ERROR nova.network.neutronv2.api [instance: b85d6c6c-e385-4601-aa47-5c580f893c9b] File "/usr/lib/python2.7/site-packages/neutronclient/v2_0/client.py", line 275, in _handle_fault_response 2016-10-13 04:48:29.007 16273 ERROR nova.network.neutronv2.api [instance: b85d6c6c-e385-4601-aa47-5c580f893c9b] exception_handler_v20(status_code, error_body) 2016-10-13 04:48:29.007 16273 ERROR nova.network.neutronv2.a
[Yahoo-eng-team] [Bug 1502136] Re: Everything returns 403 if show_multiple_locations is true and get_image_location policy is set
** Also affects: glance (Ubuntu Trusty) Importance: Undecided Status: New ** Also affects: glance (Ubuntu Xenial) Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to Glance. https://bugs.launchpad.net/bugs/1502136 Title: Everything returns 403 if show_multiple_locations is true and get_image_location policy is set Status in Glance: Fix Released Status in glance package in Ubuntu: Fix Released Status in glance source package in Trusty: In Progress Status in glance source package in Xenial: Fix Released Bug description: If, in glance-api.conf you set: show_multiple_locations = true Things work as expected: $ glance --os-image-api-version 2 image-show 13ae74f0-74bf-4792-a8bb-7c622abc5410 +--+--+ | Property | Value | +--+--+ | checksum | 9cb02fe7fcac26f8a25d6db3109063ae | | container_format | bare | | created_at | 2015-10-02T12:43:33Z | | disk_format | raw | | id | 13ae74f0-74bf-4792-a8bb-7c622abc5410 | | locations| [{"url": "swift+config://ref1/glance/13ae74f0-74bf-4792-a8bb-7c622abc5410", | | | "metadata": {}}] | | min_disk | 0 | | min_ram | 0 | | name | good-image | | owner| 88cffb9c8aee457788066c97b359585b | | protected| False | | size | 145 | | status | active | | tags | [] | | updated_at | 2015-10-02T12:43:34Z | | virtual_size | None | | visibility | private | +--+--+ but if you then set the get_image_location policy to role:admin, most calls return 403: $ glance --os-image-api-version 2 image-list 403 Forbidden: You are not authorized to complete this action. (HTTP 403) $ glance --os-image-api-version 2 image-show 13ae74f0-74bf-4792-a8bb-7c622abc5410 403 Forbidden: You are not authorized to complete this action. (HTTP 403) $ glance --os-image-api-version 2 image-delete 13ae74f0-74bf-4792-a8bb-7c622abc5410 403 Forbidden: You are not authorized to complete this action. (HTTP 403) etc. As https://review.openstack.org/#/c/48401/ says: 1. A user should be able to list/show/update/download image without needing permission on get_image_location. 2. A policy failure should result in a 403 return code. We're getting a 500 This is v2 only, v1 works ok. To manage notifications about this bug go to: https://bugs.launchpad.net/glance/+bug/1502136/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1694337] Re: Port information (binding:host_id) not updated for network:router_gateway after qRouter failover
** Also affects: neutron (Ubuntu Xenial) Importance: Undecided Status: New ** Also affects: neutron (Ubuntu Yakkety) Importance: Undecided Status: New ** Also affects: neutron (Ubuntu Zesty) Importance: Undecided Status: New ** Changed in: neutron (Ubuntu Xenial) Assignee: (unassigned) => Felipe Reyes (freyes) ** Changed in: neutron (Ubuntu Yakkety) Assignee: (unassigned) => Felipe Reyes (freyes) ** Changed in: neutron (Ubuntu Zesty) Assignee: (unassigned) => Felipe Reyes (freyes) -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to neutron. https://bugs.launchpad.net/bugs/1694337 Title: Port information (binding:host_id) not updated for network:router_gateway after qRouter failover Status in neutron: Fix Released Status in neutron package in Ubuntu: New Status in neutron source package in Xenial: New Status in neutron source package in Yakkety: New Status in neutron source package in Zesty: New Bug description: [Impact] When using l3 ha and a router agent fails over, the interface holding the network:router_gateway interface does not get its property binding:host_id updated to reflect where the keepalived moved the router. [Steps to reproduce] 0) Deploy a cloud with l3ha enabled - If familiar with juju, it's possible to use this bundle http://paste.ubuntu.com/24707730/ , but the deployment tool is not relevant 1) Once it's deployed, configure it and create a router see https://docs.openstack.org/mitaka/networking-guide/deploy-lb-ha-vrrp.html ) - This is the script used during the troubleshooting -8<-- #!/bin/bash -x source novarc # admin neutron net-create ext-net --router:external True --provider:physical_network physnet1 --provider:network_type flat neutron subnet-create ext-net 10.5.0.0/16 --name ext-subnet --allocation-pool start=10.5.254.100,end=10.5.254.199 --disable-dhcp --gateway 10.5.0.1 --dns-nameserver 10.5.0.3 keystone tenant-create --name demo 2>/dev/null keystone user-role-add --user admin --tenant demo --role Admin 2>/dev/null export TENANT_ID_DEMO=$(keystone tenant-list | grep demo | awk -F'|' '{print $2}' | tr -d ' ' 2>/dev/null ) neutron net-create demo-net --tenant-id ${TENANT_ID_DEMO} --provider:network_type vxlan env OS_TENANT_NAME=demo neutron subnet-create demo-net 192.168.1.0/24 --name demo-subnet --gateway 192.168.1.1 env OS_TENANT_NAME=demo neutron router-create demo-router env OS_TENANT_NAME=demo neutron router-interface-add demo-router demo-subnet env OS_TENANT_NAME=demo neutron router-gateway-set demo-router ext-net # verification neutron net-list neutron l3-agent-list-hosting-router demo-router neutron router-port-list demo-router - 8< --- 2) Kill the associated master keepalived process for the router ps aux | grep keepalived | grep $ROUTER_ID kill $PID 3) Wait until "neutron l3-agent-list-hosting-router demo-router" shows the other host as active 4) Check the binding:host_id property for the interfaces of the router for ID in `neutron port-list --device-id $ROUTER_ID | tail -n +4 | head -n -1| awk -F' ' '{print $2}' `; do neutron port-show $ID ; done Expected results: The interface where the device_owner is network:router_gateway has its property binding:host_id set to where the keepalived process is master Actual result: The binding:host_id is never updated, it stays set with the value obtainer during the creation of the port. To manage notifications about this bug go to: https://bugs.launchpad.net/neutron/+bug/1694337/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1649616] Re: Keystone Token Flush job does not complete in HA deployed environment
** Also affects: keystone (Ubuntu Xenial) Importance: Undecided Status: New ** Also affects: keystone (Ubuntu Yakkety) Importance: Undecided Status: New ** Also affects: keystone (Ubuntu Zesty) Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Identity (keystone). https://bugs.launchpad.net/bugs/1649616 Title: Keystone Token Flush job does not complete in HA deployed environment Status in Ubuntu Cloud Archive: New Status in OpenStack Identity (keystone): Fix Released Status in puppet-keystone: Triaged Status in tripleo: Triaged Status in keystone package in Ubuntu: New Status in keystone source package in Xenial: New Status in keystone source package in Yakkety: New Status in keystone source package in Zesty: New Bug description: The Keystone token flush job can get into a state where it will never complete because the transaction size exceeds the mysql galara transaction size - wsrep_max_ws_size (1073741824). Steps to Reproduce: 1. Authenticate many times 2. Observe that keystone token flush job runs (should be a very long time depending on disk) >20 hours in my environment 3. Observe errors in mysql.log indicating a transaction that is too large Actual results: Expired tokens are not actually flushed from the database without any errors in keystone.log. Only errors appear in mysql.log. Expected results: Expired tokens to be removed from the database Additional info: It is likely that you can demonstrate this with less than 1 million tokens as the >1 million token table is larger than 13GiB and the max transaction size is 1GiB, my token bench-marking Browbeat job creates more than needed. Once the token flush job can not complete the token table will never decrease in size and eventually the cloud will run out of disk space. Furthermore the flush job will consume disk utilization resources. This was demonstrated on slow disks (Single 7.2K SATA disk). On faster disks you will have more capacity to generate tokens, you can then generate the number of tokens to exceed the transaction size even faster. Log evidence: [root@overcloud-controller-0 log]# grep " Total expired" /var/log/keystone/keystone.log 2016-12-08 01:33:40.530 21614 INFO keystone.token.persistence.backends.sql [-] Total expired tokens removed: 1082434 2016-12-09 09:31:25.301 14120 INFO keystone.token.persistence.backends.sql [-] Total expired tokens removed: 1084241 2016-12-11 01:35:39.082 4223 INFO keystone.token.persistence.backends.sql [-] Total expired tokens removed: 1086504 2016-12-12 01:08:16.170 32575 INFO keystone.token.persistence.backends.sql [-] Total expired tokens removed: 1087823 2016-12-13 01:22:18.121 28669 INFO keystone.token.persistence.backends.sql [-] Total expired tokens removed: 1089202 [root@overcloud-controller-0 log]# tail mysqld.log 161208 1:33:41 [Warning] WSREP: transaction size limit (1073741824) exceeded: 1073774592 161208 1:33:41 [ERROR] WSREP: rbr write fail, data_len: 0, 2 161209 9:31:26 [Warning] WSREP: transaction size limit (1073741824) exceeded: 1073774592 161209 9:31:26 [ERROR] WSREP: rbr write fail, data_len: 0, 2 161211 1:35:39 [Warning] WSREP: transaction size limit (1073741824) exceeded: 1073774592 161211 1:35:40 [ERROR] WSREP: rbr write fail, data_len: 0, 2 161212 1:08:16 [Warning] WSREP: transaction size limit (1073741824) exceeded: 1073774592 161212 1:08:17 [ERROR] WSREP: rbr write fail, data_len: 0, 2 161213 1:22:18 [Warning] WSREP: transaction size limit (1073741824) exceeded: 1073774592 161213 1:22:19 [ERROR] WSREP: rbr write fail, data_len: 0, 2 Disk utilization issue graph is attached. The entire job in that graph takes from the first spike is disk util(~5:18UTC) and culminates in about ~90 minutes of pegging the disk (between 1:09utc to 2:43utc). To manage notifications about this bug go to: https://bugs.launchpad.net/cloud-archive/+bug/1649616/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp