[Yahoo-eng-team] [Bug 2019190] Re: [RBD] Retyping of in-use boot volumes renders instances unusable (possible data corruption)
Since we are using Yoga and hitting this issue I had a go at reverting the patch there too and can confirm that it does resolve the problem. ** Also affects: cloud-archive Importance: Undecided Status: New ** Also affects: cloud-archive/antelope Importance: Undecided Status: New ** Also affects: cloud-archive/caracal Importance: Undecided Status: New ** Also affects: cloud-archive/yoga Importance: Undecided Status: New ** Also affects: cloud-archive/zed Importance: Undecided Status: New ** Also affects: cloud-archive/bobcat Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/2019190 Title: [RBD] Retyping of in-use boot volumes renders instances unusable (possible data corruption) Status in Cinder: New Status in Cinder wallaby series: New Status in Ubuntu Cloud Archive: New Status in Ubuntu Cloud Archive antelope series: New Status in Ubuntu Cloud Archive bobcat series: New Status in Ubuntu Cloud Archive caracal series: New Status in Ubuntu Cloud Archive yoga series: New Status in Ubuntu Cloud Archive zed series: New Status in OpenStack Compute (nova): Invalid Bug description: While trying out the volume retype feature in cinder, we noticed that after an instance is rebooted it will not come back online and be stuck in an error state or if it comes back online, its filesystem is corrupted. ## Observations Say there are the two volume types `fast` (stored in ceph pool `volumes`) and `slow` (stored in ceph pool `volumes.hdd`). Before the retyping we can see that the volume for example is present in the `volumes.hdd` pool and has a watcher accessing the volume. ```sh [ceph: root@mon0 /]# rbd ls volumes.hdd volume-81cfbafc-4fbb-41b0-abcb-8ec7359d0bf9 [ceph: root@mon0 /]# rbd status volumes.hdd/volume-81cfbafc-4fbb-41b0-abcb-8ec7359d0bf9 Watchers: watcher=[2001:XX:XX:XX::10ad]:0/3914407456 client.365192 cookie=140370268803456 ``` Starting the retyping process using the migration policy `on-demand` for that volume either via the horizon dashboard or the CLI causes the volume to be correctly transferred to the `volumes` pool within the ceph cluster. However, the watcher does not get transferred, so nobody is accessing the volume after it has been transferred. ```sh [ceph: root@mon0 /]# rbd ls volumes volume-81cfbafc-4fbb-41b0-abcb-8ec7359d0bf9 [ceph: root@mon0 /]# rbd status volumes/volume-81cfbafc-4fbb-41b0-abcb-8ec7359d0bf9 Watchers: none ``` Taking a look at the libvirt XML of the instance in question, one can see that the `rbd` volume path does not change after the retyping is completed. Therefore, if the instance is restarted nova will not be able to find its volume preventing an instance start. Pre retype ```xml [...] [...] ``` Post retype (no change) ```xml [...] [...] ``` ### Possible cause While looking through the code that is responsible for the volume retype we found a function `swap_volume` volume which by our understanding should be responsible for fixing the association above. As we understand cinder should use an internal API path to let nova perform this action. This doesn't seem to happen. (`_swap_volume`: https://github.com/openstack/nova/blob/stable/wallaby/nova/compute/manager.py#L7218) ## Further observations If one tries to regenerate the libvirt XML by e.g. live migrating the instance and rebooting the instance after, the filesystem gets corrupted. ## Environmental Information and possibly related reports We are running the latest version of TripleO Wallaby using the hardened (whole disk) overcloud image for the nodes. Cinder Volume Version: `openstack- cinder-18.2.2-0.20230219112414.f9941d2.el8.noarch` ### Possibly related - https://bugzilla.redhat.com/show_bug.cgi?id=1293440 (might want to paste the above to a markdown file for better readability) To manage notifications about this bug go to: https://bugs.launchpad.net/cinder/+bug/2019190/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 2048785] Re: Trunk parent port (tpt port) vlan_mode is wrong in ovs
Thanks for flagging the potential security impact of this. Can someone provide a succinct exploit scenario for how an attacker might cause this to occur and then take advantage of it? Or is it merely one of those situations where someone could take advantage of the issue if they happen to find an environment where the necessary conditions were already met? ** Also affects: ossa Importance: Undecided Status: New ** Changed in: ossa Status: New => Incomplete -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to neutron. https://bugs.launchpad.net/bugs/2048785 Title: Trunk parent port (tpt port) vlan_mode is wrong in ovs Status in neutron: In Progress Status in OpenStack Security Advisory: Incomplete Bug description: ... therefore a forwarding loop, packet duplication, packet loss and double tagging is possible. Today a trunk bridge with one parent and one subport looks like this: # ovs-vsctl show ... Bridge tbr-b2781877-3 datapath_type: system Port spt-28c9689e-9e tag: 101 Interface spt-28c9689e-9e type: patch options: {peer=spi-28c9689e-9e} Port tap3709f1a1-a5 Interface tap3709f1a1-a5 Port tpt-3709f1a1-a5 Interface tpt-3709f1a1-a5 type: patch options: {peer=tpi-3709f1a1-a5} Port tbr-b2781877-3 Interface tbr-b2781877-3 type: internal ... # ovs-vsctl find Port name=tpt-3709f1a1-a5 | egrep 'tag|vlan_mode|trunks' tag : [] trunks : [] vlan_mode : [] # ovs-vsctl find Port name=spt-28c9689e-9e | egrep 'tag|vlan_mode|trunks' tag : 101 trunks : [] vlan_mode : [] I believe the vlan_mode of the tpt port is wrong (at least when the port is not "vlan_transparent") and it should have the value "access". Even when the port is "vlan_transparent", forwarding loops between br-int and a trunk bridge should be prevented. According to: http://www.openvswitch.org/support/dist-docs/ovs- vswitchd.conf.db.5.txt """ vlan_mode: optional string, one of access, dot1q-tunnel, native-tagged, native-untagged, or trunk The VLAN mode of the port, as described above. When this column is empty, a default mode is selected as follows: • If tag contains a value, the port is an access port. The trunks column should be empty. • Otherwise, the port is a trunk port. The trunks column value is honored if it is present. """ """ trunks: set of up to 4,096 integers, in range 0 to 4,095 For a trunk, native-tagged, or native-untagged port, the 802.1Q VLAN or VLANs that this port trunks; if it is empty, then the port trunks all VLANs. Must be empty if this is an access port. A native-tagged or native-untagged port always trunks its native VLAN, regardless of whether trunks includes that VLAN. """ The above combination of tag, trunks and vlan_mode for the tpt port means that it is in trunk mode (in the ovs sense) and it forwards both untagged and tagged frames with any vlan tag. But the tpt port should only forward untagged frames. Feel free to treat this as the end of the bug report. But below I'll add more about how we found this bug, in what conditions can it be triggered, what consequences it may have. However please keep in mind I don't have a full upstream reproduction at the moment. Nor have I a full analysis of every suspicion mentioned below. I'm aware of a full reproduction of this bug only in a downstream environment, which looked like below. While the following was sufficient to reproduce the problem, this was likely far from a minimal reproduction and some/many of the below steps are unnecessary. * [securitygroup].firewall_driver = openvswitch ((edited, originally was: noop)) * [agent].explicitly_egress_direct = True ((edited, originally was: [ovs].explicitly_egress_direct = True)) * 2 VMs started on the same compute. * Both having a trunk port with one parent and one subport. * The parent and the subport of each trunk have the same MAC address. * All ports are on vlan networks belonging to the same physnet. * All ports are created with --disable-port-security and --no-security-group. * The subport segmentation IDs and the corresponding vlan network segmentation IDs were the same (as if they used "inherit"). * Traffic was generated from a 3rd VM on a different compute addressed to one of the VM's subport IP, for which * the destination MAC was not yet learned by either br-int or the two trunk bridges on the host. I believe
[Yahoo-eng-team] [Bug 2025813] Fix included in openstack/nova 26.2.1
This issue was fixed in the openstack/nova 26.2.1 release. ** Changed in: nova/zed Status: Triaged => Fix Released -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/2025813 Title: test_rebuild_volume_backed_server failing 100% on nova-lvm job Status in devstack-plugin-ceph: New Status in OpenStack Compute (nova): Fix Released Status in OpenStack Compute (nova) antelope series: Fix Released Status in OpenStack Compute (nova) yoga series: Fix Released Status in OpenStack Compute (nova) zed series: Fix Released Bug description: After the tempest patch was merged [1] nova-lvm job started to fail with the following error in test_rebuild_volume_backed_server: Traceback (most recent call last): File "/opt/stack/tempest/tempest/common/utils/__init__.py", line 70, in wrapper return f(*func_args, **func_kwargs) File "/opt/stack/tempest/tempest/api/compute/servers/test_server_actions.py", line 868, in test_rebuild_volume_backed_server self.get_server_ip(server, validation_resources), File "/opt/stack/tempest/tempest/api/compute/base.py", line 519, in get_server_ip return compute.get_server_ip( File "/opt/stack/tempest/tempest/common/compute.py", line 76, in get_server_ip raise lib_exc.InvalidParam(invalid_param=msg) tempest.lib.exceptions.InvalidParam: Invalid Parameter passed: When validation.connect_method equals floating, validation_resources cannot be None As discussed on IRC with Sean [2], the SSH validation is mandatory now which is disabled in the job config [2]. [1] https://review.opendev.org/c/openstack/tempest/+/831018 [2] https://meetings.opendev.org/irclogs/%23openstack-nova/%23openstack-nova.2023-07-04.log.html#t2023-07-04T15:33:38 [3] https://opendev.org/openstack/nova/src/commit/4b454febf73cdd7b5be0a2dad272c1d7685fac9e/.zuul.yaml#L266-L267 To manage notifications about this bug go to: https://bugs.launchpad.net/devstack-plugin-ceph/+bug/2025813/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 2027755] Fix included in openstack/nova 26.2.1
This issue was fixed in the openstack/nova 26.2.1 release. ** Changed in: nova/zed Status: Fix Committed => Fix Released -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/2027755 Title: "sqlalchemy.exc.TimeoutError: QueuePool limit of size 5 overflow 50 reached, connection timed out, timeout 30.00" error raised after repeated calls of Flavor.get_* methods Status in OpenStack Compute (nova): Fix Released Status in OpenStack Compute (nova) antelope series: Fix Released Status in OpenStack Compute (nova) wallaby series: In Progress Status in OpenStack Compute (nova) xena series: In Progress Status in OpenStack Compute (nova) yoga series: In Progress Status in OpenStack Compute (nova) zed series: Fix Released Bug description: This bug was reported downstream by a user who is using some automation to query nova API for periodically for flavor information and occasionally they receive a HTTP 500 error from nova API with a message related to database connection pools. The error: sqlalchemy.exc.TimeoutError: QueuePool limit of size 5 overflow 50 reached, connection timed out, timeout 30.00 (Background on this error at: https://sqlalche.me/e/14/3o7r) is being raised in nova-api, causing a 500 error to be returned. I think this is happening because of the placement of the @api_db_api.context_manager.reader decorator (which manages the database session) on a helper method instead of on the methods that actually execute the database queries. I think it's resulting in connections not being closed and eventually reaching the database connection pool size limits. The database context manager decorator needs to be on the methods that execute the queries because part of what it does is close connections after the method is run. Full traceback: Jul 13 22:06:48 ubuntu-jammy devstack@n-api.service[270259]: DEBUG nova.api.openstack.wsgi [None req-b47b7ad6-eecd-44a9-8264-706742dd8539 demo demo] Calling method '>' {{(pid=270259) _process_stack /opt/stack/nova/nova/api/openstack/wsgi.py:513}} Jul 13 22:06:58 ubuntu-jammy devstack@n-api.service[270259]: DEBUG dbcounter [-] [270259] Writing DB stats nova_api:SELECT=2 {{(pid=270259) stat_writer /usr/local/lib/python3.10/dist-packages/dbcounter.py:117}} Jul 13 22:07:18 ubuntu-jammy devstack@n-api.service[270259]: ERROR nova.api.openstack.wsgi [None req-b47b7ad6-eecd-44a9-8264-706742dd8539 demo demo] Unexpected exception in API method: sqlalchemy.exc.TimeoutError: QueuePool limit of size 5 overflow 50 reached, connection timed out, timeout 30.00 (Background on this error at: https://sqlalche.me/e/14/3o7r) Jul 13 22:07:18 ubuntu-jammy devstack@n-api.service[270259]: ERROR nova.api.openstack.wsgi Traceback (most recent call last): Jul 13 22:07:18 ubuntu-jammy devstack@n-api.service[270259]: ERROR nova.api.openstack.wsgi File "/opt/stack/nova/nova/api/openstack/wsgi.py", line 658, in wrapped Jul 13 22:07:18 ubuntu-jammy devstack@n-api.service[270259]: ERROR nova.api.openstack.wsgi return f(*args, **kwargs) Jul 13 22:07:18 ubuntu-jammy devstack@n-api.service[270259]: ERROR nova.api.openstack.wsgi File "/opt/stack/nova/nova/api/openstack/compute/flavors_extraspecs.py", line 64, in index Jul 13 22:07:18 ubuntu-jammy devstack@n-api.service[270259]: ERROR nova.api.openstack.wsgi return self._get_extra_specs(context, flavor_id) Jul 13 22:07:18 ubuntu-jammy devstack@n-api.service[270259]: ERROR nova.api.openstack.wsgi File "/opt/stack/nova/nova/api/openstack/compute/flavors_extraspecs.py", line 34, in _get_extra_specs Jul 13 22:07:18 ubuntu-jammy devstack@n-api.service[270259]: ERROR nova.api.openstack.wsgi flavor = common.get_flavor(context, flavor_id) Jul 13 22:07:18 ubuntu-jammy devstack@n-api.service[270259]: ERROR nova.api.openstack.wsgi File "/opt/stack/nova/nova/api/openstack/common.py", line 494, in get_flavor Jul 13 22:07:18 ubuntu-jammy devstack@n-api.service[270259]: ERROR nova.api.openstack.wsgi return objects.Flavor.get_by_flavor_id(context, flavor_id) Jul 13 22:07:18 ubuntu-jammy devstack@n-api.service[270259]: ERROR nova.api.openstack.wsgi File "/usr/local/lib/python3.10/dist-packages/oslo_versionedobjects/base.py", line 184, in wrapper Jul 13 22:07:18 ubuntu-jammy devstack@n-api.service[270259]: ERROR nova.api.openstack.wsgi result = fn(cls, context, *args, **kwargs) Jul 13 22:07:18 ubuntu-jammy devstack@n-api.service[270259]: ERROR nova.api.openstack.wsgi File "/opt/stack/nova/nova/objects/flavor.py", line 395, in get_by_flavor_id Jul 13 22:07:18 ubuntu-jammy devstack@n-api.service[270259]: ERROR nova.api.openstack.wsgi db_flavor = cls._flavor_get_by_flavor_id_from_db(context, Jul 13 22:07:18 ubuntu-jammy devstack@n-api.service[270259]: ERROR nova.api.openstack.wsgi File
[Yahoo-eng-team] [Bug 2043036] Re: [ironic] list_instances/list_instance_uuid does not respect conductor_group/partition_key
Reviewed: https://review.opendev.org/c/openstack/nova/+/900831 Committed: https://opendev.org/openstack/nova/commit/fa3cf7d50cba921ea67eb161e6a199067ea62deb Submitter: "Zuul (22348)" Branch:master commit fa3cf7d50cba921ea67eb161e6a199067ea62deb Author: Jay Faulkner Date: Mon Nov 13 15:21:31 2023 -0800 [ironic] Partition & use cache for list_instance* list_instances and list_instance_uuids, as written in the Ironic driver, do not currently respect conductor_group paritioning. Given a nova compute is intended to limit it's scope of work to the conductor group it is configured to work with; this is a bug. Additionally, this should be a significant performance boost for a couple of reasons; firstly, instead of calling the Ironic API and getting all nodes, instead of the subset (when using conductor group), we're now properly getting the subset of nodes -- this is the optimized path in the Ironic DB and API code. Secondly, we're now using the driver's node cache to respond to these requests. Since list_instances and list_instance_uuids is used by periodic tasks, these operating with data that may be slightly stale should have minimal impact compared to the performance benefits. Closes-bug: #2043036 Change-Id: If31158e3269e5e06848c29294fdaa147beedb5a5 ** Changed in: nova Status: In Progress => Fix Released -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/2043036 Title: [ironic] list_instances/list_instance_uuid does not respect conductor_group/partition_key Status in Ironic: Triaged Status in OpenStack Compute (nova): Fix Released Bug description: The methods on the Ironic driver, list_instances and list_instance_uuids are not currently respecting the conductor_group option: https://opendev.org/openstack/nova/src/branch/master/nova/conf/ironic.py#L71. This leads to significant performance degradation, as querying Ironic for all nodes (/v1/nodes) instead of all nodes managed by the compute (/v1/nodes?conductor_group=blah) is a significantly more expensive API call. In addition, this can lead to unexpected behavior for operators, such as an action being taken by a compute serving conductor group "A" to resolve an issue that would normally be resolved by a compute service conductor group "B". While troubleshooting this error, we dug deeply into what this data is used for; it's used for two things: - Reconciling deleted instances as a periodic job - Ensuring no instances exist on a newly-started compute host These are tasks which either could use stale data or would not be impacted by using the Ironic driver's existing node cache. Therefore, a suggested fix is: Revise list_instances and list_instance_uuids to reuse the node cache to reduce the overall API calls being made to Ironic, and ensure all /v1/nodes calls use the same codepath in the Ironic driver. It's the belief of JayF, TheJulia, and Johnthetubaguy (on a video call right now) that using stale data, without refreshing the cache, should be safe for these use cases. (Even if we decide to refresh the cache, we should use this code path anyway.) To manage notifications about this bug go to: https://bugs.launchpad.net/ironic/+bug/2043036/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 2049899] [NEW] disk remaining in logs during live migration says 100 when no disk is migrated
Public bug reported: when doing live migrations for bfv instances the disk remaining in the nova log says 100 even if there is no disk to migrate ** Affects: nova Importance: Undecided Status: In Progress -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/2049899 Title: disk remaining in logs during live migration says 100 when no disk is migrated Status in OpenStack Compute (nova): In Progress Bug description: when doing live migrations for bfv instances the disk remaining in the nova log says 100 even if there is no disk to migrate To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/2049899/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 2049903] [NEW] nova-compute starts even if resource provider creation fails with conflict
Public bug reported: if an operator has a compute node and reinstalls it but forget to do a "openstack compute service delete " first (that would wipe the nova- compute service record and the resource provider in placement) the reinstalled compute node with the same hostname happily reports it's state as up even though the resource provider creation that nova-compute tried failed due to a conflict with the resource providers. to do operators a big favor we should make nova-compute startup fail if the resource provider creation failed (like when there is a conflict) ** Affects: nova Importance: Undecided Status: In Progress -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/2049903 Title: nova-compute starts even if resource provider creation fails with conflict Status in OpenStack Compute (nova): In Progress Bug description: if an operator has a compute node and reinstalls it but forget to do a "openstack compute service delete " first (that would wipe the nova-compute service record and the resource provider in placement) the reinstalled compute node with the same hostname happily reports it's state as up even though the resource provider creation that nova- compute tried failed due to a conflict with the resource providers. to do operators a big favor we should make nova-compute startup fail if the resource provider creation failed (like when there is a conflict) To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/2049903/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 2049909] [NEW] MLDv2 packets sent from L3 Agent managed networks cause backup routers to be preferred
Public bug reported: Neutron 2023.1 29cc1a634e530972614c09fbb212b5f63fd4c374 Ubuntu 20.04 This issue has been identified in a Neutron system running Linux Bridge networking, but whilst this may no longer be supported I'm posting it in case the same issue might be relevant for other drivers. When running multiple network nodes, the tenant networks in the namespaces on each node share MAC addresses. We have noted that particularly when rebooting a network node, traffic from the Internet to tenant networks can be disrupted when a node which was acting as the backup for a given tenant network comes back online. We have traced this to Linux sending out MLDv2 responses to the upstream switches when the tenant network processes (keepalived etc) start up. As a result, the upstream switches update their MAC tables to prefer that host despite it not being the primary. If there is minimal tenant traffic (such as when running a web server), this network will be inaccessible from the outside until a request is made from the inside to the outside and the switches re-update their MAC tables to reflect the correct state. There is already handling to prevent some IPv6 packets being sent out in these cases here: https://opendev.org/openstack/neutron/src/branch/master/neutron/agent/l3/router_info.py#L808, and there is theoretically something explicitly referencing issues with MLDv2 in the same area: https://opendev.org/openstack/neutron/src/branch/master/neutron/agent/l3/router_info.py#L813. Unfortunately these don't appear to be sufficient. There is no sysctl mechanism to prevent these gratuitous MLDv2 responses as far as I can tell, so we are working around this by using iptables rules inserted by the L3 agent into tenant networks. There may well be a better solution, but I will link our workaround to this bug report shortly. ** Affects: neutron Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to neutron. https://bugs.launchpad.net/bugs/2049909 Title: MLDv2 packets sent from L3 Agent managed networks cause backup routers to be preferred Status in neutron: New Bug description: Neutron 2023.1 29cc1a634e530972614c09fbb212b5f63fd4c374 Ubuntu 20.04 This issue has been identified in a Neutron system running Linux Bridge networking, but whilst this may no longer be supported I'm posting it in case the same issue might be relevant for other drivers. When running multiple network nodes, the tenant networks in the namespaces on each node share MAC addresses. We have noted that particularly when rebooting a network node, traffic from the Internet to tenant networks can be disrupted when a node which was acting as the backup for a given tenant network comes back online. We have traced this to Linux sending out MLDv2 responses to the upstream switches when the tenant network processes (keepalived etc) start up. As a result, the upstream switches update their MAC tables to prefer that host despite it not being the primary. If there is minimal tenant traffic (such as when running a web server), this network will be inaccessible from the outside until a request is made from the inside to the outside and the switches re-update their MAC tables to reflect the correct state. There is already handling to prevent some IPv6 packets being sent out in these cases here: https://opendev.org/openstack/neutron/src/branch/master/neutron/agent/l3/router_info.py#L808, and there is theoretically something explicitly referencing issues with MLDv2 in the same area: https://opendev.org/openstack/neutron/src/branch/master/neutron/agent/l3/router_info.py#L813. Unfortunately these don't appear to be sufficient. There is no sysctl mechanism to prevent these gratuitous MLDv2 responses as far as I can tell, so we are working around this by using iptables rules inserted by the L3 agent into tenant networks. There may well be a better solution, but I will link our workaround to this bug report shortly. To manage notifications about this bug go to: https://bugs.launchpad.net/neutron/+bug/2049909/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1943934] Re: report extra gpu device when config one enabled_vgpu_types
Fixed by https://review.opendev.org/c/openstack/nova/+/899406/2 ** Changed in: nova Status: Triaged => Fix Released -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1943934 Title: report extra gpu device when config one enabled_vgpu_types Status in OpenStack Compute (nova): Fix Released Bug description: if there are two gpu devices virtualized on the env, and config one enabled_vgpu_types and device_addresses, Nova will report these two gpu devices to Placement. we should only report the configured device_addresses to Placement. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1943934/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 2027975] Re: Add check on network interface name's length
** Changed in: cloud-init Status: New => Fix Released -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to cloud-init. https://bugs.launchpad.net/bugs/2027975 Title: Add check on network interface name's length Status in cloud-init: Fix Released Status in MAAS: Triaged Bug description: When adding a VLAN on long network interfaces, Cloud-init will fail silently to add the sub-interfaces. Deployed environment is : * MAAS 3.3.4 * Server Dell R630 + Mellanox Connectx-5 * name of interfaces : enp130s0f0np0 / enp130s0f1np1 * add a VLAN like 100-4093 Cloud-Init will not display any error message but the VLAN interfaces will not be added after the deployment. If trying to perform the operation manually, we are then greeted with the following error message. ``` ubuntu@R630:~$ sudo ip link add link enp130s0f0np0 name enp130s0f0np0.103 type vlan id 103 Error: argument "enp130s0f0np0.103" is wrong: "name" not a valid ifname ``` From Iproute2 and Kernel perspective, it is not possible to have interfaces with a name longer than 15 characters in total. A quick workaround is simply to rename the network interface to something shorter. Having a quick warning from MAAS would be nice to have to understand the origin of the issue. To manage notifications about this bug go to: https://bugs.launchpad.net/cloud-init/+bug/2027975/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 2049852] [NEW] VM reboot issues
Public bug reported: Problem Statement: Creating VM from image work fine, but reboot of that instance from openstack results in Error state. Steps to re-create: 1. Create a VM from Openstack with Image as the source. VM is created successfully. 2. Now issue a openstack server reboot 3. Softreboot fails after sometime, and hard reboot is attempted, but it fails and moves the instance to an Error state. Observations. 1. Weird thing I noticed is, when the instance is being created the nova log shows this (Note the VOLUME ID) 2024-01-19 06:45:06.536 7 INFO nova.virt.block_device [req- da372940-2784-4aba-881f-7636460efb46 req- bffdffd6-7521-4450-9ab9-93bac32789e1 a828b4ad4d794e18ac9c6238e893522d 1f4d24639d564e40816d90be4cac8ecd - - default default] [instance: 330922ac-8333-4ee7-a634-0075a00f1fd7] Booting with volume-backed-image 0e1aa7ba-23cc-4ced-8aa2-c9f78535d435 at /dev/vda 2. The create volume (c22d487d-ab7a-4373-b69f-91cd02f32cc8) in openstack does not match this ID at all. 3. openstack server show VM shows the correct volume ID as in 2. 4. During reboot issued though openstack, the multipath just fails and nova logs are stuck at cleaning stale cinder volumes. 5. Reboot command from inside the VM work fine. (sudo reboot) 6. When I use volume as the source to create instance, I have none of these problems. Nova log shows the correct volume ID during the creation phase and reboot works great with no issues. I can consistently reproduce this error. Host OS - Ubuntu 22.04 Deployment - Kolla Ansible with Docker containers Storage - Netapp with iSCSI Hypervisor - KVM Nova Version - 18.4.0 ** Affects: nova Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/2049852 Title: VM reboot issues Status in OpenStack Compute (nova): New Bug description: Problem Statement: Creating VM from image work fine, but reboot of that instance from openstack results in Error state. Steps to re-create: 1. Create a VM from Openstack with Image as the source. VM is created successfully. 2. Now issue a openstack server reboot 3. Softreboot fails after sometime, and hard reboot is attempted, but it fails and moves the instance to an Error state. Observations. 1. Weird thing I noticed is, when the instance is being created the nova log shows this (Note the VOLUME ID) 2024-01-19 06:45:06.536 7 INFO nova.virt.block_device [req- da372940-2784-4aba-881f-7636460efb46 req- bffdffd6-7521-4450-9ab9-93bac32789e1 a828b4ad4d794e18ac9c6238e893522d 1f4d24639d564e40816d90be4cac8ecd - - default default] [instance: 330922ac-8333-4ee7-a634-0075a00f1fd7] Booting with volume-backed-image 0e1aa7ba-23cc-4ced-8aa2-c9f78535d435 at /dev/vda 2. The create volume (c22d487d-ab7a-4373-b69f-91cd02f32cc8) in openstack does not match this ID at all. 3. openstack server show VM shows the correct volume ID as in 2. 4. During reboot issued though openstack, the multipath just fails and nova logs are stuck at cleaning stale cinder volumes. 5. Reboot command from inside the VM work fine. (sudo reboot) 6. When I use volume as the source to create instance, I have none of these problems. Nova log shows the correct volume ID during the creation phase and reboot works great with no issues. I can consistently reproduce this error. Host OS - Ubuntu 22.04 Deployment - Kolla Ansible with Docker containers Storage - Netapp with iSCSI Hypervisor - KVM Nova Version - 18.4.0 To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/2049852/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp