date:20240119

[Yahoo-eng-team] [Bug 2019190] Re: [RBD] Retyping of in-use boot volumes renders instances unusable (possible data corruption)

2024-01-19 Thread Edward Hope-Morley

Since we are using Yoga and hitting this issue I had a go at reverting
the patch there too and can confirm that it does resolve the problem.

** Also affects: cloud-archive
   Importance: Undecided
   Status: New

** Also affects: cloud-archive/antelope
   Importance: Undecided
   Status: New

** Also affects: cloud-archive/caracal
   Importance: Undecided
   Status: New

** Also affects: cloud-archive/yoga
   Importance: Undecided
   Status: New

** Also affects: cloud-archive/zed
   Importance: Undecided
   Status: New

** Also affects: cloud-archive/bobcat
   Importance: Undecided
   Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/2019190

Title:
  [RBD] Retyping of in-use boot volumes renders instances unusable
  (possible data corruption)

Status in Cinder:
  New
Status in Cinder wallaby series:
  New
Status in Ubuntu Cloud Archive:
  New
Status in Ubuntu Cloud Archive antelope series:
  New
Status in Ubuntu Cloud Archive bobcat series:
  New
Status in Ubuntu Cloud Archive caracal series:
  New
Status in Ubuntu Cloud Archive yoga series:
  New
Status in Ubuntu Cloud Archive zed series:
  New
Status in OpenStack Compute (nova):
  Invalid

Bug description:
  While trying out the volume retype feature in cinder, we noticed that after 
an instance is
  rebooted it will not come back online and be stuck in an error state or if it 
comes back
  online, its filesystem is corrupted.

  ## Observations

  Say there are the two volume types `fast` (stored in ceph pool `volumes`) and 
`slow`
  (stored in ceph pool `volumes.hdd`). Before the retyping we can see that the 
volume
  for example is present in the `volumes.hdd` pool and has a watcher accessing 
the
  volume.

  ```sh
  [ceph: root@mon0 /]# rbd ls volumes.hdd
  volume-81cfbafc-4fbb-41b0-abcb-8ec7359d0bf9

  [ceph: root@mon0 /]# rbd status 
volumes.hdd/volume-81cfbafc-4fbb-41b0-abcb-8ec7359d0bf9
  Watchers:
  watcher=[2001:XX:XX:XX::10ad]:0/3914407456 client.365192 
cookie=140370268803456
  ```

  Starting the retyping process using the migration policy `on-demand` for that 
volume either
  via the horizon dashboard or the CLI causes the volume to be correctly 
transferred to the
  `volumes` pool within the ceph cluster. However, the watcher does not get 
transferred, so
  nobody is accessing the volume after it has been transferred.

  ```sh
  [ceph: root@mon0 /]# rbd ls volumes
  volume-81cfbafc-4fbb-41b0-abcb-8ec7359d0bf9

  [ceph: root@mon0 /]# rbd status 
volumes/volume-81cfbafc-4fbb-41b0-abcb-8ec7359d0bf9
  Watchers: none
  ```

  Taking a look at the libvirt XML of the instance in question, one can see 
that the `rbd`
  volume path does not change after the retyping is completed. Therefore, if 
the instance is
  restarted nova will not be able to find its volume preventing an instance 
start.

   Pre retype

  ```xml
  [...]
  
  
  
  
  
  [...]
  ```

   Post retype (no change)

  ```xml
  [...]
  
  
  
  
  
  [...]
  ```

  ### Possible cause

  While looking through the code that is responsible for the volume retype we 
found a function
  `swap_volume` volume which by our understanding should be responsible for 
fixing the association
  above. As we understand cinder should use an internal API path to let nova 
perform this action.
  This doesn't seem to happen.

  (`_swap_volume`:
  
https://github.com/openstack/nova/blob/stable/wallaby/nova/compute/manager.py#L7218)

  ## Further observations

  If one tries to regenerate the libvirt XML by e.g. live migrating the 
instance and rebooting the
  instance after, the filesystem gets corrupted.

  ## Environmental Information and possibly related reports

  We are running the latest version of TripleO Wallaby using the hardened 
(whole disk)
  overcloud image for the nodes.

  Cinder Volume Version: `openstack-
  cinder-18.2.2-0.20230219112414.f9941d2.el8.noarch`

  ### Possibly related

  - https://bugzilla.redhat.com/show_bug.cgi?id=1293440

  
  (might want to paste the above to a markdown file for better readability)

To manage notifications about this bug go to:
https://bugs.launchpad.net/cinder/+bug/2019190/+subscriptions


-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp

[Yahoo-eng-team] [Bug 2048785] Re: Trunk parent port (tpt port) vlan_mode is wrong in ovs

2024-01-19 Thread Jeremy Stanley

Thanks for flagging the potential security impact of this. Can someone
provide a succinct exploit scenario for how an attacker might cause this
to occur and then take advantage of it? Or is it merely one of those
situations where someone could take advantage of the issue if they
happen to find an environment where the necessary conditions were
already met?

** Also affects: ossa
   Importance: Undecided
   Status: New

** Changed in: ossa
   Status: New => Incomplete

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/2048785

Title:
  Trunk parent port (tpt port) vlan_mode is wrong in ovs

Status in neutron:
  In Progress
Status in OpenStack Security Advisory:
  Incomplete

Bug description:
  ... therefore a forwarding loop, packet duplication, packet loss and
  double tagging is possible.

  Today a trunk bridge with one parent and one subport looks like this:

  # ovs-vsctl show
  ...
  Bridge tbr-b2781877-3
  datapath_type: system
  Port spt-28c9689e-9e
  tag: 101
  Interface spt-28c9689e-9e
  type: patch
  options: {peer=spi-28c9689e-9e}
  Port tap3709f1a1-a5
  Interface tap3709f1a1-a5
  Port tpt-3709f1a1-a5
  Interface tpt-3709f1a1-a5
  type: patch
  options: {peer=tpi-3709f1a1-a5}
  Port tbr-b2781877-3
  Interface tbr-b2781877-3
  type: internal
  ...

  # ovs-vsctl find Port name=tpt-3709f1a1-a5 | egrep 'tag|vlan_mode|trunks'
  tag : []
  trunks  : []
  vlan_mode   : []

  # ovs-vsctl find Port name=spt-28c9689e-9e | egrep 'tag|vlan_mode|trunks'
  tag : 101
  trunks  : []
  vlan_mode   : []

  I believe the vlan_mode of the tpt port is wrong (at least when the port is 
not "vlan_transparent") and it should have the value "access".
  Even when the port is "vlan_transparent", forwarding loops between br-int and 
a trunk bridge should be prevented.

  According to: http://www.openvswitch.org/support/dist-docs/ovs-
  vswitchd.conf.db.5.txt

  """
     vlan_mode: optional string, one of access, dot1q-tunnel, native-tagged,
     native-untagged, or trunk
    The VLAN mode of the port, as described above. When this  column
    is empty, a default mode is selected as follows:

    •  If  tag contains a value, the port is an access port. The
   trunks column should be empty.

    •  Otherwise, the port is a trunk port.  The  trunks  column
   value is honored if it is present.
  """

  """
     trunks: set of up to 4,096 integers, in range 0 to 4,095
    For  a trunk, native-tagged, or native-untagged port, the 802.1Q
    VLAN or VLANs that this port trunks; if it is  empty,  then  the
    port trunks all VLANs. Must be empty if this is an access port.

    A native-tagged or native-untagged port always trunks its native
    VLAN, regardless of whether trunks includes that VLAN.
  """

  The above combination of tag, trunks and vlan_mode for the tpt port
  means that it is in trunk mode (in the ovs sense) and it forwards both
  untagged and tagged frames with any vlan tag. But the tpt port should
  only forward untagged frames.

  Feel free to treat this as the end of the bug report. But below I'll
  add more about how we found this bug, in what conditions can it be
  triggered, what consequences it may have. However please keep in mind
  I don't have a full upstream reproduction at the moment. Nor have I a
  full analysis of every suspicion mentioned below.

  I'm aware of a full reproduction of this bug only in a downstream
  environment, which looked like below. While the following was
  sufficient to reproduce the problem, this was likely far from a
  minimal reproduction and some/many of the below steps are unnecessary.

  * [securitygroup].firewall_driver = openvswitch ((edited, originally was: 
noop))
  * [agent].explicitly_egress_direct = True ((edited, originally was: 
[ovs].explicitly_egress_direct = True))
  * 2 VMs started on the same compute.
  * Both having a trunk port with one parent and one subport.
  * The parent and the subport of each trunk have the same MAC address.
  * All ports are on vlan networks belonging to the same physnet.
  * All ports are created with --disable-port-security and --no-security-group.
  * The subport segmentation IDs and the corresponding vlan network 
segmentation IDs were the same (as if they used "inherit").
  * Traffic was generated from a 3rd VM on a different compute addressed to one 
of the VM's subport IP, for which
  * the destination MAC was not yet learned by either br-int or the two trunk 
bridges on the host.

  I believe

[Yahoo-eng-team] [Bug 2025813] Fix included in openstack/nova 26.2.1

2024-01-19 Thread OpenStack Infra

This issue was fixed in the openstack/nova 26.2.1  release.

** Changed in: nova/zed
   Status: Triaged => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/2025813

Title:
  test_rebuild_volume_backed_server failing 100% on nova-lvm job

Status in devstack-plugin-ceph:
  New
Status in OpenStack Compute (nova):
  Fix Released
Status in OpenStack Compute (nova) antelope series:
  Fix Released
Status in OpenStack Compute (nova) yoga series:
  Fix Released
Status in OpenStack Compute (nova) zed series:
  Fix Released

Bug description:
  After the tempest patch was merged [1] nova-lvm job started to fail
  with the following error in test_rebuild_volume_backed_server:

  
  Traceback (most recent call last):
File "/opt/stack/tempest/tempest/common/utils/__init__.py", line 70, in 
wrapper
  return f(*func_args, **func_kwargs)
File 
"/opt/stack/tempest/tempest/api/compute/servers/test_server_actions.py", line 
868, in test_rebuild_volume_backed_server
  self.get_server_ip(server, validation_resources),
File "/opt/stack/tempest/tempest/api/compute/base.py", line 519, in 
get_server_ip
  return compute.get_server_ip(
File "/opt/stack/tempest/tempest/common/compute.py", line 76, in 
get_server_ip
  raise lib_exc.InvalidParam(invalid_param=msg)
  tempest.lib.exceptions.InvalidParam: Invalid Parameter passed: When 
validation.connect_method equals floating, validation_resources cannot be None

  As discussed on IRC with Sean [2], the SSH validation is mandatory now
  which is disabled in the job config [2].

  [1] https://review.opendev.org/c/openstack/tempest/+/831018
  [2] 
https://meetings.opendev.org/irclogs/%23openstack-nova/%23openstack-nova.2023-07-04.log.html#t2023-07-04T15:33:38
  [3] 
https://opendev.org/openstack/nova/src/commit/4b454febf73cdd7b5be0a2dad272c1d7685fac9e/.zuul.yaml#L266-L267

To manage notifications about this bug go to:
https://bugs.launchpad.net/devstack-plugin-ceph/+bug/2025813/+subscriptions


-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp

[Yahoo-eng-team] [Bug 2027755] Fix included in openstack/nova 26.2.1

2024-01-19 Thread OpenStack Infra

This issue was fixed in the openstack/nova 26.2.1  release.

** Changed in: nova/zed
   Status: Fix Committed => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/2027755

Title:
  "sqlalchemy.exc.TimeoutError: QueuePool limit of size 5 overflow 50
  reached, connection timed out, timeout 30.00" error raised after
  repeated calls of Flavor.get_* methods

Status in OpenStack Compute (nova):
  Fix Released
Status in OpenStack Compute (nova) antelope series:
  Fix Released
Status in OpenStack Compute (nova) wallaby series:
  In Progress
Status in OpenStack Compute (nova) xena series:
  In Progress
Status in OpenStack Compute (nova) yoga series:
  In Progress
Status in OpenStack Compute (nova) zed series:
  Fix Released

Bug description:
  This bug was reported downstream by a user who is using some
  automation to query nova API for periodically for flavor information
  and occasionally they receive a HTTP 500 error from nova API with a
  message related to database connection pools.

  The error:

    sqlalchemy.exc.TimeoutError: QueuePool limit of size 5 overflow 50
  reached, connection timed out, timeout 30.00 (Background on this error
  at: https://sqlalche.me/e/14/3o7r)

  is being raised in nova-api, causing a 500 error to be returned.

  I think this is happening because of the placement of the
  @api_db_api.context_manager.reader decorator (which manages the
  database session) on a helper method instead of on the methods that
  actually execute the database queries. I think it's resulting in
  connections not being closed and eventually reaching the database
  connection pool size limits.

  The database context manager decorator needs to be on the methods that
  execute the queries because part of what it does is close connections
  after the method is run.

  Full traceback:

  Jul 13 22:06:48 ubuntu-jammy devstack@n-api.service[270259]: DEBUG 
nova.api.openstack.wsgi [None req-b47b7ad6-eecd-44a9-8264-706742dd8539 demo 
demo] Calling method '>' {{(pid=270259) _process_stack 
/opt/stack/nova/nova/api/openstack/wsgi.py:513}}
  Jul 13 22:06:58 ubuntu-jammy devstack@n-api.service[270259]: DEBUG dbcounter 
[-] [270259] Writing DB stats nova_api:SELECT=2 {{(pid=270259) stat_writer 
/usr/local/lib/python3.10/dist-packages/dbcounter.py:117}}
  Jul 13 22:07:18 ubuntu-jammy devstack@n-api.service[270259]: ERROR 
nova.api.openstack.wsgi [None req-b47b7ad6-eecd-44a9-8264-706742dd8539 demo 
demo] Unexpected exception in API method: sqlalchemy.exc.TimeoutError: 
QueuePool limit of size 5 overflow 50 reached, connection timed out, timeout 
30.00 (Background on this error at: https://sqlalche.me/e/14/3o7r)
  Jul 13 22:07:18 ubuntu-jammy devstack@n-api.service[270259]: ERROR 
nova.api.openstack.wsgi Traceback (most recent call last):
  Jul 13 22:07:18 ubuntu-jammy devstack@n-api.service[270259]: ERROR 
nova.api.openstack.wsgi   File "/opt/stack/nova/nova/api/openstack/wsgi.py", 
line 658, in wrapped
  Jul 13 22:07:18 ubuntu-jammy devstack@n-api.service[270259]: ERROR 
nova.api.openstack.wsgi return f(*args, **kwargs)
  Jul 13 22:07:18 ubuntu-jammy devstack@n-api.service[270259]: ERROR 
nova.api.openstack.wsgi   File 
"/opt/stack/nova/nova/api/openstack/compute/flavors_extraspecs.py", line 64, in 
index
  Jul 13 22:07:18 ubuntu-jammy devstack@n-api.service[270259]: ERROR 
nova.api.openstack.wsgi return self._get_extra_specs(context, flavor_id)
  Jul 13 22:07:18 ubuntu-jammy devstack@n-api.service[270259]: ERROR 
nova.api.openstack.wsgi   File 
"/opt/stack/nova/nova/api/openstack/compute/flavors_extraspecs.py", line 34, in 
_get_extra_specs
  Jul 13 22:07:18 ubuntu-jammy devstack@n-api.service[270259]: ERROR 
nova.api.openstack.wsgi flavor = common.get_flavor(context, flavor_id)
  Jul 13 22:07:18 ubuntu-jammy devstack@n-api.service[270259]: ERROR 
nova.api.openstack.wsgi   File "/opt/stack/nova/nova/api/openstack/common.py", 
line 494, in get_flavor
  Jul 13 22:07:18 ubuntu-jammy devstack@n-api.service[270259]: ERROR 
nova.api.openstack.wsgi return objects.Flavor.get_by_flavor_id(context, 
flavor_id)
  Jul 13 22:07:18 ubuntu-jammy devstack@n-api.service[270259]: ERROR 
nova.api.openstack.wsgi   File 
"/usr/local/lib/python3.10/dist-packages/oslo_versionedobjects/base.py", line 
184, in wrapper
  Jul 13 22:07:18 ubuntu-jammy devstack@n-api.service[270259]: ERROR 
nova.api.openstack.wsgi result = fn(cls, context, *args, **kwargs)
  Jul 13 22:07:18 ubuntu-jammy devstack@n-api.service[270259]: ERROR 
nova.api.openstack.wsgi   File "/opt/stack/nova/nova/objects/flavor.py", line 
395, in get_by_flavor_id
  Jul 13 22:07:18 ubuntu-jammy devstack@n-api.service[270259]: ERROR 
nova.api.openstack.wsgi db_flavor = 
cls._flavor_get_by_flavor_id_from_db(context,
  Jul 13 22:07:18 ubuntu-jammy devstack@n-api.service[270259]: ERROR 
nova.api.openstack.wsgi   File

[Yahoo-eng-team] [Bug 2043036] Re: [ironic] list_instances/list_instance_uuid does not respect conductor_group/partition_key

2024-01-19 Thread OpenStack Infra

Reviewed:  https://review.opendev.org/c/openstack/nova/+/900831
Committed: 
https://opendev.org/openstack/nova/commit/fa3cf7d50cba921ea67eb161e6a199067ea62deb
Submitter: "Zuul (22348)"
Branch:master

commit fa3cf7d50cba921ea67eb161e6a199067ea62deb
Author: Jay Faulkner 
Date:   Mon Nov 13 15:21:31 2023 -0800

[ironic] Partition & use cache for list_instance*

list_instances and list_instance_uuids, as written in the Ironic driver,
do not currently respect conductor_group paritioning. Given a nova
compute is intended to limit it's scope of work to the conductor group
it is configured to work with; this is a bug.

Additionally, this should be a significant performance boost for a
couple of reasons; firstly, instead of calling the Ironic API and
getting all nodes, instead of the subset (when using conductor group),
we're now properly getting the subset of nodes -- this is the optimized
path in the Ironic DB and API code. Secondly, we're now using the
driver's node cache to respond to these requests. Since list_instances
and list_instance_uuids is used by periodic tasks, these operating with
data that may be slightly stale should have minimal impact compared to
the performance benefits.

Closes-bug: #2043036
Change-Id: If31158e3269e5e06848c29294fdaa147beedb5a5


** Changed in: nova
   Status: In Progress => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/2043036

Title:
  [ironic] list_instances/list_instance_uuid does not respect
  conductor_group/partition_key

Status in Ironic:
  Triaged
Status in OpenStack Compute (nova):
  Fix Released

Bug description:
  The methods on the Ironic driver, list_instances and
  list_instance_uuids are not currently respecting the conductor_group
  option:
  https://opendev.org/openstack/nova/src/branch/master/nova/conf/ironic.py#L71.

  This leads to significant performance degradation, as querying Ironic
  for all nodes (/v1/nodes) instead of all nodes managed by the compute
  (/v1/nodes?conductor_group=blah) is a significantly more expensive API
  call.

  In addition, this can lead to unexpected behavior for operators, such
  as an action being taken by a compute serving conductor group "A" to
  resolve an issue that would normally be resolved by a compute service
  conductor group "B".

  
  While troubleshooting this error, we dug deeply into what this data is used 
for; it's used for two things:
  - Reconciling deleted instances as a periodic job
  - Ensuring no instances exist on a newly-started compute host

  
  These are tasks which either could use stale data or would not be impacted by 
using the Ironic driver's existing node cache. Therefore, a suggested fix is:

  Revise list_instances and list_instance_uuids to reuse the node cache
  to reduce the overall API calls being made to Ironic, and ensure all
  /v1/nodes calls use the same codepath in the Ironic driver. It's the
  belief of JayF, TheJulia, and Johnthetubaguy (on a video call right
  now) that using stale data, without refreshing the cache, should be
  safe for these use cases. (Even if we decide to refresh the cache, we
  should use this code path anyway.)

To manage notifications about this bug go to:
https://bugs.launchpad.net/ironic/+bug/2043036/+subscriptions


-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp

[Yahoo-eng-team] [Bug 2049899] [NEW] disk remaining in logs during live migration says 100 when no disk is migrated

2024-01-19 Thread Tobias Urdin

Public bug reported:

when doing live migrations for bfv instances the disk remaining in the
nova log says 100 even if there is no disk to migrate

** Affects: nova
 Importance: Undecided
 Status: In Progress

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/2049899

Title:
  disk remaining in logs during live migration says 100 when no disk is
  migrated

Status in OpenStack Compute (nova):
  In Progress

Bug description:
  when doing live migrations for bfv instances the disk remaining in the
  nova log says 100 even if there is no disk to migrate

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/2049899/+subscriptions


-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp

[Yahoo-eng-team] [Bug 2049903] [NEW] nova-compute starts even if resource provider creation fails with conflict

2024-01-19 Thread Tobias Urdin

Public bug reported:

if an operator has a compute node and reinstalls it but forget to do a
"openstack compute service delete " first (that would wipe the nova-
compute service record and the resource provider in placement) the
reinstalled compute node with the same hostname happily reports it's
state as up even though the resource provider creation that nova-compute
tried failed due to a conflict with the resource providers.

to do operators a big favor we should make nova-compute startup fail if
the resource provider creation failed (like when there is a conflict)

** Affects: nova
 Importance: Undecided
 Status: In Progress

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/2049903

Title:
  nova-compute starts even if resource provider creation fails with
  conflict

Status in OpenStack Compute (nova):
  In Progress

Bug description:
  if an operator has a compute node and reinstalls it but forget to do a
  "openstack compute service delete " first (that would wipe the
  nova-compute service record and the resource provider in placement)
  the reinstalled compute node with the same hostname happily reports
  it's state as up even though the resource provider creation that nova-
  compute tried failed due to a conflict with the resource providers.

  to do operators a big favor we should make nova-compute startup fail
  if the resource provider creation failed (like when there is a
  conflict)

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/2049903/+subscriptions


-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp

[Yahoo-eng-team] [Bug 2049909] [NEW] MLDv2 packets sent from L3 Agent managed networks cause backup routers to be preferred

2024-01-19 Thread Andrew Bonney

Public bug reported:

Neutron 2023.1 29cc1a634e530972614c09fbb212b5f63fd4c374
Ubuntu 20.04

This issue has been identified in a Neutron system running Linux Bridge
networking, but whilst this may no longer be supported I'm posting it in
case the same issue might be relevant for other drivers.

When running multiple network nodes, the tenant networks in the
namespaces on each node share MAC addresses. We have noted that
particularly when rebooting a network node, traffic from the Internet to
tenant networks can be disrupted when a node which was acting as the
backup for a given tenant network comes back online. We have traced this
to Linux sending out MLDv2 responses to the upstream switches when the
tenant network processes (keepalived etc) start up. As a result, the
upstream switches update their MAC tables to prefer that host despite it
not being the primary. If there is minimal tenant traffic (such as when
running a web server), this network will be inaccessible from the
outside until a request is made from the inside to the outside and the
switches re-update their MAC tables to reflect the correct state.

There is already handling to prevent some IPv6 packets being sent out in
these cases here:
https://opendev.org/openstack/neutron/src/branch/master/neutron/agent/l3/router_info.py#L808,
and there is theoretically something explicitly referencing issues with
MLDv2 in the same area:
https://opendev.org/openstack/neutron/src/branch/master/neutron/agent/l3/router_info.py#L813.
Unfortunately these don't appear to be sufficient.

There is no sysctl mechanism to prevent these gratuitous MLDv2 responses
as far as I can tell, so we are working around this by using iptables
rules inserted by the L3 agent into tenant networks. There may well be a
better solution, but I will link our workaround to this bug report
shortly.

** Affects: neutron
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/2049909

Title:
  MLDv2 packets sent from L3 Agent managed networks cause backup routers
  to be preferred

Status in neutron:
  New

Bug description:
  Neutron 2023.1 29cc1a634e530972614c09fbb212b5f63fd4c374
  Ubuntu 20.04

  This issue has been identified in a Neutron system running Linux
  Bridge networking, but whilst this may no longer be supported I'm
  posting it in case the same issue might be relevant for other drivers.

  When running multiple network nodes, the tenant networks in the
  namespaces on each node share MAC addresses. We have noted that
  particularly when rebooting a network node, traffic from the Internet
  to tenant networks can be disrupted when a node which was acting as
  the backup for a given tenant network comes back online. We have
  traced this to Linux sending out MLDv2 responses to the upstream
  switches when the tenant network processes (keepalived etc) start up.
  As a result, the upstream switches update their MAC tables to prefer
  that host despite it not being the primary. If there is minimal tenant
  traffic (such as when running a web server), this network will be
  inaccessible from the outside until a request is made from the inside
  to the outside and the switches re-update their MAC tables to reflect
  the correct state.

  There is already handling to prevent some IPv6 packets being sent out
  in these cases here:
  
https://opendev.org/openstack/neutron/src/branch/master/neutron/agent/l3/router_info.py#L808,
  and there is theoretically something explicitly referencing issues
  with MLDv2 in the same area:
  
https://opendev.org/openstack/neutron/src/branch/master/neutron/agent/l3/router_info.py#L813.
  Unfortunately these don't appear to be sufficient.

  There is no sysctl mechanism to prevent these gratuitous MLDv2
  responses as far as I can tell, so we are working around this by using
  iptables rules inserted by the L3 agent into tenant networks. There
  may well be a better solution, but I will link our workaround to this
  bug report shortly.

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/2049909/+subscriptions


-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp

[Yahoo-eng-team] [Bug 1943934] Re: report extra gpu device when config one enabled_vgpu_types

2024-01-19 Thread Sylvain Bauza

Fixed by https://review.opendev.org/c/openstack/nova/+/899406/2

** Changed in: nova
   Status: Triaged => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1943934

Title:
  report extra gpu device when config one enabled_vgpu_types

Status in OpenStack Compute (nova):
  Fix Released

Bug description:
  if there are two gpu devices virtualized on the env, and config one
  enabled_vgpu_types and device_addresses, Nova will report these two
  gpu devices to Placement. we should only report the configured
  device_addresses to Placement.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1943934/+subscriptions


-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp

[Yahoo-eng-team] [Bug 2027975] Re: Add check on network interface name's length

2024-01-19 Thread Bug Watch Updater

** Changed in: cloud-init
   Status: New => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to cloud-init.
https://bugs.launchpad.net/bugs/2027975

Title:
  Add check on network interface name's length

Status in cloud-init:
  Fix Released
Status in MAAS:
  Triaged

Bug description:
  When adding a VLAN on long network interfaces, Cloud-init will fail
  silently to add the sub-interfaces.

  Deployed environment is :
  * MAAS 3.3.4
  * Server Dell R630 + Mellanox Connectx-5
  * name of interfaces : enp130s0f0np0 / enp130s0f1np1
  * add a VLAN like 100-4093

  Cloud-Init will not display any error message but the VLAN interfaces will 
not be added after the deployment.
  If trying to perform the operation manually, we are then greeted with the 
following error message.

  ```
  ubuntu@R630:~$ sudo ip link add link enp130s0f0np0 name enp130s0f0np0.103 
type vlan id 103
  Error: argument "enp130s0f0np0.103" is wrong: "name" not a valid ifname
  ```

  From Iproute2 and Kernel perspective, it is not possible to have
  interfaces with a name longer than 15 characters in total.

  A quick workaround is simply to rename the network interface to something 
shorter.
  Having a quick warning from MAAS would be nice to have to understand the 
origin of the issue.

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-init/+bug/2027975/+subscriptions


-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp

[Yahoo-eng-team] [Bug 2049852] [NEW] VM reboot issues

2024-01-19 Thread Aravindh

Public bug reported:

Problem Statement: Creating VM from image work fine, but reboot of that
instance from openstack results in Error state.


Steps to re-create:

1. Create a VM from Openstack with Image as the source. VM is created 
successfully.
2. Now issue a openstack server reboot 
3. Softreboot fails after sometime, and hard reboot is attempted, but it fails 
and moves the instance to an Error state.


Observations. 

1. Weird thing I noticed is, when the instance is being created the nova
log shows this (Note the VOLUME ID)

2024-01-19 06:45:06.536 7 INFO nova.virt.block_device [req-
da372940-2784-4aba-881f-7636460efb46 req-
bffdffd6-7521-4450-9ab9-93bac32789e1 a828b4ad4d794e18ac9c6238e893522d
1f4d24639d564e40816d90be4cac8ecd - - default default] [instance:
330922ac-8333-4ee7-a634-0075a00f1fd7] Booting with volume-backed-image
0e1aa7ba-23cc-4ced-8aa2-c9f78535d435 at /dev/vda

2. The create volume (c22d487d-ab7a-4373-b69f-91cd02f32cc8) in openstack
does not match this ID at all.

3. openstack server show VM shows the correct volume ID as in 2.

4. During reboot issued though openstack, the multipath just fails and
nova logs are stuck at cleaning stale cinder volumes.

5. Reboot command from inside the VM work fine. (sudo reboot)



6. When I use volume as the source to create instance, I have none of
these problems. Nova log shows the correct volume ID during the creation
phase and reboot works great with no issues.


I can consistently reproduce this error.


Host OS - Ubuntu 22.04
Deployment - Kolla Ansible with Docker containers
Storage - Netapp with iSCSI
Hypervisor - KVM
Nova Version - 18.4.0

** Affects: nova
 Importance: Undecided
 Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/2049852

Title:
  VM reboot issues

Status in OpenStack Compute (nova):
  New

Bug description:
  Problem Statement: Creating VM from image work fine, but reboot of
  that instance from openstack results in Error state.

  
  Steps to re-create:

  1. Create a VM from Openstack with Image as the source. VM is created 
successfully.
  2. Now issue a openstack server reboot 
  3. Softreboot fails after sometime, and hard reboot is attempted, but it 
fails and moves the instance to an Error state.

  
  Observations. 

  1. Weird thing I noticed is, when the instance is being created the
  nova log shows this (Note the VOLUME ID)

  2024-01-19 06:45:06.536 7 INFO nova.virt.block_device [req-
  da372940-2784-4aba-881f-7636460efb46 req-
  bffdffd6-7521-4450-9ab9-93bac32789e1 a828b4ad4d794e18ac9c6238e893522d
  1f4d24639d564e40816d90be4cac8ecd - - default default] [instance:
  330922ac-8333-4ee7-a634-0075a00f1fd7] Booting with volume-backed-image
  0e1aa7ba-23cc-4ced-8aa2-c9f78535d435 at /dev/vda

  2. The create volume (c22d487d-ab7a-4373-b69f-91cd02f32cc8) in
  openstack does not match this ID at all.

  3. openstack server show VM shows the correct volume ID as in 2.

  4. During reboot issued though openstack, the multipath just fails and
  nova logs are stuck at cleaning stale cinder volumes.

  5. Reboot command from inside the VM work fine. (sudo reboot)

  

  6. When I use volume as the source to create instance, I have none of
  these problems. Nova log shows the correct volume ID during the
  creation phase and reboot works great with no issues.

  
  I can consistently reproduce this error.

  
  Host OS - Ubuntu 22.04
  Deployment - Kolla Ansible with Docker containers
  Storage - Netapp with iSCSI
  Hypervisor - KVM
  Nova Version - 18.4.0

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/2049852/+subscriptions


-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp

[Yahoo-eng-team] [Bug 2019190] Re: [RBD] Retyping of in-use boot volumes renders instances unusable (possible data corruption)

[Yahoo-eng-team] [Bug 2048785] Re: Trunk parent port (tpt port) vlan_mode is wrong in ovs

[Yahoo-eng-team] [Bug 2025813] Fix included in openstack/nova 26.2.1

[Yahoo-eng-team] [Bug 2027755] Fix included in openstack/nova 26.2.1

[Yahoo-eng-team] [Bug 2043036] Re: [ironic] list_instances/list_instance_uuid does not respect conductor_group/partition_key

[Yahoo-eng-team] [Bug 2049899] [NEW] disk remaining in logs during live migration says 100 when no disk is migrated

[Yahoo-eng-team] [Bug 2049903] [NEW] nova-compute starts even if resource provider creation fails with conflict

[Yahoo-eng-team] [Bug 2049909] [NEW] MLDv2 packets sent from L3 Agent managed networks cause backup routers to be preferred

[Yahoo-eng-team] [Bug 1943934] Re: report extra gpu device when config one enabled_vgpu_types

[Yahoo-eng-team] [Bug 2027975] Re: Add check on network interface name's length

[Yahoo-eng-team] [Bug 2049852] [NEW] VM reboot issues

11 matches

Site Navigation

Mail list logo

Footer information