[Yahoo-eng-team] [Bug 2028112] [NEW] Unable to create VM when using the sriov agent with the ml2/ovn driver.
Public bug reported: I am planning to operate nodes using HWOL and OVN Controller and nodes using SR-IOV simultaneously in the OVN environment. The test content is as follows. ** Controller server ** The ml2 mechanism_drivers were specified as follows: ``` [ml2] mechanism_drivers = sriovnicswitch,ovn ``` Upon checking the log, the driver was confirmed to be loaded normally. ``` 2023-07-19 00:44:37.403 1697414 INFO neutron.plugins.ml2.managers [-] Configured mechanism driver names: ['ovn', 'sriovnicswitch'] 2023-07-19 00:44:37.464 1697414 INFO neutron.plugins.ml2.managers [-] Loaded mechanism driver names: ['ovn', 'sriovnicswitch'] 2023-07-19 00:44:37.464 1697414 INFO neutron.plugins.ml2.managers [-] Registered mechanism drivers: ['ovn', 'sriovnicswitch'] 2023-07-19 00:44:37.464 1697414 INFO neutron.plugins.ml2.managers [-] No mechanism drivers provide segment reachability information for agent scheduling. 2023-07-19 00:44:38.358 1697414 INFO neutron.plugins.ml2.managers [req-bc634856-2d9a-44d0-ae0e-351b440a2a0b - - - - -] Initializing mechanism driver 'ovn' 2023-07-19 00:44:38.378 1697414 INFO neutron.plugins.ml2.managers [req-bc634856-2d9a-44d0-ae0e-351b440a2a0b - - - - -] Initializing mechanism driver 'sriovnicswitch' ``` ** Compute ** nova.conf ``` [pci] passthrough_whitelist = { "devname": "enp94s0f1np1", "physical_network": "physnet1" } ``` plugin/ml2/sriov-agent.ini ``` [DEFAULT] debug = true [securitygroup] firewall_driver = neutron.agent.firewall.NoopFirewallDriver [sriov_nic] physical_device_mappings = physnet1:enp94s0f1np1 ``` Neutron Agent status ``` +--+--+---+---+---+---++ | ID | Agent Type | Host | Availability Zone | Alive | State | Binary | +--+--+---+---+---+---++ | 24e9395c-379f-4afd-aa84-ae0d970794ff | NIC Switch agent | Qacloudhost06 | None | :-) | UP| neutron-sriov-nic-agent | | 43ba481c-c0f2-49bc-a34a-c94faa284ac7 | NIC Switch agent | Qaamdhost02 | None | :-) | UP| neutron-sriov-nic-agent | | 4c1a6c78-e58a-48d9-aa4a-abdf44d2f359 | NIC Switch agent | Qacloudhost07 | None | :-) | UP| neutron-sriov-nic-agent | | 534f0946-6eb3-491f-a57d-65cbc0133399 | NIC Switch agent | Qacloudhost02 | None | :-) | UP| neutron-sriov-nic-agent | | 2275f9d4-7c69-51db-ae71-b6e0be15e9b8 | OVN Metadata agent | Qacloudhost05 | | :-) | UP| neutron-ovn-metadata-agent | | 92a7b8dc-e122-49c8-a3bc-ae6a38b56cc0 | OVN Controller Gateway agent | Qacloudhost05 | | :-) | UP| ovn-controller | | c3a1e8fe-8669-5e7a-a3d7-3a2b638fae26 | OVN Metadata agent | Qaamdhost02 | | :-) | UP| neutron-ovn-metadata-agent | | d203ff10-0835-4d7e-bc63-5ff274ade5a3 | OVN Controller agent | Qaamdhost02 | | :-) | UP| ovn-controller | | fc4c5075-9b44-5c21-a24d-f86dfd0009f9 | OVN Metadata agent | Qacloudhost02 | | :-) | UP| neutron-ovn-metadata-agent | | bed0-1519-47f8-b52f-3a9116e1408f | OVN Controller Gateway agent | Qacloudhost02 | | :-) | UP| ovn-controller | +--+--+---+---+---+---++ ``` When creating a vm, Neutron error log ``` 2023-07-19 02:44:30.463 1725695 ERROR neutron.plugins.ml2.managers [req-9204d6f7-ddc3-44e2-878c-bfa9c3f761ef fbec686e249e4818be7a686833140326 7a4dd87db099460795d775b055a648ea - default default] Mechanism driver 'ovn' failed in update_port_precommit: neutron_lib.exceptions.InvalidInput: Invalid input for operation: Invalid binding:profile. too many parameters. 2023-07-19 02:44:30.463 1725695 ERROR neutron.plugins.ml2.managers Traceback (most recent call last): 2023-07-19 02:44:30.463 1725695 ERROR neutron.plugins.ml2.managers File "/usr/lib/python3/dist-packages/neutron/plugins/ml2/managers.py", line 482, in _call_on_drivers 2023-07-19 02:44:30.463 1725695 ERROR neutron.plugins.ml2.managers getattr(driver.obj, method_name)(context) 2023-07-19 02:44:30.463 1725695 ERROR neutron.plugins.ml2.managers File "/usr/lib/python3/dist-packages/neutron/plugins/ml2/drivers/ovn/mech_driver/mech_driver.py", line 792, in update_port_precommit 2023-07-19 02:44:30.463 1725695 ERROR neutron.plugins.ml2.managers ovn_utils.validate_and_get_data_from_binding_profile(port) 2023-07-19 02:44:30.463 1725695 ERROR
[Yahoo-eng-team] [Bug 1996594] Re: OVN metadata randomly stops working
There were many fixes related to the reported issue in neutron and ovs since this bug report, some of these that i quickly catched are:- - https://patchwork.ozlabs.org/project/openvswitch/patch/20220819230810.2626573-1-i.maxim...@ovn.org/ - https://review.opendev.org/c/openstack/ovsdbapp/+/856200 - https://review.opendev.org/c/openstack/ovsdbapp/+/862524 - https://review.opendev.org/c/openstack/neutron/+/857775 - https://review.opendev.org/c/openstack/neutron/+/871825 Closing it based on above and Comment #5. If the issues are still seen with python-ovs>=2.17 and above fixes included please feel free to open the issue along with ovsdb-server sb logs and neutron server and metadata agent debug logs. ** Bug watch added: Red Hat Bugzilla #2214289 https://bugzilla.redhat.com/show_bug.cgi?id=2214289 ** Changed in: neutron Status: New => Fix Released -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to neutron. https://bugs.launchpad.net/bugs/1996594 Title: OVN metadata randomly stops working Status in neutron: Fix Released Bug description: We found that OVN metadata will not work randomly when OVN is writing a snapshot. 1, At 12:30:35, OVN started to transfer leadership to write a snapshot $ find sosreport-juju-2752e1-*/var/log/ovn/* |xargs zgrep -i -E 'Transferring leadership' sosreport-juju-2752e1-6-lxd-24-xxx-2022-08-18-entowko/var/log/ovn/ovsdb-server-sb.log:2022-08-18T12:30:35.322Z|80962|raft|INFO|Transferring leadership to write a snapshot. sosreport-juju-2752e1-6-lxd-24-xxx-2022-08-18-entowko/var/log/ovn/ovsdb-server-sb.log:2022-08-18T17:52:53.024Z|82382|raft|INFO|Transferring leadership to write a snapshot. sosreport-juju-2752e1-7-lxd-27-xxx-2022-08-18-hhxxqci/var/log/ovn/ovsdb-server-sb.log:2022-08-18T12:30:35.330Z|92698|raft|INFO|Transferring leadership to write a snapshot. 2, At 12:30:36, neutron-ovn-metadata-agent reported OVSDB Error $ find sosreport-srv1*/var/log/neutron/* |xargs zgrep -i -E 'OVSDB Error' sosreport-srv1xxx2d-xxx-2022-08-18-cuvkufw/var/log/neutron/neutron-ovn-metadata-agent.log:2022-08-18 12:30:36.103 75556 ERROR ovsdbapp.backend.ovs_idl.transaction [-] OVSDB Error: no error details available sosreport-srv1xxx6d-xxx-2022-08-18-bgnovqu/var/log/neutron/neutron-ovn-metadata-agent.log:2022-08-18 12:30:36.104 2171 ERROR ovsdbapp.backend.ovs_idl.transaction [-] OVSDB Error: no error details available 3, At 12:57:53, we saw the error 'No port found in network', then we will hit the problem that OVN metadata does not work randomly 2022-08-18 12:57:53.800 3730 ERROR neutron.agent.ovn.metadata.server [-] No port found in network 63e2c276-60dd-40e3-baa1-c16342eacce2 with IP address 100.94.98.135 After the problem occurs, restarting neutron-ovn-metadata-agent or restarting haproxy instance as follows can be used as a workaround. /usr/bin/neutron-rootwrap /etc/neutron/rootwrap.conf ip netns exec ovnmeta-63e2c276-60dd-40e3-baa1-c16342eacce2 haproxy -f /var/lib/neutron/ovn-metadata- proxy/63e2c276-60dd-40e3-baa1-c16342eacce2.conf One lp bug #1990978 [1] is trying to reducing the frequency of transfers, it should be beneficial to this problem. But it only reduces the occurrence of problems, not completely avoiding them. I wonder if we need to add some retry logic on the neutron side NOTE: The openstack version we are using is focal-xena, and openvswitch's version is 2.16.0-0ubuntu2.1~cloud0 [1] https://bugs.launchpad.net/ubuntu/+source/openvswitch/+bug/1990978 To manage notifications about this bug go to: https://bugs.launchpad.net/neutron/+bug/1996594/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1969592] Re: [OVN] Frequent DB leader changes causes 'VIF creation failed' on nova side
There were many fixes related to the reported issue in neutron and ovs since this bug report, some of these that i quickly catched are:- - https://patchwork.ozlabs.org/project/openvswitch/patch/20220819230810.2626573-1-i.maxim...@ovn.org/ - https://review.opendev.org/c/openstack/ovsdbapp/+/856200 - https://review.opendev.org/c/openstack/ovsdbapp/+/862524 - https://review.opendev.org/c/openstack/neutron/+/857775 - https://review.opendev.org/c/openstack/neutron/+/871825 One of the issue that i know still open but that happens very rarely https://bugzilla.redhat.com/show_bug.cgi?id=2214289 Closing it based on above and Comment #2 and #3. If the issues are still seen with python-ovs>=2.17 and above fixes included please feel free to open the issue along with neutron and nova debug logs. ** Bug watch added: Red Hat Bugzilla #2214289 https://bugzilla.redhat.com/show_bug.cgi?id=2214289 ** Changed in: neutron Status: New => Fix Released -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to neutron. https://bugs.launchpad.net/bugs/1969592 Title: [OVN] Frequent DB leader changes causes 'VIF creation failed' on nova side Status in neutron: Fix Released Status in OpenStack Compute (nova): Incomplete Bug description: Hello guys. As I found in 2021 there was a commit to OVS [since OVS v2.16 or 2.15 with that backport] that changed behavior during OVN DBs snapshoting. Now before the leader creates a snapshot it will transfer leadership to another node. We've run tests with rally and tempest and looks like there is a problem now when there is interaction between nova and neutron. For example, simple rally test like 'create {network,router,subnet} -> add interface to router' looks okay even with 256 concurrent same tests/threads. But something like 'neutron.create_subnet -> nova.boot_server -> nova.attach_interface' will fail in time when transfer leadership happens. Since it happens to often [ each 10m + rand(10) ] we will get a lot of errors. This problem can be observed on all versions where OVS 2.16 [or backport] or higher invited :) Some tracing from logs [neutron, nova, ovn-sb-db]: CONTROL-NODES: ctl01-ovn-sb-db.log:2022-04-19T12:30:03.089Z|01002|raft|INFO|Transferring leadership to write a snapshot. ctl01-ovn-sb-db.log:2022-04-19T12:30:03.099Z|01003|raft|INFO|server 1c5f is leader for term 42 ctl03-ovn-sb-db.log:2022-04-19T12:30:03.090Z|00938|raft|INFO|received leadership transfer from 1f46 in term 41 ctl03-ovn-sb-db.log:2022-04-19T12:30:03.092Z|00940|raft|INFO|term 42: elected leader by 2+ of 3 servers ctl03-ovn-sb-db.log:2022-04-19T12:30:10.941Z|00941|jsonrpc|WARN|tcp:xx.yy.zz.26:41882: send error: Connection reset by peer ctl03-ovn-sb-db.log:2022-04-19T12:30:27.324Z|00943|jsonrpc|WARN|tcp:xx.yy.zz.26:41896: send error: Connection reset by peer ctl03-ovn-sb-db.log:2022-04-19T12:30:27.325Z|00945|jsonrpc|WARN|tcp:xx.yy.zz.26:41880: send error: Connection reset by peer ctl03-ovn-sb-db.log:2022-04-19T12:30:27.325Z|00947|jsonrpc|WARN|tcp:xx.yy.zz.26:41892: send error: Connection reset by peer ctl03-ovn-sb-db.log:2022-04-19T12:30:27.327Z|00949|jsonrpc|WARN|tcp:xx.yy.zz.26:41884: send error: Connection reset by peer ctl03-ovn-sb-db.log:2022-04-19T12:31:49.244Z|00951|jsonrpc|WARN|tcp:xx.yy.zz.25:40260: send error: Connection timed out ctl03-ovn-sb-db.log:2022-04-19T12:31:49.244Z|00953|jsonrpc|WARN|tcp:xx.yy.zz.25:40264: send error: Connection timed out ctl03-ovn-sb-db.log:2022-04-19T12:31:49.244Z|00955|jsonrpc|WARN|tcp:xx.yy.zz.24:37440: send error: Connection timed out ctl03-ovn-sb-db.log:2022-04-19T12:31:49.244Z|00957|jsonrpc|WARN|tcp:xx.yy.zz.24:37442: send error: Connection timed out ctl03-ovn-sb-db.log:2022-04-19T12:31:49.245Z|00959|jsonrpc|WARN|tcp:xx.yy.zz.24:37446: send error: Connection timed out ctl03-ovn-sb-db.log:2022-04-19T12:32:01.533Z|01001|jsonrpc|WARN|tcp:xx.yy.zz.67:57586: send error: Connection timed out 2022-04-19 12:30:08.898 27 INFO neutron.db.ovn_revision_numbers_db [req-7fcfdd74-482d-46b2-9f76-07190669d76d ff1516be452b4b939314bf3864a63f35 9d3ae9a7b121488285203b0fdeabc3a3 - default default] Successfully bumped revision number for resource be178a9a-26d7-4bf0-a4e8-d206a6965205 (type: ports) to 1 2022-04-19 12:30:09.644 27 INFO neutron.db.ovn_revision_numbers_db [req-a8278418-3ad9-450c-89bb-e7a5c1c0a06d a9864cd890224c079051b3f56021be64 72db34087b9b401d842b66643b647e16 - default default] Successfully bumped revision number for resource be178a9a-26d7-4bf0-a4e8-d206a6965205 (type: ports) to 2 2022-04-19 12:30:10.235 27 INFO neutron.wsgi [req-571b53cc-ca04-46f7-89f9-fdf8e5931f4c a9864cd890224c079051b3f56021be64 72db34087b9b401d842b66643b647e16 - default default] xx.yy.zz.68,xx.yy.zz.26 "GET /v2.0/ports?tenant_id=9d3ae9a7b121488285203b0fdeabc3a3_id=7560fbb7-3ec7-41ef-b7a5-5e955ca4ff34
[Yahoo-eng-team] [Bug 2028037] [NEW] Error 500 on create network when PostgreSQL is used
Public bug reported: Error seen in the periodic job neutron-ovn-tempest-postgres-full every day since 5.07.2023: https://zuul.openstack.org/builds?job_name=neutron- ovn-tempest-postgres- full=openstack%2Fneutron=master=0 Error from neutron server log: Jul 18 03:38:15.524757 np0034701773 neutron-server[55620]: WARNING oslo_db.sqlalchemy.exc_filters [None req-eb023ace-8737-4e89-9557-e058931a7892 demo demo] DBAPIError exception wrapped.: psycopg2.errors.GroupingError: column "subnet_service_types.subnet_id" must appear in the GROUP BY clause or be used in an aggregate function Jul 18 03:38:15.524757 np0034701773 neutron-server[55620]: LINE 2: ...de, subnets.standard_attr_id AS standard_attr_id, subnet_ser... Jul 18 03:38:15.524757 np0034701773 neutron-server[55620]: ^ Jul 18 03:38:15.524757 np0034701773 neutron-server[55620]: ERROR oslo_db.sqlalchemy.exc_filters Traceback (most recent call last): Jul 18 03:38:15.524757 np0034701773 neutron-server[55620]: ERROR oslo_db.sqlalchemy.exc_filters File "/usr/local/lib/python3.10/dist-packages/sqlalchemy/engine/base.py", line 1900, in _execute_context Jul 18 03:38:15.524757 np0034701773 neutron-server[55620]: ERROR oslo_db.sqlalchemy.exc_filters self.dialect.do_execute( Jul 18 03:38:15.524757 np0034701773 neutron-server[55620]: ERROR oslo_db.sqlalchemy.exc_filters File "/usr/local/lib/python3.10/dist-packages/sqlalchemy/engine/default.py", line 736, in do_execute Jul 18 03:38:15.524757 np0034701773 neutron-server[55620]: ERROR oslo_db.sqlalchemy.exc_filters cursor.execute(statement, parameters) Jul 18 03:38:15.524757 np0034701773 neutron-server[55620]: ERROR oslo_db.sqlalchemy.exc_filters psycopg2.errors.GroupingError: column "subnet_service_types.subnet_id" must appear in the GROUP BY clause or be used in an aggregate function Jul 18 03:38:15.524757 np0034701773 neutron-server[55620]: ERROR oslo_db.sqlalchemy.exc_filters LINE 2: ...de, subnets.standard_attr_id AS standard_attr_id, subnet_ser... Jul 18 03:38:15.524757 np0034701773 neutron-server[55620]: ERROR oslo_db.sqlalchemy.exc_filters ^ Jul 18 03:38:15.524757 np0034701773 neutron-server[55620]: ERROR oslo_db.sqlalchemy.exc_filters Jul 18 03:38:15.524757 np0034701773 neutron-server[55620]: ERROR oslo_db.sqlalchemy.exc_filters Jul 18 03:38:15.533633 np0034701773 neutron-server[55620]: ERROR neutron.plugins.ml2.managers [None req-eb023ace-8737-4e89-9557-e058931a7892 demo demo] Mechanism driver 'ovn' failed in create_network_postcommit: oslo_db.exception.DBError: (psycopg2.errors.GroupingError) column "subnet_service_types.subnet_id" must appear in the GROUP BY clause or be used in an aggregate function Jul 18 03:38:15.533633 np0034701773 neutron-server[55620]: LINE 2: ...de, subnets.standard_attr_id AS standard_attr_id, subnet_ser... Jul 18 03:38:15.533633 np0034701773 neutron-server[55620]: ^ Jul 18 03:38:15.533860 np0034701773 neutron-server[55620]: [SQL: SELECT anon_1.project_id AS anon_1_project_id, anon_1.id AS anon_1_id, anon_1.name AS anon_1_name, anon_1.network_id AS anon_1_network_id, anon_1.segment_id AS anon_1_segment_id, anon_1.subnetpool_id AS anon_1_subnetpool_id, anon_1.ip_version AS anon_1_ip_version, anon_1.cidr AS anon_1_cidr, anon_1.gateway_ip AS anon_1_gateway_ip, anon_1.enable_dhcp AS anon_1_enable_dhcp, anon_1.ipv6_ra_mode AS anon_1_ipv6_ra_mode, anon_1.ipv6_address_mode AS anon_1_ipv6_address_mode, anon_1.standard_attr_id AS anon_1_standard_attr_id, subnetpools_1.shared AS subnetpools_1_shared, subnetpoolrbacs_1.project_id AS subnetpoolrbacs_1_project_id, subnetpoolrbacs_1.id AS subnetpoolrbacs_1_id, subnetpoolrbacs_1.target_project AS subnetpoolrbacs_1_target_project, subnetpoolrbacs_1.action AS subnetpoolrbacs_1_action, subnetpoolrbacs_1.object_id AS subnetpoolrbacs_1_object_id, standardattributes_1.id AS standardattributes_1_id, standardattributes _1.resource_type AS standardattributes_1_resource_type, standardattributes_1.description AS standardattributes_1_description, standardattributes_1.revision_number AS standardattributes_1_revision_number, standardattributes_1.created_at AS standardattributes_1_created_at, standardattributes_1.updated_at AS standardattributes_1_updated_at, subnetpools_1.project_id AS subnetpools_1_project_id, subnetpools_1.id AS subnetpools_1_id, subnetpools_1.name AS subnetpools_1_name, subnetpools_1.ip_version AS subnetpools_1_ip_version, subnetpools_1.default_prefixlen AS subnetpools_1_default_prefixlen, subnetpools_1.min_prefixlen AS subnetpools_1_min_prefixlen, subnetpools_1.max_prefixlen AS subnetpools_1_max_prefixlen, subnetpools_1.is_default AS subnetpools_1_is_default, subnetpools_1.default_quota AS subnetpools_1_default_quota, subnetpools_1.hash AS subnetpools_1_hash, subnetpools_1.address_scope_id AS