[Yahoo-eng-team] [Bug 2028112] [NEW] Unable to create VM when using the sriov agent with the ml2/ovn driver.

2023-07-18 Thread kyle yoon
Public bug reported:

I am planning to operate nodes using HWOL and OVN Controller and nodes
using SR-IOV simultaneously in the OVN environment.

The test content is as follows.

** Controller server **

The ml2 mechanism_drivers were specified as follows:

```
[ml2]
mechanism_drivers = sriovnicswitch,ovn
```

Upon checking the log, the driver was confirmed to be loaded normally.

```
2023-07-19 00:44:37.403 1697414 INFO neutron.plugins.ml2.managers [-] 
Configured mechanism driver names: ['ovn', 'sriovnicswitch']
2023-07-19 00:44:37.464 1697414 INFO neutron.plugins.ml2.managers [-] Loaded 
mechanism driver names: ['ovn', 'sriovnicswitch']
2023-07-19 00:44:37.464 1697414 INFO neutron.plugins.ml2.managers [-] 
Registered mechanism drivers: ['ovn', 'sriovnicswitch']
2023-07-19 00:44:37.464 1697414 INFO neutron.plugins.ml2.managers [-] No 
mechanism drivers provide segment reachability information for agent scheduling.
2023-07-19 00:44:38.358 1697414 INFO neutron.plugins.ml2.managers 
[req-bc634856-2d9a-44d0-ae0e-351b440a2a0b - - - - -] Initializing mechanism 
driver 'ovn'
2023-07-19 00:44:38.378 1697414 INFO neutron.plugins.ml2.managers 
[req-bc634856-2d9a-44d0-ae0e-351b440a2a0b - - - - -] Initializing mechanism 
driver 'sriovnicswitch'
```

** Compute **

nova.conf

```
[pci]
passthrough_whitelist = {  "devname": "enp94s0f1np1", "physical_network": 
"physnet1" }
```

plugin/ml2/sriov-agent.ini
```
[DEFAULT]
debug = true

[securitygroup]
firewall_driver = neutron.agent.firewall.NoopFirewallDriver

[sriov_nic]
physical_device_mappings = physnet1:enp94s0f1np1
```

Neutron Agent status
```
+--+--+---+---+---+---++
| ID   | Agent Type   | Host
  | Availability Zone | Alive | State | Binary |
+--+--+---+---+---+---++
| 24e9395c-379f-4afd-aa84-ae0d970794ff | NIC Switch agent | 
Qacloudhost06 | None  | :-)   | UP| neutron-sriov-nic-agent 
   |
| 43ba481c-c0f2-49bc-a34a-c94faa284ac7 | NIC Switch agent | 
Qaamdhost02   | None  | :-)   | UP| neutron-sriov-nic-agent 
   |
| 4c1a6c78-e58a-48d9-aa4a-abdf44d2f359 | NIC Switch agent | 
Qacloudhost07 | None  | :-)   | UP| neutron-sriov-nic-agent 
   |
| 534f0946-6eb3-491f-a57d-65cbc0133399 | NIC Switch agent | 
Qacloudhost02 | None  | :-)   | UP| neutron-sriov-nic-agent 
   |
| 2275f9d4-7c69-51db-ae71-b6e0be15e9b8 | OVN Metadata agent   | 
Qacloudhost05 |   | :-)   | UP| 
neutron-ovn-metadata-agent |
| 92a7b8dc-e122-49c8-a3bc-ae6a38b56cc0 | OVN Controller Gateway agent | 
Qacloudhost05 |   | :-)   | UP| ovn-controller  
   |
| c3a1e8fe-8669-5e7a-a3d7-3a2b638fae26 | OVN Metadata agent   | 
Qaamdhost02   |   | :-)   | UP| 
neutron-ovn-metadata-agent |
| d203ff10-0835-4d7e-bc63-5ff274ade5a3 | OVN Controller agent | 
Qaamdhost02   |   | :-)   | UP| ovn-controller  
   |
| fc4c5075-9b44-5c21-a24d-f86dfd0009f9 | OVN Metadata agent   | 
Qacloudhost02 |   | :-)   | UP| 
neutron-ovn-metadata-agent |
| bed0-1519-47f8-b52f-3a9116e1408f | OVN Controller Gateway agent | 
Qacloudhost02 |   | :-)   | UP| ovn-controller  
   |
+--+--+---+---+---+---++
```

When creating a vm, Neutron error log
```
2023-07-19 02:44:30.463 1725695 ERROR neutron.plugins.ml2.managers 
[req-9204d6f7-ddc3-44e2-878c-bfa9c3f761ef fbec686e249e4818be7a686833140326 
7a4dd87db099460795d775b055a648ea - default default] Mechanism driver 'ovn' 
failed in update_port_precommit: neutron_lib.exceptions.InvalidInput: Invalid 
input for operation: Invalid binding:profile. too many parameters.
2023-07-19 02:44:30.463 1725695 ERROR neutron.plugins.ml2.managers Traceback 
(most recent call last):
2023-07-19 02:44:30.463 1725695 ERROR neutron.plugins.ml2.managers File 
"/usr/lib/python3/dist-packages/neutron/plugins/ml2/managers.py", line 482, in 
_call_on_drivers
2023-07-19 02:44:30.463 1725695 ERROR neutron.plugins.ml2.managers 
getattr(driver.obj, method_name)(context)
2023-07-19 02:44:30.463 1725695 ERROR neutron.plugins.ml2.managers File 
"/usr/lib/python3/dist-packages/neutron/plugins/ml2/drivers/ovn/mech_driver/mech_driver.py",
 line 792, in update_port_precommit
2023-07-19 02:44:30.463 1725695 ERROR neutron.plugins.ml2.managers 
ovn_utils.validate_and_get_data_from_binding_profile(port)
2023-07-19 02:44:30.463 1725695 ERROR 

[Yahoo-eng-team] [Bug 1996594] Re: OVN metadata randomly stops working

2023-07-18 Thread yatin
There were many fixes related to the reported issue in neutron and ovs since 
this bug report, some of these that i quickly catched are:-
- 
https://patchwork.ozlabs.org/project/openvswitch/patch/20220819230810.2626573-1-i.maxim...@ovn.org/
- https://review.opendev.org/c/openstack/ovsdbapp/+/856200
- https://review.opendev.org/c/openstack/ovsdbapp/+/862524
- https://review.opendev.org/c/openstack/neutron/+/857775
- https://review.opendev.org/c/openstack/neutron/+/871825

Closing it based on above and Comment #5. If the issues are still seen
with python-ovs>=2.17 and above fixes included please feel free to open
the issue along with ovsdb-server sb logs and neutron server and
metadata agent debug logs.

** Bug watch added: Red Hat Bugzilla #2214289
   https://bugzilla.redhat.com/show_bug.cgi?id=2214289

** Changed in: neutron
   Status: New => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1996594

Title:
  OVN metadata randomly stops working

Status in neutron:
  Fix Released

Bug description:
  We found that OVN metadata will not work randomly when OVN is writing
  a snapshot.

  1, At 12:30:35, OVN started to transfer leadership to write a snapshot

  $ find sosreport-juju-2752e1-*/var/log/ovn/* |xargs zgrep -i -E 'Transferring 
leadership'
  
sosreport-juju-2752e1-6-lxd-24-xxx-2022-08-18-entowko/var/log/ovn/ovsdb-server-sb.log:2022-08-18T12:30:35.322Z|80962|raft|INFO|Transferring
 leadership to write a snapshot.
  
sosreport-juju-2752e1-6-lxd-24-xxx-2022-08-18-entowko/var/log/ovn/ovsdb-server-sb.log:2022-08-18T17:52:53.024Z|82382|raft|INFO|Transferring
 leadership to write a snapshot.
  
sosreport-juju-2752e1-7-lxd-27-xxx-2022-08-18-hhxxqci/var/log/ovn/ovsdb-server-sb.log:2022-08-18T12:30:35.330Z|92698|raft|INFO|Transferring
 leadership to write a snapshot.

  2, At 12:30:36, neutron-ovn-metadata-agent reported OVSDB Error

  $ find sosreport-srv1*/var/log/neutron/* |xargs zgrep -i -E 'OVSDB Error'
  
sosreport-srv1xxx2d-xxx-2022-08-18-cuvkufw/var/log/neutron/neutron-ovn-metadata-agent.log:2022-08-18
 12:30:36.103 75556 ERROR ovsdbapp.backend.ovs_idl.transaction [-] OVSDB Error: 
no error details available
  
sosreport-srv1xxx6d-xxx-2022-08-18-bgnovqu/var/log/neutron/neutron-ovn-metadata-agent.log:2022-08-18
 12:30:36.104 2171 ERROR ovsdbapp.backend.ovs_idl.transaction [-] OVSDB Error: 
no error details available

  3, At 12:57:53, we saw the error 'No port found in network', then we
  will hit the problem that OVN metadata does not work randomly

  2022-08-18 12:57:53.800 3730 ERROR neutron.agent.ovn.metadata.server
  [-] No port found in network 63e2c276-60dd-40e3-baa1-c16342eacce2 with
  IP address 100.94.98.135

  After the problem occurs, restarting neutron-ovn-metadata-agent or
  restarting haproxy instance as follows can be used as a workaround.

  /usr/bin/neutron-rootwrap /etc/neutron/rootwrap.conf ip netns exec
  ovnmeta-63e2c276-60dd-40e3-baa1-c16342eacce2 haproxy -f
  /var/lib/neutron/ovn-metadata-
  proxy/63e2c276-60dd-40e3-baa1-c16342eacce2.conf

  One lp bug #1990978 [1] is trying to reducing the frequency of transfers, it 
should be beneficial to this problem.
  But it only reduces the occurrence of problems, not completely avoiding them. 
I wonder if we need to add some retry logic on the neutron side

  NOTE: The openstack version we are using is focal-xena, and
  openvswitch's version is 2.16.0-0ubuntu2.1~cloud0

  [1] https://bugs.launchpad.net/ubuntu/+source/openvswitch/+bug/1990978

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1996594/+subscriptions


-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp


[Yahoo-eng-team] [Bug 1969592] Re: [OVN] Frequent DB leader changes causes 'VIF creation failed' on nova side

2023-07-18 Thread yatin
There were many fixes related to the reported issue in neutron and ovs since 
this bug report, some of these that i quickly catched are:-
- 
https://patchwork.ozlabs.org/project/openvswitch/patch/20220819230810.2626573-1-i.maxim...@ovn.org/
- https://review.opendev.org/c/openstack/ovsdbapp/+/856200
- https://review.opendev.org/c/openstack/ovsdbapp/+/862524
- https://review.opendev.org/c/openstack/neutron/+/857775
- https://review.opendev.org/c/openstack/neutron/+/871825

One of the issue that i know still open but that happens very rarely
https://bugzilla.redhat.com/show_bug.cgi?id=2214289

Closing it based on above and Comment #2 and #3. If the issues are still
seen with python-ovs>=2.17 and above fixes included please feel free to
open the issue along with neutron and nova debug logs.


** Bug watch added: Red Hat Bugzilla #2214289
   https://bugzilla.redhat.com/show_bug.cgi?id=2214289

** Changed in: neutron
   Status: New => Fix Released

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1969592

Title:
  [OVN] Frequent DB leader changes causes 'VIF creation failed' on nova
  side

Status in neutron:
  Fix Released
Status in OpenStack Compute (nova):
  Incomplete

Bug description:
  Hello guys.

  As I found in 2021 there was a commit to OVS [since OVS v2.16 or 2.15 with 
that backport] that changed behavior during OVN DBs snapshoting. 
  Now before the leader creates a snapshot it will transfer leadership to 
another node. 

  We've run tests with rally and tempest and looks like there is a problem now 
when there is interaction between nova and neutron. 
  For example, simple rally test like 'create {network,router,subnet} -> add 
interface to router' looks okay even with 256 concurrent same tests/threads. 
But something like 'neutron.create_subnet -> nova.boot_server -> 
nova.attach_interface' will fail in time when transfer leadership happens. 
  Since it happens to often [ each 10m + rand(10) ] we will get a lot of 
errors. 

  This problem can be observed on all versions where OVS 2.16 [or
  backport] or higher invited :)

  
  Some tracing from logs [neutron, nova, ovn-sb-db]:

  CONTROL-NODES:

  
  ctl01-ovn-sb-db.log:2022-04-19T12:30:03.089Z|01002|raft|INFO|Transferring 
leadership to write a snapshot.
  ctl01-ovn-sb-db.log:2022-04-19T12:30:03.099Z|01003|raft|INFO|server 1c5f is 
leader for term 42

  ctl03-ovn-sb-db.log:2022-04-19T12:30:03.090Z|00938|raft|INFO|received 
leadership transfer from 1f46 in term 41
  ctl03-ovn-sb-db.log:2022-04-19T12:30:03.092Z|00940|raft|INFO|term 42: elected 
leader by 2+ of 3 servers
  
ctl03-ovn-sb-db.log:2022-04-19T12:30:10.941Z|00941|jsonrpc|WARN|tcp:xx.yy.zz.26:41882:
 send error: Connection reset by peer
  
ctl03-ovn-sb-db.log:2022-04-19T12:30:27.324Z|00943|jsonrpc|WARN|tcp:xx.yy.zz.26:41896:
 send error: Connection reset by peer
  
ctl03-ovn-sb-db.log:2022-04-19T12:30:27.325Z|00945|jsonrpc|WARN|tcp:xx.yy.zz.26:41880:
 send error: Connection reset by peer
  
ctl03-ovn-sb-db.log:2022-04-19T12:30:27.325Z|00947|jsonrpc|WARN|tcp:xx.yy.zz.26:41892:
 send error: Connection reset by peer
  
ctl03-ovn-sb-db.log:2022-04-19T12:30:27.327Z|00949|jsonrpc|WARN|tcp:xx.yy.zz.26:41884:
 send error: Connection reset by peer
  
ctl03-ovn-sb-db.log:2022-04-19T12:31:49.244Z|00951|jsonrpc|WARN|tcp:xx.yy.zz.25:40260:
 send error: Connection timed out
  
ctl03-ovn-sb-db.log:2022-04-19T12:31:49.244Z|00953|jsonrpc|WARN|tcp:xx.yy.zz.25:40264:
 send error: Connection timed out
  
ctl03-ovn-sb-db.log:2022-04-19T12:31:49.244Z|00955|jsonrpc|WARN|tcp:xx.yy.zz.24:37440:
 send error: Connection timed out
  
ctl03-ovn-sb-db.log:2022-04-19T12:31:49.244Z|00957|jsonrpc|WARN|tcp:xx.yy.zz.24:37442:
 send error: Connection timed out
  
ctl03-ovn-sb-db.log:2022-04-19T12:31:49.245Z|00959|jsonrpc|WARN|tcp:xx.yy.zz.24:37446:
 send error: Connection timed out
  
ctl03-ovn-sb-db.log:2022-04-19T12:32:01.533Z|01001|jsonrpc|WARN|tcp:xx.yy.zz.67:57586:
 send error: Connection timed out

  
  2022-04-19 12:30:08.898 27 INFO neutron.db.ovn_revision_numbers_db 
[req-7fcfdd74-482d-46b2-9f76-07190669d76d ff1516be452b4b939314bf3864a63f35 
9d3ae9a7b121488285203b0fdeabc3a3 - default default] Successfully bumped 
revision number for resource be178a9a-26d7-4bf0-a4e8-d206a6965205 (type: ports) 
to 1

  2022-04-19 12:30:09.644 27 INFO neutron.db.ovn_revision_numbers_db
  [req-a8278418-3ad9-450c-89bb-e7a5c1c0a06d
  a9864cd890224c079051b3f56021be64 72db34087b9b401d842b66643b647e16 -
  default default] Successfully bumped revision number for resource
  be178a9a-26d7-4bf0-a4e8-d206a6965205 (type: ports) to 2

  
  2022-04-19 12:30:10.235 27 INFO neutron.wsgi 
[req-571b53cc-ca04-46f7-89f9-fdf8e5931f4c a9864cd890224c079051b3f56021be64 
72db34087b9b401d842b66643b647e16 - default default] xx.yy.zz.68,xx.yy.zz.26 
"GET 
/v2.0/ports?tenant_id=9d3ae9a7b121488285203b0fdeabc3a3_id=7560fbb7-3ec7-41ef-b7a5-5e955ca4ff34
 

[Yahoo-eng-team] [Bug 2028037] [NEW] Error 500 on create network when PostgreSQL is used

2023-07-18 Thread Slawek Kaplonski
Public bug reported:

Error seen in the periodic job neutron-ovn-tempest-postgres-full every
day since 5.07.2023: https://zuul.openstack.org/builds?job_name=neutron-
ovn-tempest-postgres-
full=openstack%2Fneutron=master=0

Error from neutron server log:
Jul 18 03:38:15.524757 np0034701773 neutron-server[55620]: WARNING 
oslo_db.sqlalchemy.exc_filters [None req-eb023ace-8737-4e89-9557-e058931a7892 
demo demo] DBAPIError exception wrapped.: psycopg2.errors.GroupingError: column 
"subnet_service_types.subnet_id" must appear in the GROUP BY clause or be used 
in an aggregate function
Jul 18 03:38:15.524757 np0034701773 neutron-server[55620]: LINE 2: ...de, 
subnets.standard_attr_id AS standard_attr_id, subnet_ser...
Jul 18 03:38:15.524757 np0034701773 neutron-server[55620]:  
^
Jul 18 03:38:15.524757 np0034701773 neutron-server[55620]: ERROR 
oslo_db.sqlalchemy.exc_filters Traceback (most recent call last):
Jul 18 03:38:15.524757 np0034701773 neutron-server[55620]: ERROR 
oslo_db.sqlalchemy.exc_filters   File 
"/usr/local/lib/python3.10/dist-packages/sqlalchemy/engine/base.py", line 1900, 
in _execute_context
Jul 18 03:38:15.524757 np0034701773 neutron-server[55620]: ERROR 
oslo_db.sqlalchemy.exc_filters self.dialect.do_execute(
Jul 18 03:38:15.524757 np0034701773 neutron-server[55620]: ERROR 
oslo_db.sqlalchemy.exc_filters   File 
"/usr/local/lib/python3.10/dist-packages/sqlalchemy/engine/default.py", line 
736, in do_execute
Jul 18 03:38:15.524757 np0034701773 neutron-server[55620]: ERROR 
oslo_db.sqlalchemy.exc_filters cursor.execute(statement, parameters)
Jul 18 03:38:15.524757 np0034701773 neutron-server[55620]: ERROR 
oslo_db.sqlalchemy.exc_filters psycopg2.errors.GroupingError: column 
"subnet_service_types.subnet_id" must appear in the GROUP BY clause or be used 
in an aggregate function
Jul 18 03:38:15.524757 np0034701773 neutron-server[55620]: ERROR 
oslo_db.sqlalchemy.exc_filters LINE 2: ...de, subnets.standard_attr_id AS 
standard_attr_id, subnet_ser...
Jul 18 03:38:15.524757 np0034701773 neutron-server[55620]: ERROR 
oslo_db.sqlalchemy.exc_filters  
^
Jul 18 03:38:15.524757 np0034701773 neutron-server[55620]: ERROR 
oslo_db.sqlalchemy.exc_filters 
Jul 18 03:38:15.524757 np0034701773 neutron-server[55620]: ERROR 
oslo_db.sqlalchemy.exc_filters 
Jul 18 03:38:15.533633 np0034701773 neutron-server[55620]: ERROR 
neutron.plugins.ml2.managers [None req-eb023ace-8737-4e89-9557-e058931a7892 
demo demo] Mechanism driver 'ovn' failed in create_network_postcommit: 
oslo_db.exception.DBError: (psycopg2.errors.GroupingError) column 
"subnet_service_types.subnet_id" must appear in the GROUP BY clause or be used 
in an aggregate function
Jul 18 03:38:15.533633 np0034701773 neutron-server[55620]: LINE 2: ...de, 
subnets.standard_attr_id AS standard_attr_id, subnet_ser...
Jul 18 03:38:15.533633 np0034701773 neutron-server[55620]:  
^
Jul 18 03:38:15.533860 np0034701773 neutron-server[55620]: [SQL: SELECT 
anon_1.project_id AS anon_1_project_id, anon_1.id AS anon_1_id, anon_1.name AS 
anon_1_name, anon_1.network_id AS anon_1_network_id, anon_1.segment_id AS 
anon_1_segment_id, anon_1.subnetpool_id AS anon_1_subnetpool_id, 
anon_1.ip_version AS anon_1_ip_version, anon_1.cidr AS anon_1_cidr, 
anon_1.gateway_ip AS anon_1_gateway_ip, anon_1.enable_dhcp AS 
anon_1_enable_dhcp, anon_1.ipv6_ra_mode AS anon_1_ipv6_ra_mode, 
anon_1.ipv6_address_mode AS anon_1_ipv6_address_mode, anon_1.standard_attr_id 
AS anon_1_standard_attr_id, subnetpools_1.shared AS subnetpools_1_shared, 
subnetpoolrbacs_1.project_id AS subnetpoolrbacs_1_project_id, 
subnetpoolrbacs_1.id AS subnetpoolrbacs_1_id, subnetpoolrbacs_1.target_project 
AS subnetpoolrbacs_1_target_project, subnetpoolrbacs_1.action AS 
subnetpoolrbacs_1_action, subnetpoolrbacs_1.object_id AS 
subnetpoolrbacs_1_object_id, standardattributes_1.id AS 
standardattributes_1_id, standardattributes
 _1.resource_type AS standardattributes_1_resource_type, 
standardattributes_1.description AS standardattributes_1_description, 
standardattributes_1.revision_number AS standardattributes_1_revision_number, 
standardattributes_1.created_at AS standardattributes_1_created_at, 
standardattributes_1.updated_at AS standardattributes_1_updated_at, 
subnetpools_1.project_id AS subnetpools_1_project_id, subnetpools_1.id AS 
subnetpools_1_id, subnetpools_1.name AS subnetpools_1_name, 
subnetpools_1.ip_version AS subnetpools_1_ip_version, 
subnetpools_1.default_prefixlen AS subnetpools_1_default_prefixlen, 
subnetpools_1.min_prefixlen AS subnetpools_1_min_prefixlen, 
subnetpools_1.max_prefixlen AS subnetpools_1_max_prefixlen, 
subnetpools_1.is_default AS subnetpools_1_is_default, 
subnetpools_1.default_quota AS subnetpools_1_default_quota, subnetpools_1.hash 
AS subnetpools_1_hash, subnetpools_1.address_scope_id AS