from:"Plato, Michael via discuss"

Re: [ovs-discuss] ovs-vswitchd crashes serveral times a day

2023-04-17 Thread Plato, Michael via discuss

Hi Paolo,
I installed the patch for 2.17 on april 6th in our test environment and can 
confirm that it works. We haven't had any crashes since then. Many thanks for 
the quick solution!

Best regards

Michael

-Ursprüngliche Nachricht-
Von: Paolo Valerio  
Gesendet: Montag, 17. April 2023 10:36
An: Lazuardi Nasution 
Cc: ovs-discuss@openvswitch.org; Plato, Michael 
Betreff: Re: Re: [ovs-discuss] ovs-vswitchd crashes serveral times a day

Lazuardi Nasution  writes:

> Hi Paolo,
>
> I'm interested in your statement of "expired connections (but not yet 
> reclaimed)". Do you think that shortening conntrack timeout policy will help?
> Or, should we make it larger so there will be fewer conntrack table 
> update and flush attempts?
>

it's hard to say as it depends on the specific use case.
Probably making it larger for the specific case could help, but in general, I 
would not rely on that.
Of course, an actual fix is needed. It would be great if the patch sent could 
tested, but in any case, I'll work on a formal patch.

> Best regards.
>
> On Wed, Apr 5, 2023 at 2:51 AM Paolo Valerio  wrote:
>
> Hello,
>
> thanks for reporting this.
> I had a look at it, and, although this needs to be confirmed, I suspect
> it's related to nat (CT_CONN_TYPE_UN_NAT) and expired connections (but
> not yet reclaimed).
>
> The nat part does not necessarily perform any actual translation, but
> could still be triggered by ct(nat(src)...) which is the all-zero binding
> to avoid collisions, if any.
>
> Is there any chance to test the following patch (targeted for ovs 2.17)?
> This should help to confirm.
>
> -- >8 --
> diff --git a/lib/conntrack.c b/lib/conntrack.c
> index 08da4ddf7..ba334afb0 100644
> --- a/lib/conntrack.c
> +++ b/lib/conntrack.c
> @@ -94,9 +94,8 @@ static bool valid_new(struct dp_packet *pkt, struct
> conn_key *);
>  static struct conn *new_conn(struct conntrack *ct, struct dp_packet *pkt,
>                               struct conn_key *, long long now,
>                               uint32_t tp_id);
> -static void delete_conn_cmn(struct conn *);
> +static void delete_conn__(struct conn *);
>  static void delete_conn(struct conn *);
> -static void delete_conn_one(struct conn *conn);
>  static enum ct_update_res conn_update(struct conntrack *ct, struct conn
> *conn,
>                                        struct dp_packet *pkt,
>                                        struct conn_lookup_ctx *ctx,
> @@ -444,14 +443,13 @@ zone_limit_delete(struct conntrack *ct, uint16_t
> zone)
>  }
>
>  static void
> -conn_clean_cmn(struct conntrack *ct, struct conn *conn)
> +conn_clean_cmn(struct conntrack *ct, struct conn *conn, uint32_t hash)
>      OVS_REQUIRES(ct->ct_lock)
>  {
>      if (conn->alg) {
>          expectation_clean(ct, >key);
>      }
>
> -    uint32_t hash = conn_key_hash(>key, ct->hash_basis);
>      cmap_remove(>conns, >cm_node, hash);
>
>      struct zone_limit *zl = zone_limit_lookup(ct, conn->admit_zone);
> @@ -467,11 +465,14 @@ conn_clean(struct conntrack *ct, struct conn *conn)
>      OVS_REQUIRES(ct->ct_lock)
>  {
>      ovs_assert(conn->conn_type == CT_CONN_TYPE_DEFAULT);
> +    uint32_t conn_hash = conn_key_hash(>key, 
> ct->hash_basis);
>
> -    conn_clean_cmn(ct, conn);
> +    conn_clean_cmn(ct, conn, conn_hash);
>      if (conn->nat_conn) {
>          uint32_t hash = conn_key_hash(>nat_conn->key, ct->
> hash_basis);
> -        cmap_remove(>conns, >nat_conn->cm_node, hash);
> +        if (conn_hash != hash) {
> +            cmap_remove(>conns, >nat_conn->cm_node, hash);
> +        }
>      }
>      ovs_list_remove(>exp_node);
>      conn->cleaned = true;
> @@ -479,19 +480,6 @@ conn_clean(struct conntrack *ct, struct conn *conn)
>      atomic_count_dec(>n_conn);
>  }
>
> -static void
> -conn_clean_one(struct conntrack *ct, struct conn *conn)
> -    OVS_REQUIRES(ct->ct_lock)
> -{
> -    conn_clean_cmn(ct, conn);
> -    if (conn->conn_type == CT_CONN_TYPE_DEFAULT) {
> -        ovs_list_remove(>exp_node);
> -        conn->cleaned = true;
> -        atomic_count_dec(>n_conn);
> -    }
> -    ovsrcu_postpone(delete_conn_one, conn);
> -}
> -
>  /* Destroys the connection tracker 'ct' and frees all the allocated
> memory.
>   * The caller of this function must already have shut down packet input
>   * and PMD threads (which would have been quiesced).  */
> @@ -505,7 +493,10 @@ conntrack_destroy(struct conntrack *ct)
>
>      ovs_mutex_lock(>ct_lock);
>      CMAP_FOR_EACH (conn, cm_node, >conns) {
> -        conn_clean_one(ct, conn);
> +        if (conn->conn_type == CT_CONN_TYPE_UN_NAT) {
> +            continue;
> +        }
> +        conn_clean(ct,

Re: [ovs-discuss] Problem with ovn and neutron dynamic routing

2023-04-05 Thread Plato, Michael via discuss

e priority for the snat out rule of the router is different from 
that of the FIP.
You can change this value so that the snat rule for the Router IP occurs first! 
This sounds very strange but since n-d-r advertise the FIP address using the 
router's IP as the next-hop, it can be an alternative for this specific case.

Please open a bug on launchpad.


Regards,
Roberto

Em ter., 4 de abr. de 2023 às 04:11, Plato, Michael 
mailto:michael.pl...@tu-berlin.de>> escreveu:
Hi,
I managed to create a working setup by omitting this flow for bgp routed 
networks 
(https://github.com/ovn-org/ovn/blob/branch-22.03/northd/northd.c#L13234) . It 
is also important to keep snat enabled in the openstack router, otherwise no 
communication between a floating ip and a routed tenant network ip on the same 
network will be possible. But so far I have no idea how to decide in northd 
whether it is a routed network or not. From my point of view, the CMS (neutron) 
should pass this information to OVN. In my proof of concept, I excluded 
specific subnet ranges, but that's not useful for a production setup.

Best regards

Michael

Von: Roberto Bartzen Acosta 
mailto:roberto.aco...@luizalabs.com>>
Gesendet: Mittwoch, 8. März 2023 14:03
An: Lajos Katona mailto:katonal...@gmail.com>>
Cc: Plato, Michael 
mailto:michael.pl...@tu-berlin.de>>; 
ovs-discuss@openvswitch.org<mailto:ovs-discuss@openvswitch.org>
Betreff: Re: [ovs-discuss] Problem with ovn and neutron dynamic routing

Hi Plato,

An alternative would be to segment the networks of the hacks so that the next 
hop is announced as the IP of the segment of each hack (I'm not sure if this 
will work with OVN).
Take a look at this doc [2].

[2] 
https://docs.openstack.org/neutron/latest/admin/config-bgp-floating-ip-over-l2-segmented-network.html#setting-up-the-provider-subnets-for-the-bgp-next-hop-routing

Em qua., 8 de mar. de 2023 às 08:54, Roberto Bartzen Acosta 
mailto:roberto.aco...@luizalabs.com>> escreveu:
Hey folks,

Please correct me if I'm wrong, but this problem seems related to the logical 
flow order.

How does it work when there is no n-d-r?
- the FIP traffic is redirected from the external host to the openstack network 
provider (no explicit next-hop) and the path is discovered via ARP and then 
forwarded by the FIP's dnat_and_snat action.

When n-d-r starts to advertise the FIPs via BGP it informs the router's 
external IP as the FIP's next_hop [1].
The order of the logical flows must be interfering with the action performed 
(should do a default router nat action first or do a dnat_and_snat for the FIP?)
I think you have two alternatives to test some possible fix:
1 - Look the OVN backend in Neutron and map (Neutron/OVN) the tables used for 
the two types of traffic (NAT for routers and NAT for FIPs), and maybe change 
the priority of the flow actions (very complex)
2 - Change the n-d-r to not advertise the router's gw port IP in next_hop [1] 
(ovn case), maybe changing it to the FIP address (would need to study how the 
bgp peer expects to receive the next_hop to compose the AS_PATH).

[1] - 
https://opendev.org/openstack/neutron-dynamic-routing/src/commit/e9529f7dc5449714c76afd7fce62f228f877/neutron_dynamic_routing/services/bgp/bgp_plugin.py#L261

Best regards,
Roberto

Em qua., 8 de mar. de 2023 às 04:25, Lajos Katona via discuss 
mailto:ovs-discuss@openvswitch.org>> escreveu:
Hi,
If you feel that OVN-Neutron-Neutron-Dynamic-Routing has some issue feel free 
to open a bug report in launchpad:
https://bugs.launchpad.net/neutron

If it is a more complex issue we have weekly meetings where you can ask Neutron 
team for advice and help (we use IRC), or just write a mail to Openstack 
Discuss List 
mailto:openstack-disc...@lists.openstack.org>>
 with [Neutron] in the subject.

Best wishes
Lajos Katona


Plato, Michael via discuss 
mailto:ovs-discuss@openvswitch.org>> ezt írta 
(időpont: 2023. febr. 23., Cs, 10:26):
Hello,

many thanks for the quick response. As I can see, the ticket is a bit older. 
Are there any ideas for a solution so far or first patches that could be tested?

Best regards

Michael

Von: Luis Tomas Bolivar mailto:ltoma...@redhat.com>>
Gesendet: Montag, 20. Februar 2023 10:03
An: Plato, Michael 
mailto:michael.pl...@tu-berlin.de>>
Cc: ovs-discuss@openvswitch.org<mailto:ovs-discuss@openvswitch.org>
Betreff: Re: [ovs-discuss] Problem with ovn and neutron dynamic routing

We hit this problem a while ago and reported it here: 
https://bugzilla.redhat.com/show_bug.cgi?id=1906455

On Mon, Feb 20, 2023 at 9:56 AM Plato, Michael via discuss 
mailto:ovs-discuss@openvswitch.org>> wrote:
Hello,

we have a problem with ovn in connection with neutron dynamic routing (which is 
now supported with ovn). We can announce our internal networks via BGP and the 
VMs in this network can also be reached directly without nat.
But if we attach a public floating ip to the internal self service netwo

Re: [ovs-discuss] ovs-vswitchd crashes serveral times a day

2023-04-05 Thread Plato, Michael via discuss

Hi Paolo,
many thanks for the patch. I'll try it asap...

Regards

Michael

-Ursprüngliche Nachricht-
Von: Paolo Valerio  
Gesendet: Dienstag, 4. April 2023 21:51
An: ovs-discuss@openvswitch.org
Cc: Plato, Michael ; mrxlazuar...@gmail.com
Betreff: Re: Re: [ovs-discuss] ovs-vswitchd crashes serveral times a day

Hello,

thanks for reporting this.
I had a look at it, and, although this needs to be confirmed, I suspect it's 
related to nat (CT_CONN_TYPE_UN_NAT) and expired connections (but not yet 
reclaimed).

The nat part does not necessarily perform any actual translation, but could 
still be triggered by ct(nat(src)...) which is the all-zero binding to avoid 
collisions, if any.

Is there any chance to test the following patch (targeted for ovs 2.17)?
This should help to confirm.

-- >8 --
diff --git a/lib/conntrack.c b/lib/conntrack.c index 08da4ddf7..ba334afb0 100644
--- a/lib/conntrack.c
+++ b/lib/conntrack.c
@@ -94,9 +94,8 @@ static bool valid_new(struct dp_packet *pkt, struct conn_key 
*);  static struct conn *new_conn(struct conntrack *ct, struct dp_packet *pkt,
  struct conn_key *, long long now,
  uint32_t tp_id);
-static void delete_conn_cmn(struct conn *);
+static void delete_conn__(struct conn *);
 static void delete_conn(struct conn *); -static void delete_conn_one(struct 
conn *conn);  static enum ct_update_res conn_update(struct conntrack *ct, 
struct conn *conn,
   struct dp_packet *pkt,
   struct conn_lookup_ctx *ctx,
@@ -444,14 +443,13 @@ zone_limit_delete(struct conntrack *ct, uint16_t zone)  }

 static void
-conn_clean_cmn(struct conntrack *ct, struct conn *conn)
+conn_clean_cmn(struct conntrack *ct, struct conn *conn, uint32_t hash)
 OVS_REQUIRES(ct->ct_lock)
 {
 if (conn->alg) {
 expectation_clean(ct, >key);
 }

-uint32_t hash = conn_key_hash(>key, ct->hash_basis);
 cmap_remove(>conns, >cm_node, hash);

 struct zone_limit *zl = zone_limit_lookup(ct, conn->admit_zone); @@ 
-467,11 +465,14 @@ conn_clean(struct conntrack *ct, struct conn *conn)
 OVS_REQUIRES(ct->ct_lock)
 {
 ovs_assert(conn->conn_type == CT_CONN_TYPE_DEFAULT);
+uint32_t conn_hash = conn_key_hash(>key, ct->hash_basis);

-conn_clean_cmn(ct, conn);
+conn_clean_cmn(ct, conn, conn_hash);
 if (conn->nat_conn) {
 uint32_t hash = conn_key_hash(>nat_conn->key, ct->hash_basis);
-cmap_remove(>conns, >nat_conn->cm_node, hash);
+if (conn_hash != hash) {
+cmap_remove(>conns, >nat_conn->cm_node, hash);
+}
 }
 ovs_list_remove(>exp_node);
 conn->cleaned = true;
@@ -479,19 +480,6 @@ conn_clean(struct conntrack *ct, struct conn *conn)
 atomic_count_dec(>n_conn);
 }

-static void
-conn_clean_one(struct conntrack *ct, struct conn *conn)
-OVS_REQUIRES(ct->ct_lock)
-{
-conn_clean_cmn(ct, conn);
-if (conn->conn_type == CT_CONN_TYPE_DEFAULT) {
-ovs_list_remove(>exp_node);
-conn->cleaned = true;
-atomic_count_dec(>n_conn);
-}
-ovsrcu_postpone(delete_conn_one, conn);
-}
-
 /* Destroys the connection tracker 'ct' and frees all the allocated memory.
  * The caller of this function must already have shut down packet input
  * and PMD threads (which would have been quiesced).  */ @@ -505,7 +493,10 @@ 
conntrack_destroy(struct conntrack *ct)

 ovs_mutex_lock(>ct_lock);
 CMAP_FOR_EACH (conn, cm_node, >conns) {
-conn_clean_one(ct, conn);
+if (conn->conn_type == CT_CONN_TYPE_UN_NAT) {
+continue;
+}
+conn_clean(ct, conn);
 }
 cmap_destroy(>conns);

@@ -1052,7 +1043,10 @@ conn_not_found(struct conntrack *ct, struct dp_packet 
*pkt,
 nat_conn->alg = NULL;
 nat_conn->nat_conn = NULL;
 uint32_t nat_hash = conn_key_hash(_conn->key, ct->hash_basis);
-cmap_insert(>conns, _conn->cm_node, nat_hash);
+
+if (nat_hash != ctx->hash) {
+cmap_insert(>conns, _conn->cm_node, nat_hash);
+}
 }

 nc->nat_conn = nat_conn;
@@ -1080,7 +1074,7 @@ conn_not_found(struct conntrack *ct, struct dp_packet 
*pkt,
 nat_res_exhaustion:
 free(nat_conn);
 ovs_list_remove(>exp_node);
-delete_conn_cmn(nc);
+delete_conn__(nc);
 static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 5);
 VLOG_WARN_RL(, "Unable to NAT due to tuple space exhaustion - "
  "if DoS attack, use firewalling and/or zone partitioning."); 
@@ -2549,7 +2543,7 @@ new_conn(struct conntrack *ct, struct dp_packet *pkt, 
struct conn_key *key,  }

 static void
-delete_conn_cmn(struct conn *conn)
+delete_conn__(struct conn *conn)
 {
 free(conn->alg);
 free(conn);
@@ -2561,17 +2555,7 @@ delete_conn(struct conn *conn)
 ovs_assert(conn->conn_type == CT_CONN_TYPE_DEFAULT);
 ovs_mutex_destroy(>lock);

Re: [ovs-discuss] ovs-vswitchd crashes serveral times a day

2023-04-05 Thread Plato, Michael via discuss

Hi,

yes our k8s cluster is on the same subnet. I stopped one of the etcd nodes 
yesterday which triggers a lot of reconnection attempts from the other cluster 
members. Stilll no issues so far and no ovs crashes 

Regards

Michael

Von: Lazuardi Nasution 
Gesendet: Dienstag, 4. April 2023 09:56
An: Plato, Michael 
Cc: ovs-discuss@openvswitch.org
Betreff: Re: [ovs-discuss] ovs-vswitchd crashes serveral times a day

Hi Michael,

I assume that your k8s cluster is on the same subnet, right? Would you mind 
testing it by shutting down one of etcd instances and see if this bug still 
exists?

Best regards.

On Tue, Apr 4, 2023 at 2:50 PM Plato, Michael 
mailto:michael.pl...@tu-berlin.de>> wrote:
Hi,
from my perspective the patch works for all cases. My test environment runs 
with several k8s clusters and I haven't noticed any etcd failures so far.

Best regards

Michael

Von: Lazuardi Nasution mailto:mrxlazuar...@gmail.com>>
Gesendet: Dienstag, 4. April 2023 09:41
An: Plato, Michael 
mailto:michael.pl...@tu-berlin.de>>
Cc: ovs-discuss@openvswitch.org
Betreff: Re: [ovs-discuss] ovs-vswitchd crashes serveral times a day

Hi Michael,

Is your patch working on the same subnet unreachable traffic too. In my case, 
crashes happen when too many unreachable replies even from the same subnet. For 
example, when one of the etcd instances is down, there will be huge 
reconnection attempts and then unreachable replies from the destination VM 
where the down etcd instance exists.

Best regards.

On Tue, Apr 4, 2023 at 1:06 PM Plato, Michael 
mailto:michael.pl...@tu-berlin.de>> wrote:
Hi,
I have some news on this topic. Unfortunately I could not find the root cause. 
But I managed to implement a workaround (see patch in attachment). The basic 
idea is to mark the nat flows as invalid if there is no longer an associated 
connection. From my point of view it is a race condition. It can be triggered 
by many short-lived connections. With the patch we no longer have any crashes. 
I can't say if it has any negative effects though, as I'm not an expert. So far 
I haven't found any problems at least. Without this patch we had hundreds of 
crashes a day :/

Best regards

Michael

Von: Lazuardi Nasution mailto:mrxlazuar...@gmail.com>>
Gesendet: Montag, 3. April 2023 13:50
An: ovs-discuss@openvswitch.org
Cc: Plato, Michael 
mailto:michael.pl...@tu-berlin.de>>
Betreff: Re: [ovs-discuss] ovs-vswitchd crashes serveral times a day

Hi,

Is this related to following glibc bug? I'm not so sure about this because when 
I check the glibc source of installed version (2.35), the proposed patch has 
been applied.

https://sourceware.org/bugzilla/show_bug.cgi?id=12889

I can confirm that this problem only happen if I use statefull ACL which is 
related to conntrack. The racing situation happen when massive unreachable 
replies are received. For example, if I run etcd on VMs but one etcd node has 
been disabled which causes massive connection attempts and unreachable replies.

Best regards.

On Mon, Mar 20, 2023, 10:58 PM Lazuardi Nasution 
mailto:mrxlazuar...@gmail.com>> wrote:
Hi Michael,

Have you found the solution for this case? I find the same weird problem 
without any information about which conntrack entries are causing this issue.

I'm using OVS 3.0.1 with DPDK  21.11.2 on Ubuntu 22.04. By the way, this 
problem is disappear after I remove some Kubernutes cluster VMs and some DB 
cluster VMs.

Best regards.

Date: Thu, 29 Sep 2022 07:56:32 +
From: "Plato, Michael" 
mailto:michael.pl...@tu-berlin.de>>
To: "ovs-discuss@openvswitch.org" 
mailto:ovs-discuss@openvswitch.org>>
Subject: [ovs-discuss] ovs-vswitchd crashes serveral times a day
Message-ID: 
<8e53d3d0674049e69b2b7f3c4b0b8...@tu-berlin.de>
Content-Type: text/plain; charset="us-ascii"

Hi,

we are about to roll out our new openstack infrastructure based on yoga and 
during our testing we observered that the openvswitch-switch systemd unit 
restarts several times a day, causing network interruptions for all VMs on the 
compute node in question.
After some research we found that the ovs-vswitchd crashes with the following 
assertion failure:

"2022-09-29T06:51:05.195Z|3|util(pmd-c01/id:8)|EMER|../lib/conntrack.c:1095:
 assertion conn->conn_type == CT_CONN_TYPE_DEFAULT failed in 
conn_update_state()"

To get more information about the connection that leads to this assertion 
failure, I added some debug code to conntrack.c .
We have seen that we can trigger this issue when trying to connect from a VM to 
a destination which is unreachable. For example curl https://www.google.de:444

Shortly after that we get an assertion and the debug code says:

conn_type=1 (may be CT_CONN_TYPE_UN_NAT) ?
src ip 172.217.16.67 dst ip 141.23.xx.xx rev src ip 141.23.xx.xx rev dst ip 
172.217.16.67 src/dst ports 444/46212 rev src/dst ports

Re: [ovs-discuss] ovs-vswitchd crashes serveral times a day

2023-04-04 Thread Plato, Michael via discuss

Hi,
from my perspective the patch works for all cases. My test environment runs 
with several k8s clusters and I haven't noticed any etcd failures so far.

Best regards

Michael

Von: Lazuardi Nasution 
Gesendet: Dienstag, 4. April 2023 09:41
An: Plato, Michael 
Cc: ovs-discuss@openvswitch.org
Betreff: Re: [ovs-discuss] ovs-vswitchd crashes serveral times a day

Hi Michael,

Is your patch working on the same subnet unreachable traffic too. In my case, 
crashes happen when too many unreachable replies even from the same subnet. For 
example, when one of the etcd instances is down, there will be huge 
reconnection attempts and then unreachable replies from the destination VM 
where the down etcd instance exists.

Best regards.

On Tue, Apr 4, 2023 at 1:06 PM Plato, Michael 
mailto:michael.pl...@tu-berlin.de>> wrote:
Hi,
I have some news on this topic. Unfortunately I could not find the root cause. 
But I managed to implement a workaround (see patch in attachment). The basic 
idea is to mark the nat flows as invalid if there is no longer an associated 
connection. From my point of view it is a race condition. It can be triggered 
by many short-lived connections. With the patch we no longer have any crashes. 
I can't say if it has any negative effects though, as I'm not an expert. So far 
I haven't found any problems at least. Without this patch we had hundreds of 
crashes a day :/

Best regards

Michael

Von: Lazuardi Nasution mailto:mrxlazuar...@gmail.com>>
Gesendet: Montag, 3. April 2023 13:50
An: ovs-discuss@openvswitch.org
Cc: Plato, Michael 
mailto:michael.pl...@tu-berlin.de>>
Betreff: Re: [ovs-discuss] ovs-vswitchd crashes serveral times a day

Hi,

Is this related to following glibc bug? I'm not so sure about this because when 
I check the glibc source of installed version (2.35), the proposed patch has 
been applied.

https://sourceware.org/bugzilla/show_bug.cgi?id=12889

I can confirm that this problem only happen if I use statefull ACL which is 
related to conntrack. The racing situation happen when massive unreachable 
replies are received. For example, if I run etcd on VMs but one etcd node has 
been disabled which causes massive connection attempts and unreachable replies.

Best regards.

On Mon, Mar 20, 2023, 10:58 PM Lazuardi Nasution 
mailto:mrxlazuar...@gmail.com>> wrote:
Hi Michael,

Have you found the solution for this case? I find the same weird problem 
without any information about which conntrack entries are causing this issue.

I'm using OVS 3.0.1 with DPDK  21.11.2 on Ubuntu 22.04. By the way, this 
problem is disappear after I remove some Kubernutes cluster VMs and some DB 
cluster VMs.

Best regards.

Date: Thu, 29 Sep 2022 07:56:32 +
From: "Plato, Michael" 
mailto:michael.pl...@tu-berlin.de>>
To: "ovs-discuss@openvswitch.org" 
mailto:ovs-discuss@openvswitch.org>>
Subject: [ovs-discuss] ovs-vswitchd crashes serveral times a day
Message-ID: 
<8e53d3d0674049e69b2b7f3c4b0b8...@tu-berlin.de>
Content-Type: text/plain; charset="us-ascii"

Hi,

we are about to roll out our new openstack infrastructure based on yoga and 
during our testing we observered that the openvswitch-switch systemd unit 
restarts several times a day, causing network interruptions for all VMs on the 
compute node in question.
After some research we found that the ovs-vswitchd crashes with the following 
assertion failure:

"2022-09-29T06:51:05.195Z|3|util(pmd-c01/id:8)|EMER|../lib/conntrack.c:1095:
 assertion conn->conn_type == CT_CONN_TYPE_DEFAULT failed in 
conn_update_state()"

To get more information about the connection that leads to this assertion 
failure, I added some debug code to conntrack.c .
We have seen that we can trigger this issue when trying to connect from a VM to 
a destination which is unreachable. For example curl https://www.google.de:444

Shortly after that we get an assertion and the debug code says:

conn_type=1 (may be CT_CONN_TYPE_UN_NAT) ?
src ip 172.217.16.67 dst ip 141.23.xx.xx rev src ip 141.23.xx.xx rev dst ip 
172.217.16.67 src/dst ports 444/46212 rev src/dst ports 46212/444 zone/rev zone 
2/2 nw_proto/rev nw_proto 6/6

ovs-appctl dpctl/dump-conntrack | grep "444"
tcp,orig=(src=141.23.xx.xx,dst=172.217.16.67,sport=46212,dport=444),reply=(src=172.217.16.67,dst=141.23.xx.xx,sport=444,dport=46212),zone=2,protoinfo=(state=SYN_SENT)

Versions:
ovs-vsctl --version
ovs-vsctl (Open vSwitch) 2.17.2
DB Schema 8.3.0

ovn-controller --version
ovn-controller 22.03.0
Open vSwitch Library 2.17.0
OpenFlow versions 0x6:0x6
SB DB Schema 20.21.0

DPDK 21.11.2

We are now unsure if this is a misconfiguration or if we hit a bug.

Thanks for any feedback

Michael
___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss

Re: [ovs-discuss] Problem with ovn and neutron dynamic routing

2023-04-04 Thread Plato, Michael via discuss

Hi,
I managed to create a working setup by omitting this flow for bgp routed 
networks 
(https://github.com/ovn-org/ovn/blob/branch-22.03/northd/northd.c#L13234) . It 
is also important to keep snat enabled in the openstack router, otherwise no 
communication between a floating ip and a routed tenant network ip on the same 
network will be possible. But so far I have no idea how to decide in northd 
whether it is a routed network or not. From my point of view, the CMS (neutron) 
should pass this information to OVN. In my proof of concept, I excluded 
specific subnet ranges, but that's not useful for a production setup.

Best regards

Michael

Von: Roberto Bartzen Acosta 
Gesendet: Mittwoch, 8. März 2023 14:03
An: Lajos Katona 
Cc: Plato, Michael ; ovs-discuss@openvswitch.org
Betreff: Re: [ovs-discuss] Problem with ovn and neutron dynamic routing

Hi Plato,

An alternative would be to segment the networks of the hacks so that the next 
hop is announced as the IP of the segment of each hack (I'm not sure if this 
will work with OVN).
Take a look at this doc [2].

[2] 
https://docs.openstack.org/neutron/latest/admin/config-bgp-floating-ip-over-l2-segmented-network.html#setting-up-the-provider-subnets-for-the-bgp-next-hop-routing

Em qua., 8 de mar. de 2023 às 08:54, Roberto Bartzen Acosta 
mailto:roberto.aco...@luizalabs.com>> escreveu:
Hey folks,

Please correct me if I'm wrong, but this problem seems related to the logical 
flow order.

How does it work when there is no n-d-r?
- the FIP traffic is redirected from the external host to the openstack network 
provider (no explicit next-hop) and the path is discovered via ARP and then 
forwarded by the FIP's dnat_and_snat action.

When n-d-r starts to advertise the FIPs via BGP it informs the router's 
external IP as the FIP's next_hop [1].
The order of the logical flows must be interfering with the action performed 
(should do a default router nat action first or do a dnat_and_snat for the FIP?)
I think you have two alternatives to test some possible fix:
1 - Look the OVN backend in Neutron and map (Neutron/OVN) the tables used for 
the two types of traffic (NAT for routers and NAT for FIPs), and maybe change 
the priority of the flow actions (very complex)
2 - Change the n-d-r to not advertise the router's gw port IP in next_hop [1] 
(ovn case), maybe changing it to the FIP address (would need to study how the 
bgp peer expects to receive the next_hop to compose the AS_PATH).

[1] - 
https://opendev.org/openstack/neutron-dynamic-routing/src/commit/e9529f7dc5449714c76afd7fce62f228f877/neutron_dynamic_routing/services/bgp/bgp_plugin.py#L261

Best regards,
Roberto

Em qua., 8 de mar. de 2023 às 04:25, Lajos Katona via discuss 
mailto:ovs-discuss@openvswitch.org>> escreveu:
Hi,
If you feel that OVN-Neutron-Neutron-Dynamic-Routing has some issue feel free 
to open a bug report in launchpad:
https://bugs.launchpad.net/neutron

If it is a more complex issue we have weekly meetings where you can ask Neutron 
team for advice and help (we use IRC), or just write a mail to Openstack 
Discuss List 
mailto:openstack-disc...@lists.openstack.org>>
 with [Neutron] in the subject.

Best wishes
Lajos Katona


Plato, Michael via discuss 
mailto:ovs-discuss@openvswitch.org>> ezt írta 
(időpont: 2023. febr. 23., Cs, 10:26):
Hello,

many thanks for the quick response. As I can see, the ticket is a bit older. 
Are there any ideas for a solution so far or first patches that could be tested?

Best regards

Michael

Von: Luis Tomas Bolivar mailto:ltoma...@redhat.com>>
Gesendet: Montag, 20. Februar 2023 10:03
An: Plato, Michael 
mailto:michael.pl...@tu-berlin.de>>
Cc: ovs-discuss@openvswitch.org<mailto:ovs-discuss@openvswitch.org>
Betreff: Re: [ovs-discuss] Problem with ovn and neutron dynamic routing

We hit this problem a while ago and reported it here: 
https://bugzilla.redhat.com/show_bug.cgi?id=1906455

On Mon, Feb 20, 2023 at 9:56 AM Plato, Michael via discuss 
mailto:ovs-discuss@openvswitch.org>> wrote:
Hello,

we have a problem with ovn in connection with neutron dynamic routing (which is 
now supported with ovn). We can announce our internal networks via BGP and the 
VMs in this network can also be reached directly without nat.
But if we attach a public floating ip to the internal self service network ip, 
we have some strange effects. The VM can still be reached via ping with both 
ips. But SSH for example only works via floating ip. I did some network traces 
and found that the return traffic is being natted even though no nat was 
applied on incoming way. From my point of view we need a conntrack marker which 
identifies traffic which was d-natted on incoming way and s-nat only those 
traffic on return way. Is it possible to implement something like this to fully 
support ovn with BGP announced networks which are directly reachable via 
routing?

Thanks for reply and best regards!

Michael
_

Re: [ovs-discuss] ovs-vswitchd crashes serveral times a day

2023-04-04 Thread Plato, Michael via discuss

Hi,
I have some news on this topic. Unfortunately I could not find the root cause. 
But I managed to implement a workaround (see patch in attachment). The basic 
idea is to mark the nat flows as invalid if there is no longer an associated 
connection. From my point of view it is a race condition. It can be triggered 
by many short-lived connections. With the patch we no longer have any crashes. 
I can't say if it has any negative effects though, as I'm not an expert. So far 
I haven't found any problems at least. Without this patch we had hundreds of 
crashes a day :/

Best regards

Michael

Von: Lazuardi Nasution 
Gesendet: Montag, 3. April 2023 13:50
An: ovs-discuss@openvswitch.org
Cc: Plato, Michael 
Betreff: Re: [ovs-discuss] ovs-vswitchd crashes serveral times a day

Hi,

Is this related to following glibc bug? I'm not so sure about this because when 
I check the glibc source of installed version (2.35), the proposed patch has 
been applied.

https://sourceware.org/bugzilla/show_bug.cgi?id=12889

I can confirm that this problem only happen if I use statefull ACL which is 
related to conntrack. The racing situation happen when massive unreachable 
replies are received. For example, if I run etcd on VMs but one etcd node has 
been disabled which causes massive connection attempts and unreachable replies.

Best regards.

On Mon, Mar 20, 2023, 10:58 PM Lazuardi Nasution 
mailto:mrxlazuar...@gmail.com>> wrote:
Hi Michael,

Have you found the solution for this case? I find the same weird problem 
without any information about which conntrack entries are causing this issue.

I'm using OVS 3.0.1 with DPDK  21.11.2 on Ubuntu 22.04. By the way, this 
problem is disappear after I remove some Kubernutes cluster VMs and some DB 
cluster VMs.

Best regards.

Date: Thu, 29 Sep 2022 07:56:32 +
From: "Plato, Michael" 
mailto:michael.pl...@tu-berlin.de>>
To: "ovs-discuss@openvswitch.org" 
mailto:ovs-discuss@openvswitch.org>>
Subject: [ovs-discuss] ovs-vswitchd crashes serveral times a day
Message-ID: 
<8e53d3d0674049e69b2b7f3c4b0b8...@tu-berlin.de>
Content-Type: text/plain; charset="us-ascii"

Hi,

we are about to roll out our new openstack infrastructure based on yoga and 
during our testing we observered that the openvswitch-switch systemd unit 
restarts several times a day, causing network interruptions for all VMs on the 
compute node in question.
After some research we found that the ovs-vswitchd crashes with the following 
assertion failure:

"2022-09-29T06:51:05.195Z|3|util(pmd-c01/id:8)|EMER|../lib/conntrack.c:1095:
 assertion conn->conn_type == CT_CONN_TYPE_DEFAULT failed in 
conn_update_state()"

To get more information about the connection that leads to this assertion 
failure, I added some debug code to conntrack.c .
We have seen that we can trigger this issue when trying to connect from a VM to 
a destination which is unreachable. For example curl https://www.google.de:444

Shortly after that we get an assertion and the debug code says:

conn_type=1 (may be CT_CONN_TYPE_UN_NAT) ?
src ip 172.217.16.67 dst ip 141.23.xx.xx rev src ip 141.23.xx.xx rev dst ip 
172.217.16.67 src/dst ports 444/46212 rev src/dst ports 46212/444 zone/rev zone 
2/2 nw_proto/rev nw_proto 6/6

ovs-appctl dpctl/dump-conntrack | grep "444"
tcp,orig=(src=141.23.xx.xx,dst=172.217.16.67,sport=46212,dport=444),reply=(src=172.217.16.67,dst=141.23.xx.xx,sport=444,dport=46212),zone=2,protoinfo=(state=SYN_SENT)

Versions:
ovs-vsctl --version
ovs-vsctl (Open vSwitch) 2.17.2
DB Schema 8.3.0

ovn-controller --version
ovn-controller 22.03.0
Open vSwitch Library 2.17.0
OpenFlow versions 0x6:0x6
SB DB Schema 20.21.0

DPDK 21.11.2

We are now unsure if this is a misconfiguration or if we hit a bug.

Thanks for any feedback

Michael


ovs-conntrack.patch
Description: ovs-conntrack.patch
___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss

Re: [ovs-discuss] Problem with ovn and neutron dynamic routing

2023-02-23 Thread Plato, Michael via discuss

Hello,

many thanks for the quick response. As I can see, the ticket is a bit older. 
Are there any ideas for a solution so far or first patches that could be tested?

Best regards

Michael

Von: Luis Tomas Bolivar 
Gesendet: Montag, 20. Februar 2023 10:03
An: Plato, Michael 
Cc: ovs-discuss@openvswitch.org
Betreff: Re: [ovs-discuss] Problem with ovn and neutron dynamic routing

We hit this problem a while ago and reported it here: 
https://bugzilla.redhat.com/show_bug.cgi?id=1906455

On Mon, Feb 20, 2023 at 9:56 AM Plato, Michael via discuss 
mailto:ovs-discuss@openvswitch.org>> wrote:
Hello,

we have a problem with ovn in connection with neutron dynamic routing (which is 
now supported with ovn). We can announce our internal networks via BGP and the 
VMs in this network can also be reached directly without nat.
But if we attach a public floating ip to the internal self service network ip, 
we have some strange effects. The VM can still be reached via ping with both 
ips. But SSH for example only works via floating ip. I did some network traces 
and found that the return traffic is being natted even though no nat was 
applied on incoming way. From my point of view we need a conntrack marker which 
identifies traffic which was d-natted on incoming way and s-nat only those 
traffic on return way. Is it possible to implement something like this to fully 
support ovn with BGP announced networks which are directly reachable via 
routing?

Thanks for reply and best regards!

Michael
___
discuss mailing list
disc...@openvswitch.org<mailto:disc...@openvswitch.org>
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss


--
LUIS TOMÁS BOLÍVAR
Principal Software Engineer
Red Hat
Madrid, Spain
ltoma...@redhat.com<mailto:ltoma...@redhat.com>

___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss

[ovs-discuss] Problem with ovn and neutron dynamic routing

2023-02-20 Thread Plato, Michael via discuss

Hello,

we have a problem with ovn in connection with neutron dynamic routing (which is 
now supported with ovn). We can announce our internal networks via BGP and the 
VMs in this network can also be reached directly without nat.
But if we attach a public floating ip to the internal self service network ip, 
we have some strange effects. The VM can still be reached via ping with both 
ips. But SSH for example only works via floating ip. I did some network traces 
and found that the return traffic is being natted even though no nat was 
applied on incoming way. From my point of view we need a conntrack marker which 
identifies traffic which was d-natted on incoming way and s-nat only those 
traffic on return way. Is it possible to implement something like this to fully 
support ovn with BGP announced networks which are directly reachable via 
routing?

Thanks for reply and best regards!

Michael
___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss

Re: [ovs-discuss] ovs-dpdk conntrack assert panic

2022-10-25 Thread Plato, Michael via discuss

Hi,
it looks like I ran into the same bug 
(https://mail.openvswitch.org/pipermail/ovs-discuss/2022-September/052065.html).
 Did you find a solution for the problem?

Best regards

Michael

___
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss

Re: [ovs-discuss] ovs-vswitchd crashes serveral times a day

Re: [ovs-discuss] Problem with ovn and neutron dynamic routing

Re: [ovs-discuss] ovs-vswitchd crashes serveral times a day

Re: [ovs-discuss] ovs-vswitchd crashes serveral times a day

Re: [ovs-discuss] ovs-vswitchd crashes serveral times a day

Re: [ovs-discuss] Problem with ovn and neutron dynamic routing

Re: [ovs-discuss] ovs-vswitchd crashes serveral times a day

Re: [ovs-discuss] Problem with ovn and neutron dynamic routing

[ovs-discuss] Problem with ovn and neutron dynamic routing

Re: [ovs-discuss] ovs-dpdk conntrack assert panic

10 matches

Site Navigation

Mail list logo

Footer information