Re: [ovs-dev] [PATCH] net: openvswitch: Fix Use-After-Free in ovs_ct_exit

2024-04-22 Thread Eric Dumazet via dev
On Mon, Apr 22, 2024 at 11:37 AM Hyunwoo Kim  wrote:
>
> Since kfree_rcu, which is called in the hlist_for_each_entry_rcu traversal
> of ovs_ct_limit_exit, is not part of the RCU read critical section, it
> is possible that the RCU grace period will pass during the traversal and
> the key will be free.
>
> To prevent this, it should be changed to hlist_for_each_entry_safe.
>
> Fixes: 11efd5cb04a1 ("openvswitch: Support conntrack zone limit")
> Signed-off-by: Hyunwoo Kim 

Reviewed-by: Eric Dumazet 

Thanks !
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH net 1/2] net: openvswitch: limit the number of recursions from action sets

2024-02-06 Thread Eric Dumazet via dev
On Tue, Feb 6, 2024 at 3:55 PM Aaron Conole  wrote:
>
>
> Oops - I didn't consider it.
>
> Given that, maybe the best approach would not to rely on per-cpu
> counter. I'll respin in the next series with a depth counter that I pass
> to the function instead and compare that.  I guess that should address
> migration and eliminate the need for per-cpu counter.
>
> Does it make sense?

Sure, a depth parameter would work much better ;)
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH net 1/2] net: openvswitch: limit the number of recursions from action sets

2024-02-06 Thread Eric Dumazet via dev
On Tue, Feb 6, 2024 at 2:11 PM Aaron Conole  wrote:
>
> The ovs module allows for some actions to recursively contain an action
> list for complex scenarios, such as sampling, checking lengths, etc.
> When these actions are copied into the internal flow table, they are
> evaluated to validate that such actions make sense, and these calls
> happen recursively.
>
> The ovs-vswitchd userspace won't emit more than 16 recursion levels
> deep.  However, the module has no such limit and will happily accept
> limits larger than 16 levels nested.  Prevent this by tracking the
> number of recursions happening and manually limiting it to 16 levels
> nested.
>
> The initial implementation of the sample action would track this depth
> and prevent more than 3 levels of recursion, but this was removed to
> support the clone use case, rather than limited at the current userspace
> limit.
>
> Fixes: 798c166173ff ("openvswitch: Optimize sample action for the clone use 
> cases")
> Signed-off-by: Aaron Conole 
> ---
>  net/openvswitch/flow_netlink.c | 33 -
>  1 file changed, 28 insertions(+), 5 deletions(-)
>
> diff --git a/net/openvswitch/flow_netlink.c b/net/openvswitch/flow_netlink.c
> index 88965e2068ac..ba5cfa67a720 100644
> --- a/net/openvswitch/flow_netlink.c
> +++ b/net/openvswitch/flow_netlink.c
> @@ -48,6 +48,9 @@ struct ovs_len_tbl {
>
>  #define OVS_ATTR_NESTED -1
>  #define OVS_ATTR_VARIABLE -2
> +#define OVS_COPY_ACTIONS_MAX_DEPTH 16
> +
> +static DEFINE_PER_CPU(int, copy_actions_depth);
>
>  static bool actions_may_change_flow(const struct nlattr *actions)
>  {
> @@ -3148,11 +3151,11 @@ static int copy_action(const struct nlattr *from,
> return 0;
>  }
>
> -static int __ovs_nla_copy_actions(struct net *net, const struct nlattr *attr,
> - const struct sw_flow_key *key,
> - struct sw_flow_actions **sfa,
> - __be16 eth_type, __be16 vlan_tci,
> - u32 mpls_label_count, bool log)
> +static int ___ovs_nla_copy_actions(struct net *net, const struct nlattr 
> *attr,
> +  const struct sw_flow_key *key,
> +  struct sw_flow_actions **sfa,
> +  __be16 eth_type, __be16 vlan_tci,
> +  u32 mpls_label_count, bool log)
>  {
> u8 mac_proto = ovs_key_mac_proto(key);
> const struct nlattr *a;
> @@ -3478,6 +3481,26 @@ static int __ovs_nla_copy_actions(struct net *net, 
> const struct nlattr *attr,
> return 0;
>  }
>
> +static int __ovs_nla_copy_actions(struct net *net, const struct nlattr *attr,
> + const struct sw_flow_key *key,
> + struct sw_flow_actions **sfa,
> + __be16 eth_type, __be16 vlan_tci,
> + u32 mpls_label_count, bool log)
> +{
> +   int level = this_cpu_read(copy_actions_depth);
> +   int ret;
> +
> +   if (level > OVS_COPY_ACTIONS_MAX_DEPTH)
> +   return -EOVERFLOW;
> +

This code seems to run in process context.

Using per cpu limit would not work, unless you disabled migration ?

> +   __this_cpu_inc(copy_actions_depth);
> +   ret = ___ovs_nla_copy_actions(net, attr, key, sfa, eth_type,
> + vlan_tci, mpls_label_count, log);
> +   __this_cpu_dec(copy_actions_depth);
> +
> +   return ret;
> +}
> +
>  /* 'key' must be the masked key. */
>  int ovs_nla_copy_actions(struct net *net, const struct nlattr *attr,
>  const struct sw_flow_key *key,
> --
> 2.41.0
>
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH net-next v2] openvswitch: reduce stack usage in do_execute_actions

2023-09-21 Thread Eric Dumazet via dev
On Thu, Sep 21, 2023 at 9:42 PM Ilya Maximets  wrote:
>
> do_execute_actions() function can be called recursively multiple
> times while executing actions that require pipeline forking or
> recirculations.  It may also be re-entered multiple times if the packet
> leaves openvswitch module and re-enters it through a different port.
>
> Currently, there is a 256-byte array allocated on stack in this
> function that is supposed to hold NSH header.  Compilers tend to
> pre-allocate that space right at the beginning of the function:
>
>  a88:   48 81 ec b0 01 00 00sub$0x1b0,%rsp
>
> NSH is not a very common protocol, but the space is allocated on every
> recursive call or re-entry multiplying the wasted stack space.
>

Reviewed-by: Eric Dumazet 
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH net-next] openvswitch: reduce stack usage in do_execute_actions

2023-09-21 Thread Eric Dumazet via dev
On Thu, Sep 21, 2023 at 9:03 PM Ilya Maximets  wrote:
>
> do_execute_actions() function can be called recursively multiple
> times while executing actions that require pipeline forking or
> recirculations.  It may also be re-entered multiple times if the packet
> leaves openvswitch module and re-enters it through a different port.
>
> Currently, there is a 256-byte array allocated on stack in this
> function that is supposed to hold NSH header.  Compilers tend to
> pre-allocate that space right at the beginning of the function:
>
>  a88:   48 81 ec b0 01 00 00sub$0x1b0,%rsp
>
> NSH is not a very common protocol, but the space is allocated on every
> recursive call or re-entry multiplying the wasted stack space.
>
> Move the stack allocation to push_nsh() function that is only used
> if NSH actions are actually present.  push_nsh() is also a simple
> function without a possibility for re-entry, so the stack is returned
> right away.
>
> With this change the preallocated space is reduced by 256 B per call:
>
>  b18:   48 81 ec b0 00 00 00sub$0xb0,%rsp
>
> Signed-off-by: Ilya Maximets 
> ---
>  net/openvswitch/actions.c | 20 +---
>  1 file changed, 9 insertions(+), 11 deletions(-)
>
> diff --git a/net/openvswitch/actions.c b/net/openvswitch/actions.c
> index 5f8094acd056..80cc5c512d7b 100644
> --- a/net/openvswitch/actions.c
> +++ b/net/openvswitch/actions.c
> @@ -312,10 +312,16 @@ static int push_eth(struct sk_buff *skb, struct 
> sw_flow_key *key,
>  }
>
>  static int push_nsh(struct sk_buff *skb, struct sw_flow_key *key,
> -   const struct nshhdr *nh)
> +   const struct nlattr *a)

Presumably this function should be inlined. (one caller only)

I would add noinline_for_stack to make sure the compiler will not play
games with this attempt.

>  {
> +   u8 buffer[NSH_HDR_MAX_LEN];
> +   struct nshhdr *nh = (struct nshhdr *)buffer;
> int err;
>
> +   err = nsh_hdr_from_nlattr(a, nh, NSH_HDR_MAX_LEN);
> +   if (err)
> +   return err;
> +
> err = nsh_push(skb, nh);
> if (err)
> return err;
> @@ -1439,17 +1445,9 @@ static int do_execute_actions(struct datapath *dp, 
> struct sk_buff *skb,
> err = pop_eth(skb, key);
> break;
>
> -   case OVS_ACTION_ATTR_PUSH_NSH: {
> -   u8 buffer[NSH_HDR_MAX_LEN];
> -   struct nshhdr *nh = (struct nshhdr *)buffer;
> -
> -   err = nsh_hdr_from_nlattr(nla_data(a), nh,
> - NSH_HDR_MAX_LEN);
> -   if (unlikely(err))
> -   break;
> -   err = push_nsh(skb, key, nh);
> +   case OVS_ACTION_ATTR_PUSH_NSH:
> +   err = push_nsh(skb, key, nla_data(a));
> break;
> -   }
>
> case OVS_ACTION_ATTR_POP_NSH:
> err = pop_nsh(skb, key);
> --
> 2.41.0
>
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH net v3] net: openvswitch: fix race on port output

2023-04-07 Thread Eric Dumazet via dev
On Wed, Apr 5, 2023 at 9:53 AM Felix Huettner
 wrote:
>
> assume the following setup on a single machine:
> 1. An openvswitch instance with one bridge and default flows
> 2. two network namespaces "server" and "client"
> 3. two ovs interfaces "server" and "client" on the bridge
> 4. for each ovs interface a veth pair with a matching name and 32 rx and
>tx queues
> 5. move the ends of the veth pairs to the respective network namespaces
> 6. assign ip addresses to each of the veth ends in the namespaces (needs
>to be the same subnet)
> 7. start some http server on the server network namespace
> 8. test if a client in the client namespace can reach the http server
>
> when following the actions below the host has a chance of getting a cpu
> stuck in a infinite loop:
> 1. send a large amount of parallel requests to the http server (around
>3000 curls should work)
> 2. in parallel delete the network namespace (do not delete interfaces or
>stop the server, just kill the namespace)
>
> there is a low chance that this will cause the below kernel cpu stuck
> message. If this does not happen just retry.
> Below there is also the output of bpftrace for the functions mentioned
> in the output.
>
...
Reviewed-by: Eric Dumazet 
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH net v2] net: openvswitch: fix race on port output

2023-04-04 Thread Eric Dumazet via dev
On Tue, Apr 4, 2023 at 9:33 AM Felix Huettner
 wrote:
>
> assume the following setup on a single machine:
> 1. An openvswitch instance with one bridge and default flows
> 2. two network namespaces "server" and "client"
> 3. two ovs interfaces "server" and "client" on the bridge
> 4. for each ovs interface a veth pair with a matching name and 32 rx and
>tx queues
> 5. move the ends of the veth pairs to the respective network namespaces
> 6. assign ip addresses to each of the veth ends in the namespaces (needs
>to be the same subnet)
> 7. start some http server on the server network namespace
> 8. test if a client in the client namespace can reach the http server
>
> when following the actions below the host has a chance of getting a cpu
> stuck in a infinite loop:
> 1. send a large amount of parallel requests to the http server (around
>3000 curls should work)
> 2. in parallel delete the network namespace (do not delete interfaces or
>stop the server, just kill the namespace)
>

> Fixes: 7f8a436eaa2c ("openvswitch: Add conntrack action")
> Co-developed-by: Luca Czesla 
> Signed-off-by: Luca Czesla 
> Signed-off-by: Felix Huettner 
> ---
> v2:
>   - replace BUG_ON with DEBUG_NET_WARN_ON_ONCE
>   - use netif_carrier_ok() instead of checking for NETREG_REGISTERED
> v1: https://lore.kernel.org/netdev/ZCaXfZTwS9MVk8yZ@kernel-bug-kernel-bug/
>
>  net/core/dev.c| 1 +
>  net/openvswitch/actions.c | 2 +-
>  2 files changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 253584777101..37b26017f458 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -3199,6 +3199,7 @@ static u16 skb_tx_hash(const struct net_device *dev,
> }
>
> if (skb_rx_queue_recorded(skb)) {
> +   DEBUG_NET_WARN_ON_ONCE(unlikely(qcount == 0));

No need for unlikely(), it is already done in DEBUG_NET_WARN_ON_ONCE()

Thanks.
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH net] net: ensure all external references are released in deferred skbuffs

2022-06-22 Thread Eric Dumazet via dev
On Wed, Jun 22, 2022 at 9:04 PM Eric Dumazet  wrote:
>
> On Wed, Jun 22, 2022 at 8:19 PM Ilya Maximets  wrote:
> >
> > On 6/22/22 19:03, Eric Dumazet wrote:
> > > On Wed, Jun 22, 2022 at 6:47 PM Eric Dumazet  wrote:
> > >>
> > >> On Wed, Jun 22, 2022 at 6:39 PM Eric Dumazet  wrote:
> > >>>
> > >>> On Wed, Jun 22, 2022 at 6:29 PM Eric Dumazet  
> > >>> wrote:
> > 
> >  On Wed, Jun 22, 2022 at 4:26 PM Ilya Maximets  
> >  wrote:
> > >
> > > On 6/22/22 13:43, Eric Dumazet wrote:
> > 
> > >
> > > I tested the patch below and it seems to fix the issue seen
> > > with OVS testsuite.  Though it's not obvious for me why this
> > > happens.  Can you explain a bit more?
> > 
> >  Anyway, I am not sure we can call nf_reset_ct(skb) that early.
> > 
> >  git log seems to say that xfrm check needs to be done before
> >  nf_reset_ct(skb), I have no idea why.
> > >>>
> > >>> Additional remark: In IPv6 side, xfrm6_policy_check() _is_ called
> > >>> after nf_reset_ct(skb)
> > >>>
> > >>> Steffen, do you have some comments ?
> > >>>
> > >>> Some context:
> > >>> commit b59c270104f03960069596722fea70340579244d
> > >>> Author: Patrick McHardy 
> > >>> Date:   Fri Jan 6 23:06:10 2006 -0800
> > >>>
> > >>> [NETFILTER]: Keep conntrack reference until IPsec policy checks are 
> > >>> done
> > >>>
> > >>> Keep the conntrack reference until policy checks have been 
> > >>> performed for
> > >>> IPsec NAT support. The reference needs to be dropped before a 
> > >>> packet is
> > >>> queued to avoid having the conntrack module unloadable.
> > >>>
> > >>> Signed-off-by: Patrick McHardy 
> > >>> Signed-off-by: David S. Miller 
> > >>>
> > >>
> > >> Oh well... __xfrm_policy_check() has :
> > >>
> > >> nf_nat_decode_session(skb, &fl, family);
> > >>
> > >> This  answers my questions.
> > >>
> > >> This means we are probably missing at least one XFRM check in TCP
> > >> stack in some cases.
> > >> (Only after adding this XFRM check we can call nf_reset_ct(skb))
> > >>
> > >
> > > Maybe this will help ?
> >
> > I tested this patch and it seems to fix the OVS problem.
> > I did not test the xfrm part of it.
> >
> > Will you post an official patch?
>
> Yes I will. I need to double check we do not leak either the req, or the 
> child.
>
> Maybe the XFRM check should be done even earlier, on the listening socket ?
>
> Or if we assume the SYNACK packet has been sent after the XFRM test
> has been applied to the SYN,
> maybe we could just call nf_reset_ct(skb) to lower risk of regressions.
>
> With the last patch, it would be strange that we accept the 3WHS and
> establish a socket,
> but drop the payload in the 3rd packet...

Ilya, can you test the following patch ?
I think it makes more sense to let XFRM reject the packet earlier, and
not complete the 3WHS,
if for some reason this happens.

Thanks !

diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 
fe8f23b95d32ca4a35d05166d471327bc608fa91..da5a3c44c4fb70f1d3ecc596e694a86267f1c44a
100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1964,7 +1964,10 @@ int tcp_v4_rcv(struct sk_buff *skb)
struct sock *nsk;

sk = req->rsk_listener;
-   drop_reason = tcp_inbound_md5_hash(sk, skb,
+   if (!xfrm4_policy_check(sk, XFRM_POLICY_IN, skb))
+   drop_reason = SKB_DROP_REASON_XFRM_POLICY;
+   else
+   drop_reason = tcp_inbound_md5_hash(sk, skb,
   &iph->saddr, &iph->daddr,
   AF_INET, dif, sdif);
if (unlikely(drop_reason)) {
@@ -2016,6 +2019,7 @@ int tcp_v4_rcv(struct sk_buff *skb)
}
goto discard_and_relse;
}
+   nf_reset_ct(skb);
if (nsk == sk) {
reqsk_put(req);
tcp_v4_restore_cb(skb);
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH net] net: ensure all external references are released in deferred skbuffs

2022-06-22 Thread Eric Dumazet via dev
On Wed, Jun 22, 2022 at 8:19 PM Ilya Maximets  wrote:
>
> On 6/22/22 19:03, Eric Dumazet wrote:
> > On Wed, Jun 22, 2022 at 6:47 PM Eric Dumazet  wrote:
> >>
> >> On Wed, Jun 22, 2022 at 6:39 PM Eric Dumazet  wrote:
> >>>
> >>> On Wed, Jun 22, 2022 at 6:29 PM Eric Dumazet  wrote:
> 
>  On Wed, Jun 22, 2022 at 4:26 PM Ilya Maximets  wrote:
> >
> > On 6/22/22 13:43, Eric Dumazet wrote:
> 
> >
> > I tested the patch below and it seems to fix the issue seen
> > with OVS testsuite.  Though it's not obvious for me why this
> > happens.  Can you explain a bit more?
> 
>  Anyway, I am not sure we can call nf_reset_ct(skb) that early.
> 
>  git log seems to say that xfrm check needs to be done before
>  nf_reset_ct(skb), I have no idea why.
> >>>
> >>> Additional remark: In IPv6 side, xfrm6_policy_check() _is_ called
> >>> after nf_reset_ct(skb)
> >>>
> >>> Steffen, do you have some comments ?
> >>>
> >>> Some context:
> >>> commit b59c270104f03960069596722fea70340579244d
> >>> Author: Patrick McHardy 
> >>> Date:   Fri Jan 6 23:06:10 2006 -0800
> >>>
> >>> [NETFILTER]: Keep conntrack reference until IPsec policy checks are 
> >>> done
> >>>
> >>> Keep the conntrack reference until policy checks have been performed 
> >>> for
> >>> IPsec NAT support. The reference needs to be dropped before a packet 
> >>> is
> >>> queued to avoid having the conntrack module unloadable.
> >>>
> >>> Signed-off-by: Patrick McHardy 
> >>> Signed-off-by: David S. Miller 
> >>>
> >>
> >> Oh well... __xfrm_policy_check() has :
> >>
> >> nf_nat_decode_session(skb, &fl, family);
> >>
> >> This  answers my questions.
> >>
> >> This means we are probably missing at least one XFRM check in TCP
> >> stack in some cases.
> >> (Only after adding this XFRM check we can call nf_reset_ct(skb))
> >>
> >
> > Maybe this will help ?
>
> I tested this patch and it seems to fix the OVS problem.
> I did not test the xfrm part of it.
>
> Will you post an official patch?

Yes I will. I need to double check we do not leak either the req, or the child.

Maybe the XFRM check should be done even earlier, on the listening socket ?

Or if we assume the SYNACK packet has been sent after the XFRM test
has been applied to the SYN,
maybe we could just call nf_reset_ct(skb) to lower risk of regressions.

With the last patch, it would be strange that we accept the 3WHS and
establish a socket,
but drop the payload in the 3rd packet...

>
> >
> > diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> > index 
> > fe8f23b95d32ca4a35d05166d471327bc608fa91..49c1348e40b6c7b6a98b54d716f29c948e00ba33
> > 100644
> > --- a/net/ipv4/tcp_ipv4.c
> > +++ b/net/ipv4/tcp_ipv4.c
> > @@ -2019,12 +2019,19 @@ int tcp_v4_rcv(struct sk_buff *skb)
> > if (nsk == sk) {
> > reqsk_put(req);
> > tcp_v4_restore_cb(skb);
> > -   } else if (tcp_child_process(sk, nsk, skb)) {
> > -   tcp_v4_send_reset(nsk, skb);
> > -   goto discard_and_relse;
> > } else {
> > -   sock_put(sk);
> > -   return 0;
> > +   if (!xfrm4_policy_check(nsk, XFRM_POLICY_IN, skb)) {
> > +   drop_reason = SKB_DROP_REASON_XFRM_POLICY;
> > +   goto discard_and_relse;
> > +   }
> > +   nf_reset_ct(skb);
> > +   if (tcp_child_process(sk, nsk, skb)) {
> > +   tcp_v4_send_reset(nsk, skb);
> > +   goto discard_and_relse;
> > +   } else {
> > +   sock_put(sk);
> > +   return 0;
> > +   }
> > }
> > }
>
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH net] net: ensure all external references are released in deferred skbuffs

2022-06-22 Thread Eric Dumazet via dev
On Wed, Jun 22, 2022 at 6:47 PM Eric Dumazet  wrote:
>
> On Wed, Jun 22, 2022 at 6:39 PM Eric Dumazet  wrote:
> >
> > On Wed, Jun 22, 2022 at 6:29 PM Eric Dumazet  wrote:
> > >
> > > On Wed, Jun 22, 2022 at 4:26 PM Ilya Maximets  wrote:
> > > >
> > > > On 6/22/22 13:43, Eric Dumazet wrote:
> > >
> > > >
> > > > I tested the patch below and it seems to fix the issue seen
> > > > with OVS testsuite.  Though it's not obvious for me why this
> > > > happens.  Can you explain a bit more?
> > >
> > > Anyway, I am not sure we can call nf_reset_ct(skb) that early.
> > >
> > > git log seems to say that xfrm check needs to be done before
> > > nf_reset_ct(skb), I have no idea why.
> >
> > Additional remark: In IPv6 side, xfrm6_policy_check() _is_ called
> > after nf_reset_ct(skb)
> >
> > Steffen, do you have some comments ?
> >
> > Some context:
> > commit b59c270104f03960069596722fea70340579244d
> > Author: Patrick McHardy 
> > Date:   Fri Jan 6 23:06:10 2006 -0800
> >
> > [NETFILTER]: Keep conntrack reference until IPsec policy checks are done
> >
> > Keep the conntrack reference until policy checks have been performed for
> > IPsec NAT support. The reference needs to be dropped before a packet is
> > queued to avoid having the conntrack module unloadable.
> >
> > Signed-off-by: Patrick McHardy 
> > Signed-off-by: David S. Miller 
> >
>
> Oh well... __xfrm_policy_check() has :
>
> nf_nat_decode_session(skb, &fl, family);
>
> This  answers my questions.
>
> This means we are probably missing at least one XFRM check in TCP
> stack in some cases.
> (Only after adding this XFRM check we can call nf_reset_ct(skb))
>

Maybe this will help ?

diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 
fe8f23b95d32ca4a35d05166d471327bc608fa91..49c1348e40b6c7b6a98b54d716f29c948e00ba33
100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -2019,12 +2019,19 @@ int tcp_v4_rcv(struct sk_buff *skb)
if (nsk == sk) {
reqsk_put(req);
tcp_v4_restore_cb(skb);
-   } else if (tcp_child_process(sk, nsk, skb)) {
-   tcp_v4_send_reset(nsk, skb);
-   goto discard_and_relse;
} else {
-   sock_put(sk);
-   return 0;
+   if (!xfrm4_policy_check(nsk, XFRM_POLICY_IN, skb)) {
+   drop_reason = SKB_DROP_REASON_XFRM_POLICY;
+   goto discard_and_relse;
+   }
+   nf_reset_ct(skb);
+   if (tcp_child_process(sk, nsk, skb)) {
+   tcp_v4_send_reset(nsk, skb);
+   goto discard_and_relse;
+   } else {
+   sock_put(sk);
+   return 0;
+   }
}
}
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH net] net: ensure all external references are released in deferred skbuffs

2022-06-22 Thread Eric Dumazet via dev
On Wed, Jun 22, 2022 at 6:39 PM Eric Dumazet  wrote:
>
> On Wed, Jun 22, 2022 at 6:29 PM Eric Dumazet  wrote:
> >
> > On Wed, Jun 22, 2022 at 4:26 PM Ilya Maximets  wrote:
> > >
> > > On 6/22/22 13:43, Eric Dumazet wrote:
> >
> > >
> > > I tested the patch below and it seems to fix the issue seen
> > > with OVS testsuite.  Though it's not obvious for me why this
> > > happens.  Can you explain a bit more?
> >
> > Anyway, I am not sure we can call nf_reset_ct(skb) that early.
> >
> > git log seems to say that xfrm check needs to be done before
> > nf_reset_ct(skb), I have no idea why.
>
> Additional remark: In IPv6 side, xfrm6_policy_check() _is_ called
> after nf_reset_ct(skb)
>
> Steffen, do you have some comments ?
>
> Some context:
> commit b59c270104f03960069596722fea70340579244d
> Author: Patrick McHardy 
> Date:   Fri Jan 6 23:06:10 2006 -0800
>
> [NETFILTER]: Keep conntrack reference until IPsec policy checks are done
>
> Keep the conntrack reference until policy checks have been performed for
> IPsec NAT support. The reference needs to be dropped before a packet is
> queued to avoid having the conntrack module unloadable.
>
> Signed-off-by: Patrick McHardy 
> Signed-off-by: David S. Miller 
>

Oh well... __xfrm_policy_check() has :

nf_nat_decode_session(skb, &fl, family);

This  answers my questions.

This means we are probably missing at least one XFRM check in TCP
stack in some cases.
(Only after adding this XFRM check we can call nf_reset_ct(skb))

>
> >
> > I suspect some incoming packets are not going through
> > xfrm4_policy_check() and end up being stored in a TCP receive queue.
> >
> > Maybe something is missing before calling tcp_child_process()
> >
> >
> > >
> > > >
> > > > I note that IPv6 does the nf_reset_ct() earlier, from 
> > > > ip6_protocol_deliver_rcu()
> > > >
> > > > diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> > > > index 
> > > > fda811a5251f2d76ac24a036e6b4f4e7d7d96d6f..a06464f96fe0cc94dd78272738ddaab2c19e87db
> > > > 100644
> > > > --- a/net/ipv4/tcp_ipv4.c
> > > > +++ b/net/ipv4/tcp_ipv4.c
> > > > @@ -1919,6 +1919,8 @@ int tcp_v4_rcv(struct sk_buff *skb)
> > > > struct sock *sk;
> > > > int ret;
> > > >
> > > > +   nf_reset_ct(skb);
> > > > +
> > > > drop_reason = SKB_DROP_REASON_NOT_SPECIFIED;
> > > > if (skb->pkt_type != PACKET_HOST)
> > > > goto discard_it;
> > > > @@ -2046,8 +2048,6 @@ int tcp_v4_rcv(struct sk_buff *skb)
> > > > if (drop_reason)
> > > > goto discard_and_relse;
> > > >
> > > > -   nf_reset_ct(skb);
> > > > -
> > > > if (tcp_filter(sk, skb)) {
> > > > drop_reason = SKB_DROP_REASON_SOCKET_FILTER;
> > > > goto discard_and_relse;
> > >
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH net] net: ensure all external references are released in deferred skbuffs

2022-06-22 Thread Eric Dumazet via dev
On Wed, Jun 22, 2022 at 6:29 PM Eric Dumazet  wrote:
>
> On Wed, Jun 22, 2022 at 4:26 PM Ilya Maximets  wrote:
> >
> > On 6/22/22 13:43, Eric Dumazet wrote:
>
> >
> > I tested the patch below and it seems to fix the issue seen
> > with OVS testsuite.  Though it's not obvious for me why this
> > happens.  Can you explain a bit more?
>
> Anyway, I am not sure we can call nf_reset_ct(skb) that early.
>
> git log seems to say that xfrm check needs to be done before
> nf_reset_ct(skb), I have no idea why.

Additional remark: In IPv6 side, xfrm6_policy_check() _is_ called
after nf_reset_ct(skb)

Steffen, do you have some comments ?

Some context:
commit b59c270104f03960069596722fea70340579244d
Author: Patrick McHardy 
Date:   Fri Jan 6 23:06:10 2006 -0800

[NETFILTER]: Keep conntrack reference until IPsec policy checks are done

Keep the conntrack reference until policy checks have been performed for
IPsec NAT support. The reference needs to be dropped before a packet is
queued to avoid having the conntrack module unloadable.

Signed-off-by: Patrick McHardy 
Signed-off-by: David S. Miller 


>
> I suspect some incoming packets are not going through
> xfrm4_policy_check() and end up being stored in a TCP receive queue.
>
> Maybe something is missing before calling tcp_child_process()
>
>
> >
> > >
> > > I note that IPv6 does the nf_reset_ct() earlier, from 
> > > ip6_protocol_deliver_rcu()
> > >
> > > diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> > > index 
> > > fda811a5251f2d76ac24a036e6b4f4e7d7d96d6f..a06464f96fe0cc94dd78272738ddaab2c19e87db
> > > 100644
> > > --- a/net/ipv4/tcp_ipv4.c
> > > +++ b/net/ipv4/tcp_ipv4.c
> > > @@ -1919,6 +1919,8 @@ int tcp_v4_rcv(struct sk_buff *skb)
> > > struct sock *sk;
> > > int ret;
> > >
> > > +   nf_reset_ct(skb);
> > > +
> > > drop_reason = SKB_DROP_REASON_NOT_SPECIFIED;
> > > if (skb->pkt_type != PACKET_HOST)
> > > goto discard_it;
> > > @@ -2046,8 +2048,6 @@ int tcp_v4_rcv(struct sk_buff *skb)
> > > if (drop_reason)
> > > goto discard_and_relse;
> > >
> > > -   nf_reset_ct(skb);
> > > -
> > > if (tcp_filter(sk, skb)) {
> > > drop_reason = SKB_DROP_REASON_SOCKET_FILTER;
> > > goto discard_and_relse;
> >
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH net] net: ensure all external references are released in deferred skbuffs

2022-06-22 Thread Eric Dumazet via dev
On Wed, Jun 22, 2022 at 4:26 PM Ilya Maximets  wrote:
>
> On 6/22/22 13:43, Eric Dumazet wrote:

>
> I tested the patch below and it seems to fix the issue seen
> with OVS testsuite.  Though it's not obvious for me why this
> happens.  Can you explain a bit more?

Anyway, I am not sure we can call nf_reset_ct(skb) that early.

git log seems to say that xfrm check needs to be done before
nf_reset_ct(skb), I have no idea why.

I suspect some incoming packets are not going through
xfrm4_policy_check() and end up being stored in a TCP receive queue.

Maybe something is missing before calling tcp_child_process()


>
> >
> > I note that IPv6 does the nf_reset_ct() earlier, from 
> > ip6_protocol_deliver_rcu()
> >
> > diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> > index 
> > fda811a5251f2d76ac24a036e6b4f4e7d7d96d6f..a06464f96fe0cc94dd78272738ddaab2c19e87db
> > 100644
> > --- a/net/ipv4/tcp_ipv4.c
> > +++ b/net/ipv4/tcp_ipv4.c
> > @@ -1919,6 +1919,8 @@ int tcp_v4_rcv(struct sk_buff *skb)
> > struct sock *sk;
> > int ret;
> >
> > +   nf_reset_ct(skb);
> > +
> > drop_reason = SKB_DROP_REASON_NOT_SPECIFIED;
> > if (skb->pkt_type != PACKET_HOST)
> > goto discard_it;
> > @@ -2046,8 +2048,6 @@ int tcp_v4_rcv(struct sk_buff *skb)
> > if (drop_reason)
> > goto discard_and_relse;
> >
> > -   nf_reset_ct(skb);
> > -
> > if (tcp_filter(sk, skb)) {
> > drop_reason = SKB_DROP_REASON_SOCKET_FILTER;
> > goto discard_and_relse;
>
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH net] net: ensure all external references are released in deferred skbuffs

2022-06-22 Thread Eric Dumazet via dev
On Wed, Jun 22, 2022 at 4:26 PM Ilya Maximets  wrote:
>
> On 6/22/22 13:43, Eric Dumazet wrote:
> > On Wed, Jun 22, 2022 at 1:32 PM Ilya Maximets  wrote:
> >>
> >> On 6/22/22 12:36, Eric Dumazet wrote:
> >>> On Wed, Jun 22, 2022 at 12:28 PM Florian Westphal  wrote:
> 
>  Eric Dumazet  wrote:
> > On Sun, Jun 19, 2022 at 2:39 AM Ilya Maximets  
> > wrote:
> >>
> >> Open vSwitch system test suite is broken due to inability to
> >> load/unload netfilter modules.  kworker thread is getting trapped
> >> in the infinite loop while running a net cleanup inside the
> >> nf_conntrack_cleanup_net_list, because deferred skbuffs are still
> >> holding nfct references and not being freed by their CPU cores.
> >>
> >> In general, the idea that we will have an rx interrupt on every
> >> CPU core at some point in a near future doesn't seem correct.
> >> Devices are getting created and destroyed, interrupts are getting
> >> re-scheduled, CPUs are going online and offline dynamically.
> >> Any of these events may leave packets stuck in defer list for a
> >> long time.  It might be OK, if they are just a piece of memory,
> >> but we can't afford them holding references to any other resources.
> >>
> >> In case of OVS, nfct reference keeps the kernel thread in busy loop
> >> while holding a 'pernet_ops_rwsem' semaphore.  That blocks the
> >> later modprobe request from user space:
> >>
> >>   # ps
> >>299 root  R  99.3  200:25.89 kworker/u96:4+
> >>
> >>   # journalctl
> >>   INFO: task modprobe:11787 blocked for more than 1228 seconds.
> >> Not tainted 5.19.0-rc2 #8
> >>   task:modprobe state:D
> >>   Call Trace:
> >>
> >>__schedule+0x8aa/0x21d0
> >>schedule+0xcc/0x200
> >>rwsem_down_write_slowpath+0x8e4/0x1580
> >>down_write+0xfc/0x140
> >>register_pernet_subsys+0x15/0x40
> >>nf_nat_init+0xb6/0x1000 [nf_nat]
> >>do_one_initcall+0xbb/0x410
> >>do_init_module+0x1b4/0x640
> >>load_module+0x4c1b/0x58d0
> >>__do_sys_init_module+0x1d7/0x220
> >>do_syscall_64+0x3a/0x80
> >>entry_SYSCALL_64_after_hwframe+0x46/0xb0
> >>
> >> At this point OVS testsuite is unresponsive and never recover,
> >> because these skbuffs are never freed.
> >>
> >> Solution is to make sure no external references attached to skb
> >> before pushing it to the defer list.  Using skb_release_head_state()
> >> for that purpose.  The function modified to be re-enterable, as it
> >> will be called again during the defer list flush.
> >>
> >> Another approach that can fix the OVS use-case, is to kick all
> >> cores while waiting for references to be released during the net
> >> cleanup.  But that sounds more like a workaround for a current
> >> issue rather than a proper solution and will not cover possible
> >> issues in other parts of the code.
> >>
> >> Additionally checking for skb_zcopy() while deferring.  This might
> >> not be necessary, as I'm not sure if we can actually have zero copy
> >> packets on this path, but seems worth having for completeness as we
> >> should never defer such packets regardless.
> >>
> >> CC: Eric Dumazet 
> >> Fixes: 68822bdf76f1 ("net: generalize skb freeing deferral to per-cpu 
> >> lists")
> >> Signed-off-by: Ilya Maximets 
> >> ---
> >>  net/core/skbuff.c | 16 +++-
> >>  1 file changed, 11 insertions(+), 5 deletions(-)
> >
> > I do not think this patch is doing the right thing.
> >
> > Packets sitting in TCP receive queues should not hold state that is
> > not relevant for TCP recvmsg().
> 
>  Agree, but tcp_v4/6_rcv() already call nf_reset_ct(), else it would
>  not be possible to remove nf_conntrack module in practice.
> >>>
> >>> Well, existing nf_reset_ct() does not catch all cases, like TCP fastopen ?
> >>
> >> Yeah, that is kind of the main problem I have with the current
> >> code.  It's very hard to find all the cases where skb has to be
> >> cleaned up and almost impossible for someone who doesn't know
> >> every aspect of every network subsystem in the kernel.  That's
> >> why I went with the more or less bulletproof approach of cleaning
> >> up while actually deferring.  I can try and test the code you
> >> proposed in the other reply, but at least, I think, we need a
> >> bunch of debug warnings in the skb_attempt_defer_free() to catch
> >> possible issues.
> >
> > Debug warnings are expensive if they need to bring new cache lines.
> >
> > So far skb_attempt_defer_free() is only used by TCP in well known 
> > conditions.
>
> That's true for skb_attempt_defer_free() itself, but it's
> hard to tell the same for all the parts of the code that
> are enqueuing these skbuffs.  Even if all the places are
> well known, it looks to me h

Re: [ovs-dev] [PATCH net] net: ensure all external references are released in deferred skbuffs

2022-06-22 Thread Eric Dumazet via dev
On Wed, Jun 22, 2022 at 1:32 PM Ilya Maximets  wrote:
>
> On 6/22/22 12:36, Eric Dumazet wrote:
> > On Wed, Jun 22, 2022 at 12:28 PM Florian Westphal  wrote:
> >>
> >> Eric Dumazet  wrote:
> >>> On Sun, Jun 19, 2022 at 2:39 AM Ilya Maximets  wrote:
> 
>  Open vSwitch system test suite is broken due to inability to
>  load/unload netfilter modules.  kworker thread is getting trapped
>  in the infinite loop while running a net cleanup inside the
>  nf_conntrack_cleanup_net_list, because deferred skbuffs are still
>  holding nfct references and not being freed by their CPU cores.
> 
>  In general, the idea that we will have an rx interrupt on every
>  CPU core at some point in a near future doesn't seem correct.
>  Devices are getting created and destroyed, interrupts are getting
>  re-scheduled, CPUs are going online and offline dynamically.
>  Any of these events may leave packets stuck in defer list for a
>  long time.  It might be OK, if they are just a piece of memory,
>  but we can't afford them holding references to any other resources.
> 
>  In case of OVS, nfct reference keeps the kernel thread in busy loop
>  while holding a 'pernet_ops_rwsem' semaphore.  That blocks the
>  later modprobe request from user space:
> 
>    # ps
> 299 root  R  99.3  200:25.89 kworker/u96:4+
> 
>    # journalctl
>    INFO: task modprobe:11787 blocked for more than 1228 seconds.
>  Not tainted 5.19.0-rc2 #8
>    task:modprobe state:D
>    Call Trace:
> 
> __schedule+0x8aa/0x21d0
> schedule+0xcc/0x200
> rwsem_down_write_slowpath+0x8e4/0x1580
> down_write+0xfc/0x140
> register_pernet_subsys+0x15/0x40
> nf_nat_init+0xb6/0x1000 [nf_nat]
> do_one_initcall+0xbb/0x410
> do_init_module+0x1b4/0x640
> load_module+0x4c1b/0x58d0
> __do_sys_init_module+0x1d7/0x220
> do_syscall_64+0x3a/0x80
> entry_SYSCALL_64_after_hwframe+0x46/0xb0
> 
>  At this point OVS testsuite is unresponsive and never recover,
>  because these skbuffs are never freed.
> 
>  Solution is to make sure no external references attached to skb
>  before pushing it to the defer list.  Using skb_release_head_state()
>  for that purpose.  The function modified to be re-enterable, as it
>  will be called again during the defer list flush.
> 
>  Another approach that can fix the OVS use-case, is to kick all
>  cores while waiting for references to be released during the net
>  cleanup.  But that sounds more like a workaround for a current
>  issue rather than a proper solution and will not cover possible
>  issues in other parts of the code.
> 
>  Additionally checking for skb_zcopy() while deferring.  This might
>  not be necessary, as I'm not sure if we can actually have zero copy
>  packets on this path, but seems worth having for completeness as we
>  should never defer such packets regardless.
> 
>  CC: Eric Dumazet 
>  Fixes: 68822bdf76f1 ("net: generalize skb freeing deferral to per-cpu 
>  lists")
>  Signed-off-by: Ilya Maximets 
>  ---
>   net/core/skbuff.c | 16 +++-
>   1 file changed, 11 insertions(+), 5 deletions(-)
> >>>
> >>> I do not think this patch is doing the right thing.
> >>>
> >>> Packets sitting in TCP receive queues should not hold state that is
> >>> not relevant for TCP recvmsg().
> >>
> >> Agree, but tcp_v4/6_rcv() already call nf_reset_ct(), else it would
> >> not be possible to remove nf_conntrack module in practice.
> >
> > Well, existing nf_reset_ct() does not catch all cases, like TCP fastopen ?
>
> Yeah, that is kind of the main problem I have with the current
> code.  It's very hard to find all the cases where skb has to be
> cleaned up and almost impossible for someone who doesn't know
> every aspect of every network subsystem in the kernel.  That's
> why I went with the more or less bulletproof approach of cleaning
> up while actually deferring.  I can try and test the code you
> proposed in the other reply, but at least, I think, we need a
> bunch of debug warnings in the skb_attempt_defer_free() to catch
> possible issues.

Debug warnings are expensive if they need to bring new cache lines.

So far skb_attempt_defer_free() is only used by TCP in well known conditions.


>
> Also, what about cleaning up extensions?  IIUC, at least one
> of them can hold external references.  SKB_EXT_SEC_PATH holds
> xfrm_state.  We should, probably, free them as well?

I do not know about this, I would ask XFRM maintainers

>
> And what about zerocopy skb?  I think, we should still not
> allow them to be deferred as they seem to hold HW resources.

The point of skb_attempt_defer_free() i is to make the freeing happen
at producer
 much instead of consumer.

I do not think there is anything special in this regard 

Re: [ovs-dev] [PATCH net] net: ensure all external references are released in deferred skbuffs

2022-06-22 Thread Eric Dumazet via dev
On Wed, Jun 22, 2022 at 12:28 PM Florian Westphal  wrote:
>
> Eric Dumazet  wrote:
> > On Sun, Jun 19, 2022 at 2:39 AM Ilya Maximets  wrote:
> > >
> > > Open vSwitch system test suite is broken due to inability to
> > > load/unload netfilter modules.  kworker thread is getting trapped
> > > in the infinite loop while running a net cleanup inside the
> > > nf_conntrack_cleanup_net_list, because deferred skbuffs are still
> > > holding nfct references and not being freed by their CPU cores.
> > >
> > > In general, the idea that we will have an rx interrupt on every
> > > CPU core at some point in a near future doesn't seem correct.
> > > Devices are getting created and destroyed, interrupts are getting
> > > re-scheduled, CPUs are going online and offline dynamically.
> > > Any of these events may leave packets stuck in defer list for a
> > > long time.  It might be OK, if they are just a piece of memory,
> > > but we can't afford them holding references to any other resources.
> > >
> > > In case of OVS, nfct reference keeps the kernel thread in busy loop
> > > while holding a 'pernet_ops_rwsem' semaphore.  That blocks the
> > > later modprobe request from user space:
> > >
> > >   # ps
> > >299 root  R  99.3  200:25.89 kworker/u96:4+
> > >
> > >   # journalctl
> > >   INFO: task modprobe:11787 blocked for more than 1228 seconds.
> > > Not tainted 5.19.0-rc2 #8
> > >   task:modprobe state:D
> > >   Call Trace:
> > >
> > >__schedule+0x8aa/0x21d0
> > >schedule+0xcc/0x200
> > >rwsem_down_write_slowpath+0x8e4/0x1580
> > >down_write+0xfc/0x140
> > >register_pernet_subsys+0x15/0x40
> > >nf_nat_init+0xb6/0x1000 [nf_nat]
> > >do_one_initcall+0xbb/0x410
> > >do_init_module+0x1b4/0x640
> > >load_module+0x4c1b/0x58d0
> > >__do_sys_init_module+0x1d7/0x220
> > >do_syscall_64+0x3a/0x80
> > >entry_SYSCALL_64_after_hwframe+0x46/0xb0
> > >
> > > At this point OVS testsuite is unresponsive and never recover,
> > > because these skbuffs are never freed.
> > >
> > > Solution is to make sure no external references attached to skb
> > > before pushing it to the defer list.  Using skb_release_head_state()
> > > for that purpose.  The function modified to be re-enterable, as it
> > > will be called again during the defer list flush.
> > >
> > > Another approach that can fix the OVS use-case, is to kick all
> > > cores while waiting for references to be released during the net
> > > cleanup.  But that sounds more like a workaround for a current
> > > issue rather than a proper solution and will not cover possible
> > > issues in other parts of the code.
> > >
> > > Additionally checking for skb_zcopy() while deferring.  This might
> > > not be necessary, as I'm not sure if we can actually have zero copy
> > > packets on this path, but seems worth having for completeness as we
> > > should never defer such packets regardless.
> > >
> > > CC: Eric Dumazet 
> > > Fixes: 68822bdf76f1 ("net: generalize skb freeing deferral to per-cpu 
> > > lists")
> > > Signed-off-by: Ilya Maximets 
> > > ---
> > >  net/core/skbuff.c | 16 +++-
> > >  1 file changed, 11 insertions(+), 5 deletions(-)
> >
> > I do not think this patch is doing the right thing.
> >
> > Packets sitting in TCP receive queues should not hold state that is
> > not relevant for TCP recvmsg().
>
> Agree, but tcp_v4/6_rcv() already call nf_reset_ct(), else it would
> not be possible to remove nf_conntrack module in practice.

Well, existing nf_reset_ct() does not catch all cases, like TCP fastopen ?

Maybe 68822bdf76f1 ("net: generalize skb freeing deferral to per-cpu lists")
only widened the problem.

>
> I wonder where the deferred skbs are coming from, any and all
> queued skbs need the conntrack state dropped.
>
> I don't mind a new helper that does a combined dst+ct release though.
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH net] net: ensure all external references are released in deferred skbuffs

2022-06-22 Thread Eric Dumazet via dev
On Wed, Jun 22, 2022 at 12:02 PM Eric Dumazet  wrote:
>
> On Sun, Jun 19, 2022 at 2:39 AM Ilya Maximets  wrote:
> >
> > Open vSwitch system test suite is broken due to inability to
> > load/unload netfilter modules.  kworker thread is getting trapped
> > in the infinite loop while running a net cleanup inside the
> > nf_conntrack_cleanup_net_list, because deferred skbuffs are still
> > holding nfct references and not being freed by their CPU cores.
> >
> > In general, the idea that we will have an rx interrupt on every
> > CPU core at some point in a near future doesn't seem correct.
> > Devices are getting created and destroyed, interrupts are getting
> > re-scheduled, CPUs are going online and offline dynamically.
> > Any of these events may leave packets stuck in defer list for a
> > long time.  It might be OK, if they are just a piece of memory,
> > but we can't afford them holding references to any other resources.
> >
> > In case of OVS, nfct reference keeps the kernel thread in busy loop
> > while holding a 'pernet_ops_rwsem' semaphore.  That blocks the
> > later modprobe request from user space:
> >
> >   # ps
> >299 root  R  99.3  200:25.89 kworker/u96:4+
> >
> >   # journalctl
> >   INFO: task modprobe:11787 blocked for more than 1228 seconds.
> > Not tainted 5.19.0-rc2 #8
> >   task:modprobe state:D
> >   Call Trace:
> >
> >__schedule+0x8aa/0x21d0
> >schedule+0xcc/0x200
> >rwsem_down_write_slowpath+0x8e4/0x1580
> >down_write+0xfc/0x140
> >register_pernet_subsys+0x15/0x40
> >nf_nat_init+0xb6/0x1000 [nf_nat]
> >do_one_initcall+0xbb/0x410
> >do_init_module+0x1b4/0x640
> >load_module+0x4c1b/0x58d0
> >__do_sys_init_module+0x1d7/0x220
> >do_syscall_64+0x3a/0x80
> >entry_SYSCALL_64_after_hwframe+0x46/0xb0
> >
> > At this point OVS testsuite is unresponsive and never recover,
> > because these skbuffs are never freed.
> >
> > Solution is to make sure no external references attached to skb
> > before pushing it to the defer list.  Using skb_release_head_state()
> > for that purpose.  The function modified to be re-enterable, as it
> > will be called again during the defer list flush.
> >
> > Another approach that can fix the OVS use-case, is to kick all
> > cores while waiting for references to be released during the net
> > cleanup.  But that sounds more like a workaround for a current
> > issue rather than a proper solution and will not cover possible
> > issues in other parts of the code.
> >
> > Additionally checking for skb_zcopy() while deferring.  This might
> > not be necessary, as I'm not sure if we can actually have zero copy
> > packets on this path, but seems worth having for completeness as we
> > should never defer such packets regardless.
> >
> > CC: Eric Dumazet 
> > Fixes: 68822bdf76f1 ("net: generalize skb freeing deferral to per-cpu 
> > lists")
> > Signed-off-by: Ilya Maximets 
> > ---
> >  net/core/skbuff.c | 16 +++-
> >  1 file changed, 11 insertions(+), 5 deletions(-)
>
> I do not think this patch is doing the right thing.
>
> Packets sitting in TCP receive queues should not hold state that is
> not relevant for TCP recvmsg().
>
> This consumes extra memory for no good reason, and defer expensive
> atomic operations.
>
> We for instance release skb dst before skb is queued, we should do the
> same for conntrack state.
>
> This would increase performance anyway, as we free ct state while cpu
> caches are hot.

I am thinking of the following instead.

A new helper can be added (and later be used in net/packet/af_packet.c
and probably elsewhere)

diff --git a/include/net/dst.h b/include/net/dst.h
index 
6aa252c3fc55ccaee58faebf265510469e91d780..7c3316d9d6e73daea17223a5261f6a5c4f68eae3
100644
--- a/include/net/dst.h
+++ b/include/net/dst.h
@@ -276,6 +276,15 @@ static inline void skb_dst_drop(struct sk_buff *skb)
}
 }

+/* Before queueing skb in a receive queue, get rid of
+ * potentially expensive components.
+ */
+static inline void skb_cleanup(struct sk_buff *skb)
+{
+   skb_dst_drop(skb);
+   nf_reset_ct(skb);
+}
+
 static inline void __skb_dst_copy(struct sk_buff *nskb, unsigned long refdst)
 {
nskb->slow_gro |= !!refdst;
diff --git a/net/ipv4/tcp_fastopen.c b/net/ipv4/tcp_fastopen.c
index 
fdbcf2a6d08ef4a5164247b5a5b4b89b191a..913c98e446d56ee067b54b2c704ac1195ef1a81e
100644
--- a/net/ipv4/tcp_fastopen.c
+++ b/net/ipv4/tcp_fastopen.c
@@ -177,7 +177,7 @@ void tcp_fastopen_add_skb(struct sock *sk, struct
sk_buff *skb)
if (!skb)
return;

-   skb_dst_drop(skb);
+   skb_cleanup(skb);
/* segs_in has been initialized to 1 in tcp_create_openreq_child().
 * Hence, reset segs_in to 0 before calling tcp_segs_in()
 * to avoid double counting.  Also, tcp_segs_in() expects
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 
2e2a9ece9af27372e6b653d685a89a2c71ba05d1..987981a16ee34e0601e7e722abef1bb098c307c5
100644
--- a/net/

Re: [ovs-dev] [PATCH net] net: ensure all external references are released in deferred skbuffs

2022-06-22 Thread Eric Dumazet via dev
On Sun, Jun 19, 2022 at 2:39 AM Ilya Maximets  wrote:
>
> Open vSwitch system test suite is broken due to inability to
> load/unload netfilter modules.  kworker thread is getting trapped
> in the infinite loop while running a net cleanup inside the
> nf_conntrack_cleanup_net_list, because deferred skbuffs are still
> holding nfct references and not being freed by their CPU cores.
>
> In general, the idea that we will have an rx interrupt on every
> CPU core at some point in a near future doesn't seem correct.
> Devices are getting created and destroyed, interrupts are getting
> re-scheduled, CPUs are going online and offline dynamically.
> Any of these events may leave packets stuck in defer list for a
> long time.  It might be OK, if they are just a piece of memory,
> but we can't afford them holding references to any other resources.
>
> In case of OVS, nfct reference keeps the kernel thread in busy loop
> while holding a 'pernet_ops_rwsem' semaphore.  That blocks the
> later modprobe request from user space:
>
>   # ps
>299 root  R  99.3  200:25.89 kworker/u96:4+
>
>   # journalctl
>   INFO: task modprobe:11787 blocked for more than 1228 seconds.
> Not tainted 5.19.0-rc2 #8
>   task:modprobe state:D
>   Call Trace:
>
>__schedule+0x8aa/0x21d0
>schedule+0xcc/0x200
>rwsem_down_write_slowpath+0x8e4/0x1580
>down_write+0xfc/0x140
>register_pernet_subsys+0x15/0x40
>nf_nat_init+0xb6/0x1000 [nf_nat]
>do_one_initcall+0xbb/0x410
>do_init_module+0x1b4/0x640
>load_module+0x4c1b/0x58d0
>__do_sys_init_module+0x1d7/0x220
>do_syscall_64+0x3a/0x80
>entry_SYSCALL_64_after_hwframe+0x46/0xb0
>
> At this point OVS testsuite is unresponsive and never recover,
> because these skbuffs are never freed.
>
> Solution is to make sure no external references attached to skb
> before pushing it to the defer list.  Using skb_release_head_state()
> for that purpose.  The function modified to be re-enterable, as it
> will be called again during the defer list flush.
>
> Another approach that can fix the OVS use-case, is to kick all
> cores while waiting for references to be released during the net
> cleanup.  But that sounds more like a workaround for a current
> issue rather than a proper solution and will not cover possible
> issues in other parts of the code.
>
> Additionally checking for skb_zcopy() while deferring.  This might
> not be necessary, as I'm not sure if we can actually have zero copy
> packets on this path, but seems worth having for completeness as we
> should never defer such packets regardless.
>
> CC: Eric Dumazet 
> Fixes: 68822bdf76f1 ("net: generalize skb freeing deferral to per-cpu lists")
> Signed-off-by: Ilya Maximets 
> ---
>  net/core/skbuff.c | 16 +++-
>  1 file changed, 11 insertions(+), 5 deletions(-)

I do not think this patch is doing the right thing.

Packets sitting in TCP receive queues should not hold state that is
not relevant for TCP recvmsg().

This consumes extra memory for no good reason, and defer expensive
atomic operations.

We for instance release skb dst before skb is queued, we should do the
same for conntrack state.

This would increase performance anyway, as we free ct state while cpu
caches are hot.
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH net-next] net: rename reference+tracking helpers

2022-06-09 Thread Eric Dumazet via dev
On Thu, Jun 9, 2022 at 4:50 AM Paolo Abeni  wrote:
>
> On Wed, 2022-06-08 at 16:00 -0700, Eric Dumazet wrote:
> > On Wed, Jun 8, 2022 at 3:58 PM David Ahern  wrote:
> > >
> > > On 6/8/22 8:58 AM, Jakub Kicinski wrote:
> > > > IMO to encourage use of the track-capable API we could keep their names
> > > > short and call the legacy functions __netdev_hold() as I mentioned or
> > > > maybe netdev_hold_notrack().
> > >
> > > I like that option. Similar to the old nla_parse functions that were
> > > renamed with _deprecated - makes it easier to catch new uses.
> >
> > I think we need to clearly document the needed conversions for future
> > bugfix backports.
> >
>
> To be on the same page: do you think we need something under
> Documentation with this patch? or with the later dev_hold rename? or
> did I misunderstood completely?

Adding instructions in the comments describing the functions would probably help
stable teams (or ourselves because they will ask us to take care of conflicts)

And backport the dev_put()/dev_hold() rename to kernels without
CONFIG_NET_DEV_REFCNT_TRACKER infra.

s/dev_put()/netdev_put_notrack()/
s/dev_hold()/netdev_hold_notrack()/
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev


Re: [ovs-dev] [PATCH net-next] net: rename reference+tracking helpers

2022-06-09 Thread Eric Dumazet via dev
On Wed, Jun 8, 2022 at 3:58 PM David Ahern  wrote:
>
> On 6/8/22 8:58 AM, Jakub Kicinski wrote:
> > IMO to encourage use of the track-capable API we could keep their names
> > short and call the legacy functions __netdev_hold() as I mentioned or
> > maybe netdev_hold_notrack().
>
> I like that option. Similar to the old nla_parse functions that were
> renamed with _deprecated - makes it easier to catch new uses.

I think we need to clearly document the needed conversions for future
bugfix backports.

Alternative would be to _backport_ the renaming for all stable versions ;)
___
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev