from:"Paolo Abeni"

Re: [PATCH net-next v2 0/4] net: mitigate retpoline overhead

2018-12-07 Thread Paolo Abeni

Hi,

On Thu, 2018-12-06 at 22:28 -0800, David Miller wrote:
> From: David Miller 
> Date: Thu, 06 Dec 2018 22:24:09 -0800 (PST)
> 
> > Series applied, thanks!
> 
> Erm... actually reverted.  Please fix these build failures:

oops ...
I'm sorry for the late reply. I'm travelling and I will not able to re-
post soon.

> ld: net/ipv6/ip6_offload.o: in function `ipv6_gro_receive':
> ip6_offload.c:(.text+0xda2): undefined reference to `udp6_gro_receive'
> ld: ip6_offload.c:(.text+0xdb6): undefined reference to `udp6_gro_receive'
> ld: net/ipv6/ip6_offload.o: in function `ipv6_gro_complete':
> ip6_offload.c:(.text+0x1953): undefined reference to `udp6_gro_complete'
> ld: ip6_offload.c:(.text+0x1966): undefined reference to `udp6_gro_complete'
> make: *** [Makefile:1036: vmlinux] Error 1

Are you building with CONFIG_IPV6=m ? I tested vs some common cfg, but
I omitted that in my last iteration (my bad). With such conf ip6
offloads are builtin while udp6 offloads end-up in the ipv6 module, so
I can't use them with the given conf.

I'll try to fix the above in v3.

I'm sorry for this mess,

Paolo

Re: [PATCH net-next 1/4] indirect call wrappers: helpers to speed-up indirect calls of builtin

2018-12-05 Thread Paolo Abeni

On Tue, 2018-12-04 at 18:12 +, Edward Cree wrote:
> On 04/12/18 17:44, Paolo Abeni wrote:
> > On Tue, 2018-12-04 at 17:13 +, Edward Cree wrote:
> > > On 03/12/18 11:40, Paolo Abeni wrote:
> > > > This header define a bunch of helpers that allow avoiding the
> > > > retpoline overhead when calling builtin functions via function pointers.
> > > > It boils down to explicitly comparing the function pointers to
> > > > known builtin functions and eventually invoke directly the latter.
> > > > 
> > > > The macros defined here implement the boilerplate for the above schema
> > > > and will be used by the next patches.
> > > > 
> > > > rfc -> v1:
> > > >  - use branch prediction hint, as suggested by Eric
> > > > 
> > > > Suggested-by: Eric Dumazet 
> > > > Signed-off-by: Paolo Abeni 
> > > > ---
> > > I'm not sure I see the reason why this is done with numbers and
> > >  'name ## NR', adding extra distance between the callsite and the
> > >  list of callees.  In particular it means that each callable needs
> > >  to specify its index.
> > > Wouldn't it be simpler just to have
> > > #define 1(f, f1, ...) \
> > > (likely(f == f1) ? f1(__VA_ARGS__) : f(__VA_ARGS__))
> > > #define INDIRECT_CALL_2(f, f2, f1, ...) \
> > > (likely(f == f2) ? f2(__VA_ARGS__) : INDIRECT_CALL_1(f, f1, 
> > > __VA_ARGS__))
> > > etc.?  Removing the need for INDIRECT_CALLABLE_DECLARE_* entirely.
> > Thank you for the review!
> > 
> > As some of the builtin symbols are static, we would still need some
> > macro wrappers to properly specify the scope when retpoline is enabled.
> Ah I see, it hadn't occurred to me that static callees might not be
>  available at the callsite.  Makes sense now.  In that case, have my
>  Acked-By for this patch, if you want it.

I gave a shot to your idea, and after all I think is cleaner. So I'll
send v2 with that change.

Thanks,

Paolo

Re: [PATCH net-next 1/4] indirect call wrappers: helpers to speed-up indirect calls of builtin

2018-12-04 Thread Paolo Abeni

On Tue, 2018-12-04 at 17:13 +, Edward Cree wrote:
> On 03/12/18 11:40, Paolo Abeni wrote:
> > This header define a bunch of helpers that allow avoiding the
> > retpoline overhead when calling builtin functions via function pointers.
> > It boils down to explicitly comparing the function pointers to
> > known builtin functions and eventually invoke directly the latter.
> > 
> > The macros defined here implement the boilerplate for the above schema
> > and will be used by the next patches.
> > 
> > rfc -> v1:
> >  - use branch prediction hint, as suggested by Eric
> > 
> > Suggested-by: Eric Dumazet 
> > Signed-off-by: Paolo Abeni 
> > ---
> I'm not sure I see the reason why this is done with numbers and
>  'name ## NR', adding extra distance between the callsite and the
>  list of callees.  In particular it means that each callable needs
>  to specify its index.
> Wouldn't it be simpler just to have
> #define 1(f, f1, ...) \
> (likely(f == f1) ? f1(__VA_ARGS__) : f(__VA_ARGS__))
> #define INDIRECT_CALL_2(f, f2, f1, ...) \
> (likely(f == f2) ? f2(__VA_ARGS__) : INDIRECT_CALL_1(f, f1, 
> __VA_ARGS__))
> etc.?  Removing the need for INDIRECT_CALLABLE_DECLARE_* entirely.

Thank you for the review!

As some of the builtin symbols are static, we would still need some
macro wrappers to properly specify the scope when retpoline is enabled.

Also, I think that f1, f2... declaration before INDIRECT_CALL_ would be
uglier, as we need to list there the function names (so we would have
the same list in 2 places).

Anyway this sounds really one thing that will enrage guys on lklm.
Suggestions for alternative solutions more than welcome ;)

> PS: this has reminded me of my desire to try runtime creation of
> these kinds of branch tables with self-modifying code

This:

https://lore.kernel.org/lkml/cover.1543200841.git.jpoim...@redhat.com/T/#ma30f6b2aa655c99e93cfb267fef75b8fe9fca29b

is possibly related to what you are planning. AFAICS should work only
for global function pointers, not for e.g. function ptr inside lists,
so the above and this series should be complementary.

Cheers,

Paolo

Re: [PATCH net-next 1/4] indirect call wrappers: helpers to speed-up indirect calls of builtin

2018-12-04 Thread Paolo Abeni

On Mon, 2018-12-03 at 10:04 -0800, Eric Dumazet wrote:
> On 12/03/2018 03:40 AM, Paolo Abeni wrote:
> > This header define a bunch of helpers that allow avoiding the
> > retpoline overhead when calling builtin functions via function pointers.
> > It boils down to explicitly comparing the function pointers to
> > known builtin functions and eventually invoke directly the latter.
> > 
> > The macros defined here implement the boilerplate for the above schema
> > and will be used by the next patches.
> > 
> > rfc -> v1:
> >  - use branch prediction hint, as suggested by Eric
> > 
> > Suggested-by: Eric Dumazet 
> > Signed-off-by: Paolo Abeni 
> > ---
> >  include/linux/indirect_call_wrapper.h | 77 +++
> >  1 file changed, 77 insertions(+)
> >  create mode 100644 include/linux/indirect_call_wrapper.h
> 
> This needs to be discussed more broadly, please include lkml 

Agreed. @David: please let me know if you prefer a repost or a v2 with
the expanded recipients list.

> Also please include Paul Turner   in the discussion, since 
> he was
> the inventor of this idea.

I will do.

Thanks,
Paolo

Re: [PATCH net-next v4 2/3] udp: elide zerocopy operation in hot path

2018-12-03 Thread Paolo Abeni

On Fri, 2018-11-30 at 15:32 -0500, Willem de Bruijn wrote:
> From: Willem de Bruijn 
> 
> With MSG_ZEROCOPY, each skb holds a reference to a struct ubuf_info.
> Release of its last reference triggers a completion notification.
> 
> The TCP stack in tcp_sendmsg_locked holds an extra ref independent of
> the skbs, because it can build, send and free skbs within its loop,
> possibly reaching refcount zero and freeing the ubuf_info too soon.
> 
> The UDP stack currently also takes this extra ref, but does not need
> it as all skbs are sent after return from __ip(6)_append_data.
> 
> Avoid the extra refcount_inc and refcount_dec_and_test, and generally
> the sock_zerocopy_put in the common path, by passing the initial
> reference to the first skb.
> 
> This approach is taken instead of initializing the refcount to 0, as
> that would generate error "refcount_t: increment on 0" on the
> next skb_zcopy_set.
> 
> Changes
>   v3 -> v4
> - Move skb_zcopy_set below the only kfree_skb that might cause
>   a premature uarg destroy before skb_zerocopy_put_abort
>   - Move the entire skb_shinfo assignment block, to keep that
> cacheline access in one place
> 
> Signed-off-by: Willem de Bruijn 

I like this solution!

Acked-by: Paolo Abeni

Re: [PATCH net-next v4 1/3] udp: msg_zerocopy

2018-12-03 Thread Paolo Abeni

On Fri, 2018-11-30 at 15:32 -0500, Willem de Bruijn wrote:
> From: Willem de Bruijn 
> 
> Extend zerocopy to udp sockets. Allow setting sockopt SO_ZEROCOPY and
> interpret flag MSG_ZEROCOPY.
> 
> This patch was previously part of the zerocopy RFC patchsets. Zerocopy
> is not effective at small MTU. With segmentation offload building
> larger datagrams, the benefit of page flipping outweights the cost of
> generating a completion notification.
> 
> tools/testing/selftests/net/msg_zerocopy.sh after applying follow-on
> test patch and making skb_orphan_frags_rx same as skb_orphan_frags:
> 
> ipv4 udp -t 1
> tx=191312 (11938 MB) txc=0 zc=n
> rx=191312 (11938 MB)
> ipv4 udp -z -t 1
> tx=304507 (19002 MB) txc=304507 zc=y
> rx=304507 (19002 MB)
> ok
> ipv6 udp -t 1
> tx=174485 (10888 MB) txc=0 zc=n
> rx=174485 (10888 MB)
> ipv6 udp -z -t 1
> tx=294801 (18396 MB) txc=294801 zc=y
> rx=294801 (18396 MB)
> ok
> 
> Changes
>   v1 -> v2
> - Fixup reverse christmas tree violation
>   v2 -> v3
> - Split refcount avoidance optimization into separate patch
>   - Fix refcount leak on error in fragmented case
> (thanks to Paolo Abeni for pointing this one out!)
>   - Fix refcount inc on zero
>   - Test sock_flag SOCK_ZEROCOPY directly in __ip_append_data.
> This is needed since commit 5cf4a8532c99 ("tcp: really ignore
>   MSG_ZEROCOPY if no SO_ZEROCOPY") did the same for tcp.
> 
> Signed-off-by: Willem de Bruijn 

Acked-by: Paolo Abeni

[PATCH net-next 2/4] net: use indirect call wrappers at GRO network layer

2018-12-03 Thread Paolo Abeni

This avoids an indirect calls for L3 GRO receive path, both
for ipv4 and ipv6, if the latter is not compiled as a module.

Note that when IPv6 is compiled as builtin, it will be checked first,
so we have a single additional compare for the more common path.

Signed-off-by: Paolo Abeni 
---
 include/net/inet_common.h |  2 ++
 net/core/dev.c| 10 --
 net/ipv4/af_inet.c|  4 
 net/ipv6/ip6_offload.c|  4 
 4 files changed, 18 insertions(+), 2 deletions(-)

diff --git a/include/net/inet_common.h b/include/net/inet_common.h
index 3ca969cbd161..56e7592811ea 100644
--- a/include/net/inet_common.h
+++ b/include/net/inet_common.h
@@ -2,6 +2,8 @@
 #ifndef _INET_COMMON_H
 #define _INET_COMMON_H
 
+#include 
+
 extern const struct proto_ops inet_stream_ops;
 extern const struct proto_ops inet_dgram_ops;
 
diff --git a/net/core/dev.c b/net/core/dev.c
index 04a6b7100aac..a0ece218cf13 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -145,6 +145,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "net-sysfs.h"
 
@@ -5320,6 +5321,7 @@ static void flush_all_backlogs(void)
put_online_cpus();
 }
 
+INDIRECT_CALLABLE_DECLARE_2(int, network_gro_complete, struct sk_buff *, int);
 static int napi_gro_complete(struct sk_buff *skb)
 {
struct packet_offload *ptype;
@@ -5339,7 +5341,8 @@ static int napi_gro_complete(struct sk_buff *skb)
if (ptype->type != type || !ptype->callbacks.gro_complete)
continue;
 
-   err = ptype->callbacks.gro_complete(skb, 0);
+   err = INDIRECT_CALL_INET(ptype->callbacks.gro_complete,
+network_gro_complete, skb, 0);
break;
}
rcu_read_unlock();
@@ -5486,6 +5489,8 @@ static void gro_flush_oldest(struct list_head *head)
napi_gro_complete(oldest);
 }
 
+INDIRECT_CALLABLE_DECLARE_2(struct sk_buff *, network_gro_receive,
+   struct list_head *, struct sk_buff *);
 static enum gro_result dev_gro_receive(struct napi_struct *napi, struct 
sk_buff *skb)
 {
u32 hash = skb_get_hash_raw(skb) & (GRO_HASH_BUCKETS - 1);
@@ -5535,7 +5540,8 @@ static enum gro_result dev_gro_receive(struct napi_struct 
*napi, struct sk_buff
NAPI_GRO_CB(skb)->csum_valid = 0;
}
 
-   pp = ptype->callbacks.gro_receive(gro_head, skb);
+   pp = INDIRECT_CALL_INET(ptype->callbacks.gro_receive,
+   network_gro_receive, gro_head, skb);
break;
}
rcu_read_unlock();
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index 326c422c22f8..04ab7ebd6e9b 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -1505,6 +1505,8 @@ struct sk_buff *inet_gro_receive(struct list_head *head, 
struct sk_buff *skb)
return pp;
 }
 EXPORT_SYMBOL(inet_gro_receive);
+INDIRECT_CALLABLE(inet_gro_receive, 1, struct sk_buff *, network_gro_receive,
+ struct list_head *, struct sk_buff *);
 
 static struct sk_buff *ipip_gro_receive(struct list_head *head,
struct sk_buff *skb)
@@ -1589,6 +1591,8 @@ int inet_gro_complete(struct sk_buff *skb, int nhoff)
return err;
 }
 EXPORT_SYMBOL(inet_gro_complete);
+INDIRECT_CALLABLE(inet_gro_complete, 1, int, network_gro_complete,
+ struct sk_buff *, int);
 
 static int ipip_gro_complete(struct sk_buff *skb, int nhoff)
 {
diff --git a/net/ipv6/ip6_offload.c b/net/ipv6/ip6_offload.c
index 70f525c33cb6..a1c2bfb2ce0d 100644
--- a/net/ipv6/ip6_offload.c
+++ b/net/ipv6/ip6_offload.c
@@ -270,6 +270,8 @@ static struct sk_buff *ipv6_gro_receive(struct list_head 
*head,
 
return pp;
 }
+INDIRECT_CALLABLE(ipv6_gro_receive, 2, struct sk_buff *, network_gro_receive,
+ struct list_head *, struct sk_buff *);
 
 static struct sk_buff *sit_ip6ip6_gro_receive(struct list_head *head,
  struct sk_buff *skb)
@@ -327,6 +329,8 @@ static int ipv6_gro_complete(struct sk_buff *skb, int nhoff)
 
return err;
 }
+INDIRECT_CALLABLE(ipv6_gro_complete, 2, int, network_gro_complete,
+ struct sk_buff *, int);
 
 static int sit_gro_complete(struct sk_buff *skb, int nhoff)
 {
-- 
2.19.2

[PATCH net-next 4/4] udp: use indirect call wrapper for GRO socket lookup

2018-12-03 Thread Paolo Abeni

This avoids another indirect call for UDP GRO. Again, the test
for the IPv6 variant is performed first.

Signed-off-by: Paolo Abeni 
---
 net/ipv4/udp.c | 2 ++
 net/ipv4/udp_offload.c | 6 --
 net/ipv6/udp.c | 2 ++
 3 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index aff2a8e99e01..9ea851f47598 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -544,6 +544,8 @@ struct sock *udp4_lib_lookup_skb(struct sk_buff *skb,
return __udp4_lib_lookup_skb(skb, sport, dport, _table);
 }
 EXPORT_SYMBOL_GPL(udp4_lib_lookup_skb);
+INDIRECT_CALLABLE(udp4_lib_lookup_skb, 1, struct sock *, udp_lookup,
+ struct sk_buff *skb, __be16 sport, __be16 dport);
 
 /* Must be called under rcu_read_lock().
  * Does increment socket refcount.
diff --git a/net/ipv4/udp_offload.c b/net/ipv4/udp_offload.c
index c3c5b237c8e0..0ccd2aa1ab98 100644
--- a/net/ipv4/udp_offload.c
+++ b/net/ipv4/udp_offload.c
@@ -392,6 +392,8 @@ static struct sk_buff *udp_gro_receive_segment(struct 
list_head *head,
return NULL;
 }
 
+INDIRECT_CALLABLE_DECLARE_2(struct sock *, udp_lookup, struct sk_buff *skb,
+   __be16 sport, __be16 dport);
 struct sk_buff *udp_gro_receive(struct list_head *head, struct sk_buff *skb,
struct udphdr *uh, udp_lookup_t lookup)
 {
@@ -403,7 +405,7 @@ struct sk_buff *udp_gro_receive(struct list_head *head, 
struct sk_buff *skb,
struct sock *sk;
 
rcu_read_lock();
-   sk = (*lookup)(skb, uh->source, uh->dest);
+   sk = INDIRECT_CALL_INET(lookup, udp_lookup, skb, uh->source, uh->dest);
if (!sk)
goto out_unlock;
 
@@ -505,7 +507,7 @@ int udp_gro_complete(struct sk_buff *skb, int nhoff,
uh->len = newlen;
 
rcu_read_lock();
-   sk = (*lookup)(skb, uh->source, uh->dest);
+   sk = INDIRECT_CALL_INET(lookup, udp_lookup, skb, uh->source, uh->dest);
if (sk && udp_sk(sk)->gro_enabled) {
err = udp_gro_complete_segment(skb);
} else if (sk && udp_sk(sk)->gro_complete) {
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index 09cba4cfe31f..616f374760d1 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -282,6 +282,8 @@ struct sock *udp6_lib_lookup_skb(struct sk_buff *skb,
 inet6_sdif(skb), _table, skb);
 }
 EXPORT_SYMBOL_GPL(udp6_lib_lookup_skb);
+INDIRECT_CALLABLE(udp6_lib_lookup_skb, 2, struct sock *, udp_lookup,
+ struct sk_buff *skb, __be16 sport, __be16 dport);
 
 /* Must be called under rcu_read_lock().
  * Does increment socket refcount.
-- 
2.19.2

[PATCH net-next 3/4] net: use indirect call wrapper at GRO transport layer

2018-12-03 Thread Paolo Abeni

This avoids an indirect call in the receive path for TCP and UDP
packets. TCP takes precedence on UDP, so that we have a single
additional conditional in the common case.

Signed-off-by: Paolo Abeni 
---
 include/net/inet_common.h |  7 +++
 net/ipv4/af_inet.c| 11 +--
 net/ipv4/tcp_offload.c|  5 +
 net/ipv4/udp_offload.c|  5 +
 net/ipv6/ip6_offload.c| 10 --
 net/ipv6/tcpv6_offload.c  |  5 +
 net/ipv6/udp_offload.c|  5 +
 7 files changed, 44 insertions(+), 4 deletions(-)

diff --git a/include/net/inet_common.h b/include/net/inet_common.h
index 56e7592811ea..667bb8247f9a 100644
--- a/include/net/inet_common.h
+++ b/include/net/inet_common.h
@@ -56,4 +56,11 @@ static inline void inet_ctl_sock_destroy(struct sock *sk)
sock_release(sk->sk_socket);
 }
 
+#define indirect_call_gro_receive(name, cb, head, skb) \
+({ \
+   unlikely(gro_recursion_inc_test(skb)) ? \
+   NAPI_GRO_CB(skb)->flush |= 1, NULL :\
+   INDIRECT_CALL_2(cb, name, head, skb);   \
+})
+
 #endif
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index 04ab7ebd6e9b..774f183f56e3 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -1385,6 +1385,8 @@ struct sk_buff *inet_gso_segment(struct sk_buff *skb,
 }
 EXPORT_SYMBOL(inet_gso_segment);
 
+INDIRECT_CALLABLE_DECLARE_2(struct sk_buff *, transport4_gro_receive,
+   struct list_head *head, struct sk_buff *skb);
 struct sk_buff *inet_gro_receive(struct list_head *head, struct sk_buff *skb)
 {
const struct net_offload *ops;
@@ -1494,7 +1496,8 @@ struct sk_buff *inet_gro_receive(struct list_head *head, 
struct sk_buff *skb)
skb_gro_pull(skb, sizeof(*iph));
skb_set_transport_header(skb, skb_gro_offset(skb));
 
-   pp = call_gro_receive(ops->callbacks.gro_receive, head, skb);
+   pp = indirect_call_gro_receive(transport4_gro_receive,
+  ops->callbacks.gro_receive, head, skb);
 
 out_unlock:
rcu_read_unlock();
@@ -1558,6 +1561,8 @@ int inet_recv_error(struct sock *sk, struct msghdr *msg, 
int len, int *addr_len)
return -EINVAL;
 }
 
+INDIRECT_CALLABLE_DECLARE_2(int, transport4_gro_complete, struct sk_buff *skb,
+   int);
 int inet_gro_complete(struct sk_buff *skb, int nhoff)
 {
__be16 newlen = htons(skb->len - nhoff);
@@ -1583,7 +1588,9 @@ int inet_gro_complete(struct sk_buff *skb, int nhoff)
 * because any hdr with option will have been flushed in
 * inet_gro_receive().
 */
-   err = ops->callbacks.gro_complete(skb, nhoff + sizeof(*iph));
+   err = INDIRECT_CALL_2(ops->callbacks.gro_complete,
+ transport4_gro_complete, skb,
+ nhoff + sizeof(*iph));
 
 out_unlock:
rcu_read_unlock();
diff --git a/net/ipv4/tcp_offload.c b/net/ipv4/tcp_offload.c
index 870b0a335061..3d5dfac4cd1b 100644
--- a/net/ipv4/tcp_offload.c
+++ b/net/ipv4/tcp_offload.c
@@ -10,6 +10,7 @@
  * TCPv4 GSO/GRO support
  */
 
+#include 
 #include 
 #include 
 #include 
@@ -317,6 +318,8 @@ static struct sk_buff *tcp4_gro_receive(struct list_head 
*head, struct sk_buff *
 
return tcp_gro_receive(head, skb);
 }
+INDIRECT_CALLABLE(tcp4_gro_receive, 2, struct sk_buff *, 
transport4_gro_receive,
+ struct list_head *, struct sk_buff *);
 
 static int tcp4_gro_complete(struct sk_buff *skb, int thoff)
 {
@@ -332,6 +335,8 @@ static int tcp4_gro_complete(struct sk_buff *skb, int thoff)
 
return tcp_gro_complete(skb);
 }
+INDIRECT_CALLABLE(tcp4_gro_complete, 2, int, transport4_gro_complete,
+ struct sk_buff *, int);
 
 static const struct net_offload tcpv4_offload = {
.callbacks = {
diff --git a/net/ipv4/udp_offload.c b/net/ipv4/udp_offload.c
index 0646d61f4fa8..c3c5b237c8e0 100644
--- a/net/ipv4/udp_offload.c
+++ b/net/ipv4/udp_offload.c
@@ -13,6 +13,7 @@
 #include 
 #include 
 #include 
+#include 
 
 static struct sk_buff *__skb_udp_tunnel_segment(struct sk_buff *skb,
netdev_features_t features,
@@ -477,6 +478,8 @@ static struct sk_buff *udp4_gro_receive(struct list_head 
*head,
NAPI_GRO_CB(skb)->flush = 1;
return NULL;
 }
+INDIRECT_CALLABLE(udp4_gro_receive, 1, struct sk_buff *, 
transport4_gro_receive,
+ struct list_head *, struct sk_buff *);
 
 static int udp_gro_complete_segment(struct sk_buff *skb)
 {
@@ -536,6 +539,8 @@ static int udp4_gro_complete(struct sk_buff *skb, int nhoff)
 
return udp_gro_complete(skb, nhoff, udp4_lib_lookup_skb);
 }
+INDIRECT_CALLABLE(udp4_gro_complete, 1, int, transport4_gro_complete,
+ struct sk_buff *, int);
 
 static const struct net_offload udpv4_offload = {
.callbacks = {
diff --git a/net/ipv6/ip6_offload.c b/net/ipv6/ip6_offload.c
index a1c2bfb2ce0d..eeca4

[PATCH net-next 1/4] indirect call wrappers: helpers to speed-up indirect calls of builtin

2018-12-03 Thread Paolo Abeni

This header define a bunch of helpers that allow avoiding the
retpoline overhead when calling builtin functions via function pointers.
It boils down to explicitly comparing the function pointers to
known builtin functions and eventually invoke directly the latter.

The macros defined here implement the boilerplate for the above schema
and will be used by the next patches.

rfc -> v1:
 - use branch prediction hint, as suggested by Eric

Suggested-by: Eric Dumazet 
Signed-off-by: Paolo Abeni 
---
 include/linux/indirect_call_wrapper.h | 77 +++
 1 file changed, 77 insertions(+)
 create mode 100644 include/linux/indirect_call_wrapper.h

diff --git a/include/linux/indirect_call_wrapper.h 
b/include/linux/indirect_call_wrapper.h
new file mode 100644
index ..d6894ffc33f6
--- /dev/null
+++ b/include/linux/indirect_call_wrapper.h
@@ -0,0 +1,77 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_INDIRECT_CALL_WRAPPER_H
+#define _LINUX_INDIRECT_CALL_WRAPPER_H
+
+#ifdef CONFIG_RETPOLINE
+
+/*
+ * INDIRECT_CALL_$NR - wrapper for indirect calls with $NR known builtin
+ *  @f: function pointer
+ *  @name: base name for builtin functions, see INDIRECT_CALLABLE_DECLARE_$NR
+ *  @__VA_ARGS__: arguments for @f
+ *
+ * Avoid retpoline overhead for known builtin, checking @f vs each of them and
+ * eventually invoking directly the builtin function. Fallback to the indirect
+ * call
+ */
+#define INDIRECT_CALL_1(f, name, ...)  \
+   ({  \
+   likely(f == name ## 1) ? name ## 1(__VA_ARGS__) :   \
+f(__VA_ARGS__);\
+   })
+#define INDIRECT_CALL_2(f, name, ...)  \
+   ({  \
+   likely(f == name ## 2) ? name ## 2(__VA_ARGS__) :   \
+INDIRECT_CALL_1(f, name, __VA_ARGS__); \
+   })
+
+/*
+ * INDIRECT_CALLABLE_DECLARE_$NR - declare $NR known builtin for
+ * INDIRECT_CALL_$NR usage
+ *  @type: return type for the builtin function
+ *  @name: base name for builtin functions, the full list is generated 
appending
+ *the numbers in the 1..@NR range
+ *  @__VA_ARGS__: arguments type list for the builtin function
+ *
+ * Builtin with higher $NR will be checked first by INDIRECT_CALL_$NR
+ */
+#define INDIRECT_CALLABLE_DECLARE_1(type, name, ...)   \
+   extern type name ## 1(__VA_ARGS__)
+#define INDIRECT_CALLABLE_DECLARE_2(type, name, ...)   \
+   extern type name ## 2(__VA_ARGS__); \
+   INDIRECT_CALLABLE_DECLARE_1(type, name, __VA_ARGS__)
+
+/*
+ * INDIRECT_CALLABLE - allow usage of a builtin function from INDIRECT_CALL_$NR
+ *  @f: builtin function name
+ *  @nr: id associated with this builtin, higher values will be checked first 
by
+ *  INDIRECT_CALL_$NR
+ *  @type: function return type
+ *  @name: base name used by INDIRECT_CALL_ to access the builtin list
+ *  @__VA_ARGS__: arguments type list for @f
+ */
+#define INDIRECT_CALLABLE(f, nr, type, name, ...)  \
+   __alias(f) type name ## nr(__VA_ARGS__)
+
+#else
+#define INDIRECT_CALL_1(f, name, ...) f(__VA_ARGS__)
+#define INDIRECT_CALL_2(f, name, ...) f(__VA_ARGS__)
+#define INDIRECT_CALLABLE_DECLARE_1(type, name, ...)
+#define INDIRECT_CALLABLE_DECLARE_2(type, name, ...)
+#define INDIRECT_CALLABLE(f, nr, type, name, ...)
+#endif
+
+/*
+ * We can use INDIRECT_CALL_$NR for ipv6 related functions only if ipv6 is
+ * builtin, this macro simplify dealing with indirect calls with only ipv4/ipv6
+ * alternatives
+ */
+#if IS_BUILTIN(CONFIG_IPV6)
+#define INDIRECT_CALL_INET INDIRECT_CALL_2
+#elif IS_ENABLED(CONFIG_INET)
+#define INDIRECT_CALL_INET INDIRECT_CALL_1
+#else
+#define INDIRECT_CALL_INET(...)
+#endif
+
+#endif
-- 
2.19.2

[PATCH net-next 0/4] net: mitigate retpoline overhead

2018-12-03 Thread Paolo Abeni

The spectre v2 counter-measures, aka retpolines, are a source of measurable
overhead[1]. We can partially address that when the function pointer refers to
a builtin symbol resorting to a list of tests vs well-known builtin function and
direct calls.

Experimental results[2] shows that replacing a single indirect calls via
retpoline with several branches and direct calls gives performance gains
even when multiple branches are added - 5 or more, as reported in [2].

This may lead to some uglification around the indirect calls. In netconf 2018
Eric Dumazet described a technique to hide the most relevant part of the needed
boilerplate with some macro help.

This series is a [re-]implementation of such idea, exposing the introduced
helpers in a new header file. They are later leveraged to avoid the indirect
call overhead in the GRO path, when possible.

Overall this gives > 10% performance improvement for UDP GRO benchmark, and
smaller but measurable for TCP syn flood.

The added infra can be used in follow-up patches to cope with retpoline overhead
in other points of the networking stack (e.g. at the qdisc layer) and possibly
even in other subsystems.

rfc -> v1:
 - use branch prediction hints, as suggested by Eric

[1] http://vger.kernel.org/netconf2018_files/PaoloAbeni_netconf2018.pdf
[2] 
https://linuxplumbersconf.org/event/2/contributions/99/attachments/98/117/lpc18_paper_af_xdp_perf-v2.pdf

Paolo Abeni (4):
  indirect call wrappers: helpers to speed-up indirect calls of builtin
  net: use indirect call wrappers at GRO network layer
  net: use indirect call wrapper at GRO transport layer
  udp: use indirect call wrapper for GRO socket lookup

 include/linux/indirect_call_wrapper.h | 77 +++
 include/net/inet_common.h |  9 
 net/core/dev.c| 10 +++-
 net/ipv4/af_inet.c| 15 +-
 net/ipv4/tcp_offload.c|  5 ++
 net/ipv4/udp.c|  2 +
 net/ipv4/udp_offload.c| 11 +++-
 net/ipv6/ip6_offload.c| 14 -
 net/ipv6/tcpv6_offload.c  |  5 ++
 net/ipv6/udp.c|  2 +
 net/ipv6/udp_offload.c|  5 ++
 11 files changed, 147 insertions(+), 8 deletions(-)
 create mode 100644 include/linux/indirect_call_wrapper.h

-- 
2.19.2

Re: [RFC PATCH 1/4] indirect call wrappers: helpers to speed-up indirect calls of builtin

2018-11-30 Thread Paolo Abeni

Hi,

On Thu, 2018-11-29 at 15:25 -0800, Eric Dumazet wrote:
> 
> On 11/29/2018 03:00 PM, Paolo Abeni wrote:
> > This header define a bunch of helpers that allow avoiding the
> > retpoline overhead when calling builtin functions via function pointers.
> > It boils down to explicitly comparing the function pointers to
> > known builtin functions and eventually invoke directly the latter.
> > 
> > The macro defined here implement the boilerplate for the above schema
> > and will be used by the next patches.
> > 
> > Suggested-by: Eric Dumazet 

Oops... typo here. For some reasons checkpatch did not catch it.

> > Signed-off-by: Paolo Abeni 
> > ---
> >  include/linux/indirect_call_wrapper.h | 77 +++
> >  1 file changed, 77 insertions(+)
> >  create mode 100644 include/linux/indirect_call_wrapper.h
> > 
> > diff --git a/include/linux/indirect_call_wrapper.h 
> > b/include/linux/indirect_call_wrapper.h
> > new file mode 100644
> > index ..57e82b4a166d
> > --- /dev/null
> > +++ b/include/linux/indirect_call_wrapper.h
> > @@ -0,0 +1,77 @@
> > +/* SPDX-License-Identifier: GPL-2.0 */
> > +#ifndef _LINUX_INDIRECT_CALL_WRAPPER_H
> > +#define _LINUX_INDIRECT_CALL_WRAPPER_H
> > +
> > +#ifdef CONFIG_RETPOLINE
> > +
> > +/*
> > + * INDIRECT_CALL_$NR - wrapper for indirect calls with $NR known builtin
> > + *  @f: function pointer
> > + *  @name: base name for builtin functions, see 
> > INDIRECT_CALLABLE_DECLARE_$NR
> > + *  @__VA_ARGS__: arguments for @f
> > + *
> > + * Avoid retpoline overhead for known builtin, checking @f vs each of them 
> > and
> > + * eventually invoking directly the builtin function. Fallback to the 
> > indirect
> > + * call
> > + */
> > +#define INDIRECT_CALL_1(f, name, ...)  
> > \
> > +   ({  \
> > +   f == name ## 1 ? name ## 1(__VA_ARGS__) :   \
> 
>   likely(f == name ## 1) ? ...

Thank you for the feedback!

I thought about the above, and than I avoided it, because I was not
100% it would fit cases (if any) where we have 2 or more built-in
equally likely.

I guess we can address such cases if and when they will pop-up. I'll do
some more benchmarks with the branch prediction hints, and then if
there are no surprises, I'll add them in v1.

BTW I would like to give the correct attribution here. Does 'Suggested-
by' fit? should I list some other guy @google?

Thanks,

Paols

[RFC PATCH 2/4] net: use indirect call wrappers at GRO network layer

2018-11-29 Thread Paolo Abeni

This avoids an indirect calls for L3 GRO receive path, both
for ipv4 and ipv6, if the latter is not compiled as a module.

Note that when IPv6 is compiled as buildin, it will be checked first,
so we have a single additional compare for the more common path.

Signed-off-by: Paolo Abeni 
---
 include/net/inet_common.h |  2 ++
 net/core/dev.c| 10 --
 net/ipv4/af_inet.c|  4 
 net/ipv6/ip6_offload.c|  4 
 4 files changed, 18 insertions(+), 2 deletions(-)

diff --git a/include/net/inet_common.h b/include/net/inet_common.h
index 3ca969cbd161..56e7592811ea 100644
--- a/include/net/inet_common.h
+++ b/include/net/inet_common.h
@@ -2,6 +2,8 @@
 #ifndef _INET_COMMON_H
 #define _INET_COMMON_H
 
+#include 
+
 extern const struct proto_ops inet_stream_ops;
 extern const struct proto_ops inet_dgram_ops;
 
diff --git a/net/core/dev.c b/net/core/dev.c
index f69b2fcdee40..619f5902600f 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -145,6 +145,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "net-sysfs.h"
 
@@ -5306,6 +5307,7 @@ static void flush_all_backlogs(void)
put_online_cpus();
 }
 
+INDIRECT_CALLABLE_DECLARE_2(int, network_gro_complete, struct sk_buff *, int);
 static int napi_gro_complete(struct sk_buff *skb)
 {
struct packet_offload *ptype;
@@ -5325,7 +5327,8 @@ static int napi_gro_complete(struct sk_buff *skb)
if (ptype->type != type || !ptype->callbacks.gro_complete)
continue;
 
-   err = ptype->callbacks.gro_complete(skb, 0);
+   err = INDIRECT_CALL_INET(ptype->callbacks.gro_complete,
+network_gro_complete, skb, 0);
break;
}
rcu_read_unlock();
@@ -5472,6 +5475,8 @@ static void gro_flush_oldest(struct list_head *head)
napi_gro_complete(oldest);
 }
 
+INDIRECT_CALLABLE_DECLARE_2(struct sk_buff *, network_gro_receive,
+   struct list_head *, struct sk_buff *);
 static enum gro_result dev_gro_receive(struct napi_struct *napi, struct 
sk_buff *skb)
 {
u32 hash = skb_get_hash_raw(skb) & (GRO_HASH_BUCKETS - 1);
@@ -5521,7 +5526,8 @@ static enum gro_result dev_gro_receive(struct napi_struct 
*napi, struct sk_buff
NAPI_GRO_CB(skb)->csum_valid = 0;
}
 
-   pp = ptype->callbacks.gro_receive(gro_head, skb);
+   pp = INDIRECT_CALL_INET(ptype->callbacks.gro_receive,
+   network_gro_receive, gro_head, skb);
break;
}
rcu_read_unlock();
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index 326c422c22f8..04ab7ebd6e9b 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -1505,6 +1505,8 @@ struct sk_buff *inet_gro_receive(struct list_head *head, 
struct sk_buff *skb)
return pp;
 }
 EXPORT_SYMBOL(inet_gro_receive);
+INDIRECT_CALLABLE(inet_gro_receive, 1, struct sk_buff *, network_gro_receive,
+ struct list_head *, struct sk_buff *);
 
 static struct sk_buff *ipip_gro_receive(struct list_head *head,
struct sk_buff *skb)
@@ -1589,6 +1591,8 @@ int inet_gro_complete(struct sk_buff *skb, int nhoff)
return err;
 }
 EXPORT_SYMBOL(inet_gro_complete);
+INDIRECT_CALLABLE(inet_gro_complete, 1, int, network_gro_complete,
+ struct sk_buff *, int);
 
 static int ipip_gro_complete(struct sk_buff *skb, int nhoff)
 {
diff --git a/net/ipv6/ip6_offload.c b/net/ipv6/ip6_offload.c
index 70f525c33cb6..a1c2bfb2ce0d 100644
--- a/net/ipv6/ip6_offload.c
+++ b/net/ipv6/ip6_offload.c
@@ -270,6 +270,8 @@ static struct sk_buff *ipv6_gro_receive(struct list_head 
*head,
 
return pp;
 }
+INDIRECT_CALLABLE(ipv6_gro_receive, 2, struct sk_buff *, network_gro_receive,
+ struct list_head *, struct sk_buff *);
 
 static struct sk_buff *sit_ip6ip6_gro_receive(struct list_head *head,
  struct sk_buff *skb)
@@ -327,6 +329,8 @@ static int ipv6_gro_complete(struct sk_buff *skb, int nhoff)
 
return err;
 }
+INDIRECT_CALLABLE(ipv6_gro_complete, 2, int, network_gro_complete,
+ struct sk_buff *, int);
 
 static int sit_gro_complete(struct sk_buff *skb, int nhoff)
 {
-- 
2.19.2

[RFC PATCH 3/4] net: use indirect call wrapper at GRO transport layer

2018-11-29 Thread Paolo Abeni

This avoids an indirect call in the receive path for TCP and UDP
packets. TCP takes precedence on UDP, so that we have a single
additional conditional in the common case.

Signed-off-by: Paolo Abeni 
---
 include/net/inet_common.h |  7 +++
 net/ipv4/af_inet.c| 11 +--
 net/ipv4/tcp_offload.c|  5 +
 net/ipv4/udp_offload.c|  5 +
 net/ipv6/ip6_offload.c| 10 --
 net/ipv6/tcpv6_offload.c  |  5 +
 net/ipv6/udp_offload.c|  5 +
 7 files changed, 44 insertions(+), 4 deletions(-)

diff --git a/include/net/inet_common.h b/include/net/inet_common.h
index 56e7592811ea..667bb8247f9a 100644
--- a/include/net/inet_common.h
+++ b/include/net/inet_common.h
@@ -56,4 +56,11 @@ static inline void inet_ctl_sock_destroy(struct sock *sk)
sock_release(sk->sk_socket);
 }
 
+#define indirect_call_gro_receive(name, cb, head, skb) \
+({ \
+   unlikely(gro_recursion_inc_test(skb)) ? \
+   NAPI_GRO_CB(skb)->flush |= 1, NULL :\
+   INDIRECT_CALL_2(cb, name, head, skb);   \
+})
+
 #endif
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index 04ab7ebd6e9b..774f183f56e3 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -1385,6 +1385,8 @@ struct sk_buff *inet_gso_segment(struct sk_buff *skb,
 }
 EXPORT_SYMBOL(inet_gso_segment);
 
+INDIRECT_CALLABLE_DECLARE_2(struct sk_buff *, transport4_gro_receive,
+   struct list_head *head, struct sk_buff *skb);
 struct sk_buff *inet_gro_receive(struct list_head *head, struct sk_buff *skb)
 {
const struct net_offload *ops;
@@ -1494,7 +1496,8 @@ struct sk_buff *inet_gro_receive(struct list_head *head, 
struct sk_buff *skb)
skb_gro_pull(skb, sizeof(*iph));
skb_set_transport_header(skb, skb_gro_offset(skb));
 
-   pp = call_gro_receive(ops->callbacks.gro_receive, head, skb);
+   pp = indirect_call_gro_receive(transport4_gro_receive,
+  ops->callbacks.gro_receive, head, skb);
 
 out_unlock:
rcu_read_unlock();
@@ -1558,6 +1561,8 @@ int inet_recv_error(struct sock *sk, struct msghdr *msg, 
int len, int *addr_len)
return -EINVAL;
 }
 
+INDIRECT_CALLABLE_DECLARE_2(int, transport4_gro_complete, struct sk_buff *skb,
+   int);
 int inet_gro_complete(struct sk_buff *skb, int nhoff)
 {
__be16 newlen = htons(skb->len - nhoff);
@@ -1583,7 +1588,9 @@ int inet_gro_complete(struct sk_buff *skb, int nhoff)
 * because any hdr with option will have been flushed in
 * inet_gro_receive().
 */
-   err = ops->callbacks.gro_complete(skb, nhoff + sizeof(*iph));
+   err = INDIRECT_CALL_2(ops->callbacks.gro_complete,
+ transport4_gro_complete, skb,
+ nhoff + sizeof(*iph));
 
 out_unlock:
rcu_read_unlock();
diff --git a/net/ipv4/tcp_offload.c b/net/ipv4/tcp_offload.c
index 870b0a335061..3d5dfac4cd1b 100644
--- a/net/ipv4/tcp_offload.c
+++ b/net/ipv4/tcp_offload.c
@@ -10,6 +10,7 @@
  * TCPv4 GSO/GRO support
  */
 
+#include 
 #include 
 #include 
 #include 
@@ -317,6 +318,8 @@ static struct sk_buff *tcp4_gro_receive(struct list_head 
*head, struct sk_buff *
 
return tcp_gro_receive(head, skb);
 }
+INDIRECT_CALLABLE(tcp4_gro_receive, 2, struct sk_buff *, 
transport4_gro_receive,
+ struct list_head *, struct sk_buff *);
 
 static int tcp4_gro_complete(struct sk_buff *skb, int thoff)
 {
@@ -332,6 +335,8 @@ static int tcp4_gro_complete(struct sk_buff *skb, int thoff)
 
return tcp_gro_complete(skb);
 }
+INDIRECT_CALLABLE(tcp4_gro_complete, 2, int, transport4_gro_complete,
+ struct sk_buff *, int);
 
 static const struct net_offload tcpv4_offload = {
.callbacks = {
diff --git a/net/ipv4/udp_offload.c b/net/ipv4/udp_offload.c
index 0646d61f4fa8..c3c5b237c8e0 100644
--- a/net/ipv4/udp_offload.c
+++ b/net/ipv4/udp_offload.c
@@ -13,6 +13,7 @@
 #include 
 #include 
 #include 
+#include 
 
 static struct sk_buff *__skb_udp_tunnel_segment(struct sk_buff *skb,
netdev_features_t features,
@@ -477,6 +478,8 @@ static struct sk_buff *udp4_gro_receive(struct list_head 
*head,
NAPI_GRO_CB(skb)->flush = 1;
return NULL;
 }
+INDIRECT_CALLABLE(udp4_gro_receive, 1, struct sk_buff *, 
transport4_gro_receive,
+ struct list_head *, struct sk_buff *);
 
 static int udp_gro_complete_segment(struct sk_buff *skb)
 {
@@ -536,6 +539,8 @@ static int udp4_gro_complete(struct sk_buff *skb, int nhoff)
 
return udp_gro_complete(skb, nhoff, udp4_lib_lookup_skb);
 }
+INDIRECT_CALLABLE(udp4_gro_complete, 1, int, transport4_gro_complete,
+ struct sk_buff *, int);
 
 static const struct net_offload udpv4_offload = {
.callbacks = {
diff --git a/net/ipv6/ip6_offload.c b/net/ipv6/ip6_offload.c
index a1c2bfb2ce0d..eeca4

[RFC PATCH 4/4] udp: use indirect call wrapper for GRO socket lookup

2018-11-29 Thread Paolo Abeni

This avoids another indirect call for UDP GRO. Again, the test
for the IPv6 variant is performed first.

Signed-off-by: Paolo Abeni 
---
 net/ipv4/udp.c | 2 ++
 net/ipv4/udp_offload.c | 6 --
 net/ipv6/udp.c | 2 ++
 3 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index aff2a8e99e01..9ea851f47598 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -544,6 +544,8 @@ struct sock *udp4_lib_lookup_skb(struct sk_buff *skb,
return __udp4_lib_lookup_skb(skb, sport, dport, _table);
 }
 EXPORT_SYMBOL_GPL(udp4_lib_lookup_skb);
+INDIRECT_CALLABLE(udp4_lib_lookup_skb, 1, struct sock *, udp_lookup,
+ struct sk_buff *skb, __be16 sport, __be16 dport);
 
 /* Must be called under rcu_read_lock().
  * Does increment socket refcount.
diff --git a/net/ipv4/udp_offload.c b/net/ipv4/udp_offload.c
index c3c5b237c8e0..0ccd2aa1ab98 100644
--- a/net/ipv4/udp_offload.c
+++ b/net/ipv4/udp_offload.c
@@ -392,6 +392,8 @@ static struct sk_buff *udp_gro_receive_segment(struct 
list_head *head,
return NULL;
 }
 
+INDIRECT_CALLABLE_DECLARE_2(struct sock *, udp_lookup, struct sk_buff *skb,
+   __be16 sport, __be16 dport);
 struct sk_buff *udp_gro_receive(struct list_head *head, struct sk_buff *skb,
struct udphdr *uh, udp_lookup_t lookup)
 {
@@ -403,7 +405,7 @@ struct sk_buff *udp_gro_receive(struct list_head *head, 
struct sk_buff *skb,
struct sock *sk;
 
rcu_read_lock();
-   sk = (*lookup)(skb, uh->source, uh->dest);
+   sk = INDIRECT_CALL_INET(lookup, udp_lookup, skb, uh->source, uh->dest);
if (!sk)
goto out_unlock;
 
@@ -505,7 +507,7 @@ int udp_gro_complete(struct sk_buff *skb, int nhoff,
uh->len = newlen;
 
rcu_read_lock();
-   sk = (*lookup)(skb, uh->source, uh->dest);
+   sk = INDIRECT_CALL_INET(lookup, udp_lookup, skb, uh->source, uh->dest);
if (sk && udp_sk(sk)->gro_enabled) {
err = udp_gro_complete_segment(skb);
} else if (sk && udp_sk(sk)->gro_complete) {
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index 09cba4cfe31f..616f374760d1 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -282,6 +282,8 @@ struct sock *udp6_lib_lookup_skb(struct sk_buff *skb,
 inet6_sdif(skb), _table, skb);
 }
 EXPORT_SYMBOL_GPL(udp6_lib_lookup_skb);
+INDIRECT_CALLABLE(udp6_lib_lookup_skb, 2, struct sock *, udp_lookup,
+ struct sk_buff *skb, __be16 sport, __be16 dport);
 
 /* Must be called under rcu_read_lock().
  * Does increment socket refcount.
-- 
2.19.2

[RFC PATCH 1/4] indirect call wrappers: helpers to speed-up indirect calls of builtin

2018-11-29 Thread Paolo Abeni

This header define a bunch of helpers that allow avoiding the
retpoline overhead when calling builtin functions via function pointers.
It boils down to explicitly comparing the function pointers to
known builtin functions and eventually invoke directly the latter.

The macro defined here implement the boilerplate for the above schema
and will be used by the next patches.

Suggested-by: Eric Dumazet 
Signed-off-by: Paolo Abeni 
---
 include/linux/indirect_call_wrapper.h | 77 +++
 1 file changed, 77 insertions(+)
 create mode 100644 include/linux/indirect_call_wrapper.h

diff --git a/include/linux/indirect_call_wrapper.h 
b/include/linux/indirect_call_wrapper.h
new file mode 100644
index ..57e82b4a166d
--- /dev/null
+++ b/include/linux/indirect_call_wrapper.h
@@ -0,0 +1,77 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_INDIRECT_CALL_WRAPPER_H
+#define _LINUX_INDIRECT_CALL_WRAPPER_H
+
+#ifdef CONFIG_RETPOLINE
+
+/*
+ * INDIRECT_CALL_$NR - wrapper for indirect calls with $NR known builtin
+ *  @f: function pointer
+ *  @name: base name for builtin functions, see INDIRECT_CALLABLE_DECLARE_$NR
+ *  @__VA_ARGS__: arguments for @f
+ *
+ * Avoid retpoline overhead for known builtin, checking @f vs each of them and
+ * eventually invoking directly the builtin function. Fallback to the indirect
+ * call
+ */
+#define INDIRECT_CALL_1(f, name, ...)  \
+   ({  \
+   f == name ## 1 ? name ## 1(__VA_ARGS__) :   \
+f(__VA_ARGS__);\
+   })
+#define INDIRECT_CALL_2(f, name, ...)  \
+   ({  \
+   f == name ## 2 ? name ## 2(__VA_ARGS__) :   \
+INDIRECT_CALL_1(f, name, __VA_ARGS__); \
+   })
+
+/*
+ * INDIRECT_CALLABLE_DECLARE_$NR - declare $NR known builtin for
+ * INDIRECT_CALL_$NR usage
+ *  @type: return type for the builtin function
+ *  @name: base name for builtin functions, the full list is generated 
appending
+ *the numbers in the 1..@NR range
+ *  @__VA_ARGS__: arguments type list for the builtin function
+ *
+ * Builtin with higher $NR will be checked first by INDIRECT_CALL_$NR
+ */
+#define INDIRECT_CALLABLE_DECLARE_1(type, name, ...)   \
+   extern type name ## 1(__VA_ARGS__)
+#define INDIRECT_CALLABLE_DECLARE_2(type, name, ...)   \
+   extern type name ## 2(__VA_ARGS__); \
+   INDIRECT_CALLABLE_DECLARE_1(type, name, __VA_ARGS__)
+
+/*
+ * INDIRECT_CALLABLE - allow usage of a builtin function from INDIRECT_CALL_$NR
+ *  @f: builtin function name
+ *  @nr: id associated with this builtin, higher values will be checked first 
by
+ *  INDIRECT_CALL_$NR
+ *  @type: function return type
+ *  @name: base name used by INDIRECT_CALL_ to access the builtin list
+ *  @__VA_ARGS__: arguments type list for @f
+ */
+#define INDIRECT_CALLABLE(f, nr, type, name, ...)  \
+   __alias(f) type name ## nr(__VA_ARGS__)
+
+#else
+#define INDIRECT_CALL_1(f, name, ...) f(__VA_ARGS__)
+#define INDIRECT_CALL_2(f, name, ...) f(__VA_ARGS__)
+#define INDIRECT_CALLABLE_DECLARE_1(type, name, ...)
+#define INDIRECT_CALLABLE_DECLARE_2(type, name, ...)
+#define INDIRECT_CALLABLE(f, nr, type, name, ...)
+#endif
+
+/*
+ * We can use INDIRECT_CALL_$NR for ipv6 related functions only if ipv6 is
+ * builtin, this macro simplify dealing with indirect calls with only ipv4/ipv6
+ * alternatives
+ */
+#if IS_BUILTIN(CONFIG_IPV6)
+#define INDIRECT_CALL_INET INDIRECT_CALL_2
+#elif IS_ENABLED(CONFIG_INET)
+#define INDIRECT_CALL_INET INDIRECT_CALL_1
+#else
+#define INDIRECT_CALL_INET(...)
+#endif
+
+#endif
-- 
2.19.2

[RFC PATCH 0/4] net: mitigate retpoline overhead

2018-11-29 Thread Paolo Abeni

The spectre v2 counter-measures, aka retpolines, are a source of measurable
overhead[1]. We can partially address that when the function pointer refers to
a builtin symbol resorting to a list of tests vs well-known builtin function and
direct calls.

This may lead to some uglification around the indirect calls. In netconf 2018
Eric Dumazet described a technique to hide the most relevant part of the needed
boilerplate with some macro help.

This series is a [re-]implementation of such idea, exposing the introduced
helpers in a new header file. They are later leveraged to avoid the indirect
call overhead in the GRO path, when possible.

Overall this gives > 10% performance improvement for UDP GRO benchmark, and
smaller but measurable under for TCP syn flood.

The added infra can be used in follow-up patches to cope with retpoline overhead
in other points of the networking stack (e.g. at the qdisc layer) and possibly
even in other subsystems.

Paolo Abeni (4):
  indirect call wrappers: helpers to speed-up indirect calls of builtin
  net: use indirect call wrappers at GRO network layer
  net: use indirect call wrapper at GRO transport layer
  udp: use indirect call wrapper for GRO socket lookup

 include/linux/indirect_call_wrapper.h | 77 +++
 include/net/inet_common.h |  9 
 net/core/dev.c| 10 +++-
 net/ipv4/af_inet.c| 15 +-
 net/ipv4/tcp_offload.c|  5 ++
 net/ipv4/udp.c|  2 +
 net/ipv4/udp_offload.c| 11 +++-
 net/ipv6/ip6_offload.c| 14 -
 net/ipv6/tcpv6_offload.c  |  5 ++
 net/ipv6/udp.c|  2 +
 net/ipv6/udp_offload.c|  5 ++
 11 files changed, 147 insertions(+), 8 deletions(-)
 create mode 100644 include/linux/indirect_call_wrapper.h

-- 
2.19.2

Re: [PATCH] selftests/bpf: add config fragment CONFIG_NF_NAT_IPV6

2018-11-29 Thread Paolo Abeni

Hi,

On Thu, 2018-11-29 at 15:01 +0530, Naresh Kamboju wrote:
> CONFIG_NF_NAT_IPV6=y is required for bpf test_sockmap test case
> Fixes,
> ip6tables v1.6.1: can't initialize ip6tables table `nat': Table does
> not exist
> 
> Signed-off-by: Naresh Kamboju 
> ---
>  tools/testing/selftests/bpf/config | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/tools/testing/selftests/bpf/config 
> b/tools/testing/selftests/bpf/config
> index 37f947ec44ed..d7076cf04a9d 100644
> --- a/tools/testing/selftests/bpf/config
> +++ b/tools/testing/selftests/bpf/config
> @@ -23,3 +23,4 @@ CONFIG_LWTUNNEL=y
>  CONFIG_BPF_STREAM_PARSER=y
>  CONFIG_XDP_SOCKETS=y
>  CONFIG_FTRACE_SYSCALLS=y
> +CONFIG_NF_NAT_IPV6=y

AFAIK, the selftest Kconfig infra does not pull dependant CONFIG items 
('depends on '...) automatically: if the bpf test_sockmap test needs
CONFIG_NF_NAT_IPV6, you should also include non trivial chain of deps
up to whatever is currently explicitly requested.

On the other side, the self-tests already pull CONFIG_NF_NAT_IPV6=m for
'net' related tests, so if you run:

make kselftest-merge

before compiling the kernel for self-tests (which is AFAIK a mandatory
step) you should not get the reported error. Did you performed the
above step?

Cheers,

Paolo

Re: [PATCH net-next v2 1/2] udp: msg_zerocopy

2018-11-29 Thread Paolo Abeni

Hi,

Thank you for the update!

On Wed, 2018-11-28 at 18:50 -0500, Willem de Bruijn wrote:
> I did revert to the basic implementation using an extra ref
> for the function call, similar to TCP, as you suggested.
> 
> On top of that as a separate optimization patch I have a
> variant that uses refcnt zero by replacing refcount_inc with
> refcount_set(.., refcount_read(..) + 1). Not very pretty.

If the skb/uarg is not shared (no other threads can touch the refcnt)
before ip*_append_data() completes, how about something like the
following (incremental diff on top of patch 1/2, untested, uncompiled,
just to give the idea):

---
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 04f52e719571..1e3d195ffdfb 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -480,6 +480,13 @@ static inline void sock_zerocopy_get(struct ubuf_info 
*uarg)
refcount_inc(>refcnt);
 }
 
+/* use only before uarg is actually shared */
+static inline void __sock_zerocopy_init(struct ubuf_info *uarg, int cnt)
+{
+   if (uarg)
+   refcount_set(>refcnt, cnt);
+}
+
 void sock_zerocopy_put(struct ubuf_info *uarg);
 void sock_zerocopy_put_abort(struct ubuf_info *uarg);
 
@@ -1326,13 +1333,20 @@ static inline struct ubuf_info *skb_zcopy(struct 
sk_buff *skb)
return is_zcopy ? skb_uarg(skb) : NULL;
 }
 
-static inline void skb_zcopy_set(struct sk_buff *skb, struct ubuf_info *uarg)
+static inline int __skb_zcopy_set(struct sk_buff *skb, struct ubuf_info *uarg)
 {
if (skb && uarg && !skb_zcopy(skb)) {
-   sock_zerocopy_get(uarg);
skb_shinfo(skb)->destructor_arg = uarg;
skb_shinfo(skb)->tx_flags |= SKBTX_ZEROCOPY_FRAG;
+   return 1;
}
+   return 0;
+}
+
+static inline void skb_zcopy_set(struct sk_buff *skb, struct ubuf_info *uarg)
+{
+   if (__skb_zcopy_set(skb, uarg))
+   sock_zerocopy_get(uarg);
 }
 
 static inline void skb_zcopy_set_nouarg(struct sk_buff *skb, void *val)
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 2179ef84bb44..435bac91d293 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -957,7 +957,7 @@ struct ubuf_info *sock_zerocopy_alloc(struct sock *sk, 
size_t size)
uarg->len = 1;
uarg->bytelen = size;
uarg->zerocopy = 1;
-   refcount_set(>refcnt, sk->sk_type == SOCK_STREAM ? 1 : 0);
+   refcount_set(>refcnt, 1);
sock_hold(sk);
 
return uarg;
@@ -1097,13 +1097,6 @@ void sock_zerocopy_put_abort(struct ubuf_info *uarg)
atomic_dec(>sk_zckey);
uarg->len--;
 
-   /* Stream socks hold a ref for the syscall, as skbs can be sent
-* and freed inside the loop, dropping refcnt to 0 inbetween.
-* Datagrams do not need this, but sock_zerocopy_put expects it.
-*/
-   if (sk->sk_type != SOCK_STREAM && !refcount_read(>refcnt))
-   refcount_set(>refcnt, 1);
-
sock_zerocopy_put(uarg);
}
 }
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 7504da2f33d6..d3285613d87a 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -882,6 +882,7 @@ static int __ip_append_data(struct sock *sk,
struct rtable *rt = (struct rtable *)cork->dst;
unsigned int wmem_alloc_delta = 0;
u32 tskey = 0;
+   int uarg_refs = 0;
bool paged;
 
skb = skb_peek_tail(queue);
@@ -919,6 +920,7 @@ static int __ip_append_data(struct sock *sk,
 
if (flags & MSG_ZEROCOPY && length) {
uarg = sock_zerocopy_realloc(sk, length, skb_zcopy(skb));
+   uarg_refs = 1;
if (!uarg)
return -ENOBUFS;
if (rt->dst.dev->features & NETIF_F_SG &&
@@ -926,7 +928,7 @@ static int __ip_append_data(struct sock *sk,
paged = true;
} else {
uarg->zerocopy = 0;
-   skb_zcopy_set(skb, uarg);
+   uarg_refs += __skb_zcopy_set(skb, uarg);
}
}
 
@@ -1019,7 +1021,7 @@ static int __ip_append_data(struct sock *sk,
cork->tx_flags = 0;
skb_shinfo(skb)->tskey = tskey;
tskey = 0;
-   skb_zcopy_set(skb, uarg);
+   uarg_refs += __skb_zcopy_set(skb, uarg);
 
/*
 *  Find where to start putting bytes.
@@ -1121,6 +1123,7 @@ static int __ip_append_data(struct sock *sk,
length -= copy;
}
 
+   __sock_zerocopy_init(uarg, uarg_refs);
if (wmem_alloc_delta)
refcount_add(wmem_alloc_delta, >sk_wmem_alloc);
return 0;
@@ -1128,6 +1131,7 @@ static int __ip_append_data(struct sock *sk,
 error_efault:
err = -EFAULT;
 error:
+   __sock_zerocopy_init(uarg, uarg_refs);

[PATCH net-next] net: reorder flowi_common fields to avoid holes

2018-11-28 Thread Paolo Abeni

the flowi* structures are used and memsetted by server functions
in critical path. Currently flowi_common has a couple of holes that
we can eliminate reordering the struct fields. As a side effect,
both flowi4 and flowi6 shrink by 8 bytes.

Before:
pahole -EC flowi_common
struct flowi_common {
// ...
/* size: 40, cachelines: 1, members: 10 */
/* sum members: 32, holes: 1, sum holes: 4 */
/* padding: 4 */
/* last cacheline: 40 bytes */
};
pahole -EC flowi6
struct flowi6 {
// ...
/* size: 88, cachelines: 2, members: 6 */
/* padding: 4 */
/* last cacheline: 24 bytes */
};
pahole -EC flowi4
struct flowi4 {
// ...
/* size: 56, cachelines: 1, members: 4 */
/* padding: 4 */
/* last cacheline: 56 bytes */
};

After:
struct flowi_common {
// ...
/* size: 32, cachelines: 1, members: 10 */
/* last cacheline: 32 bytes */
};
struct flowi6 {
// ...
/* size: 80, cachelines: 2, members: 6 */
/* padding: 4 */
/* last cacheline: 16 bytes */
};
struct flowi4 {
// ...
/* size: 48, cachelines: 1, members: 4 */
/* padding: 4 */
/* last cacheline: 48 bytes */
};

Signed-off-by: Paolo Abeni 
---
 include/net/flow.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/net/flow.h b/include/net/flow.h
index 8ce21793094e..93f2c9a0f098 100644
--- a/include/net/flow.h
+++ b/include/net/flow.h
@@ -38,8 +38,8 @@ struct flowi_common {
 #define FLOWI_FLAG_KNOWN_NH0x02
 #define FLOWI_FLAG_SKIP_NH_OIF 0x04
__u32   flowic_secid;
-   struct flowi_tunnel flowic_tun_key;
kuid_t  flowic_uid;
+   struct flowi_tunnel flowic_tun_key;
 };
 
 union flowi_uli {
-- 
2.19.2

Re: [PATCH net-next v2 1/2] udp: msg_zerocopy

2018-11-26 Thread Paolo Abeni

On Mon, 2018-11-26 at 12:59 -0500, Willem de Bruijn wrote:
> The callers of this function do flush the queue of the other skbs on
> error, but only after the call to sock_zerocopy_put_abort.
> 
> sock_zerocopy_put_abort depends on total rollback to revert the
> sk_zckey increment and suppress the completion notification (which
> must not happen on return with error).
> 
> I don't immediately have a fix. Need to think about this some more..

[still out of sheer ignorance] How about tacking a refcnt for the whole
ip_append_data() scope, like in the tcp case? that will add an atomic
op per loop (likely, hitting the cache) but will remove some code hunk
in sock_zerocopy_put_abort() and sock_zerocopy_alloc().

Cheer,

Paolo

Re: [PATCH net-next v2 1/2] udp: msg_zerocopy

2018-11-26 Thread Paolo Abeni

Hi,

Sorry for the long delay...

On Mon, 2018-11-26 at 10:29 -0500, Willem de Bruijn wrote:
> @@ -1109,6 +1128,7 @@ static int __ip_append_data(struct sock *sk,
>  error_efault:
>   err = -EFAULT;
>  error:
> + sock_zerocopy_put_abort(uarg);
>   cork->length -= length;
>   IP_INC_STATS(sock_net(sk), IPSTATS_MIB_OUTDISCARDS);
>   refcount_add(wmem_alloc_delta, >sk_wmem_alloc);

Out of sheer ignorance on my side, don't we have a bad reference
accounting if e.g.:

- uarg is attached to multiple skbs, each holding a ref, 
- there is a failure on 'getfrag()'

Such failure will release 2 references (1 kfree_skb(), and another in
the above sock_zerocopy_put_abort(), as the count is still positive).

Cheers,

Paolo

[PATCH net v2] net: don't keep lonely packets forever in the gro hash

2018-11-21 Thread Paolo Abeni

Eric noted that with UDP GRO and NAPI timeout, we could keep a single
UDP packet inside the GRO hash forever, if the related NAPI instance
calls napi_gro_complete() at an higher frequency than the NAPI timeout.
Willem noted that even TCP packets could be trapped there, till the
next retransmission.
This patch tries to address the issue, flushing the old packets -
those with a NAPI_GRO_CB age before the current jiffy - before scheduling
the NAPI timeout. The rationale is that such a timeout should be
well below a jiffy and we are not flushing packets eligible for sane GRO.

v1  -> v2:
 - clarified the commit message and comment

RFC -> v1:
 - added 'Fixes tags', cleaned-up the wording.

Reported-by: Eric Dumazet 
Fixes: 3b47d30396ba ("net: gro: add a per device gro flush timer")
Signed-off-by: Paolo Abeni 
---
 net/core/dev.c | 7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 066aa902d85c..ddc551f24ba2 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -5970,11 +5970,14 @@ bool napi_complete_done(struct napi_struct *n, int 
work_done)
if (work_done)
timeout = n->dev->gro_flush_timeout;
 
+   /* When the NAPI instance uses a timeout and keeps postponing
+* it, we need to bound somehow the time packets are kept in
+* the GRO layer
+*/
+   napi_gro_flush(n, !!timeout);
if (timeout)
hrtimer_start(>timer, ns_to_ktime(timeout),
  HRTIMER_MODE_REL_PINNED);
-   else
-   napi_gro_flush(n, false);
}
if (unlikely(!list_empty(>poll_list))) {
/* If n->poll_list is not empty, we need to mask irqs */
-- 
2.17.2

[PATCH net-next] selftests: explicitly require kernel features needed by udpgro tests

2018-11-21 Thread Paolo Abeni

commit 3327a9c46352f1 ("selftests: add functionals test for UDP GRO")
make use of ipv6 NAT, but such a feature is not currently implied by
selftests. Since the 'ip[6]tables' commands may actually create nft rules,
depending on the specific user-space version, let's pull both NF and
NFT nat modules plus the needed deps.

Reported-by: Naresh Kamboju 
Fixes: 3327a9c46352f1 ("selftests: add functionals test for UDP GRO")
Signed-off-by: Paolo Abeni 
---
 tools/testing/selftests/net/config | 14 ++
 1 file changed, 14 insertions(+)

diff --git a/tools/testing/selftests/net/config 
b/tools/testing/selftests/net/config
index cd3a2f1545b5..5821bdd98d20 100644
--- a/tools/testing/selftests/net/config
+++ b/tools/testing/selftests/net/config
@@ -14,3 +14,17 @@ CONFIG_IPV6_VTI=y
 CONFIG_DUMMY=y
 CONFIG_BRIDGE=y
 CONFIG_VLAN_8021Q=y
+CONFIG_NETFILTER=y
+CONFIG_NETFILTER_ADVANCED=y
+CONFIG_NF_CONNTRACK=m
+CONFIG_NF_NAT_IPV6=m
+CONFIG_NF_NAT_IPV4=m
+CONFIG_IP6_NF_IPTABLES=m
+CONFIG_IP_NF_IPTABLES=m
+CONFIG_IP6_NF_NAT=m
+CONFIG_IP_NF_NAT=m
+CONFIG_NF_TABLES=m
+CONFIG_NF_TABLES_IPV6=y
+CONFIG_NF_TABLES_IPV4=y
+CONFIG_NFT_CHAIN_NAT_IPV6=m
+CONFIG_NFT_CHAIN_NAT_IPV4=m
-- 
2.17.2

[PATCH net-next] net: don't keep lonely packets forever in the gro hash

2018-11-20 Thread Paolo Abeni

Eric noted that with UDP GRO and NAPI timeout, we could keep a single
UDP packet inside the GRO hash forever, if the related NAPI instance
calls napi_gro_complete() at an higher frequency than the NAPI timeout.
Willem noted that even TCP packets could be trapped there, till the
next retransmission.
This patch tries to address the issue, flushing the oldest packets before
scheduling the NAPI timeout. The rationale is that such a timeout should be
well below a jiffy and we are not flushing packets eligible for sane GRO.

RFC -> v1:
 - added 'Fixes tags', cleaned-up the wording.

Reported-by: Eric Dumazet 
Fixes: 3b47d30396ba ("net: gro: add a per device gro flush timer")
Fixes: e20cf8d3f1f7 ("udp: implement GRO for plain UDP sockets.")
Signed-off-by: Paolo Abeni 
--
Note: since one of the fixed commit is currently only on net-next,
the other one is really old, and the affected scenario without the
more recent commit is really a corner case, targeting net-next.
---
 net/core/dev.c | 7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 5927f6a7c301..b6eb4e0bfa91 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -5975,11 +5975,14 @@ bool napi_complete_done(struct napi_struct *n, int 
work_done)
if (work_done)
timeout = n->dev->gro_flush_timeout;
 
+   /* When the NAPI instance uses a timeout, we still need to
+* somehow bound the time packets are keept in the GRO layer
+* under heavy traffic
+*/
+   napi_gro_flush(n, !!timeout);
if (timeout)
hrtimer_start(>timer, ns_to_ktime(timeout),
  HRTIMER_MODE_REL_PINNED);
-   else
-   napi_gro_flush(n, false);
}
if (unlikely(!list_empty(>poll_list))) {
/* If n->poll_list is not empty, we need to mask irqs */
-- 
2.17.2

Re: [RFC PATCH] net: don't keep lonely packets forever in the gro hash

2018-11-20 Thread Paolo Abeni

Hi,

On Tue, 2018-11-20 at 05:49 -0800, Eric Dumazet wrote:
> 
> On 11/20/2018 02:17 AM, Paolo Abeni wrote:
> > Eric noted that with UDP GRO and napi timeout, we could keep a single
> > UDP packet inside the GRO hash forever, if the related NAPI instance
> > calls napi_gro_complete() at an higher frequency than the napi timeout.
> > Willem noted that even TCP packets could be trapped there, till the
> > next retransmission.
> > This patch tries to address the issue, flushing the oldest packets before
> > scheduling the NAPI timeout. The rationale is that such a timeout should be
> > well below a jiffy and we are not flushing packets eligible for sane GRO.
> > 
> > Reported-by: Eric Dumazet 
> > Signed-off-by: Paolo Abeni 
> > ---
> > Sending as RFC, as I fear I'm missing some relevant pieces.
> > Also I'm unsure if this should considered a fixes for "udp: implement
> > GRO for plain UDP sockets." or for "net: gro: add a per device gro flush 
> > timer"

Thank you for your feedback!

> Truth be told, relying on jiffies change is a bit fragile for HZ=100 or 
> HZ=250 kernels.

Yes, we have higher bound there.

> See recent TCP commit that got rid of tcp_tso_should_defer() dependency on 
> HZ/jiffies
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git/commit/?id=a682850a114aef947da5d603f7fd2cfe7eabbd72

I'm unsure I follow correctly. Are you suggesting to use ns precision
for skb aging in GRO? If so, could that be a separate change? (looks
more invasive)

Thanks,

Paolo

[RFC PATCH] net: don't keep lonely packets forever in the gro hash

2018-11-20 Thread Paolo Abeni

Eric noted that with UDP GRO and napi timeout, we could keep a single
UDP packet inside the GRO hash forever, if the related NAPI instance
calls napi_gro_complete() at an higher frequency than the napi timeout.
Willem noted that even TCP packets could be trapped there, till the
next retransmission.
This patch tries to address the issue, flushing the oldest packets before
scheduling the NAPI timeout. The rationale is that such a timeout should be
well below a jiffy and we are not flushing packets eligible for sane GRO.

Reported-by: Eric Dumazet 
Signed-off-by: Paolo Abeni 
---
Sending as RFC, as I fear I'm missing some relevant pieces.
Also I'm unsure if this should considered a fixes for "udp: implement
GRO for plain UDP sockets." or for "net: gro: add a per device gro flush timer"
---
 net/core/dev.c | 7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 5927f6a7c301..5cc4c4961869 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -5975,11 +5975,14 @@ bool napi_complete_done(struct napi_struct *n, int 
work_done)
if (work_done)
timeout = n->dev->gro_flush_timeout;
 
+   /* When the NAPI instance uses a timeout, we still need to
+* someout bound the time packets are keept in the GRO layer
+* under heavy traffic
+*/
+   napi_gro_flush(n, !!timeout);
if (timeout)
hrtimer_start(>timer, ns_to_ktime(timeout),
  HRTIMER_MODE_REL_PINNED);
-   else
-   napi_gro_flush(n, false);
}
if (unlikely(!list_empty(>poll_list))) {
/* If n->poll_list is not empty, we need to mask irqs */
-- 
2.17.2

Re: selftests: net: udpgro.sh hangs on DUT devices running Linux -next

2018-11-17 Thread Paolo Abeni

Hi,

On Fri, 2018-11-16 at 14:55 +0530, Naresh Kamboju wrote:
> Kernel selftests: net: udpgro.sh hangs / waits forever on x86_64 and
> arm32 devices running Linux -next. Test getting PASS on arm64 devices.
> 
> Do you see this problem ?
> 
> Short error log:
> -
> ip6tables v1.6.1: can't initialize ip6tables table `nat': Table does
> not exist (do you need to insmod?)

Thank you for the report.

It looks like your kernel config has 

# CONFIG_NF_NAT_IPV6 is not set

Can you please confirm ?

net selftests do not explicitly ask for that, despiting using such
functionality (my bad).

I'll be travelling up to Monday (included). I'll have a better look
after that.

Cheers,

Paolo

[PATCH net-next] selftests: add explicit test for multiple concurrent GRO sockets

2018-11-14 Thread Paolo Abeni

This covers for proper accounting of encap needed static keys

Signed-off-by: Paolo Abeni 
---
 tools/testing/selftests/net/udpgro.sh | 34 +++
 1 file changed, 34 insertions(+)

diff --git a/tools/testing/selftests/net/udpgro.sh 
b/tools/testing/selftests/net/udpgro.sh
index e94ef8067173..aeac53a99aeb 100755
--- a/tools/testing/selftests/net/udpgro.sh
+++ b/tools/testing/selftests/net/udpgro.sh
@@ -91,6 +91,28 @@ run_one_nat() {
wait $(jobs -p)
 }
 
+run_one_2sock() {
+   # use 'rx' as separator between sender args and receiver args
+   local -r all="$@"
+   local -r tx_args=${all%rx*}
+   local -r rx_args=${all#*rx}
+
+   cfg_veth
+
+   ip netns exec "${PEER_NS}" ./udpgso_bench_rx ${rx_args} -p 12345 &
+   ip netns exec "${PEER_NS}" ./udpgso_bench_rx ${rx_args} && \
+   echo "ok" || \
+   echo "failed" &
+
+   # Hack: let bg programs complete the startup
+   sleep 0.1
+   ./udpgso_bench_tx ${tx_args} -p 12345
+   sleep 0.1
+   # first UDP GSO socket should be closed at this point
+   ./udpgso_bench_tx ${tx_args}
+   wait $(jobs -p)
+}
+
 run_nat_test() {
local -r args=$@
 
@@ -98,6 +120,13 @@ run_nat_test() {
./in_netns.sh $0 __subprocess_nat $2 rx -r $3
 }
 
+run_2sock_test() {
+   local -r args=$@
+
+   printf " %-40s" "$1"
+   ./in_netns.sh $0 __subprocess_2sock $2 rx -G -r $3
+}
+
 run_all() {
local -r core_args="-l 4"
local -r ipv4_args="${core_args} -4 -D 192.168.1.1"
@@ -120,6 +149,7 @@ run_all() {
run_test "GRO with custom segment size cmsg" "${ipv4_args} -M 1 -s 
14720 -S 500 " "-4 -n 1 -l 14720 -S 500"
 
run_nat_test "bad GRO lookup" "${ipv4_args} -M 1 -s 14720 -S 0" "-n 10 
-l 1472"
+   run_2sock_test "multiple GRO socks" "${ipv4_args} -M 1 -s 14720 -S 0 " 
"-4 -n 1 -l 14720 -S 1472"
 
echo "ipv6"
run_test "no GRO" "${ipv6_args} -M 10 -s 1400" "-n 10 -l 1400"
@@ -130,6 +160,7 @@ run_all() {
run_test "GRO with custom segment size cmsg" "${ipv6_args} -M 1 -s 
14520 -S 500" "-n 1 -l 14520 -S 500"
 
run_nat_test "bad GRO lookup" "${ipv6_args} -M 1 -s 14520 -S 0" "-n 10 
-l 1452"
+   run_2sock_test "multiple GRO socks" "${ipv6_args} -M 1 -s 14520 -S 0 " 
"-n 1 -l 14520 -S 1452"
 }
 
 if [ ! -f ../bpf/xdp_dummy.o ]; then
@@ -145,4 +176,7 @@ elif [[ $1 == "__subprocess" ]]; then
 elif [[ $1 == "__subprocess_nat" ]]; then
shift
run_one_nat $@
+elif [[ $1 == "__subprocess_2sock" ]]; then
+   shift
+   run_one_2sock $@
 fi
-- 
2.17.2

[PATCH net-next] udp: fix jump label misuse

2018-11-14 Thread Paolo Abeni

The commit 60fb9567bf30 ("udp: implement complete book-keeping for
encap_needed") introduced a severe misuse of jump label APIs, which
syzbot, as reported by Eric, was able to exploit.

When multiple sockets/process can concurrently request (and than
disable) the udp encap, we need to track the activation counter with
*_inc()/*_dec() jump label variants, or we can experience bad things
at disable time.

Fixes: 60fb9567bf30 ("udp: implement complete book-keeping for encap_needed")
Reported-by: Eric Dumazet 
Signed-off-by: Paolo Abeni 
---
 net/ipv4/udp.c | 4 ++--
 net/ipv6/udp.c | 4 ++--
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 6f8890c5bc7e..aff2a8e99e01 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -587,7 +587,7 @@ static inline bool __udp_is_mcast_sock(struct net *net, 
struct sock *sk,
 DEFINE_STATIC_KEY_FALSE(udp_encap_needed_key);
 void udp_encap_enable(void)
 {
-   static_branch_enable(_encap_needed_key);
+   static_branch_inc(_encap_needed_key);
 }
 EXPORT_SYMBOL(udp_encap_enable);
 
@@ -2524,7 +2524,7 @@ void udp_destroy_sock(struct sock *sk)
encap_destroy(sk);
}
if (up->encap_enabled)
-   static_branch_disable(_encap_needed_key);
+   static_branch_dec(_encap_needed_key);
}
 }
 
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index dde51fc7ac16..09cba4cfe31f 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -448,7 +448,7 @@ int udpv6_recvmsg(struct sock *sk, struct msghdr *msg, 
size_t len,
 DEFINE_STATIC_KEY_FALSE(udpv6_encap_needed_key);
 void udpv6_encap_enable(void)
 {
-   static_branch_enable(_encap_needed_key);
+   static_branch_inc(_encap_needed_key);
 }
 EXPORT_SYMBOL(udpv6_encap_enable);
 
@@ -1579,7 +1579,7 @@ void udpv6_destroy_sock(struct sock *sk)
encap_destroy(sk);
}
if (up->encap_enabled)
-   static_branch_disable(_encap_needed_key);
+   static_branch_dec(_encap_needed_key);
}
 
inet6_destroy_sock(sk);
-- 
2.17.2

[PATCH net-next] udp6: cleanup stats accounting in recvmsg()

2018-11-09 Thread Paolo Abeni

In the udp6 code path, we needed multiple tests to select the correct
mib to be updated. Since we touch at least a counter at each iteration,
it's convenient to use the recently introduced __UDPX_MIB() helper once
and remove some code duplication.

Signed-off-by: Paolo Abeni 
---
 net/ipv6/udp.c | 32 +++-
 1 file changed, 7 insertions(+), 25 deletions(-)

diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index 0c0cb1611aef..dde51fc7ac16 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -326,6 +326,7 @@ int udpv6_recvmsg(struct sock *sk, struct msghdr *msg, 
size_t len,
int err;
int is_udplite = IS_UDPLITE(sk);
bool checksum_valid = false;
+   struct udp_mib *mib;
int is_udp4;
 
if (flags & MSG_ERRQUEUE)
@@ -349,6 +350,7 @@ int udpv6_recvmsg(struct sock *sk, struct msghdr *msg, 
size_t len,
msg->msg_flags |= MSG_TRUNC;
 
is_udp4 = (skb->protocol == htons(ETH_P_IP));
+   mib = __UDPX_MIB(sk, is_udp4);
 
/*
 * If checksum is needed at all, try to do it while copying the
@@ -377,24 +379,13 @@ int udpv6_recvmsg(struct sock *sk, struct msghdr *msg, 
size_t len,
if (unlikely(err)) {
if (!peeked) {
atomic_inc(>sk_drops);
-   if (is_udp4)
-   UDP_INC_STATS(sock_net(sk), UDP_MIB_INERRORS,
- is_udplite);
-   else
-   UDP6_INC_STATS(sock_net(sk), UDP_MIB_INERRORS,
-  is_udplite);
+   SNMP_INC_STATS(mib, UDP_MIB_INERRORS);
}
kfree_skb(skb);
return err;
}
-   if (!peeked) {
-   if (is_udp4)
-   UDP_INC_STATS(sock_net(sk), UDP_MIB_INDATAGRAMS,
- is_udplite);
-   else
-   UDP6_INC_STATS(sock_net(sk), UDP_MIB_INDATAGRAMS,
-  is_udplite);
-   }
+   if (!peeked)
+   SNMP_INC_STATS(mib, UDP_MIB_INDATAGRAMS);
 
sock_recv_ts_and_drops(msg, sk, skb);
 
@@ -443,17 +434,8 @@ int udpv6_recvmsg(struct sock *sk, struct msghdr *msg, 
size_t len,
 csum_copy_err:
if (!__sk_queue_drop_skb(sk, _sk(sk)->reader_queue, skb, flags,
 udp_skb_destructor)) {
-   if (is_udp4) {
-   UDP_INC_STATS(sock_net(sk),
- UDP_MIB_CSUMERRORS, is_udplite);
-   UDP_INC_STATS(sock_net(sk),
- UDP_MIB_INERRORS, is_udplite);
-   } else {
-   UDP6_INC_STATS(sock_net(sk),
-  UDP_MIB_CSUMERRORS, is_udplite);
-   UDP6_INC_STATS(sock_net(sk),
-  UDP_MIB_INERRORS, is_udplite);
-   }
+   SNMP_INC_STATS(mib, UDP_MIB_CSUMERRORS);
+   SNMP_INC_STATS(mib, UDP_MIB_INERRORS);
}
kfree_skb(skb);
 
-- 
2.17.2

Re: [PATCH][RFC] udp: cache sock to avoid searching it twice

2018-11-09 Thread Paolo Abeni

Hi,

Adding Willem, I think he can be interested.

On Fri, 2018-11-09 at 14:21 +0800, Li RongQing wrote:
> GRO for UDP needs to lookup socket twice, first is in gro receive,
> second is gro complete, so if store sock to skb to avoid looking up
> twice, this can give small performance boost
> 
> netperf -t UDP_RR -l 10
> 
> Before:
>   Rate per sec: 28746.01
> After:
>   Rate per sec: 29401.67
> 
> Signed-off-by: Li RongQing 
> ---
>  net/ipv4/udp_offload.c | 18 +-
>  1 file changed, 17 insertions(+), 1 deletion(-)
> 
> diff --git a/net/ipv4/udp_offload.c b/net/ipv4/udp_offload.c
> index 0646d61f4fa8..429570112a33 100644
> --- a/net/ipv4/udp_offload.c
> +++ b/net/ipv4/udp_offload.c
> @@ -408,6 +408,11 @@ struct sk_buff *udp_gro_receive(struct list_head *head, 
> struct sk_buff *skb,
>  
>   if (udp_sk(sk)->gro_enabled) {
>   pp = call_gro_receive(udp_gro_receive_segment, head, skb);
> +
> + if (!IS_ERR(pp) && NAPI_GRO_CB(pp)->count > 1) {
> + sock_hold(sk);
> + pp->sk = sk;
> + }
>   rcu_read_unlock();
>   return pp;
>   }

What if 'pp' is NULL?

Aside from that, this replace a lookup with 2 atomic ops, and only when
such lookup is amortized on multiple aggregated packets: I'm unsure if
it's worthy and I don't understand how that improves RR tests (where
the socket can't see multiple, consecutive skbs, AFAIK).

Cheers,

Paolo

Re: [PATCH net-next 05/10] ipv6: factor out protocol delivery helper

2018-11-08 Thread Paolo Abeni

Hi,

On Thu, 2018-11-08 at 12:01 +0300, Sergei Shtylyov wrote:
> On 11/7/2018 2:38 PM, Paolo Abeni wrote:
> 
> > So that we can re-use it at the UDP level in the next patch
> > 
> > rfc v3 -> v1:
> >   - add the helper declaration into the ipv6 header
> > 
> > Signed-off-by: Paolo Abeni 
> > ---
> >   include/net/ipv6.h   |  2 ++
> >   net/ipv6/ip6_input.c | 28 
> >   2 files changed, 18 insertions(+), 12 deletions(-)
> > 
> > diff --git a/include/net/ipv6.h b/include/net/ipv6.h
> > index 829650540780..daf80863d3a5 100644
> > --- a/include/net/ipv6.h
> > +++ b/include/net/ipv6.h
> 
> [...]
> > @@ -319,28 +319,26 @@ void ipv6_list_rcv(struct list_head *head, struct 
> > packet_type *pt,
> >   /*
> >*Deliver the packet to the host
> >*/
> > -
> > -
> > -static int ip6_input_finish(struct net *net, struct sock *sk, struct 
> > sk_buff *skb)
> > +void ip6_protocol_deliver_rcu(struct net *net, struct sk_buff *skb, int 
> > nexthdr,
> > + bool have_final)
> >   {
> > const struct inet6_protocol *ipprot;
> > struct inet6_dev *idev;
> > unsigned int nhoff;
> > -   int nexthdr;
> > bool raw;
> > -   bool have_final = false;
> >   
> > /*
> >  *  Parse extension headers
> >  */
> >   
> > -   rcu_read_lock();
> >   resubmit:
> > idev = ip6_dst_idev(skb_dst(skb));
> > -   if (!pskb_pull(skb, skb_transport_offset(skb)))
> > -   goto discard;
> > nhoff = IP6CB(skb)->nhoff;
> > -   nexthdr = skb_network_header(skb)[nhoff];
> > +   if (!have_final) {
> 
> Haven't you removed this variable above?
> 
> > +   if (!pskb_pull(skb, skb_transport_offset(skb)))
> > +   goto discard;
> > +   nexthdr = skb_network_header(skb)[nhoff];
> 
> And this?

Thanks for reviewing. 

Both local variables are now function arguments (see the function
signature, far above in this chunk).

Cheers,

Paolo

[PATCH net-next 05/10] ipv6: factor out protocol delivery helper

2018-11-07 Thread Paolo Abeni

So that we can re-use it at the UDP level in the next patch

rfc v3 -> v1:
 - add the helper declaration into the ipv6 header

Signed-off-by: Paolo Abeni 
---
 include/net/ipv6.h   |  2 ++
 net/ipv6/ip6_input.c | 28 
 2 files changed, 18 insertions(+), 12 deletions(-)

diff --git a/include/net/ipv6.h b/include/net/ipv6.h
index 829650540780..daf80863d3a5 100644
--- a/include/net/ipv6.h
+++ b/include/net/ipv6.h
@@ -975,6 +975,8 @@ int ip6_output(struct net *net, struct sock *sk, struct 
sk_buff *skb);
 int ip6_forward(struct sk_buff *skb);
 int ip6_input(struct sk_buff *skb);
 int ip6_mc_input(struct sk_buff *skb);
+void ip6_protocol_deliver_rcu(struct net *net, struct sk_buff *skb, int 
nexthdr,
+ bool have_final);
 
 int __ip6_local_out(struct net *net, struct sock *sk, struct sk_buff *skb);
 int ip6_local_out(struct net *net, struct sock *sk, struct sk_buff *skb);
diff --git a/net/ipv6/ip6_input.c b/net/ipv6/ip6_input.c
index 96577e742afd..3065226bdc57 100644
--- a/net/ipv6/ip6_input.c
+++ b/net/ipv6/ip6_input.c
@@ -319,28 +319,26 @@ void ipv6_list_rcv(struct list_head *head, struct 
packet_type *pt,
 /*
  * Deliver the packet to the host
  */
-
-
-static int ip6_input_finish(struct net *net, struct sock *sk, struct sk_buff 
*skb)
+void ip6_protocol_deliver_rcu(struct net *net, struct sk_buff *skb, int 
nexthdr,
+ bool have_final)
 {
const struct inet6_protocol *ipprot;
struct inet6_dev *idev;
unsigned int nhoff;
-   int nexthdr;
bool raw;
-   bool have_final = false;
 
/*
 *  Parse extension headers
 */
 
-   rcu_read_lock();
 resubmit:
idev = ip6_dst_idev(skb_dst(skb));
-   if (!pskb_pull(skb, skb_transport_offset(skb)))
-   goto discard;
nhoff = IP6CB(skb)->nhoff;
-   nexthdr = skb_network_header(skb)[nhoff];
+   if (!have_final) {
+   if (!pskb_pull(skb, skb_transport_offset(skb)))
+   goto discard;
+   nexthdr = skb_network_header(skb)[nhoff];
+   }
 
 resubmit_final:
raw = raw6_local_deliver(skb, nexthdr);
@@ -411,13 +409,19 @@ static int ip6_input_finish(struct net *net, struct sock 
*sk, struct sk_buff *sk
consume_skb(skb);
}
}
-   rcu_read_unlock();
-   return 0;
+   return;
 
 discard:
__IP6_INC_STATS(net, idev, IPSTATS_MIB_INDISCARDS);
-   rcu_read_unlock();
kfree_skb(skb);
+}
+
+static int ip6_input_finish(struct net *net, struct sock *sk, struct sk_buff 
*skb)
+{
+   rcu_read_lock();
+   ip6_protocol_deliver_rcu(net, skb, 0, false);
+   rcu_read_unlock();
+
return 0;
 }
 
-- 
2.17.2

[PATCH net-next 01/10] udp: implement complete book-keeping for encap_needed

2018-11-07 Thread Paolo Abeni

The *encap_needed static keys are enabled by UDP tunnels
and several UDP encapsulations type, but they are never
turned off. This can cause unneeded overall performance
degradation for systems where such features are used
transiently.

This patch introduces complete book-keeping for such keys,
decreasing the usage at socket destruction time, if needed,
and avoiding that the same socket could increase the key
usage multiple times.

rfc v3 -> v1:
 - add socket lock around udp_tunnel_encap_enable()

rfc v2 -> rfc v3:
 - use udp_tunnel_encap_enable() in setsockopt()

Signed-off-by: Paolo Abeni 
---
 include/linux/udp.h  |  7 ++-
 include/net/udp_tunnel.h |  6 ++
 net/ipv4/udp.c   | 19 +--
 net/ipv6/udp.c   | 14 +-
 4 files changed, 34 insertions(+), 12 deletions(-)

diff --git a/include/linux/udp.h b/include/linux/udp.h
index 320d49d85484..a4dafff407fb 100644
--- a/include/linux/udp.h
+++ b/include/linux/udp.h
@@ -49,7 +49,12 @@ struct udp_sock {
unsigned int corkflag;  /* Cork is required */
__u8 encap_type;/* Is this an Encapsulation socket? */
unsigned charno_check6_tx:1,/* Send zero UDP6 checksums on TX? */
-no_check6_rx:1;/* Allow zero UDP6 checksums on RX? */
+no_check6_rx:1,/* Allow zero UDP6 checksums on RX? */
+encap_enabled:1; /* This socket enabled encap
+  * processing; UDP tunnels and
+  * different encapsulation layer set
+  * this
+  */
/*
 * Following member retains the information to create a UDP header
 * when the socket is uncorked.
diff --git a/include/net/udp_tunnel.h b/include/net/udp_tunnel.h
index fe680ab6b15a..3fbe56430e3b 100644
--- a/include/net/udp_tunnel.h
+++ b/include/net/udp_tunnel.h
@@ -165,6 +165,12 @@ static inline int udp_tunnel_handle_offloads(struct 
sk_buff *skb, bool udp_csum)
 
 static inline void udp_tunnel_encap_enable(struct socket *sock)
 {
+   struct udp_sock *up = udp_sk(sock->sk);
+
+   if (up->encap_enabled)
+   return;
+
+   up->encap_enabled = 1;
 #if IS_ENABLED(CONFIG_IPV6)
if (sock->sk->sk_family == PF_INET6)
ipv6_stub->udpv6_encap_enable();
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 1976fddb9e00..0ed715a72249 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -115,6 +115,7 @@
 #include "udp_impl.h"
 #include 
 #include 
+#include 
 
 struct udp_table udp_table __read_mostly;
 EXPORT_SYMBOL(udp_table);
@@ -2398,11 +2399,15 @@ void udp_destroy_sock(struct sock *sk)
bool slow = lock_sock_fast(sk);
udp_flush_pending_frames(sk);
unlock_sock_fast(sk, slow);
-   if (static_branch_unlikely(_encap_needed_key) && up->encap_type) {
-   void (*encap_destroy)(struct sock *sk);
-   encap_destroy = READ_ONCE(up->encap_destroy);
-   if (encap_destroy)
-   encap_destroy(sk);
+   if (static_branch_unlikely(_encap_needed_key)) {
+   if (up->encap_type) {
+   void (*encap_destroy)(struct sock *sk);
+   encap_destroy = READ_ONCE(up->encap_destroy);
+   if (encap_destroy)
+   encap_destroy(sk);
+   }
+   if (up->encap_enabled)
+   static_branch_disable(_encap_needed_key);
}
 }
 
@@ -2447,7 +2452,9 @@ int udp_lib_setsockopt(struct sock *sk, int level, int 
optname,
/* FALLTHROUGH */
case UDP_ENCAP_L2TPINUDP:
up->encap_type = val;
-   udp_encap_enable();
+   lock_sock(sk);
+   udp_tunnel_encap_enable(sk->sk_socket);
+   release_sock(sk);
break;
default:
err = -ENOPROTOOPT;
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index d2d97d07ef27..fc0ce6c59ebb 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -1458,11 +1458,15 @@ void udpv6_destroy_sock(struct sock *sk)
udp_v6_flush_pending_frames(sk);
release_sock(sk);
 
-   if (static_branch_unlikely(_encap_needed_key) && up->encap_type) {
-   void (*encap_destroy)(struct sock *sk);
-   encap_destroy = READ_ONCE(up->encap_destroy);
-   if (encap_destroy)
-   encap_destroy(sk);
+   if (static_branch_unlikely(_encap_needed_key)) {
+   if (up->encap_type) {
+   void (*encap_destroy)(struct sock *sk);
+   encap_destroy = READ_ONCE(up->encap_destroy);
+

[PATCH net-next 10/10] selftests: add functionals test for UDP GRO

2018-11-07 Thread Paolo Abeni

Extends the existing udp programs to allow checking for proper
GRO aggregation/GSO size, and run the tests via a shell script, using
a veth pair with XDP program attached to trigger the GRO code path.

rfc v3 -> v1:
 - use ip route to attach the xdp helper to the veth

rfc v2 -> rfc v3:
 - add missing test program options documentation
 - fix sporatic test failures (receiver faster than sender)

Signed-off-by: Paolo Abeni 
---
 tools/testing/selftests/net/Makefile  |   2 +-
 tools/testing/selftests/net/udpgro.sh | 148 ++
 tools/testing/selftests/net/udpgro_bench.sh   |   8 +-
 tools/testing/selftests/net/udpgso_bench.sh   |   2 +-
 tools/testing/selftests/net/udpgso_bench_rx.c | 123 +--
 tools/testing/selftests/net/udpgso_bench_tx.c |  22 ++-
 6 files changed, 282 insertions(+), 23 deletions(-)
 create mode 100755 tools/testing/selftests/net/udpgro.sh

diff --git a/tools/testing/selftests/net/Makefile 
b/tools/testing/selftests/net/Makefile
index b1bba2f60467..eec359895feb 100644
--- a/tools/testing/selftests/net/Makefile
+++ b/tools/testing/selftests/net/Makefile
@@ -7,7 +7,7 @@ CFLAGS += -I../../../../usr/include/
 TEST_PROGS := run_netsocktests run_afpackettests test_bpf.sh netdevice.sh 
rtnetlink.sh
 TEST_PROGS += fib_tests.sh fib-onlink-tests.sh pmtu.sh udpgso.sh ip_defrag.sh
 TEST_PROGS += udpgso_bench.sh fib_rule_tests.sh msg_zerocopy.sh psock_snd.sh
-TEST_PROGS += udpgro_bench.sh
+TEST_PROGS += udpgro_bench.sh udpgro.sh
 TEST_PROGS_EXTENDED := in_netns.sh
 TEST_GEN_FILES =  socket
 TEST_GEN_FILES += psock_fanout psock_tpacket msg_zerocopy
diff --git a/tools/testing/selftests/net/udpgro.sh 
b/tools/testing/selftests/net/udpgro.sh
new file mode 100755
index ..e94ef8067173
--- /dev/null
+++ b/tools/testing/selftests/net/udpgro.sh
@@ -0,0 +1,148 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+#
+# Run a series of udpgro functional tests.
+
+readonly PEER_NS="ns-peer-$(mktemp -u XX)"
+
+cleanup() {
+   local -r jobs="$(jobs -p)"
+   local -r ns="$(ip netns list|grep $PEER_NS)"
+
+   [ -n "${jobs}" ] && kill -1 ${jobs} 2>/dev/null
+   [ -n "$ns" ] && ip netns del $ns 2>/dev/null
+}
+trap cleanup EXIT
+
+cfg_veth() {
+   ip netns add "${PEER_NS}"
+   ip -netns "${PEER_NS}" link set lo up
+   ip link add type veth
+   ip link set dev veth0 up
+   ip addr add dev veth0 192.168.1.2/24
+   ip addr add dev veth0 2001:db8::2/64 nodad
+
+   ip link set dev veth1 netns "${PEER_NS}"
+   ip -netns "${PEER_NS}" addr add dev veth1 192.168.1.1/24
+   ip -netns "${PEER_NS}" addr add dev veth1 2001:db8::1/64 nodad
+   ip -netns "${PEER_NS}" link set dev veth1 up
+   ip -n "${PEER_NS}" link set veth1 xdp object ../bpf/xdp_dummy.o section 
xdp_dummy
+}
+
+run_one() {
+   # use 'rx' as separator between sender args and receiver args
+   local -r all="$@"
+   local -r tx_args=${all%rx*}
+   local -r rx_args=${all#*rx}
+
+   cfg_veth
+
+   ip netns exec "${PEER_NS}" ./udpgso_bench_rx ${rx_args} && \
+   echo "ok" || \
+   echo "failed" &
+
+   # Hack: let bg programs complete the startup
+   sleep 0.1
+   ./udpgso_bench_tx ${tx_args}
+   wait $(jobs -p)
+}
+
+run_test() {
+   local -r args=$@
+
+   printf " %-40s" "$1"
+   ./in_netns.sh $0 __subprocess $2 rx -G -r $3
+}
+
+run_one_nat() {
+   # use 'rx' as separator between sender args and receiver args
+   local addr1 addr2 pid family="" ipt_cmd=ip6tables
+   local -r all="$@"
+   local -r tx_args=${all%rx*}
+   local -r rx_args=${all#*rx}
+
+   if [[ ${tx_args} = *-4* ]]; then
+   ipt_cmd=iptables
+   family=-4
+   addr1=192.168.1.1
+   addr2=192.168.1.3/24
+   else
+   addr1=2001:db8::1
+   addr2="2001:db8::3/64 nodad"
+   fi
+
+   cfg_veth
+   ip -netns "${PEER_NS}" addr add dev veth1 ${addr2}
+
+   # fool the GRO engine changing the destination address ...
+   ip netns exec "${PEER_NS}" $ipt_cmd -t nat -I PREROUTING -d ${addr1} -j 
DNAT --to-destination ${addr2%/*}
+
+   # ... so that GRO will match the UDP_GRO enabled socket, but packets
+   # will land on the 'plain' one
+   ip netns exec "${PEER_NS}" ./udpgso_bench_rx -G ${family} -b ${addr1} 
-n 0 &
+   pid=$!
+   ip netns exec "${PEER_NS}" ./udpgso_bench_rx ${family} -b ${addr2%/*} 
${rx_args} && \
+   echo "ok" || \
+   echo "failed"&
+
+   sleep 0.1
+   ./udpgso_bench_tx ${tx_args}
+   kill -INT $pid
+   wait $(jo

[PATCH net-next 04/10] ip: factor out protocol delivery helper

2018-11-07 Thread Paolo Abeni

So that we can re-use it at the UDP level in a later patch

rfc v3 -> v1
 - add the helper declaration into the ip header

Signed-off-by: Paolo Abeni 
---
 include/net/ip.h|  1 +
 net/ipv4/ip_input.c | 73 ++---
 2 files changed, 37 insertions(+), 37 deletions(-)

diff --git a/include/net/ip.h b/include/net/ip.h
index 72593e171d14..cece69b5c531 100644
--- a/include/net/ip.h
+++ b/include/net/ip.h
@@ -155,6 +155,7 @@ int ip_rcv(struct sk_buff *skb, struct net_device *dev, 
struct packet_type *pt,
 void ip_list_rcv(struct list_head *head, struct packet_type *pt,
 struct net_device *orig_dev);
 int ip_local_deliver(struct sk_buff *skb);
+void ip_protocol_deliver_rcu(struct net *net, struct sk_buff *skb, int proto);
 int ip_mr_input(struct sk_buff *skb);
 int ip_output(struct net *net, struct sock *sk, struct sk_buff *skb);
 int ip_mc_output(struct net *net, struct sock *sk, struct sk_buff *skb);
diff --git a/net/ipv4/ip_input.c b/net/ipv4/ip_input.c
index 35a786c0aaa0..72250b4e466d 100644
--- a/net/ipv4/ip_input.c
+++ b/net/ipv4/ip_input.c
@@ -188,51 +188,50 @@ bool ip_call_ra_chain(struct sk_buff *skb)
return false;
 }
 
-static int ip_local_deliver_finish(struct net *net, struct sock *sk, struct 
sk_buff *skb)
+void ip_protocol_deliver_rcu(struct net *net, struct sk_buff *skb, int 
protocol)
 {
-   __skb_pull(skb, skb_network_header_len(skb));
-
-   rcu_read_lock();
-   {
-   int protocol = ip_hdr(skb)->protocol;
-   const struct net_protocol *ipprot;
-   int raw;
+   const struct net_protocol *ipprot;
+   int raw, ret;
 
-   resubmit:
-   raw = raw_local_deliver(skb, protocol);
+resubmit:
+   raw = raw_local_deliver(skb, protocol);
 
-   ipprot = rcu_dereference(inet_protos[protocol]);
-   if (ipprot) {
-   int ret;
-
-   if (!ipprot->no_policy) {
-   if (!xfrm4_policy_check(NULL, XFRM_POLICY_IN, 
skb)) {
-   kfree_skb(skb);
-   goto out;
-   }
-   nf_reset(skb);
+   ipprot = rcu_dereference(inet_protos[protocol]);
+   if (ipprot) {
+   if (!ipprot->no_policy) {
+   if (!xfrm4_policy_check(NULL, XFRM_POLICY_IN, skb)) {
+   kfree_skb(skb);
+   return;
}
-   ret = ipprot->handler(skb);
-   if (ret < 0) {
-   protocol = -ret;
-   goto resubmit;
+   nf_reset(skb);
+   }
+   ret = ipprot->handler(skb);
+   if (ret < 0) {
+   protocol = -ret;
+   goto resubmit;
+   }
+   __IP_INC_STATS(net, IPSTATS_MIB_INDELIVERS);
+   } else {
+   if (!raw) {
+   if (xfrm4_policy_check(NULL, XFRM_POLICY_IN, skb)) {
+   __IP_INC_STATS(net, 
IPSTATS_MIB_INUNKNOWNPROTOS);
+   icmp_send(skb, ICMP_DEST_UNREACH,
+ ICMP_PROT_UNREACH, 0);
}
-   __IP_INC_STATS(net, IPSTATS_MIB_INDELIVERS);
+   kfree_skb(skb);
} else {
-   if (!raw) {
-   if (xfrm4_policy_check(NULL, XFRM_POLICY_IN, 
skb)) {
-   __IP_INC_STATS(net, 
IPSTATS_MIB_INUNKNOWNPROTOS);
-   icmp_send(skb, ICMP_DEST_UNREACH,
- ICMP_PROT_UNREACH, 0);
-   }
-   kfree_skb(skb);
-   } else {
-   __IP_INC_STATS(net, IPSTATS_MIB_INDELIVERS);
-   consume_skb(skb);
-   }
+   __IP_INC_STATS(net, IPSTATS_MIB_INDELIVERS);
+   consume_skb(skb);
}
}
- out:
+}
+
+static int ip_local_deliver_finish(struct net *net, struct sock *sk, struct 
sk_buff *skb)
+{
+   __skb_pull(skb, skb_network_header_len(skb));
+
+   rcu_read_lock();
+   ip_protocol_deliver_rcu(net, skb, ip_hdr(skb)->protocol);
rcu_read_unlock();
 
return 0;
-- 
2.17.2

[PATCH net-next 07/10] selftests: add GRO support to udp bench rx program

2018-11-07 Thread Paolo Abeni

And fix a couple of buglets (port option processing,
clean termination on SIGINT). This is preparatory work
for GRO tests.

rfc v2 -> rfc v3:
 - use ETH_MAX_MTU

Signed-off-by: Paolo Abeni 
---
 tools/testing/selftests/net/udpgso_bench_rx.c | 37 +++
 1 file changed, 30 insertions(+), 7 deletions(-)

diff --git a/tools/testing/selftests/net/udpgso_bench_rx.c 
b/tools/testing/selftests/net/udpgso_bench_rx.c
index 727cf67a3f75..8f48d7fb32cf 100644
--- a/tools/testing/selftests/net/udpgso_bench_rx.c
+++ b/tools/testing/selftests/net/udpgso_bench_rx.c
@@ -31,9 +31,15 @@
 #include 
 #include 
 
+#ifndef UDP_GRO
+#define UDP_GRO104
+#endif
+
 static int  cfg_port   = 8000;
 static bool cfg_tcp;
 static bool cfg_verify;
+static bool cfg_read_all;
+static bool cfg_gro_segment;
 
 static bool interrupted;
 static unsigned long packets, bytes;
@@ -63,6 +69,8 @@ static void do_poll(int fd)
 
do {
ret = poll(, 1, 10);
+   if (interrupted)
+   break;
if (ret == -1)
error(1, errno, "poll");
if (ret == 0)
@@ -70,7 +78,7 @@ static void do_poll(int fd)
if (pfd.revents != POLLIN)
error(1, errno, "poll: 0x%x expected 0x%x\n",
pfd.revents, POLLIN);
-   } while (!ret && !interrupted);
+   } while (!ret);
 }
 
 static int do_socket(bool do_tcp)
@@ -102,6 +110,8 @@ static int do_socket(bool do_tcp)
error(1, errno, "listen");
 
do_poll(accept_fd);
+   if (interrupted)
+   exit(0);
 
fd = accept(accept_fd, NULL, NULL);
if (fd == -1)
@@ -167,10 +177,10 @@ static void do_verify_udp(const char *data, int len)
 /* Flush all outstanding datagrams. Verify first few bytes of each. */
 static void do_flush_udp(int fd)
 {
-   static char rbuf[ETH_DATA_LEN];
+   static char rbuf[ETH_MAX_MTU];
int ret, len, budget = 256;
 
-   len = cfg_verify ? sizeof(rbuf) : 0;
+   len = cfg_read_all ? sizeof(rbuf) : 0;
while (budget--) {
/* MSG_TRUNC will make return value full datagram length */
ret = recv(fd, rbuf, len, MSG_TRUNC | MSG_DONTWAIT);
@@ -178,7 +188,7 @@ static void do_flush_udp(int fd)
return;
if (ret == -1)
error(1, errno, "recv");
-   if (len) {
+   if (len && cfg_verify) {
if (ret == 0)
error(1, errno, "recv: 0 byte datagram\n");
 
@@ -192,23 +202,30 @@ static void do_flush_udp(int fd)
 
 static void usage(const char *filepath)
 {
-   error(1, 0, "Usage: %s [-tv] [-p port]", filepath);
+   error(1, 0, "Usage: %s [-Grtv] [-p port]", filepath);
 }
 
 static void parse_opts(int argc, char **argv)
 {
int c;
 
-   while ((c = getopt(argc, argv, "ptv")) != -1) {
+   while ((c = getopt(argc, argv, "Gp:rtv")) != -1) {
switch (c) {
+   case 'G':
+   cfg_gro_segment = true;
+   break;
case 'p':
-   cfg_port = htons(strtoul(optarg, NULL, 0));
+   cfg_port = strtoul(optarg, NULL, 0);
+   break;
+   case 'r':
+   cfg_read_all = true;
break;
case 't':
cfg_tcp = true;
break;
case 'v':
cfg_verify = true;
+   cfg_read_all = true;
break;
}
}
@@ -227,6 +244,12 @@ static void do_recv(void)
 
fd = do_socket(cfg_tcp);
 
+   if (cfg_gro_segment && !cfg_tcp) {
+   int val = 1;
+   if (setsockopt(fd, IPPROTO_UDP, UDP_GRO, , sizeof(val)))
+   error(1, errno, "setsockopt UDP_GRO");
+   }
+
treport = gettimeofday_ms() + 1000;
do {
do_poll(fd);
-- 
2.17.2

[PATCH net-next 09/10] selftests: add some benchmark for UDP GRO

2018-11-07 Thread Paolo Abeni

Run on top of veth pair, using a dummy XDP program to enable the GRO.

 rfc v3 -> v1:
  - use ip route to attach the xdp helper to the veth

Signed-off-by: Paolo Abeni 
---
 tools/testing/selftests/net/Makefile|  1 +
 tools/testing/selftests/net/udpgro_bench.sh | 93 +
 2 files changed, 94 insertions(+)
 create mode 100755 tools/testing/selftests/net/udpgro_bench.sh

diff --git a/tools/testing/selftests/net/Makefile 
b/tools/testing/selftests/net/Makefile
index 256d82d5fa87..b1bba2f60467 100644
--- a/tools/testing/selftests/net/Makefile
+++ b/tools/testing/selftests/net/Makefile
@@ -7,6 +7,7 @@ CFLAGS += -I../../../../usr/include/
 TEST_PROGS := run_netsocktests run_afpackettests test_bpf.sh netdevice.sh 
rtnetlink.sh
 TEST_PROGS += fib_tests.sh fib-onlink-tests.sh pmtu.sh udpgso.sh ip_defrag.sh
 TEST_PROGS += udpgso_bench.sh fib_rule_tests.sh msg_zerocopy.sh psock_snd.sh
+TEST_PROGS += udpgro_bench.sh
 TEST_PROGS_EXTENDED := in_netns.sh
 TEST_GEN_FILES =  socket
 TEST_GEN_FILES += psock_fanout psock_tpacket msg_zerocopy
diff --git a/tools/testing/selftests/net/udpgro_bench.sh 
b/tools/testing/selftests/net/udpgro_bench.sh
new file mode 100755
index ..9ef4b1ee0f76
--- /dev/null
+++ b/tools/testing/selftests/net/udpgro_bench.sh
@@ -0,0 +1,93 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+#
+# Run a series of udpgro benchmarks
+
+readonly PEER_NS="ns-peer-$(mktemp -u XX)"
+
+cleanup() {
+   local -r jobs="$(jobs -p)"
+   local -r ns="$(ip netns list|grep $PEER_NS)"
+
+   [ -n "${jobs}" ] && kill -INT ${jobs} 2>/dev/null
+   [ -n "$ns" ] && ip netns del $ns 2>/dev/null
+}
+trap cleanup EXIT
+
+run_one() {
+   # use 'rx' as separator between sender args and receiver args
+   local -r all="$@"
+   local -r tx_args=${all%rx*}
+   local -r rx_args=${all#*rx}
+
+   ip netns add "${PEER_NS}"
+   ip -netns "${PEER_NS}" link set lo up
+   ip link add type veth
+   ip link set dev veth0 up
+   ip addr add dev veth0 192.168.1.2/24
+   ip addr add dev veth0 2001:db8::2/64 nodad
+
+   ip link set dev veth1 netns "${PEER_NS}"
+   ip -netns "${PEER_NS}" addr add dev veth1 192.168.1.1/24
+   ip -netns "${PEER_NS}" addr add dev veth1 2001:db8::1/64 nodad
+   ip -netns "${PEER_NS}" link set dev veth1 up
+
+   ip -n "${PEER_NS}" link set veth1 xdp object ../bpf/xdp_dummy.o section 
xdp_dummy
+   ip netns exec "${PEER_NS}" ./udpgso_bench_rx ${rx_args} -r &
+   ip netns exec "${PEER_NS}" ./udpgso_bench_rx -t ${rx_args} -r &
+
+   # Hack: let bg programs complete the startup
+   sleep 0.1
+   ./udpgso_bench_tx ${tx_args}
+}
+
+run_in_netns() {
+   local -r args=$@
+
+   ./in_netns.sh $0 __subprocess ${args}
+}
+
+run_udp() {
+   local -r args=$@
+
+   echo "udp gso - over veth touching data"
+   run_in_netns ${args} -S rx
+
+   echo "udp gso and gro - over veth touching data"
+   run_in_netns ${args} -S rx -G
+}
+
+run_tcp() {
+   local -r args=$@
+
+   echo "tcp - over veth touching data"
+   run_in_netns ${args} -t rx
+}
+
+run_all() {
+   local -r core_args="-l 4"
+   local -r ipv4_args="${core_args} -4 -D 192.168.1.1"
+   local -r ipv6_args="${core_args} -6 -D 2001:db8::1"
+
+   echo "ipv4"
+   run_tcp "${ipv4_args}"
+   run_udp "${ipv4_args}"
+
+   echo "ipv6"
+   run_tcp "${ipv4_args}"
+   run_udp "${ipv6_args}"
+}
+
+if [ ! -f ../bpf/xdp_dummy.o ]; then
+   echo "Missing xdp_dummy helper. Build bpf selftest first"
+   exit -1
+fi
+
+if [[ $# -eq 0 ]]; then
+   run_all
+elif [[ $1 == "__subprocess" ]]; then
+   shift
+   run_one $@
+else
+   run_in_netns $@
+fi
-- 
2.17.2

[PATCH net-next 08/10] selftests: add dummy xdp test helper

2018-11-07 Thread Paolo Abeni

This trivial XDP program does nothing, but will be used by the
next patch to test the GRO path in a net namespace, leveraging
the veth XDP implementation.

It's added here, despite its 'net' usage, to avoid the duplication
of the llc-related makefile boilerplate.

rfc v3 -> v1:
 - move the helper implementation into the bpf directory, don't
   touch udpgso_bench_rx

rfc v2 -> rfc v3:
 - move 'x' option handling here

Signed-off-by: Paolo Abeni 
---
 tools/testing/selftests/bpf/Makefile|  3 ++-
 tools/testing/selftests/bpf/xdp_dummy.c | 13 +
 2 files changed, 15 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/bpf/xdp_dummy.c

diff --git a/tools/testing/selftests/bpf/Makefile 
b/tools/testing/selftests/bpf/Makefile
index e39dfb4e7970..4a54236475ae 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -37,7 +37,8 @@ TEST_GEN_FILES = test_pkt_access.o test_xdp.o test_l4lb.o 
test_tcp_estats.o test
test_lwt_seg6local.o sendmsg4_prog.o sendmsg6_prog.o 
test_lirc_mode2_kern.o \
get_cgroup_id_kern.o socket_cookie_prog.o test_select_reuseport_kern.o \
test_skb_cgroup_id_kern.o bpf_flow.o netcnt_prog.o \
-   test_sk_lookup_kern.o test_xdp_vlan.o test_queue_map.o test_stack_map.o
+   test_sk_lookup_kern.o test_xdp_vlan.o test_queue_map.o test_stack_map.o 
\
+   xdp_dummy.o
 
 # Order correspond to 'make run_tests' order
 TEST_PROGS := test_kmod.sh \
diff --git a/tools/testing/selftests/bpf/xdp_dummy.c 
b/tools/testing/selftests/bpf/xdp_dummy.c
new file mode 100644
index ..43b0ef1001ed
--- /dev/null
+++ b/tools/testing/selftests/bpf/xdp_dummy.c
@@ -0,0 +1,13 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#define KBUILD_MODNAME "xdp_dummy"
+#include 
+#include "bpf_helpers.h"
+
+SEC("xdp_dummy")
+int xdp_dummy_prog(struct xdp_md *ctx)
+{
+   return XDP_PASS;
+}
+
+char _license[] SEC("license") = "GPL";
-- 
2.17.2

[PATCH net-next 06/10] udp: cope with UDP GRO packet misdirection

2018-11-07 Thread Paolo Abeni

In some scenarios, the GRO engine can assemble an UDP GRO packet
that ultimately lands on a non GRO-enabled socket.
This patch tries to address the issue explicitly checking for the UDP
socket features before enqueuing the packet, and eventually segmenting
the unexpected GRO packet, as needed.

We must also cope with re-insertion requests: after segmentation the
UDP code calls the helper introduced by the previous patches, as needed.

Segmentation is performed by a common helper, which takes care of
updating socket and protocol stats is case of failure.

rfc v3 -> v1
 - fix compile issues with rxrpc
 - when gso_segment returns NULL, treat is as an error
 - added 'ipv4' argument to udp_rcv_segment()

rfc v2 -> rfc v3
 - moved udp_rcv_segment() into net/udp.h, account errors to socket
   and ns, always return NULL or segs list

Signed-off-by: Paolo Abeni 
---
 include/linux/udp.h |  6 ++
 include/net/udp.h   | 45 +
 net/ipv4/udp.c  | 23 ++-
 net/ipv6/udp.c  | 24 +++-
 4 files changed, 88 insertions(+), 10 deletions(-)

diff --git a/include/linux/udp.h b/include/linux/udp.h
index e23d5024f42f..0a9c54e76305 100644
--- a/include/linux/udp.h
+++ b/include/linux/udp.h
@@ -132,6 +132,12 @@ static inline void udp_cmsg_recv(struct msghdr *msg, 
struct sock *sk,
}
 }
 
+static inline bool udp_unexpected_gso(struct sock *sk, struct sk_buff *skb)
+{
+   return !udp_sk(sk)->gro_enabled && skb_is_gso(skb) &&
+  skb_shinfo(skb)->gso_type & SKB_GSO_UDP_L4;
+}
+
 #define udp_portaddr_for_each_entry(__sk, list) \
hlist_for_each_entry(__sk, list, __sk_common.skc_portaddr_node)
 
diff --git a/include/net/udp.h b/include/net/udp.h
index 9e82cb391dea..90a2954962eb 100644
--- a/include/net/udp.h
+++ b/include/net/udp.h
@@ -406,17 +406,24 @@ static inline int copy_linear_skb(struct sk_buff *skb, 
int len, int off,
 } while(0)
 
 #if IS_ENABLED(CONFIG_IPV6)
-#define __UDPX_INC_STATS(sk, field)\
-do {   \
-   if ((sk)->sk_family == AF_INET) \
-   __UDP_INC_STATS(sock_net(sk), field, 0);\
-   else\
-   __UDP6_INC_STATS(sock_net(sk), field, 0);   \
-} while (0)
+#define __UDPX_MIB(sk, ipv4)   \
+({ \
+   ipv4 ? (IS_UDPLITE(sk) ? sock_net(sk)->mib.udplite_statistics : \
+sock_net(sk)->mib.udp_statistics) :\
+   (IS_UDPLITE(sk) ? sock_net(sk)->mib.udplite_stats_in6 : \
+sock_net(sk)->mib.udp_stats_in6);  \
+})
 #else
-#define __UDPX_INC_STATS(sk, field) __UDP_INC_STATS(sock_net(sk), field, 0)
+#define __UDPX_MIB(sk, ipv4)   \
+({ \
+   IS_UDPLITE(sk) ? sock_net(sk)->mib.udplite_statistics : \
+sock_net(sk)->mib.udp_statistics;  \
+})
 #endif
 
+#define __UDPX_INC_STATS(sk, field) \
+   __SNMP_INC_STATS(__UDPX_MIB(sk, (sk)->sk_family == AF_INET), field)
+
 #ifdef CONFIG_PROC_FS
 struct udp_seq_afinfo {
sa_family_t family;
@@ -450,4 +457,26 @@ DECLARE_STATIC_KEY_FALSE(udpv6_encap_needed_key);
 void udpv6_encap_enable(void);
 #endif
 
+static inline struct sk_buff *udp_rcv_segment(struct sock *sk,
+ struct sk_buff *skb, bool ipv4)
+{
+   struct sk_buff *segs;
+
+   /* the GSO CB lays after the UDP one, no need to save and restore any
+* CB fragment
+*/
+   segs = __skb_gso_segment(skb, NETIF_F_SG, false);
+   if (unlikely(IS_ERR_OR_NULL(segs))) {
+   int segs_nr = skb_shinfo(skb)->gso_segs;
+
+   atomic_add(segs_nr, >sk_drops);
+   SNMP_ADD_STATS(__UDPX_MIB(sk, ipv4), UDP_MIB_INERRORS, segs_nr);
+   kfree_skb(skb);
+   return NULL;
+   }
+
+   consume_skb(skb);
+   return segs;
+}
+
 #endif /* _UDP_H */
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 725ece9d78af..73926352960c 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1909,7 +1909,7 @@ EXPORT_SYMBOL(udp_encap_enable);
  * Note that in the success and error cases, the skb is assumed to
  * have either been requeued or freed.
  */
-static int udp_queue_rcv_skb(struct sock *sk, struct sk_buff *skb)
+static int udp_queue_rcv_one_skb(struct sock *sk, struct sk_buff *skb)
 {
struct udp_sock *up = udp_sk(sk);
int is_udplite = IS_UDPLITE(sk);
@@ -2012,6 +2012,27 @@ static int udp_queue_rcv_skb(struct sock

[PATCH net-next 00/10] udp: implement GRO support

2018-11-07 Thread Paolo Abeni

This series implements GRO support for UDP sockets, as the RX counterpart
of commit bec1f6f69736 ("udp: generate gso with UDP_SEGMENT").
The core functionality is implemented by the second patch, introducing a new
sockopt to enable UDP_GRO, while patch 3 implements support for passing the
segment size to the user space via a new cmsg.
UDP GRO performs a socket lookup for each ingress packets and aggregate datagram
directed to UDP GRO enabled sockets with constant l4 tuple.

UDP GRO packets can land on non GRO-enabled sockets, e.g. due to iptables NAT
rules, and that could potentially confuse existing applications.

The solution adopted here is to de-segment the GRO packet before enqueuing
as needed. Since we must cope with packet reinsertion after de-segmentation,
the relevant code is factored-out in ipv4 and ipv6 specific helpers and exposed
to UDP usage.

While the current code can probably be improved, this safeguard ,implemented in
the patches 4-7, allows future enachements to enable UDP GSO offload on more
virtual devices eventually even on forwarded packets.

The last 4 for patches implement some performance and functional self-tests,
re-using the existing udpgso infrastructure. The problematic scenario described
above is explicitly tested.

This revision of the series try to address the feedback provided by Willem and
Subash on previous iteration.

Paolo Abeni (10):
  udp: implement complete book-keeping for encap_needed
  udp: implement GRO for plain UDP sockets.
  udp: add support for UDP_GRO cmsg
  ip: factor out protocol delivery helper
  ipv6: factor out protocol delivery helper
  udp: cope with UDP GRO packet misdirection
  selftests: add GRO support to udp bench rx program
  selftests: add dummy xdp test helper
  selftests: add some benchmark for UDP GRO
  selftests: add functionals test for UDP GRO

 include/linux/udp.h   |  25 ++-
 include/net/ip.h  |   1 +
 include/net/ipv6.h|   2 +
 include/net/udp.h |  45 -
 include/net/udp_tunnel.h  |   6 +
 include/uapi/linux/udp.h  |   1 +
 net/ipv4/ip_input.c   |  73 
 net/ipv4/udp.c|  54 +-
 net/ipv4/udp_offload.c| 109 +---
 net/ipv6/ip6_input.c  |  28 ++--
 net/ipv6/udp.c|  41 -
 net/ipv6/udp_offload.c|   6 +-
 tools/testing/selftests/bpf/Makefile  |   3 +-
 tools/testing/selftests/bpf/xdp_dummy.c   |  13 ++
 tools/testing/selftests/net/Makefile  |   1 +
 tools/testing/selftests/net/udpgro.sh | 148 +
 tools/testing/selftests/net/udpgro_bench.sh   |  95 +++
 tools/testing/selftests/net/udpgso_bench.sh   |   2 +-
 tools/testing/selftests/net/udpgso_bench_rx.c | 156 --
 tools/testing/selftests/net/udpgso_bench_tx.c |  22 ++-
 20 files changed, 708 insertions(+), 123 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/xdp_dummy.c
 create mode 100755 tools/testing/selftests/net/udpgro.sh
 create mode 100755 tools/testing/selftests/net/udpgro_bench.sh

-- 
2.17.2

[PATCH net-next 03/10] udp: add support for UDP_GRO cmsg

2018-11-07 Thread Paolo Abeni

When UDP GRO is enabled, the UDP_GRO cmsg will carry the ingress
datagram size. User-space can use such info to compute the original
packets layout.

Signed-off-by: Paolo Abeni 
---
 include/linux/udp.h | 11 +++
 net/ipv4/udp.c  |  4 
 net/ipv6/udp.c  |  3 +++
 3 files changed, 18 insertions(+)

diff --git a/include/linux/udp.h b/include/linux/udp.h
index f613b329852e..e23d5024f42f 100644
--- a/include/linux/udp.h
+++ b/include/linux/udp.h
@@ -121,6 +121,17 @@ static inline bool udp_get_no_check6_rx(struct sock *sk)
return udp_sk(sk)->no_check6_rx;
 }
 
+static inline void udp_cmsg_recv(struct msghdr *msg, struct sock *sk,
+struct sk_buff *skb)
+{
+   int gso_size;
+
+   if (skb_shinfo(skb)->gso_type & SKB_GSO_UDP_L4) {
+   gso_size = skb_shinfo(skb)->gso_size;
+   put_cmsg(msg, SOL_UDP, UDP_GRO, sizeof(gso_size), _size);
+   }
+}
+
 #define udp_portaddr_for_each_entry(__sk, list) \
hlist_for_each_entry(__sk, list, __sk_common.skc_portaddr_node)
 
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 0d447fd194c5..725ece9d78af 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1714,6 +1714,10 @@ int udp_recvmsg(struct sock *sk, struct msghdr *msg, 
size_t len, int noblock,
memset(sin->sin_zero, 0, sizeof(sin->sin_zero));
*addr_len = sizeof(*sin);
}
+
+   if (udp_sk(sk)->gro_enabled)
+   udp_cmsg_recv(msg, sk, skb);
+
if (inet->cmsg_flags)
ip_cmsg_recv_offset(msg, sk, skb, sizeof(struct udphdr), off);
 
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index fc0ce6c59ebb..8e76e719305c 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -421,6 +421,9 @@ int udpv6_recvmsg(struct sock *sk, struct msghdr *msg, 
size_t len,
*addr_len = sizeof(*sin6);
}
 
+   if (udp_sk(sk)->gro_enabled)
+   udp_cmsg_recv(msg, sk, skb);
+
if (np->rxopt.all)
ip6_datagram_recv_common_ctl(sk, msg, skb);
 
-- 
2.17.2

[PATCH net-next 02/10] udp: implement GRO for plain UDP sockets.

2018-11-07 Thread Paolo Abeni

This is the RX counterpart of commit bec1f6f69736 ("udp: generate gso
with UDP_SEGMENT"). When UDP_GRO is enabled, such socket is also
eligible for GRO in the rx path: UDP segments directed to such socket
are assembled into a larger GSO_UDP_L4 packet.

The core UDP GRO support is enabled with setsockopt(UDP_GRO).

Initial benchmark numbers:

Before:
udp rx:   1079 MB/s   769065 calls/s

After:
udp rx:   1466 MB/s24877 calls/s

This change introduces a side effect in respect to UDP tunnels:
after a UDP tunnel creation, now the kernel performs a lookup per ingress
UDP packet, while before such lookup happened only if the ingress packet
carried a valid internal header csum.

rfc v2 -> rfc v3:
 - fixed typos in macro name and comments
 - really enforce UDP_GRO_CNT_MAX, instead of UDP_GRO_CNT_MAX + 1
 - acquire socket lock in UDP_GRO setsockopt

rfc v1 -> rfc v2:
 - use a new option to enable UDP GRO
 - use static keys to protect the UDP GRO socket lookup

Signed-off-by: Paolo Abeni 
--
Note: I opted for acquiring the socket lock only for the newly introduced
setsockopt instead for every value, despite the previous conversation on
this topic, to avoid introducing somewhat larger and unrelated changes.
---
 include/linux/udp.h  |   3 +-
 include/uapi/linux/udp.h |   1 +
 net/ipv4/udp.c   |   8 +++
 net/ipv4/udp_offload.c   | 109 +++
 net/ipv6/udp_offload.c   |   6 +--
 5 files changed, 99 insertions(+), 28 deletions(-)

diff --git a/include/linux/udp.h b/include/linux/udp.h
index a4dafff407fb..f613b329852e 100644
--- a/include/linux/udp.h
+++ b/include/linux/udp.h
@@ -50,11 +50,12 @@ struct udp_sock {
__u8 encap_type;/* Is this an Encapsulation socket? */
unsigned charno_check6_tx:1,/* Send zero UDP6 checksums on TX? */
 no_check6_rx:1,/* Allow zero UDP6 checksums on RX? */
-encap_enabled:1; /* This socket enabled encap
+encap_enabled:1, /* This socket enabled encap
   * processing; UDP tunnels and
   * different encapsulation layer set
   * this
   */
+gro_enabled:1; /* Can accept GRO packets */
/*
 * Following member retains the information to create a UDP header
 * when the socket is uncorked.
diff --git a/include/uapi/linux/udp.h b/include/uapi/linux/udp.h
index 09502de447f5..30baccb6c9c4 100644
--- a/include/uapi/linux/udp.h
+++ b/include/uapi/linux/udp.h
@@ -33,6 +33,7 @@ struct udphdr {
 #define UDP_NO_CHECK6_TX 101   /* Disable sending checksum for UDP6X */
 #define UDP_NO_CHECK6_RX 102   /* Disable accpeting checksum for UDP6 */
 #define UDP_SEGMENT103 /* Set GSO segmentation size */
+#define UDP_GRO104 /* This socket can receive UDP GRO 
packets */
 
 /* UDP encapsulation types */
 #define UDP_ENCAP_ESPINUDP_NON_IKE 1 /* draft-ietf-ipsec-nat-t-ike-00/01 */
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 0ed715a72249..0d447fd194c5 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -2476,6 +2476,14 @@ int udp_lib_setsockopt(struct sock *sk, int level, int 
optname,
up->gso_size = val;
break;
 
+   case UDP_GRO:
+   lock_sock(sk);
+   if (valbool)
+   udp_tunnel_encap_enable(sk->sk_socket);
+   up->gro_enabled = valbool;
+   release_sock(sk);
+   break;
+
/*
 *  UDP-Lite's partial checksum coverage (RFC 3828).
 */
diff --git a/net/ipv4/udp_offload.c b/net/ipv4/udp_offload.c
index 802f2bc00d69..0646d61f4fa8 100644
--- a/net/ipv4/udp_offload.c
+++ b/net/ipv4/udp_offload.c
@@ -343,6 +343,54 @@ static struct sk_buff *udp4_ufo_fragment(struct sk_buff 
*skb,
return segs;
 }
 
+#define UDP_GRO_CNT_MAX 64
+static struct sk_buff *udp_gro_receive_segment(struct list_head *head,
+  struct sk_buff *skb)
+{
+   struct udphdr *uh = udp_hdr(skb);
+   struct sk_buff *pp = NULL;
+   struct udphdr *uh2;
+   struct sk_buff *p;
+
+   /* requires non zero csum, for symmetry with GSO */
+   if (!uh->check) {
+   NAPI_GRO_CB(skb)->flush = 1;
+   return NULL;
+   }
+
+   /* pull encapsulating udp header */
+   skb_gro_pull(skb, sizeof(struct udphdr));
+   skb_gro_postpull_rcsum(skb, uh, sizeof(struct udphdr));
+
+   list_for_each_entry(p, head, list) {
+   if (!NAPI_GRO_CB(p)->same_flow)
+   continue;
+
+   uh2 = udp_hdr(p);
+
+   /* Match ports only, as csum is always non zero */
+   if ((*(u32 *)>source != *(u32 *)>source)) {
+

[PATCH net-next] tun: compute the RFS hash only if needed.

2018-11-07 Thread Paolo Abeni

The tun XDP sendmsg code path, unconditionally computes the symmetric
hash of each packet for RFS's sake, even when we could skip it. e.g.
when the device has a single queue.

This change adds the check already in-place for the skb sendmsg path
to avoid unneeded hashing.

The above gives small, but measurable, performance gain for VM xmit
path when zerocopy is not enabled.

Signed-off-by: Paolo Abeni 
---
 drivers/net/tun.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 060135ceaf0e..a65779c6d72f 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -2448,7 +2448,8 @@ static int tun_xdp_one(struct tun_struct *tun,
goto out;
}
 
-   if (!rcu_dereference(tun->steering_prog))
+   if (!rcu_dereference(tun->steering_prog) && tun->numqueues > 1 &&
+   !tfile->detached)
rxhash = __skb_get_hash_symmetric(skb);
 
netif_receive_skb(skb);
-- 
2.17.2

Re: [RFC PATCH v3 06/10] udp: cope with UDP GRO packet misdirection

2018-11-02 Thread Paolo Abeni

On Thu, 2018-11-01 at 17:01 -0400, Willem de Bruijn wrote:
> On Wed, Oct 31, 2018 at 5:57 AM Paolo Abeni  wrote:
> > @@ -450,4 +457,32 @@ DECLARE_STATIC_KEY_FALSE(udpv6_encap_needed_key);
> > >  void udpv6_encap_enable(void);
> > >  #endif
> > > 
> > > +static inline struct sk_buff *udp_rcv_segment(struct sock *sk,
> > > +   struct sk_buff *skb)
> > > +{
> > > + bool ipv4 = skb->protocol == htons(ETH_P_IP);
> > 
> > And this cause a compile warning when # CONFIG_IPV6 is not set, I will
> > fix in the next iteration (again thanks kbuildbot)
> 
> Can also just pass it as argument. 

Agreed. 

> This skb->protocol should work correctly
> with tunneled packets, but it wasn't as immediately obvious to me.
> 
> Also
> 
> +   if (unlikely(!segs))
> +   goto drop;
> 
> this should not happen. But if it could and the caller treats it the
> same as error (both now return NULL), then skb needs to be freed.

Right you are. Will do in the next iteration.

Thanks,

Paolo

Re: [RFC PATCH v3 01/10] udp: implement complete book-keeping for encap_needed

2018-11-02 Thread Paolo Abeni

Hi,

On Thu, 2018-11-01 at 16:59 -0400, Willem de Bruijn wrote:
> On Tue, Oct 30, 2018 at 1:28 PM Paolo Abeni  wrote:
> > 
> > The *encap_needed static keys are enabled by UDP tunnels
> > and several UDP encapsulations type, but they are never
> > turned off. This can cause unneeded overall performance
> > degradation for systems where such features are used
> > transiently.
> > 
> > This patch introduces complete book-keeping for such keys,
> > decreasing the usage at socket destruction time, if needed,
> > and avoiding that the same socket could increase the key
> > usage multiple times.
> > 
> > rfc v2 - rfc v3:
> >  - use udp_tunnel_encap_enable() in setsockopt()
> > 
> > Signed-off-by: Paolo Abeni 
> > @@ -2447,7 +2452,7 @@ int udp_lib_setsockopt(struct sock *sk, int level, 
> > int optname,
> > /* FALLTHROUGH */
> > case UDP_ENCAP_L2TPINUDP:
> > up->encap_type = val;
> > -   udp_encap_enable();
> > +   udp_tunnel_encap_enable(sk->sk_socket);
> 
> this now also needs lock_sock?

Yep, you are right. I'll add it in the next iteration.

Thanks,

Paolo

Re: [RFC PATCH v3 04/10] ip: factor out protocol delivery helper

2018-11-02 Thread Paolo Abeni

On Thu, 2018-11-01 at 00:35 -0600, Subash Abhinov Kasiviswanathan
wrote:
> On 2018-10-30 11:24, Paolo Abeni wrote:
> > So that we can re-use it at the UDP lavel in a later patch
> > 
> 
> Hi Paolo
> 
> Minor queries -
> Should it be "level" instead of "lavel"? Similar comment for the ipv6
> patch as well.

Indeed. Will fix in the next iteration.

> > diff --git a/net/ipv4/ip_input.c b/net/ipv4/ip_input.c
> > index 35a786c0aaa0..72250b4e466d 100644
> > --- a/net/ipv4/ip_input.c
> > +++ b/net/ipv4/ip_input.c
> > @@ -188,51 +188,50 @@ bool ip_call_ra_chain(struct sk_buff *skb)
> > return false;
> >  }
> > 
> > -static int ip_local_deliver_finish(struct net *net, struct sock *sk,
> > struct sk_buff *skb)
> > +void ip_protocol_deliver_rcu(struct net *net, struct sk_buff *skb,
> > int protocol)
> 
> Would it be better if this function was declared in include/net/ip.h &
> include/net/ipv6.h rather than in net/ipv4/udp.c & net/ipv6/udp.c as in
> the patch "udp: cope with UDP GRO packet misdirection"
> 
> diff --git a/include/net/ip.h b/include/net/ip.h
> index 72593e1..3d7fdb4 100644
> --- a/include/net/ip.h
> +++ b/include/net/ip.h
> @@ -717,4 +717,6 @@ static inline void ip_cmsg_recv(struct msghdr *msg, 
> struct sk_buff *skb)
>   int rtm_getroute_parse_ip_proto(struct nlattr *attr, u8 *ip_proto,
>  struct netlink_ext_ack *extack);
> 
> +void ip_protocol_deliver_rcu(struct net *net, struct sk_buff *skb, int 
> proto);
> +
>   #endif /* _IP_H */
> diff --git a/include/net/ipv6.h b/include/net/ipv6.h
> index 8296505..4d4d2cfe 100644
> --- a/include/net/ipv6.h
> +++ b/include/net/ipv6.h
> @@ -1101,4 +1101,8 @@ int ipv6_sock_mc_join_ssm(struct sock *sk, int 
> ifindex,
>const struct in6_addr *addr, unsigned int 
> mode);
>   int ipv6_sock_mc_drop(struct sock *sk, int ifindex,
>const struct in6_addr *addr);
> +
> +void ip6_protocol_deliver_rcu(struct net *net, struct sk_buff *skb, int 
> nexthdr,
> +  bool have_final);
> +
>   #endif /* _NET_IPV6_H */
> 

Agreed, I will do in the next iteration. 

Thanks,

Paolo

Re: [RFC PATCH v3 06/10] udp: cope with UDP GRO packet misdirection

2018-10-31 Thread Paolo Abeni

On Tue, 2018-10-30 at 18:24 +0100, Paolo Abeni wrote:
> --- a/include/net/udp.h
> +++ b/include/net/udp.h
> @@ -406,17 +406,24 @@ static inline int copy_linear_skb(struct sk_buff *skb, 
> int len, int off,
>  } while(0)
>  
>  #if IS_ENABLED(CONFIG_IPV6)
> -#define __UDPX_INC_STATS(sk, field)  \
> -do { \
> - if ((sk)->sk_family == AF_INET) \
> - __UDP_INC_STATS(sock_net(sk), field, 0);\
> - else\
> - __UDP6_INC_STATS(sock_net(sk), field, 0);   \
> -} while (0)
> +#define __UDPX_MIB(sk, ipv4) \
> +({   \
> + ipv4 ? (IS_UDPLITE(sk) ? sock_net(sk)->mib.udplite_statistics : \
> +  sock_net(sk)->mib.udp_statistics) :\
> + (IS_UDPLITE(sk) ? sock_net(sk)->mib.udplite_stats_in6 : \
> +  sock_net(sk)->mib.udp_stats_in6);  \
> +})
>  #else
> -#define __UDPX_INC_STATS(sk, field) __UDP_INC_STATS(sock_net(sk), field, 0)
> +#define __UDPX_MIB(sk, ipv4) \
> +({   \
> + IS_UDPLITE(sk) ? sock_net(sk)->mib.udplite_statistics : \
> +  sock_net(sk)->mib.udp_statistics;  \
> +})
>  #endif
>  
> +#define __UDPX_INC_STATS(sk, field) \
> + __SNMP_INC_STATS(__UDPX_MIB(sk, (sk)->sk_family == AF_INET, field)
> +

This is broken (complains only if CONFIG_AF_RXRPC is set), will fix in
next iteration (thanks kbuildbot).

But I'd prefer to keep the above helper: it can be used in a follow-up
patch to cleanup a bit udp6_recvmsg().

>  #ifdef CONFIG_PROC_FS
>  struct udp_seq_afinfo {
>   sa_family_t family;
> @@ -450,4 +457,32 @@ DECLARE_STATIC_KEY_FALSE(udpv6_encap_needed_key);
>  void udpv6_encap_enable(void);
>  #endif
>  
> +static inline struct sk_buff *udp_rcv_segment(struct sock *sk,
> +   struct sk_buff *skb)
> +{
> + bool ipv4 = skb->protocol == htons(ETH_P_IP);

And this cause a compile warning when # CONFIG_IPV6 is not set, I will
fix in the next iteration (again thanks kbuildbot)

Cheers,

Paolo

[RFC PATCH v3 10/10] selftests: add functionals test for UDP GRO

2018-10-30 Thread Paolo Abeni

Extends the existing udp programs to allow checking for proper
GRO aggregation/GSO size, and run the tests via a shell script, using
a veth pair with XDP program attached to trigger the GRO code path.

rfc v2 -> rfc v3:
 - add missing test program options documentation
 - fix sporatic test failures (receiver faster than sender)

Signed-off-by: Paolo Abeni 
---
 tools/testing/selftests/net/Makefile  |   2 +-
 tools/testing/selftests/net/udpgro.sh | 147 ++
 tools/testing/selftests/net/udpgro_bench.sh   |   8 +-
 tools/testing/selftests/net/udpgso_bench.sh   |   2 +-
 tools/testing/selftests/net/udpgso_bench_rx.c | 123 +--
 tools/testing/selftests/net/udpgso_bench_tx.c |  22 ++-
 6 files changed, 281 insertions(+), 23 deletions(-)
 create mode 100755 tools/testing/selftests/net/udpgro.sh

diff --git a/tools/testing/selftests/net/Makefile 
b/tools/testing/selftests/net/Makefile
index ac999354af54..a8a0d256aafb 100644
--- a/tools/testing/selftests/net/Makefile
+++ b/tools/testing/selftests/net/Makefile
@@ -7,7 +7,7 @@ CFLAGS += -I../../../../usr/include/
 TEST_PROGS := run_netsocktests run_afpackettests test_bpf.sh netdevice.sh 
rtnetlink.sh
 TEST_PROGS += fib_tests.sh fib-onlink-tests.sh pmtu.sh udpgso.sh ip_defrag.sh
 TEST_PROGS += udpgso_bench.sh fib_rule_tests.sh msg_zerocopy.sh psock_snd.sh
-TEST_PROGS += udpgro_bench.sh
+TEST_PROGS += udpgro_bench.sh udpgro.sh
 TEST_PROGS_EXTENDED := in_netns.sh
 TEST_GEN_FILES =  socket
 TEST_GEN_FILES += psock_fanout psock_tpacket msg_zerocopy
diff --git a/tools/testing/selftests/net/udpgro.sh 
b/tools/testing/selftests/net/udpgro.sh
new file mode 100755
index ..3f12b72a3568
--- /dev/null
+++ b/tools/testing/selftests/net/udpgro.sh
@@ -0,0 +1,147 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+#
+# Run a series of udpgro functional tests.
+
+readonly PEER_NS="ns-peer-$(mktemp -u XX)"
+
+cleanup() {
+   local -r jobs="$(jobs -p)"
+   local -r ns="$(ip netns list|grep $PEER_NS)"
+
+   [ -n "${jobs}" ] && kill -1 ${jobs} 2>/dev/null
+   [ -n "$ns" ] && ip netns del $ns 2>/dev/null
+}
+trap cleanup EXIT
+
+cfg_veth() {
+   ip netns add "${PEER_NS}"
+   ip -netns "${PEER_NS}" link set lo up
+   ip link add type veth
+   ip link set dev veth0 up
+   ip addr add dev veth0 192.168.1.2/24
+   ip addr add dev veth0 2001:db8::2/64 nodad
+
+   ip link set dev veth1 netns "${PEER_NS}"
+   ip -netns "${PEER_NS}" addr add dev veth1 192.168.1.1/24
+   ip -netns "${PEER_NS}" addr add dev veth1 2001:db8::1/64 nodad
+   ip -netns "${PEER_NS}" link set dev veth1 up
+}
+
+run_one() {
+   # use 'rx' as separator between sender args and receiver args
+   local -r all="$@"
+   local -r tx_args=${all%rx*}
+   local -r rx_args=${all#*rx}
+
+   cfg_veth
+
+   ip netns exec "${PEER_NS}" ./udpgso_bench_rx ${rx_args} && \
+   echo "ok" || \
+   echo "failed" &
+
+   # Hack: let bg programs complete the startup
+   sleep 0.1
+   ./udpgso_bench_tx ${tx_args}
+   wait $(jobs -p)
+}
+
+run_test() {
+   local -r args=$@
+
+   printf " %-40s" "$1"
+   ./in_netns.sh $0 __subprocess $2 rx -G -r -x veth1 $3
+}
+
+run_one_nat() {
+   # use 'rx' as separator between sender args and receiver args
+   local addr1 addr2 pid family="" ipt_cmd=ip6tables
+   local -r all="$@"
+   local -r tx_args=${all%rx*}
+   local -r rx_args=${all#*rx}
+
+   if [[ ${tx_args} = *-4* ]]; then
+   ipt_cmd=iptables
+   family=-4
+   addr1=192.168.1.1
+   addr2=192.168.1.3/24
+   else
+   addr1=2001:db8::1
+   addr2="2001:db8::3/64 nodad"
+   fi
+
+   cfg_veth
+   ip -netns "${PEER_NS}" addr add dev veth1 ${addr2}
+
+   # fool the GRO engine changing the destination address ...
+   ip netns exec "${PEER_NS}" $ipt_cmd -t nat -I PREROUTING -d ${addr1} -j 
DNAT --to-destination ${addr2%/*}
+
+   # ... so that GRO will match the UDP_GRO enabled socket, but packets
+   # will land on the 'plain' one
+   ip netns exec "${PEER_NS}" ./udpgso_bench_rx -G ${family} -x veth1 -b 
${addr1} -n 0 &
+   pid=$!
+   ip netns exec "${PEER_NS}" ./udpgso_bench_rx ${family} -b ${addr2%/*} 
${rx_args} && \
+   echo "ok" || \
+   echo "failed"&
+
+   sleep 0.1
+   ./udpgso_bench_tx ${tx_args}
+   kill -INT $pid
+   wait $(jobs -p)
+}
+
+run_nat_test() {
+   local -r args=$@
+
+   printf " %-40s" "$1"
+   ./in_netns.sh $0 __subprocess_nat $2 rx -r

[RFC PATCH v3 08/10] selftests: conditionally enable XDP support in udpgso_bench_rx

2018-10-30 Thread Paolo Abeni

XDP support will be used by a later patch to test the GRO path
in a net namespace, leveraging the veth XDP implementation.
To avoid breaking existing setup, XDP support is conditionally
enabled and build only if llc is locally available.

rfc v2 -> rfc v3:
 - move 'x' option handling here

Signed-off-by: Paolo Abeni 
---
 tools/testing/selftests/net/Makefile  | 69 +++
 tools/testing/selftests/net/udpgso_bench_rx.c | 41 ++-
 tools/testing/selftests/net/xdp_dummy.c   | 13 
 3 files changed, 121 insertions(+), 2 deletions(-)
 create mode 100644 tools/testing/selftests/net/xdp_dummy.c

diff --git a/tools/testing/selftests/net/Makefile 
b/tools/testing/selftests/net/Makefile
index 256d82d5fa87..176459b7c4d6 100644
--- a/tools/testing/selftests/net/Makefile
+++ b/tools/testing/selftests/net/Makefile
@@ -16,8 +16,77 @@ TEST_GEN_PROGS = reuseport_bpf reuseport_bpf_cpu 
reuseport_bpf_numa
 TEST_GEN_PROGS += reuseport_dualstack reuseaddr_conflict tls
 
 KSFT_KHDR_INSTALL := 1
+
+# Allows pointing LLC/CLANG to a LLVM backend with bpf support, redefine on 
cmdline:
+#  make samples/bpf/ LLC=~/git/llvm/build/bin/llc 
CLANG=~/git/llvm/build/bin/clang
+LLC ?= llc
+CLANG ?= clang
+LLVM_OBJCOPY ?= llvm-objcopy
+BTF_PAHOLE ?= pahole
+HAS_LLC := $(shell which $(LLC) 2>/dev/null)
+
+# conditional enable testes requiring llc
+ifneq (, $(HAS_LLC))
+TEST_GEN_FILES += xdp_dummy.o
+endif
+
 include ../lib.mk
 
+ifneq (, $(HAS_LLC))
+
+# Detect that we're cross compiling and use the cross compiler
+ifdef CROSS_COMPILE
+CLANG_ARCH_ARGS = -target $(ARCH)
+endif
+
+PROBE := $(shell $(LLC) -march=bpf -mcpu=probe -filetype=null /dev/null 2>&1)
+
+# Let newer LLVM versions transparently probe the kernel for availability
+# of full BPF instruction set.
+ifeq ($(PROBE),)
+  CPU ?= probe
+else
+  CPU ?= generic
+endif
+
+SRC_PATH := $(abspath ../../../..)
+LIB_PATH := $(SRC_PATH)/tools/lib
+XDP_CFLAGS := -D SUPPORT_XDP=1 -I$(LIB_PATH)
+LIBBPF = $(LIB_PATH)/bpf/libbpf.a
+BTF_LLC_PROBE := $(shell $(LLC) -march=bpf -mattr=help 2>&1 | grep dwarfris)
+BTF_PAHOLE_PROBE := $(shell $(BTF_PAHOLE) --help 2>&1 | grep BTF)
+BTF_OBJCOPY_PROBE := $(shell $(LLVM_OBJCOPY) --help 2>&1 | grep -i 
'usage.*llvm')
+CLANG_SYS_INCLUDES := $(shell $(CLANG) -v -E - &1 \
+| sed -n '/<...> search starts here:/,/End of search list./{ s| 
\(/.*\)|-idirafter \1|p }')
+CLANG_FLAGS = -I. -I$(SRC_PATH)/include -I../bpf/ \
+ $(CLANG_SYS_INCLUDES) -Wno-compare-distinct-pointer-types
+
+ifneq ($(and $(BTF_LLC_PROBE),$(BTF_PAHOLE_PROBE),$(BTF_OBJCOPY_PROBE)),)
+   CLANG_CFLAGS += -g
+   LLC_FLAGS += -mattr=dwarfris
+   DWARF2BTF = y
+endif
+
+$(LIBBPF): FORCE
+# Fix up variables inherited from Kbuild that tools/ build system won't like
+   $(MAKE) -C $(dir $@) RM='rm -rf' LDFLAGS= srctree=$(SRC_PATH) O= 
$(nodir $@)
+
+$(OUTPUT)/udpgso_bench_rx: $(OUTPUT)/udpgso_bench_rx.c $(LIBBPF)
+   $(CC) -o $@ $(XDP_CFLAGS) $(CFLAGS) $(LOADLIBES) $(LDLIBS) $^ -lelf
+
+FORCE:
+
+# bpf program[s] generation
+$(OUTPUT)/%.o: %.c
+   $(CLANG) $(CLANG_FLAGS) \
+-O2 -target bpf -emit-llvm -c $< -o - |  \
+   $(LLC) -march=bpf -mcpu=$(CPU) $(LLC_FLAGS) -filetype=obj -o $@
+ifeq ($(DWARF2BTF),y)
+   $(BTF_PAHOLE) -J $@
+endif
+
+endif
+
 $(OUTPUT)/reuseport_bpf_numa: LDFLAGS += -lnuma
 $(OUTPUT)/tcp_mmap: LDFLAGS += -lpthread
 $(OUTPUT)/tcp_inq: LDFLAGS += -lpthread
diff --git a/tools/testing/selftests/net/udpgso_bench_rx.c 
b/tools/testing/selftests/net/udpgso_bench_rx.c
index 8f48d7fb32cf..5dcb719abe04 100644
--- a/tools/testing/selftests/net/udpgso_bench_rx.c
+++ b/tools/testing/selftests/net/udpgso_bench_rx.c
@@ -31,6 +31,10 @@
 #include 
 #include 
 
+#ifdef SUPPORT_XDP
+#include "bpf/libbpf.h"
+#endif
+
 #ifndef UDP_GRO
 #define UDP_GRO104
 #endif
@@ -40,6 +44,9 @@ static bool cfg_tcp;
 static bool cfg_verify;
 static bool cfg_read_all;
 static bool cfg_gro_segment;
+#ifdef SUPPORT_XDP
+static int cfg_xdp_iface;
+#endif
 
 static bool interrupted;
 static unsigned long packets, bytes;
@@ -202,14 +209,14 @@ static void do_flush_udp(int fd)
 
 static void usage(const char *filepath)
 {
-   error(1, 0, "Usage: %s [-Grtv] [-p port]", filepath);
+   error(1, 0, "Usage: %s [-Grtv] [-p port] [-x device]", filepath);
 }
 
 static void parse_opts(int argc, char **argv)
 {
int c;
 
-   while ((c = getopt(argc, argv, "Gp:rtv")) != -1) {
+   while ((c = getopt(argc, argv, "Gp:rtvx:")) != -1) {
switch (c) {
case 'G':
cfg_gro_segment = true;
@@ -227,6 +234,13 @@ static void parse_opts(int argc, char **argv)
cfg_verify = true;
cfg_read_all = true;
break;
+#ifdef SUPPORT_XDP
+   case 'x':
+

[RFC PATCH v3 07/10] selftests: add GRO support to udp bench rx program

2018-10-30 Thread Paolo Abeni

And fix a couple of buglets (port option processing,
clean termination on SIGINT). This is preparatory work
for GRO tests.

rfc v2 -> rfc v3:
 - use ETH_MAX_MTU

Signed-off-by: Paolo Abeni 
---
 tools/testing/selftests/net/udpgso_bench_rx.c | 37 +++
 1 file changed, 30 insertions(+), 7 deletions(-)

diff --git a/tools/testing/selftests/net/udpgso_bench_rx.c 
b/tools/testing/selftests/net/udpgso_bench_rx.c
index 727cf67a3f75..8f48d7fb32cf 100644
--- a/tools/testing/selftests/net/udpgso_bench_rx.c
+++ b/tools/testing/selftests/net/udpgso_bench_rx.c
@@ -31,9 +31,15 @@
 #include 
 #include 
 
+#ifndef UDP_GRO
+#define UDP_GRO104
+#endif
+
 static int  cfg_port   = 8000;
 static bool cfg_tcp;
 static bool cfg_verify;
+static bool cfg_read_all;
+static bool cfg_gro_segment;
 
 static bool interrupted;
 static unsigned long packets, bytes;
@@ -63,6 +69,8 @@ static void do_poll(int fd)
 
do {
ret = poll(, 1, 10);
+   if (interrupted)
+   break;
if (ret == -1)
error(1, errno, "poll");
if (ret == 0)
@@ -70,7 +78,7 @@ static void do_poll(int fd)
if (pfd.revents != POLLIN)
error(1, errno, "poll: 0x%x expected 0x%x\n",
pfd.revents, POLLIN);
-   } while (!ret && !interrupted);
+   } while (!ret);
 }
 
 static int do_socket(bool do_tcp)
@@ -102,6 +110,8 @@ static int do_socket(bool do_tcp)
error(1, errno, "listen");
 
do_poll(accept_fd);
+   if (interrupted)
+   exit(0);
 
fd = accept(accept_fd, NULL, NULL);
if (fd == -1)
@@ -167,10 +177,10 @@ static void do_verify_udp(const char *data, int len)
 /* Flush all outstanding datagrams. Verify first few bytes of each. */
 static void do_flush_udp(int fd)
 {
-   static char rbuf[ETH_DATA_LEN];
+   static char rbuf[ETH_MAX_MTU];
int ret, len, budget = 256;
 
-   len = cfg_verify ? sizeof(rbuf) : 0;
+   len = cfg_read_all ? sizeof(rbuf) : 0;
while (budget--) {
/* MSG_TRUNC will make return value full datagram length */
ret = recv(fd, rbuf, len, MSG_TRUNC | MSG_DONTWAIT);
@@ -178,7 +188,7 @@ static void do_flush_udp(int fd)
return;
if (ret == -1)
error(1, errno, "recv");
-   if (len) {
+   if (len && cfg_verify) {
if (ret == 0)
error(1, errno, "recv: 0 byte datagram\n");
 
@@ -192,23 +202,30 @@ static void do_flush_udp(int fd)
 
 static void usage(const char *filepath)
 {
-   error(1, 0, "Usage: %s [-tv] [-p port]", filepath);
+   error(1, 0, "Usage: %s [-Grtv] [-p port]", filepath);
 }
 
 static void parse_opts(int argc, char **argv)
 {
int c;
 
-   while ((c = getopt(argc, argv, "ptv")) != -1) {
+   while ((c = getopt(argc, argv, "Gp:rtv")) != -1) {
switch (c) {
+   case 'G':
+   cfg_gro_segment = true;
+   break;
case 'p':
-   cfg_port = htons(strtoul(optarg, NULL, 0));
+   cfg_port = strtoul(optarg, NULL, 0);
+   break;
+   case 'r':
+   cfg_read_all = true;
break;
case 't':
cfg_tcp = true;
break;
case 'v':
cfg_verify = true;
+   cfg_read_all = true;
break;
}
}
@@ -227,6 +244,12 @@ static void do_recv(void)
 
fd = do_socket(cfg_tcp);
 
+   if (cfg_gro_segment && !cfg_tcp) {
+   int val = 1;
+   if (setsockopt(fd, IPPROTO_UDP, UDP_GRO, , sizeof(val)))
+   error(1, errno, "setsockopt UDP_GRO");
+   }
+
treport = gettimeofday_ms() + 1000;
do {
do_poll(fd);
-- 
2.17.2

[RFC PATCH v3 02/10] udp: implement GRO for plain UDP sockets.

2018-10-30 Thread Paolo Abeni

This is the RX counterpart of commit bec1f6f69736 ("udp: generate gso
with UDP_SEGMENT"). When UDP_GRO is enabled, such socket is also
eligible for GRO in the rx path: UDP segments directed to such socket
are assembled into a larger GSO_UDP_L4 packet.

The core UDP GRO support is enabled with setsockopt(UDP_GRO).

Initial benchmark numbers:

Before:
udp rx:   1079 MB/s   769065 calls/s

After:
udp rx:   1466 MB/s24877 calls/s

This change introduces a side effect in respect to UDP tunnels:
after a UDP tunnel creation, now the kernel performs a lookup per ingress
UDP packet, while before such lookup happened only if the ingress packet
carried a valid internal header csum.

rfc v2 -> rfc v3:
 - fixed typos in macro name and comments
 - really enforce UDP_GRO_CNT_MAX, instead of UDP_GRO_CNT_MAX + 1
 - acquire socket lock in UDP_GRO setsockopt

rfc v1 -> rfc v2:
 - use a new option to enable UDP GRO
 - use static keys to protect the UDP GRO socket lookup

Signed-off-by: Paolo Abeni 
--
Note: I opted for acquiring the socket lock only for the newly introduced
setsockopt instead for every value, despite the previous conversation on
this topic, to avoid introducing somewhat larger and unrelated changes.
---
 include/linux/udp.h  |   3 +-
 include/uapi/linux/udp.h |   1 +
 net/ipv4/udp.c   |   8 +++
 net/ipv4/udp_offload.c   | 109 +++
 net/ipv6/udp_offload.c   |   6 +--
 5 files changed, 99 insertions(+), 28 deletions(-)

diff --git a/include/linux/udp.h b/include/linux/udp.h
index a4dafff407fb..f613b329852e 100644
--- a/include/linux/udp.h
+++ b/include/linux/udp.h
@@ -50,11 +50,12 @@ struct udp_sock {
__u8 encap_type;/* Is this an Encapsulation socket? */
unsigned charno_check6_tx:1,/* Send zero UDP6 checksums on TX? */
 no_check6_rx:1,/* Allow zero UDP6 checksums on RX? */
-encap_enabled:1; /* This socket enabled encap
+encap_enabled:1, /* This socket enabled encap
   * processing; UDP tunnels and
   * different encapsulation layer set
   * this
   */
+gro_enabled:1; /* Can accept GRO packets */
/*
 * Following member retains the information to create a UDP header
 * when the socket is uncorked.
diff --git a/include/uapi/linux/udp.h b/include/uapi/linux/udp.h
index 09502de447f5..30baccb6c9c4 100644
--- a/include/uapi/linux/udp.h
+++ b/include/uapi/linux/udp.h
@@ -33,6 +33,7 @@ struct udphdr {
 #define UDP_NO_CHECK6_TX 101   /* Disable sending checksum for UDP6X */
 #define UDP_NO_CHECK6_RX 102   /* Disable accpeting checksum for UDP6 */
 #define UDP_SEGMENT103 /* Set GSO segmentation size */
+#define UDP_GRO104 /* This socket can receive UDP GRO 
packets */
 
 /* UDP encapsulation types */
 #define UDP_ENCAP_ESPINUDP_NON_IKE 1 /* draft-ietf-ipsec-nat-t-ike-00/01 */
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index c51721fb293a..4d4f4d044c28 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -2474,6 +2474,14 @@ int udp_lib_setsockopt(struct sock *sk, int level, int 
optname,
up->gso_size = val;
break;
 
+   case UDP_GRO:
+   lock_sock(sk);
+   if (valbool)
+   udp_tunnel_encap_enable(sk->sk_socket);
+   up->gro_enabled = valbool;
+   release_sock(sk);
+   break;
+
/*
 *  UDP-Lite's partial checksum coverage (RFC 3828).
 */
diff --git a/net/ipv4/udp_offload.c b/net/ipv4/udp_offload.c
index 802f2bc00d69..0646d61f4fa8 100644
--- a/net/ipv4/udp_offload.c
+++ b/net/ipv4/udp_offload.c
@@ -343,6 +343,54 @@ static struct sk_buff *udp4_ufo_fragment(struct sk_buff 
*skb,
return segs;
 }
 
+#define UDP_GRO_CNT_MAX 64
+static struct sk_buff *udp_gro_receive_segment(struct list_head *head,
+  struct sk_buff *skb)
+{
+   struct udphdr *uh = udp_hdr(skb);
+   struct sk_buff *pp = NULL;
+   struct udphdr *uh2;
+   struct sk_buff *p;
+
+   /* requires non zero csum, for symmetry with GSO */
+   if (!uh->check) {
+   NAPI_GRO_CB(skb)->flush = 1;
+   return NULL;
+   }
+
+   /* pull encapsulating udp header */
+   skb_gro_pull(skb, sizeof(struct udphdr));
+   skb_gro_postpull_rcsum(skb, uh, sizeof(struct udphdr));
+
+   list_for_each_entry(p, head, list) {
+   if (!NAPI_GRO_CB(p)->same_flow)
+   continue;
+
+   uh2 = udp_hdr(p);
+
+   /* Match ports only, as csum is always non zero */
+   if ((*(u32 *)>source != *(u32 *)>source)) {
+

[RFC PATCH v3 09/10] selftests: add some benchmark for UDP GRO

2018-10-30 Thread Paolo Abeni

Run on top of veth pair, using a dummy XDP program to enable the GRO.

Signed-off-by: Paolo Abeni 
---
 tools/testing/selftests/net/Makefile|  1 +
 tools/testing/selftests/net/udpgro_bench.sh | 92 +
 2 files changed, 93 insertions(+)
 create mode 100755 tools/testing/selftests/net/udpgro_bench.sh

diff --git a/tools/testing/selftests/net/Makefile 
b/tools/testing/selftests/net/Makefile
index 176459b7c4d6..ac999354af54 100644
--- a/tools/testing/selftests/net/Makefile
+++ b/tools/testing/selftests/net/Makefile
@@ -7,6 +7,7 @@ CFLAGS += -I../../../../usr/include/
 TEST_PROGS := run_netsocktests run_afpackettests test_bpf.sh netdevice.sh 
rtnetlink.sh
 TEST_PROGS += fib_tests.sh fib-onlink-tests.sh pmtu.sh udpgso.sh ip_defrag.sh
 TEST_PROGS += udpgso_bench.sh fib_rule_tests.sh msg_zerocopy.sh psock_snd.sh
+TEST_PROGS += udpgro_bench.sh
 TEST_PROGS_EXTENDED := in_netns.sh
 TEST_GEN_FILES =  socket
 TEST_GEN_FILES += psock_fanout psock_tpacket msg_zerocopy
diff --git a/tools/testing/selftests/net/udpgro_bench.sh 
b/tools/testing/selftests/net/udpgro_bench.sh
new file mode 100755
index ..03d37e5e7424
--- /dev/null
+++ b/tools/testing/selftests/net/udpgro_bench.sh
@@ -0,0 +1,92 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+#
+# Run a series of udpgro benchmarks
+
+readonly PEER_NS="ns-peer-$(mktemp -u XX)"
+
+cleanup() {
+   local -r jobs="$(jobs -p)"
+   local -r ns="$(ip netns list|grep $PEER_NS)"
+
+   [ -n "${jobs}" ] && kill -INT ${jobs} 2>/dev/null
+   [ -n "$ns" ] && ip netns del $ns 2>/dev/null
+}
+trap cleanup EXIT
+
+run_one() {
+   # use 'rx' as separator between sender args and receiver args
+   local -r all="$@"
+   local -r tx_args=${all%rx*}
+   local -r rx_args=${all#*rx}
+
+   ip netns add "${PEER_NS}"
+   ip -netns "${PEER_NS}" link set lo up
+   ip link add type veth
+   ip link set dev veth0 up
+   ip addr add dev veth0 192.168.1.2/24
+   ip addr add dev veth0 2001:db8::2/64 nodad
+
+   ip link set dev veth1 netns "${PEER_NS}"
+   ip -netns "${PEER_NS}" addr add dev veth1 192.168.1.1/24
+   ip -netns "${PEER_NS}" addr add dev veth1 2001:db8::1/64 nodad
+   ip -netns "${PEER_NS}" link set dev veth1 up
+
+   ip netns exec "${PEER_NS}" ./udpgso_bench_rx ${rx_args} -r -x veth1 &
+   ip netns exec "${PEER_NS}" ./udpgso_bench_rx -t ${rx_args} -r &
+
+   # Hack: let bg programs complete the startup
+   sleep 0.1
+   ./udpgso_bench_tx ${tx_args}
+}
+
+run_in_netns() {
+   local -r args=$@
+
+   ./in_netns.sh $0 __subprocess ${args}
+}
+
+run_udp() {
+   local -r args=$@
+
+   echo "udp gso - over veth touching data"
+   run_in_netns ${args} -S rx
+
+   echo "udp gso and gro - over veth touching data"
+   run_in_netns ${args} -S rx -G
+}
+
+run_tcp() {
+   local -r args=$@
+
+   echo "tcp - over veth touching data"
+   run_in_netns ${args} -t rx
+}
+
+run_all() {
+   local -r core_args="-l 4"
+   local -r ipv4_args="${core_args} -4 -D 192.168.1.1"
+   local -r ipv6_args="${core_args} -6 -D 2001:db8::1"
+
+   echo "ipv4"
+   run_tcp "${ipv4_args}"
+   run_udp "${ipv4_args}"
+
+   echo "ipv6"
+   run_tcp "${ipv4_args}"
+   run_udp "${ipv6_args}"
+}
+
+if [ ! -f xdp_dummy.o ]; then
+   echo "Skipping GRO benchmarks - missing LLC"
+   exit 0
+fi
+
+if [[ $# -eq 0 ]]; then
+   run_all
+elif [[ $1 == "__subprocess" ]]; then
+   shift
+   run_one $@
+else
+   run_in_netns $@
+fi
-- 
2.17.2

[RFC PATCH v3 01/10] udp: implement complete book-keeping for encap_needed

2018-10-30 Thread Paolo Abeni

The *encap_needed static keys are enabled by UDP tunnels
and several UDP encapsulations type, but they are never
turned off. This can cause unneeded overall performance
degradation for systems where such features are used
transiently.

This patch introduces complete book-keeping for such keys,
decreasing the usage at socket destruction time, if needed,
and avoiding that the same socket could increase the key
usage multiple times.

rfc v2 - rfc v3:
 - use udp_tunnel_encap_enable() in setsockopt()

Signed-off-by: Paolo Abeni 
---
 include/linux/udp.h  |  7 ++-
 include/net/udp_tunnel.h |  6 ++
 net/ipv4/udp.c   | 17 +++--
 net/ipv6/udp.c   | 14 +-
 4 files changed, 32 insertions(+), 12 deletions(-)

diff --git a/include/linux/udp.h b/include/linux/udp.h
index 320d49d85484..a4dafff407fb 100644
--- a/include/linux/udp.h
+++ b/include/linux/udp.h
@@ -49,7 +49,12 @@ struct udp_sock {
unsigned int corkflag;  /* Cork is required */
__u8 encap_type;/* Is this an Encapsulation socket? */
unsigned charno_check6_tx:1,/* Send zero UDP6 checksums on TX? */
-no_check6_rx:1;/* Allow zero UDP6 checksums on RX? */
+no_check6_rx:1,/* Allow zero UDP6 checksums on RX? */
+encap_enabled:1; /* This socket enabled encap
+  * processing; UDP tunnels and
+  * different encapsulation layer set
+  * this
+  */
/*
 * Following member retains the information to create a UDP header
 * when the socket is uncorked.
diff --git a/include/net/udp_tunnel.h b/include/net/udp_tunnel.h
index fe680ab6b15a..3fbe56430e3b 100644
--- a/include/net/udp_tunnel.h
+++ b/include/net/udp_tunnel.h
@@ -165,6 +165,12 @@ static inline int udp_tunnel_handle_offloads(struct 
sk_buff *skb, bool udp_csum)
 
 static inline void udp_tunnel_encap_enable(struct socket *sock)
 {
+   struct udp_sock *up = udp_sk(sock->sk);
+
+   if (up->encap_enabled)
+   return;
+
+   up->encap_enabled = 1;
 #if IS_ENABLED(CONFIG_IPV6)
if (sock->sk->sk_family == PF_INET6)
ipv6_stub->udpv6_encap_enable();
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index ca3ed931f2a9..c51721fb293a 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -115,6 +115,7 @@
 #include "udp_impl.h"
 #include 
 #include 
+#include 
 
 struct udp_table udp_table __read_mostly;
 EXPORT_SYMBOL(udp_table);
@@ -2398,11 +2399,15 @@ void udp_destroy_sock(struct sock *sk)
bool slow = lock_sock_fast(sk);
udp_flush_pending_frames(sk);
unlock_sock_fast(sk, slow);
-   if (static_branch_unlikely(_encap_needed_key) && up->encap_type) {
-   void (*encap_destroy)(struct sock *sk);
-   encap_destroy = READ_ONCE(up->encap_destroy);
-   if (encap_destroy)
-   encap_destroy(sk);
+   if (static_branch_unlikely(_encap_needed_key)) {
+   if (up->encap_type) {
+   void (*encap_destroy)(struct sock *sk);
+   encap_destroy = READ_ONCE(up->encap_destroy);
+   if (encap_destroy)
+   encap_destroy(sk);
+   }
+   if (up->encap_enabled)
+   static_branch_disable(_encap_needed_key);
}
 }
 
@@ -2447,7 +2452,7 @@ int udp_lib_setsockopt(struct sock *sk, int level, int 
optname,
/* FALLTHROUGH */
case UDP_ENCAP_L2TPINUDP:
up->encap_type = val;
-   udp_encap_enable();
+   udp_tunnel_encap_enable(sk->sk_socket);
break;
default:
err = -ENOPROTOOPT;
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index d2d97d07ef27..fc0ce6c59ebb 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -1458,11 +1458,15 @@ void udpv6_destroy_sock(struct sock *sk)
udp_v6_flush_pending_frames(sk);
release_sock(sk);
 
-   if (static_branch_unlikely(_encap_needed_key) && up->encap_type) {
-   void (*encap_destroy)(struct sock *sk);
-   encap_destroy = READ_ONCE(up->encap_destroy);
-   if (encap_destroy)
-   encap_destroy(sk);
+   if (static_branch_unlikely(_encap_needed_key)) {
+   if (up->encap_type) {
+   void (*encap_destroy)(struct sock *sk);
+   encap_destroy = READ_ONCE(up->encap_destroy);
+   if (encap_destroy)
+   encap_destroy(sk);
+   }
+   if (up->encap_enabled)
+   sta

[RFC PATCH v3 04/10] ip: factor out protocol delivery helper

2018-10-30 Thread Paolo Abeni

So that we can re-use it at the UDP lavel in a later patch

Signed-off-by: Paolo Abeni 
---
 net/ipv4/ip_input.c | 73 ++---
 1 file changed, 36 insertions(+), 37 deletions(-)

diff --git a/net/ipv4/ip_input.c b/net/ipv4/ip_input.c
index 35a786c0aaa0..72250b4e466d 100644
--- a/net/ipv4/ip_input.c
+++ b/net/ipv4/ip_input.c
@@ -188,51 +188,50 @@ bool ip_call_ra_chain(struct sk_buff *skb)
return false;
 }
 
-static int ip_local_deliver_finish(struct net *net, struct sock *sk, struct 
sk_buff *skb)
+void ip_protocol_deliver_rcu(struct net *net, struct sk_buff *skb, int 
protocol)
 {
-   __skb_pull(skb, skb_network_header_len(skb));
-
-   rcu_read_lock();
-   {
-   int protocol = ip_hdr(skb)->protocol;
-   const struct net_protocol *ipprot;
-   int raw;
+   const struct net_protocol *ipprot;
+   int raw, ret;
 
-   resubmit:
-   raw = raw_local_deliver(skb, protocol);
+resubmit:
+   raw = raw_local_deliver(skb, protocol);
 
-   ipprot = rcu_dereference(inet_protos[protocol]);
-   if (ipprot) {
-   int ret;
-
-   if (!ipprot->no_policy) {
-   if (!xfrm4_policy_check(NULL, XFRM_POLICY_IN, 
skb)) {
-   kfree_skb(skb);
-   goto out;
-   }
-   nf_reset(skb);
+   ipprot = rcu_dereference(inet_protos[protocol]);
+   if (ipprot) {
+   if (!ipprot->no_policy) {
+   if (!xfrm4_policy_check(NULL, XFRM_POLICY_IN, skb)) {
+   kfree_skb(skb);
+   return;
}
-   ret = ipprot->handler(skb);
-   if (ret < 0) {
-   protocol = -ret;
-   goto resubmit;
+   nf_reset(skb);
+   }
+   ret = ipprot->handler(skb);
+   if (ret < 0) {
+   protocol = -ret;
+   goto resubmit;
+   }
+   __IP_INC_STATS(net, IPSTATS_MIB_INDELIVERS);
+   } else {
+   if (!raw) {
+   if (xfrm4_policy_check(NULL, XFRM_POLICY_IN, skb)) {
+   __IP_INC_STATS(net, 
IPSTATS_MIB_INUNKNOWNPROTOS);
+   icmp_send(skb, ICMP_DEST_UNREACH,
+ ICMP_PROT_UNREACH, 0);
}
-   __IP_INC_STATS(net, IPSTATS_MIB_INDELIVERS);
+   kfree_skb(skb);
} else {
-   if (!raw) {
-   if (xfrm4_policy_check(NULL, XFRM_POLICY_IN, 
skb)) {
-   __IP_INC_STATS(net, 
IPSTATS_MIB_INUNKNOWNPROTOS);
-   icmp_send(skb, ICMP_DEST_UNREACH,
- ICMP_PROT_UNREACH, 0);
-   }
-   kfree_skb(skb);
-   } else {
-   __IP_INC_STATS(net, IPSTATS_MIB_INDELIVERS);
-   consume_skb(skb);
-   }
+   __IP_INC_STATS(net, IPSTATS_MIB_INDELIVERS);
+   consume_skb(skb);
}
}
- out:
+}
+
+static int ip_local_deliver_finish(struct net *net, struct sock *sk, struct 
sk_buff *skb)
+{
+   __skb_pull(skb, skb_network_header_len(skb));
+
+   rcu_read_lock();
+   ip_protocol_deliver_rcu(net, skb, ip_hdr(skb)->protocol);
rcu_read_unlock();
 
return 0;
-- 
2.17.2

[RFC PATCH v3 06/10] udp: cope with UDP GRO packet misdirection

2018-10-30 Thread Paolo Abeni

In some scenarios, the GRO engine can assemble an UDP GRO packet
that ultimately lands on a non GRO-enabled socket.
This patch tries to address the issue explicitly checking for the UDP
socket features before enqueuing the packet, and eventually segmenting
the unexpected GRO packet, as needed.

We must also cope with re-insertion requests: after segmentation the
UDP code calls the helper introduced by the previous patches, as needed.

Segmentation is performed by a common helper, which takes care of
updating socket and protocol stats is case of failure.

rfc v2 -> rfc v3
 - moved udp_rcv_segment() into net/udp.h, account errors to socket
   and ns, always return NULL or segs list

Signed-off-by: Paolo Abeni 
---
 include/linux/udp.h |  6 ++
 include/net/udp.h   | 51 ++---
 net/ipv4/udp.c  | 25 +-
 net/ipv6/udp.c  | 27 +++-
 4 files changed, 99 insertions(+), 10 deletions(-)

diff --git a/include/linux/udp.h b/include/linux/udp.h
index e23d5024f42f..0a9c54e76305 100644
--- a/include/linux/udp.h
+++ b/include/linux/udp.h
@@ -132,6 +132,12 @@ static inline void udp_cmsg_recv(struct msghdr *msg, 
struct sock *sk,
}
 }
 
+static inline bool udp_unexpected_gso(struct sock *sk, struct sk_buff *skb)
+{
+   return !udp_sk(sk)->gro_enabled && skb_is_gso(skb) &&
+  skb_shinfo(skb)->gso_type & SKB_GSO_UDP_L4;
+}
+
 #define udp_portaddr_for_each_entry(__sk, list) \
hlist_for_each_entry(__sk, list, __sk_common.skc_portaddr_node)
 
diff --git a/include/net/udp.h b/include/net/udp.h
index 9e82cb391dea..f94aed316a04 100644
--- a/include/net/udp.h
+++ b/include/net/udp.h
@@ -406,17 +406,24 @@ static inline int copy_linear_skb(struct sk_buff *skb, 
int len, int off,
 } while(0)
 
 #if IS_ENABLED(CONFIG_IPV6)
-#define __UDPX_INC_STATS(sk, field)\
-do {   \
-   if ((sk)->sk_family == AF_INET) \
-   __UDP_INC_STATS(sock_net(sk), field, 0);\
-   else\
-   __UDP6_INC_STATS(sock_net(sk), field, 0);   \
-} while (0)
+#define __UDPX_MIB(sk, ipv4)   \
+({ \
+   ipv4 ? (IS_UDPLITE(sk) ? sock_net(sk)->mib.udplite_statistics : \
+sock_net(sk)->mib.udp_statistics) :\
+   (IS_UDPLITE(sk) ? sock_net(sk)->mib.udplite_stats_in6 : \
+sock_net(sk)->mib.udp_stats_in6);  \
+})
 #else
-#define __UDPX_INC_STATS(sk, field) __UDP_INC_STATS(sock_net(sk), field, 0)
+#define __UDPX_MIB(sk, ipv4)   \
+({ \
+   IS_UDPLITE(sk) ? sock_net(sk)->mib.udplite_statistics : \
+sock_net(sk)->mib.udp_statistics;  \
+})
 #endif
 
+#define __UDPX_INC_STATS(sk, field) \
+   __SNMP_INC_STATS(__UDPX_MIB(sk, (sk)->sk_family == AF_INET, field)
+
 #ifdef CONFIG_PROC_FS
 struct udp_seq_afinfo {
sa_family_t family;
@@ -450,4 +457,32 @@ DECLARE_STATIC_KEY_FALSE(udpv6_encap_needed_key);
 void udpv6_encap_enable(void);
 #endif
 
+static inline struct sk_buff *udp_rcv_segment(struct sock *sk,
+ struct sk_buff *skb)
+{
+   bool ipv4 = skb->protocol == htons(ETH_P_IP);
+   int segs_nr = skb_shinfo(skb)->gso_segs;
+   struct sk_buff *segs;
+
+   /* the GSO CB lays after the UDP one, no need to save and restore any
+* CB fragment
+*/
+   segs = __skb_gso_segment(skb, NETIF_F_SG, false);
+   if (unlikely(IS_ERR(segs))) {
+   kfree_skb(skb);
+   goto drop;
+   }
+
+   if (unlikely(!segs))
+   goto drop;
+
+   consume_skb(skb);
+   return segs;
+
+drop:
+   atomic_add(segs_nr, >sk_drops);
+   SNMP_ADD_STATS(__UDPX_MIB(sk, ipv4), UDP_MIB_INERRORS, segs_nr);
+   return NULL;
+}
+
 #endif /* _UDP_H */
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index b345f71b1cbb..b45033f63673 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1909,7 +1909,7 @@ EXPORT_SYMBOL(udp_encap_enable);
  * Note that in the success and error cases, the skb is assumed to
  * have either been requeued or freed.
  */
-static int udp_queue_rcv_skb(struct sock *sk, struct sk_buff *skb)
+static int udp_queue_rcv_one_skb(struct sock *sk, struct sk_buff *skb)
 {
struct udp_sock *up = udp_sk(sk);
int is_udplite = IS_UDPLITE(sk);
@@ -2012,6 +2012,29 @@ static int udp_queue_rcv_skb(struct sock *sk, struct 
sk_buff *skb)
return -1;
 }
 
+void ip_

[RFC PATCH v3 05/10] ipv6: factor out protocol delivery helper

2018-10-30 Thread Paolo Abeni

So that we can re-use it at the UDP lavel in the next patch

Signed-off-by: Paolo Abeni 
---
 net/ipv6/ip6_input.c | 28 
 1 file changed, 16 insertions(+), 12 deletions(-)

diff --git a/net/ipv6/ip6_input.c b/net/ipv6/ip6_input.c
index 96577e742afd..3065226bdc57 100644
--- a/net/ipv6/ip6_input.c
+++ b/net/ipv6/ip6_input.c
@@ -319,28 +319,26 @@ void ipv6_list_rcv(struct list_head *head, struct 
packet_type *pt,
 /*
  * Deliver the packet to the host
  */
-
-
-static int ip6_input_finish(struct net *net, struct sock *sk, struct sk_buff 
*skb)
+void ip6_protocol_deliver_rcu(struct net *net, struct sk_buff *skb, int 
nexthdr,
+ bool have_final)
 {
const struct inet6_protocol *ipprot;
struct inet6_dev *idev;
unsigned int nhoff;
-   int nexthdr;
bool raw;
-   bool have_final = false;
 
/*
 *  Parse extension headers
 */
 
-   rcu_read_lock();
 resubmit:
idev = ip6_dst_idev(skb_dst(skb));
-   if (!pskb_pull(skb, skb_transport_offset(skb)))
-   goto discard;
nhoff = IP6CB(skb)->nhoff;
-   nexthdr = skb_network_header(skb)[nhoff];
+   if (!have_final) {
+   if (!pskb_pull(skb, skb_transport_offset(skb)))
+   goto discard;
+   nexthdr = skb_network_header(skb)[nhoff];
+   }
 
 resubmit_final:
raw = raw6_local_deliver(skb, nexthdr);
@@ -411,13 +409,19 @@ static int ip6_input_finish(struct net *net, struct sock 
*sk, struct sk_buff *sk
consume_skb(skb);
}
}
-   rcu_read_unlock();
-   return 0;
+   return;
 
 discard:
__IP6_INC_STATS(net, idev, IPSTATS_MIB_INDISCARDS);
-   rcu_read_unlock();
kfree_skb(skb);
+}
+
+static int ip6_input_finish(struct net *net, struct sock *sk, struct sk_buff 
*skb)
+{
+   rcu_read_lock();
+   ip6_protocol_deliver_rcu(net, skb, 0, false);
+   rcu_read_unlock();
+
return 0;
 }
 
-- 
2.17.2

[RFC PATCH v3 03/10] udp: add support for UDP_GRO cmsg

2018-10-30 Thread Paolo Abeni

When UDP GRO is enabled, the UDP_GRO cmsg will carry the ingress
datagram size. User-space can use such info to compute the original
packets layout.

Signed-off-by: Paolo Abeni 
---
Note: I avoided setting a bit in cmsg_flag for UDP_GRO, as that
attempt produced some uglyfication, expecially on the ipv6 side
with no measurable performances benefits.
---
 include/linux/udp.h | 11 +++
 net/ipv4/udp.c  |  4 
 net/ipv6/udp.c  |  3 +++
 3 files changed, 18 insertions(+)

diff --git a/include/linux/udp.h b/include/linux/udp.h
index f613b329852e..e23d5024f42f 100644
--- a/include/linux/udp.h
+++ b/include/linux/udp.h
@@ -121,6 +121,17 @@ static inline bool udp_get_no_check6_rx(struct sock *sk)
return udp_sk(sk)->no_check6_rx;
 }
 
+static inline void udp_cmsg_recv(struct msghdr *msg, struct sock *sk,
+struct sk_buff *skb)
+{
+   int gso_size;
+
+   if (skb_shinfo(skb)->gso_type & SKB_GSO_UDP_L4) {
+   gso_size = skb_shinfo(skb)->gso_size;
+   put_cmsg(msg, SOL_UDP, UDP_GRO, sizeof(gso_size), _size);
+   }
+}
+
 #define udp_portaddr_for_each_entry(__sk, list) \
hlist_for_each_entry(__sk, list, __sk_common.skc_portaddr_node)
 
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 4d4f4d044c28..b345f71b1cbb 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1714,6 +1714,10 @@ int udp_recvmsg(struct sock *sk, struct msghdr *msg, 
size_t len, int noblock,
memset(sin->sin_zero, 0, sizeof(sin->sin_zero));
*addr_len = sizeof(*sin);
}
+
+   if (udp_sk(sk)->gro_enabled)
+   udp_cmsg_recv(msg, sk, skb);
+
if (inet->cmsg_flags)
ip_cmsg_recv_offset(msg, sk, skb, sizeof(struct udphdr), off);
 
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index fc0ce6c59ebb..8e76e719305c 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -421,6 +421,9 @@ int udpv6_recvmsg(struct sock *sk, struct msghdr *msg, 
size_t len,
*addr_len = sizeof(*sin6);
}
 
+   if (udp_sk(sk)->gro_enabled)
+   udp_cmsg_recv(msg, sk, skb);
+
if (np->rxopt.all)
ip6_datagram_recv_common_ctl(sk, msg, skb);
 
-- 
2.17.2

[RFC PATCH v3 00/10] udp: implement GRO support

2018-10-30 Thread Paolo Abeni

This series implements GRO support for UDP sockets, as the RX counterpart
of commit bec1f6f69736 ("udp: generate gso with UDP_SEGMENT").
The core functionality is implemented by the second patch, introducing a new
sockopt to enable UDP_GRO, while patch 3 implements support for passing the
segment size to the user space via a new cmsg.
UDP GRO performs a socket lookup for each ingress packets and aggregate datagram
directed to UDP GRO enabled sockets with constant l4 tuple.

UDP GRO packets can land on non GRO-enabled sockets, e.g. due to iptables NAT
rules, and that could potentially confuse existing applications.

The solution adopted here is to de-segment the GRO packet before enqueuing
as needed. Since we must cope with packet reinsertion after de-segmentation,
the relevant code is factored-out in ipv4 and ipv6 specific helpers and exposed
to UDP usage.

While the current code can probably be improved, this safeguard ,implemented in
the patches 4-7, allows future enachements to enable UDP GSO offload on more
virtual devices eventually even on forwarded packets.

The last 4 for patches implement some performance and functional self-tests,
re-using the existing udpgso infrastructure. The problematic scenario described
above is explicitly tested.

This revision of the series try to address the feedback provided by Willem,
Steffen and Subash fixing several bugs all along

rfc v2 - rfc v3:
 - cope better with exceptional conditions
 - test cases cleanup

rfc v1 - rfc v2:
 - use a new option to enable UDP GRO
 - use static keys to protect the UDP GRO socket lookup
 - cope with UDP GRO misdirection
 - add self-tests

Paolo Abeni (10):
  udp: implement complete book-keeping for encap_needed
  udp: implement GRO for plain UDP sockets.
  udp: add support for UDP_GRO cmsg
  ip: factor out protocol delivery helper
  ipv6: factor out protocol delivery helper
  udp: cope with UDP GRO packet misdirection
  selftests: add GRO support to udp bench rx program
  selftests: conditionally enable XDP support in udpgso_bench_rx
  selftests: add some benchmark for UDP GRO
  selftests: add functionals test for UDP GRO

 include/linux/udp.h   |  25 ++-
 include/net/udp.h |  51 -
 include/net/udp_tunnel.h  |   6 +
 include/uapi/linux/udp.h  |   1 +
 net/ipv4/ip_input.c   |  73 ---
 net/ipv4/udp.c|  54 -
 net/ipv4/udp_offload.c| 109 --
 net/ipv6/ip6_input.c  |  28 +--
 net/ipv6/udp.c|  44 +++-
 net/ipv6/udp_offload.c|   6 +-
 tools/testing/selftests/net/Makefile  |  70 +++
 tools/testing/selftests/net/udpgro.sh | 147 +
 tools/testing/selftests/net/udpgro_bench.sh   |  94 +
 tools/testing/selftests/net/udpgso_bench.sh   |   2 +-
 tools/testing/selftests/net/udpgso_bench_rx.c | 193 --
 tools/testing/selftests/net/udpgso_bench_tx.c |  22 +-
 tools/testing/selftests/net/xdp_dummy.c   |  13 ++
 17 files changed, 816 insertions(+), 122 deletions(-)
 create mode 100755 tools/testing/selftests/net/udpgro.sh
 create mode 100755 tools/testing/selftests/net/udpgro_bench.sh
 create mode 100644 tools/testing/selftests/net/xdp_dummy.c

-- 
2.17.2

Re: [RFC PATCH v2 01/10] udp: implement complete book-keeping for encap_needed

2018-10-25 Thread Paolo Abeni

Hi,

I'm sorry for lagging behind, this one felt outside my radar.

On Mon, 2018-10-22 at 12:06 -0400, Willem de Bruijn wrote:
> @@ -2431,7 +2435,9 @@ int udp_lib_setsockopt(struct sock *sk, int level, int 
> optname,
> > /* FALLTHROUGH */
> > case UDP_ENCAP_L2TPINUDP:
> > up->encap_type = val;
> > -   udp_encap_enable();
> > +   if (!up->encap_enabled)
> > +   udp_encap_enable();
> > +   up->encap_enabled = 1;
> 
> nit: no need for the branch: udp_encap_enable already has a branch and
> is static inline.

Uhm... I think it's needed, so that we call udp_encap_enable() at most
once per socket. If up->encap_enabled we also call static_key_disable
at socket destruction time (once per socket, again) and hopefully we
don't "leak" static_key_enable invocation.

> Perhaps it makes sense to convert that to take the udp_sock and handle
> the state change within, to avoid having to open code at multiple
> locations.

Possibly calling directly udp_tunnel_encap_enable()? that additionally
cope with ipv6, which is not needed here, but should not hurt.

Cheers,

Paolo

Re: [RFC PATCH v2 00/10] udp: implement GRO support

2018-10-23 Thread Paolo Abeni

Hi,

On Tue, 2018-10-23 at 14:10 +0200, Steffen Klassert wrote:
> On Fri, Oct 19, 2018 at 04:25:10PM +0200, Paolo Abeni wrote:
> > This series implements GRO support for UDP sockets, as the RX counterpart
> > of commit bec1f6f69736 ("udp: generate gso with UDP_SEGMENT").
> > The core functionality is implemented by the second patch, introducing a new
> > sockopt to enable UDP_GRO, while patch 3 implements support for passing the
> > segment size to the user space via a new cmsg.
> > UDP GRO performs a socket lookup for each ingress packets and aggregate 
> > datagram
> > directed to UDP GRO enabled sockets with constant l4 tuple.
> > 
> > UDP GRO packets can land on non GRO-enabled sockets, e.g. due to iptables 
> > NAT
> > rules, and that could potentially confuse existing applications.
> > 
> > The solution adopted here is to de-segment the GRO packet before enqueuing
> > as needed. Since we must cope with packet reinsertion after de-segmentation,
> > the relevant code is factored-out in ipv4 and ipv6 specific helpers and 
> > exposed
> > to UDP usage.
> > 
> > While the current code can probably be improved, this safeguard 
> > ,implemented in
> > the patches 4-7, allows future enachements to enable UDP GSO offload on more
> > virtual devices eventually even on forwarded packets.
> 
> I was curious what I get with this when doing packet forwarding.
> So I added forwarding support with the patch below (on top of
> your patchset). While the packet processing could work the way
> I do it in this patch, I'm not sure about the way I enable
> UDP GRO on forwarding. Maybe there it a better way than just
> do UDP GRO if forwarding is enabled on the receiving device.

My idea/hope is slightly different: ensure that UDP GRO + UDP input
path + UDP desegmentation on socket enqueue is safe and as fast as
plain UDP input path and then enable UDP GRO without socket lookup, for
each incoming frame (!!!).

If we land on the input path on a non UDP GRO enabled socket, the
packet is de-segment before enqueuing.

Socket lookup would be needed only if tunnels are enabled, to check if
the ingress frame match a tunnel. We will use UDP tunnel GRO in that
case.

> Some quick benchmark numbers with UDP packet forwarding
> (1460 byte packets) through two gateways:
> 
> net-next: 16.4 Gbps
> 
> net-next + UDP GRO: 20.3 Gbps

uhmmm... what do you think about this speed-up ?

I hoped that without the socket lookup we can get measurably better
figures.

Cheers,

Paolo

Re: [RFC PATCH v2 06/10] udp: cope with UDP GRO packet misdirection

2018-10-23 Thread Paolo Abeni

Hi,

On Mon, 2018-10-22 at 13:04 -0600, Subash Abhinov Kasiviswanathan
wrote:
> On 2018-10-19 08:25, Paolo Abeni wrote:
> > In some scenarios, the GRO engine can assemble an UDP GRO packet
> > that ultimately lands on a non GRO-enabled socket.
> > This patch tries to address the issue explicitly checking for the UDP
> > socket features before enqueuing the packet, and eventually segmenting
> > the unexpected GRO packet, as needed.
> > 
> > We must also cope with re-insertion requests: after segmentation the
> > UDP code calls the helper introduced by the previous patches, as 
> > needed.
> > 
> > Signed-off-by: Paolo Abeni 
> > ---
> > +static inline bool udp_unexpected_gso(struct sock *sk, struct sk_buff
> > *skb)
> > +{
> > +   return !udp_sk(sk)->gro_enabled && skb_is_gso(skb) &&
> > +  skb_shinfo(skb)->gso_type & SKB_GSO_UDP_L4;
> > +}
> > +
> > +static inline struct sk_buff *udp_rcv_segment(struct sock *sk,
> > + struct sk_buff *skb)
> > +{
> > +   struct sk_buff *segs;
> > +
> > +   /* the GSO CB lays after the UDP one, no need to save and restore
> > any
> > +* CB fragment, just initialize it
> > +*/
> > +   segs = __skb_gso_segment(skb, NETIF_F_SG, false);
> > +   if (unlikely(IS_ERR(segs)))
> > +   kfree_skb(skb);
> > +   else if (segs)
> > +   consume_skb(skb);
> > +   return segs;
> > +}
> > +
> > +
> 
> Hi Paolo
> 
> Do we need to check for IS_ERR_OR_NULL(segs)

Yes, thanks.

(also Williem already noted the above)

> > 
> > +void ip_protocol_deliver_rcu(struct net *net, struct sk_buff *skb, int
> > proto);
> > +
> > +static int udp_queue_rcv_skb(struct sock *sk, struct sk_buff *skb)
> > +{
> > +   struct sk_buff *next, *segs;
> > +   int ret;
> > +
> > +   if (likely(!udp_unexpected_gso(sk, skb)))
> > +   return udp_queue_rcv_one_skb(sk, skb);
> > +static int udpv6_queue_rcv_skb(struct sock *sk, struct sk_buff *skb)
> > +{
> > +   struct sk_buff *next, *segs;
> > +   int ret;
> > +
> > +   if (likely(!udp_unexpected_gso(sk, skb)))
> > +   return udpv6_queue_rcv_one_skb(sk, skb);
> > +
> 
> Is the "likely" required here?

Not required, but currently helpful IMHO, as we should hit the above
only on unlikey and really unwonted configuration.

Note that only SKB_GSO_UDP_L4 GSO packets will not match the above
likely condition.

> HW can coalesce all incoming streams of UDP and may not know the socket 
> state.
> In that case, a socket not having UDP GRO option might see a penalty 
> here.

Really? Is there any HW creating SKB_GSO_UDP_L4 packets on RX? if the
HW is doing that, without this patch, I think it's breaking existing
applications (which may expext that the read UDP frame length
implicitly describe the application level message length).

Cheers,

Paolo

Re: [PATCH net-next 1/3] net/sock: factor out dequeue/peek with offset code

2018-10-23 Thread Paolo Abeni

Hi,

On Mon, 2018-10-22 at 21:49 -0700, Alexei Starovoitov wrote:
> On Mon, May 15, 2017 at 11:01:42AM +0200, Paolo Abeni wrote:
> > And update __sk_queue_drop_skb() to work on the specified queue.
> > This will help the udp protocol to use an additional private
> > rx queue in a later patch.
> > 
> > Signed-off-by: Paolo Abeni 
> > ---
> >  include/linux/skbuff.h |  7 
> >  include/net/sock.h |  4 +--
> >  net/core/datagram.c| 90 
> > --
> >  3 files changed, 60 insertions(+), 41 deletions(-)
> > 
> > diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> > index a098d95..bfc7892 100644
> > --- a/include/linux/skbuff.h
> > +++ b/include/linux/skbuff.h
> > @@ -3056,6 +3056,13 @@ static inline void skb_frag_list_init(struct sk_buff 
> > *skb)
> >  
> >  int __skb_wait_for_more_packets(struct sock *sk, int *err, long *timeo_p,
> > const struct sk_buff *skb);
> > +struct sk_buff *__skb_try_recv_from_queue(struct sock *sk,
> > + struct sk_buff_head *queue,
> > + unsigned int flags,
> > + void (*destructor)(struct sock *sk,
> > +  struct sk_buff *skb),
> > + int *peeked, int *off, int *err,
> > + struct sk_buff **last);
> >  struct sk_buff *__skb_try_recv_datagram(struct sock *sk, unsigned flags,
> > void (*destructor)(struct sock *sk,
> >struct sk_buff *skb),
> > diff --git a/include/net/sock.h b/include/net/sock.h
> > index 66349e4..49d226f 100644
> > --- a/include/net/sock.h
> > +++ b/include/net/sock.h
> > @@ -2035,8 +2035,8 @@ void sk_reset_timer(struct sock *sk, struct 
> > timer_list *timer,
> >  
> >  void sk_stop_timer(struct sock *sk, struct timer_list *timer);
> >  
> > -int __sk_queue_drop_skb(struct sock *sk, struct sk_buff *skb,
> > -   unsigned int flags,
> > +int __sk_queue_drop_skb(struct sock *sk, struct sk_buff_head *sk_queue,
> > +   struct sk_buff *skb, unsigned int flags,
> > void (*destructor)(struct sock *sk,
> >struct sk_buff *skb));
> >  int __sock_queue_rcv_skb(struct sock *sk, struct sk_buff *skb);
> > diff --git a/net/core/datagram.c b/net/core/datagram.c
> > index db1866f2..a4592b4 100644
> > --- a/net/core/datagram.c
> > +++ b/net/core/datagram.c
> > @@ -161,6 +161,43 @@ static struct sk_buff *skb_set_peeked(struct sk_buff 
> > *skb)
> > return skb;
> >  }
> >  
> > +struct sk_buff *__skb_try_recv_from_queue(struct sock *sk,
> > + struct sk_buff_head *queue,
> > + unsigned int flags,
> > + void (*destructor)(struct sock *sk,
> > +  struct sk_buff *skb),
> > + int *peeked, int *off, int *err,
> > + struct sk_buff **last)
> > +{
> > +   struct sk_buff *skb;
> > +
> > +   *last = queue->prev;
> 
> this refactoring changed the behavior.
> Now queue->prev is returned as last.
> Whereas it was *last = queue before.
> 
> > +   skb_queue_walk(queue, skb) {
> 
> and *last = skb assignment is gone too.
> 
> Was this intentional ? 

Yes.

> Is this the right behavior?

I think so. queue->prev is the last skb in the queue. With the old
code,   __skb_try_recv_datagram(), when returning NULL, used the
instructions you quoted to overall set 'last' to the last skb in the
queue. We did not use 'last' elsewhere. So overall this just reduce the
number of instructions inside the loop. (unless I'm missing something).

Are you experiencing any specific issues due to the mentioned commit?

Thanks,

Paolo

Re: [RFC PATCH v2 03/10] udp: add support for UDP_GRO cmsg

2018-10-22 Thread Paolo Abeni

On Sun, 2018-10-21 at 16:07 -0400, Willem de Bruijn wrote:
> On Fri, Oct 19, 2018 at 10:30 AM Paolo Abeni  wrote:
> > 
> > When UDP GRO is enabled, the UDP_GRO cmsg will carry the ingress
> > datagram size. User-space can use such info to compute the original
> > packets layout.
> > 
> > Signed-off-by: Paolo Abeni 
> > ---
> > CHECK: should we use a separate setsockopt to explicitly enable
> > gso_size cmsg reception? So that user space can enable UDP_GRO and
> > fetch cmsg without forcefully receiving GRO related info.
> 
> A user can avoid the message by not passing control data. Though in
> most practical cases it seems unsafe to do so, anyway, as the path MTU
> can be lower than the expected device MTU.
> 
> > diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> > index 3c277378814f..2331ac9de954 100644
> > --- a/net/ipv4/udp.c
> > +++ b/net/ipv4/udp.c
> > @@ -1714,6 +1714,10 @@ int udp_recvmsg(struct sock *sk, struct msghdr *msg, 
> > size_t len, int noblock,
> > memset(sin->sin_zero, 0, sizeof(sin->sin_zero));
> > *addr_len = sizeof(*sin);
> > }
> > +
> > +   if (udp_sk(sk)->gro_enabled)
> > +   udp_cmsg_recv(msg, sk, skb);
> > +
> 
> Perhaps we can avoid adding a branch by setting a bit in
> inet->cmsg_flags for gso_size to let the below branch handle the cmsg
> processing.

Uhmm... I think that for ipv6 sockets we need to set a bit in rxopt
instead (and we already have some conditionals we could for ipv6 socket
recv cmsg processing).

> I'd still set that as part of the UDP_GRO setsockopt. Though if you
> insist it could be a value 2 instead of 1, effectively allowing the
> above mentioned opt-out.

I'm ok with the current impl (no additional value to opt-in the UDP GRO
cmsg).

Cheers,

Paolo

Re: [RFC PATCH v2 02/10] udp: implement GRO for plain UDP sockets.

2018-10-22 Thread Paolo Abeni

On Mon, 2018-10-22 at 13:24 +0200, Steffen Klassert wrote:
> On Fri, Oct 19, 2018 at 04:25:12PM +0200, Paolo Abeni wrote:
> >  
> > +#define UDO_GRO_CNT_MAX 64
> 
> Maybe better UDP_GRO_CNT_MAX?

Oops, typo. Yes, sure, will address in the next iteration.

> Btw. do we really need this explicit limit?
> We should not get more than 64 packets during
> one napi poll cycle.

With HZ >= 1000, gro_flush happens at most once per jiffies: we can
have much more than 64 packets per segment, with appropriate pkt len.

> 
> > +static struct sk_buff *udp_gro_receive_segment(struct list_head *head,
> > +  struct sk_buff *skb)
> > +{
> > +   struct udphdr *uh = udp_hdr(skb);
> > +   struct sk_buff *pp = NULL;
> > +   struct udphdr *uh2;
> > +   struct sk_buff *p;
> > +
> > +   /* requires non zero csum, for simmetry with GSO */
> > +   if (!uh->check) {
> > +   NAPI_GRO_CB(skb)->flush = 1;
> > +   return NULL;
> > +   }
> 
> Why is the requirement of checksums different than in 
> udp_gro_receive? It's not that I care much about UDP
> packets without a checksum, but you would not need
> to implement your own loop if the requirement could
> be the same as in udp_gro_receive.

uhm 
AFAIU, we need to generated aggregated packets that UDP GSO is able to
process/segment. I was unable to get a nocsum packet segment (possibly
PEBKAC) so I enforced that condition on the rx path.

@Willem: did I see ghost here? is UDP_SEGMENT fine with no checksum
segment?

> > +
> > +   /* pull encapsulating udp header */
> > +   skb_gro_pull(skb, sizeof(struct udphdr));
> > +   skb_gro_postpull_rcsum(skb, uh, sizeof(struct udphdr));
> > +
> > +   list_for_each_entry(p, head, list) {
> > +   if (!NAPI_GRO_CB(p)->same_flow)
> > +   continue;
> > +
> > +   uh2 = udp_hdr(p);
> > +
> > +   /* Match ports only, as csum is always non zero */
> > +   if ((*(u32 *)>source != *(u32 *)>source)) {
> > +   NAPI_GRO_CB(p)->same_flow = 0;
> > +   continue;
> > +   }
> > +
> > +   /* Terminate the flow on len mismatch or if it grow "too much".
> > +* Under small packet flood GRO count could elsewhere grow a lot
> > +* leading to execessive truesize values
> > +*/
> > +   if (!skb_gro_receive(p, skb) &&
> > +   NAPI_GRO_CB(p)->count > UDO_GRO_CNT_MAX)
> 
> This allows to merge UDO_GRO_CNT_MAX + 1 packets.

Thanks, will address in the next iteration.

Cheers,

Paolo

Re: [RFC PATCH v2 06/10] udp: cope with UDP GRO packet misdirection

2018-10-22 Thread Paolo Abeni

Hi,

On Mon, 2018-10-22 at 13:43 +0200, Steffen Klassert wrote:
> On Fri, Oct 19, 2018 at 04:25:16PM +0200, Paolo Abeni wrote:
> > +
> > +static inline struct sk_buff *udp_rcv_segment(struct sock *sk,
> > + struct sk_buff *skb)
> > +{
> > +   struct sk_buff *segs;
> > +
> > +   /* the GSO CB lays after the UDP one, no need to save and restore any
> > +* CB fragment, just initialize it
> > +*/
> > +   segs = __skb_gso_segment(skb, NETIF_F_SG, false);
> > +   if (unlikely(IS_ERR(segs)))
> > +   kfree_skb(skb);
> > +   else if (segs)
> > +   consume_skb(skb);
> > +   return segs;
> > +}
> > +
> > +
> 
> One empty line too much.

Thank you, will handle in the next iteration.

> >  #define udp_portaddr_for_each_entry(__sk, list) \
> > hlist_for_each_entry(__sk, list, __sk_common.skc_portaddr_node)
> >  
> > diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> > index 2331ac9de954..0d55145ce9f5 100644
> > --- a/net/ipv4/udp.c
> > +++ b/net/ipv4/udp.c
> > @@ -1909,7 +1909,7 @@ EXPORT_SYMBOL(udp_encap_enable);
> >   * Note that in the success and error cases, the skb is assumed to
> >   * have either been requeued or freed.
> >   */
> > -static int udp_queue_rcv_skb(struct sock *sk, struct sk_buff *skb)
> > +static int udp_queue_rcv_one_skb(struct sock *sk, struct sk_buff *skb)
> >  {
> > struct udp_sock *up = udp_sk(sk);
> > int is_udplite = IS_UDPLITE(sk);
> > @@ -2012,6 +2012,29 @@ static int udp_queue_rcv_skb(struct sock *sk, struct 
> > sk_buff *skb)
> > return -1;
> >  }
> >  
> > +void ip_protocol_deliver_rcu(struct net *net, struct sk_buff *skb, int 
> > proto);
> > +
> > +static int udp_queue_rcv_skb(struct sock *sk, struct sk_buff *skb)
> > +{
> > +   struct sk_buff *next, *segs;
> > +   int ret;
> > +
> > +   if (likely(!udp_unexpected_gso(sk, skb)))
> > +   return udp_queue_rcv_one_skb(sk, skb);
> > +
> > +   BUILD_BUG_ON(sizeof(struct udp_skb_cb) > SKB_SGO_CB_OFFSET);
> > +   __skb_push(skb, -skb_mac_offset(skb));
> > +   segs = udp_rcv_segment(sk, skb);
> > +   for (skb = segs; skb; skb = next) {
> > +   next = skb->next;
> > +   __skb_pull(skb, skb_transport_offset(skb));
> > +   ret = udp_queue_rcv_one_skb(sk, skb);
> 
> udp_queue_rcv_one_skb() starts with doing a xfrm4_policy_check().
> Maybe we can do this on the GSO packet instead of the segments.
> So far this code is just for handling a corner case, but this might
> change.

I thought about keeping the policy check here, but then I preferred
what looked the safest option. Perhaps we can improve with a follow-up?

Cheers,

Paolo

Re: [RFC PATCH v2 10/10] selftests: add functionals test for UDP GRO

2018-10-22 Thread Paolo Abeni

On Sun, 2018-10-21 at 16:09 -0400, Willem de Bruijn wrote:
> On Fri, Oct 19, 2018 at 10:31 AM Paolo Abeni  wrote:
> > 
> > Extends the existing udp programs to allow checking for proper
> > GRO aggregation/GSO size, and run the tests via a shell script, using
> > a veth pair with XDP program attached to trigger the GRO code path.
> > 
> > Signed-off-by: Paolo Abeni 
> > ---
> >  tools/testing/selftests/net/Makefile  |   2 +-
> >  tools/testing/selftests/net/udpgro.sh | 144 ++
> >  tools/testing/selftests/net/udpgro_bench.sh   |   8 +-
> >  tools/testing/selftests/net/udpgso_bench.sh   |   2 +-
> >  tools/testing/selftests/net/udpgso_bench_rx.c | 125 +--
> >  tools/testing/selftests/net/udpgso_bench_tx.c |  22 ++-
> >  6 files changed, 281 insertions(+), 22 deletions(-)
> >  create mode 100755 tools/testing/selftests/net/udpgro.sh
> > 
> > diff --git a/tools/testing/selftests/net/udpgro.sh 
> > b/tools/testing/selftests/net/udpgro.sh
> > +   run_test "no GRO chk cmsg" "${ipv4_args} -M 10 -s 1400" "-4 -n 10 
> > -l 1400 -S -1"
> > +   run_test "no GRO chk cmsg" "${ipv6_args} -M 10 -s 1400" "-n 10 -l 
> > 1400 -S -1"
> 
> why expected segment size -1 in these two?

I was unable to come up with a self-explaining option name/syntax. '-1' 
really means 'no UDP_SEGMENT cmsg'. Since the receiver did not enable
UDP_GRO, should not receive such cmsg.
> 
> > diff --git a/tools/testing/selftests/net/udpgso_bench_tx.c 
> > b/tools/testing/selftests/net/udpgso_bench_tx.c
> >  static void usage(const char *filepath)
> >  {
> > -   error(1, 0, "Usage: %s [-46cmStuz] [-C cpu] [-D dst ip] [-l secs] 
> > [-p port] [-s sendsize]",
> > +   error(1, 0, "Usage: %s [-46cmtuz] [-C cpu] [-D dst ip] [-l secs] 
> > [-m messagenr] [-p port] [-s sendsize] [-S gsosize]",
> > filepath);
> 
> missing -M

Will add in next iteration.

Additional node: in the current test implementation, 'no GRO chk cmsg'
sometimes wrongly returns a failure. I'll try to address it in the next
iteration (it's a test issue in the code I added, not a kernel one).

Cheers,

Paolo

Re: [RFC PATCH v2 08/10] selftests: conditionally enable XDP support in udpgso_bench_rx

2018-10-22 Thread Paolo Abeni

On Sun, 2018-10-21 at 16:09 -0400, Willem de Bruijn wrote:
> On Fri, Oct 19, 2018 at 10:31 AM Paolo Abeni  wrote:
> > 
> > XDP support will be used by a later patch to test the GRO path
> > in a net namespace, leveraging the veth XDP implementation.
> > To avoid breaking existing setup, XDP support is conditionally
> > enabled and build only if llc is locally available.
> > 
> > Signed-off-by: Paolo Abeni 
> > ---
> > diff --git a/tools/testing/selftests/net/Makefile 
> > b/tools/testing/selftests/net/Makefile
> > index 256d82d5fa87..176459b7c4d6 100644
> > --- a/tools/testing/selftests/net/Makefile
> > +++ b/tools/testing/selftests/net/Makefile
> > @@ -16,8 +16,77 @@ TEST_GEN_PROGS = reuseport_bpf reuseport_bpf_cpu 
> > reuseport_bpf_numa
> >  TEST_GEN_PROGS += reuseport_dualstack reuseaddr_conflict tls
> > 
> >  KSFT_KHDR_INSTALL := 1
> > +
> > +# Allows pointing LLC/CLANG to a LLVM backend with bpf support, redefine 
> > on cmdline:
> > +#  make samples/bpf/ LLC=~/git/llvm/build/bin/llc 
> > CLANG=~/git/llvm/build/bin/clang
> > +LLC ?= llc
> > +CLANG ?= clang
> > +LLVM_OBJCOPY ?= llvm-objcopy
> > +BTF_PAHOLE ?= pahole
> > +HAS_LLC := $(shell which $(LLC) 2>/dev/null)
> > +
> > +# conditional enable testes requiring llc
> > +ifneq (, $(HAS_LLC))
> > +TEST_GEN_FILES += xdp_dummy.o
> > +endif
> > +
> >  include ../lib.mk
> > 
> > +ifneq (, $(HAS_LLC))
> > +
> > +# Detect that we're cross compiling and use the cross compiler
> > +ifdef CROSS_COMPILE
> > +CLANG_ARCH_ARGS = -target $(ARCH)
> > +endif
> > +
> > +PROBE := $(shell $(LLC) -march=bpf -mcpu=probe -filetype=null /dev/null 
> > 2>&1)
> > +
> > +# Let newer LLVM versions transparently probe the kernel for availability
> > +# of full BPF instruction set.
> > +ifeq ($(PROBE),)
> > +  CPU ?= probe
> > +else
> > +  CPU ?= generic
> > +endif
> > +
> > +SRC_PATH := $(abspath ../../../..)
> > +LIB_PATH := $(SRC_PATH)/tools/lib
> > +XDP_CFLAGS := -D SUPPORT_XDP=1 -I$(LIB_PATH)
> > +LIBBPF = $(LIB_PATH)/bpf/libbpf.a
> > +BTF_LLC_PROBE := $(shell $(LLC) -march=bpf -mattr=help 2>&1 | grep 
> > dwarfris)
> > +BTF_PAHOLE_PROBE := $(shell $(BTF_PAHOLE) --help 2>&1 | grep BTF)
> > +BTF_OBJCOPY_PROBE := $(shell $(LLVM_OBJCOPY) --help 2>&1 | grep -i 
> > 'usage.*llvm')
> > +CLANG_SYS_INCLUDES := $(shell $(CLANG) -v -E - &1 \
> > +| sed -n '/<...> search starts here:/,/End of search list./{ s| 
> > \(/.*\)|-idirafter \1|p }')
> > +CLANG_FLAGS = -I. -I$(SRC_PATH)/include -I../bpf/ \
> > + $(CLANG_SYS_INCLUDES) -Wno-compare-distinct-pointer-types
> > +
> > +ifneq ($(and $(BTF_LLC_PROBE),$(BTF_PAHOLE_PROBE),$(BTF_OBJCOPY_PROBE)),)
> > +   CLANG_CFLAGS += -g
> > +   LLC_FLAGS += -mattr=dwarfris
> > +   DWARF2BTF = y
> > +endif
> > +
> > +$(LIBBPF): FORCE
> > +# Fix up variables inherited from Kbuild that tools/ build system won't 
> > like
> > +   $(MAKE) -C $(dir $@) RM='rm -rf' LDFLAGS= srctree=$(SRC_PATH) O= 
> > $(nodir $@)
> > +
> 
> This is a lot of XDP specific code. Not for this patchset, per se, but
> would be nice if we can reuse the logic in selftests/bpf for all this.

Agreed. A very similar code is already present almost duplicated in 3
different places (samples/bpf/Makefile, tools/testing/selftests/tc-
testing/bpf/Makefile and tools/testing/selftests/bpf/Makefile). A
bfp_lib.mk or the like would be nice ;). But I felt it a bit out of
scope for this patch, and I'm new to XDP/ebpf, so I preferred avoid
additional issues.

> > --- a/tools/testing/selftests/net/udpgso_bench_rx.c
> > @@ -227,6 +234,13 @@ static void parse_opts(int argc, char **argv)
> > cfg_verify = true;
> > cfg_read_all = true;
> > break;
> > +#ifdef SUPPORT_XDP
> > +   case 'x':
> > +   cfg_xdp_iface = if_nametoindex(optarg);
> > +   if (!cfg_xdp_iface)
> > +   error(1, errno, "unknown interface %s", 
> > optarg);
> > +   break;
> > +#endif
> 
> nit: needs to be added to getopt string in this patch.

Thanks, will do in next iteration.

Cheers,

Paolo

Re: [RFC PATCH v2 07/10] selftests: add GRO support to udp bench rx program

2018-10-22 Thread Paolo Abeni

On Sun, 2018-10-21 at 16:08 -0400, Willem de Bruijn wrote:
> On Fri, Oct 19, 2018 at 10:31 AM Paolo Abeni  wrote:
> > 
> > And fix a couple of buglets (port option processing,
> > clean termination on SIGINT). This is preparatory work
> > for GRO tests.
> > 
> > Signed-off-by: Paolo Abeni 
> > ---
> >  tools/testing/selftests/net/udpgso_bench_rx.c | 37 +++
> >  1 file changed, 30 insertions(+), 7 deletions(-)
> > 
> > diff --git a/tools/testing/selftests/net/udpgso_bench_rx.c 
> > b/tools/testing/selftests/net/udpgso_bench_rx.c
> > @@ -167,10 +177,10 @@ static void do_verify_udp(const char *data, int len)
> >  /* Flush all outstanding datagrams. Verify first few bytes of each. */
> >  static void do_flush_udp(int fd)
> >  {
> > -   static char rbuf[ETH_DATA_LEN];
> > +   static char rbuf[65535];
> 
> we can use ETH_MAX_MTU.

Thanks, will do in next iteration.

Paolo

Re: [RFC PATCH v2 06/10] udp: cope with UDP GRO packet misdirection

2018-10-22 Thread Paolo Abeni

On Sun, 2018-10-21 at 16:08 -0400, Willem de Bruijn wrote:
> On Fri, Oct 19, 2018 at 10:31 AM Paolo Abeni  wrote:
> > 
> > In some scenarios, the GRO engine can assemble an UDP GRO packet
> > that ultimately lands on a non GRO-enabled socket.
> > This patch tries to address the issue explicitly checking for the UDP
> > socket features before enqueuing the packet, and eventually segmenting
> > the unexpected GRO packet, as needed.
> > 
> > We must also cope with re-insertion requests: after segmentation the
> > UDP code calls the helper introduced by the previous patches, as needed.
> > 
> > Signed-off-by: Paolo Abeni 
> > ---
> > +static inline struct sk_buff *udp_rcv_segment(struct sock *sk,
> > + struct sk_buff *skb)
> > +{
> > +   struct sk_buff *segs;
> > +
> > +   /* the GSO CB lays after the UDP one, no need to save and restore 
> > any
> > +* CB fragment, just initialize it
> > +*/
> > +   segs = __skb_gso_segment(skb, NETIF_F_SG, false);
> > +   if (unlikely(IS_ERR(segs)))
> > +   kfree_skb(skb);
> > +   else if (segs)
> > +   consume_skb(skb);
> > +   return segs;
> > +}
> > +
> > +
> > +void ip_protocol_deliver_rcu(struct net *net, struct sk_buff *skb, int 
> > proto);
> > +
> > +static int udp_queue_rcv_skb(struct sock *sk, struct sk_buff *skb)
> > +{
> > +   struct sk_buff *next, *segs;
> > +   int ret;
> > +
> > +   if (likely(!udp_unexpected_gso(sk, skb)))
> > +   return udp_queue_rcv_one_skb(sk, skb);
> > +
> > +   BUILD_BUG_ON(sizeof(struct udp_skb_cb) > SKB_SGO_CB_OFFSET);
> > +   __skb_push(skb, -skb_mac_offset(skb));
> > +   segs = udp_rcv_segment(sk, skb);
> > +   for (skb = segs; skb; skb = next) {
> 
> need to check IS_ERR(segs) again?

whooops ... yes, I think so, thanks for catching it.

Since the error code is always discarded, perhpas udp_rcv_segment() can
simply return 0 when IS_ERR(segs) is true, so we can save a conditional
here. This is currently a slower/exceptional path, but if we will
enable UDP GRO for forwaded packets, it will be hit often.

Cheers,

Paolo

Re: [RFC PATCH v2 02/10] udp: implement GRO for plain UDP sockets.

2018-10-22 Thread Paolo Abeni

On Sun, 2018-10-21 at 16:06 -0400, Willem de Bruijn wrote:
> On Fri, Oct 19, 2018 at 10:30 AM Paolo Abeni  wrote:
> > 
> > This is the RX counterpart of commit bec1f6f69736 ("udp: generate gso
> > with UDP_SEGMENT"). When UDP_GRO is enabled, such socket is also
> > eligible for GRO in the rx path: UDP segments directed to such socket
> > are assembled into a larger GSO_UDP_L4 packet.
> > 
> > The core UDP GRO support is enabled with setsockopt(UDP_GRO).
> > 
> > Initial benchmark numbers:
> > 
> > Before:
> > udp rx:   1079 MB/s   769065 calls/s
> > 
> > After:
> > udp rx:   1466 MB/s24877 calls/s
> > 
> > 
> > This change introduces a side effect in respect to UDP tunnels:
> > after a UDP tunnel creation, now the kernel performs a lookup per ingress
> > UDP packet, while before such lookup happened only if the ingress packet
> > carried a valid internal header csum.
> > 
> > v1 -> v2:
> >  - use a new option to enable UDP GRO
> >  - use static keys to protect the UDP GRO socket lookup
> > 
> > Signed-off-by: Paolo Abeni 
> > ---
> >  include/linux/udp.h  |   3 +-
> >  include/uapi/linux/udp.h |   1 +
> >  net/ipv4/udp.c   |   7 +++
> >  net/ipv4/udp_offload.c   | 109 +++
> >  net/ipv6/udp_offload.c   |   6 +--
> >  5 files changed, 98 insertions(+), 28 deletions(-)
> > 
> > diff --git a/include/linux/udp.h b/include/linux/udp.h
> > index a4dafff407fb..f613b329852e 100644
> > --- a/include/linux/udp.h
> > +++ b/include/linux/udp.h
> > @@ -50,11 +50,12 @@ struct udp_sock {
> > __u8 encap_type;/* Is this an Encapsulation socket? 
> > */
> > unsigned charno_check6_tx:1,/* Send zero UDP6 checksums on TX? 
> > */
> >  no_check6_rx:1,/* Allow zero UDP6 checksums on RX? 
> > */
> > -encap_enabled:1; /* This socket enabled encap
> > +encap_enabled:1, /* This socket enabled encap
> >* processing; UDP tunnels and
> >* different encapsulation layer 
> > set
> >* this
> >*/
> > +gro_enabled:1; /* Can accept GRO packets */
> > 
> > /*
> >  * Following member retains the information to create a UDP header
> >  * when the socket is uncorked.
> > diff --git a/include/uapi/linux/udp.h b/include/uapi/linux/udp.h
> > index 09502de447f5..30baccb6c9c4 100644
> > --- a/include/uapi/linux/udp.h
> > +++ b/include/uapi/linux/udp.h
> > @@ -33,6 +33,7 @@ struct udphdr {
> >  #define UDP_NO_CHECK6_TX 101   /* Disable sending checksum for UDP6X */
> >  #define UDP_NO_CHECK6_RX 102   /* Disable accpeting checksum for UDP6 */
> >  #define UDP_SEGMENT103 /* Set GSO segmentation size */
> > +#define UDP_GRO104 /* This socket can receive UDP GRO 
> > packets */
> > 
> >  /* UDP encapsulation types */
> >  #define UDP_ENCAP_ESPINUDP_NON_IKE 1 /* 
> > draft-ietf-ipsec-nat-t-ike-00/01 */
> > diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> > index 9fcb5374e166..3c277378814f 100644
> > --- a/net/ipv4/udp.c
> > +++ b/net/ipv4/udp.c
> > @@ -115,6 +115,7 @@
> >  #include "udp_impl.h"
> >  #include 
> >  #include 
> > +#include 
> > 
> >  struct udp_table udp_table __read_mostly;
> >  EXPORT_SYMBOL(udp_table);
> > @@ -2459,6 +2460,12 @@ int udp_lib_setsockopt(struct sock *sk, int level, 
> > int optname,
> > up->gso_size = val;
> > break;
> > 
> > +   case UDP_GRO:
> > +   if (valbool)
> > +   udp_tunnel_encap_enable(sk->sk_socket);
> > +   up->gro_enabled = valbool;
> 
> The socket lock is not held here, so multiple updates to
> up->gro_enabled and the up->encap_enabled and the static branch can
> race. Syzkaller is adept at generating those.

Good catch. I was fooled by the current existing code. I think there
are potentially similar issues for UDP_ENCAP, UDPLITE_SEND_CSCOV, ...

Since the rx path don't take it anymore and we don't risk starving, I
think we should could/always acquire the socket lock on setsockopt,
wdyt?

> > +#define UDO_GRO_CNT_MAX 64
> > +static struct sk_buff *udp_gro_receive_segment(struct list_head *head,
> > +  struct sk_buff *skb)
> > +{
> > +   struct udphdr *uh = udp_hdr(skb);
> > +   struct sk_buff *pp = NULL;
> > +   struct udphdr *uh2;
> > +   struct sk_buff *p;
> > +
> > +   /* requires non zero csum, for simmetry with GSO */
> 
> symmetry

Thanks ;)

Paolo

Re: [RFC PATCH v2 00/10] udp: implement GRO support

2018-10-22 Thread Paolo Abeni

Hi all,

On Sun, 2018-10-21 at 16:05 -0400, Willem de Bruijn wrote:
> On Fri, Oct 19, 2018 at 10:30 AM Paolo Abeni  wrote:
> > 
> > This series implements GRO support for UDP sockets, as the RX counterpart
> > of commit bec1f6f69736 ("udp: generate gso with UDP_SEGMENT").
> > The core functionality is implemented by the second patch, introducing a new
> > sockopt to enable UDP_GRO, while patch 3 implements support for passing the
> > segment size to the user space via a new cmsg.
> > UDP GRO performs a socket lookup for each ingress packets and aggregate 
> > datagram
> > directed to UDP GRO enabled sockets with constant l4 tuple.
> > 
> > UDP GRO packets can land on non GRO-enabled sockets, e.g. due to iptables 
> > NAT
> > rules, and that could potentially confuse existing applications.
> 
> Good catch.
> 
> > The solution adopted here is to de-segment the GRO packet before enqueuing
> > as needed. Since we must cope with packet reinsertion after de-segmentation,
> > the relevant code is factored-out in ipv4 and ipv6 specific helpers and 
> > exposed
> > to UDP usage.
> > 
> > While the current code can probably be improved, this safeguard 
> > ,implemented in
> > the patches 4-7, allows future enachements to enable UDP GSO offload on more
> > virtual devices eventually even on forwarded packets.
> > 
> > The last 4 for patches implement some performance and functional self-tests,
> > re-using the existing udpgso infrastructure. The problematic scenario 
> > described
> > above is explicitly tested.
> 
> This looks awesome! Impressive testing, too.
> 
> A few comments in the individual patches, mostly minor.

Thank you for the in-depth review! (in the WE ;)

I'll try to address the comments on each patch individually.

In the end I rushed a bit this RFC, because the misdirection issue (and
the tentative fix) bothered me more than a bit: I wanted to check I was
not completely out-of-track.

Cheers,

Paolo

[RFC PATCH v2 08/10] selftests: conditionally enable XDP support in udpgso_bench_rx

2018-10-19 Thread Paolo Abeni

XDP support will be used by a later patch to test the GRO path
in a net namespace, leveraging the veth XDP implementation.
To avoid breaking existing setup, XDP support is conditionally
enabled and build only if llc is locally available.

Signed-off-by: Paolo Abeni 
---
 tools/testing/selftests/net/Makefile  | 69 +++
 tools/testing/selftests/net/udpgso_bench_rx.c | 37 ++
 tools/testing/selftests/net/xdp_dummy.c   | 13 
 3 files changed, 119 insertions(+)
 create mode 100644 tools/testing/selftests/net/xdp_dummy.c

diff --git a/tools/testing/selftests/net/Makefile 
b/tools/testing/selftests/net/Makefile
index 256d82d5fa87..176459b7c4d6 100644
--- a/tools/testing/selftests/net/Makefile
+++ b/tools/testing/selftests/net/Makefile
@@ -16,8 +16,77 @@ TEST_GEN_PROGS = reuseport_bpf reuseport_bpf_cpu 
reuseport_bpf_numa
 TEST_GEN_PROGS += reuseport_dualstack reuseaddr_conflict tls
 
 KSFT_KHDR_INSTALL := 1
+
+# Allows pointing LLC/CLANG to a LLVM backend with bpf support, redefine on 
cmdline:
+#  make samples/bpf/ LLC=~/git/llvm/build/bin/llc 
CLANG=~/git/llvm/build/bin/clang
+LLC ?= llc
+CLANG ?= clang
+LLVM_OBJCOPY ?= llvm-objcopy
+BTF_PAHOLE ?= pahole
+HAS_LLC := $(shell which $(LLC) 2>/dev/null)
+
+# conditional enable testes requiring llc
+ifneq (, $(HAS_LLC))
+TEST_GEN_FILES += xdp_dummy.o
+endif
+
 include ../lib.mk
 
+ifneq (, $(HAS_LLC))
+
+# Detect that we're cross compiling and use the cross compiler
+ifdef CROSS_COMPILE
+CLANG_ARCH_ARGS = -target $(ARCH)
+endif
+
+PROBE := $(shell $(LLC) -march=bpf -mcpu=probe -filetype=null /dev/null 2>&1)
+
+# Let newer LLVM versions transparently probe the kernel for availability
+# of full BPF instruction set.
+ifeq ($(PROBE),)
+  CPU ?= probe
+else
+  CPU ?= generic
+endif
+
+SRC_PATH := $(abspath ../../../..)
+LIB_PATH := $(SRC_PATH)/tools/lib
+XDP_CFLAGS := -D SUPPORT_XDP=1 -I$(LIB_PATH)
+LIBBPF = $(LIB_PATH)/bpf/libbpf.a
+BTF_LLC_PROBE := $(shell $(LLC) -march=bpf -mattr=help 2>&1 | grep dwarfris)
+BTF_PAHOLE_PROBE := $(shell $(BTF_PAHOLE) --help 2>&1 | grep BTF)
+BTF_OBJCOPY_PROBE := $(shell $(LLVM_OBJCOPY) --help 2>&1 | grep -i 
'usage.*llvm')
+CLANG_SYS_INCLUDES := $(shell $(CLANG) -v -E - &1 \
+| sed -n '/<...> search starts here:/,/End of search list./{ s| 
\(/.*\)|-idirafter \1|p }')
+CLANG_FLAGS = -I. -I$(SRC_PATH)/include -I../bpf/ \
+ $(CLANG_SYS_INCLUDES) -Wno-compare-distinct-pointer-types
+
+ifneq ($(and $(BTF_LLC_PROBE),$(BTF_PAHOLE_PROBE),$(BTF_OBJCOPY_PROBE)),)
+   CLANG_CFLAGS += -g
+   LLC_FLAGS += -mattr=dwarfris
+   DWARF2BTF = y
+endif
+
+$(LIBBPF): FORCE
+# Fix up variables inherited from Kbuild that tools/ build system won't like
+   $(MAKE) -C $(dir $@) RM='rm -rf' LDFLAGS= srctree=$(SRC_PATH) O= 
$(nodir $@)
+
+$(OUTPUT)/udpgso_bench_rx: $(OUTPUT)/udpgso_bench_rx.c $(LIBBPF)
+   $(CC) -o $@ $(XDP_CFLAGS) $(CFLAGS) $(LOADLIBES) $(LDLIBS) $^ -lelf
+
+FORCE:
+
+# bpf program[s] generation
+$(OUTPUT)/%.o: %.c
+   $(CLANG) $(CLANG_FLAGS) \
+-O2 -target bpf -emit-llvm -c $< -o - |  \
+   $(LLC) -march=bpf -mcpu=$(CPU) $(LLC_FLAGS) -filetype=obj -o $@
+ifeq ($(DWARF2BTF),y)
+   $(BTF_PAHOLE) -J $@
+endif
+
+endif
+
 $(OUTPUT)/reuseport_bpf_numa: LDFLAGS += -lnuma
 $(OUTPUT)/tcp_mmap: LDFLAGS += -lpthread
 $(OUTPUT)/tcp_inq: LDFLAGS += -lpthread
diff --git a/tools/testing/selftests/net/udpgso_bench_rx.c 
b/tools/testing/selftests/net/udpgso_bench_rx.c
index c55acdd1e27b..84f101852805 100644
--- a/tools/testing/selftests/net/udpgso_bench_rx.c
+++ b/tools/testing/selftests/net/udpgso_bench_rx.c
@@ -31,6 +31,10 @@
 #include 
 #include 
 
+#ifdef SUPPORT_XDP
+#include "bpf/libbpf.h"
+#endif
+
 #ifndef UDP_GRO
 #define UDP_GRO104
 #endif
@@ -40,6 +44,9 @@ static bool cfg_tcp;
 static bool cfg_verify;
 static bool cfg_read_all;
 static bool cfg_gro_segment;
+#ifdef SUPPORT_XDP
+static int cfg_xdp_iface;
+#endif
 
 static bool interrupted;
 static unsigned long packets, bytes;
@@ -227,6 +234,13 @@ static void parse_opts(int argc, char **argv)
cfg_verify = true;
cfg_read_all = true;
break;
+#ifdef SUPPORT_XDP
+   case 'x':
+   cfg_xdp_iface = if_nametoindex(optarg);
+   if (!cfg_xdp_iface)
+   error(1, errno, "unknown interface %s", optarg);
+   break;
+#endif
}
}
 
@@ -240,6 +254,9 @@ static void parse_opts(int argc, char **argv)
 static void do_recv(void)
 {
unsigned long tnow, treport;
+#ifdef SUPPORT_XDP
+   int prog_fd = -1;
+#endif
int fd;
 
fd = do_socket(cfg_tcp);
@@ -250,6 +267,22 @@ static void do_recv(void)
error(1, errno, "setsockopt UDP_GRO");
}
 
+#ifdef SUPPORT_XDP
+   if

[RFC PATCH v2 07/10] selftests: add GRO support to udp bench rx program

2018-10-19 Thread Paolo Abeni

And fix a couple of buglets (port option processing,
clean termination on SIGINT). This is preparatory work
for GRO tests.

Signed-off-by: Paolo Abeni 
---
 tools/testing/selftests/net/udpgso_bench_rx.c | 37 +++
 1 file changed, 30 insertions(+), 7 deletions(-)

diff --git a/tools/testing/selftests/net/udpgso_bench_rx.c 
b/tools/testing/selftests/net/udpgso_bench_rx.c
index 727cf67a3f75..c55acdd1e27b 100644
--- a/tools/testing/selftests/net/udpgso_bench_rx.c
+++ b/tools/testing/selftests/net/udpgso_bench_rx.c
@@ -31,9 +31,15 @@
 #include 
 #include 
 
+#ifndef UDP_GRO
+#define UDP_GRO104
+#endif
+
 static int  cfg_port   = 8000;
 static bool cfg_tcp;
 static bool cfg_verify;
+static bool cfg_read_all;
+static bool cfg_gro_segment;
 
 static bool interrupted;
 static unsigned long packets, bytes;
@@ -63,6 +69,8 @@ static void do_poll(int fd)
 
do {
ret = poll(, 1, 10);
+   if (interrupted)
+   break;
if (ret == -1)
error(1, errno, "poll");
if (ret == 0)
@@ -70,7 +78,7 @@ static void do_poll(int fd)
if (pfd.revents != POLLIN)
error(1, errno, "poll: 0x%x expected 0x%x\n",
pfd.revents, POLLIN);
-   } while (!ret && !interrupted);
+   } while (!ret);
 }
 
 static int do_socket(bool do_tcp)
@@ -102,6 +110,8 @@ static int do_socket(bool do_tcp)
error(1, errno, "listen");
 
do_poll(accept_fd);
+   if (interrupted)
+   exit(0);
 
fd = accept(accept_fd, NULL, NULL);
if (fd == -1)
@@ -167,10 +177,10 @@ static void do_verify_udp(const char *data, int len)
 /* Flush all outstanding datagrams. Verify first few bytes of each. */
 static void do_flush_udp(int fd)
 {
-   static char rbuf[ETH_DATA_LEN];
+   static char rbuf[65535];
int ret, len, budget = 256;
 
-   len = cfg_verify ? sizeof(rbuf) : 0;
+   len = cfg_read_all ? sizeof(rbuf) : 0;
while (budget--) {
/* MSG_TRUNC will make return value full datagram length */
ret = recv(fd, rbuf, len, MSG_TRUNC | MSG_DONTWAIT);
@@ -178,7 +188,7 @@ static void do_flush_udp(int fd)
return;
if (ret == -1)
error(1, errno, "recv");
-   if (len) {
+   if (len && cfg_verify) {
if (ret == 0)
error(1, errno, "recv: 0 byte datagram\n");
 
@@ -192,23 +202,30 @@ static void do_flush_udp(int fd)
 
 static void usage(const char *filepath)
 {
-   error(1, 0, "Usage: %s [-tv] [-p port]", filepath);
+   error(1, 0, "Usage: %s [-Grtv] [-p port]", filepath);
 }
 
 static void parse_opts(int argc, char **argv)
 {
int c;
 
-   while ((c = getopt(argc, argv, "ptv")) != -1) {
+   while ((c = getopt(argc, argv, "Gp:rtvx:")) != -1) {
switch (c) {
+   case 'G':
+   cfg_gro_segment = true;
+   break;
case 'p':
-   cfg_port = htons(strtoul(optarg, NULL, 0));
+   cfg_port = strtoul(optarg, NULL, 0);
+   break;
+   case 'r':
+   cfg_read_all = true;
break;
case 't':
cfg_tcp = true;
break;
case 'v':
cfg_verify = true;
+   cfg_read_all = true;
break;
}
}
@@ -227,6 +244,12 @@ static void do_recv(void)
 
fd = do_socket(cfg_tcp);
 
+   if (cfg_gro_segment && !cfg_tcp) {
+   int val = 1;
+   if (setsockopt(fd, IPPROTO_UDP, UDP_GRO, , sizeof(val)))
+   error(1, errno, "setsockopt UDP_GRO");
+   }
+
treport = gettimeofday_ms() + 1000;
do {
do_poll(fd);
-- 
2.17.2

[RFC PATCH v2 09/10] selftests: add some benchmark for UDP GRO

2018-10-19 Thread Paolo Abeni

Run on top of veth pair, using a dummy XDP program to enable the GRO.

Signed-off-by: Paolo Abeni 
---
 tools/testing/selftests/net/Makefile|  1 +
 tools/testing/selftests/net/udpgro_bench.sh | 92 +
 2 files changed, 93 insertions(+)
 create mode 100755 tools/testing/selftests/net/udpgro_bench.sh

diff --git a/tools/testing/selftests/net/Makefile 
b/tools/testing/selftests/net/Makefile
index 176459b7c4d6..ac999354af54 100644
--- a/tools/testing/selftests/net/Makefile
+++ b/tools/testing/selftests/net/Makefile
@@ -7,6 +7,7 @@ CFLAGS += -I../../../../usr/include/
 TEST_PROGS := run_netsocktests run_afpackettests test_bpf.sh netdevice.sh 
rtnetlink.sh
 TEST_PROGS += fib_tests.sh fib-onlink-tests.sh pmtu.sh udpgso.sh ip_defrag.sh
 TEST_PROGS += udpgso_bench.sh fib_rule_tests.sh msg_zerocopy.sh psock_snd.sh
+TEST_PROGS += udpgro_bench.sh
 TEST_PROGS_EXTENDED := in_netns.sh
 TEST_GEN_FILES =  socket
 TEST_GEN_FILES += psock_fanout psock_tpacket msg_zerocopy
diff --git a/tools/testing/selftests/net/udpgro_bench.sh 
b/tools/testing/selftests/net/udpgro_bench.sh
new file mode 100755
index ..03d37e5e7424
--- /dev/null
+++ b/tools/testing/selftests/net/udpgro_bench.sh
@@ -0,0 +1,92 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+#
+# Run a series of udpgro benchmarks
+
+readonly PEER_NS="ns-peer-$(mktemp -u XX)"
+
+cleanup() {
+   local -r jobs="$(jobs -p)"
+   local -r ns="$(ip netns list|grep $PEER_NS)"
+
+   [ -n "${jobs}" ] && kill -INT ${jobs} 2>/dev/null
+   [ -n "$ns" ] && ip netns del $ns 2>/dev/null
+}
+trap cleanup EXIT
+
+run_one() {
+   # use 'rx' as separator between sender args and receiver args
+   local -r all="$@"
+   local -r tx_args=${all%rx*}
+   local -r rx_args=${all#*rx}
+
+   ip netns add "${PEER_NS}"
+   ip -netns "${PEER_NS}" link set lo up
+   ip link add type veth
+   ip link set dev veth0 up
+   ip addr add dev veth0 192.168.1.2/24
+   ip addr add dev veth0 2001:db8::2/64 nodad
+
+   ip link set dev veth1 netns "${PEER_NS}"
+   ip -netns "${PEER_NS}" addr add dev veth1 192.168.1.1/24
+   ip -netns "${PEER_NS}" addr add dev veth1 2001:db8::1/64 nodad
+   ip -netns "${PEER_NS}" link set dev veth1 up
+
+   ip netns exec "${PEER_NS}" ./udpgso_bench_rx ${rx_args} -r -x veth1 &
+   ip netns exec "${PEER_NS}" ./udpgso_bench_rx -t ${rx_args} -r &
+
+   # Hack: let bg programs complete the startup
+   sleep 0.1
+   ./udpgso_bench_tx ${tx_args}
+}
+
+run_in_netns() {
+   local -r args=$@
+
+   ./in_netns.sh $0 __subprocess ${args}
+}
+
+run_udp() {
+   local -r args=$@
+
+   echo "udp gso - over veth touching data"
+   run_in_netns ${args} -S rx
+
+   echo "udp gso and gro - over veth touching data"
+   run_in_netns ${args} -S rx -G
+}
+
+run_tcp() {
+   local -r args=$@
+
+   echo "tcp - over veth touching data"
+   run_in_netns ${args} -t rx
+}
+
+run_all() {
+   local -r core_args="-l 4"
+   local -r ipv4_args="${core_args} -4 -D 192.168.1.1"
+   local -r ipv6_args="${core_args} -6 -D 2001:db8::1"
+
+   echo "ipv4"
+   run_tcp "${ipv4_args}"
+   run_udp "${ipv4_args}"
+
+   echo "ipv6"
+   run_tcp "${ipv4_args}"
+   run_udp "${ipv6_args}"
+}
+
+if [ ! -f xdp_dummy.o ]; then
+   echo "Skipping GRO benchmarks - missing LLC"
+   exit 0
+fi
+
+if [[ $# -eq 0 ]]; then
+   run_all
+elif [[ $1 == "__subprocess" ]]; then
+   shift
+   run_one $@
+else
+   run_in_netns $@
+fi
-- 
2.17.2

[RFC PATCH v2 10/10] selftests: add functionals test for UDP GRO

2018-10-19 Thread Paolo Abeni

Extends the existing udp programs to allow checking for proper
GRO aggregation/GSO size, and run the tests via a shell script, using
a veth pair with XDP program attached to trigger the GRO code path.

Signed-off-by: Paolo Abeni 
---
 tools/testing/selftests/net/Makefile  |   2 +-
 tools/testing/selftests/net/udpgro.sh | 144 ++
 tools/testing/selftests/net/udpgro_bench.sh   |   8 +-
 tools/testing/selftests/net/udpgso_bench.sh   |   2 +-
 tools/testing/selftests/net/udpgso_bench_rx.c | 125 +--
 tools/testing/selftests/net/udpgso_bench_tx.c |  22 ++-
 6 files changed, 281 insertions(+), 22 deletions(-)
 create mode 100755 tools/testing/selftests/net/udpgro.sh

diff --git a/tools/testing/selftests/net/Makefile 
b/tools/testing/selftests/net/Makefile
index ac999354af54..a8a0d256aafb 100644
--- a/tools/testing/selftests/net/Makefile
+++ b/tools/testing/selftests/net/Makefile
@@ -7,7 +7,7 @@ CFLAGS += -I../../../../usr/include/
 TEST_PROGS := run_netsocktests run_afpackettests test_bpf.sh netdevice.sh 
rtnetlink.sh
 TEST_PROGS += fib_tests.sh fib-onlink-tests.sh pmtu.sh udpgso.sh ip_defrag.sh
 TEST_PROGS += udpgso_bench.sh fib_rule_tests.sh msg_zerocopy.sh psock_snd.sh
-TEST_PROGS += udpgro_bench.sh
+TEST_PROGS += udpgro_bench.sh udpgro.sh
 TEST_PROGS_EXTENDED := in_netns.sh
 TEST_GEN_FILES =  socket
 TEST_GEN_FILES += psock_fanout psock_tpacket msg_zerocopy
diff --git a/tools/testing/selftests/net/udpgro.sh 
b/tools/testing/selftests/net/udpgro.sh
new file mode 100755
index ..eb380c7babf0
--- /dev/null
+++ b/tools/testing/selftests/net/udpgro.sh
@@ -0,0 +1,144 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+#
+# Run a series of udpgro functional tests.
+
+readonly PEER_NS="ns-peer-$(mktemp -u XX)"
+
+cleanup() {
+   local -r jobs="$(jobs -p)"
+   local -r ns="$(ip netns list|grep $PEER_NS)"
+
+   [ -n "${jobs}" ] && kill -1 ${jobs} 2>/dev/null
+   [ -n "$ns" ] && ip netns del $ns 2>/dev/null
+}
+trap cleanup EXIT
+
+cfg_veth() {
+   ip netns add "${PEER_NS}"
+   ip -netns "${PEER_NS}" link set lo up
+   ip link add type veth
+   ip link set dev veth0 up
+   ip addr add dev veth0 192.168.1.2/24
+   ip addr add dev veth0 2001:db8::2/64 nodad
+
+   ip link set dev veth1 netns "${PEER_NS}"
+   ip -netns "${PEER_NS}" addr add dev veth1 192.168.1.1/24
+   ip -netns "${PEER_NS}" addr add dev veth1 2001:db8::1/64 nodad
+   ip -netns "${PEER_NS}" link set dev veth1 up
+}
+
+run_one() {
+   # use 'rx' as separator between sender args and receiver args
+   local -r all="$@"
+   local -r tx_args=${all%rx*}
+   local -r rx_args=${all#*rx}
+
+   cfg_veth
+
+   ip netns exec "${PEER_NS}" ./udpgso_bench_rx ${rx_args} && \
+   echo "ok" || \
+   echo "failed" &
+
+   # Hack: let bg programs complete the startup
+   sleep 0.1
+   ./udpgso_bench_tx ${tx_args}
+   wait $(jobs -p)
+}
+
+run_test() {
+   local -r args=$@
+
+   printf " %-40s" "$1"
+   ./in_netns.sh $0 __subprocess $2 rx -G -r -x veth1 $3
+}
+
+run_one_nat() {
+   # use 'rx' as separator between sender args and receiver args
+   local addr1 addr2 pid family="" ipt_cmd=ip6tables
+   local -r all="$@"
+   local -r tx_args=${all%rx*}
+   local -r rx_args=${all#*rx}
+
+   if [[ ${tx_args} = *-4* ]]; then
+   ipt_cmd=iptables
+   family=-4
+   addr1=192.168.1.1
+   addr2=192.168.1.3/24
+   else
+   addr1=2001:db8::1
+   addr2="2001:db8::3/64 nodad"
+   fi
+
+   cfg_veth
+   ip -netns "${PEER_NS}" addr add dev veth1 ${addr2}
+
+   # fool the GRO engine changing the destination address ...
+   ip netns exec "${PEER_NS}" $ipt_cmd -t nat -I PREROUTING -d ${addr1} -j 
DNAT --to-destination ${addr2%/*}
+
+   # ... so that GRO will match the UDP_GRO enabled socket, but packets
+   # will land on the 'plain' one
+   ip netns exec "${PEER_NS}" ./udpgso_bench_rx -G ${family} -x veth1 -b 
${addr1} -n 0 &
+   pid=$!
+   ip netns exec "${PEER_NS}" ./udpgso_bench_rx ${family} -b ${addr2%/*} 
${rx_args} && \
+   echo "ok" || \
+   echo "failed"&
+
+   sleep 0.1
+   ./udpgso_bench_tx ${tx_args}
+   kill -INT $pid
+   wait $(jobs -p)
+}
+
+run_nat_test() {
+   local -r args=$@
+
+   printf " %-40s" "$1"
+   ./in_netns.sh $0 __subprocess_nat $2 rx -r $3
+}
+
+run_all() {
+   local -r core_args="-l 4"
+   local -r ipv4_args="${core_args} -4 -D 192.168.1.1&

[RFC PATCH v2 06/10] udp: cope with UDP GRO packet misdirection

2018-10-19 Thread Paolo Abeni

In some scenarios, the GRO engine can assemble an UDP GRO packet
that ultimately lands on a non GRO-enabled socket.
This patch tries to address the issue explicitly checking for the UDP
socket features before enqueuing the packet, and eventually segmenting
the unexpected GRO packet, as needed.

We must also cope with re-insertion requests: after segmentation the
UDP code calls the helper introduced by the previous patches, as needed.

Signed-off-by: Paolo Abeni 
---
 include/linux/udp.h | 23 +++
 net/ipv4/udp.c  | 25 -
 net/ipv6/udp.c  | 27 ++-
 3 files changed, 73 insertions(+), 2 deletions(-)

diff --git a/include/linux/udp.h b/include/linux/udp.h
index e23d5024f42f..19bcb396cd1b 100644
--- a/include/linux/udp.h
+++ b/include/linux/udp.h
@@ -132,6 +132,29 @@ static inline void udp_cmsg_recv(struct msghdr *msg, 
struct sock *sk,
}
 }
 
+static inline bool udp_unexpected_gso(struct sock *sk, struct sk_buff *skb)
+{
+   return !udp_sk(sk)->gro_enabled && skb_is_gso(skb) &&
+  skb_shinfo(skb)->gso_type & SKB_GSO_UDP_L4;
+}
+
+static inline struct sk_buff *udp_rcv_segment(struct sock *sk,
+ struct sk_buff *skb)
+{
+   struct sk_buff *segs;
+
+   /* the GSO CB lays after the UDP one, no need to save and restore any
+* CB fragment, just initialize it
+*/
+   segs = __skb_gso_segment(skb, NETIF_F_SG, false);
+   if (unlikely(IS_ERR(segs)))
+   kfree_skb(skb);
+   else if (segs)
+   consume_skb(skb);
+   return segs;
+}
+
+
 #define udp_portaddr_for_each_entry(__sk, list) \
hlist_for_each_entry(__sk, list, __sk_common.skc_portaddr_node)
 
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 2331ac9de954..0d55145ce9f5 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1909,7 +1909,7 @@ EXPORT_SYMBOL(udp_encap_enable);
  * Note that in the success and error cases, the skb is assumed to
  * have either been requeued or freed.
  */
-static int udp_queue_rcv_skb(struct sock *sk, struct sk_buff *skb)
+static int udp_queue_rcv_one_skb(struct sock *sk, struct sk_buff *skb)
 {
struct udp_sock *up = udp_sk(sk);
int is_udplite = IS_UDPLITE(sk);
@@ -2012,6 +2012,29 @@ static int udp_queue_rcv_skb(struct sock *sk, struct 
sk_buff *skb)
return -1;
 }
 
+void ip_protocol_deliver_rcu(struct net *net, struct sk_buff *skb, int proto);
+
+static int udp_queue_rcv_skb(struct sock *sk, struct sk_buff *skb)
+{
+   struct sk_buff *next, *segs;
+   int ret;
+
+   if (likely(!udp_unexpected_gso(sk, skb)))
+   return udp_queue_rcv_one_skb(sk, skb);
+
+   BUILD_BUG_ON(sizeof(struct udp_skb_cb) > SKB_SGO_CB_OFFSET);
+   __skb_push(skb, -skb_mac_offset(skb));
+   segs = udp_rcv_segment(sk, skb);
+   for (skb = segs; skb; skb = next) {
+   next = skb->next;
+   __skb_pull(skb, skb_transport_offset(skb));
+   ret = udp_queue_rcv_one_skb(sk, skb);
+   if (ret > 0)
+   ip_protocol_deliver_rcu(dev_net(skb->dev), skb, -ret);
+   }
+   return 0;
+}
+
 /* For TCP sockets, sk_rx_dst is protected by socket lock
  * For UDP, we use xchg() to guard against concurrent changes.
  */
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index 05f723b98cab..d892c064657c 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -558,7 +558,7 @@ void udpv6_encap_enable(void)
 }
 EXPORT_SYMBOL(udpv6_encap_enable);
 
-static int udpv6_queue_rcv_skb(struct sock *sk, struct sk_buff *skb)
+static int udpv6_queue_rcv_one_skb(struct sock *sk, struct sk_buff *skb)
 {
struct udp_sock *up = udp_sk(sk);
int is_udplite = IS_UDPLITE(sk);
@@ -641,6 +641,31 @@ static int udpv6_queue_rcv_skb(struct sock *sk, struct 
sk_buff *skb)
return -1;
 }
 
+void ip6_protocol_deliver_rcu(struct net *net, struct sk_buff *skb, int 
nexthdr,
+ bool have_final);
+
+static int udpv6_queue_rcv_skb(struct sock *sk, struct sk_buff *skb)
+{
+   struct sk_buff *next, *segs;
+   int ret;
+
+   if (likely(!udp_unexpected_gso(sk, skb)))
+   return udpv6_queue_rcv_one_skb(sk, skb);
+
+   __skb_push(skb, -skb_mac_offset(skb));
+   segs = udp_rcv_segment(sk, skb);
+   for (skb = segs; skb; skb = next) {
+   next = skb->next;
+   __skb_pull(skb, skb_transport_offset(skb));
+
+   ret = udpv6_queue_rcv_one_skb(sk, skb);
+   if (ret > 0)
+   ip6_protocol_deliver_rcu(dev_net(skb->dev), skb, ret,
+true);
+   }
+   return 0;
+}
+
 static bool __udp_v6_is_mcast_sock(struct net *net, struct sock *sk,
   __be16 loc_port, const struct in6_addr 
*loc_addr,
   __

[RFC PATCH v2 03/10] udp: add support for UDP_GRO cmsg

2018-10-19 Thread Paolo Abeni

When UDP GRO is enabled, the UDP_GRO cmsg will carry the ingress
datagram size. User-space can use such info to compute the original
packets layout.

Signed-off-by: Paolo Abeni 
---
CHECK: should we use a separate setsockopt to explicitly enable
gso_size cmsg reception? So that user space can enable UDP_GRO and
fetch cmsg without forcefully receiving GRO related info.
---
 include/linux/udp.h | 11 +++
 net/ipv4/udp.c  |  4 
 net/ipv6/udp.c  |  3 +++
 3 files changed, 18 insertions(+)

diff --git a/include/linux/udp.h b/include/linux/udp.h
index f613b329852e..e23d5024f42f 100644
--- a/include/linux/udp.h
+++ b/include/linux/udp.h
@@ -121,6 +121,17 @@ static inline bool udp_get_no_check6_rx(struct sock *sk)
return udp_sk(sk)->no_check6_rx;
 }
 
+static inline void udp_cmsg_recv(struct msghdr *msg, struct sock *sk,
+struct sk_buff *skb)
+{
+   int gso_size;
+
+   if (skb_shinfo(skb)->gso_type & SKB_GSO_UDP_L4) {
+   gso_size = skb_shinfo(skb)->gso_size;
+   put_cmsg(msg, SOL_UDP, UDP_GRO, sizeof(gso_size), _size);
+   }
+}
+
 #define udp_portaddr_for_each_entry(__sk, list) \
hlist_for_each_entry(__sk, list, __sk_common.skc_portaddr_node)
 
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 3c277378814f..2331ac9de954 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1714,6 +1714,10 @@ int udp_recvmsg(struct sock *sk, struct msghdr *msg, 
size_t len, int noblock,
memset(sin->sin_zero, 0, sizeof(sin->sin_zero));
*addr_len = sizeof(*sin);
}
+
+   if (udp_sk(sk)->gro_enabled)
+   udp_cmsg_recv(msg, sk, skb);
+
if (inet->cmsg_flags)
ip_cmsg_recv_offset(msg, sk, skb, sizeof(struct udphdr), off);
 
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index 8bb50ba32a6f..05f723b98cab 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -421,6 +421,9 @@ int udpv6_recvmsg(struct sock *sk, struct msghdr *msg, 
size_t len,
*addr_len = sizeof(*sin6);
}
 
+   if (udp_sk(sk)->gro_enabled)
+   udp_cmsg_recv(msg, sk, skb);
+
if (np->rxopt.all)
ip6_datagram_recv_common_ctl(sk, msg, skb);
 
-- 
2.17.2

[RFC PATCH v2 02/10] udp: implement GRO for plain UDP sockets.

2018-10-19 Thread Paolo Abeni

This is the RX counterpart of commit bec1f6f69736 ("udp: generate gso
with UDP_SEGMENT"). When UDP_GRO is enabled, such socket is also
eligible for GRO in the rx path: UDP segments directed to such socket
are assembled into a larger GSO_UDP_L4 packet.

The core UDP GRO support is enabled with setsockopt(UDP_GRO).

Initial benchmark numbers:

Before:
udp rx:   1079 MB/s   769065 calls/s

After:
udp rx:   1466 MB/s24877 calls/s

This change introduces a side effect in respect to UDP tunnels:
after a UDP tunnel creation, now the kernel performs a lookup per ingress
UDP packet, while before such lookup happened only if the ingress packet
carried a valid internal header csum.

v1 -> v2:
 - use a new option to enable UDP GRO
 - use static keys to protect the UDP GRO socket lookup

Signed-off-by: Paolo Abeni 
---
 include/linux/udp.h  |   3 +-
 include/uapi/linux/udp.h |   1 +
 net/ipv4/udp.c   |   7 +++
 net/ipv4/udp_offload.c   | 109 +++
 net/ipv6/udp_offload.c   |   6 +--
 5 files changed, 98 insertions(+), 28 deletions(-)

diff --git a/include/linux/udp.h b/include/linux/udp.h
index a4dafff407fb..f613b329852e 100644
--- a/include/linux/udp.h
+++ b/include/linux/udp.h
@@ -50,11 +50,12 @@ struct udp_sock {
__u8 encap_type;/* Is this an Encapsulation socket? */
unsigned charno_check6_tx:1,/* Send zero UDP6 checksums on TX? */
 no_check6_rx:1,/* Allow zero UDP6 checksums on RX? */
-encap_enabled:1; /* This socket enabled encap
+encap_enabled:1, /* This socket enabled encap
   * processing; UDP tunnels and
   * different encapsulation layer set
   * this
   */
+gro_enabled:1; /* Can accept GRO packets */
/*
 * Following member retains the information to create a UDP header
 * when the socket is uncorked.
diff --git a/include/uapi/linux/udp.h b/include/uapi/linux/udp.h
index 09502de447f5..30baccb6c9c4 100644
--- a/include/uapi/linux/udp.h
+++ b/include/uapi/linux/udp.h
@@ -33,6 +33,7 @@ struct udphdr {
 #define UDP_NO_CHECK6_TX 101   /* Disable sending checksum for UDP6X */
 #define UDP_NO_CHECK6_RX 102   /* Disable accpeting checksum for UDP6 */
 #define UDP_SEGMENT103 /* Set GSO segmentation size */
+#define UDP_GRO104 /* This socket can receive UDP GRO 
packets */
 
 /* UDP encapsulation types */
 #define UDP_ENCAP_ESPINUDP_NON_IKE 1 /* draft-ietf-ipsec-nat-t-ike-00/01 */
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 9fcb5374e166..3c277378814f 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -115,6 +115,7 @@
 #include "udp_impl.h"
 #include 
 #include 
+#include 
 
 struct udp_table udp_table __read_mostly;
 EXPORT_SYMBOL(udp_table);
@@ -2459,6 +2460,12 @@ int udp_lib_setsockopt(struct sock *sk, int level, int 
optname,
up->gso_size = val;
break;
 
+   case UDP_GRO:
+   if (valbool)
+   udp_tunnel_encap_enable(sk->sk_socket);
+   up->gro_enabled = valbool;
+   break;
+
/*
 *  UDP-Lite's partial checksum coverage (RFC 3828).
 */
diff --git a/net/ipv4/udp_offload.c b/net/ipv4/udp_offload.c
index 802f2bc00d69..d93c1e8097ba 100644
--- a/net/ipv4/udp_offload.c
+++ b/net/ipv4/udp_offload.c
@@ -343,6 +343,54 @@ static struct sk_buff *udp4_ufo_fragment(struct sk_buff 
*skb,
return segs;
 }
 
+#define UDO_GRO_CNT_MAX 64
+static struct sk_buff *udp_gro_receive_segment(struct list_head *head,
+  struct sk_buff *skb)
+{
+   struct udphdr *uh = udp_hdr(skb);
+   struct sk_buff *pp = NULL;
+   struct udphdr *uh2;
+   struct sk_buff *p;
+
+   /* requires non zero csum, for simmetry with GSO */
+   if (!uh->check) {
+   NAPI_GRO_CB(skb)->flush = 1;
+   return NULL;
+   }
+
+   /* pull encapsulating udp header */
+   skb_gro_pull(skb, sizeof(struct udphdr));
+   skb_gro_postpull_rcsum(skb, uh, sizeof(struct udphdr));
+
+   list_for_each_entry(p, head, list) {
+   if (!NAPI_GRO_CB(p)->same_flow)
+   continue;
+
+   uh2 = udp_hdr(p);
+
+   /* Match ports only, as csum is always non zero */
+   if ((*(u32 *)>source != *(u32 *)>source)) {
+   NAPI_GRO_CB(p)->same_flow = 0;
+   continue;
+   }
+
+   /* Terminate the flow on len mismatch or if it grow "too much".
+* Under small packet flood GRO count could elsewhere grow a lot
+* leading to execessive truesize values
+

[RFC PATCH v2 05/10] ipv6: factor out protocol delivery helper

2018-10-19 Thread Paolo Abeni

So that we can re-use it at the UDP lavel in the next patch

Signed-off-by: Paolo Abeni 
---
 net/ipv6/ip6_input.c | 28 
 1 file changed, 16 insertions(+), 12 deletions(-)

diff --git a/net/ipv6/ip6_input.c b/net/ipv6/ip6_input.c
index 96577e742afd..3065226bdc57 100644
--- a/net/ipv6/ip6_input.c
+++ b/net/ipv6/ip6_input.c
@@ -319,28 +319,26 @@ void ipv6_list_rcv(struct list_head *head, struct 
packet_type *pt,
 /*
  * Deliver the packet to the host
  */
-
-
-static int ip6_input_finish(struct net *net, struct sock *sk, struct sk_buff 
*skb)
+void ip6_protocol_deliver_rcu(struct net *net, struct sk_buff *skb, int 
nexthdr,
+ bool have_final)
 {
const struct inet6_protocol *ipprot;
struct inet6_dev *idev;
unsigned int nhoff;
-   int nexthdr;
bool raw;
-   bool have_final = false;
 
/*
 *  Parse extension headers
 */
 
-   rcu_read_lock();
 resubmit:
idev = ip6_dst_idev(skb_dst(skb));
-   if (!pskb_pull(skb, skb_transport_offset(skb)))
-   goto discard;
nhoff = IP6CB(skb)->nhoff;
-   nexthdr = skb_network_header(skb)[nhoff];
+   if (!have_final) {
+   if (!pskb_pull(skb, skb_transport_offset(skb)))
+   goto discard;
+   nexthdr = skb_network_header(skb)[nhoff];
+   }
 
 resubmit_final:
raw = raw6_local_deliver(skb, nexthdr);
@@ -411,13 +409,19 @@ static int ip6_input_finish(struct net *net, struct sock 
*sk, struct sk_buff *sk
consume_skb(skb);
}
}
-   rcu_read_unlock();
-   return 0;
+   return;
 
 discard:
__IP6_INC_STATS(net, idev, IPSTATS_MIB_INDISCARDS);
-   rcu_read_unlock();
kfree_skb(skb);
+}
+
+static int ip6_input_finish(struct net *net, struct sock *sk, struct sk_buff 
*skb)
+{
+   rcu_read_lock();
+   ip6_protocol_deliver_rcu(net, skb, 0, false);
+   rcu_read_unlock();
+
return 0;
 }
 
-- 
2.17.2

[RFC PATCH v2 00/10] udp: implement GRO support

2018-10-19 Thread Paolo Abeni

This series implements GRO support for UDP sockets, as the RX counterpart
of commit bec1f6f69736 ("udp: generate gso with UDP_SEGMENT").
The core functionality is implemented by the second patch, introducing a new
sockopt to enable UDP_GRO, while patch 3 implements support for passing the
segment size to the user space via a new cmsg.
UDP GRO performs a socket lookup for each ingress packets and aggregate datagram
directed to UDP GRO enabled sockets with constant l4 tuple.

UDP GRO packets can land on non GRO-enabled sockets, e.g. due to iptables NAT
rules, and that could potentially confuse existing applications.

The solution adopted here is to de-segment the GRO packet before enqueuing
as needed. Since we must cope with packet reinsertion after de-segmentation,
the relevant code is factored-out in ipv4 and ipv6 specific helpers and exposed
to UDP usage.

While the current code can probably be improved, this safeguard ,implemented in
the patches 4-7, allows future enachements to enable UDP GSO offload on more
virtual devices eventually even on forwarded packets.

The last 4 for patches implement some performance and functional self-tests,
re-using the existing udpgso infrastructure. The problematic scenario described
above is explicitly tested.

v1 - v2:
 - use a new option to enable UDP GRO
 - use static keys to protect the UDP GRO socket lookup
 - cope with UDP GRO misdirection
 - add self-tests

Paolo Abeni (10):
  udp: implement complete book-keeping for encap_needed
  udp: implement GRO for plain UDP sockets.
  udp: add support for UDP_GRO cmsg
  ip: factor out protocol delivery helper
  ipv6: factor out protocol delivery helper
  udp: cope with UDP GRO packet misdirection
  selftests: add GRO support to udp bench rx program
  selftests: conditionally enable XDP support in udpgso_bench_rx
  selftests: add some benchmark for UDP GRO
  selftests: add functionals test for UDP GRO

 include/linux/udp.h   |  42 +++-
 include/net/udp_tunnel.h  |   6 +
 include/uapi/linux/udp.h  |   1 +
 net/ipv4/ip_input.c   |  73 ---
 net/ipv4/udp.c|  54 -
 net/ipv4/udp_offload.c| 109 --
 net/ipv6/ip6_input.c  |  28 +--
 net/ipv6/udp.c|  44 +++-
 net/ipv6/udp_offload.c|   6 +-
 tools/testing/selftests/net/Makefile  |  70 +++
 tools/testing/selftests/net/udpgro.sh | 144 +
 tools/testing/selftests/net/udpgro_bench.sh   |  94 +
 tools/testing/selftests/net/udpgso_bench.sh   |   2 +-
 tools/testing/selftests/net/udpgso_bench_rx.c | 195 --
 tools/testing/selftests/net/udpgso_bench_tx.c |  22 +-
 tools/testing/selftests/net/xdp_dummy.c   |  13 ++
 16 files changed, 790 insertions(+), 113 deletions(-)
 create mode 100755 tools/testing/selftests/net/udpgro.sh
 create mode 100755 tools/testing/selftests/net/udpgro_bench.sh
 create mode 100644 tools/testing/selftests/net/xdp_dummy.c

-- 
2.17.2

[RFC PATCH v2 01/10] udp: implement complete book-keeping for encap_needed

2018-10-19 Thread Paolo Abeni

The *encap_needed static keys are enabled by UDP tunnels
and several UDP encapsulations type, but they are never
turned off. This can cause unneeded overall performance
degradation for systems where such features are used
transiently.

This patch introduces complete book-keeping for such keys,
decreasing the usage at socket destruction time, if needed,
and avoiding that the same socket could increase the key
usage multiple times.

Signed-off-by: Paolo Abeni 
---
 include/linux/udp.h  |  7 ++-
 include/net/udp_tunnel.h |  6 ++
 net/ipv4/udp.c   | 18 --
 net/ipv6/udp.c   | 14 +-
 4 files changed, 33 insertions(+), 12 deletions(-)

diff --git a/include/linux/udp.h b/include/linux/udp.h
index 320d49d85484..a4dafff407fb 100644
--- a/include/linux/udp.h
+++ b/include/linux/udp.h
@@ -49,7 +49,12 @@ struct udp_sock {
unsigned int corkflag;  /* Cork is required */
__u8 encap_type;/* Is this an Encapsulation socket? */
unsigned charno_check6_tx:1,/* Send zero UDP6 checksums on TX? */
-no_check6_rx:1;/* Allow zero UDP6 checksums on RX? */
+no_check6_rx:1,/* Allow zero UDP6 checksums on RX? */
+encap_enabled:1; /* This socket enabled encap
+  * processing; UDP tunnels and
+  * different encapsulation layer set
+  * this
+  */
/*
 * Following member retains the information to create a UDP header
 * when the socket is uncorked.
diff --git a/include/net/udp_tunnel.h b/include/net/udp_tunnel.h
index fe680ab6b15a..3fbe56430e3b 100644
--- a/include/net/udp_tunnel.h
+++ b/include/net/udp_tunnel.h
@@ -165,6 +165,12 @@ static inline int udp_tunnel_handle_offloads(struct 
sk_buff *skb, bool udp_csum)
 
 static inline void udp_tunnel_encap_enable(struct socket *sock)
 {
+   struct udp_sock *up = udp_sk(sock->sk);
+
+   if (up->encap_enabled)
+   return;
+
+   up->encap_enabled = 1;
 #if IS_ENABLED(CONFIG_IPV6)
if (sock->sk->sk_family == PF_INET6)
ipv6_stub->udpv6_encap_enable();
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index cf8252d05a01..9fcb5374e166 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -2382,11 +2382,15 @@ void udp_destroy_sock(struct sock *sk)
bool slow = lock_sock_fast(sk);
udp_flush_pending_frames(sk);
unlock_sock_fast(sk, slow);
-   if (static_branch_unlikely(_encap_needed_key) && up->encap_type) {
-   void (*encap_destroy)(struct sock *sk);
-   encap_destroy = READ_ONCE(up->encap_destroy);
-   if (encap_destroy)
-   encap_destroy(sk);
+   if (static_branch_unlikely(_encap_needed_key)) {
+   if (up->encap_type) {
+   void (*encap_destroy)(struct sock *sk);
+   encap_destroy = READ_ONCE(up->encap_destroy);
+   if (encap_destroy)
+   encap_destroy(sk);
+   }
+   if (up->encap_enabled)
+   static_branch_disable(_encap_needed_key);
}
 }
 
@@ -2431,7 +2435,9 @@ int udp_lib_setsockopt(struct sock *sk, int level, int 
optname,
/* FALLTHROUGH */
case UDP_ENCAP_L2TPINUDP:
up->encap_type = val;
-   udp_encap_enable();
+   if (!up->encap_enabled)
+   udp_encap_enable();
+   up->encap_enabled = 1;
break;
default:
err = -ENOPROTOOPT;
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index 374e7d302f26..8bb50ba32a6f 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -1460,11 +1460,15 @@ void udpv6_destroy_sock(struct sock *sk)
udp_v6_flush_pending_frames(sk);
release_sock(sk);
 
-   if (static_branch_unlikely(_encap_needed_key) && up->encap_type) {
-   void (*encap_destroy)(struct sock *sk);
-   encap_destroy = READ_ONCE(up->encap_destroy);
-   if (encap_destroy)
-   encap_destroy(sk);
+   if (static_branch_unlikely(_encap_needed_key)) {
+   if (up->encap_type) {
+   void (*encap_destroy)(struct sock *sk);
+   encap_destroy = READ_ONCE(up->encap_destroy);
+   if (encap_destroy)
+   encap_destroy(sk);
+   }
+   if (up->encap_enabled)
+   static_branch_disable(_encap_needed_key);
}
 
inet6_destroy_sock(sk);
-- 
2.17.2

[RFC PATCH v2 04/10] ip: factor out protocol delivery helper

2018-10-19 Thread Paolo Abeni

So that we can re-use it at the UDP lavel in a later patch

Signed-off-by: Paolo Abeni 
---
 net/ipv4/ip_input.c | 73 ++---
 1 file changed, 36 insertions(+), 37 deletions(-)

diff --git a/net/ipv4/ip_input.c b/net/ipv4/ip_input.c
index 35a786c0aaa0..72250b4e466d 100644
--- a/net/ipv4/ip_input.c
+++ b/net/ipv4/ip_input.c
@@ -188,51 +188,50 @@ bool ip_call_ra_chain(struct sk_buff *skb)
return false;
 }
 
-static int ip_local_deliver_finish(struct net *net, struct sock *sk, struct 
sk_buff *skb)
+void ip_protocol_deliver_rcu(struct net *net, struct sk_buff *skb, int 
protocol)
 {
-   __skb_pull(skb, skb_network_header_len(skb));
-
-   rcu_read_lock();
-   {
-   int protocol = ip_hdr(skb)->protocol;
-   const struct net_protocol *ipprot;
-   int raw;
+   const struct net_protocol *ipprot;
+   int raw, ret;
 
-   resubmit:
-   raw = raw_local_deliver(skb, protocol);
+resubmit:
+   raw = raw_local_deliver(skb, protocol);
 
-   ipprot = rcu_dereference(inet_protos[protocol]);
-   if (ipprot) {
-   int ret;
-
-   if (!ipprot->no_policy) {
-   if (!xfrm4_policy_check(NULL, XFRM_POLICY_IN, 
skb)) {
-   kfree_skb(skb);
-   goto out;
-   }
-   nf_reset(skb);
+   ipprot = rcu_dereference(inet_protos[protocol]);
+   if (ipprot) {
+   if (!ipprot->no_policy) {
+   if (!xfrm4_policy_check(NULL, XFRM_POLICY_IN, skb)) {
+   kfree_skb(skb);
+   return;
}
-   ret = ipprot->handler(skb);
-   if (ret < 0) {
-   protocol = -ret;
-   goto resubmit;
+   nf_reset(skb);
+   }
+   ret = ipprot->handler(skb);
+   if (ret < 0) {
+   protocol = -ret;
+   goto resubmit;
+   }
+   __IP_INC_STATS(net, IPSTATS_MIB_INDELIVERS);
+   } else {
+   if (!raw) {
+   if (xfrm4_policy_check(NULL, XFRM_POLICY_IN, skb)) {
+   __IP_INC_STATS(net, 
IPSTATS_MIB_INUNKNOWNPROTOS);
+   icmp_send(skb, ICMP_DEST_UNREACH,
+ ICMP_PROT_UNREACH, 0);
}
-   __IP_INC_STATS(net, IPSTATS_MIB_INDELIVERS);
+   kfree_skb(skb);
} else {
-   if (!raw) {
-   if (xfrm4_policy_check(NULL, XFRM_POLICY_IN, 
skb)) {
-   __IP_INC_STATS(net, 
IPSTATS_MIB_INUNKNOWNPROTOS);
-   icmp_send(skb, ICMP_DEST_UNREACH,
- ICMP_PROT_UNREACH, 0);
-   }
-   kfree_skb(skb);
-   } else {
-   __IP_INC_STATS(net, IPSTATS_MIB_INDELIVERS);
-   consume_skb(skb);
-   }
+   __IP_INC_STATS(net, IPSTATS_MIB_INDELIVERS);
+   consume_skb(skb);
}
}
- out:
+}
+
+static int ip_local_deliver_finish(struct net *net, struct sock *sk, struct 
sk_buff *skb)
+{
+   __skb_pull(skb, skb_network_header_len(skb));
+
+   rcu_read_lock();
+   ip_protocol_deliver_rcu(net, skb, ip_hdr(skb)->protocol);
rcu_read_unlock();
 
return 0;
-- 
2.17.2

[PATCH net] udp6: fix encap return code for resubmitting

2018-10-17 Thread Paolo Abeni

The commit eb63f2964dbe ("udp6: add missing checks on edumux packet
processing") used the same return code convention of the ipv4 counterpart,
but ipv6 uses the opposite one: positive values means resubmit.

This change addresses the issue, using positive return value for
resubmitting. Also update the related comment, which was broken, too.

Fixes: eb63f2964dbe ("udp6: add missing checks on edumux packet processing")
Signed-off-by: Paolo Abeni 
---
Note: I could not find any in kernel udp6 encap using the above
feature, that would explain why nobody complained so far...
---
 net/ipv6/udp.c | 6 ++
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index 28c4aa5078fc..b36694b6716e 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -766,11 +766,9 @@ static int udp6_unicast_rcv_skb(struct sock *sk, struct 
sk_buff *skb,
 
ret = udpv6_queue_rcv_skb(sk, skb);
 
-   /* a return value > 0 means to resubmit the input, but
-* it wants the return to be -protocol, or 0
-*/
+   /* a return value > 0 means to resubmit the input */
if (ret > 0)
-   return -ret;
+   return ret;
return 0;
 }
 
-- 
2.17.2

[PATCH net-next] selftests: use posix-style redirection in ip_defrag.sh

2018-10-11 Thread Paolo Abeni

The ip_defrag.sh script requires bash-style output redirection but
use the default shell. This may cause random failures if the default
shell is not bash.
Address the above using posix compliant output redirection.

Fixes: 02c7f38b7ace ("selftests/net: add ip_defrag selftest")
Signed-off-by: Paolo Abeni 
---
 tools/testing/selftests/net/ip_defrag.sh | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/tools/testing/selftests/net/ip_defrag.sh 
b/tools/testing/selftests/net/ip_defrag.sh
index 3a4042d918f0..f34672796044 100755
--- a/tools/testing/selftests/net/ip_defrag.sh
+++ b/tools/testing/selftests/net/ip_defrag.sh
@@ -11,10 +11,10 @@ readonly NETNS="ns-$(mktemp -u XX)"
 setup() {
ip netns add "${NETNS}"
ip -netns "${NETNS}" link set lo up
-   ip netns exec "${NETNS}" sysctl -w net.ipv4.ipfrag_high_thresh=900 
&> /dev/null
-   ip netns exec "${NETNS}" sysctl -w net.ipv4.ipfrag_low_thresh=700 
&> /dev/null
-   ip netns exec "${NETNS}" sysctl -w net.ipv6.ip6frag_high_thresh=900 
&> /dev/null
-   ip netns exec "${NETNS}" sysctl -w net.ipv6.ip6frag_low_thresh=700 
&> /dev/null
+   ip netns exec "${NETNS}" sysctl -w net.ipv4.ipfrag_high_thresh=900 
>/dev/null 2>&1
+   ip netns exec "${NETNS}" sysctl -w net.ipv4.ipfrag_low_thresh=700 
>/dev/null 2>&1
+   ip netns exec "${NETNS}" sysctl -w net.ipv6.ip6frag_high_thresh=900 
>/dev/null 2>&1
+   ip netns exec "${NETNS}" sysctl -w net.ipv6.ip6frag_low_thresh=700 
>/dev/null 2>&1
 }
 
 cleanup() {
-- 
2.17.2

Re: [PATCH net 2/2] selftest: udpgso_bench.sh explicitly requires bash

2018-10-11 Thread Paolo Abeni

On Thu, 2018-10-11 at 10:54 +0200, Paolo Abeni wrote:
> The udpgso_bench.sh script requires several bash-only features. This
> may cause random failures if the default shell is not bash.
> Address the above explicitly requiring bash as the script interpreter
> Fixes: 3a687bef148d ("selftests: udp gso benchmark")
> Signed-off-by: Paolo Abeni 

Oops... fat fingers! this is a duplicate of: 

https://patchwork.ozlabs.org/patch/982350/

David, can you please drop this one from patchwork?

Thanks!

Paolo

[PATCH net 0/2] net: explicitly requires bash when needed.

2018-10-11 Thread Paolo Abeni

Some test scripts require bash-only features but use the default shell.
This may cause random failures if the default shell is not bash.
Instead of doing a potentially complex rewrite of such scripts, these patches 
require the bash interpreter, where needed.

Paolo Abeni (2):
  selftests: rtnetlink.sh explicitly requires bash.
  selftests: udpgso_bench.sh explicitly requires bash

 tools/testing/selftests/net/rtnetlink.sh| 2 +-
 tools/testing/selftests/net/udpgso_bench.sh | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

-- 
2.17.2

[PATCH net 2/2] selftests: udpgso_bench.sh explicitly requires bash

2018-10-11 Thread Paolo Abeni

The udpgso_bench.sh script requires several bash-only features. This
may cause random failures if the default shell is not bash.
Address the above explicitly requiring bash as the script interpreter

Fixes: 3a687bef148d ("selftests: udp gso benchmark")
Signed-off-by: Paolo Abeni 
---
 tools/testing/selftests/net/udpgso_bench.sh | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/testing/selftests/net/udpgso_bench.sh 
b/tools/testing/selftests/net/udpgso_bench.sh
index 850767befa47..99e537ab5ad9 100755
--- a/tools/testing/selftests/net/udpgso_bench.sh
+++ b/tools/testing/selftests/net/udpgso_bench.sh
@@ -1,4 +1,4 @@
-#!/bin/sh
+#!/bin/bash
 # SPDX-License-Identifier: GPL-2.0
 #
 # Run a series of udpgso benchmarks
-- 
2.17.2

[PATCH net 2/2] selftest: udpgso_bench.sh explicitly requires bash

2018-10-11 Thread Paolo Abeni

The udpgso_bench.sh script requires several bash-only features. This
may cause random failures if the default shell is not bash.
Address the above explicitly requiring bash as the script interpreter

Fixes: 3a687bef148d ("selftests: udp gso benchmark")
Signed-off-by: Paolo Abeni 
---
 tools/testing/selftests/net/udpgso_bench.sh | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/testing/selftests/net/udpgso_bench.sh 
b/tools/testing/selftests/net/udpgso_bench.sh
index 850767befa47..99e537ab5ad9 100755
--- a/tools/testing/selftests/net/udpgso_bench.sh
+++ b/tools/testing/selftests/net/udpgso_bench.sh
@@ -1,4 +1,4 @@
-#!/bin/sh
+#!/bin/bash
 # SPDX-License-Identifier: GPL-2.0
 #
 # Run a series of udpgso benchmarks
-- 
2.17.2

[PATCH net 1/2] selftests: rtnetlink.sh explicitly requires bash.

2018-10-11 Thread Paolo Abeni

the script rtnetlink.sh requires a bash-only features (sleep with sub-second
precision). This may cause random test failure if the default shell is not
bash.
Address the above explicitly requiring bash as the script interpreter.

Fixes: 33b01b7b4f19 ("selftests: add rtnetlink test script")
Signed-off-by: Paolo Abeni 
---
 tools/testing/selftests/net/rtnetlink.sh | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/testing/selftests/net/rtnetlink.sh 
b/tools/testing/selftests/net/rtnetlink.sh
index 08c341b49760..e101af52d1d6 100755
--- a/tools/testing/selftests/net/rtnetlink.sh
+++ b/tools/testing/selftests/net/rtnetlink.sh
@@ -1,4 +1,4 @@
-#!/bin/sh
+#!/bin/bash
 #
 # This test is for checking rtnetlink callpaths, and get as much coverage as 
possible.
 #
-- 
2.17.2

Re: [PATCH net-next RFC 0/8] udp and configurable gro

2018-10-05 Thread Paolo Abeni

On Fri, 2018-10-05 at 11:45 -0400, Willem de Bruijn wrote:
> On Fri, Oct 5, 2018 at 11:30 AM Paolo Abeni  wrote:
> > Would love that. We need to care of key decr, too (and possibly don't
> > be fooled by encap_rcv() users).
> 
> I just sent  http://patchwork.ozlabs.org/patch/979525/
> 
> Right now all users are those that call setup_udp_tunnel_sock
> to register encap_rcv.

plus setsockopt(UDP_ENCAP)

> If accepted, I'll add a separate patch to decrement the key. That's
> probably in udp_tunnel_sock_release, but I need to take a closer
> look.

l2tp calls setup_udp_tunnel_sock() but don't use
udp_tunnel_sock_release(). Possibly it would be safer checking for:

up->encap_type || up(sk)->gro_receive

in udp_destroy_sock()

Cheers,

Paolo

Re: [PATCH net-next] udp: gro behind static key

2018-10-05 Thread Paolo Abeni

On Fri, 2018-10-05 at 11:31 -0400, Willem de Bruijn wrote:
> From: Willem de Bruijn 
> 
> Avoid the socket lookup cost in udp_gro_receive if no socket has a
> udp tunnel callback configured.
> 
> udp_sk(sk)->gro_receive requires a registration with
> setup_udp_tunnel_sock, which enables the static key.
> 
> Signed-off-by: Willem de Bruijn 
> ---
>  include/net/udp.h  | 2 ++
>  net/ipv4/udp.c | 2 +-
>  net/ipv4/udp_offload.c | 2 +-
>  net/ipv6/udp.c | 2 +-
>  net/ipv6/udp_offload.c | 2 +-
>  5 files changed, 6 insertions(+), 4 deletions(-)
> 
> diff --git a/include/net/udp.h b/include/net/udp.h
> index 8482a990b0bb..9e82cb391dea 100644
> --- a/include/net/udp.h
> +++ b/include/net/udp.h
> @@ -443,8 +443,10 @@ int udpv4_offload_init(void);
>  
>  void udp_init(void);
>  
> +DECLARE_STATIC_KEY_FALSE(udp_encap_needed_key);
>  void udp_encap_enable(void);
>  #if IS_ENABLED(CONFIG_IPV6)
> +DECLARE_STATIC_KEY_FALSE(udpv6_encap_needed_key);
>  void udpv6_encap_enable(void);
>  #endif
>  
> diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> index 5fc4beb1c336..1bec2203d558 100644
> --- a/net/ipv4/udp.c
> +++ b/net/ipv4/udp.c
> @@ -1889,7 +1889,7 @@ static int __udp_queue_rcv_skb(struct sock *sk, struct 
> sk_buff *skb)
>   return 0;
>  }
>  
> -static DEFINE_STATIC_KEY_FALSE(udp_encap_needed_key);
> +DEFINE_STATIC_KEY_FALSE(udp_encap_needed_key);
>  void udp_encap_enable(void)
>  {
>   static_branch_enable(_encap_needed_key);
> diff --git a/net/ipv4/udp_offload.c b/net/ipv4/udp_offload.c
> index 0c0522b79b43..802f2bc00d69 100644
> --- a/net/ipv4/udp_offload.c
> +++ b/net/ipv4/udp_offload.c
> @@ -405,7 +405,7 @@ static struct sk_buff *udp4_gro_receive(struct list_head 
> *head,
>  {
>   struct udphdr *uh = udp_gro_udphdr(skb);
>  
> - if (unlikely(!uh))
> + if (unlikely(!uh) || !static_branch_unlikely(_encap_needed_key))
>   goto flush;
>  
>   /* Don't bother verifying checksum if we're going to flush anyway. */
> diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
> index 28c4aa5078fc..374e7d302f26 100644
> --- a/net/ipv6/udp.c
> +++ b/net/ipv6/udp.c
> @@ -548,7 +548,7 @@ static __inline__ void udpv6_err(struct sk_buff *skb,
>   __udp6_lib_err(skb, opt, type, code, offset, info, _table);
>  }
>  
> -static DEFINE_STATIC_KEY_FALSE(udpv6_encap_needed_key);
> +DEFINE_STATIC_KEY_FALSE(udpv6_encap_needed_key);
>  void udpv6_encap_enable(void)
>  {
>   static_branch_enable(_encap_needed_key);
> diff --git a/net/ipv6/udp_offload.c b/net/ipv6/udp_offload.c
> index 95dee9ca8d22..1b8e161ac527 100644
> --- a/net/ipv6/udp_offload.c
> +++ b/net/ipv6/udp_offload.c
> @@ -119,7 +119,7 @@ static struct sk_buff *udp6_gro_receive(struct list_head 
> *head,
>  {
>   struct udphdr *uh = udp_gro_udphdr(skb);
>  
> - if (unlikely(!uh))
> + if (unlikely(!uh) || !static_branch_unlikely(_encap_needed_key))
>   goto flush;
>  
>   /* Don't bother verifying checksum if we're going to flush anyway. */

Acked-by: Paolo Abeni

Re: [PATCH net-next RFC 0/8] udp and configurable gro

2018-10-05 Thread Paolo Abeni

Hi,

Thank you for the prompt reply!

On Fri, 2018-10-05 at 10:41 -0400, Willem de Bruijn wrote:
> On Fri, Oct 5, 2018 at 9:53 AM Paolo Abeni  wrote:
> > 
> > Hi all,
> > 
> > On Fri, 2018-09-14 at 13:59 -0400, Willem de Bruijn wrote:
> > > This is a *very rough* draft. Mainly for discussion while we also
> > > look at another partially overlapping approach [1].
> > 
> > I'm wondering how we go on from this ? I'm fine with either approaches.
> 
> Let me send the udp gro static_key patch.

Would love that. We need to care of key decr, too (and possibly don't
be fooled by encap_rcv() users).

> Then we don't need the enable udp on demand logic (patch 2/4).

ok.

> Your implementation of GRO is more fleshed out (patch 3/4) than
> my quick hack. My only request would be to use a separate
> UDP_GRO socket option instead of adding this to the existing
> UDP_SEGMENT.
> 
> Sounds good?

Indeed!
I need also to add a cmsg to expose to the user the skb gro_size, and
some test cases. Locally I'm [ab-]using the GRO functionality
introduced recently on veth to test the code in a namespace pair
(attaching a dummy XDP program to the RX-side veth). I'm not sure if
that could fit a selftest.

> > Also, I'm interested in [try to] enable GRO/GSO batching in the
> > forwarding path, as you outlined initially in the GSO series
> > submission. That should cover Steffen use-case, too, right?
> 
> Great. Indeed. Though there is some unresolved discussion on
> one large gso skb vs frag list. There has been various concerns
> around the use of frag lists for GSO in the past, and it does not
> match h/w offload. So I think the answer would be the first unless
> the second proves considerably faster (in which case it could also
> be added later as optimization).

Agreed.

Let's try the first step first ;)

Final but relevant note: I'll try my best to avoid delaying this, but
lately I tend to be pre-empted by other tasks, it's difficult for me to
assure a deadline here.

Cheers,

Paolo

Re: [PATCH net-next RFC 0/8] udp and configurable gro

2018-10-05 Thread Paolo Abeni

Hi all,

On Fri, 2018-09-14 at 13:59 -0400, Willem de Bruijn wrote:
> This is a *very rough* draft. Mainly for discussion while we also
> look at another partially overlapping approach [1].

I'm wondering how we go on from this ? I'm fine with either approaches.

Also, I'm interested in [try to] enable GRO/GSO batching in the
forwarding path, as you outlined initially in the GSO series
submission. That should cover Steffen use-case, too, right? 

Cheers,

Paolo

[PATCH net-next] net: drop unused skb_append_datato_frags()

2018-10-02 Thread Paolo Abeni

This helper is unused since commit 988cf74deb45 ("inet:
Stop generating UFO packets.")

Signed-off-by: Paolo Abeni 
---
 include/linux/skbuff.h |  5 
 net/core/skbuff.c  | 58 --
 2 files changed, 63 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 87e29710373f..119d092c6b13 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -1082,11 +1082,6 @@ static inline int skb_pad(struct sk_buff *skb, int pad)
 }
 #define dev_kfree_skb(a)   consume_skb(a)
 
-int skb_append_datato_frags(struct sock *sk, struct sk_buff *skb,
-   int getfrag(void *from, char *to, int offset,
-   int len, int odd, struct sk_buff *skb),
-   void *from, int length);
-
 int skb_append_pagefrags(struct sk_buff *skb, struct page *page,
 int offset, size_t size);
 
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index b2c807f67aba..0e937d3d85b5 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -3381,64 +3381,6 @@ unsigned int skb_find_text(struct sk_buff *skb, unsigned 
int from,
 }
 EXPORT_SYMBOL(skb_find_text);
 
-/**
- * skb_append_datato_frags - append the user data to a skb
- * @sk: sock  structure
- * @skb: skb structure to be appended with user data.
- * @getfrag: call back function to be used for getting the user data
- * @from: pointer to user message iov
- * @length: length of the iov message
- *
- * Description: This procedure append the user data in the fragment part
- * of the skb if any page alloc fails user this procedure returns  -ENOMEM
- */
-int skb_append_datato_frags(struct sock *sk, struct sk_buff *skb,
-   int (*getfrag)(void *from, char *to, int offset,
-   int len, int odd, struct sk_buff *skb),
-   void *from, int length)
-{
-   int frg_cnt = skb_shinfo(skb)->nr_frags;
-   int copy;
-   int offset = 0;
-   int ret;
-   struct page_frag *pfrag = >task_frag;
-
-   do {
-   /* Return error if we don't have space for new frag */
-   if (frg_cnt >= MAX_SKB_FRAGS)
-   return -EMSGSIZE;
-
-   if (!sk_page_frag_refill(sk, pfrag))
-   return -ENOMEM;
-
-   /* copy the user data to page */
-   copy = min_t(int, length, pfrag->size - pfrag->offset);
-
-   ret = getfrag(from, page_address(pfrag->page) + pfrag->offset,
- offset, copy, 0, skb);
-   if (ret < 0)
-   return -EFAULT;
-
-   /* copy was successful so update the size parameters */
-   skb_fill_page_desc(skb, frg_cnt, pfrag->page, pfrag->offset,
-  copy);
-   frg_cnt++;
-   pfrag->offset += copy;
-   get_page(pfrag->page);
-
-   skb->truesize += copy;
-   refcount_add(copy, >sk_wmem_alloc);
-   skb->len += copy;
-   skb->data_len += copy;
-   offset += copy;
-   length -= copy;
-
-   } while (length > 0);
-
-   return 0;
-}
-EXPORT_SYMBOL(skb_append_datato_frags);
-
 int skb_append_pagefrags(struct sk_buff *skb, struct page *page,
 int offset, size_t size)
 {
-- 
2.17.1

[PATCH net] ip_tunnel: be careful when accessing the inner header

2018-09-24 Thread Paolo Abeni

Cong noted that we need the same checks introduced by commit 76c0ddd8c3a6
("ip6_tunnel: be careful when accessing the inner header")
even for ipv4 tunnels.

Fixes: c54419321455 ("GRE: Refactor GRE tunneling code.")
Suggested-by: Cong Wang 
Signed-off-by: Paolo Abeni 
---
Note: the bug is probably present even before the commit mentioned in
the fixes tag, but porting this change on older kernel is likely worthless
---
 net/ipv4/ip_tunnel.c | 9 +
 1 file changed, 9 insertions(+)

diff --git a/net/ipv4/ip_tunnel.c b/net/ipv4/ip_tunnel.c
index c4f5602308ed..284a22154b4e 100644
--- a/net/ipv4/ip_tunnel.c
+++ b/net/ipv4/ip_tunnel.c
@@ -627,6 +627,7 @@ void ip_tunnel_xmit(struct sk_buff *skb, struct net_device 
*dev,
const struct iphdr *tnl_params, u8 protocol)
 {
struct ip_tunnel *tunnel = netdev_priv(dev);
+   unsigned int inner_nhdr_len = 0;
const struct iphdr *inner_iph;
struct flowi4 fl4;
u8 tos, ttl;
@@ -636,6 +637,14 @@ void ip_tunnel_xmit(struct sk_buff *skb, struct net_device 
*dev,
__be32 dst;
bool connected;
 
+   /* ensure we can access the inner net header, for several users below */
+   if (skb->protocol == htons(ETH_P_IP))
+   inner_nhdr_len = sizeof(struct iphdr);
+   else if (skb->protocol == htons(ETH_P_IPV6))
+   inner_nhdr_len = sizeof(struct ipv6hdr);
+   if (unlikely(!pskb_may_pull(skb, inner_nhdr_len)))
+   goto tx_error;
+
inner_iph = (const struct iphdr *)skb_inner_network_header(skb);
connected = (tunnel->parms.iph.daddr != 0);
 
-- 
2.17.1

Re: [PATCH net] ip6_tunnel: be careful when accessing the inner header

2018-09-24 Thread Paolo Abeni

On Fri, 2018-09-21 at 11:51 -0700, Cong Wang wrote:
> On Wed, Sep 19, 2018 at 6:04 AM Paolo Abeni  wrote:
> > diff --git a/net/ipv6/ip6_tunnel.c b/net/ipv6/ip6_tunnel.c
> > index 419960b0ba16..a0b6932c3afd 100644
> > --- a/net/ipv6/ip6_tunnel.c
> > +++ b/net/ipv6/ip6_tunnel.c
> > @@ -1234,7 +1234,7 @@ static inline int
> >  ip4ip6_tnl_xmit(struct sk_buff *skb, struct net_device *dev)
> >  {
> > struct ip6_tnl *t = netdev_priv(dev);
> > -   const struct iphdr  *iph = ip_hdr(skb);
> > +   const struct iphdr  *iph;
> > int encap_limit = -1;
> > struct flowi6 fl6;
> > __u8 dsfield;
> > @@ -1242,6 +1242,11 @@ ip4ip6_tnl_xmit(struct sk_buff *skb, struct 
> > net_device *dev)
> > u8 tproto;
> > int err;
> > 
> > +   /* ensure we can access the full inner ip header */
> > +   if (!pskb_may_pull(skb, sizeof(struct iphdr)))
> > +   return -1;
> > +
> > +   iph = ip_hdr(skb);
> 
> Hmm...
> 
> How do IPv4 tunnels ensure they have the right inner header to access?
> ip_tunnel_xmit() uses skb_inner_network_header() to access inner header
> which doesn't have any check either AFAIK.

You are right, I think we need similar checks for ip_tunnel_xmit(),
too.

I'll try to cook a patch.

Cheers,

Paolo

Re: [PATCH net-next 4/5] ipv6: do not drop vrf udp multicast packets

2018-09-20 Thread Paolo Abeni

Hi,

On Thu, 2018-09-20 at 09:58 +0100, Mike Manning wrote:
> diff --git a/net/ipv6/ip6_input.c b/net/ipv6/ip6_input.c
> index 108f5f88ec98..fc60f297d95b 100644
> --- a/net/ipv6/ip6_input.c
> +++ b/net/ipv6/ip6_input.c
> @@ -325,9 +325,12 @@ static int ip6_input_finish(struct net *net, struct sock 
> *sk, struct sk_buff *sk
>  {
>   const struct inet6_protocol *ipprot;
>   struct inet6_dev *idev;
> + struct net_device *dev;
>   unsigned int nhoff;
> + int sdif = inet6_sdif(skb);
>   int nexthdr;
>   bool raw;
> + bool deliver;
>   bool have_final = false;

Please, try instead to sort the variable in reverse x-mas tree order.
>  
>   /*
> @@ -371,9 +374,27 @@ static int ip6_input_finish(struct net *net, struct sock 
> *sk, struct sk_buff *sk
>   skb_postpull_rcsum(skb, skb_network_header(skb),
>  skb_network_header_len(skb));
>   hdr = ipv6_hdr(skb);
> - if (ipv6_addr_is_multicast(>daddr) &&
> - !ipv6_chk_mcast_addr(skb->dev, >daddr,
> - >saddr) &&
> +
> + /* skb->dev passed may be master dev for vrfs. */
> + if (sdif) {
> + rcu_read_lock();

AFAICS, the rcu lock is already acquired at the beginning of
ip6_input_finish(), not need to acquire it here again.
+   dev = dev_get_by_index_rcu(dev_net(skb->dev),
> +sdif);
> + if (!dev) {
> + rcu_read_unlock();
> + kfree_skb(skb);
> + return -ENODEV;
> + }
> + } else {
> + dev = skb->dev;

The above fragment of code is a recurring pattern in this series,
perhaps adding an helper for it would reduce code duplication ?

Cheers,

Paolo

[PATCH net] ip6_tunnel: be careful when accessing the inner header

2018-09-19 Thread Paolo Abeni

the ip6 tunnel xmit ndo assumes that the processed skb always
contains an ip[v6] header, but syzbot has found a way to send
frames that fall short of this assumption, leading to the following splat:

BUG: KMSAN: uninit-value in ip6ip6_tnl_xmit net/ipv6/ip6_tunnel.c:1307
[inline]
BUG: KMSAN: uninit-value in ip6_tnl_start_xmit+0x7d2/0x1ef0
net/ipv6/ip6_tunnel.c:1390
CPU: 0 PID: 4504 Comm: syz-executor558 Not tainted 4.16.0+ #87
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
Google 01/01/2011
Call Trace:
  __dump_stack lib/dump_stack.c:17 [inline]
  dump_stack+0x185/0x1d0 lib/dump_stack.c:53
  kmsan_report+0x142/0x240 mm/kmsan/kmsan.c:1067
  __msan_warning_32+0x6c/0xb0 mm/kmsan/kmsan_instr.c:683
  ip6ip6_tnl_xmit net/ipv6/ip6_tunnel.c:1307 [inline]
  ip6_tnl_start_xmit+0x7d2/0x1ef0 net/ipv6/ip6_tunnel.c:1390
  __netdev_start_xmit include/linux/netdevice.h:4066 [inline]
  netdev_start_xmit include/linux/netdevice.h:4075 [inline]
  xmit_one net/core/dev.c:3026 [inline]
  dev_hard_start_xmit+0x5f1/0xc70 net/core/dev.c:3042
  __dev_queue_xmit+0x27ee/0x3520 net/core/dev.c:3557
  dev_queue_xmit+0x4b/0x60 net/core/dev.c:3590
  packet_snd net/packet/af_packet.c:2944 [inline]
  packet_sendmsg+0x7c70/0x8a30 net/packet/af_packet.c:2969
  sock_sendmsg_nosec net/socket.c:630 [inline]
  sock_sendmsg net/socket.c:640 [inline]
  ___sys_sendmsg+0xec0/0x1310 net/socket.c:2046
  __sys_sendmmsg+0x42d/0x800 net/socket.c:2136
  SYSC_sendmmsg+0xc4/0x110 net/socket.c:2167
  SyS_sendmmsg+0x63/0x90 net/socket.c:2162
  do_syscall_64+0x309/0x430 arch/x86/entry/common.c:287
  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
RIP: 0033:0x441819
RSP: 002b:7ffe58ee8268 EFLAGS: 0213 ORIG_RAX: 0133
RAX: ffda RBX: 0003 RCX: 00441819
RDX: 0002 RSI: 2100 RDI: 0003
RBP: 006cd018 R08:  R09: 
R10:  R11: 0213 R12: 00402510
R13: 004025a0 R14:  R15: 

Uninit was created at:
  kmsan_save_stack_with_flags mm/kmsan/kmsan.c:278 [inline]
  kmsan_internal_poison_shadow+0xb8/0x1b0 mm/kmsan/kmsan.c:188
  kmsan_kmalloc+0x94/0x100 mm/kmsan/kmsan.c:314
  kmsan_slab_alloc+0x11/0x20 mm/kmsan/kmsan.c:321
  slab_post_alloc_hook mm/slab.h:445 [inline]
  slab_alloc_node mm/slub.c:2737 [inline]
  __kmalloc_node_track_caller+0xaed/0x11c0 mm/slub.c:4369
  __kmalloc_reserve net/core/skbuff.c:138 [inline]
  __alloc_skb+0x2cf/0x9f0 net/core/skbuff.c:206
  alloc_skb include/linux/skbuff.h:984 [inline]
  alloc_skb_with_frags+0x1d4/0xb20 net/core/skbuff.c:5234
  sock_alloc_send_pskb+0xb56/0x1190 net/core/sock.c:2085
  packet_alloc_skb net/packet/af_packet.c:2803 [inline]
  packet_snd net/packet/af_packet.c:2894 [inline]
  packet_sendmsg+0x6454/0x8a30 net/packet/af_packet.c:2969
  sock_sendmsg_nosec net/socket.c:630 [inline]
  sock_sendmsg net/socket.c:640 [inline]
  ___sys_sendmsg+0xec0/0x1310 net/socket.c:2046
  __sys_sendmmsg+0x42d/0x800 net/socket.c:2136
  SYSC_sendmmsg+0xc4/0x110 net/socket.c:2167
  SyS_sendmmsg+0x63/0x90 net/socket.c:2162
  do_syscall_64+0x309/0x430 arch/x86/entry/common.c:287
  entry_SYSCALL_64_after_hwframe+0x3d/0xa2

This change addresses the issue adding the needed check before
accessing the inner header.

The ipv4 side of the issue is apparently there since the ipv4 over ipv6
initial support, and the ipv6 side predates git history.

Fixes: c4d3efafcc93 ("[IPV6] IP6TUNNEL: Add support to IPv4 over IPv6 tunnel.")
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Reported-by: syzbot+3fde91d4d394747d6...@syzkaller.appspotmail.com
Tested-by: Alexander Potapenko 
Signed-off-by: Paolo Abeni 
---
 net/ipv6/ip6_tunnel.c | 13 +++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/net/ipv6/ip6_tunnel.c b/net/ipv6/ip6_tunnel.c
index 419960b0ba16..a0b6932c3afd 100644
--- a/net/ipv6/ip6_tunnel.c
+++ b/net/ipv6/ip6_tunnel.c
@@ -1234,7 +1234,7 @@ static inline int
 ip4ip6_tnl_xmit(struct sk_buff *skb, struct net_device *dev)
 {
struct ip6_tnl *t = netdev_priv(dev);
-   const struct iphdr  *iph = ip_hdr(skb);
+   const struct iphdr  *iph;
int encap_limit = -1;
struct flowi6 fl6;
__u8 dsfield;
@@ -1242,6 +1242,11 @@ ip4ip6_tnl_xmit(struct sk_buff *skb, struct net_device 
*dev)
u8 tproto;
int err;
 
+   /* ensure we can access the full inner ip header */
+   if (!pskb_may_pull(skb, sizeof(struct iphdr)))
+   return -1;
+
+   iph = ip_hdr(skb);
memset(&(IPCB(skb)->opt), 0, sizeof(IPCB(skb)->opt));
 
tproto = READ_ONCE(t->parms.proto);
@@ -1306,7 +1311,7 @@ static inline int
 ip6ip6_tnl_xmit(struct sk_buff *skb, struct net_device *dev)
 {
struct ip6_tnl *t = netdev_priv(dev);
-   struct ipv6hdr *ipv6h = ipv6_hdr(skb);
+   struct ipv6hdr *ipv6h;
int encap_limit = -1;
__u16

1 2 3 4 5 6 7 8 >

1 - 100 of 769 matches

Mail list logo