from:"Stefano Brivio"

[PATCH net v2 2/2] neighbour: Avoid writing before skb->head in neigh_hh_output()

2018-12-06 Thread Stefano Brivio

While skb_push() makes the kernel panic if the skb headroom is less than
the unaligned hardware header size, it will proceed normally in case we
copy more than that because of alignment, and we'll silently corrupt
adjacent slabs.

In the case fixed by the previous patch,
"ipv6: Check available headroom in ip6_xmit() even without options", we
end up in neigh_hh_output() with 14 bytes headroom, 14 bytes hardware
header and write 16 bytes, starting 2 bytes before the allocated buffer.

Always check we're not writing before skb->head and, if the headroom is
not enough, warn and drop the packet.

v2:
 - instead of panicking with BUG_ON(), WARN_ON_ONCE() and drop the packet
   (Eric Dumazet)
 - if we avoid the panic, though, we need to explicitly check the headroom
   before the memcpy(), otherwise we'll have corrupted slabs on a running
   kernel, after we warn
 - use __skb_push() instead of skb_push(), as the headroom check is
   already implemented here explicitly (Eric Dumazet)

Signed-off-by: Stefano Brivio 
---
 include/net/neighbour.h | 28 +++-
 1 file changed, 23 insertions(+), 5 deletions(-)

diff --git a/include/net/neighbour.h b/include/net/neighbour.h
index f58b384aa6c9..665990c7dec8 100644
--- a/include/net/neighbour.h
+++ b/include/net/neighbour.h
@@ -454,6 +454,7 @@ static inline int neigh_hh_bridge(struct hh_cache *hh, 
struct sk_buff *skb)
 
 static inline int neigh_hh_output(const struct hh_cache *hh, struct sk_buff 
*skb)
 {
+   unsigned int hh_alen = 0;
unsigned int seq;
unsigned int hh_len;
 
@@ -461,16 +462,33 @@ static inline int neigh_hh_output(const struct hh_cache 
*hh, struct sk_buff *skb
seq = read_seqbegin(>hh_lock);
hh_len = hh->hh_len;
if (likely(hh_len <= HH_DATA_MOD)) {
-   /* this is inlined by gcc */
-   memcpy(skb->data - HH_DATA_MOD, hh->hh_data, 
HH_DATA_MOD);
+   hh_alen = HH_DATA_MOD;
+
+   /* skb_push() would proceed silently if we have room for
+* the unaligned size but not for the aligned size:
+* check headroom explicitly.
+*/
+   if (likely(skb_headroom(skb) >= HH_DATA_MOD)) {
+   /* this is inlined by gcc */
+   memcpy(skb->data - HH_DATA_MOD, hh->hh_data,
+  HH_DATA_MOD);
+   }
} else {
-   unsigned int hh_alen = HH_DATA_ALIGN(hh_len);
+   hh_alen = HH_DATA_ALIGN(hh_len);
 
-   memcpy(skb->data - hh_alen, hh->hh_data, hh_alen);
+   if (likely(skb_headroom(skb) >= hh_alen)) {
+   memcpy(skb->data - hh_alen, hh->hh_data,
+  hh_alen);
+   }
}
} while (read_seqretry(>hh_lock, seq));
 
-   skb_push(skb, hh_len);
+   if (WARN_ON_ONCE(skb_headroom(skb) < hh_alen)) {
+   kfree_skb(skb);
+   return NET_XMIT_DROP;
+   }
+
+   __skb_push(skb, hh_len);
return dev_queue_xmit(skb);
 }
 
-- 
2.19.2

[PATCH net v2 0/2] Fix slab out-of-bounds on insufficient headroom for IPv6 packets

2018-12-06 Thread Stefano Brivio

Patch 1/2 fixes a slab out-of-bounds occurring with short SCTP packets over
IPv4 over L2TP over IPv6 on a configuration with relatively low HEADER_MAX.

Patch 2/2 makes sure we avoid writing before the allocated buffer in
neigh_hh_output() in case the headroom is enough for the unaligned hardware
header size, but not enough for the aligned one, and that we warn if we hit
this condition.

Stefano Brivio (2):
  ipv6: Check available headroom in ip6_xmit() even without options
  neighbour: Avoid writing before skb->head in neigh_hh_output()

 include/net/neighbour.h | 28 ++-
 net/ipv6/ip6_output.c   | 42 -
 2 files changed, 44 insertions(+), 26 deletions(-)

-- 
2.19.2

[PATCH net v2 1/2] ipv6: Check available headroom in ip6_xmit() even without options

2018-12-06 Thread Stefano Brivio

Even if we send an IPv6 packet without options, MAX_HEADER might not be
enough to account for the additional headroom required by alignment of
hardware headers.

On a configuration without HYPERV_NET, WLAN, AX25, and with IPV6_TUNNEL,
sending short SCTP packets over IPv4 over L2TP over IPv6, we start with
100 bytes of allocated headroom in sctp_packet_transmit(), end up with 54
bytes after l2tp_xmit_skb(), and 14 bytes in ip6_finish_output2().

Those would be enough to append our 14 bytes header, but we're going to
align that to 16 bytes, and write 2 bytes out of the allocated slab in
neigh_hh_output().

KASan says:

[  264.967848] 
==
[  264.967861] BUG: KASAN: slab-out-of-bounds in 
ip6_finish_output2+0x1aec/0x1c70
[  264.967866] Write of size 16 at addr 6af1c7fe by task netperf/6201
[  264.967870]
[  264.967876] CPU: 0 PID: 6201 Comm: netperf Not tainted 4.20.0-rc4+ #1
[  264.967881] Hardware name: IBM 2827 H43 400 (z/VM 6.4.0)
[  264.967887] Call Trace:
[  264.967896] ([<001347d6>] show_stack+0x56/0xa0)
[  264.967903]  [<017e379c>] dump_stack+0x23c/0x290
[  264.967912]  [<007bc594>] print_address_description+0xf4/0x290
[  264.967919]  [<007bc8fc>] kasan_report+0x13c/0x240
[  264.967927]  [<0162f5e4>] ip6_finish_output2+0x1aec/0x1c70
[  264.967935]  [<0163f890>] ip6_finish_output+0x430/0x7f0
[  264.967943]  [<0163fe44>] ip6_output+0x1f4/0x580
[  264.967953]  [<0163882a>] ip6_xmit+0xfea/0x1ce8
[  264.967963]  [<017396e2>] inet6_csk_xmit+0x282/0x3f8
[  264.968033]  [<03ff805fb0ba>] l2tp_xmit_skb+0xe02/0x13e0 [l2tp_core]
[  264.968037]  [<03ff80631192>] l2tp_eth_dev_xmit+0xda/0x150 [l2tp_eth]
[  264.968041]  [<01220020>] dev_hard_start_xmit+0x268/0x928
[  264.968069]  [<01330e8e>] sch_direct_xmit+0x7ae/0x1350
[  264.968071]  [<0122359c>] __dev_queue_xmit+0x2b7c/0x3478
[  264.968075]  [<013d2862>] ip_finish_output2+0xce2/0x11a0
[  264.968078]  [<013d9b14>] ip_finish_output+0x56c/0x8c8
[  264.968081]  [<013ddd1e>] ip_output+0x226/0x4c0
[  264.968083]  [<013dbd6c>] __ip_queue_xmit+0x894/0x1938
[  264.968100]  [<03ff80bc3a5c>] sctp_packet_transmit+0x29d4/0x3648 [sctp]
[  264.968116]  [<03ff80b7bf68>] 
sctp_outq_flush_ctrl.constprop.5+0x8d0/0xe50 [sctp]
[  264.968131]  [<03ff80b7c716>] sctp_outq_flush+0x22e/0x7d8 [sctp]
[  264.968146]  [<03ff80b35c68>] sctp_cmd_interpreter.isra.16+0x530/0x6800 
[sctp]
[  264.968161]  [<03ff80b3410a>] sctp_do_sm+0x222/0x648 [sctp]
[  264.968177]  [<03ff80bbddac>] sctp_primitive_ASSOCIATE+0xbc/0xf8 [sctp]
[  264.968192]  [<03ff80b93328>] __sctp_connect+0x830/0xc20 [sctp]
[  264.968208]  [<03ff80bb11ce>] sctp_inet_connect+0x2e6/0x378 [sctp]
[  264.968212]  [<01197942>] __sys_connect+0x21a/0x450
[  264.968215]  [<0119aff8>] sys_socketcall+0x3d0/0xb08
[  264.968218]  [<0184ea7a>] system_call+0x2a2/0x2c0

[...]

Just like ip_finish_output2() does for IPv4, check that we have enough
headroom in ip6_xmit(), and reallocate it if we don't.

This issue is older than git history.

Reported-by: Jianlin Shi 
Signed-off-by: Stefano Brivio 
---
v2: Fixed Jianlin's name in commit message

 net/ipv6/ip6_output.c | 42 +-
 1 file changed, 21 insertions(+), 21 deletions(-)

diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index 827a3f5ff3bb..fcd3c66ded16 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -195,37 +195,37 @@ int ip6_xmit(const struct sock *sk, struct sk_buff *skb, 
struct flowi6 *fl6,
const struct ipv6_pinfo *np = inet6_sk(sk);
struct in6_addr *first_hop = >daddr;
struct dst_entry *dst = skb_dst(skb);
+   unsigned int head_room;
struct ipv6hdr *hdr;
u8  proto = fl6->flowi6_proto;
int seg_len = skb->len;
int hlimit = -1;
u32 mtu;
 
-   if (opt) {
-   unsigned int head_room;
+   head_room = sizeof(struct ipv6hdr) + LL_RESERVED_SPACE(dst->dev);
+   if (opt)
+   head_room += opt->opt_nflen + opt->opt_flen;
 
-   /* First: exthdrs may take lots of space (~8K for now)
-  MAX_HEADER is not enough.
-*/
-   head_room = opt->opt_nflen + opt->opt_flen;
-   seg_len += head_room;
-   head_room += sizeof(struct ipv6hdr) + 
LL_RESERVED_SPACE(dst->dev);
-
-   if (skb_headroom(skb) < head_room) {
-   struct sk_buff *skb2 = skb_realloc_headroom(skb, 
head_room);
-   if (!skb2) {
-   IP6_INC_ST

Re: [PATCH net 2/2] neighbour: BUG_ON() writing before skb->head in neigh_hh_output()

2018-12-04 Thread Stefano Brivio

On Tue, 4 Dec 2018 16:26:05 -0800
Eric Dumazet  wrote:

> > +   /* skb_push() won't panic if we have room for the unaligned size
> > only */
> > +   BUG_ON(skb_headroom(skb) < hh_alen);
> >  
> 
> What about avoiding the panic and instead call kfree_skb() ?
> 
>  if (WARN_ON_ONCE(skb_headroom(skb) < hh_alen)) {
>   kfree_skb(skb);
>  return NET_XMIT_DROP;
> }

Okay, I guess it won't go unnoticed anyway, and it's probably better
than the alternative.

> > +
> > skb_push(skb, hh_len);
> >  
> 
> Maybe we can use __skb_push() here, since prior safety check should be
> enough ?

Indeed, I'll change that in v2. Thanks!

-- 
Stefano

[PATCH net 0/2] Fix slab out-of-bounds on insufficient headroom for IPv6 packets

2018-12-04 Thread Stefano Brivio

Patch 1/2 fixes a slab out-of-bounds occurring with short SCTP packets over
IPv4 over L2TP over IPv6 on a configuration with relatively low HEADER_MAX.

Patch 2/2 makes sure we panic in neigh_hh_output() instead of silently
writing before the allocated buffer in case the headroom is enough for the
unaligned hardware header size, but not enough for the aligned one.

Stefano Brivio (2):
  ipv6: Check available headroom in ip6_xmit() even without options
  neighbour: BUG_ON() writing before skb->head in neigh_hh_output()

 include/net/neighbour.h |  8 ++--
 net/ipv6/ip6_output.c   | 42 -
 2 files changed, 27 insertions(+), 23 deletions(-)

-- 
2.19.2

[PATCH net 2/2] neighbour: BUG_ON() writing before skb->head in neigh_hh_output()

2018-12-04 Thread Stefano Brivio

While skb_push() makes the kernel panic if the skb headroom is less than
the unaligned hardware header size in neigh_hh_output(), it will proceed
silently in case we copy more than that because of alignment.

In the case fixed by the previous patch,
"ipv6: Check available headroom in ip6_xmit() even without options", we
end up in neigh_hh_output() with 14 bytes headroom, 14 bytes hardware
header and write 16 bytes, starting 2 bytes before the allocated buffer.

Panic, instead of silently corrupting adjacent slabs.

Signed-off-by: Stefano Brivio 
---
 include/net/neighbour.h | 8 ++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/include/net/neighbour.h b/include/net/neighbour.h
index f58b384aa6c9..95dcba741fd5 100644
--- a/include/net/neighbour.h
+++ b/include/net/neighbour.h
@@ -454,6 +454,7 @@ static inline int neigh_hh_bridge(struct hh_cache *hh, 
struct sk_buff *skb)
 
 static inline int neigh_hh_output(const struct hh_cache *hh, struct sk_buff 
*skb)
 {
+   unsigned int hh_alen = 0;
unsigned int seq;
unsigned int hh_len;
 
@@ -461,15 +462,18 @@ static inline int neigh_hh_output(const struct hh_cache 
*hh, struct sk_buff *skb
seq = read_seqbegin(>hh_lock);
hh_len = hh->hh_len;
if (likely(hh_len <= HH_DATA_MOD)) {
+   hh_alen = HH_DATA_MOD;
/* this is inlined by gcc */
memcpy(skb->data - HH_DATA_MOD, hh->hh_data, 
HH_DATA_MOD);
} else {
-   unsigned int hh_alen = HH_DATA_ALIGN(hh_len);
-
+   hh_alen = HH_DATA_ALIGN(hh_len);
memcpy(skb->data - hh_alen, hh->hh_data, hh_alen);
}
} while (read_seqretry(>hh_lock, seq));
 
+   /* skb_push() won't panic if we have room for the unaligned size only */
+   BUG_ON(skb_headroom(skb) < hh_alen);
+
skb_push(skb, hh_len);
return dev_queue_xmit(skb);
 }
-- 
2.19.2

[PATCH net 1/2] ipv6: Check available headroom in ip6_xmit() even without options

2018-12-04 Thread Stefano Brivio

Even if we send an IPv6 packet without options, MAX_HEADER might not be
enough to account for the additional headroom required by alignment of
hardware headers.

On a configuration without HYPERV_NET, WLAN, AX25, and with IPV6_TUNNEL,
sending short SCTP packets over IPv4 over L2TP over IPv6, we start with
100 bytes of allocated headroom in sctp_packet_transmit(), end up with 54
bytes after l2tp_xmit_skb(), and 14 bytes in ip6_finish_output2().

Those would be enough to append our 14 bytes header, but we're going to
align that to 16 bytes, and write 2 bytes out of the allocated slab in
neigh_hh_output().

KASan says:

[  264.967848] 
==
[  264.967861] BUG: KASAN: slab-out-of-bounds in 
ip6_finish_output2+0x1aec/0x1c70
[  264.967866] Write of size 16 at addr 6af1c7fe by task netperf/6201
[  264.967870]
[  264.967876] CPU: 0 PID: 6201 Comm: netperf Not tainted 4.20.0-rc4+ #1
[  264.967881] Hardware name: IBM 2827 H43 400 (z/VM 6.4.0)
[  264.967887] Call Trace:
[  264.967896] ([<001347d6>] show_stack+0x56/0xa0)
[  264.967903]  [<017e379c>] dump_stack+0x23c/0x290
[  264.967912]  [<007bc594>] print_address_description+0xf4/0x290
[  264.967919]  [<007bc8fc>] kasan_report+0x13c/0x240
[  264.967927]  [<0162f5e4>] ip6_finish_output2+0x1aec/0x1c70
[  264.967935]  [<0163f890>] ip6_finish_output+0x430/0x7f0
[  264.967943]  [<0163fe44>] ip6_output+0x1f4/0x580
[  264.967953]  [<0163882a>] ip6_xmit+0xfea/0x1ce8
[  264.967963]  [<017396e2>] inet6_csk_xmit+0x282/0x3f8
[  264.968033]  [<03ff805fb0ba>] l2tp_xmit_skb+0xe02/0x13e0 [l2tp_core]
[  264.968037]  [<03ff80631192>] l2tp_eth_dev_xmit+0xda/0x150 [l2tp_eth]
[  264.968041]  [<01220020>] dev_hard_start_xmit+0x268/0x928
[  264.968069]  [<01330e8e>] sch_direct_xmit+0x7ae/0x1350
[  264.968071]  [<0122359c>] __dev_queue_xmit+0x2b7c/0x3478
[  264.968075]  [<013d2862>] ip_finish_output2+0xce2/0x11a0
[  264.968078]  [<013d9b14>] ip_finish_output+0x56c/0x8c8
[  264.968081]  [<013ddd1e>] ip_output+0x226/0x4c0
[  264.968083]  [<013dbd6c>] __ip_queue_xmit+0x894/0x1938
[  264.968100]  [<03ff80bc3a5c>] sctp_packet_transmit+0x29d4/0x3648 [sctp]
[  264.968116]  [<03ff80b7bf68>] 
sctp_outq_flush_ctrl.constprop.5+0x8d0/0xe50 [sctp]
[  264.968131]  [<03ff80b7c716>] sctp_outq_flush+0x22e/0x7d8 [sctp]
[  264.968146]  [<03ff80b35c68>] sctp_cmd_interpreter.isra.16+0x530/0x6800 
[sctp]
[  264.968161]  [<03ff80b3410a>] sctp_do_sm+0x222/0x648 [sctp]
[  264.968177]  [<03ff80bbddac>] sctp_primitive_ASSOCIATE+0xbc/0xf8 [sctp]
[  264.968192]  [<03ff80b93328>] __sctp_connect+0x830/0xc20 [sctp]
[  264.968208]  [<03ff80bb11ce>] sctp_inet_connect+0x2e6/0x378 [sctp]
[  264.968212]  [<01197942>] __sys_connect+0x21a/0x450
[  264.968215]  [<0119aff8>] sys_socketcall+0x3d0/0xb08
[  264.968218]  [<0184ea7a>] system_call+0x2a2/0x2c0

[...]

Just like ip_finish_output2() does for IPv4, check that we have enough
headroom in ip6_xmit(), and reallocate it if we don't.

This issue is older than git history.

Reported-by: Jianlin Shin 
Signed-off-by: Stefano Brivio 
---
 net/ipv6/ip6_output.c | 42 +-
 1 file changed, 21 insertions(+), 21 deletions(-)

diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index 827a3f5ff3bb..fcd3c66ded16 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -195,37 +195,37 @@ int ip6_xmit(const struct sock *sk, struct sk_buff *skb, 
struct flowi6 *fl6,
const struct ipv6_pinfo *np = inet6_sk(sk);
struct in6_addr *first_hop = >daddr;
struct dst_entry *dst = skb_dst(skb);
+   unsigned int head_room;
struct ipv6hdr *hdr;
u8  proto = fl6->flowi6_proto;
int seg_len = skb->len;
int hlimit = -1;
u32 mtu;
 
-   if (opt) {
-   unsigned int head_room;
+   head_room = sizeof(struct ipv6hdr) + LL_RESERVED_SPACE(dst->dev);
+   if (opt)
+   head_room += opt->opt_nflen + opt->opt_flen;
 
-   /* First: exthdrs may take lots of space (~8K for now)
-  MAX_HEADER is not enough.
-*/
-   head_room = opt->opt_nflen + opt->opt_flen;
-   seg_len += head_room;
-   head_room += sizeof(struct ipv6hdr) + 
LL_RESERVED_SPACE(dst->dev);
-
-   if (skb_headroom(skb) < head_room) {
-   struct sk_buff *skb2 = skb_realloc_headroom(skb, 
head_room);
-   if (!skb2) {
-   IP6_INC_STATS(net, ip6_dst_idev(skb_dst(skb)),
-

Re: [PATCH iproute2] ss: add support for bytes_sent, bytes_retrans, dsack_dups and reord_seen

2018-11-30 Thread Stefano Brivio

Hi David,

On Thu, 29 Nov 2018 11:51:48 -0700
David Ahern  wrote:

> On 11/29/18 11:50 AM, Stephen Hemminger wrote:
> > PS: ss still doesn't support JSON output, given the volume of output it 
> > would be good.  
> 
> I thought Stefano was investigating it as an alternative to the 'display
> selected columns' patches.

Thanks for Cc'ing me, I missed this.

I'm not really investigating it at the moment, I'm convinced it's not
complicated, it's on my to-do list, but I'm not working on it right now
(and maybe also not in a very near future).

I would still say the "display selected columns" patches could be a
stopgap for now (even though it's not related to this one patch).

-- 
Stefano

Re: [PATCH net-next 16/17] net: sched: conditionally take rtnl lock on rules update path

2018-11-13 Thread Stefano Brivio

On Tue, 13 Nov 2018 16:53:07 +0100
Stefano Brivio  wrote:

> But to make that effective, you would need to protect the read too, and
> that makes your optimisation not really overzealous I think.
> 
> I'd rather go with an additional comment, if that doesn't become
> unreadable.

Oh, and of course, this whole thing makes sense only if for some reason
you end up keeping this function. Maybe ignore my comments on this patch
at least for the moment being ;)

-- 
Stefano

Re: [PATCH net-next 16/17] net: sched: conditionally take rtnl lock on rules update path

2018-11-13 Thread Stefano Brivio

On Tue, 13 Nov 2018 13:58:05 +
Vlad Buslov  wrote:

> On Tue 13 Nov 2018 at 13:40, Stefano Brivio  wrote:
> > On Tue, 13 Nov 2018 13:25:52 +
> > Vlad Buslov  wrote:
> >  
> >> On Tue 13 Nov 2018 at 09:40, Stefano Brivio  wrote:  
> >> > Hi Vlad,
> >> >
> >> > On Mon, 12 Nov 2018 09:55:45 +0200
> >> > Vlad Buslov  wrote:
> >> >
> >> >> @@ -179,9 +179,25 @@ static void tcf_proto_destroy_work(struct 
> >> >> work_struct *work)
> >> >> rtnl_unlock();
> >> >>  }
> >> >>  
> >> >> +/* Helper function to lock rtnl mutex when specified condition is true 
> >> >> and mutex
> >> >> + * hasn't been locked yet. Will set rtnl_held to 'true' before taking 
> >> >> rtnl lock.
> >> >> + * Note that this function does nothing if rtnl is already held. This 
> >> >> is
> >> >> + * intended to be used by cls API rules update API when multiple 
> >> >> conditions
> >> >> + * could require rtnl lock and its state needs to be tracked to 
> >> >> prevent trying
> >> >> + * to obtain lock multiple times.
> >> >> + */
> >> >> +
> >> >> +static void tcf_require_rtnl(bool cond, bool *rtnl_held)
> >> >> +{
> >> >> +   if (!*rtnl_held && cond) {
> >> >> +   *rtnl_held = true;
> >> >> +   rtnl_lock();
> >> >> +   }
> >> >> +}
> >> >
> >> > I guess calls to this function are supposed to be serialised. If that's
> >> > the case (which is my tentative understanding so far), I would indicate
> >> > that in the comment.
> >> >
> >> > If that's not the case, you would be introducing a race I guess.
> >> >
> >> > Same applies to tcf_block_release() from 17/17.
> >> 
> >> Hi Stefano,
> >> 
> >> Thank you for reviewing my code!
> >> 
> >> I did not intend for this function to be serialized. First argument to
> >> tcf_require_rtnl() is passed by value, and second argument is always a
> >> pointer to local stack-allocated value of the caller.  
> >
> > Yes, sorry, I haven't been terribly clear, that's what I meant by
> > serialised: it won't be called concurrently with the same *rtnl_held.
> >
> > Perhaps the risk that somebody uses it that way is close to zero, so
> > I'm not even too sure this is worth a comment, but if you can come up
> > with a concise way of saying this, that would be nice.  
> 
> I considered my comment that function "Will set rtnl_held to 'true'
> before taking rtnl lock" as a red flag for caller to not pass pointer to
> a variable that can be accessed concurrently. I guess I can add
> additional sentence to explicitly warn potential users. Or I can just
> move rtnl_held assignment in both functions to be performed while
> holding rtnl mutex. I implemented it the way I did as an overzealous
> optimization, but realistically price of an assignment is negligible in
> this case.

But to make that effective, you would need to protect the read too, and
that makes your optimisation not really overzealous I think.

I'd rather go with an additional comment, if that doesn't become
unreadable.

-- 
Stefano

Re: [PATCH net-next 16/17] net: sched: conditionally take rtnl lock on rules update path

2018-11-13 Thread Stefano Brivio

On Tue, 13 Nov 2018 13:25:52 +
Vlad Buslov  wrote:

> On Tue 13 Nov 2018 at 09:40, Stefano Brivio  wrote:
> > Hi Vlad,
> >
> > On Mon, 12 Nov 2018 09:55:45 +0200
> > Vlad Buslov  wrote:
> >  
> >> @@ -179,9 +179,25 @@ static void tcf_proto_destroy_work(struct work_struct 
> >> *work)
> >>rtnl_unlock();
> >>  }
> >>  
> >> +/* Helper function to lock rtnl mutex when specified condition is true 
> >> and mutex
> >> + * hasn't been locked yet. Will set rtnl_held to 'true' before taking 
> >> rtnl lock.
> >> + * Note that this function does nothing if rtnl is already held. This is
> >> + * intended to be used by cls API rules update API when multiple 
> >> conditions
> >> + * could require rtnl lock and its state needs to be tracked to prevent 
> >> trying
> >> + * to obtain lock multiple times.
> >> + */
> >> +
> >> +static void tcf_require_rtnl(bool cond, bool *rtnl_held)
> >> +{
> >> +  if (!*rtnl_held && cond) {
> >> +  *rtnl_held = true;
> >> +  rtnl_lock();
> >> +  }
> >> +}  
> >
> > I guess calls to this function are supposed to be serialised. If that's
> > the case (which is my tentative understanding so far), I would indicate
> > that in the comment.
> >
> > If that's not the case, you would be introducing a race I guess.
> >
> > Same applies to tcf_block_release() from 17/17.  
> 
> Hi Stefano,
> 
> Thank you for reviewing my code!
> 
> I did not intend for this function to be serialized. First argument to
> tcf_require_rtnl() is passed by value, and second argument is always a
> pointer to local stack-allocated value of the caller.

Yes, sorry, I haven't been terribly clear, that's what I meant by
serialised: it won't be called concurrently with the same *rtnl_held.

Perhaps the risk that somebody uses it that way is close to zero, so
I'm not even too sure this is worth a comment, but if you can come up
with a concise way of saying this, that would be nice.

> Same applies to tcf_block_release() - its arguments are Qdisc and block
> which support concurrency-safe reference counting, and pointer to local
> variable rtnl_held, which is not accessible to concurrent users.

Same there.

> What is the race in these cases? Am I missing something?

No, no race then. My only concern was:

thread A: thread B:
- x = false;
- tcf_require_rtnl(true, ); - tcf_require_rtnl(true, );
  - if (!*x && true)- if (!*x && true)
- *x = true;
- rtnl_lock() - *x = true;
  - rtnl_lock()

but this cannot happen as you explained.

-- 
Stefano

Re: [PATCH net-next 02/17] net: sched: protect block state with spinlock

2018-11-13 Thread Stefano Brivio

Vlad,

On Mon, 12 Nov 2018 09:28:59 -0800 (PST)
David Miller  wrote:

> From: Vlad Buslov 
> Date: Mon, 12 Nov 2018 09:55:31 +0200
> 
> > +#define ASSERT_BLOCK_LOCKED(block) \
> > +   WARN_ONCE(!spin_is_locked(&(block)->lock),  \
> > + "BLOCK: assertion failed at %s (%d)\n", __FILE__,  __LINE__)  
> 
> spin_is_locked() is not usable for assertions.

See also b86077207d0c ("igbvf: Replace spin_is_locked() with lockdep").

-- 
Stefano

Re: [PATCH net-next 16/17] net: sched: conditionally take rtnl lock on rules update path

2018-11-13 Thread Stefano Brivio

Hi Vlad,

On Mon, 12 Nov 2018 09:55:45 +0200
Vlad Buslov  wrote:

> @@ -179,9 +179,25 @@ static void tcf_proto_destroy_work(struct work_struct 
> *work)
>   rtnl_unlock();
>  }
>  
> +/* Helper function to lock rtnl mutex when specified condition is true and 
> mutex
> + * hasn't been locked yet. Will set rtnl_held to 'true' before taking rtnl 
> lock.
> + * Note that this function does nothing if rtnl is already held. This is
> + * intended to be used by cls API rules update API when multiple conditions
> + * could require rtnl lock and its state needs to be tracked to prevent 
> trying
> + * to obtain lock multiple times.
> + */
> +
> +static void tcf_require_rtnl(bool cond, bool *rtnl_held)
> +{
> + if (!*rtnl_held && cond) {
> + *rtnl_held = true;
> + rtnl_lock();
> + }
> +}

I guess calls to this function are supposed to be serialised. If that's
the case (which is my tentative understanding so far), I would indicate
that in the comment.

If that's not the case, you would be introducing a race I guess.

Same applies to tcf_block_release() from 17/17.

-- 
Stefano

Re: [PATCH iproute2] testsuite: ss: Fix spacing in expected output for ssfilter.t

2018-11-11 Thread Stefano Brivio

Hi Phil,

On Sat, 10 Nov 2018 22:48:44 +0100
Phil Sutter  wrote:

> On Sat, Nov 10, 2018 at 10:21:59AM +0100, Stefano Brivio wrote:
>
> > @@ -12,37 +12,37 @@ export TCPDIAG_FILE="$(dirname $0)/ss1.dump"
> >  ts_log "[Testing ssfilter]"
> >  
> >  ts_ss "$0" "Match dport = 22" -Htna dport = 22
> > -test_on "ESTAB0   0 10.0.0.1:36266   
> > 10.0.0.1:22"
> > +test_on "ESTAB 0   010.0.0.1:36266   
> > 10.0.0.1:22"  
> 
> How about using a regular expression ('test_on' calls grep with '-E')?
> E.g. this instead of the above:
> 
> | test_on "ESTAB *0 *0 *10.0.0.1:36266 *10.0.0.1:22"

I also thought about something similar (perhaps uglier: piping the
output through tr -s ' ' in ts_ss()).

But then I thought we might like to use this test to also check that we
don't accidentally modify spacing, so I'd rather leave it as it is,
with this patch on top.

-- 
Stefano

[PATCH iproute2] testsuite: ss: Fix spacing in expected output for ssfilter.t

2018-11-10 Thread Stefano Brivio

Since commit 00240899ec0b ("ss: Actually print left delimiter for
columns") changes spacing in ss output, we also need to adjust for that in
the ss filter test.

Fixes: 00240899ec0b ("ss: Actually print left delimiter for columns")
Signed-off-by: Stefano Brivio 
---
 testsuite/tests/ss/ssfilter.t | 24 
 1 file changed, 12 insertions(+), 12 deletions(-)

diff --git a/testsuite/tests/ss/ssfilter.t b/testsuite/tests/ss/ssfilter.t
index e74f1765cb72..3091054f2892 100755
--- a/testsuite/tests/ss/ssfilter.t
+++ b/testsuite/tests/ss/ssfilter.t
@@ -12,37 +12,37 @@ export TCPDIAG_FILE="$(dirname $0)/ss1.dump"
 ts_log "[Testing ssfilter]"
 
 ts_ss "$0" "Match dport = 22" -Htna dport = 22
-test_on "ESTAB0   0 10.0.0.1:36266   
10.0.0.1:22"
+test_on "ESTAB 0   010.0.0.1:36266   
10.0.0.1:22"
 
 ts_ss "$0" "Match dport 22" -Htna dport 22
-test_on "ESTAB0   0 10.0.0.1:36266   
10.0.0.1:22"
+test_on "ESTAB 0   010.0.0.1:36266   
10.0.0.1:22"
 
 ts_ss "$0" "Match (dport)" -Htna '( dport = 22 )'
-test_on "ESTAB0   0 10.0.0.1:36266   
10.0.0.1:22"
+test_on "ESTAB 0   010.0.0.1:36266   
10.0.0.1:22"
 
 ts_ss "$0" "Match src = 0.0.0.0" -Htna src = 0.0.0.0
-test_on "LISTEN 0   1280.0.0.0:22 
0.0.0.0:*"
+test_on "LISTEN  0   128   0.0.0.0:22 
0.0.0.0:*"
 
 ts_ss "$0" "Match src 0.0.0.0" -Htna src 0.0.0.0
-test_on "LISTEN 0   1280.0.0.0:22 
0.0.0.0:*"
+test_on "LISTEN  0   128   0.0.0.0:22 
0.0.0.0:*"
 
 ts_ss "$0" "Match src sport" -Htna src 0.0.0.0 sport = 22
-test_on "LISTEN 0   1280.0.0.0:22 
0.0.0.0:*"
+test_on "LISTEN  0   128   0.0.0.0:22 
0.0.0.0:*"
 
 ts_ss "$0" "Match src and sport" -Htna src 0.0.0.0 and sport = 22
-test_on "LISTEN 0   1280.0.0.0:22 
0.0.0.0:*"
+test_on "LISTEN  0   128   0.0.0.0:22 
0.0.0.0:*"
 
 ts_ss "$0" "Match src and sport and dport" -Htna src 10.0.0.1 and sport = 22 
and dport = 50312
-test_on "ESTAB0   0 10.0.0.1:22   
10.0.0.2:50312"
+test_on "ESTAB 0   010.0.0.1:22   
10.0.0.2:50312"
 
 ts_ss "$0" "Match src and sport and (dport)" -Htna 'src 10.0.0.1 and sport = 
22 and ( dport = 50312 )'
-test_on "ESTAB0   0 10.0.0.1:22   
10.0.0.2:50312"
+test_on "ESTAB 0   010.0.0.1:22   
10.0.0.2:50312"
 
 ts_ss "$0" "Match src and (sport and dport)" -Htna 'src 10.0.0.1 and ( sport = 
22 and dport = 50312 )'
-test_on "ESTAB0   0 10.0.0.1:22   
10.0.0.2:50312"
+test_on "ESTAB 0   010.0.0.1:22   
10.0.0.2:50312"
 
 ts_ss "$0" "Match (src and sport) and dport" -Htna '( src 10.0.0.1 and sport = 
22 ) and dport = 50312'
-test_on "ESTAB0   0 10.0.0.1:22   
10.0.0.2:50312"
+test_on "ESTAB 0   010.0.0.1:22   
10.0.0.2:50312"
 
 ts_ss "$0" "Match (src or src) and dst" -Htna '( src 0.0.0.0 or src 10.0.0.1 ) 
and dst 10.0.0.2'
-test_on "ESTAB0   0 10.0.0.1:22   
10.0.0.2:50312"
+test_on "ESTAB 0   010.0.0.1:22   
10.0.0.2:50312"
-- 
2.19.1

Re: [PATCH iproute] ss: Actually print left delimiter for columns

2018-11-09 Thread Stefano Brivio

On Fri, 9 Nov 2018 09:05:46 -0800
Stephen Hemminger  wrote:

> On Mon, 29 Oct 2018 23:04:25 +0100
> Stefano Brivio  wrote:
> 
> > While rendering columns, we use a local variable to keep track of the
> > field currently being printed, without touching current_field, which is
> > used for buffering.
> > 
> > Use the right pointer to access the left delimiter for the current column,
> > instead of always printing the left delimiter for the last buffered field,
> > which is usually an empty string.
> > 
> > This fixes an issue especially visible on narrow terminals, where some
> > columns might be displayed without separation.
> > 
> > Reported-by: YoyPa 
> > Fixes: 691bd854bf4a ("ss: Buffer raw fields first, then render them as a 
> > table")
> > Signed-off-by: Stefano Brivio 
> > Tested-by: YoyPa   
> 
> This test broke the testsuite/ss/ssfilter.t test.
> Please fix the test to match your new output format, or I will have to revert 
> it.

Ouch, sorry, I didn't notice that "new" test. I'll fix that by tomorrow.

-- 
Stefano

Re: [Patch net-next] net: move __skb_checksum_complete*() to skbuff.c

2018-11-08 Thread Stefano Brivio

On Thu, 8 Nov 2018 12:02:00 -0800
Cong Wang  wrote:

> On Thu, Nov 8, 2018 at 11:54 AM Stefano Brivio  wrote:
> >
> > Hi,
> >
> > On Thu,  8 Nov 2018 11:49:49 -0800
> > Cong Wang  wrote:
> >  
> > > +EXPORT_SYMBOL(__skb_checksum_complete);
> > > +
> > >  /* Both of above in one bottle. */  
> >
> > Maybe you should also update/drop this comment now?  
> 
> I have no idea what that comment means. Do you?

I think it refers to the fact that skb_copy_and_csum_bits() implements
both skb_checksum() and skb_copy_bits(), that were just above it at
1da177e4c3f4.

Then more stuff was moved in between, and the comment was never updated
or dropped.

-- 
Stefano

Re: [Patch net-next] net: move __skb_checksum_complete*() to skbuff.c

2018-11-08 Thread Stefano Brivio

Hi,

On Thu,  8 Nov 2018 11:49:49 -0800
Cong Wang  wrote:

> +EXPORT_SYMBOL(__skb_checksum_complete);
> +
>  /* Both of above in one bottle. */

Maybe you should also update/drop this comment now?
 
>  __wsum skb_copy_and_csum_bits(const struct sk_buff *skb, int offset,

-- 
Stefano

[PATCH iproute2 net-next v2 1/2] iplink_vxlan: Add DF configuration

2018-11-08 Thread Stefano Brivio

Allow to set the DF bit behaviour for outgoing IPv4 packets: it can be
always on, inherited from the inner header, or, by default, always off,
which is the current behaviour.

v2:
- Indicate in the man page what DF refers to, using RFC 791 wording
  (David Ahern)

Signed-off-by: Stefano Brivio 
---
 include/uapi/linux/if_link.h |  9 +
 ip/iplink_vxlan.c| 29 +
 man/man8/ip-link.8.in| 14 ++
 3 files changed, 52 insertions(+)

diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h
index 9c254603ebda..4caf683ce546 100644
--- a/include/uapi/linux/if_link.h
+++ b/include/uapi/linux/if_link.h
@@ -530,6 +530,7 @@ enum {
IFLA_VXLAN_LABEL,
IFLA_VXLAN_GPE,
IFLA_VXLAN_TTL_INHERIT,
+   IFLA_VXLAN_DF,
__IFLA_VXLAN_MAX
 };
 #define IFLA_VXLAN_MAX (__IFLA_VXLAN_MAX - 1)
@@ -539,6 +540,14 @@ struct ifla_vxlan_port_range {
__be16  high;
 };
 
+enum ifla_vxlan_df {
+   VXLAN_DF_UNSET = 0,
+   VXLAN_DF_SET,
+   VXLAN_DF_INHERIT,
+   __VXLAN_DF_END,
+   VXLAN_DF_MAX = __VXLAN_DF_END - 1,
+};
+
 /* GENEVE section */
 enum {
IFLA_GENEVE_UNSPEC,
diff --git a/ip/iplink_vxlan.c b/ip/iplink_vxlan.c
index 7fc0e2b4eb06..86afbe1334f0 100644
--- a/ip/iplink_vxlan.c
+++ b/ip/iplink_vxlan.c
@@ -31,6 +31,7 @@ static void print_explain(FILE *f)
" [ local ADDR ]\n"
" [ ttl TTL ]\n"
" [ tos TOS ]\n"
+   " [ df DF ]\n"
" [ flowlabel LABEL ]\n"
" [ dev PHYS_DEV ]\n"
" [ dstport PORT ]\n"
@@ -52,6 +53,7 @@ static void print_explain(FILE *f)
"   ADDR  := { IP_ADDRESS | any }\n"
"   TOS   := { NUMBER | inherit }\n"
"   TTL   := { 1..255 | auto | inherit }\n"
+   "   DF:= { unset | set | inherit }\n"
"   LABEL := 0-1048575\n"
);
 }
@@ -170,6 +172,22 @@ static int vxlan_parse_opt(struct link_util *lu, int argc, 
char **argv,
} else
tos = 1;
addattr8(n, 1024, IFLA_VXLAN_TOS, tos);
+   } else if (!matches(*argv, "df")) {
+   enum ifla_vxlan_df df;
+
+   NEXT_ARG();
+   check_duparg(, IFLA_VXLAN_DF, "df", *argv);
+   if (strcmp(*argv, "unset") == 0)
+   df = VXLAN_DF_UNSET;
+   else if (strcmp(*argv, "set") == 0)
+   df = VXLAN_DF_SET;
+   else if (strcmp(*argv, "inherit") == 0)
+   df = VXLAN_DF_INHERIT;
+   else
+   invarg("DF must be 'unset', 'set' or 'inherit'",
+  *argv);
+
+   addattr8(n, 1024, IFLA_VXLAN_DF, df);
} else if (!matches(*argv, "label") ||
   !matches(*argv, "flowlabel")) {
__u32 uval;
@@ -538,6 +556,17 @@ static void vxlan_print_opt(struct link_util *lu, FILE *f, 
struct rtattr *tb[])
print_string(PRINT_FP, NULL, "ttl %s ", "auto");
}
 
+   if (tb[IFLA_VXLAN_DF]) {
+   enum ifla_vxlan_df df = rta_getattr_u8(tb[IFLA_VXLAN_DF]);
+
+   if (df == VXLAN_DF_UNSET)
+   print_string(PRINT_JSON, "df", "df %s ", "unset");
+   else if (df == VXLAN_DF_SET)
+   print_string(PRINT_ANY, "df", "df %s ", "set");
+   else if (df == VXLAN_DF_INHERIT)
+   print_string(PRINT_ANY, "df", "df %s ", "inherit");
+   }
+
if (tb[IFLA_VXLAN_LABEL]) {
__u32 label = rta_getattr_u32(tb[IFLA_VXLAN_LABEL]);
 
diff --git a/man/man8/ip-link.8.in b/man/man8/ip-link.8.in
index 5132f514b279..a94cf4f19f1e 100644
--- a/man/man8/ip-link.8.in
+++ b/man/man8/ip-link.8.in
@@ -496,6 +496,8 @@ the following additional arguments are supported:
 ] [
 .BI tos " TOS "
 ] [
+.BI df " DF "
+] [
 .BI flowlabel " FLOWLABEL "
 ] [
 .BI dstport " PORT "
@@ -565,6 +567,18 @@ parameter.
 .BI tos " TOS"
 - specifies the TOS value to use in outgoing packets.
 
+.sp
+.BI df " DF"
+- specifies the usage of the Don't Fragment flag (DF) bit in outgoing packets
+with IPv4 headers. The value
+.B inherit
+causes the bit to be copied from the original IP header. The values
+.B unset
+and
+.B set
+cause the bit to be always unset or always set, respectively. By default, the
+bit is not set.
+
 .sp
 .BI flowlabel " FLOWLABEL"
 - specifies the flow label to use in outgoing packets.
-- 
2.19.1

[PATCH iproute2 net-next v2 0/2] Add DF configuration for VXLAN and GENEVE link types

2018-11-08 Thread Stefano Brivio

This series adds configuration of the DF bit in outgoing IPv4 packets for
VXLAN and GENEVE link types.

Stefano Brivio (2):
  iplink_vxlan: Add DF configuration
  iplink_geneve: Add DF configuration

 include/uapi/linux/if_link.h | 18 ++
 ip/iplink_geneve.c   | 29 +
 ip/iplink_vxlan.c| 29 +
 man/man8/ip-link.8.in| 28 
 4 files changed, 104 insertions(+)

-- 
2.19.1

[PATCH iproute2 net-next v2 2/2] iplink_geneve: Add DF configuration

2018-11-08 Thread Stefano Brivio

Allow to set the DF bit behaviour for outgoing IPv4 packets: it can be
always on, inherited from the inner header, or, by default, always off,
which is the current behaviour.

v2:
- Indicate in the man page what DF refers to, using RFC 791 wording
  (David Ahern)

Signed-off-by: Stefano Brivio 
---
 include/uapi/linux/if_link.h |  9 +
 ip/iplink_geneve.c   | 29 +
 man/man8/ip-link.8.in| 14 ++
 3 files changed, 52 insertions(+)

diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h
index 4caf683ce546..183ca7527178 100644
--- a/include/uapi/linux/if_link.h
+++ b/include/uapi/linux/if_link.h
@@ -563,10 +563,19 @@ enum {
IFLA_GENEVE_UDP_ZERO_CSUM6_RX,
IFLA_GENEVE_LABEL,
IFLA_GENEVE_TTL_INHERIT,
+   IFLA_GENEVE_DF,
__IFLA_GENEVE_MAX
 };
 #define IFLA_GENEVE_MAX(__IFLA_GENEVE_MAX - 1)
 
+enum ifla_geneve_df {
+   GENEVE_DF_UNSET = 0,
+   GENEVE_DF_SET,
+   GENEVE_DF_INHERIT,
+   __GENEVE_DF_END,
+   GENEVE_DF_MAX = __GENEVE_DF_END - 1,
+};
+
 /* PPP section */
 enum {
IFLA_PPP_UNSPEC,
diff --git a/ip/iplink_geneve.c b/ip/iplink_geneve.c
index c417842b2a5b..1872b74c5d70 100644
--- a/ip/iplink_geneve.c
+++ b/ip/iplink_geneve.c
@@ -24,6 +24,7 @@ static void print_explain(FILE *f)
"  remote ADDR\n"
"  [ ttl TTL ]\n"
"  [ tos TOS ]\n"
+   "  [ df DF ]\n"
"  [ flowlabel LABEL ]\n"
"  [ dstport PORT ]\n"
"  [ [no]external ]\n"
@@ -35,6 +36,7 @@ static void print_explain(FILE *f)
"   ADDR  := IP_ADDRESS\n"
"   TOS   := { NUMBER | inherit }\n"
"   TTL   := { 1..255 | auto | inherit }\n"
+   "   DF:= { unset | set | inherit }\n"
"   LABEL := 0-1048575\n"
);
 }
@@ -115,6 +117,22 @@ static int geneve_parse_opt(struct link_util *lu, int 
argc, char **argv,
tos = uval;
} else
tos = 1;
+   } else if (!matches(*argv, "df")) {
+   enum ifla_geneve_df df;
+
+   NEXT_ARG();
+   check_duparg(, IFLA_GENEVE_DF, "df", *argv);
+   if (strcmp(*argv, "unset") == 0)
+   df = GENEVE_DF_UNSET;
+   else if (strcmp(*argv, "set") == 0)
+   df = GENEVE_DF_SET;
+   else if (strcmp(*argv, "inherit") == 0)
+   df = GENEVE_DF_INHERIT;
+   else
+   invarg("DF must be 'unset', 'set' or 'inherit'",
+  *argv);
+
+   addattr8(n, 1024, IFLA_GENEVE_DF, df);
} else if (!matches(*argv, "label") ||
   !matches(*argv, "flowlabel")) {
__u32 uval;
@@ -287,6 +305,17 @@ static void geneve_print_opt(struct link_util *lu, FILE 
*f, struct rtattr *tb[])
print_string(PRINT_FP, NULL, "tos %s ", "inherit");
}
 
+   if (tb[IFLA_GENEVE_DF]) {
+   enum ifla_geneve_df df = rta_getattr_u8(tb[IFLA_GENEVE_DF]);
+
+   if (df == GENEVE_DF_UNSET)
+   print_string(PRINT_JSON, "df", "df %s ", "unset");
+   else if (df == GENEVE_DF_SET)
+   print_string(PRINT_ANY, "df", "df %s ", "set");
+   else if (df == GENEVE_DF_INHERIT)
+   print_string(PRINT_ANY, "df", "df %s ", "inherit");
+   }
+
if (tb[IFLA_GENEVE_LABEL]) {
__u32 label = rta_getattr_u32(tb[IFLA_GENEVE_LABEL]);
 
diff --git a/man/man8/ip-link.8.in b/man/man8/ip-link.8.in
index a94cf4f19f1e..73d37c190fff 100644
--- a/man/man8/ip-link.8.in
+++ b/man/man8/ip-link.8.in
@@ -1180,6 +1180,8 @@ the following additional arguments are supported:
 ] [
 .BI tos " TOS "
 ] [
+.BI df " DF "
+] [
 .BI flowlabel " FLOWLABEL "
 ] [
 .BI dstport " PORT"
@@ -1212,6 +1214,18 @@ ttl. Default option is "0".
 .BI tos " TOS"
 - specifies the TOS value to use in outgoing packets.
 
+.sp
+.BI df " DF"
+- specifies the usage of the Don't Fragment flag (DF) bit in outgoing packets
+with IPv4 headers. The value
+.B inherit
+causes the bit to be copied from the original IP header. The values
+.B unset
+and
+.B set
+cause the bit to be always unset or always set, respectively. By default, the
+bit is not set.
+
 .sp
 .BI flowlabel " FLOWLABEL"
 - specifies the flow label to use in outgoing packets.
-- 
2.19.1

[PATCH net-next v2 11/11] selftests: pmtu: Introduce FoU and GUE PMTU exceptions tests

2018-11-08 Thread Stefano Brivio

Introduce eight tests, for FoU and GUE, with IPv4 and IPv6 payload,
on IPv4 and IPv6 transport, that check that PMTU exceptions are created
with the right value when exceeding the MTU on a link of the path.

Signed-off-by: Stefano Brivio 
Reviewed-by: Sabrina Dubroca 
---
v2: no changes

 tools/testing/selftests/net/pmtu.sh | 179 
 1 file changed, 179 insertions(+)

diff --git a/tools/testing/selftests/net/pmtu.sh 
b/tools/testing/selftests/net/pmtu.sh
index 14d20782ccb7..e2c94e47707c 100755
--- a/tools/testing/selftests/net/pmtu.sh
+++ b/tools/testing/selftests/net/pmtu.sh
@@ -59,6 +59,14 @@
 #  Same as pmtu_ipv6_vxlan6_exception, but using a GENEVE tunnel instead of
 #  VXLAN
 #
+# - pmtu_ipv{4,6}_fou{4,6}_exception
+#  Same as pmtu_ipv4_vxlan4, but using a direct IPv4/IPv6 encapsulation
+#  (FoU) over IPv4/IPv6, instead of VXLAN
+#
+# - pmtu_ipv{4,6}_fou{4,6}_exception
+#  Same as pmtu_ipv4_vxlan4, but using a generic UDP IPv4/IPv6
+#  encapsulation (GUE) over IPv4/IPv6, instead of VXLAN
+#
 # - pmtu_vti4_exception
 #  Set up vti tunnel on top of veth, with xfrm states and policies, in two
 #  namespaces with matching endpoints. Check that route exception is not
@@ -113,6 +121,14 @@ tests="
pmtu_ipv6_geneve4_exception IPv6 over geneve4: PMTU exceptions
pmtu_ipv4_geneve6_exception IPv4 over geneve6: PMTU exceptions
pmtu_ipv6_geneve6_exception IPv6 over geneve6: PMTU exceptions
+   pmtu_ipv4_fou4_exceptionIPv4 over fou4: PMTU exceptions
+   pmtu_ipv6_fou4_exceptionIPv6 over fou4: PMTU exceptions
+   pmtu_ipv4_fou6_exceptionIPv4 over fou6: PMTU exceptions
+   pmtu_ipv6_fou6_exceptionIPv6 over fou6: PMTU exceptions
+   pmtu_ipv4_gue4_exceptionIPv4 over gue4: PMTU exceptions
+   pmtu_ipv6_gue4_exceptionIPv6 over gue4: PMTU exceptions
+   pmtu_ipv4_gue6_exceptionIPv4 over gue6: PMTU exceptions
+   pmtu_ipv6_gue6_exceptionIPv6 over gue6: PMTU exceptions
pmtu_vti6_exception vti6: PMTU exceptions
pmtu_vti4_exception vti4: PMTU exceptions
pmtu_vti4_default_mtu   vti4: default MTU assignment
@@ -200,6 +216,89 @@ nsname() {
eval echo \$NS_$1
 }
 
+setup_fou_or_gue() {
+   outer="${1}"
+   inner="${2}"
+   encap="${3}"
+
+   if [ "${outer}" = "4" ]; then
+   modprobe fou || return 2
+   a_addr="${prefix4}.${a_r1}.1"
+   b_addr="${prefix4}.${b_r1}.1"
+   if [ "${inner}" = "4" ]; then
+   type="ipip"
+   ipproto="4"
+   else
+   type="sit"
+   ipproto="41"
+   fi
+   else
+   modprobe fou6 || return 2
+   a_addr="${prefix6}:${a_r1}::1"
+   b_addr="${prefix6}:${b_r1}::1"
+   if [ "${inner}" = "4" ]; then
+   type="ip6tnl"
+   mode="mode ipip6"
+   ipproto="4 -6"
+   else
+   type="ip6tnl"
+   mode="mode ip6ip6"
+   ipproto="41 -6"
+   fi
+   fi
+
+   ${ns_a} ip fou add port  ipproto ${ipproto} || return 2
+   ${ns_a} ip link add ${encap}_a type ${type} ${mode} local ${a_addr} 
remote ${b_addr} encap ${encap} encap-sport auto encap-dport 5556 || return 2
+
+   ${ns_b} ip fou add port 5556 ipproto ${ipproto}
+   ${ns_b} ip link add ${encap}_b type ${type} ${mode} local ${b_addr} 
remote ${a_addr} encap ${encap} encap-sport auto encap-dport 
+
+   if [ "${inner}" = "4" ]; then
+   ${ns_a} ip addr add ${tunnel4_a_addr}/${tunnel4_mask} dev 
${encap}_a
+   ${ns_b} ip addr add ${tunnel4_b_addr}/${tunnel4_mask} dev 
${encap}_b
+   else
+   ${ns_a} ip addr add ${tunnel6_a_addr}/${tunnel6_mask} dev 
${encap}_a
+   ${ns_b} ip addr add ${tunnel6_b_addr}/${tunnel6_mask} dev 
${encap}_b
+   fi
+
+   ${ns_a} ip link set ${encap}_a up
+   ${ns_b} ip link set ${encap}_b up
+
+   sleep 1
+}
+
+setup_fou44() {
+   setup_fou_or_gue 4 4 fou
+}
+
+setup_fou46() {
+   setup_fou_or_gue 4 6 fou
+}
+
+setup_fou64() {
+   setup_fou_or_gue 6 4 fou
+}
+
+setup_fou66() {
+   setup_fou_or_gue 6 6 fou
+}
+
+setup_gue44() {
+   setup_fou_or_gue 4 4 gue
+}
+
+setup_gue46() {
+   setup_fou_or_gue 4 6 gue
+}
+
+setup_gue64() {
+   setup_fou_or_gue 6 4 gue
+}
+
+setup_gue66() {
+   setup_fou_or_gue 6 6 gue
+}
+
 setup_namespaces() {
for n in ${NS_A} ${NS_B} $

[PATCH net-next v2 10/11] fou, fou6: ICMP error handlers for FoU and GUE

2018-11-08 Thread Stefano Brivio

As the destination port in FoU and GUE receiving sockets doesn't
necessarily match the remote destination port, we can't associate errors
to the encapsulating tunnels with a socket lookup -- we need to blindly
try them instead. This means we don't even know if we are handling errors
for FoU or GUE without digging into the packets.

Hence, implement a single handler for both, one for IPv4 and one for IPv6,
that will check whether the packet that generated the ICMP error used a
direct IP encapsulation or if it had a GUE header, and send the error to
the matching protocol handler, if any.

Signed-off-by: Stefano Brivio 
Reviewed-by: Sabrina Dubroca 
---
v2: no changes

 net/ipv4/fou.c  | 68 +
 net/ipv4/protocol.c |  1 +
 net/ipv6/fou6.c | 74 +
 3 files changed, 143 insertions(+)

diff --git a/net/ipv4/fou.c b/net/ipv4/fou.c
index 500a59906b87..0d0ad19ecb87 100644
--- a/net/ipv4/fou.c
+++ b/net/ipv4/fou.c
@@ -3,6 +3,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -1003,15 +1004,82 @@ static int gue_build_header(struct sk_buff *skb, struct 
ip_tunnel_encap *e,
return 0;
 }
 
+static int gue_err_proto_handler(int proto, struct sk_buff *skb, u32 info)
+{
+   const struct net_protocol *ipprot = rcu_dereference(inet_protos[proto]);
+
+   if (ipprot && ipprot->err_handler) {
+   if (!ipprot->err_handler(skb, info))
+   return 0;
+   }
+
+   return -ENOENT;
+}
+
+static int gue_err(struct sk_buff *skb, u32 info)
+{
+   int transport_offset = skb_transport_offset(skb);
+   struct guehdr *guehdr;
+   size_t optlen;
+   int ret;
+
+   if (skb->len < sizeof(struct udphdr) + sizeof(struct guehdr))
+   return -EINVAL;
+
+   guehdr = (struct guehdr *)_hdr(skb)[1];
+
+   switch (guehdr->version) {
+   case 0: /* Full GUE header present */
+   break;
+   case 1: {
+   /* Direct encasulation of IPv4 or IPv6 */
+   skb_set_transport_header(skb, -(int)sizeof(struct icmphdr));
+
+   switch (((struct iphdr *)guehdr)->version) {
+   case 4:
+   ret = gue_err_proto_handler(IPPROTO_IPIP, skb, info);
+   goto out;
+#if IS_ENABLED(CONFIG_IPV6)
+   case 6:
+   ret = gue_err_proto_handler(IPPROTO_IPV6, skb, info);
+   goto out;
+#endif
+   default:
+   ret = -EOPNOTSUPP;
+   goto out;
+   }
+   }
+   default: /* Undefined version */
+   return -EOPNOTSUPP;
+   }
+
+   if (guehdr->control)
+   return -ENOENT;
+
+   optlen = guehdr->hlen << 2;
+
+   if (validate_gue_flags(guehdr, optlen))
+   return -EINVAL;
+
+   skb_set_transport_header(skb, -(int)sizeof(struct icmphdr));
+   ret = gue_err_proto_handler(guehdr->proto_ctype, skb, info);
+
+out:
+   skb_set_transport_header(skb, transport_offset);
+   return ret;
+}
+
 
 static const struct ip_tunnel_encap_ops fou_iptun_ops = {
.encap_hlen = fou_encap_hlen,
.build_header = fou_build_header,
+   .err_handler = gue_err,
 };
 
 static const struct ip_tunnel_encap_ops gue_iptun_ops = {
.encap_hlen = gue_encap_hlen,
.build_header = gue_build_header,
+   .err_handler = gue_err,
 };
 
 static int ip_tunnel_encap_add_fou_ops(void)
diff --git a/net/ipv4/protocol.c b/net/ipv4/protocol.c
index 32a691b7ce2c..92d249e053be 100644
--- a/net/ipv4/protocol.c
+++ b/net/ipv4/protocol.c
@@ -29,6 +29,7 @@
 #include 
 
 struct net_protocol __rcu *inet_protos[MAX_INET_PROTOS] __read_mostly;
+EXPORT_SYMBOL(inet_protos);
 const struct net_offload __rcu *inet_offloads[MAX_INET_PROTOS] __read_mostly;
 EXPORT_SYMBOL(inet_offloads);
 
diff --git a/net/ipv6/fou6.c b/net/ipv6/fou6.c
index 6de3c04b0f30..bd675c61deb1 100644
--- a/net/ipv6/fou6.c
+++ b/net/ipv6/fou6.c
@@ -4,6 +4,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -69,14 +70,87 @@ static int gue6_build_header(struct sk_buff *skb, struct 
ip_tunnel_encap *e,
return 0;
 }
 
+static int gue6_err_proto_handler(int proto, struct sk_buff *skb,
+ struct inet6_skb_parm *opt,
+ u8 type, u8 code, int offset, u32 info)
+{
+   const struct inet6_protocol *ipprot;
+
+   ipprot = rcu_dereference(inet6_protos[proto]);
+   if (ipprot && ipprot->err_handler) {
+   if (!ipprot->err_handler(skb, opt, type, code, offset, info))
+   return 0;
+   }
+
+   return -ENOENT;
+}
+
+static int gue6_err(struct sk_buff *skb, struct inet6_skb_parm *opt,
+   u8 type, u8 code, int offset, __be32 info)
+{

[PATCH net-next v2 03/11] vxlan: Allow configuration of DF behaviour

2018-11-08 Thread Stefano Brivio

Allow users to set the IPv4 DF bit in outgoing packets, or to inherit its
value from the IPv4 inner header. If the encapsulated protocol is IPv6 and
DF is configured to be inherited, always set it.

For IPv4, inheriting DF from the inner header was probably intended from
the very beginning judging by the comment to vxlan_xmit(), but it wasn't
actually implemented -- also because it would have done more harm than
good, without handling for ICMP Fragmentation Needed messages.

According to RFC 7348, "Path MTU discovery MAY be used". An expired RFC
draft, draft-saum-nvo3-pmtud-over-vxlan-05, whose purpose was to describe
PMTUD implementation, says that "is a MUST that Vxlan gateways [...]
SHOULD set the DF-bit [...]", whatever that means.

Given this background, the only sane option is probably to let the user
decide, and keep the current behaviour as default.

This only applies to non-lwt tunnels: if an external control plane is
used, tunnel key will still control the DF flag.

v2:
- DF behaviour configuration only applies for non-lwt tunnels, move DF
  setting to if (!info) block in vxlan_xmit_one() (Stephen Hemminger)

Signed-off-by: Stefano Brivio 
Reviewed-by: Sabrina Dubroca 
---
 drivers/net/vxlan.c  | 29 -
 include/net/vxlan.h  |  1 +
 include/uapi/linux/if_link.h |  9 +
 3 files changed, 38 insertions(+), 1 deletion(-)

diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c
index 0851af6733f3..c3e65e78f015 100644
--- a/drivers/net/vxlan.c
+++ b/drivers/net/vxlan.c
@@ -2278,13 +2278,24 @@ static void vxlan_xmit_one(struct sk_buff *skb, struct 
net_device *dev,
goto tx_error;
}
 
-   /* Bypass encapsulation if the destination is local */
if (!info) {
+   /* Bypass encapsulation if the destination is local */
err = encap_bypass_if_local(skb, dev, vxlan, dst,
dst_port, ifindex, vni,
>dst, rt->rt_flags);
if (err)
goto out_unlock;
+
+   if (vxlan->cfg.df == VXLAN_DF_SET) {
+   df = htons(IP_DF);
+   } else if (vxlan->cfg.df == VXLAN_DF_INHERIT) {
+   struct ethhdr *eth = eth_hdr(skb);
+
+   if (ntohs(eth->h_proto) == ETH_P_IPV6 ||
+   (ntohs(eth->h_proto) == ETH_P_IP &&
+old_iph->frag_off & htons(IP_DF)))
+   df = htons(IP_DF);
+   }
} else if (info->key.tun_flags & TUNNEL_DONT_FRAGMENT) {
df = htons(IP_DF);
}
@@ -2837,6 +2848,7 @@ static const struct nla_policy 
vxlan_policy[IFLA_VXLAN_MAX + 1] = {
[IFLA_VXLAN_GPE]= { .type = NLA_FLAG, },
[IFLA_VXLAN_REMCSUM_NOPARTIAL]  = { .type = NLA_FLAG },
[IFLA_VXLAN_TTL_INHERIT]= { .type = NLA_FLAG },
+   [IFLA_VXLAN_DF] = { .type = NLA_U8 },
 };
 
 static int vxlan_validate(struct nlattr *tb[], struct nlattr *data[],
@@ -2893,6 +2905,16 @@ static int vxlan_validate(struct nlattr *tb[], struct 
nlattr *data[],
}
}
 
+   if (data[IFLA_VXLAN_DF]) {
+   enum ifla_vxlan_df df = nla_get_u8(data[IFLA_VXLAN_DF]);
+
+   if (df < 0 || df > VXLAN_DF_MAX) {
+   NL_SET_ERR_MSG_ATTR(extack, tb[IFLA_VXLAN_DF],
+   "Invalid DF attribute");
+   return -EINVAL;
+   }
+   }
+
return 0;
 }
 
@@ -3538,6 +3560,9 @@ static int vxlan_nl2conf(struct nlattr *tb[], struct 
nlattr *data[],
conf->mtu = nla_get_u32(tb[IFLA_MTU]);
}
 
+   if (data[IFLA_VXLAN_DF])
+   conf->df = nla_get_u8(data[IFLA_VXLAN_DF]);
+
return 0;
 }
 
@@ -3630,6 +3655,7 @@ static size_t vxlan_get_size(const struct net_device *dev)
nla_total_size(sizeof(__u8)) +  /* IFLA_VXLAN_TTL */
nla_total_size(sizeof(__u8)) +  /* IFLA_VXLAN_TTL_INHERIT */
nla_total_size(sizeof(__u8)) +  /* IFLA_VXLAN_TOS */
+   nla_total_size(sizeof(__u8)) +  /* IFLA_VXLAN_DF */
nla_total_size(sizeof(__be32)) + /* IFLA_VXLAN_LABEL */
nla_total_size(sizeof(__u8)) +  /* IFLA_VXLAN_LEARNING */
nla_total_size(sizeof(__u8)) +  /* IFLA_VXLAN_PROXY */
@@ -3696,6 +3722,7 @@ static int vxlan_fill_info(struct sk_buff *skb, const 
struct net_device *dev)
nla_put_u8(skb, IFLA_VXLAN_TTL_INHERIT,
   !!(vxlan->cfg.flags & VXLAN_F_TTL_INHERIT)) ||
nla_put_u8

[PATCH net-next v2 08/11] net: Convert protocol error handlers from void to int

2018-11-08 Thread Stefano Brivio

We'll need this to handle ICMP errors for tunnels without a sending socket
(i.e. FoU and GUE). There, we might have to look up different types of IP
tunnels, registered as network protocols, before we get a match, so we
want this for the error handlers of IPPROTO_IPIP and IPPROTO_IPV6 in both
inet_protos and inet6_protos. These error codes will be used in the next
patch.

For consistency, return sensible error codes in protocol error handlers
whenever handlers can't handle errors because, even if valid, they don't
match a protocol or any of its states.

This has no effect on existing error handling paths.

Signed-off-by: Stefano Brivio 
Reviewed-by: Sabrina Dubroca 
---
v2: no changes

 include/net/icmp.h|  2 +-
 include/net/protocol.h|  9 ++--
 include/net/sctp/sctp.h   |  2 +-
 include/net/tcp.h |  2 +-
 include/net/udp.h |  2 +-
 net/dccp/ipv4.c   | 13 +++
 net/dccp/ipv6.c   | 13 +++
 net/ipv4/gre_demux.c  |  9 ++--
 net/ipv4/icmp.c   |  6 +++--
 net/ipv4/ip_gre.c | 48 ---
 net/ipv4/ipip.c   | 14 ++--
 net/ipv4/tcp_ipv4.c   | 22 ++
 net/ipv4/tunnel4.c| 18 ++-
 net/ipv4/udp.c| 10 
 net/ipv4/udp_impl.h   |  2 +-
 net/ipv4/udplite.c|  4 ++--
 net/ipv4/xfrm4_protocol.c | 18 ++-
 net/ipv6/icmp.c   |  4 +++-
 net/ipv6/ip6_gre.c| 18 ---
 net/ipv6/tcp_ipv6.c   | 13 +++
 net/ipv6/tunnel6.c| 12 ++
 net/ipv6/udp.c| 18 +++
 net/ipv6/udp_impl.h   |  4 ++--
 net/ipv6/udplite.c|  5 ++--
 net/ipv6/xfrm6_protocol.c | 18 ++-
 net/sctp/input.c  |  5 ++--
 net/sctp/ipv6.c   |  7 --
 27 files changed, 177 insertions(+), 121 deletions(-)

diff --git a/include/net/icmp.h b/include/net/icmp.h
index 3ef2743a8eec..6ac3a5bd0117 100644
--- a/include/net/icmp.h
+++ b/include/net/icmp.h
@@ -41,7 +41,7 @@ struct net;
 
 void icmp_send(struct sk_buff *skb_in, int type, int code, __be32 info);
 int icmp_rcv(struct sk_buff *skb);
-void icmp_err(struct sk_buff *skb, u32 info);
+int icmp_err(struct sk_buff *skb, u32 info);
 int icmp_init(void);
 void icmp_out_count(struct net *net, unsigned char type);
 
diff --git a/include/net/protocol.h b/include/net/protocol.h
index 4fc75f7ae23b..92b3eaad6088 100644
--- a/include/net/protocol.h
+++ b/include/net/protocol.h
@@ -42,7 +42,10 @@ struct net_protocol {
int (*early_demux)(struct sk_buff *skb);
int (*early_demux_handler)(struct sk_buff *skb);
int (*handler)(struct sk_buff *skb);
-   void(*err_handler)(struct sk_buff *skb, u32 info);
+
+   /* This returns an error if we weren't able to handle the error. */
+   int (*err_handler)(struct sk_buff *skb, u32 info);
+
unsigned intno_policy:1,
netns_ok:1,
/* does the protocol do more stringent
@@ -58,10 +61,12 @@ struct inet6_protocol {
void(*early_demux_handler)(struct sk_buff *skb);
int (*handler)(struct sk_buff *skb);
 
-   void(*err_handler)(struct sk_buff *skb,
+   /* This returns an error if we weren't able to handle the error. */
+   int (*err_handler)(struct sk_buff *skb,
   struct inet6_skb_parm *opt,
   u8 type, u8 code, int offset,
   __be32 info);
+
unsigned intflags;  /* INET6_PROTO_xxx */
 };
 
diff --git a/include/net/sctp/sctp.h b/include/net/sctp/sctp.h
index 8c2caa370e0f..9a3b48a35e90 100644
--- a/include/net/sctp/sctp.h
+++ b/include/net/sctp/sctp.h
@@ -151,7 +151,7 @@ int sctp_primitive_RECONF(struct net *net, struct 
sctp_association *asoc,
  * sctp/input.c
  */
 int sctp_rcv(struct sk_buff *skb);
-void sctp_v4_err(struct sk_buff *skb, u32 info);
+int sctp_v4_err(struct sk_buff *skb, u32 info);
 void sctp_hash_endpoint(struct sctp_endpoint *);
 void sctp_unhash_endpoint(struct sctp_endpoint *);
 struct sock *sctp_err_lookup(struct net *net, int family, struct sk_buff *,
diff --git a/include/net/tcp.h b/include/net/tcp.h
index a18914d20486..4743836bed2e 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -313,7 +313,7 @@ extern struct proto tcp_prot;
 
 void tcp_tasklet_init(void);
 
-void tcp_v4_err(struct sk_buff *skb, u32);
+int tcp_v4_err(struct sk_buff *skb, u32);
 
 void tcp_shutdown(struct sock *sk, int how);
 
diff --git a/include/net/udp.h b/include/net/udp.h
index eccca2325ee6..fd6d948755c8 100644
--- a/include/net/udp.h
+++ b/include/net/udp.h
@@ -283,7 +283,7 @@ bool udp_sk_rx_dst_set(struct sock *sk, struct dst_entry 
*dst);
 int udp_get_port(struct sock *sk, unsigned short snum,
 int (*saddr_cmp

[PATCH net-next v2 07/11] selftests: pmtu: Introduce tests for IPv4/IPv6 over GENEVE over IPv4/IPv6

2018-11-08 Thread Stefano Brivio

Use a router between endpoints, implemented via namespaces, set a low MTU
between router and destination endpoint, exceed it and check PMTU value in
route exceptions.

v2:
- Introduce IPv4 tests right away, if iproute2 doesn't support the 'df'
  link option they will be skipped (David Ahern)

Signed-off-by: Stefano Brivio 
Reviewed-by: Sabrina Dubroca 
---
 tools/testing/selftests/net/pmtu.sh | 117 
 1 file changed, 86 insertions(+), 31 deletions(-)

diff --git a/tools/testing/selftests/net/pmtu.sh 
b/tools/testing/selftests/net/pmtu.sh
index 33cba295ad45..14d20782ccb7 100755
--- a/tools/testing/selftests/net/pmtu.sh
+++ b/tools/testing/selftests/net/pmtu.sh
@@ -43,6 +43,22 @@
 # - pmtu_ipv6_vxlan6_exception
 #  Same as pmtu_ipv4_vxlan6_exception, but send IPv6 packets from A to B
 #
+# - pmtu_ipv4_geneve4_exception
+#  Same as pmtu_ipv4_vxlan4_exception, but using a GENEVE tunnel instead of
+#  VXLAN
+#
+# - pmtu_ipv6_geneve4_exception
+#  Same as pmtu_ipv6_vxlan4_exception, but using a GENEVE tunnel instead of
+#  VXLAN
+#
+# - pmtu_ipv4_geneve6_exception
+#  Same as pmtu_ipv4_vxlan6_exception, but using a GENEVE tunnel instead of
+#  VXLAN
+#
+# - pmtu_ipv6_geneve6_exception
+#  Same as pmtu_ipv6_vxlan6_exception, but using a GENEVE tunnel instead of
+#  VXLAN
+#
 # - pmtu_vti4_exception
 #  Set up vti tunnel on top of veth, with xfrm states and policies, in two
 #  namespaces with matching endpoints. Check that route exception is not
@@ -93,6 +109,10 @@ tests="
pmtu_ipv6_vxlan4_exception  IPv6 over vxlan4: PMTU exceptions
pmtu_ipv4_vxlan6_exception  IPv4 over vxlan6: PMTU exceptions
pmtu_ipv6_vxlan6_exception  IPv6 over vxlan6: PMTU exceptions
+   pmtu_ipv4_geneve4_exception IPv4 over geneve4: PMTU exceptions
+   pmtu_ipv6_geneve4_exception IPv6 over geneve4: PMTU exceptions
+   pmtu_ipv4_geneve6_exception IPv4 over geneve6: PMTU exceptions
+   pmtu_ipv6_geneve6_exception IPv6 over geneve6: PMTU exceptions
pmtu_vti6_exception vti6: PMTU exceptions
pmtu_vti4_exception vti4: PMTU exceptions
pmtu_vti4_default_mtu   vti4: default MTU assignment
@@ -230,32 +250,50 @@ setup_vti6() {
setup_vti 6 ${veth6_a_addr} ${veth6_b_addr} ${tunnel6_a_addr} 
${tunnel6_b_addr} ${tunnel6_mask}
 }
 
-setup_vxlan() {
-   a_addr="${1}"
-   b_addr="${2}"
-   opts="${3}"
+setup_vxlan_or_geneve() {
+   type="${1}"
+   a_addr="${2}"
+   b_addr="${3}"
+   opts="${4}"
+
+   if [ "${type}" = "vxlan" ]; then
+   opts="${opts} ttl 64 dstport 4789"
+   opts_a="local ${a_addr}"
+   opts_b="local ${b_addr}"
+   else
+   opts_a=""
+   opts_b=""
+   fi
 
-   ${ns_a} ip link add vxlan_a type vxlan id 1 local ${a_addr} remote 
${b_addr} ttl 64 dstport 4789 ${opts} || return 1
-   ${ns_b} ip link add vxlan_b type vxlan id 1 local ${b_addr} remote 
${a_addr} ttl 64 dstport 4789 ${opts}
+   ${ns_a} ip link add ${type}_a type ${type} id 1 ${opts_a} remote 
${b_addr} ${opts} || return 1
+   ${ns_b} ip link add ${type}_b type ${type} id 1 ${opts_b} remote 
${a_addr} ${opts}
 
-   ${ns_a} ip addr add ${tunnel4_a_addr}/${tunnel4_mask}   dev vxlan_a
-   ${ns_b} ip addr add ${tunnel4_b_addr}/${tunnel4_mask}   dev vxlan_b
+   ${ns_a} ip addr add ${tunnel4_a_addr}/${tunnel4_mask} dev ${type}_a
+   ${ns_b} ip addr add ${tunnel4_b_addr}/${tunnel4_mask} dev ${type}_b
 
-   ${ns_a} ip addr add ${tunnel6_a_addr}/${tunnel6_mask}   dev vxlan_a
-   ${ns_b} ip addr add ${tunnel6_b_addr}/${tunnel6_mask}   dev vxlan_b
+   ${ns_a} ip addr add ${tunnel6_a_addr}/${tunnel6_mask} dev ${type}_a
+   ${ns_b} ip addr add ${tunnel6_b_addr}/${tunnel6_mask} dev ${type}_b
 
-   ${ns_a} ip link set vxlan_a up
-   ${ns_b} ip link set vxlan_b up
+   ${ns_a} ip link set ${type}_a up
+   ${ns_b} ip link set ${type}_b up
 
sleep 1
 }
 
+setup_geneve4() {
+   setup_vxlan_or_geneve geneve ${prefix4}.${a_r1}.1  ${prefix4}.${b_r1}.1 
 "df set"
+}
+
 setup_vxlan4() {
-   setup_vxlan ${prefix4}.${a_r1}.1 ${prefix4}.${b_r1}.1 "df set"
+   setup_vxlan_or_geneve vxlan  ${prefix4}.${a_r1}.1  ${prefix4}.${b_r1}.1 
 "df set"
+}
+
+setup_geneve6() {
+   setup_vxlan_or_geneve geneve ${prefix6}:${a_r1}::1 ${prefix6}:${b_r1}::1
 }
 
 setup_vxlan6() {
-   setup_vxlan ${prefix6}:${a_r1}::1 ${prefix6}:${b_r1}::1 ""
+   setup_vxlan_or_geneve vxlan  ${prefix6}:${a_r1}::1 ${prefix6}:${b_r1}::1
 }
 
 setup_xfrm() {
@@ -514,22 +552,23 @@ test_pmtu_ipv6_exception() {
test_pmtu_ipvX 6
 }
 
-test_pmtu_ipvX_over_vxlanY_except

[PATCH net-next v2 09/11] udp: Support for error handlers of tunnels with arbitrary destination port

2018-11-08 Thread Stefano Brivio

ICMP error handling is currently not possible for UDP tunnels not
employing a receiving socket with local destination port matching the
remote one, because we have no way to look them up.

Add an err_handler tunnel encapsulation operation that can be exported by
tunnels in order to pass the error to the protocol implementing the
encapsulation. We can't easily use a lookup function as we did for VXLAN
and GENEVE, as protocol error handlers, which would be in turn called by
implementations of this new operation, handle the errors themselves,
together with the tunnel lookup.

Without a socket, we can't be sure which encapsulation error handler is
the appropriate one: encapsulation handlers (the ones for FoU and GUE
introduced in the next patch, e.g.) will need to check the new error codes
returned by protocol handlers to figure out if errors match the given
encapsulation, and, in turn, report this error back, so that we can try
all of them in __udp{4,6}_lib_err_encap_no_sk() until we have a match.

v2:
- Name all arguments in err_handler prototypes (David Miller)

Signed-off-by: Stefano Brivio 
Reviewed-by: Sabrina Dubroca 
---
 include/net/ip6_tunnel.h |  2 ++
 include/net/ip_tunnels.h |  1 +
 net/ipv4/udp.c   | 70 +++--
 net/ipv6/udp.c   | 75 +++-
 4 files changed, 113 insertions(+), 35 deletions(-)

diff --git a/include/net/ip6_tunnel.h b/include/net/ip6_tunnel.h
index 236e40ba06bf..69b4bcf880c9 100644
--- a/include/net/ip6_tunnel.h
+++ b/include/net/ip6_tunnel.h
@@ -69,6 +69,8 @@ struct ip6_tnl_encap_ops {
size_t (*encap_hlen)(struct ip_tunnel_encap *e);
int (*build_header)(struct sk_buff *skb, struct ip_tunnel_encap *e,
u8 *protocol, struct flowi6 *fl6);
+   int (*err_handler)(struct sk_buff *skb, struct inet6_skb_parm *opt,
+  u8 type, u8 code, int offset, __be32 info);
 };
 
 #ifdef CONFIG_INET
diff --git a/include/net/ip_tunnels.h b/include/net/ip_tunnels.h
index b0d022ff6ea1..db6b2218a2ad 100644
--- a/include/net/ip_tunnels.h
+++ b/include/net/ip_tunnels.h
@@ -311,6 +311,7 @@ struct ip_tunnel_encap_ops {
size_t (*encap_hlen)(struct ip_tunnel_encap *e);
int (*build_header)(struct sk_buff *skb, struct ip_tunnel_encap *e,
u8 *protocol, struct flowi4 *fl4);
+   int (*err_handler)(struct sk_buff *skb, u32 info);
 };
 
 #define MAX_IPTUN_ENCAP_OPS 8
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index a505ee5eb92c..6f8890c5bc7e 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -105,6 +105,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -590,6 +591,26 @@ void udp_encap_enable(void)
 }
 EXPORT_SYMBOL(udp_encap_enable);
 
+/* Handler for tunnels with arbitrary destination ports: no socket lookup, go
+ * through error handlers in encapsulations looking for a match.
+ */
+static int __udp4_lib_err_encap_no_sk(struct sk_buff *skb, u32 info)
+{
+   int i;
+
+   for (i = 0; i < MAX_IPTUN_ENCAP_OPS; i++) {
+   int (*handler)(struct sk_buff *skb, u32 info);
+
+   if (!iptun_encaps[i])
+   continue;
+   handler = rcu_dereference(iptun_encaps[i]->err_handler);
+   if (handler && !handler(skb, info))
+   return 0;
+   }
+
+   return -ENOENT;
+}
+
 /* Try to match ICMP errors to UDP tunnels by looking up a socket without
  * reversing source and destination port: this will match tunnels that force 
the
  * same destination port on both endpoints (e.g. VXLAN, GENEVE). Note that
@@ -597,28 +618,25 @@ EXPORT_SYMBOL(udp_encap_enable);
  * different destination ports on endpoints, in this case we won't be able to
  * trace ICMP messages back to them.
  *
+ * If this doesn't match any socket, probe tunnels with arbitrary destination
+ * ports (e.g. FoU, GUE): there, the receiving socket is useless, as the port
+ * we've sent packets to won't necessarily match the local destination port.
+ *
  * Then ask the tunnel implementation to match the error against a valid
  * association.
  *
- * Return the socket if we have a match.
+ * Return an error if we can't find a match, the socket if we need further
+ * processing, zero otherwise.
  */
 static struct sock *__udp4_lib_err_encap(struct net *net,
 const struct iphdr *iph,
 struct udphdr *uh,
 struct udp_table *udptable,
-struct sk_buff *skb)
+struct sk_buff *skb, u32 info)
 {
-   int (*lookup)(struct sock *sk, struct sk_buff *skb);
int network_offset, transport_offset;
-   struct udp_sock *up;
struct sock *sk;
 
-   sk = __udp4_lib_lookup(net, iph->daddr, uh->source,
-

[PATCH net-next v2 04/11] selftests: pmtu: Introduce tests for IPv4/IPv6 over VXLAN over IPv4/IPv6

2018-11-08 Thread Stefano Brivio

Use a router between endpoints, implemented via namespaces, set a low MTU
between router and destination endpoint, exceed it and check PMTU value in
route exceptions.

v2:
- Change all occurrences of VxLAN to VXLAN (Jiri Benc)
- Introduce IPv4 tests right away, if iproute2 doesn't support the 'df'
  link option they will be skipped (David Ahern)

Signed-off-by: Stefano Brivio 
Reviewed-by: Sabrina Dubroca 
---
 tools/testing/selftests/net/pmtu.sh | 143 
 1 file changed, 125 insertions(+), 18 deletions(-)

diff --git a/tools/testing/selftests/net/pmtu.sh 
b/tools/testing/selftests/net/pmtu.sh
index a369d616b390..33cba295ad45 100755
--- a/tools/testing/selftests/net/pmtu.sh
+++ b/tools/testing/selftests/net/pmtu.sh
@@ -26,6 +26,23 @@
 # - pmtu_ipv6
 #  Same as pmtu_ipv4, except for locked PMTU tests, using IPv6
 #
+# - pmtu_ipv4_vxlan4_exception
+#  Set up the same network topology as pmtu_ipv4, create a VXLAN tunnel
+#  over IPv4 between A and B, routed via R1. On the link between R1 and B,
+#  set a MTU lower than the VXLAN MTU and the MTU on the link between A and
+#  R1. Send IPv4 packets, exceeding the MTU between R1 and B, over VXLAN
+#  from A to B and check that the PMTU exception is created with the right
+#  value on A
+#
+# - pmtu_ipv6_vxlan4_exception
+#  Same as pmtu_ipv4_vxlan4_exception, but send IPv6 packets from A to B
+#
+# - pmtu_ipv4_vxlan6_exception
+#  Same as pmtu_ipv4_vxlan4_exception, but use IPv6 transport from A to B
+#
+# - pmtu_ipv6_vxlan6_exception
+#  Same as pmtu_ipv4_vxlan6_exception, but send IPv6 packets from A to B
+#
 # - pmtu_vti4_exception
 #  Set up vti tunnel on top of veth, with xfrm states and policies, in two
 #  namespaces with matching endpoints. Check that route exception is not
@@ -72,6 +89,10 @@ which ping6 > /dev/null 2>&1 && ping6=$(which ping6) || 
ping6=$(which ping)
 tests="
pmtu_ipv4_exception ipv4: PMTU exceptions
pmtu_ipv6_exception ipv6: PMTU exceptions
+   pmtu_ipv4_vxlan4_exception  IPv4 over vxlan4: PMTU exceptions
+   pmtu_ipv6_vxlan4_exception  IPv6 over vxlan4: PMTU exceptions
+   pmtu_ipv4_vxlan6_exception  IPv4 over vxlan6: PMTU exceptions
+   pmtu_ipv6_vxlan6_exception  IPv6 over vxlan6: PMTU exceptions
pmtu_vti6_exception vti6: PMTU exceptions
pmtu_vti4_exception vti4: PMTU exceptions
pmtu_vti4_default_mtu   vti4: default MTU assignment
@@ -95,8 +116,8 @@ ns_r2="ip netns exec ${NS_R2}"
 # Addresses are:
 # - IPv4: PREFIX4.SEGMENT.ID (/24)
 # - IPv6: PREFIX6:SEGMENT::ID (/64)
-prefix4="192.168"
-prefix6="fd00"
+prefix4="10.0"
+prefix6="fc00"
 a_r1=1
 a_r2=2
 b_r1=3
@@ -129,12 +150,12 @@ veth6_a_addr="fd00:1::a"
 veth6_b_addr="fd00:1::b"
 veth6_mask="64"
 
-vti4_a_addr="192.168.2.1"
-vti4_b_addr="192.168.2.2"
-vti4_mask="24"
-vti6_a_addr="fd00:2::a"
-vti6_b_addr="fd00:2::b"
-vti6_mask="64"
+tunnel4_a_addr="192.168.2.1"
+tunnel4_b_addr="192.168.2.2"
+tunnel4_mask="24"
+tunnel6_a_addr="fd00:2::a"
+tunnel6_b_addr="fd00:2::b"
+tunnel6_mask="64"
 
 dummy6_0_addr="fc00:1000::0"
 dummy6_1_addr="fc00:1001::0"
@@ -202,11 +223,39 @@ setup_vti() {
 }
 
 setup_vti4() {
-   setup_vti 4 ${veth4_a_addr} ${veth4_b_addr} ${vti4_a_addr} 
${vti4_b_addr} ${vti4_mask}
+   setup_vti 4 ${veth4_a_addr} ${veth4_b_addr} ${tunnel4_a_addr} 
${tunnel4_b_addr} ${tunnel4_mask}
 }
 
 setup_vti6() {
-   setup_vti 6 ${veth6_a_addr} ${veth6_b_addr} ${vti6_a_addr} 
${vti6_b_addr} ${vti6_mask}
+   setup_vti 6 ${veth6_a_addr} ${veth6_b_addr} ${tunnel6_a_addr} 
${tunnel6_b_addr} ${tunnel6_mask}
+}
+
+setup_vxlan() {
+   a_addr="${1}"
+   b_addr="${2}"
+   opts="${3}"
+
+   ${ns_a} ip link add vxlan_a type vxlan id 1 local ${a_addr} remote 
${b_addr} ttl 64 dstport 4789 ${opts} || return 1
+   ${ns_b} ip link add vxlan_b type vxlan id 1 local ${b_addr} remote 
${a_addr} ttl 64 dstport 4789 ${opts}
+
+   ${ns_a} ip addr add ${tunnel4_a_addr}/${tunnel4_mask}   dev vxlan_a
+   ${ns_b} ip addr add ${tunnel4_b_addr}/${tunnel4_mask}   dev vxlan_b
+
+   ${ns_a} ip addr add ${tunnel6_a_addr}/${tunnel6_mask}   dev vxlan_a
+   ${ns_b} ip addr add ${tunnel6_b_addr}/${tunnel6_mask}   dev vxlan_b
+
+   ${ns_a} ip link set vxlan_a up
+   ${ns_b} ip link set vxlan_b up
+
+   sleep 1
+}
+
+setup_vxlan4() {
+   setup_vxlan ${prefix4}.${a_r1}.1 ${prefix4}.${b_r1}.1 "df set"
+}
+
+setup_vxlan6() {
+   setup_vxlan ${prefix6}:${a_r1}::1 ${prefix6}:${b_r1}::1 ""
 }
 
 setup_xfrm() {
@@ -465,6 +514,64 @@ test_pmtu_ipv6_exception() {
test_

[PATCH net-next v2 06/11] geneve: Allow configuration of DF behaviour

2018-11-08 Thread Stefano Brivio

draft-ietf-nvo3-geneve-08 says:

   It is strongly RECOMMENDED that Path MTU Discovery ([RFC1191],
   [RFC1981]) be used by setting the DF bit in the IP header when Geneve
   packets are transmitted over IPv4 (this is the default with IPv6).

Now that ICMP error handling is working for GENEVE, we can comply with
this recommendation.

Make this configurable, though, to avoid breaking existing setups. By
default, DF won't be set. It can be set or inherited from inner IPv4
packets. If it's configured to be inherited and we are encapsulating IPv6,
it will be set.

This only applies to non-lwt tunnels: if an external control plane is
used, tunnel key will still control the DF flag.

v2:
- DF behaviour configuration only applies for non-lwt tunnels, apply DF
  setting only if (!geneve->collect_md) in geneve_xmit_skb()
  (Stephen Hemminger)

Signed-off-by: Stefano Brivio 
Reviewed-by: Sabrina Dubroca 
---
 drivers/net/geneve.c | 55 ++--
 include/uapi/linux/if_link.h |  9 ++
 2 files changed, 56 insertions(+), 8 deletions(-)

diff --git a/drivers/net/geneve.c b/drivers/net/geneve.c
index 4dc457489770..7c53e06b31c3 100644
--- a/drivers/net/geneve.c
+++ b/drivers/net/geneve.c
@@ -70,6 +70,7 @@ struct geneve_dev {
bool   collect_md;
bool   use_udp6_rx_checksums;
bool   ttl_inherit;
+   enum ifla_geneve_df df;
 };
 
 struct geneve_sock {
@@ -875,8 +876,8 @@ static int geneve_xmit_skb(struct sk_buff *skb, struct 
net_device *dev,
struct rtable *rt;
struct flowi4 fl4;
__u8 tos, ttl;
+   __be16 df = 0;
__be16 sport;
-   __be16 df;
int err;
 
rt = geneve_get_v4_rt(skb, dev, gs4, , info);
@@ -890,6 +891,8 @@ static int geneve_xmit_skb(struct sk_buff *skb, struct 
net_device *dev,
if (geneve->collect_md) {
tos = ip_tunnel_ecn_encap(key->tos, ip_hdr(skb), skb);
ttl = key->ttl;
+
+   df = key->tun_flags & TUNNEL_DONT_FRAGMENT ? htons(IP_DF) : 0;
} else {
tos = ip_tunnel_ecn_encap(fl4.flowi4_tos, ip_hdr(skb), skb);
if (geneve->ttl_inherit)
@@ -897,8 +900,22 @@ static int geneve_xmit_skb(struct sk_buff *skb, struct 
net_device *dev,
else
ttl = key->ttl;
ttl = ttl ? : ip4_dst_hoplimit(>dst);
+
+   if (geneve->df == GENEVE_DF_SET) {
+   df = htons(IP_DF);
+   } else if (geneve->df == GENEVE_DF_INHERIT) {
+   struct ethhdr *eth = eth_hdr(skb);
+
+   if (ntohs(eth->h_proto) == ETH_P_IPV6) {
+   df = htons(IP_DF);
+   } else if (ntohs(eth->h_proto) == ETH_P_IP) {
+   struct iphdr *iph = ip_hdr(skb);
+
+   if (iph->frag_off & htons(IP_DF))
+   df = htons(IP_DF);
+   }
+   }
}
-   df = key->tun_flags & TUNNEL_DONT_FRAGMENT ? htons(IP_DF) : 0;
 
err = geneve_build_skb(>dst, skb, info, xnet, sizeof(struct iphdr));
if (unlikely(err))
@@ -1145,6 +1162,7 @@ static const struct nla_policy 
geneve_policy[IFLA_GENEVE_MAX + 1] = {
[IFLA_GENEVE_UDP_ZERO_CSUM6_TX] = { .type = NLA_U8 },
[IFLA_GENEVE_UDP_ZERO_CSUM6_RX] = { .type = NLA_U8 },
[IFLA_GENEVE_TTL_INHERIT]   = { .type = NLA_U8 },
+   [IFLA_GENEVE_DF]= { .type = NLA_U8 },
 };
 
 static int geneve_validate(struct nlattr *tb[], struct nlattr *data[],
@@ -1180,6 +1198,16 @@ static int geneve_validate(struct nlattr *tb[], struct 
nlattr *data[],
}
}
 
+   if (data[IFLA_GENEVE_DF]) {
+   enum ifla_geneve_df df = nla_get_u8(data[IFLA_GENEVE_DF]);
+
+   if (df < 0 || df > GENEVE_DF_MAX) {
+   NL_SET_ERR_MSG_ATTR(extack, tb[IFLA_GENEVE_DF],
+   "Invalid DF attribute");
+   return -EINVAL;
+   }
+   }
+
return 0;
 }
 
@@ -1225,7 +1253,7 @@ static int geneve_configure(struct net *net, struct 
net_device *dev,
struct netlink_ext_ack *extack,
const struct ip_tunnel_info *info,
bool metadata, bool ipv6_rx_csum,
-   bool ttl_inherit)
+   bool ttl_inherit, enum ifla_geneve_df df)
 {
struct geneve_net *gn = net_generic(net, geneve_net_id);
struct geneve_dev *t, *geneve = netdev_priv(dev);
@@ -1275,6 +1303,7 @@ static int geneve_configure(struct net *net, struct 
net_device *dev,
geneve->collect_md = metadata;
geneve->use_udp6_rx_checksums = ipv6_rx_csum;
geneve->ttl_

[PATCH net-next v2 05/11] geneve: ICMP error lookup handler

2018-11-08 Thread Stefano Brivio

Export an encap_err_lookup() operation to match an ICMP error against a
valid VNI.

Signed-off-by: Stefano Brivio 
Reviewed-by: Sabrina Dubroca 
---
v2: no changes

 drivers/net/geneve.c | 52 
 1 file changed, 52 insertions(+)

diff --git a/drivers/net/geneve.c b/drivers/net/geneve.c
index fbfc13d81f66..4dc457489770 100644
--- a/drivers/net/geneve.c
+++ b/drivers/net/geneve.c
@@ -387,6 +387,57 @@ static int geneve_udp_encap_recv(struct sock *sk, struct 
sk_buff *skb)
return 0;
 }
 
+/* Callback from net/ipv{4,6}/udp.c to check that we have a tunnel for errors 
*/
+static int geneve_udp_encap_err_lookup(struct sock *sk, struct sk_buff *skb)
+{
+   struct genevehdr *geneveh;
+   struct geneve_sock *gs;
+   u8 zero_vni[3] = { 0 };
+   u8 *vni = zero_vni;
+
+   if (skb->len < GENEVE_BASE_HLEN)
+   return -EINVAL;
+
+   geneveh = geneve_hdr(skb);
+   if (geneveh->ver != GENEVE_VER)
+   return -EINVAL;
+
+   if (geneveh->proto_type != htons(ETH_P_TEB))
+   return -EINVAL;
+
+   gs = rcu_dereference_sk_user_data(sk);
+   if (!gs)
+   return -ENOENT;
+
+   if (geneve_get_sk_family(gs) == AF_INET) {
+   struct iphdr *iph = ip_hdr(skb);
+   __be32 addr4 = 0;
+
+   if (!gs->collect_md) {
+   vni = geneve_hdr(skb)->vni;
+   addr4 = iph->daddr;
+   }
+
+   return geneve_lookup(gs, addr4, vni) ? 0 : -ENOENT;
+   }
+
+#if IS_ENABLED(CONFIG_IPV6)
+   if (geneve_get_sk_family(gs) == AF_INET6) {
+   struct ipv6hdr *ip6h = ipv6_hdr(skb);
+   struct in6_addr addr6 = { 0 };
+
+   if (!gs->collect_md) {
+   vni = geneve_hdr(skb)->vni;
+   addr6 = ip6h->daddr;
+   }
+
+   return geneve6_lookup(gs, addr6, vni) ? 0 : -ENOENT;
+   }
+#endif
+
+   return -EPFNOSUPPORT;
+}
+
 static struct socket *geneve_create_sock(struct net *net, bool ipv6,
 __be16 port, bool ipv6_rx_csum)
 {
@@ -544,6 +595,7 @@ static struct geneve_sock *geneve_socket_create(struct net 
*net, __be16 port,
tunnel_cfg.gro_receive = geneve_gro_receive;
tunnel_cfg.gro_complete = geneve_gro_complete;
tunnel_cfg.encap_rcv = geneve_udp_encap_recv;
+   tunnel_cfg.encap_err_lookup = geneve_udp_encap_err_lookup;
tunnel_cfg.encap_destroy = NULL;
setup_udp_tunnel_sock(net, sock, _cfg);
list_add(>list, >sock_list);
-- 
2.19.1

[PATCH net-next v2 01/11] udp: Handle ICMP errors for tunnels with same destination port on both endpoints

2018-11-08 Thread Stefano Brivio

For both IPv4 and IPv6, if we can't match errors to a socket, try
tunnels before ignoring them. Look up a socket with the original source
and destination ports as found in the UDP packet inside the ICMP payload,
this will work for tunnels that force the same destination port for both
endpoints, i.e. VXLAN and GENEVE.

Actually, lwtunnels could break this assumption if they are configured by
an external control plane to have different destination ports on the
endpoints: in this case, we won't be able to trace ICMP messages back to
them.

For IPv6 redirect messages, call ip6_redirect() directly with the output
interface argument set to the interface we received the packet from (as
it's the very interface we should build the exception on), otherwise the
new nexthop will be rejected. There's no such need for IPv4.

Tunnels can now export an encap_err_lookup() operation that indicates a
match. Pass the packet to the lookup function, and if the tunnel driver
reports a matching association, continue with regular ICMP error handling.

v2:
- Added newline between network and transport header sets in
  __udp{4,6}_lib_err_encap() (David Miller)
- Removed redundant skb_reset_network_header(skb); in
  __udp4_lib_err_encap()
- Removed redundant reassignment of iph in __udp4_lib_err_encap()
  (Sabrina Dubroca)
- Edited comment to __udp{4,6}_lib_err_encap() to reflect the fact this
  won't work with lwtunnels configured to use asymmetric ports. By the way,
  it's VXLAN, not VxLAN (Jiri Benc)

Signed-off-by: Stefano Brivio 
Reviewed-by: Sabrina Dubroca 
---
 include/linux/udp.h  |  1 +
 include/net/udp_tunnel.h |  3 ++
 net/ipv4/udp.c   | 79 +++
 net/ipv4/udp_tunnel.c|  1 +
 net/ipv6/udp.c   | 89 +++-
 5 files changed, 153 insertions(+), 20 deletions(-)

diff --git a/include/linux/udp.h b/include/linux/udp.h
index 0a9c54e76305..2725c83395bf 100644
--- a/include/linux/udp.h
+++ b/include/linux/udp.h
@@ -77,6 +77,7 @@ struct udp_sock {
 * For encapsulation sockets.
 */
int (*encap_rcv)(struct sock *sk, struct sk_buff *skb);
+   int (*encap_err_lookup)(struct sock *sk, struct sk_buff *skb);
void (*encap_destroy)(struct sock *sk);
 
/* GRO functions for UDP socket */
diff --git a/include/net/udp_tunnel.h b/include/net/udp_tunnel.h
index 3fbe56430e3b..dc8d804af3b4 100644
--- a/include/net/udp_tunnel.h
+++ b/include/net/udp_tunnel.h
@@ -64,6 +64,8 @@ static inline int udp_sock_create(struct net *net,
 }
 
 typedef int (*udp_tunnel_encap_rcv_t)(struct sock *sk, struct sk_buff *skb);
+typedef int (*udp_tunnel_encap_err_lookup_t)(struct sock *sk,
+struct sk_buff *skb);
 typedef void (*udp_tunnel_encap_destroy_t)(struct sock *sk);
 typedef struct sk_buff *(*udp_tunnel_gro_receive_t)(struct sock *sk,
struct list_head *head,
@@ -76,6 +78,7 @@ struct udp_tunnel_sock_cfg {
/* Used for setting up udp_sock fields, see udp.h for details */
__u8  encap_type;
udp_tunnel_encap_rcv_t encap_rcv;
+   udp_tunnel_encap_err_lookup_t encap_err_lookup;
udp_tunnel_encap_destroy_t encap_destroy;
udp_tunnel_gro_receive_t gro_receive;
udp_tunnel_gro_complete_t gro_complete;
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 3488650b90ac..ce759b61f6cd 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -583,6 +583,62 @@ static inline bool __udp_is_mcast_sock(struct net *net, 
struct sock *sk,
return true;
 }
 
+DEFINE_STATIC_KEY_FALSE(udp_encap_needed_key);
+void udp_encap_enable(void)
+{
+   static_branch_enable(_encap_needed_key);
+}
+EXPORT_SYMBOL(udp_encap_enable);
+
+/* Try to match ICMP errors to UDP tunnels by looking up a socket without
+ * reversing source and destination port: this will match tunnels that force 
the
+ * same destination port on both endpoints (e.g. VXLAN, GENEVE). Note that
+ * lwtunnels might actually break this assumption by being configured with
+ * different destination ports on endpoints, in this case we won't be able to
+ * trace ICMP messages back to them.
+ *
+ * Then ask the tunnel implementation to match the error against a valid
+ * association.
+ *
+ * Return the socket if we have a match.
+ */
+static struct sock *__udp4_lib_err_encap(struct net *net,
+const struct iphdr *iph,
+struct udphdr *uh,
+struct udp_table *udptable,
+struct sk_buff *skb)
+{
+   int (*lookup)(struct sock *sk, struct sk_buff *skb);
+   int network_offset, transport_offset;
+   struct udp_sock *up;
+   struct sock *sk;
+
+   sk = __udp4_lib_lookup(net, iph->daddr, uh->source,
+  iph->saddr, uh->dest, skb->

[PATCH net-next v2 00/11] ICMP error handling for UDP tunnels

2018-11-08 Thread Stefano Brivio

This series introduces ICMP error handling for UDP tunnels and
encapsulations and related selftests. We need to handle ICMP errors to
support PMTU discovery and route redirection -- this support is entirely
missing right now:

- patch 1/11 adds a socket lookup for UDP tunnels that use, by design,
  the same destination port on both endpoints -- i.e. VXLAN and GENEVE
- patches 2/11 to 7/11 are specific to VxLAN and GENEVE
- patches 8/11 and 9/11 add infrastructure for lookup of encapsulations
  where sent packets cannot be matched via receiving socket lookup, i.e.
  FoU and GUE
- patches 10/11 and 11/11 are specific to FoU and GUE

v2: changes are listed in the single patches

Stefano Brivio (11):
  udp: Handle ICMP errors for tunnels with same destination port on both
endpoints
  vxlan: ICMP error lookup handler
  vxlan: Allow configuration of DF behaviour
  selftests: pmtu: Introduce tests for IPv4/IPv6 over VXLAN over
IPv4/IPv6
  geneve: ICMP error lookup handler
  geneve: Allow configuration of DF behaviour
  selftests: pmtu: Introduce tests for IPv4/IPv6 over GENEVE over
IPv4/IPv6
  net: Convert protocol error handlers from void to int
  udp: Support for error handlers of tunnels with arbitrary destination
port
  fou, fou6: ICMP error handlers for FoU and GUE
  selftests: pmtu: Introduce FoU and GUE PMTU exceptions tests

 drivers/net/geneve.c| 107 +++-
 drivers/net/vxlan.c |  58 -
 include/linux/udp.h |   1 +
 include/net/icmp.h  |   2 +-
 include/net/ip6_tunnel.h|   2 +
 include/net/ip_tunnels.h|   1 +
 include/net/protocol.h  |   9 +-
 include/net/sctp/sctp.h |   2 +-
 include/net/tcp.h   |   2 +-
 include/net/udp.h   |   2 +-
 include/net/udp_tunnel.h|   3 +
 include/net/vxlan.h |   1 +
 include/uapi/linux/if_link.h|  18 ++
 net/dccp/ipv4.c |  13 +-
 net/dccp/ipv6.c |  13 +-
 net/ipv4/fou.c  |  68 +
 net/ipv4/gre_demux.c|   9 +-
 net/ipv4/icmp.c |   6 +-
 net/ipv4/ip_gre.c   |  48 ++--
 net/ipv4/ipip.c |  14 +-
 net/ipv4/protocol.c |   1 +
 net/ipv4/tcp_ipv4.c |  22 +-
 net/ipv4/tunnel4.c  |  18 +-
 net/ipv4/udp.c  | 121 -
 net/ipv4/udp_impl.h |   2 +-
 net/ipv4/udp_tunnel.c   |   1 +
 net/ipv4/udplite.c  |   4 +-
 net/ipv4/xfrm4_protocol.c   |  18 +-
 net/ipv6/fou6.c |  74 ++
 net/ipv6/icmp.c |   4 +-
 net/ipv6/ip6_gre.c  |  18 +-
 net/ipv6/tcp_ipv6.c |  13 +-
 net/ipv6/tunnel6.c  |  12 +-
 net/ipv6/udp.c  | 146 +--
 net/ipv6/udp_impl.h |   4 +-
 net/ipv6/udplite.c  |   5 +-
 net/ipv6/xfrm6_protocol.c   |  18 +-
 net/sctp/input.c|   5 +-
 net/sctp/ipv6.c |   7 +-
 tools/testing/selftests/net/pmtu.sh | 377 ++--
 40 files changed, 1083 insertions(+), 166 deletions(-)

-- 
2.19.1

[PATCH net-next v2 02/11] vxlan: ICMP error lookup handler

2018-11-08 Thread Stefano Brivio

Export an encap_err_lookup() operation to match an ICMP error against a
valid VNI.

Signed-off-by: Stefano Brivio 
Reviewed-by: Sabrina Dubroca 
---
v2: no changes

 drivers/net/vxlan.c | 29 +
 1 file changed, 29 insertions(+)

diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c
index ae969f806d56..0851af6733f3 100644
--- a/drivers/net/vxlan.c
+++ b/drivers/net/vxlan.c
@@ -1552,6 +1552,34 @@ static int vxlan_rcv(struct sock *sk, struct sk_buff 
*skb)
return 0;
 }
 
+/* Callback from net/ipv{4,6}/udp.c to check that we have a VNI for errors */
+static int vxlan_err_lookup(struct sock *sk, struct sk_buff *skb)
+{
+   struct vxlan_dev *vxlan;
+   struct vxlan_sock *vs;
+   struct vxlanhdr *hdr;
+   __be32 vni;
+
+   if (skb->len < VXLAN_HLEN)
+   return -EINVAL;
+
+   hdr = vxlan_hdr(skb);
+
+   if (!(hdr->vx_flags & VXLAN_HF_VNI))
+   return -EINVAL;
+
+   vs = rcu_dereference_sk_user_data(sk);
+   if (!vs)
+   return -ENOENT;
+
+   vni = vxlan_vni(hdr->vx_vni);
+   vxlan = vxlan_vs_find_vni(vs, skb->dev->ifindex, vni);
+   if (!vxlan)
+   return -ENOENT;
+
+   return 0;
+}
+
 static int arp_reduce(struct net_device *dev, struct sk_buff *skb, __be32 vni)
 {
struct vxlan_dev *vxlan = netdev_priv(dev);
@@ -2948,6 +2976,7 @@ static struct vxlan_sock *vxlan_socket_create(struct net 
*net, bool ipv6,
tunnel_cfg.sk_user_data = vs;
tunnel_cfg.encap_type = 1;
tunnel_cfg.encap_rcv = vxlan_rcv;
+   tunnel_cfg.encap_err_lookup = vxlan_err_lookup;
tunnel_cfg.encap_destroy = NULL;
tunnel_cfg.gro_receive = vxlan_gro_receive;
tunnel_cfg.gro_complete = vxlan_gro_complete;
-- 
2.19.1

Re: [PATCH iproute2 net-next 1/2] iplink_vxlan: Add DF configuration

2018-11-07 Thread Stefano Brivio

On Wed, 7 Nov 2018 13:03:34 -0700
David Ahern  wrote:

> On 11/6/18 2:39 PM, Stefano Brivio wrote:
> > diff --git a/ip/iplink_vxlan.c b/ip/iplink_vxlan.c
> > index 7fc0e2b4eb06..86afbe1334f0 100644
> > --- a/ip/iplink_vxlan.c
> > +++ b/ip/iplink_vxlan.c
> > @@ -31,6 +31,7 @@ static void print_explain(FILE *f)
> > " [ local ADDR ]\n"
> > " [ ttl TTL ]\n"
> > " [ tos TOS ]\n"
> > +   " [ df DF ]\n"
> > " [ flowlabel LABEL ]\n"
> > " [ dev PHYS_DEV ]\n"
> > " [ dstport PORT ]\n"  
> 
> Since it is the df bit, that user option seems fine to me. Should be ok
> to use that for your probe on iproute2 support.

Okay, then I'll use it in v2 of the kernel series.

> That said, the man-page update should spell out what df refers to.

Sure, I'll add that.

-- 
Stefano

Re: [PATCH net-next 04/11] selftests: pmtu: Introduce tests for IPv4/IPv6 over VxLAN over IPv6

2018-11-07 Thread Stefano Brivio

On Wed, 7 Nov 2018 12:28:21 -0700
David Ahern  wrote:

> On 11/6/18 2:39 PM, Stefano Brivio wrote:
> > Use a router between endpoints, implemented via namespaces, set a low MTU
> > between router and destination endpoint, exceed it and check PMTU value in
> > route exceptions.
> > 
> > Reviewed-by: Sabrina Dubroca 
> > Signed-off-by: Stefano Brivio 
> > ---
> > This only introduces tests over VxLAN over IPv6 right now. I'll introduce
> > tests over IPv4 (they can be added trivially) once DF configuration support
> > is accepted into iproute2.  
> 
> you can add them now and wrapped in a 'does ip support the df option'
> check. That is needed regardless of order (kernel vs iproute2).

True, I thought about that, but then I also thought that if we end up
with a different syntax for the iproute2 command, that becomes ugly.

Then yes, the check would be there -- it's actually already there, that
|| return 1 after the ip-link command.

-- 
Stefano

Re: [PATCH net-next 00/11] ICMP error handling for UDP tunnels

2018-11-07 Thread Stefano Brivio

On Wed, 7 Nov 2018 12:09:51 +0100
Jiri Benc  wrote:

> On Tue,  6 Nov 2018 22:38:56 +0100, Stefano Brivio wrote:
> > - patch 1/11 adds a socket lookup for UDP tunnels that use, by design,
> >   the same destination port on both endpoints -- i.e. VxLAN and GENEVE  
> 
> This is not necessarily true with lwtunnels (collect_md mode of VXLAN
> and GENEVE). While any sane setup will use the same dst ports, there's
> really nothing that enforces it. Of course, in that case we have no way
> to map the ICMP error back to the tunnel.

Right, thanks for pointing that out. I will expand on that in the
comments to __udp{4,6}_lib_err_encap().

> Generally speaking, I'm not sure how ICMP error handling should work
> for external control planes. Are we sure they want PMTU discovery and
> route redirection done by the kernel? (I am not sure, neither way.)

I'm not sure either, even though I have a slight preference on
making this work by default, rather than not even giving lwtunnels a
chance by dropping ICMP messages, as it currently stands.

By the way, if needed, it's easy to disable or make it configurable:

- in geneve_udp_encap_err_lookup():
if (gs->collect_md)
return -ENOENT;

- in vxlan_err_lookup():
if (vxlan_collect_metadata(gs))
return -ENOENT;

but I would rather deal with this at a later moment, and only if the
need arises.

-- 
Stefano

Re: [PATCH net-next 03/11] vxlan: Allow configuration of DF behaviour

2018-11-07 Thread Stefano Brivio

Stephen, thanks for reviewing this.

On Tue, 6 Nov 2018 21:00:18 -0800
Stephen Hemminger  wrote:

> On Tue,  6 Nov 2018 22:38:59 +0100
> Stefano Brivio  wrote:
> 
> > df = htons(IP_DF);
> > }
> >  
> > +   if (!df) {
> > +   if (vxlan->cfg.df == VXLAN_DF_SET) {
> > +   df = htons(IP_DF);  
> 
> I am confused, this looks like this new flag is duplicating the exiting 
> tunnel DF flag.
> (in info->key.tun.flags). Why is another flag needed?

The reason is that for non-lwt tunnels the tunnel key is not used in
VXLAN, and this patch is intended for non-lwt tunnels only, as external
control planes already have a way to set the DF bit.

But now I see that the way I wrote this is misleading: this should
really be in an if (rdst) or if (!info) clause. I'll place this into
the if (!info) block just above. TTL and TOS are handled in a similar
way, using the if (rdst) clause above.

Functionally, it's equivalent, because external control planes will
never set non-default values for vxlan->cfg.df, but still not a good
reason to write it this way.

Similar story for GENEVE, I will place that under if
(!geneve->collect_md) instead.

With a notable difference, though: GENEVE already uses struct
ip_tunnel_key for some of its interface configuration, so I had already
thought about adding a TUNNEL_DONT_FRAGMENT_INHERIT flag that could be
used in tun_flags.

However, I would use the last available bit there, and this wouldn't be
terribly useful: an external control plane already has access to the
inner DF bit, and would most likely decide on its own whether to set DF
or not, by setting that in tun_flags. So I'd rather keep that in struct
geneve_dev.

-- 
Stefano

[PATCH iproute2 net-next 0/2] Add DF configuration for VxLAN and GENEVE link types

2018-11-06 Thread Stefano Brivio

This series adds configuration of the DF bit in outgoing IPv4 packets for
VxLAN and GENEVE link types.

I also included uapi/linux/if_link.h changes for convenience, please let
me know if I should repost without them.

Stefano Brivio (2):
  iplink_vxlan: Add DF configuration
  iplink_geneve: Add DF configuration

 include/uapi/linux/if_link.h | 18 ++
 ip/iplink_geneve.c   | 29 +
 ip/iplink_vxlan.c| 29 +
 man/man8/ip-link.8.in| 28 
 4 files changed, 104 insertions(+)

-- 
2.19.1

[PATCH iproute2 net-next 2/2] iplink_geneve: Add DF configuration

2018-11-06 Thread Stefano Brivio

Allow to set the DF bit behaviour for outgoing IPv4 packets: it can be
always on, inherited from the inner header, or, by default, always off,
which is the current behaviour.

Signed-off-by: Stefano Brivio 
---
 include/uapi/linux/if_link.h |  9 +
 ip/iplink_geneve.c   | 29 +
 man/man8/ip-link.8.in| 14 ++
 3 files changed, 52 insertions(+)

diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h
index 4caf683ce546..183ca7527178 100644
--- a/include/uapi/linux/if_link.h
+++ b/include/uapi/linux/if_link.h
@@ -563,10 +563,19 @@ enum {
IFLA_GENEVE_UDP_ZERO_CSUM6_RX,
IFLA_GENEVE_LABEL,
IFLA_GENEVE_TTL_INHERIT,
+   IFLA_GENEVE_DF,
__IFLA_GENEVE_MAX
 };
 #define IFLA_GENEVE_MAX(__IFLA_GENEVE_MAX - 1)
 
+enum ifla_geneve_df {
+   GENEVE_DF_UNSET = 0,
+   GENEVE_DF_SET,
+   GENEVE_DF_INHERIT,
+   __GENEVE_DF_END,
+   GENEVE_DF_MAX = __GENEVE_DF_END - 1,
+};
+
 /* PPP section */
 enum {
IFLA_PPP_UNSPEC,
diff --git a/ip/iplink_geneve.c b/ip/iplink_geneve.c
index c417842b2a5b..1872b74c5d70 100644
--- a/ip/iplink_geneve.c
+++ b/ip/iplink_geneve.c
@@ -24,6 +24,7 @@ static void print_explain(FILE *f)
"  remote ADDR\n"
"  [ ttl TTL ]\n"
"  [ tos TOS ]\n"
+   "  [ df DF ]\n"
"  [ flowlabel LABEL ]\n"
"  [ dstport PORT ]\n"
"  [ [no]external ]\n"
@@ -35,6 +36,7 @@ static void print_explain(FILE *f)
"   ADDR  := IP_ADDRESS\n"
"   TOS   := { NUMBER | inherit }\n"
"   TTL   := { 1..255 | auto | inherit }\n"
+   "   DF:= { unset | set | inherit }\n"
"   LABEL := 0-1048575\n"
);
 }
@@ -115,6 +117,22 @@ static int geneve_parse_opt(struct link_util *lu, int 
argc, char **argv,
tos = uval;
} else
tos = 1;
+   } else if (!matches(*argv, "df")) {
+   enum ifla_geneve_df df;
+
+   NEXT_ARG();
+   check_duparg(, IFLA_GENEVE_DF, "df", *argv);
+   if (strcmp(*argv, "unset") == 0)
+   df = GENEVE_DF_UNSET;
+   else if (strcmp(*argv, "set") == 0)
+   df = GENEVE_DF_SET;
+   else if (strcmp(*argv, "inherit") == 0)
+   df = GENEVE_DF_INHERIT;
+   else
+   invarg("DF must be 'unset', 'set' or 'inherit'",
+  *argv);
+
+   addattr8(n, 1024, IFLA_GENEVE_DF, df);
} else if (!matches(*argv, "label") ||
   !matches(*argv, "flowlabel")) {
__u32 uval;
@@ -287,6 +305,17 @@ static void geneve_print_opt(struct link_util *lu, FILE 
*f, struct rtattr *tb[])
print_string(PRINT_FP, NULL, "tos %s ", "inherit");
}
 
+   if (tb[IFLA_GENEVE_DF]) {
+   enum ifla_geneve_df df = rta_getattr_u8(tb[IFLA_GENEVE_DF]);
+
+   if (df == GENEVE_DF_UNSET)
+   print_string(PRINT_JSON, "df", "df %s ", "unset");
+   else if (df == GENEVE_DF_SET)
+   print_string(PRINT_ANY, "df", "df %s ", "set");
+   else if (df == GENEVE_DF_INHERIT)
+   print_string(PRINT_ANY, "df", "df %s ", "inherit");
+   }
+
if (tb[IFLA_GENEVE_LABEL]) {
__u32 label = rta_getattr_u32(tb[IFLA_GENEVE_LABEL]);
 
diff --git a/man/man8/ip-link.8.in b/man/man8/ip-link.8.in
index 1b899dd06b92..568e0aa02579 100644
--- a/man/man8/ip-link.8.in
+++ b/man/man8/ip-link.8.in
@@ -1180,6 +1180,8 @@ the following additional arguments are supported:
 ] [
 .BI tos " TOS "
 ] [
+.BI df " DF "
+] [
 .BI flowlabel " FLOWLABEL "
 ] [
 .BI dstport " PORT"
@@ -1212,6 +1214,18 @@ ttl. Default option is "0".
 .BI tos " TOS"
 - specifies the TOS value to use in outgoing packets.
 
+.sp
+.BI df " DF"
+- specifies the usage of the DF bit in outgoing packets with IPv4 headers.
+The value
+.B inherit
+causes the bit to be copied from the original IP header. The values
+.B unset
+and
+.B set
+cause the bit to be always unset or always set, respectively. By default, the
+bit is not set.
+
 .sp
 .BI flowlabel " FLOWLABEL"
 - specifies the flow label to use in outgoing packets.
-- 
2.19.1

[PATCH iproute2 net-next 1/2] iplink_vxlan: Add DF configuration

2018-11-06 Thread Stefano Brivio

Allow to set the DF bit behaviour for outgoing IPv4 packets: it can be
always on, inherited from the inner header, or, by default, always off,
which is the current behaviour.

Signed-off-by: Stefano Brivio 
---
 include/uapi/linux/if_link.h |  9 +
 ip/iplink_vxlan.c| 29 +
 man/man8/ip-link.8.in| 14 ++
 3 files changed, 52 insertions(+)

diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h
index 9c254603ebda..4caf683ce546 100644
--- a/include/uapi/linux/if_link.h
+++ b/include/uapi/linux/if_link.h
@@ -530,6 +530,7 @@ enum {
IFLA_VXLAN_LABEL,
IFLA_VXLAN_GPE,
IFLA_VXLAN_TTL_INHERIT,
+   IFLA_VXLAN_DF,
__IFLA_VXLAN_MAX
 };
 #define IFLA_VXLAN_MAX (__IFLA_VXLAN_MAX - 1)
@@ -539,6 +540,14 @@ struct ifla_vxlan_port_range {
__be16  high;
 };
 
+enum ifla_vxlan_df {
+   VXLAN_DF_UNSET = 0,
+   VXLAN_DF_SET,
+   VXLAN_DF_INHERIT,
+   __VXLAN_DF_END,
+   VXLAN_DF_MAX = __VXLAN_DF_END - 1,
+};
+
 /* GENEVE section */
 enum {
IFLA_GENEVE_UNSPEC,
diff --git a/ip/iplink_vxlan.c b/ip/iplink_vxlan.c
index 7fc0e2b4eb06..86afbe1334f0 100644
--- a/ip/iplink_vxlan.c
+++ b/ip/iplink_vxlan.c
@@ -31,6 +31,7 @@ static void print_explain(FILE *f)
" [ local ADDR ]\n"
" [ ttl TTL ]\n"
" [ tos TOS ]\n"
+   " [ df DF ]\n"
" [ flowlabel LABEL ]\n"
" [ dev PHYS_DEV ]\n"
" [ dstport PORT ]\n"
@@ -52,6 +53,7 @@ static void print_explain(FILE *f)
"   ADDR  := { IP_ADDRESS | any }\n"
"   TOS   := { NUMBER | inherit }\n"
"   TTL   := { 1..255 | auto | inherit }\n"
+   "   DF:= { unset | set | inherit }\n"
"   LABEL := 0-1048575\n"
);
 }
@@ -170,6 +172,22 @@ static int vxlan_parse_opt(struct link_util *lu, int argc, 
char **argv,
} else
tos = 1;
addattr8(n, 1024, IFLA_VXLAN_TOS, tos);
+   } else if (!matches(*argv, "df")) {
+   enum ifla_vxlan_df df;
+
+   NEXT_ARG();
+   check_duparg(, IFLA_VXLAN_DF, "df", *argv);
+   if (strcmp(*argv, "unset") == 0)
+   df = VXLAN_DF_UNSET;
+   else if (strcmp(*argv, "set") == 0)
+   df = VXLAN_DF_SET;
+   else if (strcmp(*argv, "inherit") == 0)
+   df = VXLAN_DF_INHERIT;
+   else
+   invarg("DF must be 'unset', 'set' or 'inherit'",
+  *argv);
+
+   addattr8(n, 1024, IFLA_VXLAN_DF, df);
} else if (!matches(*argv, "label") ||
   !matches(*argv, "flowlabel")) {
__u32 uval;
@@ -538,6 +556,17 @@ static void vxlan_print_opt(struct link_util *lu, FILE *f, 
struct rtattr *tb[])
print_string(PRINT_FP, NULL, "ttl %s ", "auto");
}
 
+   if (tb[IFLA_VXLAN_DF]) {
+   enum ifla_vxlan_df df = rta_getattr_u8(tb[IFLA_VXLAN_DF]);
+
+   if (df == VXLAN_DF_UNSET)
+   print_string(PRINT_JSON, "df", "df %s ", "unset");
+   else if (df == VXLAN_DF_SET)
+   print_string(PRINT_ANY, "df", "df %s ", "set");
+   else if (df == VXLAN_DF_INHERIT)
+   print_string(PRINT_ANY, "df", "df %s ", "inherit");
+   }
+
if (tb[IFLA_VXLAN_LABEL]) {
__u32 label = rta_getattr_u32(tb[IFLA_VXLAN_LABEL]);
 
diff --git a/man/man8/ip-link.8.in b/man/man8/ip-link.8.in
index 5132f514b279..1b899dd06b92 100644
--- a/man/man8/ip-link.8.in
+++ b/man/man8/ip-link.8.in
@@ -496,6 +496,8 @@ the following additional arguments are supported:
 ] [
 .BI tos " TOS "
 ] [
+.BI df " DF "
+] [
 .BI flowlabel " FLOWLABEL "
 ] [
 .BI dstport " PORT "
@@ -565,6 +567,18 @@ parameter.
 .BI tos " TOS"
 - specifies the TOS value to use in outgoing packets.
 
+.sp
+.BI df " DF"
+- specifies the usage of the DF bit in outgoing packets with IPv4 headers.
+The value
+.B inherit
+causes the bit to be copied from the original IP header. The values
+.B unset
+and
+.B set
+cause the bit to be always unset or always set, respectively. By default, the
+bit is not set.
+
 .sp
 .BI flowlabel " FLOWLABEL"
 - specifies the flow label to use in outgoing packets.
-- 
2.19.1

[PATCH net-next 03/11] vxlan: Allow configuration of DF behaviour

2018-11-06 Thread Stefano Brivio

Allow users to set the IPv4 DF bit in outgoing packets, or to inherit its
value from the IPv4 inner header. If the encapsulated protocol is IPv6 and
DF is configured to be inherited, always set it.

For IPv4, inheriting DF from the inner header was probably intended from
the very beginning judging by the comment to vxlan_xmit(), but it wasn't
actually implemented -- also because it would have done more harm than
good, without handling for ICMP Fragmentation Needed messages.

According to RFC 7348, "Path MTU discovery MAY be used". An expired RFC
draft, draft-saum-nvo3-pmtud-over-vxlan-05, whose purpose was to describe
PMTUD implementation, says that "is a MUST that Vxlan gateways [...]
SHOULD set the DF-bit [...]", whatever that means.

Given this background, the only sane option is probably to let the user
decide, and keep the current behaviour as default.

Reviewed-by: Sabrina Dubroca 
Signed-off-by: Stefano Brivio 
---
 drivers/net/vxlan.c  | 29 +
 include/net/vxlan.h  |  1 +
 include/uapi/linux/if_link.h |  9 +
 3 files changed, 39 insertions(+)

diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c
index 7706e392b2a7..ccb19b833706 100644
--- a/drivers/net/vxlan.c
+++ b/drivers/net/vxlan.c
@@ -2289,6 +2289,19 @@ static void vxlan_xmit_one(struct sk_buff *skb, struct 
net_device *dev,
df = htons(IP_DF);
}
 
+   if (!df) {
+   if (vxlan->cfg.df == VXLAN_DF_SET) {
+   df = htons(IP_DF);
+   } else if (vxlan->cfg.df == VXLAN_DF_INHERIT) {
+   struct ethhdr *eth = eth_hdr(skb);
+
+   if (ntohs(eth->h_proto) == ETH_P_IPV6 ||
+   (ntohs(eth->h_proto) == ETH_P_IP &&
+old_iph->frag_off & htons(IP_DF)))
+   df = htons(IP_DF);
+   }
+   }
+
ndst = >dst;
skb_tunnel_check_pmtu(skb, ndst, VXLAN_HEADROOM);
 
@@ -2837,6 +2850,7 @@ static const struct nla_policy 
vxlan_policy[IFLA_VXLAN_MAX + 1] = {
[IFLA_VXLAN_GPE]= { .type = NLA_FLAG, },
[IFLA_VXLAN_REMCSUM_NOPARTIAL]  = { .type = NLA_FLAG },
[IFLA_VXLAN_TTL_INHERIT]= { .type = NLA_FLAG },
+   [IFLA_VXLAN_DF] = { .type = NLA_U8 },
 };
 
 static int vxlan_validate(struct nlattr *tb[], struct nlattr *data[],
@@ -2893,6 +2907,16 @@ static int vxlan_validate(struct nlattr *tb[], struct 
nlattr *data[],
}
}
 
+   if (data[IFLA_VXLAN_DF]) {
+   enum ifla_vxlan_df df = nla_get_u8(data[IFLA_VXLAN_DF]);
+
+   if (df < 0 || df > VXLAN_DF_MAX) {
+   NL_SET_ERR_MSG_ATTR(extack, tb[IFLA_VXLAN_DF],
+   "Invalid DF attribute");
+   return -EINVAL;
+   }
+   }
+
return 0;
 }
 
@@ -3538,6 +3562,9 @@ static int vxlan_nl2conf(struct nlattr *tb[], struct 
nlattr *data[],
conf->mtu = nla_get_u32(tb[IFLA_MTU]);
}
 
+   if (data[IFLA_VXLAN_DF])
+   conf->df = nla_get_u8(data[IFLA_VXLAN_DF]);
+
return 0;
 }
 
@@ -3630,6 +3657,7 @@ static size_t vxlan_get_size(const struct net_device *dev)
nla_total_size(sizeof(__u8)) +  /* IFLA_VXLAN_TTL */
nla_total_size(sizeof(__u8)) +  /* IFLA_VXLAN_TTL_INHERIT */
nla_total_size(sizeof(__u8)) +  /* IFLA_VXLAN_TOS */
+   nla_total_size(sizeof(__u8)) +  /* IFLA_VXLAN_DF */
nla_total_size(sizeof(__be32)) + /* IFLA_VXLAN_LABEL */
nla_total_size(sizeof(__u8)) +  /* IFLA_VXLAN_LEARNING */
nla_total_size(sizeof(__u8)) +  /* IFLA_VXLAN_PROXY */
@@ -3696,6 +3724,7 @@ static int vxlan_fill_info(struct sk_buff *skb, const 
struct net_device *dev)
nla_put_u8(skb, IFLA_VXLAN_TTL_INHERIT,
   !!(vxlan->cfg.flags & VXLAN_F_TTL_INHERIT)) ||
nla_put_u8(skb, IFLA_VXLAN_TOS, vxlan->cfg.tos) ||
+   nla_put_u8(skb, IFLA_VXLAN_DF, vxlan->cfg.df) ||
nla_put_be32(skb, IFLA_VXLAN_LABEL, vxlan->cfg.label) ||
nla_put_u8(skb, IFLA_VXLAN_LEARNING,
!!(vxlan->cfg.flags & VXLAN_F_LEARN)) ||
diff --git a/include/net/vxlan.h b/include/net/vxlan.h
index 03431c148e16..ec999c49df1f 100644
--- a/include/net/vxlan.h
+++ b/include/net/vxlan.h
@@ -216,6 +216,7 @@ struct vxlan_config {
unsigned long   age_interval;
unsigned intaddrmax;
boolno_share;
+   enum ifla_vxlan_df  df;
 };
 
 struct vxlan_dev_node {
diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h
inde

[PATCH net-next 01/11] udp: Handle ICMP errors for tunnels with same destination port on both endpoints

2018-11-06 Thread Stefano Brivio

For both IPv4 and IPv6, if we can't match errors to a socket, try
tunnels before ignoring them. Look up a socket with the original source
and destination ports as found in the UDP packet inside the ICMP payload,
this will work for tunnels that force the same destination port for both
endpoints, i.e. VxLAN and GENEVE.

For IPv6 redirect messages, call ip6_redirect() directly with the output
interface argument set to the interface we received the packet from (as
it's the very interface we should build the exception on), otherwise the
new nexthop will be rejected. There's no such need for IPv4.

Tunnels can now export an encap_err_lookup() operation that indicates a
match. Pass the packet to the lookup function, and if the tunnel driver
reports a matching association, continue with regular ICMP error handling.

Reviewed-by: Sabrina Dubroca 
Signed-off-by: Stefano Brivio 
---
 include/linux/udp.h  |  1 +
 include/net/udp_tunnel.h |  3 ++
 net/ipv4/udp.c   | 76 +++-
 net/ipv4/udp_tunnel.c|  1 +
 net/ipv6/udp.c   | 83 ++--
 5 files changed, 144 insertions(+), 20 deletions(-)

diff --git a/include/linux/udp.h b/include/linux/udp.h
index 320d49d85484..c8410837f044 100644
--- a/include/linux/udp.h
+++ b/include/linux/udp.h
@@ -71,6 +71,7 @@ struct udp_sock {
 * For encapsulation sockets.
 */
int (*encap_rcv)(struct sock *sk, struct sk_buff *skb);
+   int (*encap_err_lookup)(struct sock *sk, struct sk_buff *skb);
void (*encap_destroy)(struct sock *sk);
 
/* GRO functions for UDP socket */
diff --git a/include/net/udp_tunnel.h b/include/net/udp_tunnel.h
index fe680ab6b15a..bf2f84984392 100644
--- a/include/net/udp_tunnel.h
+++ b/include/net/udp_tunnel.h
@@ -64,6 +64,8 @@ static inline int udp_sock_create(struct net *net,
 }
 
 typedef int (*udp_tunnel_encap_rcv_t)(struct sock *sk, struct sk_buff *skb);
+typedef int (*udp_tunnel_encap_err_lookup_t)(struct sock *sk,
+struct sk_buff *skb);
 typedef void (*udp_tunnel_encap_destroy_t)(struct sock *sk);
 typedef struct sk_buff *(*udp_tunnel_gro_receive_t)(struct sock *sk,
struct list_head *head,
@@ -76,6 +78,7 @@ struct udp_tunnel_sock_cfg {
/* Used for setting up udp_sock fields, see udp.h for details */
__u8  encap_type;
udp_tunnel_encap_rcv_t encap_rcv;
+   udp_tunnel_encap_err_lookup_t encap_err_lookup;
udp_tunnel_encap_destroy_t encap_destroy;
udp_tunnel_gro_receive_t gro_receive;
udp_tunnel_gro_complete_t gro_complete;
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index ca3ed931f2a9..1f054a85062d 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -585,6 +585,59 @@ static inline bool __udp_is_mcast_sock(struct net *net, 
struct sock *sk,
return true;
 }
 
+DEFINE_STATIC_KEY_FALSE(udp_encap_needed_key);
+void udp_encap_enable(void)
+{
+   static_branch_enable(_encap_needed_key);
+}
+EXPORT_SYMBOL(udp_encap_enable);
+
+/* Try to match ICMP errors to UDP tunnels by looking up a socket without
+ * reversing source and destination port: this will match tunnels that force 
the
+ * same destination port on both endpoints (e.g. VxLAN, GENEVE). Then ask the
+ * tunnel implementation to match the error against a valid association.
+ *
+ * Return the socket if we have a match.
+ */
+static struct sock *__udp4_lib_err_encap(struct net *net,
+const struct iphdr *iph,
+struct udphdr *uh,
+struct udp_table *udptable,
+struct sk_buff *skb)
+{
+   int (*lookup)(struct sock *sk, struct sk_buff *skb);
+   int network_offset, transport_offset;
+   struct udp_sock *up;
+   struct sock *sk;
+
+   sk = __udp4_lib_lookup(net, iph->daddr, uh->source,
+  iph->saddr, uh->dest, skb->dev->ifindex, 0,
+  udptable, NULL);
+   if (!sk)
+   return NULL;
+
+   network_offset = skb_network_offset(skb);
+   transport_offset = skb_transport_offset(skb);
+
+   skb_reset_network_header(skb);
+
+   /* Network header needs to point to the outer IPv4 header inside ICMP */
+   skb_reset_network_header(skb);
+   iph = ip_hdr(skb);
+   /* Transport header needs to point to the UDP header */
+   skb_set_transport_header(skb, iph->ihl << 2);
+
+   up = udp_sk(sk);
+   lookup = READ_ONCE(up->encap_err_lookup);
+   if (!lookup || lookup(sk, skb))
+   sk = NULL;
+
+   skb_set_transport_header(skb, transport_offset);
+   skb_set_network_header(skb, network_offset);
+
+   return sk;
+}
+
 /*
  * This routine is called by the ICMP module when it gets some
  * sort of error conditio

[PATCH net-next 05/11] geneve: ICMP error lookup handler

2018-11-06 Thread Stefano Brivio

Export an encap_err_lookup() operation to match an ICMP error against a
valid VNI.

Reviewed-by: Sabrina Dubroca 
Signed-off-by: Stefano Brivio 
---
 drivers/net/geneve.c | 52 
 1 file changed, 52 insertions(+)

diff --git a/drivers/net/geneve.c b/drivers/net/geneve.c
index a0cd1c41cf5f..8a69879d516a 100644
--- a/drivers/net/geneve.c
+++ b/drivers/net/geneve.c
@@ -387,6 +387,57 @@ static int geneve_udp_encap_recv(struct sock *sk, struct 
sk_buff *skb)
return 0;
 }
 
+/* Callback from net/ipv{4,6}/udp.c to check that we have a tunnel for errors 
*/
+static int geneve_udp_encap_err_lookup(struct sock *sk, struct sk_buff *skb)
+{
+   struct genevehdr *geneveh;
+   struct geneve_sock *gs;
+   u8 zero_vni[3] = { 0 };
+   u8 *vni = zero_vni;
+
+   if (skb->len < GENEVE_BASE_HLEN)
+   return -EINVAL;
+
+   geneveh = geneve_hdr(skb);
+   if (geneveh->ver != GENEVE_VER)
+   return -EINVAL;
+
+   if (geneveh->proto_type != htons(ETH_P_TEB))
+   return -EINVAL;
+
+   gs = rcu_dereference_sk_user_data(sk);
+   if (!gs)
+   return -ENOENT;
+
+   if (geneve_get_sk_family(gs) == AF_INET) {
+   struct iphdr *iph = ip_hdr(skb);
+   __be32 addr4 = 0;
+
+   if (!gs->collect_md) {
+   vni = geneve_hdr(skb)->vni;
+   addr4 = iph->daddr;
+   }
+
+   return geneve_lookup(gs, addr4, vni) ? 0 : -ENOENT;
+   }
+
+#if IS_ENABLED(CONFIG_IPV6)
+   if (geneve_get_sk_family(gs) == AF_INET6) {
+   struct ipv6hdr *ip6h = ipv6_hdr(skb);
+   struct in6_addr addr6 = { 0 };
+
+   if (!gs->collect_md) {
+   vni = geneve_hdr(skb)->vni;
+   addr6 = ip6h->daddr;
+   }
+
+   return geneve6_lookup(gs, addr6, vni) ? 0 : -ENOENT;
+   }
+#endif
+
+   return -EPFNOSUPPORT;
+}
+
 static struct socket *geneve_create_sock(struct net *net, bool ipv6,
 __be16 port, bool ipv6_rx_csum)
 {
@@ -544,6 +595,7 @@ static struct geneve_sock *geneve_socket_create(struct net 
*net, __be16 port,
tunnel_cfg.gro_receive = geneve_gro_receive;
tunnel_cfg.gro_complete = geneve_gro_complete;
tunnel_cfg.encap_rcv = geneve_udp_encap_recv;
+   tunnel_cfg.encap_err_lookup = geneve_udp_encap_err_lookup;
tunnel_cfg.encap_destroy = NULL;
setup_udp_tunnel_sock(net, sock, _cfg);
list_add(>list, >sock_list);
-- 
2.19.1

[PATCH net-next 09/11] udp: Support for error handlers of tunnels with arbitrary destination port

2018-11-06 Thread Stefano Brivio

ICMP error handling is currently not possible for UDP tunnels not
employing a receiving socket with local destination port matching the
remote one, because we have no way to look them up.

Add an err_handler tunnel encapsulation operation that can be exported by
tunnels in order to pass the error to the protocol implementing the
encapsulation. We can't easily use a lookup function as we did for VxLAN
and GENEVE, as protocol error handlers, which would be in turn called by
implementations of this new operation, handle the errors themselves,
together with the tunnel lookup.

Without a socket, we can't be sure which encapsulation error handler is
the appropriate one: encapsulation handlers (the ones for FoU and GUE
introduced in the next patch, e.g.) will need to check the new error codes
returned by protocol handlers to figure out if errors match the given
encapsulation, and, in turn, report this error back, so that we can try
all of them in __udp{4,6}_lib_err_encap_no_sk() until we have a match.

Reviewed-by: Sabrina Dubroca 
Signed-off-by: Stefano Brivio 
---
 include/net/ip6_tunnel.h |  2 +
 include/net/ip_tunnels.h |  1 +
 net/ipv4/udp.c   | 75 +++--
 net/ipv6/udp.c   | 80 ++--
 4 files changed, 119 insertions(+), 39 deletions(-)

diff --git a/include/net/ip6_tunnel.h b/include/net/ip6_tunnel.h
index 236e40ba06bf..7855966b4a19 100644
--- a/include/net/ip6_tunnel.h
+++ b/include/net/ip6_tunnel.h
@@ -69,6 +69,8 @@ struct ip6_tnl_encap_ops {
size_t (*encap_hlen)(struct ip_tunnel_encap *e);
int (*build_header)(struct sk_buff *skb, struct ip_tunnel_encap *e,
u8 *protocol, struct flowi6 *fl6);
+   int (*err_handler)(struct sk_buff *, struct inet6_skb_parm *opt,
+  u8 type, u8 code, int offset, __be32 info);
 };
 
 #ifdef CONFIG_INET
diff --git a/include/net/ip_tunnels.h b/include/net/ip_tunnels.h
index b0d022ff6ea1..5980659312e5 100644
--- a/include/net/ip_tunnels.h
+++ b/include/net/ip_tunnels.h
@@ -311,6 +311,7 @@ struct ip_tunnel_encap_ops {
size_t (*encap_hlen)(struct ip_tunnel_encap *e);
int (*build_header)(struct sk_buff *skb, struct ip_tunnel_encap *e,
u8 *protocol, struct flowi4 *fl4);
+   int (*err_handler)(struct sk_buff *, u32);
 };
 
 #define MAX_IPTUN_ENCAP_OPS 8
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index b89c4cfd7c62..83950b4faced 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -105,6 +105,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -592,30 +593,48 @@ void udp_encap_enable(void)
 }
 EXPORT_SYMBOL(udp_encap_enable);
 
+/* Handler for tunnels with arbitrary destination ports: no socket lookup, go
+ * through error handlers in encapsulations looking for a match.
+ */
+static int __udp4_lib_err_encap_no_sk(struct sk_buff *skb, u32 info)
+{
+   int i;
+
+   for (i = 0; i < MAX_IPTUN_ENCAP_OPS; i++) {
+   int (*handler)(struct sk_buff *skb, u32 info);
+
+   if (!iptun_encaps[i])
+   continue;
+   handler = rcu_dereference(iptun_encaps[i]->err_handler);
+   if (handler && !handler(skb, info))
+   return 0;
+   }
+
+   return -ENOENT;
+}
+
 /* Try to match ICMP errors to UDP tunnels by looking up a socket without
  * reversing source and destination port: this will match tunnels that force 
the
- * same destination port on both endpoints (e.g. VxLAN, GENEVE). Then ask the
- * tunnel implementation to match the error against a valid association.
+ * same destination port on both endpoints (e.g. VxLAN, GENEVE). If this 
doesn't
+ * match any socket, probe tunnels with arbitrary destination ports (e.g. FoU,
+ * GUE): there, the receiving socket is useless, as the port we've sent packets
+ * to won't necessarily match the local destination port.
+ *
+ * Then ask the tunnel implementation to match the error against a valid
+ * association.
  *
- * Return the socket if we have a match.
+ * Return an error if we can't find a match, the socket if we need further
+ * processing, zero otherwise.
  */
 static struct sock *__udp4_lib_err_encap(struct net *net,
 const struct iphdr *iph,
 struct udphdr *uh,
 struct udp_table *udptable,
-struct sk_buff *skb)
+struct sk_buff *skb, u32 info)
 {
-   int (*lookup)(struct sock *sk, struct sk_buff *skb);
int network_offset, transport_offset;
-   struct udp_sock *up;
struct sock *sk;
 
-   sk = __udp4_lib_lookup(net, iph->daddr, uh->source,
-  iph->saddr, uh->dest, skb->dev->ifindex, 0,
-

[PATCH net-next 08/11] net: Convert protocol error handlers from void to int

2018-11-06 Thread Stefano Brivio

We'll need this to handle ICMP errors for tunnels without a sending socket
(i.e. FoU and GUE). There, we might have to look up different types of IP
tunnels, registered as network protocols, before we get a match, so we
want this for the error handlers of IPPROTO_IPIP and IPPROTO_IPV6 in both
inet_protos and inet6_protos. These error codes will be used in the next
patch.

For consistency, return sensible error codes in protocol error handlers
whenever errors don't match a protocol or any of its states.

This has no effect on existing error handling paths.

Reviewed-by: Sabrina Dubroca 
Signed-off-by: Stefano Brivio 
---
 include/net/icmp.h|  2 +-
 include/net/protocol.h|  9 ++--
 include/net/sctp/sctp.h   |  2 +-
 include/net/tcp.h |  2 +-
 include/net/udp.h |  2 +-
 net/dccp/ipv4.c   | 13 +++
 net/dccp/ipv6.c   | 13 +++
 net/ipv4/gre_demux.c  |  9 ++--
 net/ipv4/icmp.c   |  6 +++--
 net/ipv4/ip_gre.c | 48 ---
 net/ipv4/ipip.c   | 14 ++--
 net/ipv4/tcp_ipv4.c   | 22 ++
 net/ipv4/tunnel4.c| 18 ++-
 net/ipv4/udp.c| 10 
 net/ipv4/udp_impl.h   |  2 +-
 net/ipv4/udplite.c|  4 ++--
 net/ipv4/xfrm4_protocol.c | 18 ++-
 net/ipv6/icmp.c   |  4 +++-
 net/ipv6/ip6_gre.c| 18 ---
 net/ipv6/tcp_ipv6.c   | 13 +++
 net/ipv6/tunnel6.c| 12 ++
 net/ipv6/udp.c| 18 +++
 net/ipv6/udp_impl.h   |  4 ++--
 net/ipv6/udplite.c|  5 ++--
 net/ipv6/xfrm6_protocol.c | 18 ++-
 net/sctp/input.c  |  5 ++--
 net/sctp/ipv6.c   |  7 --
 27 files changed, 177 insertions(+), 121 deletions(-)

diff --git a/include/net/icmp.h b/include/net/icmp.h
index 3ef2743a8eec..6ac3a5bd0117 100644
--- a/include/net/icmp.h
+++ b/include/net/icmp.h
@@ -41,7 +41,7 @@ struct net;
 
 void icmp_send(struct sk_buff *skb_in, int type, int code, __be32 info);
 int icmp_rcv(struct sk_buff *skb);
-void icmp_err(struct sk_buff *skb, u32 info);
+int icmp_err(struct sk_buff *skb, u32 info);
 int icmp_init(void);
 void icmp_out_count(struct net *net, unsigned char type);
 
diff --git a/include/net/protocol.h b/include/net/protocol.h
index 4fc75f7ae23b..92b3eaad6088 100644
--- a/include/net/protocol.h
+++ b/include/net/protocol.h
@@ -42,7 +42,10 @@ struct net_protocol {
int (*early_demux)(struct sk_buff *skb);
int (*early_demux_handler)(struct sk_buff *skb);
int (*handler)(struct sk_buff *skb);
-   void(*err_handler)(struct sk_buff *skb, u32 info);
+
+   /* This returns an error if we weren't able to handle the error. */
+   int (*err_handler)(struct sk_buff *skb, u32 info);
+
unsigned intno_policy:1,
netns_ok:1,
/* does the protocol do more stringent
@@ -58,10 +61,12 @@ struct inet6_protocol {
void(*early_demux_handler)(struct sk_buff *skb);
int (*handler)(struct sk_buff *skb);
 
-   void(*err_handler)(struct sk_buff *skb,
+   /* This returns an error if we weren't able to handle the error. */
+   int (*err_handler)(struct sk_buff *skb,
   struct inet6_skb_parm *opt,
   u8 type, u8 code, int offset,
   __be32 info);
+
unsigned intflags;  /* INET6_PROTO_xxx */
 };
 
diff --git a/include/net/sctp/sctp.h b/include/net/sctp/sctp.h
index 8c2caa370e0f..9a3b48a35e90 100644
--- a/include/net/sctp/sctp.h
+++ b/include/net/sctp/sctp.h
@@ -151,7 +151,7 @@ int sctp_primitive_RECONF(struct net *net, struct 
sctp_association *asoc,
  * sctp/input.c
  */
 int sctp_rcv(struct sk_buff *skb);
-void sctp_v4_err(struct sk_buff *skb, u32 info);
+int sctp_v4_err(struct sk_buff *skb, u32 info);
 void sctp_hash_endpoint(struct sctp_endpoint *);
 void sctp_unhash_endpoint(struct sctp_endpoint *);
 struct sock *sctp_err_lookup(struct net *net, int family, struct sk_buff *,
diff --git a/include/net/tcp.h b/include/net/tcp.h
index a18914d20486..4743836bed2e 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -313,7 +313,7 @@ extern struct proto tcp_prot;
 
 void tcp_tasklet_init(void);
 
-void tcp_v4_err(struct sk_buff *skb, u32);
+int tcp_v4_err(struct sk_buff *skb, u32);
 
 void tcp_shutdown(struct sock *sk, int how);
 
diff --git a/include/net/udp.h b/include/net/udp.h
index 9e82cb391dea..7e3f1e2b68eb 100644
--- a/include/net/udp.h
+++ b/include/net/udp.h
@@ -272,7 +272,7 @@ bool udp_sk_rx_dst_set(struct sock *sk, struct dst_entry 
*dst);
 int udp_get_port(struct sock *sk, unsigned short snum,
 int (*saddr_cmp)(const struct sock *,
  const struct

[PATCH net-next 11/11] selftests: pmtu: Introduce FoU and GUE PMTU exceptions tests

2018-11-06 Thread Stefano Brivio

Introduce eight tests, for FoU and GUE, with IPv4 and IPv6 payload,
on IPv4 and IPv6 transport, that check that PMTU exceptions are created
with the right value when exceeding the MTU on a link of the path.

Reviewed-by: Sabrina Dubroca 
Signed-off-by: Stefano Brivio 
---
 tools/testing/selftests/net/pmtu.sh | 179 
 1 file changed, 179 insertions(+)

diff --git a/tools/testing/selftests/net/pmtu.sh 
b/tools/testing/selftests/net/pmtu.sh
index e9bb0c37bdfc..b746bcb29e0f 100755
--- a/tools/testing/selftests/net/pmtu.sh
+++ b/tools/testing/selftests/net/pmtu.sh
@@ -43,6 +43,14 @@
 # - pmtu_ipv6_geneve6_exception
 #  Same as pmtu_ipv6_vxlan6, but using a GENEVE tunnel instead of VxLAN
 #
+# - pmtu_ipv{4,6}_fou{4,6}_exception
+#  Same as pmtu_ipv4_vxlan6, but using a direct IPv4/IPv6 encapsulation
+#  (FoU) over IPv4/IPv6, instead of VxLAN
+#
+# - pmtu_ipv{4,6}_fou{4,6}_exception
+#  Same as pmtu_ipv4_vxlan6, but using a generic UDP IPv4/IPv6
+#  encapsulation (GUE) over IPv4/IPv6, instead of VxLAN
+#
 # - pmtu_vti4_exception
 #  Set up vti tunnel on top of veth, with xfrm states and policies, in two
 #  namespaces with matching endpoints. Check that route exception is not
@@ -93,6 +101,14 @@ tests="
pmtu_ipv6_vxlan6_exception  IPv6 over vxlan6: PMTU exceptions
pmtu_ipv4_geneve6_exception IPv4 over geneve6: PMTU exceptions
pmtu_ipv6_geneve6_exception IPv6 over geneve6: PMTU exceptions
+   pmtu_ipv4_fou4_exceptionIPv4 over fou4: PMTU exceptions
+   pmtu_ipv6_fou4_exceptionIPv6 over fou4: PMTU exceptions
+   pmtu_ipv4_fou6_exceptionIPv4 over fou6: PMTU exceptions
+   pmtu_ipv6_fou6_exceptionIPv6 over fou6: PMTU exceptions
+   pmtu_ipv4_gue4_exceptionIPv4 over gue4: PMTU exceptions
+   pmtu_ipv6_gue4_exceptionIPv6 over gue4: PMTU exceptions
+   pmtu_ipv4_gue6_exceptionIPv4 over gue6: PMTU exceptions
+   pmtu_ipv6_gue6_exceptionIPv6 over gue6: PMTU exceptions
pmtu_vti6_exception vti6: PMTU exceptions
pmtu_vti4_exception vti4: PMTU exceptions
pmtu_vti4_default_mtu   vti4: default MTU assignment
@@ -180,6 +196,89 @@ nsname() {
eval echo \$NS_$1
 }
 
+setup_fou_or_gue() {
+   outer="${1}"
+   inner="${2}"
+   encap="${3}"
+
+   if [ "${outer}" = "4" ]; then
+   modprobe fou || return 2
+   a_addr="${prefix4}.${a_r1}.1"
+   b_addr="${prefix4}.${b_r1}.1"
+   if [ "${inner}" = "4" ]; then
+   type="ipip"
+   ipproto="4"
+   else
+   type="sit"
+   ipproto="41"
+   fi
+   else
+   modprobe fou6 || return 2
+   a_addr="${prefix6}:${a_r1}::1"
+   b_addr="${prefix6}:${b_r1}::1"
+   if [ "${inner}" = "4" ]; then
+   type="ip6tnl"
+   mode="mode ipip6"
+   ipproto="4 -6"
+   else
+   type="ip6tnl"
+   mode="mode ip6ip6"
+   ipproto="41 -6"
+   fi
+   fi
+
+   ${ns_a} ip fou add port  ipproto ${ipproto} || return 2
+   ${ns_a} ip link add ${encap}_a type ${type} ${mode} local ${a_addr} 
remote ${b_addr} encap ${encap} encap-sport auto encap-dport 5556 || return 2
+
+   ${ns_b} ip fou add port 5556 ipproto ${ipproto}
+   ${ns_b} ip link add ${encap}_b type ${type} ${mode} local ${b_addr} 
remote ${a_addr} encap ${encap} encap-sport auto encap-dport 
+
+   if [ "${inner}" = "4" ]; then
+   ${ns_a} ip addr add ${tunnel4_a_addr}/${tunnel4_mask} dev 
${encap}_a
+   ${ns_b} ip addr add ${tunnel4_b_addr}/${tunnel4_mask} dev 
${encap}_b
+   else
+   ${ns_a} ip addr add ${tunnel6_a_addr}/${tunnel6_mask} dev 
${encap}_a
+   ${ns_b} ip addr add ${tunnel6_b_addr}/${tunnel6_mask} dev 
${encap}_b
+   fi
+
+   ${ns_a} ip link set ${encap}_a up
+   ${ns_b} ip link set ${encap}_b up
+
+   sleep 1
+}
+
+setup_fou44() {
+   setup_fou_or_gue 4 4 fou
+}
+
+setup_fou46() {
+   setup_fou_or_gue 4 6 fou
+}
+
+setup_fou64() {
+   setup_fou_or_gue 6 4 fou
+}
+
+setup_fou66() {
+   setup_fou_or_gue 6 6 fou
+}
+
+setup_gue44() {
+   setup_fou_or_gue 4 4 gue
+}
+
+setup_gue46() {
+   setup_fou_or_gue 4 6 gue
+}
+
+setup_gue64() {
+   setup_fou_or_gue 6 4 gue
+}
+
+setup_gue66() {
+   setup_fou_or_gue 6 6 gue
+}
+
 setup_namespaces() {
for

[PATCH net-next 10/11] fou, fou6: ICMP error handlers for FoU and GUE

2018-11-06 Thread Stefano Brivio

As the destination port in FoU and GUE receiving sockets doesn't
necessarily match the remote destination port, we can't associate errors
to the encapsulating tunnels with a socket lookup -- we need to blindly
try them instead. This means we don't even know if we are handling errors
for FoU or GUE without digging into the packets.

Hence, implement a single handler for both, one for IPv4 and one for IPv6,
that will check whether the packet that generated the ICMP error used a
direct IP encapsulation or if it had a GUE header, and send the error to
the matching protocol handler, if any.

Reviewed-by: Sabrina Dubroca 
Signed-off-by: Stefano Brivio 
---
 net/ipv4/fou.c  | 68 +
 net/ipv4/protocol.c |  1 +
 net/ipv6/fou6.c | 74 +
 3 files changed, 143 insertions(+)

diff --git a/net/ipv4/fou.c b/net/ipv4/fou.c
index 500a59906b87..0d0ad19ecb87 100644
--- a/net/ipv4/fou.c
+++ b/net/ipv4/fou.c
@@ -3,6 +3,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -1003,15 +1004,82 @@ static int gue_build_header(struct sk_buff *skb, struct 
ip_tunnel_encap *e,
return 0;
 }
 
+static int gue_err_proto_handler(int proto, struct sk_buff *skb, u32 info)
+{
+   const struct net_protocol *ipprot = rcu_dereference(inet_protos[proto]);
+
+   if (ipprot && ipprot->err_handler) {
+   if (!ipprot->err_handler(skb, info))
+   return 0;
+   }
+
+   return -ENOENT;
+}
+
+static int gue_err(struct sk_buff *skb, u32 info)
+{
+   int transport_offset = skb_transport_offset(skb);
+   struct guehdr *guehdr;
+   size_t optlen;
+   int ret;
+
+   if (skb->len < sizeof(struct udphdr) + sizeof(struct guehdr))
+   return -EINVAL;
+
+   guehdr = (struct guehdr *)_hdr(skb)[1];
+
+   switch (guehdr->version) {
+   case 0: /* Full GUE header present */
+   break;
+   case 1: {
+   /* Direct encasulation of IPv4 or IPv6 */
+   skb_set_transport_header(skb, -(int)sizeof(struct icmphdr));
+
+   switch (((struct iphdr *)guehdr)->version) {
+   case 4:
+   ret = gue_err_proto_handler(IPPROTO_IPIP, skb, info);
+   goto out;
+#if IS_ENABLED(CONFIG_IPV6)
+   case 6:
+   ret = gue_err_proto_handler(IPPROTO_IPV6, skb, info);
+   goto out;
+#endif
+   default:
+   ret = -EOPNOTSUPP;
+   goto out;
+   }
+   }
+   default: /* Undefined version */
+   return -EOPNOTSUPP;
+   }
+
+   if (guehdr->control)
+   return -ENOENT;
+
+   optlen = guehdr->hlen << 2;
+
+   if (validate_gue_flags(guehdr, optlen))
+   return -EINVAL;
+
+   skb_set_transport_header(skb, -(int)sizeof(struct icmphdr));
+   ret = gue_err_proto_handler(guehdr->proto_ctype, skb, info);
+
+out:
+   skb_set_transport_header(skb, transport_offset);
+   return ret;
+}
+
 
 static const struct ip_tunnel_encap_ops fou_iptun_ops = {
.encap_hlen = fou_encap_hlen,
.build_header = fou_build_header,
+   .err_handler = gue_err,
 };
 
 static const struct ip_tunnel_encap_ops gue_iptun_ops = {
.encap_hlen = gue_encap_hlen,
.build_header = gue_build_header,
+   .err_handler = gue_err,
 };
 
 static int ip_tunnel_encap_add_fou_ops(void)
diff --git a/net/ipv4/protocol.c b/net/ipv4/protocol.c
index 32a691b7ce2c..92d249e053be 100644
--- a/net/ipv4/protocol.c
+++ b/net/ipv4/protocol.c
@@ -29,6 +29,7 @@
 #include 
 
 struct net_protocol __rcu *inet_protos[MAX_INET_PROTOS] __read_mostly;
+EXPORT_SYMBOL(inet_protos);
 const struct net_offload __rcu *inet_offloads[MAX_INET_PROTOS] __read_mostly;
 EXPORT_SYMBOL(inet_offloads);
 
diff --git a/net/ipv6/fou6.c b/net/ipv6/fou6.c
index 6de3c04b0f30..bd675c61deb1 100644
--- a/net/ipv6/fou6.c
+++ b/net/ipv6/fou6.c
@@ -4,6 +4,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -69,14 +70,87 @@ static int gue6_build_header(struct sk_buff *skb, struct 
ip_tunnel_encap *e,
return 0;
 }
 
+static int gue6_err_proto_handler(int proto, struct sk_buff *skb,
+ struct inet6_skb_parm *opt,
+ u8 type, u8 code, int offset, u32 info)
+{
+   const struct inet6_protocol *ipprot;
+
+   ipprot = rcu_dereference(inet6_protos[proto]);
+   if (ipprot && ipprot->err_handler) {
+   if (!ipprot->err_handler(skb, opt, type, code, offset, info))
+   return 0;
+   }
+
+   return -ENOENT;
+}
+
+static int gue6_err(struct sk_buff *skb, struct inet6_skb_parm *opt,
+   u8 type, u8 code, int offset, __be32 info)
+{
+   int trans

[PATCH net-next 02/11] vxlan: ICMP error lookup handler

2018-11-06 Thread Stefano Brivio

Export an encap_err_lookup() operation to match an ICMP error against a
valid VNI.

Reviewed-by: Sabrina Dubroca 
Signed-off-by: Stefano Brivio 
---
 drivers/net/vxlan.c | 29 +
 1 file changed, 29 insertions(+)

diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c
index 297cdeaef479..7706e392b2a7 100644
--- a/drivers/net/vxlan.c
+++ b/drivers/net/vxlan.c
@@ -1552,6 +1552,34 @@ static int vxlan_rcv(struct sock *sk, struct sk_buff 
*skb)
return 0;
 }
 
+/* Callback from net/ipv{4,6}/udp.c to check that we have a VNI for errors */
+static int vxlan_err_lookup(struct sock *sk, struct sk_buff *skb)
+{
+   struct vxlan_dev *vxlan;
+   struct vxlan_sock *vs;
+   struct vxlanhdr *hdr;
+   __be32 vni;
+
+   if (skb->len < VXLAN_HLEN)
+   return -EINVAL;
+
+   hdr = vxlan_hdr(skb);
+
+   if (!(hdr->vx_flags & VXLAN_HF_VNI))
+   return -EINVAL;
+
+   vs = rcu_dereference_sk_user_data(sk);
+   if (!vs)
+   return -ENOENT;
+
+   vni = vxlan_vni(hdr->vx_vni);
+   vxlan = vxlan_vs_find_vni(vs, skb->dev->ifindex, vni);
+   if (!vxlan)
+   return -ENOENT;
+
+   return 0;
+}
+
 static int arp_reduce(struct net_device *dev, struct sk_buff *skb, __be32 vni)
 {
struct vxlan_dev *vxlan = netdev_priv(dev);
@@ -2948,6 +2976,7 @@ static struct vxlan_sock *vxlan_socket_create(struct net 
*net, bool ipv6,
tunnel_cfg.sk_user_data = vs;
tunnel_cfg.encap_type = 1;
tunnel_cfg.encap_rcv = vxlan_rcv;
+   tunnel_cfg.encap_err_lookup = vxlan_err_lookup;
tunnel_cfg.encap_destroy = NULL;
tunnel_cfg.gro_receive = vxlan_gro_receive;
tunnel_cfg.gro_complete = vxlan_gro_complete;
-- 
2.19.1

[PATCH net-next 04/11] selftests: pmtu: Introduce tests for IPv4/IPv6 over VxLAN over IPv6

2018-11-06 Thread Stefano Brivio

Use a router between endpoints, implemented via namespaces, set a low MTU
between router and destination endpoint, exceed it and check PMTU value in
route exceptions.

Reviewed-by: Sabrina Dubroca 
Signed-off-by: Stefano Brivio 
---
This only introduces tests over VxLAN over IPv6 right now. I'll introduce
tests over IPv4 (they can be added trivially) once DF configuration support
is accepted into iproute2.

 tools/testing/selftests/net/pmtu.sh | 115 +++-
 1 file changed, 97 insertions(+), 18 deletions(-)

diff --git a/tools/testing/selftests/net/pmtu.sh 
b/tools/testing/selftests/net/pmtu.sh
index a369d616b390..19ede74af560 100755
--- a/tools/testing/selftests/net/pmtu.sh
+++ b/tools/testing/selftests/net/pmtu.sh
@@ -26,6 +26,17 @@
 # - pmtu_ipv6
 #  Same as pmtu_ipv4, except for locked PMTU tests, using IPv6
 #
+# - pmtu_ipv4_vxlan6_exception
+#  Set up the same network topology as pmtu_ipv4, create a VxLAN tunnel
+#  over IPv6 between A and B, routed via R1. On the link between R1 and B,
+#  set a MTU lower than the VxLAN MTU and the MTU on the link between A and
+#  R1. Send IPv4 packets, exceeding the MTU between R1 and B, over VxLAN
+#  from A to B and check that the PMTU exception is created with the right
+#  value on A
+#
+# - pmtu_ipv6_vxlan6_exception
+#  Same as pmtu_ipv4_vxlan6_exception, but send IPv6 packets from A to B
+#
 # - pmtu_vti4_exception
 #  Set up vti tunnel on top of veth, with xfrm states and policies, in two
 #  namespaces with matching endpoints. Check that route exception is not
@@ -72,6 +83,8 @@ which ping6 > /dev/null 2>&1 && ping6=$(which ping6) || 
ping6=$(which ping)
 tests="
pmtu_ipv4_exception ipv4: PMTU exceptions
pmtu_ipv6_exception ipv6: PMTU exceptions
+   pmtu_ipv4_vxlan6_exception  IPv4 over vxlan6: PMTU exceptions
+   pmtu_ipv6_vxlan6_exception  IPv6 over vxlan6: PMTU exceptions
pmtu_vti6_exception vti6: PMTU exceptions
pmtu_vti4_exception vti4: PMTU exceptions
pmtu_vti4_default_mtu   vti4: default MTU assignment
@@ -95,8 +108,8 @@ ns_r2="ip netns exec ${NS_R2}"
 # Addresses are:
 # - IPv4: PREFIX4.SEGMENT.ID (/24)
 # - IPv6: PREFIX6:SEGMENT::ID (/64)
-prefix4="192.168"
-prefix6="fd00"
+prefix4="10.0"
+prefix6="fc00"
 a_r1=1
 a_r2=2
 b_r1=3
@@ -129,12 +142,12 @@ veth6_a_addr="fd00:1::a"
 veth6_b_addr="fd00:1::b"
 veth6_mask="64"
 
-vti4_a_addr="192.168.2.1"
-vti4_b_addr="192.168.2.2"
-vti4_mask="24"
-vti6_a_addr="fd00:2::a"
-vti6_b_addr="fd00:2::b"
-vti6_mask="64"
+tunnel4_a_addr="192.168.2.1"
+tunnel4_b_addr="192.168.2.2"
+tunnel4_mask="24"
+tunnel6_a_addr="fd00:2::a"
+tunnel6_b_addr="fd00:2::b"
+tunnel6_mask="64"
 
 dummy6_0_addr="fc00:1000::0"
 dummy6_1_addr="fc00:1001::0"
@@ -202,11 +215,34 @@ setup_vti() {
 }
 
 setup_vti4() {
-   setup_vti 4 ${veth4_a_addr} ${veth4_b_addr} ${vti4_a_addr} 
${vti4_b_addr} ${vti4_mask}
+   setup_vti 4 ${veth4_a_addr} ${veth4_b_addr} ${tunnel4_a_addr} 
${tunnel4_b_addr} ${tunnel4_mask}
 }
 
 setup_vti6() {
-   setup_vti 6 ${veth6_a_addr} ${veth6_b_addr} ${vti6_a_addr} 
${vti6_b_addr} ${vti6_mask}
+   setup_vti 6 ${veth6_a_addr} ${veth6_b_addr} ${tunnel6_a_addr} 
${tunnel6_b_addr} ${tunnel6_mask}
+}
+
+setup_vxlan() {
+   a_addr="${1}"
+   b_addr="${2}"
+
+   ${ns_a} ip link add vxlan_a type vxlan id 1 local ${a_addr} remote 
${b_addr} ttl 64 dstport 4789 || return 1
+   ${ns_b} ip link add vxlan_b type vxlan id 1 local ${b_addr} remote 
${a_addr} ttl 64 dstport 4789
+
+   ${ns_a} ip addr add ${tunnel4_a_addr}/${tunnel4_mask}   dev vxlan_a
+   ${ns_b} ip addr add ${tunnel4_b_addr}/${tunnel4_mask}   dev vxlan_b
+
+   ${ns_a} ip addr add ${tunnel6_a_addr}/${tunnel6_mask}   dev vxlan_a
+   ${ns_b} ip addr add ${tunnel6_b_addr}/${tunnel6_mask}   dev vxlan_b
+
+   ${ns_a} ip link set vxlan_a up
+   ${ns_b} ip link set vxlan_b up
+
+   sleep 1
+}
+
+setup_vxlan6() {
+   setup_vxlan ${prefix6}:${a_r1}::1 ${prefix6}:${b_r1}::1
 }
 
 setup_xfrm() {
@@ -465,6 +501,49 @@ test_pmtu_ipv6_exception() {
test_pmtu_ipvX 6
 }
 
+test_pmtu_ipvX_over_vxlan6_exception() {
+   family=${1}
+   ll_mtu=4000
+
+   setup namespaces routing vxlan6 || return 2
+   #  IPv6 header   UDP header   VxLAN header   
Ethernet header
+   exp_mtu=$((${ll_mtu} - 40  - 8  - 8- 14))
+
+   trace "${ns_a}" vxlan_a  "${ns_b}"  vxlan_b \
+ "${ns_a}" veth_A-R1"${ns_r1}" veth_R1-A \
+ "${ns_b}" veth_B-R1"${ns

[PATCH net-next 06/11] geneve: Allow configuration of DF behaviour

2018-11-06 Thread Stefano Brivio

draft-ietf-nvo3-geneve-08 says:

   It is strongly RECOMMENDED that Path MTU Discovery ([RFC1191],
   [RFC1981]) be used by setting the DF bit in the IP header when Geneve
   packets are transmitted over IPv4 (this is the default with IPv6).

Now that ICMP error handling is working for GENEVE, we can comply with
this recommendation.

Make this configurable, though, to avoid breaking existing setups. By
default, DF won't be set. It can be set or inherited from inner IPv4
packets. If it's configured to be inherited and we are encapsulating IPv6,
it will be set.

Reviewed-by: Sabrina Dubroca 
Signed-off-by: Stefano Brivio 
---
 drivers/net/geneve.c | 52 +++-
 include/uapi/linux/if_link.h |  9 +++
 2 files changed, 55 insertions(+), 6 deletions(-)

diff --git a/drivers/net/geneve.c b/drivers/net/geneve.c
index 8a69879d516a..cafdee06b5c8 100644
--- a/drivers/net/geneve.c
+++ b/drivers/net/geneve.c
@@ -70,6 +70,7 @@ struct geneve_dev {
bool   collect_md;
bool   use_udp6_rx_checksums;
bool   ttl_inherit;
+   enum ifla_geneve_df df;
 };
 
 struct geneve_sock {
@@ -898,7 +899,24 @@ static int geneve_xmit_skb(struct sk_buff *skb, struct 
net_device *dev,
ttl = key->ttl;
ttl = ttl ? : ip4_dst_hoplimit(>dst);
}
+
df = key->tun_flags & TUNNEL_DONT_FRAGMENT ? htons(IP_DF) : 0;
+   if (!df) {
+   if (geneve->df == GENEVE_DF_SET) {
+   df = htons(IP_DF);
+   } else if (geneve->df == GENEVE_DF_INHERIT) {
+   struct ethhdr *eth = eth_hdr(skb);
+
+   if (ntohs(eth->h_proto) == ETH_P_IPV6) {
+   df = htons(IP_DF);
+   } else if (ntohs(eth->h_proto) == ETH_P_IP) {
+   struct iphdr *iph = ip_hdr(skb);
+
+   if (iph->frag_off & htons(IP_DF))
+   df = htons(IP_DF);
+   }
+   }
+   }
 
err = geneve_build_skb(>dst, skb, info, xnet, sizeof(struct iphdr));
if (unlikely(err))
@@ -1145,6 +1163,7 @@ static const struct nla_policy 
geneve_policy[IFLA_GENEVE_MAX + 1] = {
[IFLA_GENEVE_UDP_ZERO_CSUM6_TX] = { .type = NLA_U8 },
[IFLA_GENEVE_UDP_ZERO_CSUM6_RX] = { .type = NLA_U8 },
[IFLA_GENEVE_TTL_INHERIT]   = { .type = NLA_U8 },
+   [IFLA_GENEVE_DF]= { .type = NLA_U8 },
 };
 
 static int geneve_validate(struct nlattr *tb[], struct nlattr *data[],
@@ -1180,6 +1199,16 @@ static int geneve_validate(struct nlattr *tb[], struct 
nlattr *data[],
}
}
 
+   if (data[IFLA_GENEVE_DF]) {
+   enum ifla_geneve_df df = nla_get_u8(data[IFLA_GENEVE_DF]);
+
+   if (df < 0 || df > GENEVE_DF_MAX) {
+   NL_SET_ERR_MSG_ATTR(extack, tb[IFLA_GENEVE_DF],
+   "Invalid DF attribute");
+   return -EINVAL;
+   }
+   }
+
return 0;
 }
 
@@ -1225,7 +1254,7 @@ static int geneve_configure(struct net *net, struct 
net_device *dev,
struct netlink_ext_ack *extack,
const struct ip_tunnel_info *info,
bool metadata, bool ipv6_rx_csum,
-   bool ttl_inherit)
+   bool ttl_inherit, enum ifla_geneve_df df)
 {
struct geneve_net *gn = net_generic(net, geneve_net_id);
struct geneve_dev *t, *geneve = netdev_priv(dev);
@@ -1275,6 +1304,7 @@ static int geneve_configure(struct net *net, struct 
net_device *dev,
geneve->collect_md = metadata;
geneve->use_udp6_rx_checksums = ipv6_rx_csum;
geneve->ttl_inherit = ttl_inherit;
+   geneve->df = df;
 
err = register_netdevice(dev);
if (err)
@@ -1294,7 +1324,7 @@ static int geneve_nl2info(struct nlattr *tb[], struct 
nlattr *data[],
  struct netlink_ext_ack *extack,
  struct ip_tunnel_info *info, bool *metadata,
  bool *use_udp6_rx_checksums, bool *ttl_inherit,
- bool changelink)
+ enum ifla_geneve_df *df, bool changelink)
 {
int attrtype;
 
@@ -1382,6 +1412,9 @@ static int geneve_nl2info(struct nlattr *tb[], struct 
nlattr *data[],
if (data[IFLA_GENEVE_TOS])
info->key.tos = nla_get_u8(data[IFLA_GENEVE_TOS]);
 
+   if (data[IFLA_GENEVE_DF])
+   *df = nla_get_u8(data[IFLA_GENEVE_DF]);
+
if (data[IFLA_GENEVE_LABEL]) {
info->key.label = nla_get_be32(data[IFLA_GENEVE_LABEL]) &
  IPV6_FLOWLABEL_MASK;
@@ -1500,6 +153

[PATCH net-next 00/11] ICMP error handling for UDP tunnels

2018-11-06 Thread Stefano Brivio

This series introduces ICMP error handling for UDP tunnels and
encapsulations and related selftests. We need to handle ICMP errors to
support PMTU discovery and route redirection -- this support is entirely
missing right now:

- patch 1/11 adds a socket lookup for UDP tunnels that use, by design,
  the same destination port on both endpoints -- i.e. VxLAN and GENEVE
- patches 2/11 to 7/11 are specific to VxLAN and GENEVE
- patches 8/11 and 9/11 add infrastructure for lookup of encapsulations
  where sent packets cannot be matched via receiving socket lookup, i.e.
  FoU and GUE
- patches 10/11 and 11/11 are specific to FoU and GUE

Stefano Brivio (11):
  udp: Handle ICMP errors for tunnels with same destination port on both
endpoints
  vxlan: ICMP error lookup handler
  vxlan: Allow configuration of DF behaviour
  selftests: pmtu: Introduce tests for IPv4/IPv6 over VxLAN over IPv6
  geneve: ICMP error lookup handler
  geneve: Allow configuration of DF behaviour
  selftests: pmtu: Introduce tests for IPv4/IPv6 over GENEVE over IPv6
  net: Convert protocol error handlers from void to int
  udp: Support for error handlers of tunnels with arbitrary destination
port
  fou, fou6: ICMP error handlers for FoU and GUE
  selftests: pmtu: Introduce FoU and GUE PMTU exceptions tests

 drivers/net/geneve.c| 104 -
 drivers/net/vxlan.c |  58 +
 include/linux/udp.h |   1 +
 include/net/icmp.h  |   2 +-
 include/net/ip6_tunnel.h|   2 +
 include/net/ip_tunnels.h|   1 +
 include/net/protocol.h  |   9 +-
 include/net/sctp/sctp.h |   2 +-
 include/net/tcp.h   |   2 +-
 include/net/udp.h   |   2 +-
 include/net/udp_tunnel.h|   3 +
 include/net/vxlan.h |   1 +
 include/uapi/linux/if_link.h|  18 ++
 net/dccp/ipv4.c |  13 +-
 net/dccp/ipv6.c |  13 +-
 net/ipv4/fou.c  |  68 ++
 net/ipv4/gre_demux.c|   9 +-
 net/ipv4/icmp.c |   6 +-
 net/ipv4/ip_gre.c   |  48 ++--
 net/ipv4/ipip.c |  14 +-
 net/ipv4/protocol.c |   1 +
 net/ipv4/tcp_ipv4.c |  22 +-
 net/ipv4/tunnel4.c  |  18 +-
 net/ipv4/udp.c  | 119 --
 net/ipv4/udp_impl.h |   2 +-
 net/ipv4/udp_tunnel.c   |   1 +
 net/ipv4/udplite.c  |   4 +-
 net/ipv4/xfrm4_protocol.c   |  18 +-
 net/ipv6/fou6.c |  74 +++
 net/ipv6/icmp.c |   4 +-
 net/ipv6/ip6_gre.c  |  18 +-
 net/ipv6/tcp_ipv6.c |  13 +-
 net/ipv6/tunnel6.c  |  12 +-
 net/ipv6/udp.c  | 141 ++--
 net/ipv6/udp_impl.h |   4 +-
 net/ipv6/udplite.c  |   5 +-
 net/ipv6/xfrm6_protocol.c   |  18 +-
 net/sctp/input.c|   5 +-
 net/sctp/ipv6.c |   7 +-
 tools/testing/selftests/net/pmtu.sh | 326 ++--
 40 files changed, 1025 insertions(+), 163 deletions(-)

-- 
2.19.1

[PATCH net-next 07/11] selftests: pmtu: Introduce tests for IPv4/IPv6 over GENEVE over IPv6

2018-11-06 Thread Stefano Brivio

Use a router between endpoints, implemented via namespaces, set a low MTU
between router and destination endpoint, exceed it and check PMTU value in
route exceptions.

Reviewed-by: Sabrina Dubroca 
Signed-off-by: Stefano Brivio 
---
This only introduces tests over GENEVE over IPv6 right now. I'll introduce
tests over IPv4 (they can be added trivially) once DF configuration support
is accepted into iproute2.

 tools/testing/selftests/net/pmtu.sh | 78 -
 1 file changed, 55 insertions(+), 23 deletions(-)

diff --git a/tools/testing/selftests/net/pmtu.sh 
b/tools/testing/selftests/net/pmtu.sh
index 19ede74af560..e9bb0c37bdfc 100755
--- a/tools/testing/selftests/net/pmtu.sh
+++ b/tools/testing/selftests/net/pmtu.sh
@@ -37,6 +37,12 @@
 # - pmtu_ipv6_vxlan6_exception
 #  Same as pmtu_ipv4_vxlan6_exception, but send IPv6 packets from A to B
 #
+# - pmtu_ipv4_geneve6_exception
+#  Same as pmtu_ipv4_vxlan6, but using a GENEVE tunnel instead of VxLAN
+#
+# - pmtu_ipv6_geneve6_exception
+#  Same as pmtu_ipv6_vxlan6, but using a GENEVE tunnel instead of VxLAN
+#
 # - pmtu_vti4_exception
 #  Set up vti tunnel on top of veth, with xfrm states and policies, in two
 #  namespaces with matching endpoints. Check that route exception is not
@@ -85,6 +91,8 @@ tests="
pmtu_ipv6_exception ipv6: PMTU exceptions
pmtu_ipv4_vxlan6_exception  IPv4 over vxlan6: PMTU exceptions
pmtu_ipv6_vxlan6_exception  IPv6 over vxlan6: PMTU exceptions
+   pmtu_ipv4_geneve6_exception IPv4 over geneve6: PMTU exceptions
+   pmtu_ipv6_geneve6_exception IPv6 over geneve6: PMTU exceptions
pmtu_vti6_exception vti6: PMTU exceptions
pmtu_vti4_exception vti4: PMTU exceptions
pmtu_vti4_default_mtu   vti4: default MTU assignment
@@ -222,27 +230,42 @@ setup_vti6() {
setup_vti 6 ${veth6_a_addr} ${veth6_b_addr} ${tunnel6_a_addr} 
${tunnel6_b_addr} ${tunnel6_mask}
 }
 
-setup_vxlan() {
-   a_addr="${1}"
-   b_addr="${2}"
+setup_vxlan_or_geneve() {
+   type="${1}"
+   a_addr="${2}"
+   b_addr="${3}"
+
+   if [ "${type}" = "vxlan" ]; then
+   opts="ttl 64 dstport 4789"
+   opts_a="local ${a_addr}"
+   opts_b="local ${b_addr}"
+   else
+   opts=""
+   opts_a=""
+   opts_b=""
+   fi
 
-   ${ns_a} ip link add vxlan_a type vxlan id 1 local ${a_addr} remote 
${b_addr} ttl 64 dstport 4789 || return 1
-   ${ns_b} ip link add vxlan_b type vxlan id 1 local ${b_addr} remote 
${a_addr} ttl 64 dstport 4789
+   ${ns_a} ip link add ${type}_a type ${type} id 1 ${opts_a} remote 
${b_addr} ${opts} || return 1
+   ${ns_b} ip link add ${type}_b type ${type} id 1 ${opts_b} remote 
${a_addr} ${opts}
 
-   ${ns_a} ip addr add ${tunnel4_a_addr}/${tunnel4_mask}   dev vxlan_a
-   ${ns_b} ip addr add ${tunnel4_b_addr}/${tunnel4_mask}   dev vxlan_b
+   ${ns_a} ip addr add ${tunnel4_a_addr}/${tunnel4_mask} dev ${type}_a
+   ${ns_b} ip addr add ${tunnel4_b_addr}/${tunnel4_mask} dev ${type}_b
 
-   ${ns_a} ip addr add ${tunnel6_a_addr}/${tunnel6_mask}   dev vxlan_a
-   ${ns_b} ip addr add ${tunnel6_b_addr}/${tunnel6_mask}   dev vxlan_b
+   ${ns_a} ip addr add ${tunnel6_a_addr}/${tunnel6_mask} dev ${type}_a
+   ${ns_b} ip addr add ${tunnel6_b_addr}/${tunnel6_mask} dev ${type}_b
 
-   ${ns_a} ip link set vxlan_a up
-   ${ns_b} ip link set vxlan_b up
+   ${ns_a} ip link set ${type}_a up
+   ${ns_b} ip link set ${type}_b up
 
sleep 1
 }
 
+setup_geneve6() {
+   setup_vxlan_or_geneve geneve ${prefix6}:${a_r1}::1 ${prefix6}:${b_r1}::1
+}
+
 setup_vxlan6() {
-   setup_vxlan ${prefix6}:${a_r1}::1 ${prefix6}:${b_r1}::1
+   setup_vxlan_or_geneve vxlan ${prefix6}:${a_r1}::1 ${prefix6}:${b_r1}::1
 }
 
 setup_xfrm() {
@@ -501,15 +524,16 @@ test_pmtu_ipv6_exception() {
test_pmtu_ipvX 6
 }
 
-test_pmtu_ipvX_over_vxlan6_exception() {
-   family=${1}
+test_pmtu_ipvX_over_vxlan6_or_geneve6_exception() {
+   type=${1}
+   family=${2}
ll_mtu=4000
 
-   setup namespaces routing vxlan6 || return 2
-   #  IPv6 header   UDP header   VxLAN header   
Ethernet header
-   exp_mtu=$((${ll_mtu} - 40  - 8  - 8- 14))
+   setup namespaces routing ${type}6 || return 2
+   #  IPv6 header   UDP header   VxLAN/GENEVE header   
Ethernet header
+   exp_mtu=$((${ll_mtu} - 40  - 8  - 8   - 
14))
 
-   trace "${ns_a}" vxlan_a  "${ns_b}"  vxlan_b \
+   trace "${ns_a}" ${type}_a"${ns_b}"  ${type}_b \
  "${ns_a}" veth_A

Re: [PATCH iproute2 net-next 0/3] ss: Allow selection of columns to be displayed

2018-11-02 Thread Stefano Brivio

On Wed, 31 Oct 2018 20:48:05 -0600
David Ahern  wrote:

> On 10/30/18 11:34 AM, Stefano Brivio wrote:
> > 
> > - the current column abstraction is rather lightweight, things are
> >   already buffered in the defined column order so we don't have to jump
> >   back and forth in the buffer while rendering. Doing that needs some
> >   extra care to avoid a performance hit, but it's probably doable, I
> >   can put that on my to-do list  
> 
> The ss command is always a pain; it's much better in newer releases but
> I am always having to adjust terminal width.

Ouch. Do you have some examples?

-- 
Stefano

Re: [PATCH iproute2 net-next 0/3] ss: Allow selection of columns to be displayed

2018-11-02 Thread Stefano Brivio

On Thu, 1 Nov 2018 14:06:23 -0700
Jakub Kicinski  wrote:

> JSONification would probably be quite an undertaking for ss :(

Probably not too much, and we would skip buffering for JSON as we don't
need to print columns, so we don't have to care about buffering and
rendering performance too much.

All it takes is to extend a bit the out() function and pass the right
arguments to it -- it looks way easier than implementing format
strings, even though I'd like the latter more.

-- 
Stefano

Re: [PATCH iproute2 net-next 0/3] ss: Allow selection of columns to be displayed

2018-10-30 Thread Stefano Brivio

On Tue, 30 Oct 2018 10:34:45 -0600
David Ahern  wrote:

> A more flexible approach is to use format strings to allow users to
> customize the output order and whitespace as well. So for ss and your
> column list (winging it here):
> 
> netid  = %N
> state  = %S
> recv Q = %Qr
> send Q = %Qs
> local address  = %Al
> lport port = %Pl
> remote address = %Ar
> remote port= %Pr
> process data   = %p
> ...
> 
> then a format string could be: "%S  %Qr %Qs  %Al:%Pl %Ar:%Pr  %p\n"

I like the idea indeed, but I see two issues with ss:

- the current column abstraction is rather lightweight, things are
  already buffered in the defined column order so we don't have to jump
  back and forth in the buffer while rendering. Doing that needs some
  extra care to avoid a performance hit, but it's probably doable, I
  can put that on my to-do list

- how would you model automatic spacing in a format string? Should we
  support width specifiers? Disable automatic spacing if a format
  string is given? It might even make sense to allow partial automatic
  spacing with a special character in the format string, that is:

"%S.%Qr.%Qs  %Al:%Pl %Ar:%Pr  %p\n"

  would mean "align everything to the right, distribute remaining
  whitespace between %S, %Qr and %Qs". But it looks rather complicated
  at a glance.

-- 
Stefano

[PATCH iproute2 net-next 1/3] ss: Discard empty descriptor at the end of buffer, if any, before rendering

2018-10-30 Thread Stefano Brivio

This will allow us to disable display of any given column.

Signed-off-by: Stefano Brivio 
---
 misc/ss.c | 11 +--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/misc/ss.c b/misc/ss.c
index c8970438ce73..c3f61ef66258 100644
--- a/misc/ss.c
+++ b/misc/ss.c
@@ -1245,8 +1245,15 @@ static void render(void)
 
token = (struct buf_token *)buffer.head->data;
 
-   /* Ensure end alignment of last token, it wasn't necessarily flushed */
-   buffer.tail->end += buffer.cur->len % 2;
+   if (!buffer.cur->len) {
+   /* Last token was flushed, a new empty descriptor was appended:
+* discard it
+*/
+   buffer.tail->end -= sizeof(buffer.cur->len);
+   } else {
+   /* Last token wasn't flushed: ensure end alignment */
+   buffer.tail->end += buffer.cur->len % 2;
+   }
 
render_calc_width();
 
-- 
2.19.1

[PATCH iproute2 net-next 0/3] ss: Allow selection of columns to be displayed

2018-10-30 Thread Stefano Brivio

Now that we have an abstraction for columns, it's relatively easy to
selectively display only some of them, and Yoann has a use case for it.

Patch 1/3 fixes a rendering issue that shows up only when display of
arbitrary columns is disabled. Patch 2/3 implements the relevant option,
and patch 3/3 makes the output more readable when some columns are
disabled.

Stefano Brivio (3):
  ss: Discard empty descriptor at the end of buffer, if any, before
rendering
  ss: Introduce option to display selected columns only
  ss: Beautify output when arbitrary columns are hidden

 man/man8/ss.8 |  5 +++
 misc/ss.c | 85 +++
 2 files changed, 77 insertions(+), 13 deletions(-)

-- 
2.19.1

[PATCH iproute2 net-next 3/3] ss: Beautify output when arbitrary columns are hidden

2018-10-30 Thread Stefano Brivio

Define a secondary alignment for columns in case the next column is
hidden, this avoids awkward outputs if e.g. the local address is shown,
but not the local port.

Omit embedded delimiter in socket specifiers if the port or service field
is hidden.

Signed-off-by: Stefano Brivio 
---
 misc/ss.c | 30 --
 1 file changed, 20 insertions(+), 10 deletions(-)

diff --git a/misc/ss.c b/misc/ss.c
index 91be3c6db151..d489233681e9 100644
--- a/misc/ss.c
+++ b/misc/ss.c
@@ -131,7 +131,8 @@ enum col_align {
 };
 
 struct column {
-   const enum col_align align;
+   enum col_align align;
+   const enum col_align align_without_next;
const char *optname;
const char *header;
const char *ldelim;
@@ -141,15 +142,15 @@ struct column {
 };
 
 static struct column columns[] = {
-   { ALIGN_LEFT,   "netid","Netid","",  0, 0, 0 },
-   { ALIGN_LEFT,   "state","State"," ", 0, 0, 0 },
-   { ALIGN_LEFT,   "recvq","Recv-Q",   " ", 0, 0, 0 },
-   { ALIGN_LEFT,   "sendq","Send-Q",   " ", 0, 0, 0 },
-   { ALIGN_RIGHT,  "local","Local Address:",   " ", 0, 0, 0 },
-   { ALIGN_LEFT,   "lport","Port", "",  0, 0, 0 },
-   { ALIGN_RIGHT,  "peer", "Peer Address:"," ", 0, 0, 0 },
-   { ALIGN_LEFT,   "pport","Port", "",  0, 0, 0 },
-   { ALIGN_LEFT,   "ext",  "", "",  0, 0, 0 },
+   { ALIGN_LEFT,  ALIGN_LEFT, "netid", "Netid",  "",  0, 0, 0 },
+   { ALIGN_LEFT,  ALIGN_LEFT, "state", "State",  " ", 0, 0, 0 },
+   { ALIGN_LEFT,  ALIGN_LEFT, "recvq", "Recv-Q", " ", 0, 0, 0 },
+   { ALIGN_LEFT,  ALIGN_LEFT, "sendq", "Send-Q", " ", 0, 0, 0 },
+   { ALIGN_RIGHT, ALIGN_LEFT, "local", "Local Address:", " ", 0, 0, 0 },
+   { ALIGN_LEFT,  ALIGN_LEFT, "lport", "Port",   "",  0, 0, 0 },
+   { ALIGN_RIGHT, ALIGN_LEFT, "peer",  "Peer Address:",  " ", 0, 0, 0 },
+   { ALIGN_LEFT,  ALIGN_LEFT, "pport", "Port",   "",  0, 0, 0 },
+   { ALIGN_LEFT,  ALIGN_LEFT, "ext",   "",   "",  0, 0, 0 },
 };
 
 static struct column *current_field = columns;
@@ -1374,6 +1375,9 @@ static void sock_details_print(struct sockstat *s)
 static void sock_addr_print(const char *addr, char *delim, const char *port,
const char *ifname)
 {
+   if ((current_field + 1)->disabled)
+   delim = "";
+
if (ifname)
out("%s" "%%" "%s%s", addr, ifname, delim);
else
@@ -5006,6 +5010,12 @@ int main(int argc, char *argv[])
}
p = p1 + 1;
} while (p1);
+
+   for (f = columns; field_is_valid(f + 1); f++) {
+   if ((f + 1)->disabled)
+   f->align = f->align_without_next;
+   }
+
break;
}
case 'h':
-- 
2.19.1

[PATCH iproute2 net-next 2/3] ss: Introduce option to display selected columns only

2018-10-30 Thread Stefano Brivio

The new option --columns (short: -c) allows to select columns to be
displayed. Note that this doesn't affect the order in which columns are
displayed.

Reported-by: Yoann P. 
Signed-off-by: Stefano Brivio 
---
 man/man8/ss.8 |  5 +
 misc/ss.c | 62 ++-
 2 files changed, 57 insertions(+), 10 deletions(-)

diff --git a/man/man8/ss.8 b/man/man8/ss.8
index 7a6572b17364..c987dec6bcd7 100644
--- a/man/man8/ss.8
+++ b/man/man8/ss.8
@@ -24,6 +24,11 @@ Output version information.
 .B \-H, \-\-no-header
 Suppress header line.
 .TP
+.B \-c COLS, \-\-columns=COLS
+Only display selected columns, separated by commas. The following column names
+are understood: netid, state, local, lport, peer, pport, ext. This does not
+define the order of columns.
+.TP
 .B \-n, \-\-numeric
 Do not try to resolve service names.
 .TP
diff --git a/misc/ss.c b/misc/ss.c
index c3f61ef66258..91be3c6db151 100644
--- a/misc/ss.c
+++ b/misc/ss.c
@@ -132,6 +132,7 @@ enum col_align {
 
 struct column {
const enum col_align align;
+   const char *optname;
const char *header;
const char *ldelim;
int disabled;
@@ -140,15 +141,15 @@ struct column {
 };
 
 static struct column columns[] = {
-   { ALIGN_LEFT,   "Netid","", 0, 0, 0 },
-   { ALIGN_LEFT,   "State"," ",0, 0, 0 },
-   { ALIGN_LEFT,   "Recv-Q",   " ",0, 0, 0 },
-   { ALIGN_LEFT,   "Send-Q",   " ",0, 0, 0 },
-   { ALIGN_RIGHT,  "Local Address:",   " ",0, 0, 0 },
-   { ALIGN_LEFT,   "Port", "", 0, 0, 0 },
-   { ALIGN_RIGHT,  "Peer Address:"," ",0, 0, 0 },
-   { ALIGN_LEFT,   "Port", "", 0, 0, 0 },
-   { ALIGN_LEFT,   "", "", 0, 0, 0 },
+   { ALIGN_LEFT,   "netid","Netid","",  0, 0, 0 },
+   { ALIGN_LEFT,   "state","State"," ", 0, 0, 0 },
+   { ALIGN_LEFT,   "recvq","Recv-Q",   " ", 0, 0, 0 },
+   { ALIGN_LEFT,   "sendq","Send-Q",   " ", 0, 0, 0 },
+   { ALIGN_RIGHT,  "local","Local Address:",   " ", 0, 0, 0 },
+   { ALIGN_LEFT,   "lport","Port", "",  0, 0, 0 },
+   { ALIGN_RIGHT,  "peer", "Peer Address:"," ", 0, 0, 0 },
+   { ALIGN_LEFT,   "pport","Port", "",  0, 0, 0 },
+   { ALIGN_LEFT,   "ext",  "", "",  0, 0, 0 },
 };
 
 static struct column *current_field = columns;
@@ -1073,6 +1074,11 @@ static int field_is_last(struct column *f)
return f - columns == COL_MAX - 1;
 }
 
+static int field_is_valid(struct column *f)
+{
+   return f >= columns && f - columns < COL_MAX;
+}
+
 static void field_next(void)
 {
field_flush(current_field);
@@ -4666,6 +4672,8 @@ static void _usage(FILE *dest)
 "\n"
 "   -K, --kill  forcibly close sockets, display what was closed\n"
 "   -H, --no-header Suppress header line\n"
+"   -c, --columns=COLS  display only COLS columns\n"
+"   COLS := {netid|state|local|lport|peer|pport|ext}[,COLS]\n"
 "\n"
 "   -A, --query=QUERY, --socket=QUERY\n"
 "   QUERY := 
{all|inet|tcp|udp|raw|unix|unix_dgram|unix_stream|unix_seqpacket|packet|netlink|vsock_stream|vsock_dgram|tipc}[,QUERY]\n"
@@ -4785,6 +4793,7 @@ static const struct option long_opts[] = {
{ "tipcinfo", 0, 0, OPT_TIPCINFO},
{ "kill", 0, 0, 'K' },
{ "no-header", 0, 0, 'H' },
+   { "columns", 1, 0, 'c' },
{ 0 }
 
 };
@@ -4800,7 +4809,7 @@ int main(int argc, char *argv[])
int state_filter = 0;
 
while ((ch = getopt_long(argc, argv,
-"dhaletuwxnro460spbEf:miA:D:F:vVzZN:KHS",
+"dhaletuwxnro460spbEf:miA:D:F:vVzZN:KHc:S",
 long_opts, NULL)) != EOF) {
switch (ch) {
case 'n':
@@ -4966,6 +4975,39 @@ int main(int argc, char *argv[])
case 'H':
show_header = 0;
break;
+   case 'c':
+   {
+   struct column *f;
+   char *p, *p1;
+
+   if (!optarg) {
+

[PATCH iproute] ss: Actually print left delimiter for columns

2018-10-29 Thread Stefano Brivio

While rendering columns, we use a local variable to keep track of the
field currently being printed, without touching current_field, which is
used for buffering.

Use the right pointer to access the left delimiter for the current column,
instead of always printing the left delimiter for the last buffered field,
which is usually an empty string.

This fixes an issue especially visible on narrow terminals, where some
columns might be displayed without separation.

Reported-by: YoyPa 
Fixes: 691bd854bf4a ("ss: Buffer raw fields first, then render them as a table")
Signed-off-by: Stefano Brivio 
Tested-by: YoyPa 
---
 misc/ss.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/misc/ss.c b/misc/ss.c
index c8970438ce73..4d12fb5d19df 100644
--- a/misc/ss.c
+++ b/misc/ss.c
@@ -1260,7 +1260,7 @@ static void render(void)
while (token) {
/* Print left delimiter only if we already started a line */
if (line_started++)
-   printed = printf("%s", current_field->ldelim);
+   printed = printf("%s", f->ldelim);
else
printed = 0;
 
-- 
2.19.1

Re: [PATCH] Fix ss Netid column and Local/Peer_Address

2018-10-29 Thread Stefano Brivio

On Mon, 29 Oct 2018 21:07:47 +0100
"Yoann P."  wrote:

> > -   printed = printf("%s", current_field->ldelim);
> > +   printed = printf("%s", f->ldelim);
> > else
> > printed = 0;  
> 
> I can't reproduce the issue with this modification. :).

Thanks a lot for testing, and especially for reporting this! I'll
submit this as a patch in a moment.

-- 
Stefano

Re: [PATCH] Fix ss Netid column and Local/Peer_Address

2018-10-29 Thread Stefano Brivio

On Mon, 29 Oct 2018 21:06:35 +0100
"Yoann P."  wrote:

> > By the way, why do you use column(1), when ss already prints output in
> > columns? Any other issue you are working around?  
>
> column can hide columns with "-H -" and is a bit faster than awk to output a 
> single column according to time, it's the only reason I mentioned it.

Okay, but why do you need to hide some columns in the first place? I'm
wondering if your use case would justify adding options to print
selected columns only, in a generic way (right now, you can only
disable some).

Another possibility would be to rename "Local Address:" to "Local:" and
"Peer Address:" to "Peer:" -- in some cases (UNIX sockets) it's already
not so much of an address, more of a path, and "Address" doesn't really
add value when the field contains an address.

I don't like too much "Local_Address:" and "Peer_Address:" as the
output is supposed to be human-readable by default, and that underscore
just doesn't fit.

-- 
Stefano

Re: [PATCH] Fix ss Netid column and Local/Peer_Address

2018-10-29 Thread Stefano Brivio

On Mon, 29 Oct 2018 19:20:36 +0100
Stefano Brivio  wrote:

> The actual issue seems to be that in some cases the left delimiter for
> the State column is not printed

Much worse, we always print the left delimiter of the last buffered
column, which is usually empty. My bad.

The issue is not so visible in general as we almost always have spaces
to distribute around, but not if you start going below 70/75 columns.
Can you try this?

diff --git a/misc/ss.c b/misc/ss.c
index f99b6874c228..90986b1dc15f 100644
--- a/misc/ss.c
+++ b/misc/ss.c
@@ -1260,7 +1260,7 @@ static void render(void)
while (token) {
/* Print left delimiter only if we already started a line */
if (line_started++)
-   printed = printf("%s", current_field->ldelim);
+   printed = printf("%s", f->ldelim);
else
printed = 0;

-- 
Stefano

Re: [PATCH] Fix ss Netid column and Local/Peer_Address

2018-10-29 Thread Stefano Brivio

Hi Yohann,

On Fri, 26 Oct 2018 22:53:32 +0200
"Yoann P."  wrote:

> When using ss -Hutn4 or -utn3, Netid and State columns are sometime merged, 
> it 
> can be confusing when trying to pipe into awk or column.

Thanks for fixing this. A few comments though:

> @@ -144,9 +144,9 @@ static struct column columns[] = {
> { ALIGN_LEFT,   "State"," ",0, 0, 0 },
> { ALIGN_LEFT,   "Recv-Q",   " ",0, 0, 0 },
> { ALIGN_LEFT,   "Send-Q",   " ",0, 0, 0 },
> -   { ALIGN_RIGHT,  "Local Address:",   " ",0, 0, 0 },
> +   { ALIGN_RIGHT,  "Local_Address:",   " ",0, 0, 0 },
> { ALIGN_LEFT,   "Port", "", 0, 0, 0 },
> -   { ALIGN_RIGHT,  "Peer Address:"," ",0, 0, 0 },
> +   { ALIGN_RIGHT,  "Peer_Address:"," ",0, 0, 0 },

This is needed only if you pipe the output to column(1), I don't think
it's a bug, because printing the header when you pass the output to
column(1) makes little sense -- one should use -H then.

By the way, why do you use column(1), when ss already prints output in
columns? Any other issue you are working around?

> { ALIGN_LEFT,   "Port", "", 0, 0, 0 },
> { ALIGN_LEFT,   "", "", 0, 0, 0 },
>  };
> @@ -1334,7 +1334,7 @@ static void sock_state_print(struct sockstat *s)
> out("`- %s", sctp_sstate_name[s->state]);
> } else {
> field_set(COL_NETID);
> -   out("%s", sock_name);
> +   out("%-6s", sock_name);

I could reproduce this issue with a 70-columns terminal and the options
you gave.

Anyway, I don't think this is the right way to fix it: this will waste
one to two columns in case we have three letters for the Netid
specifier, and won't work the day we get six-letters names. In general,
it looks like a bad idea to reintroduce hardcoded width counts.

The actual issue seems to be that in some cases the left delimiter for
the State column is not printed, and I think you should fix that
instead. I'll look into this within a couple of days and give you some
more specific hints in case you still need them by then.

-- 
Stefano

[PATCH net] ipv6/ndisc: Preserve IPv6 control buffer if protocol error handlers are called

2018-10-24 Thread Stefano Brivio

Commit a61bbcf28a8c ("[NET]: Store skb->timestamp as offset to a base
timestamp") introduces a neighbour control buffer and zeroes it out in
ndisc_rcv(), as ndisc_recv_ns() uses it.

Commit f2776ff04722 ("[IPV6]: Fix address/interface handling in UDP and
DCCP, according to the scoping architecture.") introduces the usage of the
IPv6 control buffer in protocol error handlers (e.g. inet6_iif() in
present-day __udp6_lib_err()).

Now, with commit b94f1c0904da ("ipv6: Use icmpv6_notify() to propagate
redirect, instead of rt6_redirect()."), we call protocol error handlers
from ndisc_redirect_rcv(), after the control buffer is already stolen and
some parts are already zeroed out. This implies that inet6_iif() on this
path will always return zero.

This gives unexpected results on UDP socket lookup in __udp6_lib_err(), as
we might actually need to match sockets for a given interface.

Instead of always claiming the control buffer in ndisc_rcv(), do that only
when needed.

Fixes: b94f1c0904da ("ipv6: Use icmpv6_notify() to propagate redirect, instead 
of rt6_redirect().")
Signed-off-by: Stefano Brivio 
---
 net/ipv6/ndisc.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/net/ipv6/ndisc.c b/net/ipv6/ndisc.c
index 0ec273997d1d..673a4a932f2a 100644
--- a/net/ipv6/ndisc.c
+++ b/net/ipv6/ndisc.c
@@ -1732,10 +1732,9 @@ int ndisc_rcv(struct sk_buff *skb)
return 0;
}
 
-   memset(NEIGH_CB(skb), 0, sizeof(struct neighbour_cb));
-
switch (msg->icmph.icmp6_type) {
case NDISC_NEIGHBOUR_SOLICITATION:
+   memset(NEIGH_CB(skb), 0, sizeof(struct neighbour_cb));
ndisc_recv_ns(skb);
break;
 
-- 
2.19.1

[PATCH net] ip6_tunnel: Fix encapsulation layout

2018-10-18 Thread Stefano Brivio

Commit 058214a4d1df ("ip6_tun: Add infrastructure for doing
encapsulation") added the ip6_tnl_encap() call in ip6_tnl_xmit(), before
the call to ipv6_push_frag_opts() to append the IPv6 Tunnel Encapsulation
Limit option (option 4, RFC 2473, par. 5.1) to the outer IPv6 header.

As long as the option didn't actually end up in generated packets, this
wasn't an issue. Then commit 89a23c8b528b ("ip6_tunnel: Fix missing tunnel
encapsulation limit option") fixed sending of this option, and the
resulting layout, e.g. for FoU, is:

.---..--.---.- - -
| Outer IPv6 Header | UDP header | Option 4 | Inner IPv6 Header | Payload
'---''--'---'- - -

Needless to say, FoU and GUE (at least) won't work over IPv6. The option
is appended by default, and I couldn't find a way to disable it with the
current iproute2.

Turn this into a more reasonable:

.---.--..---.- - -
| Outer IPv6 Header | Option 4 | UDP header | Inner IPv6 Header | Payload
'---'--''---'- - -

With this, and with 84dad55951b0 ("udp6: fix encap return code for
resubmitting"), FoU and GUE work again over IPv6.

Fixes: 058214a4d1df ("ip6_tun: Add infrastructure for doing encapsulation")
Signed-off-by: Stefano Brivio 
---
 net/ipv6/ip6_tunnel.c | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/net/ipv6/ip6_tunnel.c b/net/ipv6/ip6_tunnel.c
index 6f05e0f74bdf..77b3593e7f76 100644
--- a/net/ipv6/ip6_tunnel.c
+++ b/net/ipv6/ip6_tunnel.c
@@ -1185,11 +1185,6 @@ int ip6_tnl_xmit(struct sk_buff *skb, struct net_device 
*dev, __u8 dsfield,
}
skb_dst_set(skb, dst);
 
-   if (encap_limit >= 0) {
-   init_tel_txopt(, encap_limit);
-   ipv6_push_frag_opts(skb, , );
-   }
-
if (hop_limit == 0) {
if (skb->protocol == htons(ETH_P_IP))
hop_limit = ip_hdr(skb)->ttl;
@@ -1211,6 +1206,11 @@ int ip6_tnl_xmit(struct sk_buff *skb, struct net_device 
*dev, __u8 dsfield,
if (err)
return err;
 
+   if (encap_limit >= 0) {
+   init_tel_txopt(, encap_limit);
+   ipv6_push_frag_opts(skb, , );
+   }
+
skb_push(skb, sizeof(struct ipv6hdr));
skb_reset_network_header(skb);
ipv6h = ipv6_hdr(skb);
-- 
2.19.1

Re: [PATCH net 1/2] geneve, vxlan: Don't check skb_dst() twice

2018-10-15 Thread Stefano Brivio

On Mon, 15 Oct 2018 12:19:41 +0200
Nicolas Dichtel  wrote:

> Le 12/10/2018 à 23:53, Stefano Brivio a écrit :
> > Commit f15ca723c1eb ("net: don't call update_pmtu unconditionally") avoids
> > that we try updating PMTU for a non-existent destination, but didn't clean
> > up cases where the check was already explicit. Drop those redundant checks. 
> >  
> Yes, I leave them to avoid calculating the new mtu value when not needed. We 
> are
> in the xmit path.

Before 2/2 of this series, though, we call skb_dst_update_pmtu() (and
in turn dst->ops->update_pmtu()) for *every* packet with a dst, which
I'd dare saying is by far the most common case. Besides, 2/2 needs
anyway to calculate the MTU to fix a bug.

So I think this is a vast improvement overall.

If we want to improve this further and avoid any indirect calls in the
most common path, we would need to cache the MTU in the dst -- it's
probably doable, but I would fix the specific issue addressed by 2/2
first.

-- 
Stefano

Re: [PATCH net] ip6_tunnel: Don't update PMTU on tunnels with collect_md

2018-10-15 Thread Stefano Brivio

On Mon, 15 Oct 2018 10:48:05 +0200
Nicolas Dichtel  wrote:

> Le 12/10/2018 à 18:34, Stefano Brivio a écrit :
> > On Fri, 12 Oct 2018 17:58:55 +0200
> > Nicolas Dichtel  wrote:  
> [snip]
> >> Could you explain in your commit log which problem your patch fixes?  
> > 
> > Nothing really.
> > 
> > The change in f15ca723c1eb looked accidental and I thought it doesn't
> > make sense to update the PMTU in that case, but I didn't figure out
> > it's not actually done anyway.
> > 
> > Maybe it makes things a bit more readable, in that case I'd target it
> > for net-next. What do you think?
> >   
> I don't think that this patch helps. The purpose of the skb_dst_update_pmtu()
> helper is to hide those things. If one day, update_pmtu is defined for
> md_dst_op, I bet that we won't remove this test.

I see, makes sense.

David, please drop this patch, and sorry for the noise.

-- 
Stefano

Re: [PATCH net 2/2] geneve, vxlan: Don't set exceptions if skb->len < mtu

2018-10-15 Thread Stefano Brivio

On Mon, 15 Oct 2018 15:01:31 +0900
Xin Long  wrote:

> On Sat, Oct 13, 2018 at 6:54 AM Stefano Brivio  wrote:
> >
> > We shouldn't abuse exceptions: if the destination MTU is already higher
> > than what we're transmitting, no exception should be created.  
> makes sense, shouldn't ip(6) tunnels also do this?

I should probably have mentioned this in the cover letter: in theory
yes, but I'm doing this as preparation for ICMP handling in UDP
tunnels, and those will get selftests soon (once I'm done).

Writing extensive selftests for IP tunnels will take significantly
longer, so I'm not too confident to change this right now. I'd prefer
to address that at a later time.

-- 
Stefano

[PATCH net-next 2/2] selftests: pmtu: Add optional traffic captures for single tests

2018-10-12 Thread Stefano Brivio

If --trace is passed as an option and tcpdump is available,
capture traffic for all relevant interfaces to per-test pcap
files named _.pcap.

Signed-off-by: Stefano Brivio 
Reviewed-by: Sabrina Dubroca 
---
 tools/testing/selftests/net/pmtu.sh | 60 +
 1 file changed, 53 insertions(+), 7 deletions(-)

diff --git a/tools/testing/selftests/net/pmtu.sh 
b/tools/testing/selftests/net/pmtu.sh
index 8278a24f5ba6..ed549c03d2a7 100755
--- a/tools/testing/selftests/net/pmtu.sh
+++ b/tools/testing/selftests/net/pmtu.sh
@@ -142,6 +142,7 @@ dummy6_mask="64"
 
 cleanup_done=1
 err_buf=
+tcpdump_pids=
 
 err() {
err_buf="${err_buf}${1}
@@ -284,7 +285,24 @@ setup() {
done
 }
 
+trace() {
+   [ $tracing -eq 0 ] && return
+
+   for arg do
+   [ "${ns_cmd}" = "" ] && ns_cmd="${arg}" && continue
+   ${ns_cmd} tcpdump -s 0 -i "${arg}" -w "${name}_${arg}.pcap" 2> 
/dev/null &
+   tcpdump_pids="${tcpdump_pids} $!"
+   ns_cmd=
+   done
+   sleep 1
+}
+
 cleanup() {
+   for pid in ${tcpdump_pids}; do
+   kill ${pid}
+   done
+   tcpdump_pids=
+
[ ${cleanup_done} -eq 1 ] && return
for n in ${NS_A} ${NS_B} ${NS_R1} ${NS_R2}; do
ip netns del ${n} 2> /dev/null
@@ -357,6 +375,10 @@ test_pmtu_ipvX() {
family=${1}
 
setup namespaces routing || return 2
+   trace "${ns_a}"  veth_A-R1"${ns_r1}" veth_R1-A \
+ "${ns_r1}" veth_R1-B"${ns_b}"  veth_B-R1 \
+ "${ns_a}"  veth_A-R2"${ns_r2}" veth_R2-A \
+ "${ns_r2}" veth_R2-B"${ns_b}"  veth_B-R2
 
if [ ${family} -eq 4 ]; then
ping=ping
@@ -445,6 +467,8 @@ test_pmtu_ipv6_exception() {
 
 test_pmtu_vti4_exception() {
setup namespaces veth vti4 xfrm4 || return 2
+   trace "${ns_a}" veth_a"${ns_b}" veth_b \
+ "${ns_a}" vti4_a"${ns_b}" vti4_b
 
veth_mtu=1500
vti_mtu=$((veth_mtu - 20))
@@ -473,6 +497,8 @@ test_pmtu_vti4_exception() {
 
 test_pmtu_vti6_exception() {
setup namespaces veth vti6 xfrm6 || return 2
+   trace "${ns_a}" veth_a"${ns_b}" veth_b \
+ "${ns_a}" vti6_a"${ns_b}" vti6_b
fail=0
 
# Create route exception by exceeding link layer MTU
@@ -643,29 +669,49 @@ test_pmtu_vti6_link_change_mtu() {
 
 usage() {
echo
-   echo "$0 [TEST]..."
+   echo "$0 [OPTIONS] [TEST]..."
echo "If no TEST argument is given, all tests will be run."
echo
+   echo "Options"
+   echo "  --trace: capture traffic to TEST_INTERFACE.pcap"
+   echo
echo "Available tests${tests}"
exit 1
 }
 
+exitcode=0
+desc=0
+IFS="  
+"
+
+tracing=0
 for arg do
-   # Check first that all requested tests are available before running any
-   command -v > /dev/null "test_${arg}" || { echo "=== Test ${arg} not 
found"; usage; }
+   if [ "${arg}" != "${arg#--*}" ]; then
+   opt="${arg#--}"
+   if [ "${opt}" = "trace" ]; then
+   if which tcpdump > /dev/null 2>&1; then
+   tracing=1
+   else
+   echo "=== tcpdump not available, tracing 
disabled"
+   fi
+   else
+   usage
+   fi
+   else
+   # Check first that all requested tests are available before
+   # running any
+   command -v > /dev/null "test_${arg}" || { echo "=== Test ${arg} 
not found"; usage; }
+   fi
 done
 
 trap cleanup EXIT
 
-exitcode=0
-desc=0
-IFS="  
-"
 for t in ${tests}; do
[ $desc -eq 0 ] && name="${t}" && desc=1 && continue || desc=0
 
run_this=1
for arg do
+   [ "${arg}" != "${arg#--*}" ] && continue
[ "${arg}" = "${name}" ] && run_this=1 && break
run_this=0
done
-- 
2.19.1

[PATCH net-next 0/2] selftests: pmtu: Add test choice and captures

2018-10-12 Thread Stefano Brivio

This series adds a couple of features useful for debugging: 1/2
allows selecting single tests and 2/2 adds optional traffic
captures.

Semantics for current invocation of test script are preserved.

Stefano Brivio (2):
  selftests: pmtu: Allow selection of single tests
  selftests: pmtu: Add optional traffic captures for single tests

 tools/testing/selftests/net/pmtu.sh | 69 -
 1 file changed, 68 insertions(+), 1 deletion(-)

-- 
2.19.1

[PATCH net-next 1/2] selftests: pmtu: Allow selection of single tests

2018-10-12 Thread Stefano Brivio

As number of tests is growing, it's quite convenient to allow
single tests to be run.

Display usage when the script is run with any invalid argument,
keep existing semantics when no arguments are passed so that
automated runs won't break.

Instead of just looping on the list of requested tests, if any,
check first that they exist, and go through them in a nested
loop to keep the existing way to display test descriptions.

Signed-off-by: Stefano Brivio 
Reviewed-by: Sabrina Dubroca 
---
 tools/testing/selftests/net/pmtu.sh | 21 +
 1 file changed, 21 insertions(+)

diff --git a/tools/testing/selftests/net/pmtu.sh 
b/tools/testing/selftests/net/pmtu.sh
index b9cdb68df4c5..8278a24f5ba6 100755
--- a/tools/testing/selftests/net/pmtu.sh
+++ b/tools/testing/selftests/net/pmtu.sh
@@ -641,6 +641,20 @@ test_pmtu_vti6_link_change_mtu() {
return ${fail}
 }
 
+usage() {
+   echo
+   echo "$0 [TEST]..."
+   echo "If no TEST argument is given, all tests will be run."
+   echo
+   echo "Available tests${tests}"
+   exit 1
+}
+
+for arg do
+   # Check first that all requested tests are available before running any
+   command -v > /dev/null "test_${arg}" || { echo "=== Test ${arg} not 
found"; usage; }
+done
+
 trap cleanup EXIT
 
 exitcode=0
@@ -650,6 +664,13 @@ IFS="
 for t in ${tests}; do
[ $desc -eq 0 ] && name="${t}" && desc=1 && continue || desc=0
 
+   run_this=1
+   for arg do
+   [ "${arg}" = "${name}" ] && run_this=1 && break
+   run_this=0
+   done
+   [ $run_this -eq 0 ] && continue
+
(
unset IFS
eval test_${name}
-- 
2.19.1

[PATCH net 2/2] geneve, vxlan: Don't set exceptions if skb->len < mtu

2018-10-12 Thread Stefano Brivio

We shouldn't abuse exceptions: if the destination MTU is already higher
than what we're transmitting, no exception should be created.

Fixes: 52a589d51f10 ("geneve: update skb dst pmtu on tx path")
Fixes: a93bf0ff4490 ("vxlan: update skb dst pmtu on tx path")
Signed-off-by: Stefano Brivio 
Reviewed-by: Sabrina Dubroca 
---
 drivers/net/geneve.c |  7 +++
 drivers/net/vxlan.c  |  4 ++--
 include/net/dst.h| 10 ++
 3 files changed, 15 insertions(+), 6 deletions(-)

diff --git a/drivers/net/geneve.c b/drivers/net/geneve.c
index 61c4bfbeb41c..493cd382b8aa 100644
--- a/drivers/net/geneve.c
+++ b/drivers/net/geneve.c
@@ -830,8 +830,8 @@ static int geneve_xmit_skb(struct sk_buff *skb, struct 
net_device *dev,
if (IS_ERR(rt))
return PTR_ERR(rt);
 
-   skb_dst_update_pmtu(skb, dst_mtu(>dst) -
-GENEVE_IPV4_HLEN - info->options_len);
+   skb_tunnel_check_pmtu(skb, >dst,
+ GENEVE_IPV4_HLEN + info->options_len);
 
sport = udp_flow_src_port(geneve->net, skb, 1, USHRT_MAX, true);
if (geneve->collect_md) {
@@ -872,8 +872,7 @@ static int geneve6_xmit_skb(struct sk_buff *skb, struct 
net_device *dev,
if (IS_ERR(dst))
return PTR_ERR(dst);
 
-   skb_dst_update_pmtu(skb, dst_mtu(dst) -
-GENEVE_IPV6_HLEN - info->options_len);
+   skb_tunnel_check_pmtu(skb, dst, GENEVE_IPV6_HLEN + info->options_len);
 
sport = udp_flow_src_port(geneve->net, skb, 1, USHRT_MAX, true);
if (geneve->collect_md) {
diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c
index 22e0ce592e07..27bd586b94b0 100644
--- a/drivers/net/vxlan.c
+++ b/drivers/net/vxlan.c
@@ -2194,7 +2194,7 @@ static void vxlan_xmit_one(struct sk_buff *skb, struct 
net_device *dev,
}
 
ndst = >dst;
-   skb_dst_update_pmtu(skb, dst_mtu(ndst) - VXLAN_HEADROOM);
+   skb_tunnel_check_pmtu(skb, ndst, VXLAN_HEADROOM);
 
tos = ip_tunnel_ecn_encap(tos, old_iph, skb);
ttl = ttl ? : ip4_dst_hoplimit(>dst);
@@ -2231,7 +2231,7 @@ static void vxlan_xmit_one(struct sk_buff *skb, struct 
net_device *dev,
goto out_unlock;
}
 
-   skb_dst_update_pmtu(skb, dst_mtu(ndst) - VXLAN6_HEADROOM);
+   skb_tunnel_check_pmtu(skb, ndst, VXLAN6_HEADROOM);
 
tos = ip_tunnel_ecn_encap(tos, old_iph, skb);
ttl = ttl ? : ip6_dst_hoplimit(ndst);
diff --git a/include/net/dst.h b/include/net/dst.h
index 7f735e76ca73..6cf0870414c7 100644
--- a/include/net/dst.h
+++ b/include/net/dst.h
@@ -527,4 +527,14 @@ static inline void skb_dst_update_pmtu(struct sk_buff 
*skb, u32 mtu)
dst->ops->update_pmtu(dst, NULL, skb, mtu);
 }
 
+static inline void skb_tunnel_check_pmtu(struct sk_buff *skb,
+struct dst_entry *encap_dst,
+int headroom)
+{
+   u32 encap_mtu = dst_mtu(encap_dst);
+
+   if (skb->len > encap_mtu - headroom)
+   skb_dst_update_pmtu(skb, encap_mtu - headroom);
+}
+
 #endif /* _NET_DST_H */
-- 
2.19.1

[PATCH net 1/2] geneve, vxlan: Don't check skb_dst() twice

2018-10-12 Thread Stefano Brivio

Commit f15ca723c1eb ("net: don't call update_pmtu unconditionally") avoids
that we try updating PMTU for a non-existent destination, but didn't clean
up cases where the check was already explicit. Drop those redundant checks.

Signed-off-by: Stefano Brivio 
Reviewed-by: Sabrina Dubroca 
---
 drivers/net/geneve.c | 15 ---
 drivers/net/vxlan.c  | 12 ++--
 2 files changed, 6 insertions(+), 21 deletions(-)

diff --git a/drivers/net/geneve.c b/drivers/net/geneve.c
index 6acb6b5718b9..61c4bfbeb41c 100644
--- a/drivers/net/geneve.c
+++ b/drivers/net/geneve.c
@@ -830,12 +830,8 @@ static int geneve_xmit_skb(struct sk_buff *skb, struct 
net_device *dev,
if (IS_ERR(rt))
return PTR_ERR(rt);
 
-   if (skb_dst(skb)) {
-   int mtu = dst_mtu(>dst) - GENEVE_IPV4_HLEN -
- info->options_len;
-
-   skb_dst_update_pmtu(skb, mtu);
-   }
+   skb_dst_update_pmtu(skb, dst_mtu(>dst) -
+GENEVE_IPV4_HLEN - info->options_len);
 
sport = udp_flow_src_port(geneve->net, skb, 1, USHRT_MAX, true);
if (geneve->collect_md) {
@@ -876,11 +872,8 @@ static int geneve6_xmit_skb(struct sk_buff *skb, struct 
net_device *dev,
if (IS_ERR(dst))
return PTR_ERR(dst);
 
-   if (skb_dst(skb)) {
-   int mtu = dst_mtu(dst) - GENEVE_IPV6_HLEN - info->options_len;
-
-   skb_dst_update_pmtu(skb, mtu);
-   }
+   skb_dst_update_pmtu(skb, dst_mtu(dst) -
+GENEVE_IPV6_HLEN - info->options_len);
 
sport = udp_flow_src_port(geneve->net, skb, 1, USHRT_MAX, true);
if (geneve->collect_md) {
diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c
index 2b8da2b7e721..22e0ce592e07 100644
--- a/drivers/net/vxlan.c
+++ b/drivers/net/vxlan.c
@@ -2194,11 +2194,7 @@ static void vxlan_xmit_one(struct sk_buff *skb, struct 
net_device *dev,
}
 
ndst = >dst;
-   if (skb_dst(skb)) {
-   int mtu = dst_mtu(ndst) - VXLAN_HEADROOM;
-
-   skb_dst_update_pmtu(skb, mtu);
-   }
+   skb_dst_update_pmtu(skb, dst_mtu(ndst) - VXLAN_HEADROOM);
 
tos = ip_tunnel_ecn_encap(tos, old_iph, skb);
ttl = ttl ? : ip4_dst_hoplimit(>dst);
@@ -2235,11 +2231,7 @@ static void vxlan_xmit_one(struct sk_buff *skb, struct 
net_device *dev,
goto out_unlock;
}
 
-   if (skb_dst(skb)) {
-   int mtu = dst_mtu(ndst) - VXLAN6_HEADROOM;
-
-   skb_dst_update_pmtu(skb, mtu);
-   }
+   skb_dst_update_pmtu(skb, dst_mtu(ndst) - VXLAN6_HEADROOM);
 
tos = ip_tunnel_ecn_encap(tos, old_iph, skb);
ttl = ttl ? : ip6_dst_hoplimit(ndst);
-- 
2.19.1

[PATCH net 0/2] geneve, vxlan: Don't set exceptions if skb->len < mtu

2018-10-12 Thread Stefano Brivio

This series fixes the exception abuse described in 2/2, and 1/2
is just a preparatory change to make 2/2 less ugly.

Stefano Brivio (2):
  geneve, vxlan: Don't check skb_dst() twice
  geneve, vxlan: Don't set exceptions if skb->len < mtu

 drivers/net/geneve.c | 14 +++---
 drivers/net/vxlan.c  | 12 ++--
 include/net/dst.h| 10 ++
 3 files changed, 15 insertions(+), 21 deletions(-)

-- 
2.19.1

Re: [PATCH net] ip6_tunnel: Don't update PMTU on tunnels with collect_md

2018-10-12 Thread Stefano Brivio

On Fri, 12 Oct 2018 17:58:55 +0200
Nicolas Dichtel  wrote:

> Le 12/10/2018 à 14:32, Stefano Brivio a écrit :
> > Commit 8d79266bc48c ("ip6_tunnel: add collect_md mode to IPv6
> > tunnels") introduced a check to avoid updating PMTU when
> > collect_md mode is enabled.
> > 
> > Later, commit f15ca723c1eb ("net: don't call update_pmtu
> > unconditionally") dropped this check, I guess inadvertently.  
> I removed it because update_pmtu() is not set for md_dst_op, thus I assume 
> this
> check was done for that purpose.

Ha, sounds reasonable, yes.

> Could you explain in your commit log which problem your patch fixes?

Nothing really.

The change in f15ca723c1eb looked accidental and I thought it doesn't
make sense to update the PMTU in that case, but I didn't figure out
it's not actually done anyway.

Maybe it makes things a bit more readable, in that case I'd target it
for net-next. What do you think?

-- 
Stefano

[PATCH net] ip6_tunnel: Don't update PMTU on tunnels with collect_md

2018-10-12 Thread Stefano Brivio

Commit 8d79266bc48c ("ip6_tunnel: add collect_md mode to IPv6
tunnels") introduced a check to avoid updating PMTU when
collect_md mode is enabled.

Later, commit f15ca723c1eb ("net: don't call update_pmtu
unconditionally") dropped this check, I guess inadvertently.
Restore it.

Fixes: f15ca723c1eb ("net: don't call update_pmtu unconditionally")
Signed-off-by: Stefano Brivio 
---
 net/ipv6/ip6_tunnel.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/net/ipv6/ip6_tunnel.c b/net/ipv6/ip6_tunnel.c
index a0b6932c3afd..6f05e0f74bdf 100644
--- a/net/ipv6/ip6_tunnel.c
+++ b/net/ipv6/ip6_tunnel.c
@@ -1136,7 +1136,8 @@ int ip6_tnl_xmit(struct sk_buff *skb, struct net_device 
*dev, __u8 dsfield,
mtu = max(mtu, skb->protocol == htons(ETH_P_IPV6) ?
   IPV6_MIN_MTU : IPV4_MIN_MTU);
 
-   skb_dst_update_pmtu(skb, mtu);
+   if (!t->parms.collect_md)
+   skb_dst_update_pmtu(skb, mtu);
if (skb->len - t->tun_hlen - eth_hlen > mtu && !skb_is_gso(skb)) {
*pmtu = mtu;
err = -EMSGSIZE;
-- 
2.19.1

Re: [PATCH RFC net-next] openvswitch: Queue upcalls to userspace in per-port round-robin order

2018-09-26 Thread Stefano Brivio

Hi Pravin,

On Wed, 15 Aug 2018 00:19:39 -0700
Pravin Shelar  wrote:

> I understand fairness has cost, but we need to find right balance
> between performance and fairness. Current fairness scheme is a
> lockless algorithm without much computational overhead, did you try to
> improve current algorithm so that it uses less number of ports.

In the end, we tried harder as you suggested, and found out a way to
avoid the need for per-thread sets of per-vport sockets: with a few
changes, a process-wide set of per-vport sockets is actually enough to
maintain the current fairness behaviour.

Further, we can get rid of duplicate fd events if the EPOLLEXCLUSIVE
epoll() flag is available, which improves performance noticeably. This
is explained in more detail in the commit message of the ovs-vswitchd
patch Matteo wrote [1], now merged.

This approach solves the biggest issue (i.e. huge amount of netlink
sockets) in ovs-vswitchd itself, without the need for kernel changes.

It doesn't address some proposed improvements though, that is, it does
nothing to improve the current fairness scheme, it won't allow neither
the configurable fairness criteria proposed by Ben, nor usage of BPF
maps for extensibility as suggested by William.

I would however say that those topics bear definitely lower priority,
and perhaps addressing them at all becomes overkill now that we don't
any longer have a serious issue.

[1] https://patchwork.ozlabs.org/patch/974304/

-- 
Stefano

Re: [PATCH net] selftests: pmtu: properly redirect stderr to /dev/null

2018-09-17 Thread Stefano Brivio

On Mon, 17 Sep 2018 15:30:06 +0200
Sabrina Dubroca  wrote:

> The cleanup function uses "$CMD 2 > /dev/null", which doesn't actually
> send stderr to /dev/null, so when the netns doesn't exist, the error
> message is shown. Use "2> /dev/null" instead, so that those messages
> disappear, as was intended.

Oops, thanks for catching this.

> Fixes: d1f1b9cbf34c ("selftests: net: Introduce first PMTU test")
> Signed-off-by: Sabrina Dubroca 

Acked-by: Stefano Brivio 

-- 
Stefano

Re: [PATCH RFC net-next] openvswitch: Queue upcalls to userspace in per-port round-robin order

2018-08-16 Thread Stefano Brivio

Pravin,

On Wed, 15 Aug 2018 00:19:39 -0700
Pravin Shelar  wrote:

> My argument is not about proposed fairness algorithm. It is about cost
> of the fairness and I do not see it is addressed in any of the follow
> ups.

We are still working on it (especially on the points that you mentioned
already), that's why there hasn't been any follow-up yet.

After all, we marked this patch as RFC because we thought we needed to
gather some feedback before we'd reach a solution that's good enough :)

> I revisited the original patch, here is what I see in term of added
> cost to existing upcall processing:

Thanks for the review and the detailed summary, some answers follow:

> 1. one "kzalloc(sizeof(*upcall), GFP_ATOMIC);" This involve allocate
> and initialize memory

We would use kmem_cache_alloc() here, the queue is rather small.
Initialisation, we don't really need it, we can drop it.

> 2. copy flow key which is more than 1 KB (upcall->key = *key)

The current idea here is to find a way to safely hold the pointer to the
flow key. Do you have any suggestion?

> 3. Acquire spin_lock_bh dp->upcalls.lock, which would disable bottom
> half processing on CPU while waiting for the global lock.

A double list, whose pointer is swapped when we start dequeuing
packets (same as it's done e.g. for the flow table on rehashing), would
avoid the need for this spinlock. We're trying that out.

> 4. Iterate list of queued upcalls, one of objective it is to avoid out
> of order packet. But I do not see point of ordering packets from
> different streams.

Please note, though, that we also have packets from the same stream.
Actually, the whole point of this exercise is to get packets from
different streams out of order, while maintaining order for a given
stream.

> 5. signal upcall thread after delay ovs_dp_upcall_delay(). This adds
> further to the latency.

The idea behind ovs_dp_upcall_delay() is to schedule without delay if
we don't currently have a storm of upcalls.

But if we do, we're probably introducing less latency by doing this than
by letting ovs-vswitchd handle them. It's also a fundamental requirement
to have fairness: we need to schedule upcalls, and to schedule we need
some (small, in the overall picture) delay. This is another point where
we need to show some detailed measurements, I guess.

> 6. upcall is then handed over to different thread (context switch),
> likely on different CPU.
> 8. the upcall object is freed on remote CPU.

The solution could be to use cmwq instead and have per-CPU workers and
queues. But I wonder what would be the drawbacks of having per-CPU
fairness. I think this depends a lot on how ovs-vswitchd handles the
upcalls. We could check how that performs. Any thoughts?

> 9. single lock essentially means OVS kernel datapath upcall processing
> is single threaded no matter number of cores in system.

This should also be solved by keeping two queues.

> I would be interested in how are we going to address these issues.
> 
> In example you were talking about netlink fd issue on server with 48
> core, how does this solution works when there are 5K ports each
> triggering upcall ? Can you benchmark your patch? Do you have
> performance numbers for TCP_CRR with and without this patch ? Also
> publish latency numbers for this patch. Please turn off megaflow to
> exercise upcall handling.

We just run some tests that show that fairness is maintained with a
much lower number of ports, but we have no performance numbers at the
moment -- other than the consideration that when flooding with upcalls,
ovs-vswitchd is the bottleneck. We'll run proper performance tests,
focusing especially on latency (which we kind of ignored so far).

> I understand fairness has cost, but we need to find right balance
> between performance and fairness. Current fairness scheme is a
> lockless algorithm without much computational overhead, did you try to
> improve current algorithm so that it uses less number of ports.

We tried with one socket per thread, it just doesn't work. We can
definitely try a bit harder. The problem I see here is that the current
mechanism is not actually a fairness scheme. It kind of works for most
workloads, but if a port happens to be flooding with a given timing, I
don't see how fairness can be guaranteed.

-- 
Stefano

Re: [PATCH RFC net-next] openvswitch: Queue upcalls to userspace in per-port round-robin order

2018-08-14 Thread Stefano Brivio

Hi William,

On Fri, 10 Aug 2018 07:11:01 -0700
William Tu  wrote:

> > int rr_select_srcport(struct dp_upcall_info *upcall)
> > {
> > /* look up source port from upcall->skb... */
> > }
> >
> > And we could then easily extend this to use BPF with maps one day.
> >
> >  
> Hi Stefano,
> 
> If you want to experiment with BPF, Joe and I have some prototype.
> We implemented the upcall mechanism using BPF perf event helper function
> https://github.com/williamtu/ovs-ebpf/blob/master/bpf/datapath.c#L62
> 
> And there are threads polling the perf ring buffer to receive packets from
> BPF.
> https://github.com/williamtu/ovs-ebpf/blob/master/lib/perf-event.c#L232

Interesting, thanks for the pointers!

> If I follow the discussion correctly, before upcall, you need to queue
> packets based on different configurations (vport/hash/vni/5-tuple/...)
> and queue to different buckets when congestion happens.

Yes, correct.

> In this case, you
> probably needs a BPF map to enqueue/dequeue the packet.BPF queue map is
> not supported yet, but there is patch available:
> [iovisor-dev] [RFC PATCH 1/3] bpf: add bpf queue map
> 
> So how to enqueue and dequeue packets depends on user's BPF implementation.
> This allows fairness scheme to be extensible.

For the moment being we'll try to ensure that BPF can be plugged there
rather easily. I see the advantage, but I'd rather do this as a second
step.

-- 
Stefano

Re: [PATCH RFC net-next] openvswitch: Queue upcalls to userspace in per-port round-robin order

2018-08-07 Thread Stefano Brivio

On Tue, 7 Aug 2018 15:31:11 +0200
Stefano Brivio  wrote:

> I would instead try to address the concerns that you had about the
> original patch adding fairness in the kernel, rather than trying to
> make the issue appear less severe in ovs-vswitchd.

And, by the way, if we introduce a way to configure the possible
fairness algorithms via netlink, we can also rather easily allow
ovs-vswitchd to disable fairness on kernel side altogether, should it
be needed.

-- 
Stefano

Re: [PATCH RFC net-next] openvswitch: Queue upcalls to userspace in per-port round-robin order

2018-08-07 Thread Stefano Brivio

Hi Pravin,

On Tue, 31 Jul 2018 16:12:03 -0700
Pravin Shelar  wrote:

> Rather than reducing number of thread down to 1, we could find better
> number of FDs per port.
> How about this simple solution:
> 1. Allocate (N * P) FDs as long as it is under FD limit.
> 2. If FD limit (-EMFILE) is hit reduce N value by half and repeat step 1.

I still see a few quantitative issues with this approach, other than
Ben's observation about design (which, by the way, looks entirely
reasonable to me).

We're talking about a disproportionate amount of sockets in any case.
We can have up to 2^16 vports here, with 5k vports being rather common,
and for any reasonable value of N that manages somehow to perturbate the
distribution of upcalls per thread, we are talking about something well
in excess of 100k sockets. I think this doesn't really scale.

With the current value for N (3/4 * number of threads) this can even get
close to /proc/sys/fs/file-max on some systems, and there raising the
number of allowed file descriptors for ovs-vswitchd isn't a solution
anymore.

I would instead try to address the concerns that you had about the
original patch adding fairness in the kernel, rather than trying to
make the issue appear less severe in ovs-vswitchd.

-- 
Stefano

Re: [PATCH RFC net-next] openvswitch: Queue upcalls to userspace in per-port round-robin order

2018-08-03 Thread Stefano Brivio

On Fri, 3 Aug 2018 16:01:08 -0700
Ben Pfaff  wrote:

> I think that a simple mechanism for fairness is fine.  The direction
> of extensibility that makes me anxious is how to decide what matters
> for fairness.  So far, we've talked about per-vport fairness.  That
> works pretty well for packets coming in from virtual interfaces where
> each vport represents a separate VM.

Yes, right, that's the case where we have significant issues currently.

> It does not work well if the traffic filling your queues all comes
> from a single physical port because some source of traffic is sending
> traffic at a high rate.  In that case, you'll do a lot better if you
> do fairness based on the source 5-tuple. But if you're doing network
> virtualization, then the outer source 5-tuples won't necessarily vary
> much and you'd be better off looking at the VNI and maybe some Geneve
> TLV options and maybe the inner 5-tuple...

Sure, I see what you mean now. That looks entirely doable if we
abstract the round-robin bucket selection out of the current patch.

> I would be very pleased if we could integrate a simple mechanism for
> fairness, based for now on some simple criteria like the source port,
> but thinking ahead to how we could later make it gracefully extensible
> to consider more general and possibly customizable criteria.

We could change the patch so that instead of just using the vport for
round-robin queue insertion, we generalise that and use "buckets"
instead of vports, and have a set of possible functions that are called
instead of using port_no directly in ovs_dp_upcall_queue_roundrobin(),
making this configurable via netlink, per datapath.

We could implement selection based on source port or a hash on the
source 5-tuple, and the relevant bits of
ovs_dp_upcall_queue_roundrobin() would look like this:

static int ovs_dp_upcall_queue_roundrobin(struct datapath *dp,
  struct dp_upcall_info *upcall)
{

[...]

list_for_each_entry(pos, head, list) {
int bucket = dp->rr_select(pos);

/* Count per-bucket upcalls. */
if (dp->upcalls.count[bucket] == U8_MAX) {
err = -ENOSPC;
goto out_clear;
}
dp->upcalls.count[bucket]++;

if (bucket == upcall->bucket) {
/* Another upcall for the same bucket: move insertion
 * point here, keep looking for insertion condition to
 * be still met further on.
 */
find_next = true;
here = pos;
continue;
}

count = dp->upcalls.count[bucket];
if (find_next && dp->upcalls.count[bucket] >= count) {
/* Insertion condition met: no need to look further,
 * unless another upcall for the same port occurs later.
 */
find_next = false;
here = pos;
}
}

[...]

}

and implementations for dp->rr_select() would look like:

int rr_select_vport(struct dp_upcall_info *upcall)
{
return upcall->port_no;
}

int rr_select_srcport(struct dp_upcall_info *upcall)
{
/* look up source port from upcall->skb... */
}

And we could then easily extend this to use BPF with maps one day.

This is for clarity by the way, but I guess we should avoid indirect
calls in the final implementation. 

What do you think?

-- 
Stefano

Re: [PATCH RFC net-next] openvswitch: Queue upcalls to userspace in per-port round-robin order

2018-08-03 Thread Stefano Brivio

Hi Ben,

On Tue, 31 Jul 2018 15:06:57 -0700
Ben Pfaff  wrote:

> This is an awkward problem to try to solve with sockets because of the
> nature of sockets, which are strictly first-in first-out.  What you
> really want is something closer to the algorithm that we use in
> ovs-vswitchd to send packets to an OpenFlow controller.  When the
> channel becomes congested, then for each packet to be sent to the
> controller, OVS appends it to a queue associated with its input port.
> (This could be done on a more granular basis than just port.)  If the
> maximum amount of queued packets is reached, then OVS discards a packet
> from the longest queue.  When space becomes available in the channel,
> OVS round-robins through the queues to send a packet.  This achieves
> pretty good fairness but it can't be done with sockets because you can't
> drop a packet that is already queued to one.

Thanks for your feedback. What you describe is, though, functionally
equivalent to what this patch does, minus the per-port queueing limit.

However, instead of having one explicit queue for each port, and
then fetching packets in a round-robin fashion from all the queues, we
implemented this with a single queue and choose insertion points while
queueing in such a way that the result is equivalent. This way, we
avoid the massive overhead associated with having one queue per each
port (we can have up to 2^16 ports), and cycling over them.

Let's say we have two ports, A and B, and three upcalls are sent for
each port. If we implement one queue for each port as you described, we
end up with this:

. - - -
| A1 | A2 | A3 |
' - - -

. - - -
| B1 | B2 | B3 |
' - - -

and then send upcalls in this order: A1, B1, A2, B2, A3, B3.

What we are doing here with a single queue is inserting the upcalls
directly in this order:

.--- - - -
| A1 | B1 | A2 | B2 | A3 | B3 |
'--- - - -

and dequeueing from the head.

About the per-port queueing limit: we currently have a global one
(UPCALL_QUEUE_MAX_LEN), while the per-port limit is simply given by
implementation constraints in our case:

if (dp->upcalls.count[pos->port_no] == U8_MAX - 1) {
err = -ENOSPC;
goto out_clear;
}

but we can easily swap that U8_MAX - 1 with another macro or a
configurable value, if there's any value in doing that.

> My current thought is that any fairness scheme we implement directly in
> the kernel is going to need to evolve over time.  Maybe we could do
> something flexible with BPF and maps, instead of hard-coding it.

Honestly, I fail to see what else we might want to do here, other than
adding a simple mechanism for fairness, to solve the specific issue at
hand. Flexibility would probably come at a higher cost. We could easily
make limits configurable if needed. Do you have anything else in mind?

-- 
Stefano

[PATCH net-next v2] net: Move skb decrypted field, avoid explicity copy

2018-07-17 Thread Stefano Brivio

Commit 784abe24c903 ("net: Add decrypted field to skb")
introduced a 'decrypted' field that is explicitly copied on skb
copy and clone.

Move it between headers_start[0] and headers_end[0], so that we
don't need to copy it explicitly as it's copied by the memcpy()
in __copy_skb_header().

While at it, drop the assignment in __skb_clone(), it was
already redundant.

This doesn't change the size of sk_buff or cacheline boundaries.

The 15-bits hole before tc_index becomes a 14-bits hole, and
will be again a 15-bits hole when this change is merged with
commit 8b7008620b84 ("net: Don't copy pfmemalloc flag in
__copy_skb_header()").

v2: as reported by kbuild test robot (oops, I forgot to build
with CONFIG_TLS_DEVICE it seems), we can't use
CHECK_SKB_FIELD() on a bit-field member. Just drop the
check for the moment being, perhaps we could think of some
magic to also check bit-field members one day.

Fixes: 784abe24c903 ("net: Add decrypted field to skb")
Signed-off-by: Stefano Brivio 
---
 include/linux/skbuff.h | 9 -
 net/core/skbuff.c  | 6 --
 2 files changed, 4 insertions(+), 11 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 3ceb8dcc54da..14bc9ebe30f2 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -630,7 +630,6 @@ typedef unsigned char *sk_buff_data_t;
  * @hash: the packet hash
  * @queue_mapping: Queue mapping for multiqueue devices
  * @xmit_more: More SKBs are pending for this queue
- * @decrypted: Decrypted SKB
  * @ndisc_nodetype: router type (from link layer)
  * @ooo_okay: allow the mapping of a socket to a queue to be changed
  * @l4_hash: indicate hash is a canonical 4-tuple hash over transport
@@ -641,6 +640,7 @@ typedef unsigned char *sk_buff_data_t;
  * @no_fcs:  Request NIC to treat last 4 bytes as Ethernet FCS
  * @csum_not_inet: use CRC32c to resolve CHECKSUM_PARTIAL
  * @dst_pending_confirm: need to confirm neighbour
+ * @decrypted: Decrypted SKB
   *@napi_id: id of the NAPI struct this skb came from
  * @secmark: security marking
  * @mark: Generic packet mark
@@ -737,11 +737,7 @@ struct sk_buff {
peeked:1,
head_frag:1,
xmit_more:1,
-#ifdef CONFIG_TLS_DEVICE
-   decrypted:1;
-#else
__unused:1;
-#endif
 
/* fields enclosed in headers_start/headers_end are copied
 * using a single memcpy() in __copy_skb_header()
@@ -797,6 +793,9 @@ struct sk_buff {
__u8tc_redirected:1;
__u8tc_from_ingress:1;
 #endif
+#ifdef CONFIG_TLS_DEVICE
+   __u8decrypted:1;
+#endif
 
 #ifdef CONFIG_NET_SCHED
__u16   tc_index;   /* traffic control index */
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index cfd6c6f35f9c..c4e24ac27464 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -805,9 +805,6 @@ static void __copy_skb_header(struct sk_buff *new, const 
struct sk_buff *old)
 * It is not yet because we do not want to have a 16 bit hole
 */
new->queue_mapping = old->queue_mapping;
-#ifdef CONFIG_TLS_DEVICE
-   new->decrypted = old->decrypted;
-#endif
 
memcpy(>headers_start, >headers_start,
   offsetof(struct sk_buff, headers_end) -
@@ -868,9 +865,6 @@ static struct sk_buff *__skb_clone(struct sk_buff *n, 
struct sk_buff *skb)
C(head_frag);
C(data);
C(truesize);
-#ifdef CONFIG_TLS_DEVICE
-   C(decrypted);
-#endif
refcount_set(>users, 1);
 
atomic_inc(&(skb_shinfo(skb)->dataref));
-- 
2.15.1

[PATCH net-next] net: Move skb decrypted field, avoid explicity copy

2018-07-17 Thread Stefano Brivio

Commit 784abe24c903 ("net: Add decrypted field to skb")
introduced a 'decrypted' field that is explicitly copied on skb
copy and clone.

Move it between headers_start[0] and headers_end[0], so that we
don't need to copy it explicitly as it's copied by the memcpy()
in __copy_skb_header().

While at it, drop the assignment in __skb_clone(), it was
already redundant.

This doesn't change the size of sk_buff or cacheline boundaries.

The 15-bits hole before tc_index becomes a 14-bits hole, and
will be again a 15-bits hole when this change is merged with
commit 8b7008620b84 ("net: Don't copy pfmemalloc flag in
__copy_skb_header()").

Fixes: 784abe24c903 ("net: Add decrypted field to skb")
Signed-off-by: Stefano Brivio 
---
 include/linux/skbuff.h | 9 -
 net/core/skbuff.c  | 9 +++--
 2 files changed, 7 insertions(+), 11 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 3ceb8dcc54da..14bc9ebe30f2 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -630,7 +630,6 @@ typedef unsigned char *sk_buff_data_t;
  * @hash: the packet hash
  * @queue_mapping: Queue mapping for multiqueue devices
  * @xmit_more: More SKBs are pending for this queue
- * @decrypted: Decrypted SKB
  * @ndisc_nodetype: router type (from link layer)
  * @ooo_okay: allow the mapping of a socket to a queue to be changed
  * @l4_hash: indicate hash is a canonical 4-tuple hash over transport
@@ -641,6 +640,7 @@ typedef unsigned char *sk_buff_data_t;
  * @no_fcs:  Request NIC to treat last 4 bytes as Ethernet FCS
  * @csum_not_inet: use CRC32c to resolve CHECKSUM_PARTIAL
  * @dst_pending_confirm: need to confirm neighbour
+ * @decrypted: Decrypted SKB
   *@napi_id: id of the NAPI struct this skb came from
  * @secmark: security marking
  * @mark: Generic packet mark
@@ -737,11 +737,7 @@ struct sk_buff {
peeked:1,
head_frag:1,
xmit_more:1,
-#ifdef CONFIG_TLS_DEVICE
-   decrypted:1;
-#else
__unused:1;
-#endif
 
/* fields enclosed in headers_start/headers_end are copied
 * using a single memcpy() in __copy_skb_header()
@@ -797,6 +793,9 @@ struct sk_buff {
__u8tc_redirected:1;
__u8tc_from_ingress:1;
 #endif
+#ifdef CONFIG_TLS_DEVICE
+   __u8decrypted:1;
+#endif
 
 #ifdef CONFIG_NET_SCHED
__u16   tc_index;   /* traffic control index */
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index cfd6c6f35f9c..b8a563b2d2df 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -805,9 +805,6 @@ static void __copy_skb_header(struct sk_buff *new, const 
struct sk_buff *old)
 * It is not yet because we do not want to have a 16 bit hole
 */
new->queue_mapping = old->queue_mapping;
-#ifdef CONFIG_TLS_DEVICE
-   new->decrypted = old->decrypted;
-#endif
 
memcpy(>headers_start, >headers_start,
   offsetof(struct sk_buff, headers_end) -
@@ -836,6 +833,9 @@ static void __copy_skb_header(struct sk_buff *new, const 
struct sk_buff *old)
 #ifdef CONFIG_XPS
CHECK_SKB_FIELD(sender_cpu);
 #endif
+#ifdef CONFIG_TLS_DEVICE
+   CHECK_SKB_FIELD(decrypted);
+#endif
 #ifdef CONFIG_NET_SCHED
CHECK_SKB_FIELD(tc_index);
 #endif
@@ -868,9 +868,6 @@ static struct sk_buff *__skb_clone(struct sk_buff *n, 
struct sk_buff *skb)
C(head_frag);
C(data);
C(truesize);
-#ifdef CONFIG_TLS_DEVICE
-   C(decrypted);
-#endif
refcount_set(>users, 1);
 
atomic_inc(&(skb_shinfo(skb)->dataref));
-- 
2.15.1

Re: [PATCH net] net/ipv6: Do not allow device only routes via the multipath API

2018-07-13 Thread Stefano Brivio

On Thu, 12 Jul 2018 14:48:23 -0700
dsah...@kernel.org wrote:

> @@ -4388,6 +4388,13 @@ static int ip6_route_multipath_add(struct fib6_config 
> *cfg,
>   rt = NULL;
>   goto cleanup;
>   }
> + if (!rt6_qualify_for_ecmp(rt)) {
> + err = EINVAL;

Shouldn't this be -EINVAL, just for consistency? E.g. we might return a
-ENOMEM from ip6_route_info_append(), etc.

-- 
Stefano

[PATCH net] skbuff: Unconditionally copy pfmemalloc in __skb_clone()

2018-07-13 Thread Stefano Brivio

Commit 8b7008620b84 ("net: Don't copy pfmemalloc flag in
__copy_skb_header()") introduced a different handling for the
pfmemalloc flag in copy and clone paths.

In __skb_clone(), now, the flag is set only if it was set in the
original skb, but not cleared if it wasn't. This is wrong and
might lead to socket buffers being flagged with pfmemalloc even
if the skb data wasn't allocated from pfmemalloc reserves. Copy
the flag instead of ORing it.

Reported-by: Sabrina Dubroca 
Fixes: 8b7008620b84 ("net: Don't copy pfmemalloc flag in __copy_skb_header()")
Signed-off-by: Stefano Brivio 
---
 net/core/skbuff.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 4df3164bb5fc..8e51f8555e11 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -858,8 +858,7 @@ static struct sk_buff *__skb_clone(struct sk_buff *n, 
struct sk_buff *skb)
n->cloned = 1;
n->nohdr = 0;
n->peeked = 0;
-   if (skb->pfmemalloc)
-   n->pfmemalloc = 1;
+   C(pfmemalloc);
n->destructor = NULL;
C(tail);
C(end);
-- 
2.15.1

Re: [PATCH] net: convert gro_count to bitmask

2018-07-12 Thread Stefano Brivio

On Thu, 12 Jul 2018 02:31:10 +
"Li,Rongqing"  wrote:

> > -邮件原件-
> > 发件人: Stefano Brivio [mailto:sbri...@redhat.com]
> > 发送时间: 2018年7月11日 18:52
> > 收件人: Li,Rongqing 
> > 抄送: netdev@vger.kernel.org; Eric Dumazet 
> > 主题: Re: [PATCH] net: convert gro_count to bitmask
> > 
> > On Wed, 11 Jul 2018 17:15:53 +0800
> > Li RongQing  wrote:
> >   
> > > @@ -5380,6 +5382,12 @@ static enum gro_result dev_gro_receive(struct  
> > napi_struct *napi, struct sk_buff  
> > >   if (grow > 0)
> > >   gro_pull_from_frag0(skb, grow);
> > >  ok:
> > > + if (napi->gro_hash[hash].count)
> > > + if (!test_bit(hash, >gro_bitmask))
> > > + set_bit(hash, >gro_bitmask);
> > > + else if (test_bit(hash, >gro_bitmask))
> > > + clear_bit(hash, >gro_bitmask);  
> > 
> > This might not do what you want.
> > 
> > --  
> 
> could you show detail ?

$ cat if1.c; gcc -o if1 if1.c
#include 

int main()
{
if (1)
if (0)
;
else if (2)
printf("whoops\n");

return 0;
}
$ ./if1
whoops

$ cat if2.c; gcc -o if2 if2.c
#include 

int main()
{
if (1) {
if (0)
;
} else if (2) {
printf("whoops\n");
}

return 0;
}
$ ./if2

-- 
Stefano

[PATCH net] net: Don't copy pfmemalloc flag in __copy_skb_header()

2018-07-11 Thread Stefano Brivio

The pfmemalloc flag indicates that the skb was allocated from
the PFMEMALLOC reserves, and the flag is currently copied on skb
copy and clone.

However, an skb copied from an skb flagged with pfmemalloc
wasn't necessarily allocated from PFMEMALLOC reserves, and on
the other hand an skb allocated that way might be copied from an
skb that wasn't.

So we should not copy the flag on skb copy, and rather decide
whether to allow an skb to be associated with sockets unrelated
to page reclaim depending only on how it was allocated.

Move the pfmemalloc flag before headers_start[0] using an
existing 1-bit hole, so that __copy_skb_header() doesn't copy
it.

When cloning, we'll now take care of this flag explicitly,
contravening to the warning comment of __skb_clone().

While at it, restore the newline usage introduced by commit
b19372273164 ("net: reorganize sk_buff for faster
__copy_skb_header()") to visually separate bytes used in
bitfields after headers_start[0], that was gone after commit
a9e419dc7be6 ("netfilter: merge ctinfo into nfct pointer storage
area"), and describe the pfmemalloc flag in the kernel-doc
structure comment.

This doesn't change the size of sk_buff or cacheline boundaries,
but consolidates the 15 bits hole before tc_index into a 2 bytes
hole before csum, that could now be filled more easily.

Reported-by: Patrick Talbert 
Fixes: c93bdd0e03e8 ("netvm: allow skb allocation to use PFMEMALLOC reserves")
Signed-off-by: Stefano Brivio 
---
 include/linux/skbuff.h | 10 +-
 net/core/skbuff.c  |  2 ++
 2 files changed, 7 insertions(+), 5 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 164cdedf6012..610a201126ee 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -630,6 +630,7 @@ typedef unsigned char *sk_buff_data_t;
  * @hash: the packet hash
  * @queue_mapping: Queue mapping for multiqueue devices
  * @xmit_more: More SKBs are pending for this queue
+ * @pfmemalloc: skbuff was allocated from PFMEMALLOC reserves
  * @ndisc_nodetype: router type (from link layer)
  * @ooo_okay: allow the mapping of a socket to a queue to be changed
  * @l4_hash: indicate hash is a canonical 4-tuple hash over transport
@@ -735,7 +736,7 @@ struct sk_buff {
peeked:1,
head_frag:1,
xmit_more:1,
-   __unused:1; /* one bit hole */
+   pfmemalloc:1;
 
/* fields enclosed in headers_start/headers_end are copied
 * using a single memcpy() in __copy_skb_header()
@@ -754,31 +755,30 @@ struct sk_buff {
 
__u8__pkt_type_offset[0];
__u8pkt_type:3;
-   __u8pfmemalloc:1;
__u8ignore_df:1;
-
__u8nf_trace:1;
__u8ip_summed:2;
__u8ooo_okay:1;
+
__u8l4_hash:1;
__u8sw_hash:1;
__u8wifi_acked_valid:1;
__u8wifi_acked:1;
-
__u8no_fcs:1;
/* Indicates the inner headers are valid in the skbuff. */
__u8encapsulation:1;
__u8encap_hdr_csum:1;
__u8csum_valid:1;
+
__u8csum_complete_sw:1;
__u8csum_level:2;
__u8csum_not_inet:1;
-
__u8dst_pending_confirm:1;
 #ifdef CONFIG_IPV6_NDISC_NODETYPE
__u8ndisc_nodetype:2;
 #endif
__u8ipvs_property:1;
+
__u8inner_protocol_type:1;
__u8remcsum_offload:1;
 #ifdef CONFIG_NET_SWITCHDEV
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index eba8dae22c25..4df3164bb5fc 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -858,6 +858,8 @@ static struct sk_buff *__skb_clone(struct sk_buff *n, 
struct sk_buff *skb)
n->cloned = 1;
n->nohdr = 0;
n->peeked = 0;
+   if (skb->pfmemalloc)
+   n->pfmemalloc = 1;
n->destructor = NULL;
C(tail);
C(end);
-- 
2.15.1

Re: [PATCH] net: convert gro_count to bitmask

2018-07-11 Thread Stefano Brivio

On Wed, 11 Jul 2018 17:15:53 +0800
Li RongQing  wrote:

> @@ -5380,6 +5382,12 @@ static enum gro_result dev_gro_receive(struct 
> napi_struct *napi, struct sk_buff
>   if (grow > 0)
>   gro_pull_from_frag0(skb, grow);
>  ok:
> + if (napi->gro_hash[hash].count)
> + if (!test_bit(hash, >gro_bitmask))
> + set_bit(hash, >gro_bitmask);
> + else if (test_bit(hash, >gro_bitmask))
> + clear_bit(hash, >gro_bitmask);

This might not do what you want.

-- 
Stefano

Re: [PATCH] bpfilter: fix user mode helper cross compilation

2018-06-20 Thread Stefano Brivio

On Wed, 20 Jun 2018 16:04:34 +0200
Matteo Croce  wrote:

> Use $(OBJDUMP) instead of literal 'objdump' to avoid
> using host toolchain when cross compiling.
> 
> Fixes: 421780fd4983 ("bpfilter: fix build error")
> Signed-off-by: Matteo Croce 

Reported-by: Stefano Brivio 

-- 
Stefano

Re: [PATCH] bpfilter: fix build error

2018-06-20 Thread Stefano Brivio

On Tue, 19 Jun 2018 17:16:20 +0200
Matteo Croce  wrote:

> bpfilter Makefile assumes that the system locale is en_US, and the
> parsing of objdump output fails.
> Set LC_ALL=C and, while at it, rewrite the objdump parsing so it spawns
> only 2 processes instead of 7.
> 
> Fixes: d2ba09c17a064 ("net: add skeleton of bpfilter kernel module")
> Signed-off-by: Matteo Croce 
> ---
>  net/bpfilter/Makefile | 6 --
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/net/bpfilter/Makefile b/net/bpfilter/Makefile
> index e0bbe7583e58..dd86b022eff0 100644
> --- a/net/bpfilter/Makefile
> +++ b/net/bpfilter/Makefile
> @@ -21,8 +21,10 @@ endif
>  # which bpfilter_kern.c passes further into umh blob loader at run-time
>  quiet_cmd_copy_umh = GEN $@
>cmd_copy_umh = echo ':' > $(obj)/.bpfilter_umh.o.cmd; \
> -  $(OBJCOPY) -I binary -O `$(OBJDUMP) -f $<|grep format|cut -d' ' -f8` \
> -  -B `$(OBJDUMP) -f $<|grep architecture|cut -d, -f1|cut -d' ' -f2` \
> +  $(OBJCOPY) -I binary \
> +  `LC_ALL=C objdump -f net/bpfilter/bpfilter_umh \

Why do you use objdump instead of $(OBJDUMP) now? I guess this might
cause issues if you're cross-compiling.

-- 
Stefano

[PATCH net] selftests/net: Add missing config options for PMTU tests

2018-05-24 Thread Stefano Brivio

PMTU tests in pmtu.sh need support for VTI, VTI6 and dummy
interfaces: add them to config file.

Reported-by: Naresh Kamboju <naresh.kamb...@linaro.org>
Fixes: d1f1b9cbf34c ("selftests: net: Introduce first PMTU test")
Signed-off-by: Stefano Brivio <sbri...@redhat.com>
---
 tools/testing/selftests/net/config | 5 +
 1 file changed, 5 insertions(+)

diff --git a/tools/testing/selftests/net/config 
b/tools/testing/selftests/net/config
index 6a75a3ea44ad..7ba089b33e8b 100644
--- a/tools/testing/selftests/net/config
+++ b/tools/testing/selftests/net/config
@@ -7,3 +7,8 @@ CONFIG_NET_L3_MASTER_DEV=y
 CONFIG_IPV6=y
 CONFIG_IPV6_MULTIPLE_TABLES=y
 CONFIG_VETH=y
+CONFIG_INET_XFRM_MODE_TUNNEL=y
+CONFIG_NET_IPVTI=y
+CONFIG_INET6_XFRM_MODE_TUNNEL=y
+CONFIG_IPV6_VTI=y
+CONFIG_DUMMY=y
-- 
2.15.1

[PATCH net] openvswitch: Don't swap table in nlattr_set() after OVS_ATTR_NESTED is found

2018-05-03 Thread Stefano Brivio

If an OVS_ATTR_NESTED attribute type is found while walking
through netlink attributes, we call nlattr_set() recursively
passing the length table for the following nested attributes, if
different from the current one.

However, once we're done with those sub-nested attributes, we
should continue walking through attributes using the current
table, instead of using the one related to the sub-nested
attributes.

For example, given this sequence:

1  OVS_KEY_ATTR_PRIORITY
2  OVS_KEY_ATTR_TUNNEL
3   OVS_TUNNEL_KEY_ATTR_ID
4   OVS_TUNNEL_KEY_ATTR_IPV4_SRC
5   OVS_TUNNEL_KEY_ATTR_IPV4_DST
6   OVS_TUNNEL_KEY_ATTR_TTL
7   OVS_TUNNEL_KEY_ATTR_TP_SRC
8   OVS_TUNNEL_KEY_ATTR_TP_DST
9  OVS_KEY_ATTR_IN_PORT
10 OVS_KEY_ATTR_SKB_MARK
11 OVS_KEY_ATTR_MPLS

we switch to the 'ovs_tunnel_key_lens' table on attribute #3,
and we don't switch back to 'ovs_key_lens' while setting
attributes #9 to #11 in the sequence. As OVS_KEY_ATTR_MPLS
evaluates to 21, and the array size of 'ovs_tunnel_key_lens' is
15, we also get this kind of KASan splat while accessing the
wrong table:

[ 7654.586496] 
==
[ 7654.594573] BUG: KASAN: global-out-of-bounds in nlattr_set+0x164/0xde9 
[openvswitch]
[ 7654.603214] Read of size 4 at addr c169ecf0 by task handler29/87430
[ 7654.610983]
[ 7654.612644] CPU: 21 PID: 87430 Comm: handler29 Kdump: loaded Not tainted 
3.10.0-866.el7.test.x86_64 #1
[ 7654.623030] Hardware name: Dell Inc. PowerEdge R730/072T6D, BIOS 2.1.7 
06/16/2016
[ 7654.631379] Call Trace:
[ 7654.634108]  [] dump_stack+0x19/0x1b
[ 7654.639843]  [] print_address_description+0x33/0x290
[ 7654.647129]  [] ? nlattr_set+0x164/0xde9 [openvswitch]
[ 7654.654607]  [] kasan_report.part.3+0x242/0x330
[ 7654.661406]  [] __asan_report_load4_noabort+0x34/0x40
[ 7654.668789]  [] nlattr_set+0x164/0xde9 [openvswitch]
[ 7654.676076]  [] ovs_nla_get_match+0x10c8/0x1900 
[openvswitch]
[ 7654.684234]  [] ? genl_rcv+0x28/0x40
[ 7654.689968]  [] ? netlink_unicast+0x3f3/0x590
[ 7654.696574]  [] ? ovs_nla_put_tunnel_info+0xb0/0xb0 
[openvswitch]
[ 7654.705122]  [] ? unwind_get_return_address+0xb0/0xb0
[ 7654.712503]  [] ? system_call_fastpath+0x1c/0x21
[ 7654.719401]  [] ? update_stack_state+0x229/0x370
[ 7654.726298]  [] ? update_stack_state+0x229/0x370
[ 7654.733195]  [] ? kasan_unpoison_shadow+0x35/0x50
[ 7654.740187]  [] ? kasan_kmalloc+0xaa/0xe0
[ 7654.746406]  [] ? kasan_slab_alloc+0x12/0x20
[ 7654.752914]  [] ? memset+0x31/0x40
[ 7654.758456]  [] ovs_flow_cmd_new+0x2b2/0xf00 [openvswitch]

[snip]

[ 7655.132484] The buggy address belongs to the variable:
[ 7655.138226]  ovs_tunnel_key_lens+0xf0/0xd400 [openvswitch]
[ 7655.145507]
[ 7655.147166] Memory state around the buggy address:
[ 7655.152514]  c169eb80: 00 00 00 00 00 00 00 00 00 00 fa fa fa fa fa 
fa
[ 7655.160585]  c169ec00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
00
[ 7655.168644] >c169ec80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 fa 
fa
[ 7655.176701]  ^
[ 7655.184372]  c169ed00: fa fa fa fa 00 00 00 00 fa fa fa fa 00 00 00 
05
[ 7655.192431]  c169ed80: fa fa fa fa 00 00 00 00 00 00 00 00 00 00 00 
00
[ 7655.200490] 
==

Reported-by: Hangbin Liu <liuhang...@gmail.com>
Fixes: 982b52700482 ("openvswitch: Fix mask generation for nested attributes.")
Signed-off-by: Stefano Brivio <sbri...@redhat.com>
Reviewed-by: Sabrina Dubroca <s...@queasysnail.net>
---
 net/openvswitch/flow_netlink.c | 9 +++--
 1 file changed, 3 insertions(+), 6 deletions(-)

diff --git a/net/openvswitch/flow_netlink.c b/net/openvswitch/flow_netlink.c
index 7322aa1e382e..492ab0c36f7c 100644
--- a/net/openvswitch/flow_netlink.c
+++ b/net/openvswitch/flow_netlink.c
@@ -1712,13 +1712,10 @@ static void nlattr_set(struct nlattr *attr, u8 val,
 
/* The nlattr stream should already have been validated */
nla_for_each_nested(nla, attr, rem) {
-   if (tbl[nla_type(nla)].len == OVS_ATTR_NESTED) {
-   if (tbl[nla_type(nla)].next)
-   tbl = tbl[nla_type(nla)].next;
-   nlattr_set(nla, val, tbl);
-   } else {
+   if (tbl[nla_type(nla)].len == OVS_ATTR_NESTED)
+   nlattr_set(nla, val, tbl[nla_type(nla)].next ? : tbl);
+   else
memset(nla_data(nla), val, nla_len(nla));
-   }
 
if (nla_type(nla) == OVS_KEY_ATTR_CT_STATE)
*(u32 *)nla_data(nla) &= CT_SUPPORTED_MASK;
-- 
2.15.1

[PATCH net-next] selftests: pmtu: Minimum MTU for vti6 is 68

2018-04-26 Thread Stefano Brivio

A vti6 interface can carry IPv4 packets too.

Signed-off-by: Stefano Brivio <sbri...@redhat.com>
---
 tools/testing/selftests/net/pmtu.sh | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/tools/testing/selftests/net/pmtu.sh 
b/tools/testing/selftests/net/pmtu.sh
index 1e428781a625..7651fd4d86fe 100755
--- a/tools/testing/selftests/net/pmtu.sh
+++ b/tools/testing/selftests/net/pmtu.sh
@@ -368,7 +368,7 @@ test_pmtu_vti6_link_add_mtu() {
 
fail=0
 
-   min=1280
+   min=68  # vti6 can carry IPv4 packets too
max=$((65535 - 40))
# Check invalid values first
for v in $((min - 1)) $((max + 1)); do
@@ -384,7 +384,7 @@ test_pmtu_vti6_link_add_mtu() {
done
 
# Now check valid values
-   for v in 1280 1300 $((65535 - 40)); do
+   for v in 68 1280 1300 $((65535 - 40)); do
${ns_a} ip link add vti6_a mtu ${v} type vti6 local 
${veth6_a_addr} remote ${veth6_b_addr} key 10
mtu="$(link_get_mtu "${ns_a}" vti6_a)"
${ns_a} ip link del vti6_a
-- 
2.15.1

[PATCH ipsec] vti6: Change minimum MTU to IPV4_MIN_MTU, vti6 can carry IPv4 too

2018-04-26 Thread Stefano Brivio

A vti6 interface can carry IPv4 as well, so it makes no sense to
enforce a minimum MTU of IPV6_MIN_MTU.

If the user sets an MTU below IPV6_MIN_MTU, IPv6 will be
disabled on the interface, courtesy of addrconf_notify().

Reported-by: Xin Long <lucien@gmail.com>
Fixes: b96f9afee4eb ("ipv4/6: use core net MTU range checking")
Fixes: c6741fbed6dc ("vti6: Properly adjust vti6 MTU from MTU of lower device")
Fixes: 53c81e95df17 ("ip6_vti: adjust vti mtu according to mtu of lower device")
Signed-off-by: Stefano Brivio <sbri...@redhat.com>
---
 net/ipv6/ip6_vti.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/ipv6/ip6_vti.c b/net/ipv6/ip6_vti.c
index c214ffec02f0..ca957dd93a29 100644
--- a/net/ipv6/ip6_vti.c
+++ b/net/ipv6/ip6_vti.c
@@ -669,7 +669,7 @@ static void vti6_link_config(struct ip6_tnl *t, bool 
keep_mtu)
else
mtu = ETH_DATA_LEN - LL_MAX_HEADER - sizeof(struct ipv6hdr);
 
-   dev->mtu = max_t(int, mtu, IPV6_MIN_MTU);
+   dev->mtu = max_t(int, mtu, IPV4_MIN_MTU);
 }
 
 /**
@@ -881,7 +881,7 @@ static void vti6_dev_setup(struct net_device *dev)
dev->priv_destructor = vti6_dev_free;
 
dev->type = ARPHRD_TUNNEL6;
-   dev->min_mtu = IPV6_MIN_MTU;
+   dev->min_mtu = IPV4_MIN_MTU;
dev->max_mtu = IP_MAX_MTU - sizeof(struct ipv6hdr);
dev->flags |= IFF_NOARP;
dev->addr_len = sizeof(struct in6_addr);
-- 
2.15.1

Re: [PATCH v4 ipsec-next] xfrm: remove VLA usage in __xfrm6_sort()

2018-04-25 Thread Stefano Brivio

On Wed, 25 Apr 2018 07:46:39 -0700
Kees Cook <keesc...@chromium.org> wrote:

> In the quest to remove all stack VLA usage removed from the kernel[1],
> just use XFRM_MAX_DEPTH as already done for the "class" array. In one
> case, it'll do this loop up to 5, the other caller up to 6.
> 
> [1] https://lkml.org/lkml/2018/3/7/621
> 
> Co-developed-by: Andreas Christoforou <andreaschrist...@gmail.com>
> Signed-off-by: Kees Cook <keesc...@chromium.org>
> ---
> v4:
> - actually remove memset(). :)
> v3:
> - adjust Subject and commit log (Steffen)
> - use "= { }" instead of memset() (Stefano)
> v2:
> - use XFRM_MAX_DEPTH for "count" array (Steffen and Mathias).
> ---

Acked-by: Stefano Brivio <sbri...@redhat.com>

Re: [PATCH v3 ipsec-next] xfrm: remove VLA usage in __xfrm6_sort()

2018-04-25 Thread Stefano Brivio

Hi Kees,

On Tue, 24 Apr 2018 16:46:51 -0700
Kees Cook  wrote:

> In the quest to remove all stack VLA usage removed from the kernel[1],
> just use XFRM_MAX_DEPTH as already done for the "class" array. In one
> case, it'll do this loop up to 5, the other caller up to 6.
> 
> [1] https://lkml.org/lkml/2018/3/7/621
> 
> Co-developed-by: Andreas Christoforou 
> Signed-off-by: Kees Cook 
> ---
> v3:
> - adjust Subject and commit log (Steffen)
> - use "= { }" instead of memset() (Stefano)
> - reorder variables (Stefano)
> v2:
> - use XFRM_MAX_DEPTH for "count" array (Steffen and Mathias).
> ---
>  net/ipv6/xfrm6_state.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/net/ipv6/xfrm6_state.c b/net/ipv6/xfrm6_state.c
> index 16f434791763..eeb44b64ae7f 100644
> --- a/net/ipv6/xfrm6_state.c
> +++ b/net/ipv6/xfrm6_state.c
> @@ -60,9 +60,9 @@ xfrm6_init_temprop(struct xfrm_state *x, const struct 
> xfrm_tmpl *tmpl,
>  static int
>  __xfrm6_sort(void **dst, void **src, int n, int (*cmp)(void *p), int 
> maxclass)
>  {
> - int i;
> + int count[XFRM_MAX_DEPTH] = { };
>   int class[XFRM_MAX_DEPTH];
> - int count[maxclass];
> + int i;
>  
>   memset(count, 0, sizeof(count));

I guess you forgot to remove the memset() here. Just to be clear, I
think this is how it should look like:

--- a/net/ipv6/xfrm6_state.c
+++ b/net/ipv6/xfrm6_state.c
@@ -60,11 +60,9 @@ xfrm6_init_temprop(struct xfrm_state *x, const struct 
xfrm_tmpl *tmpl,
 static int
 __xfrm6_sort(void **dst, void **src, int n, int (*cmp)(void *p), int maxclass)
 {
-   int i;
+   int count[XFRM_MAX_DEPTH] = { };
int class[XFRM_MAX_DEPTH];
-   int count[maxclass];
-
-   memset(count, 0, sizeof(count));
+   int i;
 
for (i = 0; i < n; i++) {
int c;

-- 
Stefano

1 2 3 >

1 - 100 of 270 matches

Mail list logo