Re: [PATCH] net: Fix xps_needed inc/dec mismatch

2018-12-07 Thread Sabrina Dubroca
Hi Ross,

2018-12-07, 10:16:21 +, Ross Lagerwall wrote:
> xps_needed is incremented only when a new dev map is allocated (in
> __netif_set_xps_queue). Therefore it should be decremented only when we
> actually have a dev map to destroy. Without this, it may be decremented
> too many times which causes netif_reset_xps_queues to return early and
> not actually clean up the old dev maps. This results in a crash in
> __netif_set_xps_queue when it is called later.
> 
> The crash occurred when having multiple ixgbe devices in a host. lldpad
> would reconfigure them to be FCoE-capable causing reset_xps_queues /
> set_xps_queue to be called several times. The xps_needed count would get
> out of sync and eventually the above-mentioned crash would occur.
> 
> Signed-off-by: Ross Lagerwall 

I posted another patchset recently (commits f28c020fb488 and
867d0ad476db in the "net" tree) for issues in XPS, including broken
xps_needed accounting, so your patch won't apply to David's "net"
tree. Could you try it with your use case, and if you still see
issues, fix them on top? You can grab the latest net tree here:

git://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git


Thanks,

-- 
Sabrina


Re: [PATCH v6 1/4] udp_tunnel: add config option to bind to a device

2018-11-30 Thread Sabrina Dubroca
2018-11-27, 14:05:42 +0100, Alexis Bauvin wrote:
> diff --git a/net/ipv4/udp_tunnel.c b/net/ipv4/udp_tunnel.c
> index 6539ff15e9a3..dc68e15a4f72 100644
> --- a/net/ipv4/udp_tunnel.c
> +++ b/net/ipv4/udp_tunnel.c
> @@ -20,6 +20,16 @@ int udp_sock_create4(struct net *net, struct udp_port_cfg 
> *cfg,
>   if (err < 0)
>   goto error;
>  
> + if (cfg->bind_ifindex) {
> + struct net_device *dev;
> +
> + dev = __dev_get_by_index(net, cfg->bind_ifindex);

Quoting from net/core/dev.c:

 *  [...]The device has not
 *  had its reference counter increased so the caller must be careful
 *  about locking. The caller must hold either the RTNL semaphore
 *  or @dev_base_lock.
 */

which is the case for VXLAN (and GENEVE) during ndo_open, but I don't
think other UDP tunnels (FOU, L2TP) are holding RTNL when they call
udp_sock_create(). dev_get_by_index() + dev_put() should be safe.

Also, I don't think it's a problem with vxlan, but this could handle
the case where __dev_get_by_index returns NULL.

> + err = kernel_setsockopt(sock, SOL_SOCKET, SO_BINDTODEVICE,
> + dev->name, strlen(dev->name) + 1);
> + if (err < 0)
> + goto error;
> + }
> +
>   udp_addr.sin_family = AF_INET;
>   udp_addr.sin_addr = cfg->local_ip;
>   udp_addr.sin_port = cfg->local_udp_port;

-- 
Sabrina


Re: [PATCH net-next v2 3/3] vxlan: move flag sets to use a helper func

2018-11-29 Thread Sabrina Dubroca
2018-11-29, 07:27:17 -0800, Roopa Prabhu wrote:
> On Thu, Nov 29, 2018 at 6:19 AM Sabrina Dubroca  wrote:
> > 2018-11-28, 14:27:59 -0800, Roopa Prabhu wrote:
> > nit: This patch would have been easier to review if it came first in
> > the series. Converting:
> 
> I considered that. But the first patch really removes some checks
> making it easier for the follow-on patches...and that mostly dictated
> the order.

ok.

> > if (nla_get_u8(data[IFLA_VXLAN_FOO]))
> > conf->flags |= VXLAN_F_FOO;
> >
> > to this new helper, which in effect does
> >
> > if (nla_get_u8(data[IFLA_VXLAN_FOO]))
> > conf->flags |= VXLAN_F_FOO;
> > else
> > conf->flags &= ~VXLAN_F_FOO;
> >
> > seems to change behavior, but since you're (unless I missed one) only
> > applying it to flags that couldn't be changed before patch 1, it looks
> > fine.
> 
> It does not change behavior...the earlier code only supported the
> flags during create time and did not support changing of the flag with
> changelink.
> this patch adds changelink support. But if you do see any change in
> behavior, pls report. thanks.

Yeah, looks correct. I just had to go back to patch 1 and check.
A note "only flags [list] are changed, and they could not be modified
after creation of the device prior to patch 1" in the cover and/or in
this patch's commit message would have spared reviewers a bit of a
scare :)

-- 
Sabrina


Re: [PATCH net-next v2 1/3] vxlan: support changelink for a few more attributes

2018-11-29 Thread Sabrina Dubroca
2018-11-29, 07:33:11 -0800, Roopa Prabhu wrote:
> On Thu, Nov 29, 2018 at 5:56 AM Sabrina Dubroca  wrote:
> >
> > 2018-11-28, 14:27:57 -0800, Roopa Prabhu wrote:
> > > From: Roopa Prabhu 
> > >
> > > We started very conservative when supporting changelink
> > > especially because not all attribute changes could be
> > > tested. This patch opens up a few more attributes for
> > > changelink. The reason for choosing this set of attributes
> > > is based on code references for these attributes. I have
> > > tested TTL changes and did some changelink api testing
> > > to sanity test the others.
> > >
> > > Signed-off-by: Roopa Prabhu 
> > > ---
> > >  drivers/net/vxlan.c | 36 
> > >  1 file changed, 4 insertions(+), 32 deletions(-)
> > >
> > > diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c
> > > index 9110662..73caa65 100644
> > > --- a/drivers/net/vxlan.c
> > > +++ b/drivers/net/vxlan.c
> > > @@ -3438,11 +3438,8 @@ static int vxlan_nl2conf(struct nlattr *tb[], 
> > > struct nlattr *data[],
> > >   if (data[IFLA_VXLAN_TTL])
> > >   conf->ttl = nla_get_u8(data[IFLA_VXLAN_TTL]);
> > >
> > > - if (data[IFLA_VXLAN_TTL_INHERIT]) {
> > > - if (changelink)
> > > - return -EOPNOTSUPP;
> > > + if (data[IFLA_VXLAN_TTL_INHERIT])
> > >   conf->flags |= VXLAN_F_TTL_INHERIT;
> > > - }
> >
> > This doesn't give us an option to disable TTL_INHERIT after it was
> > enabled once. Same thing with GBP, GPE, REMCSUM_NOPARTIAL.
> 
> that is provided by patch3, with the changelink patch. I can squash
> them, if thats easier.

Yes, I see it now. Sequential review got me :(
I think you can just add a note in the commit message, "will be fixed
in a further patch introducing the vxlan_nl2flag() helper" or similar.

> > > - if (data[IFLA_VXLAN_GBP]) {
> > > - if (changelink)
> > > - return -EOPNOTSUPP;
> > > + if (data[IFLA_VXLAN_GBP])
> > >   conf->flags |= VXLAN_F_GBP;
> > > - }
> > >
> > > - if (data[IFLA_VXLAN_GPE]) {
> > > - if (changelink)
> > > - return -EOPNOTSUPP;
> > > + if (data[IFLA_VXLAN_GPE])
> > >   conf->flags |= VXLAN_F_GPE;
> > > - }
> >
> > GPE implies running a different setup function (vxlan_raw_setup() vs
> > vxlan_ether_setup()), that vxlan_config_apply() only calls for
> > !changelink. I think this is incomplete.
> >
> > I think we'd also end up with mixed tunnel types (GPE/!GPE) on the
> > same socket, I'm not sure how that would work. Normally, they each try
> > to create a separate socket, and pass the GPE flag on to the
> > associated vxlan_sock. I suspect that's also a problem with rx
> > offload.
> 
> that is good to know. I will drop the change to GPE and also the rx
> offload flag and let somebody else using it
> open it up for changelink. thanks for the review

Looking at vxlan_rcv(), there's also a possible issue with the GBP
flag:

if (vs->flags & VXLAN_F_GBP)
vxlan_parse_gbp_hdr(, skb, vs->flags, md);

I don't know why that flag must match on all tunnels for a socket, but
given that it was the reason for commit ac5132d1a03f ("vxlan: Only
bind to sockets with compatible flags enabled"), I'd leave it alone as
well.

> > > - if (data[IFLA_VXLAN_REMCSUM_NOPARTIAL]) {
> > > - if (changelink)
> > > - return -EOPNOTSUPP;
> > > + if (data[IFLA_VXLAN_REMCSUM_NOPARTIAL])
> > >   conf->flags |= VXLAN_F_REMCSUM_NOPARTIAL;
> > > - }
> > >
> > >   if (tb[IFLA_MTU]) {
> > >   if (changelink)
> > > --

-- 
Sabrina


Re: [PATCH net-next v2 3/3] vxlan: move flag sets to use a helper func

2018-11-29 Thread Sabrina Dubroca
2018-11-28, 14:27:59 -0800, Roopa Prabhu wrote:
> +/* Set/clear flags based on attribute */
> +static void vxlan_nl2flag(struct vxlan_config *conf, struct nlattr *tb[],
> +   int attrtype, unsigned long mask)
> +{
> + unsigned long flags;
> +
> + if (!tb[attrtype])
> + return;
> +
> + if (nla_get_u8(tb[attrtype]))
> + flags = conf->flags | mask;
> + else
> + flags = conf->flags & ~mask;
> +
> + conf->flags = flags;
> +}
> +
>  static int vxlan_nl2conf(struct nlattr *tb[], struct nlattr *data[],
>struct net_device *dev, struct vxlan_config *conf,
>bool changelink, struct netlink_ext_ack *extack)
> @@ -3458,45 +3475,27 @@ static int vxlan_nl2conf(struct nlattr *tb[], struct 
> nlattr *data[],
>   if (data[IFLA_VXLAN_TTL])
>   conf->ttl = nla_get_u8(data[IFLA_VXLAN_TTL]);
>  
> - if (data[IFLA_VXLAN_TTL_INHERIT])
> - conf->flags |= VXLAN_F_TTL_INHERIT;
> + vxlan_nl2flag(conf, data, IFLA_VXLAN_TTL_INHERIT,
> +   VXLAN_F_TTL_INHERIT);

IFLA_VXLAN_TTL_INHERIT is a NLA_FLAG. It doesn't have any u8 data to
get. Same for:

[IFLA_VXLAN_GBP]= { .type = NLA_FLAG, },
[IFLA_VXLAN_GPE]= { .type = NLA_FLAG, },
[IFLA_VXLAN_REMCSUM_NOPARTIAL]  = { .type = NLA_FLAG },
[IFLA_VXLAN_TTL_INHERIT]= { .type = NLA_FLAG },



nit: This patch would have been easier to review if it came first in
the series. Converting:

if (nla_get_u8(data[IFLA_VXLAN_FOO]))
conf->flags |= VXLAN_F_FOO;

to this new helper, which in effect does

if (nla_get_u8(data[IFLA_VXLAN_FOO]))
conf->flags |= VXLAN_F_FOO;
else
conf->flags &= ~VXLAN_F_FOO;

seems to change behavior, but since you're (unless I missed one) only
applying it to flags that couldn't be changed before patch 1, it looks
fine.

-- 
Sabrina


Re: [PATCH net-next v2 1/3] vxlan: support changelink for a few more attributes

2018-11-29 Thread Sabrina Dubroca
2018-11-28, 14:27:57 -0800, Roopa Prabhu wrote:
> From: Roopa Prabhu 
> 
> We started very conservative when supporting changelink
> especially because not all attribute changes could be
> tested. This patch opens up a few more attributes for
> changelink. The reason for choosing this set of attributes
> is based on code references for these attributes. I have
> tested TTL changes and did some changelink api testing
> to sanity test the others.
> 
> Signed-off-by: Roopa Prabhu 
> ---
>  drivers/net/vxlan.c | 36 
>  1 file changed, 4 insertions(+), 32 deletions(-)
> 
> diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c
> index 9110662..73caa65 100644
> --- a/drivers/net/vxlan.c
> +++ b/drivers/net/vxlan.c
> @@ -3438,11 +3438,8 @@ static int vxlan_nl2conf(struct nlattr *tb[], struct 
> nlattr *data[],
>   if (data[IFLA_VXLAN_TTL])
>   conf->ttl = nla_get_u8(data[IFLA_VXLAN_TTL]);
>  
> - if (data[IFLA_VXLAN_TTL_INHERIT]) {
> - if (changelink)
> - return -EOPNOTSUPP;
> + if (data[IFLA_VXLAN_TTL_INHERIT])
>   conf->flags |= VXLAN_F_TTL_INHERIT;
> - }

This doesn't give us an option to disable TTL_INHERIT after it was
enabled once. Same thing with GBP, GPE, REMCSUM_NOPARTIAL.

> - if (data[IFLA_VXLAN_GBP]) {
> - if (changelink)
> - return -EOPNOTSUPP;
> + if (data[IFLA_VXLAN_GBP])
>   conf->flags |= VXLAN_F_GBP;
> - }
>  
> - if (data[IFLA_VXLAN_GPE]) {
> - if (changelink)
> - return -EOPNOTSUPP;
> + if (data[IFLA_VXLAN_GPE])
>   conf->flags |= VXLAN_F_GPE;
> - }

GPE implies running a different setup function (vxlan_raw_setup() vs
vxlan_ether_setup()), that vxlan_config_apply() only calls for
!changelink. I think this is incomplete.

I think we'd also end up with mixed tunnel types (GPE/!GPE) on the
same socket, I'm not sure how that would work. Normally, they each try
to create a separate socket, and pass the GPE flag on to the
associated vxlan_sock. I suspect that's also a problem with rx
offload.

> - if (data[IFLA_VXLAN_REMCSUM_NOPARTIAL]) {
> - if (changelink)
> - return -EOPNOTSUPP;
> + if (data[IFLA_VXLAN_REMCSUM_NOPARTIAL])
>   conf->flags |= VXLAN_F_REMCSUM_NOPARTIAL;
> - }
>  
>   if (tb[IFLA_MTU]) {
>   if (changelink)
> -- 
> 2.1.4
> 

-- 
Sabrina


[PATCH net 0/2] fixes for XPS configuration after Symmetric queue selection

2018-11-29 Thread Sabrina Dubroca
This fixes some bugs introduced by the "Symmetric queue selection
using XPS for Rx queues".

First, the refactoring of the cleanup function skipped resetting the
queue's NUMA node under some conditions.

Second, the accounting on static keys for XPS and RXQS-XPS is
unbalanced, so the static key for XPS won't actually disable itself,
once enabled. The RXQS-XPS static key can actually be disabled by
reconfiguring a device that didn't have RXQS-XPS configured at all.

Sabrina Dubroca (2):
  net: restore call to netdev_queue_numa_node_write when resetting XPS
  net: fix XPS static_key accounting

 net/core/dev.c | 53 +++---
 1 file changed, 29 insertions(+), 24 deletions(-)

-- 
2.19.2



[PATCH net 2/2] net: fix XPS static_key accounting

2018-11-29 Thread Sabrina Dubroca
Commit 04157469b7b8 ("net: Use static_key for XPS maps") introduced a
static key for XPS, but the increments/decrements don't match.

First, the static key's counter is incremented once for each queue, but
only decremented once for a whole batch of queues, leading to large
unbalances.

Second, the xps_rxqs_needed key is decremented whenever we reset a batch
of queues, whether they had any rxqs mapping or not, so that if we setup
cpu-XPS on em1 and RXQS-XPS on em2, resetting the queues on em1 would
decrement the xps_rxqs_needed key.

This reworks the accounting scheme so that the xps_needed key is
incremented only once for each type of XPS for all the queues on a
device, and the xps_rxqs_needed key is incremented only once for all
queues. This is sufficient to let us retrieve queues via
get_xps_queue().

This patch introduces a new reset_xps_maps(), which reinitializes and
frees the appropriate map (xps_rxqs_map or xps_cpus_map), and drops a
reference to the needed keys:
 - both xps_needed and xps_rxqs_needed, in case of rxqs maps,
 - only xps_needed, in case of CPU maps.

Now, we also need to call reset_xps_maps() at the end of
__netif_set_xps_queue() when there's no active map left, for example
when writing ',' to all queues' xps_rxqs setting.

Fixes: 04157469b7b8 ("net: Use static_key for XPS maps")
Signed-off-by: Sabrina Dubroca 
---
 net/core/dev.c | 45 -
 1 file changed, 24 insertions(+), 21 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 32a63f4c3a92..3470e7fff1f4 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2175,6 +2175,20 @@ static bool remove_xps_queue_cpu(struct net_device *dev,
return active;
 }
 
+static void reset_xps_maps(struct net_device *dev,
+  struct xps_dev_maps *dev_maps,
+  bool is_rxqs_map)
+{
+   if (is_rxqs_map) {
+   static_key_slow_dec_cpuslocked(_rxqs_needed);
+   RCU_INIT_POINTER(dev->xps_rxqs_map, NULL);
+   } else {
+   RCU_INIT_POINTER(dev->xps_cpus_map, NULL);
+   }
+   static_key_slow_dec_cpuslocked(_needed);
+   kfree_rcu(dev_maps, rcu);
+}
+
 static void clean_xps_maps(struct net_device *dev, const unsigned long *mask,
   struct xps_dev_maps *dev_maps, unsigned int nr_ids,
   u16 offset, u16 count, bool is_rxqs_map)
@@ -2186,13 +2200,8 @@ static void clean_xps_maps(struct net_device *dev, const 
unsigned long *mask,
 j < nr_ids;)
active |= remove_xps_queue_cpu(dev, dev_maps, j, offset,
   count);
-   if (!active) {
-   if (is_rxqs_map)
-   RCU_INIT_POINTER(dev->xps_rxqs_map, NULL);
-   else
-   RCU_INIT_POINTER(dev->xps_cpus_map, NULL);
-   kfree_rcu(dev_maps, rcu);
-   }
+   if (!active)
+   reset_xps_maps(dev, dev_maps, is_rxqs_map);
 
if (!is_rxqs_map) {
for (i = offset + (count - 1); count--; i--) {
@@ -2236,10 +2245,6 @@ static void netif_reset_xps_queues(struct net_device 
*dev, u16 offset,
   false);
 
 out_no_maps:
-   if (static_key_enabled(_rxqs_needed))
-   static_key_slow_dec_cpuslocked(_rxqs_needed);
-
-   static_key_slow_dec_cpuslocked(_needed);
mutex_unlock(_map_mutex);
cpus_read_unlock();
 }
@@ -2357,9 +2362,12 @@ int __netif_set_xps_queue(struct net_device *dev, const 
unsigned long *mask,
if (!new_dev_maps)
goto out_no_new_maps;
 
-   static_key_slow_inc_cpuslocked(_needed);
-   if (is_rxqs_map)
-   static_key_slow_inc_cpuslocked(_rxqs_needed);
+   if (!dev_maps) {
+   /* Increment static keys at most once per type */
+   static_key_slow_inc_cpuslocked(_needed);
+   if (is_rxqs_map)
+   static_key_slow_inc_cpuslocked(_rxqs_needed);
+   }
 
for (j = -1; j = netif_attrmask_next(j, possible_mask, nr_ids),
 j < nr_ids;) {
@@ -2457,13 +2465,8 @@ int __netif_set_xps_queue(struct net_device *dev, const 
unsigned long *mask,
}
 
/* free map if not active */
-   if (!active) {
-   if (is_rxqs_map)
-   RCU_INIT_POINTER(dev->xps_rxqs_map, NULL);
-   else
-   RCU_INIT_POINTER(dev->xps_cpus_map, NULL);
-   kfree_rcu(dev_maps, rcu);
-   }
+   if (!active)
+   reset_xps_maps(dev, dev_maps, is_rxqs_map);
 
 out_no_maps:
mutex_unlock(_map_mutex);
-- 
2.19.2



[PATCH net 1/2] net: restore call to netdev_queue_numa_node_write when resetting XPS

2018-11-29 Thread Sabrina Dubroca
Before commit 80d19669ecd3 ("net: Refactor XPS for CPUs and Rx queues"),
netif_reset_xps_queues() did netdev_queue_numa_node_write() for all the
queues being reset. Now, this is only done when the "active" variable in
clean_xps_maps() is false, ie when on all the CPUs, there's no active
XPS mapping left.

Fixes: 80d19669ecd3 ("net: Refactor XPS for CPUs and Rx queues")
Signed-off-by: Sabrina Dubroca 
---
 net/core/dev.c | 16 +---
 1 file changed, 9 insertions(+), 7 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index ddc551f24ba2..32a63f4c3a92 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2187,17 +2187,19 @@ static void clean_xps_maps(struct net_device *dev, 
const unsigned long *mask,
active |= remove_xps_queue_cpu(dev, dev_maps, j, offset,
   count);
if (!active) {
-   if (is_rxqs_map) {
+   if (is_rxqs_map)
RCU_INIT_POINTER(dev->xps_rxqs_map, NULL);
-   } else {
+   else
RCU_INIT_POINTER(dev->xps_cpus_map, NULL);
+   kfree_rcu(dev_maps, rcu);
+   }
 
-   for (i = offset + (count - 1); count--; i--)
-   netdev_queue_numa_node_write(
-   netdev_get_tx_queue(dev, i),
-   NUMA_NO_NODE);
+   if (!is_rxqs_map) {
+   for (i = offset + (count - 1); count--; i--) {
+   netdev_queue_numa_node_write(
+   netdev_get_tx_queue(dev, i),
+   NUMA_NO_NODE);
}
-   kfree_rcu(dev_maps, rcu);
}
 }
 
-- 
2.19.2



Re: [PATCH v2] ipv4: make DSCP values works with ip rules

2018-11-20 Thread Sabrina Dubroca
Hi Pavel,

2018-11-20, 16:29:36 +0300, Pavel Balaev wrote:
> This patch adds ability to set DSCP values in ip rules.

You dropped the RFC reference that you had in v1.

> Values presented in /etc/iproute3/rt_dsfield and now can be used in rules.

iproute3?

> 
> Example:
> $ ip ru add from 10.88.0.2 tos AF23 lookup dscp.
> Result:
> ---

Be careful when you write commit messages. "git am" is going to cut
out everything that comes after the first ^---$ line it sees, so the
rest of those explanations, as well as your sign-off, will be dropped.

> 32762:from 10.88.0.2 tos AF43 lookup dscp
> ---
> 
> Patch was tested with such configuration:
> 
> +---+
> | 10.88.0.2 | -> ping 172.16.0.1 -Q 0x58 ->
> |   host0   |
> +---+
> 
>+-+
>| router with patched kernel  |
>| |
> -> | from 10.88.0.2 tos AF43 lookup dscp |->
>| table dscp: |
>|   172.16.0.0/24 via 10.200.0.2  |
>+-+
> 
>+--+
> -> | eth0: 10.200.0.2 |
>| eth1: 172.16.0.1 |
>|  host1   |
>+--+
> 
> Signed-off-by: Pavel Balaev 

-- 
Sabrina


[PATCH net] ip_tunnel: don't force DF when MTU is locked

2018-11-16 Thread Sabrina Dubroca
The various types of tunnels running over IPv4 can ask to set the DF
bit to do PMTU discovery. However, PMTU discovery is subject to the
threshold set by the net.ipv4.route.min_pmtu sysctl, and is also
disabled on routes with "mtu lock". In those cases, we shouldn't set
the DF bit.

This patch makes setting the DF bit conditional on the route's MTU
locking state.

This issue seems to be older than git history.

Signed-off-by: Sabrina Dubroca 
Reviewed-by: Stefano Brivio 
---
 net/ipv4/ip_tunnel_core.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ipv4/ip_tunnel_core.c b/net/ipv4/ip_tunnel_core.c
index f45b96d715f0..c857ec6b9784 100644
--- a/net/ipv4/ip_tunnel_core.c
+++ b/net/ipv4/ip_tunnel_core.c
@@ -80,7 +80,7 @@ void iptunnel_xmit(struct sock *sk, struct rtable *rt, struct 
sk_buff *skb,
 
iph->version=   4;
iph->ihl=   sizeof(struct iphdr) >> 2;
-   iph->frag_off   =   df;
+   iph->frag_off   =   ip_mtu_locked(>dst) ? 0 : df;
iph->protocol   =   proto;
iph->tos=   tos;
iph->daddr  =   dst;
-- 
2.19.1



[PATCH net 0/2] macsec: linkstate fixes

2018-10-28 Thread Sabrina Dubroca
This series fixes issues with handling administrative and operstate of
macsec devices.

Radu Rendec proposed another version of the first patch [0] but I'd
rather not follow the behavior of vlan devices, going with macvlan
does instead. Patrick Talbert also reported the same issue to me.

The second patch is a follow-up. The restriction on setting the device
up is a bit unreasonable, and operstate provides the information we
need in this case.

[0] https://patchwork.ozlabs.org/patch/971374/

Sabrina Dubroca (2):
  macsec: update operstate when lower device changes
  macsec: let the administrator set UP state even if lowerdev is down

 drivers/net/macsec.c | 20 +---
 1 file changed, 17 insertions(+), 3 deletions(-)

-- 
2.19.1



[PATCH net 2/2] macsec: let the administrator set UP state even if lowerdev is down

2018-10-28 Thread Sabrina Dubroca
Currently, the kernel doesn't let the administrator set a macsec device
up unless its lower device is currently up. This is inconsistent, as a
macsec device that is up won't automatically go down when its lower
device goes down.

Now that linkstate propagation works, there's really no reason for this
limitation, so let's remove it.

Fixes: c09440f7dcb3 ("macsec: introduce IEEE 802.1AE driver")
Reported-by: Radu Rendec 
Signed-off-by: Sabrina Dubroca 
---
 drivers/net/macsec.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/drivers/net/macsec.c b/drivers/net/macsec.c
index 6195b8edafc0..64a982563d59 100644
--- a/drivers/net/macsec.c
+++ b/drivers/net/macsec.c
@@ -2812,9 +2812,6 @@ static int macsec_dev_open(struct net_device *dev)
struct net_device *real_dev = macsec->real_dev;
int err;
 
-   if (!(real_dev->flags & IFF_UP))
-   return -ENETDOWN;
-
err = dev_uc_add(real_dev, dev->dev_addr);
if (err < 0)
return err;
-- 
2.19.1



[PATCH net 1/2] macsec: update operstate when lower device changes

2018-10-28 Thread Sabrina Dubroca
Like all other virtual devices (macvlan, vlan), the operstate of a
macsec device should match the state of its lower device. This is done
by calling netif_stacked_transfer_operstate from its netdevice notifier.

We also need to call netif_stacked_transfer_operstate when a new macsec
device is created, so that its operstate is set properly. This is only
relevant when we try to bring the device up directly when we create it.

Radu Rendec proposed a similar patch, inspired from the 802.1q driver,
that included changing the administrative state of the macsec device,
instead of just the operstate. This version is similar to what the
macvlan driver does, and updates only the operstate.

Fixes: c09440f7dcb3 ("macsec: introduce IEEE 802.1AE driver")
Reported-by: Radu Rendec 
Reported-by: Patrick Talbert 
Signed-off-by: Sabrina Dubroca 
---
 drivers/net/macsec.c | 17 +
 1 file changed, 17 insertions(+)

diff --git a/drivers/net/macsec.c b/drivers/net/macsec.c
index 4bb90b6867a2..6195b8edafc0 100644
--- a/drivers/net/macsec.c
+++ b/drivers/net/macsec.c
@@ -3306,6 +3306,9 @@ static int macsec_newlink(struct net *net, struct 
net_device *dev,
if (err < 0)
goto del_dev;
 
+   netif_stacked_transfer_operstate(real_dev, dev);
+   linkwatch_fire_event(dev);
+
macsec_generation++;
 
return 0;
@@ -3490,6 +3493,20 @@ static int macsec_notify(struct notifier_block *this, 
unsigned long event,
return NOTIFY_DONE;
 
switch (event) {
+   case NETDEV_DOWN:
+   case NETDEV_UP:
+   case NETDEV_CHANGE: {
+   struct macsec_dev *m, *n;
+   struct macsec_rxh_data *rxd;
+
+   rxd = macsec_data_rtnl(real_dev);
+   list_for_each_entry_safe(m, n, >secys, secys) {
+   struct net_device *dev = m->secy.netdev;
+
+   netif_stacked_transfer_operstate(real_dev, dev);
+   }
+   break;
+   }
case NETDEV_UNREGISTER: {
struct macsec_dev *m, *n;
struct macsec_rxh_data *rxd;
-- 
2.19.1



Re: [PATCH net] ipv6/ndisc: Preserve IPv6 control buffer if protocol error handlers are called

2018-10-26 Thread Sabrina Dubroca
2018-10-24, 14:37:21 +0200, Stefano Brivio wrote:
> Commit a61bbcf28a8c ("[NET]: Store skb->timestamp as offset to a base
> timestamp") introduces a neighbour control buffer and zeroes it out in
> ndisc_rcv(), as ndisc_recv_ns() uses it.
> 
> Commit f2776ff04722 ("[IPV6]: Fix address/interface handling in UDP and
> DCCP, according to the scoping architecture.") introduces the usage of the
> IPv6 control buffer in protocol error handlers (e.g. inet6_iif() in
> present-day __udp6_lib_err()).
> 
> Now, with commit b94f1c0904da ("ipv6: Use icmpv6_notify() to propagate
> redirect, instead of rt6_redirect()."), we call protocol error handlers
> from ndisc_redirect_rcv(), after the control buffer is already stolen and
> some parts are already zeroed out. This implies that inet6_iif() on this
> path will always return zero.
> 
> This gives unexpected results on UDP socket lookup in __udp6_lib_err(), as
> we might actually need to match sockets for a given interface.
> 
> Instead of always claiming the control buffer in ndisc_rcv(), do that only
> when needed.
> 
> Fixes: b94f1c0904da ("ipv6: Use icmpv6_notify() to propagate redirect, 
> instead of rt6_redirect().")
> Signed-off-by: Stefano Brivio 

Reviewed-by: Sabrina Dubroca 

-- 
Sabrina


Re: [PATCH iproute2] macsec: fix off-by-one when parsing attributes

2018-10-15 Thread Sabrina Dubroca
2018-10-15, 09:36:58 -0700, Stephen Hemminger wrote:
> On Fri, 12 Oct 2018 17:34:12 +0200
> Sabrina Dubroca  wrote:
> 
> > I seem to have had a massive brainfart with uses of
> > parse_rtattr_nested(). The rtattr* array must have MAX+1 elements, and
> > the call to parse_rtattr_nested must have MAX as its bound. Let's fix
> > those.
> > 
> > Fixes: b26fc590ce62 ("ip: add MACsec support")
> > Signed-off-by: Sabrina Dubroca 
> 
> Applied,
> How did it ever work??

I'm guessing it wrote over some other stack variables before their
first use. It worked without issue until the JSON patch.

Thanks,

-- 
Sabrina


Re: Bug in MACSec - stops passing traffic after approx 5TB

2018-10-14 Thread Sabrina Dubroca
2018-10-14, 10:59:31 -0400, Josh Coombs wrote:
> I initially mistook this for a traffic control issue, but after
> stripping the test beds down to just the MACSec component, I can still
> replicate the issue.  After approximately 5TB of transfer / 4 billion
> packets over a MACSec link it stops passing traffic.

I think you're just hitting packet number exhaustion. After 2^32
packets, the packet number would wrap to 0 and start being reused,
which breaks the crypto used by macsec. Before this point, you have to
add a new SA, and tell the macsec device to switch to it.

That's why you should be using wpa_supplicant. It will monitor the
growth of the packet number, and handle the rekey for you.

If you start with a PN already close to exhaustion (say, 4294967000),
you should hit the "bug" very quickly.

> # Bring up macsec:
> echo "* Enable MACSec"
> modprobe macsec
> ip link add link "$dif" "$eif" type macsec
> ip macsec add "$eif" tx sa 0 pn 1 on key 02 "$txkey"

Keep the rest of the configuration, and replace that one with:
ip macsec add "$eif" tx sa 0 pn 4294967000 on key 02 "$txkey"

to trigger the issue faster.

-- 
Sabrina


Re: [PATCH net] ipv6: rate-limit probes for neighbourless routes

2018-10-12 Thread Sabrina Dubroca
2018-10-12, 08:17:28 -0700, Eric Dumazet wrote:
> 
> 
> On 10/12/2018 07:22 AM, Sabrina Dubroca wrote:
> > When commit 270972554c91 ("[IPV6]: ROUTE: Add Router Reachability
> > Probing (RFC4191).") introduced router probing, the rt6_probe() function
> > required that a neighbour entry existed. This neighbour entry is used to
> > record the timestamp of the last probe via the ->updated field.
> > 
> > Later, commit 2152caea7196 ("ipv6: Do not depend on rt->n in rt6_probe().")
> > removed the requirement for a neighbour entry. Neighbourless routes skip
> > the interval check and are not rate-limited.
> > 
> 
> Interesting. Can you describe a little more what happens here ?

The machine gets two default routes, one valid, one via a gateway that
doesn't exist. Every time a packet is sent via the default route, we
probe the dead gateway, which doesn't have a neighbour entry. Since
there's no rate-limiting, we send too many neighbour solicitations,
disobeying the RFC.

> Could this explain gazillions of syzbot/syzkaller reports that have all in 
> common :

That seems completely unrelated. That particular trace is about router
solicitations, not neighbour solicitations. This seems more like an
RCU issue.


> rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS 
> Google 01/01/2011
> Call Trace:
> rcu:(detected by 0, t=10712 jiffies, g=90369, q=135)
>  
>  __dump_stack lib/dump_stack.c:77 [inline]
>  dump_stack+0x1c4/0x2b4 lib/dump_stack.c:113
> rcu: All QSes seen, last rcu_preempt kthread activity 10548 
> (4295003843-4294993295), jiffies_till_next_fqs=1, root ->qsmask 0x0
> syz-executor0   R
>   running task
>  warn_alloc.cold.119+0xb7/0x1bd mm/page_alloc.c:3426
> 22896  7592   5826 0x801c
> Call Trace:
>  
>  sched_show_task.cold.83+0x2b6/0x30a kernel/sched/core.c:5296
>  __alloc_pages_slowpath+0x2667/0x2d80 mm/page_alloc.c:4297
>  print_other_cpu_stall.cold.79+0xa83/0xba5 kernel/rcu/tree.c:1430
>  check_cpu_stall kernel/rcu/tree.c:1557 [inline]
>  __rcu_pending kernel/rcu/tree.c:3276 [inline]
>  rcu_pending kernel/rcu/tree.c:3319 [inline]
>  rcu_check_callbacks+0xafc/0x1990 kernel/rcu/tree.c:2665
>  __alloc_pages_nodemask+0xa80/0xde0 mm/page_alloc.c:4390
>  __alloc_pages include/linux/gfp.h:473 [inline]
>  __alloc_pages_node include/linux/gfp.h:486 [inline]
>  kmem_getpages mm/slab.c:1409 [inline]
>  cache_grow_begin+0x91/0x8c0 mm/slab.c:2677
>  fallback_alloc+0x203/0x2e0 mm/slab.c:3219
>  cache_alloc_node+0x1c7/0x1e0 mm/slab.c:3287
>  slab_alloc_node mm/slab.c:3327 [inline]
>  kmem_cache_alloc_node+0xe3/0x730 mm/slab.c:3642
>  __alloc_skb+0x119/0x770 net/core/skbuff.c:193
>  alloc_skb include/linux/skbuff.h:997 [inline]
>  ndisc_alloc_skb+0x144/0x340 net/ipv6/ndisc.c:403
>  ndisc_send_rs+0x331/0x6e0 net/ipv6/ndisc.c:669
>  update_process_times+0x2d/0x70 kernel/time/timer.c:1636
>  addrconf_rs_timer+0x314/0x690 net/ipv6/addrconf.c:3836
>  tick_sched_handle+0x9f/0x180 kernel/time/tick-sched.c:164
>  tick_sched_timer+0x45/0x130 kernel/time/tick-sched.c:1274
>  __run_hrtimer kernel/time/hrtimer.c:1398 [inline]
>  __hrtimer_run_queues+0x41c/0x10d0 kernel/time/hrtimer.c:1460
>  call_timer_fn+0x272/0x920 kernel/time/timer.c:1326
>  hrtimer_interrupt+0x313/0x780 kernel/time/hrtimer.c:1518 

-- 
Sabrina


[PATCH iproute2] json: make 0xhex handle u64

2018-10-12 Thread Sabrina Dubroca
Stephen converted macsec's sci to use 0xhex, but 0xhex handles
unsigned int's, not 64 bits ints. Thus, the output of the "ip macsec
show" command is mangled, with half of the SCI replaced with 0s:

# ip macsec show
11: macsec0: [...]
cipher suite: GCM-AES-128, using ICV length 16
TXSC: 01560001 on SA 0

# ip -d link show macsec0
11: macsec0@ens3: [...]
link/ether 52:54:00:12:01:56 brd ff:ff:ff:ff:ff:ff promiscuity 0 
macsec sci 5254001201560001 [...]

where TXSC and sci should match.

Fixes: c0b904de6211 ("macsec: support JSON")
Signed-off-by: Sabrina Dubroca 
---
 include/json_print.h | 2 +-
 lib/json_print.c | 4 ++--
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/include/json_print.h b/include/json_print.h
index 78a6c83fb516..218da31a73fe 100644
--- a/include/json_print.h
+++ b/include/json_print.h
@@ -66,7 +66,7 @@ _PRINT_FUNC(uint, unsigned int);
 _PRINT_FUNC(u64, uint64_t);
 _PRINT_FUNC(hu, unsigned short);
 _PRINT_FUNC(hex, unsigned int);
-_PRINT_FUNC(0xhex, unsigned int);
+_PRINT_FUNC(0xhex, unsigned long long int);
 _PRINT_FUNC(luint, unsigned long int);
 _PRINT_FUNC(lluint, unsigned long long int);
 _PRINT_FUNC(float, double);
diff --git a/lib/json_print.c b/lib/json_print.c
index eed109c56401..f7ef41c1570f 100644
--- a/lib/json_print.c
+++ b/lib/json_print.c
@@ -172,12 +172,12 @@ void print_color_0xhex(enum output_type type,
   enum color_attr color,
   const char *key,
   const char *fmt,
-  unsigned int hex)
+  unsigned long long hex)
 {
if (_IS_JSON_CONTEXT(type)) {
SPRINT_BUF(b1);
 
-   snprintf(b1, sizeof(b1), "%#x", hex);
+   snprintf(b1, sizeof(b1), "%#llx", hex);
print_string(PRINT_JSON, key, NULL, b1);
} else if (_IS_FP_CONTEXT(type)) {
color_fprintf(stdout, color, fmt, hex);
-- 
2.19.1



[PATCH iproute2] macsec: fix off-by-one when parsing attributes

2018-10-12 Thread Sabrina Dubroca
I seem to have had a massive brainfart with uses of
parse_rtattr_nested(). The rtattr* array must have MAX+1 elements, and
the call to parse_rtattr_nested must have MAX as its bound. Let's fix
those.

Fixes: b26fc590ce62 ("ip: add MACsec support")
Signed-off-by: Sabrina Dubroca 
---
 ip/ipmacsec.c | 18 +-
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/ip/ipmacsec.c b/ip/ipmacsec.c
index fa56e0eee774..007ce5404788 100644
--- a/ip/ipmacsec.c
+++ b/ip/ipmacsec.c
@@ -727,7 +727,7 @@ static void print_txsc_stats(const char *prefix, struct 
rtattr *attr)
if (!attr || show_stats == 0)
return;
 
-   parse_rtattr_nested(stats, MACSEC_TXSC_STATS_ATTR_MAX + 1, attr);
+   parse_rtattr_nested(stats, MACSEC_TXSC_STATS_ATTR_MAX, attr);
 
print_stats(prefix, txsc_stats_names, NUM_MACSEC_TXSC_STATS_ATTR,
stats);
@@ -751,7 +751,7 @@ static void print_secy_stats(const char *prefix, struct 
rtattr *attr)
if (!attr || show_stats == 0)
return;
 
-   parse_rtattr_nested(stats, MACSEC_SECY_STATS_ATTR_MAX + 1, attr);
+   parse_rtattr_nested(stats, MACSEC_SECY_STATS_ATTR_MAX, attr);
 
print_stats(prefix, secy_stats_names,
NUM_MACSEC_SECY_STATS_ATTR, stats);
@@ -772,7 +772,7 @@ static void print_rxsa_stats(const char *prefix, struct 
rtattr *attr)
if (!attr || show_stats == 0)
return;
 
-   parse_rtattr_nested(stats, MACSEC_SA_STATS_ATTR_MAX + 1, attr);
+   parse_rtattr_nested(stats, MACSEC_SA_STATS_ATTR_MAX, attr);
 
print_stats(prefix, rxsa_stats_names, NUM_MACSEC_SA_STATS_ATTR, stats);
 }
@@ -789,7 +789,7 @@ static void print_txsa_stats(const char *prefix, struct 
rtattr *attr)
if (!attr || show_stats == 0)
return;
 
-   parse_rtattr_nested(stats, MACSEC_SA_STATS_ATTR_MAX + 1, attr);
+   parse_rtattr_nested(stats, MACSEC_SA_STATS_ATTR_MAX, attr);
 
print_stats(prefix, txsa_stats_names, NUM_MACSEC_SA_STATS_ATTR, stats);
 }
@@ -817,7 +817,7 @@ static void print_tx_sc(const char *prefix, __u64 sci, __u8 
encoding_sa,
bool state;
 
open_json_object(NULL);
-   parse_rtattr_nested(sa_attr, MACSEC_SA_ATTR_MAX + 1, a);
+   parse_rtattr_nested(sa_attr, MACSEC_SA_ATTR_MAX, a);
state = rta_getattr_u8(sa_attr[MACSEC_SA_ATTR_ACTIVE]);
 
print_string(PRINT_FP, NULL, "%s", prefix);
@@ -858,7 +858,7 @@ static void print_rxsc_stats(const char *prefix, struct 
rtattr *attr)
if (!attr || show_stats == 0)
return;
 
-   parse_rtattr_nested(stats, MACSEC_RXSC_STATS_ATTR_MAX + 1, attr);
+   parse_rtattr_nested(stats, MACSEC_RXSC_STATS_ATTR_MAX, attr);
 
print_stats(prefix, rxsc_stats_names,
NUM_MACSEC_RXSC_STATS_ATTR, stats);
@@ -885,7 +885,7 @@ static void print_rx_sc(const char *prefix, __be64 sci, 
__u8 active,
bool state;
 
open_json_object(NULL);
-   parse_rtattr_nested(sa_attr, MACSEC_SA_ATTR_MAX + 1, a);
+   parse_rtattr_nested(sa_attr, MACSEC_SA_ATTR_MAX, a);
state = rta_getattr_u8(sa_attr[MACSEC_SA_ATTR_ACTIVE]);
 
print_string(PRINT_FP, NULL, "%s", prefix);
@@ -918,7 +918,7 @@ static void print_rxsc_list(struct rtattr *sc)
 
open_json_object(NULL);
 
-   parse_rtattr_nested(sc_attr, MACSEC_RXSC_ATTR_MAX + 1, c);
+   parse_rtattr_nested(sc_attr, MACSEC_RXSC_ATTR_MAX, c);
print_rx_sc("",
rta_getattr_u64(sc_attr[MACSEC_RXSC_ATTR_SCI]),
rta_getattr_u32(sc_attr[MACSEC_RXSC_ATTR_ACTIVE]),
@@ -958,7 +958,7 @@ static int process(const struct sockaddr_nl *who, struct 
nlmsghdr *n,
}
 
ifindex = rta_getattr_u32(attrs[MACSEC_ATTR_IFINDEX]);
-   parse_rtattr_nested(attrs_secy, MACSEC_SECY_ATTR_MAX + 1,
+   parse_rtattr_nested(attrs_secy, MACSEC_SECY_ATTR_MAX,
attrs[MACSEC_ATTR_SECY]);
 
if (!validate_secy_dump(attrs_secy)) {
-- 
2.19.1



[PATCH net] ipv6: rate-limit probes for neighbourless routes

2018-10-12 Thread Sabrina Dubroca
When commit 270972554c91 ("[IPV6]: ROUTE: Add Router Reachability
Probing (RFC4191).") introduced router probing, the rt6_probe() function
required that a neighbour entry existed. This neighbour entry is used to
record the timestamp of the last probe via the ->updated field.

Later, commit 2152caea7196 ("ipv6: Do not depend on rt->n in rt6_probe().")
removed the requirement for a neighbour entry. Neighbourless routes skip
the interval check and are not rate-limited.

This patch adds rate-limiting for neighbourless routes, by recording the
timestamp of the last probe in the fib6_info itself.

Fixes: 2152caea7196 ("ipv6: Do not depend on rt->n in rt6_probe().")
Signed-off-by: Sabrina Dubroca 
Reviewed-by: Stefano Brivio 
---
 include/net/ip6_fib.h |  4 
 net/ipv6/route.c  | 12 ++--
 2 files changed, 10 insertions(+), 6 deletions(-)

diff --git a/include/net/ip6_fib.h b/include/net/ip6_fib.h
index 3d4930528db0..2d31e22babd8 100644
--- a/include/net/ip6_fib.h
+++ b/include/net/ip6_fib.h
@@ -159,6 +159,10 @@ struct fib6_info {
struct rt6_info * __percpu  *rt6i_pcpu;
struct rt6_exception_bucket __rcu *rt6i_exception_bucket;
 
+#ifdef CONFIG_IPV6_ROUTER_PREF
+   unsigned long   last_probe;
+#endif
+
u32 fib6_metric;
u8  fib6_protocol;
u8  fib6_type;
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index a366c05a239d..abcb5ae77319 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -520,10 +520,11 @@ static void rt6_probe_deferred(struct work_struct *w)
 
 static void rt6_probe(struct fib6_info *rt)
 {
-   struct __rt6_probe_work *work;
+   struct __rt6_probe_work *work = NULL;
const struct in6_addr *nh_gw;
struct neighbour *neigh;
struct net_device *dev;
+   struct inet6_dev *idev;
 
/*
 * Okay, this does not seem to be appropriate
@@ -539,15 +540,12 @@ static void rt6_probe(struct fib6_info *rt)
nh_gw = >fib6_nh.nh_gw;
dev = rt->fib6_nh.nh_dev;
rcu_read_lock_bh();
+   idev = __in6_dev_get(dev);
neigh = __ipv6_neigh_lookup_noref(dev, nh_gw);
if (neigh) {
-   struct inet6_dev *idev;
-
if (neigh->nud_state & NUD_VALID)
goto out;
 
-   idev = __in6_dev_get(dev);
-   work = NULL;
write_lock(>lock);
if (!(neigh->nud_state & NUD_VALID) &&
time_after(jiffies,
@@ -557,11 +555,13 @@ static void rt6_probe(struct fib6_info *rt)
__neigh_set_probe_once(neigh);
}
write_unlock(>lock);
-   } else {
+   } else if (time_after(jiffies, rt->last_probe +
+  idev->cnf.rtr_probe_interval)) {
work = kmalloc(sizeof(*work), GFP_ATOMIC);
}
 
if (work) {
+   rt->last_probe = jiffies;
INIT_WORK(>work, rt6_probe_deferred);
work->target = *nh_gw;
dev_hold(dev);
-- 
2.19.1



[PATCH net v2 0/2] net: ipv4: fixes for PMTU when link MTU changes

2018-10-09 Thread Sabrina Dubroca
The first patch adapts the changes that commit e9fa1495d738 ("ipv6:
Reflect MTU changes on PMTU of exceptions for MTU-less routes") did in
IPv6 to IPv4: lower PMTU when the first hop's MTU drops below it, and
raise PMTU when the first hop was limiting PMTU discovery and its MTU
is increased.

The second patch fixes bugs introduced in commit d52e5a7e7ca4 ("ipv4:
lock mtu in fnhe when received PMTU < net.ipv4.route.min_pmtu") that
only appear once the first patch is applied.

Selftests for these cases were introduced in net-next commit
e44e428f59e4 ("selftests: pmtu: add basic IPv4 and IPv6 PMTU tests")

v2: add cover letter, and fix a few small things in patch 1

Sabrina Dubroca (2):
  net: ipv4: update fnhe_pmtu when first hop's MTU changes
  net: ipv4: don't let PMTU updates increase route MTU

 include/linux/netdevice.h |  7 ++
 include/net/ip_fib.h  |  1 +
 net/core/dev.c| 28 --
 net/ipv4/fib_frontend.c   | 12 ++
 net/ipv4/fib_semantics.c  | 50 +++
 net/ipv4/route.c  |  7 +++---
 6 files changed, 96 insertions(+), 9 deletions(-)

-- 
2.19.0



[PATCH net v2 2/2] net: ipv4: don't let PMTU updates increase route MTU

2018-10-09 Thread Sabrina Dubroca
When an MTU update with PMTU smaller than net.ipv4.route.min_pmtu is
received, we must clamp its value. However, we can receive a PMTU
exception with PMTU < old_mtu < ip_rt_min_pmtu, which would lead to an
increase in PMTU.

To fix this, take the smallest of the old MTU and ip_rt_min_pmtu.

Before this patch, in case of an update, the exception's MTU would
always change. Now, an exception can have only its lock flag updated,
but not the MTU, so we need to add a check on locking to the following
"is this exception getting updated, or close to expiring?" test.

Fixes: d52e5a7e7ca4 ("ipv4: lock mtu in fnhe when received PMTU < 
net.ipv4.route.min_pmtu")
Signed-off-by: Sabrina Dubroca 
Reviewed-by: Stefano Brivio 
---
v2: unchanged

 net/ipv4/route.c | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index b678466da451..8501554e96a4 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -1001,21 +1001,22 @@ out:kfree_skb(skb);
 static void __ip_rt_update_pmtu(struct rtable *rt, struct flowi4 *fl4, u32 mtu)
 {
struct dst_entry *dst = >dst;
+   u32 old_mtu = ipv4_mtu(dst);
struct fib_result res;
bool lock = false;
 
if (ip_mtu_locked(dst))
return;
 
-   if (ipv4_mtu(dst) < mtu)
+   if (old_mtu < mtu)
return;
 
if (mtu < ip_rt_min_pmtu) {
lock = true;
-   mtu = ip_rt_min_pmtu;
+   mtu = min(old_mtu, ip_rt_min_pmtu);
}
 
-   if (rt->rt_pmtu == mtu &&
+   if (rt->rt_pmtu == mtu && !lock &&
time_before(jiffies, dst->expires - ip_rt_mtu_expires / 2))
return;
 
-- 
2.19.0



[PATCH net v2 1/2] net: ipv4: update fnhe_pmtu when first hop's MTU changes

2018-10-09 Thread Sabrina Dubroca
Since commit 5aad1de5ea2c ("ipv4: use separate genid for next hop
exceptions"), exceptions get deprecated separately from cached
routes. In particular, administrative changes don't clear PMTU anymore.

As Stefano described in commit e9fa1495d738 ("ipv6: Reflect MTU changes
on PMTU of exceptions for MTU-less routes"), the PMTU discovered before
the local MTU change can become stale:
 - if the local MTU is now lower than the PMTU, that PMTU is now
   incorrect
 - if the local MTU was the lowest value in the path, and is increased,
   we might discover a higher PMTU

Similarly to what commit e9fa1495d738 did for IPv6, update PMTU in those
cases.

If the exception was locked, the discovered PMTU was smaller than the
minimal accepted PMTU. In that case, if the new local MTU is smaller
than the current PMTU, let PMTU discovery figure out if locking of the
exception is still needed.

To do this, we need to know the old link MTU in the NETDEV_CHANGEMTU
notifier. By the time the notifier is called, dev->mtu has been
changed. This patch adds the old MTU as additional information in the
notifier structure, and a new call_netdevice_notifiers_u32() function.

Fixes: 5aad1de5ea2c ("ipv4: use separate genid for next hop exceptions")
Signed-off-by: Sabrina Dubroca 
Reviewed-by: Stefano Brivio 
---
v2:
 - s/u32/mtu/ in netdev_notifier_info_ext and call_netdevice_notifiers_
   helper, suggested by David Ahern
 - don't EXPORT_SYMBOL the helper, it's only used in net/core/dev.c
 - fix typo in commit message
 - fix kerneldoc comment, spotted by kbuild bot

 include/linux/netdevice.h |  7 ++
 include/net/ip_fib.h  |  1 +
 net/core/dev.c| 28 --
 net/ipv4/fib_frontend.c   | 12 ++
 net/ipv4/fib_semantics.c  | 50 +++
 5 files changed, 92 insertions(+), 6 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index c7861e4b402c..d837dad24b4c 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -2458,6 +2458,13 @@ struct netdev_notifier_info {
struct netlink_ext_ack  *extack;
 };
 
+struct netdev_notifier_info_ext {
+   struct netdev_notifier_info info; /* must be first */
+   union {
+   u32 mtu;
+   } ext;
+};
+
 struct netdev_notifier_change_info {
struct netdev_notifier_info info; /* must be first */
unsigned int flags_changed;
diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h
index 69c91d1934c1..c9b7b136939d 100644
--- a/include/net/ip_fib.h
+++ b/include/net/ip_fib.h
@@ -394,6 +394,7 @@ int ip_fib_check_default(__be32 gw, struct net_device *dev);
 int fib_sync_down_dev(struct net_device *dev, unsigned long event, bool force);
 int fib_sync_down_addr(struct net_device *dev, __be32 local);
 int fib_sync_up(struct net_device *dev, unsigned int nh_flags);
+void fib_sync_mtu(struct net_device *dev, u32 orig_mtu);
 
 #ifdef CONFIG_IP_ROUTE_MULTIPATH
 int fib_multipath_hash(const struct net *net, const struct flowi4 *fl4,
diff --git a/net/core/dev.c b/net/core/dev.c
index 82114ee6..93243479085f 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1752,6 +1752,28 @@ int call_netdevice_notifiers(unsigned long val, struct 
net_device *dev)
 }
 EXPORT_SYMBOL(call_netdevice_notifiers);
 
+/**
+ * call_netdevice_notifiers_mtu - call all network notifier blocks
+ * @val: value passed unmodified to notifier function
+ * @dev: net_device pointer passed unmodified to notifier function
+ * @arg: additional u32 argument passed to the notifier function
+ *
+ * Call all network notifier blocks.  Parameters and return value
+ * are as for raw_notifier_call_chain().
+ */
+static int call_netdevice_notifiers_mtu(unsigned long val,
+   struct net_device *dev, u32 arg)
+{
+   struct netdev_notifier_info_ext info = {
+   .info.dev = dev,
+   .ext.mtu = arg,
+   };
+
+   BUILD_BUG_ON(offsetof(struct netdev_notifier_info_ext, info) != 0);
+
+   return call_netdevice_notifiers_info(val, );
+}
+
 #ifdef CONFIG_NET_INGRESS
 static DEFINE_STATIC_KEY_FALSE(ingress_needed_key);
 
@@ -7574,14 +7596,16 @@ int dev_set_mtu_ext(struct net_device *dev, int new_mtu,
err = __dev_set_mtu(dev, new_mtu);
 
if (!err) {
-   err = call_netdevice_notifiers(NETDEV_CHANGEMTU, dev);
+   err = call_netdevice_notifiers_mtu(NETDEV_CHANGEMTU, dev,
+  orig_mtu);
err = notifier_to_errno(err);
if (err) {
/* setting mtu back and notifying everyone again,
 * so that they have a chance to revert changes.
 */
__dev_set_mtu(dev, orig_mtu);
-   call_netdevice_notifiers(NETDEV_CHANGEMTU, dev);
+   call_netdevice_no

Re: [PATCH net 1/2] net: ipv4: update fnhe_pmtu when first hop's MTU changes

2018-10-08 Thread Sabrina Dubroca
2018-10-08, 11:18:49 -0600, David Ahern wrote:
> On 10/8/18 6:36 AM, Sabrina Dubroca wrote:
> > diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> > index c7861e4b402c..dc9d2668d9bb 100644
> > --- a/include/linux/netdevice.h
> > +++ b/include/linux/netdevice.h
> > @@ -2458,6 +2458,13 @@ struct netdev_notifier_info {
> > struct netlink_ext_ack  *extack;
> >  };
> >  
> > +struct netdev_notifier_info_ext {
> > +   struct netdev_notifier_info info; /* must be first */
> > +   union {
> > +   u32 u32;
> 
> I realize you want this to be generic, but that is a really odd
> definition. can you make that mtu instead? the union allows other use
> cases to add new names.

It might get ugly if we end up with 4 different u32, but ok, I'll
rename this and we can see how it evolves.

-- 
Sabrina


[PATCH net-next 0/3] selftests: add more PMTU tests

2018-10-08 Thread Sabrina Dubroca
The current selftests for PMTU cover VTI tunnels, but there's nothing
about the generation and handling of PMTU exceptions by intermediate
routers. This series adds and improves existing helpers, then adds
IPv4 and IPv6 selftests with a setup involving an intermediate router.

Joint work with Stefano Brivio.

Sabrina Dubroca (2):
  selftests: pmtu: extend MTU parsing helper to locked MTU
  selftests: pmtu: add basic IPv4 and IPv6 PMTU tests

Stefano Brivio (1):
  selftests: pmtu: Introduce check_pmtu_value()

 tools/testing/selftests/net/pmtu.sh | 258 
 1 file changed, 227 insertions(+), 31 deletions(-)

-- 
2.19.0



[PATCH net-next 2/3] selftests: pmtu: extend MTU parsing helper to locked MTU

2018-10-08 Thread Sabrina Dubroca
The mtu_parse helper introduced in commit f2c929feeccd ("selftests:
pmtu: Factor out MTU parsing helper") can only handle "mtu 1234", but
not "mtu lock 1234". Extend it, so that we can do IPv4 tests with PMTU
smaller than net.ipv4.route.min_pmtu

Signed-off-by: Sabrina Dubroca 
---
 tools/testing/selftests/net/pmtu.sh | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/tools/testing/selftests/net/pmtu.sh 
b/tools/testing/selftests/net/pmtu.sh
index 6c38a7637744..03e56a27f69c 100755
--- a/tools/testing/selftests/net/pmtu.sh
+++ b/tools/testing/selftests/net/pmtu.sh
@@ -196,7 +196,9 @@ mtu_parse() {
 
next=0
for i in ${input}; do
+   [ ${next} -eq 1 -a "${i}" = "lock" ] && next=2 && continue
[ ${next} -eq 1 ] && echo "${i}" && return
+   [ ${next} -eq 2 ] && echo "lock ${i}" && return
[ "${i}" = "mtu" ] && next=1
done
 }
-- 
2.19.0



[PATCH net-next 1/3] selftests: pmtu: Introduce check_pmtu_value()

2018-10-08 Thread Sabrina Dubroca
From: Stefano Brivio 

Introduce and use a function that checks PMTU values against
expected values and logs error messages, to remove some clutter.

Signed-off-by: Stefano Brivio 
Signed-off-by: Sabrina Dubroca 
---
 tools/testing/selftests/net/pmtu.sh | 49 +
 1 file changed, 22 insertions(+), 27 deletions(-)

diff --git a/tools/testing/selftests/net/pmtu.sh 
b/tools/testing/selftests/net/pmtu.sh
index 0ab9423d009f..6c38a7637744 100755
--- a/tools/testing/selftests/net/pmtu.sh
+++ b/tools/testing/selftests/net/pmtu.sh
@@ -229,6 +229,19 @@ route_get_dst_pmtu_from_exception() {
mtu_parse "$(route_get_dst_exception "${ns_cmd}" ${dst})"
 }
 
+check_pmtu_value() {
+   expected="${1}"
+   value="${2}"
+   event="${3}"
+
+   [ "${expected}" = "any" ] && [ -n "${value}" ] && return 0
+   [ "${value}" = "${expected}" ] && return 0
+   [ -z "${value}" ] &&err "  PMTU exception wasn't created after 
${event}" && return 1
+   [ -z "${expected}" ] && err "  PMTU exception shouldn't exist after 
${event}" && return 1
+   err "  found PMTU exception with incorrect MTU ${value}, expected 
${expected}, after ${event}"
+   return 1
+}
+
 test_pmtu_vti4_exception() {
setup namespaces veth vti4 xfrm4 || return 2
 
@@ -248,24 +261,13 @@ test_pmtu_vti4_exception() {
# exception is created
${ns_a} ping -q -M want -i 0.1 -w 2 -s ${ping_payload} ${vti4_b_addr} > 
/dev/null
pmtu="$(route_get_dst_pmtu_from_exception "${ns_a}" ${vti4_b_addr})"
-   if [ "${pmtu}" != "" ]; then
-   err "  unexpected exception created with PMTU ${pmtu} for IP 
payload length ${esp_payload_rfc4106}"
-   return 1
-   fi
+   check_pmtu_value "" "${pmtu}" "sending packet smaller than PMTU (IP 
payload length ${esp_payload_rfc4106})" || return 1
 
# Now exceed link layer MTU by one byte, check that exception is created
+   # with the right PMTU value
${ns_a} ping -q -M want -i 0.1 -w 2 -s $((ping_payload + 1)) 
${vti4_b_addr} > /dev/null
pmtu="$(route_get_dst_pmtu_from_exception "${ns_a}" ${vti4_b_addr})"
-   if [ "${pmtu}" = "" ]; then
-   err "  exception not created for IP payload length 
$((esp_payload_rfc4106 + 1))"
-   return 1
-   fi
-
-   # ...with the right PMTU value
-   if [ ${pmtu} -ne ${esp_payload_rfc4106} ]; then
-   err "  wrong PMTU ${pmtu} in exception, expected: 
${esp_payload_rfc4106}"
-   return 1
-   fi
+   check_pmtu_value "${esp_payload_rfc4106}" "${pmtu}" "exceeding PMTU (IP 
payload length $((esp_payload_rfc4106 + 1)))"
 }
 
 test_pmtu_vti6_exception() {
@@ -280,25 +282,18 @@ test_pmtu_vti6_exception() {
${ns_a} ${ping6} -q -i 0.1 -w 2 -s 6 ${vti6_b_addr} > /dev/null
 
# Check that exception was created
-   if [ "$(route_get_dst_pmtu_from_exception "${ns_a}" ${vti6_b_addr})" = 
"" ]; then
-   err "  tunnel exceeding link layer MTU didn't create route 
exception"
-   return 1
-   fi
+   pmtu="$(route_get_dst_pmtu_from_exception "${ns_a}" ${vti6_b_addr})"
+   check_pmtu_value any "${pmtu}" "creating tunnel exceeding link layer 
MTU" || return 1
 
# Decrease tunnel MTU, check for PMTU decrease in route exception
mtu "${ns_a}" vti6_a 3000
-
-   if [ "$(route_get_dst_pmtu_from_exception "${ns_a}" ${vti6_b_addr})" 
-ne 3000 ]; then
-   err "  decreasing tunnel MTU didn't decrease route exception 
PMTU"
-   fail=1
-   fi
+   pmtu="$(route_get_dst_pmtu_from_exception "${ns_a}" ${vti6_b_addr})"
+   check_pmtu_value "3000" "${pmtu}" "decreasing tunnel MTU" || fail=1
 
# Increase tunnel MTU, check for PMTU increase in route exception
mtu "${ns_a}" vti6_a 9000
-   if [ "$(route_get_dst_pmtu_from_exception "${ns_a}" ${vti6_b_addr})" 
-ne 9000 ]; then
-   err "  increasing tunnel MTU didn't increase route exception 
PMTU"
-   fail=1
-   fi
+   pmtu="$(route_get_dst_pmtu_from_exception "${ns_a}" ${vti6_b_addr})"
+   check_pmtu_value "9000" "${pmtu}" "increasing tunnel MTU" || fail=1
 
return ${fail}
 }
-- 
2.19.0



[PATCH net-next 3/3] selftests: pmtu: add basic IPv4 and IPv6 PMTU tests

2018-10-08 Thread Sabrina Dubroca
Commit d1f1b9cbf34c ("selftests: net: Introduce first PMTU test") and
follow-ups introduced some PMTU tests, but they all rely on tunneling,
and, particularly, on VTI.

These new tests use simple routing to exercise the generation and
update of PMTU exceptions in IPv4 and IPv6.

Signed-off-by: Sabrina Dubroca 
Signed-off-by: Stefano Brivio 
---
 tools/testing/selftests/net/pmtu.sh | 207 +++-
 1 file changed, 203 insertions(+), 4 deletions(-)

diff --git a/tools/testing/selftests/net/pmtu.sh 
b/tools/testing/selftests/net/pmtu.sh
index 03e56a27f69c..b9cdb68df4c5 100755
--- a/tools/testing/selftests/net/pmtu.sh
+++ b/tools/testing/selftests/net/pmtu.sh
@@ -6,6 +6,26 @@
 #
 # Tests currently implemented:
 #
+# - pmtu_ipv4
+#  Set up two namespaces, A and B, with two paths between them over routers
+#  R1 and R2 (also implemented with namespaces), with different MTUs:
+#
+#segment a_r1segment b_r1  a_r1: 2000
+#  .--R1--.a_r2: 1500
+#  A   B   a_r3: 2000
+#  '--R2--'a_r4: 1400
+#segment a_r2segment b_r2
+#
+#  Check that PMTU exceptions with the correct PMTU are created. Then
+#  decrease and increase the MTU of the local link for one of the paths,
+#  A to R1, checking that route exception PMTU changes accordingly over
+#  this path. Also check that locked exceptions are created when an ICMP
+#  message advertising a PMTU smaller than net.ipv4.route.min_pmtu is
+#  received
+#
+# - pmtu_ipv6
+#  Same as pmtu_ipv4, except for locked PMTU tests, using IPv6
+#
 # - pmtu_vti4_exception
 #  Set up vti tunnel on top of veth, with xfrm states and policies, in two
 #  namespaces with matching endpoints. Check that route exception is not
@@ -50,6 +70,8 @@ ksft_skip=4
 which ping6 > /dev/null 2>&1 && ping6=$(which ping6) || ping6=$(which ping)
 
 tests="
+   pmtu_ipv4_exception ipv4: PMTU exceptions
+   pmtu_ipv6_exception ipv6: PMTU exceptions
pmtu_vti6_exception vti6: PMTU exceptions
pmtu_vti4_exception vti4: PMTU exceptions
pmtu_vti4_default_mtu   vti4: default MTU assignment
@@ -60,8 +82,45 @@ tests="
 
 NS_A="ns-$(mktemp -u XX)"
 NS_B="ns-$(mktemp -u XX)"
+NS_R1="ns-$(mktemp -u XX)"
+NS_R2="ns-$(mktemp -u XX)"
 ns_a="ip netns exec ${NS_A}"
 ns_b="ip netns exec ${NS_B}"
+ns_r1="ip netns exec ${NS_R1}"
+ns_r2="ip netns exec ${NS_R2}"
+
+# Addressing and routing for tests with routers: four network segments, with
+# index SEGMENT between 1 and 4, a common prefix (PREFIX4 or PREFIX6) and an
+# identifier ID, which is 1 for hosts (A and B), 2 for routers (R1 and R2).
+# Addresses are:
+# - IPv4: PREFIX4.SEGMENT.ID (/24)
+# - IPv6: PREFIX6:SEGMENT::ID (/64)
+prefix4="192.168"
+prefix6="fd00"
+a_r1=1
+a_r2=2
+b_r1=3
+b_r2=4
+#  ns  peersegment
+routing_addrs="
+   A   R1  ${a_r1}
+   A   R2  ${a_r2}
+   B   R1  ${b_r1}
+   B   R2  ${b_r2}
+"
+# Traffic from A to B goes through R1 by default, and through R2, if destined 
to
+# B's address on the b_r2 segment.
+# Traffic from B to A goes through R1.
+#  ns  destination gateway
+routes="
+   A   default ${prefix4}.${a_r1}.2
+   A   ${prefix4}.${b_r2}.1${prefix4}.${a_r2}.2
+   B   default ${prefix4}.${b_r1}.2
+
+   A   default ${prefix6}:${a_r1}::2
+   A   ${prefix6}:${b_r2}::1   ${prefix6}:${a_r2}::2
+   B   default ${prefix6}:${b_r1}::2
+"
 
 veth4_a_addr="192.168.1.1"
 veth4_b_addr="192.168.1.2"
@@ -94,9 +153,15 @@ err_flush() {
err_buf=
 }
 
+# Find the auto-generated name for this namespace
+nsname() {
+   eval echo \$NS_$1
+}
+
 setup_namespaces() {
-   ip netns add ${NS_A} || return 1
-   ip netns add ${NS_B}
+   for n in ${NS_A} ${NS_B} ${NS_R1} ${NS_R2}; do
+   ip netns add ${n} || return 1
+   done
 }
 
 setup_veth() {
@@ -167,6 +232,49 @@ setup_xfrm6() {
setup_xfrm 6 ${veth6_a_addr} ${veth6_b_addr}
 }
 
+setup_routing() {
+   for i in ${NS_R1} ${NS_R2}; do
+   ip netns exec ${i} sysctl -q net/ipv4/ip_forward=1
+   ip netns exec ${i} sysctl -q net/ipv6/conf/all/forwarding=1
+   done
+
+   for i in ${routing_addrs}; do
+   [ "${ns}" = "" ]&& ns="${i}"&& continue
+   [ "${peer}" = "" ]  && peer="${i}"  && continue
+   [ "${segment}" = "&qu

[PATCH net 2/2] ipv4: don't let PMTU updates increase route MTU

2018-10-08 Thread Sabrina Dubroca
When an MTU update with PMTU smaller than net.ipv4.route.min_pmtu is
received, we must clamp its value. However, we can receive a PMTU
exception with PMTU < old_mtu < ip_rt_min_pmtu, which would lead to an
increase in PMTU.

To fix this, take the smallest of the old MTU and ip_rt_min_pmtu.

Before this patch, in case of an update, the exception's MTU would
always change. Now, an exception can have only its lock flag updated,
but not the MTU, so we need to add a check on locking to the following
"is this exception getting updated, or close to expiring?" test.

Fixes: d52e5a7e7ca4 ("ipv4: lock mtu in fnhe when received PMTU < 
net.ipv4.route.min_pmtu")
Signed-off-by: Sabrina Dubroca 
Reviewed-by: Stefano Brivio 
---
 net/ipv4/route.c | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index b678466da451..8501554e96a4 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -1001,21 +1001,22 @@ out:kfree_skb(skb);
 static void __ip_rt_update_pmtu(struct rtable *rt, struct flowi4 *fl4, u32 mtu)
 {
struct dst_entry *dst = >dst;
+   u32 old_mtu = ipv4_mtu(dst);
struct fib_result res;
bool lock = false;
 
if (ip_mtu_locked(dst))
return;
 
-   if (ipv4_mtu(dst) < mtu)
+   if (old_mtu < mtu)
return;
 
if (mtu < ip_rt_min_pmtu) {
lock = true;
-   mtu = ip_rt_min_pmtu;
+   mtu = min(old_mtu, ip_rt_min_pmtu);
}
 
-   if (rt->rt_pmtu == mtu &&
+   if (rt->rt_pmtu == mtu && !lock &&
time_before(jiffies, dst->expires - ip_rt_mtu_expires / 2))
return;
 
-- 
2.19.0



[PATCH net 1/2] net: ipv4: update fnhe_pmtu when first hop's MTU changes

2018-10-08 Thread Sabrina Dubroca
Since commit 5aad1de5ea2c ("ipv4: use separate genid for next hop
exceptions"), exceptions get deprecated separately from cached
routes. In particular, administrative changes don't clear PMTU anymore.

As Stefano described in commit e9fa1495d738 ("ipv6: Reflect MTU changes
on PMTU of exceptions for MTU-less routes"), the PMTU discovered before
the local MTU change can become stale:
 - if the local MTU is now lower than the PMTU, that PMTU is now
   incorrect
 - if the local MTU was the lowest value in the path, and is increased,
   we might discover a higher PMTU

Similarly to what commit e9fa1495d738 did for IPv6, update PMTU in those
cases.

If the exception was locked, discovered the PMTU was smaller than the
minimal accepted PMTU. In that case, if the new local MTU is smaller
than the current PMTU, let PMTU discovery figure out if locking of the
exception is still needed.

To do this, we need to know the old link MTU in the NETDEV_CHANGEMTU
notifier. By the time the notifier is called, dev->mtu has been
changed. This patch adds the old MTU as additional information in the
notifier structure, and a new call_netdevice_notifiers_u32() function.

Fixes: 5aad1de5ea2c ("ipv4: use separate genid for next hop exceptions")
Signed-off-by: Sabrina Dubroca 
Reviewed-by: Stefano Brivio 
---
 include/linux/netdevice.h |  7 ++
 include/net/ip_fib.h  |  1 +
 net/core/dev.c| 29 +--
 net/ipv4/fib_frontend.c   | 12 ++
 net/ipv4/fib_semantics.c  | 50 +++
 5 files changed, 93 insertions(+), 6 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index c7861e4b402c..dc9d2668d9bb 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -2458,6 +2458,13 @@ struct netdev_notifier_info {
struct netlink_ext_ack  *extack;
 };
 
+struct netdev_notifier_info_ext {
+   struct netdev_notifier_info info; /* must be first */
+   union {
+   u32 u32;
+   } ext;
+};
+
 struct netdev_notifier_change_info {
struct netdev_notifier_info info; /* must be first */
unsigned int flags_changed;
diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h
index 69c91d1934c1..c9b7b136939d 100644
--- a/include/net/ip_fib.h
+++ b/include/net/ip_fib.h
@@ -394,6 +394,7 @@ int ip_fib_check_default(__be32 gw, struct net_device *dev);
 int fib_sync_down_dev(struct net_device *dev, unsigned long event, bool force);
 int fib_sync_down_addr(struct net_device *dev, __be32 local);
 int fib_sync_up(struct net_device *dev, unsigned int nh_flags);
+void fib_sync_mtu(struct net_device *dev, u32 orig_mtu);
 
 #ifdef CONFIG_IP_ROUTE_MULTIPATH
 int fib_multipath_hash(const struct net *net, const struct flowi4 *fl4,
diff --git a/net/core/dev.c b/net/core/dev.c
index 82114ee6..f4a505fb5cb2 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1752,6 +1752,29 @@ int call_netdevice_notifiers(unsigned long val, struct 
net_device *dev)
 }
 EXPORT_SYMBOL(call_netdevice_notifiers);
 
+/**
+ * call_netdevice_notifiers_u32 - call all network notifier blocks
+ * @val: value passed unmodified to notifier function
+ *  @dev: net_device pointer passed unmodified to notifier function
+ *  @u32: additional u32 argument passed to the notifier function
+ *
+ * Call all network notifier blocks.  Parameters and return value
+ * are as for raw_notifier_call_chain().
+ */
+int call_netdevice_notifiers_u32(unsigned long val, struct net_device *dev,
+u32 arg)
+{
+   struct netdev_notifier_info_ext info = {
+   .info.dev = dev,
+   .ext.u32 = arg,
+   };
+
+   BUILD_BUG_ON(offsetof(struct netdev_notifier_info_ext, info) != 0);
+
+   return call_netdevice_notifiers_info(val, );
+}
+EXPORT_SYMBOL(call_netdevice_notifiers_u32);
+
 #ifdef CONFIG_NET_INGRESS
 static DEFINE_STATIC_KEY_FALSE(ingress_needed_key);
 
@@ -7574,14 +7597,16 @@ int dev_set_mtu_ext(struct net_device *dev, int new_mtu,
err = __dev_set_mtu(dev, new_mtu);
 
if (!err) {
-   err = call_netdevice_notifiers(NETDEV_CHANGEMTU, dev);
+   err = call_netdevice_notifiers_u32(NETDEV_CHANGEMTU, dev,
+  orig_mtu);
err = notifier_to_errno(err);
if (err) {
/* setting mtu back and notifying everyone again,
 * so that they have a chance to revert changes.
 */
__dev_set_mtu(dev, orig_mtu);
-   call_netdevice_notifiers(NETDEV_CHANGEMTU, dev);
+   call_netdevice_notifiers_u32(NETDEV_CHANGEMTU, dev,
+new_mtu);
}
}
return err;
diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c
index 2998b0e47d4b..8

Re: [PATCH 1/1] macsec: reflect the master interface state

2018-10-01 Thread Sabrina Dubroca
2018-09-19, 12:44:41 -0400, Radu Rendec wrote:
> Hello,

Gah, sorry, this fell into the "week-end" crack and I forgot to answer :(

> On Wed, Sep 19, 2018 at 11:24 AM Sabrina Dubroca  wrote:
> > 2018-09-18, 20:16:12 -0400, Radu Rendec wrote:
> > > This patch makes macsec interfaces reflect the state of the underlying
> > > interface: if the master interface changes state to up/down, the macsec
> > > interface changes its state accordingly.
> >
> > Yes, that's a bit unfortunate. I was wondering if we should just allow
> > setting the device up, and let link state handle the rest.
> 
> Yes, that could work. It would also be consistent with the idea of
> propagating only the link state.

Do you want to handle it, or should I?

> > It's missing small parts of link state handling that I have been
> > testing, mainly when creating a new device.
> 
> Can you please be more specific? I've just looked at macsec_newlink()
> and I didn't notice anything related to link state handling.

Yes, that's actually the problem. For example,
macvlan_common_newlink() does:

netif_stacked_transfer_operstate(lowerdev, dev);
linkwatch_fire_event(dev);

If you try to create a macsec device UP with its lowerdev UP:

ip link set $lower up
ip link add link $lower up type macsec


You'll get:

macsec0@$lower:  [...] state UNKNOWN [...]

and you want "state UP".


> > My version of the patch only does netif_stacked_transfer_operstate(),
> > and skips setting the device administratively down (ie, same as
> > NETDEV_CHANGE).
> 
> That makes sense. But, since you mentioned you had your own patch, does
> it make sense for me to continue working on mine?

Here's what I have (without a commit message, because I hadn't written
one yet -- yours looks pretty good, other than missing a "Fixes:" tag):

diff --git a/drivers/net/macsec.c b/drivers/net/macsec.c
index 7de88b33d5b9..3532cdee2761 100644
--- a/drivers/net/macsec.c
+++ b/drivers/net/macsec.c
@@ -3308,6 +3308,9 @@ static int macsec_newlink(struct net *net, struct 
net_device *dev,
if (err < 0)
goto del_dev;
 
+   netif_stacked_transfer_operstate(real_dev, dev);
+   linkwatch_fire_event(dev);
+
macsec_generation++;
 
return 0;
@@ -3492,6 +3495,20 @@ static int macsec_notify(struct notifier_block *this, 
unsigned long event,
return NOTIFY_DONE;
 
switch (event) {
+   case NETDEV_DOWN:
+   case NETDEV_UP:
+   case NETDEV_CHANGE: {
+   struct macsec_dev *m, *n;
+   struct macsec_rxh_data *rxd;
+
+   rxd = macsec_data_rtnl(real_dev);
+   list_for_each_entry_safe(m, n, >secys, secys) {
+   struct net_device *dev = m->secy.netdev;
+
+   netif_stacked_transfer_operstate(real_dev, dev);
+   }
+   break;
+   }
case NETDEV_UNREGISTER: {
struct macsec_dev *m, *n;
struct macsec_rxh_data *rxd;



If you want to keep working on this, that's okay for me. I'm not hung
up on who gets authorship.


> > >   }
> > > + case NETDEV_UP:
> > > + list_for_each_entry(m, >secys, secys) {
> > > + struct net_device *dev = m->secy.netdev;
> > > + int flags = dev_get_flags(dev);
> > > +
> > > + if (!(flags & IFF_UP)) {
> > > + dev_change_flags(dev, flags | IFF_UP);
> > > + netif_stacked_transfer_operstate(real_dev, 
> > > dev);
> > > + }
> > > + }
> > > + break;
> >
> > I don't like that this completely ignores whether the device was done
> > independently of the lower link. Maybe the administrator actually
> > wants the macsec device down, regardless of state changes on the lower
> > device.
> 
> I thought about that too and I don't like it either. Perhaps the vlan
> approach of having a "loose binding" flag is worth considering. Then the
> admin can decice for themselves.

Do you have a use case where you'd want that? If that solves some
problem for you (a problem that can't be solved just by fixing the
current bug), then ok, we can consider it. I prefer to avoid adding
obscure options and more code unless they're necessary.

> However, I guess the problem disappears if only the link state is
> propagated. The only caveat to that is to not propagate an "up" link
> state while the macsec interface is administratively down, but
> "remember" to propagate it later if the macsec interface is
> administratively set to "up".
> 
> Thanks for the feedback!

And sorry for the delay in answering :/


-- 
Sabrina


Re: [PATCH 1/1] macsec: reflect the master interface state

2018-09-19 Thread Sabrina Dubroca
Hello,

2018-09-18, 20:16:12 -0400, Radu Rendec wrote:
> This patch makes macsec interfaces reflect the state of the underlying
> interface: if the master interface changes state to up/down, the macsec
> interface changes its state accordingly.

I got a separate report of the same issue and I've been looking in
that area too.

> This closes a loophole in the macsec interface state logic: the macsec
> interface cannot be brought up if the master interface is down (the
> operation is rejected with ENETDOWN); however, if the macsec interface
> is brought up while the master interface is up and then the master
> interface goes down, the macsec interface stays up.

Yes, that's a bit unfortunate. I was wondering if we should just allow
setting the device up, and let link state handle the rest.

> Reflecting the master interface state can also be useful if the macsec
> interface is used as a bridge port: if the master (physical) interface
> goes down, the state propagates through the macsec interface to the
> bridge module, which can react to the state change (for example if it
> runs STP).
> 
> The patch does nothing original. The same logic is implemented for vlan
> interfaces in vlan_device_event() in net/8021q/vlan.c. In fact, the code
> was copied and adapted from there.

It's missing small parts of link state handling that I have been
testing, mainly when creating a new device.

> Signed-off-by: Radu Rendec 
> ---
>  drivers/net/macsec.c | 57 +++-
>  1 file changed, 46 insertions(+), 11 deletions(-)
> 
> diff --git a/drivers/net/macsec.c b/drivers/net/macsec.c
> index 7de88b33d5b9..cb93a1290f85 100644
> --- a/drivers/net/macsec.c
> +++ b/drivers/net/macsec.c
> @@ -3486,20 +3486,21 @@ static int macsec_notify(struct notifier_block *this, 
> unsigned long event,
>void *ptr)
>  {
>   struct net_device *real_dev = netdev_notifier_info_to_dev(ptr);
> - LIST_HEAD(head);
> + struct macsec_dev *m;
> + struct macsec_rxh_data *rxd;
>  
>   if (!is_macsec_master(real_dev))
>   return NOTIFY_DONE;
>  
> + rxd = macsec_data_rtnl(real_dev);
> +
>   switch (event) {
>   case NETDEV_UNREGISTER: {
> - struct macsec_dev *m, *n;
> - struct macsec_rxh_data *rxd;
> + struct macsec_dev *n;
> + LIST_HEAD(head);
>  
> - rxd = macsec_data_rtnl(real_dev);
> - list_for_each_entry_safe(m, n, >secys, secys) {
> + list_for_each_entry_safe(m, n, >secys, secys)
>   macsec_common_dellink(m->secy.netdev, );
> - }

Please don't mix coding style changes with bug fixes.

[...]
> + case NETDEV_DOWN: {
> + struct net_device *dev, *tmp;
> + LIST_HEAD(close_list);
> +
> + list_for_each_entry(m, >secys, secys) {
> + dev = m->secy.netdev;
> +
> + if (dev->flags & IFF_UP)
> + list_add(>close_list, _list);
> + }
> +
> + dev_close_many(_list, false);
> +
> + list_for_each_entry_safe(dev, tmp, _list, close_list) {
> + netif_stacked_transfer_operstate(real_dev, dev);
> + list_del_init(>close_list);
> + }
> + list_del(_list);
> + break;

My version of the patch only does netif_stacked_transfer_operstate(),
and skips setting the device administratively down (ie, same as
NETDEV_CHANGE).

>   }
> + case NETDEV_UP:
> + list_for_each_entry(m, >secys, secys) {
> + struct net_device *dev = m->secy.netdev;
> + int flags = dev_get_flags(dev);
> +
> + if (!(flags & IFF_UP)) {
> + dev_change_flags(dev, flags | IFF_UP);
> + netif_stacked_transfer_operstate(real_dev, dev);
> + }
> + }
> + break;

I don't like that this completely ignores whether the device was done
independently of the lower link. Maybe the administrator actually
wants the macsec device down, regardless of state changes on the lower
device.

I know it's consistent with what vlan is doing, but I'm not convinced
macsec should adopt this behavior.

>   }
>  
>   return NOTIFY_OK;
> -- 
> 2.17.1
> 

-- 
Sabrina


[PATCH net] selftests: pmtu: properly redirect stderr to /dev/null

2018-09-17 Thread Sabrina Dubroca
The cleanup function uses "$CMD 2 > /dev/null", which doesn't actually
send stderr to /dev/null, so when the netns doesn't exist, the error
message is shown. Use "2> /dev/null" instead, so that those messages
disappear, as was intended.

Fixes: d1f1b9cbf34c ("selftests: net: Introduce first PMTU test")
Signed-off-by: Sabrina Dubroca 
---
 tools/testing/selftests/net/pmtu.sh | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/tools/testing/selftests/net/pmtu.sh 
b/tools/testing/selftests/net/pmtu.sh
index 32a194e3e07a..0ab9423d009f 100755
--- a/tools/testing/selftests/net/pmtu.sh
+++ b/tools/testing/selftests/net/pmtu.sh
@@ -178,8 +178,8 @@ setup() {
 
 cleanup() {
[ ${cleanup_done} -eq 1 ] && return
-   ip netns del ${NS_A} 2 > /dev/null
-   ip netns del ${NS_B} 2 > /dev/null
+   ip netns del ${NS_A} 2> /dev/null
+   ip netns del ${NS_B} 2> /dev/null
cleanup_done=1
 }
 
-- 
2.19.0



[PATCH net v2 2/3] tls: zero the crypto information from tls_context before freeing

2018-09-12 Thread Sabrina Dubroca
This contains key material in crypto_send_aes_gcm_128 and
crypto_recv_aes_gcm_128.

Introduce union tls_crypto_context, and replace the two identical
unions directly embedded in struct tls_context with it. We can then
use this union to clean up the memory in the new tls_ctx_free()
function.

Fixes: 3c4d7559159b ("tls: kernel TLS support")
Signed-off-by: Sabrina Dubroca 
---
v2: introduce union tls_crypto_context

 include/net/tls.h | 19 +--
 net/tls/tls_device.c  |  6 +++---
 net/tls/tls_device_fallback.c |  2 +-
 net/tls/tls_main.c| 20 +++-
 net/tls/tls_sw.c  |  8 
 5 files changed, 32 insertions(+), 23 deletions(-)

diff --git a/include/net/tls.h b/include/net/tls.h
index d5c683e8bb22..0a769cf2f5f3 100644
--- a/include/net/tls.h
+++ b/include/net/tls.h
@@ -171,15 +171,14 @@ struct cipher_context {
char *rec_seq;
 };
 
+union tls_crypto_context {
+   struct tls_crypto_info info;
+   struct tls12_crypto_info_aes_gcm_128 aes_gcm_128;
+};
+
 struct tls_context {
-   union {
-   struct tls_crypto_info crypto_send;
-   struct tls12_crypto_info_aes_gcm_128 crypto_send_aes_gcm_128;
-   };
-   union {
-   struct tls_crypto_info crypto_recv;
-   struct tls12_crypto_info_aes_gcm_128 crypto_recv_aes_gcm_128;
-   };
+   union tls_crypto_context crypto_send;
+   union tls_crypto_context crypto_recv;
 
struct list_head list;
struct net_device *netdev;
@@ -367,8 +366,8 @@ static inline void tls_fill_prepend(struct tls_context *ctx,
 * size KTLS_DTLS_HEADER_SIZE + KTLS_DTLS_NONCE_EXPLICIT_SIZE
 */
buf[0] = record_type;
-   buf[1] = TLS_VERSION_MINOR(ctx->crypto_send.version);
-   buf[2] = TLS_VERSION_MAJOR(ctx->crypto_send.version);
+   buf[1] = TLS_VERSION_MINOR(ctx->crypto_send.info.version);
+   buf[2] = TLS_VERSION_MAJOR(ctx->crypto_send.info.version);
/* we can use IV for nonce explicit according to spec */
buf[3] = pkt_len >> 8;
buf[4] = pkt_len & 0xFF;
diff --git a/net/tls/tls_device.c b/net/tls/tls_device.c
index 292742e50bfa..961b07d4d41c 100644
--- a/net/tls/tls_device.c
+++ b/net/tls/tls_device.c
@@ -686,7 +686,7 @@ int tls_set_device_offload(struct sock *sk, struct 
tls_context *ctx)
goto free_marker_record;
}
 
-   crypto_info = >crypto_send;
+   crypto_info = >crypto_send.info;
switch (crypto_info->cipher_type) {
case TLS_CIPHER_AES_GCM_128:
nonce_size = TLS_CIPHER_AES_GCM_128_IV_SIZE;
@@ -780,7 +780,7 @@ int tls_set_device_offload(struct sock *sk, struct 
tls_context *ctx)
 
ctx->priv_ctx_tx = offload_ctx;
rc = netdev->tlsdev_ops->tls_dev_add(netdev, sk, TLS_OFFLOAD_CTX_DIR_TX,
->crypto_send,
+>crypto_send.info,
 tcp_sk(sk)->write_seq);
if (rc)
goto release_netdev;
@@ -862,7 +862,7 @@ int tls_set_device_offload_rx(struct sock *sk, struct 
tls_context *ctx)
goto release_ctx;
 
rc = netdev->tlsdev_ops->tls_dev_add(netdev, sk, TLS_OFFLOAD_CTX_DIR_RX,
->crypto_recv,
+>crypto_recv.info,
 tcp_sk(sk)->copied_seq);
if (rc) {
pr_err_ratelimited("%s: The netdev has refused to offload this 
socket\n",
diff --git a/net/tls/tls_device_fallback.c b/net/tls/tls_device_fallback.c
index 6102169239d1..450a6dbc5a88 100644
--- a/net/tls/tls_device_fallback.c
+++ b/net/tls/tls_device_fallback.c
@@ -320,7 +320,7 @@ static struct sk_buff *tls_enc_skb(struct tls_context 
*tls_ctx,
goto free_req;
 
iv = buf;
-   memcpy(iv, tls_ctx->crypto_send_aes_gcm_128.salt,
+   memcpy(iv, tls_ctx->crypto_send.aes_gcm_128.salt,
   TLS_CIPHER_AES_GCM_128_SALT_SIZE);
aad = buf + TLS_CIPHER_AES_GCM_128_SALT_SIZE +
  TLS_CIPHER_AES_GCM_128_IV_SIZE;
diff --git a/net/tls/tls_main.c b/net/tls/tls_main.c
index 180b6640e531..737b3865be1b 100644
--- a/net/tls/tls_main.c
+++ b/net/tls/tls_main.c
@@ -241,6 +241,16 @@ static void tls_write_space(struct sock *sk)
ctx->sk_write_space(sk);
 }
 
+static void tls_ctx_free(struct tls_context *ctx)
+{
+   if (!ctx)
+   return;
+
+   memzero_explicit(>crypto_send, sizeof(ctx->crypto_send));
+   memzero_explicit(>crypto_recv, sizeof(ctx->crypto_recv));
+   kfree(ctx);
+}
+
 static void tls_sk_proto_close(struct sock *sk, long timeout)
 {
struct tls_context *ctx = tls_get_ctx(sk);
@@ -294,7 +304,7 @@ static void tls_sk_proto_close(struct sock 

[PATCH net v2 1/3] tls: don't copy the key out of tls12_crypto_info_aes_gcm_128

2018-09-12 Thread Sabrina Dubroca
There's no need to copy the key to an on-stack buffer before calling
crypto_aead_setkey().

Fixes: 3c4d7559159b ("tls: kernel TLS support")
Signed-off-by: Sabrina Dubroca 
---
 net/tls/tls_sw.c | 5 +
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c
index e28a6ff25d96..f29b7c49cbf2 100644
--- a/net/tls/tls_sw.c
+++ b/net/tls/tls_sw.c
@@ -1136,7 +1136,6 @@ void tls_sw_free_resources_rx(struct sock *sk)
 
 int tls_set_sw_offload(struct sock *sk, struct tls_context *ctx, int tx)
 {
-   char keyval[TLS_CIPHER_AES_GCM_128_KEY_SIZE];
struct tls_crypto_info *crypto_info;
struct tls12_crypto_info_aes_gcm_128 *gcm_128_info;
struct tls_sw_context_tx *sw_ctx_tx = NULL;
@@ -1265,9 +1264,7 @@ int tls_set_sw_offload(struct sock *sk, struct 
tls_context *ctx, int tx)
 
ctx->push_pending_record = tls_sw_push_pending_record;
 
-   memcpy(keyval, gcm_128_info->key, TLS_CIPHER_AES_GCM_128_KEY_SIZE);
-
-   rc = crypto_aead_setkey(*aead, keyval,
+   rc = crypto_aead_setkey(*aead, gcm_128_info->key,
TLS_CIPHER_AES_GCM_128_KEY_SIZE);
if (rc)
goto free_aead;
-- 
2.18.0



[PATCH net v2 3/3] tls: clear key material from kernel memory when do_tls_setsockopt_conf fails

2018-09-12 Thread Sabrina Dubroca
Fixes: 3c4d7559159b ("tls: kernel TLS support")
Signed-off-by: Sabrina Dubroca 
Signed-off-by: Sabrina Dubroca 
---
v2: use the new union tls_crypto_context

 net/tls/tls_main.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/tls/tls_main.c b/net/tls/tls_main.c
index 737b3865be1b..523622dc74f8 100644
--- a/net/tls/tls_main.c
+++ b/net/tls/tls_main.c
@@ -509,7 +509,7 @@ static int do_tls_setsockopt_conf(struct sock *sk, char 
__user *optval,
goto out;
 
 err_crypto_info:
-   memset(crypto_info, 0, sizeof(*crypto_info));
+   memzero_explicit(crypto_info, sizeof(union tls_crypto_context));
 out:
return rc;
 }
-- 
2.18.0



[PATCH net v2 0/3] tls: don't leave keys in kernel memory

2018-09-12 Thread Sabrina Dubroca
There are a few places where the RX/TX key for a TLS socket is copied
to kernel memory. This series clears those memory areas when they're no
longer needed.

v2: add union tls_crypto_context, following Vakul Garg's comment
swap patch 2 and 3, using new union in patch 3

Sabrina Dubroca (3):
  tls: don't copy the key out of tls12_crypto_info_aes_gcm_128
  tls: zero the crypto information from tls_context before freeing
  tls: clear key material from kernel memory when do_tls_setsockopt_conf
fails

 include/net/tls.h | 19 +--
 net/tls/tls_device.c  |  6 +++---
 net/tls/tls_device_fallback.c |  2 +-
 net/tls/tls_main.c| 22 --
 net/tls/tls_sw.c  | 13 +
 5 files changed, 34 insertions(+), 28 deletions(-)

-- 
2.18.0



Re: [PATCH net-next v1] net/tls: Set count of SG entries if sk_alloc_sg returns -ENOSPC

2018-09-06 Thread Sabrina Dubroca
2018-09-05, 21:57:43 +0530, Vakul Garg wrote:
> tls_sw_sendmsg() allocates plaintext and encrypted SG entries using
> function sk_alloc_sg(). In case the number of SG entries hit
> MAX_SKB_FRAGS, sk_alloc_sg() returns -ENOSPC and sets the variable for
> current SG index to '0'. This leads to calling of function
> tls_push_record() with 'sg_encrypted_num_elem = 0' and later causes
> kernel crash. To fix this, set the number of SG elements to the number
> of elements in plaintext/encrypted SG arrays in case sk_alloc_sg()
> returns -ENOSPC.
> 
> Signed-off-by: Vakul Garg 

What commit introduced this bug? Can you add Fixes: tag? And if it's a
bug present in the net tree, the fix should go into the net tree as
well.


Another thing I've noticed a few times with your patch submissions:
the clock on the system you're using for git-send-email seems to be
set something like 5 hours and 18 minutes in the future. Could you fix
it?  It makes your email appear out of order.

Date: Wed,  5 Sep 2018 21:57:43 +0530

Received: from lti.ap.freescale.net (14.143.30.134) by
AM0PR04MB4243.eurprd04.prod.outlook.com (2603:10a6:208:66::29) with 
Microsoft
SMTP Server (version=TLS1_2, 
cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id
15.20.1101.17; Wed, 5 Sep 2018 11:09:20 +

-- 
Sabrina


Re: [PATCH net 2/3] tls: clear key material from kernel memory when do_tls_setsockopt_conf fails

2018-09-06 Thread Sabrina Dubroca
2018-09-05, 16:53:54 +0300, Boris Pismenny wrote:
> Hi Sabrina,
> 
> On 9/5/2018 4:21 PM, Sabrina Dubroca wrote:
> > Fixes: 3c4d7559159b ("tls: kernel TLS support")
> > Signed-off-by: Sabrina Dubroca 
> > ---
> >   net/tls/tls_main.c | 2 +-
> >   1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/net/tls/tls_main.c b/net/tls/tls_main.c
> > index 180b6640e531..0d432d025471 100644
> > --- a/net/tls/tls_main.c
> > +++ b/net/tls/tls_main.c
> > @@ -499,7 +499,7 @@ static int do_tls_setsockopt_conf(struct sock *sk, char 
> > __user *optval,
> > goto out;
> >   err_crypto_info:
> > -   memset(crypto_info, 0, sizeof(*crypto_info));
> > +   memzero_explicit(crypto_info, sizeof(struct 
> > tls12_crypto_info_aes_gcm_128));
> 
> Besides the key, there are other (not secret) information in
> tls12_crypto_info_aes_gcm_128. I'd prefer you do not delete it to enable
> users to obtain it (using getsockopt) in case we decide to implement a
> fallback to userspace in the future. Such a fallback must obtain the
> kernel's iv, and record sequence number.

The operation failed. There's no reason for this stale data to remain
in the kernel. And you couldn't recover it with getsockopt, because
you'll hit !TLS_CRYPTO_INFO_READY, since we're already resetting
tls_crypto_info. Resetting tls_crypto_info on failure is necessary,
otherwise your socket will be in a broken state, with TLS not setup
but unable to perform a new attempt.

Userspace already has all this information anyway, since it passed it
to the kernel, so why would it need to recover it from the kernel?

-- 
Sabrina


Re: [PATCH net 3/3] tls: zero the crypto information from tls_context before freeing

2018-09-06 Thread Sabrina Dubroca
2018-09-05, 13:48:48 +, Vakul Garg wrote:
> 
> 
> > -Original Message-
> > From: netdev-ow...@vger.kernel.org  On
> > Behalf Of Sabrina Dubroca
> > Sent: Wednesday, September 5, 2018 6:52 PM
> > To: netdev@vger.kernel.org
> > Cc: Sabrina Dubroca ; Boris Pismenny
> > ; Ilya Lesokhin ; Aviad
> > Yehezkel ; Dave Watson 
> > Subject: [PATCH net 3/3] tls: zero the crypto information from tls_context
> > before freeing
> > 
> > This contains key material in crypto_send_aes_gcm_128 and
> > crypto_recv_aes_gcm_128.
> > 
> > Fixes: 3c4d7559159b ("tls: kernel TLS support")
> > Signed-off-by: Sabrina Dubroca 
> > ---
> >  include/net/tls.h  |  1 +
> >  net/tls/tls_main.c | 14 --
> >  2 files changed, 13 insertions(+), 2 deletions(-)
> > 
> > diff --git a/include/net/tls.h b/include/net/tls.h index
> > d5c683e8bb22..2010d23112f9 100644
> > --- a/include/net/tls.h
> > +++ b/include/net/tls.h
> > @@ -180,6 +180,7 @@ struct tls_context {
> > struct tls_crypto_info crypto_recv;
> > struct tls12_crypto_info_aes_gcm_128
> > crypto_recv_aes_gcm_128;
> > };
> > +   char tls_crypto_ctx_end[0];
> 
> This makes the key zeroization dependent upon the position of unions
> in structure.

Yes. I could add char tls_crypto_ctx_start[0].

> Can you zeroise the crypto_send, crypto_recv separately using two
> memzero_explicit calls?

It's not crypto_send, it's crypto_send_aes_gcm_128. I don't know how
likely it is that another crypto algorithm will ever be added, but
then you'd have to switch to zeroing that thing.

I had a different version of the patch that gave a name to those
unions, but then I need to change all the references everywhere and
the patch becomes a bit ugly, especially for net.

struct tls_context {
union {
struct tls_crypto_info info;
struct tls12_crypto_info_aes_gcm_128 aes_gcm_128;
} crypto_send;
union {
struct tls_crypto_info info;
struct tls12_crypto_info_aes_gcm_128 aes_gcm_128;
} crypto_recv;

[...]
}


Or even:

union tls_crypto_context {
struct tls_crypto_info info;
struct tls12_crypto_info_aes_gcm_128 aes_gcm_128;
};

struct tls_context {
union tls_crypto_context crypto_send;
union tls_crypto_context crypto_recv;

[...]
}


-- 
Sabrina


[PATCH net 1/3] tls: don't copy the key out of tls12_crypto_info_aes_gcm_128

2018-09-05 Thread Sabrina Dubroca
There's no need to copy the key to an on-stack buffer before calling
crypto_aead_setkey().

Fixes: 3c4d7559159b ("tls: kernel TLS support")
Signed-off-by: Sabrina Dubroca 
---
 net/tls/tls_sw.c | 5 +
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c
index 52fbe727d7c1..268053b04563 100644
--- a/net/tls/tls_sw.c
+++ b/net/tls/tls_sw.c
@@ -1130,7 +1130,6 @@ void tls_sw_free_resources_rx(struct sock *sk)
 
 int tls_set_sw_offload(struct sock *sk, struct tls_context *ctx, int tx)
 {
-   char keyval[TLS_CIPHER_AES_GCM_128_KEY_SIZE];
struct tls_crypto_info *crypto_info;
struct tls12_crypto_info_aes_gcm_128 *gcm_128_info;
struct tls_sw_context_tx *sw_ctx_tx = NULL;
@@ -1259,9 +1258,7 @@ int tls_set_sw_offload(struct sock *sk, struct 
tls_context *ctx, int tx)
 
ctx->push_pending_record = tls_sw_push_pending_record;
 
-   memcpy(keyval, gcm_128_info->key, TLS_CIPHER_AES_GCM_128_KEY_SIZE);
-
-   rc = crypto_aead_setkey(*aead, keyval,
+   rc = crypto_aead_setkey(*aead, gcm_128_info->key,
TLS_CIPHER_AES_GCM_128_KEY_SIZE);
if (rc)
goto free_aead;
-- 
2.18.0



[PATCH net 3/3] tls: zero the crypto information from tls_context before freeing

2018-09-05 Thread Sabrina Dubroca
This contains key material in crypto_send_aes_gcm_128 and
crypto_recv_aes_gcm_128.

Fixes: 3c4d7559159b ("tls: kernel TLS support")
Signed-off-by: Sabrina Dubroca 
---
 include/net/tls.h  |  1 +
 net/tls/tls_main.c | 14 --
 2 files changed, 13 insertions(+), 2 deletions(-)

diff --git a/include/net/tls.h b/include/net/tls.h
index d5c683e8bb22..2010d23112f9 100644
--- a/include/net/tls.h
+++ b/include/net/tls.h
@@ -180,6 +180,7 @@ struct tls_context {
struct tls_crypto_info crypto_recv;
struct tls12_crypto_info_aes_gcm_128 crypto_recv_aes_gcm_128;
};
+   char tls_crypto_ctx_end[0];
 
struct list_head list;
struct net_device *netdev;
diff --git a/net/tls/tls_main.c b/net/tls/tls_main.c
index 0d432d025471..d3a57c0b2182 100644
--- a/net/tls/tls_main.c
+++ b/net/tls/tls_main.c
@@ -241,6 +241,16 @@ static void tls_write_space(struct sock *sk)
ctx->sk_write_space(sk);
 }
 
+static void tls_ctx_free(struct tls_context *ctx)
+{
+   if (!ctx)
+   return;
+
+   memzero_explicit(>crypto_send,
+offsetof(struct tls_context, tls_crypto_ctx_end));
+   kfree(ctx);
+}
+
 static void tls_sk_proto_close(struct sock *sk, long timeout)
 {
struct tls_context *ctx = tls_get_ctx(sk);
@@ -294,7 +304,7 @@ static void tls_sk_proto_close(struct sock *sk, long 
timeout)
 #else
{
 #endif
-   kfree(ctx);
+   tls_ctx_free(ctx);
ctx = NULL;
}
 
@@ -305,7 +315,7 @@ static void tls_sk_proto_close(struct sock *sk, long 
timeout)
 * for sk->sk_prot->unhash [tls_hw_unhash]
 */
if (free_ctx)
-   kfree(ctx);
+   tls_ctx_free(ctx);
 }
 
 static int do_tls_getsockopt_tx(struct sock *sk, char __user *optval,
-- 
2.18.0



[PATCH net 2/3] tls: clear key material from kernel memory when do_tls_setsockopt_conf fails

2018-09-05 Thread Sabrina Dubroca
Fixes: 3c4d7559159b ("tls: kernel TLS support")
Signed-off-by: Sabrina Dubroca 
---
 net/tls/tls_main.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/tls/tls_main.c b/net/tls/tls_main.c
index 180b6640e531..0d432d025471 100644
--- a/net/tls/tls_main.c
+++ b/net/tls/tls_main.c
@@ -499,7 +499,7 @@ static int do_tls_setsockopt_conf(struct sock *sk, char 
__user *optval,
goto out;
 
 err_crypto_info:
-   memset(crypto_info, 0, sizeof(*crypto_info));
+   memzero_explicit(crypto_info, sizeof(struct 
tls12_crypto_info_aes_gcm_128));
 out:
return rc;
 }
-- 
2.18.0



[PATCH net 0/3] tls: don't leave keys in kernel memory

2018-09-05 Thread Sabrina Dubroca
There are a few places where the RX/TX key for a TLS socket is copied
to kernel memory. This series clears those memory areas when they're no
longer needed.

Sabrina Dubroca (3):
  tls: don't copy the key out of tls12_crypto_info_aes_gcm_128
  tls: clear key material from kernel memory when do_tls_setsockopt_conf
fails
  tls: zero the crypto information from tls_context before freeing

 include/net/tls.h  |  1 +
 net/tls/tls_main.c | 16 +---
 net/tls/tls_sw.c   |  5 +
 3 files changed, 15 insertions(+), 7 deletions(-)

-- 
2.18.0


[PATCH net 2/2] selftests: pmtu: detect correct binary to ping ipv6 addresses

2018-08-30 Thread Sabrina Dubroca
Some systems don't have the ping6 binary anymore, and use ping for
everything. Detect the absence of ping6 and try to use ping instead.

Fixes: d1f1b9cbf34c ("selftests: net: Introduce first PMTU test")
Signed-off-by: Sabrina Dubroca 
Acked-by: Stefano Brivio 
---
 tools/testing/selftests/net/pmtu.sh | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/tools/testing/selftests/net/pmtu.sh 
b/tools/testing/selftests/net/pmtu.sh
index 0ecf2609b9a4..cc2798a0a2d7 100755
--- a/tools/testing/selftests/net/pmtu.sh
+++ b/tools/testing/selftests/net/pmtu.sh
@@ -46,6 +46,9 @@
 # Kselftest framework requirement - SKIP code is 4.
 ksft_skip=4
 
+# Some systems don't have a ping6 binary anymore
+which ping6 > /dev/null 2>&1 && ping6=$(which ping6) || ping6=$(which ping)
+
 tests="
pmtu_vti6_exception vti6: PMTU exceptions
pmtu_vti4_exception vti4: PMTU exceptions
@@ -274,7 +277,7 @@ test_pmtu_vti6_exception() {
mtu "${ns_b}" veth_b 4000
mtu "${ns_a}" vti6_a 5000
mtu "${ns_b}" vti6_b 5000
-   ${ns_a} ping6 -q -i 0.1 -w 2 -s 6 ${vti6_b_addr} > /dev/null
+   ${ns_a} ${ping6} -q -i 0.1 -w 2 -s 6 ${vti6_b_addr} > /dev/null
 
# Check that exception was created
if [ "$(route_get_dst_pmtu_from_exception "${ns_a}" ${vti6_b_addr})" = 
"" ]; then
-- 
2.18.0



[PATCH net 1/2] selftests: pmtu: maximum MTU for vti4 is 2^16-1-20

2018-08-30 Thread Sabrina Dubroca
Since commit 82612de1c98e ("ip_tunnel: restore binding to ifaces with a
large mtu"), the maximum MTU for vti4 is based on IP_MAX_MTU instead of
the mysterious constant 0xFFF8.  This makes this selftest fail.

Fixes: 82612de1c98e ("ip_tunnel: restore binding to ifaces with a large mtu")
Signed-off-by: Sabrina Dubroca 
Acked-by: Stefano Brivio 
---
 tools/testing/selftests/net/pmtu.sh | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/testing/selftests/net/pmtu.sh 
b/tools/testing/selftests/net/pmtu.sh
index f8cc38afffa2..0ecf2609b9a4 100755
--- a/tools/testing/selftests/net/pmtu.sh
+++ b/tools/testing/selftests/net/pmtu.sh
@@ -334,7 +334,7 @@ test_pmtu_vti4_link_add_mtu() {
fail=0
 
min=68
-   max=$((65528 - 20))
+   max=$((65535 - 20))
# Check invalid values first
for v in $((min - 1)) $((max + 1)); do
${ns_a} ip link add vti4_a mtu ${v} type vti local 
${veth4_a_addr} remote ${veth4_b_addr} key 10 2>/dev/null
-- 
2.18.0



Re: [PATCH 1/2] xfrm6: call kfree_skb when skb is toobig

2018-08-30 Thread Sabrina Dubroca
2018-08-30, 09:58:16 -0300, Thadeu Lima de Souza Cascardo wrote:
> After commit d6990976af7c5d8f55903bfb4289b6fb030bf754 ("vti6: fix PMTU caching
> and reporting on xmit"), some too big skbs might be potentially passed down to
> __xfrm6_output, causing it to fail to transmit but not free the skb, causing a
> leak of skb, and consequentially a leak of dst references.
> 
> After running pmtu.sh, that shows as failure to unregister devices in a 
> namespace:
> 
> [  311.397671] unregister_netdevice: waiting for veth_b to become free. Usage 
> count = 1
> 
> The fix is to call kfree_skb in case of transmit failures.
> 
> Signed-off-by: Thadeu Lima de Souza Cascardo 

Reviewed-by: Sabrina Dubroca 

I was about to post the same patch. Arguably, the commit introducing
this bug is the one that added those "return -EMSGSIZE" to
__xfrm6_output without freeing.

Either way, it's missing a Fixes: tag, which should be one of those,
or both:

Fixes: d6990976af7c ("vti6: fix PMTU caching and reporting on xmit")
Fixes: dd767856a36e ("xfrm6: Don't call icmpv6_send on local error")

-- 
Sabrina


[PATCH net 2/3] ipv6: fix cleanup ordering for pingv6 registration

2018-08-28 Thread Sabrina Dubroca
Commit 6d0bfe226116 ("net: ipv6: Add IPv6 support to the ping socket.")
contains an error in the cleanup path of inet6_init(): when
proto_register(_prot, 1) fails, we try to unregister
_prot. When rawv6_init() fails, we skip unregistering
_prot.

Example of panic (triggered by faking a failure of
 proto_register(_prot, 1)):

general protection fault:  [#1] PREEMPT SMP KASAN PTI
[...]
RIP: 0010:__list_del_entry_valid+0x79/0x160
[...]
Call Trace:
 proto_unregister+0xbb/0x550
 ? trace_preempt_on+0x6f0/0x6f0
 ? sock_no_shutdown+0x10/0x10
 inet6_init+0x153/0x1b8

Fixes: 6d0bfe226116 ("net: ipv6: Add IPv6 support to the ping socket.")
Signed-off-by: Sabrina Dubroca 
---
 net/ipv6/af_inet6.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/ipv6/af_inet6.c b/net/ipv6/af_inet6.c
index e5da133c6932..9a4261e50272 100644
--- a/net/ipv6/af_inet6.c
+++ b/net/ipv6/af_inet6.c
@@ -938,14 +938,14 @@ static int __init inet6_init(void)
 
err = proto_register(_prot, 1);
if (err)
-   goto out_unregister_ping_proto;
+   goto out_unregister_raw_proto;
 
/* We MUST register RAW sockets before we create the ICMP6,
 * IGMP6, or NDISC control sockets.
 */
err = rawv6_init();
if (err)
-   goto out_unregister_raw_proto;
+   goto out_unregister_ping_proto;
 
/* Register the family here so that the init calls below will
 * be able to create sockets. (?? is this dangerous ??)
-- 
2.18.0



[PATCH net 3/3] net: rtnl: return early from rtnl_unregister_all when protocol isn't registered

2018-08-28 Thread Sabrina Dubroca
rtnl_unregister_all(PF_INET6) gets called from inet6_init in cases when
no handler has been registered for PF_INET6 yet, for example if
ip6_mr_init() fails. Abort and avoid a NULL pointer deref in that case.

Example of panic (triggered by faking a failure of
 register_pernet_subsys):

general protection fault:  [#1] PREEMPT SMP KASAN PTI
[...]
RIP: 0010:rtnl_unregister_all+0x17e/0x2a0
[...]
Call Trace:
 ? rtnetlink_net_init+0x250/0x250
 ? sock_unregister+0x103/0x160
 ? kernel_getsockopt+0x200/0x200
 inet6_init+0x197/0x20d

Fixes: e2fddf5e96df ("[IPV6]: Make af_inet6 to check ip6_route_init return 
value.")
Signed-off-by: Sabrina Dubroca 
---
 net/core/rtnetlink.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index 24431e578310..60c928894a78 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -324,6 +324,10 @@ void rtnl_unregister_all(int protocol)
 
rtnl_lock();
tab = rtnl_msg_handlers[protocol];
+   if (!tab) {
+   rtnl_unlock();
+   return;
+   }
RCU_INIT_POINTER(rtnl_msg_handlers[protocol], NULL);
for (msgindex = 0; msgindex < RTM_NR_MSGTYPES; msgindex++) {
link = tab[msgindex];
-- 
2.18.0



[PATCH net 1/3] ipv6: fix cleanup ordering for ip6_mr failure

2018-08-28 Thread Sabrina Dubroca
Commit 15e668070a64 ("ipv6: reorder icmpv6_init() and ip6_mr_init()")
moved the cleanup label for ipmr_fail, but should have changed the
contents of the cleanup labels as well. Now we can end up cleaning up
icmpv6 even though it hasn't been initialized (jump to icmp_fail or
ipmr_fail).

Simply undo things in the reverse order of their initialization.

Example of panic (triggered by faking a failure of icmpv6_init):

kasan: GPF could be caused by NULL-ptr deref or user memory access
general protection fault:  [#1] PREEMPT SMP KASAN PTI
[...]
RIP: 0010:__list_del_entry_valid+0x79/0x160
[...]
Call Trace:
 ? lock_release+0x8a0/0x8a0
 unregister_pernet_operations+0xd4/0x560
 ? ops_free_list+0x480/0x480
 ? down_write+0x91/0x130
 ? unregister_pernet_subsys+0x15/0x30
 ? down_read+0x1b0/0x1b0
 ? up_read+0x110/0x110
 ? kmem_cache_create_usercopy+0x1b4/0x240
 unregister_pernet_subsys+0x1d/0x30
 icmpv6_cleanup+0x1d/0x30
 inet6_init+0x1b5/0x23f

Fixes: 15e668070a64 ("ipv6: reorder icmpv6_init() and ip6_mr_init()")
Signed-off-by: Sabrina Dubroca 
---
 net/ipv6/af_inet6.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/net/ipv6/af_inet6.c b/net/ipv6/af_inet6.c
index 673bba31eb18..e5da133c6932 100644
--- a/net/ipv6/af_inet6.c
+++ b/net/ipv6/af_inet6.c
@@ -1113,11 +1113,11 @@ static int __init inet6_init(void)
 igmp_fail:
ndisc_cleanup();
 ndisc_fail:
-   ip6_mr_cleanup();
+   icmpv6_cleanup();
 icmp_fail:
-   unregister_pernet_subsys(_net_ops);
+   ip6_mr_cleanup();
 ipmr_fail:
-   icmpv6_cleanup();
+   unregister_pernet_subsys(_net_ops);
 register_pernet_fail:
sock_unregister(PF_INET6);
rtnl_unregister_all(PF_INET6);
-- 
2.18.0



[PATCH net 0/3] ipv6: fix error path of inet6_init()

2018-08-28 Thread Sabrina Dubroca
The error path of inet6_init() can trigger multiple kernel panics,
mostly due to wrong ordering of cleanups. This series fixes those
issues.

Sabrina Dubroca (3):
  ipv6: fix cleanup ordering for ip6_mr failure
  ipv6: fix cleanup ordering for pingv6 registration
  net: rtnl: return early from rtnl_unregister_all when protocol isn't
registered

 net/core/rtnetlink.c |  4 
 net/ipv6/af_inet6.c  | 10 +-
 2 files changed, 9 insertions(+), 5 deletions(-)

-- 
2.18.0



Re: I found a strange place while reading “net/ipv6/reassembly.c”

2018-08-15 Thread Sabrina Dubroca
2018-08-15, 04:38:29 +, Ttttabcd wrote:
> Hello everyone who develops the kernel.
> 
> At the beginning I was looking for the source author, but his email
> address has expired, so I can only come here to ask questions.
> 
> The problem is in the /net/ipv6/reassembly.c file, the author is
> Pedro Roque.
> 
> I found some strange places when I read the code for this file
> (Linux Kernel version 4.18).
> 
> In the "/net/ipv6/reassembly.c"
> 
> In the function "ip6_frag_queue"
> 
>   offset = ntohs(fhdr->frag_off) & ~0x7;
>   end = offset + (ntohs(ipv6_hdr(skb)->payload_len) -
>   ((u8 *)(fhdr + 1) - (u8 *)(ipv6_hdr(skb) + 1)));
> 
>   if ((unsigned int)end > IPV6_MAXPLEN) {
>   *prob_offset = (u8 *)>frag_off - skb_network_header(skb);
>   return -1;
>   }
> 
> Here the length of the payload is judged.

This check is based on the fragment currently being processed, and
only considers the reassembled payload.

> And in the function "ip6_frag_reasm"
> 

Re-adding the comment that comes just above this:

/* Unfragmented part is taken from the first segment. */
>   payload_len = ((head->data - skb_network_header(head)) -
>  sizeof(struct ipv6hdr) + fq->q.len -
>  sizeof(struct frag_hdr));
>   if (payload_len > IPV6_MAXPLEN)
>   goto out_oversize;

This considers the reassembled payload (ie, as above) *and* any
extension headers that might come before it.


>   ..
>   out_oversize:
>   net_dbg_ratelimited("ip6_frag_reasm: payload len = %d\n", 
> payload_len);
>   goto out_fail;
> 
> Here also judges the length of the payload.
> 
> Judged the payload length twice.
> 
> I tested that the code in the label "out_oversize:" does not execute
> at all, because it has been returned in "ip6_frag_queue".

If you try again with a routing header, I think you'll see it trigger.

> Unless I comment out the code that judge the payload length in the
> function "ip6_frag_queue", the code labeled "out_oversize:" can be
> executed.
> 
> So, is this repeated?

-- 
Sabrina


[PATCH net] net/ipv6: fix metrics leak

2018-07-30 Thread Sabrina Dubroca
Since commit d4ead6b34b67 ("net/ipv6: move metrics from dst to
rt6_info"), ipv6 metrics are shared and refcounted. rt6_set_from()
assigns the rt->from pointer and increases the refcount on from's
metrics. This reference is never released.

Introduce the fib6_metrics_release() helper and use it to release the
metrics.

Fixes: d4ead6b34b67 ("net/ipv6: move metrics from dst to rt6_info")
Signed-off-by: Sabrina Dubroca 
---
 net/ipv6/ip6_fib.c | 18 ++
 1 file changed, 14 insertions(+), 4 deletions(-)

diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
index d212738e9d10..211a2d437b56 100644
--- a/net/ipv6/ip6_fib.c
+++ b/net/ipv6/ip6_fib.c
@@ -167,11 +167,22 @@ struct fib6_info *fib6_info_alloc(gfp_t gfp_flags)
return f6i;
 }
 
+static void fib6_metrics_release(struct fib6_info *f6i)
+{
+   struct dst_metrics *m;
+
+   if (!f6i)
+   return;
+
+   m = f6i->fib6_metrics;
+   if (m != _default_metrics && refcount_dec_and_test(>refcnt))
+   kfree(m);
+}
+
 void fib6_info_destroy_rcu(struct rcu_head *head)
 {
struct fib6_info *f6i = container_of(head, struct fib6_info, rcu);
struct rt6_exception_bucket *bucket;
-   struct dst_metrics *m;
 
WARN_ON(f6i->fib6_node);
 
@@ -201,9 +212,7 @@ void fib6_info_destroy_rcu(struct rcu_head *head)
if (f6i->fib6_nh.nh_dev)
dev_put(f6i->fib6_nh.nh_dev);
 
-   m = f6i->fib6_metrics;
-   if (m != _default_metrics && refcount_dec_and_test(>refcnt))
-   kfree(m);
+   fib6_metrics_release(f6i);
 
kfree(f6i);
 }
@@ -887,6 +896,7 @@ static void fib6_drop_pcpu_from(struct fib6_info *f6i,
 
from = rcu_dereference_protected(pcpu_rt->from,
 lockdep_is_held(>tb6_lock));
+   fib6_metrics_release(from);
rcu_assign_pointer(pcpu_rt->from, NULL);
fib6_info_release(from);
}
-- 
2.18.0



[PATCH net] ipv6: make DAD fail with enhanced DAD when nonce length differs

2018-07-13 Thread Sabrina Dubroca
Commit adc176c54722 ("ipv6 addrconf: Implemented enhanced DAD (RFC7527)")
added enhanced DAD with a nonce length of 6 bytes. However, RFC7527
doesn't specify the length of the nonce, other than being 6 + 8*k bytes,
with integer k >= 0 (RFC3971 5.3.2). The current implementation simply
assumes that the nonce will always be 6 bytes, but others systems are
free to choose different sizes.

If another system sends a nonce of different length but with the same 6
bytes prefix, it shouldn't be considered as the same nonce. Thus, check
that the length of the received nonce is the same as the length we sent.

Ugly scapy test script running on veth0:

def loop():
pkt=sniff(iface="veth0", filter="icmp6", count=1)
pkt = pkt[0]
b = bytearray(pkt[Raw].load)
b[1] += 1
b += b'\xde\xad\xbe\xef\xde\xad\xbe\xef'
pkt[Raw].load = bytes(b)
pkt[IPv6].plen += 8
# fixup checksum after modifying the payload
pkt[IPv6].payload.cksum -= 0x3b44
if pkt[IPv6].payload.cksum < 0:
pkt[IPv6].payload.cksum += 0x
sendp(pkt, iface="veth0")

This should result in DAD failure for any address added to veth0's peer,
but is currently ignored.

Fixes: adc176c54722 ("ipv6 addrconf: Implemented enhanced DAD (RFC7527)")
Signed-off-by: Sabrina Dubroca 
Reviewed-by: Stefano Brivio 
---
 net/ipv6/ndisc.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ipv6/ndisc.c b/net/ipv6/ndisc.c
index e640d2f3c55c..0ec273997d1d 100644
--- a/net/ipv6/ndisc.c
+++ b/net/ipv6/ndisc.c
@@ -811,7 +811,7 @@ static void ndisc_recv_ns(struct sk_buff *skb)
return;
}
}
-   if (ndopts.nd_opts_nonce)
+   if (ndopts.nd_opts_nonce && ndopts.nd_opts_nonce->nd_opt_len == 1)
memcpy(, (u8 *)(ndopts.nd_opts_nonce + 1), 6);
 
inc = ipv6_addr_is_multicast(daddr);
-- 
2.18.0



Re: [PATCH net] skbuff: Unconditionally copy pfmemalloc in __skb_clone()

2018-07-13 Thread Sabrina Dubroca
2018-07-13, 13:21:07 +0200, Stefano Brivio wrote:
> Commit 8b7008620b84 ("net: Don't copy pfmemalloc flag in
> __copy_skb_header()") introduced a different handling for the
> pfmemalloc flag in copy and clone paths.
> 
> In __skb_clone(), now, the flag is set only if it was set in the
> original skb, but not cleared if it wasn't. This is wrong and
> might lead to socket buffers being flagged with pfmemalloc even
> if the skb data wasn't allocated from pfmemalloc reserves. Copy
> the flag instead of ORing it.
> 
> Reported-by: Sabrina Dubroca 
> Fixes: 8b7008620b84 ("net: Don't copy pfmemalloc flag in __copy_skb_header()")
> Signed-off-by: Stefano Brivio 

Thanks,

Tested-by: Sabrina Dubroca 

-- 
Sabrina


Re: [PATCH net] ipv4: reset fnhe_mtu_locked after cache route flushed

2018-07-10 Thread Sabrina Dubroca
2018-05-10, 15:43:11 -0400, David Miller wrote:
> From: Hangbin Liu 
> Date: Wed,  9 May 2018 18:06:44 +0800
> 
> > After route cache is flushed via ipv4_sysctl_rtcache_flush(), we forget
> > to reset fnhe_mtu_locked in rt_bind_exception(). When pmtu is updated
> > in __ip_rt_update_pmtu(), it will return directly since the pmtu is
> > still locked. e.g.
> > 
> > + ip netns exec client ping 10.10.1.1 -c 1 -s 1400 -M do
> > PING 10.10.1.1 (10.10.1.1) 1400(1428) bytes of data.
> > From 10.10.0.254 icmp_seq=1 Frag needed and DF set (mtu = 0)
> > 
> > --- 10.10.1.1 ping statistics ---
> > 1 packets transmitted, 0 received, +1 errors, 100% packet loss, time 0ms
> > 
> > + ip netns exec client ip route get 10.10.1.1
> > 10.10.1.1 via 10.10.0.254 dev veth0_c src 10.10.0.1 uid 0
> > cache expires 599sec mtu lock 552
> > + ip netns exec client ip route flush cache
> > + ip netns exec client ip route get 10.10.1.1
> > 10.10.1.1 via 10.10.0.254 dev veth0_c src 10.10.0.1 uid 0
> > cache
> > + ip netns exec client ping 10.10.1.1 -c 1 -s 1400 -M do
> > PING 10.10.1.1 (10.10.1.1) 1400(1428) bytes of data.
> > ping: local error: Message too long, mtu=576
> > 
> > --- 10.10.1.1 ping statistics ---
> > 1 packets transmitted, 0 received, +1 errors, 100% packet loss, time 0ms
> > 
> > + ip netns exec client ip route get 10.10.1.1
> > 10.10.1.1 via 10.10.0.254 dev veth0_c src 10.10.0.1 uid 0
> > cache
> > 
> > Fixes: d52e5a7e7ca49 ("ipv4: lock mtu in fnhe when received PMTU < 
> > net.ipv4.route.min_pmtu")
> > Reported-by: Jianlin Shi 
> > Reviewed-by: Stefano Brivio 
> > Signed-off-by: Hangbin Liu 
> 
> Applied.

meh, I know it's way too late, but I just noticed this patch lost most
of its commit message, and all of its tags, when it went into git :(
Patchwork actually thinks the format is ok (it counts one Fixes and
one Reviewed-by), but git cut all that out.

Would there be a way to detect this when you apply patches? "there's
no tag above the first '^---', but there are some tags below it",
something like that?

commit 0e8411e426e277f55bd21e287ec89fab6f8eacae
Author: Hangbin Liu 
Date:   Wed May 9 18:06:44 2018 +0800

ipv4: reset fnhe_mtu_locked after cache route flushed

After route cache is flushed via ipv4_sysctl_rtcache_flush(), we forget
to reset fnhe_mtu_locked in rt_bind_exception(). When pmtu is updated
in __ip_rt_update_pmtu(), it will return directly since the pmtu is
still locked. e.g.

+ ip netns exec client ping 10.10.1.1 -c 1 -s 1400 -M do
PING 10.10.1.1 (10.10.1.1) 1400(1428) bytes of data.
>From 10.10.0.254 icmp_seq=1 Frag needed and DF set (mtu = 0)

Signed-off-by: David S. Miller 

-- 
Sabrina


Re: [PATCH net v2 4/5] net/ipv6: propagate net.ipv6.conf.all.addr_gen_mode to devices

2018-07-10 Thread Sabrina Dubroca
2018-07-09, 11:24:49 -0600, David Ahern wrote:
> On 7/9/18 4:25 AM, Sabrina Dubroca wrote:
> > This aligns the addr_gen_mode sysctl with the expected behavior of the
> > "all" variant.
> > 
> > Fixes: d35a00b8e33d ("net/ipv6: allow sysctl to change link-local address 
> > generation mode")
> > Suggested-by: David Ahern 
> > Signed-off-by: Sabrina Dubroca 
> > ---
> >  net/ipv6/addrconf.c | 12 
> >  1 file changed, 12 insertions(+)
> > 
> > diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
> > index e89bca83e0e4..1659a6b3cf42 100644
> > --- a/net/ipv6/addrconf.c
> > +++ b/net/ipv6/addrconf.c
> > @@ -5926,6 +5926,18 @@ static int addrconf_sysctl_addr_gen_mode(struct 
> > ctl_table *ctl, int write,
> > idev->cnf.addr_gen_mode = new_val;
> > addrconf_dev_config(idev->dev);
> > }
> > +   } else if (>ipv6.devconf_all->addr_gen_mode == ctl->data) {
> > +   struct net_device *dev;
> > +
> > +   net->ipv6.devconf_dflt->addr_gen_mode = new_val;
> > +   for_each_netdev(net, dev) {
> > +   idev = __in6_dev_get(dev);
> > +   if (idev &&
> > +   idev->cnf.addr_gen_mode != new_val) {
> > +   idev->cnf.addr_gen_mode = new_val;
> > +   addrconf_dev_config(idev->dev);
> 
> This call is adding a new LL address without removing the previous one:
> 
> # ip -6 addr sh dev eth2
> 4: eth2:  mtu 1500 state UP qlen 1000
> inet6 2001:db8:2::4/64 scope global
>valid_lft forever preferred_lft forever
> inet6 fe80::e0:f9ff:fe45:6480/64 scope link
>valid_lft forever preferred_lft forever
> 
> # sysctl -w net.ipv6.conf.eth2.addr_gen_mode=3
> net.ipv6.conf.eth2.addr_gen_mode = 3
> 
> # ip -6 addr sh dev eth2
> 4: eth2:  mtu 1500 state UP qlen 1000
> inet6 2001:db8:2::4/64 scope global
>valid_lft forever preferred_lft forever
> inet6 fe80::bc31:8009:270d:e019/64 scope link stable-privacy
>valid_lft forever preferred_lft forever
> inet6 fe80::e0:f9ff:fe45:6480/64 scope link
>valid_lft forever preferred_lft forever

Yes. That's also what will happen with global addresses, once the next
RA is received: a new address corresponding to the new generation mode
will be added, and the old one isn't removed.

I think that was the expected behavior of d35a00b8e33d, but since it
never actually worked... OTOH, the netlink attribute only sets
idev->cnf.addr_gen_mode and doesn't add the new LL address (not until
a DOWN/UP cycle), which I personally find surprising. If I set the
mode to random or stable_secret, I would expect the privacy address to
show up without having to take the device down and then up.

I think removing the previous address immediately would break things
(and the user wouldn't expect an address to disappear that way, since
they're not explicitly asking for it to be removed), but I guess we
could play games with the lifetimes (reduce the lifetime of the old
address from forever to some limit). That limit would need to be
configurable I think, and I would rather target that change for
net-next.

-- 
Sabrina


[PATCH net v2 4/5] net/ipv6: propagate net.ipv6.conf.all.addr_gen_mode to devices

2018-07-09 Thread Sabrina Dubroca
This aligns the addr_gen_mode sysctl with the expected behavior of the
"all" variant.

Fixes: d35a00b8e33d ("net/ipv6: allow sysctl to change link-local address 
generation mode")
Suggested-by: David Ahern 
Signed-off-by: Sabrina Dubroca 
---
 net/ipv6/addrconf.c | 12 
 1 file changed, 12 insertions(+)

diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index e89bca83e0e4..1659a6b3cf42 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -5926,6 +5926,18 @@ static int addrconf_sysctl_addr_gen_mode(struct 
ctl_table *ctl, int write,
idev->cnf.addr_gen_mode = new_val;
addrconf_dev_config(idev->dev);
}
+   } else if (>ipv6.devconf_all->addr_gen_mode == ctl->data) {
+   struct net_device *dev;
+
+   net->ipv6.devconf_dflt->addr_gen_mode = new_val;
+   for_each_netdev(net, dev) {
+   idev = __in6_dev_get(dev);
+   if (idev &&
+   idev->cnf.addr_gen_mode != new_val) {
+   idev->cnf.addr_gen_mode = new_val;
+   addrconf_dev_config(idev->dev);
+   }
+   }
}
 
*((u32 *)ctl->data) = new_val;
-- 
2.18.0



[PATCH net v2 3/5] net/ipv6: reserve room for IFLA_INET6_ADDR_GEN_MODE

2018-07-09 Thread Sabrina Dubroca
inet6_ifla6_size() is called to check how much space is needed by
inet6_fill_link_af() and inet6_fill_ifinfo(), both of which include
the IFLA_INET6_ADDR_GEN_MODE attribute. Reserve some room for it.

Fixes: bc91b0f07ada ("ipv6: addrconf: implement address generation modes")
Signed-off-by: Sabrina Dubroca 
Reviewed-by: David Ahern 
---
 net/ipv6/addrconf.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index e20f8a1d8cdb..e89bca83e0e4 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -5208,7 +5208,9 @@ static inline size_t inet6_ifla6_size(void)
 + nla_total_size(DEVCONF_MAX * 4) /* IFLA_INET6_CONF */
 + nla_total_size(IPSTATS_MIB_MAX * 8) /* IFLA_INET6_STATS */
 + nla_total_size(ICMP6_MIB_MAX * 8) /* IFLA_INET6_ICMP6STATS */
-+ nla_total_size(sizeof(struct in6_addr)); /* IFLA_INET6_TOKEN */
++ nla_total_size(sizeof(struct in6_addr)) /* IFLA_INET6_TOKEN */
++ nla_total_size(1) /* IFLA_INET6_ADDR_GEN_MODE */
++ 0;
 }
 
 static inline size_t inet6_if_nlmsg_size(void)
-- 
2.18.0



[PATCH net v2 2/5] net/ipv6: don't reinitialize ndev->cnf.addr_gen_mode on new inet6_dev

2018-07-09 Thread Sabrina Dubroca
The value has already been copied from this netns's devconf_dflt, it
shouldn't be reset to the global kernel default.

Fixes: d35a00b8e33d ("net/ipv6: allow sysctl to change link-local address 
generation mode")
Signed-off-by: Sabrina Dubroca 
Reviewed-by: David Ahern 
---
 net/ipv6/addrconf.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index e9ba53d2a147..e20f8a1d8cdb 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -385,8 +385,6 @@ static struct inet6_dev *ipv6_add_dev(struct net_device 
*dev)
 
if (ndev->cnf.stable_secret.initialized)
ndev->cnf.addr_gen_mode = IN6_ADDR_GEN_MODE_STABLE_PRIVACY;
-   else
-   ndev->cnf.addr_gen_mode = ipv6_devconf_dflt.addr_gen_mode;
 
ndev->cnf.mtu6 = dev->mtu;
ndev->nd_parms = neigh_parms_alloc(dev, _tbl);
-- 
2.18.0



[PATCH net v2 5/5] Documentation: ip-sysctl.txt: document addr_gen_mode

2018-07-09 Thread Sabrina Dubroca
addr_gen_mode was introduced in without documentation, add it now.

Fixes: d35a00b8e33d ("net/ipv6: allow sysctl to change link-local address 
generation mode")
Signed-off-by: Sabrina Dubroca 
---
 Documentation/networking/ip-sysctl.txt | 9 +
 1 file changed, 9 insertions(+)

diff --git a/Documentation/networking/ip-sysctl.txt 
b/Documentation/networking/ip-sysctl.txt
index ce8fbf5aa63c..c6df84be12f9 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -1834,6 +1834,16 @@ stable_secret - IPv6 address
 
By default the stable secret is unset.
 
+addr_gen_mode - INTEGER
+   Defines how link-local and autoconf addresses are generated.
+
+   0: generate address based on EUI64 (default)
+   1: do no generate a link-local address, use EUI64 for addresses 
generated
+  from autoconf
+   2: generate stable privacy addresses, using the secret from
+  stable_secret (RFC7217)
+   3: generate stable privacy addresses, using a random secret if unset
+
 drop_unicast_in_l2_multicast - BOOLEAN
Drop any unicast IPv6 packets that are received in link-layer
multicast (or broadcast) frames.


[PATCH net v2 1/5] net/ipv6: fix addrconf_sysctl_addr_gen_mode

2018-07-09 Thread Sabrina Dubroca
addrconf_sysctl_addr_gen_mode() has multiple problems. First, it ignores
the errors returned by proc_dointvec().

addrconf_sysctl_addr_gen_mode() calls proc_dointvec() directly, which
writes the value to memory, and then checks if it's valid and may return
EINVAL. If a bad value is given, the value displayed when reading
net.ipv6.conf.foo.addr_gen_mode next time will be invalid. In case the
value provided by the user was valid, addrconf_dev_config() won't be
called since idev->cnf.addr_gen_mode has already been updated.

Fix this in the usual way we deal with values that need to be checked
after the proc_do*() helper has returned: define a local ctl_table and
storage, call proc_dointvec() on that temporary area, then check and
store.

addrconf_sysctl_addr_gen_mode() also writes the new value to the global
ipv6_devconf_dflt, when we're writing to some netns's default, so that
new netns will inherit the value that was set by the change occuring in
any netns. That doesn't make any sense, so let's drop this assignment.

Finally, since addr_gen_mode is a __u32, switch to proc_douintvec().

Fixes: d35a00b8e33d ("net/ipv6: allow sysctl to change link-local address 
generation mode")
Signed-off-by: Sabrina Dubroca 
---
 net/ipv6/addrconf.c | 27 ++-
 1 file changed, 14 insertions(+), 13 deletions(-)

diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index 91580c62bb86..e9ba53d2a147 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -5892,32 +5892,31 @@ static int addrconf_sysctl_addr_gen_mode(struct 
ctl_table *ctl, int write,
 loff_t *ppos)
 {
int ret = 0;
-   int new_val;
+   u32 new_val;
struct inet6_dev *idev = (struct inet6_dev *)ctl->extra1;
struct net *net = (struct net *)ctl->extra2;
+   struct ctl_table tmp = {
+   .data = _val,
+   .maxlen = sizeof(new_val),
+   .mode = ctl->mode,
+   };
 
if (!rtnl_trylock())
return restart_syscall();
 
-   ret = proc_dointvec(ctl, write, buffer, lenp, ppos);
+   new_val = *((u32 *)ctl->data);
 
-   if (write) {
-   new_val = *((int *)ctl->data);
+   ret = proc_douintvec(, write, buffer, lenp, ppos);
+   if (ret != 0)
+   goto out;
 
+   if (write) {
if (check_addr_gen_mode(new_val) < 0) {
ret = -EINVAL;
goto out;
}
 
-   /* request for default */
-   if (>ipv6.devconf_dflt->addr_gen_mode == ctl->data) {
-   ipv6_devconf_dflt.addr_gen_mode = new_val;
-
-   /* request for individual net device */
-   } else {
-   if (!idev)
-   goto out;
-
+   if (idev) {
if (check_stable_privacy(idev, net, new_val) < 0) {
ret = -EINVAL;
goto out;
@@ -5928,6 +5927,8 @@ static int addrconf_sysctl_addr_gen_mode(struct ctl_table 
*ctl, int write,
addrconf_dev_config(idev->dev);
}
}
+
+   *((u32 *)ctl->data) = new_val;
}
 
 out:
-- 
2.18.0



[PATCH net v2 0/5] net/ipv6: addr_gen_mode fixes

2018-07-09 Thread Sabrina Dubroca
This series fixes bugs in handling of the addr_gen_mode option, mainly
related to the sysctl. A minor netlink issue was also present in the
initial commit introducing the option on a per-netdevice basis.

v2: add patch 4, requested by David Ahern during review of v1
add patch 5, missing documentation for the sysctl
patches 1, 2, 3 are unchanged

Sabrina Dubroca (5):
  net/ipv6: fix addrconf_sysctl_addr_gen_mode
  net/ipv6: don't reinitialize ndev->cnf.addr_gen_mode on new inet6_dev
  net/ipv6: reserve room for IFLA_INET6_ADDR_GEN_MODE
  net/ipv6: propagate net.ipv6.conf.all.addr_gen_mode
  Documentation: ip-sysctl.txt: document addr_gen_mode

 Documentation/networking/ip-sysctl.txt | 10 
 net/ipv6/addrconf.c| 45 ++
 2 files changed, 39 insertions(+), 16 deletions(-)

-- 
2.18.0


Re: [PATCH net 1/3] net/ipv6: fix addrconf_sysctl_addr_gen_mode

2018-07-06 Thread Sabrina Dubroca
2018-07-06, 09:28:48 -0600, David Ahern wrote:
> On 7/6/18 9:02 AM, Sabrina Dubroca wrote:
> > 2018-07-06, 08:42:01 -0600, David Ahern wrote:
> >> On 7/6/18 7:49 AM, Sabrina Dubroca wrote:
> >>> diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
> >>> index 91580c62bb86..e9ba53d2a147 100644
> >>> --- a/net/ipv6/addrconf.c
> >>> +++ b/net/ipv6/addrconf.c
> >>> @@ -5892,32 +5892,31 @@ static int addrconf_sysctl_addr_gen_mode(struct 
> >>> ctl_table *ctl, int write,
> >>>loff_t *ppos)
> >>>  {
> >>>   int ret = 0;
> >>> - int new_val;
> >>> + u32 new_val;
> >>>   struct inet6_dev *idev = (struct inet6_dev *)ctl->extra1;
> >>>   struct net *net = (struct net *)ctl->extra2;
> >>> + struct ctl_table tmp = {
> >>> + .data = _val,
> >>> + .maxlen = sizeof(new_val),
> >>> + .mode = ctl->mode,
> >>> + };
> >>>  
> >>>   if (!rtnl_trylock())
> >>>   return restart_syscall();
> >>>  
> >>> - ret = proc_dointvec(ctl, write, buffer, lenp, ppos);
> >>> + new_val = *((u32 *)ctl->data);
> >>>  
> >>> - if (write) {
> >>> - new_val = *((int *)ctl->data);
> >>> + ret = proc_douintvec(, write, buffer, lenp, ppos);
> >>> + if (ret != 0)
> >>> + goto out;
> >>>  
> >>> + if (write) {
> >>>   if (check_addr_gen_mode(new_val) < 0) {
> >>>   ret = -EINVAL;
> >>>   goto out;
> >>>   }
> >>>  
> >>> - /* request for default */
> >>> - if (>ipv6.devconf_dflt->addr_gen_mode == ctl->data) {
> >>> - ipv6_devconf_dflt.addr_gen_mode = new_val;
> >>
> >> updating the template is wrong, but you still need to update the
> >> namespace's default value for new devices.
> > 
> > That's already handled by storing new_val into ctl->data at the end of
> > the 'write' block.
> 
> ok. missed that. It's part of your change below.
> 
> > 
> > BTW, I'd like to make ipv6_devconf and ipv6_devconf_dflt read-only, so
> > that people aren't tempted to update the template, but I'm thinking of
> > doing that in net-next rather than net.
> > 
> >> also, if you are fixing this it would be good to handle the change to
> >> 'all' as well and update all existing devices.
> > 
> > I thought about it, and wasn't sure if that change of behavior was
> > acceptable, especially for stable (I think the current patch should go
> > into stable). OTOH, it's quite clearly what "all" should do.
> 
> IMHO it's a bug that changing 'all' does not actually change all
> existing devices nor is it ever used.

Right. I'll add that as a separate patch in this series, unless you
really prefer the change squashed into this patch.


> Looking at other addr_gen_mode sites, addrconf_sysctl_stable_secret is
> messed up as well. It propagates a change to 'default' to all existing
> devices.

I guess it was intentional, given:

if (>ipv6.devconf_all->stable_secret == ctl->data)
return -EIO;

It only propagates the mode, and not the secret itself, to all
devices. After thinking about it for a while, I guess it considers the
new default not only as default for newly created devices, but also
for newly added addresses/prefixes.
Or am I making stuff up?

-- 
Sabrina


Re: [PATCH net 1/3] net/ipv6: fix addrconf_sysctl_addr_gen_mode

2018-07-06 Thread Sabrina Dubroca
2018-07-06, 08:42:01 -0600, David Ahern wrote:
> On 7/6/18 7:49 AM, Sabrina Dubroca wrote:
> > diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
> > index 91580c62bb86..e9ba53d2a147 100644
> > --- a/net/ipv6/addrconf.c
> > +++ b/net/ipv6/addrconf.c
> > @@ -5892,32 +5892,31 @@ static int addrconf_sysctl_addr_gen_mode(struct 
> > ctl_table *ctl, int write,
> >  loff_t *ppos)
> >  {
> > int ret = 0;
> > -   int new_val;
> > +   u32 new_val;
> > struct inet6_dev *idev = (struct inet6_dev *)ctl->extra1;
> > struct net *net = (struct net *)ctl->extra2;
> > +   struct ctl_table tmp = {
> > +   .data = _val,
> > +   .maxlen = sizeof(new_val),
> > +   .mode = ctl->mode,
> > +   };
> >  
> > if (!rtnl_trylock())
> > return restart_syscall();
> >  
> > -   ret = proc_dointvec(ctl, write, buffer, lenp, ppos);
> > +   new_val = *((u32 *)ctl->data);
> >  
> > -   if (write) {
> > -   new_val = *((int *)ctl->data);
> > +   ret = proc_douintvec(, write, buffer, lenp, ppos);
> > +   if (ret != 0)
> > +   goto out;
> >  
> > +   if (write) {
> > if (check_addr_gen_mode(new_val) < 0) {
> > ret = -EINVAL;
> > goto out;
> > }
> >  
> > -   /* request for default */
> > -   if (>ipv6.devconf_dflt->addr_gen_mode == ctl->data) {
> > -   ipv6_devconf_dflt.addr_gen_mode = new_val;
> 
> updating the template is wrong, but you still need to update the
> namespace's default value for new devices.

That's already handled by storing new_val into ctl->data at the end of
the 'write' block.

BTW, I'd like to make ipv6_devconf and ipv6_devconf_dflt read-only, so
that people aren't tempted to update the template, but I'm thinking of
doing that in net-next rather than net.

> also, if you are fixing this it would be good to handle the change to
> 'all' as well and update all existing devices.

I thought about it, and wasn't sure if that change of behavior was
acceptable, especially for stable (I think the current patch should go
into stable). OTOH, it's quite clearly what "all" should do.

> > -
> > -   /* request for individual net device */
> > -   } else {
> > -   if (!idev)
> > -   goto out;
> > -
> > +   if (idev) {
> > if (check_stable_privacy(idev, net, new_val) < 0) {
> > ret = -EINVAL;
> > goto out;
> > @@ -5928,6 +5927,8 @@ static int addrconf_sysctl_addr_gen_mode(struct 
> > ctl_table *ctl, int write,
> > addrconf_dev_config(idev->dev);
> > }
> > }
> > +
> > +   *((u32 *)ctl->data) = new_val;
> > }
> >  
> >  out:
> > 
> 

-- 
Sabrina


[PATCH net 2/3] net/ipv6: don't reinitialize ndev->cnf.addr_gen_mode on new inet6_dev

2018-07-06 Thread Sabrina Dubroca
The value has already been copied from this netns's devconf_dflt, it
shouldn't be reset to the global kernel default.

Fixes: d35a00b8e33d ("net/ipv6: allow sysctl to change link-local address 
generation mode")
Signed-off-by: Sabrina Dubroca 
---
 net/ipv6/addrconf.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index e9ba53d2a147..e20f8a1d8cdb 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -385,8 +385,6 @@ static struct inet6_dev *ipv6_add_dev(struct net_device 
*dev)
 
if (ndev->cnf.stable_secret.initialized)
ndev->cnf.addr_gen_mode = IN6_ADDR_GEN_MODE_STABLE_PRIVACY;
-   else
-   ndev->cnf.addr_gen_mode = ipv6_devconf_dflt.addr_gen_mode;
 
ndev->cnf.mtu6 = dev->mtu;
ndev->nd_parms = neigh_parms_alloc(dev, _tbl);
-- 
2.18.0



[PATCH net 1/3] net/ipv6: fix addrconf_sysctl_addr_gen_mode

2018-07-06 Thread Sabrina Dubroca
addrconf_sysctl_addr_gen_mode() has multiple problems. First, it ignores
the errors returned by proc_dointvec().

addrconf_sysctl_addr_gen_mode() calls proc_dointvec() directly, which
writes the value to memory, and then checks if it's valid and may return
EINVAL. If a bad value is given, the value displayed when reading
net.ipv6.conf.foo.addr_gen_mode next time will be invalid. In case the
value provided by the user was valid, addrconf_dev_config() won't be
called since idev->cnf.addr_gen_mode has already been updated.

Fix this in the usual way we deal with values that need to be checked
after the proc_do*() helper has returned: define a local ctl_table and
storage, call proc_dointvec() on that temporary area, then check and
store.

addrconf_sysctl_addr_gen_mode() also writes the new value to the global
ipv6_devconf_dflt, when we're writing to some netns's default, so that
new netns will inherit the value that was set by the change occuring in
any netns. That doesn't make any sense, so let's drop this assignment.

Finally, since addr_gen_mode is a __u32, switch to proc_douintvec().

Fixes: d35a00b8e33d ("net/ipv6: allow sysctl to change link-local address 
generation mode")
Signed-off-by: Sabrina Dubroca 
---
 net/ipv6/addrconf.c | 27 ++-
 1 file changed, 14 insertions(+), 13 deletions(-)

diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index 91580c62bb86..e9ba53d2a147 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -5892,32 +5892,31 @@ static int addrconf_sysctl_addr_gen_mode(struct 
ctl_table *ctl, int write,
 loff_t *ppos)
 {
int ret = 0;
-   int new_val;
+   u32 new_val;
struct inet6_dev *idev = (struct inet6_dev *)ctl->extra1;
struct net *net = (struct net *)ctl->extra2;
+   struct ctl_table tmp = {
+   .data = _val,
+   .maxlen = sizeof(new_val),
+   .mode = ctl->mode,
+   };
 
if (!rtnl_trylock())
return restart_syscall();
 
-   ret = proc_dointvec(ctl, write, buffer, lenp, ppos);
+   new_val = *((u32 *)ctl->data);
 
-   if (write) {
-   new_val = *((int *)ctl->data);
+   ret = proc_douintvec(, write, buffer, lenp, ppos);
+   if (ret != 0)
+   goto out;
 
+   if (write) {
if (check_addr_gen_mode(new_val) < 0) {
ret = -EINVAL;
goto out;
}
 
-   /* request for default */
-   if (>ipv6.devconf_dflt->addr_gen_mode == ctl->data) {
-   ipv6_devconf_dflt.addr_gen_mode = new_val;
-
-   /* request for individual net device */
-   } else {
-   if (!idev)
-   goto out;
-
+   if (idev) {
if (check_stable_privacy(idev, net, new_val) < 0) {
ret = -EINVAL;
goto out;
@@ -5928,6 +5927,8 @@ static int addrconf_sysctl_addr_gen_mode(struct ctl_table 
*ctl, int write,
addrconf_dev_config(idev->dev);
}
}
+
+   *((u32 *)ctl->data) = new_val;
}
 
 out:
-- 
2.18.0



[PATCH net 0/3] net/ipv6: addr_gen_mode fixes

2018-07-06 Thread Sabrina Dubroca
This series fixes bugs in handling of the addr_gen_mode option, mainly
related to the sysctl. A minor netlink issue was also present in the
initial commit introducing the option on a per-netdevice basis.

Sabrina Dubroca (3):
  net/ipv6: fix addrconf_sysctl_addr_gen_mode
  net/ipv6: don't reinitialize ndev->cnf.addr_gen_mode on new inet6_dev
  net/ipv6: reserve room for IFLA_INET6_ADDR_GEN_MODE

 net/ipv6/addrconf.c | 33 +
 1 file changed, 17 insertions(+), 16 deletions(-)

-- 
2.18.0



[PATCH net 3/3] net/ipv6: reserve room for IFLA_INET6_ADDR_GEN_MODE

2018-07-06 Thread Sabrina Dubroca
inet6_ifla6_size() is called to check how much space is needed by
inet6_fill_link_af() and inet6_fill_ifinfo(), both of which include
the IFLA_INET6_ADDR_GEN_MODE attribute. Reserve some room for it.

Fixes: bc91b0f07ada ("ipv6: addrconf: implement address generation modes")
Signed-off-by: Sabrina Dubroca 
---
 net/ipv6/addrconf.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index e20f8a1d8cdb..e89bca83e0e4 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -5208,7 +5208,9 @@ static inline size_t inet6_ifla6_size(void)
 + nla_total_size(DEVCONF_MAX * 4) /* IFLA_INET6_CONF */
 + nla_total_size(IPSTATS_MIB_MAX * 8) /* IFLA_INET6_STATS */
 + nla_total_size(ICMP6_MIB_MAX * 8) /* IFLA_INET6_ICMP6STATS */
-+ nla_total_size(sizeof(struct in6_addr)); /* IFLA_INET6_TOKEN */
++ nla_total_size(sizeof(struct in6_addr)) /* IFLA_INET6_TOKEN */
++ nla_total_size(1) /* IFLA_INET6_ADDR_GEN_MODE */
++ 0;
 }
 
 static inline size_t inet6_if_nlmsg_size(void)
-- 
2.18.0



[PATCH net] net: fix use-after-free in GRO with ESP

2018-06-30 Thread Sabrina Dubroca
Since the addition of GRO for ESP, gro_receive can consume the skb and
return -EINPROGRESS. In that case, the lower layer GRO handler cannot
touch the skb anymore.

Commit 5f114163f2f5 ("net: Add a skb_gro_flush_final helper.") converted
some of the gro_receive handlers that can lead to ESP's gro_receive so
that they wouldn't access the skb when -EINPROGRESS is returned, but
missed other spots, mainly in tunneling protocols.

This patch finishes the conversion to using skb_gro_flush_final(), and
adds a new helper, skb_gro_flush_final_remcsum(), used in VXLAN and
GUE.

Fixes: 5f114163f2f5 ("net: Add a skb_gro_flush_final helper.")
Signed-off-by: Sabrina Dubroca 
Reviewed-by: Stefano Brivio 
---
 drivers/net/geneve.c  |  2 +-
 drivers/net/vxlan.c   |  4 +---
 include/linux/netdevice.h | 20 
 net/8021q/vlan.c  |  2 +-
 net/ipv4/fou.c|  4 +---
 net/ipv4/gre_offload.c|  2 +-
 net/ipv4/udp_offload.c|  2 +-
 7 files changed, 26 insertions(+), 10 deletions(-)

diff --git a/drivers/net/geneve.c b/drivers/net/geneve.c
index 750eaa53bf0c..ada33c2d9ac2 100644
--- a/drivers/net/geneve.c
+++ b/drivers/net/geneve.c
@@ -476,7 +476,7 @@ static struct sk_buff **geneve_gro_receive(struct sock *sk,
 out_unlock:
rcu_read_unlock();
 out:
-   NAPI_GRO_CB(skb)->flush |= flush;
+   skb_gro_flush_final(skb, pp, flush);
 
return pp;
 }
diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c
index aee0e60471f1..f6bb1d54d4bd 100644
--- a/drivers/net/vxlan.c
+++ b/drivers/net/vxlan.c
@@ -623,9 +623,7 @@ static struct sk_buff **vxlan_gro_receive(struct sock *sk,
flush = 0;
 
 out:
-   skb_gro_remcsum_cleanup(skb, );
-   skb->remcsum_offload = 0;
-   NAPI_GRO_CB(skb)->flush |= flush;
+   skb_gro_flush_final_remcsum(skb, pp, flush, );
 
return pp;
 }
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 3ec9850c7936..3d0cc0b5cec2 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -2789,11 +2789,31 @@ static inline void skb_gro_flush_final(struct sk_buff 
*skb, struct sk_buff **pp,
if (PTR_ERR(pp) != -EINPROGRESS)
NAPI_GRO_CB(skb)->flush |= flush;
 }
+static inline void skb_gro_flush_final_remcsum(struct sk_buff *skb,
+  struct sk_buff **pp,
+  int flush,
+  struct gro_remcsum *grc)
+{
+   if (PTR_ERR(pp) != -EINPROGRESS) {
+   NAPI_GRO_CB(skb)->flush |= flush;
+   skb_gro_remcsum_cleanup(skb, grc);
+   skb->remcsum_offload = 0;
+   }
+}
 #else
 static inline void skb_gro_flush_final(struct sk_buff *skb, struct sk_buff 
**pp, int flush)
 {
NAPI_GRO_CB(skb)->flush |= flush;
 }
+static inline void skb_gro_flush_final_remcsum(struct sk_buff *skb,
+  struct sk_buff **pp,
+  int flush,
+  struct gro_remcsum *grc)
+{
+   NAPI_GRO_CB(skb)->flush |= flush;
+   skb_gro_remcsum_cleanup(skb, grc);
+   skb->remcsum_offload = 0;
+}
 #endif
 
 static inline int dev_hard_header(struct sk_buff *skb, struct net_device *dev,
diff --git a/net/8021q/vlan.c b/net/8021q/vlan.c
index 73a65789271b..8ccee3d01822 100644
--- a/net/8021q/vlan.c
+++ b/net/8021q/vlan.c
@@ -693,7 +693,7 @@ static struct sk_buff **vlan_gro_receive(struct sk_buff 
**head,
 out_unlock:
rcu_read_unlock();
 out:
-   NAPI_GRO_CB(skb)->flush |= flush;
+   skb_gro_flush_final(skb, pp, flush);
 
return pp;
 }
diff --git a/net/ipv4/fou.c b/net/ipv4/fou.c
index 1540db65241a..c9ec1603666b 100644
--- a/net/ipv4/fou.c
+++ b/net/ipv4/fou.c
@@ -448,9 +448,7 @@ static struct sk_buff **gue_gro_receive(struct sock *sk,
 out_unlock:
rcu_read_unlock();
 out:
-   NAPI_GRO_CB(skb)->flush |= flush;
-   skb_gro_remcsum_cleanup(skb, );
-   skb->remcsum_offload = 0;
+   skb_gro_flush_final_remcsum(skb, pp, flush, );
 
return pp;
 }
diff --git a/net/ipv4/gre_offload.c b/net/ipv4/gre_offload.c
index 1859c473b21a..6a7d980105f6 100644
--- a/net/ipv4/gre_offload.c
+++ b/net/ipv4/gre_offload.c
@@ -223,7 +223,7 @@ static struct sk_buff **gre_gro_receive(struct sk_buff 
**head,
 out_unlock:
rcu_read_unlock();
 out:
-   NAPI_GRO_CB(skb)->flush |= flush;
+   skb_gro_flush_final(skb, pp, flush);
 
return pp;
 }
diff --git a/net/ipv4/udp_offload.c b/net/ipv4/udp_offload.c
index 92dc9e5a7ff3..69c54540d5b4 100644
--- a/net/ipv4/udp_offload.c
+++ b/net/ipv4/udp_offload.c
@@ -394,7 +394,7 @@ struct sk_buff **udp_gro_receive(struct sk_buff **head, 
struct sk_buff *skb,
 out_unlock:
rcu_read_unlock();
 out:
-   NAPI_GRO_CB(skb)->flush |= flush;
+   skb_gro_flush_final(skb, pp

[PATCH net] alx: take rtnl before calling __alx_open from resume

2018-06-29 Thread Sabrina Dubroca
The __alx_open function can be called from ndo_open, which is called
under RTNL, or from alx_resume, which isn't. Since commit d768319cd427,
we're calling the netif_set_real_num_{tx,rx}_queues functions, which
need to be called under RTNL.

This is similar to commit 0c2cc02e571a ("igb: Move the calls to set the
Tx and Rx queues into igb_open").

Fixes: d768319cd427 ("alx: enable multiple tx queues")
Signed-off-by: Sabrina Dubroca 
---
 drivers/net/ethernet/atheros/alx/main.c | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/atheros/alx/main.c 
b/drivers/net/ethernet/atheros/alx/main.c
index 567ee54504bc..5e5022fa1d04 100644
--- a/drivers/net/ethernet/atheros/alx/main.c
+++ b/drivers/net/ethernet/atheros/alx/main.c
@@ -1897,13 +1897,19 @@ static int alx_resume(struct device *dev)
struct pci_dev *pdev = to_pci_dev(dev);
struct alx_priv *alx = pci_get_drvdata(pdev);
struct alx_hw *hw = >hw;
+   int err;
 
alx_reset_phy(hw);
 
if (!netif_running(alx->dev))
return 0;
netif_device_attach(alx->dev);
-   return __alx_open(alx, true);
+
+   rtnl_lock();
+   err = __alx_open(alx, true);
+   rtnl_unlock();
+
+   return err;
 }
 
 static SIMPLE_DEV_PM_OPS(alx_pm_ops, alx_suspend, alx_resume);
-- 
2.18.0



[PATCH net v2 1/2] ip6mr: only set ip6mr_table from setsockopt when ip6mr_new_table succeeds

2018-06-05 Thread Sabrina Dubroca
Currently, raw6_sk(sk)->ip6mr_table is set unconditionally during
ip6_mroute_setsockopt(MRT6_TABLE). A subsequent attempt at the same
setsockopt will fail with -ENOENT, since we haven't actually created
that table.

A similar fix for ipv4 was included in commit 5e1859fbcc3c ("ipv4: ipmr:
various fixes and cleanups").

Fixes: d1db275dd3f6 ("ipv6: ip6mr: support multiple tables")
Signed-off-by: Sabrina Dubroca 
---
 net/ipv6/ip6mr.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/net/ipv6/ip6mr.c b/net/ipv6/ip6mr.c
index 298fd8b6ed17..42eca2689c3b 100644
--- a/net/ipv6/ip6mr.c
+++ b/net/ipv6/ip6mr.c
@@ -1759,7 +1759,8 @@ int ip6_mroute_setsockopt(struct sock *sk, int optname, 
char __user *optval, uns
ret = 0;
if (!ip6mr_new_table(net, v))
ret = -ENOMEM;
-   raw6_sk(sk)->ip6mr_table = v;
+   else
+   raw6_sk(sk)->ip6mr_table = v;
rtnl_unlock();
return ret;
}
-- 
2.17.1



[PATCH net v2 2/2] ipmr: fix error path when ipmr_new_table fails

2018-06-05 Thread Sabrina Dubroca
commit 0bbbf0e7d0e7 ("ipmr, ip6mr: Unite creation of new mr_table")
refactored ipmr_new_table, so that it now returns NULL when
mr_table_alloc fails. Unfortunately, all callers of ipmr_new_table
expect an ERR_PTR.

This can result in NULL deref, for example when ipmr_rules_exit calls
ipmr_free_table with NULL net->ipv4.mrt in the
!CONFIG_IP_MROUTE_MULTIPLE_TABLES version.

This patch makes mr_table_alloc return errors, and changes
ip6mr_new_table and its callers to return/expect error pointers as
well. It also removes the version of mr_table_alloc defined under
!CONFIG_IP_MROUTE_COMMON, since it is never used.

Fixes: 0bbbf0e7d0e7 ("ipmr, ip6mr: Unite creation of new mr_table")
Signed-off-by: Sabrina Dubroca 
---
v2: - fixed brainfart that shadowed mrt variable in ip6_mroute_setsockopt
- rebased on top of ip6_mroute_setsockopt fix

 include/linux/mroute_base.h | 10 --
 net/ipv4/ipmr_base.c|  8 +---
 net/ipv6/ip6mr.c| 18 --
 3 files changed, 17 insertions(+), 19 deletions(-)

diff --git a/include/linux/mroute_base.h b/include/linux/mroute_base.h
index d617fe45543e..d633f737b3c6 100644
--- a/include/linux/mroute_base.h
+++ b/include/linux/mroute_base.h
@@ -307,16 +307,6 @@ static inline void vif_device_init(struct vif_device *v,
 {
 }
 
-static inline void *
-mr_table_alloc(struct net *net, u32 id,
-  struct mr_table_ops *ops,
-  void (*expire_func)(struct timer_list *t),
-  void (*table_set)(struct mr_table *mrt,
-struct net *net))
-{
-   return NULL;
-}
-
 static inline void *mr_mfc_find_parent(struct mr_table *mrt,
   void *hasharg, int parent)
 {
diff --git a/net/ipv4/ipmr_base.c b/net/ipv4/ipmr_base.c
index 30221701614c..cafb0506c8c9 100644
--- a/net/ipv4/ipmr_base.c
+++ b/net/ipv4/ipmr_base.c
@@ -35,17 +35,19 @@ mr_table_alloc(struct net *net, u32 id,
 struct net *net))
 {
struct mr_table *mrt;
+   int err;
 
mrt = kzalloc(sizeof(*mrt), GFP_KERNEL);
if (!mrt)
-   return NULL;
+   return ERR_PTR(-ENOMEM);
mrt->id = id;
write_pnet(>net, net);
 
mrt->ops = *ops;
-   if (rhltable_init(>mfc_hash, mrt->ops.rht_params)) {
+   err = rhltable_init(>mfc_hash, mrt->ops.rht_params);
+   if (err) {
kfree(mrt);
-   return NULL;
+   return ERR_PTR(err);
}
INIT_LIST_HEAD(>mfc_cache_list);
INIT_LIST_HEAD(>mfc_unres_queue);
diff --git a/net/ipv6/ip6mr.c b/net/ipv6/ip6mr.c
index 42eca2689c3b..37936671dcb3 100644
--- a/net/ipv6/ip6mr.c
+++ b/net/ipv6/ip6mr.c
@@ -227,8 +227,8 @@ static int __net_init ip6mr_rules_init(struct net *net)
INIT_LIST_HEAD(>ipv6.mr6_tables);
 
mrt = ip6mr_new_table(net, RT6_TABLE_DFLT);
-   if (!mrt) {
-   err = -ENOMEM;
+   if (IS_ERR(mrt)) {
+   err = PTR_ERR(mrt);
goto err1;
}
 
@@ -301,8 +301,13 @@ static int ip6mr_fib_lookup(struct net *net, struct flowi6 
*flp6,
 
 static int __net_init ip6mr_rules_init(struct net *net)
 {
-   net->ipv6.mrt6 = ip6mr_new_table(net, RT6_TABLE_DFLT);
-   return net->ipv6.mrt6 ? 0 : -ENOMEM;
+   struct mr_table *mrt;
+
+   mrt = ip6mr_new_table(net, RT6_TABLE_DFLT);
+   if (IS_ERR(mrt))
+   return PTR_ERR(mrt);
+   net->ipv6.mrt6 = mrt;
+   return 0;
 }
 
 static void __net_exit ip6mr_rules_exit(struct net *net)
@@ -1757,8 +1762,9 @@ int ip6_mroute_setsockopt(struct sock *sk, int optname, 
char __user *optval, uns
 
rtnl_lock();
ret = 0;
-   if (!ip6mr_new_table(net, v))
-   ret = -ENOMEM;
+   mrt = ip6mr_new_table(net, v);
+   if (IS_ERR(mrt))
+   ret = PTR_ERR(mrt);
else
raw6_sk(sk)->ip6mr_table = v;
rtnl_unlock();
-- 
2.17.1



Re: [PATCH net] ipmr: fix error path when mr_table_alloc fails

2018-06-05 Thread Sabrina Dubroca
2018-06-04, 17:25:14 -0400, David Miller wrote:
> From: Sabrina Dubroca 
> Date: Mon,  4 Jun 2018 13:55:54 +0200
> 
> > commit 0bbbf0e7d0e7 ("ipmr, ip6mr: Unite creation of new mr_table")
> > refactored ipmr_new_table, so that it now returns NULL when
> > mr_table_alloc fails. Unfortunately, all callers of ipmr_new_table
> > expect an ERR_PTR. commit 66fb33254f45 ("ipmr: properly check
> > rhltable_init() return value") followed suit.
> > 
> > This can result in NULL deref, when ipmr_rules_exit calls
> > ipmr_free_table with NULL net->ipv4.mrt in the
> > !CONFIG_IP_MROUTE_MULTIPLE_TABLES version.
> > 
> > This patch makes mr_table_alloc return errors, and changes
> > ip6mr_new_table and its callers to return/expect error pointers as
> > well. It also removes the version of mr_table_alloc defined under
> > !CONFIG_IP_MROUTE_COMMON, since it is never used.
> > 
> > Fixes: 0bbbf0e7d0e7 ("ipmr, ip6mr: Unite creation of new mr_table")
> > Fixes: 66fb33254f45 ("ipmr: properly check rhltable_init() return value")
> > Signed-off-by: Sabrina Dubroca 
> 
> This adds a new warning with gcc-8.1.1 on Fedora 28
> 
>   CC [M]  net/ipv6/ip6mr.o
> In file included from ./arch/x86/include/asm/current.h:5,
>  from ./include/linux/sched.h:12,
>  from ./include/linux/uaccess.h:5,
>  from net/ipv6/ip6mr.c:19:
> net/ipv6/ip6mr.c: In function ‘ip6_mroute_setsockopt’:
> ./include/linux/compiler.h:177:26: warning: ‘mrt’ may be used uninitialized 
> in this function [-Wmaybe-uninitialized]
>   case 8: *(__u64 *)res = *(volatile __u64 *)p; break;  \
>   ^
> net/ipv6/ip6mr.c:1752:20: note: ‘mrt’ was declared here
>struct mr_table *mrt;
> ^~~

grmbl, CONFIG_UBSAN disables -Wmaybe-uninitialized. I'll prepare a v2,
sorry.

-- 
Sabrina


[PATCH net] ipmr: fix error path when mr_table_alloc fails

2018-06-04 Thread Sabrina Dubroca
commit 0bbbf0e7d0e7 ("ipmr, ip6mr: Unite creation of new mr_table")
refactored ipmr_new_table, so that it now returns NULL when
mr_table_alloc fails. Unfortunately, all callers of ipmr_new_table
expect an ERR_PTR. commit 66fb33254f45 ("ipmr: properly check
rhltable_init() return value") followed suit.

This can result in NULL deref, when ipmr_rules_exit calls
ipmr_free_table with NULL net->ipv4.mrt in the
!CONFIG_IP_MROUTE_MULTIPLE_TABLES version.

This patch makes mr_table_alloc return errors, and changes
ip6mr_new_table and its callers to return/expect error pointers as
well. It also removes the version of mr_table_alloc defined under
!CONFIG_IP_MROUTE_COMMON, since it is never used.

Fixes: 0bbbf0e7d0e7 ("ipmr, ip6mr: Unite creation of new mr_table")
Fixes: 66fb33254f45 ("ipmr: properly check rhltable_init() return value")
Signed-off-by: Sabrina Dubroca 
---
 include/linux/mroute_base.h | 10 --
 net/ipv4/ipmr_base.c|  8 +---
 net/ipv6/ip6mr.c| 19 +--
 3 files changed, 18 insertions(+), 19 deletions(-)

diff --git a/include/linux/mroute_base.h b/include/linux/mroute_base.h
index d617fe45543e..d633f737b3c6 100644
--- a/include/linux/mroute_base.h
+++ b/include/linux/mroute_base.h
@@ -307,16 +307,6 @@ static inline void vif_device_init(struct vif_device *v,
 {
 }
 
-static inline void *
-mr_table_alloc(struct net *net, u32 id,
-  struct mr_table_ops *ops,
-  void (*expire_func)(struct timer_list *t),
-  void (*table_set)(struct mr_table *mrt,
-struct net *net))
-{
-   return NULL;
-}
-
 static inline void *mr_mfc_find_parent(struct mr_table *mrt,
   void *hasharg, int parent)
 {
diff --git a/net/ipv4/ipmr_base.c b/net/ipv4/ipmr_base.c
index 30221701614c..cafb0506c8c9 100644
--- a/net/ipv4/ipmr_base.c
+++ b/net/ipv4/ipmr_base.c
@@ -35,17 +35,19 @@ mr_table_alloc(struct net *net, u32 id,
 struct net *net))
 {
struct mr_table *mrt;
+   int err;
 
mrt = kzalloc(sizeof(*mrt), GFP_KERNEL);
if (!mrt)
-   return NULL;
+   return ERR_PTR(-ENOMEM);
mrt->id = id;
write_pnet(>net, net);
 
mrt->ops = *ops;
-   if (rhltable_init(>mfc_hash, mrt->ops.rht_params)) {
+   err = rhltable_init(>mfc_hash, mrt->ops.rht_params);
+   if (err) {
kfree(mrt);
-   return NULL;
+   return ERR_PTR(err);
}
INIT_LIST_HEAD(>mfc_cache_list);
INIT_LIST_HEAD(>mfc_unres_queue);
diff --git a/net/ipv6/ip6mr.c b/net/ipv6/ip6mr.c
index 298fd8b6ed17..f9b801bd00f8 100644
--- a/net/ipv6/ip6mr.c
+++ b/net/ipv6/ip6mr.c
@@ -227,8 +227,8 @@ static int __net_init ip6mr_rules_init(struct net *net)
INIT_LIST_HEAD(>ipv6.mr6_tables);
 
mrt = ip6mr_new_table(net, RT6_TABLE_DFLT);
-   if (!mrt) {
-   err = -ENOMEM;
+   if (IS_ERR(mrt)) {
+   err = PTR_ERR(mrt);
goto err1;
}
 
@@ -301,8 +301,13 @@ static int ip6mr_fib_lookup(struct net *net, struct flowi6 
*flp6,
 
 static int __net_init ip6mr_rules_init(struct net *net)
 {
-   net->ipv6.mrt6 = ip6mr_new_table(net, RT6_TABLE_DFLT);
-   return net->ipv6.mrt6 ? 0 : -ENOMEM;
+   struct mr_table *mrt;
+
+   mrt = ip6mr_new_table(net, RT6_TABLE_DFLT);
+   if (IS_ERR(mrt))
+   return PTR_ERR(mrt);
+   net->ipv6.mrt6 = mrt;
+   return 0;
 }
 
 static void __net_exit ip6mr_rules_exit(struct net *net)
@@ -1743,6 +1748,7 @@ int ip6_mroute_setsockopt(struct sock *sk, int optname, 
char __user *optval, uns
 #ifdef CONFIG_IPV6_MROUTE_MULTIPLE_TABLES
case MRT6_TABLE:
{
+   struct mr_table *mrt;
u32 v;
 
if (optlen != sizeof(u32))
@@ -1757,8 +1763,9 @@ int ip6_mroute_setsockopt(struct sock *sk, int optname, 
char __user *optval, uns
 
rtnl_lock();
ret = 0;
-   if (!ip6mr_new_table(net, v))
-   ret = -ENOMEM;
+   mrt = ip6mr_new_table(net, v);
+   if (IS_ERR(mrt))
+   ret = PTR_ERR(mrt);
raw6_sk(sk)->ip6mr_table = v;
rtnl_unlock();
return ret;
-- 
2.17.1



[PATCH iproute2-next v2 1/2] man: ip link: document GRE tunnels

2018-04-20 Thread Sabrina Dubroca
GRE tunnels are currently only documented together with IPIP and SIT
tunnels, but they actually have very different configuration
options. Let's separate them.

Signed-off-by: Sabrina Dubroca <s...@queasysnail.net>
---
 man/man8/ip-link.8.in | 152 --
 1 file changed, 148 insertions(+), 4 deletions(-)

diff --git a/man/man8/ip-link.8.in b/man/man8/ip-link.8.in
index 5dee9fcd627a..77ab8a3b9723 100644
--- a/man/man8/ip-link.8.in
+++ b/man/man8/ip-link.8.in
@@ -693,13 +693,13 @@ tunnel.
 .in -8
 
 .TP
-GRE, IPIP, SIT Type Support
-For a link of types
-.I GRE/IPIP/SIT
+IPIP, SIT Type Support
+For a link of type
+.IR IPIP or SIT
 the following additional arguments are supported:
 
 .BI "ip link add " DEVICE
-.BR type " { " gre " | " ipip " | " sit " }"
+.BR type " { " ipip " | " sit " }"
 .BI " remote " ADDR " local " ADDR
 [
 .BR encap " { " fou " | " gue " | " none " }"
@@ -764,6 +764,150 @@ IPv6-Over-IPv4 is not supported for IPIP.
 - make this tunnel externally controlled
 .RB "(e.g. " "ip route encap" ).
 
+.in -8
+.TP
+GRE Type Support
+For a link of type
+.IR GRE " or " GRETAP
+the following additional arguments are supported:
+
+.BI "ip link add " DEVICE
+.BR type " { " gre " | " gretap " }"
+.BI " remote " ADDR " local " ADDR
+[
+.RB [ i | o ] seq
+] [
+.RB [ i | o ] key
+.I KEY
+] [
+.RB [ i | o ] csum
+] [
+.BI ttl " TTL "
+] [
+.BI tos " TOS "
+] [
+.RB [ no ] pmtudisc
+] [
+.RB [ no ] ignore-df
+] [
+.BI dev " PHYS_DEV "
+] [
+.BR encap " { " fou " | " gue " | " none " }"
+] [
+.BR encap-sport " { " \fIPORT " | " auto " }"
+] [
+.BI "encap-dport " PORT
+] [
+.RB [ no ] encap-csum
+] [
+.RB [ no ] encap-remcsum
+] [
+.BR external
+]
+
+.in +8
+.sp
+.BI  remote " ADDR "
+- specifies the remote address of the tunnel.
+
+.sp
+.BI  local " ADDR "
+- specifies the fixed local address for tunneled packets.
+It must be an address on another interface on this host.
+
+.sp
+.RB [ i | o ] seq
+- serialize packets.
+The
+.B oseq
+flag enables sequencing of outgoing packets.
+The
+.B iseq
+flag requires that all input packets are serialized.
+
+.sp
+.RB [ i | o ] key
+.I KEY
+- use keyed GRE with key
+.IR KEY ". "KEY
+is either a number or an IPv4 address-like dotted quad.
+The
+.B key
+parameter specifies the same key to use in both directions.
+The
+.BR ikey " and " okey
+parameters specify different keys for input and output.
+
+.sp
+.RB  [ i | o ] csum
+- generate/require checksums for tunneled packets.
+The
+.B ocsum
+flag calculates checksums for outgoing packets.
+The
+.B icsum
+flag requires that all input packets have the correct
+checksum. The
+.B csum
+flag is equivalent to the combination
+.B "icsum ocsum" .
+
+.sp
+.BI ttl " TTL"
+- specifies the TTL value to use in outgoing packets.
+
+.sp
+.BI tos " TOS"
+- specifies the TOS value to use in outgoing packets.
+
+.sp
+.RB [ no ] pmtudisc
+- enables/disables Path MTU Discovery on this tunnel.
+It is enabled by default. Note that a fixed ttl is incompatible
+with this option: tunneling with a fixed ttl always makes pmtu
+discovery.
+
+.sp
+.RB [ no ] ignore-df
+- enables/disables IPv4 DF suppression on this tunnel.
+Normally datagrams that exceed the MTU will be fragmented; the presence
+of the DF flag inhibits this, resulting instead in an ICMP Unreachable
+(Fragmentation Required) message.  Enabling this attribute casues the
+DF flag to be ignored.
+
+.sp
+.BI dev " PHYS_DEV"
+- specifies the physical device to use for tunnel endpoint communication.
+
+.sp
+.BR encap " { " fou " | " gue " | " none " }"
+- specifies type of secondary UDP encapsulation. "fou" indicates
+Foo-Over-UDP, "gue" indicates Generic UDP Encapsulation.
+
+.sp
+.BR encap-sport " { " \fIPORT " | " auto " }"
+- specifies the source port in UDP encapsulation.
+.IR PORT
+indicates the port by number, "auto"
+indicates that the port number should be chosen automatically
+(the kernel picks a flow based on the flow hash of the
+encapsulated packet).
+
+.sp
+.RB [ no ] encap-csum
+- specifies if UDP checksums are enabled in the secondary
+encapsulation.
+
+.sp
+.RB [ no ] encap-remcsum
+- specifies if Remote Checksum Offload is enabled. This is only
+applicable for Generic UDP Encapsulation.
+
+.sp
+.BR external
+- make this tunnel externally controlled
+.RB "(e.g. " "ip route encap" ).
+
 .in -8
 
 .TP
-- 
2.17.0



[PATCH iproute2-next v2 2/2] gre/gre6: allow clearing {,i,o}{key,seq,csum} flags

2018-04-20 Thread Sabrina Dubroca
Currently, iproute allows setting those flags, but it's impossible to
clear them, since their current value is fetched from the kernel and
then we OR in the additional flags passed on the command line.

Add no* variants to allow clearing them.

Signed-off-by: Sabrina Dubroca <s...@queasysnail.net>
---
v2: fixed up okey flag clearing
also reset the actual value of the key, not just the flag

 ip/link_gre.c | 30 +++---
 ip/link_gre6.c| 30 +++---
 man/man8/ip-link.8.in | 27 ++-
 3 files changed, 72 insertions(+), 15 deletions(-)

diff --git a/ip/link_gre.c b/ip/link_gre.c
index bc1cee8fbca2..ede761b23a8c 100644
--- a/ip/link_gre.c
+++ b/ip/link_gre.c
@@ -31,9 +31,9 @@ static void gre_print_help(struct link_util *lu, int argc, 
char **argv, FILE *f)
);
fprintf(f,
" [ local ADDR ]\n"
-   " [ [i|o]seq ]\n"
-   " [ [i|o]key KEY ]\n"
-   " [ [i|o]csum ]\n"
+   " [ [no][i|o]seq ]\n"
+   " [ [i|o]key KEY | no[i|o]key ]\n"
+   " [ [no][i|o]csum ]\n"
" [ ttl TTL ]\n"
" [ tos TOS ]\n"
" [ [no]pmtudisc ]\n"
@@ -210,28 +210,52 @@ get_failed:
iflags |= GRE_KEY;
oflags |= GRE_KEY;
ikey = okey = tnl_parse_key("key", *argv);
+   } else if (!matches(*argv, "nokey")) {
+   iflags &= ~GRE_KEY;
+   oflags &= ~GRE_KEY;
+   ikey = okey = 0;
} else if (!matches(*argv, "ikey")) {
NEXT_ARG();
iflags |= GRE_KEY;
ikey = tnl_parse_key("ikey", *argv);
+   } else if (!matches(*argv, "noikey")) {
+   iflags &= ~GRE_KEY;
+   ikey = 0;
} else if (!matches(*argv, "okey")) {
NEXT_ARG();
oflags |= GRE_KEY;
okey = tnl_parse_key("okey", *argv);
+   } else if (!matches(*argv, "nookey")) {
+   oflags &= ~GRE_KEY;
+   okey = 0;
} else if (!matches(*argv, "seq")) {
iflags |= GRE_SEQ;
oflags |= GRE_SEQ;
+   } else if (!matches(*argv, "noseq")) {
+   iflags &= ~GRE_SEQ;
+   oflags &= ~GRE_SEQ;
} else if (!matches(*argv, "iseq")) {
iflags |= GRE_SEQ;
+   } else if (!matches(*argv, "noiseq")) {
+   iflags &= ~GRE_SEQ;
} else if (!matches(*argv, "oseq")) {
oflags |= GRE_SEQ;
+   } else if (!matches(*argv, "nooseq")) {
+   oflags &= ~GRE_SEQ;
} else if (!matches(*argv, "csum")) {
iflags |= GRE_CSUM;
oflags |= GRE_CSUM;
+   } else if (!matches(*argv, "nocsum")) {
+   iflags &= ~GRE_CSUM;
+   oflags &= ~GRE_CSUM;
} else if (!matches(*argv, "icsum")) {
iflags |= GRE_CSUM;
+   } else if (!matches(*argv, "noicsum")) {
+   iflags &= ~GRE_CSUM;
} else if (!matches(*argv, "ocsum")) {
oflags |= GRE_CSUM;
+   } else if (!matches(*argv, "noocsum")) {
+   oflags &= ~GRE_CSUM;
} else if (!matches(*argv, "nopmtudisc")) {
pmtudisc = 0;
} else if (!matches(*argv, "pmtudisc")) {
diff --git a/ip/link_gre6.c b/ip/link_gre6.c
index a6fe0b73d235..181b2eae808b 100644
--- a/ip/link_gre6.c
+++ b/ip/link_gre6.c
@@ -38,9 +38,9 @@ static void gre_print_help(struct link_util *lu, int argc, 
char **argv, FILE *f)
);
fprintf(f,
" [ local ADDR ]\n"
-   " [ [i|o]seq ]\n"
-   " [ [i|o]key KEY ]\n"
-   " [ [i|o]csum ]\n"
+   " [ [no][i|o]seq ]\n"
+   "  

Re: [PATCH iproute2-next 2/2] gre/gre6: allow clearing {,i,o}{key,seq,csum} flags

2018-04-19 Thread Sabrina Dubroca
2018-04-19, 12:22:42 +0200, Sabrina Dubroca wrote:
> @@ -210,28 +210,49 @@ get_failed:
>   iflags |= GRE_KEY;
>   oflags |= GRE_KEY;
>   ikey = okey = tnl_parse_key("key", *argv);
> + } else if (!matches(*argv, "nokey")) {
> + iflags &= ~GRE_KEY;
> + oflags &= ~GRE_KEY;
>   } else if (!matches(*argv, "ikey")) {
>   NEXT_ARG();
>   iflags |= GRE_KEY;
>   ikey = tnl_parse_key("ikey", *argv);
> + } else if (!matches(*argv, "noikey")) {
> + iflags &= ~GRE_KEY;
>   } else if (!matches(*argv, "okey")) {
>   NEXT_ARG();
>   oflags |= GRE_KEY;
>   okey = tnl_parse_key("okey", *argv);
> + } else if (!matches(*argv, "noikey")) {
> + iflags &= ~GRE_KEY;

Sorry, posted the wrong version. I'll send v2 after I've had a bucket
of coffee.

-- 
Sabrina


[PATCH iproute2-next 2/2] gre/gre6: allow clearing {,i,o}{key,seq,csum} flags

2018-04-19 Thread Sabrina Dubroca
Currently, iproute allows setting those flags, but it's impossible to
clear them, since their current value is fetched from the kernel and
then we OR in the additional flags passed on the command line.

Add no* variants to allow clearing them.

Signed-off-by: Sabrina Dubroca <s...@queasysnail.net>
---
 ip/link_gre.c | 27 ---
 ip/link_gre6.c| 27 ---
 man/man8/ip-link.8.in | 27 ++-
 3 files changed, 66 insertions(+), 15 deletions(-)

diff --git a/ip/link_gre.c b/ip/link_gre.c
index bc1cee8fbca2..4c8849f7051d 100644
--- a/ip/link_gre.c
+++ b/ip/link_gre.c
@@ -31,9 +31,9 @@ static void gre_print_help(struct link_util *lu, int argc, 
char **argv, FILE *f)
);
fprintf(f,
" [ local ADDR ]\n"
-   " [ [i|o]seq ]\n"
-   " [ [i|o]key KEY ]\n"
-   " [ [i|o]csum ]\n"
+   " [ [no][i|o]seq ]\n"
+   " [ [i|o]key KEY | no[i|o]key ]\n"
+   " [ [no][i|o]csum ]\n"
" [ ttl TTL ]\n"
" [ tos TOS ]\n"
" [ [no]pmtudisc ]\n"
@@ -210,28 +210,49 @@ get_failed:
iflags |= GRE_KEY;
oflags |= GRE_KEY;
ikey = okey = tnl_parse_key("key", *argv);
+   } else if (!matches(*argv, "nokey")) {
+   iflags &= ~GRE_KEY;
+   oflags &= ~GRE_KEY;
} else if (!matches(*argv, "ikey")) {
NEXT_ARG();
iflags |= GRE_KEY;
ikey = tnl_parse_key("ikey", *argv);
+   } else if (!matches(*argv, "noikey")) {
+   iflags &= ~GRE_KEY;
} else if (!matches(*argv, "okey")) {
NEXT_ARG();
oflags |= GRE_KEY;
okey = tnl_parse_key("okey", *argv);
+   } else if (!matches(*argv, "noikey")) {
+   iflags &= ~GRE_KEY;
} else if (!matches(*argv, "seq")) {
iflags |= GRE_SEQ;
oflags |= GRE_SEQ;
+   } else if (!matches(*argv, "noseq")) {
+   iflags &= ~GRE_SEQ;
+   oflags &= ~GRE_SEQ;
} else if (!matches(*argv, "iseq")) {
iflags |= GRE_SEQ;
+   } else if (!matches(*argv, "noiseq")) {
+   iflags &= ~GRE_SEQ;
} else if (!matches(*argv, "oseq")) {
oflags |= GRE_SEQ;
+   } else if (!matches(*argv, "nooseq")) {
+   oflags &= ~GRE_SEQ;
} else if (!matches(*argv, "csum")) {
iflags |= GRE_CSUM;
oflags |= GRE_CSUM;
+   } else if (!matches(*argv, "nocsum")) {
+   iflags &= ~GRE_CSUM;
+   oflags &= ~GRE_CSUM;
} else if (!matches(*argv, "icsum")) {
iflags |= GRE_CSUM;
+   } else if (!matches(*argv, "noicsum")) {
+   iflags &= ~GRE_CSUM;
} else if (!matches(*argv, "ocsum")) {
oflags |= GRE_CSUM;
+   } else if (!matches(*argv, "noocsum")) {
+   oflags &= ~GRE_CSUM;
} else if (!matches(*argv, "nopmtudisc")) {
pmtudisc = 0;
} else if (!matches(*argv, "pmtudisc")) {
diff --git a/ip/link_gre6.c b/ip/link_gre6.c
index a6fe0b73d235..542da0c3ccc9 100644
--- a/ip/link_gre6.c
+++ b/ip/link_gre6.c
@@ -38,9 +38,9 @@ static void gre_print_help(struct link_util *lu, int argc, 
char **argv, FILE *f)
);
fprintf(f,
" [ local ADDR ]\n"
-   " [ [i|o]seq ]\n"
-   " [ [i|o]key KEY ]\n"
-   " [ [i|o]csum ]\n"
+   " [ [no][i|o]seq ]\n"
+   " [ [i|o]key KEY | no[i|o]key ]\n"
+   " [ [no][i|o]csum ]\n"
" [ hoplimit TTL ]\n"
&q

[PATCH iproute2-next 1/2] man: ip link: document GRE tunnels

2018-04-19 Thread Sabrina Dubroca
GRE tunnels are currently only documented together with IPIP and SIT
tunnels, but they actually have very different configuration
options. Let's separate them.

Signed-off-by: Sabrina Dubroca <s...@queasysnail.net>
---
 man/man8/ip-link.8.in | 152 --
 1 file changed, 148 insertions(+), 4 deletions(-)

diff --git a/man/man8/ip-link.8.in b/man/man8/ip-link.8.in
index 5dee9fcd627a..77ab8a3b9723 100644
--- a/man/man8/ip-link.8.in
+++ b/man/man8/ip-link.8.in
@@ -693,13 +693,13 @@ tunnel.
 .in -8
 
 .TP
-GRE, IPIP, SIT Type Support
-For a link of types
-.I GRE/IPIP/SIT
+IPIP, SIT Type Support
+For a link of type
+.IR IPIP or SIT
 the following additional arguments are supported:
 
 .BI "ip link add " DEVICE
-.BR type " { " gre " | " ipip " | " sit " }"
+.BR type " { " ipip " | " sit " }"
 .BI " remote " ADDR " local " ADDR
 [
 .BR encap " { " fou " | " gue " | " none " }"
@@ -764,6 +764,150 @@ IPv6-Over-IPv4 is not supported for IPIP.
 - make this tunnel externally controlled
 .RB "(e.g. " "ip route encap" ).
 
+.in -8
+.TP
+GRE Type Support
+For a link of type
+.IR GRE " or " GRETAP
+the following additional arguments are supported:
+
+.BI "ip link add " DEVICE
+.BR type " { " gre " | " gretap " }"
+.BI " remote " ADDR " local " ADDR
+[
+.RB [ i | o ] seq
+] [
+.RB [ i | o ] key
+.I KEY
+] [
+.RB [ i | o ] csum
+] [
+.BI ttl " TTL "
+] [
+.BI tos " TOS "
+] [
+.RB [ no ] pmtudisc
+] [
+.RB [ no ] ignore-df
+] [
+.BI dev " PHYS_DEV "
+] [
+.BR encap " { " fou " | " gue " | " none " }"
+] [
+.BR encap-sport " { " \fIPORT " | " auto " }"
+] [
+.BI "encap-dport " PORT
+] [
+.RB [ no ] encap-csum
+] [
+.RB [ no ] encap-remcsum
+] [
+.BR external
+]
+
+.in +8
+.sp
+.BI  remote " ADDR "
+- specifies the remote address of the tunnel.
+
+.sp
+.BI  local " ADDR "
+- specifies the fixed local address for tunneled packets.
+It must be an address on another interface on this host.
+
+.sp
+.RB [ i | o ] seq
+- serialize packets.
+The
+.B oseq
+flag enables sequencing of outgoing packets.
+The
+.B iseq
+flag requires that all input packets are serialized.
+
+.sp
+.RB [ i | o ] key
+.I KEY
+- use keyed GRE with key
+.IR KEY ". "KEY
+is either a number or an IPv4 address-like dotted quad.
+The
+.B key
+parameter specifies the same key to use in both directions.
+The
+.BR ikey " and " okey
+parameters specify different keys for input and output.
+
+.sp
+.RB  [ i | o ] csum
+- generate/require checksums for tunneled packets.
+The
+.B ocsum
+flag calculates checksums for outgoing packets.
+The
+.B icsum
+flag requires that all input packets have the correct
+checksum. The
+.B csum
+flag is equivalent to the combination
+.B "icsum ocsum" .
+
+.sp
+.BI ttl " TTL"
+- specifies the TTL value to use in outgoing packets.
+
+.sp
+.BI tos " TOS"
+- specifies the TOS value to use in outgoing packets.
+
+.sp
+.RB [ no ] pmtudisc
+- enables/disables Path MTU Discovery on this tunnel.
+It is enabled by default. Note that a fixed ttl is incompatible
+with this option: tunneling with a fixed ttl always makes pmtu
+discovery.
+
+.sp
+.RB [ no ] ignore-df
+- enables/disables IPv4 DF suppression on this tunnel.
+Normally datagrams that exceed the MTU will be fragmented; the presence
+of the DF flag inhibits this, resulting instead in an ICMP Unreachable
+(Fragmentation Required) message.  Enabling this attribute casues the
+DF flag to be ignored.
+
+.sp
+.BI dev " PHYS_DEV"
+- specifies the physical device to use for tunnel endpoint communication.
+
+.sp
+.BR encap " { " fou " | " gue " | " none " }"
+- specifies type of secondary UDP encapsulation. "fou" indicates
+Foo-Over-UDP, "gue" indicates Generic UDP Encapsulation.
+
+.sp
+.BR encap-sport " { " \fIPORT " | " auto " }"
+- specifies the source port in UDP encapsulation.
+.IR PORT
+indicates the port by number, "auto"
+indicates that the port number should be chosen automatically
+(the kernel picks a flow based on the flow hash of the
+encapsulated packet).
+
+.sp
+.RB [ no ] encap-csum
+- specifies if UDP checksums are enabled in the secondary
+encapsulation.
+
+.sp
+.RB [ no ] encap-remcsum
+- specifies if Remote Checksum Offload is enabled. This is only
+applicable for Generic UDP Encapsulation.
+
+.sp
+.BR external
+- make this tunnel externally controlled
+.RB "(e.g. " "ip route encap" ).
+
 .in -8
 
 .TP
-- 
2.17.0



Re: Regression with 5dcd8400884c ("macsec: missing dev_put() on error in macsec_newlink()")

2018-04-14 Thread Sabrina Dubroca
Hello Laura,

2018-04-14, 10:56:55 -0700, Laura Abbott wrote:
> Hi,
> 
> Fedora got a bug report of a regression when trying to remove the
> the macsec module (https://bugzilla.redhat.com/show_bug.cgi?id=1566410).
> I did a bisect and found
> 
> commit 5dcd8400884cc4a043a6d4617e042489e5d566a9
> Author: Dan Carpenter 
> Date:   Wed Mar 21 11:09:01 2018 +0300
> 
> macsec: missing dev_put() on error in macsec_newlink()
> We moved the dev_hold(real_dev); call earlier in the function but forgot
> to update the error paths.
> Fixes: 0759e552bce7 ("macsec: fix negative refcnt on parent link")
> Signed-off-by: Dan Carpenter 
> Signed-off-by: David S. Miller 
> 
> The script I used for testing based on the reporter is attached. It
> looks like modprobe is stuck in the D state. Any idea?

I don't think that reference was actually leaked. It gets released in
macsec_free_netdev() when the device is deleted.

modprobe getting stuck is just a side-effect of the refcount going
negative on the parent device, since removing the module needs to take
the lock that is held by device deletion.

I'll send a revert tomorrow.

Thanks for the report,

-- 
Sabrina


[PATCH net 2/2] tun: send netlink notification when the device is modified

2018-04-10 Thread Sabrina Dubroca
I added dumping of link information about tun devices over netlink in
commit 1ec010e70593 ("tun: export flags, uid, gid, queue information
over netlink"), but didn't add the missing netlink notifications when
the device's exported properties change.

This patch adds notifications when owner/group or flags are modified,
when queues are attached/detached, and when a tun fd is closed.

Reported-by: Thomas Haller <thal...@redhat.com>
Fixes: 1ec010e70593 ("tun: export flags, uid, gid, queue information over 
netlink")
Signed-off-by: Sabrina Dubroca <s...@queasysnail.net>
---
 drivers/net/tun.c | 24 ++--
 1 file changed, 22 insertions(+), 2 deletions(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index c9e68fd76a37..28583aa0c17d 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -743,8 +743,15 @@ static void __tun_detach(struct tun_file *tfile, bool 
clean)
 
 static void tun_detach(struct tun_file *tfile, bool clean)
 {
+   struct tun_struct *tun;
+   struct net_device *dev;
+
rtnl_lock();
+   tun = rtnl_dereference(tfile->tun);
+   dev = tun ? tun->dev : NULL;
__tun_detach(tfile, clean);
+   if (dev)
+   netdev_state_change(dev);
rtnl_unlock();
 }
 
@@ -2562,13 +2569,15 @@ static int tun_set_iff(struct net *net, struct file 
*file, struct ifreq *ifr)
/* One or more queue has already been attached, no need
 * to initialize the device again.
 */
+   netdev_state_change(dev);
return 0;
}
 
tun->flags = (tun->flags & ~TUN_FEATURES) |
  (ifr->ifr_flags & TUN_FEATURES);
-   }
-   else {
+
+   netdev_state_change(dev);
+   } else {
char *name;
unsigned long flags = 0;
int queues = ifr->ifr_flags & IFF_MULTI_QUEUE ?
@@ -2808,6 +2817,9 @@ static int tun_set_queue(struct file *file, struct ifreq 
*ifr)
} else
ret = -EINVAL;
 
+   if (ret >= 0)
+   netdev_state_change(tun->dev);
+
 unlock:
rtnl_unlock();
return ret;
@@ -2848,6 +2860,7 @@ static long __tun_chr_ioctl(struct file *file, unsigned 
int cmd,
unsigned int ifindex;
int le;
int ret;
+   bool do_notify = false;
 
if (cmd == TUNSETIFF || cmd == TUNSETQUEUE ||
(_IOC_TYPE(cmd) == SOCK_IOC_TYPE && cmd != SIOCGSKNS)) {
@@ -2944,10 +2957,12 @@ static long __tun_chr_ioctl(struct file *file, unsigned 
int cmd,
if (arg && !(tun->flags & IFF_PERSIST)) {
tun->flags |= IFF_PERSIST;
__module_get(THIS_MODULE);
+   do_notify = true;
}
if (!arg && (tun->flags & IFF_PERSIST)) {
tun->flags &= ~IFF_PERSIST;
module_put(THIS_MODULE);
+   do_notify = true;
}
 
tun_debug(KERN_INFO, tun, "persist %s\n",
@@ -2962,6 +2977,7 @@ static long __tun_chr_ioctl(struct file *file, unsigned 
int cmd,
break;
}
tun->owner = owner;
+   do_notify = true;
tun_debug(KERN_INFO, tun, "owner set to %u\n",
  from_kuid(_user_ns, tun->owner));
break;
@@ -2974,6 +2990,7 @@ static long __tun_chr_ioctl(struct file *file, unsigned 
int cmd,
break;
}
tun->group = group;
+   do_notify = true;
tun_debug(KERN_INFO, tun, "group set to %u\n",
  from_kgid(_user_ns, tun->group));
break;
@@ -3133,6 +3150,9 @@ static long __tun_chr_ioctl(struct file *file, unsigned 
int cmd,
break;
}
 
+   if (do_notify)
+   netdev_state_change(tun->dev);
+
 unlock:
rtnl_unlock();
if (tun)
-- 
2.16.2



[PATCH net 1/2] tun: set the flags before registering the netdevice

2018-04-10 Thread Sabrina Dubroca
Otherwise, register_netdevice advertises the creation of the device with
the default flags, instead of what the user requested.

Reported-by: Thomas Haller <thal...@redhat.com>
Fixes: 1ec010e70593 ("tun: export flags, uid, gid, queue information over 
netlink")
Signed-off-by: Sabrina Dubroca <s...@queasysnail.net>
---
 drivers/net/tun.c | 9 ++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index a1ba262f40ad..c9e68fd76a37 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -2564,6 +2564,9 @@ static int tun_set_iff(struct net *net, struct file 
*file, struct ifreq *ifr)
 */
return 0;
}
+
+   tun->flags = (tun->flags & ~TUN_FEATURES) |
+ (ifr->ifr_flags & TUN_FEATURES);
}
else {
char *name;
@@ -2642,6 +2645,9 @@ static int tun_set_iff(struct net *net, struct file 
*file, struct ifreq *ifr)
 ~(NETIF_F_HW_VLAN_CTAG_TX |
   NETIF_F_HW_VLAN_STAG_TX);
 
+   tun->flags = (tun->flags & ~TUN_FEATURES) |
+ (ifr->ifr_flags & TUN_FEATURES);
+
INIT_LIST_HEAD(>disabled);
err = tun_attach(tun, file, false, ifr->ifr_flags & IFF_NAPI);
if (err < 0)
@@ -2656,9 +2662,6 @@ static int tun_set_iff(struct net *net, struct file 
*file, struct ifreq *ifr)
 
tun_debug(KERN_INFO, tun, "tun_set_iff\n");
 
-   tun->flags = (tun->flags & ~TUN_FEATURES) |
-   (ifr->ifr_flags & TUN_FEATURES);
-
/* Make sure persistent devices do not get stuck in
 * xoff state.
 */
-- 
2.16.2



[PATCH net] ip_gre: clear feature flags when incompatible o_flags are set

2018-04-10 Thread Sabrina Dubroca
Commit dd9d598c6657 ("ip_gre: add the support for i/o_flags update via
netlink") added the ability to change o_flags, but missed that the
GSO/LLTX features are disabled by default, and only enabled some gre
features are unused. Thus we also need to disable the GSO/LLTX features
on the device when the TUNNEL_SEQ or TUNNEL_CSUM flags are set.

These two examples should result in the same features being set:

ip link add gre_none type gre local 192.168.0.10 remote 192.168.0.20 ttl 
255 key 0

ip link set gre_none type gre seq
ip link add gre_seq type gre local 192.168.0.10 remote 192.168.0.20 ttl 255 
key 1 seq

Fixes: dd9d598c6657 ("ip_gre: add the support for i/o_flags update via netlink")
Signed-off-by: Sabrina Dubroca <s...@queasysnail.net>
---
 net/ipv4/ip_gre.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/net/ipv4/ip_gre.c b/net/ipv4/ip_gre.c
index a8772a978224..9c169bb2444d 100644
--- a/net/ipv4/ip_gre.c
+++ b/net/ipv4/ip_gre.c
@@ -781,8 +781,14 @@ static void ipgre_link_update(struct net_device *dev, bool 
set_mtu)
tunnel->encap.type == TUNNEL_ENCAP_NONE) {
dev->features |= NETIF_F_GSO_SOFTWARE;
dev->hw_features |= NETIF_F_GSO_SOFTWARE;
+   } else {
+   dev->features &= ~NETIF_F_GSO_SOFTWARE;
+   dev->hw_features &= ~NETIF_F_GSO_SOFTWARE;
}
dev->features |= NETIF_F_LLTX;
+   } else {
+   dev->hw_features &= ~NETIF_F_GSO_SOFTWARE;
+   dev->features &= ~(NETIF_F_LLTX | NETIF_F_GSO_SOFTWARE);
}
 }
 
-- 
2.16.2



Re: [PATCH v14 net-next 09/12] crypto: chtls - Inline TLS record Tx

2018-03-29 Thread Sabrina Dubroca
2018-03-29, 21:27:51 +0530, Atul Gupta wrote:
> TLS handler for record transmit.
> Create Inline TLS work request and post to FW.
> Create Inline TLS record CPLs for hardware
> 
> Signed-off-by: Atul Gupta 
> Signed-off-by: Michael Werner 
> ---

...
> +int chtls_sendmsg(struct sock *sk, struct msghdr *msg, size_t size)
> +{
> + struct chtls_sock *csk = rcu_dereference_sk_user_data(sk);
> + struct chtls_dev *cdev = csk->cdev;
> + struct tcp_sock *tp = tcp_sk(sk);
> + struct sk_buff *skb;
> + int mss, flags, err;
> + int recordsz = 0;
> + int copied = 0;
> + int hdrlen = 0;
> + long timeo;
> +
> + lock_sock(sk);
> + flags = msg->msg_flags;
> + timeo = sock_sndtimeo(sk, flags & MSG_DONTWAIT);
> +
> + if (!sk_in_state(sk, TCPF_ESTABLISHED | TCPF_CLOSE_WAIT)) {
> + err = sk_stream_wait_connect(sk, );
> + if (err)
> + goto out_err;
> + }
> +
> + if (sk->sk_prot->sendmsg != chtls_sendmsg) {

Can that actually happen? If so, how? AFAICT, this function is only
called when sk->sk_prot has been set to be chtls_cpl_prot.

> + release_sock(sk);
> + if (sk->sk_prot->sendmsg)
> + return sk->sk_prot->sendmsg(sk, msg, size);
> + else
> + return sk->sk_socket->ops->sendmsg(sk->sk_socket,
> +msg, size);
> + }

-- 
Sabrina


Re: [PATCH v14 net-next 08/12] crypto : chtls - CPL handler definition

2018-03-29 Thread Sabrina Dubroca
2018-03-29, 21:27:50 +0530, Atul Gupta wrote:
...
> +static void chtls_pass_accept_request(struct sock *sk,
> +   struct sk_buff *skb)
> +{
...
> + if (chtls_get_module(newsk))
> + goto reject;
> + inet_csk_reqsk_queue_added(sk);
> + reply_skb->sk = newsk;
> + chtls_install_cpl_ops(newsk);

Function defined in patch 11, declared in patch 6, and used in patch
8.  Are you actually listening to the comments we've been sending?

-- 
Sabrina


Re: [PATCH v13 net-next 02/12] ethtool: enable Inline TLS in HW

2018-03-27 Thread Sabrina Dubroca
2018-03-27, 23:06:31 +0530, Atul Gupta wrote:
> Ethtool option enables TLS record offload on HW, user
> configures the feature for netdev capable of Inline TLS.
> This allows user to define custom sk_prot for Inline TLS sock
> 
> Signed-off-by: Atul Gupta <atul.gu...@chelsio.com>
> Reviewed-by: Sabrina Dubroca <sdubr...@redhat.com>

uh, what? I definitely didn't give my "Reviewed-by" for any of these
patches. Please never do that again.

-- 
Sabrina


Re: [PATCH v11 crypto 06/12] crypto: chtls - structure and macro for Inline TLS

2018-03-18 Thread Sabrina Dubroca
2018-03-16, 21:07:35 +0530, Atul Gupta wrote:
[...]
> +#define SOCK_INLINE (31)
[...]

> +static inline int csk_flag(const struct sock *sk, enum csk_flags flag)
> +{
> + struct chtls_sock *csk = rcu_dereference_sk_user_data(sk);
> +
> + if (!sock_flag(sk, SOCK_INLINE))
> + return 0;
> + return test_bit(flag, >flags);
> +}

Should drivers really start defining their own socket flags?


> +static inline void set_queue(struct sk_buff *skb,
> +  unsigned int queue, const struct sock *sk)
> +{
> + skb->queue_mapping = queue;
> +}

That's skb_set_queue_mapping(), no need to define your own.

-- 
Sabrina


[PATCH net v2] ipv4: lock mtu in fnhe when received PMTU < net.ipv4.route.min_pmtu

2018-03-14 Thread Sabrina Dubroca
Prior to the rework of PMTU information storage in commit
2c8cec5c10bc ("ipv4: Cache learned PMTU information in inetpeer."),
when a PMTU event advertising a PMTU smaller than
net.ipv4.route.min_pmtu was received, we would disable setting the DF
flag on packets by locking the MTU metric, and set the PMTU to
net.ipv4.route.min_pmtu.

Since then, we don't disable DF, and set PMTU to
net.ipv4.route.min_pmtu, so the intermediate router that has this link
with a small MTU will have to drop the packets.

This patch reestablishes pre-2.6.39 behavior by splitting
rtable->rt_pmtu into a bitfield with rt_mtu_locked and rt_pmtu.
rt_mtu_locked indicates that we shouldn't set the DF bit on that path,
and is checked in ip_dont_fragment().

One possible workaround is to set net.ipv4.route.min_pmtu to a value low
enough to accommodate the lowest MTU encountered.

Fixes: 2c8cec5c10bc ("ipv4: Cache learned PMTU information in inetpeer.")
Signed-off-by: Sabrina Dubroca <s...@queasysnail.net>
Reviewed-by: Stefano Brivio <sbri...@redhat.com>
---
v2: make rt_pmtu a bitfield
fix missing initializations of rt_mtu_locked

 include/net/ip.h| 11 +--
 include/net/ip_fib.h|  1 +
 include/net/route.h |  3 ++-
 net/ipv4/route.c| 26 +++---
 net/ipv4/xfrm4_policy.c |  1 +
 5 files changed, 32 insertions(+), 10 deletions(-)

diff --git a/include/net/ip.h b/include/net/ip.h
index 746abff9ce51..f49b3a576bec 100644
--- a/include/net/ip.h
+++ b/include/net/ip.h
@@ -328,6 +328,13 @@ int ip_decrease_ttl(struct iphdr *iph)
return --iph->ttl;
 }
 
+static inline int ip_mtu_locked(const struct dst_entry *dst)
+{
+   const struct rtable *rt = (const struct rtable *)dst;
+
+   return rt->rt_mtu_locked || dst_metric_locked(dst, RTAX_MTU);
+}
+
 static inline
 int ip_dont_fragment(const struct sock *sk, const struct dst_entry *dst)
 {
@@ -335,7 +342,7 @@ int ip_dont_fragment(const struct sock *sk, const struct 
dst_entry *dst)
 
return  pmtudisc == IP_PMTUDISC_DO ||
(pmtudisc == IP_PMTUDISC_WANT &&
-!(dst_metric_locked(dst, RTAX_MTU)));
+!ip_mtu_locked(dst));
 }
 
 static inline bool ip_sk_accept_pmtu(const struct sock *sk)
@@ -361,7 +368,7 @@ static inline unsigned int ip_dst_mtu_maybe_forward(const 
struct dst_entry *dst,
struct net *net = dev_net(dst->dev);
 
if (net->ipv4.sysctl_ip_fwd_use_pmtu ||
-   dst_metric_locked(dst, RTAX_MTU) ||
+   ip_mtu_locked(dst) ||
!forwarding)
return dst_mtu(dst);
 
diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h
index f80524396c06..77d0a78cf7d2 100644
--- a/include/net/ip_fib.h
+++ b/include/net/ip_fib.h
@@ -59,6 +59,7 @@ struct fib_nh_exception {
int fnhe_genid;
__be32  fnhe_daddr;
u32 fnhe_pmtu;
+   boolfnhe_mtu_locked;
__be32  fnhe_gw;
unsigned long   fnhe_expires;
struct rtable __rcu *fnhe_rth_input;
diff --git a/include/net/route.h b/include/net/route.h
index 1eb9ce470e25..3d21e3686838 100644
--- a/include/net/route.h
+++ b/include/net/route.h
@@ -63,7 +63,8 @@ struct rtable {
__be32  rt_gateway;
 
/* Miscellaneous cached information */
-   u32 rt_pmtu;
+   u32 rt_mtu_locked:1,
+   rt_pmtu:31;
 
u32 rt_table_id;
 
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 860b3fd2f54b..dc6909448f41 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -634,6 +634,7 @@ static inline u32 fnhe_hashfun(__be32 daddr)
 static void fill_route_from_fnhe(struct rtable *rt, struct fib_nh_exception 
*fnhe)
 {
rt->rt_pmtu = fnhe->fnhe_pmtu;
+   rt->rt_mtu_locked = fnhe->fnhe_mtu_locked;
rt->dst.expires = fnhe->fnhe_expires;
 
if (fnhe->fnhe_gw) {
@@ -644,7 +645,7 @@ static void fill_route_from_fnhe(struct rtable *rt, struct 
fib_nh_exception *fnh
 }
 
 static void update_or_create_fnhe(struct fib_nh *nh, __be32 daddr, __be32 gw,
- u32 pmtu, unsigned long expires)
+ u32 pmtu, bool lock, unsigned long expires)
 {
struct fnhe_hash_bucket *hash;
struct fib_nh_exception *fnhe;
@@ -681,8 +682,10 @@ static void update_or_create_fnhe(struct fib_nh *nh, 
__be32 daddr, __be32 gw,
fnhe->fnhe_genid = genid;
if (gw)
fnhe->fnhe_gw = gw;
-   if (pmtu)
+   if (pmtu) {
fnhe->fnhe_pmtu = pmtu;
+   fnhe->fnhe_mtu_locked = lock;
+   }
fnhe->fnhe_expir

Re: [PATCH net] ipv4: lock mtu in fnhe when received PMTU < net.ipv4.route.min_pmtu

2018-03-12 Thread Sabrina Dubroca
2018-03-09, 16:06:19 -0500, David Miller wrote:
> From: Sabrina Dubroca <s...@queasysnail.net>
> Date: Fri,  9 Mar 2018 17:43:21 +0100
> 
> > diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h
> > index f80524396c06..77d0a78cf7d2 100644
> > --- a/include/net/ip_fib.h
> > +++ b/include/net/ip_fib.h
> > @@ -59,6 +59,7 @@ struct fib_nh_exception {
> > int fnhe_genid;
> > __be32  fnhe_daddr;
> > u32 fnhe_pmtu;
> > +   boolfnhe_mtu_locked;
> > __be32  fnhe_gw;
> > unsigned long   fnhe_expires;
> > struct rtable __rcu *fnhe_rth_input;
> > diff --git a/include/net/route.h b/include/net/route.h
> > index 1eb9ce470e25..729bb5e61e9a 100644
> > --- a/include/net/route.h
> > +++ b/include/net/route.h
> > @@ -64,6 +64,7 @@ struct rtable {
> >  
> > /* Miscellaneous cached information */
> > u32 rt_pmtu;
> > +   boolrt_mtu_locked;
> >  
> > u32 rt_table_id;
> >  
> 
> Please use a flag bit for this, we've worked hard to shrink these
> datastructures as much as possible.

Oops, sorry.

> I think if you just choose an unused RTCF_* bit (f.e. 0x0200) for
> the state, you can use that because values propagate into the
> rtable->rt_flags, and do not propagate out.  So you should be able to
> use it in this way privately inside the kernel.

What about a bitfield?

-   u32 rt_pmtu;
+   u32 rt_mtu_locked:1,
+   rt_pmtu:31;

Since it's going to be private to the kernel, I'd rather not use a
value that's in uapi, especially considering that they're almost all
used (unless we start recycling).

-- 
Sabrina


[PATCH net] ipv4: lock mtu in fnhe when received PMTU < net.ipv4.route.min_pmtu

2018-03-09 Thread Sabrina Dubroca
Prior to the rework of PMTU information storage in commit
2c8cec5c10bc ("ipv4: Cache learned PMTU information in inetpeer."),
when a PMTU event advertising a PMTU smaller than
net.ipv4.route.min_pmtu was received, we would disable setting the DF
flag on packets by locking the MTU metric, and set the PMTU to
net.ipv4.route.min_pmtu.

Since then, we don't disable DF, and set the PMTU to
net.ipv4.route.min_pmtu, so the intermediate router that has this link
with a small MTU will have to drop the packets.

This patch reestablishes pre-2.6.39 behavior by adding rt_mtu_locked
to rtable, to record that we shouldn't set the DF bit on that path, and
use this in ip_dont_fragment().

One possible workaround is to set net.ipv4.route.min_pmtu to a value low
enough to accommodate the lowest MTU encountered.

Fixes: 2c8cec5c10bc ("ipv4: Cache learned PMTU information in inetpeer.")
Signed-off-by: Sabrina Dubroca <s...@queasysnail.net>
Reviewed-by: Stefano Brivio <sbri...@redhat.com>
---
 include/net/ip.h | 11 +--
 include/net/ip_fib.h |  1 +
 include/net/route.h  |  1 +
 net/ipv4/route.c | 25 ++---
 4 files changed, 29 insertions(+), 9 deletions(-)

diff --git a/include/net/ip.h b/include/net/ip.h
index 746abff9ce51..f49b3a576bec 100644
--- a/include/net/ip.h
+++ b/include/net/ip.h
@@ -328,6 +328,13 @@ int ip_decrease_ttl(struct iphdr *iph)
return --iph->ttl;
 }
 
+static inline int ip_mtu_locked(const struct dst_entry *dst)
+{
+   const struct rtable *rt = (const struct rtable *)dst;
+
+   return rt->rt_mtu_locked || dst_metric_locked(dst, RTAX_MTU);
+}
+
 static inline
 int ip_dont_fragment(const struct sock *sk, const struct dst_entry *dst)
 {
@@ -335,7 +342,7 @@ int ip_dont_fragment(const struct sock *sk, const struct 
dst_entry *dst)
 
return  pmtudisc == IP_PMTUDISC_DO ||
(pmtudisc == IP_PMTUDISC_WANT &&
-!(dst_metric_locked(dst, RTAX_MTU)));
+!ip_mtu_locked(dst));
 }
 
 static inline bool ip_sk_accept_pmtu(const struct sock *sk)
@@ -361,7 +368,7 @@ static inline unsigned int ip_dst_mtu_maybe_forward(const 
struct dst_entry *dst,
struct net *net = dev_net(dst->dev);
 
if (net->ipv4.sysctl_ip_fwd_use_pmtu ||
-   dst_metric_locked(dst, RTAX_MTU) ||
+   ip_mtu_locked(dst) ||
!forwarding)
return dst_mtu(dst);
 
diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h
index f80524396c06..77d0a78cf7d2 100644
--- a/include/net/ip_fib.h
+++ b/include/net/ip_fib.h
@@ -59,6 +59,7 @@ struct fib_nh_exception {
int fnhe_genid;
__be32  fnhe_daddr;
u32 fnhe_pmtu;
+   boolfnhe_mtu_locked;
__be32  fnhe_gw;
unsigned long   fnhe_expires;
struct rtable __rcu *fnhe_rth_input;
diff --git a/include/net/route.h b/include/net/route.h
index 1eb9ce470e25..729bb5e61e9a 100644
--- a/include/net/route.h
+++ b/include/net/route.h
@@ -64,6 +64,7 @@ struct rtable {
 
/* Miscellaneous cached information */
u32 rt_pmtu;
+   boolrt_mtu_locked;
 
u32 rt_table_id;
 
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 860b3fd2f54b..c53a33af953a 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -634,6 +634,7 @@ static inline u32 fnhe_hashfun(__be32 daddr)
 static void fill_route_from_fnhe(struct rtable *rt, struct fib_nh_exception 
*fnhe)
 {
rt->rt_pmtu = fnhe->fnhe_pmtu;
+   rt->rt_mtu_locked = fnhe->fnhe_mtu_locked;
rt->dst.expires = fnhe->fnhe_expires;
 
if (fnhe->fnhe_gw) {
@@ -644,7 +645,7 @@ static void fill_route_from_fnhe(struct rtable *rt, struct 
fib_nh_exception *fnh
 }
 
 static void update_or_create_fnhe(struct fib_nh *nh, __be32 daddr, __be32 gw,
- u32 pmtu, unsigned long expires)
+ u32 pmtu, bool lock, unsigned long expires)
 {
struct fnhe_hash_bucket *hash;
struct fib_nh_exception *fnhe;
@@ -681,8 +682,10 @@ static void update_or_create_fnhe(struct fib_nh *nh, 
__be32 daddr, __be32 gw,
fnhe->fnhe_genid = genid;
if (gw)
fnhe->fnhe_gw = gw;
-   if (pmtu)
+   if (pmtu) {
fnhe->fnhe_pmtu = pmtu;
+   fnhe->fnhe_mtu_locked = lock;
+   }
fnhe->fnhe_expires = max(1UL, expires);
/* Update all cached dsts too */
rt = rcu_dereference(fnhe->fnhe_rth_input);
@@ -706,6 +709,7 @@ static void update_or_create_fnhe(struct fib_nh *nh, __be32 
daddr, __be32 gw,
fnhe->

Re: [PATCH v9 crypto 08/12] chtls: Key program

2018-03-07 Thread Sabrina Dubroca
2018-03-06, 21:09:27 +0530, Atul Gupta wrote:
[snip]
> +static int chtls_set_tcb_field(struct sock *sk, u16 word, u64 mask, u64 val)
> +{
> + struct chtls_sock *csk = rcu_dereference_sk_user_data(sk);
> + struct sk_buff *skb;
> + struct cpl_set_tcb_field *req;
> + struct ulptx_idata *sc;
> + unsigned int wrlen = roundup(sizeof(*req) + sizeof(*sc), 16);
> + unsigned int credits_needed = DIV_ROUND_UP(wrlen, 16);
> +
> + skb = alloc_skb(wrlen, GFP_ATOMIC);

I haven't followed the whole call chain, but does this really need to
be an atomic allocation?
Should this skb be charged to the socket's memory accounting?

> + if (!skb)
> + return -ENOMEM;
> +
> + __set_tcb_field(sk, skb, word, mask, val, 0, 1);
> + set_queue(skb, (csk->txq_idx << 1) | CPL_PRIORITY_DATA, sk);
> + csk->wr_credits -= credits_needed;
> + csk->wr_unacked += credits_needed;
> + enqueue_wr(csk, skb);
> + cxgb4_ofld_send(csk->egress_dev, skb);

Should the return value be checked?

> + return 0;
> +}


[snip]
> +static void *chtls_alloc_mem(unsigned long size)
> +{
> + void *p = kmalloc(size, GFP_KERNEL);
> +
> + if (!p)
> + p = vmalloc(size);
> + if (p)
> + memset(p, 0, size);
> + return p;
> +}

I think you replace this with kvzalloc().

> +static void chtls_free_mem(void *addr)
> +{
> + unsigned long p = (unsigned long)addr;
> +
> + if (p >= VMALLOC_START && p < VMALLOC_END)
> + vfree(addr);
> + else
> + kfree(addr);
> +}

Use kvfree().

> +/* TLS Key bitmap processing */
> +int chtls_init_kmap(struct chtls_dev *cdev, struct cxgb4_lld_info *lldi)
> +{
> + unsigned int num_key_ctx, bsize;
> +
> + num_key_ctx = (lldi->vr->key.size / TLS_KEY_CONTEXT_SZ);
> + bsize = BITS_TO_LONGS(num_key_ctx);
> +
> + cdev->kmap.size = num_key_ctx;
> + cdev->kmap.available = bsize;
> + cdev->kmap.addr = chtls_alloc_mem(sizeof(*cdev->kmap.addr) *
> +   bsize);
> + if (!cdev->kmap.addr)
> + return -1;

The return value of this function is never checked.

> +
> + cdev->kmap.start = lldi->vr->key.start;
> + spin_lock_init(>kmap.lock);
> + return 0;
> +}
> +
> +void chtls_free_kmap(struct chtls_dev *cdev)
> +{
> + if (cdev->kmap.addr)
> + chtls_free_mem(cdev->kmap.addr);
> +}

I think this wrapper function is not necessary.

> +static int get_new_keyid(struct chtls_sock *csk, u32 optname)
> +{
> + struct chtls_dev *cdev = csk->cdev;
> + struct chtls_hws *hws = >tlshws;
> + struct net_device *dev = csk->egress_dev;
> + struct adapter *adap = netdev2adap(dev);
> + int keyid;
> +
> + spin_lock_bh(>kmap.lock);
> + keyid = find_first_zero_bit(cdev->kmap.addr, cdev->kmap.size);
> + if (keyid < cdev->kmap.size) {
> + __set_bit(keyid, cdev->kmap.addr);
> + if (optname == TLS_RX)

TLS_RX only gets defined in patch 11, so the only reason you're not
breaking the build is because all these new files aren't getting
compiled until patch 12.

> + hws->rxkey = keyid;
> + else
> + hws->txkey = keyid;
> + atomic_inc(>chcr_stats.tls_key);
> + } else {
> + keyid = -1;
> + }
> + spin_unlock_bh(>kmap.lock);
> + return keyid;
> +}
> +

[snip]
> +int chtls_setkey(struct chtls_sock *csk, u32 keylen, u32 optname)
> +{
...
>+  skb = alloc_skb(len, GFP_KERNEL);
>+  if (!skb)
>+  return -ENOMEM;

This return code is also ignored by its caller.  Please review the
whole patchset for that problem.

Same question as before, does this need to be accounted?


> + /* ulptx command */
> + kwr->req.cmd = cpu_to_be32(ULPTX_CMD_V(ULP_TX_MEM_WRITE) |
> + T5_ULP_MEMIO_ORDER_V(1) |
> + T5_ULP_MEMIO_IMM_V(1));
> + kwr->req.len16 = cpu_to_be32((csk->tid << 8) |
> +   DIV_ROUND_UP(len - sizeof(kwr->wr), 16));
> + kwr->req.dlen = cpu_to_be32(ULP_MEMIO_DATA_LEN_V(klen >> 5));
> + kwr->req.lock_addr = cpu_to_be32(ULP_MEMIO_ADDR_V(keyid_to_addr
> + (cdev->kmap.start, keyid)));

Breaking this line in that way makes it really hard to read for
humans.

-- 
Sabrina


Re: [PATCH v9 crypto 06/12] cxgb4: LLD driver changes to enable TLS

2018-03-07 Thread Sabrina Dubroca
2018-03-06, 21:09:25 +0530, Atul Gupta wrote:
> Read FW capability. Read key area size. Dump the TLS record count.

That's not a really helpful commit message. Have a look at other
commit messages and try to be more descriptive.
It's also not clear if those changes belong together in one patch, but
I can't judge because the commit message is really too terse.

> Signed-off-by: Atul Gupta 
> ---
>  drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c | 32 +---
>  drivers/net/ethernet/chelsio/cxgb4/cxgb4_uld.h  |  7 ++
>  drivers/net/ethernet/chelsio/cxgb4/sge.c| 98 
> -
>  3 files changed, 126 insertions(+), 11 deletions(-)

[snip]
> +int cxgb4_immdata_send(struct net_device *dev, unsigned int idx,
> +const void *src, unsigned int len)
> +{
> + struct adapter *adap = netdev2adap(dev);
> + struct sge_uld_txq_info *txq_info;
> + struct sge_uld_txq *txq;
> + int ret;
> +
> + local_bh_disable();
> + txq_info = adap->sge.uld_txq_info[CXGB4_TX_OFLD];
> + if (unlikely(!txq_info)) {
> + WARN_ON(true);
> + return NET_XMIT_DROP;

Don't you need to local_bh_enable() before returning? If not, this
needs a comment to explain why.

> + }
> + txq = _info->uldtxq[idx];
> +
> + ret = ofld_xmit_direct(txq, src, len);
> + local_bh_enable();
> + return net_xmit_eval(ret);
> +}
> +EXPORT_SYMBOL(cxgb4_immdata_send);

-- 
Sabrina


Re: [PATCH v9 crypto 02/12] ethtool: enable Inline TLS in HW

2018-03-07 Thread Sabrina Dubroca
Since you're saying the driver supports offloading TLS records to the
HW, why not call the feature "record offloading"?  With, for example,
NETIF_F_HW_TLS_RECORD as the feature, and maybe "tls-hw-record" for
the ethtool string.  This "Inline TLS" name sounds rather like
marketing to me.

2018-03-06, 21:07:22 +0530, Atul Gupta wrote:
> Signed-off-by: Atul Gupta 

There should be a description here of what the feature covers, so that
other drivers can decide if it matches what their HW supports.

-- 
Sabrina


Re: [PATCH v9 crypto 00/12] Chelsio Inline TLS

2018-03-07 Thread Sabrina Dubroca
2018-03-06, 21:05:23 +0530, Atul Gupta wrote:
> Series for Chelsio Inline TLS driver (chtls)
> 
> Use tls ULP infrastructure to register chtls as Inline TLS driver.
> Chtls use TCP Sockets to transmit and receive TLS record. TCP proto_ops is 
> extended to offload TLS record.
> 
> T6 adapter provides the following features:
> -TLS record offload, TLS header, encrypt, digest and transmit
> -TLS record receive and decrypt
> -TLS keys store
> -TCP/IP engine
> -TLS engine
> -GCM crypto engine [support CBC also]
> 
> TLS provides security at the transport layer. It uses TCP to provide reliable 
> end-to-end transport of application data. It relies on TCP for any 
> retransmission. TLS session comprises of three parts:

Please wrap long lines.

[snip]

> v9: corrected __u8 and similar usage

That's not the only changes since v8, actually. There's also some new
code in patch 3.

-- 
Sabrina


Re: [PATCH v9 crypto 01/12] tls: tls_device struct to register TLS drivers

2018-03-07 Thread Sabrina Dubroca
Hello Atul,

One quick note before you start replying: please fix your email
client, as you've been told before. Quoting has to add a quoting
marker (the '>' character) at the beginning of the line, otherwise
it's impossible to separate your reply from the email you're quoting.

2018-03-06, 21:06:20 +0530, Atul Gupta wrote:
> tls_device structure to register Inline TLS drivers
> with net/tls
> 
> Signed-off-by: Atul Gupta 
> ---
>  include/net/tls.h | 26 ++
>  1 file changed, 26 insertions(+)
> 
> diff --git a/include/net/tls.h b/include/net/tls.h
> index 4913430..9bfb91f 100644
> --- a/include/net/tls.h
> +++ b/include/net/tls.h
> @@ -55,6 +55,28 @@
>  #define TLS_RECORD_TYPE_DATA 0x17
>  
>  #define TLS_AAD_SPACE_SIZE   13
> +#define TLS_DEVICE_NAME_MAX  32

Why 32 characters?


> +enum {
> + TLS_BASE_TX,
> + TLS_SW_TX,
> + TLS_FULL_HW, /* TLS record processed Inline */
> + TLS_NUM_CONFIG,
> +};
> +
> +extern struct proto tls_prots[TLS_NUM_CONFIG];

Don't break bisection. The code has to compile after every
patch. These are already defined in net/tls/tls_main.c.

> +struct tls_device {
> + char name[TLS_DEVICE_NAME_MAX];

I could find only one use of it, in chtls_register_dev. Is this
actually needed?

> + struct list_head dev_list;
> +
> + /* netdev present in registered inline tls driver */
> + int (*netdev)(struct tls_device *device,
> +   struct net_device *netdev);

I was trying to figure out what this did, because the name is really
not explicit, and the comment doesn't make sense, but noticed it's
never actually called.

> + int (*feature)(struct tls_device *device);
> + int (*hash)(struct tls_device *device, struct sock *sk);
> + void (*unhash)(struct tls_device *device, struct sock *sk);

I think you should add a kerneldoc comment, like all the ndo_*
methods have.

> +};
>  
>  struct tls_sw_context {
>   struct crypto_aead *aead_send;
> @@ -115,6 +137,8 @@ struct tls_context {
>   int  (*getsockopt)(struct sock *sk, int level,
>  int optname, char __user *optval,
>  int __user *optlen);
> + int (*hash)(struct sock *sk);
> + void (*unhash)(struct sock *sk);
>  };
>  
>  int wait_on_pending_writer(struct sock *sk, long *timeo);
> @@ -256,5 +280,7 @@ static inline struct tls_offload_context *tls_offload_ctx(
>  
>  int tls_proccess_cmsg(struct sock *sk, struct msghdr *msg,
> unsigned char *record_type);
> +void tls_register_device(struct tls_device *device);
> +void tls_unregister_device(struct tls_device *device);

Prototype without implementation, please don't do that.  This happens
because you split your patchset so that each patch has all the changes
for exactly one file.

-- 
Sabrina


[PATCH net-next v2] ipv6: allow userspace to add IFA_F_OPTIMISTIC addresses

2018-02-28 Thread Sabrina Dubroca
According to RFC 4429 (section 3.1), adding new IPv6 addresses as
optimistic addresses is acceptable, as long as the implementation
follows some rules:

   * Optimistic DAD SHOULD only be used when the implementation is aware
that the address is based on a most likely unique interface
identifier (such as in [RFC2464]), generated randomly [RFC3041],
or by a well-distributed hash function [RFC3972] or assigned by
Dynamic Host Configuration Protocol for IPv6 (DHCPv6) [RFC3315].
Optimistic DAD SHOULD NOT be used for manually entered
addresses.

Thus, it seems reasonable to allow userspace to set the optimistic flag
when adding new addresses.

We must not let userspace set NODAD + OPTIMISTIC, since if the kernel is
not performing DAD we would never clear the optimistic flag. We must
also ignore userspace's request to add OPTIMISTIC flag to addresses that
have already completed DAD (addresses that don't have the TENTATIVE
flag, or that have the DADFAILED flag).

Then we also need to clear the OPTIMISTIC flag on permanent addresses
when DAD fails. Otherwise, IFA_F_OPTIMISTIC addresses added by userspace
can still be used after DAD has failed, because in
ipv6_chk_addr_and_flags(), IFA_F_OPTIMISTIC overrides IFA_F_TENTATIVE.

Setting IFA_F_OPTIMISTIC from userspace is conditional on
CONFIG_IPV6_OPTIMISTIC_DAD and the optimistic_dad sysctl.

Signed-off-by: Sabrina Dubroca <s...@queasysnail.net>
Reviewed-by: Stefano Brivio <sbri...@redhat.com>
---
v2:
 - add extack (David Ahern)
 - make all of this conditional on CONFIG_IPV6_OPTIMISTIC_DAD and the
   the optimistic_dad sysctl (David Ahern)
 - fixup for permanent addresses added to a device that is down (Stefano
   Brivio)
 - fixup condition for DAD already completed (Stefano Brivio)

 net/ipv6/addrconf.c | 35 ++-
 1 file changed, 34 insertions(+), 1 deletion(-)

diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index 4facfe0b1888..b5fd116c046a 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -1459,6 +1459,21 @@ static bool ipv6_use_optimistic_addr(struct net *net,
 #endif
 }
 
+static bool ipv6_allow_optimistic_dad(struct net *net,
+ struct inet6_dev *idev)
+{
+#ifdef CONFIG_IPV6_OPTIMISTIC_DAD
+   if (!idev)
+   return false;
+   if (!net->ipv6.devconf_all->optimistic_dad && !idev->cnf.optimistic_dad)
+   return false;
+
+   return true;
+#else
+   return false;
+#endif
+}
+
 static int ipv6_get_saddr_eval(struct net *net,
   struct ipv6_saddr_score *score,
   struct ipv6_saddr_dst *dst,
@@ -1968,6 +1983,8 @@ static void addrconf_dad_stop(struct inet6_ifaddr *ifp, 
int dad_failed)
spin_lock_bh(>lock);
addrconf_del_dad_work(ifp);
ifp->flags |= IFA_F_TENTATIVE;
+   if (dad_failed)
+   ifp->flags &= ~IFA_F_OPTIMISTIC;
spin_unlock_bh(>lock);
if (dad_failed)
ipv6_ifa_notify(0, ifp);
@@ -4501,6 +4518,9 @@ static int inet6_addr_modify(struct inet6_ifaddr *ifp, 
u32 ifa_flags,
(ifp->flags & IFA_F_TEMPORARY || ifp->prefix_len != 64))
return -EINVAL;
 
+   if (!(ifp->flags & IFA_F_TENTATIVE) || ifp->flags & IFA_F_DADFAILED)
+   ifa_flags &= ~IFA_F_OPTIMISTIC;
+
timeout = addrconf_timeout_fixup(valid_lft, HZ);
if (addrconf_finite_timeout(timeout)) {
expires = jiffies_to_clock_t(timeout * HZ);
@@ -4574,6 +4594,7 @@ inet6_rtm_newaddr(struct sk_buff *skb, struct nlmsghdr 
*nlh,
struct in6_addr *pfx, *peer_pfx;
struct inet6_ifaddr *ifa;
struct net_device *dev;
+   struct inet6_dev *idev;
u32 valid_lft = INFINITY_LIFE_TIME, preferred_lft = INFINITY_LIFE_TIME;
u32 ifa_flags;
int err;
@@ -4607,7 +4628,19 @@ inet6_rtm_newaddr(struct sk_buff *skb, struct nlmsghdr 
*nlh,
 
/* We ignore other flags so far. */
ifa_flags &= IFA_F_NODAD | IFA_F_HOMEADDRESS | IFA_F_MANAGETEMPADDR |
-IFA_F_NOPREFIXROUTE | IFA_F_MCAUTOJOIN;
+IFA_F_NOPREFIXROUTE | IFA_F_MCAUTOJOIN | IFA_F_OPTIMISTIC;
+
+   idev = ipv6_find_idev(dev);
+   if (IS_ERR(idev))
+   return PTR_ERR(idev);
+
+   if (!ipv6_allow_optimistic_dad(net, idev))
+   ifa_flags &= ~IFA_F_OPTIMISTIC;
+
+   if (ifa_flags & IFA_F_NODAD && ifa_flags & IFA_F_OPTIMISTIC) {
+   NL_SET_ERR_MSG(extack, "IFA_F_NODAD and IFA_F_OPTIMISTIC are 
mutually exclusive");
+   return -EINVAL;
+   }
 
ifa = ipv6_get_ifaddr(net, pfx, dev, 1);
if (!ifa) {
-- 
2.16.2



Re: [PATCH] net: ipv4: avoid unused variable warning for sysctl

2018-02-28 Thread Sabrina Dubroca
2018-02-28, 14:32:48 +0100, Arnd Bergmann wrote:
> The newly introudced ip_min_valid_pmtu variable is only used when
> CONFIG_SYSCTL is set:
> 
> net/ipv4/route.c:135:12: error: 'ip_min_valid_pmtu' defined but not used 
> [-Werror=unused-variable]
> 
> This moves it to the other variables like it, to avoid the harmless
> warning.
> 
> Fixes: c7272c2f1229 ("net: ipv4: don't allow setting net.ipv4.route.min_pmtu 
> below 68")
> Signed-off-by: Arnd Bergmann <a...@arndb.de>

Crap. Thanks, and sorry for the mess.

Acked-by: Sabrina Dubroca <s...@queasysnail.net>

-- 
Sabrina


Re: [PATCH net-next] ipv6: allow userspace to add IFA_F_OPTIMISTIC addresses

2018-02-27 Thread Sabrina Dubroca
2018-02-27, 10:47:08 -0500, David Miller wrote:
> From: Sabrina Dubroca <s...@queasysnail.net>
> Date: Tue, 27 Feb 2018 15:13:28 +0100
> 
> > 2018-02-26, 12:11:27 -0500, David Miller wrote:
> >> From: Sabrina Dubroca <s...@queasysnail.net>
> >> Date: Mon, 26 Feb 2018 17:56:19 +0100
> >> 
> >> That's completely different to this case, which is a bonfide explicit
> >> allowance for userspace to take over these fundamental protocol tasks
> >> from the kernel.
> > 
> > This is not letting userspace take over. On the contrary, it allows
> > userspace to take advantage of the kernel's DAD, without suffering the
> > delay. The alternative, without optimistic DAD, would be to completely
> > disable DAD done by the kernel.
> > 
> > This follows RFC 4429, which explicitly allows optimistic DAD for
> > addresses generated by (among other mechanisms) DHCPv6.
> 
> Fair enough.
> 
> We can resume this conversation if and when problems pop up in the
> future :-)

Hehe :)

Thanks, I'll submit a v2 with the changes (the other) David asked for.

-- 
Sabrina


  1   2   3   4   >