[net-next:master 1099/1116] net/ipv4/tunnel4.c:218:3: error: label 'err_mpls' used but not defined

2016-07-09 Thread kbuild test robot
tree:   https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git 
master
head:   927265bc6cd6374c9bafc43408ece4e92311b149
commit: 8afe97e5d4165c9d181d42504af3f96c8427659a [1099/1116] tunnels: support 
MPLS over IPv4 tunnels
config: x86_64-randconfig-s4-07101153 (attached as .config)
compiler: gcc-6 (Debian 6.1.1-1) 6.1.1 20160430
reproduce:
git checkout 8afe97e5d4165c9d181d42504af3f96c8427659a
# save the attached .config to linux build tree
make ARCH=x86_64 

All errors (new ones prefixed by >>):

   net/ipv4/tunnel4.c: In function 'tunnel4_init':
   net/ipv4/tunnel4.c:226:1: warning: label 'err_ipv6' defined but not used 
[-Wunused-label]
err_ipv6:
^~~~
>> net/ipv4/tunnel4.c:218:3: error: label 'err_mpls' used but not defined
  goto err_mpls;
  ^~~~

vim +/err_mpls +218 net/ipv4/tunnel4.c

   212  #if IS_ENABLED(CONFIG_IPV6)
   213  if (inet_add_protocol(&tunnel64_protocol, IPPROTO_IPV6))
   214  goto err_ipv6;
   215  #endif
   216  #if IS_ENABLED(CONFIG_MPLS)
   217  if (inet_add_protocol(&tunnelmpls4_protocol, IPPROTO_MPLS))
 > 218  goto err_mpls;
   219  #endif
   220  return 0;
   221  
   222  #if IS_ENABLED(CONFIG_IPV6)
   223  err_mpls:
   224  inet_del_protocol(&tunnel4_protocol, IPPROTO_IPV6);
   225  #endif
 > 226  err_ipv6:
   227  inet_del_protocol(&tunnel4_protocol, IPPROTO_IPIP);
   228  err_ipip:
   229  pr_err("%s: can't add protocol\n", __func__);

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: Binary data


[PATCH net-next v2] tunnels: correct conditional build of MPLS and IPv6

2016-07-09 Thread Simon Horman
Using a combination if #if conditionals and goto labels to unwind
tunnel4_init seems unwieldy. This patch takes a simpler approach of
directly unregistering previously registered protocols when an error
occurs.

This fixes a number of problems with the current implementation
including the potential presence of labels when they are unused
and the potential absence of unregister code when it is needed.

Fixes: 8afe97e5d416 ("tunnels: support MPLS over IPv4 tunnels")
Signed-off-by: Simon Horman 
---
 net/ipv4/tunnel4.c | 25 +
 1 file changed, 13 insertions(+), 12 deletions(-)

diff --git a/net/ipv4/tunnel4.c b/net/ipv4/tunnel4.c
index 45cd4253583a..ec35eaa5c029 100644
--- a/net/ipv4/tunnel4.c
+++ b/net/ipv4/tunnel4.c
@@ -208,24 +208,25 @@ static const struct net_protocol tunnelmpls4_protocol = {
 static int __init tunnel4_init(void)
 {
if (inet_add_protocol(&tunnel4_protocol, IPPROTO_IPIP))
-   goto err_ipip;
+   goto err;
 #if IS_ENABLED(CONFIG_IPV6)
-   if (inet_add_protocol(&tunnel64_protocol, IPPROTO_IPV6))
-   goto err_ipv6;
+   if (inet_add_protocol(&tunnel64_protocol, IPPROTO_IPV6)) {
+   inet_del_protocol(&tunnel4_protocol, IPPROTO_IPIP);
+   goto err;
+   }
 #endif
 #if IS_ENABLED(CONFIG_MPLS)
-   if (inet_add_protocol(&tunnelmpls4_protocol, IPPROTO_MPLS))
-   goto err_mpls;
+   if (inet_add_protocol(&tunnelmpls4_protocol, IPPROTO_MPLS)) {
+   inet_del_protocol(&tunnel4_protocol, IPPROTO_IPIP);
+#if IS_ENABLED(CONFIG_IPV6)
+   inet_del_protocol(&tunnel64_protocol, IPPROTO_IPV6);
+#endif
+   goto err;
+   }
 #endif
return 0;
 
-#if IS_ENABLED(CONFIG_IPV6)
-err_mpls:
-   inet_del_protocol(&tunnel4_protocol, IPPROTO_IPV6);
-#endif
-err_ipv6:
-   inet_del_protocol(&tunnel4_protocol, IPPROTO_IPIP);
-err_ipip:
+err:
pr_err("%s: can't add protocol\n", __func__);
return -EAGAIN;
 }
-- 
2.7.0.rc3.207.g0ac5344



Re: [PATCH] tunnels: correct conditional build of MPLS

2016-07-09 Thread Simon Horman
On Sun, Jul 10, 2016 at 09:20:54AM +0900, Simon Horman wrote:
> * If MPLS is not enabled then err_mpls is not needed
> * If IPV6 is not enabled then unregistering IPPROTO_IPV6 is not
>   needed in the error path and err_ipv6 is not needed
> * When unregistering IPPROTO_IPV6 the tunnel64_protocol structure
>   rather than the tunnel4_protocol structure should be used.
> * If neither MPLS nor IPV6 are enabled then unregistering IPPROTO_IPIP
>   is not needed in the error path
> 
> Fixes: 8afe97e5d416 ("tunnels: support MPLS over IPv4 tunnels")
> Signed-off-by: Simon Horman 

I realised this still has some problems. I will send v2.

> ---
>  net/ipv4/tunnel4.c | 8 ++--
>  1 file changed, 6 insertions(+), 2 deletions(-)
> 
> diff --git a/net/ipv4/tunnel4.c b/net/ipv4/tunnel4.c
> index 45cd4253583a..1dd098b890a7 100644
> --- a/net/ipv4/tunnel4.c
> +++ b/net/ipv4/tunnel4.c
> @@ -219,12 +219,16 @@ static int __init tunnel4_init(void)
>  #endif
>   return 0;
>  
> -#if IS_ENABLED(CONFIG_IPV6)
> +#if IS_ENABLED(CONFIG_MPLS)
>  err_mpls:
> - inet_del_protocol(&tunnel4_protocol, IPPROTO_IPV6);
>  #endif
> +#if IS_ENABLED(CONFIG_IPV6)
> + inet_del_protocol(&tunnel64_protocol, IPPROTO_IPV6);
>  err_ipv6:
> +#endif
> +#if IS_ENABLED(CONFIG_MPLS) || IS_ENABLED(CONFIG_IPV6)
>   inet_del_protocol(&tunnel4_protocol, IPPROTO_IPIP);
> +#endif
>  err_ipip:
>   pr_err("%s: can't add protocol\n", __func__);
>   return -EAGAIN;
> -- 
> 2.7.0.rc3.207.g0ac5344
> 


[PATCH] tunnels: correct conditional build of MPLS

2016-07-09 Thread Simon Horman
* If MPLS is not enabled then err_mpls is not needed
* If IPV6 is not enabled then unregistering IPPROTO_IPV6 is not
  needed in the error path and err_ipv6 is not needed
* When unregistering IPPROTO_IPV6 the tunnel64_protocol structure
  rather than the tunnel4_protocol structure should be used.
* If neither MPLS nor IPV6 are enabled then unregistering IPPROTO_IPIP
  is not needed in the error path

Fixes: 8afe97e5d416 ("tunnels: support MPLS over IPv4 tunnels")
Signed-off-by: Simon Horman 
---
 net/ipv4/tunnel4.c | 8 ++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/tunnel4.c b/net/ipv4/tunnel4.c
index 45cd4253583a..1dd098b890a7 100644
--- a/net/ipv4/tunnel4.c
+++ b/net/ipv4/tunnel4.c
@@ -219,12 +219,16 @@ static int __init tunnel4_init(void)
 #endif
return 0;
 
-#if IS_ENABLED(CONFIG_IPV6)
+#if IS_ENABLED(CONFIG_MPLS)
 err_mpls:
-   inet_del_protocol(&tunnel4_protocol, IPPROTO_IPV6);
 #endif
+#if IS_ENABLED(CONFIG_IPV6)
+   inet_del_protocol(&tunnel64_protocol, IPPROTO_IPV6);
 err_ipv6:
+#endif
+#if IS_ENABLED(CONFIG_MPLS) || IS_ENABLED(CONFIG_IPV6)
inet_del_protocol(&tunnel4_protocol, IPPROTO_IPIP);
+#endif
 err_ipip:
pr_err("%s: can't add protocol\n", __func__);
return -EAGAIN;
-- 
2.7.0.rc3.207.g0ac5344



Re: [PATCH net-next v2 0/4] net: support MPLS in IPv4 and UDP

2016-07-09 Thread Simon Horman
On Sat, Jul 09, 2016 at 05:47:32PM -0400, David Miller wrote:
> From: Simon Horman 
> Date: Thu,  7 Jul 2016 07:56:11 +0200
> 
> > This short series provides support for MPLS in IPv4 (RFC4023), and by
> > virtue of FOU, MPLS in UDP (RFC7510).
> 
> Series applied, thanks Simon.

Thanks.

It looks like I messed up the error path in
"tunnels: support MPLS over IPv4 tunnels" as flagged by the
kbuild test robot.

I'll get a fix for that to you soon.


[net-next:master 1099/1116] net/ipv4/tunnel4.c:226:1: warning: label 'err_ipv6' defined but not used

2016-07-09 Thread kbuild test robot
tree:   https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git 
master
head:   927265bc6cd6374c9bafc43408ece4e92311b149
commit: 8afe97e5d4165c9d181d42504af3f96c8427659a [1099/1116] tunnels: support 
MPLS over IPv4 tunnels
config: m32r-usrv_defconfig (attached as .config)
compiler: m32r-linux-gcc (GCC) 4.9.0
reproduce:
wget 
https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross
 -O ~/bin/make.cross
chmod +x ~/bin/make.cross
git checkout 8afe97e5d4165c9d181d42504af3f96c8427659a
# save the attached .config to linux build tree
make.cross ARCH=m32r 

All warnings (new ones prefixed by >>):

   net/ipv4/tunnel4.c: In function 'tunnel4_init':
>> net/ipv4/tunnel4.c:226:1: warning: label 'err_ipv6' defined but not used 
>> [-Wunused-label]
err_ipv6:
^

vim +/err_ipv6 +226 net/ipv4/tunnel4.c

   210  if (inet_add_protocol(&tunnel4_protocol, IPPROTO_IPIP))
   211  goto err_ipip;
   212  #if IS_ENABLED(CONFIG_IPV6)
   213  if (inet_add_protocol(&tunnel64_protocol, IPPROTO_IPV6))
   214  goto err_ipv6;
   215  #endif
   216  #if IS_ENABLED(CONFIG_MPLS)
   217  if (inet_add_protocol(&tunnelmpls4_protocol, IPPROTO_MPLS))
   218  goto err_mpls;
   219  #endif
   220  return 0;
   221  
   222  #if IS_ENABLED(CONFIG_IPV6)
   223  err_mpls:
   224  inet_del_protocol(&tunnel4_protocol, IPPROTO_IPV6);
   225  #endif
 > 226  err_ipv6:
   227  inet_del_protocol(&tunnel4_protocol, IPPROTO_IPIP);
   228  err_ipip:
   229  pr_err("%s: can't add protocol\n", __func__);
   230  return -EAGAIN;
   231  }
   232  
   233  static void __exit tunnel4_fini(void)
   234  {

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: Binary data


Re: [PATCH net] ipv6: addrconf: fix Juniper SSL VPN client regression

2016-07-09 Thread Bjørn Mork
Hannes Frederic Sowa  writes:
> On Sat, Jul 9, 2016, at 23:13, Bjørn Mork wrote:
>> The Juniper SSL VPN client use a "tun" interface and seems to
>> be picky about visible changes.to it. Commit cc9da6cc4f56
>> ("ipv6: addrconf: use stable address generator for ARPHRD_NONE")
>> made such interfaces get an auto-generated IPv6 link local address
>> by default, similar to most other interface types. This made the
>> Juniper SSL VPN client fail for unknown reasons.
>> 
>> Fixing this regression by effectively reverting the behaviour to
>> what we had before, while keeping the new "addrgenmode random"
>> feature.
>
> I wonder if we can simply add a flag, something like
> IFF_SUPPRESS_AUTO_IPV6_LL, to net_device->priv_flags and use that. So we
> can keep behavior for qmi, vxlan-gpe and gre. tun is the only device
> that is really user space facing, so maybe we just limit it to this?

Sounds good to me, but I don't know if the use case really qualifies as

 "* You should have a pretty good reason to be extending these flags."


The automatic address is certainly nice to have, but "good reason"?  I
don't know...  We can always just configure those devices for automatic
LL addresses using "ip link set foo addrgen random" or similar.


Bjørn


[net-next:master 1099/1116] net/ipv4/tunnel4.c:223:1: warning: label 'err_mpls' defined but not used

2016-07-09 Thread kbuild test robot
tree:   https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git 
master
head:   927265bc6cd6374c9bafc43408ece4e92311b149
commit: 8afe97e5d4165c9d181d42504af3f96c8427659a [1099/1116] tunnels: support 
MPLS over IPv4 tunnels
config: arm-at91_dt_defconfig (attached as .config)
compiler: arm-linux-gnueabi-gcc (Debian 5.3.1-8) 5.3.1 20160205
reproduce:
wget 
https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross
 -O ~/bin/make.cross
chmod +x ~/bin/make.cross
git checkout 8afe97e5d4165c9d181d42504af3f96c8427659a
# save the attached .config to linux build tree
make.cross ARCH=arm 

All warnings (new ones prefixed by >>):

   net/ipv4/tunnel4.c: In function 'tunnel4_init':
>> net/ipv4/tunnel4.c:223:1: warning: label 'err_mpls' defined but not used 
>> [-Wunused-label]
err_mpls:
^

vim +/err_mpls +223 net/ipv4/tunnel4.c

   207  
   208  static int __init tunnel4_init(void)
   209  {
   210  if (inet_add_protocol(&tunnel4_protocol, IPPROTO_IPIP))
   211  goto err_ipip;
   212  #if IS_ENABLED(CONFIG_IPV6)
   213  if (inet_add_protocol(&tunnel64_protocol, IPPROTO_IPV6))
   214  goto err_ipv6;
   215  #endif
   216  #if IS_ENABLED(CONFIG_MPLS)
   217  if (inet_add_protocol(&tunnelmpls4_protocol, IPPROTO_MPLS))
   218  goto err_mpls;
   219  #endif
   220  return 0;
   221  
   222  #if IS_ENABLED(CONFIG_IPV6)
 > 223  err_mpls:
   224  inet_del_protocol(&tunnel4_protocol, IPPROTO_IPV6);
   225  #endif
   226  err_ipv6:
   227  inet_del_protocol(&tunnel4_protocol, IPPROTO_IPIP);
   228  err_ipip:
   229  pr_err("%s: can't add protocol\n", __func__);
   230  return -EAGAIN;
   231  }

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: Binary data


Re: [PATCH net] ipv6: addrconf: fix Juniper SSL VPN client regression

2016-07-09 Thread Hannes Frederic Sowa
Hi Bjorn,

On Sat, Jul 9, 2016, at 23:13, Bjørn Mork wrote:
> The Juniper SSL VPN client use a "tun" interface and seems to
> be picky about visible changes.to it. Commit cc9da6cc4f56
> ("ipv6: addrconf: use stable address generator for ARPHRD_NONE")
> made such interfaces get an auto-generated IPv6 link local address
> by default, similar to most other interface types. This made the
> Juniper SSL VPN client fail for unknown reasons.
> 
> Fixing this regression by effectively reverting the behaviour to
> what we had before, while keeping the new "addrgenmode random"
> feature.

I wonder if we can simply add a flag, something like
IFF_SUPPRESS_AUTO_IPV6_LL, to net_device->priv_flags and use that. So we
can keep behavior for qmi, vxlan-gpe and gre. tun is the only device
that is really user space facing, so maybe we just limit it to this?

Thanks,
Hannes


Re: [PATCH net] dccp: avoid deadlock in dccp_v4_ctl_send_reset

2016-07-09 Thread David Miller
From: Eric Dumazet 
Date: Fri, 08 Jul 2016 11:03:57 +0200

> From: Eric Dumazet 
> 
> In the prep work I did before enabling BH while handling socket backlog,
> I missed two points in DCCP :
> 
> 1) dccp_v4_ctl_send_reset() uses bh_lock_sock(), assuming BH were
> blocked. It is not anymore always true.
> 
> 2) dccp_v4_route_skb() was using __IP_INC_STATS() instead of
>   IP_INC_STATS()
> 
> A similar fix was done for TCP, in commit 47dcc20a39d0
> ("ipv4: tcp: ip_send_unicast_reply() is not BH safe")
> 
> Fixes: 7309f8821fd6 ("dccp: do not assume DCCP code is non preemptible")
> Fixes: 5413d1babe8f ("net: do not block BH while processing socket backlog")
> Signed-off-by: Eric Dumazet 
> Reported-by: Dmitry Vyukov 

Applied and queued up for -stable, thanks.


Re: [PATCH net-next] ipv6: do not abuse GFP_ATOMIC in inet6_netconf_notify_devconf()

2016-07-09 Thread David Miller
From: Eric Dumazet 
Date: Fri, 08 Jul 2016 05:46:04 +0200

> From: Eric Dumazet 
> 
> All inet6_netconf_notify_devconf() callers are in process context,
> so we can use GFP_KERNEL allocations if we take care of not holding
> a rwlock while not needed in ip6mr (we hold RTNL there)
> 
> 
> Fixes: d67b8c616b48 ("netconf: advertise mc_forwarding status")
> Fixes: f3a1bfb11ccb ("rtnl/ipv6: use netconf msg to advertise forwarding 
> status")
> Signed-off-by: Eric Dumazet 
> Cc: Nicolas Dichtel 

Applied.


Re: [PATCH net-next] ipv4: do not abuse GFP_ATOMIC in inet_netconf_notify_devconf()

2016-07-09 Thread David Miller
From: Eric Dumazet 
Date: Fri, 08 Jul 2016 05:18:24 +0200

> From: Eric Dumazet 
> 
> inet_forward_change() runs with RTNL held.
> We are allowed to sleep if required.
> 
> If we use __in_dev_get_rtnl() instead of __in_dev_get_rcu(),
> we no longer have to use GFP_ATOMIC allocations in
> inet_netconf_notify_devconf(), meaning we are less likely to miss
> notifications under memory pressure, and wont touch precious memory
> reserves either and risk dropping incoming packets.
> 
> inet_netconf_get_devconf() can also use GFP_KERNEL allocation.
> 
> Fixes: edc9e748934c ("rtnl/ipv4: use netconf msg to advertise forwarding 
> status")
> Fixes: 9e5511106f99 ("rtnl/ipv4: add support of RTM_GETNETCONF")
> Signed-off-by: Eric Dumazet 
> Cc: Nicolas Dichtel 

Applied.


Re: [PATCH v2 0/6] net: ethernet: bgmac: Add platform device support

2016-07-09 Thread David Miller
From: Jon Mason 
Date: Thu,  7 Jul 2016 19:08:52 -0400

> David Miller, Please consider including patches 1-5 in net-next

Done.


Re: [PATCH] Need proper type casting before assignment, Remove compilation Warning.

2016-07-09 Thread David Miller
From: Arvind Yadav 
Date: Fri,  8 Jul 2016 00:07:54 +0530

> -Return type of 'qe_muram_alloc' is 'unsigned long', That Was trying to
> assigned in ucc_fast_tx_virtual_fifo_base_offset and
> ucc_fast_rx_virtual_fifo_base_offset. These variable are 'unsigned int'.
> So before assginment need a proper type casting.
> 
> -Passing value in IS_ERR_VALUE() is wrong, as they pass an 'int'
> into a function that takes an 'unsigned long' argument.This happens
> to work because the type is sign-extended on 64-bit architectures
> before it gets converted into an unsigned type.
> 
> -Passing an 'unsigned short' or 'unsigned int'argument into
> IS_ERR_VALUE() is guaranteed to be broken, as are 8-bit integers
> and types that are wider than 'unsigned long'.
> 
> -Any user will get compilation warning for that do not pass an
> unsigned long' argument.
> 
> Signed-off-by: Arvind Yadav 

Your subject line is improperly formed.

It must have the subsystem or driver name, followed by a colon ":"
and a space.  Such as:

[PATCH] ucc_geth: Need proper type ...



Re: [net-next PATCH V2] net: tracepoint napi:napi_poll add work and budget

2016-07-09 Thread David Miller
From: Jesper Dangaard Brouer 
Date: Thu, 07 Jul 2016 18:01:32 +0200

> An important information for the napi_poll tracepoint is knowing
> the work done (packets processed) by the napi_poll() call. Add
> both the work done and budget, as they are related.
> 
> Handle trace_napi_poll() param change in dropwatch/drop_monitor
> and in python perf script netdev-times.py in backward compat way,
> as python fortunately supports optional parameter handling.
> 
> Signed-off-by: Jesper Dangaard Brouer 

Applied, thanks.


Re: [PATCH net-next 0/3] r8152: remove the redundant code

2016-07-09 Thread David Miller
From: Hayes Wang 
Date: Thu, 7 Jul 2016 15:09:17 +0800

> Remove the unnacessary code.

Series applied.


Re: [PATCH net-next v2 0/4] net: support MPLS in IPv4 and UDP

2016-07-09 Thread David Miller
From: Simon Horman 
Date: Thu,  7 Jul 2016 07:56:11 +0200

> This short series provides support for MPLS in IPv4 (RFC4023), and by
> virtue of FOU, MPLS in UDP (RFC7510).

Series applied, thanks Simon.


Re: [PATCH net v2 0/4] ibmvnic driver bugfixes and improvements

2016-07-09 Thread David Miller
From: Thomas Falcon 
Date: Fri,  8 Jul 2016 02:09:02 -0500

> Miscellaneous fixes and improvements on the ibmvnic driver.

Series applied.


Re: [PATCH v6 04/12] net/mlx4_en: add support for fast rx drop bpf program

2016-07-09 Thread Or Gerlitz
On Sat, Jul 9, 2016 at 10:58 PM, Saeed Mahameed
 wrote:
> On Fri, Jul 8, 2016 at 5:15 AM, Brenden Blanco  wrote:
>> Add support for the BPF_PROG_TYPE_XDP hook in mlx4 driver.
>>
>> In tc/socket bpf programs, helpers linearize skb fragments as needed
>> when the program touchs the packet data. However, in the pursuit of
>> speed, XDP programs will not be allowed to use these slower functions,
>> especially if it involves allocating an skb.
>>
>> Therefore, disallow MTU settings that would produce a multi-fragment
>> packet that XDP programs would fail to access. Future enhancements could
>> be done to increase the allowable MTU.
>>
>> Signed-off-by: Brenden Blanco 
>> ---
>>  drivers/net/ethernet/mellanox/mlx4/en_netdev.c | 38 
>> ++
>>  drivers/net/ethernet/mellanox/mlx4/en_rx.c | 36 +---
>>  drivers/net/ethernet/mellanox/mlx4/mlx4_en.h   |  5 
>>  3 files changed, 75 insertions(+), 4 deletions(-)
>>
> [...]
>> +   /* A bpf program gets first chance to drop the packet. It may
>> +* read bytes but not past the end of the frag.
>> +*/
>> +   if (prog) {
>> +   struct xdp_buff xdp;
>> +   dma_addr_t dma;
>> +   u32 act;
>> +
>> +   dma = be64_to_cpu(rx_desc->data[0].addr);
>> +   dma_sync_single_for_cpu(priv->ddev, dma,
>> +   priv->frag_info[0].frag_size,
>> +   DMA_FROM_DEVICE);
>
> In case of XDP_PASS we will dma_sync again in the normal path,

yep, correct

> this can be improved by doing the dma_sync as soon as we can and once and
> for all, regardless of the path the packet is going to take
> (XDP_DROP/mlx4_en_complete_rx_desc/mlx4_en_rx_skb).

how you would envision this can be done in a not very ugly way?


>> +
>> +   xdp.data = page_address(frags[0].page) +
>> +   frags[0].page_offset;
>> +   xdp.data_end = xdp.data + length;
>> +
>> +   act = bpf_prog_run_xdp(prog, &xdp);
>> +   switch (act) {
>> +   case XDP_PASS:
>> +   break;
>> +   default:
>> +   bpf_warn_invalid_xdp_action(act);
>> +   case XDP_DROP:
>> +   goto next;
>
> The drop action here (goto next) will release the current rx_desc
> buffers and use new ones to refill, I know that the mlx4 rx scheme
> will release/allocate new pages once every ~32 packet, but one
> improvement can really help here especially for XDP_DROP benchmarks is
> to reuse the current rx_desc buffers in case it is going to be
> dropped.

> Considering if mlx4 rx buffer scheme doesn't allow gaps, Maybe this
> can be added later as future improvement for the whole mlx4 rx data
> path drop decisions.

yes, I think it makes sense to look on this as future improvement.


[PATCH net] ipv6: addrconf: fix Juniper SSL VPN client regression

2016-07-09 Thread Bjørn Mork
The Juniper SSL VPN client use a "tun" interface and seems to
be picky about visible changes.to it. Commit cc9da6cc4f56
("ipv6: addrconf: use stable address generator for ARPHRD_NONE")
made such interfaces get an auto-generated IPv6 link local address
by default, similar to most other interface types. This made the
Juniper SSL VPN client fail for unknown reasons.

Fixing this regression by effectively reverting the behaviour to
what we had before, while keeping the new "addrgenmode random"
feature.

This will cause a regression for any userspace application which
has started to depend on the new behaviour.  But it is still
considered the lesser evil, considering the short period this
behaviour has been default.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=121131
Fixes: cc9da6cc4f56 ("ipv6: addrconf: use stable address generator for 
ARPHRD_NONE")
Reported-by: Valdis Kletnieks 
Reported-and-tested-by: Jonas Lippuner 
Cc: Hannes Frederic Sowa 
Cc: 吉藤英明 
Signed-off-by: Bjørn Mork 
---
Valdis,

I know you reported this regression back in April, but I never
saw any answers to Hannes' and mine followup quesions so I forgot
all about it.  Sorry about that.  Could you test this patch and
see if it works for you too?


Bjørn

 net/ipv6/addrconf.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index 47f837a58e0a..34e80c6aa810 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -3120,7 +3120,7 @@ static void addrconf_dev_config(struct net_device *dev)
/* this device type has no EUI support */
if (dev->type == ARPHRD_NONE &&
idev->addr_gen_mode == IN6_ADDR_GEN_MODE_EUI64)
-   idev->addr_gen_mode = IN6_ADDR_GEN_MODE_RANDOM;
+   idev->addr_gen_mode = IN6_ADDR_GEN_MODE_NONE;
 
addrconf_addr_gen(idev, false);
 }
-- 
2.8.1



Re: [PATCH v6 05/12] Add sample for adding simple drop program to link

2016-07-09 Thread Saeed Mahameed
On Fri, Jul 8, 2016 at 5:15 AM, Brenden Blanco  wrote:
> Add a sample program that only drops packets at the BPF_PROG_TYPE_XDP_RX
> hook of a link. With the drop-only program, observed single core rate is
> ~20Mpps.
>
> Other tests were run, for instance without the dropcnt increment or
> without reading from the packet header, the packet rate was mostly
> unchanged.
>
> $ perf record -a samples/bpf/xdp1 $( proto 17:   20403027 drops/s
>
> ./pktgen_sample03_burst_single_flow.sh -i $DEV -d $IP -m $MAC -t 4
> Running... ctrl^C to stop
> Device: eth4@0
> Result: OK: 11791017(c11788327+d2689) usec, 59622913 (60byte,0frags)
>   5056638pps 2427Mb/sec (2427186240bps) errors: 0
> Device: eth4@1
> Result: OK: 11791012(c11787906+d3106) usec, 60526944 (60byte,0frags)
>   5133311pps 2463Mb/sec (2463989280bps) errors: 0
> Device: eth4@2
> Result: OK: 11791019(c11788249+d2769) usec, 59868091 (60byte,0frags)
>   5077431pps 2437Mb/sec (2437166880bps) errors: 0
> Device: eth4@3
> Result: OK: 11795039(c11792403+d2636) usec, 59483181 (60byte,0frags)
>   5043067pps 2420Mb/sec (2420672160bps) errors: 0
>
> perf report --no-children:
>  26.05%  ksoftirqd/0  [mlx4_en] [k] mlx4_en_process_rx_cq
>  17.84%  ksoftirqd/0  [mlx4_en] [k] mlx4_en_alloc_frags
>   5.52%  ksoftirqd/0  [mlx4_en] [k] mlx4_en_free_frag

This just proves my point on the previous patch, reusing the rx_desc
buffers we are going to drop will save us here ~23% CPU wasted on
(alloc_frags & free_frags ) ! and this can improve some benchmarks
results where the CPU is the bottleneck.


Re: [PATCH net] ipv4: reject RTNH_F_LINKDOWN for incompatible routes

2016-07-09 Thread Julian Anastasov

Hello,

On Sat, 9 Jul 2016, Andy Gospodarek wrote:

> On Sat, Jul 09, 2016 at 12:00:15PM +0300, Julian Anastasov wrote:
> > Vegard Nossum is reporting for a crash in fib_dump_info (fib_nhs==1)
> > when nh_dev = NULL. Problem happens when RTNH_F_LINKDOWN is
> > provided from user space for routes that do not use the flag,
> > catched with netlink fuzzer.
> 
> Can you also include the panic log in the changelog or at a minimum post
> it here?

Now after Vegard posted it, I'll include in v2.

> > RTNH_F_LINKDOWN should be used only for link routes, not for
> > local routes or for routes with error code. Do not complicate
> > fast path with more checks, reject the flag early when configured
> > for incompatible routes.
> 
> Did the netlink fuzzer (trinity?) happen to check any of the other flags
> (liks RTNH_F_DEAD) that are normally set by the kernel but could be
> problematic when send down from userspace?

My guess is that fib_flush will release it soon or
later. I preferred to reject the RTNH_F_LINKDOWN flag only
for some kind of routes but another alternative is to always
reject both RTNH_F_LINKDOWN and RTNH_F_DEAD: RTNH_F_LINKDOWN
is recalculated and there is no good reason user space to
provide initial value for flag that is maintained by kernel.

> > if (fib_props[cfg->fc_type].error) {
> > -   if (cfg->fc_gw || cfg->fc_oif || cfg->fc_mp)
> > +   if (cfg->fc_gw || cfg->fc_oif || cfg->fc_mp ||
> > +   (fi->fib_nh->nh_flags & RTNH_F_LINKDOWN))
> > goto err_inval;
> 
> It looks a bit odd to use cfg in the existing checkd and fi->fib_nh in
> the rest, but not a huge issue since cfg->fc_flags and
> fi->fib_nh->nh_flags should be equivalent should be the same for single
> and multipath routes.

Using fc_flags works too for the above case. In fact,
it goes also to fib_flags, so we should have our checks there.
But it is true that RTNH_F_LINKDOWN is not used from fib_flags,
I think, we already had a chance to talk about it on 27 Oct 2015.

May be we can reject the both flags once for
rtnh_flags in fib_get_nhs() and then for fc_flags in
fib_create_info().

Regards

--
Julian Anastasov 


Re: [PATCH v6 04/12] net/mlx4_en: add support for fast rx drop bpf program

2016-07-09 Thread Saeed Mahameed
On Fri, Jul 8, 2016 at 5:15 AM, Brenden Blanco  wrote:
> Add support for the BPF_PROG_TYPE_XDP hook in mlx4 driver.
>
> In tc/socket bpf programs, helpers linearize skb fragments as needed
> when the program touchs the packet data. However, in the pursuit of
> speed, XDP programs will not be allowed to use these slower functions,
> especially if it involves allocating an skb.
>
> Therefore, disallow MTU settings that would produce a multi-fragment
> packet that XDP programs would fail to access. Future enhancements could
> be done to increase the allowable MTU.
>
> Signed-off-by: Brenden Blanco 
> ---
>  drivers/net/ethernet/mellanox/mlx4/en_netdev.c | 38 
> ++
>  drivers/net/ethernet/mellanox/mlx4/en_rx.c | 36 +---
>  drivers/net/ethernet/mellanox/mlx4/mlx4_en.h   |  5 
>  3 files changed, 75 insertions(+), 4 deletions(-)
>
[...]
> +   /* A bpf program gets first chance to drop the packet. It may
> +* read bytes but not past the end of the frag.
> +*/
> +   if (prog) {
> +   struct xdp_buff xdp;
> +   dma_addr_t dma;
> +   u32 act;
> +
> +   dma = be64_to_cpu(rx_desc->data[0].addr);
> +   dma_sync_single_for_cpu(priv->ddev, dma,
> +   priv->frag_info[0].frag_size,
> +   DMA_FROM_DEVICE);

In case of XDP_PASS we will dma_sync again in the normal path, this
can be improved by doing the dma_sync as soon as we can and once and
for all, regardless of the path the packet is going to take
(XDP_DROP/mlx4_en_complete_rx_desc/mlx4_en_rx_skb).

> +
> +   xdp.data = page_address(frags[0].page) +
> +   frags[0].page_offset;
> +   xdp.data_end = xdp.data + length;
> +
> +   act = bpf_prog_run_xdp(prog, &xdp);
> +   switch (act) {
> +   case XDP_PASS:
> +   break;
> +   default:
> +   bpf_warn_invalid_xdp_action(act);
> +   case XDP_DROP:
> +   goto next;

The drop action here (goto next) will release the current rx_desc
buffers and use new ones to refill, I know that the mlx4 rx scheme
will release/allocate new pages once every ~32 packet, but one
improvement can really help here especially for XDP_DROP benchmarks is
to reuse the current rx_desc buffers in case it is going to be
dropped.

Considering if mlx4 rx buffer scheme doesn't allow gaps, Maybe this
can be added later as future improvement for the whole mlx4 rx data
path drop decisions.


Re: [PATCH net] ipv4: reject RTNH_F_LINKDOWN for incompatible routes

2016-07-09 Thread Vegard Nossum

On 07/09/2016 07:23 PM, Andy Gospodarek wrote:

On Sat, Jul 09, 2016 at 12:00:15PM +0300, Julian Anastasov wrote:

Vegard Nossum is reporting for a crash in fib_dump_info (fib_nhs==1)
when nh_dev = NULL. Problem happens when RTNH_F_LINKDOWN is
provided from user space for routes that do not use the flag,
catched with netlink fuzzer.


Can you also include the panic log in the changelog or at a minimum post
it here?


Pid: 50, comm: netlink.exe Not tainted 4.7.0-rc5+
RIP: 0033:[<602b3d18>]
RSP: 62623890  EFLAGS: 00010202
RAX:  RBX: 6261b800 RCX: 
RDX:  RSI: 0024 RDI: 6245ba00
RBP: 626238f0 R08: 029c R09: 
R10: 62468038 R11: 6245ba00 R12: 6245ba00
R13: 625f96c0 R14: 601e16f0 R15: 
Kernel panic - not syncing: Kernel mode fault at addr 0x2e0, ip 0x602b3d18
CPU: 0 PID: 50 Comm: netlink.exe Not tainted 4.7.0-rc5+ #581
Stack:
 626238f0 960226a02 0400 00fe
 62623910 600afca7 62623970 62623a48
 62468038 0018  
Call Trace:
 [<602b3e93>] rtmsg_fib+0xd3/0x190
 [<602b6680>] fib_table_insert+0x260/0x500
 [<602b0e5d>] inet_rtm_newroute+0x4d/0x60
 [<60250def>] rtnetlink_rcv_msg+0x8f/0x270
 [<60267079>] netlink_rcv_skb+0xc9/0xe0
 [<60250d4b>] rtnetlink_rcv+0x3b/0x50
 [<60265400>] netlink_unicast+0x1a0/0x2c0
 [<60265e47>] netlink_sendmsg+0x3f7/0x470
 [<6021dc9a>] sock_sendmsg+0x3a/0x90
 [<6021e0d0>] ___sys_sendmsg+0x300/0x360
 [<6021fa64>] __sys_sendmsg+0x54/0xa0
 [<6021fac0>] SyS_sendmsg+0x10/0x20
 [<6001ea68>] handle_syscall+0x88/0x90
 [<600295fd>] userspace+0x3fd/0x500
 [<6001ac55>] fork_handler+0x85/0x90

$ addr2line -e vmlinux -i 0x602b3d18
include/linux/inetdevice.h:222
net/ipv4/fib_semantics.c:1264

220 static inline struct in_device *__in_dev_get_rtnl(const struct 
net_device *dev)

221 {
222 return rtnl_dereference(dev->ip_ptr);
223 }

1263 if (fi->fib_nh->nh_flags & RTNH_F_LINKDOWN) {
1264 in_dev = __in_dev_get_rtnl(fi->fib_nh->nh_dev);
1265 if (in_dev &&


RTNH_F_LINKDOWN should be used only for link routes, not for
local routes or for routes with error code. Do not complicate
fast path with more checks, reject the flag early when configured
for incompatible routes.


Did the netlink fuzzer (trinity?) happen to check any of the other flags
(liks RTNH_F_DEAD) that are normally set by the kernel but could be
problematic when send down from userspace?


I honestly don't know -- the fuzzer (based on AFL) doesn't know anything
about netlink in particular, so if it passed/tested any other flags it
was by chance and not by design.


Vegard


HI Dear.

2016-07-09 Thread Amal Isabelle
Hi Dear,

I'm Amal Isabelle; I am a National of France and I live in Abidjan, Cote 
d’Ivoire. I’m contacting because I want to be your friend and confide in you 
because I have in my possession now 186.13 Kilograms of Gold Dust, Quality: 23 
carat, purity 92% that I inherited from my late mother which I want to ship to 
your country and sell for investment in your country because I want to leave 
Cote d'Ivoire and relocate to your country to continue my education in your 
country.

I want you to stand by me as my tutor and ship this Gold Dust to your country 
and sell for investment in your
country.

Note that I am writing you this email purely on the ground of trust because I 
don’t know you and we have not met before. I found you here and my mind 
convinced me that I can trust you.
My best regards to you.

Waiting to hear from you.

Yours Truly,

Amal Isabelle.


Re: [PATCH net] ipv4: reject RTNH_F_LINKDOWN for incompatible routes

2016-07-09 Thread Andy Gospodarek
On Sat, Jul 09, 2016 at 12:00:15PM +0300, Julian Anastasov wrote:
> Vegard Nossum is reporting for a crash in fib_dump_info (fib_nhs==1)
> when nh_dev = NULL. Problem happens when RTNH_F_LINKDOWN is
> provided from user space for routes that do not use the flag,
> catched with netlink fuzzer.

Can you also include the panic log in the changelog or at a minimum post
it here?

> RTNH_F_LINKDOWN should be used only for link routes, not for
> local routes or for routes with error code. Do not complicate
> fast path with more checks, reject the flag early when configured
> for incompatible routes.

Did the netlink fuzzer (trinity?) happen to check any of the other flags
(liks RTNH_F_DEAD) that are normally set by the kernel but could be
problematic when send down from userspace?

> Reported-by: Vegard Nossum 
> Fixes: 0eeb075fad73 ("net: ipv4 sysctl option to ignore routes when nexthop 
> link is down")
> Tested-by: Vegard Nossum 
> Signed-off-by: Julian Anastasov 
> Cc: Andy Gospodarek 
> Cc: Dinesh Dutt 
> Cc: Scott Feldman 
> ---
>  net/ipv4/fib_semantics.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> Note: works for all kernels: net, net-next, 4.4.14, 4.5.7, 4.6.3
> 
> diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
> index d09173b..b642479 100644
> --- a/net/ipv4/fib_semantics.c
> +++ b/net/ipv4/fib_semantics.c
> @@ -1113,7 +1113,8 @@ struct fib_info *fib_create_info(struct fib_config *cfg)
>   }
>  
>   if (fib_props[cfg->fc_type].error) {
> - if (cfg->fc_gw || cfg->fc_oif || cfg->fc_mp)
> + if (cfg->fc_gw || cfg->fc_oif || cfg->fc_mp ||
> + (fi->fib_nh->nh_flags & RTNH_F_LINKDOWN))
>   goto err_inval;

It looks a bit odd to use cfg in the existing checkd and fi->fib_nh in
the rest, but not a huge issue since cfg->fc_flags and
fi->fib_nh->nh_flags should be equivalent should be the same for single
and multipath routes.

>   goto link_it;
>   } else {
> @@ -1136,7 +1137,7 @@ struct fib_info *fib_create_info(struct fib_config *cfg)
>   struct fib_nh *nh = fi->fib_nh;
>  
>   /* Local address is added. */
> - if (nhs != 1 || nh->nh_gw)
> + if (nhs != 1 || nh->nh_gw || (nh->nh_flags & RTNH_F_LINKDOWN))
>   goto err_inval;
>   nh->nh_scope = RT_SCOPE_NOWHERE;
>   nh->nh_dev = dev_get_by_index(net, fi->fib_nh->nh_oif);
> -- 
> 1.9.3
> 


Re: [PATCH net-next 0/4] net: cleanup for UDP tunnel's GRO

2016-07-09 Thread Shmulik Ladkani
On Sat, 9 Jul 2016 11:35:03 -0400 Hannes Frederic Sowa 
 wrote:
> On 09.07.2016 11:18, Shmulik Ladkani wrote:
> > On Fri, 8 Jul 2016 19:04:27 -0400 Hannes Frederic Sowa 
> >  wrote:  
>  I really do wonder if GRO on top of fragmentation does have any effect.
>  Would be great if someone has data for that already?
> >>>
> >>> I think that logic is kind of backwards.  It is already there.
> >>> Instead of asking people to prove that this change is invalid the onus
> >>> should be on the submitter to prove the change causes no harm.
> >>
> >> Of course, sorry, I didn't want to make the impression others should do
> >> that. I asked because Shmulik made the impression on me he had
> >> experience with GRO+fragmentation on vxlan and/or geneve and could
> >> provide some data, maybe even just anecdotal.  
> > 
> > Few anecdotal updates.
> > 
> > I don't have ready-made data as the systems are not using this exact
> > kind of of setup.
> > 
> > However, by performing some quick experimentations, it reveals that GRO
> > on top of the tunnels, where tunnel datagrams are fragmented, has some
> > effect. The packets indeed get aggregated, although not aggresively as
> > in the non-fragmented case.
> > 
> > Whether the effect is significant depends on the system.
> > 
> > In a system that is very sensitive to non-aggregated skbs (due to a cpu
> > bottleneck during further processing of the decapsulated packets), the
> > effect of aggregation is indeed significant.  
> 
> Cool, thanks. I thought it wouldn't happen because of the packet pacing.
> We will also do some more tests ourselves. Maybe it is time to add
> fragmentation support to inet_gro_receive to handle those cases much
> more easily without going through fragmentation engine at all, would
> probably speed up your usage significantly?

Indeed, that seems beneficial. I wondered about this back ago. I found
it not trivial, though. Without the transport headers available per
received SKB, it makes GRO complex than currently is :)

> Talking about ip fragmentation in general, are you end-host or
> mid-router fragmented? 

Currently dealing with end-host fragmentation.
(follow the thread at [1] - usecase is better explained there)

[1] http://www.spinics.net/lists/netdev/msg385085.html

Regards,
Shmulik


Re: [PATCH net-next 0/4] net: cleanup for UDP tunnel's GRO

2016-07-09 Thread Hannes Frederic Sowa
On 09.07.2016 11:18, Shmulik Ladkani wrote:
> On Fri, 8 Jul 2016 19:04:27 -0400 Hannes Frederic Sowa 
>  wrote:
 I really do wonder if GRO on top of fragmentation does have any effect.
 Would be great if someone has data for that already?  
>>>
>>> I think that logic is kind of backwards.  It is already there.
>>> Instead of asking people to prove that this change is invalid the onus
>>> should be on the submitter to prove the change causes no harm.  
>>
>> Of course, sorry, I didn't want to make the impression others should do
>> that. I asked because Shmulik made the impression on me he had
>> experience with GRO+fragmentation on vxlan and/or geneve and could
>> provide some data, maybe even just anecdotal.
> 
> Few anecdotal updates.
> 
> I don't have ready-made data as the systems are not using this exact
> kind of of setup.
> 
> However, by performing some quick experimentations, it reveals that GRO
> on top of the tunnels, where tunnel datagrams are fragmented, has some
> effect. The packets indeed get aggregated, although not aggresively as
> in the non-fragmented case.
> 
> Whether the effect is significant depends on the system.
> 
> In a system that is very sensitive to non-aggregated skbs (due to a cpu
> bottleneck during further processing of the decapsulated packets), the
> effect of aggregation is indeed significant.

Cool, thanks. I thought it wouldn't happen because of the packet pacing.
We will also do some more tests ourselves. Maybe it is time to add
fragmentation support to inet_gro_receive to handle those cases much
more easily without going through fragmentation engine at all, would
probably speed up your usage significantly?

Talking about ip fragmentation in general, are you end-host or
mid-router fragmented? Do you know if there are different
characteristics if linux fragments vs. some kind of hw-router in the
middle (do fragments get paced?).

Thanks,
Hannes



Re: [PATCH net-next 0/4] net: cleanup for UDP tunnel's GRO

2016-07-09 Thread Shmulik Ladkani
On Fri, 8 Jul 2016 19:04:27 -0400 Hannes Frederic Sowa 
 wrote:
> >> I really do wonder if GRO on top of fragmentation does have any effect.
> >> Would be great if someone has data for that already?  
> > 
> > I think that logic is kind of backwards.  It is already there.
> > Instead of asking people to prove that this change is invalid the onus
> > should be on the submitter to prove the change causes no harm.  
> 
> Of course, sorry, I didn't want to make the impression others should do
> that. I asked because Shmulik made the impression on me he had
> experience with GRO+fragmentation on vxlan and/or geneve and could
> provide some data, maybe even just anecdotal.

Few anecdotal updates.

I don't have ready-made data as the systems are not using this exact
kind of of setup.

However, by performing some quick experimentations, it reveals that GRO
on top of the tunnels, where tunnel datagrams are fragmented, has some
effect. The packets indeed get aggregated, although not aggresively as
in the non-fragmented case.

Whether the effect is significant depends on the system.

In a system that is very sensitive to non-aggregated skbs (due to a cpu
bottleneck during further processing of the decapsulated packets), the
effect of aggregation is indeed significant.

Regards,
Shmulik


Re: [PATCH] net: ip_finish_output_gso: If skb_gso_network_seglen exceeds MTU, do segmentation even for non IPSKB_FORWARDED skbs

2016-07-09 Thread Hannes Frederic Sowa
On 09.07.2016 05:00, Florian Westphal wrote:
> I am worried about this patch, skb_gso_validate_mtu is more costly than
> the ->flags & FORWARD check; everyone pays this extra cost.
> 
> What about setting IPCB FORWARD flag in iptunnel_xmit if
> skb->skb_iif != 0... instead?

That came to my mind first, too.

> Yet another possibility would be to reduce gso_size but that violates
> gro/gso symmetry...

If you see vxlan (or any other UDP encap protocol) as an in-kernel
tunnel solution/program acting as an end point it is actually legal to
modify gso_size in my opinion. Adding headers to an skb can also not be
symmetric in this case. I would not see anything bad happening because
of doing so. Do you? It is a matter of implementation. vxlan could also
eat all data and splice it anew onto a socket. It already doesn't care
about end-to-end pmtu consistency, so I can't see anything wrong with it.

To make this all safe one needs to handle the ttl in vxlan much better.
It needs to be inherited from the inner header, checked and properly
decremented on the outer header, so that vxlan itself acts as a full
blown router by itself. Otherwise endless loops are possible, as the
packet will always fit the size.

Bye,
Hannes



Re: [PATCH v6 04/12] net/mlx4_en: add support for fast rx drop bpf program

2016-07-09 Thread Or Gerlitz
On Fri, Jul 8, 2016 at 5:15 AM, Brenden Blanco  wrote:
> Add support for the BPF_PROG_TYPE_XDP hook in mlx4 driver.
>
> In tc/socket bpf programs, helpers linearize skb fragments as needed
> when the program touchs the packet data. However, in the pursuit of

nit, for the next version touchs --> touches

> speed, XDP programs will not be allowed to use these slower functions,
> especially if it involves allocating an skb.


[...]

> @@ -835,6 +838,34 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct 
> mlx4_en_cq *cq, int bud
> l2_tunnel = (dev->hw_enc_features & NETIF_F_RXCSUM) &&
> (cqe->vlan_my_qpn & cpu_to_be32(MLX4_CQE_L2_TUNNEL));
>
> +   /* A bpf program gets first chance to drop the packet. It may
> +* read bytes but not past the end of the frag.
> +*/
> +   if (prog) {
> +   struct xdp_buff xdp;
> +   dma_addr_t dma;
> +   u32 act;
> +
> +   dma = be64_to_cpu(rx_desc->data[0].addr);
> +   dma_sync_single_for_cpu(priv->ddev, dma,
> +   priv->frag_info[0].frag_size,
> +   DMA_FROM_DEVICE);
> +
> +   xdp.data = page_address(frags[0].page) +
> +   frags[0].page_offset;
> +   xdp.data_end = xdp.data + length;
> +
> +   act = bpf_prog_run_xdp(prog, &xdp);
> +   switch (act) {
> +   case XDP_PASS:
> +   break;
> +   default:
> +   bpf_warn_invalid_xdp_action(act);
> +   case XDP_DROP:
> +   goto next;
> +   }
> +   }


(probably a nit too, but wanted to make sure we don't miss something
here) is the default case preceding the DROP one in purpose? any
special reason to do that?


Re: [PATCH v6 01/12] bpf: add XDP prog type for early driver filter

2016-07-09 Thread Tom Herbert
On Sat, Jul 9, 2016 at 3:14 AM, Jesper Dangaard Brouer
 wrote:
> On Thu,  7 Jul 2016 19:15:13 -0700
> Brenden Blanco  wrote:
>
>> Add a new bpf prog type that is intended to run in early stages of the
>> packet rx path. Only minimal packet metadata will be available, hence a
>> new context type, struct xdp_md, is exposed to userspace. So far only
>> expose the packet start and end pointers, and only in read mode.
>>
>> An XDP program must return one of the well known enum values, all other
>> return codes are reserved for future use. Unfortunately, this
>> restriction is hard to enforce at verification time, so take the
>> approach of warning at runtime when such programs are encountered. The
>> driver can choose to implement unknown return codes however it wants,
>> but must invoke the warning helper with the action value.
>
> I believe we should define a stronger semantics for unknown/future
> return codes than the once stated above:
>  "driver can choose to implement unknown return codes however it wants"
>
> The mlx4 driver implementation in:
>  [PATCH v6 04/12] net/mlx4_en: add support for fast rx drop bpf program
>
> On Thu,  7 Jul 2016 19:15:16 -0700 Brenden Blanco  
> wrote:
>
>> + /* A bpf program gets first chance to drop the packet. It may
>> +  * read bytes but not past the end of the frag.
>> +  */
>> + if (prog) {
>> + struct xdp_buff xdp;
>> + dma_addr_t dma;
>> + u32 act;
>> +
>> + dma = be64_to_cpu(rx_desc->data[0].addr);
>> + dma_sync_single_for_cpu(priv->ddev, dma,
>> + priv->frag_info[0].frag_size,
>> + DMA_FROM_DEVICE);
>> +
>> + xdp.data = page_address(frags[0].page) +
>> + frags[0].page_offset;
>> + xdp.data_end = xdp.data + length;
>> +
>> + act = bpf_prog_run_xdp(prog, &xdp);
>> + switch (act) {
>> + case XDP_PASS:
>> + break;
>> + default:
>> + bpf_warn_invalid_xdp_action(act);
>> + case XDP_DROP:
>> + goto next;
>> + }
>> + }
>
> Thus, mlx4 choice is to drop packets for unknown/future return codes.
>
> I think this is the wrong choice.  I think the choice should be
> XDP_PASS, to pass the packet up the stack.
>
> I find "XDP_DROP" problematic because it happen so early in the driver,
> that we lost all possibilities to debug what packets gets dropped.  We
> get a single kernel log warning, but we cannot inspect the packets any
> longer.  By defaulting to XDP_PASS all the normal stack tools (e.g.
> tcpdump) is available.
>
It's an API issue though not a problem with the packet. Allowing
unknown return codes to pass seems like a major security problem also.

Tom

>
> I can also imagine that, defaulting to XDP_PASS, can be an important
> feature in the future.
>
> In the future we will likely have features, where XDP can "offload"
> packet delivery from the normal stack (e.g. delivery into a VM).  On a
> running production system you can then load your XDP program.  If the
> driver was too old defaulting to XDP_DROP, then you lost your service,
> instead if defaulting to XDP_PASS, your service would survive, falling
> back to normal delivery.
>
> (For the VM delivery use-case, there will likely be a need for having a
> fallback delivery method in place, when the XDP program is not active,
> in-order to support VM migration).
>
>
>
>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
>> index c14ca1c..5b47ac3 100644
>> --- a/include/uapi/linux/bpf.h
>> +++ b/include/uapi/linux/bpf.h
> [...]
>>
>> +/* User return codes for XDP prog type.
>> + * A valid XDP program must return one of these defined values. All other
>> + * return codes are reserved for future use. Unknown return codes will 
>> result
>> + * in driver-dependent behavior.
>> + */
>> +enum xdp_action {
>> + XDP_DROP,
>> + XDP_PASS,
>> +};
>> +
> [...]
>>  #endif /* _UAPI__LINUX_BPF_H__ */
>> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
>> index e206c21..a8d67d0 100644
>> --- a/kernel/bpf/verifier.c
>> +++ b/kernel/bpf/verifier.c
> [...]
>> +void bpf_warn_invalid_xdp_action(int act)
>> +{
>> + WARN_ONCE(1, "\n"
>> +  "*\n"
>> +  "**   NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE   **\n"
>> +  "**   **\n"
>> +  "** XDP program returned unknown value %-10u **\n"
>> +  "**   **\n"
>> +  "** XDP programs must return a well-kn

Re: [PATCH] net: ip_finish_output_gso: If skb_gso_network_seglen exceeds MTU, do segmentation even for non IPSKB_FORWARDED skbs

2016-07-09 Thread Florian Westphal
Shmulik Ladkani  wrote:
> I'd appreciate any suggestion how to determine traffic is local OTHER
> THAN testing IPSKB_FORWARDED; If we have such a way, there wouldn't be an
> impact on local traffic.
> 
> > What about setting IPCB FORWARD flag in iptunnel_xmit if
> > skb->skb_iif != 0... instead?
> 
> Can you please elaborate?

[ not even compile tested ]

diff --git a/net/ipv4/ip_tunnel_core.c b/net/ipv4/ip_tunnel_core.c
--- a/net/ipv4/ip_tunnel_core.c
+++ b/net/ipv4/ip_tunnel_core.c
@@ -65,6 +65,7 @@ void iptunnel_xmit(struct sock *sk, struct rtable *rt, struct 
sk_buff *skb,
struct net_device *dev = skb->dev;
struct iphdr *iph;
int err;
+   bool fwd = skb->skb_iif > 0;
 
skb_scrub_packet(skb, xnet);
 
@@ -72,6 +73,9 @@ void iptunnel_xmit(struct sock *sk, struct rtable *rt, struct 
sk_buff *skb,
skb_dst_set(skb, &rt->dst);
memset(IPCB(skb), 0, sizeof(*IPCB(skb)));
 
+   if (fwd)
+   IPCB(skb)->flags = IPSKB_FORWARDED;
+
/* Push down and install the IP header. */
skb_push(skb, sizeof(struct iphdr));
skb_reset_network_header(skb);


Re: [PATCH net] udp: prevent bugcheck if filter truncates packet too much

2016-07-09 Thread Willem de Bruijn
On Sat, Jul 9, 2016 at 6:43 AM, Michal Kubecek  wrote:
> On Sat, Jul 09, 2016 at 11:48:49AM +0200, Daniel Borkmann wrote:
>> On 07/09/2016 02:20 AM, Alexei Starovoitov wrote:
>> >On Sat, Jul 09, 2016 at 01:31:40AM +0200, Eric Dumazet wrote:
>> >>On Fri, 2016-07-08 at 17:52 +0200, Michal Kubecek wrote:
>> >>>If socket filter truncates an udp packet below the length of UDP header
>> >>>in udpv6_queue_rcv_skb() or udp_queue_rcv_skb(), it will trigger a
>> >>>BUG_ON in skb_pull_rcsum(). This BUG_ON (and therefore a system crash if
>> >>>kernel is configured that way) can be easily enforced by an unprivileged
>> >>>user which was reported as CVE-2016-6162. For a reproducer, see
>> >>>http://seclists.org/oss-sec/2016/q3/8
>> >>>
>> >>>Fixes: e6afc8ace6dd ("udp: remove headers from UDP packets before 
>> >>>queueing")
>> >>>Reported-by: Marco Grassi 
>> >>>Signed-off-by: Michal Kubecek 
>> >>>---
>
>> >>Acked-by: Eric Dumazet 
>> >
>> >this is incomplete fix. Please do not apply. See discussion at 
>> >security@kernel
>>
>> Ohh well, didn't see it earlier before starting the discussion at 
>> security@...
>>
>> I'm okay if we take this for now as a quick band aid and find a better
>> way how to deal with the underlying issue long-term so that it's
>> /guaranteed/ that it doesn't bite us any further in such fragile ways.
>
> Agreed. As rc7 is due in a day or two, rushing a complex and intrusive
> solution in might be too risky.

Acked-by: Willem de Bruijn 

Thanks, Michal.


Re: [PATCH] net: ip_finish_output_gso: If skb_gso_network_seglen exceeds MTU, do segmentation even for non IPSKB_FORWARDED skbs

2016-07-09 Thread Shmulik Ladkani
On Sat, 9 Jul 2016 11:00:20 +0200 Florian Westphal  wrote:
> Shmulik Ladkani  wrote:
> > > How does work if e.g. 1460-sized udp packet arrives on tap0?
> > > Do we fragment (possibly ignoring DF?)  
> > 
> > A *non* gso udp packet arriving on tap0 is bridged to vxlan0 (assuming
> > vxlan mtu is sufficient), and the original DF of the inner packet is
> > preserved.
> > 
> > The skb gets vxlan-udp encapsulated, with outer IP header having DF=0
> > (unless tun_flags & TUNNEL_DONT_FRAGMENT), and then, if skb->len > mtu,
> > fragmented normally at the ip_finish_output --> ip_fragment code path.  
> 
> I see.
> 
> If I understand correctly you have vxlan stacked on top of eth0, and tap
> and vxlan in a bridge.
> 
> ... and eth mtu smaller than bridge mtu.
> 
> I think that this is "working" by pure accident, and that better fix is
> to set mtu values correctly so that when vxlan header is added we don't
> exceed what can be handled by the real device (yes, I know you have
> no control over this).

Let me elaborate a bit regarding the usecase.

Consider nested virtualization. The "host" is a VM which may run in
various different cloud deployments, thus the mtu of this virtual host's
eth0 varies (I've no control over it, nor should I have).

The "host" provides a virtual network for its Nested Guest VMs - the
users of the system.
They are provided with a virtual L2 network with an MTU of their choice
(usually 1500).

Forcing the users runtime varying restrictions on whatever MTUs they
can use in their virtual network (which depend on the current choice of
where the virtual "host" is deployed), means they are forced to alter
their application's setting, per deployment. This is a non solution.

This is why guests' MTUs nor host's eth0 MTU can be set.

We have the option of using user-space based UDP tunnel (instead of
kernel's vxlan or geneve). Aside the downsides not utilizing existing
protocols and implementations, this works well as the encapsulated guest
packets are sent over a standard UDP socket via sendmsg.
As such, these datagrams may get fragmented (in deployments where host's
eth0 mtu is too small), and reassembled at the remote tunnel
termination's ip stack.

Regradless the use-case, there's currently an incosistency in kernel's
behavior:

Consider:

   VM
  eth0  # 1500 mtu; any virtual/emulated NIC that supports TSO
   .
   .
+---HOST-+
|  .   __br0__   |
|  .  /   \  |
| tap0   vxlan0  |
|(1500)  (1500)  |
|  . |
|  . |
| eth0   |
|(1200)  |
++

1. If VM disables TSO, it sends 1500-sized IP packets down eth0,
   the frames arrive on tap0, bridged, get encapsulated by vxlan0, and
   the vxlan datagram is then fragmented (as any local datagram would
   have) by ip_finish_output on eth0.

2. If VM enables TSO, we have have a superpacket arriving at tap0
   with total length of say 1 bytes, with gso_size 1460.
   The superpacket gets encapsulated by vxlan0.
   Finally upon eth0's validate_xmit_skb, the packet is udp-tunnel-segmented
   according to original gso_size of 1460, creating encapsution
   datagrams bigger than eth0's mtu - which are eventually dropped on
   the wire.

This is simply inconsistent: The GSO path should align to the non-GSO
case.
Thus my suggestion: in this specific case, within ip_finish_output_gso,
segment the GSO skb first, then fragment each segment according to dst
mtu. This aligns the GSO vs non-GSO behavior.

> I am worried about this patch, skb_gso_validate_mtu is more costly than
> the ->flags & FORWARD check; everyone pays this extra cost.

I can get back with numbers regarding the impact on local traffic.

I'd appreciate any suggestion how to determine traffic is local OTHER
THAN testing IPSKB_FORWARDED; If we have such a way, there wouldn't be an
impact on local traffic.

> What about setting IPCB FORWARD flag in iptunnel_xmit if
> skb->skb_iif != 0... instead?

Can you please elaborate?

> Yet another possibility would be to reduce gso_size but that violates
> gro/gso symmetry...

We're experimenting this path as well. But as said, fixing the
incosistency above would still be valid.

> [ I tried to check rfc but seems rfc7348 simply declares that
>   endpoints are not allowed to fragment so problem solved :-/ ]

Funny, in Geneve it's only a "best practice" recommendadtion:
  https://tools.ietf.org/html/draft-ietf-nvo3-geneve-02
  section 4.1.1

I'm not keen on vxlan; any UDP based tunnel would do ;-)

Regards,
Shmulik


Re: [LEDE-DEV] DHCP via bridge in case of IPv4

2016-07-09 Thread Alexey Brodkin
Hi Aaron,

On Sat, 2016-07-09 at 07:47 -0400, Aaron Z wrote:
> On Sat, Jul 9, 2016 at 4:37 AM, Alexey Brodkin
>  wrote:
> > 
> > Hello,
> > 
> > I was playing with quite simple bridged setup on different boards with
> > very recent kernels (4.6.3 as of this writing) and found one interesting
> > behavior that I cannot yet understand and googling din't help here as well.
> > 
> > My setup is pretty simple:
> > -   --   -
> > > 
> > > HOST  |   | "Dumb AP"  |   | Wireless client   |
> > > with DHCP |<->(eth0) (wlan0)<->| attempting to |
> > > server|   |\ br0 / |   | get settings via DHCP |
> > -   --   -
> > 
> > * HOST is my laptop with DHCP server that works for sure.
> > * "Dumb AP" is a separate board (I tried ARM-based Wandboard and ARC-based
> >   AXS10x boards but results are exactly the same) with wired (eth0) and 
> > wireless
> >   (wlan0) network controllers bridged together (br0). That "br0" bridge 
> > flawlessly
> >   gets its settings from DHCP server on host.
> > * Wireless client could be either a smatrphone or another laptop etc but
> >   what's important it should be configured to get network settings by DHCP 
> > as well.
> > 
> > So what happens "br0" always gets network settings from DHCP server on HOST.
> > That's fine. But wireless client only reliably gets settings from DHCP 
> > server
> > if IPv6 is enabled on "Dumb AP" board. If IPv6 is disabled I may see that
> > wireless client sends "DHCP Discover" then server replies with "DHCP Offer" 
> > but
> > that offer never reaches wireless client.
> 
>
> Do you have WDS enabled? If not, DHCP has issues in that scenario:
> https://wiki.openwrt.org/doc/howto/clientmode

I don't have WDS enabled. I tried to have as simple setup as possible.
Still from what I see in the Wiki article above problem happens when
there're 4 devices in the chain, right? Because as it says:
>8
The 802.11 standard only uses three MAC addresses for frames transmitted between
the Access Point and the Station. Frames transmitted from the Station to the AP
don't include the ethernet source MAC of the requesting host and response frames
are missing the destination ethernet MAC to address the target host behind the
client bridge.
>8

But in my case I only have 3 devices in the chain so I would think it's
something else but issue described in the article.

Anyways thanks for the hint.

-Alexey


Re: [PATCH net-next 0/6] sctp: implement rfc7496 in sctp

2016-07-09 Thread Xin Long
On Sat, Jul 9, 2016 at 7:47 PM, Xin Long  wrote:
> This patchset implements "Additional Policies for the Partially Reliable
> Stream Control Transmission Protocol Extension" described on RFC7496.
>
> The Partially Reliable SCTP (PR-SCTP) extension defined in [RFC3758]
> provides a generic method for senders to abandon user messages. The
> decision to abandon a user message is sender side only, and the exact
> condition is called a "PR-SCTP policy". This patchset implements 3
> policies:
>
>  1. Timed Reliability:  This allows the sender to specify a timeout for
> a user message after which the SCTP stack abandons the user message.
>
>  2. Limited Retransmission Policy:  Allows limitation of the number of
> retransmissions.
>
>  3. Priority Policy:  Allows removal of lower-priority messages if space
> for higher-priority messages is needed in the send buffer.
>
> Patch 1-3 add some sockopts in sctp to set/get pr_sctp policy status.
> Patch 4-6 implement these 3 policies one by one.
>

Attachment is some tests for this patchset.


prsctp_test.tar.gz
Description: GNU Zip compressed data


[PATCH net-next 3/6] sctp: add SCTP_PR_ASSOC_STATUS on sctp sockopt

2016-07-09 Thread Xin Long
This patch adds SCTP_PR_ASSOC_STATUS to sctp sockopt, which is used
to dump the prsctp statistics info from the asoc. The prsctp statistics
includes abandoned_sent/unsent from the asoc. abandoned_sent is the
count of the packets we drop packets from retransmit/transmited queue,
and abandoned_unsent is the count of the packets we drop from out_queue
according to the policy.

Note: another option for prsctp statistics dump described in rfc is
SCTP_PR_STREAM_STATUS, which is used to dump the prsctp statistics
info from each stream. But by now, linux doesn't yet have per stream
statistics info, it needs rfc6525 to be implemented. As the prsctp
statistics for each stream has to be based on per stream statistics,
we will delay it until rfc6525 is done in linux.

Signed-off-by: Xin Long 
---
 include/net/sctp/structs.h |  3 +++
 include/uapi/linux/sctp.h  | 12 +
 net/sctp/socket.c  | 62 ++
 3 files changed, 77 insertions(+)

diff --git a/include/net/sctp/structs.h b/include/net/sctp/structs.h
index 07115ca..d8e464a 100644
--- a/include/net/sctp/structs.h
+++ b/include/net/sctp/structs.h
@@ -1853,6 +1853,9 @@ struct sctp_association {
 prsctp_enable:1;
 
struct sctp_priv_assoc_stats stats;
+
+   __u64 abandoned_unsent[SCTP_PR_INDEX(MAX) + 1];
+   __u64 abandoned_sent[SCTP_PR_INDEX(MAX) + 1];
 };
 
 
diff --git a/include/uapi/linux/sctp.h b/include/uapi/linux/sctp.h
index 984cf2e..d304f4c 100644
--- a/include/uapi/linux/sctp.h
+++ b/include/uapi/linux/sctp.h
@@ -114,6 +114,7 @@ typedef __s32 sctp_assoc_t;
 #define SCTP_GET_ASSOC_STATS   112 /* Read only */
 #define SCTP_PR_SUPPORTED  113
 #define SCTP_DEFAULT_PRINFO114
+#define SCTP_PR_ASSOC_STATUS   115
 
 /* PR-SCTP policies */
 #define SCTP_PR_SCTP_NONE  0x
@@ -926,6 +927,17 @@ struct sctp_paddrthlds {
__u16 spt_pathpfthld;
 };
 
+/*
+ * Socket Option for Getting the Association/Stream-Specific PR-SCTP Status
+ */
+struct sctp_prstatus {
+   sctp_assoc_t sprstat_assoc_id;
+   __u16 sprstat_sid;
+   __u16 sprstat_policy;
+   __u64 sprstat_abandoned_unsent;
+   __u64 sprstat_abandoned_sent;
+};
+
 struct sctp_default_prinfo {
sctp_assoc_t pr_assoc_id;
__u32 pr_value;
diff --git a/net/sctp/socket.c b/net/sctp/socket.c
index c03fe1b..c3167c4 100644
--- a/net/sctp/socket.c
+++ b/net/sctp/socket.c
@@ -6330,6 +6330,64 @@ out:
return retval;
 }
 
+static int sctp_getsockopt_pr_assocstatus(struct sock *sk, int len,
+ char __user *optval,
+ int __user *optlen)
+{
+   struct sctp_prstatus params;
+   struct sctp_association *asoc;
+   int policy;
+   int retval = -EINVAL;
+
+   if (len < sizeof(params))
+   goto out;
+
+   len = sizeof(params);
+   if (copy_from_user(¶ms, optval, len)) {
+   retval = -EFAULT;
+   goto out;
+   }
+
+   policy = params.sprstat_policy;
+   if (policy & ~SCTP_PR_SCTP_MASK)
+   goto out;
+
+   asoc = sctp_id2assoc(sk, params.sprstat_assoc_id);
+   if (!asoc)
+   goto out;
+
+   if (policy == SCTP_PR_SCTP_NONE) {
+   params.sprstat_abandoned_unsent = 0;
+   params.sprstat_abandoned_sent = 0;
+   for (policy = 0; policy <= SCTP_PR_INDEX(MAX); policy++) {
+   params.sprstat_abandoned_unsent +=
+   asoc->abandoned_unsent[policy];
+   params.sprstat_abandoned_sent +=
+   asoc->abandoned_sent[policy];
+   }
+   } else {
+   params.sprstat_abandoned_unsent =
+   asoc->abandoned_unsent[__SCTP_PR_INDEX(policy)];
+   params.sprstat_abandoned_sent =
+   asoc->abandoned_sent[__SCTP_PR_INDEX(policy)];
+   }
+
+   if (put_user(len, optlen)) {
+   retval = -EFAULT;
+   goto out;
+   }
+
+   if (copy_to_user(optval, ¶ms, len)) {
+   retval = -EFAULT;
+   goto out;
+   }
+
+   retval = 0;
+
+out:
+   return retval;
+}
+
 static int sctp_getsockopt(struct sock *sk, int level, int optname,
   char __user *optval, int __user *optlen)
 {
@@ -6490,6 +6548,10 @@ static int sctp_getsockopt(struct sock *sk, int level, 
int optname,
retval = sctp_getsockopt_default_prinfo(sk, len, optval,
optlen);
break;
+   case SCTP_PR_ASSOC_STATUS:
+   retval = sctp_getsockopt_pr_assocstatus(sk, len, optval,
+   optlen);
+   break;
default:
retval = -ENOPROTOOPT;
break;
-- 
2.1.0



[PATCH net-next 5/6] sctp: implement prsctp RTX policy

2016-07-09 Thread Xin Long
prsctp RTX policy is a policy to abandon chunks when they are
retransmitted beyond the max count.

This patch uses sent_count to count how many times one chunk has
been sent, and prsctp_param is the max rtx count, which is from
sinfo->sinfo_timetolive in sctp_set_prsctp_policy(). So similar
to TTL policy, if RTX policy is enabled, msg->expire_at won't
work.

Then in sctp_chunk_abandoned, this patch checks if chunk->sent_count
is bigger than chunk->prsctp_param to abandon this chunk.

Signed-off-by: Xin Long 
---
 net/sctp/chunk.c | 4 
 net/sctp/sm_make_chunk.c | 2 ++
 2 files changed, 6 insertions(+)

diff --git a/net/sctp/chunk.c b/net/sctp/chunk.c
index 2698d12..b3692b5 100644
--- a/net/sctp/chunk.c
+++ b/net/sctp/chunk.c
@@ -355,6 +355,10 @@ int sctp_chunk_abandoned(struct sctp_chunk *chunk)
else
chunk->asoc->abandoned_unsent[SCTP_PR_INDEX(TTL)]++;
return 1;
+   } else if (SCTP_PR_RTX_ENABLED(chunk->sinfo.sinfo_flags) &&
+  chunk->sent_count > chunk->prsctp_param) {
+   chunk->asoc->abandoned_sent[SCTP_PR_INDEX(RTX)]++;
+   return 1;
}
 
return 0;
diff --git a/net/sctp/sm_make_chunk.c b/net/sctp/sm_make_chunk.c
index 2c431ee..cfde934 100644
--- a/net/sctp/sm_make_chunk.c
+++ b/net/sctp/sm_make_chunk.c
@@ -720,6 +720,8 @@ static void sctp_set_prsctp_policy(struct sctp_chunk *chunk,
if (SCTP_PR_TTL_ENABLED(sinfo->sinfo_flags))
chunk->prsctp_param =
jiffies + msecs_to_jiffies(sinfo->sinfo_timetolive);
+   else if (SCTP_PR_RTX_ENABLED(sinfo->sinfo_flags))
+   chunk->prsctp_param = sinfo->sinfo_timetolive;
 }
 
 /* Make a DATA chunk for the given association from the provided
-- 
2.1.0



[PATCH net-next 4/6] sctp: implement prsctp TTL policy

2016-07-09 Thread Xin Long
prsctp TTL policy is a policy to abandon chunks when they expire
at the specific time in local stack. It's similar with expires_at
in struct sctp_datamsg.

This patch uses sinfo->sinfo_timetolive to set the specific time for
TTL policy. sinfo->sinfo_timetolive is also used for msg->expires_at.
So if prsctp_enable or TTL policy is not enabled, msg->expires_at
still works as before.

Signed-off-by: Xin Long 
---
 include/net/sctp/structs.h | 10 ++
 net/sctp/chunk.c   | 20 +---
 net/sctp/output.c  |  2 ++
 net/sctp/sm_make_chunk.c   | 12 
 net/sctp/socket.c  |  4 ++--
 5 files changed, 43 insertions(+), 5 deletions(-)

diff --git a/include/net/sctp/structs.h b/include/net/sctp/structs.h
index d8e464a..6bcda71 100644
--- a/include/net/sctp/structs.h
+++ b/include/net/sctp/structs.h
@@ -602,6 +602,16 @@ struct sctp_chunk {
/* This needs to be recoverable for SCTP_SEND_FAILED events. */
struct sctp_sndrcvinfo sinfo;
 
+   /* We use this field to record param for prsctp policies,
+* for TTL policy, it is the time_to_drop of this chunk,
+* for RTX policy, it is the max_sent_count of this chunk,
+* for PRIO policy, it is the priority of this chunk.
+*/
+   unsigned long prsctp_param;
+
+   /* How many times this chunk have been sent, for prsctp RTX policy */
+   int sent_count;
+
/* Which association does this belong to?  */
struct sctp_association *asoc;
 
diff --git a/net/sctp/chunk.c b/net/sctp/chunk.c
index 1eb94bf..2698d12 100644
--- a/net/sctp/chunk.c
+++ b/net/sctp/chunk.c
@@ -335,13 +335,27 @@ errout:
 /* Check whether this message has expired. */
 int sctp_chunk_abandoned(struct sctp_chunk *chunk)
 {
-   struct sctp_datamsg *msg = chunk->msg;
+   if (!chunk->asoc->prsctp_enable ||
+   !SCTP_PR_POLICY(chunk->sinfo.sinfo_flags)) {
+   struct sctp_datamsg *msg = chunk->msg;
+
+   if (!msg->can_abandon)
+   return 0;
+
+   if (time_after(jiffies, msg->expires_at))
+   return 1;
 
-   if (!msg->can_abandon)
return 0;
+   }
 
-   if (time_after(jiffies, msg->expires_at))
+   if (SCTP_PR_TTL_ENABLED(chunk->sinfo.sinfo_flags) &&
+   time_after(jiffies, chunk->prsctp_param)) {
+   if (chunk->sent_count)
+   chunk->asoc->abandoned_sent[SCTP_PR_INDEX(TTL)]++;
+   else
+   chunk->asoc->abandoned_unsent[SCTP_PR_INDEX(TTL)]++;
return 1;
+   }
 
return 0;
 }
diff --git a/net/sctp/output.c b/net/sctp/output.c
index 2e9223b..7425f6c 100644
--- a/net/sctp/output.c
+++ b/net/sctp/output.c
@@ -316,6 +316,8 @@ static sctp_xmit_t __sctp_packet_append_chunk(struct 
sctp_packet *packet,
packet->has_data = 1;
/* timestamp the chunk for rtx purposes */
chunk->sent_at = jiffies;
+   /* Mainly used for prsctp RTX policy */
+   chunk->sent_count++;
break;
case SCTP_CID_COOKIE_ECHO:
packet->has_cookie_echo = 1;
diff --git a/net/sctp/sm_make_chunk.c b/net/sctp/sm_make_chunk.c
index 0e3045e..2c431ee 100644
--- a/net/sctp/sm_make_chunk.c
+++ b/net/sctp/sm_make_chunk.c
@@ -711,6 +711,17 @@ nodata:
return retval;
 }
 
+static void sctp_set_prsctp_policy(struct sctp_chunk *chunk,
+  const struct sctp_sndrcvinfo *sinfo)
+{
+   if (!chunk->asoc->prsctp_enable)
+   return;
+
+   if (SCTP_PR_TTL_ENABLED(sinfo->sinfo_flags))
+   chunk->prsctp_param =
+   jiffies + msecs_to_jiffies(sinfo->sinfo_timetolive);
+}
+
 /* Make a DATA chunk for the given association from the provided
  * parameters.  However, do not populate the data payload.
  */
@@ -744,6 +755,7 @@ struct sctp_chunk *sctp_make_datafrag_empty(struct 
sctp_association *asoc,
 
retval->subh.data_hdr = sctp_addto_chunk(retval, sizeof(dp), &dp);
memcpy(&retval->sinfo, sinfo, sizeof(struct sctp_sndrcvinfo));
+   sctp_set_prsctp_policy(retval, sinfo);
 
 nodata:
return retval;
diff --git a/net/sctp/socket.c b/net/sctp/socket.c
index c3167c4..0861429 100644
--- a/net/sctp/socket.c
+++ b/net/sctp/socket.c
@@ -7099,7 +7099,7 @@ static int sctp_msghdr_parse(const struct msghdr *msg, 
sctp_cmsgs_t *cmsgs)
 
if (cmsgs->srinfo->sinfo_flags &
~(SCTP_UNORDERED | SCTP_ADDR_OVER |
- SCTP_SACK_IMMEDIATELY |
+ SCTP_SACK_IMMEDIATELY | SCTP_PR_SCTP_MASK |
  SCTP_ABORT | SCTP_EOF))
return -EINVAL;
break;
@@ -7123,7 +7123,7 @@ static int sctp_msghdr_parse(const struct msghdr *msg, 
sctp_cmsgs_t *cmsgs)
 
if (cmsg

Re: [LEDE-DEV] DHCP via bridge in case of IPv4

2016-07-09 Thread Aaron Z
On Sat, Jul 9, 2016 at 4:37 AM, Alexey Brodkin
 wrote:
> Hello,
>
> I was playing with quite simple bridged setup on different boards with
> very recent kernels (4.6.3 as of this writing) and found one interesting
> behavior that I cannot yet understand and googling din't help here as well.
>
> My setup is pretty simple:
> -   --   -
> | HOST  |   | "Dumb AP"  |   | Wireless client   |
> | with DHCP |<->(eth0) (wlan0)<->| attempting to |
> | server|   |\ br0 / |   | get settings via DHCP |
> -   --   -
>
> * HOST is my laptop with DHCP server that works for sure.
> * "Dumb AP" is a separate board (I tried ARM-based Wandboard and ARC-based
>   AXS10x boards but results are exactly the same) with wired (eth0) and 
> wireless
>   (wlan0) network controllers bridged together (br0). That "br0" bridge 
> flawlessly
>   gets its settings from DHCP server on host.
> * Wireless client could be either a smatrphone or another laptop etc but
>   what's important it should be configured to get network settings by DHCP as 
> well.
>
> So what happens "br0" always gets network settings from DHCP server on HOST.
> That's fine. But wireless client only reliably gets settings from DHCP server
> if IPv6 is enabled on "Dumb AP" board. If IPv6 is disabled I may see that
> wireless client sends "DHCP Discover" then server replies with "DHCP Offer" 
> but
> that offer never reaches wireless client.
Do you have WDS enabled? If not, DHCP has issues in that scenario:
https://wiki.openwrt.org/doc/howto/clientmode

Aaron Z


[PATCH net-next 2/6] sctp: add SCTP_DEFAULT_PRINFO into sctp sockopt

2016-07-09 Thread Xin Long
This patch adds SCTP_DEFAULT_PRINFO to sctp sockopt. It is used
to set/get sctp Partially Reliable Policies' default params,
which includes 3 policies (ttl, rtx, prio) and their values.

Still, if we set policy params in sndinfo, we will use the params
of sndinfo against chunks, instead of the default params.

In this patch, we will use 5-8bit of sp/asoc->default_flags
to store prsctp policies, and reuse asoc->default_timetolive
to store their values. It means if we enable and set prsctp
policy, prior ttl timeout in sctp will not work any more.

Signed-off-by: Xin Long 
---
 include/uapi/linux/sctp.h | 29 +++
 net/sctp/socket.c | 91 +++
 2 files changed, 120 insertions(+)

diff --git a/include/uapi/linux/sctp.h b/include/uapi/linux/sctp.h
index aa08906..984cf2e 100644
--- a/include/uapi/linux/sctp.h
+++ b/include/uapi/linux/sctp.h
@@ -113,6 +113,29 @@ typedef __s32 sctp_assoc_t;
 #define SCTP_SOCKOPT_CONNECTX3 111 /* CONNECTX requests (updated) */
 #define SCTP_GET_ASSOC_STATS   112 /* Read only */
 #define SCTP_PR_SUPPORTED  113
+#define SCTP_DEFAULT_PRINFO114
+
+/* PR-SCTP policies */
+#define SCTP_PR_SCTP_NONE  0x
+#define SCTP_PR_SCTP_TTL   0x0010
+#define SCTP_PR_SCTP_RTX   0x0020
+#define SCTP_PR_SCTP_PRIO  0x0030
+#define SCTP_PR_SCTP_MAX   SCTP_PR_SCTP_PRIO
+#define SCTP_PR_SCTP_MASK  0x0030
+
+#define __SCTP_PR_INDEX(x) ((x >> 4) - 1)
+#define SCTP_PR_INDEX(x)   __SCTP_PR_INDEX(SCTP_PR_SCTP_ ## x)
+
+#define SCTP_PR_POLICY(x)  ((x) & SCTP_PR_SCTP_MASK)
+#define SCTP_PR_SET_POLICY(flags, x)   \
+   do {\
+   flags &= ~SCTP_PR_SCTP_MASK;\
+   flags |= x; \
+   } while (0)
+
+#define SCTP_PR_TTL_ENABLED(x) (SCTP_PR_POLICY(x) == SCTP_PR_SCTP_TTL)
+#define SCTP_PR_RTX_ENABLED(x) (SCTP_PR_POLICY(x) == SCTP_PR_SCTP_RTX)
+#define SCTP_PR_PRIO_ENABLED(x)(SCTP_PR_POLICY(x) == SCTP_PR_SCTP_PRIO)
 
 /* These are bit fields for msghdr->msg_flags.  See section 5.1.  */
 /* On user space Linux, these live in  as an enum.  */
@@ -903,4 +926,10 @@ struct sctp_paddrthlds {
__u16 spt_pathpfthld;
 };
 
+struct sctp_default_prinfo {
+   sctp_assoc_t pr_assoc_id;
+   __u32 pr_value;
+   __u16 pr_policy;
+};
+
 #endif /* _UAPI_SCTP_H */
diff --git a/net/sctp/socket.c b/net/sctp/socket.c
index 7460dde..c03fe1b 100644
--- a/net/sctp/socket.c
+++ b/net/sctp/socket.c
@@ -3694,6 +3694,47 @@ out:
return retval;
 }
 
+static int sctp_setsockopt_default_prinfo(struct sock *sk,
+ char __user *optval,
+ unsigned int optlen)
+{
+   struct sctp_default_prinfo info;
+   struct sctp_association *asoc;
+   int retval = -EINVAL;
+
+   if (optlen != sizeof(info))
+   goto out;
+
+   if (copy_from_user(&info, optval, sizeof(info))) {
+   retval = -EFAULT;
+   goto out;
+   }
+
+   if (info.pr_policy & ~SCTP_PR_SCTP_MASK)
+   goto out;
+
+   if (info.pr_policy == SCTP_PR_SCTP_NONE)
+   info.pr_value = 0;
+
+   asoc = sctp_id2assoc(sk, info.pr_assoc_id);
+   if (asoc) {
+   SCTP_PR_SET_POLICY(asoc->default_flags, info.pr_policy);
+   asoc->default_timetolive = info.pr_value;
+   } else if (!info.pr_assoc_id) {
+   struct sctp_sock *sp = sctp_sk(sk);
+
+   SCTP_PR_SET_POLICY(sp->default_flags, info.pr_policy);
+   sp->default_timetolive = info.pr_value;
+   } else {
+   goto out;
+   }
+
+   retval = 0;
+
+out:
+   return retval;
+}
+
 /* API 6.2 setsockopt(), getsockopt()
  *
  * Applications use setsockopt() and getsockopt() to set or retrieve
@@ -3857,6 +3898,9 @@ static int sctp_setsockopt(struct sock *sk, int level, 
int optname,
case SCTP_PR_SUPPORTED:
retval = sctp_setsockopt_pr_supported(sk, optval, optlen);
break;
+   case SCTP_DEFAULT_PRINFO:
+   retval = sctp_setsockopt_default_prinfo(sk, optval, optlen);
+   break;
default:
retval = -ENOPROTOOPT;
break;
@@ -6243,6 +6287,49 @@ out:
return retval;
 }
 
+static int sctp_getsockopt_default_prinfo(struct sock *sk, int len,
+ char __user *optval,
+ int __user *optlen)
+{
+   struct sctp_default_prinfo info;
+   struct sctp_association *asoc;
+   int retval = -EFAULT;
+
+   if (len < sizeof(info)) {
+   retval = -EINVAL;
+   goto out;
+   }
+
+   len = sizeof(info);
+   if (copy_from_user(&info, optval, len))
+   goto out;
+
+   asoc = sctp_id2assoc(sk, info.pr_assoc_id);
+   if (asoc) {
+   info.pr_policy = SCTP_PR_

[PATCH net-next 6/6] sctp: implement prsctp PRIO policy

2016-07-09 Thread Xin Long
prsctp PRIO policy is a policy to abandon lower priority chunks when
asoc doesn't have enough snd buffer, so that the current chunk with
higher priority can be queued successfully.

Similar to TTL/RTX policy, we will set the priority of the chunk to
prsctp_param with sinfo->sinfo_timetolive in sctp_set_prsctp_policy().
So if PRIO policy is enabled, msg->expire_at won't work.

asoc->sent_cnt_removable will record how many chunks can be checked to
remove. If priority policy is enabled, when the chunk is queued into
the out_queue, we will increase sent_cnt_removable. When the chunk is
moved to abandon_queue or dequeue and free, we will decrease
sent_cnt_removable.

In sctp_sendmsg, we will check if there is enough snd buffer for current
msg and if sent_cnt_removable is not 0. Then try to abandon chunks in
sctp_prune_prsctp when sendmsg from the retransmit/transmited queue, and
free chunks from out_queue in right order until the abandon+free size >
msg_len - sctp_wfree. For the abandon size, we have to wait until it
sends FORWARD TSN, receives the sack and the chunks are really freed.

Signed-off-by: Xin Long 
---
 include/net/sctp/structs.h |  4 ++
 net/sctp/chunk.c   |  1 +
 net/sctp/outqueue.c| 99 ++
 net/sctp/sm_make_chunk.c   |  3 +-
 net/sctp/socket.c  |  3 ++
 5 files changed, 109 insertions(+), 1 deletion(-)

diff --git a/include/net/sctp/structs.h b/include/net/sctp/structs.h
index 6bcda71..8626bdd 100644
--- a/include/net/sctp/structs.h
+++ b/include/net/sctp/structs.h
@@ -1084,6 +1084,8 @@ void sctp_retransmit(struct sctp_outq *, struct 
sctp_transport *,
 sctp_retransmit_reason_t);
 void sctp_retransmit_mark(struct sctp_outq *, struct sctp_transport *, __u8);
 int sctp_outq_uncork(struct sctp_outq *, gfp_t gfp);
+void sctp_prsctp_prune(struct sctp_association *asoc,
+  struct sctp_sndrcvinfo *sinfo, int msg_len);
 /* Uncork and flush an outqueue.  */
 static inline void sctp_outq_cork(struct sctp_outq *q)
 {
@@ -1864,6 +1866,8 @@ struct sctp_association {
 
struct sctp_priv_assoc_stats stats;
 
+   int sent_cnt_removable;
+
__u64 abandoned_unsent[SCTP_PR_INDEX(MAX) + 1];
__u64 abandoned_sent[SCTP_PR_INDEX(MAX) + 1];
 };
diff --git a/net/sctp/chunk.c b/net/sctp/chunk.c
index b3692b5..a55e547 100644
--- a/net/sctp/chunk.c
+++ b/net/sctp/chunk.c
@@ -360,6 +360,7 @@ int sctp_chunk_abandoned(struct sctp_chunk *chunk)
chunk->asoc->abandoned_sent[SCTP_PR_INDEX(RTX)]++;
return 1;
}
+   /* PRIO policy is processed by sendmsg, not here */
 
return 0;
 }
diff --git a/net/sctp/outqueue.c b/net/sctp/outqueue.c
index 084718f..72e54a4 100644
--- a/net/sctp/outqueue.c
+++ b/net/sctp/outqueue.c
@@ -326,6 +326,9 @@ int sctp_outq_tail(struct sctp_outq *q, struct sctp_chunk 
*chunk, gfp_t gfp)
 
sctp_chunk_hold(chunk);
sctp_outq_tail_data(q, chunk);
+   if (chunk->asoc->prsctp_enable &&
+   SCTP_PR_PRIO_ENABLED(chunk->sinfo.sinfo_flags))
+   chunk->asoc->sent_cnt_removable++;
if (chunk->chunk_hdr->flags & SCTP_DATA_UNORDERED)
SCTP_INC_STATS(net, SCTP_MIB_OUTUNORDERCHUNKS);
else
@@ -372,6 +375,96 @@ static void sctp_insert_list(struct list_head *head, 
struct list_head *new)
list_add_tail(new, head);
 }
 
+static int sctp_prsctp_prune_sent(struct sctp_association *asoc,
+ struct sctp_sndrcvinfo *sinfo,
+ struct list_head *queue, int msg_len)
+{
+   struct sctp_chunk *chk, *temp;
+
+   list_for_each_entry_safe(chk, temp, queue, transmitted_list) {
+   if (!SCTP_PR_PRIO_ENABLED(chk->sinfo.sinfo_flags) ||
+   chk->prsctp_param <= sinfo->sinfo_timetolive)
+   continue;
+
+   list_del_init(&chk->transmitted_list);
+   sctp_insert_list(&asoc->outqueue.abandoned,
+&chk->transmitted_list);
+
+   asoc->sent_cnt_removable--;
+   asoc->abandoned_sent[SCTP_PR_INDEX(PRIO)]++;
+
+   if (!chk->tsn_gap_acked) {
+   if (chk->transport)
+   chk->transport->flight_size -=
+   sctp_data_size(chk);
+   asoc->outqueue.outstanding_bytes -= sctp_data_size(chk);
+   }
+
+   msg_len -= SCTP_DATA_SNDSIZE(chk) +
+  sizeof(struct sk_buff) +
+  sizeof(struct sctp_chunk);
+   if (msg_len <= 0)
+   break;
+   }
+
+   return msg_len;
+}
+
+static int sctp_prsctp_prune_unsent(struct sctp_association *asoc,
+ 

[PATCH net-next 0/6] sctp: implement rfc7496 in sctp

2016-07-09 Thread Xin Long
This patchset implements "Additional Policies for the Partially Reliable
Stream Control Transmission Protocol Extension" described on RFC7496.

The Partially Reliable SCTP (PR-SCTP) extension defined in [RFC3758]
provides a generic method for senders to abandon user messages. The
decision to abandon a user message is sender side only, and the exact
condition is called a "PR-SCTP policy". This patchset implements 3
policies:

 1. Timed Reliability:  This allows the sender to specify a timeout for
a user message after which the SCTP stack abandons the user message.

 2. Limited Retransmission Policy:  Allows limitation of the number of
retransmissions.

 3. Priority Policy:  Allows removal of lower-priority messages if space
for higher-priority messages is needed in the send buffer.

Patch 1-3 add some sockopts in sctp to set/get pr_sctp policy status.
Patch 4-6 implement these 3 policies one by one.

Xin Long (6):
  sctp: add SCTP_PR_SUPPORTED on sctp sockopt
  sctp: add SCTP_DEFAULT_PRINFO into sctp sockopt
  sctp: add SCTP_PR_ASSOC_STATUS on sctp sockopt
  sctp: implement prsctp TTL policy
  sctp: implement prsctp RTX policy
  sctp: implement prsctp PRIO policy

 include/net/sctp/structs.h |  23 -
 include/uapi/linux/sctp.h  |  42 
 net/sctp/associola.c   |   1 +
 net/sctp/chunk.c   |  25 -
 net/sctp/endpointola.c |   1 +
 net/sctp/output.c  |   2 +
 net/sctp/outqueue.c|  99 +++
 net/sctp/sm_make_chunk.c   |  27 +++--
 net/sctp/socket.c  | 240 -
 9 files changed, 447 insertions(+), 13 deletions(-)

-- 
2.1.0



[PATCH net-next 1/6] sctp: add SCTP_PR_SUPPORTED on sctp sockopt

2016-07-09 Thread Xin Long
According to section 4.5 of rfc7496, prsctp_enable should be per asoc.
We will add prsctp_enable to both asoc and ep, and replace the places
where it used net.sctp->prsctp_enable with asoc->prsctp_enable.

ep->prsctp_enable will be initialized with net.sctp->prsctp_enable, and
asoc->prsctp_enable will be initialized with ep->prsctp_enable. We can
also modify it's value through sockopt SCTP_PR_SUPPORTED.

Signed-off-by: Xin Long 
---
 include/net/sctp/structs.h |  6 ++--
 include/uapi/linux/sctp.h  |  1 +
 net/sctp/associola.c   |  1 +
 net/sctp/endpointola.c |  1 +
 net/sctp/sm_make_chunk.c   | 12 +++
 net/sctp/socket.c  | 80 ++
 6 files changed, 93 insertions(+), 8 deletions(-)

diff --git a/include/net/sctp/structs.h b/include/net/sctp/structs.h
index 83c5ec5..07115ca 100644
--- a/include/net/sctp/structs.h
+++ b/include/net/sctp/structs.h
@@ -1256,7 +1256,8 @@ struct sctp_endpoint {
/* SCTP-AUTH: endpoint shared keys */
struct list_head endpoint_shared_keys;
__u16 active_key_id;
-   __u8  auth_enable;
+   __u8  auth_enable:1,
+ prsctp_enable:1;
 };
 
 /* Recover the outter endpoint structure. */
@@ -1848,7 +1849,8 @@ struct sctp_association {
__u16 active_key_id;
 
__u8 need_ecne:1,   /* Need to send an ECNE Chunk? */
-temp:1;/* Is it a temporary association? */
+temp:1,/* Is it a temporary association? */
+prsctp_enable:1;
 
struct sctp_priv_assoc_stats stats;
 };
diff --git a/include/uapi/linux/sctp.h b/include/uapi/linux/sctp.h
index ce70fe6..aa08906 100644
--- a/include/uapi/linux/sctp.h
+++ b/include/uapi/linux/sctp.h
@@ -112,6 +112,7 @@ typedef __s32 sctp_assoc_t;
 #define SCTP_SOCKOPT_CONNECTX  110 /* CONNECTX requests. */
 #define SCTP_SOCKOPT_CONNECTX3 111 /* CONNECTX requests (updated) */
 #define SCTP_GET_ASSOC_STATS   112 /* Read only */
+#define SCTP_PR_SUPPORTED  113
 
 /* These are bit fields for msghdr->msg_flags.  See section 5.1.  */
 /* On user space Linux, these live in  as an enum.  */
diff --git a/net/sctp/associola.c b/net/sctp/associola.c
index e1849f3..1c23060 100644
--- a/net/sctp/associola.c
+++ b/net/sctp/associola.c
@@ -268,6 +268,7 @@ static struct sctp_association 
*sctp_association_init(struct sctp_association *a
goto fail_init;
 
asoc->active_key_id = ep->active_key_id;
+   asoc->prsctp_enable = ep->prsctp_enable;
 
/* Save the hmacs and chunks list into this association */
if (ep->auth_hmacs_list)
diff --git a/net/sctp/endpointola.c b/net/sctp/endpointola.c
index 9d494e3..1f03065 100644
--- a/net/sctp/endpointola.c
+++ b/net/sctp/endpointola.c
@@ -163,6 +163,7 @@ static struct sctp_endpoint *sctp_endpoint_init(struct 
sctp_endpoint *ep,
 */
ep->auth_hmacs_list = auth_hmacs;
ep->auth_chunk_list = auth_chunks;
+   ep->prsctp_enable = net->sctp.prsctp_enable;
 
return ep;
 
diff --git a/net/sctp/sm_make_chunk.c b/net/sctp/sm_make_chunk.c
index 56f364d..0e3045e 100644
--- a/net/sctp/sm_make_chunk.c
+++ b/net/sctp/sm_make_chunk.c
@@ -261,7 +261,7 @@ struct sctp_chunk *sctp_make_init(const struct 
sctp_association *asoc,
chunksize += WORD_ROUND(SCTP_SAT_LEN(num_types));
chunksize += sizeof(ecap_param);
 
-   if (net->sctp.prsctp_enable)
+   if (asoc->prsctp_enable)
chunksize += sizeof(prsctp_param);
 
/* ADDIP: Section 4.2.7:
@@ -355,7 +355,7 @@ struct sctp_chunk *sctp_make_init(const struct 
sctp_association *asoc,
sctp_addto_param(retval, num_ext, extensions);
}
 
-   if (net->sctp.prsctp_enable)
+   if (asoc->prsctp_enable)
sctp_addto_chunk(retval, sizeof(prsctp_param), &prsctp_param);
 
if (sp->adaptation_ind) {
@@ -2024,8 +2024,8 @@ static void sctp_process_ext_param(struct 
sctp_association *asoc,
for (i = 0; i < num_ext; i++) {
switch (param.ext->chunks[i]) {
case SCTP_CID_FWD_TSN:
-   if (net->sctp.prsctp_enable && 
!asoc->peer.prsctp_capable)
-   asoc->peer.prsctp_capable = 1;
+   if (asoc->prsctp_enable && !asoc->peer.prsctp_capable)
+   asoc->peer.prsctp_capable = 1;
break;
case SCTP_CID_AUTH:
/* if the peer reports AUTH, assume that he
@@ -2169,7 +2169,7 @@ static sctp_ierror_t sctp_verify_param(struct net *net,
break;
 
case SCTP_PARAM_FWD_TSN_SUPPORT:
-   if (net->sctp.prsctp_enable)
+   if (ep->prsctp_enable)
break;
goto fallthrough;
 
@@ -2653,7 +2653,7 @@ do_addr_param:
break;
 
case SCTP_PARAM_FWD_TSN_SUPPORT:
-   if (net->sctp.prsctp_enabl

Re: XDP seeking input from NIC hardware vendors

2016-07-09 Thread Jesper Dangaard Brouer
On Fri, 8 Jul 2016 18:51:07 +0100
Jakub Kicinski  wrote:

> On Fri, 8 Jul 2016 09:45:25 -0700, John Fastabend wrote:
> > The only distinction between VFs and queue groupings on my side is VFs
> > provide RSS where as queue groupings have to be selected explicitly.
> > In a programmable NIC world the distinction might be lost if a "RSS"
> > program can be loaded into the NIC to select queues but for existing
> > hardware the distinction is there.  
> 
> To do BPF RSS we need a way to select the queue which I think is all
> Jesper wanted.  So we will have to tackle the queue selection at some
> point.  The main obstacle with it for me is to define what queue
> selection means when program is not offloaded to HW...  Implementing
> queue selection on HW side is trivial.

Yes, I do see the problem of fallback, when the programs "filter" demux
cannot be offloaded to hardware.

First I though it was a good idea to keep the "demux-filter" part of
the eBPF program, as software fallback can still apply this filter in
SW, and just mark the packets as not-zero-copy-safe.  But when HW
offloading is not possible, then packets can be delivered every RX
queue, and SW would need to handle that, which hard to keep transparent.


> > If you demux using a eBPF program or via a filter model like
> > flow_director or cls_{u32|flower} I think we can support both. And this
> > just depends on the programmability of the hardware. Note flow_director
> > and cls_{u32|flower} steering to VFs is already in place.  

Maybe we should keep HW demuxing as a separate setup step.

Today I can almost do what I want: by setting up ntuple filters, and (if
Alexei allows it) assign an application specific XDP eBPF program to a
specific RX queue.

 ethtool -K eth2 ntuple on
 ethtool -N eth2 flow-type udp4 dst-ip 192.168.254.1 dst-port 53 action 42

Then the XDP program can be attached to RX queue 42, and
promise/guarantee that it will consume all packet.  And then the
backing page-pool can allow zero-copy RX (and enable scrubbing when
refilling pool).


> Yes, for steering to VFs we could potentially reuse a lot of existing
> infrastructure.
> 
> > The question I have is should the "filter" part of the eBPF program
> > be a separate program from the XDP program and loaded using specific
> > semantics (e.g. "load_hardware_demux" ndo op) at the risk of building
> > a ever growing set of "ndo" ops. If you are running multiple XDP
> > programs on the same NIC hardware then I think this actually makes
> > sense otherwise how would the hardware and even software find the
> > "demux" logic. In this model there is a "demux" program that selects
> > a queue/VF and a program that runs on the netdev queues.  
> 
> I don't think we should enforce the separation here.  What we may want
> to do before forwarding to the VF can be much more complicated than
> pure demux/filtering (simple eg - pop VLAN/tunnel).  VF representative
> model works well here as fallback - if program could not be offloaded
> it will be run on the host and "trombone" packets via VFR into the VF.

That is an interesting idea.

> If we have a chain of BPF programs we can order them in increasing
> level of complexity/features required and then HW could transparently
> offload the first parts - the easier ones - leaving more complex
> processing on the host.

I'll try to keep out of the discussion of how to structure the BPF
program, as it is outside my "area".
 
> This should probably be paired with some sort of "skip-sw" flag to let
> user space enforce the HW offload on the fast path part.


-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer


Re: [PATCH net] udp: prevent bugcheck if filter truncates packet too much

2016-07-09 Thread Michal Kubecek
On Sat, Jul 09, 2016 at 11:48:49AM +0200, Daniel Borkmann wrote:
> On 07/09/2016 02:20 AM, Alexei Starovoitov wrote:
> >On Sat, Jul 09, 2016 at 01:31:40AM +0200, Eric Dumazet wrote:
> >>On Fri, 2016-07-08 at 17:52 +0200, Michal Kubecek wrote:
> >>>If socket filter truncates an udp packet below the length of UDP header
> >>>in udpv6_queue_rcv_skb() or udp_queue_rcv_skb(), it will trigger a
> >>>BUG_ON in skb_pull_rcsum(). This BUG_ON (and therefore a system crash if
> >>>kernel is configured that way) can be easily enforced by an unprivileged
> >>>user which was reported as CVE-2016-6162. For a reproducer, see
> >>>http://seclists.org/oss-sec/2016/q3/8
> >>>
> >>>Fixes: e6afc8ace6dd ("udp: remove headers from UDP packets before 
> >>>queueing")
> >>>Reported-by: Marco Grassi 
> >>>Signed-off-by: Michal Kubecek 
> >>>---
> >>>  net/ipv4/udp.c | 2 ++
> >>>  net/ipv6/udp.c | 2 ++
> >>>  2 files changed, 4 insertions(+)
> >>>
> >>>diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> >>>index ca5e8ea29538..4aed8fc23d32 100644
> >>>--- a/net/ipv4/udp.c
> >>>+++ b/net/ipv4/udp.c
> >>>@@ -1583,6 +1583,8 @@ int udp_queue_rcv_skb(struct sock *sk, struct 
> >>>sk_buff *skb)
> >>>
> >>>   if (sk_filter(sk, skb))
> >>>   goto drop;
> >>>+  if (unlikely(skb->len < sizeof(struct udphdr)))
> >>>+  goto drop;
> >>>
> >>>   udp_csum_pull_header(skb);
> >>>   if (sk_rcvqueues_full(sk, sk->sk_rcvbuf)) {
> >>>diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
> >>>index 005dc82c2138..acc09705618b 100644
> >>>--- a/net/ipv6/udp.c
> >>>+++ b/net/ipv6/udp.c
> >>>@@ -620,6 +620,8 @@ int udpv6_queue_rcv_skb(struct sock *sk, struct 
> >>>sk_buff *skb)
> >>>
> >>>   if (sk_filter(sk, skb))
> >>>   goto drop;
> >>>+  if (unlikely(skb->len < sizeof(struct udphdr)))
> >>>+  goto drop;
> >>>
> >>>   udp_csum_pull_header(skb);
> >>>   if (sk_rcvqueues_full(sk, sk->sk_rcvbuf)) {
> >>
> >>
> >>Arg :(
> >>
> >>Acked-by: Eric Dumazet 
> >
> >this is incomplete fix. Please do not apply. See discussion at 
> >security@kernel
> 
> Ohh well, didn't see it earlier before starting the discussion at security@...
> 
> I'm okay if we take this for now as a quick band aid and find a better
> way how to deal with the underlying issue long-term so that it's
> /guaranteed/ that it doesn't bite us any further in such fragile ways.

Agreed. As rc7 is due in a day or two, rushing a complex and intrusive
solution in might be too risky.

Michal Kubecek



[iproute PATCH] ip-address.8: Document autojoin flag

2016-07-09 Thread Phil Sutter
Description copied from related kernel support commit message with a
little tailoring to fit.

While at it, fix font of non-terminal CONFFLAG-LIST in synopsis.

Signed-off-by: Phil Sutter 
---
 man/man8/ip-address.8.in | 21 +++--
 1 file changed, 19 insertions(+), 2 deletions(-)

diff --git a/man/man8/ip-address.8.in b/man/man8/ip-address.8.in
index 7cb9271731339..ab59d270321d9 100644
--- a/man/man8/ip-address.8.in
+++ b/man/man8/ip-address.8.in
@@ -77,14 +77,15 @@ ip-address \- protocol address management
 .IR FLAG " := "
 .RB "[ " permanent " | " dynamic " | " secondary " | " primary " |"
 .RB [ - ] tentative " | [" - ] deprecated " | [" - ] dadfailed " |"
-.BR temporary " | " CONFFLAG-LIST " ]"
+.BR temporary " |"
+.IR CONFFLAG-LIST " ]"
 
 .ti -8
 .IR CONFFLAG-LIST " := [ "  CONFFLAG-LIST " ] " CONFFLAG
 
 .ti -8
 .IR CONFFLAG " := "
-.RB "[ " home " | " mngtmpaddr " | " nodad " | " noprefixroute " ]"
+.RB "[ " home " | " mngtmpaddr " | " nodad " | " noprefixroute " | " autojoin 
" ]"
 
 .ti -8
 .IR LIFETIME " := [ "
@@ -249,6 +250,22 @@ address, and don't search for one to delete when removing 
the address. Changing
 an address to add this flag will remove the automatically added prefix route,
 changing it to remove this flag will create the prefix route automatically.
 
+.TP
+.B autojoin
+Joining multicast group on ethernet level via "ip maddr" command would not work
+if we have an Ethernet switch that does igmp snooping since the switch would
+not replicate multicast packets on ports that did not have IGMP reports for the
+multicast addresses.
+
+Linux vxlan interfaces created via "ip link add vxlan" have the group option
+that enables then to do the required join.
+
+Using the
+.B autojoin
+flag when adding a multicast address enables similar functionality for
+openvswitch vxlan interfaces as well as other tunneling mechanisms that need to
+receive multicast traffic.
+
 .SS ip address delete - delete protocol address
 .B Arguments:
 coincide with the arguments of
-- 
2.8.2



Re: [PATCH net] udp: prevent bugcheck if filter truncates packet too much

2016-07-09 Thread Daniel Borkmann

On 07/09/2016 02:20 AM, Alexei Starovoitov wrote:

On Sat, Jul 09, 2016 at 01:31:40AM +0200, Eric Dumazet wrote:

On Fri, 2016-07-08 at 17:52 +0200, Michal Kubecek wrote:

If socket filter truncates an udp packet below the length of UDP header
in udpv6_queue_rcv_skb() or udp_queue_rcv_skb(), it will trigger a
BUG_ON in skb_pull_rcsum(). This BUG_ON (and therefore a system crash if
kernel is configured that way) can be easily enforced by an unprivileged
user which was reported as CVE-2016-6162. For a reproducer, see
http://seclists.org/oss-sec/2016/q3/8

Fixes: e6afc8ace6dd ("udp: remove headers from UDP packets before queueing")
Reported-by: Marco Grassi 
Signed-off-by: Michal Kubecek 
---
  net/ipv4/udp.c | 2 ++
  net/ipv6/udp.c | 2 ++
  2 files changed, 4 insertions(+)

diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index ca5e8ea29538..4aed8fc23d32 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1583,6 +1583,8 @@ int udp_queue_rcv_skb(struct sock *sk, struct sk_buff 
*skb)

if (sk_filter(sk, skb))
goto drop;
+   if (unlikely(skb->len < sizeof(struct udphdr)))
+   goto drop;

udp_csum_pull_header(skb);
if (sk_rcvqueues_full(sk, sk->sk_rcvbuf)) {
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index 005dc82c2138..acc09705618b 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -620,6 +620,8 @@ int udpv6_queue_rcv_skb(struct sock *sk, struct sk_buff 
*skb)

if (sk_filter(sk, skb))
goto drop;
+   if (unlikely(skb->len < sizeof(struct udphdr)))
+   goto drop;

udp_csum_pull_header(skb);
if (sk_rcvqueues_full(sk, sk->sk_rcvbuf)) {



Arg :(

Acked-by: Eric Dumazet 


this is incomplete fix. Please do not apply. See discussion at security@kernel


Ohh well, didn't see it earlier before starting the discussion at security@...

I'm okay if we take this for now as a quick band aid and find a better way how
to deal with the underlying issue long-term so that it's /guaranteed/ that it
doesn't bite us any further in such fragile ways.


[iproute PATCH 0/6] iplink: Improve documentation

2016-07-09 Thread Phil Sutter
This series improves documentation around the feature of 'ip link set'
to configure device type specific parameters. While doing so, I reviewed
used fonts in ip-link.8 and fixed a few bugs/inconsistencies.

Phil Sutter (6):
  iplink: List valid 'type' argument in ip link help text
  iplink: bond_slave: Add missing help functions
  ip-link.8: Extend type list in synopsis
  ip-link.8: Place 'ip link set' warning more prominently
  ip-link.8: Add slave type option descriptions
  ip-link.8: Fix font choices

 ip/iplink.c|   4 +-
 ip/iplink_bond_slave.c |  24 
 man/man8/ip-link.8.in  | 314 +++--
 3 files changed, 254 insertions(+), 88 deletions(-)

-- 
2.8.2



[iproute PATCH 4/6] ip-link.8: Place 'ip link set' warning more prominently

2016-07-09 Thread Phil Sutter
This moves the warning to the beginning of the section about 'ip link
set' which makes it still stand out after adding more text to it's end.

Signed-off-by: Phil Sutter 
---
 man/man8/ip-link.8.in | 24 
 1 file changed, 12 insertions(+), 12 deletions(-)

diff --git a/man/man8/ip-link.8.in b/man/man8/ip-link.8.in
index 4fc5f3c6257e5..2678d37df7478 100644
--- a/man/man8/ip-link.8.in
+++ b/man/man8/ip-link.8.in
@@ -1015,6 +1015,18 @@ specifies the type of the device.
 
 .SS ip link set - change device attributes
 
+.PP
+.B Warning:
+If multiple parameter changes are requested,
+.B ip
+aborts immediately after any of the changes have failed.
+This is the only case when
+.B ip
+can move the system to an unpredictable state. The solution
+is to avoid changing several parameters with one
+.B ip link set
+call.
+
 .TP
 .BI dev " DEVICE "
 .I DEVICE
@@ -1235,18 +1247,6 @@ set the IPv6 address generation mode
 .BR "link-netnsid "
 set peer netnsid for a cross-netns interface
 
-.PP
-.B Warning:
-If multiple parameter changes are requested,
-.B ip
-aborts immediately after any of the changes have failed.
-This is the only case when
-.B ip
-can move the system to an unpredictable state. The solution
-is to avoid changing several parameters with one
-.B ip link set
-call.
-
 .SS  ip link show - display device attributes
 
 .TP
-- 
2.8.2



[iproute PATCH 3/6] ip-link.8: Extend type list in synopsis

2016-07-09 Thread Phil Sutter
'ip link set' supports passing a type to set type-specific parameters.
Add this missing piece of information to the synopsis section.

Signed-off-by: Phil Sutter 
---
 man/man8/ip-link.8.in | 71 +--
 1 file changed, 40 insertions(+), 31 deletions(-)

diff --git a/man/man8/ip-link.8.in b/man/man8/ip-link.8.in
index ad18f7555d0a5..4fc5f3c6257e5 100644
--- a/man/man8/ip-link.8.in
+++ b/man/man8/ip-link.8.in
@@ -40,35 +40,6 @@ ip-link \- network device configuration
 .RI "[ " ARGS " ]"
 
 .ti -8
-.IR TYPE " := [ "
-.BR bridge " | "
-.BR bond " | "
-.BR can " | "
-.BR dummy " | "
-.BR hsr " | "
-.BR ifb " | "
-.BR ipoib " |"
-.BR macvlan  " | "
-.BR macvtap  " | "
-.BR vcan " | "
-.BR veth " | "
-.BR vlan " | "
-.BR vxlan " |"
-.BR ip6tnl " |"
-.BR ipip " |"
-.BR sit " |"
-.BR gre " |"
-.BR gretap " |"
-.BR ip6gre " |"
-.BR ip6gretap " |"
-.BR vti " |"
-.BR nlmon " |"
-.BR ipvlan " |"
-.BR lowpan " |"
-.BR geneve " |"
-.BR vrf " ]"
-
-.ti -8
 .BR "ip link delete " {
 .IR DEVICE " | "
 .BI "group " GROUP
@@ -80,7 +51,12 @@ ip-link \- network device configuration
 .BR "ip link set " {
 .IR DEVICE " | "
 .BI "group " GROUP
-.RB "} [ { " up " | " down " } ]"
+}
+.br
+.RB "[ { " up " | " down " } ]"
+.br
+.RB "[ " type
+.IR "ETYPE TYPE_ARGS" " ]"
 .br
 .RB "[ " arp " { " on " | " off " } ]"
 .br
@@ -169,7 +145,7 @@ ip-link \- network device configuration
 .B master
 .IR DEVICE " ] ["
 .B type
-.IR TYPE " ]"
+.IR ETYPE " ]"
 .B vrf
 .IR NAME " ]"
 
@@ -177,6 +153,39 @@ ip-link \- network device configuration
 .B ip link help
 .RI "[ " TYPE " ]"
 
+.ti -8
+.IR TYPE " := [ "
+.BR bridge " | "
+.BR bond " | "
+.BR can " | "
+.BR dummy " | "
+.BR hsr " | "
+.BR ifb " | "
+.BR ipoib " |"
+.BR macvlan  " | "
+.BR macvtap  " | "
+.BR vcan " | "
+.BR veth " | "
+.BR vlan " | "
+.BR vxlan " |"
+.BR ip6tnl " |"
+.BR ipip " |"
+.BR sit " |"
+.BR gre " |"
+.BR gretap " |"
+.BR ip6gre " |"
+.BR ip6gretap " |"
+.BR vti " |"
+.BR nlmon " |"
+.BR ipvlan " |"
+.BR lowpan " |"
+.BR geneve " |"
+.BR vrf " ]"
+
+.ti -8
+.IR ETYPE " := [ " TYPE " |"
+.BR bridge_slave " | " bond_slave " ]"
+
 .SH "DESCRIPTION"
 .SS ip link add - add virtual link
 
-- 
2.8.2



[iproute PATCH 1/6] iplink: List valid 'type' argument in ip link help text

2016-07-09 Thread Phil Sutter
Signed-off-by: Phil Sutter 
---
 ip/iplink.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/ip/iplink.c b/ip/iplink.c
index f2a2e13cf0c5b..0e3cee6af8b6f 100644
--- a/ip/iplink.c
+++ b/ip/iplink.c
@@ -55,7 +55,9 @@ void iplink_usage(void)
fprintf(stderr, "   type TYPE [ ARGS ]\n");
fprintf(stderr, "   ip link delete { DEVICE | dev DEVICE | 
group DEVGROUP } type TYPE [ ARGS ]\n");
fprintf(stderr, "\n");
-   fprintf(stderr, "   ip link set { DEVICE | dev DEVICE | 
group DEVGROUP } [ { up | down } ]\n");
+   fprintf(stderr, "   ip link set { DEVICE | dev DEVICE | 
group DEVGROUP }\n");
+   fprintf(stderr, " [ { up | down } ]\n");
+   fprintf(stderr, " [ type TYPE ARGS 
]\n");
} else
fprintf(stderr, "Usage: ip link set DEVICE [ { up | down } 
]\n");
 
-- 
2.8.2



[iproute PATCH 2/6] iplink: bond_slave: Add missing help functions

2016-07-09 Thread Phil Sutter
Signed-off-by: Phil Sutter 
---
 ip/iplink_bond_slave.c | 24 
 1 file changed, 24 insertions(+)

diff --git a/ip/iplink_bond_slave.c b/ip/iplink_bond_slave.c
index d67793237edfc..9c60dea8a2757 100644
--- a/ip/iplink_bond_slave.c
+++ b/ip/iplink_bond_slave.c
@@ -17,6 +17,16 @@
 #include "utils.h"
 #include "ip_common.h"
 
+static void print_explain(FILE *f)
+{
+   fprintf(f, "Usage: ... bond_slave [ queue_id ID ]\n");
+}
+
+static void explain(void)
+{
+   print_explain(stderr);
+}
+
 static const char *slave_states[] = {
[BOND_STATE_ACTIVE] = "ACTIVE",
[BOND_STATE_BACKUP] = "BACKUP",
@@ -99,6 +109,13 @@ static int bond_slave_parse_opt(struct link_util *lu, int 
argc, char **argv,
if (get_u16(&queue_id, *argv, 0))
invarg("queue_id is invalid", *argv);
addattr16(n, 1024, IFLA_BOND_SLAVE_QUEUE_ID, queue_id);
+   } else {
+   if (matches(*argv, "help") != 0)
+   fprintf(stderr,
+   "bond_slave: unknown option \"%s\"?\n",
+   *argv);
+   explain();
+   return -1;
}
argc--, argv++;
}
@@ -106,10 +123,17 @@ static int bond_slave_parse_opt(struct link_util *lu, int 
argc, char **argv,
return 0;
 }
 
+static void bond_slave_print_help(struct link_util *lu, int argc, char **argv,
+ FILE *f)
+{
+   print_explain(f);
+}
+
 struct link_util bond_slave_link_util = {
.id = "bond",
.maxattr= IFLA_BOND_SLAVE_MAX,
.print_opt  = bond_slave_print_opt,
.parse_opt  = bond_slave_parse_opt,
+   .print_help = bond_slave_print_help,
.slave  = true,
 };
-- 
2.8.2



[iproute PATCH 6/6] ip-link.8: Fix font choices

2016-07-09 Thread Phil Sutter
Signed-off-by: Phil Sutter 
---
 man/man8/ip-link.8.in | 92 ++-
 1 file changed, 47 insertions(+), 45 deletions(-)

diff --git a/man/man8/ip-link.8.in b/man/man8/ip-link.8.in
index ada20fe210793..1644af0ef9f9f 100644
--- a/man/man8/ip-link.8.in
+++ b/man/man8/ip-link.8.in
@@ -304,7 +304,7 @@ the following additional arguments are supported:
 .BI "ip link add
 .BI link " DEVICE "
 .BI name " NAME "
-.BI type " vlan "
+.B "type vlan"
 [
 .BI protocol " VLAN_PROTO "
 ]
@@ -406,7 +406,7 @@ For a link of type
 the following additional arguments are supported:
 
 .BI "ip link add " DEVICE
-.BI type " vxlan " id " ID"
+.BI type " vxlan " id " VNI"
 [
 .BI dev " PHYS_DEV "
 .RB " ] [ { " group " | " remote " } "
@@ -425,27 +425,27 @@ the following additional arguments are supported:
 ] [
 .BI srcport " MIN MAX "
 ] [
-.I "[no]learning "
+.RB [ no ] learning
 ] [
-.I "[no]proxy "
+.RB [ no ] proxy
 ] [
-.I "[no]rsc "
+.RB [ no ] rsc
 ] [
-.I "[no]l2miss "
+.RB [ no ] l2miss
 ] [
-.I "[no]l3miss "
+.RB [ no ] l3miss
 ] [
-.I "[no]udpcsum "
+.RB [ no ] udpcsum
 ] [
-.I "[no]udp6zerocsumtx "
+.RB [ no ] udp6zerocsumtx
 ] [
-.I "[no]udp6zerocsumrx "
+.RB [ no ] udp6zerocsumrx
 ] [
 .BI ageing " SECONDS "
 ] [
 .BI maxaddress " NUMBER "
 ] [
-.RI "[no]external "
+.RB [ no ] external
 ] [
 .B gbp
 ] [
@@ -502,36 +502,36 @@ parameter.
 source ports to communicate to the remote VXLAN tunnel endpoint.
 
 .sp
-.I [no]learning
+.RB [ no ] learning
 - specifies if unknown source link layer addresses and IP addresses
 are entered into the VXLAN device forwarding database.
 
 .sp
-.I [no]rsc
+.RB [ no ] rsc
 - specifies if route short circuit is turned on.
 
 .sp
-.I [no]proxy
+.RB [ no ] proxy
 - specifies ARP proxy is turned on.
 
 .sp
-.I [no]l2miss
+.RB [ no ] l2miss
 - specifies if netlink LLADDR miss notifications are generated.
 
 .sp
-.I [no]l3miss
+.RB [ no ] l3miss
 - specifies if netlink IP ADDR miss notifications are generated.
 
 .sp
-.I [no]udpcsum
+.RB [ no ] udpcsum
 - specifies if UDP checksum is calculated for transmitted packets over IPv4.
 
 .sp
-.I [no]udp6zerocsumtx
+.RB [ no ] udp6zerocsumtx
 - skip UDP checksum calculation for transmitted packets over IPv6.
 
 .sp
-.I [no]udp6zerocsumrx
+.RB [ no ] udp6zerocsumrx
 - allow incoming UDP packets over IPv6 with zero checksum field.
 
 .sp
@@ -543,7 +543,7 @@ are entered into the VXLAN device forwarding database.
 - specifies the maximum number of FDB entries.
 
 .sp
-.I [no]external
+.RB [ no ] external
 - specifies whether an external control plane
 .RB "(e.g. " "ip route encap" )
 or the internal FDB should be used.
@@ -607,18 +607,18 @@ For a link of types
 the following additional arguments are supported:
 
 .BI "ip link add " DEVICE
-.BR type " { gre | ipip | sit } "
+.BR type " { " gre " | " ipip " | " sit " }"
 .BI " remote " ADDR " local " ADDR
 [
-.BR encap " { fou | gue | none } "
+.BR encap " { " fou " | " gue " | " none " }"
 ] [
-.BI "encap-sport { " PORT " | auto } "
+.BR encap-sport " { " \fIPORT " | " auto " }"
 ] [
 .BI "encap-dport " PORT
 ] [
-.I " [no]encap-csum "
+.RB [ no ] encap-csum
 ] [
-.I " [no]encap-remcsum "
+.RB [ no ] encap-remcsum
 ]
 
 .in +8
@@ -632,12 +632,12 @@ the following additional arguments are supported:
 It must be an address on another interface on this host.
 
 .sp
-.BR encap " { fou | gue | none } "
+.BR encap " { " fou " | " gue " | " none " }"
 - specifies type of secondary UDP encapsulation. "fou" indicates
 Foo-Over-UDP, "gue" indicates Generic UDP Encapsulation.
 
 .sp
-.BI "encap-sport { " PORT " | auto } "
+.BR encap-sport " { " \fIPORT " | " auto " }"
 - specifies the source port in UDP encapsulation.
 .IR PORT
 indicates the port by number, "auto"
@@ -646,12 +646,12 @@ indicates that the port number should be chosen 
automatically
 encapsulated packet).
 
 .sp
-.I [no]encap-csum
+.RB [ no ] encap-csum
 - specifies if UDP checksums are enabled in the secondary
 encapsulation.
 
 .sp
-.I [no]encap-remcsum
+.RB [ no ] encap-remcsum
 - specifies if Remote Checksum Offload is enabled. This is only
 applicable for Generic UDP Encapsulation.
 
@@ -664,13 +664,15 @@ For a link of type
 the following additional arguments are supported:
 
 .BI "ip link add " DEVICE
-.BI type " { ip6gre | ip6gretap }  " remote " ADDR " local " ADDR
+.BR type " { " ip6gre " | " ip6gretap " }"
+.BI remote " ADDR " local " ADDR"
 [
-.I "[i|o]seq]"
+.RB [ i | o ] seq
 ] [
-.I "[i|o]key" KEY
+.RB [ i | o ] key
+.I KEY
 ] [
-.I " [i|o]csum "
+.RB [ i | o ] csum
 ] [
 .BI hoplimit " TTL "
 ] [
@@ -696,7 +698,7 @@ the following additional arguments are supported:
 It must be an address on another interface on this host.
 
 .sp
-.BI  [i|o]seq
+.RB  [ i | o ] seq
 - serialize packets.
 The
 .B oseq
@@ -706,7 +708,7 @@ The
 flag requires that all input packets are serialized.
 
 .sp
-.BI  [i|o]key " KEY"
+.RB  [ i | o ] key " \fIKEY"
 - use keyed GRE with key
 .IR KEY ". "KEY
 is either a number 

[iproute PATCH 5/6] ip-link.8: Add slave type option descriptions

2016-07-09 Thread Phil Sutter
Signed-off-by: Phil Sutter 
---
 man/man8/ip-link.8.in | 129 ++
 1 file changed, 129 insertions(+)

diff --git a/man/man8/ip-link.8.in b/man/man8/ip-link.8.in
index 2678d37df7478..ada20fe210793 100644
--- a/man/man8/ip-link.8.in
+++ b/man/man8/ip-link.8.in
@@ -1247,6 +1247,135 @@ set the IPv6 address generation mode
 .BR "link-netnsid "
 set peer netnsid for a cross-netns interface
 
+.TP
+.BI type " ETYPE TYPE_ARGS"
+Change type-specific settings. For a list of supported types and arguments 
refer
+to the description of
+.B "ip link add"
+above. In addition to that, it is possible to manipulate settings to slave
+devices:
+
+.TP
+Bridge Slave Support
+For a link with master
+.B bridge
+the following additional arguments are supported:
+
+.B "ip link set type bridge_slave"
+[
+.BI state " STATE"
+] [
+.BI priority " PRIO"
+] [
+.BI cost " COST"
+] [
+.BR guard " { " on " | " off " }"
+] [
+.BR hairpin " { " on " | " off " }"
+] [
+.BR fastleave " { " on " | " off " }"
+] [
+.BR root_block " { " on " | " off " }"
+] [
+.BR learning " { " on " | " off " }"
+] [
+.BR flood " { " on " | " off " }"
+] [
+.BR proxy_arp " { " on " | " off " }"
+] [
+.BR proxy_arp_wifi " { " on " | " off " }"
+] [
+.BI mcast_router " MULTICAST_ROUTER"
+] [
+.BR mcast_fast_leave " { " on " | " off "} ]"
+
+.in +8
+.sp
+.BI state " STATE"
+- Set port state.
+.I STATE
+is a number representing the following states:
+.BR 0 " (disabled),"
+.BR 1 " (listening),"
+.BR 2 " (learning),"
+.BR 3 " (forwarding),"
+.BR 4 " (blocking)."
+
+.BI priority " PRIO"
+- set port priority (a 16bit unsigned value).
+
+.BI cost " COST"
+- set port cost (a 32bit unsigned value).
+
+.BR guard " { " on " | " off " }"
+- block incoming BPDU packets on this port.
+
+.BR hairpin " { " on " | " off " }"
+- enable hairpin mode on this port. This will allow incoming packets on this
+port to be reflected back.
+
+.BR fastleave " { " on " | " off " }"
+- enable multicast fast leave on this port.
+
+.BR root_block " { " on " | " off " }"
+- block this port from becoming the bridge's root port.
+
+.BR learning " { " on " | " off " }"
+- allow MAC address learning on this port.
+
+.BR flood " { " on " | " off " }"
+- open the flood gates on this port, i.e. forward all unicast frames to this
+port also. Requires
+.BR proxy_arp " and " proxy_arp_wifi
+to be turned off.
+
+.BR proxy_arp " { " on " | " off " }"
+- enable proxy ARP on this port.
+
+.BR proxy_arp_wifi " { " on " | " off " }"
+- enable proxy ARP on this port which meets extended requirements by IEEE
+802.11 and Hotspot 2.0 specifications.
+
+.BI mcast_router " MULTICAST_ROUTER"
+- configure this port for having multicast routers attached. A port with a
+multicast router will receive all multicast traffic.
+.I MULTICAST_ROUTER
+may be either
+.B 0
+to disable multicast routers on this port,
+.B 1
+to let the system detect the presence of of routers (this is the default),
+.B 2
+to permanently enable multicast traffic forwarding on this port or
+.B 3
+to enable multicast routers temporarily on this port, not depending on incoming
+queries.
+
+.BR mcast_fast_leave " { " on " | " off " }"
+- this is a synonym to the
+.B fastleave
+option above.
+
+.in -8
+
+.TP
+Bonding Slave Support
+For a link with master
+.B bond
+the following additional arguments are supported:
+
+.B "ip link set type bond_slave"
+[
+.BI queue_id " ID"
+]
+
+.in +8
+.sp
+.BI queue_id " ID"
+- set the slave's queue ID (a 16bit unsigned value).
+
+.in -8
+
 .SS  ip link show - display device attributes
 
 .TP
-- 
2.8.2



Re: [PATCH] net: ip_finish_output_gso: If skb_gso_network_seglen exceeds MTU, do segmentation even for non IPSKB_FORWARDED skbs

2016-07-09 Thread Florian Westphal
David Miller  wrote:
> From: Shmulik Ladkani 
> Date: Tue, 5 Jul 2016 17:05:41 +0300
> 
> > On Tue, 5 Jul 2016 15:03:27 +0200, f...@strlen.de wrote:
> >> (Or did I misunderstand this setup...?)
> > 
> > tap0 bridged with vxlan0.
> > route to vxlan0's remote peer is via eth0, configured with small mtu.
> 
> Florian, any more comments?

Sorry, I commented now.  But I'd really like to hear what vxlan experts
have to say about this, seems in RFC (7348) universe endpoint fragmention is
not supposed to happen :-/


Re: [PATCH] net: ip_finish_output_gso: If skb_gso_network_seglen exceeds MTU, do segmentation even for non IPSKB_FORWARDED skbs

2016-07-09 Thread Florian Westphal
Shmulik Ladkani  wrote:

[ CC 
> On Tue, 5 Jul 2016 15:03:27 +0200, f...@strlen.de wrote:
> > > The expected behavior in such a setup would be segmenting the skb first,
> > > and then fragmenting each segment according to dst mtu, and finally
> > > passing the resulting fragments to ip_finish_output2.
> > > 
> > > 'ip_finish_output_gso' already supports this "Slowpath" behavior,
> > > but it is only considered if IPSKB_FORWARDED is set.
> > > 
> > > However in the bridged case, IPSKB_FORWARDED is off, and the "Slowpath"
> > > behavior is not considered.
> > 
> > I placed this test there under the assumption that L2 bridges have
> > the same MTU on all bridge ports, so we'd only need to consider routing
> > case.
> 
> In our setups we have no control of VM mtu (which affects gso_size of
> packets arriving from tap), and no control of vxlan's underlay mtu.

:-(

> > How does work if e.g. 1460-sized udp packet arrives on tap0?
> > Do we fragment (possibly ignoring DF?)
> 
> A *non* gso udp packet arriving on tap0 is bridged to vxlan0 (assuming
> vxlan mtu is sufficient), and the original DF of the inner packet is
> preserved.
> 
> The skb gets vxlan-udp encapsulated, with outer IP header having DF=0
> (unless tun_flags & TUNNEL_DONT_FRAGMENT), and then, if skb->len > mtu,
> fragmented normally at the ip_finish_output --> ip_fragment code path.

I see.

> So on wire we have 2 frags of the vxlan datagram; they are reassembled
> at recepient ip stack of vxlan termination. Inner packet preserved.
> Not ideal, but works.
> 
> The issue is with GSO skbs arriving from tap, which eventually generates
> segments larger then the mtu, which are not transmitted on eth0:
> 
>   tap0 rx:  super packet, gso_size from user's virtio_net_hdr
> ...
> vxlan0 tx:  encaps the super packet
>   ...
>   ip_finish_output
> ip_finish_output_gso
>   *NO* skb_gso_validate_mtu() <--- problem here

We don't do this for ipv6 either since we're expected to send PKTOOBIG
error.

> ip_finish_output2:  tx the encapsulated super packet on eth0
>   ...
>   validate_xmit_skb
> netif_needs_gso
>   skb_gso_segment: segments inner payload according to
>original gso_size,
>leads to vxlan datagrams larger than mtu
>
> > How does it work for non-ip protocols?
> 
> The specific problem is with vxlan (or any other udp based tunnel)
> encapsulated GSOed packets.

If I understand correctly you have vxlan stacked on top of eth0, and tap
and vxlan in a bridge.

... and eth mtu smaller than bridge mtu.

I think that this is "working" by pure accident, and that better fix is
to set mtu values correctly so that when vxlan header is added we don't
exceed what can be handled by the real device (yes, I know you have
no control over this).

I am worried about this patch, skb_gso_validate_mtu is more costly than
the ->flags & FORWARD check; everyone pays this extra cost.

What about setting IPCB FORWARD flag in iptunnel_xmit if
skb->skb_iif != 0... instead?

Yet another possibility would be to reduce gso_size but that violates
gro/gso symmetry...

[ I tried to check rfc but seems rfc7348 simply declares that
  endpoints are not allowed to fragment so problem solved :-/ ]


[PATCH net] ipv4: reject RTNH_F_LINKDOWN for incompatible routes

2016-07-09 Thread Julian Anastasov
Vegard Nossum is reporting for a crash in fib_dump_info (fib_nhs==1)
when nh_dev = NULL. Problem happens when RTNH_F_LINKDOWN is
provided from user space for routes that do not use the flag,
catched with netlink fuzzer.

RTNH_F_LINKDOWN should be used only for link routes, not for
local routes or for routes with error code. Do not complicate
fast path with more checks, reject the flag early when configured
for incompatible routes.

Reported-by: Vegard Nossum 
Fixes: 0eeb075fad73 ("net: ipv4 sysctl option to ignore routes when nexthop 
link is down")
Tested-by: Vegard Nossum 
Signed-off-by: Julian Anastasov 
Cc: Andy Gospodarek 
Cc: Dinesh Dutt 
Cc: Scott Feldman 
---
 net/ipv4/fib_semantics.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

Note: works for all kernels: net, net-next, 4.4.14, 4.5.7, 4.6.3

diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
index d09173b..b642479 100644
--- a/net/ipv4/fib_semantics.c
+++ b/net/ipv4/fib_semantics.c
@@ -1113,7 +1113,8 @@ struct fib_info *fib_create_info(struct fib_config *cfg)
}
 
if (fib_props[cfg->fc_type].error) {
-   if (cfg->fc_gw || cfg->fc_oif || cfg->fc_mp)
+   if (cfg->fc_gw || cfg->fc_oif || cfg->fc_mp ||
+   (fi->fib_nh->nh_flags & RTNH_F_LINKDOWN))
goto err_inval;
goto link_it;
} else {
@@ -1136,7 +1137,7 @@ struct fib_info *fib_create_info(struct fib_config *cfg)
struct fib_nh *nh = fi->fib_nh;
 
/* Local address is added. */
-   if (nhs != 1 || nh->nh_gw)
+   if (nhs != 1 || nh->nh_gw || (nh->nh_flags & RTNH_F_LINKDOWN))
goto err_inval;
nh->nh_scope = RT_SCOPE_NOWHERE;
nh->nh_dev = dev_get_by_index(net, fi->fib_nh->nh_oif);
-- 
1.9.3



DHCP via bridge in case of IPv4

2016-07-09 Thread Alexey Brodkin
Hello,

I was playing with quite simple bridged setup on different boards with
very recent kernels (4.6.3 as of this writing) and found one interesting
behavior that I cannot yet understand and googling din't help here as well.

My setup is pretty simple:
-   --   -
| HOST  |   | "Dumb AP"  |   | Wireless client   |
| with DHCP |<->(eth0) (wlan0)<->| attempting to |
| server|   |\ br0 / |   | get settings via DHCP |
-   --   -

* HOST is my laptop with DHCP server that works for sure.
* "Dumb AP" is a separate board (I tried ARM-based Wandboard and ARC-based
  AXS10x boards but results are exactly the same) with wired (eth0) and wireless
  (wlan0) network controllers bridged together (br0). That "br0" bridge 
flawlessly
  gets its settings from DHCP server on host.
* Wireless client could be either a smatrphone or another laptop etc but
  what's important it should be configured to get network settings by DHCP as 
well.

So what happens "br0" always gets network settings from DHCP server on HOST.
That's fine. But wireless client only reliably gets settings from DHCP server
if IPv6 is enabled on "Dumb AP" board. If IPv6 is disabled I may see that
wireless client sends "DHCP Discover" then server replies with "DHCP Offer" but
that offer never reaches wireless client.

Well actually sometimes very-very rarely that offer may reach wireless client 
but
I cannot understand how to reproduce it reliably.

Still looks like enabling of IPv6 fixes that issue.

So my question here is: why I see that difference with IPv4 vs IPv6?

One sidenote:
  Somehow I figured out that in case of IPv4 so-called routing
  cache is absent (it was removed in Linux kernel 3.6) while with IPv6 it
  still exist. And assuming my hardware is sane and no data gets lost I may
  think that it's really a routing problem and missing routing cache might
  be an answer. Still being a noob in networking stuff I'd like to get a bit
  better explanation of things I see.

All thoughts and comments are more than welcome.

Regards,
Alexey


Re: [PATCH net] tcp: make challenge acks less predictable

2016-07-09 Thread Eric Dumazet
On Fri, 2016-07-08 at 17:27 -0700, Yue Cao wrote:
> Hi Eric, 
> 
> 
> Thank you for the email. After rethinking the suggested patch, our
> side-channel attack might still work.
> 
> 
> The main idea behind the patch is to change challenge_count lifetime
> from 1s to a random value in the range [0.5s, 1.5s), which creates a
> time synchronization issue at the attacker's end. 
> 
> 
> In our modified attack, 
> 1. Instead of sending several packets throughout the 1s duration,
> attacker sends fewer packets in a short period (e.g. 0.1s, or even
> shorter). It is likely that this short period will be included in one
> challenge_count lifetime at the server’s end.
> 2. If this short period covers two challenge_counts’ lifetime or some
> rare case that attacker is not sure, attacker can repeat sending same
> packets after a short period (e.g. 1.5s) to confirm it. 
> 3. These packets should include one or more spoofed packets and 1005(a
> value bigger than 1001) packets to exhaust such side channel. 
> 
> 
> In summary, if the attacker receives less than 1000 packets from the
> server, it must be a good guess. If the attacker receives more than
> 1000 packets from the server, this short period covers two
> challenge_counts’ lifetime and the attacker has to repeat sending same
> packets after a short duration. If the attacker receives exactly 1000
> packets from the server, it is most likely a wrong guess. However, the
> attacker would better repeat sending packets to confirm it since these
> 1000 packets may be sent from two continuous challenge_counts’
> lifetime(though it’s a rare case).

OK so all we need is to vary the 1000 value a bit so that attacker can
not predict it, as Linus first did.

I will send a V2, thanks a lot !






Re: [PATCH v6 01/12] bpf: add XDP prog type for early driver filter

2016-07-09 Thread Jesper Dangaard Brouer
On Thu,  7 Jul 2016 19:15:13 -0700
Brenden Blanco  wrote:

> Add a new bpf prog type that is intended to run in early stages of the
> packet rx path. Only minimal packet metadata will be available, hence a
> new context type, struct xdp_md, is exposed to userspace. So far only
> expose the packet start and end pointers, and only in read mode.
> 
> An XDP program must return one of the well known enum values, all other
> return codes are reserved for future use. Unfortunately, this
> restriction is hard to enforce at verification time, so take the
> approach of warning at runtime when such programs are encountered. The
> driver can choose to implement unknown return codes however it wants,
> but must invoke the warning helper with the action value.

I believe we should define a stronger semantics for unknown/future
return codes than the once stated above:
 "driver can choose to implement unknown return codes however it wants"

The mlx4 driver implementation in:
 [PATCH v6 04/12] net/mlx4_en: add support for fast rx drop bpf program

On Thu,  7 Jul 2016 19:15:16 -0700 Brenden Blanco  wrote:

> + /* A bpf program gets first chance to drop the packet. It may
> +  * read bytes but not past the end of the frag.
> +  */
> + if (prog) {
> + struct xdp_buff xdp;
> + dma_addr_t dma;
> + u32 act;
> +
> + dma = be64_to_cpu(rx_desc->data[0].addr);
> + dma_sync_single_for_cpu(priv->ddev, dma,
> + priv->frag_info[0].frag_size,
> + DMA_FROM_DEVICE);
> +
> + xdp.data = page_address(frags[0].page) +
> + frags[0].page_offset;
> + xdp.data_end = xdp.data + length;
> +
> + act = bpf_prog_run_xdp(prog, &xdp);
> + switch (act) {
> + case XDP_PASS:
> + break;
> + default:
> + bpf_warn_invalid_xdp_action(act);
> + case XDP_DROP:
> + goto next;
> + }
> + }

Thus, mlx4 choice is to drop packets for unknown/future return codes.

I think this is the wrong choice.  I think the choice should be
XDP_PASS, to pass the packet up the stack.

I find "XDP_DROP" problematic because it happen so early in the driver,
that we lost all possibilities to debug what packets gets dropped.  We
get a single kernel log warning, but we cannot inspect the packets any
longer.  By defaulting to XDP_PASS all the normal stack tools (e.g.
tcpdump) is available.


I can also imagine that, defaulting to XDP_PASS, can be an important
feature in the future.

In the future we will likely have features, where XDP can "offload"
packet delivery from the normal stack (e.g. delivery into a VM).  On a
running production system you can then load your XDP program.  If the
driver was too old defaulting to XDP_DROP, then you lost your service,
instead if defaulting to XDP_PASS, your service would survive, falling
back to normal delivery.

(For the VM delivery use-case, there will likely be a need for having a
fallback delivery method in place, when the XDP program is not active,
in-order to support VM migration).



> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index c14ca1c..5b47ac3 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
[...]
>  
> +/* User return codes for XDP prog type.
> + * A valid XDP program must return one of these defined values. All other
> + * return codes are reserved for future use. Unknown return codes will result
> + * in driver-dependent behavior.
> + */
> +enum xdp_action {
> + XDP_DROP,
> + XDP_PASS,
> +};
> +
[...]
>  #endif /* _UAPI__LINUX_BPF_H__ */
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index e206c21..a8d67d0 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
[...]
> +void bpf_warn_invalid_xdp_action(int act)
> +{
> + WARN_ONCE(1, "\n"
> +  "*\n"
> +  "**   NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE   **\n"
> +  "**   **\n"
> +  "** XDP program returned unknown value %-10u **\n"
> +  "**   **\n"
> +  "** XDP programs must return a well-known return  **\n"
> +  "** value. Invalid return values will result in   **\n"
> +  "** undefined packet actions. **\n"
> +  "**   **\n"
> +  "**   NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE   **\n"
> +  "***

Re: [GIT PULL net-next] rxrpc: Improve conn/call lookup and fix call number generation [ver #3]

2016-07-09 Thread David Howells
David Miller  wrote:

> I'll pull, but this is not how I want you to operate.

Anyway, thanks for pulling it.

David