date:20160819

Re: [PATCH net] vmxnet3: fix tx data ring copy for variable size

2016-08-19 Thread David Miller

From: Shrikrishna Khare 
Date: Fri, 19 Aug 2016 10:33:42 -0700

> 'Commit 3c8b3efc061a ("vmxnet3: allow variable length transmit data ring
> buffer")' changed the size of the buffers in the tx data ring from a
> fixed size of 128 bytes to a variable size.
> 
> However, while copying data to the data ring, vmxnet3_copy_hdr continues
> to carry the old code that assumes fixed buffer size of 128. This patch
> fixes it by adding correct offset based on the actual data ring buffer
> size.
> 
> Signed-off-by: Guolin Yang 
> Signed-off-by: Shrikrishna Khare 

Applied.

Re: [PATCHv2,net-next] samples/bpf: Add tunnel set/get tests.

2016-08-19 Thread David Miller

From: William Tu 
Date: Fri, 19 Aug 2016 11:55:44 -0700

> The patch creates sample code exercising bpf_skb_{set,get}_tunnel_key,
> and bpf_skb_{set,get}_tunnel_opt for GRE, VXLAN, and GENEVE.  A native
> tunnel device is created in a namespace to interact with a lwtunnel
> device out of the namespace, with metadata enabled.  The bpf_skb_set_*
> program is attached to tc egress and bpf_skb_get_* is attached to egress
> qdisc.  A ping between two tunnels is used to verify correctness and
> the result of bpf_skb_get_* printed by bpf_trace_printk.
> 
> Signed-off-by: William Tu 

Applied, thanks.

Next time, propagate the ACK's your received into your commit message
when you post new versions.

Re: [PATCH net-next] hv_netvsc: Implement batching of receive completions

2016-08-19 Thread David Miller

From: Haiyang Zhang 
Date: Fri, 19 Aug 2016 14:47:09 -0700

> From: Haiyang Zhang 
> 
> The existing code uses busy retry when unable to send out receive
> completions due to full ring buffer. It also gives up retrying after limit
> is reached, and causes receive buffer slots not being recycled.
> This patch implements batching of receive completions. It also prevents
> dropping receive completions due to full ring buffer.
> 
> Signed-off-by: Haiyang Zhang 
> Reviewed-by: Stephen Hemminger 

Applied, thanks.

Re: [net] ixgbe: Do not clear RAR entry when clearing VMDq for SAN MAC

2016-08-19 Thread David Miller

From: Jeff Kirsher 
Date: Fri, 19 Aug 2016 20:58:26 -0700

> From: Alexander Duyck 
> 
> The RAR entry for the SAN MAC address was being cleared when we were
> clearing the VMDq pool bits.  In order to prevent this we need to add
> an extra check to protect the SAN MAC from being cleared.
> 
> Fixes: 6e982aeae ("ixgbe: Clear stale pool mappings")
> Signed-off-by: Alexander Duyck 
> Tested-by: Andrew Bowers 
> Signed-off-by: Jeff Kirsher 

Applied, thanks.

Re: [RFC PATCH] openvswitch: use percpu flow stats

2016-08-19 Thread David Miller

From: Eric Dumazet 
Date: Fri, 19 Aug 2016 21:07:52 -0700

> Here is an example of a system I had in the past.
> 
> 8 cpus  (num_possible_cpus() returns 8)
> 
> Cpus were numbered : 0, 1, 2, 3, 8, 9, 10, 11  : (nr_cpu_ids = 12)
> 
> I am pretty sure they are still alive.
> 
> Since you want an array indexed by cpu numbers, you need to size it by
> nr_cpu_ids, otherwise array[11] will crash / access garbage.
> 
> This is why you find many nr_cpu_ids in the linux tree, not
> num_possible_cpus() to size arrays.

My bad, I misunderstood what nr_cpu_ids means.  You are absolutely right.

Re: [RFC PATCH] openvswitch: use percpu flow stats

2016-08-19 Thread Eric Dumazet

On Fri, 2016-08-19 at 18:09 -0700, David Miller wrote:
> From: Eric Dumazet 
> Date: Fri, 19 Aug 2016 12:56:56 -0700
> 
> > On Fri, 2016-08-19 at 16:47 -0300, Thadeu Lima de Souza Cascardo wrote:
> >> Instead of using flow stats per NUMA node, use it per CPU. When using
> >> megaflows, the stats lock can be a bottleneck in scalability.
> > 
> > ...
> > 
> >>  
> >>flow_cache = kmem_cache_create("sw_flow", sizeof(struct sw_flow)
> >> - + (nr_node_ids
> >> + + (num_possible_cpus()
> >>  * sizeof(struct flow_stats *)),
> >>   0, 0, NULL);
> >>if (flow_cache == NULL)
> > 
> > This is bogus.
> > 
> > Please use nr_cpu_ids instead of num_possible_cpus().
> 
> Then how would hot-plugged cpus be handled properly?
> 
> The only other way is you'd have to free all objects in the current
> flow_cache, destroy it, then make a new kmem_cache and reallocate
> all of the existing objects and hook them back up every time a cpu
> up event occurs.
> 
> That doesn't make any sense at all.
> 
> Therefore, the size of the objects coming out of "sw_flow" have
> to accomodate all possible cpu indexes.

Here is an example of a system I had in the past.

8 cpus  (num_possible_cpus() returns 8)

Cpus were numbered : 0, 1, 2, 3, 8, 9, 10, 11  : (nr_cpu_ids = 12)

I am pretty sure they are still alive.

Since you want an array indexed by cpu numbers, you need to size it by
nr_cpu_ids, otherwise array[11] will crash / access garbage.

This is why you find many nr_cpu_ids in the linux tree, not
num_possible_cpus() to size arrays.

# git grep -n nr_cpu_ids -- net
net/bridge/netfilter/ebtables.c:900:vmalloc(nr_cpu_ids * 
sizeof(*(newinfo->chainstack)));
net/bridge/netfilter/ebtables.c:1132:   countersize = 
COUNTER_OFFSET(tmp.nentries) * nr_cpu_ids;
net/bridge/netfilter/ebtables.c:1185:   countersize = 
COUNTER_OFFSET(repl->nentries) * nr_cpu_ids;
net/bridge/netfilter/ebtables.c:2200:   countersize = 
COUNTER_OFFSET(tmp.nentries) * nr_cpu_ids;
net/core/dev.c:3483:if (next_cpu < nr_cpu_ids) {
net/core/dev.c:3588: *   - Current CPU is unset (>= nr_cpu_ids).
net/core/dev.c:3596:(tcpu >= nr_cpu_ids || !cpu_online(tcpu) ||
net/core/dev.c:3603:if (tcpu < nr_cpu_ids && cpu_online(tcpu)) {
net/core/dev.c:3651:if (rflow->filter == filter_id && cpu < 
nr_cpu_ids &&
net/core/neighbour.c:2737:  for (cpu = *pos-1; cpu < nr_cpu_ids; ++cpu) {
net/core/neighbour.c:2751:  for (cpu = *pos; cpu < nr_cpu_ids; ++cpu) {
net/core/net-procfs.c:122:  while (*pos < nr_cpu_ids)
net/core/sysctl_net_core.c:70:  rps_cpu_mask = 
roundup_pow_of_two(nr_cpu_ids) - 1;
net/ipv4/netfilter/nf_conntrack_l3proto_ipv4_compat.c:360:  for (cpu = 
*pos-1; cpu < nr_cpu_ids; ++cpu) {
net/ipv4/netfilter/nf_conntrack_l3proto_ipv4_compat.c:375:  for (cpu = 
*pos; cpu < nr_cpu_ids; ++cpu) {
net/ipv4/route.c:254:   for (cpu = *pos-1; cpu < nr_cpu_ids; ++cpu) {
net/ipv4/route.c:267:   for (cpu = *pos; cpu < nr_cpu_ids; ++cpu) {
net/ipv6/icmp.c:918:kzalloc(nr_cpu_ids * sizeof(struct sock *), 
GFP_KERNEL);
net/netfilter/nf_conntrack_netlink.c:1265:  for (cpu = cb->args[0]; cpu < 
nr_cpu_ids; cpu++) {
net/netfilter/nf_conntrack_netlink.c:2003:  if (cb->args[0] == nr_cpu_ids)
net/netfilter/nf_conntrack_netlink.c:2006:  for (cpu = cb->args[0]; cpu < 
nr_cpu_ids; cpu++) {
net/netfilter/nf_conntrack_netlink.c:3211:  if (cb->args[0] == nr_cpu_ids)
net/netfilter/nf_conntrack_netlink.c:3214:  for (cpu = cb->args[0]; cpu < 
nr_cpu_ids; cpu++) {
net/netfilter/nf_conntrack_standalone.c:309:for (cpu = *pos-1; cpu < 
nr_cpu_ids; ++cpu) {
net/netfilter/nf_conntrack_standalone.c:324:for (cpu = *pos; cpu < 
nr_cpu_ids; ++cpu) {
net/netfilter/nf_synproxy_core.c:255:   for (cpu = *pos - 1; cpu < nr_cpu_ids; 
cpu++) {
net/netfilter/nf_synproxy_core.c:270:   for (cpu = *pos; cpu < nr_cpu_ids; 
cpu++) {
net/netfilter/x_tables.c:1064:  size = sizeof(void **) * nr_cpu_ids;
net/sunrpc/svc.c:167:   unsigned int maxpools = nr_cpu_ids;

[net] ixgbe: Do not clear RAR entry when clearing VMDq for SAN MAC

2016-08-19 Thread Jeff Kirsher

From: Alexander Duyck 

The RAR entry for the SAN MAC address was being cleared when we were
clearing the VMDq pool bits.  In order to prevent this we need to add
an extra check to protect the SAN MAC from being cleared.

Fixes: 6e982aeae ("ixgbe: Clear stale pool mappings")
Signed-off-by: Alexander Duyck 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/ixgbe/ixgbe_common.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_common.c 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_common.c
index b4217f3..c47b605 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_common.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_common.c
@@ -2958,8 +2958,10 @@ s32 ixgbe_clear_vmdq_generic(struct ixgbe_hw *hw, u32 
rar, u32 vmdq)
}
 
/* was that the last pool using this rar? */
-   if (mpsar_lo == 0 && mpsar_hi == 0 && rar != 0)
+   if (mpsar_lo == 0 && mpsar_hi == 0 &&
+   rar != 0 && rar != hw->mac.san_mac_rar_index)
hw->mac.ops.clear_rar(hw, rar);
+
return 0;
 }
 
-- 
2.7.4

[PATCH iproute2] ss: fix build with musl libc

2016-08-19 Thread Gustavo Zacarias

UINT_MAX usage requires limits.h, so include it.

Signed-off-by: Gustavo Zacarias 
---
 misc/ss.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/misc/ss.c b/misc/ss.c
index e758f57..3b268d9 100644
--- a/misc/ss.c
+++ b/misc/ss.c
@@ -26,6 +26,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "utils.h"
 #include "rt_names.h"
-- 
2.7.3

Re: [PATCH 4/5] driver core: set up ownership of class devices in sysfs

2016-08-19 Thread Stephen Hemminger

On Tue, 16 Aug 2016 15:33:14 -0700
Dmitry Torokhov  wrote:

>  static struct kobj_type device_ktype = {
>   .release= device_release,
>   .sysfs_ops  = _sysfs_ops,
>   .namespace  = device_namespace,
> + .get_ownership  = device_get_ownership,
>  };
>  

OT - I wonder if kobj_type could be ro_after_init?

Re: [RFC PATCH] openvswitch: use percpu flow stats

2016-08-19 Thread David Miller

From: Eric Dumazet 
Date: Fri, 19 Aug 2016 12:56:56 -0700

> On Fri, 2016-08-19 at 16:47 -0300, Thadeu Lima de Souza Cascardo wrote:
>> Instead of using flow stats per NUMA node, use it per CPU. When using
>> megaflows, the stats lock can be a bottleneck in scalability.
> 
> ...
> 
>>  
>>  flow_cache = kmem_cache_create("sw_flow", sizeof(struct sw_flow)
>> -   + (nr_node_ids
>> +   + (num_possible_cpus()
>>* sizeof(struct flow_stats *)),
>> 0, 0, NULL);
>>  if (flow_cache == NULL)
> 
> This is bogus.
> 
> Please use nr_cpu_ids instead of num_possible_cpus().

Then how would hot-plugged cpus be handled properly?

The only other way is you'd have to free all objects in the current
flow_cache, destroy it, then make a new kmem_cache and reallocate
all of the existing objects and hook them back up every time a cpu
up event occurs.

That doesn't make any sense at all.

Therefore, the size of the objects coming out of "sw_flow" have
to accomodate all possible cpu indexes.

Re: [PATCH] net/irda: remove pointless assignment/check

2016-08-19 Thread David Miller

From: Vegard Nossum 
Date: Fri, 19 Aug 2016 18:08:57 +0200

> We've already set sk to sock->sk and dereferenced it, so if it's NULL
> we would have crashed already. Moreover, if it was NULL we would have
> crashed anyway when jumping to 'out' and trying to unlock the sock.
> Furthermore, if we had assigned a different value to 'sk' we would
> have been calling lock_sock() and release_sock() on different sockets.
> 
> My conclusion is that these two lines are complete nonsense and only
> serve to confuse the reader.
> 
> Signed-off-by: Vegard Nossum 

Applied.

Re: [patch net] mlxsw: spectrum_buffers: Fix pool value handling in mlxsw_sp_sb_tc_pool_bind_set

2016-08-19 Thread David Miller

From: Jiri Pirko 
Date: Fri, 19 Aug 2016 14:43:48 +0200

> From: Jiri Pirko 
> 
> Pool index has to be converted by get_pool helper to work correctly for
> egress pool. In mlxsw the egress pool index starts from 0.
> 
> Fixes: 0f433fa0ecc ("mlxsw: spectrum_buffers: Implement shared buffer 
> configuration")
> Signed-off-by: Jiri Pirko 

Applied.

Re: [PATCH net-next] qed: utilize FW 8.10.10.0

2016-08-19 Thread David Miller


Applied.

Re: [PATCH 1/1] l2tp: Fix the connect status check in pppol2tp_getname

2016-08-19 Thread David Miller

From: f...@ikuai8.com
Date: Fri, 19 Aug 2016 13:36:23 +0800

> From: Gao Feng 
> 
> The sk->sk_state is bits flag, so need use bit operation check
> instead of value check.
> 
> Signed-off-by: Gao Feng 

Applied.

Re: [net-next 0/4][pull request] 10GbE Intel Wired LAN Driver Updates 2016-08-18

2016-08-19 Thread David Miller

From: Jeff Kirsher 
Date: Thu, 18 Aug 2016 23:03:32 -0700

> This series contains updates to ixgbe and ixgbevf.

Pulled, thanks Jeff.

Re: [PATCH net v2] sctp: linearize early if it's not GSO

2016-08-19 Thread David Miller

From: Marcelo Ricardo Leitner 
Date: Thu, 18 Aug 2016 14:58:35 -0300

> Because otherwise when crc computation is still needed it's way more
> expensive than on a linear buffer to the point that it affects
> performance.
> 
> It's so expensive that netperf test gives a perf output as below:
 ...
> Fixes: 3acb50c18d8d ("sctp: delay as much as possible skb_linearize")
> Signed-off-by: Marcelo Ricardo Leitner 

Applied and queued up for -stable.

Re: [PATCH net-next 0/5] net: dsa: bcm_sf2: Platform conversion

2016-08-19 Thread David Miller

From: Florian Fainelli 
Date: Thu, 18 Aug 2016 15:30:11 -0700

> This patch series converts the bcm_sf2 driver from a traditional DSA driver
> into a platform_device driver and makes it use the new DSA binding that Andrew
> introduced in the latest merge window.
> 
> Prior attempts used to coerce the code in net/dsa/dsa2.c to accept the old
> binding, while really there is only one broken single user out there: bcm_sf2,
> so instead, just assume the new DT binding is deployed and use it accordingly.

Series applied, thanks Florian.

Re: [PATCH net-next 0/3] Fix mv88e6xxx wait function

2016-08-19 Thread David Miller

From: Andrew Lunn 
Date: Fri, 19 Aug 2016 00:01:54 +0200

> The mv88e6xxx wait function can be upset of the system has nots of
> other things to do and a sleep takes a lot longer than expected. Fix
> this be using a fixed number of iterations, rather than a fixed
> walkclock time.
> 
> Witht that change made, it is possible to consoliate another
> wait function.
> 
> A wait actually timing out should not happen and when it does, it
> means something serious is wrong. Make sure an error is logged,
> since not all callers will log an error.

Series applied, thanks.

Re: [PATCH net-next 0/2] PHY Kconfig and Makefile cleanup

2016-08-19 Thread David Miller

From: Andrew Lunn 
Date: Thu, 18 Aug 2016 23:56:04 +0200

> The Ethernet PHY directory has slowly been getting more entries.
> Split the entries in the Makefile and Kconfig into MDIO bus drivers
> and PHYs. Within these two groups, sort them. This should reduce merge
> conflicts and aid finding what one searches for.
> 
> The Kconfig text contains redundant "Driver for" and "Support for"
> which add little value, make the vendor less obvious, and defeat the
> shortcut key in the menu. Remove such text.

Series applied.

Re: [PATCH net] tcp: md5: remove tcp_md5_hash_header()

2016-08-19 Thread David Miller

From: Eric Dumazet 
Date: Thu, 18 Aug 2016 09:49:55 -0700

> From: Eric Dumazet 
> 
> After commit 19689e38eca5 ("tcp: md5: use kmalloc() backed scratch
> areas") this function is no longer used.
> 
> Signed-off-by: Eric Dumazet 

Applied.

Re: [PATCH net] net: ipv4: fix sparse error in fib_good_nh()

2016-08-19 Thread David Miller

From: Eric Dumazet 
Date: Thu, 18 Aug 2016 10:19:34 -0700

> From: Eric Dumazet 
> 
> Fixes following sparse errors :
> 
> net/ipv4/fib_semantics.c:1579:61: warning: incorrect type in argument 2
> (different base types)
> net/ipv4/fib_semantics.c:1579:61:expected unsigned int [unsigned]
> [usertype] key
> net/ipv4/fib_semantics.c:1579:61:got restricted __be32 const
> [usertype] nh_gw
> 
> 
> Fixes: a6db4494d218c ("net: ipv4: Consider failed nexthops in multipath 
> routes")
> Signed-off-by: Eric Dumazet 

Applied.

Re: [PATCH net] udp: include addrconf.h

2016-08-19 Thread David Miller

From: Eric Dumazet 
Date: Thu, 18 Aug 2016 09:59:12 -0700

> From: Eric Dumazet 
> 
> Include ipv4_rcv_saddr_equal() definition to avoid this sparse error :
> 
> net/ipv4/udp.c:362:5: warning: symbol 'ipv4_rcv_saddr_equal' was not
> declared. Should it be static?
> 
> Signed-off-by: Eric Dumazet 

Applied.

Re: [PATCH net] sit: ipip6_valid_ip_proto() is static

2016-08-19 Thread David Miller

From: Eric Dumazet 
Date: Thu, 18 Aug 2016 10:03:55 -0700

> From: Eric Dumazet 
> 
> Fixes this sparse error :
> net/ipv6/sit.c:1129:6: warning: symbol 'ipip6_valid_ip_proto' was not
> declared. Should it be static?
> 
> 
> Fixes: 49dbe7ae2168b ("sit: support MPLS over IPv4")
> Signed-off-by: Eric Dumazet 

This was done in net-next already.

Re: [PATCH] ibmvnic: Handle backing device failover and reinitialization

2016-08-19 Thread David Miller

From: Thomas Falcon 
Date: Thu, 18 Aug 2016 11:37:51 -0500

> An upcoming feature of IBM VNIC protocol is the ability to configure
> redundant backing devices for a VNIC client. In case of a failure
> on the current backing device, the driver will receive a signal
> from the hypervisor indicating that a failover will occur. The driver
> will then wait for a message from the backing device before 
> establishing a new connection.
> 
> Signed-off-by: Thomas Falcon 

Applied.

Re: [PATCH 0/5] Make /sys/class/net per net namespace objects belong to container

2016-08-19 Thread David Miller

From: Dmitry Torokhov 
Date: Tue, 16 Aug 2016 15:33:10 -0700

> There are objects in /sys hierarchy (/sys/class/net/) that logically belong
> to a namespace/container. Unfortunately all sysfs objects start their life
> belonging to global root, and while we could change ownership manually,
> keeping tracks of all objects that come and go is cumbersome. It would
> be better if kernel created them using correct uid/gid from the beginning.
> 
> This series changes kernfs to allow creating object's with arbitrary
> uid/gid, adds get_ownership() callback to ktype structure so subsystems
> could supply their own logic (likely tied to namespace support) for
> determining ownership of kobjects, and adjusts sysfs code to make use of
> this information. Lastly net-sysfs is adjusted to make sure that objects in
> net namespace are owned by the root user from the owning user namespace.
> 
> Note that we do not adjust ownership of objects moved into a new namespace
> (as when moving a network device into a container) as userspace can easily
> do it.

I need some domain experts to review this series please.

Thank you.

Re: [PATCH net-next] net: hns: Add reset function support for RoCE driver

2016-08-19 Thread David Miller

From: Lijun Ou 
Date: Thu, 18 Aug 2016 20:32:52 +0800

> It added reset function for RoCE driver. RoCE is a feature of hns.
> In hip06 SoC, in RoCE reset process, it's needed to configure dsaf
> channel reset, port and sl map info. Reset function of RoCE is
> located in dsaf module, we only call it in RoCE driver when needed.
> 
> This patch is used to fix the conflict, please refer to this link:
>   https://www.spinics.net/lists/linux-rdma/msg39114.html
> 
> Signed-off-by: Wei Hu 
> Signed-off-by: Nenglong Zhao 
> Signed-off-by: Lijun Ou 
> Signed-off-by: Sheng Li 
> Reviewed-by: Yisen Zhuang 

Applied, thanks.

Re: [PATCH net 00/10] Mellanox 100G mlx5 fixes 2016-08-16

2016-08-19 Thread David Miller

From: Saeed Mahameed 
Date: Thu, 18 Aug 2016 21:09:01 +0300

> This series includes some bug fixes for mlx5e driver.

Series applied.

> For -stable of 4.6.y and 4.7.y:
> net/mlx5e: Use correct flow dissector key on flower offloading
> net/mlx5: Fix pci error recovery flow
> net/mlx5: Added missing check of msg length in verifying its signature

Queued up, thanks.

Re: [PATCH v1 1/2] pptp: Use macro and sizeof instead of literal number

2016-08-19 Thread Philp Prindeville


Inline...


On 08/17/2016 10:47 PM, f...@48lvckh6395k16k5.yundunddos.com wrote:

From: Gao Feng

Use existing macros like PPP_ADDRESS, SC_COMP_PROT and sizeof fixed
variables instead of original literal number to enhance readbility.

BTW, the original pptp_rcv uses literal number "12" as the param
of pskb_may_pull. Actually the "12" is less than the size of struct
pptp_gre_header. Now use the sizeof(*header) fixes this issue.

Signed-off-by: Gao Feng
---
  v1: Initial patch

  drivers/net/ppp/pptp.c | 23 ---
  1 file changed, 12 insertions(+), 11 deletions(-)

diff --git a/drivers/net/ppp/pptp.c b/drivers/net/ppp/pptp.c
index 1951b10..48c3701 100644
--- a/drivers/net/ppp/pptp.c
+++ b/drivers/net/ppp/pptp.c
@@ -54,6 +54,8 @@ static struct proto pptp_sk_proto __read_mostly;
  static const struct ppp_channel_ops pptp_chan_ops;
  static const struct proto_ops pptp_ops;
  
+static const u8 fixed_ppphdr[2] = {PPP_ALLSTATIONS, PPP_UI};

+


The PPP header is actually 4 bytes, you're omitting 2 bytes of 
protocol.  See PPP_HDRLEN.  Agree with Miller that we don't need this 
mock header here.




  static struct pppox_sock *lookup_chan(u16 call_id, __be32 s_addr)
  {
struct pppox_sock *sock;
@@ -167,7 +169,7 @@ static int pptp_xmit(struct ppp_channel *chan, struct 
sk_buff *skb)
  
  	tdev = rt->dst.dev;
  
-	max_headroom = LL_RESERVED_SPACE(tdev) + sizeof(*iph) + sizeof(*hdr) + 2;

+   max_headroom = LL_RESERVED_SPACE(tdev) + sizeof(*iph) + sizeof(*hdr) + 
sizeof(fixed_ppphdr);


Where are the 2 bytes of PPP protocol field being accounted for?

  
  	if (skb_headroom(skb) < max_headroom || skb_cloned(skb) || skb_shared(skb)) {

struct sk_buff *new_skb = skb_realloc_headroom(skb, 
max_headroom);
@@ -190,9 +192,9 @@ static int pptp_xmit(struct ppp_channel *chan, struct 
sk_buff *skb)
  
  	/* Put in the address/control bytes if necessary */

if ((opt->ppp_flags & SC_COMP_AC) == 0 || islcp) {
-   data = skb_push(skb, 2);
-   data[0] = PPP_ALLSTATIONS;
-   data[1] = PPP_UI;
+   data = skb_push(skb, sizeof(fixed_ppphdr));
+   data[0] = fixed_ppphdr[0];
+   data[1] = fixed_ppphdr[1];


I'd use:

PPP_ADDRESS(data) = PPP_ALLSTATIONS;
PPP_CONTROL(data) = PPP_UI;

here.


}
  
  	len = skb->len;

@@ -219,8 +221,7 @@ static int pptp_xmit(struct ppp_channel *chan, struct 
sk_buff *skb)
}
hdr->payload_len = htons(len);
  
-	/*	Push down and install the IP header. */

-
+   /* Push down and install the IP header. */
skb_reset_transport_header(skb);
skb_push(skb, sizeof(*iph));
skb_reset_network_header(skb);
@@ -319,14 +320,14 @@ static int pptp_rcv_core(struct sock *sk, struct sk_buff 
*skb)
  allow_packet:
skb_pull(skb, headersize);
  
-		if (payload[0] == PPP_ALLSTATIONS && payload[1] == PPP_UI) {

+   if (PPP_ADDRESS(payload) == PPP_ALLSTATIONS && 
PPP_CONTROL(payload) == PPP_UI) {
/* chop off address/control */
if (skb->len < 3)
goto drop;
-   skb_pull(skb, 2);
+   skb_pull(skb, sizeof(fixed_ppphdr));
}
  
-		if ((*skb->data) & 1) {

+   if ((*skb->data) & SC_COMP_PROT) {


Before the skb_pull(skb, 2) happens above, I would extract the protocol 
via PPP_CONTROL(data).



/* protocol is compressed */
skb_push(skb, 1)[0] = 0;
}
@@ -351,7 +352,7 @@ static int pptp_rcv(struct sk_buff *skb)
if (skb->pkt_type != PACKET_HOST)
goto drop;
  
-	if (!pskb_may_pull(skb, 12))

+   if (!pskb_may_pull(skb, sizeof(*header)))
goto drop;
  
  	iph = ip_hdr(skb);

@@ -468,7 +469,7 @@ static int pptp_connect(struct socket *sock, struct 
sockaddr *uservaddr,
ip_rt_put(rt);
po->chan.mtu -= PPTP_HEADER_OVERHEAD;
  
-	po->chan.hdrlen = 2 + sizeof(struct pptp_gre_header);

+   po->chan.hdrlen = sizeof(fixed_ppphdr) + sizeof(struct pptp_gre_header);
error = ppp_register_channel(>chan);
if (error) {
pr_err("PPTP: failed to register PPP channel (%d)\n", error);

[PATCH 1/2] net: ethernet: renesas: ravb: use phydev from struct net_device

2016-08-19 Thread Philippe Reynes

The private structure contain a pointer to phydev, but the structure
net_device already contain such pointer. So we can remove the pointer
phy_dev in the private structure, and update the driver to use the
one contained in struct net_device.

Signed-off-by: Philippe Reynes 
---
 drivers/net/ethernet/renesas/ravb.h  |1 -
 drivers/net/ethernet/renesas/ravb_main.c |   29 -
 2 files changed, 12 insertions(+), 18 deletions(-)

diff --git a/drivers/net/ethernet/renesas/ravb.h 
b/drivers/net/ethernet/renesas/ravb.h
index 4e5d5e9..f110966 100644
--- a/drivers/net/ethernet/renesas/ravb.h
+++ b/drivers/net/ethernet/renesas/ravb.h
@@ -1011,7 +1011,6 @@ struct ravb_private {
struct work_struct work;
/* MII transceiver section. */
struct mii_bus *mii_bus;/* MDIO bus control */
-   struct phy_device *phydev;  /* PHY device control */
int link;
phy_interface_t phy_interface;
int msg_enable;
diff --git a/drivers/net/ethernet/renesas/ravb_main.c 
b/drivers/net/ethernet/renesas/ravb_main.c
index d4809ad..d21c9a4 100644
--- a/drivers/net/ethernet/renesas/ravb_main.c
+++ b/drivers/net/ethernet/renesas/ravb_main.c
@@ -942,7 +942,7 @@ out:
 static void ravb_adjust_link(struct net_device *ndev)
 {
struct ravb_private *priv = netdev_priv(ndev);
-   struct phy_device *phydev = priv->phydev;
+   struct phy_device *phydev = ndev->phydev;
bool new_state = false;
 
if (phydev->link) {
@@ -1032,22 +1032,19 @@ static int ravb_phy_init(struct net_device *ndev)
 
phy_attached_info(phydev);
 
-   priv->phydev = phydev;
-
return 0;
 }
 
 /* PHY control start function */
 static int ravb_phy_start(struct net_device *ndev)
 {
-   struct ravb_private *priv = netdev_priv(ndev);
int error;
 
error = ravb_phy_init(ndev);
if (error)
return error;
 
-   phy_start(priv->phydev);
+   phy_start(ndev->phydev);
 
return 0;
 }
@@ -1058,9 +1055,9 @@ static int ravb_get_settings(struct net_device *ndev, 
struct ethtool_cmd *ecmd)
int error = -ENODEV;
unsigned long flags;
 
-   if (priv->phydev) {
+   if (ndev->phydev) {
spin_lock_irqsave(>lock, flags);
-   error = phy_ethtool_gset(priv->phydev, ecmd);
+   error = phy_ethtool_gset(ndev->phydev, ecmd);
spin_unlock_irqrestore(>lock, flags);
}
 
@@ -1073,7 +1070,7 @@ static int ravb_set_settings(struct net_device *ndev, 
struct ethtool_cmd *ecmd)
unsigned long flags;
int error;
 
-   if (!priv->phydev)
+   if (!ndev->phydev)
return -ENODEV;
 
spin_lock_irqsave(>lock, flags);
@@ -1081,7 +1078,7 @@ static int ravb_set_settings(struct net_device *ndev, 
struct ethtool_cmd *ecmd)
/* Disable TX and RX */
ravb_rcv_snd_disable(ndev);
 
-   error = phy_ethtool_sset(priv->phydev, ecmd);
+   error = phy_ethtool_sset(ndev->phydev, ecmd);
if (error)
goto error_exit;
 
@@ -1110,9 +1107,9 @@ static int ravb_nway_reset(struct net_device *ndev)
int error = -ENODEV;
unsigned long flags;
 
-   if (priv->phydev) {
+   if (ndev->phydev) {
spin_lock_irqsave(>lock, flags);
-   error = phy_start_aneg(priv->phydev);
+   error = phy_start_aneg(ndev->phydev);
spin_unlock_irqrestore(>lock, flags);
}
 
@@ -1661,10 +1658,9 @@ static int ravb_close(struct net_device *ndev)
}
 
/* PHY disconnect */
-   if (priv->phydev) {
-   phy_stop(priv->phydev);
-   phy_disconnect(priv->phydev);
-   priv->phydev = NULL;
+   if (ndev->phydev) {
+   phy_stop(ndev->phydev);
+   phy_disconnect(ndev->phydev);
}
 
if (priv->chip_id != RCAR_GEN2) {
@@ -1753,8 +1749,7 @@ static int ravb_hwtstamp_set(struct net_device *ndev, 
struct ifreq *req)
 /* ioctl to device function */
 static int ravb_do_ioctl(struct net_device *ndev, struct ifreq *req, int cmd)
 {
-   struct ravb_private *priv = netdev_priv(ndev);
-   struct phy_device *phydev = priv->phydev;
+   struct phy_device *phydev = ndev->phydev;
 
if (!netif_running(ndev))
return -EINVAL;
-- 
1.7.4.4

[PATCH 2/2] net: ethernet: renesas: ravb: use new api ethtool_{get|set}_link_ksettings

2016-08-19 Thread Philippe Reynes

The ethtool api {get|set}_settings is deprecated.
We move this driver to new api {get|set}_link_ksettings.

Signed-off-by: Philippe Reynes 
---
 drivers/net/ethernet/renesas/ravb_main.c |   16 +---
 1 files changed, 9 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/renesas/ravb_main.c 
b/drivers/net/ethernet/renesas/ravb_main.c
index d21c9a4..cad23ba 100644
--- a/drivers/net/ethernet/renesas/ravb_main.c
+++ b/drivers/net/ethernet/renesas/ravb_main.c
@@ -1049,7 +1049,8 @@ static int ravb_phy_start(struct net_device *ndev)
return 0;
 }
 
-static int ravb_get_settings(struct net_device *ndev, struct ethtool_cmd *ecmd)
+static int ravb_get_link_ksettings(struct net_device *ndev,
+  struct ethtool_link_ksettings *cmd)
 {
struct ravb_private *priv = netdev_priv(ndev);
int error = -ENODEV;
@@ -1057,14 +1058,15 @@ static int ravb_get_settings(struct net_device *ndev, 
struct ethtool_cmd *ecmd)
 
if (ndev->phydev) {
spin_lock_irqsave(>lock, flags);
-   error = phy_ethtool_gset(ndev->phydev, ecmd);
+   error = phy_ethtool_ksettings_get(ndev->phydev, cmd);
spin_unlock_irqrestore(>lock, flags);
}
 
return error;
 }
 
-static int ravb_set_settings(struct net_device *ndev, struct ethtool_cmd *ecmd)
+static int ravb_set_link_ksettings(struct net_device *ndev,
+  const struct ethtool_link_ksettings *cmd)
 {
struct ravb_private *priv = netdev_priv(ndev);
unsigned long flags;
@@ -1078,11 +1080,11 @@ static int ravb_set_settings(struct net_device *ndev, 
struct ethtool_cmd *ecmd)
/* Disable TX and RX */
ravb_rcv_snd_disable(ndev);
 
-   error = phy_ethtool_sset(ndev->phydev, ecmd);
+   error = phy_ethtool_ksettings_set(ndev->phydev, cmd);
if (error)
goto error_exit;
 
-   if (ecmd->duplex == DUPLEX_FULL)
+   if (cmd->base.duplex == DUPLEX_FULL)
priv->duplex = 1;
else
priv->duplex = 0;
@@ -1306,8 +1308,6 @@ static int ravb_get_ts_info(struct net_device *ndev,
 }
 
 static const struct ethtool_ops ravb_ethtool_ops = {
-   .get_settings   = ravb_get_settings,
-   .set_settings   = ravb_set_settings,
.nway_reset = ravb_nway_reset,
.get_msglevel   = ravb_get_msglevel,
.set_msglevel   = ravb_set_msglevel,
@@ -1318,6 +1318,8 @@ static const struct ethtool_ops ravb_ethtool_ops = {
.get_ringparam  = ravb_get_ringparam,
.set_ringparam  = ravb_set_ringparam,
.get_ts_info= ravb_get_ts_info,
+   .get_link_ksettings = ravb_get_link_ksettings,
+   .set_link_ksettings = ravb_set_link_ksettings,
 };
 
 static inline int ravb_hook_irq(unsigned int irq, irq_handler_t handler,
-- 
1.7.4.4

Re: [PATCH 1/1] ppp: Fix one deadlock issue of PPP when send frame

2016-08-19 Thread Guillaume Nault

On Thu, Aug 18, 2016 at 08:11:15AM -0600, Philp Prindeville wrote:
> 
> It seems unlikely to me that this sort of locking problem hasn't existed
> elsewhere before and that an entirely new mechanism for handling it is
> needed...  How are similar re-entrancy issues handled elsewhere?
>
Virtual devices like vxlan and geneve drop packets if the ouput device
(after encapsulation) is the same as the encapsulating device.
It's possible to bypass this check by stacking devices. In this case,
recursion occurs until __dev_queue_xmit() triggers the
XMIT_RECURSION_LIMIT. If the qdisc isn't noqueue so that recursion
isn't stopped by __dev_queue_xmit(), then it's possible to fill the
skb's headroom and the packet gets drop after pskb_expand_head() fails.

In any case only the faulty path is affected and the system remains
stable.

It'd be nice to have PPP working this way, but this is made really
difficult by the number of locks used in the xmit path.
Even checking if the output interface is also the encapsulating one
isn't trivial. The L2TP layer doesn't access the upper interface. It
may not even be a PPP device in case of L2TPv3.
--
To unsubscribe from this list: send the line "unsubscribe linux-ppp" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Feature Request] IP_BIND_ADDRESS_NO_PORT with port

2016-08-19 Thread Eric Dumazet

On Fri, 2016-08-19 at 23:41 +0300, Cristian Morales Vega wrote:
> I would like a socket option that let's me share a source port as long
> as the 4-tuples are unique in a UDP socket *and* specify which source
> port that is.
> It would be similar to IP_BIND_ADDRESS_NO_PORT. It would do exactly
> the same when you bind to port 0, but when you bind to a different
> port it would say "don't do anything with this port yet, but when I
> connect() please use the port as source port instead of a random
> ephemeral port".
> 
> The reason is the following. I would like to have a server receiving
> and sending UDP datagrams from multiple clients using a single port. I
> would like to use multiple sockets for sending, so I can use different
> SO_MAX_PACING_RATEs for each one. One socket per client/session.
> So there is a single "main socket" bind to the relevant port
> (equivalent to the one that accept()s in TCP), and when it receives a
> new session request it creates a new socket specific for that season
> (as accept() does). When creating that new socket I first need to
> bind() it to select the port, and then connect() it to make it receive
> only datagrams from the relevant session.
> The problem is that between the bind() and the connect() there is a
> small time in which the new socket can receive data that should go to
> the "main socket". It's not an unsolvable problem but makes the server
> code more complicated than it needs to be.

Yes, this is something I had on my plate for a while. This is indeed
something we can solve easily.

Re: [PATCH] gianfar: prevent fragmentation in DSA environments

2016-08-19 Thread Andrew Lunn

> Nice improvement.

Thanks

> But what's so special about 8?

When talking to Marvell Switch chips, there are two different headers
which can be added. The DSA header is 4 bytes, and the EDSA header is
8 bytes. However, if there is already a VLAN header, when using EDSA,
it will replace the VLAN header, so only need 4 additional bytes.
There are some very old Marvell switches which add a 4 byte
trailer. If you are talking to a Broadcom switch chip, it also has a 4
byte header.

> I thought only 4 bytes were missing :)

So the requirement is probably not currently for the newer Marvell
switch chips? But we should be looking forward and expect at some
point somebody wants to use the newer chips. I've got a Freescale
Vybrid board using the modern Marvell chips, but the FEC driver does
not have a such a hard limit as this driver.

However, it does not seem as simple as that. A standard Ethernet frame
should have a maximum size of 1522 when including a VLAN header. Yet
the driver appears to be using 1536, which is this rounded up to
multiples of 64. So there is already 14 spare bytes in there. So there
must be something else going on here.

> At least 1536 is the default size of the MRBLR register, as specified
> in the h/w ref manual.  Is there some recommended standard size
> to accommodate most (if not all) headers, to refer to?

Not that i know of. These switch headers are proprietary.

Andrew

Re: [net-next PATCH] ethtool: Add support for SCTP verification tag in Rx NFC

2016-08-19 Thread Greg

On Fri, 2016-08-19 at 14:32 -0700, Alexander Duyck wrote:
> The i40e hardware has support for SCTP filtering via Rx NFC however the
> default configuration expects us to include the verification tag as a part
> of the filter.  In order to support that I need to be able to transfer that
> data through the ethtool interface via a new structure.
> 
> This patch adds a new structure to allow us to pass the verification tag
> for IPv4 or IPv6 SCTP traffic.
> 
> Signed-off-by: Alexander Duyck 
> ---
>  include/uapi/linux/ethtool.h |   50 
> +++---
>  1 file changed, 46 insertions(+), 4 deletions(-)
> 
> diff --git a/include/uapi/linux/ethtool.h b/include/uapi/linux/ethtool.h
> index b8f38e8..12ba8ac 100644
> --- a/include/uapi/linux/ethtool.h
> +++ b/include/uapi/linux/ethtool.h
> @@ -708,7 +708,7 @@ enum ethtool_flags {
>   * @pdst: Destination port
>   * @tos: Type-of-service
>   *
> - * This can be used to specify a TCP/IPv4, UDP/IPv4 or SCTP/IPv4 flow.
> + * This can be used to specify a TCP/IPv4 or UDP/IPv4 flow.
>   */
>  struct ethtool_tcpip4_spec {
>   __be32  ip4src;
> @@ -719,6 +719,27 @@ struct ethtool_tcpip4_spec {
>  };
>  
>  /**
> + * struct ethtool_sctpip4_spec - flow specification for SCTP/IPv4
> + * @ip4src: Source host
> + * @ip4dst: Destination host
> + * @psrc: Source port
> + * @pdst: Destination port
> + * @tos: Type-of-service
> + * @vtag: Verification tag
> + *
> + * This can be used to specify a SCTP/IPv4 flow.
> + */
> +struct ethtool_sctpip4_spec {
> + __be32  ip4src;
> + __be32  ip4dst;
> + __be16  psrc;
> + __be16  pdst;
> + __u8tos;
> + /* 3 byte hole */
> + __be32  vtag;
> +};
> +
> +/**
>   * struct ethtool_ah_espip4_spec - flow specification for IPsec/IPv4
>   * @ip4src: Source host
>   * @ip4dst: Destination host
> @@ -762,7 +783,7 @@ struct ethtool_usrip4_spec {
>   * @pdst: Destination port
>   * @tclass: Traffic Class
>   *
> - * This can be used to specify a TCP/IPv6, UDP/IPv6 or SCTP/IPv6 flow.
> + * This can be used to specify a TCP/IPv6 or UDP/IPv6 flow.
>   */
>  struct ethtool_tcpip6_spec {
>   __be32  ip6src[4];
> @@ -773,6 +794,27 @@ struct ethtool_tcpip6_spec {
>  };
>  
>  /**
> + * struct ethtool_sctpip6_spec - flow specification for SCTP/IPv6
> + * @ip6src: Source host
> + * @ip6dst: Destination host
> + * @psrc: Source port
> + * @pdst: Destination port
> + * @tclass: Traffic Class
> + * @vtag: Verification tag
> + *
> + * This can be used to specify a SCTP/IPv6 flow.
> + */
> +struct ethtool_sctpip6_spec {
> + __be32  ip6src[4];
> + __be32  ip6dst[4];
> + __be16  psrc;
> + __be16  pdst;
> + __u8tclass;
> + /* 3 byte hole */
> + __be32  vtag;
> +};
> +
> +/**
>   * struct ethtool_ah_espip6_spec - flow specification for IPsec/IPv6
>   * @ip6src: Source host
>   * @ip6dst: Destination host
> @@ -807,13 +849,13 @@ struct ethtool_usrip6_spec {
>  union ethtool_flow_union {
>   struct ethtool_tcpip4_spec  tcp_ip4_spec;
>   struct ethtool_tcpip4_spec  udp_ip4_spec;
> - struct ethtool_tcpip4_spec  sctp_ip4_spec;
> + struct ethtool_sctpip4_spec sctp_ip4_spec;
>   struct ethtool_ah_espip4_spec   ah_ip4_spec;
>   struct ethtool_ah_espip4_spec   esp_ip4_spec;
>   struct ethtool_usrip4_spec  usr_ip4_spec;
>   struct ethtool_tcpip6_spec  tcp_ip6_spec;
>   struct ethtool_tcpip6_spec  udp_ip6_spec;
> - struct ethtool_tcpip6_spec  sctp_ip6_spec;
> + struct ethtool_sctpip6_spec sctp_ip6_spec;
>   struct ethtool_ah_espip6_spec   ah_ip6_spec;
>   struct ethtool_ah_espip6_spec   esp_ip6_spec;
>   struct ethtool_usrip6_spec  usr_ip6_spec;
> 

Looks good to me.

Reviewed-by: Greg Rose

Re: [PATCH v2 1/1] ppp: Fix one deadlock issue of PPP when send frame

2016-08-19 Thread Guillaume Nault

On Fri, Aug 19, 2016 at 06:52:58AM +0800, Feng Gao wrote:
> Hi Guillaume,
> 
> Thanks your detail analyses.
> Now I think it is a good solution that just drop the packet and print
> error log instead my original solution that supports reentrant. This
> solution will not bring any side effects.
> 
At least it should improve the situation.
I'm still uncomfortable with this approach because it only fixes part
of the problem. But I have nothing better to propose for now.

Re: [PATCH 2/2] net: sched: avoid duplicates in qdisc dump

2016-08-19 Thread Jiri Kosina

On Thu, 18 Aug 2016, Cong Wang wrote:

> Doesn't this mean we can now just remove the ingress case from 
> tc_dump_qdisc() and simply iterate the whole hash table?

That'd be indeed a nice cleanup, but for that we'll have to make sure that 
the top-level ingress entry makes it to the hashtable as well; that's not 
the case currently, and hence the special-casing (but I agree that it's in 
principle just a remnant of the old design and of a rather 1:1 
list->hashtable conversion, and there is no good reason for keeping it 
this way any more).

So patch that'd add ingress qdisc to the hashtable whenever it 
materializes should be enough to get rid of the "we have to walk two 
linked lists" heritage and the ingress special-casing for dumps.

Will look into it next week unless someone is faster.

Thanks,

-- 
Jiri Kosina
SUSE Labs

Re: [PATCH 1/1] ppp: Fix one deadlock issue of PPP when reentrant

2016-08-19 Thread Guillaume Nault

On Fri, Aug 19, 2016 at 11:16:41PM +0800, f...@ikuai8.com wrote:
> From: Gao Feng 
> 
> PPP channel holds one spinlock before send frame. But the skb may
> select the same PPP channel with wrong route policy. As a result,
> the skb reaches the same channel path. It tries to get the same
> spinlock which is held before. Bang, the deadlock comes out.
> 
> Now add one lock owner to avoid it like xmit_lock_owner of
> netdev_queue. Check the lock owner before try to get the spinlock.
> If the current cpu is already the owner, it means ppp finds there is
> one reentrant and returns directly. If not owner and hold the spinlock
> successfully, it sets owner with current CPU ID.
> 
> The following is the panic stack of 3.3.8. But the same issue
> should be in the upstream too.
> 
> [] ? _raw_spin_lock_bh+0x11/0x40
> [] ppp_unregister_channel+0x1347/0x2170 [ppp_generic]
> [] ? kmem_cache_free+0xa7/0xc0
> [] ppp_unregister_channel+0x1db7/0x2170 [ppp_generic]
> [] ppp_unregister_channel+0x2065/0x2170 [ppp_generic]
> [] dev_hard_start_xmit+0x4cd/0x620
> [] sch_direct_xmit+0x74/0x1d0
> [] dev_queue_xmit+0x1d/0x30
> [] neigh_direct_output+0xc/0x10
> [] ip_finish_output+0x25e/0x2b0
> [] ip_output+0x88/0x90
> [] ? __ip_local_out+0x9f/0xb0
> [] ip_local_out+0x24/0x30
> [] 0xa00b9744
> [] ppp_unregister_channel+0x20f8/0x2170 [ppp_generic]
> [] ppp_output_wakeup+0x122/0x11d0 [ppp_generic]
> [] vfs_write+0xb8/0x160
> [] sys_write+0x45/0x90
> [] system_call_fastpath+0x16/0x1b
> 
> The call flow is like this.
> ppp_write->ppp_channel_push->start_xmit->select inappropriate route
>  -> dev_hard_start_xmit->ppp_start_xmit->ppp_xmit_process->
> ppp_push. Now ppp_push tries to get the same spinlock which is held
> in ppp_channel_push.
> 
> Although the PPP deadlock is caused by inappropriate route policy
> with L2TP, I think it is not accepted the PPP module would cause kernel
> deadlock with wrong route policy.
> 
> Signed-off-by: Gao Feng 
> ---
>  v3: Change the fix solution. Giveup the send chance instead of recursive lock
>  v2: Fix recursive unlock issue
>  v1: Initial patch
>  
>  drivers/net/ppp/ppp_generic.c | 95 
> +--
>  1 file changed, 73 insertions(+), 22 deletions(-)
> 
> diff --git a/drivers/net/ppp/ppp_generic.c b/drivers/net/ppp/ppp_generic.c
> index 70cfa06..b653f1f 100644
> --- a/drivers/net/ppp/ppp_generic.c
> +++ b/drivers/net/ppp/ppp_generic.c
> @@ -162,6 +162,46 @@ struct ppp {
>|SC_MULTILINK|SC_MP_SHORTSEQ|SC_MP_XSHORTSEQ \
>|SC_COMP_TCP|SC_REJ_COMP_TCP|SC_MUST_COMP)
>  
> +struct channel_lock {
> + spinlock_t lock;
> + int owner;
> +};
> +
> +static inline void ppp_channel_lock_init(struct channel_lock *cl)
No need for inline in .c files.

> +{
> + cl->owner = -1;
> + spin_lock_init(>lock);
> +}
> +
> +static inline bool ppp_channel_lock_bh(struct channel_lock *cl)
> +{
> + int cpu;
> +
> + local_bh_disable();
> + cpu = smp_processor_id();
> + if (cpu == cl->owner) {
> + /* The CPU already holds this channel lock and sends. But the
> +  * channel is selected after inappropriate route. It causes
> +  * reenter the channel again. It is forbidden by PPP module.
> +  */
> + if (net_ratelimit())
> + pr_err("PPP detects one recursive channel send\n");
> + local_bh_enable();
What about calling local_bh_enable() before logging the error?

> + return false;
> + }
> + spin_lock(>lock);
> + cl->owner = cpu;
> +
> + return true;
> +}
> +
> +static inline void ppp_channel_unlock_bh(struct channel_lock *cl)
> +{
> + cl->owner = -1;
> + spin_unlock(>lock);
> + local_bh_enable();
> +}
> +
>  /*
>   * Private data structure for each channel.
>   * This includes the data structure used for multilink.
> @@ -171,7 +211,7 @@ struct channel {
>   struct list_head list;  /* link in all/new_channels list */
>   struct ppp_channel *chan;   /* public channel data structure */
>   struct rw_semaphore chan_sem;   /* protects `chan' during chan ioctl */
> - spinlock_t  downl;  /* protects `chan', file.xq dequeue */
> + struct channel_lock downl;  /* protects `chan', file.xq dequeue */
>   struct ppp  *ppp;   /* ppp unit we're connected to */
>   struct net  *chan_net;  /* the net channel belongs to */
>   struct list_head clist; /* link in list of channels per unit */

> @@ -1645,6 +1687,7 @@ static int ppp_mp_explode(struct ppp *ppp, struct 
> sk_buff *skb)
>   struct channel *pch;
>   struct sk_buff *frag;
>   struct ppp_channel *chan;
> + bool locked;
>  
>   totspeed = 0; /*total bitrate of the bundle*/
>   nfree = 0; /* # channels which have no packet already queued */
> @@ -1735,17 +1778,21 @@ static int ppp_mp_explode(struct

Re: [PATCH] gianfar: fix size of fragmented frames

2016-08-19 Thread Claudiu Manoil

Hi Zefir,

[sorry if the message indentation is wrong ... webmail]


From: Zefir Kurtisi 
Sent: Friday, August 19, 2016 8:11 PM
To: Claudiu Manoil
Subject: Re: [PATCH] gianfar: fix size of fragmented frames
    
Hi Claudiu,

On 08/19/2016 06:46 PM, Claudiu Manoil wrote:
>> -Original Message-
>> From: Zefir Kurtisi [mailto:zefir.kurt...@neratec.com]
>> Sent: Friday, August 19, 2016 12:16 PM
>> To: netdev@vger.kernel.org
>> Cc: claudiu.man...@freescale.com
>> Subject: [PATCH] gianfar: fix size of fragmented frames
>>
>> The eTSEC RxBD 'Data Length' field is context depening:
>> for the last fragment it contains the full frame size,
>> while fragments contain the fragment size, which equals
>> the value written to register MRBLR.
>>
> 
> According to RM the last fragment has the whole packet length indeed,
> and this should apply to fragmented frames too:
> 
> " Data length, written by the eTSEC.
> Data length is the number of octets written by the eTSEC into this BD's data 
> buffer if L is cleared
> (the value is equal to MRBLR), or, if L is set, the length of the frame 
> including CRC, FCB (if
> RCTRL[PRSDEP > 00), preamble (if MACCFG2[PreAmRxEn]=1), time stamp (if 
> RCTRL[TS] = 1)
> and any padding (RCTRL[PAL]). "
> 
>> This differentiation is missing in the gianfar driver,
>> which causes data corruption as soon as the hardware
>> starts to fragment receiving frames. As a result, the
>> size of fragmented frames is increased by
>> (nr_frags - 1) * MRBLR
>>
>> We first noticed this issue working with DSA, where a
>> 1540 octet frame is fragmented by the hardware and
>> reconstructed by the driver as 3076 octet frame.
>>
>> This patch fixes the problem by adjusting the size of
>> the last fragment.
>>
>> Signed-off-by: Zefir Kurtisi 
>> ---
>> drivers/net/ethernet/freescale/gianfar.c | 19 +--
>> 1 file changed, 13 insertions(+), 6 deletions(-)
>>
>> diff --git a/drivers/net/ethernet/freescale/gianfar.c
>> b/drivers/net/ethernet/freescale/gianfar.c
>> index d20935d..4187280 100644
>> --- a/drivers/net/ethernet/freescale/gianfar.c
>> +++ b/drivers/net/ethernet/freescale/gianfar.c
>> @@ -2922,17 +2922,24 @@ static bool gfar_add_rx_frag(struct gfar_rx_buff 
>> *rxb,
>> u32 lstatus,
>> {
>>   unsigned int size = lstatus & BD_LENGTH_MASK;
>>   struct page *page = rxb->page;
>> +    bool last = !!(lstatus & BD_LFLAG(RXBD_LAST));
>>
>>   /* Remove the FCS from the packet length */
>> -    if (likely(lstatus & BD_LFLAG(RXBD_LAST)))
>> +    if (last)
>>   size -= ETH_FCS_LEN;
>>
>> -    if (likely(first))
>> +    if (likely(first)) {
>>   skb_put(skb, size);
>> -    else
>> -    skb_add_rx_frag(skb, skb_shinfo(skb)->nr_frags, page,
>> -    rxb->page_offset + RXBUF_ALIGNMENT,
>> -    size, GFAR_RXB_TRUESIZE);
>> +    } else {
>> +    /* the last fragments' length contains the full frame length */
>> +    if (last)
>> +    size -= skb->len;
> 
> While I agree with your finding, I don't think this works for packets having 
> more
> than 2 buffers (head + 1 fragment).
> How's this supposed to work for a skb with 2 fragments, for instance?
> I think this needs more thoughtful consideration.
> 
In fact, this works and I tested it by setting GFAR_RXB_SIZE to 256. Receiving a
1000 byte frame then results in 4 fragments, 3*256 plus the last one sized 1000.
The driver then combines 256+256+256+232 (=1000-768) back to a 1000 bytes frame.

I don't see why it should not, because skb->len is exactly the size of the
fragments added before the last one, so subtracting them from the total frame 
size
should give the size of the last fragment, no?

[Claudiu]
Thanks for the details. I didn't have much time to look into it (I still 
don't), and you can
never be too careful with this kind of changes. But I agree with your 
assessment.
I'm surprised this didn't come out earlier, like a warning from the stack, 
since the
truesize can be easily exceeded in this case.

>> +
>> +    if (size > 0 && size <= GFAR_RXB_SIZE)
> 
> Why do you need this check?
> The h/w ensures that the buffers won't exceed GFAR_RXB_SIZE
> (which is MRBL), size 0 is also not possible.
> 
Do you question the first part? In cases where the last fragment consists of 
only
the FCS, adding a zero size fragment would falsify skb->truesize.

The second expression is fail-safe only - it should never happen given the
hardware specs - but if it did, not checking for a sane frame size might cause
serious trouble (not sure what skb_add_rx_frag() does with an insanely high size
value). If you prefer, I will leave it out in v2 of the patch.

[Claudiu]
Nice catch with the size > 0 part too.  The second part is debatable indeed, it 
may case
confusion implying that size may exceed that limit, which is false, since the 
point of RX
S/G is that a buffer doesn't exceed

Re: [PATCH 0/3] rhashtable: Get rid of raw table walkers part 1

2016-08-19 Thread David Miller

From: Herbert Xu 
Date: Thu, 18 Aug 2016 16:48:54 +0800

> This series starts the process of getting rid of all raw rhashtable
> walkers (e.g., using any of the rht_for_each helpers) from the
> kernel.
> 
> We need to do this before I can fix the resize kmalloc failure issue
> by using multi-layered tables.
> 
> We should do this anyway because almost all raw table walkers are
> already buggy in that they don't handle multiple rhashtables during
> a resize.

Series applied to net-next with v2 of patch #3.

> Dave/Tomas, please keep an eye out for any new patches that try
> to introduce raw table walkers and nack them.

Ok.

Re: [PATCH 2/3] net: stmmac: add a glue driver for the Amlogic Meson 8b / GXBB DWMAC

2016-08-19 Thread Kevin Hilman

Martin Blumenstingl  writes:

> The Ethernet controller available in Meson8b and GXBB SoCs is a Synopsys
> DesignWare MAC IP core which is already supported by the stmmac driver.
>
> In addition to the standard stmmac driver some Meson8b / GXBB specific
> registers have to be configured for the PHY clocks. These SoC specific
> registers are called PRG_ETHERNET_ADDR0 and PRG_ETHERNET_ADDR1 in the
> datasheet.
> These registers are not backwards compatible with those on Meson 6b,
> which is why a new glue driver is introduced. This worked for many
> boards because the bootloader programs the PRG_ETHERNET registers
> correctly. Additionally the meson6-dwmac driver only sets bit 1 of
> PRG_ETHERNET_ADDR0 which (according to the datasheet) is only used
> during reset.
>
> Currently all configuration values can be determined automatically,
> based on the configured phy-mode (which is mandatory for the stmmac
> driver). If required the tx-delay and the mux clock (so it supports
> the MPLL2 clock as well) can be made configurable in the future.
>
> Signed-off-by: Martin Blumenstingl 

[...]

> +static int meson8b_init_clk(struct meson8b_dwmac *dwmac)
> +{
> + struct clk_init_data init;
> + int i, ret;
> + struct device *dev = >pdev->dev;
> + char clk_name[32];
> + const char *clk_div_parents[1];
> + const char *mux_parent_names[MUX_CLK_NUM_PARENTS];
> + unsigned int mux_parent_count = 0;
> + static struct clk_div_table clk_25m_div_table[] = {
> + { .val = 0, .div = 5 },
> + { .val = 1, .div = 10 },
> + { /* sentinel */ },
> + };
> +
> + /* get the mux parents from DT */
> + for (i = 0; i < MUX_CLK_NUM_PARENTS; i++) {
> + char name[16];
> +
> + snprintf(name, sizeof(name), "clkin%d", i);
> + dwmac->m250_mux_parent[i] = devm_clk_get(dev, name);
> + if (IS_ERR(dwmac->m250_mux_parent[i])) {
> + /* NOTE: the second clock (MP2) is unused on all known

nit: multi-line comment style  (c.f. Documentation/CodingStyle search
for "multi-line")

> +  * boards, thus we're making it optional here.
> +  */
> + if (i > 0)
> + continue;

IMO, this is a bit confusing.  I think this should either be coded to
deal with an optional "clkin1" (preferred), or it should be coded to
without a mux and clkin0 is directly the parent of the divider.  The
current way of always bailing out of this loop early is a bit confusing.

> + ret = PTR_ERR(dwmac->m250_mux_parent[i]);
> + if (ret != -EPROBE_DEFER)
> + dev_err(dev, "Missing clock %s\n", name);
> + dwmac->m250_mux_parent[i] = NULL;
> + return ret;
> + }
> +
> + mux_parent_names[i] =
> + __clk_get_name(dwmac->m250_mux_parent[i]);
> + mux_parent_count++;
> + }
> +
> + /* create the m250_mux */
> + snprintf(clk_name, sizeof(clk_name), "%s#m250_sel", dev_name(dev));
> + init.name = clk_name;
> + init.ops = _mux_ops;
> + init.flags = CLK_IS_BASIC;
> + init.parent_names = mux_parent_names;
> + init.num_parents = mux_parent_count;
> +
> + dwmac->m250_mux.reg = dwmac->regs + PRG_ETH0;
> + dwmac->m250_mux.shift = PRG_ETH0_CLK_M250_SEL_SHIFT;
> + dwmac->m250_mux.mask = PRG_ETH0_CLK_M250_SEL_MASK;
> + dwmac->m250_mux.flags = 0;
> + dwmac->m250_mux.table = NULL;
> + dwmac->m250_mux.hw.init = 
> +
> + dwmac->m250_mux_clk = devm_clk_register(dev, >m250_mux.hw);
> + if (WARN_ON(PTR_ERR_OR_ZERO(dwmac->m250_mux_clk)))
> + return PTR_ERR(dwmac->m250_mux_clk);
> +
> + /* create the m250_div */
> + snprintf(clk_name, sizeof(clk_name), "%s#m250_div", dev_name(dev));
> + init.name = devm_kstrdup(dev, clk_name, GFP_KERNEL);
> + init.ops = _divider_ops;
> + init.flags = CLK_IS_BASIC | CLK_SET_RATE_PARENT;

hmm, with CLK_SET_RATE_PARENT that implies that it might switch the mux,
but it's hard-coded above so the mux only ever has one parent, so that
can never happen.

> + clk_div_parents[0] = __clk_get_name(dwmac->m250_mux_clk);
> + init.parent_names = clk_div_parents;
> + init.num_parents = ARRAY_SIZE(clk_div_parents);
> +
> + dwmac->m250_div.reg = dwmac->regs + PRG_ETH0;
> + dwmac->m250_div.shift = PRG_ETH0_CLK_M250_DIV_SHIFT;
> + dwmac->m250_div.width = PRG_ETH0_CLK_M250_DIV_WIDTH;
> + dwmac->m250_div.hw.init = 
> + dwmac->m250_div.flags = CLK_DIVIDER_ONE_BASED | CLK_DIVIDER_ALLOW_ZERO;
> +
> + dwmac->m250_div_clk = devm_clk_register(dev, >m250_div.hw);
> + if (WARN_ON(PTR_ERR_OR_ZERO(dwmac->m250_div_clk)))
> + return PTR_ERR(dwmac->m250_div_clk);
> +
> + /* create the m25_div */
> +

Re: [PATCH v1 1/1] l2tp: Use existing macros instead of literal number

2016-08-19 Thread David Miller

From: f...@ikuai8.com
Date: Thu, 18 Aug 2016 15:05:19 +0800

> @@ -138,6 +138,8 @@ static const struct ppp_channel_ops pppol2tp_chan_ops = {
>  
>  static const struct proto_ops pppol2tp_ops;
>  
> +static const unsigned char fixed_ppphdr[2] = {PPP_ALLSTATIONS, PPP_UI};
> +
>  /* Helpers to obtain tunnel/session contexts from sockets.
>   */
>  static inline struct l2tp_session *pppol2tp_sock_to_session(struct sock *sk)

Same problem as the pptp patch, I'm not applying this.

Re: [PATCH v1 1/2] pptp: Use macro and sizeof instead of literal number

2016-08-19 Thread David Miller

From: f...@ikuai8.com
Date: Thu, 18 Aug 2016 12:47:34 +0800

> @@ -54,6 +54,8 @@ static struct proto pptp_sk_proto __read_mostly;
>  static const struct ppp_channel_ops pptp_chan_ops;
>  static const struct proto_ops pptp_ops;
>  
> +static const u8 fixed_ppphdr[2] = {PPP_ALLSTATIONS, PPP_UI};
> +

This makes things much worse in my opinion, and at the cost
of two extra bytes in the data section as well.

I'm not applying this.

Re: [PATCH] gianfar: prevent fragmentation in DSA environments

2016-08-19 Thread Claudiu Manoil

Hi Andrew,

From: Andrew Lunn 
Sent: Friday, August 19, 2016 7:39 PM
To: Claudiu Manoil
Cc: Zefir Kurtisi; netdev@vger.kernel.org; claudiu.man...@freescale.com
Subject: Re: [PATCH] gianfar: prevent fragmentation in DSA environments

> >> -#define GFAR_RXB_SIZE 1536
> >> +/* prevent fragmenation by HW in DSA environments */
> >> +#define GFAR_RXB_SIZE (1536 + RXBUF_ALIGNMENT)
> >
> >Hi Zefir
> >
> >Using RXBUF_ALIGNMENT suggests this has something to do with
> >alignment, not extra headers.
> >
> >How about
> >
> >/* Prevent fragmenation by HW when using extra headers like DSA */
> >#define GFAR_RXB_SIZE (1536 + 8)
> >
> 
> MRBL (Maximum receive buffer length) must be multiple of 64.
> Please consult de hardware documentation at least before prosing
> changes to the driver.

Hi Claudiu

Thanks for pointing this out. This makes RXBUF_ALIGNMENT even worse
for understabability, since you say multiples of 64 is important, not
alignment.

How about

/* Prevent fragmentation by HW when using extra headers like DSA,
while respecting the multiple of 64 requirement. */

#define GFAR_RXB_SIZE roundup(1536 + 8, 64)

[Claudiu]
Nice improvement.
But what's so special about 8?  I thought only 4 bytes were missing :)
At least 1536 is the default size of the MRBLR register, as specified
in the h/w ref manual.  Is there some recommended standard size
to accommodate most (if not all) headers, to refer to?

[net-next PATCH] ethtool: Add support for SCTP verification tag in Rx NFC

2016-08-19 Thread Alexander Duyck

The i40e hardware has support for SCTP filtering via Rx NFC however the
default configuration expects us to include the verification tag as a part
of the filter.  In order to support that I need to be able to transfer that
data through the ethtool interface via a new structure.

This patch adds a new structure to allow us to pass the verification tag
for IPv4 or IPv6 SCTP traffic.

Signed-off-by: Alexander Duyck 
---
 include/uapi/linux/ethtool.h |   50 +++---
 1 file changed, 46 insertions(+), 4 deletions(-)

diff --git a/include/uapi/linux/ethtool.h b/include/uapi/linux/ethtool.h
index b8f38e8..12ba8ac 100644
--- a/include/uapi/linux/ethtool.h
+++ b/include/uapi/linux/ethtool.h
@@ -708,7 +708,7 @@ enum ethtool_flags {
  * @pdst: Destination port
  * @tos: Type-of-service
  *
- * This can be used to specify a TCP/IPv4, UDP/IPv4 or SCTP/IPv4 flow.
+ * This can be used to specify a TCP/IPv4 or UDP/IPv4 flow.
  */
 struct ethtool_tcpip4_spec {
__be32  ip4src;
@@ -719,6 +719,27 @@ struct ethtool_tcpip4_spec {
 };
 
 /**
+ * struct ethtool_sctpip4_spec - flow specification for SCTP/IPv4
+ * @ip4src: Source host
+ * @ip4dst: Destination host
+ * @psrc: Source port
+ * @pdst: Destination port
+ * @tos: Type-of-service
+ * @vtag: Verification tag
+ *
+ * This can be used to specify a SCTP/IPv4 flow.
+ */
+struct ethtool_sctpip4_spec {
+   __be32  ip4src;
+   __be32  ip4dst;
+   __be16  psrc;
+   __be16  pdst;
+   __u8tos;
+   /* 3 byte hole */
+   __be32  vtag;
+};
+
+/**
  * struct ethtool_ah_espip4_spec - flow specification for IPsec/IPv4
  * @ip4src: Source host
  * @ip4dst: Destination host
@@ -762,7 +783,7 @@ struct ethtool_usrip4_spec {
  * @pdst: Destination port
  * @tclass: Traffic Class
  *
- * This can be used to specify a TCP/IPv6, UDP/IPv6 or SCTP/IPv6 flow.
+ * This can be used to specify a TCP/IPv6 or UDP/IPv6 flow.
  */
 struct ethtool_tcpip6_spec {
__be32  ip6src[4];
@@ -773,6 +794,27 @@ struct ethtool_tcpip6_spec {
 };
 
 /**
+ * struct ethtool_sctpip6_spec - flow specification for SCTP/IPv6
+ * @ip6src: Source host
+ * @ip6dst: Destination host
+ * @psrc: Source port
+ * @pdst: Destination port
+ * @tclass: Traffic Class
+ * @vtag: Verification tag
+ *
+ * This can be used to specify a SCTP/IPv6 flow.
+ */
+struct ethtool_sctpip6_spec {
+   __be32  ip6src[4];
+   __be32  ip6dst[4];
+   __be16  psrc;
+   __be16  pdst;
+   __u8tclass;
+   /* 3 byte hole */
+   __be32  vtag;
+};
+
+/**
  * struct ethtool_ah_espip6_spec - flow specification for IPsec/IPv6
  * @ip6src: Source host
  * @ip6dst: Destination host
@@ -807,13 +849,13 @@ struct ethtool_usrip6_spec {
 union ethtool_flow_union {
struct ethtool_tcpip4_spec  tcp_ip4_spec;
struct ethtool_tcpip4_spec  udp_ip4_spec;
-   struct ethtool_tcpip4_spec  sctp_ip4_spec;
+   struct ethtool_sctpip4_spec sctp_ip4_spec;
struct ethtool_ah_espip4_spec   ah_ip4_spec;
struct ethtool_ah_espip4_spec   esp_ip4_spec;
struct ethtool_usrip4_spec  usr_ip4_spec;
struct ethtool_tcpip6_spec  tcp_ip6_spec;
struct ethtool_tcpip6_spec  udp_ip6_spec;
-   struct ethtool_tcpip6_spec  sctp_ip6_spec;
+   struct ethtool_sctpip6_spec sctp_ip6_spec;
struct ethtool_ah_espip6_spec   ah_ip6_spec;
struct ethtool_ah_espip6_spec   esp_ip6_spec;
struct ethtool_usrip6_spec  usr_ip6_spec;

Re: [PATCH v2 1/2] ipv6: save route expiry in RTA_EXPIRES if RTF_EXPIRES set

2016-08-19 Thread Eric Dumazet

On Fri, 2016-08-19 at 18:41 +0200, Andrew Yourtchenko wrote:
> This allows "ip -6 route save" to save the expiry for the routes
> that have it, such that it can be correctly restored later by
> "ip -6 route restore".
> 
> If a route has RTF_EXPIRES set, generate RTA_EXPIRES value which
> will be used to restore the flag and expiry value by already
> existing code in rtm_to_fib6_config.
> 
> The expiry was already being saved as part of RTA_CACHEINFO
> in rtnl_put_cacheinfo(), but adding code to generate RTF_EXPIRES upon save
> looked more appropriate than redundant cherrypicking from
> RTA_CACHEINFO upon restore.
> 
> Signed-off-by: Andrew Yourtchenko 
> ---
> Changes since v1 [1]: 
>  * fixed the indentation in a multiline function call
>as per David Miller's review
> 
> [1] v1: http://marc.info/?l=linux-netdev=147135597422280=2
> 
>  net/ipv6/route.c | 8 
>  1 file changed, 8 insertions(+)
> 
> diff --git a/net/ipv6/route.c b/net/ipv6/route.c
> index 4981755..f5b987d 100644
> --- a/net/ipv6/route.c
> +++ b/net/ipv6/route.c
> @@ -3244,6 +3244,14 @@ static int rt6_fill_node(struct net *net,
>   if (rtnl_put_cacheinfo(skb, >dst, 0, expires, rt->dst.error) < 0)
>   goto nla_put_failure;
>  
> + /* Can't rely on expires == 0. It is zero if no expires flag,
> +  * or if the timing is precisely at expiry. So, recheck the flag.
> +  */
> + if (rt->rt6i_flags & RTF_EXPIRES)
> + if (nla_put_u32(skb, RTA_EXPIRES,
> + expires > 0 ? expires / HZ : 0))
> + goto nla_put_failure;
> +

Why sending (kernel -> user) the value in second units, while the other
direction (user -> kernel) expects hz ?

iproute2$ git grep -n RTA_EXPIRES
include/linux/rtnetlink.h:389:  RTA_EXPIRES,
ip/iproute.c:930:   addattr32(, sizeof(req), 
RTA_EXPIRES, expires*hz);

kernel patch : (32bc201e1974976b7d3fea9a9b17bb7392ca6394)

--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c

@@ -2809,6 +2810,15 @@ static int rtm_to_fib6_config(struct sk_buff *skb, 
struct nlmsghdr *nlh,
if (tb[RTA_ENCAP_TYPE])
cfg->fc_encap_type = nla_get_u16(tb[RTA_ENCAP_TYPE]);
 
+   if (tb[RTA_EXPIRES]) {
+   unsigned long timeout = 
addrconf_timeout_fixup(nla_get_u32(tb[RTA_EXPIRES]), HZ);
+
+   if (addrconf_finite_timeout(timeout)) {
+   cfg->fc_expires = jiffies_to_clock_t(timeout * HZ);
+   cfg->fc_flags |= RTF_EXPIRES;
+   }
+   }
+

[Feature Request] IP_BIND_ADDRESS_NO_PORT with port

2016-08-19 Thread Cristian Morales Vega

I would like a socket option that let's me share a source port as long
as the 4-tuples are unique in a UDP socket *and* specify which source
port that is.
It would be similar to IP_BIND_ADDRESS_NO_PORT. It would do exactly
the same when you bind to port 0, but when you bind to a different
port it would say "don't do anything with this port yet, but when I
connect() please use the port as source port instead of a random
ephemeral port".

The reason is the following. I would like to have a server receiving
and sending UDP datagrams from multiple clients using a single port. I
would like to use multiple sockets for sending, so I can use different
SO_MAX_PACING_RATEs for each one. One socket per client/session.
So there is a single "main socket" bind to the relevant port
(equivalent to the one that accept()s in TCP), and when it receives a
new session request it creates a new socket specific for that season
(as accept() does). When creating that new socket I first need to
bind() it to select the port, and then connect() it to make it receive
only datagrams from the relevant session.
The problem is that between the bind() and the connect() there is a
small time in which the new socket can receive data that should go to
the "main socket". It's not an unsolvable problem but makes the server
code more complicated than it needs to be.

Re: [PATCH v1 1/1] pppoe: l2tp: the PPPOX_CONNECTED should be used with bit operation

2016-08-19 Thread Guillaume Nault

On Fri, Aug 19, 2016 at 06:58:58AM +0800, Feng Gao wrote:
> inline.
> 
> On Fri, Aug 19, 2016 at 1:44 AM, Guillaume Nault  wrote:
> > On Thu, Aug 18, 2016 at 09:59:03AM +0800, f...@ikuai8.com wrote:
> >> From: Gao Feng 
> >>
> >> There are some codes in pppoe and l2tp which use the PPPOX_CONNECTED
> >> as the value including assignment and condition check.
> >> They should keep consistent with other codes.
> >>
> >> Signed-off-by: Gao Feng 
> >> ---
> >>  v1: Initial Patch
> >>
> >>  drivers/net/ppp/pppoe.c | 2 +-
> >>  net/l2tp/l2tp_ppp.c | 4 ++--
> >>  2 files changed, 3 insertions(+), 3 deletions(-)
> >>
> >> diff --git a/drivers/net/ppp/pppoe.c b/drivers/net/ppp/pppoe.c
> >> index 4ddae81..684b773 100644
> >> --- a/drivers/net/ppp/pppoe.c
> >> +++ b/drivers/net/ppp/pppoe.c
> >> @@ -697,7 +697,7 @@ static int pppoe_connect(struct socket *sock, struct 
> >> sockaddr *uservaddr,
> >>   goto err_put;
> >>   }
> >>
> >> - sk->sk_state = PPPOX_CONNECTED;
> >> + sk->sk_state |= PPPOX_CONNECTED;
> >>
> > Using plain assignment makes it clear for the reader that other flags
> > are unset. I see no reason for changing this.
> 
> I get you. So I don't modify the PPPOX_DEAD assignment.
> But I am afraid if there is some case that the flag PPPOX_BOUND is set
> before PPPOX_CONNECTED . Then the assignment of "PPPOX_CONNECTED" will
> clear the PPPOX_BOUND flag.
> 
PPPOX_BOUND shouldn't be set here. If the socket hasn't been connected
before, it can't have the BOUND flag (pppox_ioctl(PPPIOCGCHAN) would
fail). But if it was connected, then it'd have to go through the
'/* Delete the old binding */' part of pppoe_connect() first, thus
reseting sk_state to PPPOX_NONE.

Re: [PATCH net-next 2/3] net: mpls: Fixups for GSO

2016-08-19 Thread Alexander Duyck

On Fri, Aug 19, 2016 at 10:09 AM, David Ahern  wrote:
> As reported by Lennert the MPLS GSO code is failing to properly segment
> large packets. There are a couple of problems:
>
> 1. the inner protocol is not set so the gso segment functions for inner
>protocol layers are not getting run, and
>
> 2  MPLS labels for packets that use the "native" (non-OVS) MPLS code
>are not properly accounted for in mpls_gso_segment.
>
> The MPLS GSO code was added for OVS. It is re-using skb_mac_gso_segment
> to call the gso segment functions for the higher layer protocols. That
> means skb_mac_gso_segment is called twice -- once with the network
> protocol set to MPLS and again with the network protocol set to the
> inner protocol.
>
> This patch sets the inner skb protocol addressing item 1 above and sets
> the network_header and inner_network_header to mark where the MPLS labels
> start and end. The MPLS code in OVS is also updated to set the two
> network markers.
>
> From there the MPLS GSO code uses the difference between the network
> header and the inner network header to know the size of the MPLS header
> that was pushed. It then pulls the MPLS header, resets the mac_len and
> protocol for the inner protocol and then calls skb_mac_gso_segment
> to segment the skb.
>
> Afterward the inner protocol segmentation is done the skb protocol
> is set to mpls for each segment and the network and mac headers
> restored.
>
> Reported-by: Lennert Buytenhek 
> Signed-off-by: David Ahern 
> ---
>  net/mpls/mpls_gso.c   | 38 +++---
>  net/mpls/mpls_iptunnel.c  |  4 
>  net/openvswitch/actions.c |  6 ++
>  3 files changed, 37 insertions(+), 11 deletions(-)
>
> diff --git a/net/mpls/mpls_gso.c b/net/mpls/mpls_gso.c
> index 2055e57ed1c3..2aa4beaa0e4f 100644
> --- a/net/mpls/mpls_gso.c
> +++ b/net/mpls/mpls_gso.c
> @@ -23,32 +23,48 @@ static struct sk_buff *mpls_gso_segment(struct sk_buff 
> *skb,
>netdev_features_t features)
>  {
> struct sk_buff *segs = ERR_PTR(-EINVAL);
> +   u16 mac_offset = skb->mac_header;
> netdev_features_t mpls_features;
> +   u16 mac_len = skb->mac_len;
> __be16 mpls_protocol;
> +   int mpls_hlen;
> +
> +   skb_reset_network_header(skb);
> +   mpls_hlen = skb_inner_network_header(skb) - skb_network_header(skb);
> +   if (unlikely(!pskb_may_pull(skb, mpls_hlen)))
> +   goto out;
>
> /* Setup inner SKB. */
> mpls_protocol = skb->protocol;
> skb->protocol = skb->inner_protocol;
>
> -   /* Push back the mac header that skb_mac_gso_segment() has pulled.
> -* It will be re-pulled by the call to skb_mac_gso_segment() below
> -*/
> -   __skb_push(skb, skb->mac_len);
> +   __skb_pull(skb, mpls_hlen);
> +
> +   skb->mac_len = 0;
> +   skb_reset_mac_header(skb);
> +   skb_set_network_header(skb, skb_inner_network_offset(skb));

No need to set the network header.  Both IPv4 and IPv6 GSO paths will
reset the network header just like you did at the start.

> /* Segment inner packet. */
> mpls_features = skb->dev->mpls_features & features;
> segs = skb_mac_gso_segment(skb, mpls_features);
> +   if (IS_ERR_OR_NULL(segs)) {
> +   skb_gso_error_unwind(skb, mpls_protocol, mpls_hlen, 
> mac_offset,
> +mac_len);
> +   goto out;
> +   }
>
> +   skb = segs;

You could probably pull your math for mpls_hlen + mac_len out of the
loop below and just take care of adding mac_len to mpls_hlen up here
and store it of in mpls_hlen since it isn't used anywhere else.

> +   do {
> +   skb->mac_len = mac_len;
> +   skb->protocol = mpls_protocol;
>
> -   /* Restore outer protocol. */
> -   skb->protocol = mpls_protocol;
> +   __skb_push(skb, mpls_hlen + mac_len);
>
> -   /* Re-pull the mac header that the call to skb_mac_gso_segment()
> -* above pulled.  It will be re-pushed after returning
> -* skb_mac_gso_segment(), an indirect caller of this function.
> -*/
> -   __skb_pull(skb, skb->data - skb_mac_header(skb));

You need to store off the inner network header before you overwrite it
in the lines below.  Either skb_reset_inner_network_header before the
push, or skb_reset_inner_headers before you call the two lines below.

> +   skb_reset_mac_header(skb);
> +   skb_set_network_header(skb, mac_len);
> +   } while ((skb = skb->next));
>
> +out:
> return segs;
>  }
>
> diff --git a/net/mpls/mpls_iptunnel.c b/net/mpls/mpls_iptunnel.c
> index aed872cc05a6..cf52cf30ac4b 100644
> --- a/net/mpls/mpls_iptunnel.c
> +++ b/net/mpls/mpls_iptunnel.c
> @@ -90,7 +90,11 @@ static int mpls_xmit(struct sk_buff *skb)
> if (skb_cow(skb, hh_len +

[PATCH] cfg80211: Add stub for cfg80211_get_station()

2016-08-19 Thread Linus Lüssing

This allows modules using this function (currently: batman-adv) to
compile even if cfg80211 is not built at all, thus relaxing
dependencies.

Signed-off-by: Linus Lüssing 
---
 include/net/cfg80211.h | 9 +
 1 file changed, 9 insertions(+)

diff --git a/include/net/cfg80211.h b/include/net/cfg80211.h
index 9c23f4d3..beb7610 100644
--- a/include/net/cfg80211.h
+++ b/include/net/cfg80211.h
@@ -1102,6 +1102,7 @@ struct station_info {
struct cfg80211_tid_stats pertid[IEEE80211_NUM_TIDS + 1];
 };
 
+#if IS_ENABLED(CONFIG_CFG80211)
 /**
  * cfg80211_get_station - retrieve information about a given station
  * @dev: the device where the station is supposed to be connected to
@@ -1114,6 +1115,14 @@ struct station_info {
  */
 int cfg80211_get_station(struct net_device *dev, const u8 *mac_addr,
 struct station_info *sinfo);
+#else
+static inline int cfg80211_get_station(struct net_device *dev,
+  const u8 *mac_addr,
+  struct station_info *sinfo)
+{
+   return -ENOENT;
+}
+#endif
 
 /**
  * enum monitor_flags - monitor flags
-- 
2.1.4

Re: [PATCH 0/1] HW checksum offload problem on Xilinx Zynq GEM

2016-08-19 Thread Tom Herbert

On Fri, Aug 19, 2016 at 12:28 PM, David Miller  wrote:
> From: Helmut Buchsbaum 
> Date: Fri, 19 Aug 2016 15:04:57 +0200
>
>> When working on upgrading the v3.x kernels of our embedded devices
>> to more recent 4.x kernels we noticed some of our proprietary networking
>> stuff is broken. Further investigations brought up an issue with small
>> UDP packets (data payload <= 2), which contained wrong UDP header
>> checksums.
>> We tracked this down to commit 85ff3d87bf2ef1fadcde8553628c60f79806fdb4
>> net/macb: add TX checksum offload feature. It turns out that Zynq's GEM
>> is obviously buggy regarding the UDP checksum calculation of such small
>> UDP packets as long as the UDP checksum field is != 0 *BEFORE* the
>> HW calulation. But since udp_send_skb() *ALWAYS* calculates the UDP header
>> checksum (unless disabled via socket option), this is the usual case.
>> Unfortunately it does not respect the net device feature setting which
>> would leave UDP checksum untouched when checksum offloading is enabled.
>
> Then simply clear the checksum field in the driver, or fix
> udp_send_skb() to do what you claim it is supposed to.
>
> That seems like a much more appropriate fix to me.

Fix the driver to do that. When offloading a UDP checksum the pseudo
header checksum is written into the checksum field-- that's the
interface that everyone else has been using without issue.

[PATCH net-next] hv_netvsc: Implement batching of receive completions

2016-08-19 Thread Haiyang Zhang

From: Haiyang Zhang 

The existing code uses busy retry when unable to send out receive
completions due to full ring buffer. It also gives up retrying after limit
is reached, and causes receive buffer slots not being recycled.
This patch implements batching of receive completions. It also prevents
dropping receive completions due to full ring buffer.

Signed-off-by: Haiyang Zhang 
Reviewed-by: Stephen Hemminger 
---
 drivers/net/hyperv/hyperv_net.h   |   17 
 drivers/net/hyperv/netvsc.c   |  170 ++---
 drivers/net/hyperv/rndis_filter.c |6 +-
 3 files changed, 160 insertions(+), 33 deletions(-)

diff --git a/drivers/net/hyperv/hyperv_net.h b/drivers/net/hyperv/hyperv_net.h
index fa7b1e4..ce45d68 100644
--- a/drivers/net/hyperv/hyperv_net.h
+++ b/drivers/net/hyperv/hyperv_net.h
@@ -634,6 +634,20 @@ struct multi_send_data {
u32 count; /* counter of batched packets */
 };
 
+struct recv_comp_data {
+   u64 tid; /* transaction id */
+   u32 status;
+};
+
+/* Netvsc Receive Slots Max */
+#define NETVSC_RECVSLOT_MAX (NETVSC_RECEIVE_BUFFER_SIZE / ETH_DATA_LEN + 1)
+
+struct multi_recv_comp {
+   void *buf; /* queued receive completions */
+   u32 first; /* first data entry */
+   u32 next; /* next entry for writing */
+};
+
 struct netvsc_stats {
u64 packets;
u64 bytes;
@@ -736,6 +750,9 @@ struct netvsc_device {
u32 max_pkt; /* max number of pkt in one send, e.g. 8 */
u32 pkt_align; /* alignment bytes, e.g. 8 */
 
+   struct multi_recv_comp mrc[VRSS_CHANNEL_MAX];
+   atomic_t num_outstanding_recvs;
+
atomic_t open_cnt;
 };
 
diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c
index 8078bc2..b15edfc 100644
--- a/drivers/net/hyperv/netvsc.c
+++ b/drivers/net/hyperv/netvsc.c
@@ -74,6 +74,9 @@ static struct netvsc_device *alloc_net_device(void)
return NULL;
}
 
+   net_device->mrc[0].buf = vzalloc(NETVSC_RECVSLOT_MAX *
+sizeof(struct recv_comp_data));
+
init_waitqueue_head(_device->wait_drain);
net_device->destroy = false;
atomic_set(_device->open_cnt, 0);
@@ -85,6 +88,11 @@ static struct netvsc_device *alloc_net_device(void)
 
 static void free_netvsc_device(struct netvsc_device *nvdev)
 {
+   int i;
+
+   for (i = 0; i < VRSS_CHANNEL_MAX; i++)
+   vfree(nvdev->mrc[i].buf);
+
kfree(nvdev->cb_buffer);
kfree(nvdev);
 }
@@ -107,7 +115,8 @@ static struct netvsc_device *get_inbound_net_device(struct 
hv_device *device)
goto get_in_err;
 
if (net_device->destroy &&
-   atomic_read(_device->num_outstanding_sends) == 0)
+   atomic_read(_device->num_outstanding_sends) == 0 &&
+   atomic_read(_device->num_outstanding_recvs) == 0)
net_device = NULL;
 
 get_in_err:
@@ -972,49 +981,121 @@ send_now:
return ret;
 }
 
-static void netvsc_send_recv_completion(struct hv_device *device,
-   struct vmbus_channel *channel,
-   struct netvsc_device *net_device,
-   u64 transaction_id, u32 status)
+static int netvsc_send_recv_completion(struct vmbus_channel *channel,
+  u64 transaction_id, u32 status)
 {
struct nvsp_message recvcompMessage;
-   int retries = 0;
int ret;
-   struct net_device *ndev = hv_get_drvdata(device);
 
recvcompMessage.hdr.msg_type =
NVSP_MSG1_TYPE_SEND_RNDIS_PKT_COMPLETE;
 
recvcompMessage.msg.v1_msg.send_rndis_pkt_complete.status = status;
 
-retry_send_cmplt:
/* Send the completion */
ret = vmbus_sendpacket(channel, ,
-  sizeof(struct nvsp_message), transaction_id,
-  VM_PKT_COMP, 0);
-   if (ret == 0) {
-   /* success */
-   /* no-op */
-   } else if (ret == -EAGAIN) {
-   /* no more room...wait a bit and attempt to retry 3 times */
-   retries++;
-   netdev_err(ndev, "unable to send receive completion pkt"
-   " (tid %llx)...retrying %d\n", transaction_id, retries);
-
-   if (retries < 4) {
-   udelay(100);
-   goto retry_send_cmplt;
-   } else {
-   netdev_err(ndev, "unable to send receive "
-   "completion pkt (tid %llx)...give up 
retrying\n",
-   transaction_id);
-   }
-   } else {
-   netdev_err(ndev, "unable to send receive "
-   "completion pkt - %llx\n", transaction_id);
+  sizeof(struct nvsp_message_header) +

Re: [RFC PATCH] openvswitch: use percpu flow stats

2016-08-19 Thread Eric Dumazet

On Fri, 2016-08-19 at 16:47 -0300, Thadeu Lima de Souza Cascardo wrote:
> Instead of using flow stats per NUMA node, use it per CPU. When using
> megaflows, the stats lock can be a bottleneck in scalability.

...

>  
>   flow_cache = kmem_cache_create("sw_flow", sizeof(struct sw_flow)
> -+ (nr_node_ids
> ++ (num_possible_cpus()
> * sizeof(struct flow_stats *)),
>  0, 0, NULL);
>   if (flow_cache == NULL)

This is bogus.

Please use nr_cpu_ids instead of num_possible_cpus().

[RFC PATCH] openvswitch: use percpu flow stats

2016-08-19 Thread Thadeu Lima de Souza Cascardo

Instead of using flow stats per NUMA node, use it per CPU. When using
megaflows, the stats lock can be a bottleneck in scalability.

On a E5-2690 12-core system, usual throughput went from ~4Mpps to ~15Mpps
when forwarding between two 40GbE ports with a single flow configured on
the datapath.
---
 net/openvswitch/flow.c   | 39 +--
 net/openvswitch/flow.h   |  4 ++--
 net/openvswitch/flow_table.c | 21 +++--
 3 files changed, 34 insertions(+), 30 deletions(-)

diff --git a/net/openvswitch/flow.c b/net/openvswitch/flow.c
index 0ea128e..90ae248 100644
--- a/net/openvswitch/flow.c
+++ b/net/openvswitch/flow.c
@@ -29,6 +29,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -72,32 +73,33 @@ void ovs_flow_stats_update(struct sw_flow *flow, __be16 
tcp_flags,
 {
struct flow_stats *stats;
int node = numa_node_id();
+   int cpu = get_cpu();
int len = skb->len + (skb_vlan_tag_present(skb) ? VLAN_HLEN : 0);
 
-   stats = rcu_dereference(flow->stats[node]);
+   stats = rcu_dereference(flow->stats[cpu]);
 
-   /* Check if already have node-specific stats. */
+   /* Check if already have CPU-specific stats. */
if (likely(stats)) {
spin_lock(>lock);
/* Mark if we write on the pre-allocated stats. */
-   if (node == 0 && unlikely(flow->stats_last_writer != node))
-   flow->stats_last_writer = node;
+   if (cpu == 0 && unlikely(flow->stats_last_writer != cpu))
+   flow->stats_last_writer = cpu;
} else {
stats = rcu_dereference(flow->stats[0]); /* Pre-allocated. */
spin_lock(>lock);
 
-   /* If the current NUMA-node is the only writer on the
+   /* If the current CPU is the only writer on the
 * pre-allocated stats keep using them.
 */
-   if (unlikely(flow->stats_last_writer != node)) {
+   if (unlikely(flow->stats_last_writer != cpu)) {
/* A previous locker may have already allocated the
-* stats, so we need to check again.  If node-specific
+* stats, so we need to check again.  If CPU-specific
 * stats were already allocated, we update the pre-
 * allocated stats as we have already locked them.
 */
-   if (likely(flow->stats_last_writer != NUMA_NO_NODE)
-   && likely(!rcu_access_pointer(flow->stats[node]))) {
-   /* Try to allocate node-specific stats. */
+   if (likely(flow->stats_last_writer != -1)
+   && likely(!rcu_access_pointer(flow->stats[cpu]))) {
+   /* Try to allocate CPU-specific stats. */
struct flow_stats *new_stats;
 
new_stats =
@@ -114,12 +116,12 @@ void ovs_flow_stats_update(struct sw_flow *flow, __be16 
tcp_flags,
new_stats->tcp_flags = tcp_flags;
spin_lock_init(_stats->lock);
 
-   rcu_assign_pointer(flow->stats[node],
+   rcu_assign_pointer(flow->stats[cpu],
   new_stats);
goto unlock;
}
}
-   flow->stats_last_writer = node;
+   flow->stats_last_writer = cpu;
}
}
 
@@ -129,6 +131,7 @@ void ovs_flow_stats_update(struct sw_flow *flow, __be16 
tcp_flags,
stats->tcp_flags |= tcp_flags;
 unlock:
spin_unlock(>lock);
+   put_cpu();
 }
 
 /* Must be called with rcu_read_lock or ovs_mutex. */
@@ -136,14 +139,14 @@ void ovs_flow_stats_get(const struct sw_flow *flow,
struct ovs_flow_stats *ovs_stats,
unsigned long *used, __be16 *tcp_flags)
 {
-   int node;
+   int cpu;
 
*used = 0;
*tcp_flags = 0;
memset(ovs_stats, 0, sizeof(*ovs_stats));
 
-   for_each_node(node) {
-   struct flow_stats *stats = 
rcu_dereference_ovsl(flow->stats[node]);
+   for_each_possible_cpu(cpu) {
+   struct flow_stats *stats = 
rcu_dereference_ovsl(flow->stats[cpu]);
 
if (stats) {
/* Local CPU may write on non-local stats, so we must
@@ -163,10 +166,10 @@ void ovs_flow_stats_get(const struct sw_flow *flow,
 /* Called with ovs_mutex. */
 void ovs_flow_stats_clear(struct sw_flow *flow)
 {
-   int node;
+   int cpu;
 
-   for_each_node(node) {
-   struct flow_stats *stats =

[Patch net-next] net_sched: properly handle failure case of tcf_exts_init()

2016-08-19 Thread Cong Wang

After commit 22dc13c837c3 ("net_sched: convert tcf_exts from list to pointer 
array")
we do dynamic allocation in tcf_exts_init(), therefore we need
to handle the ENOMEM case properly.

Cc: Jamal Hadi Salim 
Signed-off-by: Cong Wang 
---
 include/net/pkt_cls.h   |  6 ++--
 net/sched/cls_basic.c   | 12 +--
 net/sched/cls_bpf.c | 27 +--
 net/sched/cls_cgroup.c  | 13 +--
 net/sched/cls_flow.c| 26 --
 net/sched/cls_flower.c  | 11 --
 net/sched/cls_fw.c  | 18 +++---
 net/sched/cls_route.c   | 14 +---
 net/sched/cls_rsvp.h| 17 +++---
 net/sched/cls_tcindex.c | 90 +++--
 net/sched/cls_u32.c | 21 
 11 files changed, 181 insertions(+), 74 deletions(-)

diff --git a/include/net/pkt_cls.h b/include/net/pkt_cls.h
index c99508d..a459be5 100644
--- a/include/net/pkt_cls.h
+++ b/include/net/pkt_cls.h
@@ -69,17 +69,19 @@ struct tcf_exts {
int police;
 };
 
-static inline void tcf_exts_init(struct tcf_exts *exts, int action, int police)
+static inline int tcf_exts_init(struct tcf_exts *exts, int action, int police)
 {
 #ifdef CONFIG_NET_CLS_ACT
exts->type = 0;
exts->nr_actions = 0;
exts->actions = kcalloc(TCA_ACT_MAX_PRIO, sizeof(struct tc_action *),
GFP_KERNEL);
-   WARN_ON(!exts->actions); /* TODO: propagate the error to callers */
+   if (!exts->actions)
+   return -ENOMEM;
 #endif
exts->action = action;
exts->police = police;
+   return 0;
 }
 
 /**
diff --git a/net/sched/cls_basic.c b/net/sched/cls_basic.c
index 0b8c3ac..eb219b7 100644
--- a/net/sched/cls_basic.c
+++ b/net/sched/cls_basic.c
@@ -138,10 +138,12 @@ static int basic_set_parms(struct net *net, struct 
tcf_proto *tp,
struct tcf_exts e;
struct tcf_ematch_tree t;
 
-   tcf_exts_init(, TCA_BASIC_ACT, TCA_BASIC_POLICE);
-   err = tcf_exts_validate(net, tp, tb, est, , ovr);
+   err = tcf_exts_init(, TCA_BASIC_ACT, TCA_BASIC_POLICE);
if (err < 0)
return err;
+   err = tcf_exts_validate(net, tp, tb, est, , ovr);
+   if (err < 0)
+   goto errout;
 
err = tcf_em_tree_validate(tp, tb[TCA_BASIC_EMATCHES], );
if (err < 0)
@@ -189,7 +191,10 @@ static int basic_change(struct net *net, struct sk_buff 
*in_skb,
if (!fnew)
return -ENOBUFS;
 
-   tcf_exts_init(>exts, TCA_BASIC_ACT, TCA_BASIC_POLICE);
+   err = tcf_exts_init(>exts, TCA_BASIC_ACT, TCA_BASIC_POLICE);
+   if (err < 0)
+   goto errout;
+
err = -EINVAL;
if (handle) {
fnew->handle = handle;
@@ -226,6 +231,7 @@ static int basic_change(struct net *net, struct sk_buff 
*in_skb,
 
return 0;
 errout:
+   tcf_exts_destroy(>exts);
kfree(fnew);
return err;
 }
diff --git a/net/sched/cls_bpf.c b/net/sched/cls_bpf.c
index c3002c2..4742f41 100644
--- a/net/sched/cls_bpf.c
+++ b/net/sched/cls_bpf.c
@@ -311,17 +311,19 @@ static int cls_bpf_modify_existing(struct net *net, 
struct tcf_proto *tp,
if ((!is_bpf && !is_ebpf) || (is_bpf && is_ebpf))
return -EINVAL;
 
-   tcf_exts_init(, TCA_BPF_ACT, TCA_BPF_POLICE);
-   ret = tcf_exts_validate(net, tp, tb, est, , ovr);
+   ret = tcf_exts_init(, TCA_BPF_ACT, TCA_BPF_POLICE);
if (ret < 0)
return ret;
+   ret = tcf_exts_validate(net, tp, tb, est, , ovr);
+   if (ret < 0)
+   goto errout;
 
if (tb[TCA_BPF_FLAGS]) {
u32 bpf_flags = nla_get_u32(tb[TCA_BPF_FLAGS]);
 
if (bpf_flags & ~TCA_BPF_FLAG_ACT_DIRECT) {
-   tcf_exts_destroy();
-   return -EINVAL;
+   ret = -EINVAL;
+   goto errout;
}
 
have_exts = bpf_flags & TCA_BPF_FLAG_ACT_DIRECT;
@@ -331,10 +333,8 @@ static int cls_bpf_modify_existing(struct net *net, struct 
tcf_proto *tp,
 
ret = is_bpf ? cls_bpf_prog_from_ops(tb, prog) :
   cls_bpf_prog_from_efd(tb, prog, tp);
-   if (ret < 0) {
-   tcf_exts_destroy();
-   return ret;
-   }
+   if (ret < 0)
+   goto errout;
 
if (tb[TCA_BPF_CLASSID]) {
prog->res.classid = nla_get_u32(tb[TCA_BPF_CLASSID]);
@@ -343,6 +343,10 @@ static int cls_bpf_modify_existing(struct net *net, struct 
tcf_proto *tp,
 
tcf_exts_change(tp, >exts, );
return 0;
+
+errout:
+   tcf_exts_destroy();
+   return ret;
 }
 
 static u32 cls_bpf_grab_new_handle(struct tcf_proto *tp,
@@ -388,7 +392,9 @@ static int cls_bpf_change(struct net *net, struct sk_buff 
*in_skb,
if (!prog)
return -ENOBUFS;
 
-   tcf_exts_init(>exts, TCA_BPF_ACT, TCA_BPF_POLICE);
+   ret = tcf_exts_init(>exts,

Re: [PATCH 0/1] HW checksum offload problem on Xilinx Zynq GEM

2016-08-19 Thread David Miller

From: Helmut Buchsbaum 
Date: Fri, 19 Aug 2016 15:04:57 +0200

> When working on upgrading the v3.x kernels of our embedded devices
> to more recent 4.x kernels we noticed some of our proprietary networking
> stuff is broken. Further investigations brought up an issue with small
> UDP packets (data payload <= 2), which contained wrong UDP header
> checksums.
> We tracked this down to commit 85ff3d87bf2ef1fadcde8553628c60f79806fdb4
> net/macb: add TX checksum offload feature. It turns out that Zynq's GEM
> is obviously buggy regarding the UDP checksum calculation of such small
> UDP packets as long as the UDP checksum field is != 0 *BEFORE* the
> HW calulation. But since udp_send_skb() *ALWAYS* calculates the UDP header
> checksum (unless disabled via socket option), this is the usual case.
> Unfortunately it does not respect the net device feature setting which
> would leave UDP checksum untouched when checksum offloading is enabled.

Then simply clear the checksum field in the driver, or fix
udp_send_skb() to do what you claim it is supposed to.

That seems like a much more appropriate fix to me.

Re: [PATCH 1/1] l2tp: Fix the connect status check in pppol2tp_getname

2016-08-19 Thread Guillaume Nault

On Fri, Aug 19, 2016 at 01:36:23PM +0800, f...@ikuai8.com wrote:
> From: Gao Feng 
> 
> The sk->sk_state is bits flag, so need use bit operation check
> instead of value check.
> 

Tested-by: Guillaume Nault

Re: [PATCH] samples/bpf: Add tunnel set/get tests.

2016-08-19 Thread William Tu

Thanks, I've submitted newer version.

On Thu, Aug 18, 2016 at 9:10 PM, David Miller  wrote:
> From: William Tu 
> Date: Tue, 16 Aug 2016 07:03:01 -0700
>
>> The patch creates sample code exercising bpf_skb_{set,get}_tunnel_key,
>> and bpf_skb_{set,get}_tunnel_opt for GRE, VXLAN, and GENEVE.  A native
>> tunnel device is created in a namespace to interact with a lwtunnel
>> device out of the namespace, with metadata enabled.  The bpf_skb_set_*
>> program is attached to tc egress and bpf_skb_get_* is attached to egress
>> qdisc.  A ping between two tunnels is used to verify correctness and
>> the result of bpf_skb_get_* printed by bpf_trace_printk.
>>
>> Signed-off-by: William Tu 
>
> Please respin, this doesn't apply cleanly to net-next.

[PATCHv2,net-next] samples/bpf: Add tunnel set/get tests.

2016-08-19 Thread William Tu

The patch creates sample code exercising bpf_skb_{set,get}_tunnel_key,
and bpf_skb_{set,get}_tunnel_opt for GRE, VXLAN, and GENEVE.  A native
tunnel device is created in a namespace to interact with a lwtunnel
device out of the namespace, with metadata enabled.  The bpf_skb_set_*
program is attached to tc egress and bpf_skb_get_* is attached to egress
qdisc.  A ping between two tunnels is used to verify correctness and
the result of bpf_skb_get_* printed by bpf_trace_printk.

Signed-off-by: William Tu 
---
 samples/bpf/Makefile   |   1 +
 samples/bpf/bpf_helpers.h  |   8 ++
 samples/bpf/tcbpf2_kern.c  | 191 +
 samples/bpf/test_tunnel_bpf.sh | 127 +++
 4 files changed, 327 insertions(+)
 create mode 100644 samples/bpf/tcbpf2_kern.c
 create mode 100755 samples/bpf/test_tunnel_bpf.sh

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index eb582c6..db3cb06 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -67,6 +67,7 @@ always += tracex6_kern.o
 always += test_probe_write_user_kern.o
 always += trace_output_kern.o
 always += tcbpf1_kern.o
+always += tcbpf2_kern.o
 always += lathist_kern.o
 always += offwaketime_kern.o
 always += spintest_kern.o
diff --git a/samples/bpf/bpf_helpers.h b/samples/bpf/bpf_helpers.h
index 6f1672a..bbdf62a 100644
--- a/samples/bpf/bpf_helpers.h
+++ b/samples/bpf/bpf_helpers.h
@@ -47,6 +47,14 @@ static int (*bpf_probe_write_user)(void *dst, void *src, int 
size) =
(void *) BPF_FUNC_probe_write_user;
 static int (*bpf_current_task_under_cgroup)(void *map, int index) =
(void *) BPF_FUNC_current_task_under_cgroup;
+static int (*bpf_skb_get_tunnel_key)(void *ctx, void *key, int size, int 
flags) =
+   (void *) BPF_FUNC_skb_get_tunnel_key;
+static int (*bpf_skb_set_tunnel_key)(void *ctx, void *key, int size, int 
flags) =
+   (void *) BPF_FUNC_skb_set_tunnel_key;
+static int (*bpf_skb_get_tunnel_opt)(void *ctx, void *md, int size) =
+   (void *) BPF_FUNC_skb_get_tunnel_opt;
+static int (*bpf_skb_set_tunnel_opt)(void *ctx, void *md, int size) =
+   (void *) BPF_FUNC_skb_set_tunnel_opt;
 
 /* llvm builtin functions that eBPF C program may use to
  * emit BPF_LD_ABS and BPF_LD_IND instructions
diff --git a/samples/bpf/tcbpf2_kern.c b/samples/bpf/tcbpf2_kern.c
new file mode 100644
index 000..7a15289
--- /dev/null
+++ b/samples/bpf/tcbpf2_kern.c
@@ -0,0 +1,191 @@
+/* Copyright (c) 2016 VMware
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ */
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include "bpf_helpers.h"
+
+#define ERROR(ret) do {\
+   char fmt[] = "ERROR line:%d ret:%d\n";\
+   bpf_trace_printk(fmt, sizeof(fmt), __LINE__, ret); \
+   } while(0)
+
+struct geneve_opt {
+   __be16  opt_class;
+   u8  type;
+   u8  length:5;
+   u8  r3:1;
+   u8  r2:1;
+   u8  r1:1;
+   u8  opt_data[8]; /* hard-coded to 8 byte */
+};
+
+struct vxlan_metadata {
+   u32 gbp;
+};
+
+SEC("gre_set_tunnel")
+int _gre_set_tunnel(struct __sk_buff *skb)
+{
+   int ret;
+   struct bpf_tunnel_key key;
+
+   __builtin_memset(, 0x0, sizeof(key));
+   key.remote_ipv4 = 0xac100164; /* 172.16.1.100 */
+   key.tunnel_id = 2;
+   key.tunnel_tos = 0;
+   key.tunnel_ttl = 64;
+
+   ret = bpf_skb_set_tunnel_key(skb, , sizeof(key), 
BPF_F_ZERO_CSUM_TX);
+   if (ret < 0) {
+   ERROR(ret);
+   return TC_ACT_SHOT;
+   }
+
+   return TC_ACT_OK;
+}
+
+SEC("gre_get_tunnel")
+int _gre_get_tunnel(struct __sk_buff *skb)
+{
+   int ret;
+   struct bpf_tunnel_key key;
+   char fmt[] = "key %d remote ip 0x%x\n";
+
+   ret = bpf_skb_get_tunnel_key(skb, , sizeof(key), 0);
+   if (ret < 0) {
+   ERROR(ret);
+   return TC_ACT_SHOT;
+   }
+
+   bpf_trace_printk(fmt, sizeof(fmt), key.tunnel_id, key.remote_ipv4);
+   return TC_ACT_OK;
+}
+
+SEC("vxlan_set_tunnel")
+int _vxlan_set_tunnel(struct __sk_buff *skb)
+{
+   int ret;
+   struct bpf_tunnel_key key;
+   struct vxlan_metadata md;
+
+   __builtin_memset(, 0x0, sizeof(key));
+   key.remote_ipv4 = 0xac100164; /* 172.16.1.100 */
+   key.tunnel_id = 2;
+   key.tunnel_tos = 0;
+   key.tunnel_ttl = 64;
+
+   ret = bpf_skb_set_tunnel_key(skb, , sizeof(key), 
BPF_F_ZERO_CSUM_TX);
+   if (ret < 0) {
+   ERROR(ret);
+   return TC_ACT_SHOT;
+   }
+
+   md.gbp = 0x800FF; /* Set VXLAN Group Policy extension */
+   ret = bpf_skb_set_tunnel_opt(skb, , sizeof(md));
+   if (ret < 0) {
+   ERROR(ret);
+   return TC_ACT_SHOT;
+   }
+
+   return

Re: [PATCH v1 1/1] l2tp: Use existing macros instead of literal number

2016-08-19 Thread Guillaume Nault

On Fri, Aug 19, 2016 at 05:55:26PM +0800, f...@ikuai8.com wrote:
> From: Gao Feng 
> 
> 1. Use PPP_ALLSTATIONS/PPP_UI instead of literal 0xff/0x03;
> 2. Use one static const global fixed_ppphdr instead of two same
> static variable ppph in two different functions;
> 3. Use SEND_SHUTDOWN instead of literal 2;
> 
> Signed-off-by: Gao Feng 
> ---
>  v1: Initial patch
> 
v1 again?

BTW, no need to number your patch for a single patch series.
But you have to tell which tree is the series for.

So instead of [PATCH v1 1/1], you should send [PATCH v2 net-next].

Re: [PATCH v2 2/2] ipv6: fixup RTF_* flags when restoring RTPROT_RA route from rtnetlink

2016-08-19 Thread Sergei Shtylyov


Hello.

On 08/19/2016 07:41 PM, Andrew Yourtchenko wrote:


Fix the flags for RA-derived routes that were saved
via "ip -6 route save" and and subsequently restored via
"ip -6 route restore", allowing the incoming router advertisements
to update them, rather than complain about inability to do so.

Upon the restore of RA-derived saved routes, set the RTF_ADDRCONF
to indicate that the source of the route was originally
a router advertisement, and set the RTF_DEFAULT or RTF_ROUTEINFO
flag depending on prefix length. This can be considered a
sister change of f0396f60d7c165018c9b203fb9b89fb224835578, in


   It's enough to specify 12 digits but you also need to specify the commit 
summary enclosed in ("").



the other direction.

Signed-off-by: Andrew Yourtchenko 
---
Changes since v1 [1]:
 * fixed the indentation of the basic blocks to be always a full TAB
   as per David Miller's review

[1] v1: http://marc.info/?l=linux-netdev=147135599322285=2

 net/ipv6/route.c | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index f5b987d..60d95cd 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -2769,6 +2769,16 @@ static int rtm_to_fib6_config(struct sk_buff *skb, 
struct nlmsghdr *nlh,
cfg->fc_protocol = rtm->rtm_protocol;
cfg->fc_type = rtm->rtm_type;

+   if (rtm->rtm_protocol == RTPROT_RA) {
+   /* RA-derived route: set flags accordingly. */
+   cfg->fc_flags |= RTF_ADDRCONF;
+   if (rtm->rtm_dst_len == 0) {
+   cfg->fc_flags |= RTF_DEFAULT;
+   } else {
+   cfg->fc_flags |= RTF_ROUTEINFO;
+   }


   {} not needed here and above.

[...]

MBR, Sergei

Re: [PATCH net 2/2] sctp: not copying duplicate addrs to the assoc's bind address list

2016-08-19 Thread Neil Horman

On Fri, Aug 19, 2016 at 07:30:22PM +0800, Xin Long wrote:
> From: Xin Long 
> 
> sctp.local_addr_list is a global address list that is supposed to include
> all the local addresses. sctp updates this list according to NETDEV_UP/
> NETDEV_DOWN notifications.
> 
> However, if multiple NICs have the same address, the global list will
> have duplicate addresses. Even if for one NIC, promote secondaries in
> __inet_del_ifa can also lead to accumulating duplicate addresses.
> 
> When sctp binds address 'ANY' and creates a connection, it copies all
> the addresses from global list into asoc's bind addr list, which makes
> sctp pack the duplicate addresses into INIT/INIT_ACK packets.
> 
> This patch is to filter the duplicate addresses when copying the addrs
> from global list in sctp_copy_local_addr_list and unpacking addr_param
> from cookie in sctp_raw_to_bind_addrs to asoc's bind addr list.
> 
> Signed-off-by: Xin Long 
> ---
>  net/sctp/bind_addr.c | 3 +++
>  net/sctp/protocol.c  | 3 +++
>  2 files changed, 6 insertions(+)
> 
> diff --git a/net/sctp/bind_addr.c b/net/sctp/bind_addr.c
> index 401c607..1ebc184 100644
> --- a/net/sctp/bind_addr.c
> +++ b/net/sctp/bind_addr.c
> @@ -292,6 +292,8 @@ int sctp_raw_to_bind_addrs(struct sctp_bind_addr *bp, 
> __u8 *raw_addr_list,
>   }
>  
>   af->from_addr_param(, rawaddr, htons(port), 0);
> + if (sctp_bind_addr_state(bp, ) != -1)
> + goto next;
>   retval = sctp_add_bind_addr(bp, , sizeof(addr),
>   SCTP_ADDR_SRC, gfp);
>   if (retval) {
> @@ -300,6 +302,7 @@ int sctp_raw_to_bind_addrs(struct sctp_bind_addr *bp, 
> __u8 *raw_addr_list,
>   break;
>   }
>  
> +next:
>   len = ntohs(param->length);
>   addrs_len -= len;
>   raw_addr_list += len;
> diff --git a/net/sctp/protocol.c b/net/sctp/protocol.c
> index da5d82b..616a942 100644
> --- a/net/sctp/protocol.c
> +++ b/net/sctp/protocol.c
> @@ -220,6 +220,9 @@ int sctp_copy_local_addr_list(struct net *net, struct 
> sctp_bind_addr *bp,
>!(copy_flags & SCTP_ADDR6_PEERSUPP)))
>   continue;
>  
> + if (sctp_bind_addr_state(bp, >a) != -1)
> + continue;
> +
>   error = sctp_add_bind_addr(bp, >a, sizeof(addr->a),
>  SCTP_ADDR_SRC, GFP_ATOMIC);
>   if (error)
> -- 
> 2.1.0
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-sctp" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
Ah, I see what you're doing.  Ok, this makes some sense, at least on the receive
side, when you get a cookie unpacked and modify the remote peers address list,
it makes sense to check for duplicates.  On the local side however, I would,
instead of checking it when the list gets copied, I'd check it when the master
list gets updated (in the NETDEV_UP event notifier for the local address list,
and the sctp_add_bind_addr function for the endpoint address list).  That way
you can keep that nested for loop out of the send path on the local system.

Neil

[PATCH net] vmxnet3: fix tx data ring copy for variable size

2016-08-19 Thread Shrikrishna Khare

'Commit 3c8b3efc061a ("vmxnet3: allow variable length transmit data ring
buffer")' changed the size of the buffers in the tx data ring from a
fixed size of 128 bytes to a variable size.

However, while copying data to the data ring, vmxnet3_copy_hdr continues
to carry the old code that assumes fixed buffer size of 128. This patch
fixes it by adding correct offset based on the actual data ring buffer
size.

Signed-off-by: Guolin Yang 
Signed-off-by: Shrikrishna Khare 
---
 drivers/net/vmxnet3/vmxnet3_drv.c | 4 +++-
 drivers/net/vmxnet3/vmxnet3_int.h | 4 ++--
 2 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/drivers/net/vmxnet3/vmxnet3_drv.c 
b/drivers/net/vmxnet3/vmxnet3_drv.c
index c68fe49..4244b9d 100644
--- a/drivers/net/vmxnet3/vmxnet3_drv.c
+++ b/drivers/net/vmxnet3/vmxnet3_drv.c
@@ -914,7 +914,9 @@ vmxnet3_copy_hdr(struct sk_buff *skb, struct 
vmxnet3_tx_queue *tq,
 {
struct Vmxnet3_TxDataDesc *tdd;
 
-   tdd = tq->data_ring.base + tq->tx_ring.next2fill;
+   tdd = (struct Vmxnet3_TxDataDesc *)((u8 *)tq->data_ring.base +
+   tq->tx_ring.next2fill *
+   tq->txdata_desc_size);
 
memcpy(tdd->data, skb->data, ctx->copy_size);
netdev_dbg(adapter->netdev,
diff --git a/drivers/net/vmxnet3/vmxnet3_int.h 
b/drivers/net/vmxnet3/vmxnet3_int.h
index 74fc030..7dc37a0 100644
--- a/drivers/net/vmxnet3/vmxnet3_int.h
+++ b/drivers/net/vmxnet3/vmxnet3_int.h
@@ -69,10 +69,10 @@
 /*
  * Version numbers
  */
-#define VMXNET3_DRIVER_VERSION_STRING   "1.4.9.0-k"
+#define VMXNET3_DRIVER_VERSION_STRING   "1.4.a.0-k"
 
 /* a 32-bit int, each byte encode a verion number in VMXNET3_DRIVER_VERSION */
-#define VMXNET3_DRIVER_VERSION_NUM  0x01040900
+#define VMXNET3_DRIVER_VERSION_NUM  0x01040a00
 
 #if defined(CONFIG_PCI_MSI)
/* RSS only makes sense if MSI-X is supported. */
-- 
2.8.2

Re: [PATCH] gianfar: prevent fragmentation in DSA environments

2016-08-19 Thread Andrew Lunn

> I don't fully understand why GFAR_RXB_SIZE has to be lower than
> that, preventing the HW using all available memory - the freescale
> guys most probably do.

That will be because the cache invalidates in the DMA API are
expensive. There is no point invalidating the whole 2K if you never
make use of the top part. Just invalidate what is actually used.  I
optimised the Intel i210 driver recently along these lines, and my
receive performance went up 6%

 Andrew

Re: [PATCH] gianfar: prevent fragmentation in DSA environments

2016-08-19 Thread Andrew Lunn

> >> -#define GFAR_RXB_SIZE 1536
> >> +/* prevent fragmenation by HW in DSA environments */
> >> +#define GFAR_RXB_SIZE (1536 + RXBUF_ALIGNMENT)
> >
> >Hi Zefir
> >
> >Using RXBUF_ALIGNMENT suggests this has something to do with
> >alignment, not extra headers.
> >
> >How about
> >
> >/* Prevent fragmenation by HW when using extra headers like DSA */
> >#define GFAR_RXB_SIZE (1536 + 8)
> >
> 
> MRBL (Maximum receive buffer length) must be multiple of 64.
> Please consult de hardware documentation at least before prosing
> changes to the driver.
 
Hi Claudiu

Thanks for pointing this out. This makes RXBUF_ALIGNMENT even worse
for understabability, since you say multiples of 64 is important, not
alignment.

How about

/* Prevent fragmentation by HW when using extra headers like DSA,
while respecting the multiple of 64 requirement. */

#define GFAR_RXB_SIZE roundup(1536 + 8, 64)

Andrew

[PATCH net-next 2/3] net: mpls: Fixups for GSO

2016-08-19 Thread David Ahern

As reported by Lennert the MPLS GSO code is failing to properly segment
large packets. There are a couple of problems:

1. the inner protocol is not set so the gso segment functions for inner
   protocol layers are not getting run, and

2  MPLS labels for packets that use the "native" (non-OVS) MPLS code
   are not properly accounted for in mpls_gso_segment.

The MPLS GSO code was added for OVS. It is re-using skb_mac_gso_segment
to call the gso segment functions for the higher layer protocols. That
means skb_mac_gso_segment is called twice -- once with the network
protocol set to MPLS and again with the network protocol set to the
inner protocol.

This patch sets the inner skb protocol addressing item 1 above and sets
the network_header and inner_network_header to mark where the MPLS labels
start and end. The MPLS code in OVS is also updated to set the two
network markers.

>From there the MPLS GSO code uses the difference between the network
header and the inner network header to know the size of the MPLS header
that was pushed. It then pulls the MPLS header, resets the mac_len and
protocol for the inner protocol and then calls skb_mac_gso_segment
to segment the skb.

Afterward the inner protocol segmentation is done the skb protocol
is set to mpls for each segment and the network and mac headers
restored.

Reported-by: Lennert Buytenhek 
Signed-off-by: David Ahern 
---
 net/mpls/mpls_gso.c   | 38 +++---
 net/mpls/mpls_iptunnel.c  |  4 
 net/openvswitch/actions.c |  6 ++
 3 files changed, 37 insertions(+), 11 deletions(-)

diff --git a/net/mpls/mpls_gso.c b/net/mpls/mpls_gso.c
index 2055e57ed1c3..2aa4beaa0e4f 100644
--- a/net/mpls/mpls_gso.c
+++ b/net/mpls/mpls_gso.c
@@ -23,32 +23,48 @@ static struct sk_buff *mpls_gso_segment(struct sk_buff *skb,
   netdev_features_t features)
 {
struct sk_buff *segs = ERR_PTR(-EINVAL);
+   u16 mac_offset = skb->mac_header;
netdev_features_t mpls_features;
+   u16 mac_len = skb->mac_len;
__be16 mpls_protocol;
+   int mpls_hlen;
+
+   skb_reset_network_header(skb);
+   mpls_hlen = skb_inner_network_header(skb) - skb_network_header(skb);
+   if (unlikely(!pskb_may_pull(skb, mpls_hlen)))
+   goto out;
 
/* Setup inner SKB. */
mpls_protocol = skb->protocol;
skb->protocol = skb->inner_protocol;
 
-   /* Push back the mac header that skb_mac_gso_segment() has pulled.
-* It will be re-pulled by the call to skb_mac_gso_segment() below
-*/
-   __skb_push(skb, skb->mac_len);
+   __skb_pull(skb, mpls_hlen);
+
+   skb->mac_len = 0;
+   skb_reset_mac_header(skb);
+   skb_set_network_header(skb, skb_inner_network_offset(skb));
 
/* Segment inner packet. */
mpls_features = skb->dev->mpls_features & features;
segs = skb_mac_gso_segment(skb, mpls_features);
+   if (IS_ERR_OR_NULL(segs)) {
+   skb_gso_error_unwind(skb, mpls_protocol, mpls_hlen, mac_offset,
+mac_len);
+   goto out;
+   }
 
+   skb = segs;
+   do {
+   skb->mac_len = mac_len;
+   skb->protocol = mpls_protocol;
 
-   /* Restore outer protocol. */
-   skb->protocol = mpls_protocol;
+   __skb_push(skb, mpls_hlen + mac_len);
 
-   /* Re-pull the mac header that the call to skb_mac_gso_segment()
-* above pulled.  It will be re-pushed after returning
-* skb_mac_gso_segment(), an indirect caller of this function.
-*/
-   __skb_pull(skb, skb->data - skb_mac_header(skb));
+   skb_reset_mac_header(skb);
+   skb_set_network_header(skb, mac_len);
+   } while ((skb = skb->next));
 
+out:
return segs;
 }
 
diff --git a/net/mpls/mpls_iptunnel.c b/net/mpls/mpls_iptunnel.c
index aed872cc05a6..cf52cf30ac4b 100644
--- a/net/mpls/mpls_iptunnel.c
+++ b/net/mpls/mpls_iptunnel.c
@@ -90,7 +90,11 @@ static int mpls_xmit(struct sk_buff *skb)
if (skb_cow(skb, hh_len + new_header_size))
goto drop;
 
+   skb_set_inner_protocol(skb, skb->protocol);
+   skb_reset_inner_network_header(skb);
+
skb_push(skb, new_header_size);
+
skb_reset_network_header(skb);
 
skb->dev = out_dev;
diff --git a/net/openvswitch/actions.c b/net/openvswitch/actions.c
index 1ecbd7715f6d..6d78f162a88b 100644
--- a/net/openvswitch/actions.c
+++ b/net/openvswitch/actions.c
@@ -167,6 +167,12 @@ static int push_mpls(struct sk_buff *skb, struct 
sw_flow_key *key,
skb->mac_len);
skb_reset_mac_header(skb);
 
+   /* for GSO: set MPLS as network header and encapsulated protocol
+* header as inner network header
+*/
+   skb_set_network_header(skb, skb->mac_len);
+   skb_set_inner_network_header(skb, skb->mac_len + MPLS_HLEN);

[PATCH net-next 1/3] net: lwtunnel: Handle fragmentation

2016-08-19 Thread David Ahern

From: Roopa Prabhu 

Today mpls iptunnel lwtunnel_output redirect expects the tunnel
output function to handle fragmentation. This is ok but can be
avoided if we did not do the mpls output redirect too early.
ie we could wait until ip fragmentation is done and then call
mpls output for each ip fragment.

To make this work we will need,
1) the lwtunnel state to carry encap headroom
2) and do the redirect to the encap output handler on the ip fragment
(essentially do the output redirect after fragmentation)

This patch adds tunnel headroom in lwtstate to make sure we
account for tunnel data in mtu calculations during fragmentation
and adds new xmit redirect handler to redirect to lwtunnel xmit func
after ip fragmentation.

This includes IPV6 and some mtu fixes and testing from David Ahern.

Signed-off-by: Roopa Prabhu 
Signed-off-by: David Ahern 
---
 include/net/lwtunnel.h   | 44 
 net/core/lwtunnel.c  | 35 +++
 net/ipv4/ip_output.c |  8 
 net/ipv4/route.c |  4 +++-
 net/ipv6/ip6_output.c|  8 
 net/ipv6/route.c |  4 +++-
 net/mpls/mpls_iptunnel.c |  9 +
 7 files changed, 106 insertions(+), 6 deletions(-)

diff --git a/include/net/lwtunnel.h b/include/net/lwtunnel.h
index e9f116e29c22..ea3f80f58fd6 100644
--- a/include/net/lwtunnel.h
+++ b/include/net/lwtunnel.h
@@ -13,6 +13,13 @@
 /* lw tunnel state flags */
 #define LWTUNNEL_STATE_OUTPUT_REDIRECT BIT(0)
 #define LWTUNNEL_STATE_INPUT_REDIRECT  BIT(1)
+#define LWTUNNEL_STATE_XMIT_REDIRECT   BIT(2)
+
+enum {
+   LWTUNNEL_XMIT_DONE,
+   LWTUNNEL_XMIT_CONTINUE,
+};
+
 
 struct lwtunnel_state {
__u16   type;
@@ -21,6 +28,7 @@ struct lwtunnel_state {
int (*orig_output)(struct net *net, struct sock *sk, struct 
sk_buff *skb);
int (*orig_input)(struct sk_buff *);
int len;
+   __u16   headroom;
__u8data[0];
 };
 
@@ -34,6 +42,7 @@ struct lwtunnel_encap_ops {
  struct lwtunnel_state *lwtstate);
int (*get_encap_size)(struct lwtunnel_state *lwtstate);
int (*cmp_encap)(struct lwtunnel_state *a, struct lwtunnel_state *b);
+   int (*xmit)(struct sk_buff *skb);
 };
 
 #ifdef CONFIG_LWTUNNEL
@@ -75,6 +84,24 @@ static inline bool lwtunnel_input_redirect(struct 
lwtunnel_state *lwtstate)
 
return false;
 }
+
+static inline bool lwtunnel_xmit_redirect(struct lwtunnel_state *lwtstate)
+{
+   if (lwtstate && (lwtstate->flags & LWTUNNEL_STATE_XMIT_REDIRECT))
+   return true;
+
+   return false;
+}
+
+static inline unsigned int lwtunnel_headroom(struct lwtunnel_state *lwtstate,
+unsigned int mtu)
+{
+   if (lwtunnel_xmit_redirect(lwtstate) && lwtstate->headroom < mtu)
+   return lwtstate->headroom;
+
+   return 0;
+}
+
 int lwtunnel_encap_add_ops(const struct lwtunnel_encap_ops *op,
   unsigned int num);
 int lwtunnel_encap_del_ops(const struct lwtunnel_encap_ops *op,
@@ -90,6 +117,7 @@ struct lwtunnel_state *lwtunnel_state_alloc(int hdr_len);
 int lwtunnel_cmp_encap(struct lwtunnel_state *a, struct lwtunnel_state *b);
 int lwtunnel_output(struct net *net, struct sock *sk, struct sk_buff *skb);
 int lwtunnel_input(struct sk_buff *skb);
+int lwtunnel_xmit(struct sk_buff *skb);
 
 #else
 
@@ -117,6 +145,17 @@ static inline bool lwtunnel_input_redirect(struct 
lwtunnel_state *lwtstate)
return false;
 }
 
+static inline bool lwtunnel_xmit_redirect(struct lwtunnel_state *lwtstate)
+{
+   return false;
+}
+
+static inline unsigned int lwtunnel_headroom(struct lwtunnel_state *lwtstate,
+unsigned int mtu)
+{
+   return 0;
+}
+
 static inline int lwtunnel_encap_add_ops(const struct lwtunnel_encap_ops *op,
 unsigned int num)
 {
@@ -170,6 +209,11 @@ static inline int lwtunnel_input(struct sk_buff *skb)
return -EOPNOTSUPP;
 }
 
+static inline int lwtunnel_xmit(struct sk_buff *skb)
+{
+   return -EOPNOTSUPP;
+}
+
 #endif /* CONFIG_LWTUNNEL */
 
 #define MODULE_ALIAS_RTNL_LWT(encap_type) MODULE_ALIAS("rtnl-lwt-" 
__stringify(encap_type))
diff --git a/net/core/lwtunnel.c b/net/core/lwtunnel.c
index 669ecc9f884e..e5f84c26ba1a 100644
--- a/net/core/lwtunnel.c
+++ b/net/core/lwtunnel.c
@@ -251,6 +251,41 @@ int lwtunnel_output(struct net *net, struct sock *sk, 
struct sk_buff *skb)
 }
 EXPORT_SYMBOL(lwtunnel_output);
 
+int lwtunnel_xmit(struct sk_buff *skb)
+{
+   struct dst_entry *dst = skb_dst(skb);
+   const struct lwtunnel_encap_ops *ops;
+   struct lwtunnel_state *lwtstate;
+   int ret = -EINVAL;
+
+   if (!dst)
+   goto drop;
+
+   lwtstate = dst->lwtstate;
+
+   if

[PATCH net-next 3/3] net: veth: Set features for MPLS

2016-08-19 Thread David Ahern

veth does not really transmit packets only moves the skb from one
netdev to another so gso and checksum is not really needed. Add
the features to mpls_features to get the same benefit and performance
with MPLS as without it.

Reported-by: Lennert Buytenhek 
Signed-off-by: David Ahern 
---
 drivers/net/veth.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index f37a6e61d4ad..5db320a4d5cf 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -340,6 +340,7 @@ static void veth_setup(struct net_device *dev)
 
dev->hw_features = VETH_FEATURES;
dev->hw_enc_features = VETH_FEATURES;
+   dev->mpls_features = NETIF_F_HW_CSUM | NETIF_F_GSO_SOFTWARE;
 }
 
 /*
-- 
2.1.4

[PATCH v3 net-next 0/3] net: mpls: fragmentation and gso fixes for locally originated traffic

2016-08-19 Thread David Ahern

This series fixes mtu and fragmentation for tunnels using lwtunnel
output redirect, and fixes GSO for MPLS for locally originated traffic
reported by Lennert Buytenhek.

A follow on series will address fragmentation and GSO for forwarded
MPLS traffic. Hardware offload of GSO with MPLS also needs to be
addressed.

v3
- updates to mpls_gso_segment per Alex's comments
- dropped skb->encapsulation = 1 from mpls_xmit per Alex's comment

v2
- consistent use of network_header in skb to fix GSO for MPLS
- update MPLS code in OVS to network_header and inner_network_header


David Ahern (2):
  net: mpls: Fixups for GSO
  net: veth: Set features for MPLS

Roopa Prabhu (1):
  net: lwtunnel: Handle fragmentation

 drivers/net/veth.c|  1 +
 include/net/lwtunnel.h| 44 
 net/core/lwtunnel.c   | 35 +++
 net/ipv4/ip_output.c  |  8 
 net/ipv4/route.c  |  4 +++-
 net/ipv6/ip6_output.c |  8 
 net/ipv6/route.c  |  4 +++-
 net/mpls/mpls_gso.c   | 38 +++---
 net/mpls/mpls_iptunnel.c  | 13 +
 net/openvswitch/actions.c |  6 ++
 10 files changed, 144 insertions(+), 17 deletions(-)

-- 
2.1.4

Re: [RFC PATCH 0/5] Add eBPF hooks for cgroups

2016-08-19 Thread Thomas Graf

On 08/19/16 at 06:21pm, Pablo Neira Ayuso wrote:
> On Fri, Aug 19, 2016 at 12:35:14PM +0200, Daniel Mack wrote:
> > Also true. A cgroup can currently only hold one bpf program for each
> > direction, and they are supposed to be set from one controlling instance
> > in the system. However, it is possible to create subcgroups, and install
> > own programs in them, which will then be effective instead of the one in
> > the parent. They will, however, replace each other in runtime behavior,
> > and not be stacked. This is a fundamentally different approach than how
> > nf_tables works of course.
> 
> I see, this looks problematic indeed, thanks for confirming this.

What exactly is problematic? I think the proposed mechanism is very
clean in allowing sub groups to provide the entire program. This
allows for delegation. Different orchestrators can manage different
cgroups. It's different as Daniel stated. I don't see how this is
problematic though.

You brought up multiple tables which reflect the cumulative approach.
This sometimes works but has its issues as well. Users must be aware
of each other and anticipate what rules other users might inject
before or after their own tables. The very existence of firewalld which
aims at democratizing this collaboration proves this point.

So in that sense I would very much like for both models to be made
available to users. nftables+cgroups for a cumulative approach as
well as BPF+cgroups for the delegation approach.  I don't see why the
cgroups based filtering capability should not be made available to both.

RE: [PATCH] gianfar: fix size of fragmented frames

2016-08-19 Thread Claudiu Manoil

>-Original Message-
>From: Zefir Kurtisi [mailto:zefir.kurt...@neratec.com]
>Sent: Friday, August 19, 2016 12:16 PM
>To: netdev@vger.kernel.org
>Cc: claudiu.man...@freescale.com
>Subject: [PATCH] gianfar: fix size of fragmented frames
>
>The eTSEC RxBD 'Data Length' field is context depening:
>for the last fragment it contains the full frame size,
>while fragments contain the fragment size, which equals
>the value written to register MRBLR.
>

According to RM the last fragment has the whole packet length indeed,
and this should apply to fragmented frames too:

" Data length, written by the eTSEC.
Data length is the number of octets written by the eTSEC into this BD's data 
buffer if L is cleared
(the value is equal to MRBLR), or, if L is set, the length of the frame 
including CRC, FCB (if
RCTRL[PRSDEP > 00), preamble (if MACCFG2[PreAmRxEn]=1), time stamp (if 
RCTRL[TS] = 1)
and any padding (RCTRL[PAL]). "

>This differentiation is missing in the gianfar driver,
>which causes data corruption as soon as the hardware
>starts to fragment receiving frames. As a result, the
>size of fragmented frames is increased by
>(nr_frags - 1) * MRBLR
>
>We first noticed this issue working with DSA, where a
>1540 octet frame is fragmented by the hardware and
>reconstructed by the driver as 3076 octet frame.
>
>This patch fixes the problem by adjusting the size of
>the last fragment.
>
>Signed-off-by: Zefir Kurtisi 
>---
> drivers/net/ethernet/freescale/gianfar.c | 19 +--
> 1 file changed, 13 insertions(+), 6 deletions(-)
>
>diff --git a/drivers/net/ethernet/freescale/gianfar.c
>b/drivers/net/ethernet/freescale/gianfar.c
>index d20935d..4187280 100644
>--- a/drivers/net/ethernet/freescale/gianfar.c
>+++ b/drivers/net/ethernet/freescale/gianfar.c
>@@ -2922,17 +2922,24 @@ static bool gfar_add_rx_frag(struct gfar_rx_buff *rxb,
>u32 lstatus,
> {
>   unsigned int size = lstatus & BD_LENGTH_MASK;
>   struct page *page = rxb->page;
>+  bool last = !!(lstatus & BD_LFLAG(RXBD_LAST));
>
>   /* Remove the FCS from the packet length */
>-  if (likely(lstatus & BD_LFLAG(RXBD_LAST)))
>+  if (last)
>   size -= ETH_FCS_LEN;
>
>-  if (likely(first))
>+  if (likely(first)) {
>   skb_put(skb, size);
>-  else
>-  skb_add_rx_frag(skb, skb_shinfo(skb)->nr_frags, page,
>-  rxb->page_offset + RXBUF_ALIGNMENT,
>-  size, GFAR_RXB_TRUESIZE);
>+  } else {
>+  /* the last fragments' length contains the full frame length */
>+  if (last)
>+  size -= skb->len;

While I agree with your finding, I don't think this works for packets having 
more
than 2 buffers (head + 1 fragment).
How's this supposed to work for a skb with 2 fragments, for instance?
I think this needs more thoughtful consideration.

>+
>+  if (size > 0 && size <= GFAR_RXB_SIZE)

Why do you need this check?
The h/w ensures that the buffers won't exceed GFAR_RXB_SIZE
(which is MRBL), size 0 is also not possible.

>+  skb_add_rx_frag(skb, skb_shinfo(skb)->nr_frags, page,
>+  rxb->page_offset + RXBUF_ALIGNMENT,
>+  size, GFAR_RXB_TRUESIZE);
>+  }
>
>   /* try reuse page */
>   if (unlikely(page_count(page) != 1))
>--
>2.7.4

Re: [PATCH] gianfar: prevent fragmentation in DSA environments

2016-08-19 Thread Zefir Kurtisi

On 08/19/2016 04:59 PM, Andrew Lunn wrote:
> On Fri, Aug 19, 2016 at 11:16:14AM +0200, Zefir Kurtisi wrote:
>> The eTSEC register MRBLR defines the maximum space in
>> the RX buffers and is set to 1536 by gianfar. This
>> reasonably covers the common use case where the MTU
>> is kept at default 1500.
>>
>> Alas, if the eTSEC is attached to a DSA enabled switch,
>> the DSA header extension causes every maximum sized
>> frame to be fragmented by the hardware (1536 + 4).
>>
>> This patch increases the maximum RX buffer size by
>> RBUF_ALIGNMENT (64) octets. Since the driver uses a
>> half-page memory schema, this change does not
>> increase allocated memory but allows the hardware to
>> use 1600 bytes of the totally available 2048.
>>
>> Signed-off-by: Zefir Kurtisi 
>> ---
>>  drivers/net/ethernet/freescale/gianfar.h | 3 ++-
>>  1 file changed, 2 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/net/ethernet/freescale/gianfar.h 
>> b/drivers/net/ethernet/freescale/gianfar.h
>> index 373fd09..02b794b 100644
>> --- a/drivers/net/ethernet/freescale/gianfar.h
>> +++ b/drivers/net/ethernet/freescale/gianfar.h
>> @@ -100,7 +100,8 @@ extern const char gfar_driver_version[];
>>  #define DEFAULT_RX_LFC_THR  16
>>  #define DEFAULT_LFC_PTVVAL  4
>>  
>> -#define GFAR_RXB_SIZE 1536
>> +/* prevent fragmenation by HW in DSA environments */
>> +#define GFAR_RXB_SIZE (1536 + RXBUF_ALIGNMENT)
> 
> Hi Zefir
> 
> Using RXBUF_ALIGNMENT suggests this has something to do with
> alignment, not extra headers.
> 
> How about
> 
> /* Prevent fragmenation by HW when using extra headers like DSA */
> #define GFAR_RXB_SIZE (1536 + 8)
> 
>   Andrew
> 
Hi Andrew,

the MRBLR has to be written with multiples of 64 (lower 6 bits reserved as 0),
therefore (1536 + 64) is the next valid value to chose.

As noted in the commit message, this is an artificial limitation, since the RX
buffers are essentially 2048 bytes (GFAR_RXB_TRUESIZE) long. I don't fully
understand why GFAR_RXB_SIZE has to be lower than that, preventing
the HW using all available memory - the freescale guys most probably do.


Cheers,
Zefir

[PATCH v2 1/2] ipv6: save route expiry in RTA_EXPIRES if RTF_EXPIRES set

2016-08-19 Thread Andrew Yourtchenko

This allows "ip -6 route save" to save the expiry for the routes
that have it, such that it can be correctly restored later by
"ip -6 route restore".

If a route has RTF_EXPIRES set, generate RTA_EXPIRES value which
will be used to restore the flag and expiry value by already
existing code in rtm_to_fib6_config.

The expiry was already being saved as part of RTA_CACHEINFO
in rtnl_put_cacheinfo(), but adding code to generate RTF_EXPIRES upon save
looked more appropriate than redundant cherrypicking from
RTA_CACHEINFO upon restore.

Signed-off-by: Andrew Yourtchenko 
---
Changes since v1 [1]: 
 * fixed the indentation in a multiline function call
   as per David Miller's review

[1] v1: http://marc.info/?l=linux-netdev=147135597422280=2

 net/ipv6/route.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 4981755..f5b987d 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -3244,6 +3244,14 @@ static int rt6_fill_node(struct net *net,
if (rtnl_put_cacheinfo(skb, >dst, 0, expires, rt->dst.error) < 0)
goto nla_put_failure;
 
+   /* Can't rely on expires == 0. It is zero if no expires flag,
+* or if the timing is precisely at expiry. So, recheck the flag.
+*/
+   if (rt->rt6i_flags & RTF_EXPIRES)
+   if (nla_put_u32(skb, RTA_EXPIRES,
+   expires > 0 ? expires / HZ : 0))
+   goto nla_put_failure;
+
if (nla_put_u8(skb, RTA_PREF, IPV6_EXTRACT_PREF(rt->rt6i_flags)))
goto nla_put_failure;
 
-- 
2.7.4

[PATCH v2 2/2] ipv6: fixup RTF_* flags when restoring RTPROT_RA route from rtnetlink

2016-08-19 Thread Andrew Yourtchenko

Fix the flags for RA-derived routes that were saved
via "ip -6 route save" and and subsequently restored via
"ip -6 route restore", allowing the incoming router advertisements
to update them, rather than complain about inability to do so.

Upon the restore of RA-derived saved routes, set the RTF_ADDRCONF
to indicate that the source of the route was originally
a router advertisement, and set the RTF_DEFAULT or RTF_ROUTEINFO
flag depending on prefix length. This can be considered a
sister change of f0396f60d7c165018c9b203fb9b89fb224835578, in
the other direction.

Signed-off-by: Andrew Yourtchenko 
---
Changes since v1 [1]:
 * fixed the indentation of the basic blocks to be always a full TAB
   as per David Miller's review

[1] v1: http://marc.info/?l=linux-netdev=147135599322285=2

 net/ipv6/route.c | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index f5b987d..60d95cd 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -2769,6 +2769,16 @@ static int rtm_to_fib6_config(struct sk_buff *skb, 
struct nlmsghdr *nlh,
cfg->fc_protocol = rtm->rtm_protocol;
cfg->fc_type = rtm->rtm_type;
 
+   if (rtm->rtm_protocol == RTPROT_RA) {
+   /* RA-derived route: set flags accordingly. */
+   cfg->fc_flags |= RTF_ADDRCONF;
+   if (rtm->rtm_dst_len == 0) {
+   cfg->fc_flags |= RTF_DEFAULT;
+   } else {
+   cfg->fc_flags |= RTF_ROUTEINFO;
+   }
+   }
+
if (rtm->rtm_type == RTN_UNREACHABLE ||
rtm->rtm_type == RTN_BLACKHOLE ||
rtm->rtm_type == RTN_PROHIBIT ||
-- 
2.7.4

[PATCH v2 0/2] ipv6: fix stuck RA-derived route in container after migration with criu

2016-08-19 Thread Andrew Yourtchenko

This patchset fixes the connectivity problem for containers
with RA-derived default route, after they were migrated using criu:
the default routes would lose their "expires" value and become
stuck forever. The corresponding criu issue with the discussion
is at https://github.com/xemul/criu/issues/177

The latter uses "ip -6 route save" to save the routes during
the migration, and "ip -6 route restore" during restore, so the problem
is easily reproducible even without criu.

There are two problems, hence two patches in this patchset:

1) the expiry time for the route is saved in
RTA_CACHEINFO via rtnl_put_cacheinfo, but the code in rtm_to_fib6_config
expects the RTA_EXPIRES. Rather than cherrypicking in the restore code path
from RTA_CACHEINFO, adding RTA_EXPIRES upon save seemed like
a better option.

2) the restored route, even with the properly restored expires,
does not have the correct RTF_* flags set (RTF_ADDRCONF|RTF_DEFAULT),
preventing the incoming router advertisements from updating it.
During the code review I noticed RTF_ROUTEINFO would be also lost
during save/restore. This can be viewed as an operation symmetric
to that done in f0396f60d7c165018c9b203fb9b89fb224835578.

Tested both net and net-next with the patches.

Changes since v1 [1]:
 * Fixed the indentation in both patches as per David Miller's review comments:
- basic block indented always by a TAB
- multiline function call second line to start at the first column
  after the opening parenthesis of the function call, using
  appropriate number of TABs (and SPC only at the end, if needed).

[1] v1: http://marc.info/?l=linux-netdev=147135599322286=2

Andrew Yourtchenko (2):
  ipv6: save route expiry in RTA_EXPIRES if RTF_EXPIRES set
  ipv6: fixup RTF_* flags when restoring RTPROT_RA route from rtnetlink

 net/ipv6/route.c | 18 ++
 1 file changed, 18 insertions(+)

-- 
2.7.4

Re: [v2 PATCH 3/3] netlink: Use rhashtable walk interface in diag dump

2016-08-19 Thread Herbert Xu

On Fri, Aug 19, 2016 at 06:32:37PM +0200, Thomas Graf wrote:
> On 08/19/16 at 04:21pm, Herbert Xu wrote:
> > This patch converts the diag dumping code to use the rhashtable
> > walk code instead of going through rhashtable by hand.  The lock
> > nl_table_lock is now only taken while we process the multicast
> > list as it's not needed for the rhashtable walk.
> > 
> > Signed-off-by: Herbert Xu 
> 
> LGTM. Curious, what did you use to test this? I've been struggling
> to find a good test case.

ss -A netlink

Cheers,
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Re: [PATCH 1/3] rhashtable: Remove GFP flag from rhashtable_walk_init

2016-08-19 Thread Herbert Xu

On Fri, Aug 19, 2016 at 06:25:04PM +0200, Thomas Graf wrote:
> On 08/18/16 at 04:50pm, Herbert Xu wrote:
> > +/* Obsolete function, do not use in new code. */
> > +static inline int rhashtable_walk_init(struct rhashtable *ht,
> > +  struct rhashtable_iter *iter, gfp_t gfp)
> > +{
> > +   rhashtable_walk_enter(ht, iter);
> > +   return 0;
> > +}
> 
> Converting the 5 users of rhashtable_walk_init() looks very straight
> forward, any reason to keep this around?

It is easy but it's needless churn, especially since we've just
converted all of them the other way.

Feel free to do a patch on top of this to get rid of them.

Cheers,
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Re: [RFC PATCH 0/5] Add eBPF hooks for cgroups

2016-08-19 Thread Thomas Graf

On 08/19/16 at 06:31pm, Pablo Neira Ayuso wrote:
> Why do you need global seccomp policies? The process knows better what
> he needs to place in his sandbox, so attaching this from the process
> itself makes more sense to me... Anyway, this reminds me to selinux.

Two different objectives. The point is to sandbox applications and
restrict their capabilities. It's not the application itself but the task
orchestration system around it that manages the policies.

Re: [PATCH 0/3] rhashtable: Get rid of raw table walkers part 1

2016-08-19 Thread Thomas Graf

On 08/18/16 at 04:48pm, Herbert Xu wrote:
> Dave/Tomas, please keep an eye out for any new patches that try
> to introduce raw table walkers and nack them.

Very nice work Herbert. I think we should move the rht_for_each_* macros
to a private header or lib/rhashtable.c to discourage usage altogether.

Re: [v2 PATCH 3/3] netlink: Use rhashtable walk interface in diag dump

2016-08-19 Thread Thomas Graf

On 08/19/16 at 04:21pm, Herbert Xu wrote:
> This patch converts the diag dumping code to use the rhashtable
> walk code instead of going through rhashtable by hand.  The lock
> nl_table_lock is now only taken while we process the multicast
> list as it's not needed for the rhashtable walk.
> 
> Signed-off-by: Herbert Xu 

LGTM. Curious, what did you use to test this? I've been struggling
to find a good test case.

Re: [RFC PATCH 0/5] Add eBPF hooks for cgroups

2016-08-19 Thread Pablo Neira Ayuso

On Fri, Aug 19, 2016 at 01:20:25PM +0200, Daniel Borkmann wrote:
> On 08/19/2016 11:19 AM, Pablo Neira Ayuso wrote:
> [...]
> > * During the Netfilter Workshop, the main concern to add this new socket
> 
> Don't really know what was discussed exactly at NFWS, but ...

Slides are available here:

http://workshop.netfilter.org/2016/wiki/index.php/File_Cgroup-matches-NFWS2016.html

We missed you Daniel, I hope you can make it next year ;-)

> >ingress hook was that it is too specific. However this new hook in
> >the network stack looks way more specific more specific since *it only
> >works for cgroups*.
> 
> ... why would that be something so overly specific? I don't think that "it
> only works for cgroups" would be a narrow use case. While the current
> sk_filter() BPF policies are set from application level, it makes sense to
> me to have an option for an entity that manages the cgroups to apply an
> external policy for networking side as well for participating processes.
> It seems like a useful extension to the current sk_filter() infrastructure
> iff we carve out the details properly and generic enough, and besides ...

This forces anyone to filter socket traffic from cgroups, so this
makes the networking infrastructure dependent on cgroups for no
reason, instead of simply using the cgroupv1, cgroupv2, or whatever
other information as yet another selector.

> [...]
> On 08/19/2016 12:35 PM, Daniel Mack wrote:
> [...]
> >So - I don't know. The whole 'eBPF in cgroups' idea was born because
> >through the discussions over the past months we had on all this, it
> >became clear to me that netfilter is not the right place for filtering
> >on local tasks. I agree the solution I am proposing in my patch set has
> >its downsides, mostly when it comes to transparency to users, but I
> >considered that acceptable. After all, we have eBPF users all over the
> >place in the kernel already, and seccomp, for instance, isn't any better
> >in that regard.
> 
> ... since you mention seccomp here as well, it would be another good fit
> as a program subtype to apply syscall policies for those participants on
> a cgroup level too, f.e. to disallow certain syscalls. It would be quite
> similar conceptually. So, fwiw, if this is being designed generic enough,
> the use cases would go much broader than that.

Why do you need global seccomp policies? The process knows better what
he needs to place in his sandbox, so attaching this from the process
itself makes more sense to me... Anyway, this reminds me to selinux.

Back to my main point, I would not expect we have to request sysadmins
to dump BPF bytecode to understand what global policy is being
enforced.

Re: [PATCH 2/3] MAINTAINERS: Add extra rhashtable maintainer

2016-08-19 Thread Thomas Graf

On 08/18/16 at 04:50pm, Herbert Xu wrote:
> As I'm working actively on rhashtable it helps if people CCed me
> when they work on in.
> 
> Signed-off-by: Herbert Xu 

Acked-by: Thomas Graf

Re: [PATCH 1/3] rhashtable: Remove GFP flag from rhashtable_walk_init

2016-08-19 Thread Thomas Graf

On 08/18/16 at 04:50pm, Herbert Xu wrote:
> +/* Obsolete function, do not use in new code. */
> +static inline int rhashtable_walk_init(struct rhashtable *ht,
> +struct rhashtable_iter *iter, gfp_t gfp)
> +{
> + rhashtable_walk_enter(ht, iter);
> + return 0;
> +}

Converting the 5 users of rhashtable_walk_init() looks very straight
forward, any reason to keep this around?

Re: [RFC PATCH 0/5] Add eBPF hooks for cgroups

2016-08-19 Thread Pablo Neira Ayuso

Hi Daniel,

On Fri, Aug 19, 2016 at 12:35:14PM +0200, Daniel Mack wrote:
> Hi Pablo,
> 
> On 08/19/2016 11:19 AM, Pablo Neira Ayuso wrote:
> > On Wed, Aug 17, 2016 at 04:00:43PM +0200, Daniel Mack wrote:
> >> I'd appreciate some feedback on this. Pablo has some remaining concerns
> >> about this approach, and I'd like to continue the discussion we had
> >> off-list in the light of this patchset.
> > 
> > OK, I'm going to summarize them here below:
> > 
> > * This new hook
> 
> "This" refers to your alternative to my patch set, right?
>
> > allows us to enforce an *administrative filtering
> >   policy* that must be visible to anyone with CAP_NET_ADMIN. This is
> >   easy to display in nf_tables as you can list the ruleset via the nft
> >   userspace tool. Otherwise, in your approach if a misconfigured
> >   filtering policy causes connectivity problems, I don't see how the
> >   sysadmin is going to have an easy way to troubleshoot what is going on.
> 
> True. That's the downside of bpf.
> 
> > * Interaction with other software. As I could read from your patch,
> >   what you propose will detach any previous existing filter. So I
> >   don't see how you can attach multiple filtering policies from
> >   different processes that don't cooperate each other.
> 
> Also true. A cgroup can currently only hold one bpf program for each
> direction, and they are supposed to be set from one controlling instance
> in the system. However, it is possible to create subcgroups, and install
> own programs in them, which will then be effective instead of the one in
> the parent. They will, however, replace each other in runtime behavior,
> and not be stacked. This is a fundamentally different approach than how
> nf_tables works of course.

I see, this looks problematic indeed, thanks for confirming this.

> > In nf_tables
> >   this is easy since they can create their own tables so they keep their
> >   ruleset in separate spaces. If the interaction is not OK, again the
> >   sysadmin can very quickly debug this since the policies would be
> >   visible via nf_tables ruleset listing.
> 
> True. Debugging would be much easier that way.
> 
> > So what I'm proposing goes in the direction of using the nf_tables
> > infrastructure instead:
> > 
> > * Add a new socket family for nf_tables with an input hook at
> >   sk_filter(). This just requires the new netfilter hook there and
> >   the boiler plate code to allow creating tables for this new family.
> >   And then we get access to many of the existing features in
> >   nf_tables for free.
> 
> Yes. However, when I proposed more or less exactly that back in
> September last year ("NF_INET_LOCAL_SOCKET_IN"), the concern raised by
> you and Florian Westphal was that this type of decision making is out of
> scope for netfilter, mostly because
>
> a) whether a userspace process is running should not have any influence
> in the netfilter behavior (which it does, because the rules are not
> processed when the local socket is cannot be determined)

We always have a socket at sk_filter(). This new socket family retains
this specific semantics.

> b) it is asymmetric, as it only exists for the input path

Unless strictly necessary, I would not add another earlier socket
egress hook if ipv4 and ipv6 LOCAL_OUT hook is enough for this.

> c) it's a change in netfilter paradigm, because rules for multicast
> receivers are run multiple times (once for each receiving task)

This is part of the semantics of this new socket family, so the user
is aware of this particular expensive multicast behaviour in this new
family.

> d) it was considered a sledgehammer solution for a something that very
> few people really need

I don't see why the current RFC is less heavyweight. I guess this was
true with the "hooks spread all over the transport layer" approach was
discussed, but not with the single sk_filter() hook we're discussing.

> I still think such a hook would be a good thing to have. As far as
> implementation goes, my patch set back then patched each of the
> protocols individually (ipv4, ipv6, dccp, sctp), while your idea to hook
> in to sk_filter sound much more reasonable.

That's it. Instead of having hooks spread all over the place per
protocol, the idea is to add a new socket family with a hook like
NF_SOCKET_INGRESS at sk_filter(), this new family encloses this new
specific socket semantics, that differs from the ip, ip6 and inet
family semantics.

> If the opinions on the previously raised concerns have changed, I'm
> happy to revisit.
> 
> > * We can quickly find a verdict on the packet using using any combination
> >   of selectors through concatenations and maps in nf_tables. In
> >   nf_tables we can express the policy with a non-linear ruleset.
> 
> That's another interesting detail that was discussed on NFWS, yes. We
> need a way to dispatch incoming packets without walking a linear
> dispatcher list. In the eBPF approach, that's very easy because the
> cgroup is directly

RE: [PATCH] gianfar: prevent fragmentation in DSA environments

2016-08-19 Thread Claudiu Manoil

>-Original Message-
>From: Andrew Lunn [mailto:and...@lunn.ch]
>Sent: Friday, August 19, 2016 5:59 PM
>To: Zefir Kurtisi 
>Cc: netdev@vger.kernel.org; claudiu.man...@freescale.com
>Subject: Re: [PATCH] gianfar: prevent fragmentation in DSA environments
>
>On Fri, Aug 19, 2016 at 11:16:14AM +0200, Zefir Kurtisi wrote:
>> The eTSEC register MRBLR defines the maximum space in
>> the RX buffers and is set to 1536 by gianfar. This
>> reasonably covers the common use case where the MTU
>> is kept at default 1500.
>>
>> Alas, if the eTSEC is attached to a DSA enabled switch,
>> the DSA header extension causes every maximum sized
>> frame to be fragmented by the hardware (1536 + 4).
>>
>> This patch increases the maximum RX buffer size by
>> RBUF_ALIGNMENT (64) octets. Since the driver uses a
>> half-page memory schema, this change does not
>> increase allocated memory but allows the hardware to
>> use 1600 bytes of the totally available 2048.
>>
>> Signed-off-by: Zefir Kurtisi 
>> ---
>>  drivers/net/ethernet/freescale/gianfar.h | 3 ++-
>>  1 file changed, 2 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/net/ethernet/freescale/gianfar.h
>b/drivers/net/ethernet/freescale/gianfar.h
>> index 373fd09..02b794b 100644
>> --- a/drivers/net/ethernet/freescale/gianfar.h
>> +++ b/drivers/net/ethernet/freescale/gianfar.h
>> @@ -100,7 +100,8 @@ extern const char gfar_driver_version[];
>>  #define DEFAULT_RX_LFC_THR  16
>>  #define DEFAULT_LFC_PTVVAL  4
>>
>> -#define GFAR_RXB_SIZE 1536
>> +/* prevent fragmenation by HW in DSA environments */
>> +#define GFAR_RXB_SIZE (1536 + RXBUF_ALIGNMENT)
>
>Hi Zefir
>
>Using RXBUF_ALIGNMENT suggests this has something to do with
>alignment, not extra headers.
>
>How about
>
>/* Prevent fragmenation by HW when using extra headers like DSA */
>#define GFAR_RXB_SIZE (1536 + 8)
>

MRBL (Maximum receive buffer length) must be multiple of 64.
Please consult de hardware documentation at least before prosing
changes to the driver.

Claudiu

[PATCH] net/irda: remove pointless assignment/check

2016-08-19 Thread Vegard Nossum

We've already set sk to sock->sk and dereferenced it, so if it's NULL
we would have crashed already. Moreover, if it was NULL we would have
crashed anyway when jumping to 'out' and trying to unlock the sock.
Furthermore, if we had assigned a different value to 'sk' we would
have been calling lock_sock() and release_sock() on different sockets.

My conclusion is that these two lines are complete nonsense and only
serve to confuse the reader.

Signed-off-by: Vegard Nossum 
---
 net/irda/af_irda.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/net/irda/af_irda.c b/net/irda/af_irda.c
index 8d2f7c9..db63969 100644
--- a/net/irda/af_irda.c
+++ b/net/irda/af_irda.c
@@ -845,9 +845,6 @@ static int irda_accept(struct socket *sock, struct socket 
*newsock, int flags)
if (sock->state != SS_UNCONNECTED)
goto out;
 
-   if ((sk = sock->sk) == NULL)
-   goto out;
-
err = -EOPNOTSUPP;
if ((sk->sk_type != SOCK_STREAM) && (sk->sk_type != SOCK_SEQPACKET) &&
(sk->sk_type != SOCK_DGRAM))
-- 
2.10.0.rc0.1.g07c9292

Re: [RFC PATCH 0/5] Add eBPF hooks for cgroups

2016-08-19 Thread Alexei Starovoitov

On Fri, Aug 19, 2016 at 11:19:41AM +0200, Pablo Neira Ayuso wrote:
> Hi Daniel,
> 
> On Wed, Aug 17, 2016 at 04:00:43PM +0200, Daniel Mack wrote:
> > I'd appreciate some feedback on this. Pablo has some remaining concerns
> > about this approach, and I'd like to continue the discussion we had
> > off-list in the light of this patchset.
> 
> OK, I'm going to summarize them here below:
> 
> * This new hook allows us to enforce an *administrative filtering
>   policy* that must be visible to anyone with CAP_NET_ADMIN. This is
>   easy to display in nf_tables as you can list the ruleset via the nft
>   userspace tool. Otherwise, in your approach if a misconfigured
>   filtering policy causes connectivity problems, I don't see how the
>   sysadmin is going to have an easy way to troubleshoot what is going on.
> 
> * Interaction with other software. As I could read from your patch,
>   what you propose will detach any previous existing filter. So I
>   don't see how you can attach multiple filtering policies from
>   different processes that don't cooperate each other. In nf_tables
>   this is easy since they can create their own tables so they keep their
>   ruleset in separate spaces. If the interaction is not OK, again the
>   sysadmin can very quickly debug this since the policies would be
>   visible via nf_tables ruleset listing.
> 
> * During the Netfilter Workshop, the main concern to add this new socket
>   ingress hook was that it is too specific. However this new hook in
>   the network stack looks way more specific more specific since *it only
>   works for cgroups*.
> 
> So what I'm proposing goes in the direction of using the nf_tables
> infrastructure instead:

Pablo, if you were proposing to do cgroups+nft as well as cgroups+bpf
we could have had much more productive discussion.
You were not participating in cgroup+bpf design and now bringing up
bogus points that make no sense to me. That's not helpful.
Please start another cgroups+nft thread and there we can discuss the
ways to do it cleanly without slowdown the stack.
netfilter hooks bloat the stack enough that some people compile them out.
If I were you, I'd focus on improving iptables/nft performance instead
of arguing about their coolness.

> Thanks for your patience on debating this!

I don't think you're sincere.

Re: [RFC PATCH 11/13] net: sched: pfifo_fast use alf_queue

2016-08-19 Thread John Fastabend

On 16-08-19 03:13 AM, Jesper Dangaard Brouer wrote:
> On Wed, 17 Aug 2016 12:38:10 -0700
> John Fastabend  wrote:
> 
>> This converts the pfifo_fast qdisc to use the alf_queue enqueue and
> ^
>> dequeue routines then sets the NOLOCK bit.
>>
>> This also removes the logic used to pick the next band to dequeue from
>> and instead just checks each alf_queue for packets from top priority
>^
>> to lowest. This might need to be a bit more clever but seems to work
>> for now.
> 
> You need to fix the description, as you are no longer using my
> alf_queue implementation but instead are using the skb_array/ptr_ring
> queue (by MST).
> 

Yep I forgot to change this even though iirc you may have had the same
comment in the last rev. Thanks! I'll get it fixed up this time.

.John

Re: [PATCH 1/1] netfilter: gre: Use the consitent GRE and PPTP struct instead of the structures defined in netfilter

2016-08-19 Thread Feng Gao

My email server reports the last same patch email failed to send.
So I just sent it again.

I am sorry, if anyone receives duplicated ones.

Regards
Feng

On Fri, Aug 19, 2016 at 11:01 PM,   wrote:
> From: Gao Feng 
>
> There are two structures which define the GRE header and PPTP
> header. So it is unneccessary to define duplicated structures in
> netfilter again.
>
> Signed-off-by: Gao Feng 
> ---
>  v1: Intial patch
>
>  include/linux/netfilter/nf_conntrack_proto_gre.h | 63 
> +---
>  include/uapi/linux/if_tunnel.h   |  1 +
>  net/ipv4/netfilter/nf_nat_proto_gre.c| 15 +++---
>  net/netfilter/nf_conntrack_proto_gre.c   | 14 +++---
>  4 files changed, 19 insertions(+), 74 deletions(-)
>
> diff --git a/include/linux/netfilter/nf_conntrack_proto_gre.h 
> b/include/linux/netfilter/nf_conntrack_proto_gre.h
> index df78dc2..9c741da 100644
> --- a/include/linux/netfilter/nf_conntrack_proto_gre.h
> +++ b/include/linux/netfilter/nf_conntrack_proto_gre.h
> @@ -2,67 +2,8 @@
>  #define _CONNTRACK_PROTO_GRE_H
>  #include 
>
> -/* GRE PROTOCOL HEADER */
> -
> -/* GRE Version field */
> -#define GRE_VERSION_1701   0x0
> -#define GRE_VERSION_PPTP   0x1
> -
> -/* GRE Protocol field */
> -#define GRE_PROTOCOL_PPTP  0x880B
> -
> -/* GRE Flags */
> -#define GRE_FLAG_C 0x80
> -#define GRE_FLAG_R 0x40
> -#define GRE_FLAG_K 0x20
> -#define GRE_FLAG_S 0x10
> -#define GRE_FLAG_A 0x80
> -
> -#define GRE_IS_C(f)((f)_FLAG_C)
> -#define GRE_IS_R(f)((f)_FLAG_R)
> -#define GRE_IS_K(f)((f)_FLAG_K)
> -#define GRE_IS_S(f)((f)_FLAG_S)
> -#define GRE_IS_A(f)((f)_FLAG_A)
> -
> -/* GRE is a mess: Four different standards */
> -struct gre_hdr {
> -#if defined(__LITTLE_ENDIAN_BITFIELD)
> -   __u16   rec:3,
> -   srr:1,
> -   seq:1,
> -   key:1,
> -   routing:1,
> -   csum:1,
> -   version:3,
> -   reserved:4,
> -   ack:1;
> -#elif defined(__BIG_ENDIAN_BITFIELD)
> -   __u16   csum:1,
> -   routing:1,
> -   key:1,
> -   seq:1,
> -   srr:1,
> -   rec:3,
> -   ack:1,
> -   reserved:4,
> -   version:3;
> -#else
> -#error "Adjust your  defines"
> -#endif
> -   __be16  protocol;
> -};
> -
> -/* modified GRE header for PPTP */
> -struct gre_hdr_pptp {
> -   __u8   flags;   /* bitfield */
> -   __u8   version; /* should be GRE_VERSION_PPTP */
> -   __be16 protocol;/* should be GRE_PROTOCOL_PPTP */
> -   __be16 payload_len; /* size of ppp payload, not inc. gre header */
> -   __be16 call_id; /* peer's call_id for this session */
> -   __be32 seq; /* sequence number.  Present if S==1 */
> -   __be32 ack; /* seq number of highest packet received by */
> -   /*  sender in this session */
> -};
> +#include 
> +#include 
>
>  struct nf_ct_gre {
> unsigned int stream_timeout;
> diff --git a/include/uapi/linux/if_tunnel.h b/include/uapi/linux/if_tunnel.h
> index 361b9f0..1b27e2c 100644
> --- a/include/uapi/linux/if_tunnel.h
> +++ b/include/uapi/linux/if_tunnel.h
> @@ -36,6 +36,7 @@
>  #define GRE_IS_REC(f)  ((f) & GRE_REC)
>  #define GRE_IS_ACK(f)  ((f) & GRE_ACK)
>
> +#define GRE_VERSION_0  __cpu_to_be16(0x)
>  #define GRE_VERSION_1  __cpu_to_be16(0x0001)
>  #define GRE_PROTO_PPP  __cpu_to_be16(0x880b)
>  #define GRE_PPTP_KEY_MASK  __cpu_to_be32(0x)
> diff --git a/net/ipv4/netfilter/nf_nat_proto_gre.c 
> b/net/ipv4/netfilter/nf_nat_proto_gre.c
> index 9414923..afe81a8 100644
> --- a/net/ipv4/netfilter/nf_nat_proto_gre.c
> +++ b/net/ipv4/netfilter/nf_nat_proto_gre.c
> @@ -88,8 +88,9 @@ gre_manip_pkt(struct sk_buff *skb,
>   const struct nf_conntrack_tuple *tuple,
>   enum nf_nat_manip_type maniptype)
>  {
> -   const struct gre_hdr *greh;
> -   struct gre_hdr_pptp *pgreh;
> +   const struct gre_base_hdr *greh;
> +   struct pptp_gre_header *pgreh;
> +   u16 gre_ver;
>
> /* pgreh includes two optional 32bit fields which are not required
>  * to be there.  That's where the magic '8' comes from */
> @@ -97,18 +98,20 @@ gre_manip_pkt(struct sk_buff *skb,
> return false;
>
> greh = (void *)skb->data + hdroff;
> -   pgreh = (struct gre_hdr_pptp *)greh;
> +   pgreh = (struct pptp_gre_header *)greh;
>
> /* we only have destination manip of a packet, since 'source key'
>  * is not present in the packet itself */
> if (maniptype != NF_NAT_MANIP_DST)
> return true;
> -   switch (greh->version) {
> -   case GRE_VERSION_1701:
> +
> +   gre_ver = ntohs(greh->flags

[PATCH 1/1] ppp: Fix one deadlock issue of PPP when reentrant

2016-08-19 Thread fgao

From: Gao Feng 

PPP channel holds one spinlock before send frame. But the skb may
select the same PPP channel with wrong route policy. As a result,
the skb reaches the same channel path. It tries to get the same
spinlock which is held before. Bang, the deadlock comes out.

Now add one lock owner to avoid it like xmit_lock_owner of
netdev_queue. Check the lock owner before try to get the spinlock.
If the current cpu is already the owner, it means ppp finds there is
one reentrant and returns directly. If not owner and hold the spinlock
successfully, it sets owner with current CPU ID.

The following is the panic stack of 3.3.8. But the same issue
should be in the upstream too.

[] ? _raw_spin_lock_bh+0x11/0x40
[] ppp_unregister_channel+0x1347/0x2170 [ppp_generic]
[] ? kmem_cache_free+0xa7/0xc0
[] ppp_unregister_channel+0x1db7/0x2170 [ppp_generic]
[] ppp_unregister_channel+0x2065/0x2170 [ppp_generic]
[] dev_hard_start_xmit+0x4cd/0x620
[] sch_direct_xmit+0x74/0x1d0
[] dev_queue_xmit+0x1d/0x30
[] neigh_direct_output+0xc/0x10
[] ip_finish_output+0x25e/0x2b0
[] ip_output+0x88/0x90
[] ? __ip_local_out+0x9f/0xb0
[] ip_local_out+0x24/0x30
[] 0xa00b9744
[] ppp_unregister_channel+0x20f8/0x2170 [ppp_generic]
[] ppp_output_wakeup+0x122/0x11d0 [ppp_generic]
[] vfs_write+0xb8/0x160
[] sys_write+0x45/0x90
[] system_call_fastpath+0x16/0x1b

The call flow is like this.
ppp_write->ppp_channel_push->start_xmit->select inappropriate route
 -> dev_hard_start_xmit->ppp_start_xmit->ppp_xmit_process->
ppp_push. Now ppp_push tries to get the same spinlock which is held
in ppp_channel_push.

Although the PPP deadlock is caused by inappropriate route policy
with L2TP, I think it is not accepted the PPP module would cause kernel
deadlock with wrong route policy.

Signed-off-by: Gao Feng 
---
 v3: Change the fix solution. Giveup the send chance instead of recursive lock
 v2: Fix recursive unlock issue
 v1: Initial patch
 
 drivers/net/ppp/ppp_generic.c | 95 +--
 1 file changed, 73 insertions(+), 22 deletions(-)

diff --git a/drivers/net/ppp/ppp_generic.c b/drivers/net/ppp/ppp_generic.c
index 70cfa06..b653f1f 100644
--- a/drivers/net/ppp/ppp_generic.c
+++ b/drivers/net/ppp/ppp_generic.c
@@ -162,6 +162,46 @@ struct ppp {
 |SC_MULTILINK|SC_MP_SHORTSEQ|SC_MP_XSHORTSEQ \
 |SC_COMP_TCP|SC_REJ_COMP_TCP|SC_MUST_COMP)
 
+struct channel_lock {
+   spinlock_t lock;
+   int owner;
+};
+
+static inline void ppp_channel_lock_init(struct channel_lock *cl)
+{
+   cl->owner = -1;
+   spin_lock_init(>lock);
+}
+
+static inline bool ppp_channel_lock_bh(struct channel_lock *cl)
+{
+   int cpu;
+
+   local_bh_disable();
+   cpu = smp_processor_id();
+   if (cpu == cl->owner) {
+   /* The CPU already holds this channel lock and sends. But the
+* channel is selected after inappropriate route. It causes
+* reenter the channel again. It is forbidden by PPP module.
+*/
+   if (net_ratelimit())
+   pr_err("PPP detects one recursive channel send\n");
+   local_bh_enable();
+   return false;
+   }
+   spin_lock(>lock);
+   cl->owner = cpu;
+
+   return true;
+}
+
+static inline void ppp_channel_unlock_bh(struct channel_lock *cl)
+{
+   cl->owner = -1;
+   spin_unlock(>lock);
+   local_bh_enable();
+}
+
 /*
  * Private data structure for each channel.
  * This includes the data structure used for multilink.
@@ -171,7 +211,7 @@ struct channel {
struct list_head list;  /* link in all/new_channels list */
struct ppp_channel *chan;   /* public channel data structure */
struct rw_semaphore chan_sem;   /* protects `chan' during chan ioctl */
-   spinlock_t  downl;  /* protects `chan', file.xq dequeue */
+   struct channel_lock downl;  /* protects `chan', file.xq dequeue */
struct ppp  *ppp;   /* ppp unit we're connected to */
struct net  *chan_net;  /* the net channel belongs to */
struct list_head clist; /* link in list of channels per unit */
@@ -1587,9 +1627,7 @@ ppp_push(struct ppp *ppp)
list = >channels;
if (list_empty(list)) {
/* nowhere to send the packet, just drop it */
-   ppp->xmit_pending = NULL;
-   kfree_skb(skb);
-   return;
+   goto drop;
}
 
if ((ppp->flags & SC_MULTILINK) == 0) {
@@ -1597,16 +1635,19 @@ ppp_push(struct ppp *ppp)
list = list->next;
pch = list_entry(list, struct channel, clist);
 
-   spin_lock_bh(>downl);
+   if (unlikely(!ppp_channel_lock_bh(>downl))) {
+   /* Fail to hold channel lock */
+   goto drop;
+   }

Re: [PATCH net 2/2] sctp: not copying duplicate addrs to the assoc's bind address list

2016-08-19 Thread Xin Long

> Under what valid use case will multiple interfaces have the same network
> address?
>
Hi, Neil.
I'm not sure the specific valid use case.

The point is, do we trust the sctp global addr list has no duplicate
address ? In one of users' computer, he found hundreds of duplicate
addresses in INIT_ACK packets.
(https://bugzilla.redhat.com/show_bug.cgi?id=1308362)

If we think NETDEV_UP/DOWN event in addrs chain always come
in pairs, I can not explain how come this issue happened.

But from sctp end, we can avoid it with this patch. after all
binding any duplicate addresses is illegal, just like what
sctp_bind does.

what do you think ?

Re: [RFC PATCH] net: diag: support SOCK_DESTROY for UDP sockets

2016-08-19 Thread David Ahern

On 8/18/16 10:17 PM, Eric Dumazet wrote:
> Why are you iterating the whole table ?
> 
> Normally, udp_hashfn(net, ntohs(dport), udptable->mask) slot should be
> enough to find all sockets bound to dport.

I took a tangent after in early mistake with the socket lookup and after that I 
was focused on doing the right thing in udp_abort and did not revisit the 
socket lookup. Reverted to __udp{4,6}_lib_lookup.

[PATCH 1/1] netfilter: gre: Use the consitent GRE and PPTP struct instead of the structures defined in netfilter

2016-08-19 Thread fgao

From: Gao Feng 

There are two structures which define the GRE header and PPTP
header. So it is unneccessary to define duplicated structures in
netfilter again.

Signed-off-by: Gao Feng 
---
 v1: Intial patch

 include/linux/netfilter/nf_conntrack_proto_gre.h | 63 +---
 include/uapi/linux/if_tunnel.h   |  1 +
 net/ipv4/netfilter/nf_nat_proto_gre.c| 15 +++---
 net/netfilter/nf_conntrack_proto_gre.c   | 14 +++---
 4 files changed, 19 insertions(+), 74 deletions(-)

diff --git a/include/linux/netfilter/nf_conntrack_proto_gre.h 
b/include/linux/netfilter/nf_conntrack_proto_gre.h
index df78dc2..9c741da 100644
--- a/include/linux/netfilter/nf_conntrack_proto_gre.h
+++ b/include/linux/netfilter/nf_conntrack_proto_gre.h
@@ -2,67 +2,8 @@
 #define _CONNTRACK_PROTO_GRE_H
 #include 
 
-/* GRE PROTOCOL HEADER */
-
-/* GRE Version field */
-#define GRE_VERSION_1701   0x0
-#define GRE_VERSION_PPTP   0x1
-
-/* GRE Protocol field */
-#define GRE_PROTOCOL_PPTP  0x880B
-
-/* GRE Flags */
-#define GRE_FLAG_C 0x80
-#define GRE_FLAG_R 0x40
-#define GRE_FLAG_K 0x20
-#define GRE_FLAG_S 0x10
-#define GRE_FLAG_A 0x80
-
-#define GRE_IS_C(f)((f)_FLAG_C)
-#define GRE_IS_R(f)((f)_FLAG_R)
-#define GRE_IS_K(f)((f)_FLAG_K)
-#define GRE_IS_S(f)((f)_FLAG_S)
-#define GRE_IS_A(f)((f)_FLAG_A)
-
-/* GRE is a mess: Four different standards */
-struct gre_hdr {
-#if defined(__LITTLE_ENDIAN_BITFIELD)
-   __u16   rec:3,
-   srr:1,
-   seq:1,
-   key:1,
-   routing:1,
-   csum:1,
-   version:3,
-   reserved:4,
-   ack:1;
-#elif defined(__BIG_ENDIAN_BITFIELD)
-   __u16   csum:1,
-   routing:1,
-   key:1,
-   seq:1,
-   srr:1,
-   rec:3,
-   ack:1,
-   reserved:4,
-   version:3;
-#else
-#error "Adjust your  defines"
-#endif
-   __be16  protocol;
-};
-
-/* modified GRE header for PPTP */
-struct gre_hdr_pptp {
-   __u8   flags;   /* bitfield */
-   __u8   version; /* should be GRE_VERSION_PPTP */
-   __be16 protocol;/* should be GRE_PROTOCOL_PPTP */
-   __be16 payload_len; /* size of ppp payload, not inc. gre header */
-   __be16 call_id; /* peer's call_id for this session */
-   __be32 seq; /* sequence number.  Present if S==1 */
-   __be32 ack; /* seq number of highest packet received by */
-   /*  sender in this session */
-};
+#include 
+#include 
 
 struct nf_ct_gre {
unsigned int stream_timeout;
diff --git a/include/uapi/linux/if_tunnel.h b/include/uapi/linux/if_tunnel.h
index 361b9f0..1b27e2c 100644
--- a/include/uapi/linux/if_tunnel.h
+++ b/include/uapi/linux/if_tunnel.h
@@ -36,6 +36,7 @@
 #define GRE_IS_REC(f)  ((f) & GRE_REC)
 #define GRE_IS_ACK(f)  ((f) & GRE_ACK)
 
+#define GRE_VERSION_0  __cpu_to_be16(0x)
 #define GRE_VERSION_1  __cpu_to_be16(0x0001)
 #define GRE_PROTO_PPP  __cpu_to_be16(0x880b)
 #define GRE_PPTP_KEY_MASK  __cpu_to_be32(0x)
diff --git a/net/ipv4/netfilter/nf_nat_proto_gre.c 
b/net/ipv4/netfilter/nf_nat_proto_gre.c
index 9414923..afe81a8 100644
--- a/net/ipv4/netfilter/nf_nat_proto_gre.c
+++ b/net/ipv4/netfilter/nf_nat_proto_gre.c
@@ -88,8 +88,9 @@ gre_manip_pkt(struct sk_buff *skb,
  const struct nf_conntrack_tuple *tuple,
  enum nf_nat_manip_type maniptype)
 {
-   const struct gre_hdr *greh;
-   struct gre_hdr_pptp *pgreh;
+   const struct gre_base_hdr *greh;
+   struct pptp_gre_header *pgreh;
+   u16 gre_ver;
 
/* pgreh includes two optional 32bit fields which are not required
 * to be there.  That's where the magic '8' comes from */
@@ -97,18 +98,20 @@ gre_manip_pkt(struct sk_buff *skb,
return false;
 
greh = (void *)skb->data + hdroff;
-   pgreh = (struct gre_hdr_pptp *)greh;
+   pgreh = (struct pptp_gre_header *)greh;
 
/* we only have destination manip of a packet, since 'source key'
 * is not present in the packet itself */
if (maniptype != NF_NAT_MANIP_DST)
return true;
-   switch (greh->version) {
-   case GRE_VERSION_1701:
+
+   gre_ver = ntohs(greh->flags & GRE_VERSION);
+   switch (gre_ver) {
+   case GRE_VERSION_0:
/* We do not currently NAT any GREv0 packets.
 * Try to behave like "nf_nat_proto_unknown" */
break;
-   case GRE_VERSION_PPTP:
+   case GRE_VERSION_1:
pr_debug("call_id -> 0x%04x\n", ntohs(tuple->dst.u.gre.key));
pgreh->call_id = tuple->dst.u.gre.key;
break;
diff --git

Re: [PATCH] gianfar: prevent fragmentation in DSA environments

2016-08-19 Thread Andrew Lunn

On Fri, Aug 19, 2016 at 11:16:14AM +0200, Zefir Kurtisi wrote:
> The eTSEC register MRBLR defines the maximum space in
> the RX buffers and is set to 1536 by gianfar. This
> reasonably covers the common use case where the MTU
> is kept at default 1500.
> 
> Alas, if the eTSEC is attached to a DSA enabled switch,
> the DSA header extension causes every maximum sized
> frame to be fragmented by the hardware (1536 + 4).
> 
> This patch increases the maximum RX buffer size by
> RBUF_ALIGNMENT (64) octets. Since the driver uses a
> half-page memory schema, this change does not
> increase allocated memory but allows the hardware to
> use 1600 bytes of the totally available 2048.
> 
> Signed-off-by: Zefir Kurtisi 
> ---
>  drivers/net/ethernet/freescale/gianfar.h | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/net/ethernet/freescale/gianfar.h 
> b/drivers/net/ethernet/freescale/gianfar.h
> index 373fd09..02b794b 100644
> --- a/drivers/net/ethernet/freescale/gianfar.h
> +++ b/drivers/net/ethernet/freescale/gianfar.h
> @@ -100,7 +100,8 @@ extern const char gfar_driver_version[];
>  #define DEFAULT_RX_LFC_THR  16
>  #define DEFAULT_LFC_PTVVAL  4
>  
> -#define GFAR_RXB_SIZE 1536
> +/* prevent fragmenation by HW in DSA environments */
> +#define GFAR_RXB_SIZE (1536 + RXBUF_ALIGNMENT)

Hi Zefir

Using RXBUF_ALIGNMENT suggests this has something to do with
alignment, not extra headers.

How about

/* Prevent fragmenation by HW when using extra headers like DSA */
#define GFAR_RXB_SIZE (1536 + 8)

Andrew

Re: [4.4-RT PATCH RFC/RFT] drivers: net: cpsw: mark rx/tx irq as IRQF_NO_THREAD

2016-08-19 Thread Grygorii Strashko

On 08/12/2016 11:32 AM, Sebastian Andrzej Siewior wrote:
> On 2016-08-11 19:15:40 [+0300], Grygorii Strashko wrote:
>> Mark CPSW Rx/Tx IRQs as IRQF_NO_THREAD and avoid double scheduling on -RT
>> where this IRQs are forced threaded:
>>  rx-irq
>>   |- schedule threaded rx-irq handler
>> ...
>>   |- threaded rx-irq handler -> cpsw_rx_interrupt()
>>  |- napi_schedule()
>>  |- __raise_softirq_irqoff()
>> |- wakeup_proper_softirq()
>> ...
>>   napi
> 
> This should not be the default path. The default should be napi running
> in the context of the threaded rx-irq handler once the handler is done.
> The wakeup_proper_softirq() part is only done if napi thinks that the
> callback functions runs for too long. So in *that* case you continue
> NAPI in the softirq-thread which runs at SCHED_OTHER.
> 
>> after:
>>  rx-irq
>>   |- cpsw_rx_interrupt()
>>  |- napi_schedule()
>>   |- irq_exit()
>>  |- invoke_softirq()
>> |- wakeup_softirqd()
>> ...
>>   napi
> 
> Since you schedule the softirq from an IRQ-off region / without a
> process context you force the softirq to run in the thread at
> SCHED_OTHER priority.
> 
>> And, as result, get benefits from the following improvements (tested
>> on am57xx-evm):
>>
>> 1) "[ 78.348599] NOHZ: local_softirq_pending 80" message will not be
>>seen any more. Now these warnings can be seen once iperf is started.
>># iperf -c $IPERFHOST -w 128K  -d -t 60
> 
> Do you also see "sched: RT throttling activated"? Because I don't see
> otherwise why this should pop up.
> 

I've collected trace before first occurrence of  "NOHZ: local_softirq_pending 
80"

 irq/354-4848400-85[000]90.639460: irq_handler_entry:irq=19 
name=arch_timer
   iperf-1284  [001]90.639474: softirq_raise:vec=1 
[action=TIMER]
   iperf-1284  [001]90.639486: irq_handler_exit: irq=19 
ret=handled
 irq/354-4848400-85[000]90.639488: softirq_raise:vec=7 
[action=SCHED]
   iperf-1284  [001]90.639490: sched_waking: 
comm=ksoftirqd/1 pid=21 prio=120 target_cpu=001
 irq/354-4848400-85[000]90.639492: softirq_raise:vec=1 
[action=TIMER]
   iperf-1284  [001]90.639499: sched_wakeup: ksoftirqd/1:21 
[120] success=1 CPU:001
   iperf-1284  [001]90.639504: sched_waking: 
comm=ktimersoftd/1 pid=20 prio=98 target_cpu=001
 irq/354-4848400-85[000]90.639505: irq_handler_exit: irq=19 
ret=handled
   iperf-1284  [001]90.639512: sched_wakeup: 
ktimersoftd/1:20 [98] success=1 CPU:001
   iperf-1284  [001]90.639526: sched_stat_runtime:   comm=iperf 
pid=1284 runtime=50752 [ns] vruntime=2105322850 [ns]
   iperf-1284  [001]90.639537: sched_switch: iperf:1284 
[120] R ==> irq/355-4848400:86 [49]
 irq/355-4848400-86[001]90.639545: softirq_raise:vec=3 
[action=NET_RX]
 irq/355-4848400-86[001]90.639556: softirq_entry:vec=3 
[action=NET_RX]
 irq/355-4848400-86[001]90.639589: softirq_exit: vec=3 
[action=NET_RX]
 irq/355-4848400-86[001]90.639614: sched_switch: 
irq/355-4848400:86 [49] S ==> ktimersoftd/1:20 [98]
   ktimersoftd/1-20[001]90.639625: softirq_entry:vec=1 
[action=TIMER]
   ktimersoftd/1-20[001]90.639637: sched_waking: 
comm=rcu_preempt pid=8 prio=98 target_cpu=001
   ktimersoftd/1-20[001]90.639646: sched_wakeup: rcu_preempt:8 
[98] success=1 CPU:001
   ktimersoftd/1-20[001]90.639663: softirq_exit: vec=1 
[action=TIMER]
   ktimersoftd/1-20[001]90.639679: sched_switch: 
ktimersoftd/1:20 [98] S ==> rcu_preempt:8 [98]
 rcu_preempt-8 [001]90.639722: sched_switch: rcu_preempt:8 
[98] S ==> ksoftirqd/1:21 [120]
 ksoftirqd/1-21[001]90.639740: sched_stat_runtime:   
comm=ksoftirqd/1 pid=21 runtime=25539 [ns] vruntime=29960463828 [ns]
 ksoftirqd/1-21[001]90.639750: sched_switch: ksoftirqd/1:21 
[120] S ==> iperf:1284 [120]
 irq/354-4848400-85[000]90.639878: irq_handler_entry:irq=355 
name=48484000.ethernet
 irq/354-4848400-85[000]90.639880: irq_handler_exit: irq=355 
ret=handled
 irq/354-4848400-85[000]90.639884: sched_waking: 
comm=irq/355-4848400 pid=86 prio=49 target_cpu=001
 irq/354-4848400-85[000]90.639896: sched_wakeup: 
irq/355-4848400:86 [49] success=1 CPU:001
   iperf-1284  [001]90.639903: sched_stat_runtime:   comm=iperf 
pid=1284 runtime=150466 [ns] vruntime=2105473316 [ns]
   iperf-1284  [001]90.639913: sched_switch: iperf:1284 
[120] R ==> irq/355-4848400:86 [49]
 irq/355-4848400-86[001]90.639921: softirq_raise:vec=3 
[action=NET_RX]
 irq/355-4848400-86[001]90.639932: softirq_entry:vec=3 
[action=NET_RX]
 irq/355-4848400-86[001]90.639958: sched_waking: 
comm=irq/354-4848400

Re: [PATCH] net: ip_finish_output_gso: If skb_gso_network_seglen exceeds MTU, allow segmentation for gre tunneled skbs

2016-08-19 Thread Shmulik Ladkani

Hi,

On Fri, 19 Aug 2016 11:20:40 +0200 Hannes Frederic Sowa 
 wrote:
> >> Maybe we can change our criteria in the following manner:
> >>  
> >> -  if (skb_iif && proto == IPPROTO_UDP) {
> >> +  if (skb_iif && !(df & htons(IP_DF))) {
> >>IPCB(skb)->flags |= IPSKB_FRAG_SEGS;
> >>
> >> This way, any tunnel explicitly providing DF is NOT allowed to
> >> further fragment the resulting segments (leading to tx segments being
> >> dropped).
> >> Others tunnels, that do not care (e.g. vxlan and geneve, and probably
> >> ovs vport-gre, or other ovs encap vports, in df_default=false mode),
> >> will behave same for gso and non-gso.
> >>
> 
> I am really not sure...
> 
> Probably we have no other choice.

Further diving into this, seems the !IP_DF approach is more correct 
then the IPPROTO_UDP approach (WRT packets/segments arriving from other
interface, that exceed egress mtu):

vxlan/geneve:
  Both set df to zero.
  !IP_DF approach acts same as IPPROTO_UDP approach

vxlan/geneve in collect_md (e.g. OvS):
  They set df according to tun_flags & TUNNEL_DONT_FRAGMENT.
  IPPROTO_UDP approach:
IPSKB_FRAG_SEGS gets set unconditionally.
In case TUNNEL_DONT_FRAGMENT requested, non-gso get dropped
due to IPSTATS_MIB_FRAGFAILS, whereas gso gets segmented+fragmented (!)
  !IP_DF approach:
Aligned, both non-gso and gso gets dropped for TUNNEL_DONT_FRAGMENT.

ip_gre in collect_md (e.g. OvS):
  Sets df according to tun_flags & TUNNEL_DONT_FRAGMENT.
  IPPROTO_UDP approach:
IPSKB_FRAG_SEGS is never set.
Therefore in the case were df is NOT set, non-gso are fragged and
passed, whereas gso gets dropped (!)
  !IP_DF approach:
Non-gso vs gso aligned.

ip_gre in nopmtudisc:
  Will pass tnl_update_pmtu checks; Then, df inherrited from inner_iph
  (or stays unset if IFLA_GRE_IGNORE_DF specified).
  IPPROTO_UDP approach:
IPSKB_FRAG_SEGS never set.
Therefore in the case were df is NOT set, non-gso are fragged and
passed, whereas gso gets dropped (!)
  !IP_DF approach:
Aligned.

ip_gre in fou/gue mode in nopmtudisc:
  Assuming they pass tnl_update_pmtu checks; Then, df inherrited from
  inner_iph (or stays unset if IFLA_GRE_IGNORE_DF specified).
  IPPROTO_UDP approach:
IPSKB_FRAG_SEGS gets always set (since proto==IPPROTO_UDP).
In the case df is set, non-gso dropped by IPSTATS_MIB_FRAGFAILS,
whereas gso gets segmented+fragmented (!)
  !IP_DF approach: 
Aligned.

ip_gre in pmtudisc:
  Sets df to IP_DF.
  Non-gso will fail tnl_update_pmtu checks (gso should pass).
  IPPROTO_UDP approach:
IPSKB_FRAG_SEGS never set. This leads the gso skbs to be eventually
dropped. okay.
  !IP_DF approach:
IPSKB_FRAG_SEGS not set, since IP_DF is true.
This leads to gso skbs to be eventually dropped. okay.

(truely appreciate if you can review my above analysis)

Therefore using !(df & htons(IP_DF)) actually fixes some oversights of
our former proto==IPPROTO_UDP approach.

I'll send a patch.

Thanks
Shmulik

[PATCH net 1/2] sctp: reduce indent level in sctp_copy_local_addr_list

2016-08-19 Thread Xin Long

This patch is to reduce indent level by using continue when the addr
is not allowed, and also drop end_copy by using break.

Signed-off-by: Xin Long 
---
 net/sctp/protocol.c | 37 +++--
 1 file changed, 19 insertions(+), 18 deletions(-)

diff --git a/net/sctp/protocol.c b/net/sctp/protocol.c
index 7b523e3..da5d82b 100644
--- a/net/sctp/protocol.c
+++ b/net/sctp/protocol.c
@@ -205,26 +205,27 @@ int sctp_copy_local_addr_list(struct net *net, struct 
sctp_bind_addr *bp,
list_for_each_entry_rcu(addr, >sctp.local_addr_list, list) {
if (!addr->valid)
continue;
-   if (sctp_in_scope(net, >a, scope)) {
-   /* Now that the address is in scope, check to see if
-* the address type is really supported by the local
-* sock as well as the remote peer.
-*/
-   if AF_INET == addr->a.sa.sa_family) &&
- (copy_flags & SCTP_ADDR4_PEERSUPP))) ||
-   (((AF_INET6 == addr->a.sa.sa_family) &&
- (copy_flags & SCTP_ADDR6_ALLOWED) &&
- (copy_flags & SCTP_ADDR6_PEERSUPP {
-   error = sctp_add_bind_addr(bp, >a,
-   sizeof(addr->a),
-   SCTP_ADDR_SRC, GFP_ATOMIC);
-   if (error)
-   goto end_copy;
-   }
-   }
+   if (!sctp_in_scope(net, >a, scope))
+   continue;
+
+   /* Now that the address is in scope, check to see if
+* the address type is really supported by the local
+* sock as well as the remote peer.
+*/
+   if (addr->a.sa.sa_family == AF_INET &&
+   !(copy_flags & SCTP_ADDR4_PEERSUPP))
+   continue;
+   if (addr->a.sa.sa_family == AF_INET6 &&
+   (!(copy_flags & SCTP_ADDR6_ALLOWED) ||
+!(copy_flags & SCTP_ADDR6_PEERSUPP)))
+   continue;
+
+   error = sctp_add_bind_addr(bp, >a, sizeof(addr->a),
+  SCTP_ADDR_SRC, GFP_ATOMIC);
+   if (error)
+   break;
}
 
-end_copy:
rcu_read_unlock();
return error;
 }
-- 
2.1.0

RE: [PATCH 1/2] net: phy: Add error checks in the driver

2016-08-19 Thread Appana Durga Kedareswara Rao

Hi Andrew,

Thanks for the review...

> > -   mdiobus_write(phydev->mdio.bus, priv->addr,
> XILINX_GMII2RGMII_REG, val);
> > +   err = mdiobus_write(phydev->mdio.bus, priv->addr,
> XILINX_GMII2RGMII_REG,
> > +   val);
> > +   if (err < 0)
> > +   return err;
> >
> > return 0;
> 
> Do you need to assign err? Why not just

Ok will fix in the next version

Regards,
Kedar.

> 
>return mdiobus_write(phydev->mdio.bus, priv->addr,
> XILINX_GMII2RGMII_REG,
>   val);

Re: [PATCH net 2/2] sctp: not copying duplicate addrs to the assoc's bind address list

2016-08-19 Thread Neil Horman

On Fri, Aug 19, 2016 at 07:30:22PM +0800, Xin Long wrote:
> From: Xin Long 
> 
> sctp.local_addr_list is a global address list that is supposed to include
> all the local addresses. sctp updates this list according to NETDEV_UP/
> NETDEV_DOWN notifications.
> 
> However, if multiple NICs have the same address, the global list will
> have duplicate addresses. Even if for one NIC, promote secondaries in
> __inet_del_ifa can also lead to accumulating duplicate addresses.
> 
> When sctp binds address 'ANY' and creates a connection, it copies all
> the addresses from global list into asoc's bind addr list, which makes
> sctp pack the duplicate addresses into INIT/INIT_ACK packets.
> 
> This patch is to filter the duplicate addresses when copying the addrs
> from global list in sctp_copy_local_addr_list and unpacking addr_param
> from cookie in sctp_raw_to_bind_addrs to asoc's bind addr list.
> 
> Signed-off-by: Xin Long 


Under what valid use case will multiple interfaces have the same network
address?

Neil

> ---
>  net/sctp/bind_addr.c | 3 +++
>  net/sctp/protocol.c  | 3 +++
>  2 files changed, 6 insertions(+)
> 
> diff --git a/net/sctp/bind_addr.c b/net/sctp/bind_addr.c
> index 401c607..1ebc184 100644
> --- a/net/sctp/bind_addr.c
> +++ b/net/sctp/bind_addr.c
> @@ -292,6 +292,8 @@ int sctp_raw_to_bind_addrs(struct sctp_bind_addr *bp, 
> __u8 *raw_addr_list,
>   }
>  
>   af->from_addr_param(, rawaddr, htons(port), 0);
> + if (sctp_bind_addr_state(bp, ) != -1)
> + goto next;
>   retval = sctp_add_bind_addr(bp, , sizeof(addr),
>   SCTP_ADDR_SRC, gfp);
>   if (retval) {
> @@ -300,6 +302,7 @@ int sctp_raw_to_bind_addrs(struct sctp_bind_addr *bp, 
> __u8 *raw_addr_list,
>   break;
>   }
>  
> +next:
>   len = ntohs(param->length);
>   addrs_len -= len;
>   raw_addr_list += len;
> diff --git a/net/sctp/protocol.c b/net/sctp/protocol.c
> index da5d82b..616a942 100644
> --- a/net/sctp/protocol.c
> +++ b/net/sctp/protocol.c
> @@ -220,6 +220,9 @@ int sctp_copy_local_addr_list(struct net *net, struct 
> sctp_bind_addr *bp,
>!(copy_flags & SCTP_ADDR6_PEERSUPP)))
>   continue;
>  
> + if (sctp_bind_addr_state(bp, >a) != -1)
> + continue;
> +
>   error = sctp_add_bind_addr(bp, >a, sizeof(addr->a),
>  SCTP_ADDR_SRC, GFP_ATOMIC);
>   if (error)
> -- 
> 2.1.0
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-sctp" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

Re: [PATCH 2/2] net: phy: Fix race conditions in the driver

2016-08-19 Thread Andrew Lunn

On Fri, Aug 19, 2016 at 06:18:11PM +0530, Kedareswara rao Appana wrote:
> This patch fixes the below race conditions in the driver.
> ---> Fix Opps after unload the driver as a module
> ---> Use spin locks where relevant.
> ---> Take reference on the external phy to prevent issues
> when phy driver is unloaded.

Each one of these should be an individual patch.

> 
> Reported-by: Andrew Lunn 
> Signed-off-by: Kedareswara rao Appana 
> ---
>  drivers/net/phy/xilinx_gmii2rgmii.c | 32 +++-
>  1 file changed, 31 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/net/phy/xilinx_gmii2rgmii.c 
> b/drivers/net/phy/xilinx_gmii2rgmii.c
> index 7336fd0..cdd9d95 100644
> --- a/drivers/net/phy/xilinx_gmii2rgmii.c
> +++ b/drivers/net/phy/xilinx_gmii2rgmii.c
> @@ -34,6 +34,7 @@ struct gmii2rgmii {
>   struct phy_driver *phy_drv;
>   struct phy_driver conv_phy_drv;
>   int addr;
> + spinlock_t phy_lock;

The phy already have a lock.

>  };
>  
>  static int xgmiitorgmii_read_status(struct phy_device *phydev)
> @@ -55,7 +56,7 @@ static int xgmiitorgmii_read_status(struct phy_device 
> *phydev)
>   val |= BMCR_SPEED1000;
>   else if (phydev->speed == SPEED_100)
>   val |= BMCR_SPEED100;
> - else
> + else if (phydev->speed == SPEED_10)
>   val |= BMCR_SPEED10;
  
I said you want to return an error is the PHY is using a speed your
converter cannot support. I don't see an error being raised here.

>   err = mdiobus_write(phydev->mdio.bus, priv->addr, XILINX_GMII2RGMII_REG,
> @@ -71,6 +72,8 @@ int xgmiitorgmii_probe(struct mdio_device *mdiodev)
>   struct device *dev = >dev;
>   struct device_node *np = dev->of_node, *phy_node;
>   struct gmii2rgmii *priv;
> + struct mii_bus *bus;
> + unsigned long flags;
>  
>   priv = devm_kzalloc(dev, sizeof(*priv), GFP_KERNEL);
>   if (!priv)
> @@ -88,17 +91,43 @@ int xgmiitorgmii_probe(struct mdio_device *mdiodev)
>   return -EPROBE_DEFER;
>   }
>  
> + spin_lock_init(>phy_lock);
> +
> + bus = priv->phy_dev->mdio.bus;
> + if (!try_module_get(bus->owner)) {
> + dev_err(dev, "failed to get the bus module\n");
> + return -EIO;
> + }

The comment at the top says you are taking a reference to the phy.
Here you take a reference to the mdio bus driver???

> +
> + get_device(>phy_dev->mdio.dev);
> +
>   priv->addr = mdiodev->addr;
>   priv->phy_drv = priv->phy_dev->drv;
>   memcpy(>conv_phy_drv, priv->phy_dev->drv,
>  sizeof(struct phy_driver));
>   priv->conv_phy_drv.read_status = xgmiitorgmii_read_status;
>   priv->phy_dev->priv = priv;
> + spin_lock_irqsave(>phy_lock, flags);
>   priv->phy_dev->drv = >conv_phy_drv;
> + spin_unlock_irqrestore(>phy_lock, flags);

And how is this spinlock protecting anything? 

> +
> + dev_set_drvdata(dev, priv);
>  
>   return 0;
>  }
>  
> +static void xgmiitorgmii_remove(struct mdio_device *mdiodev)
> +{
> + struct gmii2rgmii *priv = dev_get_drvdata(>dev);
> + struct mii_bus *bus;
> +
> + bus = priv->phy_dev->mdio.bus;
> +
> + put_device(>phy_dev->mdio.dev);
> + module_put(bus->owner);
> + phy_disconnect(priv->phy_dev);

Why are you disconnecting the phy?

> +}
> +
>  static const struct of_device_id xgmiitorgmii_of_match[] = {
>   { .compatible = "xlnx,gmii-to-rgmii-1.0" },
>   {},
> @@ -107,6 +136,7 @@ MODULE_DEVICE_TABLE(of, xgmiitorgmii_of_match);
>  
>  static struct mdio_driver xgmiitorgmii_driver = {
>   .probe  = xgmiitorgmii_probe,
> + .remove = xgmiitorgmii_remove,
>   .mdiodrv.driver = {
>   .name = "xgmiitorgmii",
>   .of_match_table = xgmiitorgmii_of_match,
> -- 
> 2.1.2
>

Re: [PATCH net 1/1] net sched ife action: fix encoding to use real length

2016-08-19 Thread Sergei Shtylyov


Hello.

On 8/19/2016 1:17 PM, Jamal Hadi Salim wrote:


From: Jamal Hadi Salim 

Encoding of the metadata was using the padded length as opposed to
the real length of the data which is a bug per specification.
This has not been an issue todate because all metadatum specified


   To date? Metadata perhaps?


so far has been 32 bit where aligned and data length are the same width.
This also includes a bug fix for validating the length of a u16 field.
But since there is no metadata of size u16 yes we are fine to include it
here.

While at it get rid of magic numbers

Fixes: ef6980b6


   This tag has a standardized format, including 12-digit SHA1 and the commit 
summary enclosed in ("").



Signed-off-by: Jamal Hadi Salim 

[...]

MBR, Sergei

1 2 >

1 - 100 of 160 matches

Mail list logo