Re: [patch net-next] net: sched: move block offload unbind after all chains are flushed

2017-11-03 Thread Jiri Pirko
Fri, Nov 03, 2017 at 07:47:44AM CET, da...@davemloft.net wrote:
>From: Jiri Pirko 
>Date: Thu,  2 Nov 2017 15:07:01 +0100
>
>> From: Jiri Pirko 
>> 
>> Currently, the offload unbind is done before the chains are flushed.
>> That causes driver to unregister block callback before it can get all
>> the callback calls done during flush, leaving the offloaded tps inside
>> the HW. So fix the order to prevent this situation and restore the
>> original behaviour.
>> 
>> Reported-by: Alexander Duyck 
>> Reported-by: Jakub Kicinski 
>> Signed-off-by: Jiri Pirko 
>
>I'm pretty sure this was my fault during the merge, I had to pick
>whether to do it before or after the offload unbind and I picked the
>latter.

I also had it wrong in the original commit 
8c4083b30e56fc71b0e94c26374b32d95d5ea461

>
>Applied, thank you.

Thanks.


Re: [PATCH net-next] tcp: tcp_fragment() should not assume rtx skbs

2017-11-03 Thread David Miller
From: Eric Dumazet 
Date: Thu, 02 Nov 2017 18:10:03 -0700

> From: Eric Dumazet 
> 
> While stress testing MTU probing, we had crashes in list_del() that we 
> root-caused
> to the fact that tcp_fragment() is unconditionally inserting the freshly 
> allocated
> skb into tsorted_sent_queue list.
> 
> But this list is supposed to contain skbs that were sent.
> This was mostly harmless until MTU probing was enabled.
> 
> Fortunately we can use the tcp_queue enum added later (but in same linux 
> version)
> for rtx-rb-tree to fix the bug. 
> 
> Fixes: e2080072ed2d ("tcp: new list for sent but unacked skbs for RACK 
> recovery")
> Signed-off-by: Eric Dumazet 

Applied, thanks Eric.


Re: [PATCH net] xfrm: defer daddr pointer assignment after spi parsing

2017-11-03 Thread Steffen Klassert
On Wed, Nov 01, 2017 at 08:30:49PM +0100, Florian Westphal wrote:
> syzbot reports:
> BUG: KASAN: use-after-free in __xfrm_state_lookup+0x695/0x6b0
> Read of size 4 at addr 8801d434e538 by task syzkaller647520/2991
> [..]
> __xfrm_state_lookup+0x695/0x6b0 net/xfrm/xfrm_state.c:833
> xfrm_state_lookup+0x8a/0x160 net/xfrm/xfrm_state.c:1592
> xfrm_input+0x8e5/0x22f0 net/xfrm/xfrm_input.c:302
> 
> The use-after-free is the ipv4 destination address, which points
> to an skb head area that has been reallocated:
>   pskb_expand_head+0x36b/0x1210 net/core/skbuff.c:1494
>   __pskb_pull_tail+0x14a/0x17c0 net/core/skbuff.c:1877
>   pskb_may_pull include/linux/skbuff.h:2102 [inline]
>   xfrm_parse_spi+0x3d3/0x4d0 net/xfrm/xfrm_input.c:170
>   xfrm_input+0xce2/0x22f0 net/xfrm/xfrm_input.c:291
> 
> so the real bug is that xfrm_parse_spi() uses pskb_may_pull, but
> for now do smaller workaround that makes xfrm_input fetch daddr
> after spi parsing.
> 
> Reported-by: syzbot 
> Signed-off-by: Florian Westphal 

Applied, thanks Florian!


[PATCH net-next 1/2] bnxt_en: fix typo in bnxt_set_coalesce

2017-11-03 Thread Michael Chan
From: Andy Gospodarek 

Recent refactoring of coalesce settings contained a typo that prevents
receive settings from being set properly.

Fixes: 18775aa8a91f ("bnxt_en: Reorganize the coalescing parameters.")
Signed-off-by: Andy Gospodarek 
Signed-off-by: Michael Chan 
---
 drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c 
b/drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c
index 5cd1a50..7ce1d4b 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c
@@ -84,7 +84,7 @@ static int bnxt_set_coalesce(struct net_device *dev,
hw_coal->coal_ticks_irq = coal->rx_coalesce_usecs_irq;
hw_coal->coal_bufs_irq = coal->rx_max_coalesced_frames_irq * mult;
 
-   hw_coal = &bp->rx_coal;
+   hw_coal = &bp->tx_coal;
mult = hw_coal->bufs_per_record;
hw_coal->coal_ticks = coal->tx_coalesce_usecs;
hw_coal->coal_bufs = coal->tx_max_coalesced_frames * mult;
-- 
1.8.3.1



[PATCH net-next 0/2] bnxt_en: Fix IRQ coalescing regressions.

2017-11-03 Thread Michael Chan
There was a typo and missing guard-rail against illegal values in the
recent code clean up.  All reported by Andy Gospodarek.

Andy Gospodarek (1):
  bnxt_en: fix typo in bnxt_set_coalesce

Michael Chan (1):
  bnxt_en: Fix IRQ coalescing regression.

 drivers/net/ethernet/broadcom/bnxt/bnxt.c | 6 +-
 drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c | 2 +-
 2 files changed, 6 insertions(+), 2 deletions(-)

-- 
1.8.3.1



[PATCH net-next 2/2] bnxt_en: Fix IRQ coalescing regression.

2017-11-03 Thread Michael Chan
Recent IRQ coalescing clean up has removed a guard-rail for the max DMA
buffer coalescing value.  This is a 6-bit value and must not be 0.  We
already have a check for 0 but 64 is equivalent to 0 and will cause
non-stop interrupts.  Fix it by adding the proper check.

Fixes: f8503969d27b ("bnxt_en: Refactor and simplify coalescing code.")
Reported-by: Andy Gospodarek 
Signed-off-by: Michael Chan 
---
 drivers/net/ethernet/broadcom/bnxt/bnxt.c | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c 
b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index c3dfaa5..4e3d569 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -4548,9 +4548,13 @@ static void bnxt_hwrm_set_coal_params(struct bnxt_coal 
*hw_coal,
 
val = clamp_t(u16, hw_coal->coal_bufs, 1, max);
req->num_cmpl_aggr_int = cpu_to_le16(val);
+
+   /* This is a 6-bit value and must not be 0, or we'll get non stop IRQ */
+   val = min_t(u16, val, 63);
req->num_cmpl_dma_aggr = cpu_to_le16(val);
 
-   val = clamp_t(u16, hw_coal->coal_bufs_irq, 1, max);
+   /* This is a 6-bit value and must not be 0, or we'll get non stop IRQ */
+   val = clamp_t(u16, hw_coal->coal_bufs_irq, 1, 63);
req->num_cmpl_dma_aggr_during_int = cpu_to_le16(val);
 
tmr = BNXT_USEC_TO_COAL_TIMER(hw_coal->coal_ticks);
-- 
1.8.3.1



linux-next: manual merge of the akpm-current tree with the net-next tree

2017-11-03 Thread Stephen Rothwell
Hi Andrew,

Today's linux-next merge of the akpm-current tree got a conflict in:

  drivers/net/ethernet/netronome/nfp/nfp_net_common.c

between commits:

  5f0ca2fb71e2 ("nfp: handle page allocation failures")
  790a39917183 ("nfp: switch to dev_alloc_page()")
  16f50cda06ae ("nfp: use a counter instead of log message for allocation 
failures")

from the net-next tree and commit:

  0432d14c45bb ("mm: remove __GFP_COLD")

from the akpm-current tree.

I fixed it up (the former series removes the use of alloc_page() from
nfp_net_napi_alloc_one()) and can carry the fix as necessary. This
is now fixed as far as linux-next is concerned, but any non trivial
conflicts should be mentioned to your upstream maintainer when your tree
is submitted for merging.  You may also want to consider cooperating
with the maintainer of the conflicting tree to minimise any particularly
complex conflicts.

-- 
Cheers,
Stephen Rothwell


[PATCH] net: sched: cls_u32: use bitwise & rather than logical && on n->flags

2017-11-03 Thread Colin King
From: Colin Ian King 

Currently n->flags is being operated on by a logical && operator rather
than a bitwise & operator. This looks incorrect as these should be bit
flag operations. Fix this.

Detected by CoverityScan, CID#1460398 ("Logical vs. bitwise operator")

Fixes: 245dc5121a9b ("net: sched: cls_u32: call block callbacks for offload")
Signed-off-by: Colin Ian King 
---
 net/sched/cls_u32.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/sched/cls_u32.c b/net/sched/cls_u32.c
index 86145867b424..2737b71854c9 100644
--- a/net/sched/cls_u32.c
+++ b/net/sched/cls_u32.c
@@ -572,7 +572,7 @@ static int u32_replace_hw_knode(struct tcf_proto *tp, 
struct tc_u_knode *n,
n->flags |= TCA_CLS_FLAGS_IN_HW;
}
 
-   if (skip_sw && !(n->flags && TCA_CLS_FLAGS_IN_HW))
+   if (skip_sw && !(n->flags & TCA_CLS_FLAGS_IN_HW))
return -EINVAL;
 
return 0;
-- 
2.14.1



Re: [PATCH] net: mvpp2: add ethtool GOP statistics

2017-11-03 Thread Miquel RAYNAL
Hi Andrew,

Thanks for the review, I forgot to mention this is for net-next, I'll
fix the subject line when sending the v2.

> > +static struct mvpp2_ethtool_statistics mvpp2_ethtool_stats[] = {  
> 
> This can probably be const, and save a few bytes of RAM.

Absolutely.

> 
> > +   { MVPP2_MIB_GOOD_OCTETS_RCVD_LOW, "good_octets_received" },
> > +   { MVPP2_MIB_BAD_OCTETS_RCVD, "bad_octets_received" },
> > +   { MVPP2_MIB_CRC_ERRORS_SENT, "crc_errors_sent" },
> > +   { MVPP2_MIB_UNICAST_FRAMES_RCVD,
> > "unicast_frames_received" },
> > +   { MVPP2_MIB_BROADCAST_FRAMES_RCVD,
> > "broadcast_frames_received" },
> > +   { MVPP2_MIB_MULTICAST_FRAMES_RCVD,
> > "multicast_frames_received" },
> > +   { MVPP2_MIB_FRAMES_64_OCTETS, "frames_64_octets" },
> > +   { MVPP2_MIB_FRAMES_65_TO_127_OCTETS,
> > "frames_65_to_127_octet" },
> > +   { MVPP2_MIB_FRAMES_128_TO_255_OCTETS,

...

> > +static void mvpp2_ethtool_get_stats(struct net_device *dev,
> > +   struct ethtool_stats *stats,
> > u64 *data) +{
> > +   struct mvpp2_port *port = netdev_priv(dev);
> > +
> > +   /* Update statistics for all ports, copy only those
> > actually needed */
> > +   mvpp2_gather_hw_statistics(&port->priv->stats_work.work);  
> 
> Shouldn't there be some locking here? What if
> mvpp2_gather_hw_statistic is already running?

You are right, locking is needed when accessing the registers. I added
mutexes, please have a look in the v2 regarding their implementation as
I am not very familiar with them.

> 
> > @@ -7613,13 +7788,19 @@ static int mvpp2_port_probe(struct
> > platform_device *pdev, port->base = priv->iface_base +
> > MVPP22_GMAC_BASE(port->gop_id); }
> >  
> > -   /* Alloc per-cpu stats */
> > +   /* Alloc per-cpu and ethtool stats */
> > port->stats = netdev_alloc_pcpu_stats(struct
> > mvpp2_pcpu_stats); if (!port->stats) {
> > err = -ENOMEM;
> > goto err_free_irq;
> > }
> >  
> > +   port->ethtool_stats = kzalloc(sizeof(mvpp2_ethtool_stats),
> > GFP_KERNEL);  
> 
> devm_ to make the cleanup simpler?

Ok.

> 
> > +   /* This work recall himself within a delay. If the
> > cancellation returned
> > +* a non-zero value, it means a work is still running. In
> > that case, use
> > +* use the flush (returns when the running work will be
> > done) and cancel  
> 
> One use is enough.
> 
> > +* the new work that was just submitted to the queue but
> > not started yet
> > +* due to the delay.
> > +*/
> > +   if (!cancel_delayed_work(&priv->stats_work)) {
> > +   flush_workqueue(priv->stats_queue);
> > +   cancel_delayed_work(&priv->stats_work);
> > +   }  
> 
> Why is cancel_delayed_work_sync() not enough?

I did not knew about the *_sync() version, thanks for pointing it.


Thank you,
Miquèl


Re: [PATCH net-next V2 1/3] tun: abstract flow steering logic

2017-11-03 Thread Willem de Bruijn
On Thu, Nov 2, 2017 at 12:51 PM, Jason Wang  wrote:
>
>
> On 2017年11月02日 11:45, Michael S. Tsirkin wrote:
>>
>> On Thu, Nov 02, 2017 at 11:43:48AM +0800, Jason Wang wrote:
>>>
>>>
>>> On 2017年11月02日 09:11, Willem de Bruijn wrote:

 On Tue, Oct 31, 2017 at 7:32 PM, Jason Wang  wrote:
>
> tun now use flow caches based automatic queue steering method. This
> may not suffice all user cases. To extend it to be able to use more
> flow steering policy, this patch abstracts flow steering logic into
> tun_steering_ops, then we can declare and use different methods in
> the future.
> Signed-off-by: Jason Wang 
> ---
>drivers/net/tun.c | 85
> +--
>1 file changed, 63 insertions(+), 22 deletions(-)
>
> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> index ea29da9..bff6259 100644

 The previous RFC enabled support for multiple pluggable steering
 policies. But as all can be implemented in BPF and we only plan to
 support an eBPF policy besides the legacy one, this patch is no longer
 needed. We can save a few indirect function calls.
>>>
>>> But we should at least support two kinds of steering policy, so this is
>>> still needed?
>>>
>>> And I'm not quite sure we can implement all kinds of policies through BPF
>>> e.g RSS or we may want to offload the queue selection to underlayer
>>> switch
>>> or nic .
>>>
>>> Thanks
>>
>> I think a simple if condition is preferable for now, too. Let's wait
>> until we get some 3/4 of these.
>>
>
> That's a solution but we may need if in at least four places. If this is ok,
> I will do it in next version.

That sounds good to me.

I only see three places that need to be modified, the callback sites
that this patch introduces. Strictly speaking, it's not even necessary
to forgo the rxhash operations. Though a nice optimization.


[iproute2 PATCH] flower: Represent HW traffic classes as classid values

2017-11-03 Thread Amritha Nambiar
This patch was previously submitted as RFC. Submitting this as
non-RFC now that the classid reservation scheme for hardware
traffic classes and offloads to route packets to a hardware
traffic class are accepted in net-next.

HW traffic classes 0 through 15 are represented using the
reserved classid values :ffe0 - :ffef.

Example:
Match Dst IPv4,Dst Port and route to TC1:
# tc filter add dev eth0 protocol ip parent :\
  prio 1 flower dst_ip 192.168.1.1/32\
  ip_proto udp dst_port 12000 skip_sw\
  hw_tc 1

# tc filter show dev eth0 parent :
filter pref 1 flower chain 0
filter pref 1 flower chain 0 handle 0x1 hw_tc 1
  eth_type ipv4
  ip_proto udp
  dst_ip 192.168.1.1
  dst_port 12000
  skip_sw
  in_hw

Signed-off-by: Amritha Nambiar 
---
 include/uapi/linux/pkt_sched.h |1 +
 tc/f_flower.c  |   33 +
 2 files changed, 30 insertions(+), 4 deletions(-)

diff --git a/include/uapi/linux/pkt_sched.h b/include/uapi/linux/pkt_sched.h
index e95b5c9..e7cc3d3 100644
--- a/include/uapi/linux/pkt_sched.h
+++ b/include/uapi/linux/pkt_sched.h
@@ -74,6 +74,7 @@ struct tc_estimator {
 #define TC_H_INGRESS(0xFFF1U)
 #define TC_H_CLSACTTC_H_INGRESS
 
+#define TC_H_MIN_PRIORITY  0xFFE0U
 #define TC_H_MIN_INGRESS   0xFFF2U
 #define TC_H_MIN_EGRESS0xFFF3U
 
diff --git a/tc/f_flower.c b/tc/f_flower.c
index b180210..a72c512 100644
--- a/tc/f_flower.c
+++ b/tc/f_flower.c
@@ -614,6 +614,25 @@ static int flower_parse_opt(struct filter_util *qu, char 
*handle,
return -1;
}
addattr_l(n, MAX_MSG, TCA_FLOWER_CLASSID, &handle, 4);
+   } else if (matches(*argv, "hw_tc") == 0) {
+   unsigned int handle;
+   __u32 tc;
+   char *end;
+
+   NEXT_ARG();
+   tc = strtoul(*argv, &end, 0);
+   if (*end) {
+   fprintf(stderr, "Illegal TC index\n");
+   return -1;
+   }
+   if (tc >= TC_QOPT_MAX_QUEUE) {
+   fprintf(stderr, "TC index exceeds max range\n");
+   return -1;
+   }
+   handle = TC_H_MAKE(TC_H_MAJ(t->tcm_parent),
+  TC_H_MIN(tc + TC_H_MIN_PRIORITY));
+   addattr_l(n, MAX_MSG, TCA_FLOWER_CLASSID, &handle,
+ sizeof(handle));
} else if (matches(*argv, "ip_flags") == 0) {
NEXT_ARG();
ret = flower_parse_matching_flags(*argv,
@@ -1187,10 +1206,16 @@ static int flower_print_opt(struct filter_util *qu, 
FILE *f,
fprintf(f, "handle 0x%x ", handle);
 
if (tb[TCA_FLOWER_CLASSID]) {
-   SPRINT_BUF(b1);
-   fprintf(f, "classid %s ",
-   
sprint_tc_classid(rta_getattr_u32(tb[TCA_FLOWER_CLASSID]),
- b1));
+   __u32 h = rta_getattr_u32(tb[TCA_FLOWER_CLASSID]);
+
+   if (TC_H_MIN(h) < TC_H_MIN_PRIORITY ||
+   TC_H_MIN(h) > (TC_H_MIN_PRIORITY + TC_QOPT_MAX_QUEUE - 1)) {
+   SPRINT_BUF(b1);
+   fprintf(f, "classid %s ", sprint_tc_classid(h, b1));
+   } else {
+   fprintf(f, "hw_tc %u ",
+   TC_H_MIN(h) - TC_H_MIN_PRIORITY);
+   }
}
 
if (tb[TCA_FLOWER_INDEV]) {



Re: [PATCH net-next V2 3/3] tun: add eBPF based queue selection method

2017-11-03 Thread Willem de Bruijn
On Tue, Oct 31, 2017 at 7:32 PM, Jason Wang  wrote:
> This patch introduces an eBPF based queue selection method based on
> the flow steering policy ops. Userspace could load an eBPF program
> through TUNSETSTEERINGEBPF. This gives much more flexibility compare
> to simple but hard coded policy in kernel.
>
> Signed-off-by: Jason Wang 
> ---
> +static int tun_set_steering_ebpf(struct tun_struct *tun, void __user *data)
> +{
> +   struct bpf_prog *prog;
> +   u32 fd;
> +
> +   if (copy_from_user(&fd, data, sizeof(fd)))
> +   return -EFAULT;
> +
> +   prog = bpf_prog_get_type(fd, BPF_PROG_TYPE_SOCKET_FILTER);

If the idea is to allow guests to pass BPF programs down to the host,
you may want to define a new program type that is more restrictive than
socket filter.

The external functions allowed for socket filters (sk_filter_func_proto)
are relatively few (compared to, say, clsact), but may still leak host
information to a guest. More importantly, guest security considerations
limits how we can extend socket filters later.


[patch net-next 03/16] mlxsw: spectrum: Move mlxsw_sp_ipip_netdev_{s,d}addr{,4}()

2017-11-03 Thread Jiri Pirko
From: Petr Machata 

These functions ideologically belong to the IPIP module, and some
follow-up work will benefit from their presence there.

Signed-off-by: Petr Machata 
Reviewed-by: Ido Schimmel 
Signed-off-by: Jiri Pirko 
---
 .../net/ethernet/mellanox/mlxsw/spectrum_ipip.c| 53 ++
 .../net/ethernet/mellanox/mlxsw/spectrum_ipip.h|  4 ++
 .../net/ethernet/mellanox/mlxsw/spectrum_router.c  | 53 --
 .../net/ethernet/mellanox/mlxsw/spectrum_router.h  |  7 ---
 4 files changed, 57 insertions(+), 60 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_ipip.c 
b/drivers/net/ethernet/mellanox/mlxsw/spectrum_ipip.c
index 702fe94..8a9fbb6 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_ipip.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_ipip.c
@@ -68,6 +68,59 @@ static u32 mlxsw_sp_ipip_netdev_okey(const struct net_device 
*ol_dev)
be32_to_cpu(tun->parms.o_key) : 0;
 }
 
+static __be32
+mlxsw_sp_ipip_netdev_saddr4(const struct net_device *ol_dev)
+{
+   struct ip_tunnel *tun = netdev_priv(ol_dev);
+
+   return tun->parms.iph.saddr;
+}
+
+union mlxsw_sp_l3addr
+mlxsw_sp_ipip_netdev_saddr(enum mlxsw_sp_l3proto proto,
+  const struct net_device *ol_dev)
+{
+   switch (proto) {
+   case MLXSW_SP_L3_PROTO_IPV4:
+   return (union mlxsw_sp_l3addr) {
+   .addr4 = mlxsw_sp_ipip_netdev_saddr4(ol_dev),
+   };
+   case MLXSW_SP_L3_PROTO_IPV6:
+   break;
+   }
+
+   WARN_ON(1);
+   return (union mlxsw_sp_l3addr) {
+   .addr4 = 0,
+   };
+}
+
+static __be32 mlxsw_sp_ipip_netdev_daddr4(const struct net_device *ol_dev)
+{
+   struct ip_tunnel *tun = netdev_priv(ol_dev);
+
+   return tun->parms.iph.daddr;
+}
+
+static union mlxsw_sp_l3addr
+mlxsw_sp_ipip_netdev_daddr(enum mlxsw_sp_l3proto proto,
+  const struct net_device *ol_dev)
+{
+   switch (proto) {
+   case MLXSW_SP_L3_PROTO_IPV4:
+   return (union mlxsw_sp_l3addr) {
+   .addr4 = mlxsw_sp_ipip_netdev_daddr4(ol_dev),
+   };
+   case MLXSW_SP_L3_PROTO_IPV6:
+   break;
+   }
+
+   WARN_ON(1);
+   return (union mlxsw_sp_l3addr) {
+   .addr4 = 0,
+   };
+}
+
 static int
 mlxsw_sp_ipip_nexthop_update_gre4(struct mlxsw_sp *mlxsw_sp, u32 adj_index,
  struct mlxsw_sp_ipip_entry *ipip_entry)
diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_ipip.h 
b/drivers/net/ethernet/mellanox/mlxsw/spectrum_ipip.h
index 6fb4912..87becd1 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_ipip.h
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_ipip.h
@@ -38,6 +38,10 @@
 #include "spectrum_router.h"
 #include 
 
+union mlxsw_sp_l3addr
+mlxsw_sp_ipip_netdev_saddr(enum mlxsw_sp_l3proto proto,
+  const struct net_device *ol_dev);
+
 enum mlxsw_sp_ipip_type {
MLXSW_SP_IPIP_TYPE_GRE4,
MLXSW_SP_IPIP_TYPE_MAX,
diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c 
b/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
index 97f062a..ec90c6b 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
@@ -1016,59 +1016,6 @@ mlxsw_sp_ipip_entry_dealloc(struct mlxsw_sp_ipip_entry 
*ipip_entry)
kfree(ipip_entry);
 }
 
-static __be32
-mlxsw_sp_ipip_netdev_saddr4(const struct net_device *ol_dev)
-{
-   struct ip_tunnel *tun = netdev_priv(ol_dev);
-
-   return tun->parms.iph.saddr;
-}
-
-union mlxsw_sp_l3addr
-mlxsw_sp_ipip_netdev_saddr(enum mlxsw_sp_l3proto proto,
-  const struct net_device *ol_dev)
-{
-   switch (proto) {
-   case MLXSW_SP_L3_PROTO_IPV4:
-   return (union mlxsw_sp_l3addr) {
-   .addr4 = mlxsw_sp_ipip_netdev_saddr4(ol_dev),
-   };
-   case MLXSW_SP_L3_PROTO_IPV6:
-   break;
-   };
-
-   WARN_ON(1);
-   return (union mlxsw_sp_l3addr) {
-   .addr4 = 0,
-   };
-}
-
-__be32 mlxsw_sp_ipip_netdev_daddr4(const struct net_device *ol_dev)
-{
-   struct ip_tunnel *tun = netdev_priv(ol_dev);
-
-   return tun->parms.iph.daddr;
-}
-
-union mlxsw_sp_l3addr
-mlxsw_sp_ipip_netdev_daddr(enum mlxsw_sp_l3proto proto,
-  const struct net_device *ol_dev)
-{
-   switch (proto) {
-   case MLXSW_SP_L3_PROTO_IPV4:
-   return (union mlxsw_sp_l3addr) {
-   .addr4 = mlxsw_sp_ipip_netdev_daddr4(ol_dev),
-   };
-   case MLXSW_SP_L3_PROTO_IPV6:
-   break;
-   };
-
-   WARN_ON(1);
-   return (union mlxsw_sp_l3addr) {
-   .addr4 = 0,
-   };
-}
-
 static bool mlxsw_sp_l3addr_eq(const union mlxsw_sp_l3addr *addr1,
   const union 

[patch net-next 14/16] mlxsw: spectrum: Handle NETDEV_CHANGE on L3 tunnels

2017-11-03 Thread Jiri Pirko
From: Petr Machata 

Changes to L3 tunnel netdevices (through `ip tunnel change' as well as
`ip link set') lead to NETDEV_CHANGE being generated on the tunnel
device. Because what is relevant for the tunnel in question depends on
the tunnel type, handling of the event is dispatched to the IPIP module
through a newly-added interface mlxsw_sp_ipip_ops.ol_netdev_change().

IPIP tunnels now remember the last set of tunnel parameters in struct
mlxsw_sp_ipip_entry.parms, and use it to figure out what exactly has
changed.

Signed-off-by: Petr Machata 
Reviewed-by: Ido Schimmel 
Signed-off-by: Jiri Pirko 
---
 .../net/ethernet/mellanox/mlxsw/spectrum_ipip.c| 67 ++
 .../net/ethernet/mellanox/mlxsw/spectrum_ipip.h|  5 ++
 .../net/ethernet/mellanox/mlxsw/spectrum_router.c  | 42 +++---
 .../net/ethernet/mellanox/mlxsw/spectrum_router.h  |  7 +++
 4 files changed, 114 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_ipip.c 
b/drivers/net/ethernet/mellanox/mlxsw/spectrum_ipip.c
index 1850080..5f78fc5 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_ipip.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_ipip.c
@@ -287,6 +287,72 @@ mlxsw_sp_ipip_ol_loopback_config_gre4(struct mlxsw_sp 
*mlxsw_sp,
};
 }
 
+static int
+mlxsw_sp_ipip_ol_netdev_change_gre4(struct mlxsw_sp *mlxsw_sp,
+   struct mlxsw_sp_ipip_entry *ipip_entry,
+   struct netlink_ext_ack *extack)
+{
+   union mlxsw_sp_l3addr old_saddr, new_saddr;
+   union mlxsw_sp_l3addr old_daddr, new_daddr;
+   struct ip_tunnel_parm new_parms;
+   bool update_tunnel = false;
+   bool update_decap = false;
+   bool update_nhs = false;
+   int err = 0;
+
+   new_parms = mlxsw_sp_ipip_netdev_parms(ipip_entry->ol_dev);
+
+   new_saddr = mlxsw_sp_ipip_parms_saddr(MLXSW_SP_L3_PROTO_IPV4,
+ new_parms);
+   old_saddr = mlxsw_sp_ipip_parms_saddr(MLXSW_SP_L3_PROTO_IPV4,
+ ipip_entry->parms);
+   new_daddr = mlxsw_sp_ipip_parms_daddr(MLXSW_SP_L3_PROTO_IPV4,
+ new_parms);
+   old_daddr = mlxsw_sp_ipip_parms_daddr(MLXSW_SP_L3_PROTO_IPV4,
+ ipip_entry->parms);
+
+   if (!mlxsw_sp_l3addr_eq(&new_saddr, &old_saddr)) {
+   u16 ul_tb_id = mlxsw_sp_ipip_dev_ul_tb_id(ipip_entry->ol_dev);
+
+   /* Since the local address has changed, if there is another
+* tunnel with a matching saddr, both need to be demoted.
+*/
+   if (mlxsw_sp_ipip_demote_tunnel_by_saddr(mlxsw_sp,
+MLXSW_SP_L3_PROTO_IPV4,
+new_saddr, ul_tb_id,
+ipip_entry)) {
+   mlxsw_sp_ipip_entry_demote_tunnel(mlxsw_sp, ipip_entry);
+   return 0;
+   }
+
+   update_tunnel = true;
+   } else if (mlxsw_sp_ipip_parms_okey(ipip_entry->parms) !=
+  mlxsw_sp_ipip_parms_okey(new_parms)) {
+   update_tunnel = true;
+   } else if (!mlxsw_sp_l3addr_eq(&new_daddr, &old_daddr)) {
+   update_nhs = true;
+   } else if (mlxsw_sp_ipip_parms_ikey(ipip_entry->parms) !=
+  mlxsw_sp_ipip_parms_ikey(new_parms)) {
+   update_decap = true;
+   }
+
+   if (update_tunnel)
+   err = __mlxsw_sp_ipip_entry_update_tunnel(mlxsw_sp, ipip_entry,
+ true, true, true,
+ extack);
+   else if (update_nhs)
+   err = __mlxsw_sp_ipip_entry_update_tunnel(mlxsw_sp, ipip_entry,
+ false, false, true,
+ extack);
+   else if (update_decap)
+   err = __mlxsw_sp_ipip_entry_update_tunnel(mlxsw_sp, ipip_entry,
+ false, false, false,
+ extack);
+
+   ipip_entry->parms = new_parms;
+   return err;
+}
+
 static const struct mlxsw_sp_ipip_ops mlxsw_sp_ipip_gre4_ops = {
.dev_type = ARPHRD_IPGRE,
.ul_proto = MLXSW_SP_L3_PROTO_IPV4,
@@ -294,6 +360,7 @@ static const struct mlxsw_sp_ipip_ops 
mlxsw_sp_ipip_gre4_ops = {
.fib_entry_op = mlxsw_sp_ipip_fib_entry_op_gre4,
.can_offload = mlxsw_sp_ipip_can_offload_gre4,
.ol_loopback_config = mlxsw_sp_ipip_ol_loopback_config_gre4,
+   .ol_netdev_change = mlxsw_sp_ipip_ol_netdev_change_gre4,
 };
 
 const struct mlxsw_sp_ipip_ops *mlxsw_sp_ipip_ops_arr[] = {
diff --git a/

[patch net-next 15/16] mlxsw: spectrum_ipip: Handle underlay device change

2017-11-03 Thread Jiri Pirko
From: Petr Machata 

When a bound device of an IP-in-IP tunnel changes, such as through
'ip tunnel change name $name dev $dev', the loopback backing the tunnel
needs to be recreated.

Signed-off-by: Petr Machata 
Reviewed-by: Ido Schimmel 
Signed-off-by: Jiri Pirko 
---
 drivers/net/ethernet/mellanox/mlxsw/spectrum_ipip.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_ipip.c 
b/drivers/net/ethernet/mellanox/mlxsw/spectrum_ipip.c
index 5f78fc5..7502e53 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_ipip.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_ipip.c
@@ -326,8 +326,9 @@ mlxsw_sp_ipip_ol_netdev_change_gre4(struct mlxsw_sp 
*mlxsw_sp,
}
 
update_tunnel = true;
-   } else if (mlxsw_sp_ipip_parms_okey(ipip_entry->parms) !=
-  mlxsw_sp_ipip_parms_okey(new_parms)) {
+   } else if ((mlxsw_sp_ipip_parms_okey(ipip_entry->parms) !=
+   mlxsw_sp_ipip_parms_okey(new_parms)) ||
+  ipip_entry->parms.link != new_parms.link) {
update_tunnel = true;
} else if (!mlxsw_sp_l3addr_eq(&new_daddr, &old_daddr)) {
update_nhs = true;
-- 
2.9.5



[patch net-next 13/16] mlxsw: spectrum: Support IPIP underlay VRF migration

2017-11-03 Thread Jiri Pirko
From: Petr Machata 

When a bound device of a tunnel netdevice changes VRF, the loopback RIF
that backs the tunnel needs to be updated and existing encapsulating
routes need to be refreshed.

Note that several tunnels can share the same bound device, in which case
all the impacted tunnels need to be updated.

Signed-off-by: Petr Machata 
Reviewed-by: Ido Schimmel 
Signed-off-by: Jiri Pirko 
---
 drivers/net/ethernet/mellanox/mlxsw/spectrum.c |   3 +
 drivers/net/ethernet/mellanox/mlxsw/spectrum.h |   7 ++
 .../net/ethernet/mellanox/mlxsw/spectrum_router.c  | 100 +
 3 files changed, 110 insertions(+)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum.c 
b/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
index 55bb366..63e5087 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
@@ -4545,6 +4545,9 @@ static int mlxsw_sp_netdevice_event(struct notifier_block 
*nb,
if (mlxsw_sp_netdev_is_ipip_ol(mlxsw_sp, dev))
err = mlxsw_sp_netdevice_ipip_ol_event(mlxsw_sp, dev,
   event, ptr);
+   else if (mlxsw_sp_netdev_is_ipip_ul(mlxsw_sp, dev))
+   err = mlxsw_sp_netdevice_ipip_ul_event(mlxsw_sp, dev,
+  event, ptr);
else if (event == NETDEV_CHANGEADDR || event == NETDEV_CHANGEMTU)
err = mlxsw_sp_netdevice_router_port_event(dev);
else if (mlxsw_sp_is_vrf_event(event, ptr))
diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum.h 
b/drivers/net/ethernet/mellanox/mlxsw/spectrum.h
index 07cba52..47dd7e0 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum.h
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum.h
@@ -398,10 +398,17 @@ int mlxsw_sp_netdevice_vrf_event(struct net_device 
*l3_dev, unsigned long event,
 struct netdev_notifier_changeupper_info *info);
 bool mlxsw_sp_netdev_is_ipip_ol(const struct mlxsw_sp *mlxsw_sp,
const struct net_device *dev);
+bool mlxsw_sp_netdev_is_ipip_ul(const struct mlxsw_sp *mlxsw_sp,
+   const struct net_device *dev);
 int mlxsw_sp_netdevice_ipip_ol_event(struct mlxsw_sp *mlxsw_sp,
 struct net_device *l3_dev,
 unsigned long event,
 struct netdev_notifier_info *info);
+int
+mlxsw_sp_netdevice_ipip_ul_event(struct mlxsw_sp *mlxsw_sp,
+struct net_device *l3_dev,
+unsigned long event,
+struct netdev_notifier_info *info);
 void
 mlxsw_sp_port_vlan_router_leave(struct mlxsw_sp_port_vlan *mlxsw_sp_port_vlan);
 void mlxsw_sp_rif_destroy(struct mlxsw_sp_rif *rif);
diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c 
b/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
index 832bfa1..aa7b820 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
@@ -1257,6 +1257,33 @@ mlxsw_sp_ipip_entry_find_by_ol_dev(struct mlxsw_sp 
*mlxsw_sp,
return NULL;
 }
 
+static struct mlxsw_sp_ipip_entry *
+mlxsw_sp_ipip_entry_find_by_ul_dev(const struct mlxsw_sp *mlxsw_sp,
+  const struct net_device *ul_dev,
+  struct mlxsw_sp_ipip_entry *start)
+{
+   struct mlxsw_sp_ipip_entry *ipip_entry;
+
+   ipip_entry = list_prepare_entry(start, &mlxsw_sp->router->ipip_list,
+   ipip_list_node);
+   list_for_each_entry_continue(ipip_entry, &mlxsw_sp->router->ipip_list,
+ipip_list_node) {
+   struct net_device *ipip_ul_dev =
+   __mlxsw_sp_ipip_netdev_ul_dev_get(ipip_entry->ol_dev);
+
+   if (ipip_ul_dev == ul_dev)
+   return ipip_entry;
+   }
+
+   return NULL;
+}
+
+bool mlxsw_sp_netdev_is_ipip_ul(const struct mlxsw_sp *mlxsw_sp,
+   const struct net_device *dev)
+{
+   return mlxsw_sp_ipip_entry_find_by_ul_dev(mlxsw_sp, dev, NULL);
+}
+
 static bool mlxsw_sp_netdevice_ipip_can_offload(struct mlxsw_sp *mlxsw_sp,
const struct net_device *ol_dev,
enum mlxsw_sp_ipip_type ipipt)
@@ -1434,6 +1461,16 @@ static int mlxsw_sp_netdevice_ipip_ol_vrf_event(struct 
mlxsw_sp *mlxsw_sp,
   true, false, false, extack);
 }
 
+static int
+mlxsw_sp_netdevice_ipip_ul_vrf_event(struct mlxsw_sp *mlxsw_sp,
+struct mlxsw_sp_ipip_entry *ipip_entry,
+struct net_device *ul_dev,
+struct netlink_ext_ack

[patch net-next 07/16] mlxsw: spectrum_router: Extract mlxsw_sp_ipip_entry_ol_up_event()

2017-11-03 Thread Jiri Pirko
From: Petr Machata 

The piece of logic to promote decap route, if any, is useful for generic
tunnel updates, not just for handling of NETDEV_UP events on tunnel
interfaces. Extract it to a separate function.

Signed-off-by: Petr Machata 
Reviewed-by: Ido Schimmel 
Signed-off-by: Jiri Pirko 
---
 .../net/ethernet/mellanox/mlxsw/spectrum_router.c  | 23 +-
 1 file changed, 14 insertions(+), 9 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c 
b/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
index 2b05f9f..ce0d462 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
@@ -1314,21 +1314,26 @@ static void 
mlxsw_sp_netdevice_ipip_ol_unreg_event(struct mlxsw_sp *mlxsw_sp,
mlxsw_sp_ipip_entry_destroy(mlxsw_sp, ipip_entry);
 }
 
+static void
+mlxsw_sp_ipip_entry_ol_up_event(struct mlxsw_sp *mlxsw_sp,
+   struct mlxsw_sp_ipip_entry *ipip_entry)
+{
+   struct mlxsw_sp_fib_entry *decap_fib_entry;
+
+   decap_fib_entry = mlxsw_sp_ipip_entry_find_decap(mlxsw_sp, ipip_entry);
+   if (decap_fib_entry)
+   mlxsw_sp_ipip_entry_promote_decap(mlxsw_sp, ipip_entry,
+ decap_fib_entry);
+}
+
 static void mlxsw_sp_netdevice_ipip_ol_up_event(struct mlxsw_sp *mlxsw_sp,
struct net_device *ol_dev)
 {
-   struct mlxsw_sp_fib_entry *decap_fib_entry;
struct mlxsw_sp_ipip_entry *ipip_entry;
 
ipip_entry = mlxsw_sp_ipip_entry_find_by_ol_dev(mlxsw_sp, ol_dev);
-   if (ipip_entry) {
-   decap_fib_entry = mlxsw_sp_ipip_entry_find_decap(mlxsw_sp,
-ipip_entry);
-   if (decap_fib_entry)
-   mlxsw_sp_ipip_entry_promote_decap(mlxsw_sp, ipip_entry,
- decap_fib_entry);
-   }
-
+   if (ipip_entry)
+   mlxsw_sp_ipip_entry_ol_up_event(mlxsw_sp, ipip_entry);
 }
 
 static void
-- 
2.9.5



[patch net-next 16/16] mlxsw: spectrum_router: Handle down of tunnel underlay

2017-11-03 Thread Jiri Pirko
From: Petr Machata 

When the bound device of a tunnel device is down, encapsulated packets
are not egressed anymore, but tunnel decap still works. Extend
mlxsw_sp_nexthop_rif_update() to take IFF_UP into consideration when
deciding whether a given next hop should be offloaded.

Because the new logic was added to mlxsw_sp_nexthop_rif_update(), this
fixes the case where a newly-added tunnel has a down bound device, which
would previously be fully offloaded. Now the down state of the bound
device is noted and next hops forwarding to such tunnel are not
offloaded.

In addition to that, notice NETDEV_UP and NETDEV_DOWN of a bound device
to force refresh of tunnel encap route offloads.

Signed-off-by: Petr Machata 
Reviewed-by: Ido Schimmel 
Signed-off-by: Jiri Pirko 
---
 .../net/ethernet/mellanox/mlxsw/spectrum_router.c  | 57 +-
 1 file changed, 55 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c 
b/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
index c1928561..e918784 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
@@ -1467,6 +1467,28 @@ mlxsw_sp_netdevice_ipip_ul_vrf_event(struct mlxsw_sp 
*mlxsw_sp,
 }
 
 static int
+mlxsw_sp_netdevice_ipip_ul_up_event(struct mlxsw_sp *mlxsw_sp,
+   struct mlxsw_sp_ipip_entry *ipip_entry,
+   struct net_device *ul_dev)
+{
+   return __mlxsw_sp_ipip_entry_update_tunnel(mlxsw_sp, ipip_entry,
+  false, false, true, NULL);
+}
+
+static int
+mlxsw_sp_netdevice_ipip_ul_down_event(struct mlxsw_sp *mlxsw_sp,
+ struct mlxsw_sp_ipip_entry *ipip_entry,
+ struct net_device *ul_dev)
+{
+   /* A down underlay device causes encapsulated packets to not be
+* forwarded, but decap still works. So refresh next hops without
+* touching anything else.
+*/
+   return __mlxsw_sp_ipip_entry_update_tunnel(mlxsw_sp, ipip_entry,
+  false, false, true, NULL);
+}
+
+static int
 mlxsw_sp_netdevice_ipip_ol_change_event(struct mlxsw_sp *mlxsw_sp,
struct net_device *ol_dev,
struct netlink_ext_ack *extack)
@@ -1604,6 +1626,14 @@ __mlxsw_sp_netdevice_ipip_ul_event(struct mlxsw_sp 
*mlxsw_sp,
ul_dev,
extack);
break;
+
+   case NETDEV_UP:
+   return mlxsw_sp_netdevice_ipip_ul_up_event(mlxsw_sp, ipip_entry,
+  ul_dev);
+   case NETDEV_DOWN:
+   return mlxsw_sp_netdevice_ipip_ul_down_event(mlxsw_sp,
+ipip_entry,
+ul_dev);
}
return 0;
 }
@@ -3297,10 +3327,19 @@ static void mlxsw_sp_nexthop_neigh_fini(struct mlxsw_sp 
*mlxsw_sp,
neigh_release(n);
 }
 
+static bool mlxsw_sp_ipip_netdev_ul_up(struct net_device *ol_dev)
+{
+   struct net_device *ul_dev = __mlxsw_sp_ipip_netdev_ul_dev_get(ol_dev);
+
+   return ul_dev ? (ul_dev->flags & IFF_UP) : true;
+}
+
 static int mlxsw_sp_nexthop_ipip_init(struct mlxsw_sp *mlxsw_sp,
  struct mlxsw_sp_nexthop *nh,
  struct net_device *ol_dev)
 {
+   bool removing;
+
if (!nh->nh_grp->gateway || nh->ipip_entry)
return 0;
 
@@ -3308,7 +3347,8 @@ static int mlxsw_sp_nexthop_ipip_init(struct mlxsw_sp 
*mlxsw_sp,
if (!nh->ipip_entry)
return -ENOENT;
 
-   __mlxsw_sp_nexthop_neigh_update(nh, false);
+   removing = !mlxsw_sp_ipip_netdev_ul_up(ol_dev);
+   __mlxsw_sp_nexthop_neigh_update(nh, removing);
return 0;
 }
 
@@ -3476,9 +3516,22 @@ static void mlxsw_sp_nexthop_rif_update(struct mlxsw_sp 
*mlxsw_sp,
struct mlxsw_sp_rif *rif)
 {
struct mlxsw_sp_nexthop *nh;
+   bool removing;
 
list_for_each_entry(nh, &rif->nexthop_list, rif_list_node) {
-   __mlxsw_sp_nexthop_neigh_update(nh, false);
+   switch (nh->type) {
+   case MLXSW_SP_NEXTHOP_TYPE_ETH:
+   removing = false;
+   break;
+   case MLXSW_SP_NEXTHOP_TYPE_IPIP:
+   removing = !mlxsw_sp_ipip_netdev_ul_up(rif->dev);
+   break;
+   default:
+   WARN_ON(1);
+   continue;
+   }
+
+   __mlxsw_sp_nexthop_neigh_update(nh, removing);
mlxsw_sp_next

[patch net-next 09/16] mlxsw: spectrum_router: Extract __mlxsw_sp_ipip_entry_update_tunnel()

2017-11-03 Thread Jiri Pirko
From: Petr Machata 

The work that's done by mlxsw_sp_netdevice_ipip_ol_vrf_event() is a good
basis for a more versatile function that would take care of all sorts of
tunnel updates requests: __mlxsw_sp_ipip_entry_update_tunnel(). Extract
that function. Factor out a helper mlxsw_sp_ipip_entry_ol_lb_update() as
well.

Signed-off-by: Petr Machata 
Reviewed-by: Ido Schimmel 
Signed-off-by: Jiri Pirko 
---
 .../net/ethernet/mellanox/mlxsw/spectrum_router.c  | 74 ++
 1 file changed, 46 insertions(+), 28 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c 
b/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
index c4f1881..e2795b8 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
@@ -1355,46 +1355,64 @@ static void 
mlxsw_sp_netdevice_ipip_ol_down_event(struct mlxsw_sp *mlxsw_sp,
mlxsw_sp_ipip_entry_ol_down_event(mlxsw_sp, ipip_entry);
 }
 
-static int mlxsw_sp_netdevice_ipip_ol_vrf_event(struct mlxsw_sp *mlxsw_sp,
-   struct net_device *ol_dev,
-   struct netlink_ext_ack *extack)
-{
-   struct mlxsw_sp_fib_entry *decap_fib_entry;
-   struct mlxsw_sp_ipip_entry *ipip_entry;
-   struct mlxsw_sp_rif_ipip_lb *lb_rif;
+static int
+mlxsw_sp_ipip_entry_ol_lb_update(struct mlxsw_sp *mlxsw_sp,
+struct mlxsw_sp_ipip_entry *ipip_entry,
+struct netlink_ext_ack *extack)
+{
+   struct mlxsw_sp_rif_ipip_lb *old_lb_rif = ipip_entry->ol_lb;
+   struct mlxsw_sp_rif_ipip_lb *new_lb_rif;
+
+   new_lb_rif = mlxsw_sp_ipip_ol_ipip_lb_create(mlxsw_sp,
+ipip_entry->ipipt,
+ipip_entry->ol_dev,
+extack);
+   if (IS_ERR(new_lb_rif))
+   return PTR_ERR(new_lb_rif);
+   ipip_entry->ol_lb = new_lb_rif;
+   mlxsw_sp_rif_destroy(&old_lb_rif->common);
 
-   ipip_entry = mlxsw_sp_ipip_entry_find_by_ol_dev(mlxsw_sp, ol_dev);
-   if (!ipip_entry)
-   return 0;
+   return 0;
+}
+
+int __mlxsw_sp_ipip_entry_update_tunnel(struct mlxsw_sp *mlxsw_sp,
+   struct mlxsw_sp_ipip_entry *ipip_entry,
+   struct netlink_ext_ack *extack)
+{
+   int err;
 
-   /* When a tunneling device is moved to a different VRF, we need to
-* update the backing loopback. Since RIFs can't be edited, we need to
-* destroy and recreate it. That might create a window of opportunity
-* where RALUE and RATR registers end up referencing a RIF that's
-* already gone. RATRs are handled by the RIF destroy, and to take care
+   /* RIFs can't be edited, so to update loopback, we need to destroy and
+* recreate it. That creates a window of opportunity where RALUE and
+* RATR registers end up referencing a RIF that's already gone. RATRs
+* are handled in mlxsw_sp_ipip_entry_ol_lb_update(), and to take care
 * of RALUE, demote the decap route back.
 */
if (ipip_entry->decap_fib_entry)
mlxsw_sp_ipip_entry_demote_decap(mlxsw_sp, ipip_entry);
 
-   lb_rif = mlxsw_sp_ipip_ol_ipip_lb_create(mlxsw_sp, ipip_entry->ipipt,
-ol_dev, extack);
-   if (IS_ERR(lb_rif))
-   return PTR_ERR(lb_rif);
-   mlxsw_sp_rif_destroy(&ipip_entry->ol_lb->common);
-   ipip_entry->ol_lb = lb_rif;
+   err = mlxsw_sp_ipip_entry_ol_lb_update(mlxsw_sp, ipip_entry, extack);
+   if (err)
+   return err;
 
-   if (ol_dev->flags & IFF_UP) {
-   decap_fib_entry = mlxsw_sp_ipip_entry_find_decap(mlxsw_sp,
-ipip_entry);
-   if (decap_fib_entry)
-   mlxsw_sp_ipip_entry_promote_decap(mlxsw_sp, ipip_entry,
- decap_fib_entry);
-   }
+   if (ipip_entry->ol_dev->flags & IFF_UP)
+   mlxsw_sp_ipip_entry_ol_up_event(mlxsw_sp, ipip_entry);
 
return 0;
 }
 
+static int mlxsw_sp_netdevice_ipip_ol_vrf_event(struct mlxsw_sp *mlxsw_sp,
+   struct net_device *ol_dev,
+   struct netlink_ext_ack *extack)
+{
+   struct mlxsw_sp_ipip_entry *ipip_entry =
+   mlxsw_sp_ipip_entry_find_by_ol_dev(mlxsw_sp, ol_dev);
+
+   if (!ipip_entry)
+   return 0;
+   return __mlxsw_sp_ipip_entry_update_tunnel(mlxsw_sp, ipip_entry,
+  extack);
+}
+
 int mlxsw_sp_netdevice_ipip_ol_event(struct mlxsw_sp *mlxsw_sp,
  

[patch net-next 12/16] mlxsw: spectrum_router: Onload conflicting tunnels

2017-11-03 Thread Jiri Pirko
From: Petr Machata 

The approach for offloading IP tunnels implemented currently by mlxsw
doesn't allow two tunnels that have the same local IP address in the
same (underlay) VRF. Previously, offloads were introduced on demand as
encap routes were formed. When such a route was created that would cause
offload of a conflicting tunnel, mlxsw_sp_ipip_entry_create() would
detect it and return -EEXIST, which would propagate up and cause FIB
abort.

Now however IPIP entries are created as soon as an offloadable netdevice
is created, and the failure prevents creation of such device.
Furthermore, if the driver is installed at the point where such
conflicting tunnels exist, the failure actually prevents successful
modprobe.

Furthermore, follow-up patches implement handling of NETDEV_CHANGE due
to the local address change. However, NETDEV_CHANGE can't be vetoed. The
failure merely means that the offloads weren't updated, but the change
in Linux configuration is not rolled back. It is thus desirable to have
a robust way of handling these conflicts, which can later be reused for
handling NETDEV_CHANGE as well.

To fix this, when a conflicting tunnel is created, instead of failing,
simply pull the old tunnel to slow path and reject offloading the
new one.

Introduce two functions: mlxsw_sp_ipip_entry_demote_tunnel() and
mlxsw_sp_ipip_demote_tunnel_by_saddr() to handle this. Make them both
public, because they will be useful later on in this patchset.

Signed-off-by: Petr Machata 
Reviewed-by: Ido Schimmel 
Signed-off-by: Jiri Pirko 
---
 .../net/ethernet/mellanox/mlxsw/spectrum_router.c  | 73 +++---
 .../net/ethernet/mellanox/mlxsw/spectrum_router.h  |  8 +++
 2 files changed, 60 insertions(+), 21 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c 
b/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
index 897a384..832bfa1 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
@@ -1159,24 +1159,7 @@ mlxsw_sp_ipip_entry_create(struct mlxsw_sp *mlxsw_sp,
   enum mlxsw_sp_ipip_type ipipt,
   struct net_device *ol_dev)
 {
-   u32 ul_tb_id = mlxsw_sp_ipip_dev_ul_tb_id(ol_dev);
-   struct mlxsw_sp_router *router = mlxsw_sp->router;
struct mlxsw_sp_ipip_entry *ipip_entry;
-   enum mlxsw_sp_l3proto ul_proto;
-   union mlxsw_sp_l3addr saddr;
-
-   /* The configuration where several tunnels have the same local address
-* in the same underlay table needs special treatment in the HW. That is
-* currently not implemented in the driver.
-*/
-   ul_proto = router->ipip_ops_arr[ipipt]->ul_proto;
-   saddr = mlxsw_sp_ipip_netdev_saddr(ul_proto, ol_dev);
-   list_for_each_entry(ipip_entry, &mlxsw_sp->router->ipip_list,
-   ipip_list_node) {
-   if (mlxsw_sp_ipip_entry_saddr_matches(mlxsw_sp, ul_proto, saddr,
- ul_tb_id, ipip_entry))
-   return ERR_PTR(-EEXIST);
-   }
 
ipip_entry = mlxsw_sp_ipip_entry_alloc(mlxsw_sp, ipipt, ol_dev);
if (IS_ERR(ipip_entry))
@@ -1292,14 +1275,24 @@ static int mlxsw_sp_netdevice_ipip_ol_reg_event(struct 
mlxsw_sp *mlxsw_sp,
struct net_device *ol_dev)
 {
struct mlxsw_sp_ipip_entry *ipip_entry;
+   enum mlxsw_sp_l3proto ul_proto;
enum mlxsw_sp_ipip_type ipipt;
+   union mlxsw_sp_l3addr saddr;
+   u32 ul_tb_id;
 
mlxsw_sp_netdev_ipip_type(mlxsw_sp, ol_dev, &ipipt);
if (mlxsw_sp_netdevice_ipip_can_offload(mlxsw_sp, ol_dev, ipipt)) {
-   ipip_entry = mlxsw_sp_ipip_entry_create(mlxsw_sp, ipipt,
-   ol_dev);
-   if (IS_ERR(ipip_entry))
-   return PTR_ERR(ipip_entry);
+   ul_tb_id = mlxsw_sp_ipip_dev_ul_tb_id(ol_dev);
+   ul_proto = mlxsw_sp->router->ipip_ops_arr[ipipt]->ul_proto;
+   saddr = mlxsw_sp_ipip_netdev_saddr(ul_proto, ol_dev);
+   if (!mlxsw_sp_ipip_demote_tunnel_by_saddr(mlxsw_sp, ul_proto,
+ saddr, ul_tb_id,
+ NULL)) {
+   ipip_entry = mlxsw_sp_ipip_entry_create(mlxsw_sp, ipipt,
+   ol_dev);
+   if (IS_ERR(ipip_entry))
+   return PTR_ERR(ipip_entry);
+   }
}
 
return 0;
@@ -1441,6 +1434,44 @@ static int mlxsw_sp_netdevice_ipip_ol_vrf_event(struct 
mlxsw_sp *mlxsw_sp,
   true, false, false, extack);
 }
 
+void mlxsw_sp_ipip_entry_demote_tunnel(struct mlxsw_sp *mlxsw_sp,
+  

[patch net-next 04/16] mlxsw: spectrum_ipip: Split accessor functions

2017-11-03 Thread Jiri Pirko
From: Petr Machata 

To implement NETDEV_CHANGE notifications on IP-in-IP tunnels, the
handler needs to figure out what actually changed, to understand how
exactly to update the offloads. It will do so by storing struct
ip_tunnel_parm with previous configuration, and comparing that to the
new version.

To facilitate these comparisons, extract the code that operates on
struct ip_tunnel_parm from the existing accessor functions, and make
those a thin wrapper that extracts tunnel parameters and dispatches.

Signed-off-by: Petr Machata 
Reviewed-by: Ido Schimmel 
Signed-off-by: Jiri Pirko 
---
 .../net/ethernet/mellanox/mlxsw/spectrum_ipip.c| 100 ++---
 .../net/ethernet/mellanox/mlxsw/spectrum_ipip.h|   3 +
 2 files changed, 70 insertions(+), 33 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_ipip.c 
b/drivers/net/ethernet/mellanox/mlxsw/spectrum_ipip.c
index 8a9fbb6..1850080 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_ipip.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_ipip.c
@@ -36,54 +36,49 @@
 
 #include "spectrum_ipip.h"
 
-static bool
-mlxsw_sp_ipip_netdev_has_ikey(const struct net_device *ol_dev)
+struct ip_tunnel_parm
+mlxsw_sp_ipip_netdev_parms(const struct net_device *ol_dev)
 {
struct ip_tunnel *tun = netdev_priv(ol_dev);
 
-   return !!(tun->parms.i_flags & TUNNEL_KEY);
+   return tun->parms;
 }
 
-static bool
-mlxsw_sp_ipip_netdev_has_okey(const struct net_device *ol_dev)
+static bool mlxsw_sp_ipip_parms_has_ikey(struct ip_tunnel_parm parms)
 {
-   struct ip_tunnel *tun = netdev_priv(ol_dev);
-
-   return !!(tun->parms.o_flags & TUNNEL_KEY);
+   return !!(parms.i_flags & TUNNEL_KEY);
 }
 
-static u32 mlxsw_sp_ipip_netdev_ikey(const struct net_device *ol_dev)
+static bool mlxsw_sp_ipip_parms_has_okey(struct ip_tunnel_parm parms)
 {
-   struct ip_tunnel *tun = netdev_priv(ol_dev);
-
-   return mlxsw_sp_ipip_netdev_has_ikey(ol_dev) ?
-   be32_to_cpu(tun->parms.i_key) : 0;
+   return !!(parms.o_flags & TUNNEL_KEY);
 }
 
-static u32 mlxsw_sp_ipip_netdev_okey(const struct net_device *ol_dev)
+static u32 mlxsw_sp_ipip_parms_ikey(struct ip_tunnel_parm parms)
 {
-   struct ip_tunnel *tun = netdev_priv(ol_dev);
-
-   return mlxsw_sp_ipip_netdev_has_okey(ol_dev) ?
-   be32_to_cpu(tun->parms.o_key) : 0;
+   return mlxsw_sp_ipip_parms_has_ikey(parms) ?
+   be32_to_cpu(parms.i_key) : 0;
 }
 
-static __be32
-mlxsw_sp_ipip_netdev_saddr4(const struct net_device *ol_dev)
+static u32 mlxsw_sp_ipip_parms_okey(struct ip_tunnel_parm parms)
 {
-   struct ip_tunnel *tun = netdev_priv(ol_dev);
+   return mlxsw_sp_ipip_parms_has_okey(parms) ?
+   be32_to_cpu(parms.o_key) : 0;
+}
 
-   return tun->parms.iph.saddr;
+static __be32 mlxsw_sp_ipip_parms_saddr4(struct ip_tunnel_parm parms)
+{
+   return parms.iph.saddr;
 }
 
-union mlxsw_sp_l3addr
-mlxsw_sp_ipip_netdev_saddr(enum mlxsw_sp_l3proto proto,
-  const struct net_device *ol_dev)
+static union mlxsw_sp_l3addr
+mlxsw_sp_ipip_parms_saddr(enum mlxsw_sp_l3proto proto,
+ struct ip_tunnel_parm parms)
 {
switch (proto) {
case MLXSW_SP_L3_PROTO_IPV4:
return (union mlxsw_sp_l3addr) {
-   .addr4 = mlxsw_sp_ipip_netdev_saddr4(ol_dev),
+   .addr4 = mlxsw_sp_ipip_parms_saddr4(parms),
};
case MLXSW_SP_L3_PROTO_IPV6:
break;
@@ -95,21 +90,19 @@ mlxsw_sp_ipip_netdev_saddr(enum mlxsw_sp_l3proto proto,
};
 }
 
-static __be32 mlxsw_sp_ipip_netdev_daddr4(const struct net_device *ol_dev)
+static __be32 mlxsw_sp_ipip_parms_daddr4(struct ip_tunnel_parm parms)
 {
-   struct ip_tunnel *tun = netdev_priv(ol_dev);
-
-   return tun->parms.iph.daddr;
+   return parms.iph.daddr;
 }
 
 static union mlxsw_sp_l3addr
-mlxsw_sp_ipip_netdev_daddr(enum mlxsw_sp_l3proto proto,
-  const struct net_device *ol_dev)
+mlxsw_sp_ipip_parms_daddr(enum mlxsw_sp_l3proto proto,
+ struct ip_tunnel_parm parms)
 {
switch (proto) {
case MLXSW_SP_L3_PROTO_IPV4:
return (union mlxsw_sp_l3addr) {
-   .addr4 = mlxsw_sp_ipip_netdev_daddr4(ol_dev),
+   .addr4 = mlxsw_sp_ipip_parms_daddr4(parms),
};
case MLXSW_SP_L3_PROTO_IPV6:
break;
@@ -121,6 +114,47 @@ mlxsw_sp_ipip_netdev_daddr(enum mlxsw_sp_l3proto proto,
};
 }
 
+static bool mlxsw_sp_ipip_netdev_has_ikey(const struct net_device *ol_dev)
+{
+   return mlxsw_sp_ipip_parms_has_ikey(mlxsw_sp_ipip_netdev_parms(ol_dev));
+}
+
+static bool mlxsw_sp_ipip_netdev_has_okey(const struct net_device *ol_dev)
+{
+   return mlxsw_sp_ipip_parms_has_okey(mlxsw_sp_ipip_netdev_parms(ol_dev));
+}
+
+static u32 mlxsw_sp_ipip_netdev_ikey(const struct net_device *ol_dev

[patch net-next 11/16] mlxsw: spectrum_router: Fix saddr deduction in mlxsw_sp_ipip_entry_create()

2017-11-03 Thread Jiri Pirko
From: Petr Machata 

When trying to determine whether there are other offloaded tunnels with
the same local address, mlxsw_sp_ipip_entry_create() should look for a
tunnel with matching UL protocol, matching saddr, in the same VRF.
However instead of taking into account the UL protocol of the tunnel
netdevice (which mlxsw_sp_ipip_entry_saddr_matches() then compares to
the UL protocol of inspected IPIP entry), it deduces the UL protocol
from the inspected IPIP entry (and that's compared to itself).

This is currently immaterial, because only one tunnel type is offloaded,
and therefore the UL protocol always matches, but introducing support
for a tunnel with IPv6 underlay would uncover this error.

Signed-off-by: Petr Machata 
Reviewed-by: Ido Schimmel 
Signed-off-by: Jiri Pirko 
---
 drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c 
b/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
index 1376a97..897a384 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
@@ -1169,10 +1169,10 @@ mlxsw_sp_ipip_entry_create(struct mlxsw_sp *mlxsw_sp,
 * in the same underlay table needs special treatment in the HW. That is
 * currently not implemented in the driver.
 */
+   ul_proto = router->ipip_ops_arr[ipipt]->ul_proto;
+   saddr = mlxsw_sp_ipip_netdev_saddr(ul_proto, ol_dev);
list_for_each_entry(ipip_entry, &mlxsw_sp->router->ipip_list,
ipip_list_node) {
-   ul_proto = router->ipip_ops_arr[ipip_entry->ipipt]->ul_proto;
-   saddr = mlxsw_sp_ipip_netdev_saddr(ul_proto, ol_dev);
if (mlxsw_sp_ipip_entry_saddr_matches(mlxsw_sp, ul_proto, saddr,
  ul_tb_id, ipip_entry))
return ERR_PTR(-EEXIST);
-- 
2.9.5



[patch net-next 08/16] mlxsw: spectrum: Propagate extack for tunnel events

2017-11-03 Thread Jiri Pirko
From: Petr Machata 

The function mlxsw_sp_rif_create() takes an extack parameter. So far,
for creation of loopback interfaces, NULL was passed. For some events
however the extack can be extracted and passed along. So do that for
NETDEV_CHANGEUPPER handler.

Use the opportunity to update the type of info argument that
mlxsw_sp_netdevice_ipip_ol_event() takes. Follow-up patches will
introduce handling of more changes, and some of them carry an extack as
well, but in an info structure of a different type. Though not strictly
erroneous (the pointer could be cast whichever way), it makes no sense
to pretend the value is always of a certain type, when in fact it isn't.
So change the prototype of the above-mentioned function as well.

Signed-off-by: Petr Machata 
Reviewed-by: Ido Schimmel 
Signed-off-by: Jiri Pirko 
---
 drivers/net/ethernet/mellanox/mlxsw/spectrum.h |  9 +++
 .../net/ethernet/mellanox/mlxsw/spectrum_router.c  | 31 +-
 2 files changed, 23 insertions(+), 17 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum.h 
b/drivers/net/ethernet/mellanox/mlxsw/spectrum.h
index f01b5cb..07cba52 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum.h
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum.h
@@ -398,11 +398,10 @@ int mlxsw_sp_netdevice_vrf_event(struct net_device 
*l3_dev, unsigned long event,
 struct netdev_notifier_changeupper_info *info);
 bool mlxsw_sp_netdev_is_ipip_ol(const struct mlxsw_sp *mlxsw_sp,
const struct net_device *dev);
-int
-mlxsw_sp_netdevice_ipip_ol_event(struct mlxsw_sp *mlxsw_sp,
-struct net_device *l3_dev,
-unsigned long event,
-struct netdev_notifier_changeupper_info *info);
+int mlxsw_sp_netdevice_ipip_ol_event(struct mlxsw_sp *mlxsw_sp,
+struct net_device *l3_dev,
+unsigned long event,
+struct netdev_notifier_info *info);
 void
 mlxsw_sp_port_vlan_router_leave(struct mlxsw_sp_port_vlan *mlxsw_sp_port_vlan);
 void mlxsw_sp_rif_destroy(struct mlxsw_sp_rif *rif);
diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c 
b/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
index ce0d462..c4f1881 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
@@ -961,7 +961,8 @@ mlxsw_sp_rif_create(struct mlxsw_sp *mlxsw_sp,
 static struct mlxsw_sp_rif_ipip_lb *
 mlxsw_sp_ipip_ol_ipip_lb_create(struct mlxsw_sp *mlxsw_sp,
enum mlxsw_sp_ipip_type ipipt,
-   struct net_device *ol_dev)
+   struct net_device *ol_dev,
+   struct netlink_ext_ack *extack)
 {
struct mlxsw_sp_rif_params_ipip_lb lb_params;
const struct mlxsw_sp_ipip_ops *ipip_ops;
@@ -974,7 +975,7 @@ mlxsw_sp_ipip_ol_ipip_lb_create(struct mlxsw_sp *mlxsw_sp,
.lb_config = ipip_ops->ol_loopback_config(mlxsw_sp, ol_dev),
};
 
-   rif = mlxsw_sp_rif_create(mlxsw_sp, &lb_params.common, NULL);
+   rif = mlxsw_sp_rif_create(mlxsw_sp, &lb_params.common, extack);
if (IS_ERR(rif))
return ERR_CAST(rif);
return container_of(rif, struct mlxsw_sp_rif_ipip_lb, common);
@@ -993,7 +994,7 @@ mlxsw_sp_ipip_entry_alloc(struct mlxsw_sp *mlxsw_sp,
return ERR_PTR(-ENOMEM);
 
ipip_entry->ol_lb = mlxsw_sp_ipip_ol_ipip_lb_create(mlxsw_sp, ipipt,
-   ol_dev);
+   ol_dev, NULL);
if (IS_ERR(ipip_entry->ol_lb)) {
ret = ERR_CAST(ipip_entry->ol_lb);
goto err_ol_ipip_lb_create;
@@ -1355,7 +1356,8 @@ static void mlxsw_sp_netdevice_ipip_ol_down_event(struct 
mlxsw_sp *mlxsw_sp,
 }
 
 static int mlxsw_sp_netdevice_ipip_ol_vrf_event(struct mlxsw_sp *mlxsw_sp,
-   struct net_device *ol_dev)
+   struct net_device *ol_dev,
+   struct netlink_ext_ack *extack)
 {
struct mlxsw_sp_fib_entry *decap_fib_entry;
struct mlxsw_sp_ipip_entry *ipip_entry;
@@ -1376,7 +1378,7 @@ static int mlxsw_sp_netdevice_ipip_ol_vrf_event(struct 
mlxsw_sp *mlxsw_sp,
mlxsw_sp_ipip_entry_demote_decap(mlxsw_sp, ipip_entry);
 
lb_rif = mlxsw_sp_ipip_ol_ipip_lb_create(mlxsw_sp, ipip_entry->ipipt,
-ol_dev);
+ol_dev, extack);
if (IS_ERR(lb_rif))
return PTR_ERR(lb_rif);
mlxsw_sp_rif_destroy(&ipip_entry->ol_lb->common);
@@ -1393,12 +1395,14 @@ static

[patch net-next 06/16] mlxsw: spectrum_router: Make mlxsw_sp_netdevice_ipip_ol_up_event() void

2017-11-03 Thread Jiri Pirko
From: Petr Machata 

This function only ever returns 0, so don't pretend it returns anything
useful and just make it void.

Signed-off-by: Petr Machata 
Reviewed-by: Ido Schimmel 
Signed-off-by: Jiri Pirko 
---
 drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c 
b/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
index da8fe7e..2b05f9f 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
@@ -1314,8 +1314,8 @@ static void mlxsw_sp_netdevice_ipip_ol_unreg_event(struct 
mlxsw_sp *mlxsw_sp,
mlxsw_sp_ipip_entry_destroy(mlxsw_sp, ipip_entry);
 }
 
-static int mlxsw_sp_netdevice_ipip_ol_up_event(struct mlxsw_sp *mlxsw_sp,
-  struct net_device *ol_dev)
+static void mlxsw_sp_netdevice_ipip_ol_up_event(struct mlxsw_sp *mlxsw_sp,
+   struct net_device *ol_dev)
 {
struct mlxsw_sp_fib_entry *decap_fib_entry;
struct mlxsw_sp_ipip_entry *ipip_entry;
@@ -1329,7 +1329,6 @@ static int mlxsw_sp_netdevice_ipip_ol_up_event(struct 
mlxsw_sp *mlxsw_sp,
  decap_fib_entry);
}
 
-   return 0;
 }
 
 static void
@@ -1402,7 +1401,8 @@ mlxsw_sp_netdevice_ipip_ol_event(struct mlxsw_sp 
*mlxsw_sp,
mlxsw_sp_netdevice_ipip_ol_unreg_event(mlxsw_sp, ol_dev);
return 0;
case NETDEV_UP:
-   return mlxsw_sp_netdevice_ipip_ol_up_event(mlxsw_sp, ol_dev);
+   mlxsw_sp_netdevice_ipip_ol_up_event(mlxsw_sp, ol_dev);
+   return 0;
case NETDEV_DOWN:
mlxsw_sp_netdevice_ipip_ol_down_event(mlxsw_sp, ol_dev);
return 0;
-- 
2.9.5



[patch net-next 00/16] mlxsw: Handle changes in GRE configuration

2017-11-03 Thread Jiri Pirko
From: Jiri Pirko 

Petr says:

Until now, when an IP tunnel was offloaded by the mlxsw driver, the
offload was pretty much static, and changes in Linux configuration were
not reflected in the hardware. That led to discrepancies between traffic
flows in slow path and fast path. The work-around used to be to remove
all routes that forward to the netdevice and re-add them. This is
clearly suboptimal, but actually, as of the decap-only patchset, it's
not even enough anymore, and one needs to go all the way and simply drop
the tunnel and recreate it correctly.

With this patchset, the NETDEV_CHANGE events that are generated for
changes of up'd tunnel netdevices are captured and interpreted to
correctly reconfigure the HW in accordance with changes requested at the
software layer. In addition, NETDEV_CHANGEUPPER, NETDEV_UP and
NETDEV_DOWN are now handled not only for tunnel devices themselves, but
also for their bound devices. Each change is then translated to one or
more of the following updates to the HW configuration:

- refresh of offload of local route that corresponds to tunnel's local
  address
- refresh of the loopback RIF
- refresh of offloads of routes that forward to the changed tunnel
- removal of tunnel offloads

These tools are used to implement the following configuration changes:

- addition of a new offloadable tunnel with local address that conflicts
  with that of an already-offloaded tunnel (the existing tunnel is
  onloaded, the new one isn't offloaded)
- changes to TTL, TOS that make tunnel unsuitable for offloading
- changes to ikey, okey, remote
- changes to local, which when they cause conflict with another
  tunnel, lead to onloading of both newly-conflicting tunnels
- migration of a bound device of an offloaded tunnel device to a
  different VRF
- changes to what device is bound to a tunnel device (i.e. like what
  "ip tunnel change name g dev another" does)
- changes to up / down state of a bound device. A down bound device
  doesn't forward encapsulated traffic anymore, but decap still works.

This patchset starts with a suite of patches that adapt the existing
code base step by step to facilitate introduction of the offloading
code. The five substantial patches at the end then implement the changes
mentioned above.

Petr Machata (16):
  mlxsw: spectrum: Rename IPIP-related netdevice handlers
  mlxsw: spectrum_router: Extract mlxsw_sp_netdevice_ipip_can_offload()
  mlxsw: spectrum: Move mlxsw_sp_ipip_netdev_{s,d}addr{,4}()
  mlxsw: spectrum_ipip: Split accessor functions
  mlxsw: spectrum_router: Extract mlxsw_sp_ipip_entry_ol_down_event()
  mlxsw: spectrum_router: Make mlxsw_sp_netdevice_ipip_ol_up_event()
void
  mlxsw: spectrum_router: Extract mlxsw_sp_ipip_entry_ol_up_event()
  mlxsw: spectrum: Propagate extack for tunnel events
  mlxsw: spectrum_router: Extract __mlxsw_sp_ipip_entry_update_tunnel()
  mlxsw: spectrum_router: Generalize
__mlxsw_sp_ipip_entry_update_tunnel()
  mlxsw: spectrum_router: Fix saddr deduction in
mlxsw_sp_ipip_entry_create()
  mlxsw: spectrum_router: Onload conflicting tunnels
  mlxsw: spectrum: Support IPIP underlay VRF migration
  mlxsw: spectrum: Handle NETDEV_CHANGE on L3 tunnels
  mlxsw: spectrum_ipip: Handle underlay device change
  mlxsw: spectrum_router: Handle down of tunnel underlay

 drivers/net/ethernet/mellanox/mlxsw/spectrum.c |   8 +-
 drivers/net/ethernet/mellanox/mlxsw/spectrum.h |  18 +-
 .../net/ethernet/mellanox/mlxsw/spectrum_ipip.c| 183 ++-
 .../net/ethernet/mellanox/mlxsw/spectrum_ipip.h|  12 +
 .../net/ethernet/mellanox/mlxsw/spectrum_router.c  | 539 +++--
 .../net/ethernet/mellanox/mlxsw/spectrum_router.h  |  29 +-
 6 files changed, 614 insertions(+), 175 deletions(-)

-- 
2.9.5



[patch net-next 10/16] mlxsw: spectrum_router: Generalize __mlxsw_sp_ipip_entry_update_tunnel()

2017-11-03 Thread Jiri Pirko
From: Petr Machata 

The work that needs to be done to update HW configuration in response to
changes is similar to what __mlxsw_sp_ipip_entry_update_tunnel() already
does, but with a number of twists: each change requires a different
subset of things to happen. Extend the function to support all these
uses, and allow finely-grained configuration of what should happen at
each call through a suite of function arguments.

Publish the updated function to allow use from the spectrum_ipip module.

Signed-off-by: Petr Machata 
Reviewed-by: Ido Schimmel 
Signed-off-by: Jiri Pirko 
---
 .../net/ethernet/mellanox/mlxsw/spectrum_router.c  | 47 --
 .../net/ethernet/mellanox/mlxsw/spectrum_router.h  |  7 
 2 files changed, 50 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c 
b/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
index e2795b8..1376a97 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
@@ -1355,9 +1355,12 @@ static void mlxsw_sp_netdevice_ipip_ol_down_event(struct 
mlxsw_sp *mlxsw_sp,
mlxsw_sp_ipip_entry_ol_down_event(mlxsw_sp, ipip_entry);
 }
 
+static void mlxsw_sp_nexthop_rif_update(struct mlxsw_sp *mlxsw_sp,
+   struct mlxsw_sp_rif *rif);
 static int
 mlxsw_sp_ipip_entry_ol_lb_update(struct mlxsw_sp *mlxsw_sp,
 struct mlxsw_sp_ipip_entry *ipip_entry,
+bool keep_encap,
 struct netlink_ext_ack *extack)
 {
struct mlxsw_sp_rif_ipip_lb *old_lb_rif = ipip_entry->ol_lb;
@@ -1370,13 +1373,32 @@ mlxsw_sp_ipip_entry_ol_lb_update(struct mlxsw_sp 
*mlxsw_sp,
if (IS_ERR(new_lb_rif))
return PTR_ERR(new_lb_rif);
ipip_entry->ol_lb = new_lb_rif;
+
+   if (keep_encap) {
+   list_splice_init(&old_lb_rif->common.nexthop_list,
+&new_lb_rif->common.nexthop_list);
+   mlxsw_sp_nexthop_rif_update(mlxsw_sp, &new_lb_rif->common);
+   }
+
mlxsw_sp_rif_destroy(&old_lb_rif->common);
 
return 0;
 }
 
+/**
+ * Update the offload related to an IPIP entry. This always updates decap, and
+ * in addition to that it also:
+ * @recreate_loopback: recreates the associated loopback RIF
+ * @keep_encap: updates next hops that use the tunnel netdevice. This is only
+ *  relevant when recreate_loopback is true.
+ * @update_nexthops: updates next hops, keeping the current loopback RIF. This
+ *   is only relevant when recreate_loopback is false.
+ */
 int __mlxsw_sp_ipip_entry_update_tunnel(struct mlxsw_sp *mlxsw_sp,
struct mlxsw_sp_ipip_entry *ipip_entry,
+   bool recreate_loopback,
+   bool keep_encap,
+   bool update_nexthops,
struct netlink_ext_ack *extack)
 {
int err;
@@ -1390,9 +1412,15 @@ int __mlxsw_sp_ipip_entry_update_tunnel(struct mlxsw_sp 
*mlxsw_sp,
if (ipip_entry->decap_fib_entry)
mlxsw_sp_ipip_entry_demote_decap(mlxsw_sp, ipip_entry);
 
-   err = mlxsw_sp_ipip_entry_ol_lb_update(mlxsw_sp, ipip_entry, extack);
-   if (err)
-   return err;
+   if (recreate_loopback) {
+   err = mlxsw_sp_ipip_entry_ol_lb_update(mlxsw_sp, ipip_entry,
+  keep_encap, extack);
+   if (err)
+   return err;
+   } else if (update_nexthops) {
+   mlxsw_sp_nexthop_rif_update(mlxsw_sp,
+   &ipip_entry->ol_lb->common);
+   }
 
if (ipip_entry->ol_dev->flags & IFF_UP)
mlxsw_sp_ipip_entry_ol_up_event(mlxsw_sp, ipip_entry);
@@ -1410,7 +1438,7 @@ static int mlxsw_sp_netdevice_ipip_ol_vrf_event(struct 
mlxsw_sp *mlxsw_sp,
if (!ipip_entry)
return 0;
return __mlxsw_sp_ipip_entry_update_tunnel(mlxsw_sp, ipip_entry,
-  extack);
+  true, false, false, extack);
 }
 
 int mlxsw_sp_netdevice_ipip_ol_event(struct mlxsw_sp *mlxsw_sp,
@@ -3285,6 +3313,17 @@ static void mlxsw_sp_nexthop4_event(struct mlxsw_sp 
*mlxsw_sp,
mlxsw_sp_nexthop_group_refresh(mlxsw_sp, nh->nh_grp);
 }
 
+static void mlxsw_sp_nexthop_rif_update(struct mlxsw_sp *mlxsw_sp,
+   struct mlxsw_sp_rif *rif)
+{
+   struct mlxsw_sp_nexthop *nh;
+
+   list_for_each_entry(nh, &rif->nexthop_list, rif_list_node) {
+   __mlxsw_sp_nexthop_neigh_update(nh, false);
+   mlxsw_sp_nexthop_group_refresh(mlxsw_sp, nh->nh_grp);
+   }
+}
+
 static void mlxsw_sp_next

[patch net-next 05/16] mlxsw: spectrum_router: Extract mlxsw_sp_ipip_entry_ol_down_event()

2017-11-03 Thread Jiri Pirko
From: Petr Machata 

Signed-off-by: Petr Machata 
Reviewed-by: Ido Schimmel 
Signed-off-by: Jiri Pirko 
---
 drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c | 12 ++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c 
b/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
index ec90c6b..da8fe7e 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
@@ -1332,14 +1332,22 @@ static int mlxsw_sp_netdevice_ipip_ol_up_event(struct 
mlxsw_sp *mlxsw_sp,
return 0;
 }
 
+static void
+mlxsw_sp_ipip_entry_ol_down_event(struct mlxsw_sp *mlxsw_sp,
+ struct mlxsw_sp_ipip_entry *ipip_entry)
+{
+   if (ipip_entry->decap_fib_entry)
+   mlxsw_sp_ipip_entry_demote_decap(mlxsw_sp, ipip_entry);
+}
+
 static void mlxsw_sp_netdevice_ipip_ol_down_event(struct mlxsw_sp *mlxsw_sp,
  struct net_device *ol_dev)
 {
struct mlxsw_sp_ipip_entry *ipip_entry;
 
ipip_entry = mlxsw_sp_ipip_entry_find_by_ol_dev(mlxsw_sp, ol_dev);
-   if (ipip_entry && ipip_entry->decap_fib_entry)
-   mlxsw_sp_ipip_entry_demote_decap(mlxsw_sp, ipip_entry);
+   if (ipip_entry)
+   mlxsw_sp_ipip_entry_ol_down_event(mlxsw_sp, ipip_entry);
 }
 
 static int mlxsw_sp_netdevice_ipip_ol_vrf_event(struct mlxsw_sp *mlxsw_sp,
-- 
2.9.5



[patch net-next 02/16] mlxsw: spectrum_router: Extract mlxsw_sp_netdevice_ipip_can_offload()

2017-11-03 Thread Jiri Pirko
From: Petr Machata 

Some of the code down the road needs this logic as well.

Signed-off-by: Petr Machata 
Reviewed-by: Ido Schimmel 
Signed-off-by: Jiri Pirko 
---
 .../net/ethernet/mellanox/mlxsw/spectrum_router.c| 20 +++-
 1 file changed, 15 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c 
b/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
index 9672933..97f062a 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
@@ -1326,18 +1326,28 @@ mlxsw_sp_ipip_entry_find_by_ol_dev(struct mlxsw_sp 
*mlxsw_sp,
return NULL;
 }
 
+static bool mlxsw_sp_netdevice_ipip_can_offload(struct mlxsw_sp *mlxsw_sp,
+   const struct net_device *ol_dev,
+   enum mlxsw_sp_ipip_type ipipt)
+{
+   const struct mlxsw_sp_ipip_ops *ops
+   = mlxsw_sp->router->ipip_ops_arr[ipipt];
+
+   /* For deciding whether decap should be offloaded, we don't care about
+* overlay protocol, so ask whether either one is supported.
+*/
+   return ops->can_offload(mlxsw_sp, ol_dev, MLXSW_SP_L3_PROTO_IPV4) ||
+  ops->can_offload(mlxsw_sp, ol_dev, MLXSW_SP_L3_PROTO_IPV6);
+}
+
 static int mlxsw_sp_netdevice_ipip_ol_reg_event(struct mlxsw_sp *mlxsw_sp,
struct net_device *ol_dev)
 {
-   struct mlxsw_sp_router *router = mlxsw_sp->router;
struct mlxsw_sp_ipip_entry *ipip_entry;
enum mlxsw_sp_ipip_type ipipt;
 
mlxsw_sp_netdev_ipip_type(mlxsw_sp, ol_dev, &ipipt);
-   if (router->ipip_ops_arr[ipipt]->can_offload(mlxsw_sp, ol_dev,
-MLXSW_SP_L3_PROTO_IPV4) ||
-   router->ipip_ops_arr[ipipt]->can_offload(mlxsw_sp, ol_dev,
-MLXSW_SP_L3_PROTO_IPV6)) {
+   if (mlxsw_sp_netdevice_ipip_can_offload(mlxsw_sp, ol_dev, ipipt)) {
ipip_entry = mlxsw_sp_ipip_entry_create(mlxsw_sp, ipipt,
ol_dev);
if (IS_ERR(ipip_entry))
-- 
2.9.5



[patch net-next 01/16] mlxsw: spectrum: Rename IPIP-related netdevice handlers

2017-11-03 Thread Jiri Pirko
From: Petr Machata 

To distinguish between events related to tunnel device itself and its
bound device, rename a number of functions related to handling tunneling
netdevice events to include _ol_ (for "overlay") in the name. That
leaves room in the namespace for underlay-related functions, which would
have _ul_ in the name.

Signed-off-by: Petr Machata 
Reviewed-by: Ido Schimmel 
Signed-off-by: Jiri Pirko 
---
 drivers/net/ethernet/mellanox/mlxsw/spectrum.c |  5 ++-
 drivers/net/ethernet/mellanox/mlxsw/spectrum.h | 12 +++---
 .../net/ethernet/mellanox/mlxsw/spectrum_router.c  | 45 +++---
 3 files changed, 32 insertions(+), 30 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum.c 
b/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
index 52f38b4..55bb366 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
@@ -4542,8 +4542,9 @@ static int mlxsw_sp_netdevice_event(struct notifier_block 
*nb,
int err = 0;
 
mlxsw_sp = container_of(nb, struct mlxsw_sp, netdevice_nb);
-   if (mlxsw_sp_netdev_is_ipip(mlxsw_sp, dev))
-   err = mlxsw_sp_netdevice_ipip_event(mlxsw_sp, dev, event, ptr);
+   if (mlxsw_sp_netdev_is_ipip_ol(mlxsw_sp, dev))
+   err = mlxsw_sp_netdevice_ipip_ol_event(mlxsw_sp, dev,
+  event, ptr);
else if (event == NETDEV_CHANGEADDR || event == NETDEV_CHANGEMTU)
err = mlxsw_sp_netdevice_router_port_event(dev);
else if (mlxsw_sp_is_vrf_event(event, ptr))
diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum.h 
b/drivers/net/ethernet/mellanox/mlxsw/spectrum.h
index b2393bb8..f01b5cb 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum.h
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum.h
@@ -396,13 +396,13 @@ int mlxsw_sp_inet6addr_valid_event(struct notifier_block 
*unused,
   unsigned long event, void *ptr);
 int mlxsw_sp_netdevice_vrf_event(struct net_device *l3_dev, unsigned long 
event,
 struct netdev_notifier_changeupper_info *info);
-bool mlxsw_sp_netdev_is_ipip(const struct mlxsw_sp *mlxsw_sp,
-const struct net_device *dev);
+bool mlxsw_sp_netdev_is_ipip_ol(const struct mlxsw_sp *mlxsw_sp,
+   const struct net_device *dev);
 int
-mlxsw_sp_netdevice_ipip_event(struct mlxsw_sp *mlxsw_sp,
- struct net_device *l3_dev,
- unsigned long event,
- struct netdev_notifier_changeupper_info *info);
+mlxsw_sp_netdevice_ipip_ol_event(struct mlxsw_sp *mlxsw_sp,
+struct net_device *l3_dev,
+unsigned long event,
+struct netdev_notifier_changeupper_info *info);
 void
 mlxsw_sp_port_vlan_router_leave(struct mlxsw_sp_port_vlan *mlxsw_sp_port_vlan);
 void mlxsw_sp_rif_destroy(struct mlxsw_sp_rif *rif);
diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c 
b/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
index d657f01..9672933 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
@@ -1306,8 +1306,8 @@ static bool mlxsw_sp_netdev_ipip_type(const struct 
mlxsw_sp *mlxsw_sp,
return false;
 }
 
-bool mlxsw_sp_netdev_is_ipip(const struct mlxsw_sp *mlxsw_sp,
-const struct net_device *dev)
+bool mlxsw_sp_netdev_is_ipip_ol(const struct mlxsw_sp *mlxsw_sp,
+   const struct net_device *dev)
 {
return mlxsw_sp_netdev_ipip_type(mlxsw_sp, dev, NULL);
 }
@@ -1326,8 +1326,8 @@ mlxsw_sp_ipip_entry_find_by_ol_dev(struct mlxsw_sp 
*mlxsw_sp,
return NULL;
 }
 
-static int mlxsw_sp_netdevice_ipip_reg_event(struct mlxsw_sp *mlxsw_sp,
-struct net_device *ol_dev)
+static int mlxsw_sp_netdevice_ipip_ol_reg_event(struct mlxsw_sp *mlxsw_sp,
+   struct net_device *ol_dev)
 {
struct mlxsw_sp_router *router = mlxsw_sp->router;
struct mlxsw_sp_ipip_entry *ipip_entry;
@@ -1347,8 +1347,8 @@ static int mlxsw_sp_netdevice_ipip_reg_event(struct 
mlxsw_sp *mlxsw_sp,
return 0;
 }
 
-static void mlxsw_sp_netdevice_ipip_unreg_event(struct mlxsw_sp *mlxsw_sp,
-   struct net_device *ol_dev)
+static void mlxsw_sp_netdevice_ipip_ol_unreg_event(struct mlxsw_sp *mlxsw_sp,
+  struct net_device *ol_dev)
 {
struct mlxsw_sp_ipip_entry *ipip_entry;
 
@@ -1357,8 +1357,8 @@ static void mlxsw_sp_netdevice_ipip_unreg_event(struct 
mlxsw_sp *mlxsw_sp,
mlxsw_sp_ipip_entry_destroy(mlxsw_sp, ipip_entry);
 }
 
-static int mlxsw_sp_netdevice_ipip_up_ev

Re: [PATCH ipsec] xfrm: do unconditional template resolution before pcpu cache check

2017-11-03 Thread Steffen Klassert
On Thu, Nov 02, 2017 at 06:57:29PM -0400, Paul Moore wrote:
> On Thu, Nov 2, 2017 at 11:46 AM, Florian Westphal  wrote:
> > Stephen Smalley says:
> >  Since 4.14-rc1, the selinux-testsuite has been encountering sporadic
> >  failures during testing of labeled IPSEC. git bisect pointed to
> >  commit ec30d ("xfrm: add xdst pcpu cache").
> >  The xdst pcpu cache is only checking that the policies are the same,
> >  but does not validate that the policy, state, and flow match with respect
> >  to security context labeling.
> >  As a result, the wrong SA could be used and the receiver could end up
> >  performing permission checking and providing SO_PEERSEC or SCM_SECURITY
> >  values for the wrong security context.
> >
> > This fix makes it so that we always do the template resolution, and
> > then checks that the found states match those in the pcpu bundle.
> >
> > This has the disadvantage of doing a bit more work (lookup in state hash
> > table) if we can reuse the xdst entry (we only avoid xdst alloc/free)
> > but we don't add a lot of extra work in case we can't reuse.
> >
> > xfrm_pol_dead() check is removed, reasoning is that
> > xfrm_tmpl_resolve does all needed checks.
> >
> > Cc: Paul Moore 
> > Fixes: ec30d78c14a813db39a647b6a348b428 ("xfrm: add xdst pcpu cache")
> > Reported-by: Stephen Smalley 
> > Tested-by: Stephen Smalley 
> > Signed-off-by: Florian Westphal 
> > ---
> >  net/xfrm/xfrm_policy.c | 42 --
> >  1 file changed, 24 insertions(+), 18 deletions(-)
> 
> This looks reasonable and seems like probably the simplest approach to
> me.  I'm building a test kernel with it now, but considering the time
> of day here, I probably will not be able to test it until tomorrow
> morning; however it is important to note that Stephen did test this
> already so please don't wait on my test results - we are likely to be
> running the same tests anyway.
> 
> Acked-by: Paul Moore 

Patch applied, thanks everyone!


Re: [RFC PATCH 01/14] packet: introduce AF_PACKET V4 userspace API

2017-11-03 Thread Björn Töpel
2017-11-03 3:29 GMT+01:00 Willem de Bruijn :
 +/*
 + * struct tpacket_memreg_req is used in conjunction with PACKET_MEMREG
 + * to register user memory which should be used to store the packet
 + * data.
 + *
 + * There are some constraints for the memory being registered:
 + * - The memory area has to be memory page size aligned.
 + * - The frame size has to be a power of 2.
 + * - The frame size cannot be smaller than 2048B.
 + * - The frame size cannot be larger than the memory page size.
 + *
 + * Corollary: The number of frames that can be stored is
 + * len / frame_size.
 + *
 + */
 +struct tpacket_memreg_req {
 +   unsigned long   addr;   /* Start of packet data area */
 +   unsigned long   len;/* Length of packet data area */
 +   unsigned intframe_size; /* Frame size */
 +   unsigned intdata_headroom;  /* Frame head room */
 +};
>>>
>>> Existing packet sockets take a tpacket_req, allocate memory and let the
>>> user process mmap this. I understand that TPACKET_V4 distinguishes
>>> the descriptor from packet pools, but could both use the existing structs
>>> and logic (packet_mmap)? That would avoid introducing a lot of new code
>>> just for granting user pages to the kernel.
>>>
>>
>> We could certainly pass the "tpacket_memreg_req" fields as part of
>> descriptor ring setup ("tpacket_req4"), but we went with having the
>> memory register as a new separate setsockopt. Having it separated,
>> makes it easier to compare regions at the kernel side of things. "Is
>> this the same umem as another one?" If we go the path of passing the
>> range at descriptor ring setup, we need to handle all kind of
>> overlapping ranges to determine when a copy is needed or not, in those
>> cases where the packet buffer (i.e. umem) is shared between processes.
>
> That's not what I meant. Both descriptor rings and packet pools are
> memory regions. Packet sockets already have logic to allocate regions
> and make them available to userspace with mmap(). Packet v4 reuses
> that logic for its descriptor rings. Can it use the same for its packet
> pool? Why does the kernel map user memory, instead? That is a lot of
> non-trivial new logic.

Ah, got it. So, why do we register packet pool memory, instead of
allocating in the kernel and mapping *that* memory.

Actually, we started out with that approach, where the packet_mmap
call mapped Tx/Rx descriptor rings and the packet buffer region. We
later moved to this (register umem) approach, because it's more
flexible for user space, not having to use a AF_PACKET specific
allocator (i.e. continue to use regular mallocs, huge pages and such).

I agree that the memory register code is adding a lot of new logic,
but I believe it's worth the flexibility for user space. I'm looking
into if I can share the memory register logic from Infiniband/verbs
subsystem (drivers/infiniband/core/umem.c).


Björn


Re: [RFC PATCH 02/14] packet: implement PACKET_MEMREG setsockopt

2017-11-03 Thread Björn Töpel
2017-11-03 4:00 GMT+01:00 Willem de Bruijn :
> On Tue, Oct 31, 2017 at 9:41 PM, Björn Töpel  wrote:
>> From: Björn Töpel 
>>
>> Here, the PACKET_MEMREG setsockopt is implemented for the AF_PACKET
>> protocol family. PACKET_MEMREG allows the user to register memory
>> regions that can be used by AF_PACKET V4 as packet data buffers.
>>
>> Signed-off-by: Björn Töpel 
>> ---
>> +/*** V4 QUEUE OPERATIONS ***/
>> +
>> +/**
>> + * tp4q_umem_new - Creates a new umem (packet buffer)
>> + *
>> + * @addr: The address to the umem
>> + * @size: The size of the umem
>> + * @frame_size: The size of each frame, between 2K and PAGE_SIZE
>> + * @data_headroom: The desired data headroom before start of the packet
>> + *
>> + * Returns a pointer to the new umem or NULL for failure
>> + **/
>> +static inline struct tp4_umem *tp4q_umem_new(unsigned long addr, size_t 
>> size,
>> +unsigned int frame_size,
>> +unsigned int data_headroom)
>> +{
>> +   struct tp4_umem *umem;
>> +   unsigned int nframes;
>> +
>> +   if (frame_size < TP4_UMEM_MIN_FRAME_SIZE || frame_size > PAGE_SIZE) {
>> +   /* Strictly speaking we could support this, if:
>> +* - huge pages, or*
>> +* - using an IOMMU, or
>> +* - making sure the memory area is consecutive
>> +* but for now, we simply say "computer says no".
>> +*/
>> +   return ERR_PTR(-EINVAL);
>> +   }
>> +
>> +   if (!is_power_of_2(frame_size))
>> +   return ERR_PTR(-EINVAL);
>> +
>> +   if (!PAGE_ALIGNED(addr)) {
>> +   /* Memory area has to be page size aligned. For
>> +* simplicity, this might change.
>> +*/
>> +   return ERR_PTR(-EINVAL);
>> +   }
>> +
>> +   if ((addr + size) < addr)
>> +   return ERR_PTR(-EINVAL);
>> +
>> +   nframes = size / frame_size;
>> +   if (nframes == 0)
>> +   return ERR_PTR(-EINVAL);
>> +
>> +   data_headroom = ALIGN(data_headroom, 64);
>> +
>> +   if (frame_size - data_headroom - TP4_KERNEL_HEADROOM < 0)
>> +   return ERR_PTR(-EINVAL);
>
> signed comparison on unsigned int

Thanks, will address in next revision!


Re: [RFC PATCH 03/14] packet: enable AF_PACKET V4 rings

2017-11-03 Thread Björn Töpel
2017-11-03 5:16 GMT+01:00 Willem de Bruijn :
>> +/**
>> + * tp4q_enqueue_from_array - Enqueue entries from packet array to tp4 queue
>> + *
>> + * @a: Pointer to the packet array to enqueue from
>> + * @dcnt: Max number of entries to enqueue
>> + *
>> + * Returns 0 for success or an errno at failure
>> + **/
>> +static inline int tp4q_enqueue_from_array(struct tp4_packet_array *a,
>> + u32 dcnt)
>> +{
>> +   struct tp4_queue *q = a->tp4q;
>> +   unsigned int used_idx = q->used_idx;
>> +   struct tpacket4_desc *d = a->items;
>> +   int i;
>> +
>> +   if (q->num_free < dcnt)
>> +   return -ENOSPC;
>> +
>> +   q->num_free -= dcnt;
>
> perhaps annotate with a lockdep_is_held to document which lock
> ensures mutual exclusion on the ring. Different for tx and rx?
>

Good idea. I'll give that a try!

>> diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
>> index b39be424ec0e..190598eb3461 100644
>> --- a/net/packet/af_packet.c
>> +++ b/net/packet/af_packet.c
>> @@ -189,6 +189,9 @@ static int packet_set_ring(struct sock *sk, union 
>> tpacket_req_u *req_u,
>>  #define BLOCK_O2PRIV(x)((x)->offset_to_priv)
>>  #define BLOCK_PRIV(x)  ((void *)((char *)(x) + BLOCK_O2PRIV(x)))
>>
>> +#define RX_RING 0
>> +#define TX_RING 1
>> +
>
> Not needed if using bool for tx_ring below. The test effectively already
> treats it as bool: does not explicitly test these constants.
>
>> +static void packet_clear_ring(struct sock *sk, int tx_ring)
>> +{
>> +   struct packet_sock *po = pkt_sk(sk);
>> +   struct packet_ring_buffer *rb;
>> +   union tpacket_req_u req_u;
>> +
>> +   rb = tx_ring ? &po->tx_ring : &po->rx_ring;
>
>
> I meant here.

Yup, I'll remove/clean this up.


Björn


RE: [RFC PATCH 00/14] Introducing AF_PACKET V4 support

2017-11-03 Thread Karlsson, Magnus


> -Original Message-
> From: Willem de Bruijn [mailto:willemdebruijn.ker...@gmail.com]
> Sent: Friday, November 3, 2017 5:35 AM
> To: Björn Töpel 
> Cc: Karlsson, Magnus ; Duyck, Alexander H
> ; Alexander Duyck
> ; John Fastabend
> ; Alexei Starovoitov ; Jesper
> Dangaard Brouer ; michael.lundkv...@ericsson.com;
> ravineet.si...@ericsson.com; Daniel Borkmann ;
> Network Development ; Topel, Bjorn
> ; Brandeburg, Jesse
> ; Singhai, Anjali ;
> Rosen, Rami ; Shaw, Jeffrey B
> ; Yigit, Ferruh ; Zhang,
> Qi Z 
> Subject: Re: [RFC PATCH 00/14] Introducing AF_PACKET V4 support
> 
> On Tue, Oct 31, 2017 at 9:41 PM, Björn Töpel 
> wrote:
> > From: Björn Töpel 
> >
> > This RFC introduces AF_PACKET_V4 and PACKET_ZEROCOPY that are
> > optimized for high performance packet processing and zero-copy
> > semantics. Throughput improvements can be up to 40x compared to V2
> and
> > V3 for the micro benchmarks included. Would be great to get your
> > feedback on it.
> >
> > The main difference between V4 and V2/V3 is that TX and RX descriptors
> > are separated from packet buffers.
> 
> Cool feature. I'm looking forward to the netdev talk. Aside from the inline
> comments in the patches, a few architecture questions.

Glad to hear. Are you going to Netdev in Seoul? If so, let us hook up
and discuss your comments in further detail. Some initial thoughts
below.

> Is TX support needed? Existing PACKET_TX_RING already sends out packets
> without copying directly from the tx_ring. Indirection through a descriptor
> ring is not helpful on TX if all packets still have to come from a 
> pre-registered
> packet pool. The patch set adds a lot of tx-only code and is complex enough
> without it.

That is correct, but what if the packet you are going to transmit came
in from the receive path and is already in the packet buffer? This
might happen if the application is examining/sniffing packets then
sending them out, or doing some modification to them. In that case we
avoid a copy in V4 since the packet is already in the packet
buffer. With V2 and V3, a copy from the RX ring to the TX ring would
be needed. In the PACKET_ZEROCOPY case, avoiding this copy increases
performance quite a lot.

> Can you use the existing PACKET_V2 format for the packet pool? The
> v4 format is nearly the same as V2. Using the same version might avoid some
> code duplication and simplify upgrading existing legacy code.
> Instead of continuing to add new versions whose behavior is implicit,
> perhaps we can add explicit mode PACKET_INDIRECT to PACKET_V2.

Interesting idea that I think is worth thinking more about. One
problem though with the V2 ring format model, and the current V4
format too by the way, when applied to a user-space allocated memory
is that they are symmetric, i.e. that user space and kernel space have
to produce and consume the same amount of entries (within the length
of the descriptor area). User space sends down a buffer entry that the
kernel fills in for RX for example. Symmetric queues do not work when
you have a shared packet buffer between two processes. (This is not a
requirement, but someone might do a mmap with MAP_SHARED for the
packet buffer and then fork of a child that then inherits this packet
buffer.) One of the processes might just receive packets, while the
other one is transmitting. Or if you have a veth link pair between two
processes and they have been set up to share packet buffer area. With
a symmetric queue you have to copy even if they share the same packet
buffer, but with an asymmetric queue, you do not and the driver only
needs to copy the packet buffer id between the TX desc ring of the
sender to the RX desc ring of the receiver, not the data. I think this
gives an indication that we need a new structure. Anyway, I like your
idea and I think it is worth thinking more about it. Let us have a
discussion about this at Netdev, if you are there.

> Finally, is it necessary to define a new descriptor ring format? Same for the
> packet array and frame set. The kernel already has a few, such as virtio for
> the first, skb_array/ptr_ring, even linux list for the second. These 
> containers
> add a lot of new boilerplate code. If new formats are absolutely necessary, at
> least we should consider making them generic (like skb_array and ptr_ring).
> But I'd like to understand first why, e.g., virtio cannot be used.

Agree with you. Good if we can use something existing. The descriptor
format of V4 was based on one of the first Virtio 1.1 proposal by
Michael Tsirkin (tools/virtio/ringtest/ring.c). Then we have diverged
somewhat due to performance reasons and Virtio 1.1 has done the same
but in another direction. We should take a look at the latest Virtio
1.1 proposal again and see what it offers. The reason we did not go
with Virtio 0.9 was for performance. Too many indirections, something
that the people behind Virtio 1.1 had identified too. With ptr_ring,
how do we deal with the pointers in the structure as this no

RE: [PATCH net-next v2 1/3] enic: reset fetch index

2017-11-03 Thread David Laight
From: Parvi Kaustubhi
> Sent: 01 November 2017 15:45
> Since we are allowing rx ring size modification, reset fetch index
> everytime. Otherwise it could have a stale value that can lead to a null
> pointer dereference.
> 
> Signed-off-by: Govindarajulu Varadarajan 
> Signed-off-by: Parvi Kaustubhi 
> ---
> v2: remove unused variable to fix build warning
> 
>  drivers/net/ethernet/cisco/enic/vnic_rq.c | 16 ++--
>  1 file changed, 2 insertions(+), 14 deletions(-)
> 
> diff --git a/drivers/net/ethernet/cisco/enic/vnic_rq.c 
> b/drivers/net/ethernet/cisco/enic/vnic_rq.c
> index 36bc2c7..f8aa326 100644
> --- a/drivers/net/ethernet/cisco/enic/vnic_rq.c
> +++ b/drivers/net/ethernet/cisco/enic/vnic_rq.c
> @@ -139,20 +139,8 @@ void vnic_rq_init(struct vnic_rq *rq, unsigned int 
> cq_index,
>   unsigned int error_interrupt_enable,
>   unsigned int error_interrupt_offset)
>  {
> - u32 fetch_index = 0;
> -
> - /* Use current fetch_index as the ring starting point */
> - fetch_index = ioread32(&rq->ctrl->fetch_index);
> -
> - if (fetch_index == 0x) { /* check for hardware gone  */
> - /* Hardware surprise removal: reset fetch_index */
> - fetch_index = 0;
> - }
> -
> - vnic_rq_init_start(rq, cq_index,
> - fetch_index, fetch_index,
> - error_interrupt_enable,
> - error_interrupt_offset);
> + vnic_rq_init_start(rq, cq_index, 0, 0, error_interrupt_enable,
> +error_interrupt_offset);

ISTM that the two '0' arguments can be removed.

David



Re: [PATCH] net: mvpp2: add ethtool GOP statistics

2017-11-03 Thread Miquel RAYNAL
Hi Florian,

> > +static u64 mvpp2_read_count(struct mvpp2_port *port, unsigned int
> > offset) +{
> > +   bool reg_is_64b =
> > +   (offset == MVPP2_MIB_GOOD_OCTETS_RCVD_LOW) ||
> > +   (offset == MVPP2_MIB_GOOD_OCTETS_SENT_LOW);  
> 
> This does not scale very well, put that in your statistics structure
> and define a member "reg_is_64b" there such that you can pass a
> pointer to one of these members here, and check, on per-counter basis
> whether this is needed or not.

I completely agree, I will use a third parameter in the statistics
structure as you suggest.

> 
> > +   void __iomem *base;
> > +   u64 val;
> > +
> > +   if (port->priv->hw_version == MVPP21)
> > +   base = port->priv->lms_base +
> > MVPP21_MIB_COUNTERS_OFFSET +
> > +  port->gop_id * MVPP21_MIB_COUNTERS_PORT_SZ;
> > +   else
> > +   base = port->priv->iface_base +
> > MVPP22_MIB_COUNTERS_OFFSET +
> > +  port->gop_id * MVPP22_MIB_COUNTERS_PORT_SZ;
> > +
> > +   val = readl(base + offset);
> > +   if (reg_is_64b)
> > +   val += (u64)readl(base + offset + 4) << 32;  
> 
> So the value gets latched when the higher part gets read last?

When counters are 64-bit, they use two 32-bit registers to store the
value. If the lower 32 bits are at address X, then the higher 32 bits
are at X+4.

> 
> > +
> > +   return val;
> > +}
> > +
> > +struct mvpp2_ethtool_statistics {
> > +   unsigned int offset;
> > +   const char string[ETH_GSTRING_LEN];  
> 
> Add your reg_is_64b member here too.

Of course, yes.

> 
> > +};
> > +
> > +/* Due to the fact that software statistics and hardware
> > statistics are, by
> > + * design, incremented at different moments in the chain of packet
> > processing,
> > + * it is very likely that incoming packets could have been dropped
> > after being
> > + * counted by hardware but before reaching software statistics
> > (most probably
> > + * multicast packets), and in the oppposite way, during
> > transmission, FCS bytes
> > + * are added in between as well as TSO skb will be split and
> > header bytes added.
> > + */  
> 
> OK, not sure what to make of that comment.

For me it is a problem if when doing 'ifconfig eth0' and ' ethtool -S
eth0' gives you totally different statistics. This comment explains why
as I asked myself how the statistics could evolve so differently. Hence
the comment. Do you want me to get rid of it? For now I have added this
to make it clearer:

  * Hence, statistics gathered from userspace with ifconfig (software)
  * and ethtool (hardware) cannot be compared.
  */

> > +static void mvpp2_gather_hw_statistics(struct work_struct *work)
> > +{
> > +   struct delayed_work *del_work = to_delayed_work(work);
> > +   struct mvpp2 *priv = container_of(del_work, struct mvpp2,
> > stats_work);
> > +   struct mvpp2_port *port;
> > +   u64 *pstats;
> > +   int i, j;
> > +
> > +   for (i = 0; i < priv->port_count; i++) {
> > +   if (!priv->port_list[i])
> > +   continue;
> > +
> > +   port = priv->port_list[i];
> > +   pstats = port->ethtool_stats;  
> 
> port->ethtool_stats was allocated this way:
> 
> port->ethtool_stats = kzalloc(sizeof(mvpp2_ethtool_stats)) instead of
> ARRAY_SIZE(mvpp2_ethtool_stats) * sizeof(u64)
> 
> This is probably working right now because mvpp2_ethtool_stats is much
> bigger than ARRAY_SIZE(mvpp2_ethtool_stats) * sizeof(u64).

My mistake, sorry about that.

> 
> 
> > +   for (j = 0; j < ARRAY_SIZE(mvpp2_ethtool_stats);
> > j++)
> > +   *pstats++ += mvpp2_read_count(
> > +   port,
> > mvpp2_ethtool_stats[j].offset);  
> 
> You might want to look into the helper functions from
> include/linux/u64_stats_sync.h to safely add a 32-bit quantity to a
> 64-bit quantity on 32-bit hosts.

I looked at the header, but I do not think this applies here because
once the first register in the list is read (at the lowest address),
counters are freezed until the last register is read (hardware is
still counting but will not update the registers we read until then).
So I guess there is no need for it here, what do you think?


> >  static void mvpp2_port_reset(struct mvpp2_port *port)
> >  {
> > u32 val;
> > +   int i;  
> 
> unsigned int i

Ok.


> > @@ -7613,13 +7788,19 @@ static int mvpp2_port_probe(struct
> > platform_device *pdev, port->base = priv->iface_base +
> > MVPP22_GMAC_BASE(port->gop_id); }
> >  
> > -   /* Alloc per-cpu stats */
> > +   /* Alloc per-cpu and ethtool stats */
> > port->stats = netdev_alloc_pcpu_stats(struct
> > mvpp2_pcpu_stats); if (!port->stats) {
> > err = -ENOMEM;
> > goto err_free_irq;
> > }
> >  
> > +   port->ethtool_stats = kzalloc(sizeof(mvpp2_ethtool_stats),
> > GFP_KERNEL);
> > +   if (!port->ethtool_stats) {
> > +   err = -ENOMEM;
> > +   goto err_free_stats;
> > +   }  
> 
> Should not the above be kcalloc(sizeof(u64),
> ARRAY_SIZE(mvpp2_ethtool_

Re: [PATCH net-next] arp: Ignore packets with an all zero sender mac address

2017-11-03 Thread Eelco Chaudron

On 27/10/17 15:48, David Miller wrote:

From: Eelco Chaudron 
Date: Thu, 26 Oct 2017 10:37:01 +0200


Some applications/devices seem to forget their MAC address when
performing some kind of a failover which triggers (something that
looks like) a gratuities arp.

The ARP packet looks something like this:

   Address Resolution Protocol (reply)
   Hardware type: Ethernet (1)
   Protocol type: IPv4 (0x0800)
   Hardware size: 6
   Protocol size: 4
   Opcode: reply (2)
   Sender MAC address: 00:00:00:00:00:00
   Sender IP address: 10.0.0.1
   Target MAC address: 00:00:00:00:00:00
   Target IP address: 255.255.255.255

This will result in existing arp entries being overwritten with an all
zero mac address. Until the arp entry times out this host can no
longer initiate a connection to this device.

Checking for and ignoring invalid mac addresses will solve this
problem.

Signed-off-by: Eelco Chaudron 

I really have trouble justifying this fully.

I looked at a bunch of ARP implementations, and I see no special
checks about the link level address other than to make sure it isn't
"our" address.

Whatever is generating these weird ARP packets should be fixed
instead.
Looking for any mentioning of an all-zero MAC address being invalid, the 
only reference I could find was in the original first Xerox Wire 
Specification. The IEEE specifications do not mention this at all, and 
according to it, the all-zero address is a valid MA-L address assigned 
to Xerox.


Looking at the packet more, it might be an attempt to do an unARP (RFC 
1868) but forgot to implement to set the Hardware Address Length to zero.


I'm sure adding an arptables entry can be used to solve this instead, in 
case the offending device cannot be fixed.


Please ignore this patch...

Cheers,

Eelco


[patch net-next v4 1/2] net: sched: introduce chain_head_change callback

2017-11-03 Thread Jiri Pirko
From: Jiri Pirko 

Add a callback that is to be called whenever head of the chain changes.
Also provide a callback for the default case when the caller gets a
block using non-extended getter.

Signed-off-by: Jiri Pirko 
---
 include/net/pkt_cls.h | 14 ++--
 include/net/sch_generic.h |  5 -
 net/sched/cls_api.c   | 54 ---
 net/sched/sch_ingress.c   | 36 +--
 4 files changed, 62 insertions(+), 47 deletions(-)

diff --git a/include/net/pkt_cls.h b/include/net/pkt_cls.h
index d15c40c..505d4b7 100644
--- a/include/net/pkt_cls.h
+++ b/include/net/pkt_cls.h
@@ -26,6 +26,8 @@ enum tcf_block_binder_type {
 
 struct tcf_block_ext_info {
enum tcf_block_binder_type binder_type;
+   tcf_chain_head_change_t *chain_head_change;
+   void *chain_head_change_priv;
 };
 
 struct tcf_block_cb;
@@ -37,12 +39,10 @@ struct tcf_chain *tcf_chain_get(struct tcf_block *block, 
u32 chain_index,
 void tcf_chain_put(struct tcf_chain *chain);
 int tcf_block_get(struct tcf_block **p_block,
  struct tcf_proto __rcu **p_filter_chain, struct Qdisc *q);
-int tcf_block_get_ext(struct tcf_block **p_block,
- struct tcf_proto __rcu **p_filter_chain, struct Qdisc *q,
+int tcf_block_get_ext(struct tcf_block **p_block, struct Qdisc *q,
  struct tcf_block_ext_info *ei);
 void tcf_block_put(struct tcf_block *block);
-void tcf_block_put_ext(struct tcf_block *block,
-  struct tcf_proto __rcu **p_filter_chain, struct Qdisc *q,
+void tcf_block_put_ext(struct tcf_block *block, struct Qdisc *q,
   struct tcf_block_ext_info *ei);
 
 static inline struct Qdisc *tcf_block_q(struct tcf_block *block)
@@ -82,8 +82,7 @@ int tcf_block_get(struct tcf_block **p_block,
 }
 
 static inline
-int tcf_block_get_ext(struct tcf_block **p_block,
- struct tcf_proto __rcu **p_filter_chain, struct Qdisc *q,
+int tcf_block_get_ext(struct tcf_block **p_block, struct Qdisc *q,
  struct tcf_block_ext_info *ei)
 {
return 0;
@@ -94,8 +93,7 @@ static inline void tcf_block_put(struct tcf_block *block)
 }
 
 static inline
-void tcf_block_put_ext(struct tcf_block *block,
-  struct tcf_proto __rcu **p_filter_chain, struct Qdisc *q,
+void tcf_block_put_ext(struct tcf_block *block, struct Qdisc *q,
   struct tcf_block_ext_info *ei)
 {
 }
diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index c23e938..f230269 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -260,9 +260,12 @@ struct qdisc_skb_cb {
unsigned char   data[QDISC_CB_PRIV_LEN];
 };
 
+typedef void tcf_chain_head_change_t(struct tcf_proto *tp_head, void *priv);
+
 struct tcf_chain {
struct tcf_proto __rcu *filter_chain;
-   struct tcf_proto __rcu **p_filter_chain;
+   tcf_chain_head_change_t *chain_head_change;
+   void *chain_head_change_priv;
struct list_head list;
struct tcf_block *block;
u32 index; /* chain index */
diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
index 3364347..6416fe3 100644
--- a/net/sched/cls_api.c
+++ b/net/sched/cls_api.c
@@ -195,12 +195,19 @@ static struct tcf_chain *tcf_chain_create(struct 
tcf_block *block,
return chain;
 }
 
+static void tcf_chain_head_change(struct tcf_chain *chain,
+ struct tcf_proto *tp_head)
+{
+   if (chain->chain_head_change)
+   chain->chain_head_change(tp_head,
+chain->chain_head_change_priv);
+}
+
 static void tcf_chain_flush(struct tcf_chain *chain)
 {
struct tcf_proto *tp;
 
-   if (chain->p_filter_chain)
-   RCU_INIT_POINTER(*chain->p_filter_chain, NULL);
+   tcf_chain_head_change(chain, NULL);
while ((tp = rtnl_dereference(chain->filter_chain)) != NULL) {
RCU_INIT_POINTER(chain->filter_chain, tp->next);
tcf_chain_put(chain);
@@ -242,13 +249,6 @@ void tcf_chain_put(struct tcf_chain *chain)
 }
 EXPORT_SYMBOL(tcf_chain_put);
 
-static void
-tcf_chain_filter_chain_ptr_set(struct tcf_chain *chain,
-  struct tcf_proto __rcu **p_filter_chain)
-{
-   chain->p_filter_chain = p_filter_chain;
-}
-
 static void tcf_block_offload_cmd(struct tcf_block *block, struct Qdisc *q,
  struct tcf_block_ext_info *ei,
  enum tc_block_command command)
@@ -276,8 +276,7 @@ static void tcf_block_offload_unbind(struct tcf_block 
*block, struct Qdisc *q,
tcf_block_offload_cmd(block, q, ei, TC_BLOCK_UNBIND);
 }
 
-int tcf_block_get_ext(struct tcf_block **p_block,
- struct tcf_proto __rcu **p_filter_chain, struct Qdisc *q,
+int tcf_block_get_ext(struct tcf_block **p_block, struct Qdisc *q,
  struct tcf_block_

[patch net-next v4 2/2] net: core: introduce mini_Qdisc and eliminate usage of tp->q for clsact fastpath

2017-11-03 Thread Jiri Pirko
From: Jiri Pirko 

In sch_handle_egress and sch_handle_ingress tp->q is used only in order
to update stats. So stats and filter list are the only things that are
needed in clsact qdisc fastpath processing. Introduce new mini_Qdisc
struct to hold those items. Also, introduce a helper to swap the
mini_Qdisc structures in case filter list head changes.

This removes need for tp->q usage without added overhead.

Signed-off-by: Jiri Pirko 
---
 include/linux/netdevice.h |  9 ++---
 include/net/sch_generic.h | 32 
 net/core/dev.c| 21 +++--
 net/sched/sch_generic.c   | 46 ++
 net/sched/sch_ingress.c   | 19 ++-
 5 files changed, 109 insertions(+), 18 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 5e02f79..7de7656 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1559,6 +1559,8 @@ enum netdev_priv_flags {
  *
  * @rx_handler:handler for received packets
  * @rx_handler_data:   XXX: need comments on this one
+ * @miniq_ingress: ingress/clsact qdisc specific data for
+ * ingress processing
  * @ingress_queue: XXX: need comments on this one
  * @broadcast: hw bcast address
  *
@@ -1576,7 +1578,8 @@ enum netdev_priv_flags {
  * @tx_global_lock:XXX: need comments on this one
  *
  * @xps_maps:  XXX: need comments on this one
- *
+ * @miniq_egress:  clsact qdisc specific data for
+ * egress processing
  * @watchdog_timeo:Represents the timeout that is used by
  * the watchdog (see dev_watchdog())
  * @watchdog_timer:List of timers
@@ -1795,7 +1798,7 @@ struct net_device {
void __rcu  *rx_handler_data;
 
 #ifdef CONFIG_NET_CLS_ACT
-   struct tcf_proto __rcu  *ingress_cl_list;
+   struct mini_Qdisc __rcu *miniq_ingress;
 #endif
struct netdev_queue __rcu *ingress_queue;
 #ifdef CONFIG_NETFILTER_INGRESS
@@ -1826,7 +1829,7 @@ struct net_device {
struct xps_dev_maps __rcu *xps_maps;
 #endif
 #ifdef CONFIG_NET_CLS_ACT
-   struct tcf_proto __rcu  *egress_cl_list;
+   struct mini_Qdisc __rcu *miniq_egress;
 #endif
 
/* These may be needed for future network-power-down code. */
diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index f230269..c64e62c 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -904,4 +904,36 @@ static inline void psched_ratecfg_getrate(struct 
tc_ratespec *res,
res->linklayer = (r->linklayer & TC_LINKLAYER_MASK);
 }
 
+/* Mini Qdisc serves for specific needs of ingress/clsact Qdisc.
+ * The fast path only needs to access filter list and to update stats
+ */
+struct mini_Qdisc {
+   struct tcf_proto *filter_list;
+   struct gnet_stats_basic_cpu __percpu *cpu_bstats;
+   struct gnet_stats_queue __percpu *cpu_qstats;
+   struct rcu_head rcu;
+};
+
+static inline void mini_qdisc_bstats_cpu_update(struct mini_Qdisc *miniq,
+   const struct sk_buff *skb)
+{
+   bstats_cpu_update(this_cpu_ptr(miniq->cpu_bstats), skb);
+}
+
+static inline void mini_qdisc_qstats_cpu_drop(struct mini_Qdisc *miniq)
+{
+   this_cpu_inc(miniq->cpu_qstats->drops);
+}
+
+struct mini_Qdisc_pair {
+   struct mini_Qdisc miniq1;
+   struct mini_Qdisc miniq2;
+   struct mini_Qdisc __rcu **p_miniq;
+};
+
+void mini_qdisc_pair_swap(struct mini_Qdisc_pair *miniqp,
+ struct tcf_proto *tp_head);
+void mini_qdisc_pair_init(struct mini_Qdisc_pair *miniqp, struct Qdisc *qdisc,
+ struct mini_Qdisc __rcu **p_miniq);
+
 #endif
diff --git a/net/core/dev.c b/net/core/dev.c
index 24ac908..1423cf4 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3274,22 +3274,22 @@ EXPORT_SYMBOL(dev_loopback_xmit);
 static struct sk_buff *
 sch_handle_egress(struct sk_buff *skb, int *ret, struct net_device *dev)
 {
-   struct tcf_proto *cl = rcu_dereference_bh(dev->egress_cl_list);
+   struct mini_Qdisc *miniq = rcu_dereference_bh(dev->miniq_egress);
struct tcf_result cl_res;
 
-   if (!cl)
+   if (!miniq)
return skb;
 
/* qdisc_skb_cb(skb)->pkt_len was already set by the caller. */
-   qdisc_bstats_cpu_update(cl->q, skb);
+   mini_qdisc_bstats_cpu_update(miniq, skb);
 
-   switch (tcf_classify(skb, cl, &cl_res, false)) {
+   switch (tcf_classify(skb, miniq->filter_list, &cl_res, false)) {
case TC_ACT_OK:
case TC_ACT_RECLASSIFY:
skb->tc_index = TC_H_MIN(cl_res.classid);
break;
case TC_ACT_SHOT:
-   qdisc_qstats_cpu_drop(cl->q);
+   mini_qdisc_qstats_cpu_drop(miniq);
*ret = NET_XMIT_DROP;
  

[patch net-next v4 0/2] net: core: introduce mini_Qdisc and eliminate usage of tp->q for clsact fastpath

2017-11-03 Thread Jiri Pirko
From: Jiri Pirko 

This patchset's main patch is patch number 2. It carries the
description. Patch 1 is just a dependency.

---
v3->v4:
- rebased to be applicable on top of the current net-next
v2->v3:
- Using head change callback to replace miniq pointer every time tp head
  changes. This eliminates one rcu dereference and makes the claim "without
  added overhead" valid.
v1->v2:
- Use dev instead of skb->dev in sch_handle_egress as pointed out by Daniel
- Fixed synchronize_rcu_bh() in mini_qdisc_disable and commented

Jiri Pirko (2):
  net: sched: introduce chain_head_change callback
  net: core: introduce mini_Qdisc and eliminate usage of tp->q for
clsact fastpath

 include/linux/netdevice.h |  9 +---
 include/net/pkt_cls.h | 14 ++--
 include/net/sch_generic.h | 37 +++-
 net/core/dev.c| 21 +-
 net/sched/cls_api.c   | 54 ---
 net/sched/sch_generic.c   | 46 
 net/sched/sch_ingress.c   | 45 +--
 7 files changed, 166 insertions(+), 60 deletions(-)

-- 
2.9.5



Re: [RFC PATCH 07/14] packet: wire up zerocopy for AF_PACKET V4

2017-11-03 Thread Björn Töpel
2017-11-03 4:17 GMT+01:00 Willem de Bruijn :
> On Tue, Oct 31, 2017 at 9:41 PM, Björn Töpel  wrote:
>> From: Björn Töpel 
>>
>> This commits adds support for zerocopy mode. Note that zerocopy mode
>> requires that the network interface has been bound to the socket using
>> the bind syscall, and that the corresponding netdev implements the
>> AF_PACKET V4 ndos.
>>
>> Signed-off-by: Björn Töpel 
>> ---
>> +
>> +static void packet_v4_disable_zerocopy(struct net_device *dev,
>> +  struct tp4_netdev_parms *zc)
>> +{
>> +   struct tp4_netdev_parms params;
>> +
>> +   params = *zc;
>> +   params.command  = TP4_DISABLE;
>> +
>> +   (void)dev->netdev_ops->ndo_tp4_zerocopy(dev, ¶ms);
>
> Don't ignore error return codes.
>

Will fix!

>> +static int packet_v4_zerocopy(struct sock *sk, int qp)
>> +{
>> +   struct packet_sock *po = pkt_sk(sk);
>> +   struct socket *sock = sk->sk_socket;
>> +   struct tp4_netdev_parms *zc = NULL;
>> +   struct net_device *dev;
>> +   bool if_up;
>> +   int ret = 0;
>> +
>> +   /* Currently, only RAW sockets are supported.*/
>> +   if (sock->type != SOCK_RAW)
>> +   return -EINVAL;
>> +
>> +   rtnl_lock();
>> +   dev = packet_cached_dev_get(po);
>> +
>> +   /* Socket needs to be bound to an interface. */
>> +   if (!dev) {
>> +   rtnl_unlock();
>> +   return -EISCONN;
>> +   }
>> +
>> +   /* The device needs to have both the NDOs implemented. */
>> +   if (!(dev->netdev_ops->ndo_tp4_zerocopy &&
>> + dev->netdev_ops->ndo_tp4_xmit)) {
>> +   ret = -EOPNOTSUPP;
>> +   goto out_unlock;
>> +   }
>
> Inconsistent error handling with above test.
>

Will fix.

>> +
>> +   if (!(po->rx_ring.pg_vec && po->tx_ring.pg_vec)) {
>> +   ret = -EOPNOTSUPP;
>> +   goto out_unlock;
>> +   }
>
> A ring can be unmapped later with packet_set_ring. Should that operation
> fail if zerocopy is enabled? After that, it can also change version with
> PACKET_VERSION.
>

You're correct, I've missed this. I need to revisit the scenario when
a ring is unmapped, and recreated. Thanks for pointing this out.

>> +
>> +   if_up = dev->flags & IFF_UP;
>> +   zc = rtnl_dereference(po->zc);
>> +
>> +   /* Disable */
>> +   if (qp <= 0) {
>> +   if (!zc)
>> +   goto out_unlock;
>> +
>> +   packet_v4_disable_zerocopy(dev, zc);
>> +   rcu_assign_pointer(po->zc, NULL);
>> +
>> +   if (if_up) {
>> +   spin_lock(&po->bind_lock);
>> +   register_prot_hook(sk);
>> +   spin_unlock(&po->bind_lock);
>> +   }
>
> There have been a bunch of race conditions in this bind code. We need
> to be very careful with adding more states to the locking, especially when
> open coding in multiple locations, as this patch does. I counted at least
> four bind locations. See for instance also
> http://patchwork.ozlabs.org/patch/813945/
>

Yeah, the locking schemes in AF_PACKET is pretty convoluted. I'll
document and make the locking more explicit (and avoiding open coding
it).

>
>> +
>> +   goto out_unlock;
>> +   }
>> +
>> +   /* Enable */
>> +   if (!zc) {
>> +   zc = kzalloc(sizeof(*zc), GFP_KERNEL);
>> +   if (!zc) {
>> +   ret = -ENOMEM;
>> +   goto out_unlock;
>> +   }
>> +   }
>> +
>> +   if (zc->queue_pair >= 0)
>> +   packet_v4_disable_zerocopy(dev, zc);
>
> This calls disable even if zc was freshly allocated.
> Shoud be > 0?
>

Good catch. It should be > 0.

>>  static int packet_release(struct socket *sock)
>>  {
>> +   struct tp4_netdev_parms *zc;
>> struct sock *sk = sock->sk;
>> +   struct net_device *dev;
>> struct packet_sock *po;
>> struct packet_fanout *f;
>> struct net *net;
>> @@ -3337,6 +3541,20 @@ static int packet_release(struct socket *sock)
>> sock_prot_inuse_add(net, sk->sk_prot, -1);
>> preempt_enable();
>>
>> +   rtnl_lock();
>> +   zc = rtnl_dereference(po->zc);
>> +   dev = packet_cached_dev_get(po);
>> +   if (zc && dev)
>> +   packet_v4_disable_zerocopy(dev, zc);
>> +   if (dev)
>> +   dev_put(dev);
>> +   rtnl_unlock();
>> +
>> +   if (zc) {
>> +   synchronize_rcu();
>> +   kfree(zc);
>> +   }
>
> Please use a helper function for anything this complex.

Will fix.


Thanks,
Björn


[PATCH net-next 1/5] net: dsa: lan9303: Correct register names in comments

2017-11-03 Thread Egil Hjelmeland
Two comments refer to registers, but lack the LAN9303_ prefix.
Fix that.

Signed-off-by: Egil Hjelmeland 
---
 include/linux/dsa/lan9303.h | 8 +---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/include/linux/dsa/lan9303.h b/include/linux/dsa/lan9303.h
index 05d8d136baab..f48a85c377de 100644
--- a/include/linux/dsa/lan9303.h
+++ b/include/linux/dsa/lan9303.h
@@ -13,8 +13,8 @@ struct lan9303_phy_ops {
 #define LAN9303_NUM_ALR_RECORDS 512
 struct lan9303_alr_cache_entry {
u8  mac_addr[ETH_ALEN];
-   u8  port_map;   /* Bitmap of ports. Zero if unused entry */
-   u8  stp_override;   /* non zero if set ALR_DAT1_AGE_OVERRID */
+   u8  port_map; /* Bitmap of ports. Zero if unused entry */
+   u8  stp_override; /* non zero if set LAN9303_ALR_DAT1_AGE_OVERRID */
 };
 
 struct lan9303 {
@@ -28,7 +28,9 @@ struct lan9303 {
struct mutex indirect_mutex; /* protect indexed register access */
const struct lan9303_phy_ops *ops;
bool is_bridged; /* true if port 1 and 2 are bridged */
-   u32 swe_port_state; /* remember SWE_PORT_STATE while not bridged */
+
+   /* remember LAN9303_SWE_PORT_STATE while not bridged */
+   u32 swe_port_state;
/* LAN9303 do not offer reading specific ALR entry. Cache all
 * static entries in a flat table
 **/
-- 
2.11.0



[PATCH net-next 3/5] net: dsa: lan9303: Replace msleep(1) with usleep_range()

2017-11-03 Thread Egil Hjelmeland
Remove scripts/checkpatch.pl WARNING by replacing msleep(1) with usleep_range()

Signed-off-by: Egil Hjelmeland 
---
 drivers/net/dsa/lan9303-core.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/dsa/lan9303-core.c b/drivers/net/dsa/lan9303-core.c
index c4afc8f1a66d..70ecd18a5e7d 100644
--- a/drivers/net/dsa/lan9303-core.c
+++ b/drivers/net/dsa/lan9303-core.c
@@ -284,7 +284,7 @@ static int lan9303_indirect_phy_wait_for_completion(struct 
lan9303 *chip)
}
if (!(reg & LAN9303_PMI_ACCESS_MII_BUSY))
return 0;
-   msleep(1);
+   usleep_range(1000, 2000);
}
 
return -EIO;
@@ -376,7 +376,7 @@ static int lan9303_switch_wait_for_completion(struct 
lan9303 *chip)
}
if (!(reg & LAN9303_SWITCH_CSR_CMD_BUSY))
return 0;
-   msleep(1);
+   usleep_range(1000, 2000);
}
 
return -EIO;
-- 
2.11.0



[PATCH net-next 0/5] net: dsa: lan9303: Linting

2017-11-03 Thread Egil Hjelmeland
This series is non-functional. 
 - Correct some errors in comments and documentation.
Remove scripts/checkpatch.pl WARNINGs and most CHECKs:
 - Replace msleep(1) with usleep_range()
 - Remove unnecessary parentheses
 - Adjust indenting

Egil Hjelmeland (5):
  net: dsa: lan9303: Correct register names in comments
  net: dsa: lan9303: Fix syntax errors in device tree examples
  net: dsa: lan9303: Replace msleep(1) with usleep_range()
  net: dsa: lan9303: Remove unnecessary parentheses
  net: dsa: lan9303: Adjust indenting

 Documentation/devicetree/bindings/net/dsa/lan9303.txt | 4 ++--
 drivers/net/dsa/lan9303-core.c| 6 +++---
 drivers/net/dsa/lan9303_i2c.c | 2 +-
 drivers/net/dsa/lan9303_mdio.c| 2 +-
 include/linux/dsa/lan9303.h   | 8 +---
 net/dsa/tag_lan9303.c | 2 +-
 6 files changed, 13 insertions(+), 11 deletions(-)

-- 
2.11.0



[PATCH net-next 2/5] net: dsa: lan9303: Fix syntax errors in device tree examples

2017-11-03 Thread Egil Hjelmeland
Signed-off-by: Egil Hjelmeland 
---
 Documentation/devicetree/bindings/net/dsa/lan9303.txt | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/Documentation/devicetree/bindings/net/dsa/lan9303.txt 
b/Documentation/devicetree/bindings/net/dsa/lan9303.txt
index 4448d063ddf6..464d6bf87605 100644
--- a/Documentation/devicetree/bindings/net/dsa/lan9303.txt
+++ b/Documentation/devicetree/bindings/net/dsa/lan9303.txt
@@ -52,7 +52,7 @@ I2C managed mode:
 
port@1 { /* external port 1 */
reg = <1>;
-   label = "lan1;
+   label = "lan1";
};
 
port@2 { /* external port 2 */
@@ -89,7 +89,7 @@ MDIO managed mode:
 
port@1 { /* external port 1 */
reg = <1>;
-   label = "lan1;
+   label = "lan1";
};
 
port@2 { /* external port 2 */
-- 
2.11.0



[PATCH net-next 5/5] net: dsa: lan9303: Adjust indenting

2017-11-03 Thread Egil Hjelmeland
Remove scripts/checkpatch.pl CHECKs by adjusting indenting.

Signed-off-by: Egil Hjelmeland 
---
 drivers/net/dsa/lan9303_i2c.c  | 2 +-
 drivers/net/dsa/lan9303_mdio.c | 2 +-
 net/dsa/tag_lan9303.c  | 2 +-
 3 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/net/dsa/lan9303_i2c.c b/drivers/net/dsa/lan9303_i2c.c
index 24ec20f7f444..909a7e864246 100644
--- a/drivers/net/dsa/lan9303_i2c.c
+++ b/drivers/net/dsa/lan9303_i2c.c
@@ -50,7 +50,7 @@ static int lan9303_i2c_probe(struct i2c_client *client,
return -ENOMEM;
 
sw_dev->chip.regmap = devm_regmap_init_i2c(client,
-   &lan9303_i2c_regmap_config);
+  &lan9303_i2c_regmap_config);
if (IS_ERR(sw_dev->chip.regmap)) {
ret = PTR_ERR(sw_dev->chip.regmap);
dev_err(&client->dev, "Failed to allocate register map: %d\n",
diff --git a/drivers/net/dsa/lan9303_mdio.c b/drivers/net/dsa/lan9303_mdio.c
index 0bc56b9900f9..cc9c2ea1c4fe 100644
--- a/drivers/net/dsa/lan9303_mdio.c
+++ b/drivers/net/dsa/lan9303_mdio.c
@@ -116,7 +116,7 @@ static int lan9303_mdio_probe(struct mdio_device *mdiodev)
return -ENOMEM;
 
sw_dev->chip.regmap = devm_regmap_init(&mdiodev->dev, NULL, sw_dev,
-   &lan9303_mdio_regmap_config);
+  &lan9303_mdio_regmap_config);
if (IS_ERR(sw_dev->chip.regmap)) {
ret = PTR_ERR(sw_dev->chip.regmap);
dev_err(&mdiodev->dev, "regmap init failed: %d\n", ret);
diff --git a/net/dsa/tag_lan9303.c b/net/dsa/tag_lan9303.c
index e526c8967b98..5ba01fc3c6ba 100644
--- a/net/dsa/tag_lan9303.c
+++ b/net/dsa/tag_lan9303.c
@@ -88,7 +88,7 @@ static struct sk_buff *lan9303_xmit(struct sk_buff *skb, 
struct net_device *dev)
 }
 
 static struct sk_buff *lan9303_rcv(struct sk_buff *skb, struct net_device *dev,
-   struct packet_type *pt)
+  struct packet_type *pt)
 {
u16 *lan9303_tag;
unsigned int source_port;
-- 
2.11.0



[PATCH net-next 4/5] net: dsa: lan9303: Remove unnecessary parentheses

2017-11-03 Thread Egil Hjelmeland
Remove scripts/checkpatch.pl CHECKs by remove unnecessary parentheses

Signed-off-by: Egil Hjelmeland 
---
 drivers/net/dsa/lan9303-core.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/dsa/lan9303-core.c b/drivers/net/dsa/lan9303-core.c
index 70ecd18a5e7d..b9a95f542f65 100644
--- a/drivers/net/dsa/lan9303-core.c
+++ b/drivers/net/dsa/lan9303-core.c
@@ -483,7 +483,7 @@ static int lan9303_detect_phy_setup(struct lan9303 *chip)
return reg;
}
 
-   if ((reg != 0) && (reg != 0x))
+   if (reg != 0 && reg != 0x)
chip->phy_addr_sel_strap = 1;
else
chip->phy_addr_sel_strap = 0;
-- 
2.11.0



[PATCH net-next v2] net: mvpp2: add ethtool GOP statistics

2017-11-03 Thread Miquel Raynal
Add ethtool statistics support by reading the GOP statistics from the
hardware counters. Also implement a workqueue to gather the statistics
every second or some 32-bit counters could overflow.

Suggested-by: Stefan Chulski 
Signed-off-by: Miquel Raynal 
---
 drivers/net/ethernet/marvell/mvpp2.c | 237 ++-
 1 file changed, 231 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/marvell/mvpp2.c 
b/drivers/net/ethernet/marvell/mvpp2.c
index 97efe4733661..ca46eca27981 100644
--- a/drivers/net/ethernet/marvell/mvpp2.c
+++ b/drivers/net/ethernet/marvell/mvpp2.c
@@ -38,6 +38,8 @@
 #include 
 #include 
 
+static DEFINE_IDA(engine_index_ida);
+
 /* RX Fifo Registers */
 #define MVPP2_RX_DATA_FIFO_SIZE_REG(port)  (0x00 + 4 * (port))
 #define MVPP2_RX_ATTR_FIFO_SIZE_REG(port)  (0x20 + 4 * (port))
@@ -769,6 +771,42 @@ enum mvpp2_bm_type {
MVPP2_BM_SWF_SHORT
 };
 
+/* GMAC MIB Counters register definitions */
+#define MVPP21_MIB_COUNTERS_OFFSET 0x1000
+#define MVPP21_MIB_COUNTERS_PORT_SZ0x400
+#define MVPP22_MIB_COUNTERS_OFFSET 0x0
+#define MVPP22_MIB_COUNTERS_PORT_SZ0x100
+
+#define MVPP2_MIB_GOOD_OCTETS_RCVD 0x0
+#define MVPP2_MIB_BAD_OCTETS_RCVD  0x8
+#define MVPP2_MIB_CRC_ERRORS_SENT  0xc
+#define MVPP2_MIB_UNICAST_FRAMES_RCVD  0x10
+#define MVPP2_MIB_BROADCAST_FRAMES_RCVD0x18
+#define MVPP2_MIB_MULTICAST_FRAMES_RCVD0x1c
+#define MVPP2_MIB_FRAMES_64_OCTETS 0x20
+#define MVPP2_MIB_FRAMES_65_TO_127_OCTETS  0x24
+#define MVPP2_MIB_FRAMES_128_TO_255_OCTETS 0x28
+#define MVPP2_MIB_FRAMES_256_TO_511_OCTETS 0x2c
+#define MVPP2_MIB_FRAMES_512_TO_1023_OCTETS0x30
+#define MVPP2_MIB_FRAMES_1024_TO_MAX_OCTETS0x34
+#define MVPP2_MIB_GOOD_OCTETS_SENT 0x38
+#define MVPP2_MIB_UNICAST_FRAMES_SENT  0x40
+#define MVPP2_MIB_MULTICAST_FRAMES_SENT0x48
+#define MVPP2_MIB_BROADCAST_FRAMES_SENT0x4c
+#define MVPP2_MIB_FC_SENT  0x54
+#define MVPP2_MIB_FC_RCVD  0x58
+#define MVPP2_MIB_RX_FIFO_OVERRUN  0x5c
+#define MVPP2_MIB_UNDERSIZE_RCVD   0x60
+#define MVPP2_MIB_FRAGMENTS_RCVD   0x64
+#define MVPP2_MIB_OVERSIZE_RCVD0x68
+#define MVPP2_MIB_JABBER_RCVD  0x6c
+#define MVPP2_MIB_MAC_RCV_ERROR0x70
+#define MVPP2_MIB_BAD_CRC_EVENT0x74
+#define MVPP2_MIB_COLLISION0x78
+#define MVPP2_MIB_LATE_COLLISION   0x7c
+
+#define MVPP2_MIB_COUNTERS_STATS_DELAY (1 * HZ)
+
 /* Definitions */
 
 /* Shared Packet Processor resources */
@@ -796,6 +834,7 @@ struct mvpp2 {
struct clk *axi_clk;
 
/* List of pointers to port structures */
+   int port_count;
struct mvpp2_port **port_list;
 
/* Aggregated TXQs */
@@ -817,6 +856,12 @@ struct mvpp2 {
 
/* Maximum number of RXQs per port */
unsigned int max_port_rxqs;
+
+   /* Workqueue to gather hardware statistics with its lock */
+   struct mutex gather_stats_lock;
+   struct delayed_work stats_work;
+   char queue_name[20];
+   struct workqueue_struct *stats_queue;
 };
 
 struct mvpp2_pcpu_stats {
@@ -879,6 +924,7 @@ struct mvpp2_port {
u16 tx_ring_size;
u16 rx_ring_size;
struct mvpp2_pcpu_stats __percpu *stats;
+   u64 *ethtool_stats;
 
phy_interface_t phy_interface;
struct device_node *phy_node;
@@ -4743,9 +4789,142 @@ static void mvpp2_port_loopback_set(struct mvpp2_port 
*port)
writel(val, port->base + MVPP2_GMAC_CTRL_1_REG);
 }
 
+struct mvpp2_ethtool_counter {
+   unsigned int offset;
+   const char string[ETH_GSTRING_LEN];
+   bool reg_is_64b;
+};
+
+static u64 mvpp2_read_count(struct mvpp2_port *port,
+   const struct mvpp2_ethtool_counter *counter)
+{
+   void __iomem *base;
+   u64 val;
+
+   if (port->priv->hw_version == MVPP21)
+   base = port->priv->lms_base + MVPP21_MIB_COUNTERS_OFFSET +
+  port->gop_id * MVPP21_MIB_COUNTERS_PORT_SZ;
+   else
+   base = port->priv->iface_base + MVPP22_MIB_COUNTERS_OFFSET +
+  port->gop_id * MVPP22_MIB_COUNTERS_PORT_SZ;
+
+   val = readl(base + counter->offset);
+   if (counter->reg_is_64b)
+   val += (u64)readl(base + counter->offset + 4) << 32;
+
+   return val;
+}
+
+/* Due to the fact that software statistics and hardware statistics are, by
+ * design, incremented at different moments in the chain of packet processing,
+ * it is very likely that incoming packets could have been dropped after being
+ * counted by hardware but before reaching software statistics (most probably
+ * multicast packets), and in the oppposite way, during transmission

Re: [PATCH net-next v2] net: mvpp2: add ethtool GOP statistics

2017-11-03 Thread Miquel RAYNAL
On Fri,  3 Nov 2017 12:04:25 +0100
Miquel Raynal  wrote:

> Add ethtool statistics support by reading the GOP statistics from the
> hardware counters. Also implement a workqueue to gather the statistics
> every second or some 32-bit counters could overflow.
> 
> Suggested-by: Stefan Chulski 
> Signed-off-by: Miquel Raynal 
> ---

Changes since v1:
- Constified the ethtool array (with strings and offset) and added a
  third parameter to the repeated structure: if the register is 64-bit
  wide.
- Added locking inside the mvpp2_gather_stats() to avoid starting
  reading the registers while a read is already pending.
- Used devm_* allocators.
- Switched to cancel_delayed_work_sync().
- Fixed the size allocated to the statistics array.
- Added an index to the workqueue name, derived with IDA because the
  name shown in 'ps' is only 15 chars wide (so must be short) and there
  was no existing way to retrieve a unique number per engine instance.
- Start gathering the statistics only in ndo_start(), stop in
  ndo_stop().

Best regards,
Miquèl


Re: KASAN: use-after-free Read in refcount_inc_not_zero

2017-11-03 Thread Xin Long
On Fri, Nov 3, 2017 at 1:35 AM, syzbot

wrote:
> Hello,
>
> syzkaller hit the following crash on
> 73d3393ada4f70fa3df5639c8d438f2f034c0ecb
> git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/master
> compiler: gcc (GCC) 7.1.1 20170620
> .config is attached
> Raw console output is attached.
the log is too log to identify the issue. not sure if it's possible
to make it shorter.

otherwise, let's see if it still exists after fixing the last use-after-free
issue, I feel they are relevant.

thanks.

>
>
>
>
> netlink: 1 bytes leftover after parsing attributes in process
> `syz-executor4'.
> ==
> BUG: KASAN: use-after-free in __read_once_size include/linux/compiler.h:276
> [inline]
> BUG: KASAN: use-after-free in atomic_read arch/x86/include/asm/atomic.h:26
> [inline]
> BUG: KASAN: use-after-free in refcount_inc_not_zero+0x16e/0x180
> lib/refcount.c:119
> Read of size 4 at addr 8801c9de8ad8 by task syz-executor6/8757
>
> CPU: 0 PID: 8757 Comm: syz-executor6 Not tainted 4.14.0-rc5+ #138
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
> Google 01/01/2011
> Call Trace:
>  
>  __dump_stack lib/dump_stack.c:16 [inline]
>  dump_stack+0x194/0x257 lib/dump_stack.c:52
>  print_address_description+0x73/0x250 mm/kasan/report.c:252
>  kasan_report_error mm/kasan/report.c:351 [inline]
>  kasan_report+0x25b/0x340 mm/kasan/report.c:409
>  __asan_report_load4_noabort+0x14/0x20 mm/kasan/report.c:429
>  __read_once_size include/linux/compiler.h:276 [inline]
>  atomic_read arch/x86/include/asm/atomic.h:26 [inline]
>  refcount_inc_not_zero+0x16e/0x180 lib/refcount.c:119
>  refcount_inc+0x15/0x50 lib/refcount.c:152
>  sctp_association_hold+0x16/0x20 net/sctp/associola.c:875
>  sctp_generate_timeout_event+0x2b0/0x330 net/sctp/sm_sideeffect.c:297
>  sctp_generate_t1_init_event+0x1a/0x20 net/sctp/sm_sideeffect.c:330
>  call_timer_fn+0x233/0x830 kernel/time/timer.c:1281
>  expire_timers kernel/time/timer.c:1320 [inline]
>  __run_timers+0x7fd/0xb90 kernel/time/timer.c:1620
>  run_timer_softirq+0x4c/0xb0 kernel/time/timer.c:1646
>  __do_softirq+0x2d7/0xb85 kernel/softirq.c:284
>  invoke_softirq kernel/softirq.c:364 [inline]
>  irq_exit+0x1cc/0x200 kernel/softirq.c:405
>  exiting_irq arch/x86/include/asm/apic.h:638 [inline]
>  smp_apic_timer_interrupt+0x177/0x710 arch/x86/kernel/apic/apic.c:1059
>  apic_timer_interrupt+0x9d/0xb0 arch/x86/entry/entry_64.S:770
>  
> RIP: 0010:copy_page+0x7/0x10 arch/x86/lib/copy_page_64.S:17
> RSP: 0018:8801d2656e48 EFLAGS: 00010286 ORIG_RAX: ff10
> RAX: 8801905c0100 RBX: 06290080 RCX: 0140
> RDX:  RSI: 88018a402600 RDI: 880172c02600
> RBP: 8801d2656f98 R08: 0002 R09: 
> R10:  R11: 0001 R12: 0002
> R13: 8800 R14: dc00 R15: ffa2
>  migrate_page+0x149/0x1f0 mm/migrate.c:751
>  move_to_new_page+0x3e1/0x8c0 mm/migrate.c:916
>  __unmap_and_move+0xad2/0x1190 mm/migrate.c:1087
>  unmap_and_move mm/migrate.c:1170 [inline]
>  migrate_pages+0x956/0x2610 mm/migrate.c:1404
>  do_mbind+0xa98/0xce0 mm/mempolicy.c:1239
>  SYSC_mbind mm/mempolicy.c:1341 [inline]
>  SyS_mbind+0x13b/0x150 mm/mempolicy.c:1323
>  entry_SYSCALL_64_fastpath+0x1f/0xbe
> RIP: 0033:0x452719
> RSP: 002b:7f37fadc7be8 EFLAGS: 0212 ORIG_RAX: 00ed
> RAX: ffda RBX: 007580d8 RCX: 00452719
> RDX:  RSI: 0080 RDI: 203b5000
> RBP: 0082 R08: 0001 R09: 0002
> R10: 20001ff8 R11: 0212 R12: 
> R13: 00a6f7ff R14: 7f37fadc89c0 R15: 0011
>
> Allocated by task 8763:
>  save_stack_trace+0x16/0x20 arch/x86/kernel/stacktrace.c:59
>  save_stack+0x43/0xd0 mm/kasan/kasan.c:447
>  set_track mm/kasan/kasan.c:459 [inline]
>  kasan_kmalloc+0xad/0xe0 mm/kasan/kasan.c:551
>  kmem_cache_alloc_trace+0x136/0x750 mm/slab.c:3627
>  kmalloc include/linux/slab.h:493 [inline]
>  kzalloc include/linux/slab.h:666 [inline]
>  sctp_association_new+0x114/0x21e0 net/sctp/associola.c:309
>  sctp_sendmsg+0x128c/0x31f0 net/sctp/socket.c:1838
>  inet_sendmsg+0x11f/0x5e0 net/ipv4/af_inet.c:762
>  sock_sendmsg_nosec net/socket.c:633 [inline]
>  sock_sendmsg+0xca/0x110 net/socket.c:643
>  SYSC_sendto+0x352/0x5a0 net/socket.c:1750
>  SyS_sendto+0x40/0x50 net/socket.c:1718
>  entry_SYSCALL_64_fastpath+0x1f/0xbe
>
> Freed by task 8776:
>  save_stack_trace+0x16/0x20 arch/x86/kernel/stacktrace.c:59
>  save_stack+0x43/0xd0 mm/kasan/kasan.c:447
>  set_track mm/kasan/kasan.c:459 [inline]
>  kasan_slab_free+0x71/0xc0 mm/kasan/kasan.c:524
>  __cache_free mm/slab.c:3503 [inline]
>  kfree+0xca/0x250 mm/slab.c:3820
>  sctp_association_destroy net/sctp/associola.c:435 [inline]
>  sctp_association_put+0x21c/0x2f0 net/sctp/associola.c:884
>  sctp_association_free+

Re: KASAN: stack-out-of-bounds Read in xfrm_state_find (2)

2017-11-03 Thread Steffen Klassert
On Thu, Nov 02, 2017 at 01:25:28PM +0100, Florian Westphal wrote:
> Steffen Klassert  wrote:
> 
> > I'd propose to use the addresses from the template unconditionally,
> > like the (untested) patch below does.
> > 
> > Unfortunalely the reproducer does not work with my config,
> > sendto returns EAGAIN. Could anybody try this patch?
> 
> The reproducer no longer causes KASAN spew with your patch,
> but i don't have a test case that actually creates/uses a tunnel.

The patch passed my standard tests, so I tend apply it
after a day in the ipsec/testing branch.


Re: [PATCH net-next] arp: Ignore packets with an all zero sender mac address

2017-11-03 Thread David Miller
From: Eelco Chaudron 
Date: Fri, 3 Nov 2017 11:39:04 +0100

> Looking for any mentioning of an all-zero MAC address being invalid,
> the only reference I could find was in the original first Xerox Wire
> Specification. The IEEE specifications do not mention this at all, and
> according to it, the all-zero address is a valid MA-L address assigned
> to Xerox.
> 
> Looking at the packet more, it might be an attempt to do an unARP (RFC
> 1868) but forgot to implement to set the Hardware Address Length to
> zero.
> 
> I'm sure adding an arptables entry can be used to solve this instead,
> in case the offending device cannot be fixed.
> 
> Please ignore this patch...

Ok, thanks for looking more deeply into this.


Re: [Patch net-next] net_sched: check NULL in tcf_block_put()

2017-11-03 Thread David Miller
From: Cong Wang 
Date: Thu,  2 Nov 2017 17:32:08 -0700

> Callers of tcf_block_put() could pass NULL so
> we can't use block->q before checking if block is
> NULL or not.
> 
> tcf_block_put_ext() callers are fine, it is always
> non-NULL.
> 
> Fixes: 8c4083b30e56 ("net: sched: add block bind/unbind notif. and extended 
> block_get/put")
> Reported-by: Dave Taht 
> Cc: Jiri Pirko 
> Signed-off-by: Cong Wang 

Applied, thanks Cong.


Re: [PATCH net-next 0/2] bnxt_en: Fix IRQ coalescing regressions.

2017-11-03 Thread David Miller
From: Michael Chan 
Date: Fri,  3 Nov 2017 03:32:37 -0400

> There was a typo and missing guard-rail against illegal values in the
> recent code clean up.  All reported by Andy Gospodarek.

Series applied, thank you.


Re: [PATCH net-next 0/6] net: hns3: support set_link_ksettings and for nway_reset ethtool command

2017-11-03 Thread David Miller
From: Lipeng 
Date: Fri, 3 Nov 2017 12:18:24 +0800

> This patch-set adds support for set_link_ksettings && for nway_resets
> ethtool command and fixes some related ethtool bugs.
> 1, patch[4/6] adds support for ethtool_ops.set_link_ksettings.
> 2, patch[5/6] adds support ethtool_ops.for nway_reset.
> 3, patch[1/6,2/6,3/6,6/6] fix some bugs for getting port information by
>ethtool command(ethtool ethx).

Series applied, thank you.


Re: TCP connection closed without FIN or RST

2017-11-03 Thread Vitaly Davidovich
Hi Eric,

Ran a few more tests yesterday with packet captures, including a
capture on the client.  It turns out that the client stops ack'ing
entirely at some point in the conversation - the last advertised
client window is not even close to zero (it's actually ~348K).  So
there's complete radio silence from the client for some reason, even
though it does send back ACKs early on in the conversation.  So yes,
as far as the server is concerned, the client is completely gone and
tcp_retries2 rightfully breaches eventually once the server retrans go
unanswered long (and for sufficient times) enough.

What's odd though is the packet capture on the client shows the server
retrans packets arriving, so it's not like the segments don't reach
the client.  I'll keep investigating, but if you (or anyone else
reading this) knows of circumstances that might cause this, I'd
appreciate any tips on where/what to look at.

Thanks

On Wed, Nov 1, 2017 at 7:06 PM, Eric Dumazet  wrote:
> On Wed, 2017-11-01 at 22:22 +, Vitaly Davidovich wrote:
>> Eric,
>>
>
>> Yes I agree.  However the thing I’m still puzzled about is the client
>> application is not reading/draining the recvq - ok, the client tcp
>> stack should start advertising a 0 window size.  Does a 0 window size
>> count against the tcp_retries2? Is that what you were alluding to in
>> your first reply?
>>
>
> Every time we receive an (valid) ACK, with a win 0 or not, the counter
> of attempts is cleared, given the opportunity for the sender to send 15
> more probes.
>>
>> If it *does* count towards the retries limit then a RST doesn’t seem
>> like a bad idea.  The client is responding with segments but the user
>> app there just isn’t draining the data.  Presumably that RST has a
>> good chance of reaching the client and then unblocking the read()
>> there with a peer reset error.  Or am I missing something?
>>
>>
>> If it doesn’t count towards the limit then I need to figure out why
>> the 0 window size segments weren’t being sent by the client.
>
> Yes please :)
>>
>>
>> I will try to double check that the client was indeed advertising 0
>> window size.  There’s nothing special about that machine - it’s a
>> 4.1.35 kernel as well.  I wouldn’t expect the tcp stack there to be
>> unresponsive just because the user app is sleeping.
>>
>
>
>


[PATCH] dpaa_eth: avoid uninitialized variable false-positive warning

2017-11-03 Thread Arnd Bergmann
We can now build this driver on ARM, so I ran into a randconfig build
warning that presumably had existed on powerpc already.

drivers/net/ethernet/freescale/dpaa/dpaa_eth.c: In function 'sg_fd_to_skb':
drivers/net/ethernet/freescale/dpaa/dpaa_eth.c:1712:18: error: 'skb' may be 
used uninitialized in this function [-Werror=maybe-uninitialized]

I'm slightly changing the logic here, to make it obvious to the
compiler that 'skb' is always initialized.

Signed-off-by: Arnd Bergmann 
---
 drivers/net/ethernet/freescale/dpaa/dpaa_eth.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/freescale/dpaa/dpaa_eth.c 
b/drivers/net/ethernet/freescale/dpaa/dpaa_eth.c
index 969f6b12952e..ebc55b6a6349 100644
--- a/drivers/net/ethernet/freescale/dpaa/dpaa_eth.c
+++ b/drivers/net/ethernet/freescale/dpaa/dpaa_eth.c
@@ -1721,6 +1721,7 @@ static struct sk_buff *sg_fd_to_skb(const struct 
dpaa_priv *priv,
 
/* Iterate through the SGT entries and add data buffers to the skb */
sgt = vaddr + fd_off;
+   skb = NULL;
for (i = 0; i < DPAA_SGT_MAX_ENTRIES; i++) {
/* Extension bit is not supported */
WARN_ON(qm_sg_entry_is_ext(&sgt[i]));
@@ -1738,7 +1739,7 @@ static struct sk_buff *sg_fd_to_skb(const struct 
dpaa_priv *priv,
count_ptr = this_cpu_ptr(dpaa_bp->percpu_count);
dma_unmap_single(dpaa_bp->dev, sg_addr, dpaa_bp->size,
 DMA_FROM_DEVICE);
-   if (i == 0) {
+   if (!skb) {
sz = dpaa_bp->size +
SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
skb = build_skb(sg_vaddr, sz);
-- 
2.9.0



Re: [patch net-next v4 0/2] net: core: introduce mini_Qdisc and eliminate usage of tp->q for clsact fastpath

2017-11-03 Thread David Miller
From: Jiri Pirko 
Date: Fri,  3 Nov 2017 11:46:23 +0100

> This patchset's main patch is patch number 2. It carries the
> description. Patch 1 is just a dependency.

Series applied, thanks Jiri.


Re: TCP connection closed without FIN or RST

2017-11-03 Thread Eric Dumazet
On Fri, 2017-11-03 at 08:41 -0400, Vitaly Davidovich wrote:
> Hi Eric,
> 
> Ran a few more tests yesterday with packet captures, including a
> capture on the client.  It turns out that the client stops ack'ing
> entirely at some point in the conversation - the last advertised
> client window is not even close to zero (it's actually ~348K).  So
> there's complete radio silence from the client for some reason, even
> though it does send back ACKs early on in the conversation.  So yes,
> as far as the server is concerned, the client is completely gone and
> tcp_retries2 rightfully breaches eventually once the server retrans go
> unanswered long (and for sufficient times) enough.
> 
> What's odd though is the packet capture on the client shows the server
> retrans packets arriving, so it's not like the segments don't reach
> the client.  I'll keep investigating, but if you (or anyone else
> reading this) knows of circumstances that might cause this, I'd
> appreciate any tips on where/what to look at.


Might be a middle box issue ?  Like a firewall connection tracking
having some kind of timeout if nothing is sent on one direction ?

What output do you have from client side with :

ss -temoi dst 




Re: TCP connection closed without FIN or RST

2017-11-03 Thread Eric Dumazet
On Fri, 2017-11-03 at 06:00 -0700, Eric Dumazet wrote:
> On Fri, 2017-11-03 at 08:41 -0400, Vitaly Davidovich wrote:
> > Hi Eric,
> > 
> > Ran a few more tests yesterday with packet captures, including a
> > capture on the client.  It turns out that the client stops ack'ing
> > entirely at some point in the conversation - the last advertised
> > client window is not even close to zero (it's actually ~348K).  So
> > there's complete radio silence from the client for some reason, even
> > though it does send back ACKs early on in the conversation.  So yes,
> > as far as the server is concerned, the client is completely gone and
> > tcp_retries2 rightfully breaches eventually once the server retrans go
> > unanswered long (and for sufficient times) enough.
> > 
> > What's odd though is the packet capture on the client shows the server
> > retrans packets arriving, so it's not like the segments don't reach
> > the client.  I'll keep investigating, but if you (or anyone else
> > reading this) knows of circumstances that might cause this, I'd
> > appreciate any tips on where/what to look at.
> 
> 
> Might be a middle box issue ?  Like a firewall connection tracking
> having some kind of timeout if nothing is sent on one direction ?
> 
> What output do you have from client side with :
> 
> ss -temoi dst 

It also could be a wrapping issue on TCP timestamps.

You could try disabling tcp timestamps, and restart the TCP flow.

echo 0 >/proc/sys/net/ipv4/tcp_timestamps







[PATCH net-next] tcp: tcp_mtu_probing() cleanup

2017-11-03 Thread Eric Dumazet
From: Eric Dumazet 

Reduce one indentation level to make code more readable.
tcp_sync_mss() can be factorized.

Signed-off-by: Eric Dumazet 
---
 net/ipv4/tcp_timer.c |   31 ++-
 1 file changed, 14 insertions(+), 17 deletions(-)

diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
index 
035a1ef1f2d8462c1d19f364b599ffac538ef688..16df6dd44b988a128d97df3a7953437499a216e8
 100644
--- a/net/ipv4/tcp_timer.c
+++ b/net/ipv4/tcp_timer.c
@@ -107,26 +107,23 @@ static int tcp_orphan_retries(struct sock *sk, bool alive)
 
 static void tcp_mtu_probing(struct inet_connection_sock *icsk, struct sock *sk)
 {
-   struct net *net = sock_net(sk);
+   const struct net *net = sock_net(sk);
+   int mss;
 
/* Black hole detection */
-   if (net->ipv4.sysctl_tcp_mtu_probing) {
-   if (!icsk->icsk_mtup.enabled) {
-   icsk->icsk_mtup.enabled = 1;
-   icsk->icsk_mtup.probe_timestamp = tcp_jiffies32;
-   tcp_sync_mss(sk, icsk->icsk_pmtu_cookie);
-   } else {
-   struct net *net = sock_net(sk);
-   struct tcp_sock *tp = tcp_sk(sk);
-   int mss;
-
-   mss = tcp_mtu_to_mss(sk, icsk->icsk_mtup.search_low) >> 
1;
-   mss = min(net->ipv4.sysctl_tcp_base_mss, mss);
-   mss = max(mss, 68 - tp->tcp_header_len);
-   icsk->icsk_mtup.search_low = tcp_mss_to_mtu(sk, mss);
-   tcp_sync_mss(sk, icsk->icsk_pmtu_cookie);
-   }
+   if (!net->ipv4.sysctl_tcp_mtu_probing)
+   return;
+
+   if (!icsk->icsk_mtup.enabled) {
+   icsk->icsk_mtup.enabled = 1;
+   icsk->icsk_mtup.probe_timestamp = tcp_jiffies32;
+   } else {
+   mss = tcp_mtu_to_mss(sk, icsk->icsk_mtup.search_low) >> 1;
+   mss = min(net->ipv4.sysctl_tcp_base_mss, mss);
+   mss = max(mss, 68 - tcp_sk(sk)->tcp_header_len);
+   icsk->icsk_mtup.search_low = tcp_mss_to_mtu(sk, mss);
}
+   tcp_sync_mss(sk, icsk->icsk_pmtu_cookie);
 }
 
 




[PATCH net-next] tcp: do not clear again skb->csum in tcp_init_nondata_skb()

2017-11-03 Thread Eric Dumazet
From: Eric Dumazet 

tcp_init_nondata_skb() is fed with freshly allocated skbs.
They already have a cleared csum field, no need to clear it again.

This is based on Neal review on commit 3b11775033dc ("tcp: do not mangle
skb->cb[] in tcp_make_synack()"), noticing I did not clear skb->csum.

Signed-off-by: Eric Dumazet 
Reported-by: Neal Cardwell 
---
 net/ipv4/tcp_output.c |1 -
 1 file changed, 1 deletion(-)

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 
822962ece2840824db3d89993f6889780cd2ab99..0b8062274ed2bf1c2781fe7e88f0de97f6ab6e55
 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -381,7 +381,6 @@ static void tcp_ecn_send(struct sock *sk, struct sk_buff 
*skb,
 static void tcp_init_nondata_skb(struct sk_buff *skb, u32 seq, u8 flags)
 {
skb->ip_summed = CHECKSUM_PARTIAL;
-   skb->csum = 0;
 
TCP_SKB_CB(skb)->tcp_flags = flags;
TCP_SKB_CB(skb)->sacked = 0;




Re: [4.14rc6] __tcp_select_window divide by zero.

2017-11-03 Thread Dave Jones
On Tue, Oct 24, 2017 at 09:00:30AM -0400, Dave Jones wrote:
 > divide error:  [#1] SMP KASAN
 > CPU: 0 PID: 31140 Comm: trinity-c12 Not tainted 4.14.0-rc6-think+ #1 
 > RIP: 0010:__tcp_select_window+0x21f/0x400
 > Call Trace:
 >  tcp_cleanup_rbuf+0x27d/0x2a0
 >  tcp_recvmsg+0x7a9/0x1430
 >  inet_recvmsg+0x10b/0x360
 >  sock_read_iter+0x19d/0x240
 >  do_iter_readv_writev+0x2e4/0x320
 >  do_iter_read+0x149/0x280
 >  vfs_readv+0x107/0x180
 >  do_readv+0xc0/0x1b0
 >  do_syscall_64+0x182/0x400
 >  entry_SYSCALL64_slow_path+0x25/0x25
 > Code: 41 5e 41 5f c3 48 8d bb 48 09 00 00 e8 4b 2b 30 ff 8b 83 48 09 00 00 
 > 89 ea 44 29 f2 39 c2 7d 08 39 c5 0f 8d 86 01 00 00 89 e8 99 <41> f7 fe 89 e8 
 > 29 d0 eb 8c 41 f7 df 48 89 c7 44 89 f9 d3 fd e8 
 > RIP: __tcp_select_window+0x21f/0x400 RSP: 8803df54f418
 > 
 >
 >if (window <= free_space - mss || window > free_space)
 >window = rounddown(free_space, mss);

I'm still hitting this fairly often, so I threw in a debug patch, and
when this happens..

[53182.361210] window: 0 free_space: 0 mss: 0

Any suggestions on what we should default the window size to be in
this situation to avoid the rounddown ?


Dave



Re: TCP connection closed without FIN or RST

2017-11-03 Thread Vitaly Davidovich
On Fri, Nov 3, 2017 at 9:00 AM, Eric Dumazet  wrote:
> On Fri, 2017-11-03 at 08:41 -0400, Vitaly Davidovich wrote:
>> Hi Eric,
>>
>> Ran a few more tests yesterday with packet captures, including a
>> capture on the client.  It turns out that the client stops ack'ing
>> entirely at some point in the conversation - the last advertised
>> client window is not even close to zero (it's actually ~348K).  So
>> there's complete radio silence from the client for some reason, even
>> though it does send back ACKs early on in the conversation.  So yes,
>> as far as the server is concerned, the client is completely gone and
>> tcp_retries2 rightfully breaches eventually once the server retrans go
>> unanswered long (and for sufficient times) enough.
>>
>> What's odd though is the packet capture on the client shows the server
>> retrans packets arriving, so it's not like the segments don't reach
>> the client.  I'll keep investigating, but if you (or anyone else
>> reading this) knows of circumstances that might cause this, I'd
>> appreciate any tips on where/what to look at.
>
>
> Might be a middle box issue ?  Like a firewall connection tracking
> having some kind of timeout if nothing is sent on one direction ?
Yeah, that's certainly possible although I've not found evidence of
that yet, including asking sysadmins.  But it's definitely an avenue
I'm going to walk a bit further down.
>
> What output do you have from client side with :
>
> ss -temoi dst 
I snipped some irrelevant info, like IP addresses, uid, inode number, etc.

Client before it wakes up - the recvq has been at 125976 for the
entire time it's been sleeping (15 minutes):

State   Recv-Q Send-Q

ESTAB   125976 0



skmem:(r151040,rb15,t0,tb15,f512,w0,o0,bl0) ts
sack scalable wscale:0,11 rto:208 rtt:4.664/8.781 ato:40 mss:1448
cwnd:10 send 24.8Mbps rcv_rtt:321786 rcv_space:524140


While the server is on its last retrans timer, the client wakes up and
slurps up its recv buffer:

State   Recv-Q Send-Q

ESTAB   0  0

 skmem:(r0,rb15,t0,tb15,f151552,w0,o0,bl0) ts
sack scalable wscale:0,11 rto:208 rtt:4.664/8.781 ato:40 mss:1448
cwnd:10 send 24.8Mbps rcv_rtt:321786 rcv_space:524140



Here's the cmd output from the server right before the last retrans
timer expires and the socket is aborted.  Note that this output is
after the client has drained its recv queue (the output right above):


State   Recv-Q Send-Q

ESTAB   0  925272

timer:(on,14sec,15)

 skmem:(r0,rb10,t0,tb105,f2440,w947832,o0,bl0) ts sack
scalable wscale:11,0 rto:12 rtt:9.69/16.482 ato:40 mss:1448 cwnd:1
ssthresh:89 send 1.2Mbps unacked:99 retrans:1/15 lost:99 rcv_rtt:4
rcv_space:28960

Also worth noting the server's sendq has been at 925272 the entire time as well.


Does anything stand out here? I guess one thing that stands out to me
(but that could be due to my lack of in-depth knowledge of this) is
that the client rcv_space is significantly larger than the recvq.

Thanks Eric!

>
>


Re: TCP connection closed without FIN or RST

2017-11-03 Thread Vitaly Davidovich
On Fri, Nov 3, 2017 at 9:02 AM, Eric Dumazet  wrote:
> On Fri, 2017-11-03 at 06:00 -0700, Eric Dumazet wrote:
>> On Fri, 2017-11-03 at 08:41 -0400, Vitaly Davidovich wrote:
>> > Hi Eric,
>> >
>> > Ran a few more tests yesterday with packet captures, including a
>> > capture on the client.  It turns out that the client stops ack'ing
>> > entirely at some point in the conversation - the last advertised
>> > client window is not even close to zero (it's actually ~348K).  So
>> > there's complete radio silence from the client for some reason, even
>> > though it does send back ACKs early on in the conversation.  So yes,
>> > as far as the server is concerned, the client is completely gone and
>> > tcp_retries2 rightfully breaches eventually once the server retrans go
>> > unanswered long (and for sufficient times) enough.
>> >
>> > What's odd though is the packet capture on the client shows the server
>> > retrans packets arriving, so it's not like the segments don't reach
>> > the client.  I'll keep investigating, but if you (or anyone else
>> > reading this) knows of circumstances that might cause this, I'd
>> > appreciate any tips on where/what to look at.
>>
>>
>> Might be a middle box issue ?  Like a firewall connection tracking
>> having some kind of timeout if nothing is sent on one direction ?
>>
>> What output do you have from client side with :
>>
>> ss -temoi dst 
>
> It also could be a wrapping issue on TCP timestamps.
>
> You could try disabling tcp timestamps, and restart the TCP flow.
>
> echo 0 >/proc/sys/net/ipv4/tcp_timestamps
Ok, I will try to do that.  Thanks for the tip.
>
>
>
>
>


[PATCH] orinoco_usb: remove redundant pointer dev

2017-11-03 Thread Colin King
From: Colin Ian King 

The pointer dev is assigned but never read, hence it is redundant
and can be removed. Cleans up clang warning:

drivers/net/wireless/intersil/orinoco/orinoco_usb.c:1468:2: warning:
Value stored to 'dev' is never read

Signed-off-by: Colin Ian King 
---
 drivers/net/wireless/intersil/orinoco/orinoco_usb.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/drivers/net/wireless/intersil/orinoco/orinoco_usb.c 
b/drivers/net/wireless/intersil/orinoco/orinoco_usb.c
index 56f6e3b71f48..501180584b4b 100644
--- a/drivers/net/wireless/intersil/orinoco/orinoco_usb.c
+++ b/drivers/net/wireless/intersil/orinoco/orinoco_usb.c
@@ -1457,7 +1457,6 @@ static void ezusb_bulk_in_callback(struct urb *urb)
 
 static inline void ezusb_delete(struct ezusb_priv *upriv)
 {
-   struct net_device *dev;
struct list_head *item;
struct list_head *tmp_item;
unsigned long flags;
@@ -1465,7 +1464,6 @@ static inline void ezusb_delete(struct ezusb_priv *upriv)
BUG_ON(in_interrupt());
BUG_ON(!upriv);
 
-   dev = upriv->dev;
mutex_lock(&upriv->mtx);
 
upriv->udev = NULL; /* No timer will be rearmed from here */
-- 
2.14.1



Re: Page allocator bottleneck

2017-11-03 Thread Mel Gorman
On Thu, Nov 02, 2017 at 07:21:09PM +0200, Tariq Toukan wrote:
> 
> 
> On 18/09/2017 12:16 PM, Tariq Toukan wrote:
> > 
> > 
> > On 15/09/2017 1:23 PM, Mel Gorman wrote:
> > > On Thu, Sep 14, 2017 at 07:49:31PM +0300, Tariq Toukan wrote:
> > > > Insights: Major degradation between #1 and #2, not getting any
> > > > close to linerate! Degradation is fixed between #2 and #3. This is
> > > > because page allocator cannot stand the higher allocation rate. In
> > > > #2, we also see that the addition of rings (cores) reduces BW (!!),
> > > > as result of increasing congestion over shared resources.
> > > > 
> > > 
> > > Unfortunately, no surprises there.
> > > 
> > > > Congestion in this case is very clear. When monitored in perf
> > > > top: 85.58% [kernel] [k] queued_spin_lock_slowpath
> > > > 
> > > 
> > > While it's not proven, the most likely candidate is the zone lock
> > > and that should be confirmed using a call-graph profile. If so, then
> > > the suggestion to tune to the size of the per-cpu allocator would
> > > mitigate the problem.
> > > 
> > Indeed, I tuned the per-cpu allocator and bottleneck is released.
> > 
> 
> Hi all,
> 
> After leaving this task for a while doing other tasks, I got back to it now
> and see that the good behavior I observed earlier was not stable.
> 
> Recall: I work with a modified driver that allocates a page (4K) per packet
> (MTU=1500), in order to simulate the stress on page-allocator in 200Gbps
> NICs.
> 

There is almost new in the data that hasn't been discussed before. The
suggestion to free on a remote per-cpu list would be expensive as it would
require per-cpu lists to have a lock for safe remote access.  However,
I'd be curious if you could test the mm-pagealloc-irqpvec-v1r4 branch
ttps://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git .  It's an
unfinished prototype I worked on a few weeks ago. I was going to revisit
in about a months time when 4.15-rc1 was out. I'd be interested in seeing
if it has a postive gain in normal page allocations without destroying
the performance of interrupt and softirq allocation contexts. The
interrupt/softirq context testing is crucial as that is something that
hurt us before when trying to improve page allocator performance.

-- 
Mel Gorman
SUSE Labs


Re: [PATCH net-next 0/5] net: dsa: lan9303: Linting

2017-11-03 Thread Vivien Didelot
Egil Hjelmeland  writes:

> This series is non-functional. 
>  - Correct some errors in comments and documentation.
> Remove scripts/checkpatch.pl WARNINGs and most CHECKs:
>  - Replace msleep(1) with usleep_range()
>  - Remove unnecessary parentheses
>  - Adjust indenting
>
> Egil Hjelmeland (5):
>   net: dsa: lan9303: Correct register names in comments
>   net: dsa: lan9303: Fix syntax errors in device tree examples
>   net: dsa: lan9303: Replace msleep(1) with usleep_range()
>   net: dsa: lan9303: Remove unnecessary parentheses
>   net: dsa: lan9303: Adjust indenting
>
>  Documentation/devicetree/bindings/net/dsa/lan9303.txt | 4 ++--
>  drivers/net/dsa/lan9303-core.c| 6 +++---
>  drivers/net/dsa/lan9303_i2c.c | 2 +-
>  drivers/net/dsa/lan9303_mdio.c| 2 +-
>  include/linux/dsa/lan9303.h   | 8 +---
>  net/dsa/tag_lan9303.c | 2 +-
>  6 files changed, 13 insertions(+), 11 deletions(-)

Reviewed-by: Vivien Didelot 


Thanks!

Vivien


Re: [RFC PATCH 00/14] Introducing AF_PACKET V4 support

2017-11-03 Thread Willem de Bruijn
On Fri, Nov 3, 2017 at 7:13 PM, Karlsson, Magnus
 wrote:
>
>
>> -Original Message-
>> From: Willem de Bruijn [mailto:willemdebruijn.ker...@gmail.com]
>> Sent: Friday, November 3, 2017 5:35 AM
>> To: Björn Töpel 
>> Cc: Karlsson, Magnus ; Duyck, Alexander H
>> ; Alexander Duyck
>> ; John Fastabend
>> ; Alexei Starovoitov ; Jesper
>> Dangaard Brouer ; michael.lundkv...@ericsson.com;
>> ravineet.si...@ericsson.com; Daniel Borkmann ;
>> Network Development ; Topel, Bjorn
>> ; Brandeburg, Jesse
>> ; Singhai, Anjali ;
>> Rosen, Rami ; Shaw, Jeffrey B
>> ; Yigit, Ferruh ; Zhang,
>> Qi Z 
>> Subject: Re: [RFC PATCH 00/14] Introducing AF_PACKET V4 support
>>
>> On Tue, Oct 31, 2017 at 9:41 PM, Björn Töpel 
>> wrote:
>> > From: Björn Töpel 
>> >
>> > This RFC introduces AF_PACKET_V4 and PACKET_ZEROCOPY that are
>> > optimized for high performance packet processing and zero-copy
>> > semantics. Throughput improvements can be up to 40x compared to V2
>> and
>> > V3 for the micro benchmarks included. Would be great to get your
>> > feedback on it.
>> >
>> > The main difference between V4 and V2/V3 is that TX and RX descriptors
>> > are separated from packet buffers.
>>
>> Cool feature. I'm looking forward to the netdev talk. Aside from the inline
>> comments in the patches, a few architecture questions.
>
> Glad to hear. Are you going to Netdev in Seoul? If so, let us hook up
> and discuss your comments in further detail. Some initial thoughts
> below.

Sounds great. I'll be there.

>> Is TX support needed? Existing PACKET_TX_RING already sends out packets
>> without copying directly from the tx_ring. Indirection through a descriptor
>> ring is not helpful on TX if all packets still have to come from a 
>> pre-registered
>> packet pool. The patch set adds a lot of tx-only code and is complex enough
>> without it.
>
> That is correct, but what if the packet you are going to transmit came
> in from the receive path and is already in the packet buffer?

Oh, yes, of course. That is a common use case. I should have
thought of that.

> This
> might happen if the application is examining/sniffing packets then
> sending them out, or doing some modification to them. In that case we
> avoid a copy in V4 since the packet is already in the packet
> buffer. With V2 and V3, a copy from the RX ring to the TX ring would
> be needed. In the PACKET_ZEROCOPY case, avoiding this copy increases
> performance quite a lot.
>
>> Can you use the existing PACKET_V2 format for the packet pool? The
>> v4 format is nearly the same as V2. Using the same version might avoid some
>> code duplication and simplify upgrading existing legacy code.
>> Instead of continuing to add new versions whose behavior is implicit,
>> perhaps we can add explicit mode PACKET_INDIRECT to PACKET_V2.
>
> Interesting idea that I think is worth thinking more about. One
> problem though with the V2 ring format model, and the current V4
> format too by the way, when applied to a user-space allocated memory
> is that they are symmetric, i.e. that user space and kernel space have
> to produce and consume the same amount of entries (within the length
> of the descriptor area). User space sends down a buffer entry that the
> kernel fills in for RX for example. Symmetric queues do not work when
> you have a shared packet buffer between two processes. (This is not a
> requirement, but someone might do a mmap with MAP_SHARED for the
> packet buffer and then fork of a child that then inherits this packet
> buffer.) One of the processes might just receive packets, while the
> other one is transmitting. Or if you have a veth link pair between two
> processes and they have been set up to share packet buffer area. With
> a symmetric queue you have to copy even if they share the same packet
> buffer, but with an asymmetric queue, you do not and the driver only
> needs to copy the packet buffer id between the TX desc ring of the
> sender to the RX desc ring of the receiver, not the data. I think this
> gives an indication that we need a new structure. Anyway, I like your
> idea and I think it is worth thinking more about it. Let us have a
> discussion about this at Netdev, if you are there.

Okay. I don't quite understand the definition of symmetric here. At
least one problem that you describe, the veth pair, is solved by
introducing descriptor rings as level of indirection, regardless of the
format of the frames in the packet ring (now, really, random access
packet pool).

>> Finally, is it necessary to define a new descriptor ring format? Same for the
>> packet array and frame set. The kernel already has a few, such as virtio for
>> the first, skb_array/ptr_ring, even linux list for the second. These 
>> containers
>> add a lot of new boilerplate code. If new formats are absolutely necessary, 
>> at
>> least we should consider making them generic (like skb_array and ptr_ring).
>> But I'd like to understand first why, e.g., virtio cannot be used.
>
> Agree with you. Good if we ca

Re: TCP connection closed without FIN or RST

2017-11-03 Thread Vitaly Davidovich
On Fri, Nov 3, 2017 at 9:39 AM, Vitaly Davidovich  wrote:
> On Fri, Nov 3, 2017 at 9:02 AM, Eric Dumazet  wrote:
>> On Fri, 2017-11-03 at 06:00 -0700, Eric Dumazet wrote:
>>> On Fri, 2017-11-03 at 08:41 -0400, Vitaly Davidovich wrote:
>>> > Hi Eric,
>>> >
>>> > Ran a few more tests yesterday with packet captures, including a
>>> > capture on the client.  It turns out that the client stops ack'ing
>>> > entirely at some point in the conversation - the last advertised
>>> > client window is not even close to zero (it's actually ~348K).  So
>>> > there's complete radio silence from the client for some reason, even
>>> > though it does send back ACKs early on in the conversation.  So yes,
>>> > as far as the server is concerned, the client is completely gone and
>>> > tcp_retries2 rightfully breaches eventually once the server retrans go
>>> > unanswered long (and for sufficient times) enough.
>>> >
>>> > What's odd though is the packet capture on the client shows the server
>>> > retrans packets arriving, so it's not like the segments don't reach
>>> > the client.  I'll keep investigating, but if you (or anyone else
>>> > reading this) knows of circumstances that might cause this, I'd
>>> > appreciate any tips on where/what to look at.
>>>
>>>
>>> Might be a middle box issue ?  Like a firewall connection tracking
>>> having some kind of timeout if nothing is sent on one direction ?
>>>
>>> What output do you have from client side with :
>>>
>>> ss -temoi dst 
>>
>> It also could be a wrapping issue on TCP timestamps.
>>
>> You could try disabling tcp timestamps, and restart the TCP flow.
>>
>> echo 0 >/proc/sys/net/ipv4/tcp_timestamps
> Ok, I will try to do that.  Thanks for the tip.
Tried with tcp_timestamps disabled on the client (didn't touch the
server), but that didn't change the outcome - same issue at the end.
>>
>>
>>
>>
>>


[PATCH] rtlwifi: remove redundant pointer tid_data

2017-11-03 Thread Colin King
From: Colin Ian King 

tid_data is assigned but never read, hence it is redundant
and can be removed. Cleans up clang warning:

drivers/net/wireless/realtek/rtlwifi/base.c:1581:2: warning: Value
stored to 'tid_data' is never read

Signed-off-by: Colin Ian King 
---
 drivers/net/wireless/realtek/rtlwifi/base.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/drivers/net/wireless/realtek/rtlwifi/base.c 
b/drivers/net/wireless/realtek/rtlwifi/base.c
index 7e3107f9e37f..cad2272ae21b 100644
--- a/drivers/net/wireless/realtek/rtlwifi/base.c
+++ b/drivers/net/wireless/realtek/rtlwifi/base.c
@@ -1630,7 +1630,6 @@ int rtl_tx_agg_stop(struct ieee80211_hw *hw, struct 
ieee80211_vif *vif,
struct ieee80211_sta *sta, u16 tid)
 {
struct rtl_priv *rtlpriv = rtl_priv(hw);
-   struct rtl_tid_data *tid_data;
struct rtl_sta_info *sta_entry = NULL;
 
if (sta == NULL)
@@ -1643,7 +1642,6 @@ int rtl_tx_agg_stop(struct ieee80211_hw *hw, struct 
ieee80211_vif *vif,
return -EINVAL;
 
sta_entry = (struct rtl_sta_info *)sta->drv_priv;
-   tid_data = &sta_entry->tids[tid];
sta_entry->tids[tid].agg.agg_state = RTL_AGG_STOP;
 
ieee80211_stop_tx_ba_cb_irqsafe(vif, sta->addr, tid);
-- 
2.14.1



Re: [PATCH net-next 4/5] net: dsa: lan9303: Remove unnecessary parentheses

2017-11-03 Thread Joe Perches
On Fri, 2017-11-03 at 11:55 +0100, Egil Hjelmeland wrote:
> Remove scripts/checkpatch.pl CHECKs by remove unnecessary parentheses
[]
> diff --git a/drivers/net/dsa/lan9303-core.c b/drivers/net/dsa/lan9303-core.c
[]
> @@ -483,7 +483,7 @@ static int lan9303_detect_phy_setup(struct lan9303 *chip)
>   return reg;
>   }
>  
> - if ((reg != 0) && (reg != 0x))
> + if (reg != 0 && reg != 0x)
>   chip->phy_addr_sel_strap = 1;
>   else
>   chip->phy_addr_sel_strap = 0;

phy_addr_sel_strap is currently bool.

If this is to be changed, it should be set
true or false.

My preference would be:

chip->phy_addr_sel_strap = (reg != 0 && reg != 0x);

But perhaps its bool type should be converted
to int as this phy_addr_sel_strap is used as
int several times.

$ git grep phy_addr_sel_strap
drivers/net/dsa/lan9303-core.c: /* depending on the 'phy_addr_sel_strap' 
setting, the three phys are
drivers/net/dsa/lan9303-core.c:  * 'phy_addr_sel_strap' setting directly, so we 
need a test, which
drivers/net/dsa/lan9303-core.c:  * Special reg 18 of phy 3 reads as 0x, if 
'phy_addr_sel_strap' is 0
drivers/net/dsa/lan9303-core.c:  * 0x, which means 'phy_addr_sel_strap' is 
1 and the IDs are 1-2-3.
drivers/net/dsa/lan9303-core.c: chip->phy_addr_sel_strap = 1;
drivers/net/dsa/lan9303-core.c: chip->phy_addr_sel_strap = 0;
drivers/net/dsa/lan9303-core.c: chip->phy_addr_sel_strap ? "1-2-3" : 
"0-1-2");
drivers/net/dsa/lan9303-core.c: int phy_base = chip->phy_addr_sel_strap;
drivers/net/dsa/lan9303-core.c: int phy_base = chip->phy_addr_sel_strap;
drivers/net/dsa/lan9303-core.c: if (port == chip->phy_addr_sel_strap) {
drivers/net/dsa/lan9303-core.c: lan9303_phy_write(ds, 
chip->phy_addr_sel_strap + port,
drivers/net/dsa/lan9303-core.c: chip->ds->phys_mii_mask = 
chip->phy_addr_sel_strap ? 0xe : 0x7;
include/linux/dsa/lan9303.h:bool phy_addr_sel_strap;



pull-request: wireless-drivers-next 2017-11-03

2017-11-03 Thread Kalle Valo
Hi Dave,

another pull request to net-next for v4.15. I'm at the airport on my way
to Netdev 2.2, so please pay extra attention if I made any stupid
mistakes. And as always, please let me know if there are any problems.

If Linus does not release final v4.14 on Sunday, and gives us one more
week before the merge window, I'll try to send one more pull request
next week.

Kalle

The following changes since commit aec72f3392b1d598a979e89c4fdb131965ae0ab3:

  net-tun: fix panics at dismantle time (2017-10-20 13:31:26 +0100)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/kvalo/wireless-drivers-next.git 
tags/wireless-drivers-next-for-davem-2017-11-03

for you to fetch changes up to e226fb5affccca98c405de80527180224d93d251:

  Merge ath-next from 
git://git.kernel.org/pub/scm/linux/kernel/git/kvalo/ath.git (2017-11-02 
19:48:25 +0200)


wireless-drivers-next patches for 4.15

Mostly fixes this time, but also few new features.

Major changes:

wil6210

* remove ssid debugfs file

rsi

* add WOWLAN support for suspend, hibernate and shutdown states

ath10k

* add support for CCMP-256, GCMP and GCMP-256 ciphers on hardware
  where it's supported (QCA99x0 and QCA4019)


Amitkumar Karwar (2):
  rsi: move rsi_sdio_reinit_device() out of CONFIG_PM
  rsi: fix kbuild reported build errors with CONFIG_PM off

Anilkumar Kolli (3):
  ath10k: move ath10k_mac_tdls_vif*() functions
  ath10k: block offchannel operations if TDLS session is active
  ath10k: fix sending wmi cmd during the tdls teardown

Arnd Bergmann (2):
  brcmsmac: split up wlc_phy_workarounds_nphy
  brcmsmac: reindent split functions

Ben Greear (1):
  ath10k: store coverage-class in case firmware is not booted

Beni Lev (1):
  iwlwifi: mvm: allow reading UMAC error data from SMEM in A000 devices

Bhumika Goyal (1):
  ath10k: make ath10k_hw_ce_regs const

Brian Norris (2):
  ath10k: fix core PCI suspend when WoWLAN is supported but disabled
  ath10k: fix build errors with !CONFIG_PM

Christos Gkekas (2):
  ath10k: spectral: remove redundant check in write_file_spectral_count()
  ath9k: debug: Remove redundant check

Colin Ian King (1):
  ath9k: make const array reg_hole_list static, reduces object code size

Ganapathi Bhat (1):
  mwifiex: do not transmit in 11N rates when connected in TKIP security

Himanshu Jha (1):
  ath9k: remove cast to void pointer

Igor Mitsyanko (1):
  qtnfmac: advertise support of inactivity timeout

Joe Perches (1):
  bcma: Use bcma_debug and not pr_cont in MIPS driver

Johannes Berg (6):
  iwlwifi: mvm: allocate reorder buffer according to need
  iwlwifi: mvm: pass baid_data to iwl_mvm_release_frames()
  iwlwifi: pcie: remove set but not used variable tcph
  wil6210: remove wil6210_uapi.h from MAINTAINERS
  wil6210: remove SSID debugfs
  libertas: don't write wdev->ssid/_len

Kalle Valo (3):
  Merge tag 'iwlwifi-next-for-kalle-2017-10-18' of 
git://git.kernel.org/.../iwlwifi/iwlwifi-next
  Merge ath-next from git://git.kernel.org/.../kvalo/ath.git
  Merge ath-next from git://git.kernel.org/.../kvalo/ath.git

Karun Eagalapati (3):
  rsi: sdio: add WOWLAN support for S3 suspend state
  rsi: sdio: Add WOWLAN support for S4 hibernate state
  rsi: sdio: Add WOWLAN support for S5 shutdown state

Kees Cook (9):
  rtlwifi: Convert timers to use timer_setup()
  qtnfmac: Convert timers to use timer_setup()
  iwlegacy: Convert timers to use timer_setup()
  atmel: Convert timers to use timer_setup()
  cw1200: Convert timers to use timer_setup()
  drivers/wireless: rsi: Convert timers to use timer_setup()
  mwifiex: Convert timers to use timer_setup()
  libertas: Convert timers to use timer_setup()
  ath: Convert timers to use timer_setup()

Liad Kaufman (1):
  iwlwifi: mvm: add missing lq_color

Loic Poulain (1):
  wcn36xx: Disable 5GHz for wcn3620

Luca Coelho (3):
  iwlwifi: mvm: move umac_error_event_table validity check to where it's set
  iwlwifi: define minimum valid address for umac_error_event_table in cfg
  iwlwifi: pcie: sort IDs for the 9000 series for easier comparisons

Miaoqing Pan (1):
  ath9k: fix tx99 potential info leak

Nik Nyby (1):
  rtlwifi: rtl8821ae: Fix typo in variable name

Ping-Ke Shih (1):
  rtlwifi: Remove seq_number from rtl_tid_data

Rajkumar Manoharan (1):
  ath10k: add new cipher suite support

Sara Sharon (1):
  iwlwifi: mvm: remove duplicated fields in mvm reorder buffer

Sergey Matyukevich (4):
  qtnfmac: modify full Tx queue error reporting
  qtnfmac: enable registration of more mgmt frames
  qtnfmac: drop nonexistent function declaration
  qtnfmac: modify full Tx queue recovery

Thomas Meyer (1):
  ath9k: Use ARRAY_SIZE macr

Re: [PATCH v2 1/7] crypto:chelsio: Remove unused parameter

2017-11-03 Thread Herbert Xu
On Sun, Oct 08, 2017 at 01:37:18PM +0530, Harsh Jain wrote:
> From: Yeshaswi M R Gowda 
> 
> Remove unused parameter sent to latest fw.
> 
> Signed-off-by: Harsh Jain 

All applied.  Thanks.
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


Re: [PATCH v10 00/20] simplify crypto wait for async op

2017-11-03 Thread Herbert Xu
On Wed, Oct 18, 2017 at 08:00:32AM +0100, Gilad Ben-Yossef wrote:
> Many users of kernel async. crypto services have a pattern of
> starting an async. crypto op and than using a completion
> to wait for it to end.
> 
> This patch set simplifies this common use case in two ways:
> 
> First, by separating the return codes of the case where a
> request is queued to a backlog due to the provider being
> busy (-EBUSY) from the case the request has failed due
> to the provider being busy and backlogging is not enabled
> (-ENOSPC).
> 
> Next, this change is than built on to create a generic API
> to wait for a async. crypto operation to complete.
> 
> The end result is a smaller code base and an API that is
> easier to use and more difficult to get wrong.
> 
> The patch set was boot tested on x86_64 and arm64 which
> at the very least tests the crypto users via testmgr and
> tcrypt but I do note that I do not have access to some
> of the HW whose drivers are modified nor do I claim I was
> able to test all of the corner cases.
> 
> The patch set is based upon linux-next release tagged
> next-20171017.

All applied.  Thanks.
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


[PATCH] net: plip: mark expected switch fall-throughs

2017-11-03 Thread Gustavo A. R. Silva
In preparation to enabling -Wimplicit-fallthrough, mark switch cases
where we are expecting to fall through.

Addresses-Coverity-ID: 114893
Addresses-Coverity-ID: 114894
Addresses-Coverity-ID: 114895
Addresses-Coverity-ID: 114896
Addresses-Coverity-ID: 114897
Addresses-Coverity-ID: 114898
Addresses-Coverity-ID: 114899
Addresses-Coverity-ID: 114900
Addresses-Coverity-ID: 114901
Addresses-Coverity-ID: 114902
Addresses-Coverity-ID: 114903
Addresses-Coverity-ID: 114904
Addresses-Coverity-ID: 114905
Signed-off-by: Gustavo A. R. Silva 
---
 drivers/net/plip/plip.c | 13 +
 1 file changed, 13 insertions(+)

diff --git a/drivers/net/plip/plip.c b/drivers/net/plip/plip.c
index 3c55ea3..feb92ec 100644
--- a/drivers/net/plip/plip.c
+++ b/drivers/net/plip/plip.c
@@ -502,6 +502,7 @@ plip_receive(unsigned short nibble_timeout, struct 
net_device *dev,
*data_p = (c0 >> 3) & 0x0f;
write_data (dev, 0x10); /* send ACK */
*ns_p = PLIP_NB_1;
+   /* fall through */
 
case PLIP_NB_1:
cx = nibble_timeout;
@@ -597,6 +598,7 @@ plip_receive_packet(struct net_device *dev, struct 
net_local *nl,
printk(KERN_DEBUG "%s: receive start\n", dev->name);
rcv->state = PLIP_PK_LENGTH_LSB;
rcv->nibble = PLIP_NB_BEGIN;
+   /* fall through */
 
case PLIP_PK_LENGTH_LSB:
if (snd->state != PLIP_PK_DONE) {
@@ -617,6 +619,7 @@ plip_receive_packet(struct net_device *dev, struct 
net_local *nl,
return TIMEOUT;
}
rcv->state = PLIP_PK_LENGTH_MSB;
+   /* fall through */
 
case PLIP_PK_LENGTH_MSB:
if (plip_receive(nibble_timeout, dev,
@@ -639,6 +642,7 @@ plip_receive_packet(struct net_device *dev, struct 
net_local *nl,
rcv->state = PLIP_PK_DATA;
rcv->byte = 0;
rcv->checksum = 0;
+   /* fall through */
 
case PLIP_PK_DATA:
lbuf = rcv->skb->data;
@@ -651,6 +655,7 @@ plip_receive_packet(struct net_device *dev, struct 
net_local *nl,
rcv->checksum += lbuf[--rcv->byte];
} while (rcv->byte);
rcv->state = PLIP_PK_CHECKSUM;
+   /* fall through */
 
case PLIP_PK_CHECKSUM:
if (plip_receive(nibble_timeout, dev,
@@ -663,6 +668,7 @@ plip_receive_packet(struct net_device *dev, struct 
net_local *nl,
return ERROR;
}
rcv->state = PLIP_PK_DONE;
+   /* fall through */
 
case PLIP_PK_DONE:
/* Inform the upper layer for the arrival of a packet. */
@@ -708,6 +714,7 @@ plip_send(unsigned short nibble_timeout, struct net_device 
*dev,
case PLIP_NB_BEGIN:
write_data (dev, data & 0x0f);
*ns_p = PLIP_NB_1;
+   /* fall through */
 
case PLIP_NB_1:
write_data (dev, 0x10 | (data & 0x0f));
@@ -722,6 +729,7 @@ plip_send(unsigned short nibble_timeout, struct net_device 
*dev,
}
write_data (dev, 0x10 | (data >> 4));
*ns_p = PLIP_NB_2;
+   /* fall through */
 
case PLIP_NB_2:
write_data (dev, (data >> 4));
@@ -810,6 +818,7 @@ plip_send_packet(struct net_device *dev, struct net_local 
*nl,
  &snd->nibble, snd->length.b.lsb))
return TIMEOUT;
snd->state = PLIP_PK_LENGTH_MSB;
+   /* fall through */
 
case PLIP_PK_LENGTH_MSB:
if (plip_send(nibble_timeout, dev,
@@ -818,6 +827,7 @@ plip_send_packet(struct net_device *dev, struct net_local 
*nl,
snd->state = PLIP_PK_DATA;
snd->byte = 0;
snd->checksum = 0;
+   /* fall through */
 
case PLIP_PK_DATA:
do {
@@ -829,6 +839,7 @@ plip_send_packet(struct net_device *dev, struct net_local 
*nl,
snd->checksum += lbuf[--snd->byte];
} while (snd->byte);
snd->state = PLIP_PK_CHECKSUM;
+   /* fall through */
 
case PLIP_PK_CHECKSUM:
if (plip_send(nibble_timeout, dev,
@@ -839,6 +850,7 @@ plip_send_packet(struct net_device *dev, struct net_local 
*nl,
dev_kfree_skb(snd->skb);
dev->stats.tx_packets++;
snd->state = PLIP_PK_DONE;
+   /* fall through */
 
case PLIP_PK_DONE:
/* Close the connection */
@@ -927,6 +939,7 @@ plip_interrupt(void *dev_id)
switch (nl->connection) {
case PLIP_CN_CLOSING:
netif_wake_queue (dev);
+   /* fall through */
case PLIP_CN_NONE:
case PLIP_CN_SEND:
rcv->state = PLIP_PK_TRIGGER;
-- 
2.7.4



Re: [PATCH 1/2] bpf: add a bpf_override_function helper

2017-11-03 Thread Josef Bacik
On Fri, Nov 03, 2017 at 12:12:13AM +0100, Daniel Borkmann wrote:
> Hi Josef,
> 
> one more issue I just noticed, see comment below:
> 
> On 11/02/2017 03:37 PM, Josef Bacik wrote:
> [...]
> > diff --git a/include/linux/filter.h b/include/linux/filter.h
> > index cdd78a7beaae..dfa44fd74bae 100644
> > --- a/include/linux/filter.h
> > +++ b/include/linux/filter.h
> > @@ -458,7 +458,8 @@ struct bpf_prog {
> > locked:1,   /* Program image locked? */
> > gpl_compatible:1, /* Is filter GPL compatible? 
> > */
> > cb_access:1,/* Is control block accessed? */
> > -   dst_needed:1;   /* Do we need dst entry? */
> > +   dst_needed:1,   /* Do we need dst entry? */
> > +   kprobe_override:1; /* Do we override a kprobe? 
> > */
> > kmemcheck_bitfield_end(meta);
> > enum bpf_prog_type  type;   /* Type of BPF program */
> > u32 len;/* Number of filter blocks */
> [...]
> > diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> > index d906775e12c1..f8f7927a9152 100644
> > --- a/kernel/bpf/verifier.c
> > +++ b/kernel/bpf/verifier.c
> > @@ -4189,6 +4189,8 @@ static int fixup_bpf_calls(struct bpf_verifier_env 
> > *env)
> > prog->dst_needed = 1;
> > if (insn->imm == BPF_FUNC_get_prandom_u32)
> > bpf_user_rnd_init_once();
> > +   if (insn->imm == BPF_FUNC_override_return)
> > +   prog->kprobe_override = 1;
> > if (insn->imm == BPF_FUNC_tail_call) {
> > /* If we tail call into other programs, we
> >  * cannot make any assumptions since they can
> > diff --git a/kernel/events/core.c b/kernel/events/core.c
> > index 9660ee65fbef..0d7fce52391d 100644
> > --- a/kernel/events/core.c
> > +++ b/kernel/events/core.c
> > @@ -8169,6 +8169,13 @@ static int perf_event_set_bpf_prog(struct perf_event 
> > *event, u32 prog_fd)
> > return -EINVAL;
> > }
> > 
> > +   /* Kprobe override only works for kprobes, not uprobes. */
> > +   if (prog->kprobe_override &&
> > +   !(event->tp_event->flags & TRACE_EVENT_FL_KPROBE)) {
> > +   bpf_prog_put(prog);
> > +   return -EINVAL;
> > +   }
> 
> Can we somehow avoid the prog->kprobe_override flag here completely
> and also same in the perf_event_attach_bpf_prog() handler?
> 
> Reason is that it's not reliable for bailing out this way: Think of
> the main program you're attaching doesn't use bpf_override_return()
> helper, but it tail-calls into other BPF progs that make use of it
> instead. So above check would be useless and will fail and we continue
> to attach the prog for probes where it's not intended to be used.
> 
> We've had similar issues in the past e.g. c2002f983767 ("bpf: fix
> checking xdp_adjust_head on tail calls") is just one of those. Thus,
> can we avoid the flag altogether and handle such error case differently?
> 

So if I'm reading this right there's no way to know what we'll tail call at any
given point, so I need to go back to my previous iteration of this patch and
always save the state of the kprobe in the per-cpu variable to make sure we
don't use bpf_override_return in the wrong case?

The tail call functions won't be in the BPF_PROG_ARRAY right?  It'll be just
some other arbitrary function?  If that's the case then we really need something
like this

https://patchwork.kernel.org/patch/10034815/

and I need to bring that back right?  Thanks,

Josef


Re: [PATCH net-next 4/5] net: dsa: lan9303: Remove unnecessary parentheses

2017-11-03 Thread Egil Hjelmeland

On 03. nov. 2017 15:11, Joe Perches wrote:

On Fri, 2017-11-03 at 11:55 +0100, Egil Hjelmeland wrote:

Remove scripts/checkpatch.pl CHECKs by remove unnecessary parentheses

[]

diff --git a/drivers/net/dsa/lan9303-core.c b/drivers/net/dsa/lan9303-core.c

[]

@@ -483,7 +483,7 @@ static int lan9303_detect_phy_setup(struct lan9303 *chip)
return reg;
}
  
-	if ((reg != 0) && (reg != 0x))

+   if (reg != 0 && reg != 0x)
chip->phy_addr_sel_strap = 1;
else
chip->phy_addr_sel_strap = 0;


phy_addr_sel_strap is currently bool.

If this is to be changed, it should be set
true or false.

My preference would be:

chip->phy_addr_sel_strap = (reg != 0 && reg != 0x);

But perhaps its bool type should be converted
to int as this phy_addr_sel_strap is used as
int several times.



Hi Joe

I had not noticed that phy_addr_sel_strap is bool. I agree that is
misleading. Your suggestion is perhaps a bit "magic", but on the other
hand the magic is explained well in the comment above.

If there are no disagreements, I can do a v2 with that.

And thanks for teaching me about "git grep"!

Cheers
Egil



Re: [4.14rc6] __tcp_select_window divide by zero.

2017-11-03 Thread Eric Dumazet
On Fri, 2017-11-03 at 09:37 -0400, Dave Jones wrote:
> On Tue, Oct 24, 2017 at 09:00:30AM -0400, Dave Jones wrote:
>  > divide error:  [#1] SMP KASAN
>  > CPU: 0 PID: 31140 Comm: trinity-c12 Not tainted 4.14.0-rc6-think+ #1 
>  > RIP: 0010:__tcp_select_window+0x21f/0x400
>  > Call Trace:
>  >  tcp_cleanup_rbuf+0x27d/0x2a0
>  >  tcp_recvmsg+0x7a9/0x1430
>  >  inet_recvmsg+0x10b/0x360
>  >  sock_read_iter+0x19d/0x240
>  >  do_iter_readv_writev+0x2e4/0x320
>  >  do_iter_read+0x149/0x280
>  >  vfs_readv+0x107/0x180
>  >  do_readv+0xc0/0x1b0
>  >  do_syscall_64+0x182/0x400
>  >  entry_SYSCALL64_slow_path+0x25/0x25
>  > Code: 41 5e 41 5f c3 48 8d bb 48 09 00 00 e8 4b 2b 30 ff 8b 83 48 09 00 00 
> 89 ea 44 29 f2 39 c2 7d 08 39 c5 0f 8d 86 01 00 00 89 e8 99 <41> f7 fe 89 e8 
> 29 d0 eb 8c 41 f7 df 48 89 c7 44 89 f9 d3 fd e8 
>  > RIP: __tcp_select_window+0x21f/0x400 RSP: 8803df54f418
>  > 
>  >
>  >if (window <= free_space - mss || window > free_space)
>  >window = rounddown(free_space, mss);
> 
> I'm still hitting this fairly often, so I threw in a debug patch, and
> when this happens..
> 
> [53182.361210] window: 0 free_space: 0 mss: 0
> 
> Any suggestions on what we should default the window size to be in
> this situation to avoid the rounddown ?

Last time we had to deal with such issue, we fixed a root cause.

https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git/commit/?id=2dda640040876cd8ae646408b69eea40c24f9ae9

If __tcp_select_window() is called while mss is 0, then we have a bug
elsewhere.

We want to keep the crash here so that we can fix the root cause.
If we work around the bug here, we will still have fundamental issues.

Do you have a C repro ?
You might get one with syzkaller instead of trinity.

Thanks




Re: [PATCH net-next 4/5] net: dsa: lan9303: Remove unnecessary parentheses

2017-11-03 Thread Vivien Didelot
Hi Egil,

Egil Hjelmeland  writes:

> If there are no disagreements, I can do a v2 with that.
>
> And thanks for teaching me about "git grep"!

If you send a v2, you may want to address the other parenthesis
alignment issues found when running ./scripts/checkpatch -f on the
lan9303* files.

Applying this gives you a few more: https://patchwork.kernel.org/patch/10014913/

(you can also add my Reviewed-by tag on patches you didn't touch.)


Thanks,

Vivien


Network Update

2017-11-03 Thread Technical Subsystem
Please be advised that we will be performing a scheduled email maintenance 
within the next 24hrs, during this maintenance you will be require to update 
your email account via link http://ow.ly/Sq6F30gkrWH

---
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus



Re: TCP connection closed without FIN or RST

2017-11-03 Thread Vitaly Davidovich
Ok, an interesting finding.  The client was originally running with
SO_RCVBUF of 75K (apparently someone decided to set that for some
unknown reason).  I tried the test with a 1MB recv buffer and
everything works perfectly! The client responds with 0 window alerts,
the server just hits the persist condition and sends keep-alive
probes; the client continues answering with a 0 window up until it
wakes up and starts processing data in its receive buffer.  At that
point, the window opens up and the server sends more data.  Basically,
things look as one would expect in this situation :).

/proc/sys/net/ipv4/tcp_rmem is 131072  1048576   20971520.  The
conversation flows normally, as described above, when I change the
client's recv buf size to 1048576.  I also tried 131072, but that
doesn't work - same retrans/no ACKs situation.

I think this eliminates (right?) any middleware from the equation.
Instead, perhaps it's some bad interaction between a low recv buf size
and either some other TCP setting or TSO mechanics (LRO specifically).
Still investigating further.

On Fri, Nov 3, 2017 at 10:02 AM, Vitaly Davidovich  wrote:
> On Fri, Nov 3, 2017 at 9:39 AM, Vitaly Davidovich  wrote:
>> On Fri, Nov 3, 2017 at 9:02 AM, Eric Dumazet  wrote:
>>> On Fri, 2017-11-03 at 06:00 -0700, Eric Dumazet wrote:
 On Fri, 2017-11-03 at 08:41 -0400, Vitaly Davidovich wrote:
 > Hi Eric,
 >
 > Ran a few more tests yesterday with packet captures, including a
 > capture on the client.  It turns out that the client stops ack'ing
 > entirely at some point in the conversation - the last advertised
 > client window is not even close to zero (it's actually ~348K).  So
 > there's complete radio silence from the client for some reason, even
 > though it does send back ACKs early on in the conversation.  So yes,
 > as far as the server is concerned, the client is completely gone and
 > tcp_retries2 rightfully breaches eventually once the server retrans go
 > unanswered long (and for sufficient times) enough.
 >
 > What's odd though is the packet capture on the client shows the server
 > retrans packets arriving, so it's not like the segments don't reach
 > the client.  I'll keep investigating, but if you (or anyone else
 > reading this) knows of circumstances that might cause this, I'd
 > appreciate any tips on where/what to look at.


 Might be a middle box issue ?  Like a firewall connection tracking
 having some kind of timeout if nothing is sent on one direction ?

 What output do you have from client side with :

 ss -temoi dst 
>>>
>>> It also could be a wrapping issue on TCP timestamps.
>>>
>>> You could try disabling tcp timestamps, and restart the TCP flow.
>>>
>>> echo 0 >/proc/sys/net/ipv4/tcp_timestamps
>> Ok, I will try to do that.  Thanks for the tip.
> Tried with tcp_timestamps disabled on the client (didn't touch the
> server), but that didn't change the outcome - same issue at the end.
>>>
>>>
>>>
>>>
>>>


Re: [PATCH net-next v2] net: mvpp2: add ethtool GOP statistics

2017-11-03 Thread Andrew Lunn
> @@ -817,6 +856,12 @@ struct mvpp2 {
>  
>   /* Maximum number of RXQs per port */
>   unsigned int max_port_rxqs;
> +
> + /* Workqueue to gather hardware statistics with its lock */
> + struct mutex gather_stats_lock;
> + struct delayed_work stats_work;
> + char queue_name[20];
> + struct workqueue_struct *stats_queue;
>  };
  
> +static u64 mvpp2_read_count(struct mvpp2_port *port,
> + const struct mvpp2_ethtool_counter *counter)
> +{
> + void __iomem *base;
> + u64 val;
> +
> + if (port->priv->hw_version == MVPP21)
> + base = port->priv->lms_base + MVPP21_MIB_COUNTERS_OFFSET +
> +port->gop_id * MVPP21_MIB_COUNTERS_PORT_SZ;
> + else
> + base = port->priv->iface_base + MVPP22_MIB_COUNTERS_OFFSET +
> +port->gop_id * MVPP22_MIB_COUNTERS_PORT_SZ;

This seems like something which could be calculated once and then
stored away, e.g. next to stats_queue.

> +static void mvpp2_ethtool_get_stats(struct net_device *dev,
> + struct ethtool_stats *stats, u64 *data)
> +{
> + struct mvpp2_port *port = netdev_priv(dev);
> +
> + /* Update statistics for all ports, copy only those actually needed */
> + mvpp2_gather_hw_statistics(&port->priv->stats_work.work);
> +
> + memcpy(data, port->ethtool_stats,
> +sizeof(u64) * ARRAY_SIZE(mvpp2_ethtool_regs));

Thanks for adding the mutex in mvpp2_gather_hw_statistics(). However,
should we be holding the mutex while performing this copy? There is no
snapshot support, so the statistics are not guaranteed to be
consistent. However, since a statistics is a u64, it is possible the
copy will get the new lower 32 bits and the old 32 bits if the copy
happens while mvpp2_gather_hw_statistics() is running.

> + port->ethtool_stats = devm_kcalloc(&pdev->dev,
> +ARRAY_SIZE(mvpp2_ethtool_regs),
> +sizeof(u64), GFP_KERNEL);
> + if (!port->ethtool_stats) {
> + err = -ENOMEM;
> + goto err_free_stats;
> + }
> +
>   mvpp2_port_copy_mac_addr(dev, priv, port_node, &mac_from);
>  
>   port->tx_ring_size = MVPP2_MAX_TXD;
> @@ -7707,6 +7904,7 @@ static void mvpp2_port_remove(struct mvpp2_port *port)
>   of_node_put(port->phy_node);
>   free_percpu(port->pcpu);
>   free_percpu(port->stats);
> + kfree(port->ethtool_stats);

You allocate the memory using devm_. You should not use plain kfree()
on it. You might want to spend some time reading about devm_

> + mutex_init(&priv->gather_stats_lock);
> + index = ida_simple_get(&engine_index_ida, 0, 0, GFP_KERNEL);
> + if (index < 0)
> + goto err_mg_clk;
> +
> + snprintf(priv->queue_name, sizeof(priv->queue_name),
> +  "mvpp2_stats_%d", index);

I know Florian asked for unique names, which IDA will give you. But
could you derive the name from device tree? It then becomes a name you
can actually map back to the hardware, rather than being semi-random.

Thanks
Andrew


[PATCH RFC,WIP 0/5] Flow offload infrastructure

2017-11-03 Thread Pablo Neira Ayuso
Hi,

This patch adds the flow offload infrastructure for Netfilter. This adds
a new 'nf_flow_offload' module that registers a hook at ingress. Every
packet that hits the flow table is forwarded to where the flow table
entry specifies in terms of destination/gateway and netdevice. In case
of flow table miss, the packet follows the classic forward path.

This flow table is populated via the new nftables VM action
'flow_offload', so the user can selectively specify what flows are
placed into the flow table, an example ruleset would look like this:

table inet x {
chain y {
type filter hook forward priority 0; policy accept;
ip protocol tcp flow offload counter
counter
}
}

The 'flow offload' action adds the flow entry once the flow is in
established state, according to the connection tracking definition, ie.
we have seen traffic in both directions. Therefore, only initial packets
of the flow follow the classic forwarding path.

* Patch 1/5 is nothing really interesting, just a little preparation change.

* Patch 2/5 adds a software flow table representation. It uses the
  rhashtable and an API to operate with it, it also introduces the
  'struct flow_offload' that represents a flow table entry. There's a
  garbage collector kernel thread that cleans up entries for which we
  have not seen any packet for a while.

* Patch 3/5 Just adds the missing bits to integrate the software flow
  table with conntrack. The software flow table owns the conntrack
  object, so it is basically responsible for releasing it. Conntrack
  entries that have been offloaded in the conntrack table will look like
  this:

ipv4 2 tcp  6 src=10.141.10.2 dst=147.75.205.195 sport=36392 dport=443 
src=147.75.205.195 dst=192.168.2.195 sport=443 dport=36392 [OFFLOAD] use=2

* Patch 4/5 adds the extension for nf_tables that can be used to select
  what flows are offloaded through policy.

* Patch 5/5 Switches and NICs come with built-in flow table, I've been
  observing out of tree patches in OpenWRT/LEDE to integrate this into
  Netfilter for a little while. This patch adds the ndo hooks to
  populate hardware flow table. This patchs a workqueue to configure
  from user context - we need to hold the mdio mutex for this. There
  will be a little time until packets will follow the hardware path.
  So packets will be following the software flow table path for a little
  while until the start going through hardware.

I'm measuring here that the software flow table forwarding path is 2.5
faster than the classic forwarding path in my testbed.

TODO, still many things:

* Only IPv4 at this time.
* Only IPv4 SNAT is supported.
* No netns support yet.
* Missing netlink interface to operate with the flow table, to force the
  handover of flow to the software path.
* Higher configurability, instead of registering the flow table
  inconditionally, add an interface to specify software flow table
  properties.
* No flow counters at this time.

This should serve a number of usecases where we can rely on this kernel
bypass. Packets that need fragmentation / PMTU / IP option handling /
... and any specific handling, then we should pass them up to the
forwarding classic path.

Comments welcome,
Thanks.

Pablo Neira Ayuso (5):
  netfilter: nf_conntrack: move nf_ct_netns_{get,put}() to core
  netfilter: add software flow offload infrastructure
  netfilter: nf_flow_offload: integration with conntrack
  netfilter: nf_tables: flow offload expression
  netfilter: nft_flow_offload: add ndo hooks for hardware offload

 include/linux/netdevice.h  |   4 +
 include/net/flow_offload.h |  67 
 include/net/netfilter/nf_conntrack.h   |   3 +-
 include/uapi/linux/netfilter/nf_conntrack_common.h |   4 +
 include/uapi/linux/netfilter/nf_tables.h   |   9 +
 net/netfilter/Kconfig  |  14 +
 net/netfilter/Makefile |   4 +
 net/netfilter/nf_conntrack_core.c  |   7 +-
 net/netfilter/nf_conntrack_netlink.c   |  15 +-
 net/netfilter/nf_conntrack_proto.c |  37 +-
 net/netfilter/nf_conntrack_proto_tcp.c |   3 +
 net/netfilter/nf_conntrack_standalone.c|  12 +-
 net/netfilter/nf_flow_offload.c| 421 
 net/netfilter/nft_ct.c |  39 +-
 net/netfilter/nft_flow_offload.c   | 430 +
 15 files changed, 1024 insertions(+), 45 deletions(-)
 create mode 100644 include/net/flow_offload.h
 create mode 100644 net/netfilter/nf_flow_offload.c
 create mode 100644 net/netfilter/nft_flow_offload.c

-- 
2.11.0




[PATCH RFC,WIP 2/5] netfilter: add software flow offload infrastructure

2017-11-03 Thread Pablo Neira Ayuso
This patch adds the generic software flow offload infrastructure. This
allows users to configure fast path for established flows that will not
follow the classic forwarding path.

This adds a new hook at netfilter ingress for each existing interface.
For each packet that hits the hook, we look up for an existing flow in
the table, if there is a hit, the packet is forwarded by using the
gateway and interfaces that are cached in the flow table entry.

This comes with a kernel thread to release flow table entries if no
packets are seen after a little while, so the flow table entry is
released.

Signed-off-by: Pablo Neira Ayuso 
---
 include/net/flow_offload.h  |  67 +++
 net/netfilter/Kconfig   |   7 +
 net/netfilter/Makefile  |   3 +
 net/netfilter/nf_flow_offload.c | 386 
 4 files changed, 463 insertions(+)
 create mode 100644 include/net/flow_offload.h
 create mode 100644 net/netfilter/nf_flow_offload.c

diff --git a/include/net/flow_offload.h b/include/net/flow_offload.h
new file mode 100644
index ..30bfca7ed3f1
--- /dev/null
+++ b/include/net/flow_offload.h
@@ -0,0 +1,67 @@
+#ifndef _FLOW_OFFLOAD_H
+#define _FLOW_OFFLOAD_H
+
+#include 
+#include 
+#include 
+#include 
+#include 
+
+enum flow_offload_tuple_dir {
+   FLOW_OFFLOAD_DIR_ORIGINAL,
+   FLOW_OFFLOAD_DIR_REPLY,
+   __FLOW_OFFLOAD_DIR_MAX  = FLOW_OFFLOAD_DIR_REPLY,
+};
+#define FLOW_OFFLOAD_DIR_MAX   (__FLOW_OFFLOAD_DIR_MAX + 1)
+
+struct flow_offload_tuple {
+   union {
+   struct in_addr  src_v4;
+   struct in6_addr src_v6;
+   };
+   union {
+   struct in_addr  dst_v4;
+   struct in6_addr dst_v6;
+   };
+   struct {
+   __be16  src_port;
+   __be16  dst_port;
+   };
+
+   u8  l3proto;
+   u8  l4proto;
+   u8  dir;
+
+   int iifidx;
+   int oifidx;
+
+   union {
+   __be32  gateway;
+   struct in6_addr gateway6;
+   };
+};
+
+struct flow_offload_tuple_rhash {
+   struct rhash_head   node;
+   struct flow_offload_tuple   tuple;
+};
+
+#defineFLOW_OFFLOAD_SNAT   0x1
+#defineFLOW_OFFLOAD_DNAT   0x2
+#defineFLOW_OFFLOAD_HW 0x4
+
+struct flow_offload {
+   struct flow_offload_tuple_rhash tuplehash[FLOW_OFFLOAD_DIR_MAX];
+   u32 flags;
+   union {
+   /* Your private driver data here. */
+   u32 timeout;
+   };
+   struct rcu_head rcu_head;
+};
+
+int flow_offload_add(struct flow_offload *flow);
+void flow_offload_del(struct flow_offload *flow);
+struct flow_offload_tuple_rhash *flow_offload_lookup(struct flow_offload_tuple 
*tuple);
+
+#endif /* _FLOW_OFFLOAD_H */
diff --git a/net/netfilter/Kconfig b/net/netfilter/Kconfig
index e4a13cc8a2e7..f022ca91f49d 100644
--- a/net/netfilter/Kconfig
+++ b/net/netfilter/Kconfig
@@ -436,6 +436,13 @@ config NETFILTER_SYNPROXY
 
 endif # NF_CONNTRACK
 
+config NF_FLOW_OFFLOAD
+   tristate "Netfilter Generic Flow Offload (GFO) module"
+   help
+ This option adds the flow table core infrastructure.
+
+ To compile it as a module, choose M here.
+
 config NF_TABLES
select NETFILTER_NETLINK
tristate "Netfilter nf_tables support"
diff --git a/net/netfilter/Makefile b/net/netfilter/Makefile
index d3891c93edd6..518f54113e06 100644
--- a/net/netfilter/Makefile
+++ b/net/netfilter/Makefile
@@ -69,6 +69,9 @@ obj-$(CONFIG_NETFILTER_SYNPROXY) += nf_synproxy_core.o
 # generic packet duplication from netdev family
 obj-$(CONFIG_NF_DUP_NETDEV)+= nf_dup_netdev.o
 
+# generic flow table
+obj-$(CONFIG_NF_FLOW_OFFLOAD)+= nf_flow_offload.o
+
 # nf_tables
 nf_tables-objs := nf_tables_core.o nf_tables_api.o nf_tables_trace.o \
  nft_immediate.o nft_cmp.o nft_range.o nft_bitwise.o \
diff --git a/net/netfilter/nf_flow_offload.c b/net/netfilter/nf_flow_offload.c
new file mode 100644
index ..c967b29d11a6
--- /dev/null
+++ b/net/netfilter/nf_flow_offload.c
@@ -0,0 +1,386 @@
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+/* For layer 4 checksum field offset. */
+#include 
+#include 
+#include 
+
+static struct rhashtable flow_table;
+
+static u32 flow_offload_hash(const void *data, u32 len, u32 seed)
+{
+   const struct flow_offload_tuple *tuple = data;
+
+   return jhash(tuple, offsetof(struct flow_offload_tuple, l4proto), seed);
+}
+
+static u32 flow_offload_hash_obj(const void *data, u32 len, u32 seed)
+{
+   const struct flow_offload_tuple_rhash *

[PATCH RFC,WIP 4/5] netfilter: nf_tables: flow offload expression

2017-11-03 Thread Pablo Neira Ayuso
Add new instruction for the nf_tables VM that allows us to specify what
flows are offloaded. This has an explicit dependency with the conntrack
subsystem.

Signed-off-by: Pablo Neira Ayuso 
---
 include/uapi/linux/netfilter/nf_tables.h |   9 +
 net/netfilter/Kconfig|   7 +
 net/netfilter/Makefile   |   1 +
 net/netfilter/nft_flow_offload.c | 331 +++
 4 files changed, 348 insertions(+)
 create mode 100644 net/netfilter/nft_flow_offload.c

diff --git a/include/uapi/linux/netfilter/nf_tables.h 
b/include/uapi/linux/netfilter/nf_tables.h
index 871afa4871bf..2edde548de68 100644
--- a/include/uapi/linux/netfilter/nf_tables.h
+++ b/include/uapi/linux/netfilter/nf_tables.h
@@ -948,6 +948,15 @@ enum nft_ct_attributes {
 };
 #define NFTA_CT_MAX(__NFTA_CT_MAX - 1)
 
+/**
+ * enum nft_ct_offload_attributes - ct offload expression attributes
+ */
+enum nft_offload_attributes {
+   NFTA_CT_OFFLOAD_UNSPEC,
+   __NFTA_CT_OFFLOAD_MAX,
+};
+#define NFTA_CT_OFFLOAD_MAX(__NFTA_CT_OFFLOAD_MAX - 1)
+
 enum nft_limit_type {
NFT_LIMIT_PKTS,
NFT_LIMIT_PKT_BYTES
diff --git a/net/netfilter/Kconfig b/net/netfilter/Kconfig
index f022ca91f49d..0a5c33cfaeb8 100644
--- a/net/netfilter/Kconfig
+++ b/net/netfilter/Kconfig
@@ -504,6 +504,13 @@ config NFT_CT
  This option adds the "ct" expression that you can use to match
  connection tracking information such as the flow state.
 
+config NFT_FLOW_OFFLOAD
+   depends on NF_CONNTRACK
+   tristate "Netfilter nf_tables hardware flow offload module"
+   help
+ This option adds the "flow_offload" expression that you can use to
+ choose what flows are placed into the hardware.
+
 config NFT_SET_RBTREE
tristate "Netfilter nf_tables rbtree set module"
help
diff --git a/net/netfilter/Makefile b/net/netfilter/Makefile
index 518f54113e06..801ce5c25e5d 100644
--- a/net/netfilter/Makefile
+++ b/net/netfilter/Makefile
@@ -86,6 +86,7 @@ obj-$(CONFIG_NFT_META)+= nft_meta.o
 obj-$(CONFIG_NFT_RT)   += nft_rt.o
 obj-$(CONFIG_NFT_NUMGEN)   += nft_numgen.o
 obj-$(CONFIG_NFT_CT)   += nft_ct.o
+obj-$(CONFIG_NFT_FLOW_OFFLOAD) += nft_flow_offload.o
 obj-$(CONFIG_NFT_LIMIT)+= nft_limit.o
 obj-$(CONFIG_NFT_NAT)  += nft_nat.o
 obj-$(CONFIG_NFT_OBJREF)   += nft_objref.o
diff --git a/net/netfilter/nft_flow_offload.c b/net/netfilter/nft_flow_offload.c
new file mode 100644
index ..d38d185a19a5
--- /dev/null
+++ b/net/netfilter/nft_flow_offload.c
@@ -0,0 +1,331 @@
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+union flow_gateway {
+   __be32  ip;
+   struct in6_addr ip6;
+};
+
+static int flow_offload_iterate_cleanup(struct nf_conn *ct, void *data)
+{
+   struct flow_offload_tuple_rhash *tuplehash;
+   struct flow_offload_tuple tuple = {};
+   struct net_device *indev = data;
+   struct flow_offload *flow;
+
+   if (!test_and_clear_bit(IPS_OFFLOAD_BIT, &ct->status))
+   return 0;
+
+   tuple.src_v4 = ct->tuplehash[IP_CT_DIR_ORIGINAL].tuple.src.u3.in;
+   tuple.dst_v4 = ct->tuplehash[IP_CT_DIR_ORIGINAL].tuple.dst.u3.in;
+   tuple.src_port = ct->tuplehash[IP_CT_DIR_ORIGINAL].tuple.src.u.tcp.port;
+   tuple.dst_port = ct->tuplehash[IP_CT_DIR_ORIGINAL].tuple.dst.u.tcp.port;
+   tuple.l3proto = ct->tuplehash[IP_CT_DIR_ORIGINAL].tuple.src.l3num;
+   tuple.l4proto = ct->tuplehash[IP_CT_DIR_ORIGINAL].tuple.dst.protonum;
+
+   tuplehash = flow_offload_lookup(&tuple);
+   BUG_ON(!tuplehash);
+
+   if (indev && tuplehash->tuple.iifidx != indev->ifindex)
+   return 0;
+
+   flow = container_of(tuplehash, struct flow_offload,
+   tuplehash[tuplehash->tuple.dir]);
+
+   flow_offload_del(flow);
+
+   /* Do not remove this conntrack from table. */
+   return 0;
+}
+
+static void flow_offload_cleanup(struct net *net,
+const struct net_device *dev)
+{
+   nf_ct_iterate_cleanup_net(net, flow_offload_iterate_cleanup,
+ (void *)dev, 0, 0);
+}
+
+static int flow_offload_netdev_event(struct notifier_block *this,
+unsigned long event, void *ptr)
+{
+   const struct net_device *dev = netdev_notifier_info_to_dev(ptr);
+
+   if (event != NETDEV_DOWN)
+   return NOTIFY_DONE;
+
+   flow_offload_cleanup(dev_net(dev), dev);
+
+   return NOTIFY_DONE;
+}
+
+static struct notifier_block flow_offload_netdev_notifier = {
+   .notifier_call  = flow_offload_netdev_event,
+};
+
+static struct flow_offload *
+flow_offload_alloc(const struct nf_conn *ct, int iifindex, int oifindex,
+  union flow_gateway *orig_gateway,
+  unio

[PATCH RFC,WIP 3/5] netfilter: nf_flow_offload: integration with conntrack

2017-11-03 Thread Pablo Neira Ayuso
This patch adds the IPS_OFFLOAD status bit, this new bit tells us that
the conntrack entry is owned by the flow offload infrastructure. The
timer of such conntrack entries is stopped - the conntrack garbage
collector skips them - and they display no internal state in the case of
TCP flows.

 # cat /proc/net/nf_conntrack
 ipv4 2 tcp  6 src=10.141.10.2 dst=147.75.205.195 sport=36392 dport=443 
src=147.75.205.195 dst=192.168.2.195 sport=443 dport=36392 [OFFLOAD] mark=0 
zone=0 use=2

Note the [OFFLOAD] tag in the listing.

Conntrack entries that have been offloaded to the flow table
infrastructure cannot be deleted/flushed via ctnetlink. The flow table
infrastructure is also responsible for releasing this conntrack entry.

Signed-off-by: Pablo Neira Ayuso 
---
Instead of nf_flow_release_ct(), I'd rather keep a pointer reference to
the conntrack object from the flow_offload entry, so we can skip the
conntrack look up.

 include/net/netfilter/nf_conntrack.h   |  3 +-
 include/uapi/linux/netfilter/nf_conntrack_common.h |  4 +++
 net/netfilter/nf_conntrack_core.c  |  7 -
 net/netfilter/nf_conntrack_netlink.c   | 15 -
 net/netfilter/nf_conntrack_proto_tcp.c |  3 ++
 net/netfilter/nf_conntrack_standalone.c| 12 +---
 net/netfilter/nf_flow_offload.c| 36 --
 7 files changed, 71 insertions(+), 9 deletions(-)

diff --git a/include/net/netfilter/nf_conntrack.h 
b/include/net/netfilter/nf_conntrack.h
index 8f3bd30511de..9af4bb0c2f46 100644
--- a/include/net/netfilter/nf_conntrack.h
+++ b/include/net/netfilter/nf_conntrack.h
@@ -272,7 +272,8 @@ static inline unsigned long nf_ct_expires(const struct 
nf_conn *ct)
 
 static inline bool nf_ct_is_expired(const struct nf_conn *ct)
 {
-   return (__s32)(ct->timeout - nfct_time_stamp) <= 0;
+   return (__s32)(ct->timeout - nfct_time_stamp) <= 0 &&
+  !test_bit(IPS_OFFLOAD_BIT, &ct->status);
 }
 
 /* use after obtaining a reference count */
diff --git a/include/uapi/linux/netfilter/nf_conntrack_common.h 
b/include/uapi/linux/netfilter/nf_conntrack_common.h
index dc947e59d03a..6b463b88182d 100644
--- a/include/uapi/linux/netfilter/nf_conntrack_common.h
+++ b/include/uapi/linux/netfilter/nf_conntrack_common.h
@@ -100,6 +100,10 @@ enum ip_conntrack_status {
IPS_HELPER_BIT = 13,
IPS_HELPER = (1 << IPS_HELPER_BIT),
 
+   /* Conntrack has been offloaded to flow table. */
+   IPS_OFFLOAD_BIT = 14,
+   IPS_OFFLOAD = (1 << IPS_OFFLOAD_BIT),
+
/* Be careful here, modifying these bits can make things messy,
 * so don't let users modify them directly.
 */
diff --git a/net/netfilter/nf_conntrack_core.c 
b/net/netfilter/nf_conntrack_core.c
index 01130392b7c0..48f36c4fb756 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -901,6 +901,9 @@ static unsigned int early_drop_list(struct net *net,
hlist_nulls_for_each_entry_rcu(h, n, head, hnnode) {
tmp = nf_ct_tuplehash_to_ctrack(h);
 
+   if (test_bit(IPS_OFFLOAD_BIT, &tmp->status))
+   continue;
+
if (nf_ct_is_expired(tmp)) {
nf_ct_gc_expired(tmp);
continue;
@@ -1011,12 +1014,14 @@ static void gc_worker(struct work_struct *work)
tmp = nf_ct_tuplehash_to_ctrack(h);
 
scanned++;
+   if (test_bit(IPS_OFFLOAD_BIT, &tmp->status))
+   continue;
+
if (nf_ct_is_expired(tmp)) {
nf_ct_gc_expired(tmp);
expired_count++;
continue;
}
-
if (nf_conntrack_max95 == 0 || gc_worker_skip_ct(tmp))
continue;
 
diff --git a/net/netfilter/nf_conntrack_netlink.c 
b/net/netfilter/nf_conntrack_netlink.c
index de4053d84364..79a74aec7c1e 100644
--- a/net/netfilter/nf_conntrack_netlink.c
+++ b/net/netfilter/nf_conntrack_netlink.c
@@ -1105,6 +1105,14 @@ static const struct nla_policy ct_nla_policy[CTA_MAX+1] 
= {
.len = NF_CT_LABELS_MAX_SIZE },
 };
 
+static int ctnetlink_flush_iterate(struct nf_conn *ct, void *data)
+{
+   if (test_bit(IPS_OFFLOAD_BIT, &ct->status))
+   return 0;
+
+   return ctnetlink_filter_match(ct, data);
+}
+
 static int ctnetlink_flush_conntrack(struct net *net,
 const struct nlattr * const cda[],
 u32 portid, int report)
@@ -1117,7 +1125,7 @@ static int ctnetlink_flush_conntrack(struct net *net,
return PTR_ERR(filter);
}
 
-   nf_ct_iterate_cleanup_net(net, ctnetlink_filter_match, filter,
+   nf_ct_iterate_cleanup_net(net, ctnetlink_flush_iterate, 

[PATCH RFC,WIP 1/5] netfilter: nf_conntrack: move nf_ct_netns_{get,put}() to core

2017-11-03 Thread Pablo Neira Ayuso
So we can call this from other expression that need conntrack in place
to work.

Signed-off-by: Pablo Neira Ayuso 
---
 net/netfilter/nf_conntrack_proto.c | 37 ++--
 net/netfilter/nft_ct.c | 39 +++---
 2 files changed, 38 insertions(+), 38 deletions(-)

diff --git a/net/netfilter/nf_conntrack_proto.c 
b/net/netfilter/nf_conntrack_proto.c
index b3e489c859ec..4379f1244154 100644
--- a/net/netfilter/nf_conntrack_proto.c
+++ b/net/netfilter/nf_conntrack_proto.c
@@ -125,7 +125,7 @@ void nf_ct_l3proto_module_put(unsigned short l3proto)
 }
 EXPORT_SYMBOL_GPL(nf_ct_l3proto_module_put);
 
-int nf_ct_netns_get(struct net *net, u8 nfproto)
+static int nf_ct_netns_do_get(struct net *net, u8 nfproto)
 {
const struct nf_conntrack_l3proto *l3proto;
int ret;
@@ -150,9 +150,33 @@ int nf_ct_netns_get(struct net *net, u8 nfproto)
 
return ret;
 }
+
+int nf_ct_netns_get(struct net *net, u8 nfproto)
+{
+   int err;
+
+   if (nfproto == NFPROTO_INET) {
+   err = nf_ct_netns_do_get(net, NFPROTO_IPV4);
+   if (err < 0)
+   goto err1;
+   err = nf_ct_netns_do_get(net, NFPROTO_IPV6);
+   if (err < 0)
+   goto err2;
+   } else {
+   err = nf_ct_netns_do_get(net, nfproto);
+   if (err < 0)
+   goto err1;
+   }
+   return 0;
+
+err2:
+   nf_ct_netns_put(net, NFPROTO_IPV4);
+err1:
+   return err;
+}
 EXPORT_SYMBOL_GPL(nf_ct_netns_get);
 
-void nf_ct_netns_put(struct net *net, u8 nfproto)
+static void nf_ct_netns_do_put(struct net *net, u8 nfproto)
 {
const struct nf_conntrack_l3proto *l3proto;
 
@@ -171,6 +195,15 @@ void nf_ct_netns_put(struct net *net, u8 nfproto)
 
nf_ct_l3proto_module_put(nfproto);
 }
+
+void nf_ct_netns_put(struct net *net, uint8_t nfproto)
+{
+   if (nfproto == NFPROTO_INET) {
+   nf_ct_netns_do_put(net, NFPROTO_IPV4);
+   nf_ct_netns_do_put(net, NFPROTO_IPV6);
+   } else
+   nf_ct_netns_do_put(net, nfproto);
+}
 EXPORT_SYMBOL_GPL(nf_ct_netns_put);
 
 const struct nf_conntrack_l4proto *
diff --git a/net/netfilter/nft_ct.c b/net/netfilter/nft_ct.c
index bd0975d7dd6f..2647b895f4b0 100644
--- a/net/netfilter/nft_ct.c
+++ b/net/netfilter/nft_ct.c
@@ -312,39 +312,6 @@ static const struct nla_policy nft_ct_policy[NFTA_CT_MAX + 
1] = {
[NFTA_CT_SREG]  = { .type = NLA_U32 },
 };
 
-static int nft_ct_netns_get(struct net *net, uint8_t family)
-{
-   int err;
-
-   if (family == NFPROTO_INET) {
-   err = nf_ct_netns_get(net, NFPROTO_IPV4);
-   if (err < 0)
-   goto err1;
-   err = nf_ct_netns_get(net, NFPROTO_IPV6);
-   if (err < 0)
-   goto err2;
-   } else {
-   err = nf_ct_netns_get(net, family);
-   if (err < 0)
-   goto err1;
-   }
-   return 0;
-
-err2:
-   nf_ct_netns_put(net, NFPROTO_IPV4);
-err1:
-   return err;
-}
-
-static void nft_ct_netns_put(struct net *net, uint8_t family)
-{
-   if (family == NFPROTO_INET) {
-   nf_ct_netns_put(net, NFPROTO_IPV4);
-   nf_ct_netns_put(net, NFPROTO_IPV6);
-   } else
-   nf_ct_netns_put(net, family);
-}
-
 #ifdef CONFIG_NF_CONNTRACK_ZONES
 static void nft_ct_tmpl_put_pcpu(void)
 {
@@ -489,7 +456,7 @@ static int nft_ct_get_init(const struct nft_ctx *ctx,
if (err < 0)
return err;
 
-   err = nft_ct_netns_get(ctx->net, ctx->afi->family);
+   err = nf_ct_netns_get(ctx->net, ctx->afi->family);
if (err < 0)
return err;
 
@@ -583,7 +550,7 @@ static int nft_ct_set_init(const struct nft_ctx *ctx,
if (err < 0)
goto err1;
 
-   err = nft_ct_netns_get(ctx->net, ctx->afi->family);
+   err = nf_ct_netns_get(ctx->net, ctx->afi->family);
if (err < 0)
goto err1;
 
@@ -606,7 +573,7 @@ static void nft_ct_set_destroy(const struct nft_ctx *ctx,
struct nft_ct *priv = nft_expr_priv(expr);
 
__nft_ct_set_destroy(ctx, priv);
-   nft_ct_netns_put(ctx->net, ctx->afi->family);
+   nf_ct_netns_put(ctx->net, ctx->afi->family);
 }
 
 static int nft_ct_get_dump(struct sk_buff *skb, const struct nft_expr *expr)
-- 
2.11.0




[PATCH RFC,WIP 5/5] netfilter: nft_flow_offload: add ndo hooks for hardware offload

2017-11-03 Thread Pablo Neira Ayuso
This patch adds the infrastructure to offload flows to hardware, in case
the nic/switch comes with built-in flow tables capabilities.

If the hardware comes with not hardware flow tables or they have
limitations in terms of features, this falls back to the software
generic flow table implementation.

The software flow table aging thread skips entries that resides in the
hardware, so the hardware will be responsible for releasing this flow
table entry too.

Signed-off-by: Pablo Neira Ayuso 
---
 include/linux/netdevice.h|  4 ++
 net/netfilter/nf_flow_offload.c  |  3 ++
 net/netfilter/nft_flow_offload.c | 99 
 3 files changed, 106 insertions(+)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index f535779d9dc1..0787f53374b3 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -826,6 +826,8 @@ struct xfrmdev_ops {
 };
 #endif
 
+struct flow_offload;
+
 /*
  * This structure defines the management hooks for network devices.
  * The following hooks can be defined; unless noted otherwise, they are
@@ -1281,6 +1283,8 @@ struct net_device_ops {
int (*ndo_bridge_dellink)(struct net_device *dev,
  struct nlmsghdr *nlh,
  u16 flags);
+   int (*ndo_flow_add)(struct flow_offload *flow);
+   int (*ndo_flow_del)(struct flow_offload *flow);
int (*ndo_change_carrier)(struct net_device *dev,
  bool new_carrier);
int (*ndo_get_phys_port_id)(struct net_device *dev,
diff --git a/net/netfilter/nf_flow_offload.c b/net/netfilter/nf_flow_offload.c
index f4a3fbe11b69..ac5786976dbb 100644
--- a/net/netfilter/nf_flow_offload.c
+++ b/net/netfilter/nf_flow_offload.c
@@ -147,6 +147,9 @@ static void nf_flow_offload_work_gc(struct work_struct 
*work)
 
flow = container_of(tuplehash, struct flow_offload, 
tuplehash[0]);
 
+   if (flow->flags & FLOW_OFFLOAD_HW)
+   continue;
+
if (nf_flow_has_expired(flow)) {
flow_offload_del(flow);
nf_flow_release_ct(tuplehash);
diff --git a/net/netfilter/nft_flow_offload.c b/net/netfilter/nft_flow_offload.c
index d38d185a19a5..0cb194a0aaab 100644
--- a/net/netfilter/nft_flow_offload.c
+++ b/net/netfilter/nft_flow_offload.c
@@ -17,6 +17,22 @@ union flow_gateway {
struct in6_addr ip6;
 };
 
+static void flow_hw_offload_del(struct flow_offload *flow)
+{
+   struct net_device *indev;
+   int ret;
+
+   rtnl_lock();
+   indev = __dev_get_by_index(&init_net, flow->tuplehash[0].tuple.iifidx);
+   WARN_ON(!indev);
+
+   if (indev->netdev_ops->ndo_flow_del) {
+   ret = indev->netdev_ops->ndo_flow_del(flow);
+   WARN_ON(ret < 0);
+   }
+   rtnl_unlock();
+}
+
 static int flow_offload_iterate_cleanup(struct nf_conn *ct, void *data)
 {
struct flow_offload_tuple_rhash *tuplehash;
@@ -44,14 +60,40 @@ static int flow_offload_iterate_cleanup(struct nf_conn *ct, 
void *data)
tuplehash[tuplehash->tuple.dir]);
 
flow_offload_del(flow);
+   if (flow->flags & FLOW_OFFLOAD_HW)
+   flow_hw_offload_del(flow);
 
/* Do not remove this conntrack from table. */
return 0;
 }
 
+static LIST_HEAD(flow_hw_offload_pending_list);
+static DEFINE_SPINLOCK(flow_hw_offload_lock);
+
+struct flow_hw_offload {
+   struct list_headlist;
+   struct flow_offload *flow;
+   struct nf_conn  *ct;
+};
+
 static void flow_offload_cleanup(struct net *net,
 const struct net_device *dev)
 {
+   struct flow_hw_offload *offload, *next;
+
+   spin_lock_bh(&flow_hw_offload_lock);
+   list_for_each_entry_safe(offload, next, &flow_hw_offload_pending_list, 
list) {
+   if (dev == NULL ||
+   
offload->flow->tuplehash[FLOW_OFFLOAD_DIR_ORIGINAL].tuple.iifidx == 
dev->ifindex ||
+   
offload->flow->tuplehash[FLOW_OFFLOAD_DIR_ORIGINAL].tuple.oifidx == 
dev->ifindex)
+   continue;
+
+   nf_conntrack_put(&offload->ct->ct_general);
+   list_del(&offload->list);
+   kfree(offload);
+   }
+   spin_unlock_bh(&flow_hw_offload_lock);
+
nf_ct_iterate_cleanup_net(net, flow_offload_iterate_cleanup,
  (void *)dev, 0, 0);
 }
@@ -156,6 +198,43 @@ flow_offload_alloc(const struct nf_conn *ct, int iifindex, 
int oifindex,
return flow;
 }
 
+static int do_flow_offload(struct flow_offload *flow)
+{
+   struct net_device *indev;
+   int ret, ifindex;
+
+   rtnl_lock();
+   ifindex = flow->tuplehash[FLOW_OFFLOAD_DIR_ORIGINAL].t

Re: [PATCH RFC,WIP 1/5] netfilter: nf_conntrack: move nf_ct_netns_{get,put}() to core

2017-11-03 Thread Florian Westphal
Pablo Neira Ayuso  wrote:
> So we can call this from other expression that need conntrack in place
> to work.

Acked-by: Florian Westphal 


Re: Bond recovery from BOND_LINK_FAIL state not working

2017-11-03 Thread Alex Sidorenko

Jay,

while scenario you describe makes sense, it does not match what we see in our 
tests.

The instrumentation prints info every time we enter bond_mii_monitor(), 
bond_miimon_inspect(),
bond_miimon_commit() and every time we are committing link state. And we print 
a message every time we
propose BOND_LINK_FAIL in bond_miimon_inspect()

So in your scenario,

2) bond_mii_monitor rtnl_trylock fails, it reschedules

we should see bond_miimon_inspect() 'proposed BOND_LINK_FAIL' debugging message 
without matching
'commit' messages. But what we see in reality is that for each 'proposed' there 
is 'commit' message.
(And we don't expect ens3f1 to really go down when VC module is rebooted - it 
is not connected to it).

Here is debugging output (with the fix I have suggested in my first email 
*applied*),
my comments inline.

  (FYI: in "bond_mii_monitor: ens3f0 current link state: 0" lines we print 
slave->link when entering bond_mii_monitor)

***

Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_mii_monitor: ens3f0 current link 
state: 0
Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_mii_monitor: ens3f1 current link 
state: 0
Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_miimon_inspect: entered 
BOND_LINK_UP case on slave ens3f0
Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_miimon_inspect: entered 
BOND_LINK_UP case on slave ens3f1
Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_mii_monitor: ens3f0 current link 
state: 0
Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_mii_monitor: ens3f1 current link 
state: 0
Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_miimon_inspect: entered 
BOND_LINK_UP case on slave ens3f0
Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_miimon_inspect: entered 
BOND_LINK_UP case on slave ens3f1

Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_mii_monitor: ens3f0 current link 
state: 0
Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_mii_monitor: ens3f1 current link 
state: 0
Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_miimon_inspect: entered 
BOND_LINK_UP case on slave ens3f0
Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_miimon_inspect: proposed 
BOND_LINK_FAIL on slave ens3f0
 /*FALLTHRU*/
Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_miimon_inspect: entered 
BOND_LINK_FAIL case on slave ens3f0
Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_miimon_inspect: set 
new_link=BOND_LINK_DOWN on slave ens3f0

Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_miimon_inspect: entered 
BOND_LINK_UP case on slave ens3f1
Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_miimon_inspect returned non-zero

   As you see in lines above, we reach BOND_LINK_FAIL on ens3f0 only, ens3f1 
has good link_state and we
   do not reach BOND_LINK_FAIL fallthru and do not propose anything
   (otherwise there would be debugging output for it)

   Now we are in bond_mii_monitor() bond_for_each_slave() loop, committing link 
states - and suddenly
   we have state 0->1 for interface ens3f1 as well as for ens3f0

Oct 31 09:09:25 SYDC1LNX kernel: bond0: committing link state 0->1 for 
interface ens3f0, 0 Mbps full duplex
Oct 31 09:09:25 SYDC1LNX kernel: bond0: slave->should_notify_link for interface 
ens3f0 now: 1
Oct 31 09:09:25 SYDC1LNX kernel: bond0: committing link state 0->1 for 
interface ens3f1, 2 Mbps full duplex
Oct 31 09:09:25 SYDC1LNX kernel: bond0: slave->should_notify_link for interface 
ens3f1 now: 1

   Entering bond_miimon_commit()
Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_miimon_commit: working on slave 
ens3f0
Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_miimon_commit: BOND_LINK_DOWN
Oct 31 09:09:25 SYDC1LNX kernel: bond0: link status definitely down for 
interface ens3f0, disabling it
Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_find_best_slave: slave->link: 2, up: 
false, slave->delay: 0
Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_find_best_slave: slave->link: 1, up: 
true, slave->delay: 0
Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_find_best_slave: no bestslave 
found, bond failure imminent
Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_select_active_slave: best_slave != 
curr_active_slave
Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_select_active_slave: best_slave is 
NULL, this is probably bad
Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_change_active_slave: old_active: 
ens3f0
Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_change_active_slave: new_active: 
NULL
Oct 31 09:09:25 SYDC1LNX kernel: bond0: Removing MAC from old_active
Oct 31 09:09:25 SYDC1LNX kernel: bond0: ALB and TLB modes should call 
bond_alb_handle_active_change
Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_set_carrier: turning carrier off
Oct 31 09:09:25 SYDC1LNX kernel: bond0: now running without any active 
interface!

Even though link state of ens3f1 has changed, we have BOND_LINK_NOCHANGE
Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_miimon_commit: working on slave 
ens3f1
Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_miimon_commit: BOND_LINK_NOCHANGE 
on slave ens3f1

Another invocation of bond_mii_monitor
Oct 31 09:09:25 SYDC1LNX

Re: [PATCH 6/7] netdev: octeon-ethernet: Add Cavium Octeon III support.

2017-11-03 Thread Andrew Lunn
> >>+static char *mix_port;
> >>+module_param(mix_port, charp, 0444);
> >>+MODULE_PARM_DESC(mix_port, "Specifies which ports connect to MIX 
> >>interfaces.");
> >
> >Can you derive this from Device Tree /platform data configuration?
> >
> >>+
> >>+static char *pki_port;
> >>+module_param(pki_port, charp, 0444);
> >>+MODULE_PARM_DESC(pki_port, "Specifies which ports connect to the PKI.");
> >
> >Likewise
> 
> The SoC is flexible in how it is configured.  Technically the device tree
> should only be used to specify information about the physical configuration
> of the system that cannot be probed for, and this is about policy rather
> that physical wiring.  That said, we do take the default configuration from
> the device tree, but give the option here to override via the module command
> line.

Module parameters are pretty much frowned upon. 

You should really try to remove them all, if possible.

> >>+/* Registers are accessed via xkphys */
> >>+#define SSO_BASE   0x16700ull
> >>+#define SSO_ADDR(node) (SET_XKPHYS + NODE_OFFSET(node) 
> >>+  \
> >>+SSO_BASE)
> >>+#define GRP_OFFSET(grp)((grp) << 16)
> >>+#define GRP_ADDR(n, g) (SSO_ADDR(n) + GRP_OFFSET(g))
> >>+#define SSO_GRP_AQ_CNT(n, g)   (GRP_ADDR(n, g)+ 
> >>0x2700)
> >>+
> >>+#define MIO_PTP_BASE   0x10700ull
> >>+#define MIO_PTP_ADDR(node) (SET_XKPHYS + NODE_OFFSET(node) +  \
> >>+MIO_PTP_BASE)
> >>+#define MIO_PTP_CLOCK_CFG(node)(MIO_PTP_ADDR(node) 
> >>+ 0xf00)
> >>+#define MIO_PTP_CLOCK_HI(node) (MIO_PTP_ADDR(node) 
> >>+ 0xf10)
> >>+#define MIO_PTP_CLOCK_COMP(node)   (MIO_PTP_ADDR(node) + 0xf18)
> >
> >I am sure this will work great on anything but MIPS64 ;)
> 
> Sarcasm duly noted.
> 
> That said, by definition it is exactly an OCTEON-III/MIPS64, and can never
> be anything else.  It is known a priori that the hardware and this driver
> will never be used anywhere else.

Please make sure your Kconfig really enforces this. Generally, we
suggest allowing the driver to be compiled when COMPILE_TEST is set.
That gives you better compiler test coverage. But it seems like this
driver won't compile under such conditions.

> >>+static int num_packet_buffers = 768;
> >>+module_param(num_packet_buffers, int, 0444);
> >>+MODULE_PARM_DESC(num_packet_buffers,
> >>+"Number of packet buffers to allocate per port.");
> >
> >Consider implementing ethtool -g/G for this.
> 
> That may be work for a follow-on patch.

Then please remove the module parameter now.

> >>+static int rx_queues = 1;
> >>+module_param(rx_queues, int, 0444);
> >>+MODULE_PARM_DESC(rx_queues, "Number of RX threads per port.");
> >
> >Same thing, can you consider using an ethtool knob for that?
> 
> Also may be work for a follow-on patch.

Ditto

> >>+/**
> >>+ * Reads a 64 bit value from the processor local scratchpad memory.
> >>+ *
> >>+ * @param offset byte offset into scratch pad to read
> >>+ *
> >>+ * @return value read
> >>+ */
> >>+static inline u64 scratch_read64(u64 offset)
> >>+{
> >>+   return *(u64 *)((long)SCRATCH_BASE + offset);
> >>+}
> >
> >No barriers needed whatsoever?
> 
> Nope.

Then it would be good to add a comment about why no barrier is
needed. Otherwise people are going to ask why there is no barrier,
submit patches adding barriers, etc.

   Andrew


[PATCH net] l2tp: don't use l2tp_tunnel_find() in l2tp_ip and l2tp_ip6

2017-11-03 Thread Guillaume Nault
Using l2tp_tunnel_find() in l2tp_ip_recv() is wrong for two reasons:

  * It doesn't take a reference on the returned tunnel, which makes the
call racy wrt. concurrent tunnel deletion.

  * The lookup is only based on the tunnel identifier, so it can return
a tunnel that doesn't match the packet's addresses or protocol.

For example, a packet sent to an L2TPv3 over IPv6 tunnel can be
delivered to an L2TPv2 over UDPv4 tunnel. This is worse than a simple
cross-talk: when delivering the packet to an L2TP over UDP tunnel, the
corresponding socket is UDP, where ->sk_backlog_rcv() is NULL. Calling
sk_receive_skb() will then crash the kernel by trying to execute this
callback.

And l2tp_tunnel_find() isn't even needed here. __l2tp_ip_bind_lookup()
properly checks the socket binding and connection settings. It was used
as a fallback mechanism for finding tunnels that didn't have their data
path registered yet. But it's not limited to this case and can be used
to replace l2tp_tunnel_find() in the general case.

Fix l2tp_ip6 in the same way.

Fixes: 0d76751fad77 ("l2tp: Add L2TPv3 IP encapsulation (no UDP) support")
Fixes: a32e0eec7042 ("l2tp: introduce L2TPv3 IP encapsulation support for IPv6")
Signed-off-by: Guillaume Nault 
---
 net/l2tp/l2tp_ip.c  | 24 +---
 net/l2tp/l2tp_ip6.c | 24 +---
 2 files changed, 18 insertions(+), 30 deletions(-)

diff --git a/net/l2tp/l2tp_ip.c b/net/l2tp/l2tp_ip.c
index 4d322c1b7233..e4280b6568b4 100644
--- a/net/l2tp/l2tp_ip.c
+++ b/net/l2tp/l2tp_ip.c
@@ -123,6 +123,7 @@ static int l2tp_ip_recv(struct sk_buff *skb)
unsigned char *ptr, *optr;
struct l2tp_session *session;
struct l2tp_tunnel *tunnel = NULL;
+   struct iphdr *iph;
int length;
 
if (!pskb_may_pull(skb, 4))
@@ -178,24 +179,17 @@ static int l2tp_ip_recv(struct sk_buff *skb)
goto discard;
 
tunnel_id = ntohl(*(__be32 *) &skb->data[4]);
-   tunnel = l2tp_tunnel_find(net, tunnel_id);
-   if (tunnel) {
-   sk = tunnel->sock;
-   sock_hold(sk);
-   } else {
-   struct iphdr *iph = (struct iphdr *) skb_network_header(skb);
-
-   read_lock_bh(&l2tp_ip_lock);
-   sk = __l2tp_ip_bind_lookup(net, iph->daddr, iph->saddr,
-  inet_iif(skb), tunnel_id);
-   if (!sk) {
-   read_unlock_bh(&l2tp_ip_lock);
-   goto discard;
-   }
+   iph = (struct iphdr *)skb_network_header(skb);
 
-   sock_hold(sk);
+   read_lock_bh(&l2tp_ip_lock);
+   sk = __l2tp_ip_bind_lookup(net, iph->daddr, iph->saddr, inet_iif(skb),
+  tunnel_id);
+   if (!sk) {
read_unlock_bh(&l2tp_ip_lock);
+   goto discard;
}
+   sock_hold(sk);
+   read_unlock_bh(&l2tp_ip_lock);
 
if (!xfrm4_policy_check(sk, XFRM_POLICY_IN, skb))
goto discard_put;
diff --git a/net/l2tp/l2tp_ip6.c b/net/l2tp/l2tp_ip6.c
index 88b397c30d86..8bcaa975b432 100644
--- a/net/l2tp/l2tp_ip6.c
+++ b/net/l2tp/l2tp_ip6.c
@@ -136,6 +136,7 @@ static int l2tp_ip6_recv(struct sk_buff *skb)
unsigned char *ptr, *optr;
struct l2tp_session *session;
struct l2tp_tunnel *tunnel = NULL;
+   struct ipv6hdr *iph;
int length;
 
if (!pskb_may_pull(skb, 4))
@@ -192,24 +193,17 @@ static int l2tp_ip6_recv(struct sk_buff *skb)
goto discard;
 
tunnel_id = ntohl(*(__be32 *) &skb->data[4]);
-   tunnel = l2tp_tunnel_find(net, tunnel_id);
-   if (tunnel) {
-   sk = tunnel->sock;
-   sock_hold(sk);
-   } else {
-   struct ipv6hdr *iph = ipv6_hdr(skb);
-
-   read_lock_bh(&l2tp_ip6_lock);
-   sk = __l2tp_ip6_bind_lookup(net, &iph->daddr, &iph->saddr,
-   inet6_iif(skb), tunnel_id);
-   if (!sk) {
-   read_unlock_bh(&l2tp_ip6_lock);
-   goto discard;
-   }
+   iph = ipv6_hdr(skb);
 
-   sock_hold(sk);
+   read_lock_bh(&l2tp_ip6_lock);
+   sk = __l2tp_ip6_bind_lookup(net, &iph->daddr, &iph->saddr,
+   inet6_iif(skb), tunnel_id);
+   if (!sk) {
read_unlock_bh(&l2tp_ip6_lock);
+   goto discard;
}
+   sock_hold(sk);
+   read_unlock_bh(&l2tp_ip6_lock);
 
if (!xfrm6_policy_check(sk, XFRM_POLICY_IN, skb))
goto discard_put;
-- 
2.15.0



[PATCH net-next v6 0/3] Incorporated all required changes

2017-11-03 Thread Manish Kurup
Hi everyone,

Modified the netronome drivers (flower action) to use the VLAN helper
functions instead of dereferencing the structure directly. This is
required for the VLAN action patch.

Could you please review?

Here're the changes:
v2: Fixed all helper functions to use RCU (rtnl_dereference) - Eric, Jamal
v2: Fixed indentation, extra line nits - Jamal, Jiri
v2: Moved rcu_head to the end of the struct - Jiri
v2: Re-formatted locals to reverse-christmas-tree - Jiri
v2: Removed mismatched spin_lock() - Cong
v2: Removed spin_lock_bh() in tcf_vlan_init, rtnl_dereference() should
suffice - Cong, Jiri
v4: Modified the nfp flower action code to use the VLAN helper functions
instead of referencing the structure directly. Isolated this into a
separate patch - Pieter Jansen
v5: Got rid of the unlikely() for the allocation case - Simon Horman
v6: Added cleanup functions for RCU alloc - Dave Miller

Acked-by: Jamal Hadi Salim 
Acked-by: Jiri Pirko 
Signed-off-by: Manish Kurup 

Manish Kurup (3):
  act_vlan: Change stats update to use per-core stats
  nfp flower action: Modified to use VLAN helper functions
  act_vlan: VLAN action rewrite to use RCU lock/unlock and update

 drivers/net/ethernet/netronome/nfp/flower/action.c |  5 +-
 include/net/tc_act/tc_vlan.h   | 46 +---
 net/sched/act_vlan.c   | 81 +++---
 3 files changed, 94 insertions(+), 38 deletions(-)

-- 
2.7.4



[PATCH net-next v6 1/3] act_vlan: Change stats update to use per-core stats

2017-11-03 Thread Manish Kurup
The VLAN action maintains one set of stats across all cores, and uses a
spinlock to synchronize updates to it from the same. Changed this to use a
per-CPU stats context instead.
This change will result in better performance.

Acked-by: Jamal Hadi Salim 
Acked-by: Jiri Pirko 
Signed-off-by: Manish Kurup 
---
 net/sched/act_vlan.c | 10 ++
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/net/sched/act_vlan.c b/net/sched/act_vlan.c
index 16eb067..b093bad 100644
--- a/net/sched/act_vlan.c
+++ b/net/sched/act_vlan.c
@@ -30,9 +30,10 @@ static int tcf_vlan(struct sk_buff *skb, const struct 
tc_action *a,
int err;
u16 tci;
 
-   spin_lock(&v->tcf_lock);
tcf_lastuse_update(&v->tcf_tm);
-   bstats_update(&v->tcf_bstats, skb);
+   bstats_cpu_update(this_cpu_ptr(v->common.cpu_bstats), skb);
+
+   spin_lock(&v->tcf_lock);
action = v->tcf_action;
 
/* Ensure 'data' points at mac_header prior calling vlan manipulating
@@ -85,7 +86,8 @@ static int tcf_vlan(struct sk_buff *skb, const struct 
tc_action *a,
 
 drop:
action = TC_ACT_SHOT;
-   v->tcf_qstats.drops++;
+   qstats_drop_inc(this_cpu_ptr(v->common.cpu_qstats));
+
 unlock:
if (skb_at_tc_ingress(skb))
skb_pull_rcsum(skb, skb->mac_len);
@@ -172,7 +174,7 @@ static int tcf_vlan_init(struct net *net, struct nlattr 
*nla,
 
if (!exists) {
ret = tcf_idr_create(tn, parm->index, est, a,
-&act_vlan_ops, bind, false);
+&act_vlan_ops, bind, true);
if (ret)
return ret;
 
-- 
2.7.4



[PATCH net-next v6 2/3] nfp flower action: Modified to use VLAN helper functions

2017-11-03 Thread Manish Kurup
Modified netronome nfp flower action to use VLAN helper functions instead
of accessing the structure directly.

Signed-off-by: Manish Kurup 
---
 drivers/net/ethernet/netronome/nfp/flower/action.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/netronome/nfp/flower/action.c 
b/drivers/net/ethernet/netronome/nfp/flower/action.c
index de64ced..c1c595f 100644
--- a/drivers/net/ethernet/netronome/nfp/flower/action.c
+++ b/drivers/net/ethernet/netronome/nfp/flower/action.c
@@ -58,7 +58,6 @@ nfp_fl_push_vlan(struct nfp_fl_push_vlan *push_vlan,
 const struct tc_action *action)
 {
size_t act_size = sizeof(struct nfp_fl_push_vlan);
-   struct tcf_vlan *vlan = to_vlan(action);
u16 tmp_push_vlan_tci;
 
push_vlan->head.jump_id = NFP_FL_ACTION_OPCODE_PUSH_VLAN;
@@ -67,8 +66,8 @@ nfp_fl_push_vlan(struct nfp_fl_push_vlan *push_vlan,
push_vlan->vlan_tpid = tcf_vlan_push_proto(action);
 
tmp_push_vlan_tci =
-   FIELD_PREP(NFP_FL_PUSH_VLAN_PRIO, vlan->tcfv_push_prio) |
-   FIELD_PREP(NFP_FL_PUSH_VLAN_VID, vlan->tcfv_push_vid) |
+   FIELD_PREP(NFP_FL_PUSH_VLAN_PRIO, tcf_vlan_push_prio(action)) |
+   FIELD_PREP(NFP_FL_PUSH_VLAN_VID, tcf_vlan_push_vid(action)) |
NFP_FL_PUSH_VLAN_CFI;
push_vlan->vlan_tci = cpu_to_be16(tmp_push_vlan_tci);
 }
-- 
2.7.4



[PATCH net-next v6 3/3] act_vlan: VLAN action rewrite to use RCU lock/unlock and update

2017-11-03 Thread Manish Kurup
Using a spinlock in the VLAN action causes performance issues when the VLAN
action is used on multiple cores. Rewrote the VLAN action to use RCU read
locking for reads and updates instead.

Acked-by: Jamal Hadi Salim 
Acked-by: Jiri Pirko 
Signed-off-by: Manish Kurup 
---
 include/net/tc_act/tc_vlan.h | 46 +--
 net/sched/act_vlan.c | 75 ++--
 2 files changed, 88 insertions(+), 33 deletions(-)

diff --git a/include/net/tc_act/tc_vlan.h b/include/net/tc_act/tc_vlan.h
index c2090df..22ae260 100644
--- a/include/net/tc_act/tc_vlan.h
+++ b/include/net/tc_act/tc_vlan.h
@@ -13,12 +13,17 @@
 #include 
 #include 
 
+struct tcf_vlan_params {
+   int   tcfv_action;
+   u16   tcfv_push_vid;
+   __be16tcfv_push_proto;
+   u8tcfv_push_prio;
+   struct rcu_head   rcu;
+};
+
 struct tcf_vlan {
struct tc_actioncommon;
-   int tcfv_action;
-   u16 tcfv_push_vid;
-   __be16  tcfv_push_proto;
-   u8  tcfv_push_prio;
+   struct tcf_vlan_params __rcu *vlan_p;
 };
 #define to_vlan(a) ((struct tcf_vlan *)a)
 
@@ -33,22 +38,45 @@ static inline bool is_tcf_vlan(const struct tc_action *a)
 
 static inline u32 tcf_vlan_action(const struct tc_action *a)
 {
-   return to_vlan(a)->tcfv_action;
+   u32 tcfv_action;
+
+   rcu_read_lock();
+   tcfv_action = rcu_dereference(to_vlan(a)->vlan_p)->tcfv_action;
+   rcu_read_unlock();
+
+   return tcfv_action;
 }
 
 static inline u16 tcf_vlan_push_vid(const struct tc_action *a)
 {
-   return to_vlan(a)->tcfv_push_vid;
+   u16 tcfv_push_vid;
+
+   rcu_read_lock();
+   tcfv_push_vid = rcu_dereference(to_vlan(a)->vlan_p)->tcfv_push_vid;
+   rcu_read_unlock();
+
+   return tcfv_push_vid;
 }
 
 static inline __be16 tcf_vlan_push_proto(const struct tc_action *a)
 {
-   return to_vlan(a)->tcfv_push_proto;
+   __be16 tcfv_push_proto;
+
+   rcu_read_lock();
+   tcfv_push_proto = rcu_dereference(to_vlan(a)->vlan_p)->tcfv_push_proto;
+   rcu_read_unlock();
+
+   return tcfv_push_proto;
 }
 
 static inline u8 tcf_vlan_push_prio(const struct tc_action *a)
 {
-   return to_vlan(a)->tcfv_push_prio;
-}
+   u8 tcfv_push_prio;
 
+   rcu_read_lock();
+   tcfv_push_prio = rcu_dereference(to_vlan(a)->vlan_p)->tcfv_push_prio;
+   rcu_read_unlock();
+
+   return tcfv_push_prio;
+}
 #endif /* __NET_TC_VLAN_H */
diff --git a/net/sched/act_vlan.c b/net/sched/act_vlan.c
index b093bad..97f717a 100644
--- a/net/sched/act_vlan.c
+++ b/net/sched/act_vlan.c
@@ -26,6 +26,7 @@ static int tcf_vlan(struct sk_buff *skb, const struct 
tc_action *a,
struct tcf_result *res)
 {
struct tcf_vlan *v = to_vlan(a);
+   struct tcf_vlan_params *p;
int action;
int err;
u16 tci;
@@ -33,24 +34,27 @@ static int tcf_vlan(struct sk_buff *skb, const struct 
tc_action *a,
tcf_lastuse_update(&v->tcf_tm);
bstats_cpu_update(this_cpu_ptr(v->common.cpu_bstats), skb);
 
-   spin_lock(&v->tcf_lock);
-   action = v->tcf_action;
-
/* Ensure 'data' points at mac_header prior calling vlan manipulating
 * functions.
 */
if (skb_at_tc_ingress(skb))
skb_push_rcsum(skb, skb->mac_len);
 
-   switch (v->tcfv_action) {
+   rcu_read_lock();
+
+   action = READ_ONCE(v->tcf_action);
+
+   p = rcu_dereference(v->vlan_p);
+
+   switch (p->tcfv_action) {
case TCA_VLAN_ACT_POP:
err = skb_vlan_pop(skb);
if (err)
goto drop;
break;
case TCA_VLAN_ACT_PUSH:
-   err = skb_vlan_push(skb, v->tcfv_push_proto, v->tcfv_push_vid |
-   (v->tcfv_push_prio << VLAN_PRIO_SHIFT));
+   err = skb_vlan_push(skb, p->tcfv_push_proto, p->tcfv_push_vid |
+   (p->tcfv_push_prio << VLAN_PRIO_SHIFT));
if (err)
goto drop;
break;
@@ -69,14 +73,14 @@ static int tcf_vlan(struct sk_buff *skb, const struct 
tc_action *a,
goto drop;
}
/* replace the vid */
-   tci = (tci & ~VLAN_VID_MASK) | v->tcfv_push_vid;
+   tci = (tci & ~VLAN_VID_MASK) | p->tcfv_push_vid;
/* replace prio bits, if tcfv_push_prio specified */
-   if (v->tcfv_push_prio) {
+   if (p->tcfv_push_prio) {
tci &= ~VLAN_PRIO_MASK;
-   tci |= v->tcfv_push_prio << VLAN_PRIO_SHIFT;
+   tci |= p->tcfv_push_prio << VLAN_PRIO_SHIFT;
}
/* put updated tci as hwaccel tag */
-   __vlan_hwaccel_put_tag(skb, v

Re: [PATCH net-next 0/6] net: hns3: support set_link_ksettings and for nway_reset ethtool command

2017-11-03 Thread Andrew Lunn
On Fri, Nov 03, 2017 at 12:18:24PM +0800, Lipeng wrote:
> This patch-set adds support for set_link_ksettings && for nway_resets
> ethtool command and fixes some related ethtool bugs.
> 1, patch[4/6] adds support for ethtool_ops.set_link_ksettings.
> 2, patch[5/6] adds support ethtool_ops.for nway_reset.
> 3, patch[1/6,2/6,3/6,6/6] fix some bugs for getting port information by
>ethtool command(ethtool ethx).

Hi Lipeng

Do you want the fixes applied to net, and back ported to stable?

If so, you need to submit them separately, and against the correct
tree.

Andrew


RE: [PATCH net-next 0/6] net: hns3: support set_link_ksettings and for nway_reset ethtool command

2017-11-03 Thread Salil Mehta
Hi Andrew,

> -Original Message-
> From: Andrew Lunn [mailto:and...@lunn.ch]
> Sent: Friday, November 03, 2017 3:52 PM
> To: lipeng (Y)
> Cc: da...@davemloft.net; netdev@vger.kernel.org; linux-
> ker...@vger.kernel.org; Linuxarm; Salil Mehta
> Subject: Re: [PATCH net-next 0/6] net: hns3: support set_link_ksettings
> and for nway_reset ethtool command
> 
> On Fri, Nov 03, 2017 at 12:18:24PM +0800, Lipeng wrote:
> > This patch-set adds support for set_link_ksettings && for nway_resets
> > ethtool command and fixes some related ethtool bugs.
> > 1, patch[4/6] adds support for ethtool_ops.set_link_ksettings.
> > 2, patch[5/6] adds support ethtool_ops.for nway_reset.
> > 3, patch[1/6,2/6,3/6,6/6] fix some bugs for getting port information
> by
> >ethtool command(ethtool ethx).
> 
> Hi Lipeng
> 
> Do you want the fixes applied to net, and back ported to stable?
> 
> If so, you need to submit them separately, and against the correct
> tree.
Yes, you are correct. This should have been submitted against net repo.
But now this patch-set has been accepted by Dave for net-next do you
think we can still submit it again against Dave's net repo?

Thanks
Salil

> 
>   Andrew


Re: TCP connection closed without FIN or RST

2017-11-03 Thread Eric Dumazet
On Fri, 2017-11-03 at 11:13 -0400, Vitaly Davidovich wrote:
> Ok, an interesting finding.  The client was originally running with
> SO_RCVBUF of 75K (apparently someone decided to set that for some
> unknown reason).  I tried the test with a 1MB recv buffer and
> everything works perfectly! The client responds with 0 window alerts,
> the server just hits the persist condition and sends keep-alive
> probes; the client continues answering with a 0 window up until it
> wakes up and starts processing data in its receive buffer.  At that
> point, the window opens up and the server sends more data.  Basically,
> things look as one would expect in this situation :).
> 
> /proc/sys/net/ipv4/tcp_rmem is 131072  1048576   20971520.  The
> conversation flows normally, as described above, when I change the
> client's recv buf size to 1048576.  I also tried 131072, but that
> doesn't work - same retrans/no ACKs situation.
> 
> I think this eliminates (right?) any middleware from the equation.
> Instead, perhaps it's some bad interaction between a low recv buf size
> and either some other TCP setting or TSO mechanics (LRO specifically).
> Still investigating further.

Just in case, have you tried a more recent linux kernel ?

I would rather not spend time on some problem that might already be
fixed.





  1   2   >