Re: [PATCH net] tcp: fix possible deadlock in TCP stack vs BPF filter

2017-08-14 Thread David Miller
From: Eric Dumazet 
Date: Mon, 14 Aug 2017 17:44:43 -0700

> From: Eric Dumazet 
> 
> Filtering the ACK packet was not put at the right place.
> 
> At this place, we already allocated a child and put it
> into accept queue.
> 
> We absolutely need to call tcp_child_process() to release
> its spinlock, or we will deadlock at accept() or close() time.
> 
> Found by syzkaller team (Thanks a lot !)
> 
> Fixes: 8fac365f63c8 ("tcp: Add a tcp_filter hook before handle ack packet")
> Signed-off-by: Eric Dumazet 
> Reported-by: Dmitry Vyukov 

Applied, thanks.


Re: [PATCH net] dccp: purge write queue in dccp_destroy_sock()

2017-08-14 Thread David Miller
From: Eric Dumazet 
Date: Mon, 14 Aug 2017 14:10:25 -0700

> From: Eric Dumazet 
> 
> syzkaller reported that DCCP could have a non empty
> write queue at dismantle time.
> 
> WARNING: CPU: 1 PID: 2953 at net/core/stream.c:199 
> sk_stream_kill_queues+0x3ce/0x520 net/core/stream.c:199
> Kernel panic - not syncing: panic_on_warn set ...
 ...
> Signed-off-by: Eric Dumazet 
> Reported-by: Dmitry Vyukov 

Applied and queued up for -stable, thanks.


Re: [PATCH net] udp: fix linear skb reception with PEEK_OFF

2017-08-14 Thread David Miller
From: Paolo Abeni 
Date: Mon, 14 Aug 2017 21:31:38 +0200

> From: Al Viro 
> 
> copy_linear_skb() is broken; both of its callers actually
> expect 'len' to be the amount we are trying to copy,
> not the offset of the end.
> Fix it keeping the meanings of arguments in sync with what the
> callers (both of them) expect.
> Also restore a saner behavior on EFAULT (i.e. preserving
> the iov_iter position in case of failure):
> 
> The commit fd851ba9caa9 ("udp: harden copy_linear_skb()")
> avoids the more destructive effect of the buggy
> copy_linear_skb(), e.g. no more invalid memory access, but
> said function still behaves incorrectly: when peeking with
> offset it can fail with EINVAL instead of copying the
> appropriate amount of memory.
> 
> Reported-by: Sasha Levin 
> Fixes: b65ac44674dd ("udp: try to avoid 2 cache miss on dequeue")
> Fixes: fd851ba9caa9 ("udp: harden copy_linear_skb()")
> Signed-off-by: Al Viro 
> Acked-by: Paolo Abeni 
> Tested-by: Sasha Levin 
> ---
> This patch has been buried in a private email exchange for some
> time.
> I'm posting it on behalf of Al Viro, to avoid loosing this merge
> window, since he is busy elsewhere.

Applied, thanks.


Re: [PATCH net-next] liquidio: fix issues with fw_type module parameter

2017-08-14 Thread David Miller
From: Felix Manlunas 
Date: Mon, 14 Aug 2017 12:17:56 -0700

> From: Derek Chickles 
> 
> The fw_type module parameter isn't showing up in the
> /sys/module/liquidio/parameters directory.  Fix it by setting the read
> permission bits for user, group, other in module_param_string().  Revise
> the description of fw_type.  Initialize the fw_type static char array with
> the default value to conform to the module parameter description.
> 
> Signed-off-by: Derek Chickles 
> Signed-off-by: Felix Manlunas 

Applied, thanks.


Re: [patch net-next 0/2] mlxsw: Add support for nexthop group consolidation for IPv6

2017-08-14 Thread David Miller
From: Jiri Pirko 
Date: Mon, 14 Aug 2017 21:09:18 +0200

> From: Jiri Pirko 
> 
> Arkadi says:
> 
> Due to limited ASIC resources the maximum number of routes is limited by
> the nexthop resource. In order to improve the routing scale nexthop
> consolidation should be performed.
> 
> In case of IPv4, the kernel does the consolidation of nexthops in the form
> of the fib_info struct. In that case, the driver uses the fib_info's
> address as a key for the internal nexthop group representative struct
> lookup. In case of IPv6, the kernel doesn't do consolidation, thus the
> driver should implement it by itself.
> 
> The hash value is calculated based on the nexthop set, by performing
> bitwise xor on the ifindexs of the nexthops, in a similar way to IPV4's
> kernel implementation. In case of collision a full match is performed
> between the sets which include address and ifindex comparison.
> 
> In order to use the same hash table in both cases (IPv4/6), the rhashtable
> is changed to operate on variable length key.

Series applied, thanks.


Re: [PATCH V2 net-next 0/8] liquidio: adding support for ethtool --set-ring feature

2017-08-14 Thread David Miller
From: Felix Manlunas 
Date: Mon, 14 Aug 2017 12:00:34 -0700

> From: Intiyaz Basha 
> 
> Code reorganization is required for adding ethtool --set-ring feature.
> First seven patches are for code reorganization.  The last patch is for
> adding this feature.
> 
> Change Log:
> V1 -> V2
>  Only patch #8 was changed:  unnecessary parentheses were removed in two
>  if-statements in lio_ethtool_set_ringparam().

Series applied.


Re: [PATCH net] ipv6: release rt6->rt6i_idev properly during ifdown

2017-08-14 Thread David Miller
From: Wei Wang 
Date: Mon, 14 Aug 2017 10:44:59 -0700

> From: Wei Wang 
> 
> When a dst is created by addrconf_dst_alloc() for a host route or an
> anycast route, dst->dev points to loopback dev while rt6->rt6i_idev
> points to a real device.
> When the real device goes down, the current cleanup code only checks for
> dst->dev and assumes rt6->rt6i_idev->dev is the same. This causes the
> refcount leak on the real device in the above situation.
> This patch makes sure to always release the refcount taken on
> rt6->rt6i_idev during dst_dev_put().
> 
> Fixes: 587fea741134 ("ipv6: mark DST_NOGC and remove the operation of
> dst_free()")
> Reported-by: John Stultz 
> Tested-by: John Stultz 
> Tested-by: Martin KaFai Lau 
> Signed-off-by: Wei Wang 
> Signed-off-by: Martin KaFai Lau 

Applied, thank you.


Re: [PATCH net] af_key: do not use GFP_KERNEL in atomic contexts

2017-08-14 Thread David Miller
From: Eric Dumazet 
Date: Mon, 14 Aug 2017 10:16:45 -0700

> From: Eric Dumazet 
> 
> pfkey_broadcast() might be called from non process contexts,
> we can not use GFP_KERNEL in these cases [1].
> 
> This patch partially reverts commit ba51b6be38c1 ("net: Fix RCU splat in
> af_key"), only keeping the GFP_ATOMIC forcing under rcu_read_lock()
> section.
> 
> [1] : syzkaller reported :
 ...
> Fixes: ba51b6be38c1 ("net: Fix RCU splat in af_key")
> Signed-off-by: Eric Dumazet 
> Reported-by: Dmitry Vyukov 

Applied and queued up for -stable, thanks.


Re: [PATCH net] tcp: ulp: avoid module refcnt leak in tcp_set_ulp

2017-08-14 Thread David Miller
From: Sabrina Dubroca 
Date: Mon, 14 Aug 2017 18:04:24 +0200

> __tcp_ulp_find_autoload returns tcp_ulp_ops after taking a reference on
> the module. Then, if ->init fails, tcp_set_ulp propagates the error but
> nothing releases that reference.
> 
> Fixes: 734942cc4ea6 ("tcp: ULP infrastructure")
> Signed-off-by: Sabrina Dubroca 

Applied, thanks.


Re: [PATCH v11 0/5] Add new PCI_DEV_FLAGS_NO_RELAXED_ORDERING flag

2017-08-14 Thread David Miller
From: Ding Tianhong 
Date: Tue, 15 Aug 2017 11:23:22 +0800

> Some devices have problems with Transaction Layer Packets with the Relaxed
> Ordering Attribute set.  This patch set adds a new PCIe Device Flag,
> PCI_DEV_FLAGS_NO_RELAXED_ORDERING, a set of PCI Quirks to catch some known
> devices with Relaxed Ordering issues, and a use of this new flag by the
> cxgb4 driver to avoid using Relaxed Ordering with problematic Root Complex
> Ports.
 ...

Series applied, thanks.


Re: [PATCH net-next V2 3/3] tap: XDP support

2017-08-14 Thread Jason Wang



On 2017年08月15日 00:01, Michael S. Tsirkin wrote:

On Sat, Aug 12, 2017 at 10:48:49AM +0800, Jason Wang wrote:


On 2017年08月12日 07:12, Jakub Kicinski wrote:

On Fri, 11 Aug 2017 19:41:18 +0800, Jason Wang wrote:

This patch tries to implement XDP for tun. The implementation was
split into two parts:

- fast path: small and no gso packet. We try to do XDP at page level
before build_skb(). For XDP_TX, since creating/destroying queues
were completely under control of userspace, it was implemented
through generic XDP helper after skb has been built. This could be
optimized in the future.
- slow path: big or gso packet. We try to do it after skb was created
through generic XDP helpers.

Test were done through pktgen with small packets.

xdp1 test shows ~41.1% improvement:

Before: ~1.7Mpps
After:  ~2.3Mpps

xdp_redirect to ixgbe shows ~60% improvement:

Before: ~0.8Mpps
After:  ~1.38Mpps

Suggested-by: Michael S. Tsirkin 
Signed-off-by: Jason Wang 

Looks OK to me now :)

Out of curiosity, you say the build_skb() is for "small packets", and it
seems you are always reserving the 256B regardless of XDP being
installed.  Does this have no performance impact on non-XDP case?

Have a test, only less than 1% were noticed which I think could be ignored.

Thanks

What did you test btw?


Pktgen


  The biggest issue would be with something like
UDP with short packets.



Note that we do this only when sndbuf is INT_MAX. So this is probably 
not an issue. The only thing matter is more stress to page allocator, 
but according to the result of pktgen it was very small that could be 
ignored.


Thanks


Re: [PATCH net] openvswitch: fix skb_panic due to the incorrect actions attrlen

2017-08-14 Thread Pravin Shelar
On Sun, Aug 13, 2017 at 12:04 AM, Liping Zhang  wrote:
> From: Liping Zhang 
>
> For sw_flow_actions, the actions_len only represents the kernel part's
> size, and when we dump the actions to the userspace, we will do the
> convertions, so it's true size may become bigger than the actions_len.
>
> But unfortunately, for OVS_PACKET_ATTR_ACTIONS, we use the actions_len
> to alloc the skbuff, so the user_skb's size may become insufficient and
> oops will happen like this:
>   skbuff: skb_over_panic: text:8148fabf len:1749 put:157 head:
>   881300f39000 data:881300f39000 tail:0x6d5 end:0x6c0 dev:
>   [ cut here ]
>   kernel BUG at net/core/skbuff.c:129!
>   [...]
>   Call Trace:
>
>[] skb_put+0x43/0x44
>[] skb_zerocopy+0x6c/0x1f4
>[] queue_userspace_packet+0x3a3/0x448 [openvswitch]
>[] ovs_dp_upcall+0x30/0x5c [openvswitch]
>[] output_userspace+0x132/0x158 [openvswitch]
>[] ? ip6_rcv_finish+0x74/0x77 [ipv6]
>[] do_execute_actions+0xcc1/0xdc8 [openvswitch]
>[] ovs_execute_actions+0x74/0x106 [openvswitch]
>[] ovs_dp_process_packet+0xe1/0xfd [openvswitch]
>[] ? key_extract+0x63c/0x8d5 [openvswitch]
>[] ovs_vport_receive+0xa1/0xc3 [openvswitch]
>   [...]
>
> Also we can find that the actions_len is much little than the orig_len:
>   crash> struct sw_flow_actions 0x8812f539d000
>   struct sw_flow_actions {
> rcu = {
>   next = 0x8812f5398800,
>   func = 0xe3b00035db32
> },
> orig_len = 1384,
> actions_len = 592,
> actions = 0x8812f539d01c
>   }
>
> So as a quick fix, use the orig_len instead of the actions_len to alloc
> the user_skb.
>
> Last, this oops happened on our system running a relative old kernel, but
> the same risk still exists on the mainline, since we use the wrong
> actions_len from the beginning.
>
Thanks for fixing it.

> Fixes: ccea74457bbd ("openvswitch: include datapath actions with 
> sampled-packet upcall to userspace")
> Cc: Neil McKee 
> Signed-off-by: Liping Zhang 
> ---
>  net/openvswitch/actions.c  | 39 +--
>  net/openvswitch/datapath.c |  2 +-
>  net/openvswitch/datapath.h |  1 +
>  3 files changed, 27 insertions(+), 15 deletions(-)
>
> diff --git a/net/openvswitch/actions.c b/net/openvswitch/actions.c
> index e4610676299b..799a22dfb89e 100644
> --- a/net/openvswitch/actions.c
> +++ b/net/openvswitch/actions.c
> @@ -48,6 +48,7 @@ struct deferred_action {
> struct sk_buff *skb;
> const struct nlattr *actions;
> int actions_len;
> +   int actions_attrlen;
>
Have you considered passing this value using struct ovs_skb_cb? That
would save passing this parameter in all these functions.


Re: [PATCH net-next V2 3/3] tap: XDP support

2017-08-14 Thread Jason Wang



On 2017年08月14日 16:43, Daniel Borkmann wrote:

On 08/11/2017 01:41 PM, Jason Wang wrote:

This patch tries to implement XDP for tun. The implementation was
split into two parts:

[...]
@@ -1402,6 +1521,22 @@ static ssize_t tun_get_user(struct tun_struct 
*tun, struct tun_file *tfile,

  skb_reset_network_header(skb);
  skb_probe_transport_header(skb, 0);

+if (generic_xdp) {
+struct bpf_prog *xdp_prog;
+int ret;
+
+rcu_read_lock();
+xdp_prog = rcu_dereference(tun->xdp_prog);


The name generic_xdp is a bit confusing in this context given this
is 'native' XDP, perhaps above if (generic_xdp) should have a comment
explaining semantics for tun and how it relates to actual generic xdp
that sits at dev->xdp_prog, and gets run from netif_rx_ni(). Or just
name the bool xdp_handle_gso with a comment that we let the generic
XDP infrastructure deal with non-linear skbs instead of having to
re-implement the do_xdp_generic() internals, plus a statement that
the actual generic XDP comes a bit later in the path. That would at
least make it more obvious to read, imho.


Ok, since non gso packet (e.g jumbo packet) may go this way too, 
something like "xdp_handle_skb" is better. Will send a patch.


Thanks




+if (xdp_prog) {
+ret = do_xdp_generic(xdp_prog, skb);
+if (ret != XDP_PASS) {
+rcu_read_unlock();
+return total_len;
+}
+}
+rcu_read_unlock();
+}
+
  rxhash = __skb_get_hash_symmetric(skb);
  #ifndef CONFIG_4KSTACKS
  tun_rx_batched(tun, tfile, skb, more);







Re: general protection fault in fib_dump_info

2017-08-14 Thread Eric Dumazet
On Tue, 2017-08-15 at 10:49 +0800, idaifish wrote:
> Syzkaller hit 'general protection fault in fib_dump_info' bug on
> commit 4.13-rc5..
> 
> Guilty file: net/ipv4/fib_semantics.c
> 
> kasan: GPF could be caused by NULL-ptr deref or user memory access
> general protection fault:  [#1] SMP KASAN
> Modules linked in:
> CPU: 0 PID: 2808 Comm: syz-executor0 Not tainted 4.13.0-rc5 #1
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
> Ubuntu-1.8.2-1ubuntu1 04/01/2014
> task: 880078562700 task.stack: 88007811
> RIP: 0010:fib_dump_info+0x388/0x1170 net/ipv4/fib_semantics.c:1314
> RSP: 0018:880078117010 EFLAGS: 00010206
> RAX: dc00 RBX: 00fe RCX: 0002
> RDX: 0006 RSI: 880078117084 RDI: 0030
> RBP: 880078117268 R08: 000c R09: 8800780d80c8
> R10: 58d629b4 R11: 67fce681 R12: 
> R13: 8800784bd540 R14: 8800780d80b5 R15: 8800780d80a4
> FS:  022fa940() GS:88007fc0() knlGS:
> CS:  0010 DS:  ES:  CR0: 80050033
> CR2: 004387d0 CR3: 79135000 CR4: 06f0
> Call Trace:
>  inet_rtm_getroute+0xc89/0x1f50 net/ipv4/route.c:2766
>  rtnetlink_rcv_msg+0x288/0x680 net/core/rtnetlink.c:4217
>  netlink_rcv_skb+0x340/0x470 net/netlink/af_netlink.c:2397
>  rtnetlink_rcv+0x28/0x30 net/core/rtnetlink.c:4223
>  netlink_unicast_kernel net/netlink/af_netlink.c:1265 [inline]
>  netlink_unicast+0x4c4/0x6e0 net/netlink/af_netlink.c:1291
>  netlink_sendmsg+0x8c4/0xca0 net/netlink/af_netlink.c:1854
>  sock_sendmsg_nosec net/socket.c:633 [inline]
>  sock_sendmsg+0xca/0x110 net/socket.c:643
>  ___sys_sendmsg+0x779/0x8d0 net/socket.c:2035
>  __sys_sendmsg+0xd1/0x170 net/socket.c:2069
>  SYSC_sendmsg net/socket.c:2080 [inline]
>  SyS_sendmsg+0x2d/0x50 net/socket.c:2076
>  entry_SYSCALL_64_fastpath+0x1a/0xa5
> RIP: 0033:0x4512e9
> RSP: 002b:7ffc75584cc8 EFLAGS: 0216 ORIG_RAX: 002e
> RAX: ffda RBX: 0002 RCX: 004512e9
> RDX:  RSI: 20f2cfc8 RDI: 0003
> RBP: 000e R08:  R09: 
> R10:  R11: 0216 R12: fffe
> R13: 00718000 R14: 20c44ff0 R15: 
> Code: 00 0f b6 8d ec fd ff ff 48 8b 85 f0 fd ff ff 88 48 17 48 8b 45
> 28 48 8d 78 30 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 <0f>
> b6 04 02 84 c0 74 08 3c 03 0f 8e cb 0c 00 00 48 8b 45 28 44
> RIP: fib_dump_info+0x388/0x1170 net/ipv4/fib_semantics.c:1314 RSP:
> 880078117010
> ---[ end trace 254a7af28348f88b ]---
> Kernel panic - not syncing: Fatal exception
> Kernel Offset: disabled
> Rebooting in 86400 seconds..
> 
> -
> 
> .config and reproducer.prog are attached.  Unfortunately the extracted
> C program can't work.
> Maybe you can follow the instruction
> [https://github.com/google/syzkaller/blob/master/docs/executing_syzkaller_programs.md]
> to reproduce the bug.
> 
> 

Probably fixed by commit 2c87d63ac853550e734edfd45e1be5e5aa44fbcc
("ipv4: route: fix inet_rtm_getroute induced crash")
(In David Miller net tree)




[PATCH v11 4/5] net/cxgb4: Use new PCI_DEV_FLAGS_NO_RELAXED_ORDERING flag

2017-08-14 Thread Ding Tianhong
From: Casey Leedom 

cxgb4 Ethernet driver now queries PCIe configuration space to determine
if it can send TLPs to it with the Relaxed Ordering Attribute set.

Remove the enable_pcie_relaxed_ordering() to avoid enable PCIe Capability
Device Control[Relaxed Ordering Enable] at probe routine, to make sure
the driver will not send the Relaxed Ordering TLPs to the Root Complex which
could not deal the Relaxed Ordering TLPs.

Signed-off-by: Casey Leedom 
Signed-off-by: Ding Tianhong 
Reviewed-by: Casey Leedom 
---
 drivers/net/ethernet/chelsio/cxgb4/cxgb4.h  |  1 +
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c | 23 +--
 drivers/net/ethernet/chelsio/cxgb4/sge.c|  5 +++--
 3 files changed, 21 insertions(+), 8 deletions(-)

diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h 
b/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h
index ef4be78..09ea62e 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h
@@ -529,6 +529,7 @@ enum { /* adapter flags */
USING_SOFT_PARAMS  = (1 << 6),
MASTER_PF  = (1 << 7),
FW_OFLD_CONN   = (1 << 9),
+   ROOT_NO_RELAXED_ORDERING = (1 << 10),
 };
 
 enum {
diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c 
b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
index e403fa1..33bb867 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
@@ -4654,11 +4654,6 @@ static void print_port_info(const struct net_device *dev)
dev->name, adap->params.vpd.id, adap->name, buf);
 }
 
-static void enable_pcie_relaxed_ordering(struct pci_dev *dev)
-{
-   pcie_capability_set_word(dev, PCI_EXP_DEVCTL, PCI_EXP_DEVCTL_RELAX_EN);
-}
-
 /*
  * Free the following resources:
  * - memory used for tables
@@ -4908,7 +4903,6 @@ static int init_one(struct pci_dev *pdev, const struct 
pci_device_id *ent)
}
 
pci_enable_pcie_error_reporting(pdev);
-   enable_pcie_relaxed_ordering(pdev);
pci_set_master(pdev);
pci_save_state(pdev);
 
@@ -4947,6 +4941,23 @@ static int init_one(struct pci_dev *pdev, const struct 
pci_device_id *ent)
adapter->msg_enable = DFLT_MSG_ENABLE;
memset(adapter->chan_map, 0xff, sizeof(adapter->chan_map));
 
+   /* If possible, we use PCIe Relaxed Ordering Attribute to deliver
+* Ingress Packet Data to Free List Buffers in order to allow for
+* chipset performance optimizations between the Root Complex and
+* Memory Controllers.  (Messages to the associated Ingress Queue
+* notifying new Packet Placement in the Free Lists Buffers will be
+* send without the Relaxed Ordering Attribute thus guaranteeing that
+* all preceding PCIe Transaction Layer Packets will be processed
+* first.)  But some Root Complexes have various issues with Upstream
+* Transaction Layer Packets with the Relaxed Ordering Attribute set.
+* The PCIe devices which under the Root Complexes will be cleared the
+* Relaxed Ordering bit in the configuration space, So we check our
+* PCIe configuration space to see if it's flagged with advice against
+* using Relaxed Ordering.
+*/
+   if (!pcie_relaxed_ordering_enabled(pdev))
+   adapter->flags |= ROOT_NO_RELAXED_ORDERING;
+
spin_lock_init(>stats_lock);
spin_lock_init(>tid_release_lock);
spin_lock_init(>win0_lock);
diff --git a/drivers/net/ethernet/chelsio/cxgb4/sge.c 
b/drivers/net/ethernet/chelsio/cxgb4/sge.c
index ede1220..4ef68f6 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/sge.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/sge.c
@@ -2719,6 +2719,7 @@ int t4_sge_alloc_rxq(struct adapter *adap, struct 
sge_rspq *iq, bool fwevtq,
struct fw_iq_cmd c;
struct sge *s = >sge;
struct port_info *pi = netdev_priv(dev);
+   int relaxed = !(adap->flags & ROOT_NO_RELAXED_ORDERING);
 
/* Size needs to be multiple of 16, including status entry. */
iq->size = roundup(iq->size, 16);
@@ -2772,8 +2773,8 @@ int t4_sge_alloc_rxq(struct adapter *adap, struct 
sge_rspq *iq, bool fwevtq,
 
flsz = fl->size / 8 + s->stat_len / sizeof(struct tx_desc);
c.iqns_to_fl0congen |= htonl(FW_IQ_CMD_FL0PACKEN_F |
-FW_IQ_CMD_FL0FETCHRO_F |
-FW_IQ_CMD_FL0DATARO_F |
+FW_IQ_CMD_FL0FETCHRO_V(relaxed) |
+FW_IQ_CMD_FL0DATARO_V(relaxed) |
 FW_IQ_CMD_FL0PADEN_F);
if (cong >= 0)
c.iqns_to_fl0congen |=
-- 
1.8.3.1




[PATCH v11 3/5] PCI: Disable Relaxed Ordering Attributes for AMD A1100

2017-08-14 Thread Ding Tianhong
Casey reported that the AMD ARM A1100 SoC has a bug in its PCIe
Root Port where Upstream Transaction Layer Packets with the Relaxed
Ordering Attribute clear are allowed to bypass earlier TLPs with
Relaxed Ordering set, it would cause Data Corruption, so we need
to disable Relaxed Ordering Attribute when Upstream TLPs to the
Root Port.

Reported-and-suggested-by: Casey Leedom 
Signed-off-by: Casey Leedom 
Signed-off-by: Ding Tianhong 
Acked-by: Casey Leedom 
---
 drivers/pci/quirks.c | 16 
 1 file changed, 16 insertions(+)

diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
index 1272f7e..1407604 100644
--- a/drivers/pci/quirks.c
+++ b/drivers/pci/quirks.c
@@ -4089,6 +4089,22 @@ static void quirk_relaxedordering_disable(struct pci_dev 
*dev)
  quirk_relaxedordering_disable);
 
 /*
+ * The AMD ARM A1100 (AKA "SEATTLE") SoC has a bug in its PCIe Root Complex
+ * where Upstream Transaction Layer Packets with the Relaxed Ordering
+ * Attribute clear are allowed to bypass earlier TLPs with Relaxed Ordering
+ * set.  This is a violation of the PCIe 3.0 Transaction Ordering Rules
+ * outlined in Section 2.4.1 (PCI Express(r) Base Specification Revision 3.0
+ * November 10, 2010).  As a result, on this platform we can't use Relaxed
+ * Ordering for Upstream TLPs.
+ */
+DECLARE_PCI_FIXUP_CLASS_EARLY(PCI_VENDOR_ID_AMD, 0x1a00, 
PCI_CLASS_NOT_DEFINED, 8,
+ quirk_relaxedordering_disable);
+DECLARE_PCI_FIXUP_CLASS_EARLY(PCI_VENDOR_ID_AMD, 0x1a01, 
PCI_CLASS_NOT_DEFINED, 8,
+ quirk_relaxedordering_disable);
+DECLARE_PCI_FIXUP_CLASS_EARLY(PCI_VENDOR_ID_AMD, 0x1a02, 
PCI_CLASS_NOT_DEFINED, 8,
+ quirk_relaxedordering_disable);
+
+/*
  * Per PCIe r3.0, sec 2.2.9, "Completion headers must supply the same
  * values for the Attribute as were supplied in the header of the
  * corresponding Request, except as explicitly allowed when IDO is used."
-- 
1.8.3.1




[PATCH v11 5/5] net/cxgb4vf: Use new PCI_DEV_FLAGS_NO_RELAXED_ORDERING flag

2017-08-14 Thread Ding Tianhong
From: Casey Leedom 

cxgb4vf Ethernet driver now queries PCIe configuration space to
determine if it can send TLPs to it with the Relaxed Ordering
Attribute set, just like the pf did.

Signed-off-by: Casey Leedom 
Signed-off-by: Ding Tianhong 
Reviewed-by: Casey Leedom 
---
 drivers/net/ethernet/chelsio/cxgb4vf/adapter.h  |  1 +
 drivers/net/ethernet/chelsio/cxgb4vf/cxgb4vf_main.c | 18 ++
 drivers/net/ethernet/chelsio/cxgb4vf/sge.c  |  3 +++
 3 files changed, 22 insertions(+)

diff --git a/drivers/net/ethernet/chelsio/cxgb4vf/adapter.h 
b/drivers/net/ethernet/chelsio/cxgb4vf/adapter.h
index 109bc63..08c6ddb 100644
--- a/drivers/net/ethernet/chelsio/cxgb4vf/adapter.h
+++ b/drivers/net/ethernet/chelsio/cxgb4vf/adapter.h
@@ -408,6 +408,7 @@ enum { /* adapter flags */
USING_MSI  = (1UL << 1),
USING_MSIX = (1UL << 2),
QUEUES_BOUND   = (1UL << 3),
+   ROOT_NO_RELAXED_ORDERING = (1UL << 4),
 };
 
 /*
diff --git a/drivers/net/ethernet/chelsio/cxgb4vf/cxgb4vf_main.c 
b/drivers/net/ethernet/chelsio/cxgb4vf/cxgb4vf_main.c
index ac7a150..2b85b87 100644
--- a/drivers/net/ethernet/chelsio/cxgb4vf/cxgb4vf_main.c
+++ b/drivers/net/ethernet/chelsio/cxgb4vf/cxgb4vf_main.c
@@ -2888,6 +2888,24 @@ static int cxgb4vf_pci_probe(struct pci_dev *pdev,
 */
adapter->name = pci_name(pdev);
adapter->msg_enable = DFLT_MSG_ENABLE;
+
+   /* If possible, we use PCIe Relaxed Ordering Attribute to deliver
+* Ingress Packet Data to Free List Buffers in order to allow for
+* chipset performance optimizations between the Root Complex and
+* Memory Controllers.  (Messages to the associated Ingress Queue
+* notifying new Packet Placement in the Free Lists Buffers will be
+* send without the Relaxed Ordering Attribute thus guaranteeing that
+* all preceding PCIe Transaction Layer Packets will be processed
+* first.)  But some Root Complexes have various issues with Upstream
+* Transaction Layer Packets with the Relaxed Ordering Attribute set.
+* The PCIe devices which under the Root Complexes will be cleared the
+* Relaxed Ordering bit in the configuration space, So we check our
+* PCIe configuration space to see if it's flagged with advice against
+* using Relaxed Ordering.
+*/
+   if (!pcie_relaxed_ordering_enabled(pdev))
+   adapter->flags |= ROOT_NO_RELAXED_ORDERING;
+
err = adap_init0(adapter);
if (err)
goto err_unmap_bar;
diff --git a/drivers/net/ethernet/chelsio/cxgb4vf/sge.c 
b/drivers/net/ethernet/chelsio/cxgb4vf/sge.c
index e37dde2..05498e7 100644
--- a/drivers/net/ethernet/chelsio/cxgb4vf/sge.c
+++ b/drivers/net/ethernet/chelsio/cxgb4vf/sge.c
@@ -2205,6 +2205,7 @@ int t4vf_sge_alloc_rxq(struct adapter *adapter, struct 
sge_rspq *rspq,
struct port_info *pi = netdev_priv(dev);
struct fw_iq_cmd cmd, rpl;
int ret, iqandst, flsz = 0;
+   int relaxed = !(adapter->flags & ROOT_NO_RELAXED_ORDERING);
 
/*
 * If we're using MSI interrupts and we're not initializing the
@@ -2300,6 +2301,8 @@ int t4vf_sge_alloc_rxq(struct adapter *adapter, struct 
sge_rspq *rspq,
cpu_to_be32(
FW_IQ_CMD_FL0HOSTFCMODE_V(SGE_HOSTFCMODE_NONE) |
FW_IQ_CMD_FL0PACKEN_F |
+   FW_IQ_CMD_FL0FETCHRO_V(relaxed) |
+   FW_IQ_CMD_FL0DATARO_V(relaxed) |
FW_IQ_CMD_FL0PADEN_F);
 
/* In T6, for egress queue type FL there is internal overhead
-- 
1.8.3.1




[PATCH v11 2/5] PCI: Disable Relaxed Ordering for some Intel processors

2017-08-14 Thread Ding Tianhong
According to the Intel spec section 3.9.1 said:

3.9.1 Optimizing PCIe Performance for Accesses Toward Coherent Memory
  and Toward MMIO Regions (P2P)

In order to maximize performance for PCIe devices in the processors
listed in Table 3-6 below, the soft- ware should determine whether the
accesses are toward coherent memory (system memory) or toward MMIO
regions (P2P access to other devices). If the access is toward MMIO
region, then software can command HW to set the RO bit in the TLP
header, as this would allow hardware to achieve maximum throughput for
these types of accesses. For accesses toward coherent memory, software
can command HW to clear the RO bit in the TLP header (no RO), as this
would allow hardware to achieve maximum throughput for these types of
accesses.

Table 3-6. Intel Processor CPU RP Device IDs for Processors Optimizing
   PCIe Performance

ProcessorCPU RP Device IDs

Intel Xeon processors based on   6F01H-6F0EH
Broadwell microarchitecture

Intel Xeon processors based on   2F01H-2F0EH
Haswell microarchitecture

It means some Intel processors has performance issue when use the Relaxed
Ordering Attribute, so disable Relaxed Ordering for these root port.

Signed-off-by: Casey Leedom 
Signed-off-by: Ding Tianhong 
Acked-by: Alexander Duyck 
Acked-by: Ashok Raj 
---
 drivers/pci/quirks.c | 62 
 1 file changed, 62 insertions(+)

diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
index 61b59bf..1272f7e 100644
--- a/drivers/pci/quirks.c
+++ b/drivers/pci/quirks.c
@@ -4027,6 +4027,68 @@ static void quirk_relaxedordering_disable(struct pci_dev 
*dev)
 }
 
 /*
+ * Intel Xeon processors based on Broadwell/Haswell microarchitecture Root
+ * Complex has a Flow Control Credit issue which can cause performance
+ * problems with Upstream Transaction Layer Packets with Relaxed Ordering set.
+ */
+DECLARE_PCI_FIXUP_CLASS_EARLY(PCI_VENDOR_ID_INTEL, 0x6f01, 
PCI_CLASS_NOT_DEFINED, 8,
+ quirk_relaxedordering_disable);
+DECLARE_PCI_FIXUP_CLASS_EARLY(PCI_VENDOR_ID_INTEL, 0x6f02, 
PCI_CLASS_NOT_DEFINED, 8,
+ quirk_relaxedordering_disable);
+DECLARE_PCI_FIXUP_CLASS_EARLY(PCI_VENDOR_ID_INTEL, 0x6f03, 
PCI_CLASS_NOT_DEFINED, 8,
+ quirk_relaxedordering_disable);
+DECLARE_PCI_FIXUP_CLASS_EARLY(PCI_VENDOR_ID_INTEL, 0x6f04, 
PCI_CLASS_NOT_DEFINED, 8,
+ quirk_relaxedordering_disable);
+DECLARE_PCI_FIXUP_CLASS_EARLY(PCI_VENDOR_ID_INTEL, 0x6f05, 
PCI_CLASS_NOT_DEFINED, 8,
+ quirk_relaxedordering_disable);
+DECLARE_PCI_FIXUP_CLASS_EARLY(PCI_VENDOR_ID_INTEL, 0x6f06, 
PCI_CLASS_NOT_DEFINED, 8,
+ quirk_relaxedordering_disable);
+DECLARE_PCI_FIXUP_CLASS_EARLY(PCI_VENDOR_ID_INTEL, 0x6f07, 
PCI_CLASS_NOT_DEFINED, 8,
+ quirk_relaxedordering_disable);
+DECLARE_PCI_FIXUP_CLASS_EARLY(PCI_VENDOR_ID_INTEL, 0x6f08, 
PCI_CLASS_NOT_DEFINED, 8,
+ quirk_relaxedordering_disable);
+DECLARE_PCI_FIXUP_CLASS_EARLY(PCI_VENDOR_ID_INTEL, 0x6f09, 
PCI_CLASS_NOT_DEFINED, 8,
+ quirk_relaxedordering_disable);
+DECLARE_PCI_FIXUP_CLASS_EARLY(PCI_VENDOR_ID_INTEL, 0x6f0a, 
PCI_CLASS_NOT_DEFINED, 8,
+ quirk_relaxedordering_disable);
+DECLARE_PCI_FIXUP_CLASS_EARLY(PCI_VENDOR_ID_INTEL, 0x6f0b, 
PCI_CLASS_NOT_DEFINED, 8,
+ quirk_relaxedordering_disable);
+DECLARE_PCI_FIXUP_CLASS_EARLY(PCI_VENDOR_ID_INTEL, 0x6f0c, 
PCI_CLASS_NOT_DEFINED, 8,
+ quirk_relaxedordering_disable);
+DECLARE_PCI_FIXUP_CLASS_EARLY(PCI_VENDOR_ID_INTEL, 0x6f0d, 
PCI_CLASS_NOT_DEFINED, 8,
+ quirk_relaxedordering_disable);
+DECLARE_PCI_FIXUP_CLASS_EARLY(PCI_VENDOR_ID_INTEL, 0x6f0e, 
PCI_CLASS_NOT_DEFINED, 8,
+ quirk_relaxedordering_disable);
+DECLARE_PCI_FIXUP_CLASS_EARLY(PCI_VENDOR_ID_INTEL, 0x2f01, 
PCI_CLASS_NOT_DEFINED, 8,
+ quirk_relaxedordering_disable);
+DECLARE_PCI_FIXUP_CLASS_EARLY(PCI_VENDOR_ID_INTEL, 0x2f02, 
PCI_CLASS_NOT_DEFINED, 8,
+ quirk_relaxedordering_disable);
+DECLARE_PCI_FIXUP_CLASS_EARLY(PCI_VENDOR_ID_INTEL, 0x2f03, 
PCI_CLASS_NOT_DEFINED, 8,
+ quirk_relaxedordering_disable);
+DECLARE_PCI_FIXUP_CLASS_EARLY(PCI_VENDOR_ID_INTEL, 0x2f04, 
PCI_CLASS_NOT_DEFINED, 8,
+ quirk_relaxedordering_disable);
+DECLARE_PCI_FIXUP_CLASS_EARLY(PCI_VENDOR_ID_INTEL, 0x2f05, 
PCI_CLASS_NOT_DEFINED, 8,
+ quirk_relaxedordering_disable);
+DECLARE_PCI_FIXUP_CLASS_EARLY(PCI_VENDOR_ID_INTEL, 

[PATCH v11 0/5] Add new PCI_DEV_FLAGS_NO_RELAXED_ORDERING flag

2017-08-14 Thread Ding Tianhong
Some devices have problems with Transaction Layer Packets with the Relaxed
Ordering Attribute set.  This patch set adds a new PCIe Device Flag,
PCI_DEV_FLAGS_NO_RELAXED_ORDERING, a set of PCI Quirks to catch some known
devices with Relaxed Ordering issues, and a use of this new flag by the
cxgb4 driver to avoid using Relaxed Ordering with problematic Root Complex
Ports.

It's been years since I've submitted kernel.org patches, I appolgise for the
almost certain submission errors.

v2: Alexander point out that the v1 was only a part of the whole solution,
some platform which has some issues could use the new flag to indicate
that it is not safe to enable relaxed ordering attribute, then we need
to clear the relaxed ordering enable bits in the PCI configuration when
initializing the device. So add a new second patch to modify the PCI
initialization code to clear the relaxed ordering enable bit in the
event that the root complex doesn't want relaxed ordering enabled.

The third patch was base on the v1's second patch and only be changed
to query the relaxed ordering enable bit in the PCI configuration space
to allow the Chelsio NIC to send TLPs with the relaxed ordering attributes
set.

This version didn't plan to drop the defines for Intel Drivers to use the
new checking way to enable relaxed ordering because it is not the hardest
part of the moment, we could fix it in next patchset when this patches
reach the goal.

v3: Redesigned the logic for pci_configure_relaxed_ordering when configuration,
If a PCIe device didn't enable the relaxed ordering attribute default,
we should not do anything in the PCIe configuration, otherwise we
should check if any of the devices above us do not support relaxed
ordering by the PCI_DEV_FLAGS_NO_RELAXED_ORDERING flag, then base on
the result if we get a return that indicate that the relaxed ordering
is not supported we should update our device to disable relaxed ordering
in configuration space. If the device above us doesn't exist or isn't
the PCIe device, we shouldn't do anything and skip updating relaxed ordering
because we are probably running in a guest.

v4: Rename the functions pcie_get_relaxed_ordering and 
pcie_disable_relaxed_ordering
according John's suggestion, and modify the description, use the true/false
as the return value.

We shouldn't enable relaxed ordering attribute by the setting in the root
complex configuration space for PCIe device, so fix it for cxgb4.

Fix some format issues.

v5: Removed the unnecessary code for some function which only return the bool
value, and add the check for VF device.

Make this patch set base on 4.12-rc5.

v6: Fix the logic error in the need to enable the relaxed ordering attribute 
for cxgb4.

v7: The cxgb4 drivers will enable the PCIe Capability Device Control[Relaxed
Ordering Enable] in PCI Probe() routine, this will break our current
solution for some platform which has problematic when enable the relaxed
ordering attribute. According to the latest recommendations, remove the
enable_pcie_relaxed_ordering(), although it could not cover the Peer-to-Peer
scene, but we agree to leave this problem until we really trigger it.

Make this patch set base on 4.12 release version.

v8: Change the second patch title and description to make it more reasonable,
add the acked-by from Alex and Ashok.

Add a new patch to enable the Relaxed Ordering Attribute for cxgb4vf driver.

Make this patch set base on 4.13-rc2.

v9: The document (https://software.intel.com/sites/default/files/managed/9e/
bc/64-ia-32-architectures-optimization-manual.pdf) indicate that the Xeon
processors based on Broadwell/Haswell microarchitecture has the problem
with Relaxed Ordering Attribute enabled, so add the whole list Device ID
from Intel to the patch.

v10: Significant rework based on Bjorn's feedback, reorganize the first 2 
patches,
 now the Intel and AMD erratum soc has been divided to the different 
patches,
 rename the pcie_relaxed_ordering_supported() to 
pcie_relaxed_ordering_enabled(),
 and no need to check every intervening switch except the root ports, update
 some commits.

v11: We shouldn't let the Intel engineer to acked the AMD's erratum patch, fix 
the
 funny mistake.

Casey Leedom (2):
  net/cxgb4: Use new PCI_DEV_FLAGS_NO_RELAXED_ORDERING flag
  net/cxgb4vf: Use new PCI_DEV_FLAGS_NO_RELAXED_ORDERING flag

Ding Tianhong (3):
  PCI: Disable PCIe Relaxed Ordering if unsupported
  PCI: Disable Relaxed Ordering for some Intel processors
  PCI: Disable Relaxed Ordering Attributes for AMD A1100

 drivers/net/ethernet/chelsio/cxgb4/cxgb4.h |  1 +
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c| 23 --
 drivers/net/ethernet/chelsio/cxgb4/sge.c   |  5 +-
 drivers/net/ethernet/chelsio/cxgb4vf/adapter.h |  1 +
 

[PATCH v11 1/5] PCI: Disable PCIe Relaxed Ordering if unsupported

2017-08-14 Thread Ding Tianhong
When bit4 is set in the PCIe Device Control register, it indicates
whether the device is permitted to use relaxed ordering.
On some platforms using relaxed ordering can have performance issues or
due to erratum can cause data-corruption. In such cases devices must avoid
using relaxed ordering.

The patch adds a new flag PCI_DEV_FLAGS_NO_RELAXED_ORDERING to indicate that
Relaxed Ordering (RO) attribute should not be used for Transaction Layer
Packets (TLP) targeted towards these affected root complexes.

This patch checks if there is any node in the hierarchy that indicates that
using relaxed ordering is not safe. In such cases the patch turns off the
relaxed ordering by clearing the capability for this device.

Signed-off-by: Casey Leedom 
Signed-off-by: Ding Tianhong 
Acked-by: Ashok Raj 
Acked-by: Alexander Duyck 
Acked-by: Casey Leedom 
---
 drivers/pci/probe.c  | 43 +++
 drivers/pci/quirks.c | 11 +++
 include/linux/pci.h  |  3 +++
 3 files changed, 57 insertions(+)

diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
index c31310d..779e646 100644
--- a/drivers/pci/probe.c
+++ b/drivers/pci/probe.c
@@ -1762,6 +1762,48 @@ static void pci_configure_extended_tags(struct pci_dev 
*dev)
 PCI_EXP_DEVCTL_EXT_TAG);
 }
 
+/**
+ * pcie_relaxed_ordering_enabled - Probe for PCIe relaxed ordering enable
+ * @dev: PCI device to query
+ *
+ * Returns true if the device has enabled relaxed ordering attribute.
+ */
+bool pcie_relaxed_ordering_enabled(struct pci_dev *dev)
+{
+   u16 v;
+
+   pcie_capability_read_word(dev, PCI_EXP_DEVCTL, );
+
+   return !!(v & PCI_EXP_DEVCTL_RELAX_EN);
+}
+EXPORT_SYMBOL(pcie_relaxed_ordering_enabled);
+
+static void pci_configure_relaxed_ordering(struct pci_dev *dev)
+{
+   struct pci_dev *root;
+
+   /* PCI_EXP_DEVICE_RELAX_EN is RsvdP in VFs */
+   if (dev->is_virtfn)
+   return;
+
+   if (!pcie_relaxed_ordering_enabled(dev))
+   return;
+
+   /*
+* For now, we only deal with Relaxed Ordering issues with Root
+* Ports. Peer-to-Peer DMA is another can of worms.
+*/
+   root = pci_find_pcie_root_port(dev);
+   if (!root)
+   return;
+
+   if (root->dev_flags & PCI_DEV_FLAGS_NO_RELAXED_ORDERING) {
+   pcie_capability_clear_word(dev, PCI_EXP_DEVCTL,
+  PCI_EXP_DEVCTL_RELAX_EN);
+   dev_info(>dev, "Disable Relaxed Ordering because the Root 
Port didn't support it\n");
+   }
+}
+
 static void pci_configure_device(struct pci_dev *dev)
 {
struct hotplug_params hpp;
@@ -1769,6 +1811,7 @@ static void pci_configure_device(struct pci_dev *dev)
 
pci_configure_mps(dev);
pci_configure_extended_tags(dev);
+   pci_configure_relaxed_ordering(dev);
 
memset(, 0, sizeof(hpp));
ret = pci_get_hp_params(dev, );
diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
index 6967c6b..61b59bf 100644
--- a/drivers/pci/quirks.c
+++ b/drivers/pci/quirks.c
@@ -4016,6 +4016,17 @@ static void quirk_tw686x_class(struct pci_dev *pdev)
  quirk_tw686x_class);
 
 /*
+ * Some devices have problems with Transaction Layer Packets with the Relaxed
+ * Ordering Attribute set.  Such devices should mark themselves and other
+ * Device Drivers should check before sending TLPs with RO set.
+ */
+static void quirk_relaxedordering_disable(struct pci_dev *dev)
+{
+   dev->dev_flags |= PCI_DEV_FLAGS_NO_RELAXED_ORDERING;
+   dev_info(>dev, "Disable Relaxed Ordering Attributes to avoid PCIe 
Completion erratum\n");
+}
+
+/*
  * Per PCIe r3.0, sec 2.2.9, "Completion headers must supply the same
  * values for the Attribute as were supplied in the header of the
  * corresponding Request, except as explicitly allowed when IDO is used."
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 4869e66..29606fb 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -188,6 +188,8 @@ enum pci_dev_flags {
 * the direct_complete optimization.
 */
PCI_DEV_FLAGS_NEEDS_RESUME = (__force pci_dev_flags_t) (1 << 11),
+   /* Don't use Relaxed Ordering for TLPs directed at this device */
+   PCI_DEV_FLAGS_NO_RELAXED_ORDERING = (__force pci_dev_flags_t) (1 << 12),
 };
 
 enum pci_irq_reroute_variant {
@@ -1125,6 +1127,7 @@ int pci_add_ext_cap_save_buffer(struct pci_dev *dev,
 void pci_pme_wakeup_bus(struct pci_bus *bus);
 void pci_d3cold_enable(struct pci_dev *dev);
 void pci_d3cold_disable(struct pci_dev *dev);
+bool pcie_relaxed_ordering_enabled(struct pci_dev *dev);
 
 /* PCI Virtual Channel */
 int pci_save_vc_state(struct pci_dev *dev);
-- 
1.8.3.1




[PATCH v5 net 1/2] net: remove unnecessary rotation

2017-08-14 Thread Shaohua Li
From: Shaohua Li 

According to David Miller, the rotation doesn't really help avoid
security problem, so delte it.

Suggested-by: David Miller 
Signed-off-by: Shaohua Li 
---
 include/net/ipv6.h | 6 --
 1 file changed, 6 deletions(-)

diff --git a/include/net/ipv6.h b/include/net/ipv6.h
index 6eac5cf..7548367 100644
--- a/include/net/ipv6.h
+++ b/include/net/ipv6.h
@@ -790,12 +790,6 @@ static inline __be32 ip6_make_flowlabel(struct net *net, 
struct sk_buff *skb,
 
hash = skb_get_hash_flowi6(skb, fl6);
 
-   /* Since this is being sent on the wire obfuscate hash a bit
-* to minimize possbility that any useful information to an
-* attacker is leaked. Only lower 20 bits are relevant.
-*/
-   rol32(hash, 16);
-
flowlabel = (__force __be32)hash & IPV6_FLOWLABEL_MASK;
 
if (net->ipv6.sysctl.flowlabel_state_ranges)
-- 
2.9.5



[PATCH v5 net 2/2] net: fix tcp reset packet flowlabel for ipv6

2017-08-14 Thread Shaohua Li
From: Shaohua Li 

Please see below tcpdump output:
21:00:48.109122 IP6 (flowlabel 0x43304, hlim 64, next-header TCP (6) payload 
length: 40) fec0::5054:ff:fe12:3456.55804 > fec0::5054:ff:fe12:3456.: Flags 
[S], cksum 0x0529 (incorrect -> 0xf56c), seq 3282214508, win 43690, options 
[mss 65476,sackOK,TS val 2500903437 ecr 0,nop,wscale 7], length 0
21:00:48.109381 IP6 (flowlabel 0xd827f, hlim 64, next-header TCP (6) payload 
length: 40) fec0::5054:ff:fe12:3456. > fec0::5054:ff:fe12:3456.55804: Flags 
[S.], cksum 0x0529 (incorrect -> 0x49ad), seq 1923801573, ack 3282214509, win 
43690, options [mss 65476,sackOK,TS val 2500903437 ecr 2500903437,nop,wscale 
7], length 0
21:00:48.109548 IP6 (flowlabel 0x43304, hlim 64, next-header TCP (6) payload 
length: 32) fec0::5054:ff:fe12:3456.55804 > fec0::5054:ff:fe12:3456.: Flags 
[.], cksum 0x0521 (incorrect -> 0x1bdf), seq 1, ack 1, win 342, options 
[nop,nop,TS val 2500903437 ecr 2500903437], length 0
21:00:48.109823 IP6 (flowlabel 0x43304, hlim 64, next-header TCP (6) payload 
length: 62) fec0::5054:ff:fe12:3456.55804 > fec0::5054:ff:fe12:3456.: Flags 
[P.], cksum 0x053f (incorrect -> 0xb8b1), seq 1:31, ack 1, win 342, options 
[nop,nop,TS val 2500903437 ecr 2500903437], length 30
21:00:48.109910 IP6 (flowlabel 0xd827f, hlim 64, next-header TCP (6) payload 
length: 32) fec0::5054:ff:fe12:3456. > fec0::5054:ff:fe12:3456.55804: Flags 
[.], cksum 0x0521 (incorrect -> 0x1bc1), seq 1, ack 31, win 342, options 
[nop,nop,TS val 2500903437 ecr 2500903437], length 0
21:00:48.110043 IP6 (flowlabel 0xd827f, hlim 64, next-header TCP (6) payload 
length: 56) fec0::5054:ff:fe12:3456. > fec0::5054:ff:fe12:3456.55804: Flags 
[P.], cksum 0x0539 (incorrect -> 0xb726), seq 1:25, ack 31, win 342, options 
[nop,nop,TS val 2500903438 ecr 2500903437], length 24
21:00:48.110173 IP6 (flowlabel 0x43304, hlim 64, next-header TCP (6) payload 
length: 32) fec0::5054:ff:fe12:3456.55804 > fec0::5054:ff:fe12:3456.: Flags 
[.], cksum 0x0521 (incorrect -> 0x1ba7), seq 31, ack 25, win 342, options 
[nop,nop,TS val 2500903438 ecr 2500903438], length 0
21:00:48.110211 IP6 (flowlabel 0xd827f, hlim 64, next-header TCP (6) payload 
length: 32) fec0::5054:ff:fe12:3456. > fec0::5054:ff:fe12:3456.55804: Flags 
[F.], cksum 0x0521 (incorrect -> 0x1ba7), seq 25, ack 31, win 342, options 
[nop,nop,TS val 2500903438 ecr 2500903437], length 0
21:00:48.151099 IP6 (flowlabel 0x43304, hlim 64, next-header TCP (6) payload 
length: 32) fec0::5054:ff:fe12:3456.55804 > fec0::5054:ff:fe12:3456.: Flags 
[.], cksum 0x0521 (incorrect -> 0x1ba6), seq 31, ack 26, win 342, options 
[nop,nop,TS val 2500903438 ecr 2500903438], length 0
21:00:49.110524 IP6 (flowlabel 0x43304, hlim 64, next-header TCP (6) payload 
length: 56) fec0::5054:ff:fe12:3456.55804 > fec0::5054:ff:fe12:3456.: Flags 
[P.], cksum 0x0539 (incorrect -> 0xb324), seq 31:55, ack 26, win 342, options 
[nop,nop,TS val 2500904438 ecr 2500903438], length 24
21:00:49.110637 IP6 (flowlabel 0xb34d5, hlim 64, next-header TCP (6) payload 
length: 20) fec0::5054:ff:fe12:3456. > fec0::5054:ff:fe12:3456.55804: Flags 
[R], cksum 0x0515 (incorrect -> 0x668c), seq 1923801599, win 0, length 0

The tcp reset packet has a different flowlabel, which causes our router
doesn't correctly close tcp connection. We are using flowlabel to do
load balance. Routers in the path maintain connection state. So if flow
label changes, the packet is routed through a different router. In this
case, the old router doesn't get the reset packet to close the tcp
connection.

The reason is the normal packet gets the skb->hash from sk->sk_txhash,
which is generated randomly.  ip6_make_flowlabel then uses the hash to
create a flowlabel. The reset packet doesn't get assigned a hash, so the
flowlabel is calculated with flowi6.

Since user can't change timewait sock flowlabel, we create a fake
flowlabel for timewait socket with the random generated hash
(sk->sk_txhash), then use it in reset packet. In this way, the reset
packet will have the same flowlabel as normal packets.

This also fixes the flowlabel issue for reset packet if user configures
flowlabel, which is ignored previously.

Cc: Eric Dumazet 
Cc: Florent Fourcot 
Cc: Cong Wang 
Cc: Tom Herbert 
Signed-off-by: Shaohua Li 
---
 include/net/ipv6.h   | 42 +-
 net/ipv4/tcp_minisocks.c |  8 +++-
 net/ipv6/tcp_ipv6.c  | 18 +-
 3 files changed, 57 insertions(+), 11 deletions(-)

diff --git a/include/net/ipv6.h b/include/net/ipv6.h
index 7548367..fef395c 100644
--- a/include/net/ipv6.h
+++ b/include/net/ipv6.h
@@ -771,6 +771,30 @@ static inline void iph_to_flow_copy_v6addrs(struct 
flow_keys *flow,
 
 #define IP6_DEFAULT_AUTO_FLOW_LABELS   IP6_AUTO_FLOW_LABEL_OPTOUT
 
+static inline bool ip6_need_make_flowlabel(struct net *net, 

[PATCH v5 net 0/2] ipv6: fix flowlabel issue for reset packet

2017-08-14 Thread Shaohua Li
From: Shaohua Li 

Please see below tcpdump output:
21:00:48.109122 IP6 (flowlabel 0x43304, hlim 64, next-header TCP (6) payload 
length: 40) fec0::5054:ff:fe12:3456.55804 > fec0::5054:ff:fe12:3456.: Flags 
[S], cksum 0x0529 (incorrect -> 0xf56c), seq 3282214508, win 43690, options 
[mss 65476,sackOK,TS val 2500903437 ecr 0,nop,wscale 7], length 0
21:00:48.109381 IP6 (flowlabel 0xd827f, hlim 64, next-header TCP (6) payload 
length: 40) fec0::5054:ff:fe12:3456. > fec0::5054:ff:fe12:3456.55804: Flags 
[S.], cksum 0x0529 (incorrect -> 0x49ad), seq 1923801573, ack 3282214509, win 
43690, options [mss 65476,sackOK,TS val 2500903437 ecr 2500903437,nop,wscale 
7], length 0
21:00:48.109548 IP6 (flowlabel 0x43304, hlim 64, next-header TCP (6) payload 
length: 32) fec0::5054:ff:fe12:3456.55804 > fec0::5054:ff:fe12:3456.: Flags 
[.], cksum 0x0521 (incorrect -> 0x1bdf), seq 1, ack 1, win 342, options 
[nop,nop,TS val 2500903437 ecr 2500903437], length 0
21:00:48.109823 IP6 (flowlabel 0x43304, hlim 64, next-header TCP (6) payload 
length: 62) fec0::5054:ff:fe12:3456.55804 > fec0::5054:ff:fe12:3456.: Flags 
[P.], cksum 0x053f (incorrect -> 0xb8b1), seq 1:31, ack 1, win 342, options 
[nop,nop,TS val 2500903437 ecr 2500903437], length 30
21:00:48.109910 IP6 (flowlabel 0xd827f, hlim 64, next-header TCP (6) payload 
length: 32) fec0::5054:ff:fe12:3456. > fec0::5054:ff:fe12:3456.55804: Flags 
[.], cksum 0x0521 (incorrect -> 0x1bc1), seq 1, ack 31, win 342, options 
[nop,nop,TS val 2500903437 ecr 2500903437], length 0
21:00:48.110043 IP6 (flowlabel 0xd827f, hlim 64, next-header TCP (6) payload 
length: 56) fec0::5054:ff:fe12:3456. > fec0::5054:ff:fe12:3456.55804: Flags 
[P.], cksum 0x0539 (incorrect -> 0xb726), seq 1:25, ack 31, win 342, options 
[nop,nop,TS val 2500903438 ecr 2500903437], length 24
21:00:48.110173 IP6 (flowlabel 0x43304, hlim 64, next-header TCP (6) payload 
length: 32) fec0::5054:ff:fe12:3456.55804 > fec0::5054:ff:fe12:3456.: Flags 
[.], cksum 0x0521 (incorrect -> 0x1ba7), seq 31, ack 25, win 342, options 
[nop,nop,TS val 2500903438 ecr 2500903438], length 0
21:00:48.110211 IP6 (flowlabel 0xd827f, hlim 64, next-header TCP (6) payload 
length: 32) fec0::5054:ff:fe12:3456. > fec0::5054:ff:fe12:3456.55804: Flags 
[F.], cksum 0x0521 (incorrect -> 0x1ba7), seq 25, ack 31, win 342, options 
[nop,nop,TS val 2500903438 ecr 2500903437], length 0
21:00:48.151099 IP6 (flowlabel 0x43304, hlim 64, next-header TCP (6) payload 
length: 32) fec0::5054:ff:fe12:3456.55804 > fec0::5054:ff:fe12:3456.: Flags 
[.], cksum 0x0521 (incorrect -> 0x1ba6), seq 31, ack 26, win 342, options 
[nop,nop,TS val 2500903438 ecr 2500903438], length 0
21:00:49.110524 IP6 (flowlabel 0x43304, hlim 64, next-header TCP (6) payload 
length: 56) fec0::5054:ff:fe12:3456.55804 > fec0::5054:ff:fe12:3456.: Flags 
[P.], cksum 0x0539 (incorrect -> 0xb324), seq 31:55, ack 26, win 342, options 
[nop,nop,TS val 2500904438 ecr 2500903438], length 24
21:00:49.110637 IP6 (flowlabel 0xb34d5, hlim 64, next-header TCP (6) payload 
length: 20) fec0::5054:ff:fe12:3456. > fec0::5054:ff:fe12:3456.55804: Flags 
[R], cksum 0x0515 (incorrect -> 0x668c), seq 1923801599, win 0, length 0

The flowlabel of reset packet (0xb34d5) and flowlabel of normal packet
(0xd827f) are different. This causes our router doesn't correctly close tcp
connection. We are using flowlabel to do load balance. Routers in the path
maintain connection state. So if flow label changes, the packet is routed
through a different router. In this case, the old router doesn't get the reset
packet to close the tcp connection. The patches try to fix the issue.

Thanks,
Shaohua


Shaohua Li (2):
  net: remove unnecessary rotation
  net: fix tcp reset packet flowlabel for ipv6

 include/net/ipv6.h   | 48 +---
 net/ipv4/tcp_minisocks.c |  8 +++-
 net/ipv6/tcp_ipv6.c  | 18 +-
 3 files changed, 57 insertions(+), 17 deletions(-)

-- 
2.9.5



Re: [PATCH V4 net 0/2] ipv6: fix flowlabel issue for reset packet

2017-08-14 Thread Shaohua Li
On Fri, Aug 11, 2017 at 06:00:20PM -0700, Tom Herbert wrote:
> On Thu, Aug 10, 2017 at 12:13 PM, Shaohua Li  wrote:
> > On Thu, Aug 10, 2017 at 11:30:51AM -0700, Tom Herbert wrote:
> >> On Thu, Aug 10, 2017 at 9:30 AM, Shaohua Li  wrote:
> >> > On Wed, Aug 09, 2017 at 09:40:08AM -0700, Tom Herbert wrote:
> >> >> On Mon, Jul 31, 2017 at 3:19 PM, Shaohua Li  wrote:
> >> >> > From: Shaohua Li 
> >> >> >
> >> >> > Please see below tcpdump output:
> >> >> > 21:00:48.109122 IP6 (flowlabel 0x43304, hlim 64, next-header TCP (6) 
> >> >> > payload length: 40) fec0::5054:ff:fe12:3456.55804 > 
> >> >> > fec0::5054:ff:fe12:3456.: Flags [S], cksum 0x0529 (incorrect -> 
> >> >> > 0xf56c), seq 3282214508, win 43690, options [mss 65476,sackOK,TS val 
> >> >> > 2500903437 ecr 0,nop,wscale 7], length 0
> >> >> > 21:00:48.109381 IP6 (flowlabel 0xd827f, hlim 64, next-header TCP (6) 
> >> >> > payload length: 40) fec0::5054:ff:fe12:3456. > 
> >> >> > fec0::5054:ff:fe12:3456.55804: Flags [S.], cksum 0x0529 (incorrect -> 
> >> >> > 0x49ad), seq 1923801573, ack 3282214509, win 43690, options [mss 
> >> >> > 65476,sackOK,TS val 2500903437 ecr 2500903437,nop,wscale 7], length 0
> >> >> > 21:00:48.109548 IP6 (flowlabel 0x43304, hlim 64, next-header TCP (6) 
> >> >> > payload length: 32) fec0::5054:ff:fe12:3456.55804 > 
> >> >> > fec0::5054:ff:fe12:3456.: Flags [.], cksum 0x0521 (incorrect -> 
> >> >> > 0x1bdf), seq 1, ack 1, win 342, options [nop,nop,TS val 2500903437 
> >> >> > ecr 2500903437], length 0
> >> >> > 21:00:48.109823 IP6 (flowlabel 0x43304, hlim 64, next-header TCP (6) 
> >> >> > payload length: 62) fec0::5054:ff:fe12:3456.55804 > 
> >> >> > fec0::5054:ff:fe12:3456.: Flags [P.], cksum 0x053f (incorrect -> 
> >> >> > 0xb8b1), seq 1:31, ack 1, win 342, options [nop,nop,TS val 2500903437 
> >> >> > ecr 2500903437], length 30
> >> >> > 21:00:48.109910 IP6 (flowlabel 0xd827f, hlim 64, next-header TCP (6) 
> >> >> > payload length: 32) fec0::5054:ff:fe12:3456. > 
> >> >> > fec0::5054:ff:fe12:3456.55804: Flags [.], cksum 0x0521 (incorrect -> 
> >> >> > 0x1bc1), seq 1, ack 31, win 342, options [nop,nop,TS val 2500903437 
> >> >> > ecr 2500903437], length 0
> >> >> > 21:00:48.110043 IP6 (flowlabel 0xd827f, hlim 64, next-header TCP (6) 
> >> >> > payload length: 56) fec0::5054:ff:fe12:3456. > 
> >> >> > fec0::5054:ff:fe12:3456.55804: Flags [P.], cksum 0x0539 (incorrect -> 
> >> >> > 0xb726), seq 1:25, ack 31, win 342, options [nop,nop,TS val 
> >> >> > 2500903438 ecr 2500903437], length 24
> >> >> > 21:00:48.110173 IP6 (flowlabel 0x43304, hlim 64, next-header TCP (6) 
> >> >> > payload length: 32) fec0::5054:ff:fe12:3456.55804 > 
> >> >> > fec0::5054:ff:fe12:3456.: Flags [.], cksum 0x0521 (incorrect -> 
> >> >> > 0x1ba7), seq 31, ack 25, win 342, options [nop,nop,TS val 2500903438 
> >> >> > ecr 2500903438], length 0
> >> >> > 21:00:48.110211 IP6 (flowlabel 0xd827f, hlim 64, next-header TCP (6) 
> >> >> > payload length: 32) fec0::5054:ff:fe12:3456. > 
> >> >> > fec0::5054:ff:fe12:3456.55804: Flags [F.], cksum 0x0521 (incorrect -> 
> >> >> > 0x1ba7), seq 25, ack 31, win 342, options [nop,nop,TS val 2500903438 
> >> >> > ecr 2500903437], length 0
> >> >> > 21:00:48.151099 IP6 (flowlabel 0x43304, hlim 64, next-header TCP (6) 
> >> >> > payload length: 32) fec0::5054:ff:fe12:3456.55804 > 
> >> >> > fec0::5054:ff:fe12:3456.: Flags [.], cksum 0x0521 (incorrect -> 
> >> >> > 0x1ba6), seq 31, ack 26, win 342, options [nop,nop,TS val 2500903438 
> >> >> > ecr 2500903438], length 0
> >> >> > 21:00:49.110524 IP6 (flowlabel 0x43304, hlim 64, next-header TCP (6) 
> >> >> > payload length: 56) fec0::5054:ff:fe12:3456.55804 > 
> >> >> > fec0::5054:ff:fe12:3456.: Flags [P.], cksum 0x0539 (incorrect -> 
> >> >> > 0xb324), seq 31:55, ack 26, win 342, options [nop,nop,TS val 
> >> >> > 2500904438 ecr 2500903438], length 24
> >> >> > 21:00:49.110637 IP6 (flowlabel 0xb34d5, hlim 64, next-header TCP (6) 
> >> >> > payload length: 20) fec0::5054:ff:fe12:3456. > 
> >> >> > fec0::5054:ff:fe12:3456.55804: Flags [R], cksum 0x0515 (incorrect -> 
> >> >> > 0x668c), seq 1923801599, win 0, length 0
> >> >> >
> >> >> > The flowlabel of reset packet (0xb34d5) and flowlabel of normal packet
> >> >> > (0xd827f) are different. This causes our router doesn't correctly 
> >> >> > close tcp
> >> >> > connection. The patches try to fix the issue.
> >> >> >
> >> >> Shaohua,
> >> >>
> >> >> Can you give some more detail about what the router doesn't close the
> >> >> TCP connection means? I'm guessing the problem is either: 1) the
> >> >> router is maintaining connection state that includes the flow label in
> >> >> a connection tuple. 2) some router in the path is maintaining
> >> >> connection state, but when the flow label changes the flow's packet
> >> >> are routed through a different router that doesn't have a state for
> >> >> the flow it drops the packet. 

Re: [PATCH v2] sctp: fully initialize the IPv6 address in sctp_v6_to_addr()

2017-08-14 Thread David Miller
From: Marcelo Ricardo Leitner 
Date: Mon, 14 Aug 2017 22:58:14 -0300

> On Tue, Aug 15, 2017 at 10:43:59AM +0900, 吉藤英明 wrote:
>> > diff --git a/net/sctp/ipv6.c b/net/sctp/ipv6.c
>> > index 2a186b201ad2..a15d691829c6 100644
>> > --- a/net/sctp/ipv6.c
>> > +++ b/net/sctp/ipv6.c
>> > @@ -513,6 +513,8 @@ static void sctp_v6_to_addr(union sctp_addr *addr, 
>> > struct in6_addr *saddr,
>> > addr->sa.sa_family = AF_INET6;
>> > addr->v6.sin6_port = port;
>> > addr->v6.sin6_addr = *saddr;
>> > +   addr->v6.sin6_flowinfo = 0;
>> > +   addr->v6.sin6_scope_id = 0;
>> 
>> Please set flowinfo between port and addr.
> 
> Why?

Store buffer compression.

You want to always initialize structure member in the order
they are in memory.

No, the compiler won't do this automatically.


linux-next: build failure after merge of the net-next tree

2017-08-14 Thread Stephen Rothwell
Hi all,

After merging the net-next tree, today's linux-next build (arm
multi_v7_defconfig) failed like this:

arch/arm/boot/dts/rk3228-evb.dtb: ERROR (phandle_references): Reference to 
non-existent node or label "phy0"

Caused by commit

  db40f15b53e4 ("ARM: dts: rk3228-evb: Enable the integrated PHY for gmac")

Its possible that the error is caused by an interaction with another
commit, but I could not find anything obvious.  I have reverted that
commit for today.

-- 
Cheers,
Stephen Rothwell


Re: [PATCH v2] sctp: fully initialize the IPv6 address in sctp_v6_to_addr()

2017-08-14 Thread Marcelo Ricardo Leitner
On Tue, Aug 15, 2017 at 10:43:59AM +0900, 吉藤英明 wrote:
> Hi,
> 
> 2017-08-15 3:43 GMT+09:00 Alexander Potapenko :
> > KMSAN reported use of uninitialized sctp_addr->v4.sin_addr.s_addr and
> > sctp_addr->v6.sin6_scope_id in sctp_v6_cmp_addr() (see below).
> > Make sure all fields of an IPv6 address are initialized, which
> > guarantees that the IPv4 fields are also initialized.
> >
> > ==
> >  BUG: KMSAN: use of uninitialized memory in sctp_v6_cmp_addr+0x8d4/0x9f0
> >  net/sctp/ipv6.c:517
> >  CPU: 2 PID: 31056 Comm: syz-executor1 Not tainted 4.11.0-rc5+ #2944
> >  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs
> >  01/01/2011
> >  Call Trace:
> >   dump_stack+0x172/0x1c0 lib/dump_stack.c:42
> >   is_logbuf_locked mm/kmsan/kmsan.c:59 [inline]
> >   kmsan_report+0x12a/0x180 mm/kmsan/kmsan.c:938
> >   native_save_fl arch/x86/include/asm/irqflags.h:18 [inline]
> >   arch_local_save_flags arch/x86/include/asm/irqflags.h:72 [inline]
> >   arch_local_irq_save arch/x86/include/asm/irqflags.h:113 [inline]
> >   __msan_warning_32+0x61/0xb0 mm/kmsan/kmsan_instr.c:467
> >   sctp_v6_cmp_addr+0x8d4/0x9f0 net/sctp/ipv6.c:517
> >   sctp_v6_get_dst+0x8c7/0x1630 net/sctp/ipv6.c:290
> >   sctp_transport_route+0x101/0x570 net/sctp/transport.c:292
> >   sctp_assoc_add_peer+0x66d/0x16f0 net/sctp/associola.c:651
> >   sctp_sendmsg+0x35a5/0x4f90 net/sctp/socket.c:1871
> >   inet_sendmsg+0x498/0x670 net/ipv4/af_inet.c:762
> >   sock_sendmsg_nosec net/socket.c:633 [inline]
> >   sock_sendmsg net/socket.c:643 [inline]
> >   SYSC_sendto+0x608/0x710 net/socket.c:1696
> >   SyS_sendto+0x8a/0xb0 net/socket.c:1664
> >   entry_SYSCALL_64_fastpath+0x13/0x94
> >  RIP: 0033:0x44b479
> >  RSP: 002b:7f6213f21c08 EFLAGS: 0286 ORIG_RAX: 002c
> >  RAX: ffda RBX: 2000 RCX: 0044b479
> >  RDX: 0041 RSI: 20edd000 RDI: 0006
> >  RBP: 007080a8 R08: 20b85fe4 R09: 001c
> >  R10: 00040005 R11: 0286 R12: 
> >  R13: 3760 R14: 006e5820 R15: 00ff8000
> >  origin description: dst_saddr@sctp_v6_get_dst
> >  local variable created at:
> >   sk_fullsock include/net/sock.h:2321 [inline]
> >   inet6_sk include/linux/ipv6.h:309 [inline]
> >   sctp_v6_get_dst+0x91/0x1630 net/sctp/ipv6.c:241
> >   sctp_transport_route+0x101/0x570 net/sctp/transport.c:292
> > ==
> >  BUG: KMSAN: use of uninitialized memory in sctp_v6_cmp_addr+0x8d4/0x9f0
> >  net/sctp/ipv6.c:517
> >  CPU: 2 PID: 31056 Comm: syz-executor1 Not tainted 4.11.0-rc5+ #2944
> >  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs
> >  01/01/2011
> >  Call Trace:
> >   dump_stack+0x172/0x1c0 lib/dump_stack.c:42
> >   is_logbuf_locked mm/kmsan/kmsan.c:59 [inline]
> >   kmsan_report+0x12a/0x180 mm/kmsan/kmsan.c:938
> >   native_save_fl arch/x86/include/asm/irqflags.h:18 [inline]
> >   arch_local_save_flags arch/x86/include/asm/irqflags.h:72 [inline]
> >   arch_local_irq_save arch/x86/include/asm/irqflags.h:113 [inline]
> >   __msan_warning_32+0x61/0xb0 mm/kmsan/kmsan_instr.c:467
> >   sctp_v6_cmp_addr+0x8d4/0x9f0 net/sctp/ipv6.c:517
> >   sctp_v6_get_dst+0x8c7/0x1630 net/sctp/ipv6.c:290
> >   sctp_transport_route+0x101/0x570 net/sctp/transport.c:292
> >   sctp_assoc_add_peer+0x66d/0x16f0 net/sctp/associola.c:651
> >   sctp_sendmsg+0x35a5/0x4f90 net/sctp/socket.c:1871
> >   inet_sendmsg+0x498/0x670 net/ipv4/af_inet.c:762
> >   sock_sendmsg_nosec net/socket.c:633 [inline]
> >   sock_sendmsg net/socket.c:643 [inline]
> >   SYSC_sendto+0x608/0x710 net/socket.c:1696
> >   SyS_sendto+0x8a/0xb0 net/socket.c:1664
> >   entry_SYSCALL_64_fastpath+0x13/0x94
> >  RIP: 0033:0x44b479
> >  RSP: 002b:7f6213f21c08 EFLAGS: 0286 ORIG_RAX: 002c
> >  RAX: ffda RBX: 2000 RCX: 0044b479
> >  RDX: 0041 RSI: 20edd000 RDI: 0006
> >  RBP: 007080a8 R08: 20b85fe4 R09: 001c
> >  R10: 00040005 R11: 0286 R12: 
> >  R13: 3760 R14: 006e5820 R15: 00ff8000
> >  origin description: dst_saddr@sctp_v6_get_dst
> >  local variable created at:
> >   sk_fullsock include/net/sock.h:2321 [inline]
> >   inet6_sk include/linux/ipv6.h:309 [inline]
> >   sctp_v6_get_dst+0x91/0x1630 net/sctp/ipv6.c:241
> >   sctp_transport_route+0x101/0x570 net/sctp/transport.c:292
> > ==
> >
> > Signed-off-by: Alexander Potapenko 
> > Reviewed-by: Xin Long 
> > ---
> > v2 is identical to v1, resending per request by Marcelo Ricardo Leitner.
> > ---
> >  net/sctp/ipv6.c | 2 ++
> >  1 file changed, 2 insertions(+)
> >
> > diff --git 

linux-next: manual merge of the net-next tree with the rockchip tree

2017-08-14 Thread Stephen Rothwell
Hi all,

Today's linux-next merge of the net-next tree got a conflict in:

  arch/arm64/boot/dts/rockchip/rk3328.dtsi

between commit:

  c60c0373a5e8 ("arm64: dts: rockchip: add usb2 nodes for RK3328 SoCs")

from the rockchip tree and commit:

  9c4cc910fe28 ("ARM64: dts: rockchip: Add gmac2phy node support for rk3328")

from the net-next tree.

I fixed it up (see below) and can carry the fix as necessary. This
is now fixed as far as linux-next is concerned, but any non trivial
conflicts should be mentioned to your upstream maintainer when your tree
is submitted for merging.  You may also want to consider cooperating
with the maintainer of the conflicting tree to minimise any particularly
complex conflicts.

-- 
Cheers,
Stephen Rothwell

diff --cc arch/arm64/boot/dts/rockchip/rk3328.dtsi
index e6da0cee1241,d48bf5d9f8bd..
--- a/arch/arm64/boot/dts/rockchip/rk3328.dtsi
+++ b/arch/arm64/boot/dts/rockchip/rk3328.dtsi
@@@ -616,45 -426,43 +618,82 @@@
status = "disabled";
};
  
+   gmac2phy: ethernet@ff55 {
+   compatible = "rockchip,rk3328-gmac";
+   reg = <0x0 0xff55 0x0 0x1>;
+   rockchip,grf = <>;
+   interrupts = ;
+   interrupt-names = "macirq";
+   clocks = < SCLK_MAC2PHY_SRC>, < SCLK_MAC2PHY_RXTX>,
+< SCLK_MAC2PHY_RXTX>, < SCLK_MAC2PHY_REF>,
+< ACLK_MAC2PHY>, < PCLK_MAC2PHY>,
+< SCLK_MAC2PHY_OUT>;
+   clock-names = "stmmaceth", "mac_clk_rx",
+ "mac_clk_tx", "clk_mac_ref",
+ "aclk_mac", "pclk_mac",
+ "clk_macphy";
+   resets = < SRST_GMAC2PHY_A>, < SRST_MACPHY>;
+   reset-names = "stmmaceth", "mac-phy";
+   phy-mode = "rmii";
+   phy-handle = <>;
+   status = "disabled";
+ 
+   mdio {
+   compatible = "snps,dwmac-mdio";
+   #address-cells = <1>;
+   #size-cells = <0>;
+ 
+   phy: phy@0 {
+   compatible = "ethernet-phy-id1234.d400", 
"ethernet-phy-ieee802.3-c22";
+   reg = <0>;
+   clocks = < SCLK_MAC2PHY_OUT>;
+   resets = < SRST_MACPHY>;
+   pinctrl-names = "default";
+   pinctrl-0 = <_rxm1 _linkm1>;
+   phy-is-integrated;
+   };
+   };
+   };
+ 
 +  usb20_otg: usb@ff58 {
 +  compatible = "rockchip,rk3328-usb", "rockchip,rk3066-usb",
 +   "snps,dwc2";
 +  reg = <0x0 0xff58 0x0 0x4>;
 +  interrupts = ;
 +  clocks = < HCLK_OTG>;
 +  clock-names = "otg";
 +  dr_mode = "otg";
 +  g-np-tx-fifo-size = <16>;
 +  g-rx-fifo-size = <280>;
 +  g-tx-fifo-size = <256 128 128 64 32 16>;
 +  g-use-dma;
 +  phys = <_otg>;
 +  phy-names = "usb2-phy";
 +  status = "disabled";
 +  };
 +
 +  usb_host0_ehci: usb@ff5c {
 +  compatible = "generic-ehci";
 +  reg = <0x0 0xff5c 0x0 0x1>;
 +  interrupts = ;
 +  clocks = < HCLK_HOST0>, <>;
 +  clock-names = "usbhost", "utmi";
 +  phys = <_host>;
 +  phy-names = "usb";
 +  status = "disabled";
 +  };
 +
 +  usb_host0_ohci: usb@ff5d {
 +  compatible = "generic-ohci";
 +  reg = <0x0 0xff5d 0x0 0x1>;
 +  interrupts = ;
 +  clocks = < HCLK_HOST0>, <>;
 +  clock-names = "usbhost", "utmi";
 +  phys = <_host>;
 +  phy-names = "usb";
 +  status = "disabled";
 +  };
 +
gic: interrupt-controller@ff811000 {
compatible = "arm,gic-400";
#interrupt-cells = <3>;


Re: [PATCH v2] sctp: fully initialize the IPv6 address in sctp_v6_to_addr()

2017-08-14 Thread 吉藤英明
Hi,

2017-08-15 3:43 GMT+09:00 Alexander Potapenko :
> KMSAN reported use of uninitialized sctp_addr->v4.sin_addr.s_addr and
> sctp_addr->v6.sin6_scope_id in sctp_v6_cmp_addr() (see below).
> Make sure all fields of an IPv6 address are initialized, which
> guarantees that the IPv4 fields are also initialized.
>
> ==
>  BUG: KMSAN: use of uninitialized memory in sctp_v6_cmp_addr+0x8d4/0x9f0
>  net/sctp/ipv6.c:517
>  CPU: 2 PID: 31056 Comm: syz-executor1 Not tainted 4.11.0-rc5+ #2944
>  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs
>  01/01/2011
>  Call Trace:
>   dump_stack+0x172/0x1c0 lib/dump_stack.c:42
>   is_logbuf_locked mm/kmsan/kmsan.c:59 [inline]
>   kmsan_report+0x12a/0x180 mm/kmsan/kmsan.c:938
>   native_save_fl arch/x86/include/asm/irqflags.h:18 [inline]
>   arch_local_save_flags arch/x86/include/asm/irqflags.h:72 [inline]
>   arch_local_irq_save arch/x86/include/asm/irqflags.h:113 [inline]
>   __msan_warning_32+0x61/0xb0 mm/kmsan/kmsan_instr.c:467
>   sctp_v6_cmp_addr+0x8d4/0x9f0 net/sctp/ipv6.c:517
>   sctp_v6_get_dst+0x8c7/0x1630 net/sctp/ipv6.c:290
>   sctp_transport_route+0x101/0x570 net/sctp/transport.c:292
>   sctp_assoc_add_peer+0x66d/0x16f0 net/sctp/associola.c:651
>   sctp_sendmsg+0x35a5/0x4f90 net/sctp/socket.c:1871
>   inet_sendmsg+0x498/0x670 net/ipv4/af_inet.c:762
>   sock_sendmsg_nosec net/socket.c:633 [inline]
>   sock_sendmsg net/socket.c:643 [inline]
>   SYSC_sendto+0x608/0x710 net/socket.c:1696
>   SyS_sendto+0x8a/0xb0 net/socket.c:1664
>   entry_SYSCALL_64_fastpath+0x13/0x94
>  RIP: 0033:0x44b479
>  RSP: 002b:7f6213f21c08 EFLAGS: 0286 ORIG_RAX: 002c
>  RAX: ffda RBX: 2000 RCX: 0044b479
>  RDX: 0041 RSI: 20edd000 RDI: 0006
>  RBP: 007080a8 R08: 20b85fe4 R09: 001c
>  R10: 00040005 R11: 0286 R12: 
>  R13: 3760 R14: 006e5820 R15: 00ff8000
>  origin description: dst_saddr@sctp_v6_get_dst
>  local variable created at:
>   sk_fullsock include/net/sock.h:2321 [inline]
>   inet6_sk include/linux/ipv6.h:309 [inline]
>   sctp_v6_get_dst+0x91/0x1630 net/sctp/ipv6.c:241
>   sctp_transport_route+0x101/0x570 net/sctp/transport.c:292
> ==
>  BUG: KMSAN: use of uninitialized memory in sctp_v6_cmp_addr+0x8d4/0x9f0
>  net/sctp/ipv6.c:517
>  CPU: 2 PID: 31056 Comm: syz-executor1 Not tainted 4.11.0-rc5+ #2944
>  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs
>  01/01/2011
>  Call Trace:
>   dump_stack+0x172/0x1c0 lib/dump_stack.c:42
>   is_logbuf_locked mm/kmsan/kmsan.c:59 [inline]
>   kmsan_report+0x12a/0x180 mm/kmsan/kmsan.c:938
>   native_save_fl arch/x86/include/asm/irqflags.h:18 [inline]
>   arch_local_save_flags arch/x86/include/asm/irqflags.h:72 [inline]
>   arch_local_irq_save arch/x86/include/asm/irqflags.h:113 [inline]
>   __msan_warning_32+0x61/0xb0 mm/kmsan/kmsan_instr.c:467
>   sctp_v6_cmp_addr+0x8d4/0x9f0 net/sctp/ipv6.c:517
>   sctp_v6_get_dst+0x8c7/0x1630 net/sctp/ipv6.c:290
>   sctp_transport_route+0x101/0x570 net/sctp/transport.c:292
>   sctp_assoc_add_peer+0x66d/0x16f0 net/sctp/associola.c:651
>   sctp_sendmsg+0x35a5/0x4f90 net/sctp/socket.c:1871
>   inet_sendmsg+0x498/0x670 net/ipv4/af_inet.c:762
>   sock_sendmsg_nosec net/socket.c:633 [inline]
>   sock_sendmsg net/socket.c:643 [inline]
>   SYSC_sendto+0x608/0x710 net/socket.c:1696
>   SyS_sendto+0x8a/0xb0 net/socket.c:1664
>   entry_SYSCALL_64_fastpath+0x13/0x94
>  RIP: 0033:0x44b479
>  RSP: 002b:7f6213f21c08 EFLAGS: 0286 ORIG_RAX: 002c
>  RAX: ffda RBX: 2000 RCX: 0044b479
>  RDX: 0041 RSI: 20edd000 RDI: 0006
>  RBP: 007080a8 R08: 20b85fe4 R09: 001c
>  R10: 00040005 R11: 0286 R12: 
>  R13: 3760 R14: 006e5820 R15: 00ff8000
>  origin description: dst_saddr@sctp_v6_get_dst
>  local variable created at:
>   sk_fullsock include/net/sock.h:2321 [inline]
>   inet6_sk include/linux/ipv6.h:309 [inline]
>   sctp_v6_get_dst+0x91/0x1630 net/sctp/ipv6.c:241
>   sctp_transport_route+0x101/0x570 net/sctp/transport.c:292
> ==
>
> Signed-off-by: Alexander Potapenko 
> Reviewed-by: Xin Long 
> ---
> v2 is identical to v1, resending per request by Marcelo Ricardo Leitner.
> ---
>  net/sctp/ipv6.c | 2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/net/sctp/ipv6.c b/net/sctp/ipv6.c
> index 2a186b201ad2..a15d691829c6 100644
> --- a/net/sctp/ipv6.c
> +++ b/net/sctp/ipv6.c
> @@ -513,6 +513,8 @@ static void sctp_v6_to_addr(union sctp_addr *addr, struct 
> in6_addr *saddr,
> addr->sa.sa_family = 

Re: [PATCH v10 3/5] PCI: Disable Relaxed Ordering Attributes for AMD A1100

2017-08-14 Thread Ding Tianhong


On 2017/8/15 1:19, Raj, Ashok wrote:
> On Mon, Aug 14, 2017 at 11:44:57PM +0800, Ding Tianhong wrote:
>> Casey reported that the AMD ARM A1100 SoC has a bug in its PCIe
>> Root Port where Upstream Transaction Layer Packets with the Relaxed
>> Ordering Attribute clear are allowed to bypass earlier TLPs with
>> Relaxed Ordering set, it would cause Data Corruption, so we need
>> to disable Relaxed Ordering Attribute when Upstream TLPs to the
>> Root Port.
>>
>> Signed-off-by: Casey Leedom 
>> Signed-off-by: Ding Tianhong 
>> Acked-by: Alexander Duyck 
>> Acked-by: Ashok Raj 
> 
> I can't ack this patch :-).. must be someone from AMD. Please remove my
> signature from this.
> 

Sorry for funny mistake :)  I will fix it.

Ding

>> ---
>>  drivers/pci/quirks.c | 16 
>>  1 file changed, 16 insertions(+)
>>
>> diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
>> index 1272f7e..1407604 100644
>> --- a/drivers/pci/quirks.c
>> +++ b/drivers/pci/quirks.c
>> @@ -4089,6 +4089,22 @@ static void quirk_relaxedordering_disable(struct 
>> pci_dev *dev)
>>quirk_relaxedordering_disable);
>>  
>>  /*
>> + * The AMD ARM A1100 (AKA "SEATTLE") SoC has a bug in its PCIe Root Complex
>> + * where Upstream Transaction Layer Packets with the Relaxed Ordering
>> + * Attribute clear are allowed to bypass earlier TLPs with Relaxed Ordering
>> + * set.  This is a violation of the PCIe 3.0 Transaction Ordering Rules
>> + * outlined in Section 2.4.1 (PCI Express(r) Base Specification Revision 3.0
>> + * November 10, 2010).  As a result, on this platform we can't use Relaxed
>> + * Ordering for Upstream TLPs.
>> + */
>> +DECLARE_PCI_FIXUP_CLASS_EARLY(PCI_VENDOR_ID_AMD, 0x1a00, 
>> PCI_CLASS_NOT_DEFINED, 8,
>> +  quirk_relaxedordering_disable);
>> +DECLARE_PCI_FIXUP_CLASS_EARLY(PCI_VENDOR_ID_AMD, 0x1a01, 
>> PCI_CLASS_NOT_DEFINED, 8,
>> +  quirk_relaxedordering_disable);
>> +DECLARE_PCI_FIXUP_CLASS_EARLY(PCI_VENDOR_ID_AMD, 0x1a02, 
>> PCI_CLASS_NOT_DEFINED, 8,
>> +  quirk_relaxedordering_disable);
>> +
>> +/*
>>   * Per PCIe r3.0, sec 2.2.9, "Completion headers must supply the same
>>   * values for the Attribute as were supplied in the header of the
>>   * corresponding Request, except as explicitly allowed when IDO is used."
>> -- 
>> 1.8.3.1
>>
>>
> 
> .
> 



Re: [PATCH net] datagram: When peeking datagrams with offset < 0 don't skip empty skbs

2017-08-14 Thread Willem de Bruijn
On Mon, Aug 14, 2017 at 11:31 AM, Paolo Abeni  wrote:
> On Mon, 2017-08-14 at 11:03 -0400, Willem de Bruijn wrote:
>> > I'm actually surprised that only unix sockets can have negative values.  Is
>> > there a reason for that?  I had assumed that sk_set_peek_off would allow
>> > negative values as the code already has to support negative values due to 
>> > what
>> > the initial value is.
>>
>> A negative initial value indicates that PEEK_OFF is disabled. It only
>> makes sense to peek from a positive offset from the start of the data.
>
> With the current code, the user space can disable peeking with offset
> setting a negative offset value, after that peeking with offset has
> been enabled. But only for UNIX sockets. I think the same should be
> allowed for UDP sockets.

Agreed. Reverting to no-offset should be allowed.

>> > > I'm wondering adding an explicit SOCK_PEEK_OFF/MSG_PEEK_OFF socket flag
>> > > would help simplyifing the code:
>>
>> The behavior needs to be bifurcated between peeking with
>> offset and without offset.
>>
>> When peeking with offset is enabled by setting SO_PEEK_OFF,
>> subsequent reads do move the offset, so the observed behavior
>> is correct.
>>
>> When sk->sk_peek_offset is negative, offset mode is disabled
>> and the same packet must be read twice.
>>
>> An explicit boolean flag to discern between the two may make
>> the code simpler to understand, not sure whether that is logically
>> required.
>
> Yes, an explicit PEEK_OFF flag is just to keep the code simpler, so
> that there is no need to add checks at every sk_peek_offset() call site
> and the relevant logic can be fully encapsulated under the MSG_PEEK
> branch in __skb_try_recv_from_queue(), I think/hope.
> It's not a functional requirement.

It is a problematic that sk_peek_offset returns zero both on zero offset
and when peeking at offset is disabled.

It is not infeasible to fix that and fix up all callers, as Matthew's
patch does. But perhaps this patch is simpler to reason about. Thoughts?

+static inline bool sk_peek_at_offset(struct sock *sk)
+{
+   return READ_ONCE(sk->sk_peek_off) >= 0;
+}
+
 static inline int sk_peek_offset(struct sock *sk, int flags)
 {
if (unlikely(flags & MSG_PEEK)) {
diff --git a/net/core/datagram.c b/net/core/datagram.c
index ee5647bd91b3..30b53932af73 100644
--- a/net/core/datagram.c
+++ b/net/core/datagram.c
@@ -175,12 +175,14 @@ struct sk_buff *__skb_try_recv_from_queue(struct sock *sk,
*last = queue->prev;
skb_queue_walk(queue, skb) {
if (flags & MSG_PEEK) {
-   if (_off >= skb->len && (skb->len || _off ||
-skb->peeked)) {
+   if (_off >= skb->len && sk_peek_at_offset(sk) &&
+   (skb->len || _off || skb->peeked)) {


Re: Kernel 4.13.0-rc4-next-20170811 - IP Routing / Forwarding performance vs Core/RSS number / HT on

2017-08-14 Thread Eric Dumazet
On Mon, 2017-08-14 at 18:07 -0700, Eric Dumazet wrote:

> Or try to hack the IFF_XMIT_DST_RELEASE flag on the vlan netdev.

Something like :

diff --git a/net/8021q/vlan_netlink.c b/net/8021q/vlan_netlink.c
index 
5e831de3103e2f7092c7fa15534def403bc62fb4..9472de846d5c0960996261cb2843032847fa4bf7
 100644
--- a/net/8021q/vlan_netlink.c
+++ b/net/8021q/vlan_netlink.c
@@ -143,6 +143,7 @@ static int vlan_newlink(struct net *src_net, struct 
net_device *dev,
vlan->vlan_proto = proto;
vlan->vlan_id= nla_get_u16(data[IFLA_VLAN_ID]);
vlan->real_dev   = real_dev;
+   dev->priv_flags |= (real_dev->priv_flags & IFF_XMIT_DST_RELEASE);
vlan->flags  = VLAN_FLAG_REORDER_HDR;
 
err = vlan_check_real_dev(real_dev, vlan->vlan_proto, vlan->vlan_id);





Re: Kernel 4.13.0-rc4-next-20170811 - IP Routing / Forwarding performance vs Core/RSS number / HT on

2017-08-14 Thread Eric Dumazet
On Tue, 2017-08-15 at 02:45 +0200, Paweł Staszewski wrote:
> 
> W dniu 2017-08-14 o 18:57, Paolo Abeni pisze:
> > On Mon, 2017-08-14 at 18:19 +0200, Jesper Dangaard Brouer wrote:
> >> The output (extracted below) didn't show who called 'do_raw_spin_lock',
> >> BUT it showed another interesting thing.  The kernel code
> >> __dev_queue_xmit() in might create route dst-cache problem for itself(?),
> >> as it will first call skb_dst_force() and then skb_dst_drop() when the
> >> packet is transmitted on a VLAN.
> >>
> >>   static int __dev_queue_xmit(struct sk_buff *skb, void *accel_priv)
> >>   {
> >>   [...]
> >>/* If device/qdisc don't need skb->dst, release it right now while
> >> * its hot in this cpu cache.
> >> */
> >>if (dev->priv_flags & IFF_XMIT_DST_RELEASE)
> >>skb_dst_drop(skb);
> >>else
> >>skb_dst_force(skb);
> > I think that the high impact of the above code in this specific test is
> > mostly due to the following:
> >
> > - ingress packets with different RSS rx hash lands on different CPUs
> yes but isn't this normal ?
> everybody that want to ballance load over cores will try tu use as many 
> as possible :)
> With some limitations  ... best are 6 to 7 RSS queues - so need to use 6 
> to 7 cpu cores
> 
> > - but they use the same dst entry, since the destination IPs belong to
> > the same subnet
> typical for ddos - many sources one destination

Nobody hit this issue yet.

We usually change the kernel, given typical workloads.

In this case, we might need per cpu nh_rth_input

Or try to hack the IFF_XMIT_DST_RELEASE flag on the vlan netdev.





Re: [PATCH net-next v2] openvswitch: enable NSH support

2017-08-14 Thread Yang, Yi
On Tue, Aug 15, 2017 at 12:09:14AM +0800, Eric Garver wrote:
> On Thu, Aug 10, 2017 at 09:21:15PM +0800, Yi Yang wrote:
> 
> Hi Yi,
> 
> In general I'd like to echo Jiri's comments on the netlink attributes.
> I'd like to see the metadata separate.
> 
> I have a few other comments below.
> 
> Thanks.
> Eric.
> 
> [..]

Thanks Eric, I'm doing this and it is almost done, there is still an
issue to fix.

> > +{
> > +   return 4 * (ntohs(nsh->ver_flags_len) & NSH_LEN_MASK) >> NSH_LEN_SHIFT;
> 
> This is doing the multiplication before the shift. It works only because
> the shift is 0.
> 

Thank you for catching this, the right one:

((ntohs(nsh->ver_flags_len) & NSH_LEN_MASK) >> NSH_LEN_SHIFT) << 2

> > +static int push_nsh(struct sk_buff *skb, struct sw_flow_key *key,
> > +   const struct ovs_action_push_nsh *oapn)
> > +{
> > +   struct nsh_hdr *nsh;
> > +   size_t length = NSH_BASE_HDR_LEN + oapn->mdlen;
> > +   u8 next_proto;
> > +
> > +   if (key->mac_proto == MAC_PROTO_ETHERNET) {
> > +   next_proto = NSH_P_ETHERNET;
> > +   } else {
> > +   switch (ntohs(skb->protocol)) {
> > +   case ETH_P_IP:
> > +   next_proto = NSH_P_IPV4;
> > +   break;
> > +   case ETH_P_IPV6:
> > +   next_proto = NSH_P_IPV6;
> > +   break;
> > +   case ETH_P_NSH:
> > +   next_proto = NSH_P_NSH;
> > +   break;
> > +   default:
> > +   return -ENOTSUPP;
> > +   }
> > +   }
> > +
> >
> 
> I believe you need to validate that oapn->mdlen is a multiple of 4.
>

I'll add this check.

> > +   switch (nsh->md_type) {
> > +   case NSH_M_TYPE1:
> > +   nsh->md1 = *(struct nsh_md1_ctx *)oapn->metadata;
> > +   break;
> > +   case NSH_M_TYPE2: {
> > +   /* The MD2 metadata in oapn is already padded to 4 bytes. */
> > +   size_t len = DIV_ROUND_UP(oapn->mdlen, 4) * 4;
> > +
> > +   memcpy(nsh->md2, oapn->metadata, len);
> 
> I don't see any validation of oapn->mdlen. Normally this happens in
> __ovs_nla_copy_actions(). It will be made easier if you add a separate
> MD attribute as Jiri has suggested.
>

Got it, will include this in next version.


Re: Kernel 4.13.0-rc4-next-20170811 - IP Routing / Forwarding performance vs Core/RSS number / HT on

2017-08-14 Thread Paweł Staszewski



W dniu 2017-08-14 o 18:57, Paolo Abeni pisze:

On Mon, 2017-08-14 at 18:19 +0200, Jesper Dangaard Brouer wrote:

The output (extracted below) didn't show who called 'do_raw_spin_lock',
BUT it showed another interesting thing.  The kernel code
__dev_queue_xmit() in might create route dst-cache problem for itself(?),
as it will first call skb_dst_force() and then skb_dst_drop() when the
packet is transmitted on a VLAN.

  static int __dev_queue_xmit(struct sk_buff *skb, void *accel_priv)
  {
  [...]
/* If device/qdisc don't need skb->dst, release it right now while
 * its hot in this cpu cache.
 */
if (dev->priv_flags & IFF_XMIT_DST_RELEASE)
skb_dst_drop(skb);
else
skb_dst_force(skb);

I think that the high impact of the above code in this specific test is
mostly due to the following:

- ingress packets with different RSS rx hash lands on different CPUs

yes but isn't this normal ?
everybody that want to ballance load over cores will try tu use as many 
as possible :)
With some limitations  ... best are 6 to 7 RSS queues - so need to use 6 
to 7 cpu cores



- but they use the same dst entry, since the destination IPs belong to
the same subnet

typical for ddos - many sources one destination



- the dst refcnt cacheline is contented between all the CPUs

Perhaps we can inprove the situation setting the IFF_XMIT_DST_RELEASE
flag for vlan if the underlaying device does not have (relevant)
classifier attached? (and clearing it as needed)

Paolo





[PATCH net] tcp: fix possible deadlock in TCP stack vs BPF filter

2017-08-14 Thread Eric Dumazet
From: Eric Dumazet 

Filtering the ACK packet was not put at the right place.

At this place, we already allocated a child and put it
into accept queue.

We absolutely need to call tcp_child_process() to release
its spinlock, or we will deadlock at accept() or close() time.

Found by syzkaller team (Thanks a lot !)

Fixes: 8fac365f63c8 ("tcp: Add a tcp_filter hook before handle ack packet")
Signed-off-by: Eric Dumazet 
Reported-by: Dmitry Vyukov 
Cc: Chenbo Feng 
---
 net/ipv4/tcp_ipv4.c |4 ++--
 net/ipv6/tcp_ipv6.c |4 ++--
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index a20e7f03d5f7..e9252c7df809 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1722,6 +1722,8 @@ int tcp_v4_rcv(struct sk_buff *skb)
 */
sock_hold(sk);
refcounted = true;
+   if (tcp_filter(sk, skb))
+   goto discard_and_relse;
nsk = tcp_check_req(sk, skb, req, false);
if (!nsk) {
reqsk_put(req);
@@ -1729,8 +1731,6 @@ int tcp_v4_rcv(struct sk_buff *skb)
}
if (nsk == sk) {
reqsk_put(req);
-   } else if (tcp_filter(sk, skb)) {
-   goto discard_and_relse;
} else if (tcp_child_process(sk, nsk, skb)) {
tcp_v4_send_reset(nsk, skb);
goto discard_and_relse;
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 2521690d62d6..206210125fd7 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -1456,6 +1456,8 @@ static int tcp_v6_rcv(struct sk_buff *skb)
}
sock_hold(sk);
refcounted = true;
+   if (tcp_filter(sk, skb))
+   goto discard_and_relse;
nsk = tcp_check_req(sk, skb, req, false);
if (!nsk) {
reqsk_put(req);
@@ -1464,8 +1466,6 @@ static int tcp_v6_rcv(struct sk_buff *skb)
if (nsk == sk) {
reqsk_put(req);
tcp_v6_restore_cb(skb);
-   } else if (tcp_filter(sk, skb)) {
-   goto discard_and_relse;
} else if (tcp_child_process(sk, nsk, skb)) {
tcp_v6_send_reset(nsk, skb);
goto discard_and_relse;




Re: Kernel 4.13.0-rc4-next-20170811 - IP Routing / Forwarding performance vs Core/RSS number / HT on

2017-08-14 Thread Paweł Staszewski



W dniu 2017-08-14 o 18:19, Jesper Dangaard Brouer pisze:

On Sun, 13 Aug 2017 18:58:58 +0200 Paweł Staszewski  
wrote:


To show some difference below comparision vlan/no-vlan traffic

10Mpps forwarded traffic vith no-vlan vs 6.9Mpps with vlan

I'm trying to reproduce in my testlab (with ixgbe).  I do see, a
performance reduction of about 10-19% when I forward out a VLAN
interface.  This is larger than I expected, but still lower than what
you reported 30-40% slowdown.

[...]

Ok mellanox afrrived (MT27700 - mlnx5 driver)
And to compare melannox with vlans and without: 33% performance 
degradation (less than with ixgbe where i reach ~40% with same settings)


Mellanox without TX traffix on vlan:
ID;CPU_CORES / RSS QUEUES;PKT_SIZE;PPS_RX;BPS_RX;PPS_TX;BPS_TX
0;16;64;11089305;709715520;8871553;567779392
1;16;64;11096292;710162688;11095566;710116224
2;16;64;11095770;710129280;11096799;710195136
3;16;64;11097199;710220736;11097702;710252928
4;16;64;11080984;567081856;11079662;709098368
5;16;64;11077696;708972544;11077039;708930496
6;16;64;11082991;709311424;8864802;567347328
7;16;64;11089596;709734144;8870927;709789184
8;16;64;11094043;710018752;11095391;710105024

Mellanox with TX traffic on vlan:
ID;CPU_CORES / RSS QUEUES;PKT_SIZE;PPS_RX;BPS_RX;PPS_TX;BPS_TX
0;16;64;7369914;471674496;7370281;471697980
1;16;64;7368896;471609408;7368043;471554752
2;16;64;7367577;471524864;7367759;471536576
3;16;64;7368744;377305344;7369391;471641024
4;16;64;7366824;471476736;7364330;471237120
5;16;64;7368352;471574528;7367239;471503296
6;16;64;7367459;471517376;7367806;471539584
7;16;64;7367190;471500160;7367988;471551232
8;16;64;7368023;471553472;7368076;471556864



ethtool settings for both tests:
ifc='enp175s0f0 enp175s0f1'
for i in $ifc
do
ip link set up dev $i
ethtool -A $i autoneg off rx off tx off
ethtool -G $i rx 128 tx 256
ip link set $i txqueuelen 1000
ethtool -C $i rx-usecs 25
ethtool -L $i combined 16
ethtool -K $i gro off tso off gso off sg on l2-fwd-offload off 
tx-nocache-copy off ntuple on

ethtool -N $i rx-flow-hash udp4 sdfn
done

and perf top:
   PerfTop:   83650 irqs/sec  kernel:99.7%  exact:  0.0% [4000Hz 
cycles],  (all, 56 CPUs)

---

14.25%  [kernel]   [k] dst_release
14.17%  [kernel]   [k] skb_dst_force
13.41%  [kernel]   [k] rt_cache_valid
11.47%  [kernel]   [k] ip_finish_output2
 7.01%  [kernel]   [k] do_raw_spin_lock
 5.07%  [kernel]   [k] page_frag_free
 3.47%  [mlx5_core][k] mlx5e_xmit
 2.88%  [kernel]   [k] fib_table_lookup
 2.43%  [mlx5_core][k] skb_from_cqe.isra.32
 1.97%  [kernel]   [k] virt_to_head_page
 1.81%  [mlx5_core][k] mlx5e_poll_tx_cq
 0.93%  [kernel]   [k] __dev_queue_xmit
 0.87%  [kernel]   [k] __build_skb
 0.84%  [kernel]   [k] ipt_do_table
 0.79%  [kernel]   [k] ip_rcv
 0.79%  [kernel]   [k] acpi_processor_ffh_cstate_enter
 0.78%  [kernel]   [k] netif_skb_features
 0.73%  [kernel]   [k] __netif_receive_skb_core
 0.52%  [kernel]   [k] dev_hard_start_xmit
 0.52%  [kernel]   [k] build_skb
 0.51%  [kernel]   [k] ip_route_input_rcu
 0.50%  [kernel]   [k] skb_unref
 0.49%  [kernel]   [k] ip_forward
 0.48%  [mlx5_core][k] mlx5_cqwq_get_cqe
 0.44%  [kernel]   [k] udp_v4_early_demux
 0.41%  [kernel]   [k] napi_consume_skb
 0.40%  [kernel]   [k] __local_bh_enable_ip
 0.39%  [kernel]   [k] ip_rcv_finish
 0.39%  [kernel]   [k] kmem_cache_alloc
 0.38%  [kernel]   [k] sch_direct_xmit
 0.33%  [kernel]   [k] validate_xmit_skb
 0.32%  [mlx5_core][k] mlx5e_free_rx_wqe_reuse
 0.29%  [kernel]   [k] netdev_pick_tx
 0.28%  [mlx5_core][k] mlx5e_build_rx_skb
 0.27%  [kernel]   [k] deliver_ptype_list_skb
 0.26%  [kernel]   [k] fib_validate_source
 0.26%  [mlx5_core][k] mlx5e_napi_poll
 0.26%  [mlx5_core][k] mlx5e_handle_rx_cqe
 0.26%  [mlx5_core][k] mlx5e_rx_cache_get
 0.25%  [kernel]   [k] eth_header
 0.23%  [kernel]   [k] skb_network_protocol
 0.20%  [kernel]   [k] nf_hook_slow
 0.20%  [kernel]   [k] vlan_passthru_hard_header
 0.20%  [kernel]   [k] vlan_dev_hard_start_xmit
 0.19%  [kernel]   [k] swiotlb_map_page
 0.18%  [kernel]   [k] compound_head
 0.18%  [kernel]   [k] neigh_connected_output
 0.18%  [mlx5_core][k] mlx5e_alloc_rx_wqe
 0.18%  [kernel]   [k] ip_output
 0.17%  [kernel]   [k] prefetch_freepointer.isra.70
 0.17%  [kernel]   [k] __slab_free
 0.16%  [kernel]   [k] eth_type_vlan
 

Re: [PATCH net-next 1/1 v3] drivers: net: rmnet: Initial implementation

2017-08-14 Thread Subash Abhinov Kasiviswanathan

+ */
+void rmnet_egress_handler(struct sk_buff *skb,
+ struct rmnet_logical_ep_conf_s *ep)
+{
+   struct rmnet_phys_ep_conf_s *config;
+   struct net_device *orig_dev;
+   int rc;
+
+   orig_dev = skb->dev;
+   skb->dev = ep->egress_dev;
+
+   config = (struct rmnet_phys_ep_conf_s *)
+   rcu_dereference(skb->dev->rx_handler_data);


This is certainly a misuse of dev->rx_handler_data. Dev private of a
function arg to carry the pointer around.



Hi Jiri

Sorry for the delay in posting a new series.
I have an additional query regarding this comment.

This dev (from skb->dev->rx_handler_data) corresponds to the real_dev to 
which
the rmnet devices are attached to. I had earlier setup a rx_handler on 
this
real_dev netdevice in rmnet_associate_network_device(). Would it still 
be
incorrect to use rx_handler_data of real_dev to have rmnet specific 
config

information?

Bridge is similarly storing the bridge information on the real_dev
rx_handler_data and retrieving it through br_port_get_rcu(). I am using 
that

as a reference.

--
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora 
Forum,

a Linux Foundation Collaborative Project


Re: [PATCH net] af_key: do not use GFP_KERNEL in atomic contexts

2017-08-14 Thread David Ahern
On 8/14/17 11:16 AM, Eric Dumazet wrote:
> From: Eric Dumazet 
> 
> pfkey_broadcast() might be called from non process contexts,
> we can not use GFP_KERNEL in these cases [1].
> 
> This patch partially reverts commit ba51b6be38c1 ("net: Fix RCU splat in
> af_key"), only keeping the GFP_ATOMIC forcing under rcu_read_lock()
> section.
> 
> [1] : syzkaller reported :
...
> Fixes: ba51b6be38c1 ("net: Fix RCU splat in af_key")
> Signed-off-by: Eric Dumazet 
> Reported-by: Dmitry Vyukov 
> Cc: David Ahern 
> ---
>  net/key/af_key.c |   48 -
>  1 file changed, 26 insertions(+), 22 deletions(-)
> 

Thanks for the fix, Eric.

Acked-by: David Ahern 


Re: [PATCH v2] sctp: fully initialize the IPv6 address in sctp_v6_to_addr()

2017-08-14 Thread Marcelo Ricardo Leitner
On Mon, Aug 14, 2017 at 08:43:04PM +0200, Alexander Potapenko wrote:
> KMSAN reported use of uninitialized sctp_addr->v4.sin_addr.s_addr and
> sctp_addr->v6.sin6_scope_id in sctp_v6_cmp_addr() (see below).
> Make sure all fields of an IPv6 address are initialized, which
> guarantees that the IPv4 fields are also initialized.
> 
> ==
>  BUG: KMSAN: use of uninitialized memory in sctp_v6_cmp_addr+0x8d4/0x9f0
>  net/sctp/ipv6.c:517
>  CPU: 2 PID: 31056 Comm: syz-executor1 Not tainted 4.11.0-rc5+ #2944
>  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs
>  01/01/2011
>  Call Trace:
>   dump_stack+0x172/0x1c0 lib/dump_stack.c:42
>   is_logbuf_locked mm/kmsan/kmsan.c:59 [inline]
>   kmsan_report+0x12a/0x180 mm/kmsan/kmsan.c:938
>   native_save_fl arch/x86/include/asm/irqflags.h:18 [inline]
>   arch_local_save_flags arch/x86/include/asm/irqflags.h:72 [inline]
>   arch_local_irq_save arch/x86/include/asm/irqflags.h:113 [inline]
>   __msan_warning_32+0x61/0xb0 mm/kmsan/kmsan_instr.c:467
>   sctp_v6_cmp_addr+0x8d4/0x9f0 net/sctp/ipv6.c:517
>   sctp_v6_get_dst+0x8c7/0x1630 net/sctp/ipv6.c:290
>   sctp_transport_route+0x101/0x570 net/sctp/transport.c:292
>   sctp_assoc_add_peer+0x66d/0x16f0 net/sctp/associola.c:651
>   sctp_sendmsg+0x35a5/0x4f90 net/sctp/socket.c:1871
>   inet_sendmsg+0x498/0x670 net/ipv4/af_inet.c:762
>   sock_sendmsg_nosec net/socket.c:633 [inline]
>   sock_sendmsg net/socket.c:643 [inline]
>   SYSC_sendto+0x608/0x710 net/socket.c:1696
>   SyS_sendto+0x8a/0xb0 net/socket.c:1664
>   entry_SYSCALL_64_fastpath+0x13/0x94
>  RIP: 0033:0x44b479
>  RSP: 002b:7f6213f21c08 EFLAGS: 0286 ORIG_RAX: 002c
>  RAX: ffda RBX: 2000 RCX: 0044b479
>  RDX: 0041 RSI: 20edd000 RDI: 0006
>  RBP: 007080a8 R08: 20b85fe4 R09: 001c
>  R10: 00040005 R11: 0286 R12: 
>  R13: 3760 R14: 006e5820 R15: 00ff8000
>  origin description: dst_saddr@sctp_v6_get_dst
>  local variable created at:
>   sk_fullsock include/net/sock.h:2321 [inline]
>   inet6_sk include/linux/ipv6.h:309 [inline]
>   sctp_v6_get_dst+0x91/0x1630 net/sctp/ipv6.c:241
>   sctp_transport_route+0x101/0x570 net/sctp/transport.c:292
> ==
>  BUG: KMSAN: use of uninitialized memory in sctp_v6_cmp_addr+0x8d4/0x9f0
>  net/sctp/ipv6.c:517
>  CPU: 2 PID: 31056 Comm: syz-executor1 Not tainted 4.11.0-rc5+ #2944
>  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs
>  01/01/2011
>  Call Trace:
>   dump_stack+0x172/0x1c0 lib/dump_stack.c:42
>   is_logbuf_locked mm/kmsan/kmsan.c:59 [inline]
>   kmsan_report+0x12a/0x180 mm/kmsan/kmsan.c:938
>   native_save_fl arch/x86/include/asm/irqflags.h:18 [inline]
>   arch_local_save_flags arch/x86/include/asm/irqflags.h:72 [inline]
>   arch_local_irq_save arch/x86/include/asm/irqflags.h:113 [inline]
>   __msan_warning_32+0x61/0xb0 mm/kmsan/kmsan_instr.c:467
>   sctp_v6_cmp_addr+0x8d4/0x9f0 net/sctp/ipv6.c:517
>   sctp_v6_get_dst+0x8c7/0x1630 net/sctp/ipv6.c:290
>   sctp_transport_route+0x101/0x570 net/sctp/transport.c:292
>   sctp_assoc_add_peer+0x66d/0x16f0 net/sctp/associola.c:651
>   sctp_sendmsg+0x35a5/0x4f90 net/sctp/socket.c:1871
>   inet_sendmsg+0x498/0x670 net/ipv4/af_inet.c:762
>   sock_sendmsg_nosec net/socket.c:633 [inline]
>   sock_sendmsg net/socket.c:643 [inline]
>   SYSC_sendto+0x608/0x710 net/socket.c:1696
>   SyS_sendto+0x8a/0xb0 net/socket.c:1664
>   entry_SYSCALL_64_fastpath+0x13/0x94
>  RIP: 0033:0x44b479
>  RSP: 002b:7f6213f21c08 EFLAGS: 0286 ORIG_RAX: 002c
>  RAX: ffda RBX: 2000 RCX: 0044b479
>  RDX: 0041 RSI: 20edd000 RDI: 0006
>  RBP: 007080a8 R08: 20b85fe4 R09: 001c
>  R10: 00040005 R11: 0286 R12: 
>  R13: 3760 R14: 006e5820 R15: 00ff8000
>  origin description: dst_saddr@sctp_v6_get_dst
>  local variable created at:
>   sk_fullsock include/net/sock.h:2321 [inline]
>   inet6_sk include/linux/ipv6.h:309 [inline]
>   sctp_v6_get_dst+0x91/0x1630 net/sctp/ipv6.c:241
>   sctp_transport_route+0x101/0x570 net/sctp/transport.c:292
> ==
> 
> Signed-off-by: Alexander Potapenko 
> Reviewed-by: Xin Long 

Acked-by: Marcelo Ricardo Leitner 

> ---
> v2 is identical to v1, resending per request by Marcelo Ricardo Leitner.
> ---
>  net/sctp/ipv6.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/net/sctp/ipv6.c b/net/sctp/ipv6.c
> index 2a186b201ad2..a15d691829c6 100644
> --- a/net/sctp/ipv6.c
> +++ b/net/sctp/ipv6.c
> @@ -513,6 +513,8 @@ static void sctp_v6_to_addr(union sctp_addr 

Re: [PATCH net-next 01/11] net: dsa: legacy: assign dst->applied

2017-08-14 Thread Andrew Lunn
On Mon, Aug 14, 2017 at 06:22:32PM -0400, Vivien Didelot wrote:
> The "applied" boolean of the dsa_switch_tree is only set in the new
> bindings. This patch sets it in the legacy code as well.

What is missing here is: Why? 

I see the next patch does a WARN_ON(!applied). Why have a WARN_ON?

  Andrew


Re: [PATCH net-next 03/11] net: dsa: debugfs: add tree

2017-08-14 Thread Andrew Lunn
On Mon, Aug 14, 2017 at 06:22:34PM -0400, Vivien Didelot wrote:
> This commit adds the boiler plate to create a DSA related debug
> filesystem entry as well as a "tree" file, containing the tree index.
> 
> # cat switch1/tree
> 0
> 
> Signed-off-by: Vivien Didelot 

Reviewed-by: Andrew Lunn 

Andrew


Re: [PATCH net-next 11/11] net: dsa: debugfs: add port vlan

2017-08-14 Thread Andrew Lunn
On Mon, Aug 14, 2017 at 06:22:42PM -0400, Vivien Didelot wrote:
> Add a debug filesystem "vlan" entry to query a port's hardware VLAN
> entries through the .port_vlan_dump switch operation.
> 
> This is really convenient to query directly the hardware or inspect DSA
> or CPU links, since these ports are not exposed to userspace.
> 
> Here are the VLAN entries for a CPU port:
> 
> # cat port5/vlan
> vid 1
> vid 42  pvid
> 
> Signed-off-by: Vivien Didelot 

Reviewed-by: Andrew Lunn 

Andrew


Re: [PATCH net-next 10/11] net: dsa: restore VLAN dump

2017-08-14 Thread Andrew Lunn
On Mon, Aug 14, 2017 at 06:22:41PM -0400, Vivien Didelot wrote:
> This commit defines a dsa_vlan_dump_cb_t callback, similar to the FDB
> dump callback and partly reverts commit a0b6b8c9fa3c ("net: dsa: Remove
> support for vlan dump from DSA's drivers") to restore the DSA drivers
> VLAN dump operations.
> 
> Signed-off-by: Vivien Didelot 

Reviewed-by: Andrew Lunn 

Andrew


Re: [PATCH net-next 09/11] net: dsa: debugfs: add port mdb

2017-08-14 Thread Andrew Lunn
On Mon, Aug 14, 2017 at 06:22:40PM -0400, Vivien Didelot wrote:
> Add a debug filesystem "mdb" entry to query a port's hardware MDB
> entries through the .port_mdb_dump switch operation.
> 
> This is really convenient to query directly the hardware or inspect DSA
> or CPU links, since these ports are not exposed to userspace.
> 
> Signed-off-by: Vivien Didelot 

Reviewed-by: Andrew Lunn 

Andrew


Re: [PATCH net-next 08/11] net: dsa: restore mdb dump

2017-08-14 Thread Andrew Lunn
On Mon, Aug 14, 2017 at 06:22:39PM -0400, Vivien Didelot wrote:
> The same dsa_fdb_dump_cb_t callback is used since there is no
> distinction to do between FDB and MDB entries at this layer.
> 
> Implement mv88e6xxx_port_mdb_dump so that multicast addresses associated
> to a switch port can be dumped.
> 
> Signed-off-by: Vivien Didelot 

Reviewed-by: Andrew Lunn 

Andrew


Re: [PATCH net-next 07/11] net: dsa: debugfs: add port fdb

2017-08-14 Thread Andrew Lunn
On Mon, Aug 14, 2017 at 06:22:38PM -0400, Vivien Didelot wrote:
> Add a debug filesystem "fdb" entry to query a port's hardware FDB
> entries through the .port_fdb_dump switch operation.
> 
> This is really convenient to query directly the hardware or inspect DSA
> or CPU links, since these ports are not exposed to userspace.
> 
> # cat port1/fdb
> vid 012:34:56:78:90:abstaticunicast
> 
> Signed-off-by: Vivien Didelot 

Reviewed-by: Andrew Lunn 

Andrew


Re: [PATCH net-next 06/11] net: dsa: debugfs: add port registers

2017-08-14 Thread Andrew Lunn
On Mon, Aug 14, 2017 at 06:22:37PM -0400, Vivien Didelot wrote:
> Add a debug filesystem "regs" entry to query a port's hardware registers
> through the .get_regs_len and .get_regs_len switch operations.
> 
> This is very convenient because it allows one to dump the registers of
> DSA links, which are not exposed to userspace.
> 
> Here are the registers of a zii-rev-b CPU and DSA ports:
> 
> # pr -mt switch0/port{5,6}/regs
>  0: 4e07   0: 4d04
>  1: 403e   1: 003d
>  2:    2: 
>  3: 3521   3: 3521
>  4: 0533   4: 373f
>  5: 8000   5: 
>  6: 005f   6: 003f
>  7: 002a   7: 002a
>  8: 2080   8: 2080
>  9: 0001   9: 0001
> 10:   10: 
> 11: 0020  11: 
> 12:   12: 
> 13:   13: 
> 14:   14: 
> 15: 9100  15: dada
> 16:   16: 
> 17:   17: 
> 18:   18: 
> 19:   19: 00d8
> 20:   20: 
> 21:   21: 
> 22: 0022  22: 
> 23:   23: 
> 24: 3210  24: 3210
> 25: 7654  25: 7654
> 26:   26: 
> 27: 8000  27: 8000
> 28:   28: 
> 29:   29: 
> 30:   30: 
> 31:   31: 
> 
> Signed-off-by: Vivien Didelot 

Reviewed-by: Andrew Lunn 

Andrew


Re: [PATCH net-next 05/11] net: dsa: debugfs: add port stats

2017-08-14 Thread Andrew Lunn
On Mon, Aug 14, 2017 at 06:22:36PM -0400, Vivien Didelot wrote:
> Add a debug filesystem "stats" entry to query a port's hardware
> statistics through the DSA switch .get_sset_count, .get_strings and
> .get_ethtool_stats operations.
> 
> This allows one to get statistics about DSA links interconnecting
> switches, which is very convenient because this kind of port is not
> exposed to userspace.
> 
> Here are the stats of a zii-rev-b DSA and CPU ports:
> 
> # pr -mt switch0/port{5,6}/stats
> in_good_octets  : 0 in_good_octets  : 13824
> in_bad_octets   : 0 in_bad_octets   : 0
> in_unicast  : 0 in_unicast  : 0
> in_broadcasts   : 0 in_broadcasts   : 216
> in_multicasts   : 0 in_multicasts   : 0
> in_pause: 0 in_pause: 0
> in_undersize: 0 in_undersize: 0
> in_fragments: 0 in_fragments: 0
> in_oversize : 0 in_oversize : 0
> in_jabber   : 0 in_jabber   : 0
> in_rx_error : 0 in_rx_error : 0
> in_fcs_error: 0 in_fcs_error: 0
> out_octets  : 9216  out_octets  : 0
> out_unicast : 0 out_unicast : 0
> out_broadcasts  : 144   out_broadcasts  : 0
> out_multicasts  : 0 out_multicasts  : 0
> out_pause   : 0 out_pause   : 0
> excessive   : 0 excessive   : 0
> collisions  : 0 collisions  : 0
> deferred: 0 deferred: 0
> single  : 0 single  : 0
> multiple: 0 multiple: 0
> out_fcs_error   : 0 out_fcs_error   : 0
> late: 0 late: 0
> hist_64bytes: 0 hist_64bytes: 0
> hist_65_127bytes: 0 hist_65_127bytes: 0
> hist_128_255bytes   : 0 hist_128_255bytes   : 0
> hist_256_511bytes   : 0 hist_256_511bytes   : 0
> hist_512_1023bytes  : 0 hist_512_1023bytes  : 0
> hist_1024_max_bytes : 0 hist_1024_max_bytes : 0
> sw_in_discards  : 0 sw_in_discards  : 0
> sw_in_filtered  : 0 sw_in_filtered  : 0
> sw_out_filtered : 0 sw_out_filtered : 216
> 
> Signed-off-by: Vivien Didelot 

Reviewed-by: Andrew Lunn 

Andrew


Re: [PATCH net-next 04/11] net: dsa: debugfs: add tag_protocol

2017-08-14 Thread Andrew Lunn
On Mon, Aug 14, 2017 at 06:22:35PM -0400, Vivien Didelot wrote:
> Add a debug filesystem "tag_protocol" entry to query the switch tagging
> protocol through the .get_tag_protocol operation.
> 
> # cat switch1/tag_protocol
> EDSA
> 
> Signed-off-by: Vivien Didelot 
> ---
>  net/dsa/debugfs.c | 54 ++
>  1 file changed, 54 insertions(+)
> 
> diff --git a/net/dsa/debugfs.c b/net/dsa/debugfs.c
> index 5607efdb924d..30a732e86161 100644
> --- a/net/dsa/debugfs.c
> +++ b/net/dsa/debugfs.c
> @@ -109,6 +109,55 @@ static int dsa_debugfs_create_file(struct dsa_switch 
> *ds, struct dentry *dir,
>   return 0;
>  }
>  
> +static int dsa_debugfs_tag_protocol_read(struct dsa_switch *ds, int id,
> +  struct seq_file *seq)
> +{
> + enum dsa_tag_protocol proto;
> +
> + if (!ds->ops->get_tag_protocol)
> + return -EOPNOTSUPP;
> +
> + proto = ds->ops->get_tag_protocol(ds);
> +
> + switch (proto) {
> + case DSA_TAG_PROTO_NONE:
> + seq_puts(seq, "NONE\n");
> + break;
> + case DSA_TAG_PROTO_BRCM:
> + seq_puts(seq, "BRCM\n");
> + break;
> + case DSA_TAG_PROTO_DSA:
> + seq_puts(seq, "DSA\n");
> + break;
> + case DSA_TAG_PROTO_EDSA:
> + seq_puts(seq, "EDSA\n");
> + break;
> + case DSA_TAG_PROTO_KSZ:
> + seq_puts(seq, "KSZ\n");
> + break;
> + case DSA_TAG_PROTO_LAN9303:
> + seq_puts(seq, "LAN9303\n");
> + break;
> + case DSA_TAG_PROTO_MTK:
> + seq_puts(seq, "MTK\n");
> + break;
> + case DSA_TAG_PROTO_QCA:
> + seq_puts(seq, "QCA\n");
> + break;
> + case DSA_TAG_PROTO_TRAILER:
> + seq_puts(seq, "TRAILER\n");
> + break;
> + default:
> + return -EINVAL;
> + }

Hi Vivien

Minor nitpick. Rather than -EINVAL, how about seq_puts(seq, "Unknown - Please 
fix %s\n", __func__);

Reviewed-by: Andrew Lunn 

Andrew


[PATCH net-next 05/11] net: dsa: debugfs: add port stats

2017-08-14 Thread Vivien Didelot
Add a debug filesystem "stats" entry to query a port's hardware
statistics through the DSA switch .get_sset_count, .get_strings and
.get_ethtool_stats operations.

This allows one to get statistics about DSA links interconnecting
switches, which is very convenient because this kind of port is not
exposed to userspace.

Here are the stats of a zii-rev-b DSA and CPU ports:

# pr -mt switch0/port{5,6}/stats
in_good_octets  : 0 in_good_octets  : 13824
in_bad_octets   : 0 in_bad_octets   : 0
in_unicast  : 0 in_unicast  : 0
in_broadcasts   : 0 in_broadcasts   : 216
in_multicasts   : 0 in_multicasts   : 0
in_pause: 0 in_pause: 0
in_undersize: 0 in_undersize: 0
in_fragments: 0 in_fragments: 0
in_oversize : 0 in_oversize : 0
in_jabber   : 0 in_jabber   : 0
in_rx_error : 0 in_rx_error : 0
in_fcs_error: 0 in_fcs_error: 0
out_octets  : 9216  out_octets  : 0
out_unicast : 0 out_unicast : 0
out_broadcasts  : 144   out_broadcasts  : 0
out_multicasts  : 0 out_multicasts  : 0
out_pause   : 0 out_pause   : 0
excessive   : 0 excessive   : 0
collisions  : 0 collisions  : 0
deferred: 0 deferred: 0
single  : 0 single  : 0
multiple: 0 multiple: 0
out_fcs_error   : 0 out_fcs_error   : 0
late: 0 late: 0
hist_64bytes: 0 hist_64bytes: 0
hist_65_127bytes: 0 hist_65_127bytes: 0
hist_128_255bytes   : 0 hist_128_255bytes   : 0
hist_256_511bytes   : 0 hist_256_511bytes   : 0
hist_512_1023bytes  : 0 hist_512_1023bytes  : 0
hist_1024_max_bytes : 0 hist_1024_max_bytes : 0
sw_in_discards  : 0 sw_in_discards  : 0
sw_in_filtered  : 0 sw_in_filtered  : 0
sw_out_filtered : 0 sw_out_filtered : 216

Signed-off-by: Vivien Didelot 
---
 net/dsa/debugfs.c | 43 +++
 1 file changed, 43 insertions(+)

diff --git a/net/dsa/debugfs.c b/net/dsa/debugfs.c
index 30a732e86161..5f91b4423404 100644
--- a/net/dsa/debugfs.c
+++ b/net/dsa/debugfs.c
@@ -109,6 +109,43 @@ static int dsa_debugfs_create_file(struct dsa_switch *ds, 
struct dentry *dir,
return 0;
 }
 
+static void dsa_debugfs_stats_read_count(struct dsa_switch *ds, int id,
+struct seq_file *seq, int count)
+{
+   u8 strings[count * ETH_GSTRING_LEN];
+   u64 stats[count];
+   int i;
+
+   ds->ops->get_strings(ds, id, strings);
+   ds->ops->get_ethtool_stats(ds, id, stats);
+
+   for (i = 0; i < count; i++)
+   seq_printf(seq, "%-20s: %lld\n", strings + i * ETH_GSTRING_LEN,
+  stats[i]);
+}
+
+static int dsa_debugfs_stats_read(struct dsa_switch *ds, int id,
+ struct seq_file *seq)
+{
+   int count;
+
+   if (!ds->ops->get_sset_count || !ds->ops->get_strings ||
+   !ds->ops->get_ethtool_stats)
+   return -EOPNOTSUPP;
+
+   count = ds->ops->get_sset_count(ds);
+   if (count < 0)
+   return count;
+
+   dsa_debugfs_stats_read_count(ds, id, seq, count);
+
+   return 0;
+}
+
+static const struct dsa_debugfs_ops dsa_debugfs_stats_ops = {
+   .read = dsa_debugfs_stats_read,
+};
+
 static int dsa_debugfs_tag_protocol_read(struct dsa_switch *ds, int id,
 struct seq_file *seq)
 {
@@ -174,6 +211,7 @@ static int dsa_debugfs_create_port(struct dsa_switch *ds, 
int port)
 {
struct dentry *dir;
char name[32];
+   int err;
 
snprintf(name, sizeof(name), DSA_PORT_FMT, port);
 
@@ -181,6 +219,11 @@ static int dsa_debugfs_create_port(struct dsa_switch *ds, 
int port)
if (IS_ERR_OR_NULL(dir))
return -EFAULT;
 
+   err = dsa_debugfs_create_file(ds, dir, "stats", port,
+ _debugfs_stats_ops);
+   if (err)
+   return err;
+
return 0;
 }
 
-- 
2.14.0



[PATCH net-next 06/11] net: dsa: debugfs: add port registers

2017-08-14 Thread Vivien Didelot
Add a debug filesystem "regs" entry to query a port's hardware registers
through the .get_regs_len and .get_regs_len switch operations.

This is very convenient because it allows one to dump the registers of
DSA links, which are not exposed to userspace.

Here are the registers of a zii-rev-b CPU and DSA ports:

# pr -mt switch0/port{5,6}/regs
 0: 4e07 0: 4d04
 1: 403e 1: 003d
 2:  2: 
 3: 3521 3: 3521
 4: 0533 4: 373f
 5: 8000 5: 
 6: 005f 6: 003f
 7: 002a 7: 002a
 8: 2080 8: 2080
 9: 0001 9: 0001
10: 10: 
11: 002011: 
12: 12: 
13: 13: 
14: 14: 
15: 910015: dada
16: 16: 
17: 17: 
18: 18: 
19: 19: 00d8
20: 20: 
21: 21: 
22: 002222: 
23: 23: 
24: 321024: 3210
25: 765425: 7654
26: 26: 
27: 800027: 8000
28: 28: 
29: 29: 
30: 30: 
31: 31: 

Signed-off-by: Vivien Didelot 
---
 net/dsa/debugfs.c | 39 +++
 1 file changed, 39 insertions(+)

diff --git a/net/dsa/debugfs.c b/net/dsa/debugfs.c
index 5f91b4423404..012fcf466cc1 100644
--- a/net/dsa/debugfs.c
+++ b/net/dsa/debugfs.c
@@ -109,6 +109,40 @@ static int dsa_debugfs_create_file(struct dsa_switch *ds, 
struct dentry *dir,
return 0;
 }
 
+static void dsa_debugfs_regs_read_count(struct dsa_switch *ds, int id,
+   struct seq_file *seq, int count)
+{
+   u16 data[count * ETH_GSTRING_LEN];
+   struct ethtool_regs regs;
+   int i;
+
+   ds->ops->get_regs(ds, id, , data);
+
+   for (i = 0; i < count / 2; i++)
+   seq_printf(seq, "%2d: %04x\n", i, data[i]);
+}
+
+static int dsa_debugfs_regs_read(struct dsa_switch *ds, int id,
+struct seq_file *seq)
+{
+   int count;
+
+   if (!ds->ops->get_regs_len || !ds->ops->get_regs)
+   return -EOPNOTSUPP;
+
+   count = ds->ops->get_regs_len(ds, id);
+   if (count < 0)
+   return count;
+
+   dsa_debugfs_regs_read_count(ds, id, seq, count);
+
+   return 0;
+}
+
+static const struct dsa_debugfs_ops dsa_debugfs_regs_ops = {
+   .read = dsa_debugfs_regs_read,
+};
+
 static void dsa_debugfs_stats_read_count(struct dsa_switch *ds, int id,
 struct seq_file *seq, int count)
 {
@@ -219,6 +253,11 @@ static int dsa_debugfs_create_port(struct dsa_switch *ds, 
int port)
if (IS_ERR_OR_NULL(dir))
return -EFAULT;
 
+   err = dsa_debugfs_create_file(ds, dir, "regs", port,
+ _debugfs_regs_ops);
+   if (err)
+   return err;
+
err = dsa_debugfs_create_file(ds, dir, "stats", port,
  _debugfs_stats_ops);
if (err)
-- 
2.14.0



[PATCH net-next 01/11] net: dsa: legacy: assign dst->applied

2017-08-14 Thread Vivien Didelot
The "applied" boolean of the dsa_switch_tree is only set in the new
bindings. This patch sets it in the legacy code as well.

Signed-off-by: Vivien Didelot 
---
 net/dsa/legacy.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/net/dsa/legacy.c b/net/dsa/legacy.c
index 91e6f7981d39..a6a0849483d1 100644
--- a/net/dsa/legacy.c
+++ b/net/dsa/legacy.c
@@ -605,6 +605,7 @@ static int dsa_setup_dst(struct dsa_switch_tree *dst, 
struct net_device *dev,
 */
wmb();
dev->dsa_ptr = dst;
+   dst->applied = true;
 
return 0;
 }
@@ -689,6 +690,8 @@ static void dsa_remove_dst(struct dsa_switch_tree *dst)
dsa_cpu_port_ethtool_restore(dst->cpu_dp);
 
dev_put(dst->cpu_dp->netdev);
+
+   dst->applied = false;
 }
 
 static int dsa_remove(struct platform_device *pdev)
-- 
2.14.0



[PATCH net-next 02/11] net: dsa: add debugfs interface

2017-08-14 Thread Vivien Didelot
This commit adds a DEBUG_FS dependent DSA core file creating a generic
debug filesystem interface for the DSA switch devices.

The interface can be mounted with:

# mount -t debugfs none /sys/kernel/debug

The dsa directory contains one directory per switch chip:

# cd /sys/kernel/debug/dsa/
# ls
switch0  switch1 switch2

Each chip directory contains one directory per port:

# ls -l switch0/
drwxr-xr-x 2 root root 0 Jan  1 00:00 port0
drwxr-xr-x 2 root root 0 Jan  1 00:00 port1
drwxr-xr-x 2 root root 0 Jan  1 00:00 port2
drwxr-xr-x 2 root root 0 Jan  1 00:00 port5
drwxr-xr-x 2 root root 0 Jan  1 00:00 port6

Future patches will add entry files to these directories.

Signed-off-by: Vivien Didelot 
---
 include/net/dsa.h  |   7 
 net/dsa/Kconfig|  14 +++
 net/dsa/Makefile   |   1 +
 net/dsa/debugfs.c  | 121 +
 net/dsa/dsa.c  |   3 ++
 net/dsa/dsa2.c |   4 ++
 net/dsa/dsa_priv.h |  13 ++
 net/dsa/legacy.c   |   4 ++
 8 files changed, 167 insertions(+)
 create mode 100644 net/dsa/debugfs.c

diff --git a/include/net/dsa.h b/include/net/dsa.h
index 7f46b521313e..4ef5d38755d9 100644
--- a/include/net/dsa.h
+++ b/include/net/dsa.h
@@ -212,6 +212,13 @@ struct dsa_switch {
 */
void *priv;
 
+#ifdef CONFIG_NET_DSA_DEBUGFS
+   /*
+* Debugfs interface.
+*/
+   struct dentry *debugfs_dir;
+#endif
+
/*
 * Configuration data for this switch.
 */
diff --git a/net/dsa/Kconfig b/net/dsa/Kconfig
index cc5f8f971689..0f05a1e59dd2 100644
--- a/net/dsa/Kconfig
+++ b/net/dsa/Kconfig
@@ -15,6 +15,20 @@ config NET_DSA
 
 if NET_DSA
 
+config NET_DSA_DEBUGFS
+   bool "Distributed Switch Architecture debugfs interface"
+   depends on DEBUG_FS
+   ---help---
+ Enable creation of debugfs files for the DSA core.
+
+ These debugfs files provide per-switch information, such as the tag
+ protocol in use and ports connectivity. They also allow querying the
+ hardware directly through the switch operations for debugging instead
+ of going through the bridge, switchdev and DSA layers.
+
+ This is also a way to inspect the stats and FDB, MDB or VLAN entries
+ of CPU and DSA links, since they are not exposed to userspace.
+
 # tagging formats
 config NET_DSA_TAG_BRCM
bool
diff --git a/net/dsa/Makefile b/net/dsa/Makefile
index fcce25da937c..7f60c6dfaffb 100644
--- a/net/dsa/Makefile
+++ b/net/dsa/Makefile
@@ -1,6 +1,7 @@
 # the core
 obj-$(CONFIG_NET_DSA) += dsa_core.o
 dsa_core-y += dsa.o dsa2.o legacy.o port.o slave.o switch.o
+dsa_core-$(CONFIG_NET_DSA_DEBUGFS) += debugfs.o
 
 # tagging formats
 dsa_core-$(CONFIG_NET_DSA_TAG_BRCM) += tag_brcm.o
diff --git a/net/dsa/debugfs.c b/net/dsa/debugfs.c
new file mode 100644
index ..68caf5a2c0c3
--- /dev/null
+++ b/net/dsa/debugfs.c
@@ -0,0 +1,121 @@
+/*
+ * net/dsa/debugfs.c - DSA debugfs interface
+ * Copyright (c) 2017 Savoir-faire Linux, Inc.
+ * Vivien Didelot 
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ */
+
+#include 
+
+#include "dsa_priv.h"
+
+#define DSA_SWITCH_FMT "switch%d"
+#define DSA_PORT_FMT   "port%d"
+
+/* DSA module debugfs directory */
+static struct dentry *dsa_debugfs_dir;
+
+static int dsa_debugfs_create_port(struct dsa_switch *ds, int port)
+{
+   struct dentry *dir;
+   char name[32];
+
+   snprintf(name, sizeof(name), DSA_PORT_FMT, port);
+
+   dir = debugfs_create_dir(name, ds->debugfs_dir);
+   if (IS_ERR_OR_NULL(dir))
+   return -EFAULT;
+
+   return 0;
+}
+
+static int dsa_debugfs_create_switch(struct dsa_switch *ds)
+{
+   char name[32];
+   int i;
+
+   /* skip if there is no debugfs support */
+   if (!dsa_debugfs_dir)
+   return 0;
+
+   snprintf(name, sizeof(name), DSA_SWITCH_FMT, ds->index);
+
+   ds->debugfs_dir = debugfs_create_dir(name, dsa_debugfs_dir);
+   if (IS_ERR_OR_NULL(ds->debugfs_dir))
+   return -EFAULT;
+
+   for (i = 0; i < ds->num_ports; i++) {
+   if (!ds->ports[i].dn)
+   continue;
+
+   err = dsa_debugfs_create_port(ds, i);
+   if (err)
+   return err;
+   }
+
+   return 0;
+}
+
+static void dsa_debugfs_destroy_switch(struct dsa_switch *ds)
+{
+   /* handles NULL */
+   debugfs_remove_recursive(ds->debugfs_dir);
+}
+
+void dsa_debugfs_create_tree(struct dsa_switch_tree *dst)
+{
+   struct dsa_switch *ds;
+   int i, err;
+
+   WARN_ON(!dst->applied);
+
+   for (i = 0; i < DSA_MAX_SWITCHES; i++) {
+   

[PATCH net-next 07/11] net: dsa: debugfs: add port fdb

2017-08-14 Thread Vivien Didelot
Add a debug filesystem "fdb" entry to query a port's hardware FDB
entries through the .port_fdb_dump switch operation.

This is really convenient to query directly the hardware or inspect DSA
or CPU links, since these ports are not exposed to userspace.

# cat port1/fdb
vid 012:34:56:78:90:abstaticunicast

Signed-off-by: Vivien Didelot 
---
 net/dsa/debugfs.c | 36 
 1 file changed, 36 insertions(+)

diff --git a/net/dsa/debugfs.c b/net/dsa/debugfs.c
index 012fcf466cc1..8204c62dc9c1 100644
--- a/net/dsa/debugfs.c
+++ b/net/dsa/debugfs.c
@@ -10,6 +10,7 @@
  */
 
 #include 
+#include 
 #include 
 
 #include "dsa_priv.h"
@@ -109,6 +110,36 @@ static int dsa_debugfs_create_file(struct dsa_switch *ds, 
struct dentry *dir,
return 0;
 }
 
+static int dsa_debugfs_fdb_dump_cb(const unsigned char *addr, u16 vid,
+  bool is_static, void *data)
+{
+   struct seq_file *seq = data;
+   int i;
+
+   seq_printf(seq, "vid %d", vid);
+   for (i = 0; i < ETH_ALEN; i++)
+   seq_printf(seq, "%s%02x", i ? ":" : "", addr[i]);
+   seq_printf(seq, "%s", is_static ? "static" : "dynamic");
+   seq_printf(seq, "%s", is_unicast_ether_addr(addr) ?
+  "unicast" : "multicast");
+   seq_puts(seq, "\n");
+
+   return 0;
+}
+
+static int dsa_debugfs_fdb_read(struct dsa_switch *ds, int id,
+   struct seq_file *seq)
+{
+   if (!ds->ops->port_fdb_dump)
+   return -EOPNOTSUPP;
+
+   return ds->ops->port_fdb_dump(ds, id, dsa_debugfs_fdb_dump_cb, seq);
+}
+
+static const struct dsa_debugfs_ops dsa_debugfs_fdb_ops = {
+   .read = dsa_debugfs_fdb_read,
+};
+
 static void dsa_debugfs_regs_read_count(struct dsa_switch *ds, int id,
struct seq_file *seq, int count)
 {
@@ -253,6 +284,11 @@ static int dsa_debugfs_create_port(struct dsa_switch *ds, 
int port)
if (IS_ERR_OR_NULL(dir))
return -EFAULT;
 
+   err = dsa_debugfs_create_file(ds, dir, "fdb", port,
+ _debugfs_fdb_ops);
+   if (err)
+   return err;
+
err = dsa_debugfs_create_file(ds, dir, "regs", port,
  _debugfs_regs_ops);
if (err)
-- 
2.14.0



[PATCH net-next 08/11] net: dsa: restore mdb dump

2017-08-14 Thread Vivien Didelot
The same dsa_fdb_dump_cb_t callback is used since there is no
distinction to do between FDB and MDB entries at this layer.

Implement mv88e6xxx_port_mdb_dump so that multicast addresses associated
to a switch port can be dumped.

Signed-off-by: Vivien Didelot 
---
 drivers/net/dsa/mv88e6xxx/chip.c | 33 +
 include/net/dsa.h|  3 +++
 2 files changed, 28 insertions(+), 8 deletions(-)

diff --git a/drivers/net/dsa/mv88e6xxx/chip.c b/drivers/net/dsa/mv88e6xxx/chip.c
index 918d8f0fe091..2ad32734b6f6 100644
--- a/drivers/net/dsa/mv88e6xxx/chip.c
+++ b/drivers/net/dsa/mv88e6xxx/chip.c
@@ -1380,7 +1380,7 @@ static int mv88e6xxx_port_fdb_del(struct dsa_switch *ds, 
int port,
 }
 
 static int mv88e6xxx_port_db_dump_fid(struct mv88e6xxx_chip *chip,
- u16 fid, u16 vid, int port,
+ u16 fid, u16 vid, int port, bool mc,
  dsa_fdb_dump_cb_t *cb, void *data)
 {
struct mv88e6xxx_atu_entry addr;
@@ -1401,11 +1401,14 @@ static int mv88e6xxx_port_db_dump_fid(struct 
mv88e6xxx_chip *chip,
if (addr.trunk || (addr.portvec & BIT(port)) == 0)
continue;
 
-   if (!is_unicast_ether_addr(addr.mac))
+   if ((is_unicast_ether_addr(addr.mac) && mc) ||
+   (is_multicast_ether_addr(addr.mac) && !mc))
continue;
 
-   is_static = (addr.state ==
-MV88E6XXX_G1_ATU_DATA_STATE_UC_STATIC);
+   is_static = addr.state == mc ?
+   MV88E6XXX_G1_ATU_DATA_STATE_MC_STATIC :
+   MV88E6XXX_G1_ATU_DATA_STATE_UC_STATIC;
+
err = cb(addr.mac, vid, is_static, data);
if (err)
return err;
@@ -1415,7 +1418,7 @@ static int mv88e6xxx_port_db_dump_fid(struct 
mv88e6xxx_chip *chip,
 }
 
 static int mv88e6xxx_port_db_dump(struct mv88e6xxx_chip *chip, int port,
- dsa_fdb_dump_cb_t *cb, void *data)
+ bool mc, dsa_fdb_dump_cb_t *cb, void *data)
 {
struct mv88e6xxx_vtu_entry vlan = {
.vid = chip->info->max_vid,
@@ -1428,7 +1431,7 @@ static int mv88e6xxx_port_db_dump(struct mv88e6xxx_chip 
*chip, int port,
if (err)
return err;
 
-   err = mv88e6xxx_port_db_dump_fid(chip, fid, 0, port, cb, data);
+   err = mv88e6xxx_port_db_dump_fid(chip, fid, 0, port, mc, cb, data);
if (err)
return err;
 
@@ -1442,7 +1445,7 @@ static int mv88e6xxx_port_db_dump(struct mv88e6xxx_chip 
*chip, int port,
break;
 
err = mv88e6xxx_port_db_dump_fid(chip, vlan.fid, vlan.vid, port,
-cb, data);
+mc, cb, data);
if (err)
return err;
} while (vlan.vid < chip->info->max_vid);
@@ -1457,7 +1460,7 @@ static int mv88e6xxx_port_fdb_dump(struct dsa_switch *ds, 
int port,
int err;
 
mutex_lock(>reg_lock);
-   err = mv88e6xxx_port_db_dump(chip, port, cb, data);
+   err = mv88e6xxx_port_db_dump(chip, port, false, cb, data);
mutex_unlock(>reg_lock);
 
return err;
@@ -3777,6 +3780,19 @@ static int mv88e6xxx_port_mdb_del(struct dsa_switch *ds, 
int port,
return err;
 }
 
+static int mv88e6xxx_port_mdb_dump(struct dsa_switch *ds, int port,
+  dsa_fdb_dump_cb_t *cb, void *data)
+{
+   struct mv88e6xxx_chip *chip = ds->priv;
+   int err;
+
+   mutex_lock(>reg_lock);
+   err = mv88e6xxx_port_db_dump(chip, port, true, cb, data);
+   mutex_unlock(>reg_lock);
+
+   return err;
+}
+
 static const struct dsa_switch_ops mv88e6xxx_switch_ops = {
.probe  = mv88e6xxx_drv_probe,
.get_tag_protocol   = mv88e6xxx_get_tag_protocol,
@@ -3810,6 +3826,7 @@ static const struct dsa_switch_ops mv88e6xxx_switch_ops = 
{
.port_mdb_prepare   = mv88e6xxx_port_mdb_prepare,
.port_mdb_add   = mv88e6xxx_port_mdb_add,
.port_mdb_del   = mv88e6xxx_port_mdb_del,
+   .port_mdb_dump  = mv88e6xxx_port_mdb_dump,
.crosschip_bridge_join  = mv88e6xxx_crosschip_bridge_join,
.crosschip_bridge_leave = mv88e6xxx_crosschip_bridge_leave,
 };
diff --git a/include/net/dsa.h b/include/net/dsa.h
index 4ef5d38755d9..1fd419031948 100644
--- a/include/net/dsa.h
+++ b/include/net/dsa.h
@@ -288,6 +288,7 @@ static inline u8 dsa_upstream_port(struct dsa_switch *ds)
return ds->rtable[dst->cpu_dp->ds->index];
 }
 
+/* FDB (and MDB) dump callback */
 typedef int dsa_fdb_dump_cb_t(const unsigned char *addr, u16 vid,
  bool is_static, void *data);
 struct 

[PATCH net-next 10/11] net: dsa: restore VLAN dump

2017-08-14 Thread Vivien Didelot
This commit defines a dsa_vlan_dump_cb_t callback, similar to the FDB
dump callback and partly reverts commit a0b6b8c9fa3c ("net: dsa: Remove
support for vlan dump from DSA's drivers") to restore the DSA drivers
VLAN dump operations.

Signed-off-by: Vivien Didelot 
---
 drivers/net/dsa/b53/b53_common.c   | 41 
 drivers/net/dsa/b53/b53_priv.h |  2 ++
 drivers/net/dsa/bcm_sf2.c  |  1 +
 drivers/net/dsa/dsa_loop.c | 38 ++
 drivers/net/dsa/microchip/ksz_common.c | 41 
 drivers/net/dsa/mv88e6xxx/chip.c   | 49 ++
 include/net/dsa.h  |  5 
 7 files changed, 177 insertions(+)

diff --git a/drivers/net/dsa/b53/b53_common.c b/drivers/net/dsa/b53/b53_common.c
index 274f3679f33d..be0c5fa8bd9b 100644
--- a/drivers/net/dsa/b53/b53_common.c
+++ b/drivers/net/dsa/b53/b53_common.c
@@ -1053,6 +1053,46 @@ int b53_vlan_del(struct dsa_switch *ds, int port,
 }
 EXPORT_SYMBOL(b53_vlan_del);
 
+int b53_vlan_dump(struct dsa_switch *ds, int port, dsa_vlan_dump_cb_t *cb,
+ void *data)
+{
+   struct b53_device *dev = ds->priv;
+   u16 vid, vid_start = 0, pvid;
+   struct b53_vlan *vl;
+   bool untagged;
+   int err = 0;
+
+   if (is5325(dev) || is5365(dev))
+   vid_start = 1;
+
+   b53_read16(dev, B53_VLAN_PAGE, B53_VLAN_PORT_DEF_TAG(port), );
+
+   /* Use our software cache for dumps, since we do not have any HW
+* operation returning only the used/valid VLANs
+*/
+   for (vid = vid_start; vid < dev->num_vlans; vid++) {
+   vl = >vlans[vid];
+
+   if (!vl->valid)
+   continue;
+
+   if (!(vl->members & BIT(port)))
+   continue;
+
+   untagged = false;
+
+   if (vl->untag & BIT(port))
+   untagged = true;
+
+   err = cb(vid, pvid == vid, untagged, data);
+   if (err)
+   break;
+   }
+
+   return err;
+}
+EXPORT_SYMBOL(b53_vlan_dump);
+
 /* Address Resolution Logic routines */
 static int b53_arl_op_wait(struct b53_device *dev)
 {
@@ -1503,6 +1543,7 @@ static const struct dsa_switch_ops b53_switch_ops = {
.port_vlan_prepare  = b53_vlan_prepare,
.port_vlan_add  = b53_vlan_add,
.port_vlan_del  = b53_vlan_del,
+   .port_vlan_dump = b53_vlan_dump,
.port_fdb_dump  = b53_fdb_dump,
.port_fdb_add   = b53_fdb_add,
.port_fdb_del   = b53_fdb_del,
diff --git a/drivers/net/dsa/b53/b53_priv.h b/drivers/net/dsa/b53/b53_priv.h
index 01bd8cbe9a3f..2b3e59d80fdb 100644
--- a/drivers/net/dsa/b53/b53_priv.h
+++ b/drivers/net/dsa/b53/b53_priv.h
@@ -393,6 +393,8 @@ void b53_vlan_add(struct dsa_switch *ds, int port,
  struct switchdev_trans *trans);
 int b53_vlan_del(struct dsa_switch *ds, int port,
 const struct switchdev_obj_port_vlan *vlan);
+int b53_vlan_dump(struct dsa_switch *ds, int port, dsa_vlan_dump_cb_t *cb,
+ void *data);
 int b53_fdb_add(struct dsa_switch *ds, int port,
const unsigned char *addr, u16 vid);
 int b53_fdb_del(struct dsa_switch *ds, int port,
diff --git a/drivers/net/dsa/bcm_sf2.c b/drivers/net/dsa/bcm_sf2.c
index bbcb4053e04e..1907b27297c3 100644
--- a/drivers/net/dsa/bcm_sf2.c
+++ b/drivers/net/dsa/bcm_sf2.c
@@ -1021,6 +1021,7 @@ static const struct dsa_switch_ops bcm_sf2_ops = {
.port_vlan_prepare  = b53_vlan_prepare,
.port_vlan_add  = b53_vlan_add,
.port_vlan_del  = b53_vlan_del,
+   .port_vlan_dump = b53_vlan_dump,
.port_fdb_dump  = b53_fdb_dump,
.port_fdb_add   = b53_fdb_add,
.port_fdb_del   = b53_fdb_del,
diff --git a/drivers/net/dsa/dsa_loop.c b/drivers/net/dsa/dsa_loop.c
index 7819a9fe8321..0407533f725f 100644
--- a/drivers/net/dsa/dsa_loop.c
+++ b/drivers/net/dsa/dsa_loop.c
@@ -257,6 +257,43 @@ static int dsa_loop_port_vlan_del(struct dsa_switch *ds, 
int port,
return 0;
 }
 
+static int dsa_loop_port_vlan_dump(struct dsa_switch *ds, int port,
+  dsa_vlan_dump_cb_t *cb, void *data)
+{
+   struct dsa_loop_priv *ps = ds->priv;
+   struct mii_bus *bus = ps->bus;
+   struct dsa_loop_vlan *vl;
+   u16 vid, vid_start = 0;
+   bool pvid, untagged;
+   int err = 0;
+
+   dev_dbg(ds->dev, "%s\n", __func__);
+
+   /* Just do a sleeping operation to make lockdep checks effective */
+   mdiobus_read(bus, ps->port_base + port, MII_BMSR);
+
+   for (vid = vid_start; vid < DSA_LOOP_VLANS; vid++) {
+   vl = >vlans[vid];
+
+   if (!(vl->members & BIT(port)))
+   continue;
+
+   

[PATCH net-next 11/11] net: dsa: debugfs: add port vlan

2017-08-14 Thread Vivien Didelot
Add a debug filesystem "vlan" entry to query a port's hardware VLAN
entries through the .port_vlan_dump switch operation.

This is really convenient to query directly the hardware or inspect DSA
or CPU links, since these ports are not exposed to userspace.

Here are the VLAN entries for a CPU port:

# cat port5/vlan
vid 1
vid 42  pvid

Signed-off-by: Vivien Didelot 
---
 net/dsa/debugfs.c | 33 +
 1 file changed, 33 insertions(+)

diff --git a/net/dsa/debugfs.c b/net/dsa/debugfs.c
index 98c5068d20da..b00942368d29 100644
--- a/net/dsa/debugfs.c
+++ b/net/dsa/debugfs.c
@@ -286,6 +286,34 @@ static const struct dsa_debugfs_ops dsa_debugfs_tree_ops = 
{
.read = dsa_debugfs_tree_read,
 };
 
+static int dsa_debugfs_vlan_dump_cb(u16 vid, bool pvid, bool untagged,
+   void *data)
+{
+   struct seq_file *seq = data;
+
+   seq_printf(seq, "vid %d", vid);
+   if (pvid)
+   seq_puts(seq, "  pvid");
+   if (untagged)
+   seq_puts(seq, "  untagged");
+   seq_puts(seq, "\n");
+
+   return 0;
+}
+
+static int dsa_debugfs_vlan_read(struct dsa_switch *ds, int id,
+struct seq_file *seq)
+{
+   if (!ds->ops->port_vlan_dump)
+   return -EOPNOTSUPP;
+
+   return ds->ops->port_vlan_dump(ds, id, dsa_debugfs_vlan_dump_cb, seq);
+}
+
+static const struct dsa_debugfs_ops dsa_debugfs_vlan_ops = {
+   .read = dsa_debugfs_vlan_read,
+};
+
 static int dsa_debugfs_create_port(struct dsa_switch *ds, int port)
 {
struct dentry *dir;
@@ -318,6 +346,11 @@ static int dsa_debugfs_create_port(struct dsa_switch *ds, 
int port)
if (err)
return err;
 
+   err = dsa_debugfs_create_file(ds, dir, "vlan", port,
+ _debugfs_vlan_ops);
+   if (err)
+   return err;
+
return 0;
 }
 
-- 
2.14.0



[PATCH net-next 09/11] net: dsa: debugfs: add port mdb

2017-08-14 Thread Vivien Didelot
Add a debug filesystem "mdb" entry to query a port's hardware MDB
entries through the .port_mdb_dump switch operation.

This is really convenient to query directly the hardware or inspect DSA
or CPU links, since these ports are not exposed to userspace.

Signed-off-by: Vivien Didelot 
---
 net/dsa/debugfs.c | 19 +++
 1 file changed, 19 insertions(+)

diff --git a/net/dsa/debugfs.c b/net/dsa/debugfs.c
index 8204c62dc9c1..98c5068d20da 100644
--- a/net/dsa/debugfs.c
+++ b/net/dsa/debugfs.c
@@ -140,6 +140,20 @@ static const struct dsa_debugfs_ops dsa_debugfs_fdb_ops = {
.read = dsa_debugfs_fdb_read,
 };
 
+static int dsa_debugfs_mdb_read(struct dsa_switch *ds, int id,
+   struct seq_file *seq)
+{
+   if (!ds->ops->port_mdb_dump)
+   return -EOPNOTSUPP;
+
+   /* same callback as for FDB dump */
+   return ds->ops->port_mdb_dump(ds, id, dsa_debugfs_fdb_dump_cb, seq);
+}
+
+static const struct dsa_debugfs_ops dsa_debugfs_mdb_ops = {
+   .read = dsa_debugfs_mdb_read,
+};
+
 static void dsa_debugfs_regs_read_count(struct dsa_switch *ds, int id,
struct seq_file *seq, int count)
 {
@@ -289,6 +303,11 @@ static int dsa_debugfs_create_port(struct dsa_switch *ds, 
int port)
if (err)
return err;
 
+   err = dsa_debugfs_create_file(ds, dir, "mdb", port,
+ _debugfs_mdb_ops);
+   if (err)
+   return err;
+
err = dsa_debugfs_create_file(ds, dir, "regs", port,
  _debugfs_regs_ops);
if (err)
-- 
2.14.0



[PATCH net-next 04/11] net: dsa: debugfs: add tag_protocol

2017-08-14 Thread Vivien Didelot
Add a debug filesystem "tag_protocol" entry to query the switch tagging
protocol through the .get_tag_protocol operation.

# cat switch1/tag_protocol
EDSA

Signed-off-by: Vivien Didelot 
---
 net/dsa/debugfs.c | 54 ++
 1 file changed, 54 insertions(+)

diff --git a/net/dsa/debugfs.c b/net/dsa/debugfs.c
index 5607efdb924d..30a732e86161 100644
--- a/net/dsa/debugfs.c
+++ b/net/dsa/debugfs.c
@@ -109,6 +109,55 @@ static int dsa_debugfs_create_file(struct dsa_switch *ds, 
struct dentry *dir,
return 0;
 }
 
+static int dsa_debugfs_tag_protocol_read(struct dsa_switch *ds, int id,
+struct seq_file *seq)
+{
+   enum dsa_tag_protocol proto;
+
+   if (!ds->ops->get_tag_protocol)
+   return -EOPNOTSUPP;
+
+   proto = ds->ops->get_tag_protocol(ds);
+
+   switch (proto) {
+   case DSA_TAG_PROTO_NONE:
+   seq_puts(seq, "NONE\n");
+   break;
+   case DSA_TAG_PROTO_BRCM:
+   seq_puts(seq, "BRCM\n");
+   break;
+   case DSA_TAG_PROTO_DSA:
+   seq_puts(seq, "DSA\n");
+   break;
+   case DSA_TAG_PROTO_EDSA:
+   seq_puts(seq, "EDSA\n");
+   break;
+   case DSA_TAG_PROTO_KSZ:
+   seq_puts(seq, "KSZ\n");
+   break;
+   case DSA_TAG_PROTO_LAN9303:
+   seq_puts(seq, "LAN9303\n");
+   break;
+   case DSA_TAG_PROTO_MTK:
+   seq_puts(seq, "MTK\n");
+   break;
+   case DSA_TAG_PROTO_QCA:
+   seq_puts(seq, "QCA\n");
+   break;
+   case DSA_TAG_PROTO_TRAILER:
+   seq_puts(seq, "TRAILER\n");
+   break;
+   default:
+   return -EINVAL;
+   }
+
+   return 0;
+}
+
+static const struct dsa_debugfs_ops dsa_debugfs_tag_protocol_ops = {
+   .read = dsa_debugfs_tag_protocol_read,
+};
+
 static int dsa_debugfs_tree_read(struct dsa_switch *ds, int id,
 struct seq_file *seq)
 {
@@ -151,6 +200,11 @@ static int dsa_debugfs_create_switch(struct dsa_switch *ds)
if (IS_ERR_OR_NULL(ds->debugfs_dir))
return -EFAULT;
 
+   err = dsa_debugfs_create_file(ds, ds->debugfs_dir, "tag_protocol", -1,
+ _debugfs_tag_protocol_ops);
+   if (err)
+   return err;
+
err = dsa_debugfs_create_file(ds, ds->debugfs_dir, "tree", -1,
  _debugfs_tree_ops);
if (err)
-- 
2.14.0



[PATCH net-next 03/11] net: dsa: debugfs: add tree

2017-08-14 Thread Vivien Didelot
This commit adds the boiler plate to create a DSA related debug
filesystem entry as well as a "tree" file, containing the tree index.

# cat switch1/tree
0

Signed-off-by: Vivien Didelot 
---
 net/dsa/debugfs.c | 108 ++
 1 file changed, 108 insertions(+)

diff --git a/net/dsa/debugfs.c b/net/dsa/debugfs.c
index 68caf5a2c0c3..5607efdb924d 100644
--- a/net/dsa/debugfs.c
+++ b/net/dsa/debugfs.c
@@ -10,6 +10,7 @@
  */
 
 #include 
+#include 
 
 #include "dsa_priv.h"
 
@@ -19,6 +20,107 @@
 /* DSA module debugfs directory */
 static struct dentry *dsa_debugfs_dir;
 
+struct dsa_debugfs_ops {
+   int (*read)(struct dsa_switch *ds, int id, struct seq_file *seq);
+   int (*write)(struct dsa_switch *ds, int id, char *buf);
+};
+
+struct dsa_debugfs_priv {
+   const struct dsa_debugfs_ops *ops;
+   struct dsa_switch *ds;
+   int id;
+};
+
+static int dsa_debugfs_show(struct seq_file *seq, void *p)
+{
+   struct dsa_debugfs_priv *priv = seq->private;
+   struct dsa_switch *ds = priv->ds;
+
+   /* Somehow file mode is bypassed... Double check here */
+   if (!priv->ops->read)
+   return -EOPNOTSUPP;
+
+   return priv->ops->read(ds, priv->id, seq);
+}
+
+static ssize_t dsa_debugfs_write(struct file *file, const char __user 
*user_buf,
+size_t count, loff_t *ppos)
+{
+   struct seq_file *seq = file->private_data;
+   struct dsa_debugfs_priv *priv = seq->private;
+   struct dsa_switch *ds = priv->ds;
+   char buf[count + 1];
+   int err;
+
+   /* Somehow file mode is bypassed... Double check here */
+   if (!priv->ops->write)
+   return -EOPNOTSUPP;
+
+   if (copy_from_user(buf, user_buf, count))
+   return -EFAULT;
+
+   buf[count] = '\0';
+
+   err = priv->ops->write(ds, priv->id, buf);
+
+   return err ? err : count;
+}
+
+static int dsa_debugfs_open(struct inode *inode, struct file *file)
+{
+   return single_open(file, dsa_debugfs_show, inode->i_private);
+}
+
+static const struct file_operations dsa_debugfs_fops = {
+   .open = dsa_debugfs_open,
+   .read = seq_read,
+   .write = dsa_debugfs_write,
+   .llseek = no_llseek,
+   .release = single_release,
+   .owner = THIS_MODULE,
+};
+
+static int dsa_debugfs_create_file(struct dsa_switch *ds, struct dentry *dir,
+  char *name, int id,
+  const struct dsa_debugfs_ops *ops)
+{
+   struct dsa_debugfs_priv *priv;
+   struct dentry *entry;
+   umode_t mode;
+
+   priv = devm_kzalloc(ds->dev, sizeof(*priv), GFP_KERNEL);
+   if (!priv)
+   return -ENOMEM;
+
+   priv->ops = ops;
+   priv->ds = ds;
+   priv->id = id;
+
+   mode = 0;
+   if (ops->read)
+   mode |= 0444;
+   if (ops->write)
+   mode |= 0200;
+
+   entry = debugfs_create_file(name, mode, dir, priv, _debugfs_fops);
+   if (IS_ERR_OR_NULL(entry))
+   return -EFAULT;
+
+   return 0;
+}
+
+static int dsa_debugfs_tree_read(struct dsa_switch *ds, int id,
+struct seq_file *seq)
+{
+   seq_printf(seq, "%d\n", ds->dst->tree);
+
+   return 0;
+}
+
+static const struct dsa_debugfs_ops dsa_debugfs_tree_ops = {
+   .read = dsa_debugfs_tree_read,
+};
+
 static int dsa_debugfs_create_port(struct dsa_switch *ds, int port)
 {
struct dentry *dir;
@@ -36,6 +138,7 @@ static int dsa_debugfs_create_port(struct dsa_switch *ds, 
int port)
 static int dsa_debugfs_create_switch(struct dsa_switch *ds)
 {
char name[32];
+   int err;
int i;
 
/* skip if there is no debugfs support */
@@ -48,6 +151,11 @@ static int dsa_debugfs_create_switch(struct dsa_switch *ds)
if (IS_ERR_OR_NULL(ds->debugfs_dir))
return -EFAULT;
 
+   err = dsa_debugfs_create_file(ds, ds->debugfs_dir, "tree", -1,
+ _debugfs_tree_ops);
+   if (err)
+   return err;
+
for (i = 0; i < ds->num_ports; i++) {
if (!ds->ports[i].dn)
continue;
-- 
2.14.0



[PATCH net-next 00/11] net: dsa: add generic debugfs interface

2017-08-14 Thread Vivien Didelot
This patch series adds a generic debugfs interface for the DSA
framework, so that all switch devices benefit from it, e.g. Marvell,
Broadcom, Microchip or any other DSA driver.

This is really convenient for debugging, especially CPU ports and DSA
links which are not exposed to userspace as net device. This interface
is currently the only way to easily inspect the hardware for such ports.

With the patch series, any switch device user is able to query the
hardware for the supported tagging protocol, the ports stats and
registers, as well as their FDB, MDB and VLAN entries.

This support is only compiled if CONFIG_DEBUG_FS is enabled. Below is
and example of usage of this interface on a multi-chip switch fabric:

# mount -t debugfs none /sys/kernel/debug
# cd /sys/kernel/debug/dsa/
# ls
switch0  switch1 switch2
# ls -l switch0/
drwxr-xr-x 2 root root 0 Jan  1 00:00 port0
drwxr-xr-x 2 root root 0 Jan  1 00:00 port1
drwxr-xr-x 2 root root 0 Jan  1 00:00 port2
drwxr-xr-x 2 root root 0 Jan  1 00:00 port5
drwxr-xr-x 2 root root 0 Jan  1 00:00 port6
-r--r--r-- 1 root root 0 Jan  1 00:00 tag_protocol
-r--r--r-- 1 root root 0 Jan  1 00:00 tree
# ls -l switch0/port6
-r--r--r-- 1 root root 0 Jan  1 00:00 fdb
-r--r--r-- 1 root root 0 Jan  1 00:00 mdb
-r--r--r-- 1 root root 0 Jan  1 00:00 regs
-r--r--r-- 1 root root 0 Jan  1 00:00 stats
-r--r--r-- 1 root root 0 Jan  1 00:00 vlan
# cat switch0/port2/vlan
vid 42  pvid  untagged
# cat switch0/port1/fdb
vid 012:34:56:78:90:abstaticunicast
# pr -mt switch0/port{5,6}/stats
in_good_octets  : 0 in_good_octets  : 13824
in_bad_octets   : 0 in_bad_octets   : 0
in_unicast  : 0 in_unicast  : 0
in_broadcasts   : 0 in_broadcasts   : 216
in_multicasts   : 0 in_multicasts   : 0
in_pause: 0 in_pause: 0
in_undersize: 0 in_undersize: 0
...
# pr -mt switch0/port{5,6}/regs
 0: 4e07 0: 4d04
 1: 403e 1: 003d
 2:  2: 
 3: 3521 3: 3521
 4: 0533 4: 373f
 5: 8000 5: 
 6: 005f 6: 003f
 7: 002a 7: 002a
...

where switch0 port5 and port6 are CPU and DSA ports of a ZII Rev B.

Vivien Didelot (11):
  net: dsa: legacy: assign dst->applied
  net: dsa: add debugfs interface
  net: dsa: debugfs: add tree
  net: dsa: debugfs: add tag_protocol
  net: dsa: debugfs: add port stats
  net: dsa: debugfs: add port registers
  net: dsa: debugfs: add port fdb
  net: dsa: restore mdb dump
  net: dsa: debugfs: add port mdb
  net: dsa: restore VLAN dump
  net: dsa: debugfs: add port vlan

 drivers/net/dsa/b53/b53_common.c   |  41 +++
 drivers/net/dsa/b53/b53_priv.h |   2 +
 drivers/net/dsa/bcm_sf2.c  |   1 +
 drivers/net/dsa/dsa_loop.c |  38 +++
 drivers/net/dsa/microchip/ksz_common.c |  41 +++
 drivers/net/dsa/mv88e6xxx/chip.c   |  82 +-
 include/net/dsa.h  |  15 ++
 net/dsa/Kconfig|  14 +
 net/dsa/Makefile   |   1 +
 net/dsa/debugfs.c  | 453 +
 net/dsa/dsa.c  |   3 +
 net/dsa/dsa2.c |   4 +
 net/dsa/dsa_priv.h |  13 +
 net/dsa/legacy.c   |   7 +
 14 files changed, 707 insertions(+), 8 deletions(-)
 create mode 100644 net/dsa/debugfs.c

-- 
2.14.0



RE: [net-next 08/15] i40e/i40evf: organize and re-number feature flags

2017-08-14 Thread Keller, Jacob E


> -Original Message-
> From: David Miller [mailto:da...@davemloft.net]
> Sent: Saturday, August 12, 2017 1:04 PM
> To: Kirsher, Jeffrey T 
> Cc: Keller, Jacob E ; netdev@vger.kernel.org;
> nhor...@redhat.com; sassm...@redhat.com; jogre...@redhat.com
> Subject: Re: [net-next 08/15] i40e/i40evf: organize and re-number feature 
> flags
> 
> From: Jeff Kirsher 
> Date: Sat, 12 Aug 2017 04:08:41 -0700
> 
> > Also ensure that the flags variable is actually a u64 to guarantee
> > 64bits of space on all architectures.
> 
> Why?  You don't need 64-bits, you only need 27.
> 
> This will be unnecessarily expensive on 32-bit platforms.
> 
> Please don't do this.

I suppose a better method would be to switch to using a declare_bitmap instead, 
so that it automatically sizes based on the number of flags we have. The reason 
we chose 64bits is because we will add flags in the future, as we originally 
had more than 32 flags prior to this patch until we moved some into a separate 
field.

But now that I think about it, using DECLARE_BITMAP makes more sense, though 
it's a bit more invasive of the code.

Thanks,
Jake


Re: [net-next 15/15] i40e: synchronize nvmupdate command and adminq subtask

2017-08-14 Thread Shannon Nelson
On Sat, Aug 12, 2017 at 4:08 AM, Jeff Kirsher
 wrote:
> From: Sudheer Mogilappagari 
>
> During NVM update, state machine gets into unrecoverable state because
> i40e_clean_adminq_subtask can get scheduled after the admin queue
> command but before other state variables are updated. This causes
> incorrect input to i40e_nvmupd_check_wait_event and state transitions
> don't happen.
>
> This issue existed before but surfaced after commit 373149fc99a0
> ("i40e: Decrease the scope of rtnl lock")
>
> This fix adds locking around admin queue command and update of
> state variables so that adminq_subtask will have accurate information
> whenever it gets scheduled.
>
> Signed-off-by: Sudheer Mogilappagari 
> Signed-off-by: Jeff Kirsher 
> ---
>  drivers/net/ethernet/intel/i40e/i40e_nvm.c | 6 ++
>  1 file changed, 6 insertions(+)
>
> diff --git a/drivers/net/ethernet/intel/i40e/i40e_nvm.c 
> b/drivers/net/ethernet/intel/i40e/i40e_nvm.c
> index 6fdecd70dcbc..2cf7db2dc7cd 100644
> --- a/drivers/net/ethernet/intel/i40e/i40e_nvm.c
> +++ b/drivers/net/ethernet/intel/i40e/i40e_nvm.c
> @@ -753,6 +753,11 @@ i40e_status i40e_nvmupd_command(struct i40e_hw *hw,
> hw->nvmupd_state = I40E_NVMUPD_STATE_INIT;
> }
>
> +   /* Acquire lock to prevent race condition where adminq_task
> +* can execute after i40e_nvmupd_nvm_read/write but before state
> +* variables (nvm_wait_opcode, nvm_release_on_done) are updated
> +*/
> +   mutex_lock(>aq.arq_mutex);
> switch (hw->nvmupd_state) {
> case I40E_NVMUPD_STATE_INIT:
> status = i40e_nvmupd_state_init(hw, cmd, bytes, perrno);
> @@ -788,6 +793,7 @@ i40e_status i40e_nvmupd_command(struct i40e_hw *hw,
> *perrno = -ESRCH;
> break;
> }

Perhaps I missed a patch somewhere, but I think there is still a
return statement in the middle of this switch() (INIT_WAIT and
WRITE_WAIT) that means you can leave the mutex locked.  I thought I
had seen a newer version of this patch that had this fixed

sln

> +   mutex_unlock(>aq.arq_mutex);
> return status;
>  }
>
> --
> 2.14.0
>



-- 
==
Mr. Shannon Nelson Parents can't afford to be squeamish.


Re: [PATCH] net/sched: reset block pointer in tcf_block_put()

2017-08-14 Thread Cong Wang
On Mon, Aug 14, 2017 at 5:59 AM, Konstantin Khlebnikov
 wrote:
>
> This should work, I suppose.
>
> But this approach requires careful review for all qdisc, mine is completely
> mechanical.

Well, we don't have many classful qdisc's. Your patch actually
touches more qdisc's than mine, because you change an API, so
it is slightly harder to backport. ;)


[PATCH net] dccp: purge write queue in dccp_destroy_sock()

2017-08-14 Thread Eric Dumazet
From: Eric Dumazet 

syzkaller reported that DCCP could have a non empty
write queue at dismantle time.

WARNING: CPU: 1 PID: 2953 at net/core/stream.c:199 
sk_stream_kill_queues+0x3ce/0x520 net/core/stream.c:199
Kernel panic - not syncing: panic_on_warn set ...

CPU: 1 PID: 2953 Comm: syz-executor0 Not tainted 4.13.0-rc4+ #2
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 
01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:16 [inline]
 dump_stack+0x194/0x257 lib/dump_stack.c:52
 panic+0x1e4/0x417 kernel/panic.c:180
 __warn+0x1c4/0x1d9 kernel/panic.c:541
 report_bug+0x211/0x2d0 lib/bug.c:183
 fixup_bug+0x40/0x90 arch/x86/kernel/traps.c:190
 do_trap_no_signal arch/x86/kernel/traps.c:224 [inline]
 do_trap+0x260/0x390 arch/x86/kernel/traps.c:273
 do_error_trap+0x120/0x390 arch/x86/kernel/traps.c:310
 do_invalid_op+0x1b/0x20 arch/x86/kernel/traps.c:323
 invalid_op+0x1e/0x30 arch/x86/entry/entry_64.S:846
RIP: 0010:sk_stream_kill_queues+0x3ce/0x520 net/core/stream.c:199
RSP: 0018:8801d182f108 EFLAGS: 00010297
RAX: 8801d1144140 RBX: 8801d13cb280 RCX: 
RDX:  RSI: 85137b00 RDI: 8801d13cb280
RBP: 8801d182f148 R08: 0001 R09: 
R10:  R11:  R12: 8801d13cb4d0
R13: 8801d13cb3b8 R14: 8801d13cb300 R15: 8801d13cb3b8
 inet_csk_destroy_sock+0x175/0x3f0 net/ipv4/inet_connection_sock.c:835
 dccp_close+0x84d/0xc10 net/dccp/proto.c:1067
 inet_release+0xed/0x1c0 net/ipv4/af_inet.c:425
 sock_release+0x8d/0x1e0 net/socket.c:597
 sock_close+0x16/0x20 net/socket.c:1126
 __fput+0x327/0x7e0 fs/file_table.c:210
 fput+0x15/0x20 fs/file_table.c:246
 task_work_run+0x18a/0x260 kernel/task_work.c:116
 exit_task_work include/linux/task_work.h:21 [inline]
 do_exit+0xa32/0x1b10 kernel/exit.c:865
 do_group_exit+0x149/0x400 kernel/exit.c:969
 get_signal+0x7e8/0x17e0 kernel/signal.c:2330
 do_signal+0x94/0x1ee0 arch/x86/kernel/signal.c:808
 exit_to_usermode_loop+0x21c/0x2d0 arch/x86/entry/common.c:157
 prepare_exit_to_usermode arch/x86/entry/common.c:194 [inline]
 syscall_return_slowpath+0x3a7/0x450 arch/x86/entry/common.c:263

Signed-off-by: Eric Dumazet 
Reported-by: Dmitry Vyukov 
---
 net/dccp/proto.c |5 +
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/net/dccp/proto.c b/net/dccp/proto.c
index 9fe25bf63296..86bc40ba6ba5 100644
--- a/net/dccp/proto.c
+++ b/net/dccp/proto.c
@@ -201,10 +201,7 @@ void dccp_destroy_sock(struct sock *sk)
 {
struct dccp_sock *dp = dccp_sk(sk);
 
-   /*
-* DCCP doesn't use sk_write_queue, just sk_send_head
-* for retransmissions
-*/
+   __skb_queue_purge(>sk_write_queue);
if (sk->sk_send_head != NULL) {
kfree_skb(sk->sk_send_head);
sk->sk_send_head = NULL;




Re: [PATCH net] udp: fix linear skb reception with PEEK_OFF

2017-08-14 Thread Eric Dumazet
On Mon, 2017-08-14 at 21:31 +0200, Paolo Abeni wrote:
> From: Al Viro 
> 
> copy_linear_skb() is broken; both of its callers actually
> expect 'len' to be the amount we are trying to copy,
> not the offset of the end.
> Fix it keeping the meanings of arguments in sync with what the
> callers (both of them) expect.
> Also restore a saner behavior on EFAULT (i.e. preserving
> the iov_iter position in case of failure):
> 
> The commit fd851ba9caa9 ("udp: harden copy_linear_skb()")
> avoids the more destructive effect of the buggy
> copy_linear_skb(), e.g. no more invalid memory access, but
> said function still behaves incorrectly: when peeking with
> offset it can fail with EINVAL instead of copying the
> appropriate amount of memory.
> 
> Reported-by: Sasha Levin 
> Fixes: b65ac44674dd ("udp: try to avoid 2 cache miss on dequeue")
> Fixes: fd851ba9caa9 ("udp: harden copy_linear_skb()")
> Signed-off-by: Al Viro 
> Acked-by: Paolo Abeni 
> Tested-by: Sasha Levin 
> ---

Oh well.

Acked-by: Eric Dumazet 




Cycling Enthusiasts List

2017-08-14 Thread Jens Altmann


Hi,

Greetings of the day!

Would you be interested in acquiring an email list of "Cycling Enthusiasts" 
from USA?

We also have data for Hikign Enthusiasts, Running Enthusiasts, Camping and 
Outdoor Enthusiasts, Skiers List, Health and Fitness Enthusiasts, Tennis 
Enthusiasts, Boxing Enthusiasts, Travelers List, Frequent Travelers List, 
Basketball Enthusiasts, Golfers, Soccer Enthusiasts, Baseball Enthusiasts and 
many more.

Each record in the list contains Contact Name (First, Middle and Last Name), 
Mailing Address, List type and Opt-in email address.

All the contacts are opt-in verified, complete permission based and can be used 
for unlimited multi-channel marketing.

Please let me know your thoughts towards procuring the Cycling Enthusiasts List.

Best Regards,
Jens Altmann
Marketing Manager



We respect your privacy, if you do not wish to receive any further emails from 
our end, please reply with a subject “Leave Out”.



Re: ipv4: distinguish EHOSTUNREACH from the ENETUNREACH

2017-08-14 Thread Daniel Walker

Hi,


It seems like commit cd0f0b is trying to add back these two errors 
values into ip_route_input_slow(). However, if you follow the code path 
further down you get to the two exit points of this function,


in net/ipv4/route.c:ip_route_input_slow()

if (rt_cache_valid(rth)) {
skb_dst_set_noref(skb, >dst);
err = 0;
goto out;
}

and

skb_dst_set(skb, >dst);
err = 0;
goto out;

Both of these set "err" variable to 0. This effective destroys the 
return value which the patch seems to be adding. Am I missing something 
here?



Thanks,

Daniel



Re: [iproute PATCH 51/51] lib/bpf: Check return value of write()

2017-08-14 Thread Daniel Borkmann

On 08/14/2017 07:25 PM, Phil Sutter wrote:
[...]

But I really think we shouldn't make such a fuss about it - writing to
stderr either always works or we're in trouble everywhere. This patch
was merely to shut gcc up, so no need to waste much energy on a scenario
which won't happen anyway.


Yup, fair enough, makes sense.

Acked-by: Daniel Borkmann 


Re: [PATCH net] datagram: When peeking datagrams with offset < 0 don't skip empty skbs

2017-08-14 Thread Willem de Bruijn
On Mon, Aug 14, 2017 at 3:15 PM, Thiago Macieira
 wrote:
> On Monday, 14 August 2017 12:03:16 PDT Willem de Bruijn wrote:
>> On Mon, Aug 14, 2017 at 2:58 PM, Thiago Macieira
>>
>>  wrote:
>> > On Monday, 14 August 2017 11:46:42 PDT Willem de Bruijn wrote:
>> >> > By the way, what were the usecases for the peek offset feature?
>> >>
>> >> The idea was to be able to peek at application headers of upper
>> >> layer protocols and multiplex messages among threads. It proved
>> >> so complex even for UDP that we did not attempt the same feature
>> >> for TCP. Also, KCM implements demultiplexing using eBPF today.
>> >
>> > Interesting, but how would userspace coordinate like that? Suppose
>> > multiple
>> > threads are woken up by a datagram being received
>>
>> This assumes a separate listener thread and worker threadpool.
>
> The listener thread still needs to synchronise with the worker that got
> activated and wait for it to recv from the socket before the listener thread
> can go back to poll().
>
> If we are really talking about threads in the same process, it might be easier
> for the listener to just read the datagram anyway and pass it on to the
> worker. That way, it can proceed immediately to the next datagram and not have
> to wait for the possibly slow worker.
>
> If it is a separate process, then I don't see another way and this might be
> necessary.
>
> By the way, what does recv with MSG_PEEK | MSG_TRUNC return? Is it the full
> datagram's size or is it the size minus the peek offset?

udp_recvmsg returns ulen if the flag is passed.

if (flags & MSG_TRUNC)
err = ulen;

This is computed earlier in the function as udp payload length and not
modified after.

ulen = udp_skb_len(skb);


[PATCH net] udp: fix linear skb reception with PEEK_OFF

2017-08-14 Thread Paolo Abeni
From: Al Viro 

copy_linear_skb() is broken; both of its callers actually
expect 'len' to be the amount we are trying to copy,
not the offset of the end.
Fix it keeping the meanings of arguments in sync with what the
callers (both of them) expect.
Also restore a saner behavior on EFAULT (i.e. preserving
the iov_iter position in case of failure):

The commit fd851ba9caa9 ("udp: harden copy_linear_skb()")
avoids the more destructive effect of the buggy
copy_linear_skb(), e.g. no more invalid memory access, but
said function still behaves incorrectly: when peeking with
offset it can fail with EINVAL instead of copying the
appropriate amount of memory.

Reported-by: Sasha Levin 
Fixes: b65ac44674dd ("udp: try to avoid 2 cache miss on dequeue")
Fixes: fd851ba9caa9 ("udp: harden copy_linear_skb()")
Signed-off-by: Al Viro 
Acked-by: Paolo Abeni 
Tested-by: Sasha Levin 
---
This patch has been buried in a private email exchange for some
time.
I'm posting it on behalf of Al Viro, to avoid loosing this merge
window, since he is busy elsewhere.
---
 include/net/udp.h | 9 -
 1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/include/net/udp.h b/include/net/udp.h
index e9b1d1eacb59..586de4b811b5 100644
--- a/include/net/udp.h
+++ b/include/net/udp.h
@@ -366,14 +366,13 @@ static inline bool udp_skb_is_linear(struct sk_buff *skb)
 static inline int copy_linear_skb(struct sk_buff *skb, int len, int off,
  struct iov_iter *to)
 {
-   int n, copy = len - off;
+   int n;
 
-   if (copy < 0)
-   return -EINVAL;
-   n = copy_to_iter(skb->data + off, copy, to);
-   if (n == copy)
+   n = copy_to_iter(skb->data + off, len, to);
+   if (n == len)
return 0;
 
+   iov_iter_revert(to, n);
return -EFAULT;
 }
 
-- 
2.13.5



Re: [Intel-wired-lan] [PATCH 6/6] [net-next]net: i40e: Enable cloud filters in i40e via tc/flower classifier

2017-08-14 Thread Nambiar, Amritha
On 8/1/2017 12:16 PM, Shannon Nelson wrote:
> On 7/31/2017 5:38 PM, Amritha Nambiar wrote:
>> This patch enables tc-flower based hardware offloads. tc/flower
>> filter provided by the kernel is configured as driver specific
>> cloud filter. The patch implements functions and admin queue
>> commands needed to support cloud filters in the driver and
>> adds cloud filters to configure these tc-flower filters.
>>
>> The only action supported is to redirect packets to a traffic class
>> on the same device.
>>
>> # tc qdisc add dev eth0 ingress
>> # ethtool -K eth0 hw-tc-offload on
>>
>> # tc filter add dev eth0 protocol ip parent :\
>>prio 1 flower dst_mac 3c:fd:fe:a0:d6:70 skip_sw indev eth0\
>>action mirred ingress redirect dev eth0 tc 0
>>
>> # tc filter add dev eth0 protocol ip parent :\
>>prio 2 flower dst_ip 192.168.3.5/32\
>>ip_proto udp dst_port 25 skip_sw indev eth0\
>>action mirred ingress redirect dev eth0 tc 1
>>
>> # tc filter add dev eth0 protocol ipv6 parent :\
>>prio 3 flower dst_ip fe8::200:1\
>>ip_proto udp dst_port 66 skip_sw indev eth0\
>>action mirred ingress redirect dev eth0 tc 2
>>
>> Delete tc flower filter:
>> Example:
>>
>> # tc filter del dev eth0 parent : prio 3 handle 0x1 flower
>> # tc filter del dev eth0 parent :
>>
>> Flow Director Sideband is disabled while configuring cloud filters
>> via tc-flower.
> 
> Only while configuring, or the whole time there is a cloud filter?  This 
> is unclear here.

The entire time cloud filters exists. Will make the comment clearer in v2.

> 
>>
>> Unsupported matches when cloud filters are added using enhanced
>> big buffer cloud filter mode of underlying switch include:
>> 1. source port and source IP
>> 2. Combined MAC address and IP fields.
>> 3. Not specfying L4 port
> 
> s/specfying/specifying/

Will fix in v2.

> 
>>
>> These filter matches can however be used to redirect traffic to
>> the main VSI (tc 0) which does not require the enhanced big buffer
>> cloud filter support.
>>
>> Signed-off-by: Amritha Nambiar 
>> Signed-off-by: Kiran Patil 
>> ---
>>   drivers/net/ethernet/intel/i40e/i40e.h   |   46 +
>>   drivers/net/ethernet/intel/i40e/i40e_common.c|  180 
>>   drivers/net/ethernet/intel/i40e/i40e_main.c  |  952 
>> ++
>>   drivers/net/ethernet/intel/i40e/i40e_prototype.h |   17
>>   4 files changed, 1193 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/net/ethernet/intel/i40e/i40e.h 
>> b/drivers/net/ethernet/intel/i40e/i40e.h
>> index 5c0cad5..7288265 100644
>> --- a/drivers/net/ethernet/intel/i40e/i40e.h
>> +++ b/drivers/net/ethernet/intel/i40e/i40e.h
>> @@ -55,6 +55,8 @@
>>   #include 
>>   #include 
>>   #include 
>> +#include 
>> +#include 
>>   #include "i40e_type.h"
>>   #include "i40e_prototype.h"
>>   #include "i40e_client.h"
>> @@ -252,10 +254,51 @@ struct i40e_fdir_filter {
>>  u32 fd_id;
>>   };
>>   
>> +#define I40E_CLOUD_FIELD_OMAC   0x01
>> +#define I40E_CLOUD_FIELD_IMAC   0x02
>> +#define I40E_CLOUD_FIELD_IVLAN  0x04
>> +#define I40E_CLOUD_FIELD_TEN_ID 0x08
>> +#define I40E_CLOUD_FIELD_IIP0x10
>> +
>> +#define I40E_CLOUD_FILTER_FLAGS_OMACI40E_CLOUD_FIELD_OMAC
>> +#define I40E_CLOUD_FILTER_FLAGS_IMACI40E_CLOUD_FIELD_IMAC
>> +#define I40E_CLOUD_FILTER_FLAGS_IMAC_IVLAN  (I40E_CLOUD_FIELD_IMAC | \
>> + I40E_CLOUD_FIELD_IVLAN)
>> +#define I40E_CLOUD_FILTER_FLAGS_IMAC_TEN_ID (I40E_CLOUD_FIELD_IMAC | \
>> + I40E_CLOUD_FIELD_TEN_ID)
>> +#define I40E_CLOUD_FILTER_FLAGS_OMAC_TEN_ID_IMAC (I40E_CLOUD_FIELD_OMAC | \
>> +  I40E_CLOUD_FIELD_IMAC | \
>> +  I40E_CLOUD_FIELD_TEN_ID)
>> +#define I40E_CLOUD_FILTER_FLAGS_IMAC_IVLAN_TEN_ID (I40E_CLOUD_FIELD_IMAC | \
>> +   I40E_CLOUD_FIELD_IVLAN | \
>> +   I40E_CLOUD_FIELD_TEN_ID)
>> +#define I40E_CLOUD_FILTER_FLAGS_IIP I40E_CLOUD_FIELD_IIP
>> +
>>   struct i40e_cloud_filter {
>>  struct hlist_node cloud_node;
>>  /* cloud filter input set follows */
>>  unsigned long cookie;
>> +u8 dst_mac[ETH_ALEN];
>> +u8 src_mac[ETH_ALEN];
>> +__be16 vlan_id;
>> +__be32 dst_ip[4];
>> +__be32 src_ip[4];
>> +u8 dst_ipv6[16];
>> +u8 src_ipv6[16];
>> +__be16 dst_port;
>> +__be16 src_port;
>> +/* matter only when IP based filtering is set */
>> +bool is_ipv6;
>> +/* IPPROTO value */
>> +u8 ip_proto;
>> +/* L4 port type: src or destination port */
>> +#define I40E_CLOUD_FILTER_PORT_SRC  0x01
>> +#define I40E_CLOUD_FILTER_PORT_DEST 0x02
>> +u8 port_type;
>> +u32 tenant_id;
>> +u8 flags;
>> +#define I40E_CLOUD_TNL_TYPE_NONE0xff
>> +u8 tunnel_type;
>>  /* filter 

[PATCH net-next] liquidio: fix issues with fw_type module parameter

2017-08-14 Thread Felix Manlunas
From: Derek Chickles 

The fw_type module parameter isn't showing up in the
/sys/module/liquidio/parameters directory.  Fix it by setting the read
permission bits for user, group, other in module_param_string().  Revise
the description of fw_type.  Initialize the fw_type static char array with
the default value to conform to the module parameter description.

Signed-off-by: Derek Chickles 
Signed-off-by: Felix Manlunas 
---
 drivers/net/ethernet/cavium/liquidio/lio_main.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/cavium/liquidio/lio_main.c 
b/drivers/net/ethernet/cavium/liquidio/lio_main.c
index cbd6287..b8ba2c2 100644
--- a/drivers/net/ethernet/cavium/liquidio/lio_main.c
+++ b/drivers/net/ethernet/cavium/liquidio/lio_main.c
@@ -59,9 +59,9 @@ static int debug = -1;
 module_param(debug, int, 0644);
 MODULE_PARM_DESC(debug, "NETIF_MSG debug bits");
 
-static char fw_type[LIO_MAX_FW_TYPE_LEN];
-module_param_string(fw_type, fw_type, sizeof(fw_type), );
-MODULE_PARM_DESC(fw_type, "Type of firmware to be loaded. Default \"nic\"");
+static char fw_type[LIO_MAX_FW_TYPE_LEN] = LIO_FW_NAME_TYPE_NIC;
+module_param_string(fw_type, fw_type, sizeof(fw_type), 0444);
+MODULE_PARM_DESC(fw_type, "Type of firmware to be loaded. Default \"nic\".  
Use \"none\" to load firmware from flash.");
 
 static u32 console_bitmask;
 module_param(console_bitmask, int, 0644);


Re: [PATCH net] datagram: When peeking datagrams with offset < 0 don't skip empty skbs

2017-08-14 Thread Thiago Macieira
On Monday, 14 August 2017 12:03:16 PDT Willem de Bruijn wrote:
> On Mon, Aug 14, 2017 at 2:58 PM, Thiago Macieira
> 
>  wrote:
> > On Monday, 14 August 2017 11:46:42 PDT Willem de Bruijn wrote:
> >> > By the way, what were the usecases for the peek offset feature?
> >> 
> >> The idea was to be able to peek at application headers of upper
> >> layer protocols and multiplex messages among threads. It proved
> >> so complex even for UDP that we did not attempt the same feature
> >> for TCP. Also, KCM implements demultiplexing using eBPF today.
> > 
> > Interesting, but how would userspace coordinate like that? Suppose
> > multiple
> > threads are woken up by a datagram being received
> 
> This assumes a separate listener thread and worker threadpool.

The listener thread still needs to synchronise with the worker that got 
activated and wait for it to recv from the socket before the listener thread 
can go back to poll().

If we are really talking about threads in the same process, it might be easier 
for the listener to just read the datagram anyway and pass it on to the 
worker. That way, it can proceed immediately to the next datagram and not have 
to wait for the possibly slow worker.

If it is a separate process, then I don't see another way and this might be 
necessary.

By the way, what does recv with MSG_PEEK | MSG_TRUNC return? Is it the full 
datagram's size or is it the size minus the peek offset?

-- 
Thiago Macieira - thiago.macieira (AT) intel.com
  Software Architect - Intel Open Source Technology Center



[patch net-next 2/2] mlxsw: spectrum_router: Add support for nexthop group consolidation for IPv6

2017-08-14 Thread Jiri Pirko
From: Arkadi Sharshevsky 

Due to limited ASIC resources the maximum number of routes is limited by
the nexthop resource. In order to improve the routing scale nexthop
consolidation should be performed.

This patch adds support for IPv6 neighbor consolidation. The hash value
is calculated based on the nexthop set, by performing bitwise xor on the
ifindexs of the nexthops, in a similar way to IPv4's kernel implementation.
In case of collision a full match is performed between the sets which
include address and ifindex comparison.

Non gateway nexthop groups are not inserted to the hash table due to
lack of nexthop device (ifindex).

Signed-off-by: Arkadi Sharshevsky 
Reviewed-by: Ido Schimmel 
Signed-off-by: Jiri Pirko 
---
 .../net/ethernet/mellanox/mlxsw/spectrum_router.c  | 150 +++--
 1 file changed, 141 insertions(+), 9 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c 
b/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
index 5100429..16676ff 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
@@ -1509,6 +1509,7 @@ struct mlxsw_sp_nexthop {
struct rhash_head ht_node;
struct mlxsw_sp_nexthop_key key;
unsigned char gw_addr[sizeof(struct in6_addr)];
+   int ifindex;
struct mlxsw_sp_rif *rif;
u8 should_offload:1, /* set indicates this neigh is connected and
  * should be put to KVD linear area of this group.
@@ -1543,24 +1544,115 @@ mlxsw_sp_nexthop4_group_fi(const struct 
mlxsw_sp_nexthop_group *nh_grp)
 }
 
 struct mlxsw_sp_nexthop_group_cmp_arg {
-   struct fib_info *fi;
+   enum mlxsw_sp_l3proto proto;
+   union {
+   struct fib_info *fi;
+   struct mlxsw_sp_fib6_entry *fib6_entry;
+   };
 };
 
+static bool
+mlxsw_sp_nexthop6_group_has_nexthop(const struct mlxsw_sp_nexthop_group 
*nh_grp,
+   const struct in6_addr *gw, int ifindex)
+{
+   int i;
+
+   for (i = 0; i < nh_grp->count; i++) {
+   const struct mlxsw_sp_nexthop *nh;
+
+   nh = _grp->nexthops[i];
+   if (nh->ifindex == ifindex &&
+   ipv6_addr_equal(gw, (struct in6_addr *) nh->gw_addr))
+   return true;
+   }
+
+   return false;
+}
+
+static bool
+mlxsw_sp_nexthop6_group_cmp(const struct mlxsw_sp_nexthop_group *nh_grp,
+   const struct mlxsw_sp_fib6_entry *fib6_entry)
+{
+   struct mlxsw_sp_rt6 *mlxsw_sp_rt6;
+
+   if (nh_grp->count != fib6_entry->nrt6)
+   return false;
+
+   list_for_each_entry(mlxsw_sp_rt6, _entry->rt6_list, list) {
+   struct in6_addr *gw;
+   int ifindex;
+
+   ifindex = mlxsw_sp_rt6->rt->dst.dev->ifindex;
+   gw = _sp_rt6->rt->rt6i_gateway;
+   if (!mlxsw_sp_nexthop6_group_has_nexthop(nh_grp, gw, ifindex))
+   return false;
+   }
+
+   return true;
+}
+
 static int
 mlxsw_sp_nexthop_group_cmp(struct rhashtable_compare_arg *arg, const void *ptr)
 {
const struct mlxsw_sp_nexthop_group_cmp_arg *cmp_arg = arg->key;
const struct mlxsw_sp_nexthop_group *nh_grp = ptr;
 
-   return cmp_arg->fi != mlxsw_sp_nexthop4_group_fi(nh_grp);
+   switch (cmp_arg->proto) {
+   case MLXSW_SP_L3_PROTO_IPV4:
+   return cmp_arg->fi != mlxsw_sp_nexthop4_group_fi(nh_grp);
+   case MLXSW_SP_L3_PROTO_IPV6:
+   return !mlxsw_sp_nexthop6_group_cmp(nh_grp,
+   cmp_arg->fib6_entry);
+   default:
+   WARN_ON(1);
+   return 1;
+   }
+}
+
+static int
+mlxsw_sp_nexthop_group_type(const struct mlxsw_sp_nexthop_group *nh_grp)
+{
+   return nh_grp->neigh_tbl->family;
 }
 
 static u32 mlxsw_sp_nexthop_group_hash_obj(const void *data, u32 len, u32 seed)
 {
const struct mlxsw_sp_nexthop_group *nh_grp = data;
-   struct fib_info *fi = mlxsw_sp_nexthop4_group_fi(nh_grp);
+   const struct mlxsw_sp_nexthop *nh;
+   struct fib_info *fi;
+   unsigned int val;
+   int i;
+
+   switch (mlxsw_sp_nexthop_group_type(nh_grp)) {
+   case AF_INET:
+   fi = mlxsw_sp_nexthop4_group_fi(nh_grp);
+   return jhash(, sizeof(fi), seed);
+   case AF_INET6:
+   val = nh_grp->count;
+   for (i = 0; i < nh_grp->count; i++) {
+   nh = _grp->nexthops[i];
+   val ^= nh->ifindex;
+   }
+   return jhash(, sizeof(val), seed);
+   default:
+   WARN_ON(1);
+   return 0;
+   }
+}
+
+static u32
+mlxsw_sp_nexthop6_group_hash(struct mlxsw_sp_fib6_entry *fib6_entry, u32 seed)
+{
+   unsigned int val = 

[patch net-next 0/2] mlxsw: Add support for nexthop group consolidation for IPv6

2017-08-14 Thread Jiri Pirko
From: Jiri Pirko 

Arkadi says:

Due to limited ASIC resources the maximum number of routes is limited by
the nexthop resource. In order to improve the routing scale nexthop
consolidation should be performed.

In case of IPv4, the kernel does the consolidation of nexthops in the form
of the fib_info struct. In that case, the driver uses the fib_info's
address as a key for the internal nexthop group representative struct
lookup. In case of IPv6, the kernel doesn't do consolidation, thus the
driver should implement it by itself.

The hash value is calculated based on the nexthop set, by performing
bitwise xor on the ifindexs of the nexthops, in a similar way to IPV4's
kernel implementation. In case of collision a full match is performed
between the sets which include address and ifindex comparison.

In order to use the same hash table in both cases (IPv4/6), the rhashtable
is changed to operate on variable length key.

Arkadi Sharshevsky (2):
  mlxsw: spectrum_router: Prepare nexthop group's hash table for IPv6
  mlxsw: spectrum_router: Add support for nexthop group consolidation
for IPv6

 .../net/ethernet/mellanox/mlxsw/spectrum_router.c  | 209 ++---
 1 file changed, 188 insertions(+), 21 deletions(-)

-- 
2.9.3



[patch net-next 1/2] mlxsw: spectrum_router: Prepare nexthop group's hash table for IPv6

2017-08-14 Thread Jiri Pirko
From: Arkadi Sharshevsky 

This patch does preparation before introducing IPv6 nexthop group
consolidation. Currently the nexthop group hash table is used only by
IPv4 and uses fixed key size. In order to support the IPv6's variable
length key the current table is changed.

Signed-off-by: Arkadi Sharshevsky 
Reviewed-by: Ido Schimmel 
Signed-off-by: Jiri Pirko 
---
 .../net/ethernet/mellanox/mlxsw/spectrum_router.c  | 69 --
 1 file changed, 52 insertions(+), 17 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c 
b/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
index 3d9be36..5100429 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
@@ -1522,15 +1522,11 @@ struct mlxsw_sp_nexthop {
struct mlxsw_sp_neigh_entry *neigh_entry;
 };
 
-struct mlxsw_sp_nexthop_group_key {
-   struct fib_info *fi;
-};
-
 struct mlxsw_sp_nexthop_group {
+   void *priv;
struct rhash_head ht_node;
struct list_head fib_list; /* list of fib entries that use this group */
struct neigh_table *neigh_tbl;
-   struct mlxsw_sp_nexthop_group_key key;
u8 adj_index_valid:1,
   gateway:1; /* routes using the group use a gateway */
u32 adj_index;
@@ -1540,10 +1536,46 @@ struct mlxsw_sp_nexthop_group {
 #define nh_rif nexthops[0].rif
 };
 
+static struct fib_info *
+mlxsw_sp_nexthop4_group_fi(const struct mlxsw_sp_nexthop_group *nh_grp)
+{
+   return nh_grp->priv;
+}
+
+struct mlxsw_sp_nexthop_group_cmp_arg {
+   struct fib_info *fi;
+};
+
+static int
+mlxsw_sp_nexthop_group_cmp(struct rhashtable_compare_arg *arg, const void *ptr)
+{
+   const struct mlxsw_sp_nexthop_group_cmp_arg *cmp_arg = arg->key;
+   const struct mlxsw_sp_nexthop_group *nh_grp = ptr;
+
+   return cmp_arg->fi != mlxsw_sp_nexthop4_group_fi(nh_grp);
+}
+
+static u32 mlxsw_sp_nexthop_group_hash_obj(const void *data, u32 len, u32 seed)
+{
+   const struct mlxsw_sp_nexthop_group *nh_grp = data;
+   struct fib_info *fi = mlxsw_sp_nexthop4_group_fi(nh_grp);
+
+   return jhash(, sizeof(fi), seed);
+}
+
+static u32
+mlxsw_sp_nexthop_group_hash(const void *data, u32 len, u32 seed)
+{
+   const struct mlxsw_sp_nexthop_group_cmp_arg *cmp_arg = data;
+
+   return jhash(_arg->fi, sizeof(cmp_arg->fi), seed);
+}
+
 static const struct rhashtable_params mlxsw_sp_nexthop_group_ht_params = {
-   .key_offset = offsetof(struct mlxsw_sp_nexthop_group, key),
.head_offset = offsetof(struct mlxsw_sp_nexthop_group, ht_node),
-   .key_len = sizeof(struct mlxsw_sp_nexthop_group_key),
+   .hashfn  = mlxsw_sp_nexthop_group_hash,
+   .obj_hashfn  = mlxsw_sp_nexthop_group_hash_obj,
+   .obj_cmpfn   = mlxsw_sp_nexthop_group_cmp,
 };
 
 static int mlxsw_sp_nexthop_group_insert(struct mlxsw_sp *mlxsw_sp,
@@ -1563,10 +1595,14 @@ static void mlxsw_sp_nexthop_group_remove(struct 
mlxsw_sp *mlxsw_sp,
 }
 
 static struct mlxsw_sp_nexthop_group *
-mlxsw_sp_nexthop_group_lookup(struct mlxsw_sp *mlxsw_sp,
- struct mlxsw_sp_nexthop_group_key key)
+mlxsw_sp_nexthop4_group_lookup(struct mlxsw_sp *mlxsw_sp,
+  struct fib_info *fi)
 {
-   return rhashtable_lookup_fast(_sp->router->nexthop_group_ht, ,
+   struct mlxsw_sp_nexthop_group_cmp_arg cmp_arg;
+
+   cmp_arg.fi = fi;
+   return rhashtable_lookup_fast(_sp->router->nexthop_group_ht,
+ _arg,
  mlxsw_sp_nexthop_group_ht_params);
 }
 
@@ -2063,12 +2099,12 @@ mlxsw_sp_nexthop4_group_create(struct mlxsw_sp 
*mlxsw_sp, struct fib_info *fi)
nh_grp = kzalloc(alloc_size, GFP_KERNEL);
if (!nh_grp)
return ERR_PTR(-ENOMEM);
+   nh_grp->priv = fi;
INIT_LIST_HEAD(_grp->fib_list);
nh_grp->neigh_tbl = _tbl;
 
nh_grp->gateway = fi->fib_nh->nh_scope == RT_SCOPE_LINK;
nh_grp->count = fi->fib_nhs;
-   nh_grp->key.fi = fi;
fib_info_hold(fi);
for (i = 0; i < nh_grp->count; i++) {
nh = _grp->nexthops[i];
@@ -2089,7 +2125,7 @@ mlxsw_sp_nexthop4_group_create(struct mlxsw_sp *mlxsw_sp, 
struct fib_info *fi)
nh = _grp->nexthops[i];
mlxsw_sp_nexthop4_fini(mlxsw_sp, nh);
}
-   fib_info_put(nh_grp->key.fi);
+   fib_info_put(fi);
kfree(nh_grp);
return ERR_PTR(err);
 }
@@ -2108,7 +2144,7 @@ mlxsw_sp_nexthop4_group_destroy(struct mlxsw_sp *mlxsw_sp,
}
mlxsw_sp_nexthop_group_refresh(mlxsw_sp, nh_grp);
WARN_ON_ONCE(nh_grp->adj_index_valid);
-   fib_info_put(nh_grp->key.fi);
+   fib_info_put(mlxsw_sp_nexthop4_group_fi(nh_grp));
kfree(nh_grp);
 }
 
@@ -2116,11 +2152,9 @@ static int 

Re: [Intel-wired-lan] [PATCH 5/6] [net-next]net: i40e: Clean up of cloud filters

2017-08-14 Thread Nambiar, Amritha
On 8/1/2017 12:16 PM, Shannon Nelson wrote:
> On 7/31/2017 5:38 PM, Amritha Nambiar wrote:
>> Introduce the cloud filter datastructure and cleanup of cloud
>> filters associated with the device.
>>
>> Signed-off-by: Amritha Nambiar 
>> ---
>>   drivers/net/ethernet/intel/i40e/i40e.h  |   11 +++
>>   drivers/net/ethernet/intel/i40e/i40e_main.c |   27 
>> +++
>>   2 files changed, 38 insertions(+)
>>
>> diff --git a/drivers/net/ethernet/intel/i40e/i40e.h 
>> b/drivers/net/ethernet/intel/i40e/i40e.h
>> index 1391e5d..5c0cad5 100644
>> --- a/drivers/net/ethernet/intel/i40e/i40e.h
>> +++ b/drivers/net/ethernet/intel/i40e/i40e.h
>> @@ -252,6 +252,14 @@ struct i40e_fdir_filter {
>>  u32 fd_id;
>>   };
>>   
>> +struct i40e_cloud_filter {
>> +struct hlist_node cloud_node;
>> +/* cloud filter input set follows */
>> +unsigned long cookie;
>> +/* filter control */
>> +u16 seid;
>> +};
> 
> This would be cleaner and more readable with the field comments off to 
> the side rather than in line with the fields.

Will fix in the next version of the series.

> 
>> +
>>   #define I40E_ETH_P_LLDP0x88cc
>>   
>>   #define I40E_DCB_PRIO_TYPE_STRICT  0
>> @@ -419,6 +427,9 @@ struct i40e_pf {
>>  struct i40e_udp_port_config udp_ports[I40E_MAX_PF_UDP_OFFLOAD_PORTS];
>>  u16 pending_udp_bitmap;
>>   
>> +struct hlist_head cloud_filter_list;
>> +u16 num_cloud_filters;
>> +
>>  enum i40e_interrupt_policy int_policy;
>>  u16 rx_itr_default;
>>  u16 tx_itr_default;
>> diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c 
>> b/drivers/net/ethernet/intel/i40e/i40e_main.c
>> index f74..93f6fe2 100644
>> --- a/drivers/net/ethernet/intel/i40e/i40e_main.c
>> +++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
>> @@ -6928,6 +6928,29 @@ static void i40e_fdir_filter_exit(struct i40e_pf *pf)
>>   }
>>   
>>   /**
>> + * i40e_cloud_filter_exit - Cleans up the Cloud Filters
>> + * @pf: Pointer to PF
>> + *
>> + * This function destroys the hlist where all the Cloud Filters
>> + * filters were saved.
>> + **/
>> +static void i40e_cloud_filter_exit(struct i40e_pf *pf)
>> +{
>> +struct i40e_cloud_filter *cfilter;
>> +struct hlist_node *node;
>> +
>> +if (hlist_empty(>cloud_filter_list))
>> +return;
> 
> Is this check really necessary?  Doesn't hlist_for_each_entry_safe() 
> check for this?

That's right. Will fix in the next version of the series.

> 
>> +
>> +hlist_for_each_entry_safe(cfilter, node,
>> +  >cloud_filter_list, cloud_node) {
>> +hlist_del(>cloud_node);
>> +kfree(cfilter);
>> +}
>> +pf->num_cloud_filters = 0;
>> +}
>> +
>> +/**
>>* i40e_close - Disables a network interface
>>* @netdev: network interface device structure
>>*
>> @@ -12137,6 +12160,7 @@ static int i40e_setup_pf_switch(struct i40e_pf *pf, 
>> bool reinit)
>>  vsi = i40e_vsi_reinit_setup(pf->vsi[pf->lan_vsi]);
>>  if (!vsi) {
>>  dev_info(>pdev->dev, "setup of MAIN VSI failed\n");
>> +i40e_cloud_filter_exit(pf);
>>  i40e_fdir_teardown(pf);
>>  return -EAGAIN;
>>  }
>> @@ -12961,6 +12985,8 @@ static void i40e_remove(struct pci_dev *pdev)
>>  if (pf->vsi[pf->lan_vsi])
>>  i40e_vsi_release(pf->vsi[pf->lan_vsi]);
>>   
>> +i40e_cloud_filter_exit(pf);
>> +
>>  /* remove attached clients */
>>  if (pf->flags & I40E_FLAG_IWARP_ENABLED) {
>>  ret_code = i40e_lan_del_device(pf);
>> @@ -13170,6 +13196,7 @@ static void i40e_shutdown(struct pci_dev *pdev)
>>   
>>  del_timer_sync(>service_timer);
>>  cancel_work_sync(>service_task);
>> +i40e_cloud_filter_exit(pf);
>>  i40e_fdir_teardown(pf);
>>   
>>  /* Client close must be called explicitly here because the timer
>>
>> ___
>> Intel-wired-lan mailing list
>> intel-wired-...@osuosl.org
>> https://lists.osuosl.org/mailman/listinfo/intel-wired-lan
>>


Re: [PATCH net] ipv6: release rt6->rt6i_idev properly during ifdown

2017-08-14 Thread David Ahern
On 8/14/17 11:44 AM, Wei Wang wrote:
> From: Wei Wang 
> 
> When a dst is created by addrconf_dst_alloc() for a host route or an
> anycast route, dst->dev points to loopback dev while rt6->rt6i_idev
> points to a real device.
> When the real device goes down, the current cleanup code only checks for
> dst->dev and assumes rt6->rt6i_idev->dev is the same. This causes the
> refcount leak on the real device in the above situation.
> This patch makes sure to always release the refcount taken on
> rt6->rt6i_idev during dst_dev_put().
> 
> Fixes: 587fea741134 ("ipv6: mark DST_NOGC and remove the operation of
> dst_free()")
> Reported-by: John Stultz 
> Tested-by: John Stultz 
> Tested-by: Martin KaFai Lau 
> Signed-off-by: Wei Wang 
> Signed-off-by: Martin KaFai Lau 
> ---
>  net/ipv6/route.c | 13 +
>  1 file changed, 5 insertions(+), 8 deletions(-)

Acked-by: David Ahern 


Re: [PATCH net] datagram: When peeking datagrams with offset < 0 don't skip empty skbs

2017-08-14 Thread Willem de Bruijn
On Mon, Aug 14, 2017 at 2:58 PM, Thiago Macieira
 wrote:
> On Monday, 14 August 2017 11:46:42 PDT Willem de Bruijn wrote:
>> > By the way, what were the usecases for the peek offset feature?
>>
>> The idea was to be able to peek at application headers of upper
>> layer protocols and multiplex messages among threads. It proved
>> so complex even for UDP that we did not attempt the same feature
>> for TCP. Also, KCM implements demultiplexing using eBPF today.
>
> Interesting, but how would userspace coordinate like that? Suppose multiple
> threads are woken up by a datagram being received

This assumes a separate listener thread and worker threadpool.


[PATCH V2 net-next 7/8] liquidio: moved liquidio_setup_io_queues to lio_core.c

2017-08-14 Thread Felix Manlunas
From: Intiyaz Basha 

Moving common liquidio_setup_io_queues to lio_core.c

Signed-off-by: Intiyaz Basha 
Signed-off-by: Felix Manlunas 
---
 drivers/net/ethernet/cavium/liquidio/lio_core.c| 119 -
 drivers/net/ethernet/cavium/liquidio/lio_main.c| 109 +--
 drivers/net/ethernet/cavium/liquidio/lio_vf_main.c |  93 +---
 .../net/ethernet/cavium/liquidio/octeon_network.h  |  13 +--
 4 files changed, 118 insertions(+), 216 deletions(-)

diff --git a/drivers/net/ethernet/cavium/liquidio/lio_core.c 
b/drivers/net/ethernet/cavium/liquidio/lio_core.c
index 2030c25..d20d0eb 100644
--- a/drivers/net/ethernet/cavium/liquidio/lio_core.c
+++ b/drivers/net/ethernet/cavium/liquidio/lio_core.c
@@ -406,8 +406,8 @@ static void lio_update_txq_status(struct octeon_device 
*oct, int iq_num)
  * @param desc_size size of each descriptor
  * @param app_ctx application context
  */
-int octeon_setup_droq(struct octeon_device *oct, int q_no, int num_descs,
- int desc_size, void *app_ctx)
+static int octeon_setup_droq(struct octeon_device *oct, int q_no, int 
num_descs,
+int desc_size, void *app_ctx)
 {
int ret_val;
 
@@ -441,7 +441,7 @@ int octeon_setup_droq(struct octeon_device *oct, int q_no, 
int num_descs,
  * @param param- additional control data with the packet
  * @param arg  - farg registered in droq_ops
  */
-void
+static void
 liquidio_push_packet(u32 octeon_id __attribute__((unused)),
 void *skbuff,
 u32 len,
@@ -599,7 +599,7 @@ static void napi_schedule_wrapper(void *param)
  * \brief callback when receive interrupt occurs and we are in NAPI mode
  * @param arg pointer to octeon output queue
  */
-void liquidio_napi_drv_callback(void *arg)
+static void liquidio_napi_drv_callback(void *arg)
 {
struct octeon_device *oct;
struct octeon_droq *droq = arg;
@@ -626,7 +626,7 @@ void liquidio_napi_drv_callback(void *arg)
  * @param napi NAPI structure
  * @param budget maximum number of items to process
  */
-int liquidio_napi_poll(struct napi_struct *napi, int budget)
+static int liquidio_napi_poll(struct napi_struct *napi, int budget)
 {
struct octeon_instr_queue *iq;
struct octeon_device *oct;
@@ -679,3 +679,112 @@ int liquidio_napi_poll(struct napi_struct *napi, int 
budget)
 
return (!tx_done) ? (budget) : (work_done);
 }
+
+/**
+ * \brief Setup input and output queues
+ * @param octeon_dev octeon device
+ * @param ifidx Interface index
+ *
+ * Note: Queues are with respect to the octeon device. Thus
+ * an input queue is for egress packets, and output queues
+ * are for ingress packets.
+ */
+int liquidio_setup_io_queues(struct octeon_device *octeon_dev, int ifidx)
+{
+   struct octeon_droq_ops droq_ops;
+   struct net_device *netdev;
+   struct octeon_droq *droq;
+   struct napi_struct *napi;
+   int cpu_id_modulus;
+   int num_tx_descs;
+   struct lio *lio;
+   int retval = 0;
+   int q, q_no;
+   int cpu_id;
+
+   netdev = octeon_dev->props[ifidx].netdev;
+
+   lio = GET_LIO(netdev);
+
+   memset(_ops, 0, sizeof(struct octeon_droq_ops));
+
+   droq_ops.fptr = liquidio_push_packet;
+   droq_ops.farg = netdev;
+
+   droq_ops.poll_mode = 1;
+   droq_ops.napi_fn = liquidio_napi_drv_callback;
+   cpu_id = 0;
+   cpu_id_modulus = num_present_cpus();
+
+   /* set up DROQs. */
+   for (q = 0; q < lio->linfo.num_rxpciq; q++) {
+   q_no = lio->linfo.rxpciq[q].s.q_no;
+   dev_dbg(_dev->pci_dev->dev,
+   "%s index:%d linfo.rxpciq.s.q_no:%d\n",
+   __func__, q, q_no);
+   retval = octeon_setup_droq(
+   octeon_dev, q_no,
+   CFG_GET_NUM_RX_DESCS_NIC_IF(octeon_get_conf(octeon_dev),
+   lio->ifidx),
+   CFG_GET_NUM_RX_BUF_SIZE_NIC_IF(octeon_get_conf(octeon_dev),
+  lio->ifidx),
+   NULL);
+   if (retval) {
+   dev_err(_dev->pci_dev->dev,
+   "%s : Runtime DROQ(RxQ) creation failed.\n",
+   __func__);
+   return 1;
+   }
+
+   droq = octeon_dev->droq[q_no];
+   napi = >napi;
+   dev_dbg(_dev->pci_dev->dev, "netif_napi_add netdev:%llx 
oct:%llx\n",
+   (u64)netdev, (u64)octeon_dev);
+   netif_napi_add(netdev, napi, liquidio_napi_poll, 64);
+
+   /* designate a CPU for this droq */
+   droq->cpu_id = cpu_id;
+   cpu_id++;
+   if (cpu_id >= cpu_id_modulus)
+   cpu_id = 0;
+
+   

[PATCH V2 net-next 4/8] liquidio: moved liquidio_push_packet to lio_core.c

2017-08-14 Thread Felix Manlunas
From: Intiyaz Basha 

Moving common liquidio_push_packet to lio_core.c

Signed-off-by: Intiyaz Basha 
Signed-off-by: Felix Manlunas 
---
 drivers/net/ethernet/cavium/liquidio/lio_core.c| 149 +
 drivers/net/ethernet/cavium/liquidio/lio_main.c| 147 
 drivers/net/ethernet/cavium/liquidio/lio_vf_main.c | 128 --
 .../net/ethernet/cavium/liquidio/octeon_network.h  |   7 +
 4 files changed, 156 insertions(+), 275 deletions(-)

diff --git a/drivers/net/ethernet/cavium/liquidio/lio_core.c 
b/drivers/net/ethernet/cavium/liquidio/lio_core.c
index 90583ce..b0b246e 100644
--- a/drivers/net/ethernet/cavium/liquidio/lio_core.c
+++ b/drivers/net/ethernet/cavium/liquidio/lio_core.c
@@ -432,3 +432,152 @@ int octeon_setup_droq(struct octeon_device *oct, int 
q_no, int num_descs,
 
return ret_val;
 }
+
+/** Routine to push packets arriving on Octeon interface upto network layer.
+ * @param oct_id   - octeon device id.
+ * @param skbuff   - skbuff struct to be passed to network layer.
+ * @param len  - size of total data received.
+ * @param rh   - Control header associated with the packet
+ * @param param- additional control data with the packet
+ * @param arg  - farg registered in droq_ops
+ */
+void
+liquidio_push_packet(u32 octeon_id __attribute__((unused)),
+void *skbuff,
+u32 len,
+union octeon_rh *rh,
+void *param,
+void *arg)
+{
+   struct net_device *netdev = (struct net_device *)arg;
+   struct octeon_droq *droq =
+   container_of(param, struct octeon_droq, napi);
+   struct sk_buff *skb = (struct sk_buff *)skbuff;
+   struct skb_shared_hwtstamps *shhwtstamps;
+   struct napi_struct *napi = param;
+   u16 vtag = 0;
+   u32 r_dh_off;
+   u64 ns;
+
+   if (netdev) {
+   struct lio *lio = GET_LIO(netdev);
+   struct octeon_device *oct = lio->oct_dev;
+   int packet_was_received;
+
+   /* Do not proceed if the interface is not in RUNNING state. */
+   if (!ifstate_check(lio, LIO_IFSTATE_RUNNING)) {
+   recv_buffer_free(skb);
+   droq->stats.rx_dropped++;
+   return;
+   }
+
+   skb->dev = netdev;
+
+   skb_record_rx_queue(skb, droq->q_no);
+   if (likely(len > MIN_SKB_SIZE)) {
+   struct octeon_skb_page_info *pg_info;
+   unsigned char *va;
+
+   pg_info = ((struct octeon_skb_page_info *)(skb->cb));
+   if (pg_info->page) {
+   /* For Paged allocation use the frags */
+   va = page_address(pg_info->page) +
+   pg_info->page_offset;
+   memcpy(skb->data, va, MIN_SKB_SIZE);
+   skb_put(skb, MIN_SKB_SIZE);
+   skb_add_rx_frag(skb, skb_shinfo(skb)->nr_frags,
+   pg_info->page,
+   pg_info->page_offset +
+   MIN_SKB_SIZE,
+   len - MIN_SKB_SIZE,
+   LIO_RXBUFFER_SZ);
+   }
+   } else {
+   struct octeon_skb_page_info *pg_info =
+   ((struct octeon_skb_page_info *)(skb->cb));
+   skb_copy_to_linear_data(skb, page_address(pg_info->page)
+   + pg_info->page_offset, len);
+   skb_put(skb, len);
+   put_page(pg_info->page);
+   }
+
+   r_dh_off = (rh->r_dh.len - 1) * BYTES_PER_DHLEN_UNIT;
+
+   if (oct->ptp_enable) {
+   if (rh->r_dh.has_hwtstamp) {
+   /* timestamp is included from the hardware at
+* the beginning of the packet.
+*/
+   if (ifstate_check
+   (lio,
+LIO_IFSTATE_RX_TIMESTAMP_ENABLED)) {
+   /* Nanoseconds are in the first 64-bits
+* of the packet.
+*/
+   memcpy(, (skb->data + r_dh_off),
+  sizeof(ns));
+   r_dh_off -= BYTES_PER_DHLEN_UNIT;
+   shhwtstamps = skb_hwtstamps(skb);
+ 

[PATCH V2 net-next 3/8] liquidio: moved octeon_setup_droq to lio_core.c

2017-08-14 Thread Felix Manlunas
From: Intiyaz Basha 

Moving common octeon_setup_droq to lio_core.c

Signed-off-by: Intiyaz Basha 
Signed-off-by: Felix Manlunas 
---
 drivers/net/ethernet/cavium/liquidio/lio_core.c| 35 
 drivers/net/ethernet/cavium/liquidio/lio_main.c| 37 --
 drivers/net/ethernet/cavium/liquidio/lio_vf_main.c | 35 
 .../net/ethernet/cavium/liquidio/octeon_network.h  |  2 ++
 4 files changed, 37 insertions(+), 72 deletions(-)

diff --git a/drivers/net/ethernet/cavium/liquidio/lio_core.c 
b/drivers/net/ethernet/cavium/liquidio/lio_core.c
index b55ab75..90583ce 100644
--- a/drivers/net/ethernet/cavium/liquidio/lio_core.c
+++ b/drivers/net/ethernet/cavium/liquidio/lio_core.c
@@ -397,3 +397,38 @@ void lio_update_txq_status(struct octeon_device *oct, int 
iq_num)
netif_wake_queue(netdev);
}
 }
+
+/**
+ * \brief Setup output queue
+ * @param oct octeon device
+ * @param q_no which queue
+ * @param num_descs how many descriptors
+ * @param desc_size size of each descriptor
+ * @param app_ctx application context
+ */
+int octeon_setup_droq(struct octeon_device *oct, int q_no, int num_descs,
+ int desc_size, void *app_ctx)
+{
+   int ret_val;
+
+   dev_dbg(>pci_dev->dev, "Creating Droq: %d\n", q_no);
+   /* droq creation and local register settings. */
+   ret_val = octeon_create_droq(oct, q_no, num_descs, desc_size, app_ctx);
+   if (ret_val < 0)
+   return ret_val;
+
+   if (ret_val == 1) {
+   dev_dbg(>pci_dev->dev, "Using default droq %d\n", q_no);
+   return 0;
+   }
+
+   /* Enable the droq queues */
+   octeon_set_droq_pkt_op(oct, q_no, 1);
+
+   /* Send Credit for Octeon Output queues. Credits are always
+* sent after the output queue is enabled.
+*/
+   writel(oct->droq[q_no]->max_count, oct->droq[q_no]->pkts_credit_reg);
+
+   return ret_val;
+}
diff --git a/drivers/net/ethernet/cavium/liquidio/lio_main.c 
b/drivers/net/ethernet/cavium/liquidio/lio_main.c
index ba1b493..a814d58 100644
--- a/drivers/net/ethernet/cavium/liquidio/lio_main.c
+++ b/drivers/net/ethernet/cavium/liquidio/lio_main.c
@@ -2196,43 +2196,6 @@ static int load_firmware(struct octeon_device *oct)
 }
 
 /**
- * \brief Setup output queue
- * @param oct octeon device
- * @param q_no which queue
- * @param num_descs how many descriptors
- * @param desc_size size of each descriptor
- * @param app_ctx application context
- */
-static int octeon_setup_droq(struct octeon_device *oct, int q_no, int 
num_descs,
-int desc_size, void *app_ctx)
-{
-   int ret_val = 0;
-
-   dev_dbg(>pci_dev->dev, "Creating Droq: %d\n", q_no);
-   /* droq creation and local register settings. */
-   ret_val = octeon_create_droq(oct, q_no, num_descs, desc_size, app_ctx);
-   if (ret_val < 0)
-   return ret_val;
-
-   if (ret_val == 1) {
-   dev_dbg(>pci_dev->dev, "Using default droq %d\n", q_no);
-   return 0;
-   }
-   /* tasklet creation for the droq */
-
-   /* Enable the droq queues */
-   octeon_set_droq_pkt_op(oct, q_no, 1);
-
-   /* Send Credit for Octeon Output queues. Credits are always
-* sent after the output queue is enabled.
-*/
-   writel(oct->droq[q_no]->max_count,
-  oct->droq[q_no]->pkts_credit_reg);
-
-   return ret_val;
-}
-
-/**
  * \brief Callback for getting interface configuration
  * @param status status of request
  * @param buf pointer to resp structure
diff --git a/drivers/net/ethernet/cavium/liquidio/lio_vf_main.c 
b/drivers/net/ethernet/cavium/liquidio/lio_vf_main.c
index dd0265a..a6efd75 100644
--- a/drivers/net/ethernet/cavium/liquidio/lio_vf_main.c
+++ b/drivers/net/ethernet/cavium/liquidio/lio_vf_main.c
@@ -1345,41 +1345,6 @@ static void free_netsgbuf_with_resp(void *buf)
 }
 
 /**
- * \brief Setup output queue
- * @param oct octeon device
- * @param q_no which queue
- * @param num_descs how many descriptors
- * @param desc_size size of each descriptor
- * @param app_ctx application context
- */
-static int octeon_setup_droq(struct octeon_device *oct, int q_no, int 
num_descs,
-int desc_size, void *app_ctx)
-{
-   int ret_val;
-
-   dev_dbg(>pci_dev->dev, "Creating Droq: %d\n", q_no);
-   /* droq creation and local register settings. */
-   ret_val = octeon_create_droq(oct, q_no, num_descs, desc_size, app_ctx);
-   if (ret_val < 0)
-   return ret_val;
-
-   if (ret_val == 1) {
-   dev_dbg(>pci_dev->dev, "Using default droq %d\n", q_no);
-   return 0;
-   }
-
-   /* Enable the droq queues */
-   octeon_set_droq_pkt_op(oct, q_no, 1);
-
-   /* Send Credit for Octeon Output queues. Credits are always
-* sent after the 

[PATCH V2 net-next 5/8] liquidio: moved liquidio_napi_drv_callback to lio_core.c

2017-08-14 Thread Felix Manlunas
From: Intiyaz Basha 

Moving common liquidio_napi_drv_callback to lio_core.c

Signed-off-by: Intiyaz Basha 
Signed-off-by: Felix Manlunas 
---
 drivers/net/ethernet/cavium/liquidio/lio_core.c| 39 ++
 drivers/net/ethernet/cavium/liquidio/lio_main.c| 38 -
 drivers/net/ethernet/cavium/liquidio/lio_vf_main.c | 13 +---
 .../net/ethernet/cavium/liquidio/octeon_network.h  |  1 +
 4 files changed, 41 insertions(+), 50 deletions(-)

diff --git a/drivers/net/ethernet/cavium/liquidio/lio_core.c 
b/drivers/net/ethernet/cavium/liquidio/lio_core.c
index b0b246e..8cba927 100644
--- a/drivers/net/ethernet/cavium/liquidio/lio_core.c
+++ b/drivers/net/ethernet/cavium/liquidio/lio_core.c
@@ -581,3 +581,42 @@ liquidio_push_packet(u32 octeon_id __attribute__((unused)),
recv_buffer_free(skb);
}
 }
+
+/**
+ * \brief wrapper for calling napi_schedule
+ * @param param parameters to pass to napi_schedule
+ *
+ * Used when scheduling on different CPUs
+ */
+static void napi_schedule_wrapper(void *param)
+{
+   struct napi_struct *napi = param;
+
+   napi_schedule(napi);
+}
+
+/**
+ * \brief callback when receive interrupt occurs and we are in NAPI mode
+ * @param arg pointer to octeon output queue
+ */
+void liquidio_napi_drv_callback(void *arg)
+{
+   struct octeon_device *oct;
+   struct octeon_droq *droq = arg;
+   int this_cpu = smp_processor_id();
+
+   oct = droq->oct_dev;
+
+   if (OCTEON_CN23XX_PF(oct) || OCTEON_CN23XX_VF(oct) ||
+   droq->cpu_id == this_cpu) {
+   napi_schedule_irqoff(>napi);
+   } else {
+   struct call_single_data *csd = >csd;
+
+   csd->func = napi_schedule_wrapper;
+   csd->info = >napi;
+   csd->flags = 0;
+
+   smp_call_function_single_async(droq->cpu_id, csd);
+   }
+}
diff --git a/drivers/net/ethernet/cavium/liquidio/lio_main.c 
b/drivers/net/ethernet/cavium/liquidio/lio_main.c
index 68a94c4..4241949 100644
--- a/drivers/net/ethernet/cavium/liquidio/lio_main.c
+++ b/drivers/net/ethernet/cavium/liquidio/lio_main.c
@@ -2229,44 +2229,6 @@ static void if_cfg_callback(struct octeon_device *oct,
 }
 
 /**
- * \brief wrapper for calling napi_schedule
- * @param param parameters to pass to napi_schedule
- *
- * Used when scheduling on different CPUs
- */
-static void napi_schedule_wrapper(void *param)
-{
-   struct napi_struct *napi = param;
-
-   napi_schedule(napi);
-}
-
-/**
- * \brief callback when receive interrupt occurs and we are in NAPI mode
- * @param arg pointer to octeon output queue
- */
-static void liquidio_napi_drv_callback(void *arg)
-{
-   struct octeon_device *oct;
-   struct octeon_droq *droq = arg;
-   int this_cpu = smp_processor_id();
-
-   oct = droq->oct_dev;
-
-   if (OCTEON_CN23XX_PF(oct) || droq->cpu_id == this_cpu) {
-   napi_schedule_irqoff(>napi);
-   } else {
-   struct call_single_data *csd = >csd;
-
-   csd->func = napi_schedule_wrapper;
-   csd->info = >napi;
-   csd->flags = 0;
-
-   smp_call_function_single_async(droq->cpu_id, csd);
-   }
-}
-
-/**
  * \brief Entry point for NAPI polling
  * @param napi NAPI structure
  * @param budget maximum number of items to process
diff --git a/drivers/net/ethernet/cavium/liquidio/lio_vf_main.c 
b/drivers/net/ethernet/cavium/liquidio/lio_vf_main.c
index 013a861..2663bd6 100644
--- a/drivers/net/ethernet/cavium/liquidio/lio_vf_main.c
+++ b/drivers/net/ethernet/cavium/liquidio/lio_vf_main.c
@@ -1377,17 +1377,6 @@ static void if_cfg_callback(struct octeon_device *oct,
 }
 
 /**
- * \brief callback when receive interrupt occurs and we are in NAPI mode
- * @param arg pointer to octeon output queue
- */
-static void liquidio_vf_napi_drv_callback(void *arg)
-{
-   struct octeon_droq *droq = arg;
-
-   napi_schedule_irqoff(>napi);
-}
-
-/**
  * \brief Entry point for NAPI polling
  * @param napi NAPI structure
  * @param budget maximum number of items to process
@@ -1473,7 +1462,7 @@ static int setup_io_queues(struct octeon_device 
*octeon_dev, int ifidx)
droq_ops.farg = netdev;
 
droq_ops.poll_mode = 1;
-   droq_ops.napi_fn = liquidio_vf_napi_drv_callback;
+   droq_ops.napi_fn = liquidio_napi_drv_callback;
cpu_id = 0;
cpu_id_modulus = num_present_cpus();
 
diff --git a/drivers/net/ethernet/cavium/liquidio/octeon_network.h 
b/drivers/net/ethernet/cavium/liquidio/octeon_network.h
index 5d78fd6..076fdfc 100644
--- a/drivers/net/ethernet/cavium/liquidio/octeon_network.h
+++ b/drivers/net/ethernet/cavium/liquidio/octeon_network.h
@@ -484,4 +484,5 @@ liquidio_push_packet(u32 octeon_id __attribute__((unused)),
 union octeon_rh *rh,
 void *param,
 void *arg);
+void 

[PATCH V2 net-next 8/8] liquidio: added support for ethtool --set-ring feature

2017-08-14 Thread Felix Manlunas
From: Intiyaz Basha 

added support for ethtool --set-ring feature

Signed-off-by: Intiyaz Basha 
Signed-off-by: Felix Manlunas 
---
 drivers/net/ethernet/cavium/liquidio/lio_ethtool.c | 131 +
 drivers/net/ethernet/cavium/liquidio/lio_main.c|   6 +-
 drivers/net/ethernet/cavium/liquidio/lio_vf_main.c |   6 +-
 .../net/ethernet/cavium/liquidio/octeon_config.h   |  13 +-
 .../net/ethernet/cavium/liquidio/octeon_device.c   |  14 +--
 .../net/ethernet/cavium/liquidio/octeon_network.h  |   1 +
 6 files changed, 160 insertions(+), 11 deletions(-)

diff --git a/drivers/net/ethernet/cavium/liquidio/lio_ethtool.c 
b/drivers/net/ethernet/cavium/liquidio/lio_ethtool.c
index b78e296..5ef595d 100644
--- a/drivers/net/ethernet/cavium/liquidio/lio_ethtool.c
+++ b/drivers/net/ethernet/cavium/liquidio/lio_ethtool.c
@@ -642,6 +642,9 @@ lio_ethtool_get_ringparam(struct net_device *netdev,
u32 tx_max_pending = 0, rx_max_pending = 0, tx_pending = 0,
rx_pending = 0;
 
+   if (ifstate_check(lio, LIO_IFSTATE_RESETTING))
+   return;
+
if (OCTEON_CN6XXX(oct)) {
struct octeon_config *conf6x = CHIP_CONF(oct, cn6xxx);
 
@@ -666,6 +669,126 @@ lio_ethtool_get_ringparam(struct net_device *netdev,
ering->rx_jumbo_max_pending = 0;
 }
 
+static int lio_reset_queues(struct net_device *netdev)
+{
+   struct lio *lio = GET_LIO(netdev);
+   struct octeon_device *oct = lio->oct_dev;
+   struct napi_struct *napi, *n;
+   int i;
+
+   dev_dbg(>pci_dev->dev, "%s:%d ifidx %d\n",
+   __func__, __LINE__, lio->ifidx);
+
+   if (wait_for_pending_requests(oct))
+   dev_err(>pci_dev->dev, "There were pending requests\n");
+
+   if (lio_wait_for_instr_fetch(oct))
+   dev_err(>pci_dev->dev, "IQ had pending instructions\n");
+
+   if (octeon_set_io_queues_off(oct)) {
+   dev_err(>pci_dev->dev, "setting io queues off failed\n");
+   return -1;
+   }
+
+   /* Disable the input and output queues now. No more packets will
+* arrive from Octeon.
+*/
+   oct->fn_list.disable_io_queues(oct);
+   /* Delete NAPI */
+   list_for_each_entry_safe(napi, n, >napi_list, dev_list)
+   netif_napi_del(napi);
+
+   for (i = 0; i < MAX_OCTEON_OUTPUT_QUEUES(oct); i++) {
+   if (!(oct->io_qmask.oq & BIT_ULL(i)))
+   continue;
+   octeon_delete_droq(oct, i);
+   }
+
+   for (i = 0; i < MAX_OCTEON_INSTR_QUEUES(oct); i++) {
+   if (!(oct->io_qmask.iq & BIT_ULL(i)))
+   continue;
+   octeon_delete_instr_queue(oct, i);
+   }
+
+   if (oct->fn_list.setup_device_regs(oct)) {
+   dev_err(>pci_dev->dev, "Failed to configure device 
registers\n");
+   return -1;
+   }
+
+   if (liquidio_setup_io_queues(oct, 0)) {
+   dev_err(>pci_dev->dev, "IO queues initialization 
failed\n");
+   return -1;
+   }
+
+   /* Enable the input and output queues for this Octeon device */
+   if (oct->fn_list.enable_io_queues(oct)) {
+   dev_err(>pci_dev->dev, "Failed to enable input/output 
queues");
+   return -1;
+   }
+
+   return 0;
+}
+
+static int lio_ethtool_set_ringparam(struct net_device *netdev,
+struct ethtool_ringparam *ering)
+{
+   u32 rx_count, tx_count, rx_count_old, tx_count_old;
+   struct lio *lio = GET_LIO(netdev);
+   struct octeon_device *oct = lio->oct_dev;
+   int stopped = 0;
+
+   if (!OCTEON_CN23XX_PF(oct) && !OCTEON_CN23XX_VF(oct))
+   return -EINVAL;
+
+   if (ering->rx_mini_pending || ering->rx_jumbo_pending)
+   return -EINVAL;
+
+   rx_count = clamp_t(u32, ering->rx_pending, CN23XX_MIN_OQ_DESCRIPTORS,
+  CN23XX_MAX_OQ_DESCRIPTORS);
+   tx_count = clamp_t(u32, ering->tx_pending, CN23XX_MIN_IQ_DESCRIPTORS,
+  CN23XX_MAX_IQ_DESCRIPTORS);
+
+   rx_count_old = oct->droq[0]->max_count;
+   tx_count_old = oct->instr_queue[0]->max_count;
+
+   if (rx_count == rx_count_old && tx_count == tx_count_old)
+   return 0;
+
+   ifstate_set(lio, LIO_IFSTATE_RESETTING);
+
+   if (netif_running(netdev)) {
+   netdev->netdev_ops->ndo_stop(netdev);
+   stopped = 1;
+   }
+
+   /* Change RX/TX DESCS  count */
+   if (tx_count != tx_count_old)
+   CFG_SET_NUM_TX_DESCS_NIC_IF(octeon_get_conf(oct), lio->ifidx,
+   tx_count);
+   if (rx_count != rx_count_old)
+   CFG_SET_NUM_RX_DESCS_NIC_IF(octeon_get_conf(oct), lio->ifidx,
+   rx_count);
+
+   if (lio_reset_queues(netdev))
+  

[PATCH V2 net-next 6/8] liquidio: moved liquidio_napi_poll to lio_core.c

2017-08-14 Thread Felix Manlunas
From: Intiyaz Basha 

Moving common liquidio_napi_poll to lio_core.c

Signed-off-by: Intiyaz Basha 
Signed-off-by: Felix Manlunas 
---
 drivers/net/ethernet/cavium/liquidio/lio_core.c| 61 +-
 drivers/net/ethernet/cavium/liquidio/lio_main.c| 52 --
 drivers/net/ethernet/cavium/liquidio/lio_vf_main.c | 54 ---
 .../net/ethernet/cavium/liquidio/octeon_network.h  |  2 +-
 4 files changed, 61 insertions(+), 108 deletions(-)

diff --git a/drivers/net/ethernet/cavium/liquidio/lio_core.c 
b/drivers/net/ethernet/cavium/liquidio/lio_core.c
index 8cba927..2030c25 100644
--- a/drivers/net/ethernet/cavium/liquidio/lio_core.c
+++ b/drivers/net/ethernet/cavium/liquidio/lio_core.c
@@ -366,7 +366,7 @@ void cleanup_rx_oom_poll_fn(struct net_device *netdev)
 }
 
 /* Runs in interrupt context. */
-void lio_update_txq_status(struct octeon_device *oct, int iq_num)
+static void lio_update_txq_status(struct octeon_device *oct, int iq_num)
 {
struct octeon_instr_queue *iq = oct->instr_queue[iq_num];
struct net_device *netdev;
@@ -620,3 +620,62 @@ void liquidio_napi_drv_callback(void *arg)
smp_call_function_single_async(droq->cpu_id, csd);
}
 }
+
+/**
+ * \brief Entry point for NAPI polling
+ * @param napi NAPI structure
+ * @param budget maximum number of items to process
+ */
+int liquidio_napi_poll(struct napi_struct *napi, int budget)
+{
+   struct octeon_instr_queue *iq;
+   struct octeon_device *oct;
+   struct octeon_droq *droq;
+   int tx_done = 0, iq_no;
+   int work_done;
+
+   droq = container_of(napi, struct octeon_droq, napi);
+   oct = droq->oct_dev;
+   iq_no = droq->q_no;
+
+   /* Handle Droq descriptors */
+   work_done = octeon_process_droq_poll_cmd(oct, droq->q_no,
+POLL_EVENT_PROCESS_PKTS,
+budget);
+
+   /* Flush the instruction queue */
+   iq = oct->instr_queue[iq_no];
+   if (iq) {
+   /* TODO: move this check to inside octeon_flush_iq,
+* once check_db_timeout is removed
+*/
+   if (atomic_read(>instr_pending))
+   /* Process iq buffers with in the budget limits */
+   tx_done = octeon_flush_iq(oct, iq, budget);
+   else
+   tx_done = 1;
+   /* Update iq read-index rather than waiting for next interrupt.
+* Return back if tx_done is false.
+*/
+   /* sub-queue status update */
+   lio_update_txq_status(oct, iq_no);
+   } else {
+   dev_err(>pci_dev->dev, "%s:  iq (%d) num invalid\n",
+   __func__, iq_no);
+   }
+
+#define MAX_REG_CNT  200U
+   /* force enable interrupt if reg cnts are high to avoid wraparound */
+   if (((work_done < budget) && (tx_done)) ||
+   (iq->pkt_in_done >= MAX_REG_CNT) ||
+   (droq->pkt_count >= MAX_REG_CNT)) {
+   tx_done = 1;
+   napi_complete_done(napi, work_done);
+
+   octeon_process_droq_poll_cmd(droq->oct_dev, droq->q_no,
+POLL_EVENT_ENABLE_INTR, 0);
+   return 0;
+   }
+
+   return (!tx_done) ? (budget) : (work_done);
+}
diff --git a/drivers/net/ethernet/cavium/liquidio/lio_main.c 
b/drivers/net/ethernet/cavium/liquidio/lio_main.c
index 4241949..b00d199 100644
--- a/drivers/net/ethernet/cavium/liquidio/lio_main.c
+++ b/drivers/net/ethernet/cavium/liquidio/lio_main.c
@@ -2229,58 +2229,6 @@ static void if_cfg_callback(struct octeon_device *oct,
 }
 
 /**
- * \brief Entry point for NAPI polling
- * @param napi NAPI structure
- * @param budget maximum number of items to process
- */
-static int liquidio_napi_poll(struct napi_struct *napi, int budget)
-{
-   struct octeon_droq *droq;
-   int work_done;
-   int tx_done = 0, iq_no;
-   struct octeon_instr_queue *iq;
-   struct octeon_device *oct;
-
-   droq = container_of(napi, struct octeon_droq, napi);
-   oct = droq->oct_dev;
-   iq_no = droq->q_no;
-   /* Handle Droq descriptors */
-   work_done = octeon_process_droq_poll_cmd(oct, droq->q_no,
-POLL_EVENT_PROCESS_PKTS,
-budget);
-
-   /* Flush the instruction queue */
-   iq = oct->instr_queue[iq_no];
-   if (iq) {
-   if (atomic_read(>instr_pending))
-   /* Process iq buffers with in the budget limits */
-   tx_done = octeon_flush_iq(oct, iq, budget);
-   else
-   tx_done = 1;
-   /* Update iq read-index rather than waiting for next interrupt.
-

[PATCH V2 net-next 2/8] liquidio: moved update_txq_status to lio_core.c

2017-08-14 Thread Felix Manlunas
From: Intiyaz Basha 

Moving common update_txq_status to lio_core.c

Signed-off-by: Intiyaz Basha 
Signed-off-by: Felix Manlunas 
---
 drivers/net/ethernet/cavium/liquidio/lio_core.c| 33 
 drivers/net/ethernet/cavium/liquidio/lio_main.c| 35 +-
 drivers/net/ethernet/cavium/liquidio/lio_vf_main.c | 26 +---
 .../net/ethernet/cavium/liquidio/octeon_network.h  |  1 +
 4 files changed, 36 insertions(+), 59 deletions(-)

diff --git a/drivers/net/ethernet/cavium/liquidio/lio_core.c 
b/drivers/net/ethernet/cavium/liquidio/lio_core.c
index adde774..b55ab75 100644
--- a/drivers/net/ethernet/cavium/liquidio/lio_core.c
+++ b/drivers/net/ethernet/cavium/liquidio/lio_core.c
@@ -364,3 +364,36 @@ void cleanup_rx_oom_poll_fn(struct net_device *netdev)
destroy_workqueue(lio->rxq_status_wq.wq);
}
 }
+
+/* Runs in interrupt context. */
+void lio_update_txq_status(struct octeon_device *oct, int iq_num)
+{
+   struct octeon_instr_queue *iq = oct->instr_queue[iq_num];
+   struct net_device *netdev;
+   struct lio *lio;
+
+   netdev = oct->props[iq->ifidx].netdev;
+
+   /* This is needed because the first IQ does not have
+* a netdev associated with it.
+*/
+   if (!netdev)
+   return;
+
+   lio = GET_LIO(netdev);
+   if (netif_is_multiqueue(netdev)) {
+   if (__netif_subqueue_stopped(netdev, iq->q_index) &&
+   lio->linfo.link.s.link_up &&
+   (!octnet_iq_is_full(oct, iq_num))) {
+   netif_wake_subqueue(netdev, iq->q_index);
+   INCR_INSTRQUEUE_PKT_COUNT(lio->oct_dev, iq_num,
+ tx_restart, 1);
+   }
+   } else if (netif_queue_stopped(netdev) &&
+  lio->linfo.link.s.link_up &&
+  (!octnet_iq_is_full(oct, lio->txq))) {
+   INCR_INSTRQUEUE_PKT_COUNT(lio->oct_dev, lio->txq,
+ tx_restart, 1);
+   netif_wake_queue(netdev);
+   }
+}
diff --git a/drivers/net/ethernet/cavium/liquidio/lio_main.c 
b/drivers/net/ethernet/cavium/liquidio/lio_main.c
index b20d13f..ba1b493 100644
--- a/drivers/net/ethernet/cavium/liquidio/lio_main.c
+++ b/drivers/net/ethernet/cavium/liquidio/lio_main.c
@@ -903,39 +903,6 @@ static inline void update_link_status(struct net_device 
*netdev,
}
 }
 
-/* Runs in interrupt context. */
-static void update_txq_status(struct octeon_device *oct, int iq_num)
-{
-   struct net_device *netdev;
-   struct lio *lio;
-   struct octeon_instr_queue *iq = oct->instr_queue[iq_num];
-
-   netdev = oct->props[iq->ifidx].netdev;
-
-   /* This is needed because the first IQ does not have
-* a netdev associated with it.
-*/
-   if (!netdev)
-   return;
-
-   lio = GET_LIO(netdev);
-   if (netif_is_multiqueue(netdev)) {
-   if (__netif_subqueue_stopped(netdev, iq->q_index) &&
-   lio->linfo.link.s.link_up &&
-   (!octnet_iq_is_full(oct, iq_num))) {
-   INCR_INSTRQUEUE_PKT_COUNT(lio->oct_dev, iq_num,
- tx_restart, 1);
-   netif_wake_subqueue(netdev, iq->q_index);
-   }
-   } else if (netif_queue_stopped(netdev) &&
-  lio->linfo.link.s.link_up &&
-  (!octnet_iq_is_full(oct, lio->txq))) {
-   INCR_INSTRQUEUE_PKT_COUNT(lio->oct_dev,
- lio->txq, tx_restart, 1);
-   netif_wake_queue(netdev);
-   }
-}
-
 static
 int liquidio_schedule_msix_droq_pkt_handler(struct octeon_droq *droq, u64 ret)
 {
@@ -2515,7 +2482,7 @@ static int liquidio_napi_poll(struct napi_struct *napi, 
int budget)
/* Update iq read-index rather than waiting for next interrupt.
 * Return back if tx_done is false.
 */
-   update_txq_status(oct, iq_no);
+   lio_update_txq_status(oct, iq_no);
} else {
dev_err(>pci_dev->dev, "%s:  iq (%d) num invalid\n",
__func__, iq_no);
diff --git a/drivers/net/ethernet/cavium/liquidio/lio_vf_main.c 
b/drivers/net/ethernet/cavium/liquidio/lio_vf_main.c
index 17623ed..dd0265a 100644
--- a/drivers/net/ethernet/cavium/liquidio/lio_vf_main.c
+++ b/drivers/net/ethernet/cavium/liquidio/lio_vf_main.c
@@ -647,30 +647,6 @@ static void update_link_status(struct net_device *netdev,
}
 }
 
-static void update_txq_status(struct octeon_device *oct, int iq_num)
-{
-   struct octeon_instr_queue *iq = oct->instr_queue[iq_num];
-   struct net_device *netdev;
-   struct lio *lio;
-
-   netdev = oct->props[iq->ifidx].netdev;
-   lio = 

[PATCH V2 net-next 1/8] liquidio: moved wait_for_pending_requests to octeon_network.h

2017-08-14 Thread Felix Manlunas
From: Intiyaz Basha 

Moving common function wait_for_pending_requests to octeon_network.h

Signed-off-by: Intiyaz Basha 
Signed-off-by: Felix Manlunas 
---
 .../ethernet/cavium/liquidio/cn23xx_vf_device.h|  2 --
 drivers/net/ethernet/cavium/liquidio/lio_main.c| 26 
 drivers/net/ethernet/cavium/liquidio/lio_vf_main.c | 28 +-
 .../net/ethernet/cavium/liquidio/octeon_device.h   |  2 ++
 .../net/ethernet/cavium/liquidio/octeon_network.h  | 26 
 5 files changed, 29 insertions(+), 55 deletions(-)

diff --git a/drivers/net/ethernet/cavium/liquidio/cn23xx_vf_device.h 
b/drivers/net/ethernet/cavium/liquidio/cn23xx_vf_device.h
index 3f98c73..2d06097 100644
--- a/drivers/net/ethernet/cavium/liquidio/cn23xx_vf_device.h
+++ b/drivers/net/ethernet/cavium/liquidio/cn23xx_vf_device.h
@@ -36,8 +36,6 @@ struct octeon_cn23xx_vf {
 
 #define CN23XX_MAILBOX_MSGPARAM_SIZE   6
 
-#define MAX_VF_IP_OP_PENDING_PKT_COUNT 100
-
 void cn23xx_vf_ask_pf_to_do_flr(struct octeon_device *oct);
 
 int cn23xx_octeon_pfvf_handshake(struct octeon_device *oct);
diff --git a/drivers/net/ethernet/cavium/liquidio/lio_main.c 
b/drivers/net/ethernet/cavium/liquidio/lio_main.c
index 8bf6dfc..b20d13f 100644
--- a/drivers/net/ethernet/cavium/liquidio/lio_main.c
+++ b/drivers/net/ethernet/cavium/liquidio/lio_main.c
@@ -273,32 +273,6 @@ static void force_io_queues_off(struct octeon_device *oct)
 }
 
 /**
- * \brief wait for all pending requests to complete
- * @param oct Pointer to Octeon device
- *
- * Called during shutdown sequence
- */
-static int wait_for_pending_requests(struct octeon_device *oct)
-{
-   int i, pcount = 0;
-
-   for (i = 0; i < 100; i++) {
-   pcount =
-   atomic_read(>response_list
-   [OCTEON_ORDERED_SC_LIST].pending_req_count);
-   if (pcount)
-   schedule_timeout_uninterruptible(HZ / 10);
-   else
-   break;
-   }
-
-   if (pcount)
-   return 1;
-
-   return 0;
-}
-
-/**
  * \brief Cause device to go quiet so it can be safely removed/reset/etc
  * @param oct Pointer to Octeon device
  */
diff --git a/drivers/net/ethernet/cavium/liquidio/lio_vf_main.c 
b/drivers/net/ethernet/cavium/liquidio/lio_vf_main.c
index c6f52f2..17623ed 100644
--- a/drivers/net/ethernet/cavium/liquidio/lio_vf_main.c
+++ b/drivers/net/ethernet/cavium/liquidio/lio_vf_main.c
@@ -123,7 +123,7 @@ static int lio_wait_for_oq_pkts(struct octeon_device *oct)
 {
struct octeon_device_priv *oct_priv =
(struct octeon_device_priv *)oct->priv;
-   int retry = MAX_VF_IP_OP_PENDING_PKT_COUNT;
+   int retry = MAX_IO_PENDING_PKT_COUNT;
int pkt_cnt = 0, pending_pkts;
int i;
 
@@ -148,32 +148,6 @@ static int lio_wait_for_oq_pkts(struct octeon_device *oct)
 }
 
 /**
- * \brief wait for all pending requests to complete
- * @param oct Pointer to Octeon device
- *
- * Called during shutdown sequence
- */
-static int wait_for_pending_requests(struct octeon_device *oct)
-{
-   int i, pcount = 0;
-
-   for (i = 0; i < MAX_VF_IP_OP_PENDING_PKT_COUNT; i++) {
-   pcount = atomic_read(
-   >response_list[OCTEON_ORDERED_SC_LIST]
-.pending_req_count);
-   if (pcount)
-   schedule_timeout_uninterruptible(HZ / 10);
-   else
-   break;
-   }
-
-   if (pcount)
-   return 1;
-
-   return 0;
-}
-
-/**
  * \brief Cause device to go quiet so it can be safely removed/reset/etc
  * @param oct Pointer to Octeon device
  */
diff --git a/drivers/net/ethernet/cavium/liquidio/octeon_device.h 
b/drivers/net/ethernet/cavium/liquidio/octeon_device.h
index b014e6a..0ad58f9 100644
--- a/drivers/net/ethernet/cavium/liquidio/octeon_device.h
+++ b/drivers/net/ethernet/cavium/liquidio/octeon_device.h
@@ -568,6 +568,8 @@ struct octeon_device {
 #define CHIP_CONF(oct, TYPE) \
(((struct octeon_ ## TYPE  *)((oct)->chip))->conf)
 
+#define MAX_IO_PENDING_PKT_COUNT 100
+
 /*-- Function Prototypes --*/
 
 /** Initialize device list memory */
diff --git a/drivers/net/ethernet/cavium/liquidio/octeon_network.h 
b/drivers/net/ethernet/cavium/liquidio/octeon_network.h
index ec8504b..043f6e6 100644
--- a/drivers/net/ethernet/cavium/liquidio/octeon_network.h
+++ b/drivers/net/ethernet/cavium/liquidio/octeon_network.h
@@ -448,4 +448,30 @@ static inline void ifstate_reset(struct lio *lio, int 
state_flag)
atomic_set(>ifstate, (atomic_read(>ifstate) & ~(state_flag)));
 }
 
+/**
+ * \brief wait for all pending requests to complete
+ * @param oct Pointer to Octeon device
+ *
+ * Called during shutdown sequence
+ */
+static inline int 

[PATCH V2 net-next 0/8] liquidio: adding support for ethtool --set-ring feature

2017-08-14 Thread Felix Manlunas
From: Intiyaz Basha 

Code reorganization is required for adding ethtool --set-ring feature.
First seven patches are for code reorganization.  The last patch is for
adding this feature.

Change Log:
V1 -> V2
 Only patch #8 was changed:  unnecessary parentheses were removed in two
 if-statements in lio_ethtool_set_ringparam().

Intiyaz Basha (8):
  liquidio: moved wait_for_pending_requests to octeon_network.h
  liquidio: moved update_txq_status to lio_core.c
  liquidio: moved octeon_setup_droq to lio_core.c
  liquidio: moved liquidio_push_packet to lio_core.c
  liquidio: moved liquidio_napi_drv_callback to lio_core.c
  liquidio: moved liquidio_napi_poll to lio_core.c
  liquidio: moved liquidio_setup_io_queues to lio_core.c
  liquidio: added support for ethtool --set-ring feature

 .../ethernet/cavium/liquidio/cn23xx_vf_device.h|   2 -
 drivers/net/ethernet/cavium/liquidio/lio_core.c| 424 +++
 drivers/net/ethernet/cavium/liquidio/lio_ethtool.c | 131 ++
 drivers/net/ethernet/cavium/liquidio/lio_main.c| 448 +
 drivers/net/ethernet/cavium/liquidio/lio_vf_main.c | 379 +
 .../net/ethernet/cavium/liquidio/octeon_config.h   |  13 +-
 .../net/ethernet/cavium/liquidio/octeon_device.c   |  14 +-
 .../net/ethernet/cavium/liquidio/octeon_device.h   |   2 +
 .../net/ethernet/cavium/liquidio/octeon_network.h  |  29 ++
 9 files changed, 617 insertions(+), 825 deletions(-)

-- 
2.9.0



Re: [Intel-wired-lan] [PATCH 4/6] [net-next]net: i40e: Admin queue definitions for cloud filters

2017-08-14 Thread Nambiar, Amritha
On 8/1/2017 12:16 PM, Shannon Nelson wrote:
> On 7/31/2017 5:37 PM, Amritha Nambiar wrote:
>> Add new admin queue definitions and extended fields for cloud
>> filter support. Define big buffer for extended general fields
>> in Add/Remove Cloud filters command.
>>
>> Signed-off-by: Amritha Nambiar 
>> Signed-off-by: Kiran Patil 
>> Signed-off-by: Store Laura 
>> Signed-off-by: Iremonger Bernard 
>> Signed-off-by: Jingjing Wu 
>> ---
>>   drivers/net/ethernet/intel/i40e/i40e_adminq_cmd.h |   98 
>> +
>>   1 file changed, 97 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/net/ethernet/intel/i40e/i40e_adminq_cmd.h 
>> b/drivers/net/ethernet/intel/i40e/i40e_adminq_cmd.h
>> index 8bba04c..9f14305 100644
>> --- a/drivers/net/ethernet/intel/i40e/i40e_adminq_cmd.h
>> +++ b/drivers/net/ethernet/intel/i40e/i40e_adminq_cmd.h
> 
> I see that you're changing the i40e version of this, but not the i40evf 
> version.  I understand that these changes are not useful for the VF, but 
> are you no longer trying to keep the AdminQ definitions consistent 
> between the two?

I will add these definitions to the VF as well for consistency in the
next version.

> 
>> @@ -1358,7 +1358,9 @@ struct i40e_aqc_add_remove_cloud_filters {
>>   #define I40E_AQC_ADD_CLOUD_CMD_SEID_NUM_SHIFT  0
>>   #define I40E_AQC_ADD_CLOUD_CMD_SEID_NUM_MASK   (0x3FF << \
>>  I40E_AQC_ADD_CLOUD_CMD_SEID_NUM_SHIFT)
>> -u8  reserved2[4];
>> +u8  big_buffer_flag;
>> +#define I40E_AQC_ADD_REM_CLOUD_CMD_BIG_BUFFER   1
>> +u8  reserved2[3];
>>  __le32  addr_high;
>>  __le32  addr_low;
>>   };
>> @@ -1395,6 +1397,13 @@ struct i40e_aqc_add_remove_cloud_filters_element_data 
>> {
>>   #define I40E_AQC_ADD_CLOUD_FILTER_IMAC 0x000A
>>   #define I40E_AQC_ADD_CLOUD_FILTER_OMAC_TEN_ID_IMAC 0x000B
>>   #define I40E_AQC_ADD_CLOUD_FILTER_IIP  0x000C
>> +/* 0x0010 to 0x0017 is for custom filters */
>> +/* flag to be used when adding cloud filter: IP + L4 Port */
>> +#define I40E_AQC_ADD_CLOUD_FILTER_IP_PORT   0x0010
>> +/* flag to be used when adding cloud filter: Dest MAC + L4 Port */
>> +#define I40E_AQC_ADD_CLOUD_FILTER_MAC_PORT  0x0011
>> +/* flag to be used when adding cloud filter: Dest MAC + VLAN + L4 Port */
>> +#define I40E_AQC_ADD_CLOUD_FILTER_MAC_VLAN_PORT 0x0012
>>   
>>   #define I40E_AQC_ADD_CLOUD_FLAGS_TO_QUEUE  0x0080
>>   #define I40E_AQC_ADD_CLOUD_VNK_SHIFT   6
>> @@ -1429,6 +1438,45 @@ struct i40e_aqc_add_remove_cloud_filters_element_data 
>> {
>>  u8  response_reserved[7];
>>   };
> 
> I know you didn't add this struct, but where's the I40E_CHECK_STRUCT_LEN 
> check?

Will add all the needed I40E_CHECK_STRUCT_LEN check in the next version
of the series.

> 
>>   
>> +/* i40e_aqc_add_remove_cloud_filters_element_big_data is used when
>> + * I40E_AQC_ADD_REM_CLOUD_CMD_BIG_BUFFER flag is set.
>> + */
>> +struct i40e_aqc_add_remove_cloud_filters_element_big_data {
>> +struct i40e_aqc_add_remove_cloud_filters_element_data element;
>> +u16 general_fields[32];
>> +#define I40E_AQC_ADD_CLOUD_FV_FLU_0X10_WORD00
>> +#define I40E_AQC_ADD_CLOUD_FV_FLU_0X10_WORD11
>> +#define I40E_AQC_ADD_CLOUD_FV_FLU_0X10_WORD22
>> +#define I40E_AQC_ADD_CLOUD_FV_FLU_0X11_WORD03
>> +#define I40E_AQC_ADD_CLOUD_FV_FLU_0X11_WORD14
>> +#define I40E_AQC_ADD_CLOUD_FV_FLU_0X11_WORD25
>> +#define I40E_AQC_ADD_CLOUD_FV_FLU_0X12_WORD06
>> +#define I40E_AQC_ADD_CLOUD_FV_FLU_0X12_WORD17
>> +#define I40E_AQC_ADD_CLOUD_FV_FLU_0X12_WORD28
>> +#define I40E_AQC_ADD_CLOUD_FV_FLU_0X13_WORD09
>> +#define I40E_AQC_ADD_CLOUD_FV_FLU_0X13_WORD110
>> +#define I40E_AQC_ADD_CLOUD_FV_FLU_0X13_WORD211
>> +#define I40E_AQC_ADD_CLOUD_FV_FLU_0X14_WORD012
>> +#define I40E_AQC_ADD_CLOUD_FV_FLU_0X14_WORD113
>> +#define I40E_AQC_ADD_CLOUD_FV_FLU_0X14_WORD214
>> +#define I40E_AQC_ADD_CLOUD_FV_FLU_0X16_WORD015
>> +#define I40E_AQC_ADD_CLOUD_FV_FLU_0X16_WORD116
>> +#define I40E_AQC_ADD_CLOUD_FV_FLU_0X16_WORD217
>> +#define I40E_AQC_ADD_CLOUD_FV_FLU_0X16_WORD318
>> +#define I40E_AQC_ADD_CLOUD_FV_FLU_0X16_WORD419
>> +#define I40E_AQC_ADD_CLOUD_FV_FLU_0X16_WORD520
>> +#define I40E_AQC_ADD_CLOUD_FV_FLU_0X16_WORD621
>> +#define I40E_AQC_ADD_CLOUD_FV_FLU_0X16_WORD722
>> +#define I40E_AQC_ADD_CLOUD_FV_FLU_0X17_WORD023
>> +#define I40E_AQC_ADD_CLOUD_FV_FLU_0X17_WORD124
>> +#define I40E_AQC_ADD_CLOUD_FV_FLU_0X17_WORD225
>> +#define I40E_AQC_ADD_CLOUD_FV_FLU_0X17_WORD326
>> +#define I40E_AQC_ADD_CLOUD_FV_FLU_0X17_WORD427
>> +#define I40E_AQC_ADD_CLOUD_FV_FLU_0X17_WORD5  

Re: [PATCH net] datagram: When peeking datagrams with offset < 0 don't skip empty skbs

2017-08-14 Thread Thiago Macieira
On Monday, 14 August 2017 11:46:42 PDT Willem de Bruijn wrote:
> > By the way, what were the usecases for the peek offset feature?
> 
> The idea was to be able to peek at application headers of upper
> layer protocols and multiplex messages among threads. It proved
> so complex even for UDP that we did not attempt the same feature
> for TCP. Also, KCM implements demultiplexing using eBPF today.

Interesting, but how would userspace coordinate like that? Suppose multiple 
threads are woken up by a datagram being received, they peek at a certain 
offset shared among them all to see which one reads. Suppose that thread is 
slow or blocked and, while it's getting its act together, another datagram 
arrives.

Because of that, the other threads can't disable their polling. They will 
continually be woken up by the kernel if they go back to poll/select. Even 
with epoll, there's no new edge trigger since event is already at level.

How will they avoid busy-waiting? And won't this secondary coordination 
obviate the need for offset peeking?

-- 
Thiago Macieira - thiago.macieira (AT) intel.com
  Software Architect - Intel Open Source Technology Center



Re: [PATCH net] datagram: When peeking datagrams with offset < 0 don't skip empty skbs

2017-08-14 Thread Willem de Bruijn
On Mon, Aug 14, 2017 at 2:33 PM, Thiago Macieira
 wrote:
> On Monday, 14 August 2017 11:25:23 PDT Willem de Bruijn wrote:
>> > Do applications using SOCK_DGRAM rely on the behaviour of skipping over
>> > datagrams that are too short?
>>
>> It is established behavior. It cannot be ruled out that an application
>> somewhere depends on it.
>
> Understood.
>
> By the way, what were the usecases for the peek offset feature?

The idea was to be able to peek at application headers of upper
layer protocols and multiplex messages among threads. It proved
so complex even for UDP that we did not attempt the same feature
for TCP. Also, KCM implements demultiplexing using eBPF today.

> Also, do they apply to non-peeking recv?

Reading at an offset is not implemented. Especially for RPC over
TCP, reading at an offset could make sense. Say, if a message
is received completely, but it is head-of-line blocked behind
another that has a hole. Here, too, KCM is an alternative.


[PATCH v2] sctp: fully initialize the IPv6 address in sctp_v6_to_addr()

2017-08-14 Thread Alexander Potapenko
KMSAN reported use of uninitialized sctp_addr->v4.sin_addr.s_addr and
sctp_addr->v6.sin6_scope_id in sctp_v6_cmp_addr() (see below).
Make sure all fields of an IPv6 address are initialized, which
guarantees that the IPv4 fields are also initialized.

==
 BUG: KMSAN: use of uninitialized memory in sctp_v6_cmp_addr+0x8d4/0x9f0
 net/sctp/ipv6.c:517
 CPU: 2 PID: 31056 Comm: syz-executor1 Not tainted 4.11.0-rc5+ #2944
 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs
 01/01/2011
 Call Trace:
  dump_stack+0x172/0x1c0 lib/dump_stack.c:42
  is_logbuf_locked mm/kmsan/kmsan.c:59 [inline]
  kmsan_report+0x12a/0x180 mm/kmsan/kmsan.c:938
  native_save_fl arch/x86/include/asm/irqflags.h:18 [inline]
  arch_local_save_flags arch/x86/include/asm/irqflags.h:72 [inline]
  arch_local_irq_save arch/x86/include/asm/irqflags.h:113 [inline]
  __msan_warning_32+0x61/0xb0 mm/kmsan/kmsan_instr.c:467
  sctp_v6_cmp_addr+0x8d4/0x9f0 net/sctp/ipv6.c:517
  sctp_v6_get_dst+0x8c7/0x1630 net/sctp/ipv6.c:290
  sctp_transport_route+0x101/0x570 net/sctp/transport.c:292
  sctp_assoc_add_peer+0x66d/0x16f0 net/sctp/associola.c:651
  sctp_sendmsg+0x35a5/0x4f90 net/sctp/socket.c:1871
  inet_sendmsg+0x498/0x670 net/ipv4/af_inet.c:762
  sock_sendmsg_nosec net/socket.c:633 [inline]
  sock_sendmsg net/socket.c:643 [inline]
  SYSC_sendto+0x608/0x710 net/socket.c:1696
  SyS_sendto+0x8a/0xb0 net/socket.c:1664
  entry_SYSCALL_64_fastpath+0x13/0x94
 RIP: 0033:0x44b479
 RSP: 002b:7f6213f21c08 EFLAGS: 0286 ORIG_RAX: 002c
 RAX: ffda RBX: 2000 RCX: 0044b479
 RDX: 0041 RSI: 20edd000 RDI: 0006
 RBP: 007080a8 R08: 20b85fe4 R09: 001c
 R10: 00040005 R11: 0286 R12: 
 R13: 3760 R14: 006e5820 R15: 00ff8000
 origin description: dst_saddr@sctp_v6_get_dst
 local variable created at:
  sk_fullsock include/net/sock.h:2321 [inline]
  inet6_sk include/linux/ipv6.h:309 [inline]
  sctp_v6_get_dst+0x91/0x1630 net/sctp/ipv6.c:241
  sctp_transport_route+0x101/0x570 net/sctp/transport.c:292
==
 BUG: KMSAN: use of uninitialized memory in sctp_v6_cmp_addr+0x8d4/0x9f0
 net/sctp/ipv6.c:517
 CPU: 2 PID: 31056 Comm: syz-executor1 Not tainted 4.11.0-rc5+ #2944
 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs
 01/01/2011
 Call Trace:
  dump_stack+0x172/0x1c0 lib/dump_stack.c:42
  is_logbuf_locked mm/kmsan/kmsan.c:59 [inline]
  kmsan_report+0x12a/0x180 mm/kmsan/kmsan.c:938
  native_save_fl arch/x86/include/asm/irqflags.h:18 [inline]
  arch_local_save_flags arch/x86/include/asm/irqflags.h:72 [inline]
  arch_local_irq_save arch/x86/include/asm/irqflags.h:113 [inline]
  __msan_warning_32+0x61/0xb0 mm/kmsan/kmsan_instr.c:467
  sctp_v6_cmp_addr+0x8d4/0x9f0 net/sctp/ipv6.c:517
  sctp_v6_get_dst+0x8c7/0x1630 net/sctp/ipv6.c:290
  sctp_transport_route+0x101/0x570 net/sctp/transport.c:292
  sctp_assoc_add_peer+0x66d/0x16f0 net/sctp/associola.c:651
  sctp_sendmsg+0x35a5/0x4f90 net/sctp/socket.c:1871
  inet_sendmsg+0x498/0x670 net/ipv4/af_inet.c:762
  sock_sendmsg_nosec net/socket.c:633 [inline]
  sock_sendmsg net/socket.c:643 [inline]
  SYSC_sendto+0x608/0x710 net/socket.c:1696
  SyS_sendto+0x8a/0xb0 net/socket.c:1664
  entry_SYSCALL_64_fastpath+0x13/0x94
 RIP: 0033:0x44b479
 RSP: 002b:7f6213f21c08 EFLAGS: 0286 ORIG_RAX: 002c
 RAX: ffda RBX: 2000 RCX: 0044b479
 RDX: 0041 RSI: 20edd000 RDI: 0006
 RBP: 007080a8 R08: 20b85fe4 R09: 001c
 R10: 00040005 R11: 0286 R12: 
 R13: 3760 R14: 006e5820 R15: 00ff8000
 origin description: dst_saddr@sctp_v6_get_dst
 local variable created at:
  sk_fullsock include/net/sock.h:2321 [inline]
  inet6_sk include/linux/ipv6.h:309 [inline]
  sctp_v6_get_dst+0x91/0x1630 net/sctp/ipv6.c:241
  sctp_transport_route+0x101/0x570 net/sctp/transport.c:292
==

Signed-off-by: Alexander Potapenko 
Reviewed-by: Xin Long 
---
v2 is identical to v1, resending per request by Marcelo Ricardo Leitner.
---
 net/sctp/ipv6.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/net/sctp/ipv6.c b/net/sctp/ipv6.c
index 2a186b201ad2..a15d691829c6 100644
--- a/net/sctp/ipv6.c
+++ b/net/sctp/ipv6.c
@@ -513,6 +513,8 @@ static void sctp_v6_to_addr(union sctp_addr *addr, struct 
in6_addr *saddr,
addr->sa.sa_family = AF_INET6;
addr->v6.sin6_port = port;
addr->v6.sin6_addr = *saddr;
+   addr->v6.sin6_flowinfo = 0;
+   addr->v6.sin6_scope_id = 0;
 }
 
 /* Compare addresses exactly.
-- 
2.14.0.434.g98096fd7a8-goog



Re: [PATCH net] datagram: When peeking datagrams with offset < 0 don't skip empty skbs

2017-08-14 Thread Thiago Macieira
On Monday, 14 August 2017 11:25:23 PDT Willem de Bruijn wrote:
> > Do applications using SOCK_DGRAM rely on the behaviour of skipping over
> > datagrams that are too short?
> 
> It is established behavior. It cannot be ruled out that an application
> somewhere depends on it.

Understood.

By the way, what were the usecases for the peek offset feature?

Also, do they apply to non-peeking recv?

-- 
Thiago Macieira - thiago.macieira (AT) intel.com
  Software Architect - Intel Open Source Technology Center



Re: [PATCH net] datagram: When peeking datagrams with offset < 0 don't skip empty skbs

2017-08-14 Thread Willem de Bruijn
On Mon, Aug 14, 2017 at 1:02 PM, Thiago Macieira
 wrote:
> On Monday, 14 August 2017 09:33:48 PDT Willem de Bruijn wrote:
>> > But here's a question: if the peek offset is equal to the length, should
>> > the reading return an empty datagram? This would indicate to the caller
>> > that there was a datagram there, which was skipped over.
>>
>> In the general case, no, it should read at the offset, which is the next
>> skb.
>
> I beg to differ. In this particular case, we are talking about datagrams. If
> it were stream sockets, I would agree with you: just skip to the next. But in
> datagrams, the same way you do return zero-sized ones, I would return an empty
> one if you peeked at or past the end.

It can be argued either way. I would not change it in the scope of
this bug.

>
>> Since we only need to change no-offset semantics to fix this bug,
>> I would not change this behavior, which is also expected by some
>> applications by now.
>
> Do applications using SOCK_DGRAM rely on the behaviour of skipping over
> datagrams that are too short?

It is established behavior. It cannot be ruled out that an application
somewhere depends on it.


Re: [net 1/1] tipc: avoid inheriting msg_non_seq flag when message is returned

2017-08-14 Thread David Miller
From: Jon Maloy 
Date: Mon, 14 Aug 2017 18:28:49 +0200

> In the function msg_reverse(), we reverse the header while trying to
> reuse the original buffer whenever possible. Those rejected/returned
> messages are always transmitted as unicast, but the msg_non_seq field
> is not explicitly set to zero as it should be.
> 
> We have seen cases where multicast senders set the message type to
> "NOT dest_droppable", meaning that a multicast message shorter than
> one MTU will be returned, e.g., during receive buffer overflow, by
> reusing the original buffer. This has the effect that even the
> 'msg_non_seq' field is inadvertently inherited by the rejected message,
> although it is now sent as a unicast message. This again leads the
> receiving unicast link endpoint to steer the packet toward the broadcast
> link receive function, where it is dropped. The affected unicast link is
> thereafter (after 100 failed retransmissions) declared 'stale' and
> reset.
> 
> We fix this by unconditionally setting the 'msg_non_seq' flag to zero
> for all rejected/returned messages.
> 
> Reported-by: Canh Duc Luu 
> Signed-off-by: Jon Maloy 

Also applied, thanks again.


Re: [net 1/1] tipc: accept PACKET_MULTICAST packets

2017-08-14 Thread David Miller
From: Jon Maloy 
Date: Mon, 14 Aug 2017 17:55:56 +0200

> On L2 bearers, the TIPC broadcast function is sending out packets using
> the corresponding L2 broadcast address. At reception, we filter such
> packets under the assumption that they will also be delivered as
> broadcast packets.
> 
> This assumption doesn't always hold true. Under high load, we have seen
> that a switch may convert the destination address and deliver the packet
> as a PACKET_MULTICAST, something leading to inadvertently dropped
> packets and a stale and reset broadcast link.
> 
> We fix this by extending the reception filtering to accept packets of
> type PACKET_MULTICAST.
> 
> Signed-off-by: Jon Maloy 

Applied, thanks Jon.


Re: [patch v2 0/2] Enable Mellanox switch device in I2C mode

2017-08-14 Thread David Miller
From: Ohad Oz 
Date: Mon, 14 Aug 2017 15:38:20 +

> The following patch set updates global to Mellanox Kconfig files to support
> configuration of Mellanox Switch (mlxsw) without PCI and with I2C only.

Series applied to net-next.


Re: [PATCH net-next v3] net: phy: Use tab for indentation in Kconfig

2017-08-14 Thread David Miller
From: Michal Simek 
Date: Mon, 14 Aug 2017 15:43:00 +0200

> Using tabs instead of space for indentation.
> 
> Signed-off-by: Michal Simek 
> Reviewed-by: Andrew Lunn 

Applied, thanks.


  1   2   >