Re: KMSAN: uninit-value in __netif_receive_skb_core

2018-04-12 Thread syzbot
syzbot has found reproducer for the following crash on  
https://github.com/google/kmsan.git/master commit

35ff515e4bda2646f6c881d33951c306ea9c282a (Tue Apr 10 08:59:43 2018 +)
Merge pull request #11 from parkerduckworth/readme
syzbot dashboard link:  
https://syzkaller.appspot.com/bug?extid=b202b7208664142954fa


So far this crash happened 3 times on  
https://github.com/google/kmsan.git/master.

C reproducer: https://syzkaller.appspot.com/x/repro.c?id=455991623680
syzkaller reproducer:  
https://syzkaller.appspot.com/x/repro.syz?id=4590273065648128
Raw console output:  
https://syzkaller.appspot.com/x/log.txt?id=4631921027973120
Kernel config:  
https://syzkaller.appspot.com/x/.config?id=6627248707860932248

compiler: clang version 7.0.0 (trunk 329391)

IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: syzbot+b202b720866414295...@syzkaller.appspotmail.com
It will help syzbot understand when the bug is fixed.

==
BUG: KMSAN: uninit-value in __read_once_size include/linux/compiler.h:197  
[inline]
BUG: KMSAN: uninit-value in deliver_ptype_list_skb net/core/dev.c:1908  
[inline]
BUG: KMSAN: uninit-value in __netif_receive_skb_core+0x4630/0x4a80  
net/core/dev.c:4545

CPU: 0 PID: 3514 Comm: syzkaller031167 Not tainted 4.16.0+ #83
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS  
Google 01/01/2011

Call Trace:
 
 __dump_stack lib/dump_stack.c:17 [inline]
 dump_stack+0x185/0x1d0 lib/dump_stack.c:53
 kmsan_report+0x142/0x240 mm/kmsan/kmsan.c:1067
 __msan_warning_32+0x6c/0xb0 mm/kmsan/kmsan_instr.c:676
 __read_once_size include/linux/compiler.h:197 [inline]
 deliver_ptype_list_skb net/core/dev.c:1908 [inline]
 __netif_receive_skb_core+0x4630/0x4a80 net/core/dev.c:4545
 __netif_receive_skb net/core/dev.c:4627 [inline]
 process_backlog+0x62d/0xe20 net/core/dev.c:5307
 napi_poll net/core/dev.c:5705 [inline]
 net_rx_action+0x7c1/0x1a70 net/core/dev.c:5771
 __do_softirq+0x56d/0x93d kernel/softirq.c:285
 do_softirq_own_stack+0x2a/0x40 arch/x86/entry/entry_64.S:1040
 
 do_softirq kernel/softirq.c:329 [inline]
 __local_bh_enable_ip+0x114/0x140 kernel/softirq.c:182
 local_bh_enable+0x36/0x40 include/linux/bottom_half.h:32
 rcu_read_unlock_bh include/linux/rcupdate.h:726 [inline]
 __dev_queue_xmit+0x2a31/0x2b60 net/core/dev.c:3584
 dev_queue_xmit+0x4b/0x60 net/core/dev.c:3590
 packet_snd net/packet/af_packet.c:2944 [inline]
 packet_sendmsg+0x7c57/0x8a10 net/packet/af_packet.c:2969
 sock_sendmsg_nosec net/socket.c:630 [inline]
 sock_sendmsg net/socket.c:640 [inline]
 sock_write_iter+0x3b9/0x470 net/socket.c:909
 do_iter_readv_writev+0x7bb/0x970 include/linux/fs.h:1776
 do_iter_write+0x30d/0xd40 fs/read_write.c:932
 vfs_writev fs/read_write.c:977 [inline]
 do_writev+0x3c9/0x830 fs/read_write.c:1012
 SYSC_writev+0x9b/0xb0 fs/read_write.c:1085
 SyS_writev+0x56/0x80 fs/read_write.c:1082
 do_syscall_64+0x309/0x430 arch/x86/entry/common.c:287
 entry_SYSCALL_64_after_hwframe+0x3d/0xa2
RIP: 0033:0x43ffb9
RSP: 002b:7ffd42187708 EFLAGS: 0217 ORIG_RAX: 0014
RAX: ffda RBX: 004002c8 RCX: 0043ffb9
RDX: 0001 RSI: 200010c0 RDI: 0003
RBP: 006ca018 R08: 004002c8 R09: 004002c8
R10: 004002c8 R11: 0217 R12: 004018e0
R13: 00401970 R14:  R15: 

Uninit was stored to memory at:
 kmsan_save_stack_with_flags mm/kmsan/kmsan.c:278 [inline]
 kmsan_save_stack mm/kmsan/kmsan.c:293 [inline]
 kmsan_internal_chain_origin+0x12b/0x210 mm/kmsan/kmsan.c:684
 __msan_chain_origin+0x69/0xc0 mm/kmsan/kmsan_instr.c:521
 skb_vlan_untag+0x950/0xee0 include/linux/if_vlan.h:597
 __netif_receive_skb_core+0x70a/0x4a80 net/core/dev.c:4460
 __netif_receive_skb net/core/dev.c:4627 [inline]
 process_backlog+0x62d/0xe20 net/core/dev.c:5307
 napi_poll net/core/dev.c:5705 [inline]
 net_rx_action+0x7c1/0x1a70 net/core/dev.c:5771
 __do_softirq+0x56d/0x93d kernel/softirq.c:285
Uninit was created at:
 kmsan_save_stack_with_flags mm/kmsan/kmsan.c:278 [inline]
 kmsan_internal_poison_shadow+0xb8/0x1b0 mm/kmsan/kmsan.c:188
 kmsan_kmalloc+0x94/0x100 mm/kmsan/kmsan.c:314
 kmsan_slab_alloc+0x11/0x20 mm/kmsan/kmsan.c:321
 slab_post_alloc_hook mm/slab.h:445 [inline]
 slab_alloc_node mm/slub.c:2737 [inline]
 __kmalloc_node_track_caller+0xaed/0x11c0 mm/slub.c:4369
 __kmalloc_reserve net/core/skbuff.c:138 [inline]
 __alloc_skb+0x2cf/0x9f0 net/core/skbuff.c:206
 alloc_skb include/linux/skbuff.h:984 [inline]
 alloc_skb_with_frags+0x1d4/0xb20 net/core/skbuff.c:5234
 sock_alloc_send_pskb+0xb56/0x1190 net/core/sock.c:2085
 packet_alloc_skb net/packet/af_packet.c:2803 [inline]
 packet_snd net/packet/af_packet.c:2894 [inline]
 packet_sendmsg+0x6444/0x8a10 net/packet/af_packet.c:2969
 sock_sendmsg_nosec net/socket.c:630 [inline]
 sock_sendmsg net/socket.c:640 [inline]
 sock_write_iter+0x3b9/0x470 

Re: iproute2-4.16.0 no longer accepts routes via fe80::1

2018-04-12 Thread David Ahern
On 4/12/18 6:41 AM, Thomas Deutschmann wrote:
> Hi,
> 
> well, it isn't just "fe80::1", it is any IPv6 address which
> will be rejected if not called with "-6". I run bisect:
> 
>> git bisect start
>> # good: [50b8a842e8c098cddb213f5b3076526df88826e8] v4.15.0
>> git bisect good 50b8a842e8c098cddb213f5b3076526df88826e8
>> # bad: [4b6c4177ee66421770f0bbcc765c29135e44d921] v4.16.0
>> git bisect bad 4b6c4177ee66421770f0bbcc765c29135e44d921
>> # bad: [5f4892e2c8d4fb22118713e0c83290b352fe0e34] rdma: Make visible the 
>> number of arguments
>> git bisect bad 5f4892e2c8d4fb22118713e0c83290b352fe0e34
>> # good: [8c75f69411bc8c3affe5d173afcf981d15f5da15] Merge branch 'master' 
>> into net-next
>> git bisect good 8c75f69411bc8c3affe5d173afcf981d15f5da15
>> # bad: [27c523e209ab956ff269afec68c6e744e7f5edb6] utils: Introduce 
>> get_addr_rta() and inet_addr_match_rta()
>> git bisect bad 27c523e209ab956ff269afec68c6e744e7f5edb6
>> # bad: [d0bcedd549566a87354aa804df3be6be80681ee9] tc: introduce 
>> tc_qdisc_block_exists helper
>> git bisect bad d0bcedd549566a87354aa804df3be6be80681ee9
>> # bad: [6c4b672738acf680ee98c10e79a52a8dede5f9a6] iplink_geneve: Get rid of 
>> inet_get_addr()
>> git bisect bad 6c4b672738acf680ee98c10e79a52a8dede5f9a6
>> # bad: [93fa12418dc6f5943692250244be303bb162175b] utils: Always specify 
>> family and ->bytelen in get_prefix_1()
>> git bisect bad 93fa12418dc6f5943692250244be303bb162175b
>> # good: [f2522007d8fee924cb098b4afc8af16f2b25829f] utils: Always specify 
>> family for address in get_addr_1()
>> git bisect good f2522007d8fee924cb098b4afc8af16f2b25829f
>> # first bad commit: [93fa12418dc6f5943692250244be303bb162175b] utils: Always 
>> specify family and ->bytelen in get_prefix_1()
> 
>> From 93fa12418dc6f5943692250244be303bb162175b Mon Sep 17 00:00:00 2001
>> From: Serhey Popovych
>> Date: Thu, 18 Jan 2018 20:13:43 +0200
>> Subject: utils: Always specify family and ->bytelen in get_prefix_1()
>>
>> Handle default/all/any special case in get_addr_1() to setup
>> ->family and ->bytelen correctly.
>>
>> Make get_addr_1() return ->bitlen == -2 instead of -1 to
>> distinguish default/all/any special case from the rest:
>> it is safe because all callers check ->bitlen < 0, not
>> explicit value -1.
>>
>> Reduce intendation by one level and get rid of goto/label
>> to make code more readable.
>>
>> Signed-off-by: Serhey Popovych
>> Signed-off-by: David Ahern
> 
> https://git.kernel.org/pub/scm/network/iproute2/iproute2.git/commit/?id=93fa12418dc6f5943692250244be303bb162175b
> 
> So was this an intended behavior change? I.e. this will require
> updates for various user space tools/network configuration scripts
> which are relying on ip utilities feature to auto-detect inet family
> which was "supported" (at least working) until 4.16.0...
> 
> 

Not intentional. Serhey please take a look


Re: [PATCH v5 05/14] PCI: Add pcie_print_link_status() to log link speed and whether it's limited

2018-04-12 Thread Jakub Kicinski
On Fri, 30 Mar 2018 16:05:18 -0500, Bjorn Helgaas wrote:
> + if (bw_avail >= bw_cap)
> + pci_info(dev, "%d Mb/s available bandwidth (%s x%d link)\n",
> +  bw_cap, PCIE_SPEED2STR(speed_cap), width_cap);
> + else
> + pci_info(dev, "%d Mb/s available bandwidth, limited by %s x%d 
> link at %s (capable of %d Mb/s with %s x%d link)\n",
> +  bw_avail, PCIE_SPEED2STR(speed), width,
> +  limiting_dev ? pci_name(limiting_dev) : "",
> +  bw_cap, PCIE_SPEED2STR(speed_cap), width_cap);

I was just looking at using this new function to print PCIe BW for a
NIC, but I'm slightly worried that there is nothing in the message that
says PCIe...  For a NIC some people may interpret the bandwidth as NIC
bandwidth:

[   39.839989] nfp :04:00.0: Netronome Flow Processor NFP4000/NFP6000 PCIe 
Card Probe
[   39.848943] nfp :04:00.0: 63.008 Gb/s available bandwidth (8 GT/s x8 
link)
[   39.857146] nfp :04:00.0: RESERVED BARs: 0.0: General/MSI-X SRAM, 0.1: 
PCIe XPB/MSI-X PBA, 0.4: Explicit0, 0.5: Explicit1, fre4

It's not a 63Gbps NIC...  I'm sorry if this was discussed before and I
didn't find it.  Would it make sense to add the "PCIe: " prefix to the
message like bnx2x used to do?  Like:

nfp :04:00.0: PCIe: 63.008 Gb/s available bandwidth (8 GT/s x8 link)

Sorry for a very late comment.


Re: [RFC v2] virtio: support packed ring

2018-04-12 Thread Jason Wang



On 2018年04月01日 22:12, Tiwei Bie wrote:

Hello everyone,

This RFC implements packed ring support for virtio driver.

The code was tested with DPDK vhost (testpmd/vhost-PMD) implemented
by Jens at http://dpdk.org/ml/archives/dev/2018-January/089417.html
Minor changes are needed for the vhost code, e.g. to kick the guest.

TODO:
- Refinements and bug fixes;
- Split into small patches;
- Test indirect descriptor support;
- Test/fix event suppression support;
- Test devices other than net;

RFC v1 -> RFC v2:
- Add indirect descriptor support - compile test only;
- Add event suppression supprt - compile test only;
- Move vring_packed_init() out of uapi (Jason, MST);
- Merge two loops into one in virtqueue_add_packed() (Jason);
- Split vring_unmap_one() for packed ring and split ring (Jason);
- Avoid using '%' operator (Jason);
- Rename free_head -> next_avail_idx (Jason);
- Add comments for virtio_wmb() in virtqueue_add_packed() (Jason);
- Some other refinements and bug fixes;

Thanks!

Signed-off-by: Tiwei Bie 
---
  drivers/virtio/virtio_ring.c   | 1094 +---
  include/linux/virtio_ring.h|8 +-
  include/uapi/linux/virtio_config.h |   12 +-
  include/uapi/linux/virtio_ring.h   |   61 ++
  4 files changed, 980 insertions(+), 195 deletions(-)

diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index 71458f493cf8..0515dca34d77 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -58,14 +58,15 @@
  
  struct vring_desc_state {

void *data; /* Data for callback. */
-   struct vring_desc *indir_desc;  /* Indirect descriptor, if any. */
+   void *indir_desc;   /* Indirect descriptor, if any. */
+   int num;/* Descriptor list length. */
  };
  
  struct vring_virtqueue {

struct virtqueue vq;
  
-	/* Actual memory layout for this queue */

-   struct vring vring;
+   /* Is this a packed ring? */
+   bool packed;
  
  	/* Can we use weak barriers? */

bool weak_barriers;
@@ -79,19 +80,45 @@ struct vring_virtqueue {
/* Host publishes avail event idx */
bool event;
  
-	/* Head of free buffer list. */

-   unsigned int free_head;
/* Number we've added since last sync. */
unsigned int num_added;
  
  	/* Last used index we've seen. */

u16 last_used_idx;
  
-	/* Last written value to avail->flags */

-   u16 avail_flags_shadow;
+   union {
+   /* Available for split ring */
+   struct {
+   /* Actual memory layout for this queue. */
+   struct vring vring;
  
-	/* Last written value to avail->idx in guest byte order */

-   u16 avail_idx_shadow;
+   /* Head of free buffer list. */
+   unsigned int free_head;
+
+   /* Last written value to avail->flags */
+   u16 avail_flags_shadow;
+
+   /* Last written value to avail->idx in
+* guest byte order. */
+   u16 avail_idx_shadow;
+   };
+
+   /* Available for packed ring */
+   struct {
+   /* Actual memory layout for this queue. */
+   struct vring_packed vring_packed;
+
+   /* Driver ring wrap counter. */
+   u8 wrap_counter;
+
+   /* Index of the next avail descriptor. */
+   unsigned int next_avail_idx;
+
+   /* Last written value to driver->flags in
+* guest byte order. */
+   u16 event_flags_shadow;
+   };
+   };
  
  	/* How to notify other side. FIXME: commonalize hcalls! */

bool (*notify)(struct virtqueue *vq);
@@ -201,8 +228,33 @@ static dma_addr_t vring_map_single(const struct 
vring_virtqueue *vq,
  cpu_addr, size, direction);
  }
  
-static void vring_unmap_one(const struct vring_virtqueue *vq,

-   struct vring_desc *desc)
+static void vring_unmap_one_split(const struct vring_virtqueue *vq,
+ struct vring_desc *desc)
+{
+   u16 flags;
+
+   if (!vring_use_dma_api(vq->vq.vdev))
+   return;
+
+   flags = virtio16_to_cpu(vq->vq.vdev, desc->flags);
+
+   if (flags & VRING_DESC_F_INDIRECT) {
+   dma_unmap_single(vring_dma_dev(vq),
+virtio64_to_cpu(vq->vq.vdev, desc->addr),
+virtio32_to_cpu(vq->vq.vdev, desc->len),
+(flags & VRING_DESC_F_WRITE) ?
+DMA_FROM_DEVICE : DMA_TO_DEVICE);
+   } else {
+   dma_unmap_page(vring_dma_dev(vq),
+  

Re: [PATCH] iscsi: respond to netlink with unicast when appropriate

2018-04-12 Thread Martin K. Petersen

Chris,

> Instead of always multicasting responses, send a unicast netlink message
> directed at the correct pid.  This will be needed if we ever want to
> support multiple userspace processes interacting with the kernel over
> iSCSI netlink simultaneously.  Limitations can currently be seen if you
> attempt to run multiple iscsistart commands in parallel.
>
> We've fixed up the userspace issues in iscsistart that prevented
> multiple instances from running, so now attempts to speed up booting by
> bringing up multiple iscsi sessions at once in the initramfs are just
> running into misrouted responses that this fixes.

Applied to 4.17/scsi-fixes. Thanks!

-- 
Martin K. Petersen  Oracle Linux Engineering


Re: v6/sit tunnels and VRFs

2018-04-12 Thread David Ahern
On 4/12/18 10:54 AM, Jeff Barnhill wrote:
> Hi David,
> 
> In the slides referenced, you recommend adding an "unreachable
> default" route to the end of each VRF route table.  In my testing (for
> v4) this results in a change to fib lookup failures such that instead
> of ENETUNREACH being returned, EHOSTUNREACH is returned since the fib
> finds the unreachable route, versus failing to find a route
> altogether.
> 
> Have the implications of this been considered?  I don't see a
> clean/easy way to achieve the old behavior without affecting non-VRF
> routing (eg. remove the unreachable route and delete the non-VRF
> rules).  I'm guessing that programmatically, it may not make much
> difference, ie. lookup fails, but for debugging or to a user looking
> at it, the difference matters.  Do you (or anyone else) have any
> thoughts on this?

We have recommended moving the local table down in the FIB rules:

# ip ru ls
1000:   from all lookup [l3mdev-table]
32765:  from all lookup local
32766:  from all lookup main
32767:  from all lookup default

and adding a default route to VRF tables:

# ip ro ls vrf red
unreachable default  metric 4278198272
172.16.2.0/24  proto bgp  metric 20
nexthop via 169.254.0.1  dev swp3 weight 1 onlink
nexthop via 169.254.0.1  dev swp4 weight 1 onlink

# ip -6 ro ls vrf red
2001:db8:2::/64  proto bgp  metric 20
nexthop via fe80::202:ff:fe00:e  dev swp3 weight 1
nexthop via fe80::202:ff:fe00:f  dev swp4 weight 1
anycast fe80:: dev lo  proto kernel  metric 0  pref medium
anycast fe80:: dev lo  proto kernel  metric 0  pref medium
fe80::/64 dev swp3  proto kernel  metric 256  pref medium
fe80::/64 dev swp4  proto kernel  metric 256  pref medium
ff00::/8 dev swp3  metric 256  pref medium
ff00::/8 dev swp4  metric 256  pref medium
unreachable default dev lo  metric 4278198272  error -101 pref medium

Over the last 2 years we have not seen any negative side effects to this
and is what you want to have proper VRF separation.

Without a default route lookups will proceed to the next fib rule which
means a lookup in the next table and barring other PBR rules will be the
main table. It will lead to wrong lookups.

Here is an example:
  ip netns add foo
  ip netns exec foo bash
  ip li set lo up
  ip li add red type vrf table 123
  ip li set red up
  ip li add dummy1 type dummy
  ip addr add 10.100.1.1/24 dev dummy1
  ip li set dummy1 master red
  ip li set dummy1 up
  ip li add dummy2 type dummy
  ip addr add 10.100.1.1/24 dev dummy2
  ip li set dummy2 up
  ip ro get 10.100.2.2
  ip ro get 10.100.2.2 vrf red

# ip ru ls
0:  from all lookup local
1000:   from all lookup [l3mdev-table]
32766:  from all lookup main
32767:  from all lookup default

# ip ro ls
10.100.1.0/24 dev dummy2 proto kernel scope link src 10.100.1.1
10.100.2.0/24 via 10.100.1.2 dev dummy2

# ip ro ls vrf red
10.100.1.0/24 dev dummy1 proto kernel scope link src 10.100.1.1

That's the setup. What happens on route lookups?
# ip ro get vrf red 10.100.2.1
10.100.2.1 via 10.100.1.2 dev dummy2 src 10.100.1.1 uid 0
cache

which is clearly wrong. Let's look at the lookup sequence

# perf record -e fib:* ip ro get vrf red 10.100.2.1
10.100.2.1 via 10.100.1.2 dev dummy2 src 10.100.1.1 uid 0
cache
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.003 MB perf.data (4 samples) ]

#  perf script --fields trace:trace
table 255 oif 2 iif 1 src 0.0.0.0 dst 10.100.2.1 tos 0 scope 0 flags 4
table 123 oif 2 iif 1 src 0.0.0.0 dst 10.100.2.1 tos 0 scope 0 flags 4
table 254 oif 2 iif 1 src 0.0.0.0 dst 10.100.2.1 tos 0 scope 0 flags 4
nexthop dev dummy2 oif 4 src 10.100.1.1

The first one is because I did not move the local table down.
The second one is the correct vrf lookup
The third one is the continuation to the next table - the main table.

Adding a default route:
# ip ro add vrf red unreachable default

And the lookup is proper:
# ip ro get vrf red 10.100.2.1
RTNETLINK answers: No route to host


Re: [PATCH net] net: dsa: mv88e6xxx: Fix receive time stamp race condition.

2018-04-12 Thread David Miller
From: Richard Cochran 
Date: Thu, 12 Apr 2018 10:35:40 -0700

> On Mon, Apr 09, 2018 at 07:19:31AM -0700, Richard Cochran wrote:
>> Dave, please hold off on this patch.  I am seeing new problems in my
>> testing with this applied.  I still need to get to the bottom of
>> this.
> 
> Looks like the new problems are a HW/board glitch.
> 
> The patch is good to go.

Ok, applied and queued up for -stable.

Thanks.


Re: [PATCHv2 net] sctp: do not check port in sctp_inet6_cmp_addr

2018-04-12 Thread David Miller
From: Xin Long 
Date: Thu, 12 Apr 2018 14:24:31 +0800

> pf->cmp_addr() is called before binding a v6 address to the sock. It
> should not check ports, like in sctp_inet_cmp_addr.
> 
> But sctp_inet6_cmp_addr checks the addr by invoking af(6)->cmp_addr,
> sctp_v6_cmp_addr where it also compares the ports.
> 
> This would cause that setsockopt(SCTP_SOCKOPT_BINDX_ADD) could bind
> multiple duplicated IPv6 addresses after Commit 40b4f0fd74e4 ("sctp:
> lack the check for ports in sctp_v6_cmp_addr").
> 
> This patch is to remove af->cmp_addr called in sctp_inet6_cmp_addr,
> but do the proper check for both v6 addrs and v4mapped addrs.
> 
> v1->v2:
>   - define __sctp_v6_cmp_addr to do the common address comparison
> used for both pf and af v6 cmp_addr.
> 
> Fixes: 40b4f0fd74e4 ("sctp: lack the check for ports in sctp_v6_cmp_addr")
> Reported-by: Jianwen Ji 
> Signed-off-by: Xin Long 

Applied and queued up for -stable.


Re: [PATCH v2 net] net: fix deadlock while clearing neighbor proxy table

2018-04-12 Thread David Miller
From: Wolfgang Bumiller 
Date: Thu, 12 Apr 2018 10:46:55 +0200

> When coming from ndisc_netdev_event() in net/ipv6/ndisc.c,
> neigh_ifdown() is called with _tbl, locking this while
> clearing the proxy neighbor entries when eg. deleting an
> interface. Calling the table's pndisc_destructor() with the
> lock still held, however, can cause a deadlock: When a
> multicast listener is available an IGMP packet of type
> ICMPV6_MGM_REDUCTION may be sent out. When reaching
> ip6_finish_output2(), if no neighbor entry for the target
> address is found, __neigh_create() is called with _tbl,
> which it'll want to lock.
> 
> Move the elements into their own list, then unlock the table
> and perform the destruction.
> 
> Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=199289
> Fixes: 6fd6ce2056de ("ipv6: Do not depend on rt->n in ip6_finish_output2().")
> Signed-off-by: Wolfgang Bumiller 
> ---
> Changes to v1:
>   * Renamed 'pneigh_ifdown' to 'pneigh_ifdown_and_unlock'.

Applied and queued up for -stable.

> Btw. I'm not actually sure how much sense the Fixes tag makes as the
> commit itself isn't wrong, it just happens to be the most easily
> triggerable code path there (and I can't definitively rule out others,
> given that the "sending something over the network with the lock held
> will deadlock" comment at the top of the file is also from the initial
> commit I'd expect other backtraces to be possible from out of that
> function) and the other affected lines are mostly the initial git
> commit...

Understood.


Re: [PATCH net 0/4] nfp: improve signal handing on FW waits and flower control message processing

2018-04-12 Thread David Miller
From: Jakub Kicinski 
Date: Wed, 11 Apr 2018 16:47:34 -0700

> The first part of this set aims to improve handling of interrupted
> waits.  Patch 1 makes waiting for management FW responses
> uninterruptible while patch 2 adds a message when signal arrives
> while waiting for an NFP mutex.  We can't interrupt execution of
> FW commands so uninterruptible sleep seems reasonable there.
> Exiting a wait for a mutex should be clean and have no side affects
> so we are allowing to abort it.  Note that both waits have rather
> large timeouts (tens of seconds).
> 
> Patches 3 and 4 improve flower offload operation under heavy load.
> Currently there is no cap on the number of queued FW notifications.
> Some of the notifications have to be processed from a workqueue
> which may lead to very large number of messages getting queued
> if workqueue never gets a chance to run.  Pieter puts a limit
> on number of queued messages, tries to drop some messages we ignore
> without queuing and process more important messages first.

Series applied, thanks Jakub.


Re: [net 1/1] tipc: fix missing initializer in tipc_sendmsg()

2018-04-12 Thread David Miller
From: Jon Maloy 
Date: Thu, 12 Apr 2018 01:15:48 +0200

> The stack variable 'dnode' in __tipc_sendmsg() may theoretically
> end up tipc_node_get_mtu() as an unitilalized variable.
> 
> We fix this by intializing the variable at declaration. We also add
> a default else clause to the two conditional ones already there, so
> that we never end up in the named function if the given address
> type is illegal.
> 
> Reported-by: syzbot+b0975ce9355b347c1...@syzkaller.appspotmail.com
> Signed-off-by: Jon Maloy 

Applied, thanks Jon.


Re: [PATCH net] strparser: Fix incorrect strp->need_bytes value.

2018-04-12 Thread David Miller
From: Doron Roberts-Kedes 
Date: Wed, 11 Apr 2018 15:05:16 -0700

> strp_data_ready resets strp->need_bytes to 0 if strp_peek_len indicates
> that the remainder of the message has been received. However,
> do_strp_work does not reset strp->need_bytes to 0. If do_strp_work
> completes a partial message, the value of strp->need_bytes will continue
> to reflect the needed bytes of the previous message, causing
> future invocations of strp_data_ready to return early if
> strp->need_bytes is less than strp_peek_len. Resetting strp->need_bytes
> to 0 in __strp_recv on handing a full message to the upper layer solves
> this problem.
> 
> __strp_recv also calculates strp->need_bytes using stm->accum_len before
> stm->accum_len has been incremented by cand_len. This can cause
> strp->need_bytes to be equal to the full length of the message instead
> of the full length minus the accumulated length. This, in turn, causes
> strp_data_ready to return early, even when there is sufficient data to
> complete the partial message. Incrementing stm->accum_len before using
> it to calculate strp->need_bytes solves this problem.
> 
> Found while testing net/tls_sw recv path.
> 
> Fixes: 43a0c6751a322847 ("strparser: Stream parser for messages")
> Signed-off-by: Doron Roberts-Kedes 

Applied and queued up for -stable.


Re: [PATCH] selftests: net: add in_netns.sh to TEST_PROGS

2018-04-12 Thread David Miller
From: Anders Roxell 
Date: Wed, 11 Apr 2018 17:17:34 +0200

> Script in_netns.sh isn't installed.
> 
> running psock_fanout test
> 
> ./run_afpackettests: line 12: ./in_netns.sh: No such file or directory
> [FAIL]
> 
> running psock_tpacket test
> 
> ./run_afpackettests: line 22: ./in_netns.sh: No such file or directory
> [FAIL]
> 
> In current code added in_netns.sh to be installed.
> 
> Fixes: cc30c93fa020 ("selftests/net: ignore background traffic in 
> psock_fanout")
> Signed-off-by: Anders Roxell 

Applied, thanks.


Re: [PATCH v2 net 0/2] ibmvnic: Fix parameter change request handling

2018-04-12 Thread David Miller
From: Nathan Fontenot 
Date: Wed, 11 Apr 2018 10:09:26 -0500

> When updating parameters for the ibmvnic driver there is a possibility
> of entering an infinite loop if a return value other that a partial
> success is received from sending the login CRQ.
> 
> Also, a deadlock can occur on the rtnl lock if netdev_notify_peers()
> is called during driver reset for a parameter change reset.
> 
> This patch set corrects both of these issues by updating the return
> code handling in ibmvnic_login() nand gaurding against calling
> netdev_notify_peers() for parameter change requests.
> 
> -Nathan
> 
> Updates for V2: Correct spelling mistakes in commit messages.

Series applied, thanks.


Re: [PATCH net] net: validate attribute sizes in neigh_dump_table()

2018-04-12 Thread David Miller
From: Eric Dumazet 
Date: Wed, 11 Apr 2018 14:46:00 -0700

> Since neigh_dump_table() calls nlmsg_parse() without giving policy
> constraints, attributes can have arbirary size that we must validate
> 
> Reported by syzbot/KMSAN :
 ...
> Fixes: 21fdd092acc7 ("net: Add support for filtering neigh dump by master 
> device")
> Signed-off-by: Eric Dumazet 
> Cc: David Ahern 
> Reported-by: syzbot 

Applied and queued up for -stable.


Re: [PATCH net] tcp: md5: reject TCP_MD5SIG or TCP_MD5SIG_EXT on established sockets

2018-04-12 Thread David Miller
From: Eric Dumazet 
Date: Wed, 11 Apr 2018 14:36:28 -0700

> syzbot/KMSAN reported an uninit-value in tcp_parse_options() [1]
> 
> I believe this was caused by a TCP_MD5SIG being set on live
> flow.
> 
> This is highly unexpected, since TCP option space is limited.
> 
> For instance, presence of TCP MD5 option automatically disables
> TCP TimeStamp option at SYN/SYNACK time, which we can not do
> once flow has been established.
> 
> Really, adding/deleting an MD5 key only makes sense on sockets
> in CLOSE or LISTEN state.
 ...
> Fixes: cfb6eeb4c860 ("[TCP]: MD5 Signature Option (RFC2385) support.")
> Signed-off-by: Eric Dumazet 
> Reported-by: syzbot 

Applied and queued up for -stable.


Re: [net 1/1] tipc: fix unbalanced reference counter

2018-04-12 Thread David Miller
From: Jon Maloy 
Date: Wed, 11 Apr 2018 22:52:09 +0200

> When a topology subscription is created, we may encounter (or KASAN
> may provoke) a failure to create a corresponding service instance in
> the binding table. Instead of letting the tipc_nametbl_subscribe()
> report the failure back to the caller, the function just makes a warning
> printout and returns, without incrementing the subscription reference
> counter as expected by the caller.
> 
> This makes the caller believe that the subscription was successful, so
> it will at a later moment try to unsubscribe the item. This involves
> a sub_put() call. Since the reference counter never was incremented
> in the first place, we get a premature delete of the subscription item,
> followed by a "use-after-free" warning.
> 
> We fix this by adding a return value to tipc_nametbl_subscribe() and
> make the caller aware of the failure to subscribe.
> 
> This bug seems to always have been around, but this fix only applies
> back to the commit shown below. Given the low risk of this happening
> we believe this to be sufficient.
> 
> Fixes: commit 218527fe27ad ("tipc: replace name table service range
> array with rb tree")
> Reported-by: syzbot+aa245f26d42b8305d...@syzkaller.appspotmail.com
> 
> Signed-off-by: Jon Maloy 

Applied and queued up for -stable.


Re: [PATCH v2 net] lan78xx: PHY DSP registers initialization to address EEE link drop issues with long cables

2018-04-12 Thread David Miller
From: Raghuram Chary J 
Date: Wed, 11 Apr 2018 20:36:36 +0530

> The patch is to configure DSP registers of PHY device
> to handle Gbe-EEE failures with >40m cable length.
> 
> Fixes: 55d7de9de6c3 ("Microchip's LAN7800 family USB 2/3 to 10/100/1000 
> Ethernet device driver")
> Signed-off-by: Raghuram Chary J 

Applied.


Re: [PATCH] net: stmmac: fix missing support for 802.1AD tag on reception

2018-04-12 Thread David Miller
From: Elad Nachman 
Date: Wed, 11 Apr 2018 15:07:40 +

> --- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c2018-04-11 
> 17:04:00.586057300 +0300
> +++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c2018-04-11 
> 17:05:33.601992400 +0300
> @@ -3293,17 +3293,19 @@ dma_map_err:
> 
>  static void stmmac_rx_vlan(struct net_device *dev, struct sk_buff *skb)
>  {
> -struct ethhdr *ehdr;
> +struct vlan_ethhdr *veth;
>  u16 vlanid;
> +__be16 vlan_proto;

This patch has been mangled by your email client.


Re: [PATCHv2] mISDN: Remove VLAs

2018-04-12 Thread David Miller
From: Laura Abbott 
Date: Tue, 10 Apr 2018 18:04:29 -0700

> There's an ongoing effort to remove VLAs[1] from the kernel to eventually
> turn on -Wvla. Remove the VLAs from the mISDN code by switching to using
> kstrdup in one place and using an upper bound in another.
> 
> Signed-off-by: Laura Abbott 
> ---
> v2: Switch to a tighter upper bound so we are allocating a more
> reasonable amount on the stack (300). This is based on previous checks
> against this value.

Applied.


Re: [PATCH] net/tls: Remove VLA usage

2018-04-12 Thread David Miller
From: Kees Cook 
Date: Tue, 10 Apr 2018 17:52:34 -0700

> In the quest to remove VLAs from the kernel[1], this replaces the VLA
> size with the only possible size used in the code, and adds a mechanism
> to double-check future IV sizes.
> 
> [1] 
> https://lkml.kernel.org/r/CA+55aFzCG-zNmZwX4A2FQpadafLfEzK6CC=qpxydaacu1rq...@mail.gmail.com
> 
> Signed-off-by: Kees Cook 

Applied.


Re: [bpf PATCH 0/3] BPF, a couple sockmap fixes

2018-04-12 Thread Daniel Borkmann
On 04/12/2018 01:56 AM, John Fastabend wrote:
> While testing sockmap with more programs (besides our test programs)
> I found a couple issues.
> 
> The attached series fixes an issue where pinned maps were not
> working correctly, blocking sockets returned zero, and an error
> path that when the sock hit an out of memory case resulted in a
> double page_put() while doing ingress redirects.
> 
> See individual patches for more details.

Applied to bpf tree, thanks John!


Re: [RFC bpf-next v2 4/8] bpf: add documentation for eBPF helpers (23-32)

2018-04-12 Thread Alexei Starovoitov
On Tue, Apr 10, 2018 at 03:41:53PM +0100, Quentin Monnet wrote:
> Add documentation for eBPF helper functions to bpf.h user header file.
> This documentation can be parsed with the Python script provided in
> another commit of the patch series, in order to provide a RST document
> that can later be converted into a man page.
> 
> The objective is to make the documentation easily understandable and
> accessible to all eBPF developers, including beginners.
> 
> This patch contains descriptions for the following helper functions, all
> written by Daniel:
> 
> - bpf_get_prandom_u32()
> - bpf_get_smp_processor_id()
> - bpf_get_cgroup_classid()
> - bpf_get_route_realm()
> - bpf_skb_load_bytes()
> - bpf_csum_diff()
> - bpf_skb_get_tunnel_opt()
> - bpf_skb_set_tunnel_opt()
> - bpf_skb_change_proto()
> - bpf_skb_change_type()
> 
> Cc: Daniel Borkmann 
> Signed-off-by: Quentin Monnet 
> ---
>  include/uapi/linux/bpf.h | 125 
> +++
>  1 file changed, 125 insertions(+)
> 
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index f3ea8824efbc..d147d9dd6a83 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -473,6 +473,14 @@ union bpf_attr {
>   *   The number of bytes written to the buffer, or a negative error
>   *   in case of failure.
>   *
> + * u32 bpf_prandom_u32(void)
> + *   Return
> + *   A random 32-bit unsigned value.

there is no such helper.
It's called bpf_get_prandom_u32().
I'd also add a note that it's using its own random state and cannot be
used to infer seed of other random functions in the kernel.

> + *
> + * u32 bpf_get_smp_processor_id(void)
> + *   Return
> + *   The SMP (Symmetric multiprocessing) processor id.

probably worth adding a note to explain that all bpf programs run
with preemption disabled, so processor id is stable for the run of the program.

> + *
>   * int bpf_skb_store_bytes(struct sk_buff *skb, u32 offset, const void 
> *from, u32 len, u64 flags)
>   *   Description
>   *   Store *len* bytes from address *from* into the packet
> @@ -604,6 +612,13 @@ union bpf_attr {
>   *   Return
>   *   0 on success, or a negative error in case of failure.
>   *
> + * u32 bpf_get_cgroup_classid(struct sk_buff *skb)
> + *   Description
> + *   Retrieve the classid for the current task, i.e. for the
> + *   net_cls (network classifier) cgroup to which *skb* belongs.

please add that kernel should be configured with CONFIG_NET_CLS_CGROUP=y|m
and mention Documentation/cgroup-v1/net_cls.txt
Otherwise 'network classifier' is way too generic.
I'd also mention that placing a task into net_cls controller
disables all of cgroup-bpf.

> + *   Return
> + *   The classid, or 0 for the default unconfigured classid.
> + *
>   * int bpf_skb_vlan_push(struct sk_buff *skb, __be16 vlan_proto, u16 
> vlan_tci)
>   *   Description
>   *   Push a *vlan_tci* (VLAN tag control information) of protocol
> @@ -703,6 +718,14 @@ union bpf_attr {
>   *   are **TC_ACT_REDIRECT** on success or **TC_ACT_SHOT** on
>   *   error.
>   *
> + * u32 bpf_get_route_realm(struct sk_buff *skb)
> + *   Description
> + *   Retrieve the realm or the route, that is to say the
> + *   **tclassid** field of the destination for the *skb*.

Similarly this only works if CONFIG_IP_ROUTE_CLASSID is on.

> + *   Return
> + *   The realm of the route for the packet associated to *sdb*, or 0
> + *   if none was found.
> + *
>   * int bpf_perf_event_output(struct pt_reg *ctx, struct bpf_map *map, u64 
> flags, void *data, u64 size)
>   *   Description
>   *   Write perf raw sample into a perf event held by *map* of type
> @@ -779,6 +802,21 @@ union bpf_attr {
>   *   Return
>   *   0 on success, or a negative error in case of failure.
>   *
> + * int bpf_skb_load_bytes(const struct sk_buff *skb, u32 offset, void *to, 
> u32 len)
> + *   Description
> + *   This helper was provided as an easy way to load data from a
> + *   packet. It can be used to load *len* bytes from *offset* from
> + *   the packet associated to *skb*, into the buffer pointed by
> + *   *to*.
> + *
> + *   Since Linux 4.7, this helper is deprecated in favor of
> + *   "direct packet access", enabling packet data to be manipulated
> + *   with *skb*\ **->data** and *skb*\ **->data_end** pointing
> + *   respectively to the first byte of packet data and to the byte
> + *   after the last byte of packet data.

I wouldn't call it deprecated.
It's still useful when programmer wants to read large quantities of
data from the packet

> + *   Return
> + *   0 on success, or a negative error in case of failure.
> + *
>   * int bpf_get_stackid(struct pt_reg *ctx, struct bpf_map *map, u64 flags)
>   *   Description
>   *   

[PATCH ipsec-next] selftests: add xfrm state-policy-monitor to rtnetlink.sh

2018-04-12 Thread Shannon Nelson
Add a simple set of tests for the IPsec xfrm commands.

Signed-off-by: Shannon Nelson 
---
 tools/testing/selftests/net/rtnetlink.sh | 103 +++
 1 file changed, 103 insertions(+)

diff --git a/tools/testing/selftests/net/rtnetlink.sh 
b/tools/testing/selftests/net/rtnetlink.sh
index e6f4852..760faef 100755
--- a/tools/testing/selftests/net/rtnetlink.sh
+++ b/tools/testing/selftests/net/rtnetlink.sh
@@ -502,6 +502,108 @@ kci_test_macsec()
echo "PASS: macsec"
 }
 
+#---
+# Example commands
+#   ip x s add proto esp src 14.0.0.52 dst 14.0.0.70 \
+#spi 0x07 mode transport reqid 0x07 replay-window 32 \
+#aead 'rfc4106(gcm(aes))' 1234567890123456dcba 128 \
+#sel src 14.0.0.52/24 dst 14.0.0.70/24
+#   ip x p add dir out src 14.0.0.52/24 dst 14.0.0.70/24 \
+#tmpl proto esp src 14.0.0.52 dst 14.0.0.70 \
+#spi 0x07 mode transport reqid 0x07
+#
+# Subcommands not tested
+#ip x s update
+#ip x s allocspi
+#ip x s deleteall
+#ip x p update
+#ip x p deleteall
+#ip x p set
+#---
+kci_test_ipsec()
+{
+   srcip="14.0.0.52"
+   dstip="14.0.0.70"
+   algo="aead rfc4106(gcm(aes)) 0x3132333435363738393031323334353664636261 
128"
+
+   # flush to be sure there's nothing configured
+   ip x s flush ; ip x p flush
+   check_err $?
+
+   # start the monitor in the background
+   tmpfile=`mktemp ipsectestXXX`
+   ip x m > $tmpfile &
+   mpid=$!
+   sleep 0.2
+
+   ipsecid="proto esp src $srcip dst $dstip spi 0x07"
+   ip x s add $ipsecid \
+mode transport reqid 0x07 replay-window 32 \
+$algo sel src $srcip/24 dst $dstip/24
+   check_err $?
+
+   lines=`ip x s list | grep $srcip | grep $dstip | wc -l`
+   test $lines -eq 2
+   check_err $?
+
+   ip x s count | grep -q "SAD count 1"
+   check_err $?
+
+   lines=`ip x s get $ipsecid | grep $srcip | grep $dstip | wc -l`
+   test $lines -eq 2
+   check_err $?
+
+   ip x s delete $ipsecid
+   check_err $?
+
+   lines=`ip x s list | wc -l`
+   test $lines -eq 0
+   check_err $?
+
+   ipsecsel="dir out src $srcip/24 dst $dstip/24"
+   ip x p add $ipsecsel \
+   tmpl proto esp src $srcip dst $dstip \
+   spi 0x07 mode transport reqid 0x07
+   check_err $?
+
+   lines=`ip x p list | grep $srcip | grep $dstip | wc -l`
+   test $lines -eq 2
+   check_err $?
+
+   ip x p count | grep -q "SPD IN  0 OUT 1 FWD 0"
+   check_err $?
+
+   lines=`ip x p get $ipsecsel | grep $srcip | grep $dstip | wc -l`
+   test $lines -eq 2
+   check_err $?
+
+   ip x p delete $ipsecsel
+   check_err $?
+
+   lines=`ip x p list | wc -l`
+   test $lines -eq 0
+   check_err $?
+
+   # check the monitor results
+   kill $mpid
+   lines=`wc -l $tmpfile | cut "-d " -f1`
+   test $lines -eq 20
+   check_err $?
+   rm -rf $tmpfile
+
+   # clean up any leftovers
+   ip x s flush
+   check_err $?
+   ip x p flush
+   check_err $?
+
+   if [ $ret -ne 0 ]; then
+   echo "FAIL: ipsec"
+   return 1
+   fi
+   echo "PASS: ipsec"
+}
+
 kci_test_gretap()
 {
testns="testns"
@@ -755,6 +857,7 @@ kci_test_rtnl()
kci_test_vrf
kci_test_encap
kci_test_macsec
+   kci_test_ipsec
 
kci_del_dummy
 }
-- 
2.7.4



Re: [PATCH] net/mlx5: remove some extraneous spaces in indentations

2018-04-12 Thread Saeed Mahameed
On Mon, 2018-04-09 at 13:43 +0100, Colin King wrote:
> From: Colin Ian King 
> 
> There are several lines where there is an extraneous space causing
> indentation misalignment. Remove them.
> 
> Cleans up Cocconelle warnings:
> 
> ./drivers/net/ethernet/mellanox/mlx5/core/qp.c:409:3-18: code aligned
> with following code on line 410
> ./drivers/net/ethernet/mellanox/mlx5/core/qp.c:415:3-18: code aligned
> with following code on line 416
> ./drivers/net/ethernet/mellanox/mlx5/core/qp.c:421:3-18: code aligned
> with following code on line 422
> 
> Signed-off-by: Colin Ian King 
> 

Applied to mlx5-next, Thanks Colin!



Re: linux-next on x60: network manager often complains "network is disabled" after resume

2018-04-12 Thread Pavel Machek
On Mon 2018-03-19 18:33:56, Pavel Machek wrote:
> On Mon 2018-03-19 10:40:08, Dan Williams wrote:
> > On Mon, 2018-03-19 at 10:21 +0100, Pavel Machek wrote:
> > > On Mon 2018-03-19 05:17:45, Woody Suwalski wrote:
> > > > Pavel Machek wrote:
> > > > > Hi!
> > > > > 
> > > > > With recent linux-next, after resume networkmanager often claims
> > > > > that
> > > > > "network is disabled". Sometimes suspend/resume clears that.
> > > > > 
> > > > > Any ideas? Does it work for you?
> > > > >   
> > > > > Pavel
> > > > 
> > > > Tried the 4.16-rc6 with nm 1.4.4 - I do not see the issue.
> > > 
> > > Thanks for testing... but yes, 4.16 should be ok. If not fixed,
> > > problem will appear in 4.17-rc1.
> > 
> > Where does the complaint occur?  In the GUI, or with nmcli, or
> > somewhere else?  Also, what's the output of "nmcli dev" after resume?
> 
> In the GUI. I click in place where I'd select access point, and menu
> does not show up, telling me that "network is disabled".

Ouch and the bug now crept to mainline.. and it happens on X220,
too. With ethernet, bug is harder to see, because "network is
disabled" and "no network" icon is there, but ethernet still works.

Best regards,
Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) 
http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html


signature.asc
Description: Digital signature


Re: SRIOV switchdev mode BoF minutes

2018-04-12 Thread Samudrala, Sridhar

On 4/12/2018 1:20 PM, Or Gerlitz wrote:

On Thu, Apr 12, 2018 at 8:05 PM, Samudrala, Sridhar
 wrote:

On 11/12/2017 11:49 AM, Or Gerlitz wrote:

Hi Dave and all,

During and after the BoF on SRIOV switchdev mode, we came into a
consensus among the developers from four different HW vendors (CC
audience) that a correct thing to do would be to disallow any new
extensions to the legacy mode.

The idea is to put focus on the new mode and not add new UAPIs and
kernel code which was turned to be a wrong design which does not allow
for properly offloading a kernel switching SW model to e-switch HW.

We also had a good session the day after regarding alignment for the
representation model of the uplink (physical port) and PF/s.

The VF representor netdevs  exist for all drivers that support the new
mode but the representation for the uplink and PF wasn't the same for
all. The decision was to represent the uplink and PFs vports in the
same manner done for VFs, using rep netdevs. This alignment would
provide a more strict and clear view of the kernel model for e-switch
to users and upper layer control plane SW.


I don't see any changes in the Mellanox/other drivers to move to this new
model to enable the uplink and PF port representors, any updates?

Yeah, I am worked on that but didn't get to finalize the upstreaming
so far.  I have resumed
the work and plan uplink rep in mlx5 to replace the PF being uplink rep for 4.18


It would be really nice to highlight the pros and cons of the old versus the
new model.

We are looking into adding switchdev support for our new 100Gb ice driver
and could use some feedback on the direction we should be taking.

good news.

The uplink rep is clear cut that needs to be a rep device representing
the uplink just like vf
rep represents the vport toward the vf - please just do it correct
from the begining


Having an uplink rep will definitely help implement the slow path with 
flat/vlan network
scenarios by not having to add PF to the bridge.

But how do they help with a vxlan overlay scenario? In case of overlays, the 
slow path
has to go via vxlan -> ip stack -> pf?

What about pf-rep?



Re: SRIOV switchdev mode BoF minutes

2018-04-12 Thread Or Gerlitz
On Thu, Apr 12, 2018 at 8:05 PM, Samudrala, Sridhar
 wrote:
> On 11/12/2017 11:49 AM, Or Gerlitz wrote:
>>
>> Hi Dave and all,
>>
>> During and after the BoF on SRIOV switchdev mode, we came into a
>> consensus among the developers from four different HW vendors (CC
>> audience) that a correct thing to do would be to disallow any new
>> extensions to the legacy mode.
>>
>> The idea is to put focus on the new mode and not add new UAPIs and
>> kernel code which was turned to be a wrong design which does not allow
>> for properly offloading a kernel switching SW model to e-switch HW.
>>
>> We also had a good session the day after regarding alignment for the
>> representation model of the uplink (physical port) and PF/s.
>>
>> The VF representor netdevs  exist for all drivers that support the new
>> mode but the representation for the uplink and PF wasn't the same for
>> all. The decision was to represent the uplink and PFs vports in the
>> same manner done for VFs, using rep netdevs. This alignment would
>> provide a more strict and clear view of the kernel model for e-switch
>> to users and upper layer control plane SW.
>>
> I don't see any changes in the Mellanox/other drivers to move to this new
> model to enable the uplink and PF port representors, any updates?

Yeah, I am worked on that but didn't get to finalize the upstreaming
so far.  I have resumed
the work and plan uplink rep in mlx5 to replace the PF being uplink rep for 4.18

> It would be really nice to highlight the pros and cons of the old versus the
> new model.
>
> We are looking into adding switchdev support for our new 100Gb ice driver
> and could use some feedback on the direction we should be taking.

good news.

The uplink rep is clear cut that needs to be a rep device representing
the uplink just like vf
rep represents the vport toward the vf - please just do it correct
from the begining

I can spare


Re: [PATCH] net: ethernet: ti: cpsw: fix tx vlan priority mapping

2018-04-12 Thread Grygorii Strashko


On 04/12/2018 09:25 AM, Ivan Khoronzhuk wrote:
> The CPDMA_TX_PRIORITY_MAP in real is vlan pcp field priority mapping
> register and basically replaces vlan pcp field for tagged packets.
> So, set it to be 1:1 mapping.

"Otherwise, it will cause unexpected change of egress vlan tagged packets,
 like prio 2 -> prio 5"


> 
> Signed-off-by: Ivan Khoronzhuk 
> ---
> Based on net/master

Fixes: e05 107 e6b 747 ("net: ethernet: ti: cpsw: add multi queue support")

Reviewed-by: Grygorii Strashko  

> 
>   drivers/net/ethernet/ti/cpsw.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/net/ethernet/ti/cpsw.c b/drivers/net/ethernet/ti/cpsw.c
> index 3037127..74f8284 100644
> --- a/drivers/net/ethernet/ti/cpsw.c
> +++ b/drivers/net/ethernet/ti/cpsw.c
> @@ -129,7 +129,7 @@ do {  
> \
>   
>   #define RX_PRIORITY_MAPPING 0x76543210
>   #define TX_PRIORITY_MAPPING 0x33221100
> -#define CPDMA_TX_PRIORITY_MAP0x01234567
> +#define CPDMA_TX_PRIORITY_MAP0x76543210
>   
>   #define CPSW_VLAN_AWARE BIT(1)
>   #define CPSW_RX_VLAN_ENCAP  BIT(2)
> 

-- 
regards,
-grygorii


Re: [PATCH net-next] Per interface IPv4 stats (CONFIG_IP_IFSTATS_TABLE)

2018-04-12 Thread Julian Anastasov

Hello,

On Thu, 12 Apr 2018, Stephen Suryaputra wrote:

> Thanks for the feedbacks. Please see the detail below:
> 
> On Wed, Apr 11, 2018 at 3:37 PM, Julian Anastasov  wrote:
> [snip]
> >> - __IP_INC_STATS(net, IPSTATS_MIB_INHDRERRORS);
> >> + __IP_INC_STATS(net, skb_dst(skb)->dev, IPSTATS_MIB_INHDRERRORS);
> >
> > May be skb->dev if we want to account it to the
> > input device.
> >
> Yes. I'm about to make change it but see the next one.
> 
> [snip]
> >> diff --git a/net/netfilter/ipvs/ip_vs_xmit.c 
> >> b/net/netfilter/ipvs/ip_vs_xmit.c
> >> index 4527921..32bd3af 100644
> >> --- a/net/netfilter/ipvs/ip_vs_xmit.c
> >> +++ b/net/netfilter/ipvs/ip_vs_xmit.c
> >> @@ -286,7 +286,7 @@ static inline bool decrement_ttl(struct netns_ipvs 
> >> *ipvs,
> >>   {
> >>   if (ip_hdr(skb)->ttl <= 1) {
> >>   /* Tell the sender its packet died... */
> >> - __IP_INC_STATS(net, IPSTATS_MIB_INHDRERRORS);
> >> + __IP_INC_STATS(net, skb_dst(skb)->dev, 
> >> IPSTATS_MIB_INHDRERRORS);
> >
> > At this point, skb_dst(skb) can be:
> >
> > - input route at LOCAL_IN => dst->dev is "lo", skb->dev = input_device
> > - output route at LOCAL_OUT => dst->dev is output_device, skb->dev = NULL
> >
> > We should see this error on LOCAL_IN but better to be
> > safe: use 'skb->dev ? : skb_dst(skb)->dev' instead of just
> > 'skb_dst(skb)->dev'.
> >
> This follows v6 implementation in the same function:
> 
> #ifdef CONFIG_IP_VS_IPV6
> if (skb_af == AF_INET6) {
> struct dst_entry *dst = skb_dst(skb);
> 
> /* check and decrement ttl */
> if (ipv6_hdr(skb)->hop_limit <= 1) {
> /* Force OUTPUT device used as source address */

It looks like IPVS copied it from ip6_forward() but in
IPVS context it has its reason: we want ICMP to exit with
saddr=Virtual_IP. And we are at LOCAL_IN where there is no
output device like in ip6_forward(FORWARD) to use its source
address.

So, IPVS is special (both input and output path) and needs:

IPv4: skb->dev ? : skb_dst(skb)->dev
IPv6 needs fix for IPVS stats in decrement_ttl:

idev = skb->dev ? __in6_dev_get(skb->dev) : ip6_dst_idev(dst);
...
__IP6_INC_STATS(net, idev, IPSTATS_MIB_INHDRERRORS);

Otherwise, stats will go to "lo" if ip6_dst_idev
is used for local input route.

So, for accounting on input IPv4 path skb->dev should be
used, while for IPv6 some sites may prefer to feed icmpv6_send()
with output dst->dev as device containing the source address (skb->dev).
But this is unrelated to the stats.

> skb->dev = dst->dev;
> icmpv6_send(skb, ICMPV6_TIME_EXCEED,
> ICMPV6_EXC_HOPLIMIT, 0);
> __IP6_INC_STATS(net, ip6_dst_idev(dst),
> IPSTATS_MIB_INHDRERRORS);
> 
> return false;
> }
> 
> /* don't propagate ttl change to cloned packets */
> if (!skb_make_writable(skb, sizeof(struct ipv6hdr)))
> return false;
> 
> ipv6_hdr(skb)->hop_limit--;
> } else
> #endif
> 
> [snip]
> >
> > The patch probably has other errors, for example,
> > using rt->dst.dev (lo) when rt->dst.error != 0 in ip_error,
> > may be 'dev' should be used instead...
> 
> Same also here. Examples are ip6_forward and ip6_pkt_drop.
> 
> I think it's better be counted in the input device for them also. Thoughts?

I think so. ipv6_rcv() works with idev = __in6_dev_get(skb->dev)
but I don't know IPv6 well and whether ip6_dst_idev(skb_dst(skb))
is correct usage for input path. It should be correct for output
path, though.

Regards

--
Julian Anastasov 


[PATCH net-next] ipv6: provide Kconfig switch to disable accept_ra by default

2018-04-12 Thread Matthias Schiffer
Many distributions and users prefer to handle router advertisements in
userspace; one example is OpenWrt, which includes a combined RA and DHCPv6
client. For such configurations, accept_ra should not be enabled by
default.

As setting net.ipv6.conf.default.accept_ra via sysctl.conf or similar
facilities may be too late to catch all interfaces and common sysctl.conf
tools do not allow setting an option for all existing interfaces, this
patch provides a Kconfig option to control the default value of
default.accept_ra.

Using default.accept_ra is preferable to all.accept_ra for our usecase,
as disabling all.accept_ra would preclude users from explicitly enabling
accept_ra on individual interfaces.

Signed-off-by: Matthias Schiffer 
---
 net/ipv6/Kconfig| 12 
 net/ipv6/addrconf.c |  2 +-
 2 files changed, 13 insertions(+), 1 deletion(-)

diff --git a/net/ipv6/Kconfig b/net/ipv6/Kconfig
index 6794ddf0547c..0f453110f288 100644
--- a/net/ipv6/Kconfig
+++ b/net/ipv6/Kconfig
@@ -20,6 +20,18 @@ menuconfig IPV6
 
 if IPV6
 
+config IPV6_ACCEPT_RA_DEFAULT
+   bool "IPv6: Accept router advertisements by default"
+   default y
+   help
+ The kernel can internally handle IPv6 router advertisements for
+ stateless address autoconfiguration (SLAAC) and route configuration,
+ which can be configured in detail and per-interface using a number of
+ sysctl options. This option controls the default value of
+ net.ipv6.conf.default.accept_ra.
+
+ If unsure, say Y.
+
 config IPV6_ROUTER_PREF
bool "IPv6: Router Preference (RFC 4191) support"
---help---
diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index 78cef00c9596..ec066cd742db 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -245,7 +245,7 @@ static struct ipv6_devconf ipv6_devconf_dflt __read_mostly 
= {
.forwarding = 0,
.hop_limit  = IPV6_DEFAULT_HOPLIMIT,
.mtu6   = IPV6_MIN_MTU,
-   .accept_ra  = 1,
+   .accept_ra  = IS_ENABLED(CONFIG_IPV6_ACCEPT_RA_DEFAULT),
.accept_redirects   = 1,
.autoconf   = 1,
.force_mld_version  = 0,
-- 
2.17.0



Re: [RFC net-next 1/2] net: net-porcfs: Reduce rcu lock critical section

2018-04-12 Thread Saeed Mahameed
On Wed, 2018-04-11 at 19:59 -0700, Eric Dumazet wrote:
> 
> On 04/11/2018 04:47 PM, Saeed Mahameed wrote:
> > 
> > Well if we allow devices to access HW counters via FW command
> > interfaces in ndo_get_stats and by testing mlx5 where we query up
> > to 5
> > hw registers, it could take 100us, still this is way smaller than
> > 10sec
> >  :) and it is really a nice rate to fetch HW stats on demand.
> 
> If hardware stats are slower than software ones, maybe it is time to
> use software stats,
> instead of changing the whole stack ?
> 

We already have SW stats for [rx/tx]_[packets/bytes] but for
[rx/tx]_[error/drop] etc .. they can only be grabbed from HW.

We don't want to report only partial counters to get_stats ndo just to
avoid sleeping.

> There are very few devices drivers having issues like that.
> 



[net] xfrm: allow to release xfrm_state with flush

2018-04-12 Thread Jacek Kalwas
Call to flush SAs doesn't release xfrm_state in case there was a
traffic associated with that state and state was already deleted.

Given patch calls xfrm_policy_cache_flush despite of actual states
deleted in xfrm_state_flush function.

Signed-off-by: Jacek Kalwas 
---
 net/xfrm/xfrm_state.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/net/xfrm/xfrm_state.c b/net/xfrm/xfrm_state.c
index f9d2f2233f09..7d3d6a12a14f 100644
--- a/net/xfrm/xfrm_state.c
+++ b/net/xfrm/xfrm_state.c
@@ -734,10 +734,10 @@ int xfrm_state_flush(struct net *net, u8 proto, bool 
task_valid)
}
 out:
spin_unlock_bh(>xfrm.xfrm_state_lock);
-   if (cnt) {
+   if (cnt)
err = 0;
-   xfrm_policy_cache_flush();
-   }
+
+   xfrm_policy_cache_flush();
return err;
 }
 EXPORT_SYMBOL(xfrm_state_flush);
-- 
2.14.3



Intel Technology Poland sp. z o.o.
ul. Slowackiego 173 | 80-298 Gdansk | Sad Rejonowy Gdansk Polnoc | VII Wydzial 
Gospodarczy Krajowego Rejestru Sadowego - KRS 101882 | NIP 957-07-52-316 | 
Kapital zakladowy 200.000 PLN.

Ta wiadomosc wraz z zalacznikami jest przeznaczona dla okreslonego adresata i 
moze zawierac informacje poufne. W razie przypadkowego otrzymania tej 
wiadomosci, prosimy o powiadomienie nadawcy oraz trwale jej usuniecie; 
jakiekolwiek
przegladanie lub rozpowszechnianie jest zabronione.
This e-mail and any attachments may contain confidential material for the sole 
use of the intended recipient(s). If you are not the intended recipient, please 
contact the sender and delete all copies; any review or distribution by
others is strictly prohibited.



[net] xfrm: cover crypto status in xfrm_input

2018-04-12 Thread Jacek Kalwas
Status checking in xfrm_input doesn't cover CRYPTO_GENERIC_ERROR and
CRYPTO_INVALID_PACKET_SYNTAX.

Given patch adds additional check for CRYPTO_INVALID_PACKET_SYNTAX and
treats CRYPTO_GENERIC_ERROR as status matching LINUX_MIB_XFRMINERROR.

Signed-off-by: Jacek Kalwas 
---
 net/xfrm/xfrm_input.c | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/net/xfrm/xfrm_input.c b/net/xfrm/xfrm_input.c
index 352abca2605f..08d70ea774f9 100644
--- a/net/xfrm/xfrm_input.c
+++ b/net/xfrm/xfrm_input.c
@@ -285,7 +285,12 @@ int xfrm_input(struct sk_buff *skb, int nexthdr, __be32 
spi, int encap_type)
goto drop;
}
 
-   XFRM_INC_STATS(net, 
LINUX_MIB_XFRMINBUFFERERROR);
+   if (xo->status & CRYPTO_INVALID_PACKET_SYNTAX) {
+   XFRM_INC_STATS(net, 
LINUX_MIB_XFRMINBUFFERERROR);
+   goto drop;
+   }
+
+   XFRM_INC_STATS(net, LINUX_MIB_XFRMINERROR);
goto drop;
}
 
-- 
2.14.3



Intel Technology Poland sp. z o.o.
ul. Slowackiego 173 | 80-298 Gdansk | Sad Rejonowy Gdansk Polnoc | VII Wydzial 
Gospodarczy Krajowego Rejestru Sadowego - KRS 101882 | NIP 957-07-52-316 | 
Kapital zakladowy 200.000 PLN.

Ta wiadomosc wraz z zalacznikami jest przeznaczona dla okreslonego adresata i 
moze zawierac informacje poufne. W razie przypadkowego otrzymania tej 
wiadomosci, prosimy o powiadomienie nadawcy oraz trwale jej usuniecie; 
jakiekolwiek
przegladanie lub rozpowszechnianie jest zabronione.
This e-mail and any attachments may contain confidential material for the sole 
use of the intended recipient(s). If you are not the intended recipient, please 
contact the sender and delete all copies; any review or distribution by
others is strictly prohibited.



[net] udp: enable UDP checksum offload for ESP

2018-04-12 Thread Jacek Kalwas
In case NIC has support for ESP TX CSUM offload skb->ip_summed is not
set to CHECKSUM_PARTIAL which results in checksum calculated by SW.

Fix enables ESP TX CSUM for UDP by extending condition with check for
NETIF_F_HW_ESP_TX_CSUM.

Signed-off-by: Jacek Kalwas 
---
 net/ipv4/ip_output.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 4c11b810a447..a2dfb5a9ba76 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -907,7 +907,7 @@ static int __ip_append_data(struct sock *sk,
length + fragheaderlen <= mtu &&
rt->dst.dev->features & (NETIF_F_HW_CSUM | NETIF_F_IP_CSUM) &&
!(flags & MSG_MORE) &&
-   !exthdrlen)
+   (!exthdrlen || (rt->dst.dev->features & NETIF_F_HW_ESP_TX_CSUM)))
csummode = CHECKSUM_PARTIAL;
 
cork->length += length;
-- 
2.14.3



Intel Technology Poland sp. z o.o.
ul. Slowackiego 173 | 80-298 Gdansk | Sad Rejonowy Gdansk Polnoc | VII Wydzial 
Gospodarczy Krajowego Rejestru Sadowego - KRS 101882 | NIP 957-07-52-316 | 
Kapital zakladowy 200.000 PLN.

Ta wiadomosc wraz z zalacznikami jest przeznaczona dla okreslonego adresata i 
moze zawierac informacje poufne. W razie przypadkowego otrzymania tej 
wiadomosci, prosimy o powiadomienie nadawcy oraz trwale jej usuniecie; 
jakiekolwiek
przegladanie lub rozpowszechnianie jest zabronione.
This e-mail and any attachments may contain confidential material for the sole 
use of the intended recipient(s). If you are not the intended recipient, please 
contact the sender and delete all copies; any review or distribution by
others is strictly prohibited.



[PATCH net 2/3] l2tp: hold reference on tunnels printed in pppol2tp proc file

2018-04-12 Thread Guillaume Nault
Use l2tp_tunnel_get_nth() instead of l2tp_tunnel_find_nth(), to be safe
against concurrent tunnel deletion.

Unlike sessions, we can't drop the reference held on tunnels in
pppol2tp_seq_show(). Tunnels are reused across several calls to
pppol2tp_seq_start() when iterating over sessions. These iterations
need the tunnel for accessing the next session. Therefore the only safe
moment for dropping the reference is just before searching for the next
tunnel.

Normally, the last invocation of pppol2tp_next_tunnel() doesn't find
any new tunnel, so it drops the last tunnel without taking any new
reference. However, in case of error, pppol2tp_seq_stop() is called
directly, so we have to drop the reference there.

Fixes: fd558d186df2 ("l2tp: Split pppol2tp patch into separate l2tp and ppp 
parts")
Signed-off-by: Guillaume Nault 
---
 net/l2tp/l2tp_ppp.c | 24 +---
 1 file changed, 17 insertions(+), 7 deletions(-)

diff --git a/net/l2tp/l2tp_ppp.c b/net/l2tp/l2tp_ppp.c
index 896bbca9bdaa..7d0c963680e6 100644
--- a/net/l2tp/l2tp_ppp.c
+++ b/net/l2tp/l2tp_ppp.c
@@ -1551,16 +1551,19 @@ struct pppol2tp_seq_data {
 
 static void pppol2tp_next_tunnel(struct net *net, struct pppol2tp_seq_data *pd)
 {
+   /* Drop reference taken during previous invocation */
+   if (pd->tunnel)
+   l2tp_tunnel_dec_refcount(pd->tunnel);
+
for (;;) {
-   pd->tunnel = l2tp_tunnel_find_nth(net, pd->tunnel_idx);
+   pd->tunnel = l2tp_tunnel_get_nth(net, pd->tunnel_idx);
pd->tunnel_idx++;
 
-   if (pd->tunnel == NULL)
-   break;
+   /* Only accept L2TPv2 tunnels */
+   if (!pd->tunnel || pd->tunnel->version == 2)
+   return;
 
-   /* Ignore L2TPv3 tunnels */
-   if (pd->tunnel->version < 3)
-   break;
+   l2tp_tunnel_dec_refcount(pd->tunnel);
}
 }
 
@@ -1609,7 +1612,14 @@ static void *pppol2tp_seq_next(struct seq_file *m, void 
*v, loff_t *pos)
 
 static void pppol2tp_seq_stop(struct seq_file *p, void *v)
 {
-   /* nothing to do */
+   struct pppol2tp_seq_data *pd = v;
+
+   if (!pd || pd == SEQ_START_TOKEN)
+   return;
+
+   /* Drop reference taken by last invocation of pppol2tp_next_tunnel() */
+   if (pd->tunnel)
+   l2tp_tunnel_dec_refcount(pd->tunnel);
 }
 
 static void pppol2tp_seq_tunnel_show(struct seq_file *m, void *v)
-- 
2.17.0



[PATCH net 3/3] l2tp: hold reference on tunnels printed in l2tp/tunnels debugfs file

2018-04-12 Thread Guillaume Nault
Use l2tp_tunnel_get_nth() instead of l2tp_tunnel_find_nth(), to be safe
against concurrent tunnel deletion.

Use the same mechanism as in l2tp_ppp.c for dropping the reference
taken by l2tp_tunnel_get_nth(). That is, drop the reference just
before looking up the next tunnel. In case of error, drop the last
accessed tunnel in l2tp_dfs_seq_stop().

That was the last use of l2tp_tunnel_find_nth().

Fixes: 0ad6614048cf ("l2tp: Add debugfs files for dumping l2tp debug info")
Signed-off-by: Guillaume Nault 
---
 net/l2tp/l2tp_core.c| 20 
 net/l2tp/l2tp_core.h|  1 -
 net/l2tp/l2tp_debugfs.c | 15 +--
 3 files changed, 13 insertions(+), 23 deletions(-)

diff --git a/net/l2tp/l2tp_core.c b/net/l2tp/l2tp_core.c
index c8c4183f0f37..40261cb68e83 100644
--- a/net/l2tp/l2tp_core.c
+++ b/net/l2tp/l2tp_core.c
@@ -355,26 +355,6 @@ int l2tp_session_register(struct l2tp_session *session,
 }
 EXPORT_SYMBOL_GPL(l2tp_session_register);
 
-struct l2tp_tunnel *l2tp_tunnel_find_nth(const struct net *net, int nth)
-{
-   struct l2tp_net *pn = l2tp_pernet(net);
-   struct l2tp_tunnel *tunnel;
-   int count = 0;
-
-   rcu_read_lock_bh();
-   list_for_each_entry_rcu(tunnel, >l2tp_tunnel_list, list) {
-   if (++count > nth) {
-   rcu_read_unlock_bh();
-   return tunnel;
-   }
-   }
-
-   rcu_read_unlock_bh();
-
-   return NULL;
-}
-EXPORT_SYMBOL_GPL(l2tp_tunnel_find_nth);
-
 /*
  * Receive data handling
  */
diff --git a/net/l2tp/l2tp_core.h b/net/l2tp/l2tp_core.h
index e4896413b2b6..c199020f8a8a 100644
--- a/net/l2tp/l2tp_core.h
+++ b/net/l2tp/l2tp_core.h
@@ -222,7 +222,6 @@ struct l2tp_session *l2tp_session_get(const struct net *net,
 struct l2tp_session *l2tp_session_get_nth(struct l2tp_tunnel *tunnel, int nth);
 struct l2tp_session *l2tp_session_get_by_ifname(const struct net *net,
const char *ifname);
-struct l2tp_tunnel *l2tp_tunnel_find_nth(const struct net *net, int nth);
 
 int l2tp_tunnel_create(struct net *net, int fd, int version, u32 tunnel_id,
   u32 peer_tunnel_id, struct l2tp_tunnel_cfg *cfg,
diff --git a/net/l2tp/l2tp_debugfs.c b/net/l2tp/l2tp_debugfs.c
index 72e713da4733..b8f9d45bfeb1 100644
--- a/net/l2tp/l2tp_debugfs.c
+++ b/net/l2tp/l2tp_debugfs.c
@@ -47,7 +47,11 @@ struct l2tp_dfs_seq_data {
 
 static void l2tp_dfs_next_tunnel(struct l2tp_dfs_seq_data *pd)
 {
-   pd->tunnel = l2tp_tunnel_find_nth(pd->net, pd->tunnel_idx);
+   /* Drop reference taken during previous invocation */
+   if (pd->tunnel)
+   l2tp_tunnel_dec_refcount(pd->tunnel);
+
+   pd->tunnel = l2tp_tunnel_get_nth(pd->net, pd->tunnel_idx);
pd->tunnel_idx++;
 }
 
@@ -96,7 +100,14 @@ static void *l2tp_dfs_seq_next(struct seq_file *m, void *v, 
loff_t *pos)
 
 static void l2tp_dfs_seq_stop(struct seq_file *p, void *v)
 {
-   /* nothing to do */
+   struct l2tp_dfs_seq_data *pd = v;
+
+   if (!pd || pd == SEQ_START_TOKEN)
+   return;
+
+   /* Drop reference taken by last invocation of l2tp_dfs_next_tunnel() */
+   if (pd->tunnel)
+   l2tp_tunnel_dec_refcount(pd->tunnel);
 }
 
 static void l2tp_dfs_seq_tunnel_show(struct seq_file *m, void *v)
-- 
2.17.0



[PATCH net 0/3] l2tp: remove unsafe calls to l2tp_tunnel_find_nth()

2018-04-12 Thread Guillaume Nault
Using l2tp_tunnel_find_nth() is racy, because the returned tunnel can
go away as soon as this function returns. This series introduce
l2tp_tunnel_get_nth() as a safe replacement to fixes these races.

With this series, all unsafe tunnel/session lookups are finally gone.

Guillaume Nault (3):
  l2tp: hold reference on tunnels in netlink dumps
  l2tp: hold reference on tunnels printed in pppol2tp proc file
  l2tp: hold reference on tunnels printed in l2tp/tunnels debugfs file

 net/l2tp/l2tp_core.c| 40 
 net/l2tp/l2tp_core.h|  3 ++-
 net/l2tp/l2tp_debugfs.c | 15 +--
 net/l2tp/l2tp_netlink.c | 11 ---
 net/l2tp/l2tp_ppp.c | 24 +---
 5 files changed, 60 insertions(+), 33 deletions(-)

-- 
2.17.0



[PATCH net 1/3] l2tp: hold reference on tunnels in netlink dumps

2018-04-12 Thread Guillaume Nault
l2tp_tunnel_find_nth() is unsafe: no reference is held on the returned
tunnel, therefore it can be freed whenever the caller uses it.
This patch defines l2tp_tunnel_get_nth() which works similarly, but
also takes a reference on the returned tunnel. The caller then has to
drop it after it stops using the tunnel.

Convert netlink dumps to make them safe against concurrent tunnel
deletion.

Fixes: 309795f4bec2 ("l2tp: Add netlink control API for L2TP")
Signed-off-by: Guillaume Nault 
---
 net/l2tp/l2tp_core.c| 20 
 net/l2tp/l2tp_core.h|  2 ++
 net/l2tp/l2tp_netlink.c | 11 ---
 3 files changed, 30 insertions(+), 3 deletions(-)

diff --git a/net/l2tp/l2tp_core.c b/net/l2tp/l2tp_core.c
index 0fbd3ee26165..c8c4183f0f37 100644
--- a/net/l2tp/l2tp_core.c
+++ b/net/l2tp/l2tp_core.c
@@ -183,6 +183,26 @@ struct l2tp_tunnel *l2tp_tunnel_get(const struct net *net, 
u32 tunnel_id)
 }
 EXPORT_SYMBOL_GPL(l2tp_tunnel_get);
 
+struct l2tp_tunnel *l2tp_tunnel_get_nth(const struct net *net, int nth)
+{
+   const struct l2tp_net *pn = l2tp_pernet(net);
+   struct l2tp_tunnel *tunnel;
+   int count = 0;
+
+   rcu_read_lock_bh();
+   list_for_each_entry_rcu(tunnel, >l2tp_tunnel_list, list) {
+   if (++count > nth) {
+   l2tp_tunnel_inc_refcount(tunnel);
+   rcu_read_unlock_bh();
+   return tunnel;
+   }
+   }
+   rcu_read_unlock_bh();
+
+   return NULL;
+}
+EXPORT_SYMBOL_GPL(l2tp_tunnel_get_nth);
+
 /* Lookup a session. A new reference is held on the returned session. */
 struct l2tp_session *l2tp_session_get(const struct net *net,
  struct l2tp_tunnel *tunnel,
diff --git a/net/l2tp/l2tp_core.h b/net/l2tp/l2tp_core.h
index ba33cbec71eb..e4896413b2b6 100644
--- a/net/l2tp/l2tp_core.h
+++ b/net/l2tp/l2tp_core.h
@@ -212,6 +212,8 @@ static inline void *l2tp_session_priv(struct l2tp_session 
*session)
 }
 
 struct l2tp_tunnel *l2tp_tunnel_get(const struct net *net, u32 tunnel_id);
+struct l2tp_tunnel *l2tp_tunnel_get_nth(const struct net *net, int nth);
+
 void l2tp_tunnel_free(struct l2tp_tunnel *tunnel);
 
 struct l2tp_session *l2tp_session_get(const struct net *net,
diff --git a/net/l2tp/l2tp_netlink.c b/net/l2tp/l2tp_netlink.c
index b05dbd9ffcb2..6616c9fd292f 100644
--- a/net/l2tp/l2tp_netlink.c
+++ b/net/l2tp/l2tp_netlink.c
@@ -487,14 +487,17 @@ static int l2tp_nl_cmd_tunnel_dump(struct sk_buff *skb, 
struct netlink_callback
struct net *net = sock_net(skb->sk);
 
for (;;) {
-   tunnel = l2tp_tunnel_find_nth(net, ti);
+   tunnel = l2tp_tunnel_get_nth(net, ti);
if (tunnel == NULL)
goto out;
 
if (l2tp_nl_tunnel_send(skb, NETLINK_CB(cb->skb).portid,
cb->nlh->nlmsg_seq, NLM_F_MULTI,
-   tunnel, L2TP_CMD_TUNNEL_GET) < 0)
+   tunnel, L2TP_CMD_TUNNEL_GET) < 0) {
+   l2tp_tunnel_dec_refcount(tunnel);
goto out;
+   }
+   l2tp_tunnel_dec_refcount(tunnel);
 
ti++;
}
@@ -848,7 +851,7 @@ static int l2tp_nl_cmd_session_dump(struct sk_buff *skb, 
struct netlink_callback
 
for (;;) {
if (tunnel == NULL) {
-   tunnel = l2tp_tunnel_find_nth(net, ti);
+   tunnel = l2tp_tunnel_get_nth(net, ti);
if (tunnel == NULL)
goto out;
}
@@ -856,6 +859,7 @@ static int l2tp_nl_cmd_session_dump(struct sk_buff *skb, 
struct netlink_callback
session = l2tp_session_get_nth(tunnel, si);
if (session == NULL) {
ti++;
+   l2tp_tunnel_dec_refcount(tunnel);
tunnel = NULL;
si = 0;
continue;
@@ -865,6 +869,7 @@ static int l2tp_nl_cmd_session_dump(struct sk_buff *skb, 
struct netlink_callback
 cb->nlh->nlmsg_seq, NLM_F_MULTI,
 session, L2TP_CMD_SESSION_GET) < 0) {
l2tp_session_dec_refcount(session);
+   l2tp_tunnel_dec_refcount(tunnel);
break;
}
l2tp_session_dec_refcount(session);
-- 
2.17.0



Re: XDP_TX for virtio_net not working in recent kernel?

2018-04-12 Thread Kimitoshi Takahashi

Hello Jason,

The patch fixed the issue.
Thank you very much!


On 04/11/2018 10:50 AM, Jason Wang wrote:


I suspect a kick is missed in the XDP_TX case.

Could you please try the attached patch to see if it fixes the issue?

Thanks


--
Kimitoshi Takahashi


Re: [PATCH net] net: dsa: mv88e6xxx: Fix receive time stamp race condition.

2018-04-12 Thread Richard Cochran
On Mon, Apr 09, 2018 at 07:19:31AM -0700, Richard Cochran wrote:
> Dave, please hold off on this patch.  I am seeing new problems in my
> testing with this applied.  I still need to get to the bottom of
> this.

Looks like the new problems are a HW/board glitch.

The patch is good to go.
 
Thanks,
Richard


[RFC net-next] ipv6: send netlink notifications for manually configured addresses

2018-04-12 Thread Lorenzo Bianconi
Send a netlink notification when userspace adds a manually configured
address if DAD is enabled and optimistic flag isn't set.
Moreover send RTM_DELADDR notifications for tentative addresses.

Some userspace applications (e.g. NetworkManager) are interested in
addr netlink events albeit the address is still in tentative state,
however events are not sent if DAD process is not completed.
If the address is added and immediately removed userspace listeners
are not notified. This behaviour can be easily reproduced by using
veth interfaces:

$ ip -b - < link add dev vm1 type veth peer name vm2
> link set dev vm1 up
> link set dev vm2 up
> addr add 2001:db8:a:b:1:2:3:4/64 dev vm1
> addr del 2001:db8:a:b:1:2:3:4/64 dev vm1
EOF

This patch reverts the behaviour introduced by the commit f784ad3d79e5
("ipv6: do not send RTM_DELADDR for tentative addresses")

Suggested-by: Thomas Haller 
Signed-off-by: Lorenzo Bianconi 
---
 net/ipv6/addrconf.c | 13 +
 1 file changed, 5 insertions(+), 8 deletions(-)

diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index 78cef00c9596..dffa38004c13 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -2902,6 +2902,11 @@ static int inet6_addr_add(struct net *net, int ifindex,
  expires, flags);
}
 
+   /* Send a netlink notification if DAD is enabled and
+* optimistic flag is not set
+*/
+   if (!(ifp->flags & (IFA_F_OPTIMISTIC | IFA_F_NODAD)))
+   ipv6_ifa_notify(0, ifp);
/*
 * Note that section 3.1 of RFC 4429 indicates
 * that the Optimistic flag should not be set for
@@ -5029,14 +5034,6 @@ static void inet6_ifa_notify(int event, struct 
inet6_ifaddr *ifa)
struct net *net = dev_net(ifa->idev->dev);
int err = -ENOBUFS;
 
-   /* Don't send DELADDR notification for TENTATIVE address,
-* since NEWADDR notification is sent only after removing
-* TENTATIVE flag, if DAD has not failed.
-*/
-   if (ifa->flags & IFA_F_TENTATIVE && !(ifa->flags & IFA_F_DADFAILED) &&
-   event == RTM_DELADDR)
-   return;
-
skb = nlmsg_new(inet6_ifaddr_msgsize(), GFP_ATOMIC);
if (!skb)
goto errout;
-- 
2.14.3



Re: SRIOV switchdev mode BoF minutes

2018-04-12 Thread Samudrala, Sridhar

On 11/12/2017 11:49 AM, Or Gerlitz wrote:

Hi Dave and all,

During and after the BoF on SRIOV switchdev mode, we came into a
consensus among the developers from four different HW vendors (CC
audience) that a correct thing to do would be to disallow any new
extensions to the legacy mode.

The idea is to put focus on the new mode and not add new UAPIs and
kernel code which was turned to be a wrong design which does not allow
for properly offloading a kernel switching SW model to e-switch HW.

We also had a good session the day after regarding alignment for the
representation model of the uplink (physical port) and PF/s.

The VF representor netdevs  exist for all drivers that support the new
mode but the representation for the uplink and PF wasn't the same for
all. The decision was to represent the uplink and PFs vports in the
same manner done for VFs, using rep netdevs. This alignment would
provide a more strict and clear view of the kernel model for e-switch
to users and upper layer control plane SW.


I don't see any changes in the Mellanox/other drivers to move to this new model 
to enable
the uplink and PF port representors, any updates?

It would be really nice to highlight the pros and cons of the old versus the
new model.

We are looking into adding switchdev support for our new 100Gb ice driver and 
could
use some feedback on the direction we should be taking.

Thanks
Sridhar



Re: [Cluster-devel] [PATCH v2 0/2] gfs2: Stop using rhashtable_walk_peek

2018-04-12 Thread Bob Peterson
- Original Message -
> Here's a second version of the patch (now a patch set) to eliminate
> rhashtable_walk_peek in gfs2.
> 
> The first patch introduces lockref_put_not_zero, the inverse of
> lockref_get_not_zero.
> 
> The second patch eliminates rhashtable_walk_peek in gfs2.  In
> gfs2_glock_iter_next, the new lockref function from patch one is used to
> drop a lockref count as long as the count doesn't drop to zero.  This is
> almost always the case; if there is a risk of dropping the last
> reference, we must defer that to a work queue because dropping the last
> reference may sleep.
> 
> Thanks,
> Andreas
> 
> Andreas Gruenbacher (2):
>   lockref: Add lockref_put_not_zero
>   gfs2: Stop using rhashtable_walk_peek
> 
>  fs/gfs2/glock.c | 47 ---
>  include/linux/lockref.h |  1 +
>  lib/lockref.c   | 28 
>  3 files changed, 57 insertions(+), 19 deletions(-)
> 
> --
> 2.14.3

Hi,

Thanks. These two patches are now pushed to the for-next branch of the 
linux-gfs2 tree:

https://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2.git/commit/?h=for-next=450b1f6f56350c630e795f240dc5a77aa8aa2419
https://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2.git/commit/?h=for-next=3fd5d3ad35dc44aaf0f28d60cc0eb75887bff54d

Regards,

Bob Peterson
Red Hat File Systems


[PATCH] NFC: fix attrs checks in netlink interface

2018-04-12 Thread Andrey Konovalov
nfc_genl_deactivate_target() relies on the NFC_ATTR_TARGET_INDEX
attribute being present, but doesn't check whether it is actually
provided by the user. Same goes for nfc_genl_fw_download() and
NFC_ATTR_FIRMWARE_NAME.

This patch adds appropriate checks.

Found with syzkaller.

Signed-off-by: Andrey Konovalov 
---
 net/nfc/netlink.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/net/nfc/netlink.c b/net/nfc/netlink.c
index f018eafc2a0d..58adfb0c90f6 100644
--- a/net/nfc/netlink.c
+++ b/net/nfc/netlink.c
@@ -936,7 +936,8 @@ static int nfc_genl_deactivate_target(struct sk_buff *skb,
u32 device_idx, target_idx;
int rc;
 
-   if (!info->attrs[NFC_ATTR_DEVICE_INDEX])
+   if (!info->attrs[NFC_ATTR_DEVICE_INDEX] ||
+   !info->attrs[NFC_ATTR_TARGET_INDEX])
return -EINVAL;
 
device_idx = nla_get_u32(info->attrs[NFC_ATTR_DEVICE_INDEX]);
@@ -1245,7 +1246,8 @@ static int nfc_genl_fw_download(struct sk_buff *skb, 
struct genl_info *info)
u32 idx;
char firmware_name[NFC_FIRMWARE_NAME_MAXSIZE + 1];
 
-   if (!info->attrs[NFC_ATTR_DEVICE_INDEX])
+   if (!info->attrs[NFC_ATTR_DEVICE_INDEX] ||
+   !info->attrs[NFC_ATTR_FIRMWARE_NAME])
return -EINVAL;
 
idx = nla_get_u32(info->attrs[NFC_ATTR_DEVICE_INDEX]);
-- 
2.17.0.484.g0c8726318c-goog



Re: [PATCH 08/15] ASoC: pxa: remove the dmaengine compat need

2018-04-12 Thread Robert Jarzmik
Mark Brown  writes:

> On Mon, Apr 02, 2018 at 04:26:49PM +0200, Robert Jarzmik wrote:
>> As the pxa architecture switched towards the dmaengine slave map, the
>> old compatibility mechanism to acquire the dma requestor line number and
>> priority are not needed anymore.
>
> Acked-by: Mark Brown 
>
> If there's no dependency I'm happy to take this for 4.18.
Thanks for the ack.

The patches 1 and 2 are the dependency here, so I'd rather push it through my
tree once the review is complete.

Cheers.

-- 
Robert


Re: v6/sit tunnels and VRFs

2018-04-12 Thread Jeff Barnhill
Hi David,

In the slides referenced, you recommend adding an "unreachable
default" route to the end of each VRF route table.  In my testing (for
v4) this results in a change to fib lookup failures such that instead
of ENETUNREACH being returned, EHOSTUNREACH is returned since the fib
finds the unreachable route, versus failing to find a route
altogether.

Have the implications of this been considered?  I don't see a
clean/easy way to achieve the old behavior without affecting non-VRF
routing (eg. remove the unreachable route and delete the non-VRF
rules).  I'm guessing that programmatically, it may not make much
difference, ie. lookup fails, but for debugging or to a user looking
at it, the difference matters.  Do you (or anyone else) have any
thoughts on this?

Thanks,
Jeff


On Sun, Oct 29, 2017 at 11:48 AM, David Ahern  wrote:
> On 10/27/17 8:43 PM, Jeff Barnhill wrote:
>> ping v4 loopback...
>>
>> jeff@VM2:~$ ip route list vrf myvrf
>> 127.0.0.0/8 dev myvrf proto kernel scope link src 127.0.0.1
>> 192.168.200.0/24 via 192.168.210.3 dev enp0s8
>> 192.168.210.0/24 dev enp0s8 proto kernel scope link src 192.168.210.2
>>
>> Lookups shown in perf script were for table 255.  Is it necessary to
>> put the l3mdev table first?  If I re-order the tables, it starts
>> working:
>
> Yes, we advise moving the local table down to avoid false hits (e.g.,
> duplicate addresses like this between the default VRF and another VRF).
>
> I covered that and a few other things at OSS 2017. Latest VRF slides for
> users:
>   http://schd.ws/hosted_files/ossna2017/fe/vrf-tutorial-oss.pdf


Re: [PATCH v2] net: dsa: b53: Using sleep-able operations in b53_switch_reset_gpio

2018-04-12 Thread Florian Fainelli


On 04/11/2018 06:48 PM, Jia-Ju Bai wrote:
> b53_switch_reset_gpio() is never called in atomic context.
> 
> The call chain ending up at b53_switch_reset_gpio() is:
> [1] b53_switch_reset_gpio() <- b53_switch_reset() <-
> b53_reset_switch() <- b53_setup()
> 
> b53_switch_reset_gpio() is set as ".setup" in struct dsa_switch_ops.
> This function is not called in atomic context.
> 
> Despite never getting called from atomic context, b53_switch_reset_gpio()
> calls non-sleep operations mdelay() and gpio_set_value().
> They are not necessary and can be replaced with msleep() 
> and gpio_set_value_cansleep().
> 
> This is found by a static analysis tool named DCNS written by myself.
> And I also manually check it.
> 
> Signed-off-by: Jia-Ju Bai 

Acked-by: Florian Fainelli 

-- 
Florian


Re: TCP one-by-one acking - RFC interpretation question

2018-04-12 Thread Yuchung Cheng
On Wed, Apr 11, 2018 at 5:06 AM, Michal Kubecek  wrote:
> On Wed, Apr 11, 2018 at 12:58:37PM +0200, Michal Kubecek wrote:
>> There is something else I don't understand, though. In the case of
>> acking previously sacked and never retransmitted segment,
>> tcp_clean_rtx_queue() calculates the parameters for tcp_ack_update_rtt()
>> using
>>
>> if (sack->first_sackt.v64) {
>> sack_rtt_us = skb_mstamp_us_delta(,
>> >first_sackt);
>> ca_rtt_us = skb_mstamp_us_delta(,
>> >last_sackt);
>> }
>>
>> (in 4.4; mainline code replaces  with tp->tcp_mstamp). If I read the
>> code correctly, both sack->first_sackt and sack->last_sackt contain
>> timestamps of initial segment transmission. This would mean we use the
>> time difference between the initial transmission and now, i.e. including
>> the RTO of the lost packet).
>>
>> IMHO we should take the actual round trip time instead, i.e. the
>> difference between the original transmission and the time the packet
>> sacked (first time). It seems we have been doing this before commit
>> 31231a8a8730 ("tcp: improve RTT from SACK for CC").
>
> Sorry for the noise, this was my misunderstanding, the first_sackt and
> last_sackt values are only taken from segments newly sacked by ack
> received right now, not those which were already sacked before.
>
> The actual problem and unrealistic RTT measurements come from another
> RFC violation I didn't mention before: the NAS doesn't follow RFC 2018
> section 4 rule for ordering of SACK blocks. Rather than sending SACK
> blocks three most recently received out-of-order blocks, it simply sends
> first three ordered by sequence numbers. In the earlier example (odd
> packets were received, even lost)
>
>ACK SAK SAK SAK
> +---+---+---+---+---+---+---+---+---+
> |   1   |   2   |   3   |   4   |   5   |   6   |   7   |   8   |   9   |
> +---+---+---+---+---+---+---+---+---+
>   34273   35701   37129   38557   39985   41413   42841   44269   45697   
> 47125
>
> it responds to retransmitted segment 2 by
>
>   1. ACK 37129, SACK 37129-38557 39985-41413 42841-44269
>   2. ACK 38557, SACK 39985-41413 42841-44269 45697-47125
>
> This new SACK block 45697-47125 has not been retransmitted and as it
> wasn't sacked before, it is considered newly sacked. Therefore it gets
> processed and its deemed RTT (time since its original transmit time)
> "poisons" the RTT calculation, leading to RTO spiraling up.
>
> Thus if we want to work around the NAS behaviour, we would need to
> recognize such new SACK block as "not really new" and ignore it for
> first_sackt/last_sackt. I'm not sure if it's possible without
> misinterpreting actually delayed out of order packets. Of course, it is
> not clear if it's worth the effort to work around so severely broken TCP
> implementations (two obvious RFC violations, even if we don't count the
> one-by-one acking).
Right. Not much we (sender) can do if the receiver is not reporting
the delivery status correctly. This also negatively impacts TCP
congestion control (Cubic, Reno, BBR, CDG etc) because we've changed
it to increase/decrease cwnd based on both inorder and out-of-order
delivery.

We're close to publish our internal packetdrill tests. Hopefully they
can be used to test these poor implementations.

>
> Michal Kubecek


Re: [PATCH net] tcp: md5: reject TCP_MD5SIG or TCP_MD5SIG_EXT on established sockets

2018-04-12 Thread Yuchung Cheng
On Wed, Apr 11, 2018 at 2:36 PM, Eric Dumazet  wrote:
>
> syzbot/KMSAN reported an uninit-value in tcp_parse_options() [1]
>
> I believe this was caused by a TCP_MD5SIG being set on live
> flow.
>
> This is highly unexpected, since TCP option space is limited.
>
> For instance, presence of TCP MD5 option automatically disables
> TCP TimeStamp option at SYN/SYNACK time, which we can not do
> once flow has been established.
>
> Really, adding/deleting an MD5 key only makes sense on sockets
> in CLOSE or LISTEN state.
>
> [1]
> BUG: KMSAN: uninit-value in tcp_parse_options+0xd74/0x1a30 
> net/ipv4/tcp_input.c:3720
> CPU: 1 PID: 6177 Comm: syzkaller192004 Not tainted 4.16.0+ #83
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS 
> Google 01/01/2011
> Call Trace:
>  __dump_stack lib/dump_stack.c:17 [inline]
>  dump_stack+0x185/0x1d0 lib/dump_stack.c:53
>  kmsan_report+0x142/0x240 mm/kmsan/kmsan.c:1067
>  __msan_warning_32+0x6c/0xb0 mm/kmsan/kmsan_instr.c:676
>  tcp_parse_options+0xd74/0x1a30 net/ipv4/tcp_input.c:3720
>  tcp_fast_parse_options net/ipv4/tcp_input.c:3858 [inline]
>  tcp_validate_incoming+0x4f1/0x2790 net/ipv4/tcp_input.c:5184
>  tcp_rcv_established+0xf60/0x2bb0 net/ipv4/tcp_input.c:5453
>  tcp_v4_do_rcv+0x6cd/0xd90 net/ipv4/tcp_ipv4.c:1469
>  sk_backlog_rcv include/net/sock.h:908 [inline]
>  __release_sock+0x2d6/0x680 net/core/sock.c:2271
>  release_sock+0x97/0x2a0 net/core/sock.c:2786
>  tcp_sendmsg+0xd6/0x100 net/ipv4/tcp.c:1464
>  inet_sendmsg+0x48d/0x740 net/ipv4/af_inet.c:764
>  sock_sendmsg_nosec net/socket.c:630 [inline]
>  sock_sendmsg net/socket.c:640 [inline]
>  SYSC_sendto+0x6c3/0x7e0 net/socket.c:1747
>  SyS_sendto+0x8a/0xb0 net/socket.c:1715
>  do_syscall_64+0x309/0x430 arch/x86/entry/common.c:287
>  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
> RIP: 0033:0x448fe9
> RSP: 002b:7fd472c64d38 EFLAGS: 0216 ORIG_RAX: 002c
> RAX: ffda RBX: 006e5a30 RCX: 00448fe9
> RDX: 029f RSI: 20a88f88 RDI: 0004
> RBP: 006e5a34 R08: 20e68000 R09: 0010
> R10: 27fd R11: 0216 R12: 
> R13: 7fff074899ef R14: 7fd472c659c0 R15: 0009
>
> Uninit was created at:
>  kmsan_save_stack_with_flags mm/kmsan/kmsan.c:278 [inline]
>  kmsan_internal_poison_shadow+0xb8/0x1b0 mm/kmsan/kmsan.c:188
>  kmsan_kmalloc+0x94/0x100 mm/kmsan/kmsan.c:314
>  kmsan_slab_alloc+0x11/0x20 mm/kmsan/kmsan.c:321
>  slab_post_alloc_hook mm/slab.h:445 [inline]
>  slab_alloc_node mm/slub.c:2737 [inline]
>  __kmalloc_node_track_caller+0xaed/0x11c0 mm/slub.c:4369
>  __kmalloc_reserve net/core/skbuff.c:138 [inline]
>  __alloc_skb+0x2cf/0x9f0 net/core/skbuff.c:206
>  alloc_skb include/linux/skbuff.h:984 [inline]
>  tcp_send_ack+0x18c/0x910 net/ipv4/tcp_output.c:3624
>  __tcp_ack_snd_check net/ipv4/tcp_input.c:5040 [inline]
>  tcp_ack_snd_check net/ipv4/tcp_input.c:5053 [inline]
>  tcp_rcv_established+0x2103/0x2bb0 net/ipv4/tcp_input.c:5469
>  tcp_v4_do_rcv+0x6cd/0xd90 net/ipv4/tcp_ipv4.c:1469
>  sk_backlog_rcv include/net/sock.h:908 [inline]
>  __release_sock+0x2d6/0x680 net/core/sock.c:2271
>  release_sock+0x97/0x2a0 net/core/sock.c:2786
>  tcp_sendmsg+0xd6/0x100 net/ipv4/tcp.c:1464
>  inet_sendmsg+0x48d/0x740 net/ipv4/af_inet.c:764
>  sock_sendmsg_nosec net/socket.c:630 [inline]
>  sock_sendmsg net/socket.c:640 [inline]
>  SYSC_sendto+0x6c3/0x7e0 net/socket.c:1747
>  SyS_sendto+0x8a/0xb0 net/socket.c:1715
>  do_syscall_64+0x309/0x430 arch/x86/entry/common.c:287
>  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
>
> Fixes: cfb6eeb4c860 ("[TCP]: MD5 Signature Option (RFC2385) support.")
> Signed-off-by: Eric Dumazet 
> Reported-by: syzbot 
> ---
Acked-by: Yuchung Cheng 

Thanks for the fix!
>  net/ipv4/tcp.c | 6 --
>  1 file changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index 
> bccc4c2700870b8c7ff592a6bd27acebd9bc6471..4fa3f812b9ff8954a9b6a018c648ff12ab995721
>  100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -2813,8 +2813,10 @@ static int do_tcp_setsockopt(struct sock *sk, int 
> level,
>  #ifdef CONFIG_TCP_MD5SIG
> case TCP_MD5SIG:
> case TCP_MD5SIG_EXT:
> -   /* Read the IP->Key mappings from userspace */
> -   err = tp->af_specific->md5_parse(sk, optname, optval, optlen);
> +   if ((1 << sk->sk_state) & (TCPF_CLOSE | TCPF_LISTEN))
> +   err = tp->af_specific->md5_parse(sk, optname, optval, 
> optlen);
> +   else
> +   err = -EINVAL;
> break;
>  #endif
> case TCP_USER_TIMEOUT:
> --
> 2.17.0.484.g0c8726318c-goog
>


Re: [PATCH net-next] Per interface IPv4 stats (CONFIG_IP_IFSTATS_TABLE)

2018-04-12 Thread Stephen Suryaputra
Thanks for the feedbacks. Please see the detail below:

On Wed, Apr 11, 2018 at 3:37 PM, Julian Anastasov  wrote:
[snip]
>> - __IP_INC_STATS(net, IPSTATS_MIB_INHDRERRORS);
>> + __IP_INC_STATS(net, skb_dst(skb)->dev, IPSTATS_MIB_INHDRERRORS);
>
> May be skb->dev if we want to account it to the
> input device.
>
Yes. I'm about to make change it but see the next one.

[snip]
>> diff --git a/net/netfilter/ipvs/ip_vs_xmit.c 
>> b/net/netfilter/ipvs/ip_vs_xmit.c
>> index 4527921..32bd3af 100644
>> --- a/net/netfilter/ipvs/ip_vs_xmit.c
>> +++ b/net/netfilter/ipvs/ip_vs_xmit.c
>> @@ -286,7 +286,7 @@ static inline bool decrement_ttl(struct netns_ipvs *ipvs,
>>   {
>>   if (ip_hdr(skb)->ttl <= 1) {
>>   /* Tell the sender its packet died... */
>> - __IP_INC_STATS(net, IPSTATS_MIB_INHDRERRORS);
>> + __IP_INC_STATS(net, skb_dst(skb)->dev, 
>> IPSTATS_MIB_INHDRERRORS);
>
> At this point, skb_dst(skb) can be:
>
> - input route at LOCAL_IN => dst->dev is "lo", skb->dev = input_device
> - output route at LOCAL_OUT => dst->dev is output_device, skb->dev = NULL
>
> We should see this error on LOCAL_IN but better to be
> safe: use 'skb->dev ? : skb_dst(skb)->dev' instead of just
> 'skb_dst(skb)->dev'.
>
This follows v6 implementation in the same function:

#ifdef CONFIG_IP_VS_IPV6
if (skb_af == AF_INET6) {
struct dst_entry *dst = skb_dst(skb);

/* check and decrement ttl */
if (ipv6_hdr(skb)->hop_limit <= 1) {
/* Force OUTPUT device used as source address */
skb->dev = dst->dev;
icmpv6_send(skb, ICMPV6_TIME_EXCEED,
ICMPV6_EXC_HOPLIMIT, 0);
__IP6_INC_STATS(net, ip6_dst_idev(dst),
IPSTATS_MIB_INHDRERRORS);

return false;
}

/* don't propagate ttl change to cloned packets */
if (!skb_make_writable(skb, sizeof(struct ipv6hdr)))
return false;

ipv6_hdr(skb)->hop_limit--;
} else
#endif

[snip]
>
> The patch probably has other errors, for example,
> using rt->dst.dev (lo) when rt->dst.error != 0 in ip_error,
> may be 'dev' should be used instead...

Same also here. Examples are ip6_forward and ip6_pkt_drop.

I think it's better be counted in the input device for them also. Thoughts?

Regards,
Stephen.


Re: [PATCH net 2/2] sfc: limit ARFS workitems in flight per channel

2018-04-12 Thread David Miller
From: Edward Cree 
Date: Thu, 12 Apr 2018 16:24:46 +0100

> This code is not handling expiration of old ARFS filters, it's inserting
>  new ones.

Then simply make the work process a queue, and add entries to the queue
here if the work is already scheduled.

Is there a reason why that wouldn't work?


Re: XDP performance regression due to CONFIG_RETPOLINE Spectre V2

2018-04-12 Thread Jesper Dangaard Brouer
On Thu, 12 Apr 2018 16:56:53 +0200 Christoph Hellwig  wrote:

> On Thu, Apr 12, 2018 at 04:51:23PM +0200, Christoph Hellwig wrote:
> > On Thu, Apr 12, 2018 at 03:50:29PM +0200, Jesper Dangaard Brouer wrote:  
> > > ---
> > > Implement support for keeping the DMA mapping through the XDP return
> > > call, to remove RX map/unmap calls.  Implement bulking for XDP
> > > ndo_xdp_xmit and XDP return frame API.  Bulking allows to perform DMA
> > > bulking via scatter-gatter DMA calls, XDP TX need it for DMA
> > > map+unmap. The driver RX DMA-sync (to CPU) per packet calls are harder
> > > to mitigate (via bulk technique). Ask DMA maintainer for a common
> > > case direct call for swiotlb DMA sync call ;-)  
> > 
> > Why do you even end up in swiotlb code?  Once you bounce buffer your
> > performance is toast anyway..  
> 
> I guess that is because x86 selects it as the default as soon as
> we have more than 4G memory. 

I were also confused why I ended up using SWIOTLB (SoftWare IO-TLB),
that might explain it. And I'm not hitting the bounce-buffer case.

How do I control which DMA engine I use? (So, I can play a little)


> That should be solveable fairly easily with the per-device dma ops,
> though.

I didn't understand this part.

I wanted to ask your opinion, on a hackish idea I have...
Which is howto detect, if I can reuse the RX-DMA map address, for TX-DMA
operation on another device (still/only calling sync_single_for_device).

With XDP_REDIRECT we are redirecting between net_device's. Usually
we keep the RX-DMA mapping as we recycle the page. On the redirect to
TX-device (via ndo_xdp_xmit) we do a new DMA map+unmap for TX.  The
question is how to avoid this mapping(?).  In some cases, with some DMA
engines (or lack of) I guess the DMA address is actually the same as
the RX-DMA mapping dma_addr_t already known, right?  For those cases,
would it be possible to just (re)use that address for TX?

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


Re: [PATCH 08/15] ASoC: pxa: remove the dmaengine compat need

2018-04-12 Thread Mark Brown
On Mon, Apr 02, 2018 at 04:26:49PM +0200, Robert Jarzmik wrote:
> As the pxa architecture switched towards the dmaengine slave map, the
> old compatibility mechanism to acquire the dma requestor line number and
> priority are not needed anymore.

Acked-by: Mark Brown 

If there's no dependency I'm happy to take this for 4.18.


signature.asc
Description: PGP signature


Re: [PATCH net 2/2] sfc: limit ARFS workitems in flight per channel

2018-04-12 Thread Edward Cree
On 12/04/18 16:11, David Miller wrote:
> From: Edward Cree 
> Date: Thu, 12 Apr 2018 15:02:50 +0100
>
>> A misconfigured system (e.g. with all interrupts affinitised to all CPUs)
>>  may produce a storm of ARFS steering events.  With the existing sfc ARFS
>>  implementation, that could create a backlog of workitems that grinds the
>>  system to a halt.  To prevent this, limit the number of workitems that
>>  may be in flight for a given SFC device to 8 (EFX_RPS_MAX_IN_FLIGHT), and
>>  return EBUSY from our ndo_rx_flow_steer method if the limit is reached.
>> Given this limit, also store the workitems in an array of slots within the
>>  struct efx_nic, rather than dynamically allocating for each request.
>>
>> Signed-off-by: Edward Cree 
> I don't think this behavior is all that great.
>
> If you really have to queue up these operations because they take a long
> time, I think it is better to enter a synchronous mode and sleep once
> you hit this in-flight limit of 8.
I don't think we can sleep at this point, ndo_rx_flow_steer is called from
 the RX path (netif_receive_skb_internal() -> get_rps_cpu() ->
 set_rps_cpu()).

> Either that or make the expiration work smarter when it has lots of events
> to process.
I'm afraid I don't understand what you mean here.
This code is not handling expiration of old ARFS filters, it's inserting
 new ones.

-Ed


Re: [linux-sunxi] Re: [PATCH 3/5] net: stmmac: dwmac-sun8i: Allow getting syscon regmap from device

2018-04-12 Thread Chen-Yu Tsai
On Thu, Apr 12, 2018 at 11:11 PM, Icenowy Zheng  wrote:
>
>
> 于 2018年4月12日 GMT+08:00 下午10:56:28, Maxime Ripard  
> 写到:
>>On Wed, Apr 11, 2018 at 10:16:39PM +0800, Icenowy Zheng wrote:
>>> From: Chen-Yu Tsai 
>>>
>>> On the Allwinner R40 SoC, the "GMAC clock" register is in the CCU
>>> address space; on the A64 SoC this register is in the SRAM controller
>>> address space, and with a different offset.
>>>
>>> To access the register from another device and hide the internal
>>> difference between the device, let it register a regmap named
>>> "emac-clock". We can then get the device from the phandle, and
>>> retrieve the regmap with dev_get_regmap(); in this situation the
>>> regmap_field will be set up to access the only register in the
>>regmap.
>>>
>>> Signed-off-by: Chen-Yu Tsai 
>>> [Icenowy: change to use regmaps with single register, change commit
>>>  message]
>>> Signed-off-by: Icenowy Zheng 
>>> ---
>>>  drivers/net/ethernet/stmicro/stmmac/dwmac-sun8i.c | 48
>>++-
>>>  1 file changed, 46 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac-sun8i.c
>>b/drivers/net/ethernet/stmicro/stmmac/dwmac-sun8i.c
>>> index 1037f6c78bca..b61210c0d415 100644
>>> --- a/drivers/net/ethernet/stmicro/stmmac/dwmac-sun8i.c
>>> +++ b/drivers/net/ethernet/stmicro/stmmac/dwmac-sun8i.c
>>> @@ -85,6 +85,13 @@ const struct reg_field old_syscon_reg_field = {
>>>  .msb = 31,
>>>  };
>>>
>>> +/* Specially exported regmap which contains only EMAC register */
>>> +const struct reg_field single_reg_field = {
>>> +.reg = 0,
>>> +.lsb = 0,
>>> +.msb = 31,
>>> +};
>>> +
>>
>>I'm not sure this would be wise. If we ever need some other register
>>exported through the regmap, will have to change all the calling sites
>>everywhere in the kernel, which will be a pain and will break
>>bisectability.
>
> In this situation the register can be exported as another
>  regmap. Currently the code will access a regmap with name
> "emac-clock" for this register.
>
>>
>>Chen-Yu's (or was it yours?) initial solution with a custom writeable
>>hook only allowing a single register seemed like a better one.
>
> But I remember you mentioned that you want it to hide the
> difference inside the device.

Hi,

The idea is that a device can export multiple regmaps. This one,
the one named "gmac" (in my soon to come v2) or "emac-clock" here,
is but one of many possible regmaps, and it only exports the register
needed by the GMAC/EMAC. IMHO it is highly unlikely the same piece of
hardware would need a second register from the same device. A more
likely situation would be it needs another register from a different
device, like on the H6 where the "internal PHY" is not so internal
from a system bus point of view.

ChenYu


RE: [PATCH 3/4] lan78xx: Read LED modes from Device Tree

2018-04-12 Thread Woojung.Huh
> > @@ -2097,6 +2098,25 @@ static int lan78xx_phy_init(struct lan78xx_net *dev)
> > (void)lan78xx_set_eee(dev->net, );
> > }
> >
> > +   if (!of_property_read_u32_array(dev->udev->dev.of_node,
> > +   "microchip,led-modes",
> > +   led_modes, ARRAY_SIZE(led_modes))) {
> > +   u32 reg;
> > +   int i;
> > +
> > +   reg = phy_read(phydev, 0x1d);
> > +   for (i = 0; i < ARRAY_SIZE(led_modes); i++) {
> > +   reg &= ~(0xf << (i * 4));
> > +   reg |= (led_modes[i] & 0xf) << (i * 4);
> > +   }
> > +   (void)phy_write(phydev, 0x1d, reg);
> 
> Poking PHY registers directly from the MAC driver is not always a good
> idea. This MAC driver does that in a few places :-(
Agree but, some are for workaround unfortunately.

> What do we know about the PHY? It is built into the device or is it
> external? If it is external, how do you know the LED register is at
> 0x1d?
This register is not defined in include/linux/microchipphy.h. :(
Also agree that there parts should be applied to internal PHY only.



RE: [RFC PATCH v2 03/14] xsk: add umem fill queue support and mmap

2018-04-12 Thread Karlsson, Magnus


> -Original Message-
> From: Michael S. Tsirkin [mailto:m...@redhat.com]
> Sent: Thursday, April 12, 2018 4:05 PM
> To: Karlsson, Magnus 
> Cc: Björn Töpel ; Duyck, Alexander H
> ; alexander.du...@gmail.com;
> john.fastab...@gmail.com; a...@fb.com; bro...@redhat.com;
> willemdebruijn.ker...@gmail.com; dan...@iogearbox.net;
> netdev@vger.kernel.org; michael.lundkv...@ericsson.com; Brandeburg,
> Jesse ; Singhai, Anjali
> ; Zhang, Qi Z ;
> ravineet.si...@ericsson.com
> Subject: Re: [RFC PATCH v2 03/14] xsk: add umem fill queue support and
> mmap
> 
> On Thu, Apr 12, 2018 at 07:38:25AM +, Karlsson, Magnus wrote:
> > I think you are definitely right in that there are ways in which we
> > can improve performance here. That said, the current queue performs
> > slightly better than the previous one we had that was more or less a
> > copy of one of your first virtio 1.1 proposals from little over a year
> > ago. It had bidirectional queues and a valid flag in the descriptor
> > itself. The reason we abandoned this was not poor performance (it was
> > good), but a need to go to unidirectional queues. Maybe I should have
> > only changed that aspect and kept the valid flag.
> 
> Is there a summary about unidirectional queues anywhere?  I'm curious to
> know whether there are any lessons here to be learned for virtio or ptr_ring.

I did a quick hack in which I used your ptr_ring for the fill queue instead of
our head/tail based one. In the corner cases (usually empty or usually full), 
there
is basically no difference. But for the case when the queue is always half full,
the ptr_ring implementation boosts the performance from 5.6 to 5.7 Mpps 
(as there is no cache line bouncing in this case) 
on my system (slower than Björn's that was used for the numbers in the RFC).

So I think this should be implemented properly so we can get some real numbers.
Especially since 0.1 Mpps with copies will likely become much more with 
zero-copy
as we are really chasing cycles there. We will get back a better evaluation in 
a few
days.

Thanks: Magnus

> --
> MST


Re: [RFC v3 net-next 13/18] net/sched: Introduce the TBS Qdisc

2018-04-12 Thread Miroslav Lichvar
On Thu, Apr 12, 2018 at 08:03:49AM -0700, Richard Cochran wrote:
> On Wed, Apr 11, 2018 at 04:38:44PM -0700, Jesus Sanchez-Palencia wrote:
> > Just breaking this down a bit, yes, TAI is the network time base, and the 
> > NICs
> > PTP clock use that because PTP is (commonly) based on TAI. After the PHCs 
> > have
> > been synchronized over the network (e.g. with ptp4l), my understanding is 
> > that
> > if applications want to use the clockid_t CLOCK_TAI as a network clock 
> > reference
> > it's required that something (i.e. phc2sys) is synchronizing the PHCs and 
> > the
> > system clock, and also that something calls adjtime to apply the TAI vs UTC
> > offset to CLOCK_TAI.
> 
> Yes.  I haven't seen any distro that sets the TAI-UTC offset after
> boot, nor are there any user space tools for this.  The kernel is
> ready, though.

FWIW, the default NTP configuration in Fedora sets the kernel TAI-UTC
offset.

> > I was thinking about the full offload use-cases, thus when no scheduling is
> > happening inside the qdiscs. Applications could just read the time from the 
> > PHC
> > clocks directly without having to rely on any of the above. On this case,
> > userspace would use DYNAMIC_CLOCK just to flag that this is the case, but I 
> > must
> > admit it's not clear to me how common of a use-case that is, or even if it 
> > makes
> > sense.
> 
> 1588 allows only two timescales, TAI and ARB-itrary.  Although it
> doesn't make too much sense to use ARB, still people will do strange
> things.  Probably some people use UTC.  I am not advocating supporting
> alternate timescales, just pointing out the possibility.

There is also the possibility that the NIC clock is not synchronized
to anything. For synchronization of the system clock it's easier to
leave it free running and only track its phase/frequency offset to
allow conversion between the PHC and system time.

-- 
Miroslav Lichvar


Re: [PATCH 2/4] lan78xx: Read initial EEE setting from Device Tree

2018-04-12 Thread Phil Elwell
Andrew,

On 12/04/2018 15:16, Andrew Lunn wrote:
> On Thu, Apr 12, 2018 at 02:55:34PM +0100, Phil Elwell wrote:
>> Add two new Device Tree properties:
>> * microchip,eee-enabled  - a boolean to enable EEE
>> * microchip,tx-lpi-timer - time in microseconds to wait after TX goes
>>idle before entering the low power state
>>(default 600)
> 
> Hi Phil
> 
> This looks wrong.
> 
> What should happen is that the MAC driver calls phy_init_eee() to find
> out if the PHY supports EEE. There should be no need to look in device
> tree.

If the driver should be calling phy_init_eee to initialise EEE operation then 
I'm fine
with that (although I notice that the TI cpsw calls phy_ethtool_set_eee but I 
don't see
it calling phy_init_eee). However, it sounds like I need to keep my DT toggle 
of the
EEE enablement and parameters downstream.

Phil


Re: [PATCH 3/5] net: stmmac: dwmac-sun8i: Allow getting syscon regmap from device

2018-04-12 Thread Icenowy Zheng


于 2018年4月12日 GMT+08:00 下午10:56:28, Maxime Ripard  写到:
>On Wed, Apr 11, 2018 at 10:16:39PM +0800, Icenowy Zheng wrote:
>> From: Chen-Yu Tsai 
>> 
>> On the Allwinner R40 SoC, the "GMAC clock" register is in the CCU
>> address space; on the A64 SoC this register is in the SRAM controller
>> address space, and with a different offset.
>> 
>> To access the register from another device and hide the internal
>> difference between the device, let it register a regmap named
>> "emac-clock". We can then get the device from the phandle, and
>> retrieve the regmap with dev_get_regmap(); in this situation the
>> regmap_field will be set up to access the only register in the
>regmap.
>> 
>> Signed-off-by: Chen-Yu Tsai 
>> [Icenowy: change to use regmaps with single register, change commit
>>  message]
>> Signed-off-by: Icenowy Zheng 
>> ---
>>  drivers/net/ethernet/stmicro/stmmac/dwmac-sun8i.c | 48
>++-
>>  1 file changed, 46 insertions(+), 2 deletions(-)
>> 
>> diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac-sun8i.c
>b/drivers/net/ethernet/stmicro/stmmac/dwmac-sun8i.c
>> index 1037f6c78bca..b61210c0d415 100644
>> --- a/drivers/net/ethernet/stmicro/stmmac/dwmac-sun8i.c
>> +++ b/drivers/net/ethernet/stmicro/stmmac/dwmac-sun8i.c
>> @@ -85,6 +85,13 @@ const struct reg_field old_syscon_reg_field = {
>>  .msb = 31,
>>  };
>>  
>> +/* Specially exported regmap which contains only EMAC register */
>> +const struct reg_field single_reg_field = {
>> +.reg = 0,
>> +.lsb = 0,
>> +.msb = 31,
>> +};
>> +
>
>I'm not sure this would be wise. If we ever need some other register
>exported through the regmap, will have to change all the calling sites
>everywhere in the kernel, which will be a pain and will break
>bisectability.

In this situation the register can be exported as another
 regmap. Currently the code will access a regmap with name
"emac-clock" for this register.

>
>Chen-Yu's (or was it yours?) initial solution with a custom writeable
>hook only allowing a single register seemed like a better one.

But I remember you mentioned that you want it to hide the
difference inside the device.

>
>Maxime


Re: [PATCH net 2/2] sfc: limit ARFS workitems in flight per channel

2018-04-12 Thread David Miller
From: Edward Cree 
Date: Thu, 12 Apr 2018 15:02:50 +0100

> A misconfigured system (e.g. with all interrupts affinitised to all CPUs)
>  may produce a storm of ARFS steering events.  With the existing sfc ARFS
>  implementation, that could create a backlog of workitems that grinds the
>  system to a halt.  To prevent this, limit the number of workitems that
>  may be in flight for a given SFC device to 8 (EFX_RPS_MAX_IN_FLIGHT), and
>  return EBUSY from our ndo_rx_flow_steer method if the limit is reached.
> Given this limit, also store the workitems in an array of slots within the
>  struct efx_nic, rather than dynamically allocating for each request.
> 
> Signed-off-by: Edward Cree 

I don't think this behavior is all that great.

If you really have to queue up these operations because they take a long
time, I think it is better to enter a synchronous mode and sleep once
you hit this in-flight limit of 8.

Either that or make the expiration work smarter when it has lots of events
to process.



Re: [RFC v3 net-next 13/18] net/sched: Introduce the TBS Qdisc

2018-04-12 Thread Richard Cochran
On Wed, Apr 11, 2018 at 04:38:44PM -0700, Jesus Sanchez-Palencia wrote:
> Just breaking this down a bit, yes, TAI is the network time base, and the NICs
> PTP clock use that because PTP is (commonly) based on TAI. After the PHCs have
> been synchronized over the network (e.g. with ptp4l), my understanding is that
> if applications want to use the clockid_t CLOCK_TAI as a network clock 
> reference
> it's required that something (i.e. phc2sys) is synchronizing the PHCs and the
> system clock, and also that something calls adjtime to apply the TAI vs UTC
> offset to CLOCK_TAI.

Yes.  I haven't seen any distro that sets the TAI-UTC offset after
boot, nor are there any user space tools for this.  The kernel is
ready, though.

> I was thinking about the full offload use-cases, thus when no scheduling is
> happening inside the qdiscs. Applications could just read the time from the 
> PHC
> clocks directly without having to rely on any of the above. On this case,
> userspace would use DYNAMIC_CLOCK just to flag that this is the case, but I 
> must
> admit it's not clear to me how common of a use-case that is, or even if it 
> makes
> sense.

1588 allows only two timescales, TAI and ARB-itrary.  Although it
doesn't make too much sense to use ARB, still people will do strange
things.  Probably some people use UTC.  I am not advocating supporting
alternate timescales, just pointing out the possibility.

Thanks,
Richard



Re: [PATCH 5/5] arm64: allwinner: a64: add SRAM controller device tree node

2018-04-12 Thread Maxime Ripard
Hi,

On Wed, Apr 11, 2018 at 10:16:41PM +0800, Icenowy Zheng wrote:
> Allwinner A64 has a SRAM controller, and in the device tree currently
> we have a syscon node to enable EMAC driver to access the EMAC clock
> register. As SRAM controller driver can now export regmap for this
> register, replace the syscon node to the SRAM controller device node,
> and let EMAC driver to acquire its EMAC clock regmap.
> 
> Signed-off-by: Icenowy Zheng 
> ---
>  arch/arm64/boot/dts/allwinner/sun50i-a64.dtsi | 23 +++
>  1 file changed, 19 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/arm64/boot/dts/allwinner/sun50i-a64.dtsi 
> b/arch/arm64/boot/dts/allwinner/sun50i-a64.dtsi
> index 1b2ef28c42bd..1c37659d9d41 100644
> --- a/arch/arm64/boot/dts/allwinner/sun50i-a64.dtsi
> +++ b/arch/arm64/boot/dts/allwinner/sun50i-a64.dtsi
> @@ -168,10 +168,25 @@
>   #size-cells = <1>;
>   ranges;
>  
> - syscon: syscon@1c0 {
> - compatible = "allwinner,sun50i-a64-system-controller",
> - "syscon";
> + sram_controller: sram-controller@1c0 {
> + compatible = "allwinner,sun50i-a64-sram-controller";
>   reg = <0x01c0 0x1000>;
> + #address-cells = <1>;
> + #size-cells = <1>;
> + ranges;
> +
> + sram_c: sram@18000 {
> + compatible = "mmio-sram";
> + reg = <0x00018000 0x28000>;
> + #address-cells = <1>;
> + #size-cells = <1>;
> + ranges = <0 0x00018000 0x28000>;
> +
> + de2_sram: sram-section@0 {
> + compatible = 
> "allwinner,sun50i-a64-sram-c";
> + reg = <0x 0x28000>;
> + };
> + };

That doesn't look related at all to what's being discussed here, so
you'd rather add it as part of your DE2-enablement serie (or amend
your commit log to say why this is important to do it in this patch).

Maxime

-- 
Maxime Ripard, Bootlin (formerly Free Electrons)
Embedded Linux and Kernel engineering
https://bootlin.com


signature.asc
Description: PGP signature


Re: XDP performance regression due to CONFIG_RETPOLINE Spectre V2

2018-04-12 Thread Christoph Hellwig
On Thu, Apr 12, 2018 at 04:51:23PM +0200, Christoph Hellwig wrote:
> On Thu, Apr 12, 2018 at 03:50:29PM +0200, Jesper Dangaard Brouer wrote:
> > ---
> > Implement support for keeping the DMA mapping through the XDP return
> > call, to remove RX map/unmap calls.  Implement bulking for XDP
> > ndo_xdp_xmit and XDP return frame API.  Bulking allows to perform DMA
> > bulking via scatter-gatter DMA calls, XDP TX need it for DMA
> > map+unmap. The driver RX DMA-sync (to CPU) per packet calls are harder
> > to mitigate (via bulk technique). Ask DMA maintainer for a common
> > case direct call for swiotlb DMA sync call ;-)
> 
> Why do you even end up in swiotlb code?  Once you bounce buffer your
> performance is toast anyway..

I guess that is because x86 selects it as the default as soon as
we have more than 4G memory. That should be solveable fairly easily
with the per-device dma ops, though.


Re: [PATCH 3/5] net: stmmac: dwmac-sun8i: Allow getting syscon regmap from device

2018-04-12 Thread Maxime Ripard
On Wed, Apr 11, 2018 at 10:16:39PM +0800, Icenowy Zheng wrote:
> From: Chen-Yu Tsai 
> 
> On the Allwinner R40 SoC, the "GMAC clock" register is in the CCU
> address space; on the A64 SoC this register is in the SRAM controller
> address space, and with a different offset.
> 
> To access the register from another device and hide the internal
> difference between the device, let it register a regmap named
> "emac-clock". We can then get the device from the phandle, and
> retrieve the regmap with dev_get_regmap(); in this situation the
> regmap_field will be set up to access the only register in the regmap.
> 
> Signed-off-by: Chen-Yu Tsai 
> [Icenowy: change to use regmaps with single register, change commit
>  message]
> Signed-off-by: Icenowy Zheng 
> ---
>  drivers/net/ethernet/stmicro/stmmac/dwmac-sun8i.c | 48 
> ++-
>  1 file changed, 46 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac-sun8i.c 
> b/drivers/net/ethernet/stmicro/stmmac/dwmac-sun8i.c
> index 1037f6c78bca..b61210c0d415 100644
> --- a/drivers/net/ethernet/stmicro/stmmac/dwmac-sun8i.c
> +++ b/drivers/net/ethernet/stmicro/stmmac/dwmac-sun8i.c
> @@ -85,6 +85,13 @@ const struct reg_field old_syscon_reg_field = {
>   .msb = 31,
>  };
>  
> +/* Specially exported regmap which contains only EMAC register */
> +const struct reg_field single_reg_field = {
> + .reg = 0,
> + .lsb = 0,
> + .msb = 31,
> +};
> +

I'm not sure this would be wise. If we ever need some other register
exported through the regmap, will have to change all the calling sites
everywhere in the kernel, which will be a pain and will break
bisectability.

Chen-Yu's (or was it yours?) initial solution with a custom writeable
hook only allowing a single register seemed like a better one.

Maxime

-- 
Maxime Ripard, Bootlin (formerly Free Electrons)
Embedded Linux and Kernel engineering
https://bootlin.com


signature.asc
Description: PGP signature


Re: XDP performance regression due to CONFIG_RETPOLINE Spectre V2

2018-04-12 Thread Christoph Hellwig
On Thu, Apr 12, 2018 at 03:50:29PM +0200, Jesper Dangaard Brouer wrote:
> ---
> Implement support for keeping the DMA mapping through the XDP return
> call, to remove RX map/unmap calls.  Implement bulking for XDP
> ndo_xdp_xmit and XDP return frame API.  Bulking allows to perform DMA
> bulking via scatter-gatter DMA calls, XDP TX need it for DMA
> map+unmap. The driver RX DMA-sync (to CPU) per packet calls are harder
> to mitigate (via bulk technique). Ask DMA maintainer for a common
> case direct call for swiotlb DMA sync call ;-)

Why do you even end up in swiotlb code?  Once you bounce buffer your
performance is toast anyway..


Re: [PATCH 3/4] lan78xx: Read LED modes from Device Tree

2018-04-12 Thread Andrew Lunn
> @@ -2097,6 +2098,25 @@ static int lan78xx_phy_init(struct lan78xx_net *dev)
>   (void)lan78xx_set_eee(dev->net, );
>   }
>  
> + if (!of_property_read_u32_array(dev->udev->dev.of_node,
> + "microchip,led-modes",
> + led_modes, ARRAY_SIZE(led_modes))) {
> + u32 reg;
> + int i;
> +
> + reg = phy_read(phydev, 0x1d);
> + for (i = 0; i < ARRAY_SIZE(led_modes); i++) {
> + reg &= ~(0xf << (i * 4));
> + reg |= (led_modes[i] & 0xf) << (i * 4);
> + }
> + (void)phy_write(phydev, 0x1d, reg);

Poking PHY registers directly from the MAC driver is not always a good
idea. This MAC driver does that in a few places :-(

What do we know about the PHY? It is built into the device or is it
external? If it is external, how do you know the LED register is at
0x1d?

The safest place to do this is in the PHY driver, and place these OF
properties into the PHY node.

   Andrew


Re: [PATCH 4/4] dt-bindings: Document the DT bindings for lan78xx

2018-04-12 Thread Phil Elwell
Hi Andrew,

On 12/04/2018 15:30, Andrew Lunn wrote:
> On Thu, Apr 12, 2018 at 02:55:36PM +0100, Phil Elwell wrote:
>> The Microchip LAN78XX family of devices are Ethernet controllers with
>> a USB interface. Despite being discoverable devices it can be useful to
>> be able to configure them from Device Tree, particularly in low-cost
>> applications without an EEPROM or programmed OTP.
> 
> It would be good to document what happens when there is an EEPROM. Is
> OF used in preference to the EEPROM?

Yes it is. I'll mention it in V2.

Phil


Re: [PATCH 3/4] lan78xx: Read LED modes from Device Tree

2018-04-12 Thread Phil Elwell
Hi Andrew,

On 12/04/2018 15:26, Andrew Lunn wrote:
> On Thu, Apr 12, 2018 at 02:55:35PM +0100, Phil Elwell wrote:
>> Add support for DT property "microchip,led-modes", a vector of two
>> cells (u32s) in the range 0-15, each of which sets the mode for one
>> of the two LEDs. Some possible values are:
>>
>> 0=link/activity  1=link1000/activity
>> 2=link100/activity   3=link10/activity
>> 4=link100/1000/activity  5=link10/1000/activity
>> 6=link10/100/activity14=off15=on
>>
>> Also use the presence of the DT property to indicate that the
>> LEDs should be enabled - necessary in the event that no valid OTP
>> or EEPROM is available.
> 
> I'm not a fan of this, but at the moment, we don't have anything
> better.
> 
> Please follow what mscc does, add a header file for the LED settings.

Good idea.

> 
>>
>> Signed-off-by: Phil Elwell 
>> ---
>>  drivers/net/usb/lan78xx.c | 20 
>>  1 file changed, 20 insertions(+)
>>
>> diff --git a/drivers/net/usb/lan78xx.c b/drivers/net/usb/lan78xx.c
>> index d98397b..ffb483d 100644
>> --- a/drivers/net/usb/lan78xx.c
>> +++ b/drivers/net/usb/lan78xx.c
>> @@ -2008,6 +2008,7 @@ static int lan78xx_phy_init(struct lan78xx_net *dev)
>>  {
>>  int ret;
>>  u32 mii_adv;
>> +u32 led_modes[2];
>>  struct phy_device *phydev;
>>  
>>  phydev = phy_find_first(dev->mdiobus);
>> @@ -2097,6 +2098,25 @@ static int lan78xx_phy_init(struct lan78xx_net *dev)
>>  (void)lan78xx_set_eee(dev->net, );
>>  }
>>  
>> +if (!of_property_read_u32_array(dev->udev->dev.of_node,
>> +"microchip,led-modes",
>> +led_modes, ARRAY_SIZE(led_modes))) {
>> +u32 reg;
>> +int i;
>> +
>> +reg = phy_read(phydev, 0x1d);
>> +for (i = 0; i < ARRAY_SIZE(led_modes); i++) {
>> +reg &= ~(0xf << (i * 4));
>> +reg |= (led_modes[i] & 0xf) << (i * 4);
>> +}
> 
> Please add range checks for led_modes[i] and return -EINVAL if the
> check fails.

Will do.

Thanks,

Phil


Re: [PATCH 4/4] dt-bindings: Document the DT bindings for lan78xx

2018-04-12 Thread Andrew Lunn
On Thu, Apr 12, 2018 at 02:55:36PM +0100, Phil Elwell wrote:
> The Microchip LAN78XX family of devices are Ethernet controllers with
> a USB interface. Despite being discoverable devices it can be useful to
> be able to configure them from Device Tree, particularly in low-cost
> applications without an EEPROM or programmed OTP.

It would be good to document what happens when there is an EEPROM. Is
OF used in preference to the EEPROM?
   
Andrew


Re: [PATCH 3/4] lan78xx: Read LED modes from Device Tree

2018-04-12 Thread Andrew Lunn
On Thu, Apr 12, 2018 at 02:55:35PM +0100, Phil Elwell wrote:
> Add support for DT property "microchip,led-modes", a vector of two
> cells (u32s) in the range 0-15, each of which sets the mode for one
> of the two LEDs. Some possible values are:
> 
> 0=link/activity  1=link1000/activity
> 2=link100/activity   3=link10/activity
> 4=link100/1000/activity  5=link10/1000/activity
> 6=link10/100/activity14=off15=on
> 
> Also use the presence of the DT property to indicate that the
> LEDs should be enabled - necessary in the event that no valid OTP
> or EEPROM is available.

I'm not a fan of this, but at the moment, we don't have anything
better.

Please follow what mscc does, add a header file for the LED settings.

   Andrew

> 
> Signed-off-by: Phil Elwell 
> ---
>  drivers/net/usb/lan78xx.c | 20 
>  1 file changed, 20 insertions(+)
> 
> diff --git a/drivers/net/usb/lan78xx.c b/drivers/net/usb/lan78xx.c
> index d98397b..ffb483d 100644
> --- a/drivers/net/usb/lan78xx.c
> +++ b/drivers/net/usb/lan78xx.c
> @@ -2008,6 +2008,7 @@ static int lan78xx_phy_init(struct lan78xx_net *dev)
>  {
>   int ret;
>   u32 mii_adv;
> + u32 led_modes[2];
>   struct phy_device *phydev;
>  
>   phydev = phy_find_first(dev->mdiobus);
> @@ -2097,6 +2098,25 @@ static int lan78xx_phy_init(struct lan78xx_net *dev)
>   (void)lan78xx_set_eee(dev->net, );
>   }
>  
> + if (!of_property_read_u32_array(dev->udev->dev.of_node,
> + "microchip,led-modes",
> + led_modes, ARRAY_SIZE(led_modes))) {
> + u32 reg;
> + int i;
> +
> + reg = phy_read(phydev, 0x1d);
> + for (i = 0; i < ARRAY_SIZE(led_modes); i++) {
> + reg &= ~(0xf << (i * 4));
> + reg |= (led_modes[i] & 0xf) << (i * 4);
> + }

Please add range checks for led_modes[i] and return -EINVAL if the
check fails.

  Andrew


[PATCH] net: ethernet: ti: cpsw: fix tx vlan priority mapping

2018-04-12 Thread Ivan Khoronzhuk
The CPDMA_TX_PRIORITY_MAP in real is vlan pcp field priority mapping
register and basically replaces vlan pcp field for tagged packets.
So, set it to be 1:1 mapping.

Signed-off-by: Ivan Khoronzhuk 
---
Based on net/master

 drivers/net/ethernet/ti/cpsw.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/ti/cpsw.c b/drivers/net/ethernet/ti/cpsw.c
index 3037127..74f8284 100644
--- a/drivers/net/ethernet/ti/cpsw.c
+++ b/drivers/net/ethernet/ti/cpsw.c
@@ -129,7 +129,7 @@ do {
\
 
 #define RX_PRIORITY_MAPPING0x76543210
 #define TX_PRIORITY_MAPPING0x33221100
-#define CPDMA_TX_PRIORITY_MAP  0x01234567
+#define CPDMA_TX_PRIORITY_MAP  0x76543210
 
 #define CPSW_VLAN_AWAREBIT(1)
 #define CPSW_RX_VLAN_ENCAP BIT(2)
-- 
2.7.4



Re: [PATCH 1/4] lan78xx: Read MAC address from DT if present

2018-04-12 Thread Phil Elwell
Hi Andrew,

On 12/04/2018 15:06, Andrew Lunn wrote:
> Hi Phil
> 
>> -ret = lan78xx_write_reg(dev, RX_ADDRL, addr_lo);
>> -ret = lan78xx_write_reg(dev, RX_ADDRH, addr_hi);
>> +mac_addr = of_get_mac_address(dev->udev->dev.of_node);
> 
> It might be better to use the higher level eth_platform_get_mac_address().

OK - I'll take your word for it.

Phil


Re: [PATCH 4/4] dt-bindings: Document the DT bindings for lan78xx

2018-04-12 Thread Andrew Lunn
On Thu, Apr 12, 2018 at 03:10:57PM +0100, Phil Elwell wrote:
> Hi Andrew,
> 
> On 12/04/2018 15:04, Andrew Lunn wrote:
> > On Thu, Apr 12, 2018 at 02:55:36PM +0100, Phil Elwell wrote:
> >> The Microchip LAN78XX family of devices are Ethernet controllers with
> >> a USB interface. Despite being discoverable devices it can be useful to
> >> be able to configure them from Device Tree, particularly in low-cost
> >> applications without an EEPROM or programmed OTP.
> >>
> >> Document the supported properties in a bindings file, adding it to
> >> MAINTAINERS at the same time.
> > 
> > Hi Phil
> > 
> > How you link an OF node to a USB device is not obvious. Could you
> > please include either a pointer to some binding documentation, or make
> > your example show it.
> 
> Thanks for the feedback. Would you consider this (lifted from the Pi 3B+ 
> Device Tree)
> a sufficient example?

Yes, this is good.

Thanks
 Andrew


Re: [PATCH 2/4] lan78xx: Read initial EEE setting from Device Tree

2018-04-12 Thread Andrew Lunn
On Thu, Apr 12, 2018 at 02:55:34PM +0100, Phil Elwell wrote:
> Add two new Device Tree properties:
> * microchip,eee-enabled  - a boolean to enable EEE
> * microchip,tx-lpi-timer - time in microseconds to wait after TX goes
>idle before entering the low power state
>(default 600)

Hi Phil

This looks wrong.

What should happen is that the MAC driver calls phy_init_eee() to find
out if the PHY supports EEE. There should be no need to look in device
tree.

Andrew


Re: [RFC PATCH v2 00/14] Introducing AF_XDP support

2018-04-12 Thread Björn Töpel
2018-04-11 20:43 GMT+02:00 Alexei Starovoitov :
> On 4/11/18 5:17 AM, Björn Töpel wrote:
>>
>>
>> In the current RFC you are required to create both an Rx and Tx
>> queue to bind the socket, which is just weird for your "Rx on one
>> device, Tx to another" scenario. I'll fix that in the next RFC.
>
> I would defer on adding new features until the key functionality
> lands.  imo it's in good shape and I would submit it without RFC tag
> as soon as net-next reopens.

Yes, makes sense. We're doing some ptr_ring-like vs head/tail
measurements, and depending on the result we'll send out a proper
patch when net-next is open again.

What tree should we target -- bpf-next or net-next?


Thanks!
Björn


Re: [PATCH 4/4] dt-bindings: Document the DT bindings for lan78xx

2018-04-12 Thread Phil Elwell
Hi Andrew,

On 12/04/2018 15:04, Andrew Lunn wrote:
> On Thu, Apr 12, 2018 at 02:55:36PM +0100, Phil Elwell wrote:
>> The Microchip LAN78XX family of devices are Ethernet controllers with
>> a USB interface. Despite being discoverable devices it can be useful to
>> be able to configure them from Device Tree, particularly in low-cost
>> applications without an EEPROM or programmed OTP.
>>
>> Document the supported properties in a bindings file, adding it to
>> MAINTAINERS at the same time.
> 
> Hi Phil
> 
> How you link an OF node to a USB device is not obvious. Could you
> please include either a pointer to some binding documentation, or make
> your example show it.

Thanks for the feedback. Would you consider this (lifted from the Pi 3B+ Device 
Tree)
a sufficient example?

 {
usb1@1 {
compatible = "usb424,2514";
reg = <1>;
#address-cells = <1>;
#size-cells = <0>;

usb1_1@1 {
compatible = "usb424,2514";
reg = <1>;
#address-cells = <1>;
#size-cells = <0>;

ethernet: usbether@1 {
compatible = "usb424,7800";
reg = <1>;
microchip,eee-enabled;
microchip,tx-lpi-timer = <600>; /* 
non-aggressive*/
/*
 * led0 = 1:link1000/activity
 * led1 = 6:link10/100/activity
 */
microchip,led-modes = <1 6>;
};
};
};
};

Phil


Re: [PATCH net-next] Per interface IPv4 stats (CONFIG_IP_IFSTATS_TABLE)

2018-04-12 Thread kbuild test robot
Hi Stephen,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on net-next/master]

url:
https://github.com/0day-ci/linux/commits/Stephen-Suryaputra/Per-interface-IPv4-stats-CONFIG_IP_IFSTATS_TABLE/20180412-181719
reproduce:
# apt-get install sparse
make ARCH=x86_64 allmodconfig
make C=1 CF=-D__CHECK_ENDIAN__


sparse warnings: (new ones prefixed by >>)

   net/ipv4/proc.c:414:28: sparse: Variable length array is used.
>> net/ipv4/proc.c:499:43: sparse: incorrect type in argument 1 (different 
>> address spaces) @@expected void [noderef] *mib @@got vvoid 
>> [noderef] *mib @@
   net/ipv4/proc.c:499:43:expected void [noderef] *mib
   net/ipv4/proc.c:499:43:got void [noderef] **pcpumib
>> net/ipv4/proc.c:532:34: sparse: cast removes address space of expression
   net/ipv4/proc.c:534:34: sparse: cast removes address space of expression

vim +499 net/ipv4/proc.c

   411  
   412  static int snmp_seq_show_tcp_udp(struct seq_file *seq, void *v)
   413  {
 > 414  unsigned long buff[TCPUDP_MIB_MAX];
   415  struct net *net = seq->private;
   416  int i;
   417  
   418  memset(buff, 0, TCPUDP_MIB_MAX * sizeof(unsigned long));
   419  
   420  seq_puts(seq, "\nTcp:");
   421  for (i = 0; snmp4_tcp_list[i].name; i++)
   422  seq_printf(seq, " %s", snmp4_tcp_list[i].name);
   423  
   424  seq_puts(seq, "\nTcp:");
   425  snmp_get_cpu_field_batch(buff, snmp4_tcp_list,
   426   net->mib.tcp_statistics);
   427  for (i = 0; snmp4_tcp_list[i].name; i++) {
   428  /* MaxConn field is signed, RFC 2012 */
   429  if (snmp4_tcp_list[i].entry == TCP_MIB_MAXCONN)
   430  seq_printf(seq, " %ld", buff[i]);
   431  else
   432  seq_printf(seq, " %lu", buff[i]);
   433  }
   434  
   435  memset(buff, 0, TCPUDP_MIB_MAX * sizeof(unsigned long));
   436  
   437  snmp_get_cpu_field_batch(buff, snmp4_udp_list,
   438   net->mib.udp_statistics);
   439  seq_puts(seq, "\nUdp:");
   440  for (i = 0; snmp4_udp_list[i].name; i++)
   441  seq_printf(seq, " %s", snmp4_udp_list[i].name);
   442  seq_puts(seq, "\nUdp:");
   443  for (i = 0; snmp4_udp_list[i].name; i++)
   444  seq_printf(seq, " %lu", buff[i]);
   445  
   446  memset(buff, 0, TCPUDP_MIB_MAX * sizeof(unsigned long));
   447  
   448  /* the UDP and UDP-Lite MIBs are the same */
   449  seq_puts(seq, "\nUdpLite:");
   450  snmp_get_cpu_field_batch(buff, snmp4_udp_list,
   451   net->mib.udplite_statistics);
   452  for (i = 0; snmp4_udp_list[i].name; i++)
   453  seq_printf(seq, " %s", snmp4_udp_list[i].name);
   454  seq_puts(seq, "\nUdpLite:");
   455  for (i = 0; snmp4_udp_list[i].name; i++)
   456  seq_printf(seq, " %lu", buff[i]);
   457  
   458  seq_putc(seq, '\n');
   459  return 0;
   460  }
   461  
   462  static int snmp_seq_show(struct seq_file *seq, void *v)
   463  {
   464  snmp_seq_show_ipstats(seq, v);
   465  
   466  icmp_put(seq);  /* RFC 2011 compatibility */
   467  icmpmsg_put(seq);
   468  
   469  snmp_seq_show_tcp_udp(seq, v);
   470  
   471  return 0;
   472  }
   473  
   474  static int snmp_seq_open(struct inode *inode, struct file *file)
   475  {
   476  return single_open_net(inode, file, snmp_seq_show);
   477  }
   478  
   479  static const struct file_operations snmp_seq_fops = {
   480  .open= snmp_seq_open,
   481  .read= seq_read,
   482  .llseek  = seq_lseek,
   483  .release = single_release_net,
   484  };
   485  
   486  
   487  #ifdef CONFIG_IP_IFSTATS_TABLE
   488  static void snmp_seq_show_item(struct seq_file *seq, void __percpu 
**pcpumib,
   489 atomic_long_t *smib,
   490 const struct snmp_mib *itemlist,
   491 char *prefix)
   492  {
   493  char name[32];
   494  int i;
   495  unsigned long val;
   496  
   497  for (i = 0; itemlist[i].name; i++) {
   498  val = pcpumib ?
 > 499  snmp_fold_field64(pcpumib, itemlist[i].entry,
   500offsetof(struct ipstats_mib, 
syncp)) :
   501  atomic_long_read(smib + itemlist[i].entry);
   502  snprintf(name, sizeof(name), "%s%s&quo

Re: [PATCH 1/4] lan78xx: Read MAC address from DT if present

2018-04-12 Thread Andrew Lunn
Hi Phil

> - ret = lan78xx_write_reg(dev, RX_ADDRL, addr_lo);
> - ret = lan78xx_write_reg(dev, RX_ADDRH, addr_hi);
> + mac_addr = of_get_mac_address(dev->udev->dev.of_node);

It might be better to use the higher level eth_platform_get_mac_address().

   Andrew


Re: [RFC PATCH v2 03/14] xsk: add umem fill queue support and mmap

2018-04-12 Thread Michael S. Tsirkin
On Thu, Apr 12, 2018 at 07:38:25AM +, Karlsson, Magnus wrote:
> I think you are definitely right in that there are ways in which
> we can improve performance here. That said, the current queue
> performs slightly better than the previous one we had that was
> more or less a copy of one of your first virtio 1.1 proposals
> from little over a year ago. It had bidirectional queues and a
> valid flag in the descriptor itself. The reason we abandoned this
> was not poor performance (it was good), but a need to go to
> unidirectional queues. Maybe I should have only changed that
> aspect and kept the valid flag.

Is there a summary about unidirectional queues anywhere?  I'm curious to
know whether there are any lessons here to be learned for virtio
or ptr_ring.

-- 
MST


Re: [PATCH 4/4] dt-bindings: Document the DT bindings for lan78xx

2018-04-12 Thread Andrew Lunn
On Thu, Apr 12, 2018 at 02:55:36PM +0100, Phil Elwell wrote:
> The Microchip LAN78XX family of devices are Ethernet controllers with
> a USB interface. Despite being discoverable devices it can be useful to
> be able to configure them from Device Tree, particularly in low-cost
> applications without an EEPROM or programmed OTP.
> 
> Document the supported properties in a bindings file, adding it to
> MAINTAINERS at the same time.

Hi Phil

How you link an OF node to a USB device is not obvious. Could you
please include either a pointer to some binding documentation, or make
your example show it.

Thanks
Andrew


[PATCH net 2/2] sfc: limit ARFS workitems in flight per channel

2018-04-12 Thread Edward Cree
A misconfigured system (e.g. with all interrupts affinitised to all CPUs)
 may produce a storm of ARFS steering events.  With the existing sfc ARFS
 implementation, that could create a backlog of workitems that grinds the
 system to a halt.  To prevent this, limit the number of workitems that
 may be in flight for a given SFC device to 8 (EFX_RPS_MAX_IN_FLIGHT), and
 return EBUSY from our ndo_rx_flow_steer method if the limit is reached.
Given this limit, also store the workitems in an array of slots within the
 struct efx_nic, rather than dynamically allocating for each request.

Signed-off-by: Edward Cree 
---
 drivers/net/ethernet/sfc/net_driver.h | 25 +++
 drivers/net/ethernet/sfc/rx.c | 58 ++-
 2 files changed, 55 insertions(+), 28 deletions(-)

diff --git a/drivers/net/ethernet/sfc/net_driver.h 
b/drivers/net/ethernet/sfc/net_driver.h
index 5e379a83c729..eea3808b3f25 100644
--- a/drivers/net/ethernet/sfc/net_driver.h
+++ b/drivers/net/ethernet/sfc/net_driver.h
@@ -733,6 +733,27 @@ struct efx_rss_context {
u32 rx_indir_table[128];
 };
 
+#ifdef CONFIG_RFS_ACCEL
+/**
+ * struct efx_async_filter_insertion - Request to asynchronously insert a 
filter
+ * @net_dev: Reference to the netdevice
+ * @spec: The filter to insert
+ * @work: Workitem for this request
+ * @rxq_index: Identifies the channel for which this request was made
+ * @flow_id: Identifies the kernel-side flow for which this request was made
+ */
+struct efx_async_filter_insertion {
+   struct net_device *net_dev;
+   struct efx_filter_spec spec;
+   struct work_struct work;
+   u16 rxq_index;
+   u32 flow_id;
+};
+
+/* Maximum number of ARFS workitems that may be in flight on an efx_nic */
+#define EFX_RPS_MAX_IN_FLIGHT  8
+#endif /* CONFIG_RFS_ACCEL */
+
 /**
  * struct efx_nic - an Efx NIC
  * @name: Device name (net device name or bus id before net device registered)
@@ -850,6 +871,8 @@ struct efx_rss_context {
  * @rps_expire_channel: Next channel to check for expiry
  * @rps_expire_index: Next index to check for expiry in
  * @rps_expire_channel's @rps_flow_id
+ * @rps_slot_map: bitmap of in-flight entries in @rps_slot
+ * @rps_slot: array of ARFS insertion requests for efx_filter_rfs_work()
  * @active_queues: Count of RX and TX queues that haven't been flushed and 
drained.
  * @rxq_flush_pending: Count of number of receive queues that need to be 
flushed.
  * Decremented when the efx_flush_rx_queue() is called.
@@ -1004,6 +1027,8 @@ struct efx_nic {
struct mutex rps_mutex;
unsigned int rps_expire_channel;
unsigned int rps_expire_index;
+   unsigned long rps_slot_map;
+   struct efx_async_filter_insertion rps_slot[EFX_RPS_MAX_IN_FLIGHT];
 #endif
 
atomic_t active_queues;
diff --git a/drivers/net/ethernet/sfc/rx.c b/drivers/net/ethernet/sfc/rx.c
index 13b0eb71dbf3..9c593c661cbf 100644
--- a/drivers/net/ethernet/sfc/rx.c
+++ b/drivers/net/ethernet/sfc/rx.c
@@ -827,28 +827,13 @@ MODULE_PARM_DESC(rx_refill_threshold,
 
 #ifdef CONFIG_RFS_ACCEL
 
-/**
- * struct efx_async_filter_insertion - Request to asynchronously insert a 
filter
- * @net_dev: Reference to the netdevice
- * @spec: The filter to insert
- * @work: Workitem for this request
- * @rxq_index: Identifies the channel for which this request was made
- * @flow_id: Identifies the kernel-side flow for which this request was made
- */
-struct efx_async_filter_insertion {
-   struct net_device *net_dev;
-   struct efx_filter_spec spec;
-   struct work_struct work;
-   u16 rxq_index;
-   u32 flow_id;
-};
-
 static void efx_filter_rfs_work(struct work_struct *data)
 {
struct efx_async_filter_insertion *req = container_of(data, struct 
efx_async_filter_insertion,
  work);
struct efx_nic *efx = netdev_priv(req->net_dev);
struct efx_channel *channel = efx_get_channel(efx, req->rxq_index);
+   int slot_idx = req - efx->rps_slot;
int rc;
 
rc = efx->type->filter_insert(efx, >spec, true);
@@ -878,8 +863,8 @@ static void efx_filter_rfs_work(struct work_struct *data)
}
 
/* Release references */
+   clear_bit(slot_idx, >rps_slot_map);
dev_put(req->net_dev);
-   kfree(req);
 }
 
 int efx_filter_rfs(struct net_device *net_dev, const struct sk_buff *skb,
@@ -888,22 +873,36 @@ int efx_filter_rfs(struct net_device *net_dev, const 
struct sk_buff *skb,
struct efx_nic *efx = netdev_priv(net_dev);
struct efx_async_filter_insertion *req;
struct flow_keys fk;
+   int slot_idx;
+   int rc;
 
-   if (flow_id == RPS_FLOW_ID_INVALID)
-   return -EINVAL;
+   /* find a free slot */
+   for (slot_idx = 0; slot_idx < EFX_RPS_MAX_IN_FLIGHT; slot_idx++)
+   if (!test_and_set_bit(slot_idx, >rps_slot_map))
+   break;
+   

[PATCH net 1/2] sfc: insert ARFS filters with replace_equal=true

2018-04-12 Thread Edward Cree
Necessary to allow redirecting a flow when the application moves.

Fixes: 3af0f34290f6 ("sfc: replace asynchronous filter operations")
Signed-off-by: Edward Cree 
---
 drivers/net/ethernet/sfc/rx.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/sfc/rx.c b/drivers/net/ethernet/sfc/rx.c
index 95682831484e..13b0eb71dbf3 100644
--- a/drivers/net/ethernet/sfc/rx.c
+++ b/drivers/net/ethernet/sfc/rx.c
@@ -851,7 +851,7 @@ static void efx_filter_rfs_work(struct work_struct *data)
struct efx_channel *channel = efx_get_channel(efx, req->rxq_index);
int rc;
 
-   rc = efx->type->filter_insert(efx, >spec, false);
+   rc = efx->type->filter_insert(efx, >spec, true);
if (rc >= 0) {
/* Remember this so we can check whether to expire the filter
 * later.



[PATCH net 0/2] sfc: couple of ARFS fixes

2018-04-12 Thread Edward Cree
Two issues introduced by my recent asynchronous filter handling changes:
1. The old filter_rfs_insert would replace a matching filter of equal
   priority; we need to pass the appropriate argument to filter_insert to
   make it do the same.
2. It's possible to cause the kernel to hammer ndo_rx_flow_steer very
   hard, so make sure we don't build up too huge a backlog of workitems.

Possibly it would be better to fix #2 on the kernel side; I think the way
 to do that would be to maintain a forward (as well as reverse) queue-to-
 cpu map and replace the set_rps_cpu() check
if (rxq_index == skb_get_rx_queue(skb))
 with something like (pseudocode)
if (irqaffinity of queue[skb_get_rx_queue(skb)] includes next_cpu)
 but I'm not sure whether it's right or even necessary, and in any case
 it's not a regression in 4.17 so isn't 'net' material.
(There's also the issue that we come up in the bad configuration by
 default, but that too is a problem for another time.)

Edward Cree (2):
  sfc: insert ARFS filters with replace_equal=true
  sfc: limit ARFS workitems in flight per channel

 drivers/net/ethernet/sfc/net_driver.h | 25 +++
 drivers/net/ethernet/sfc/rx.c | 60 ++-
 2 files changed, 56 insertions(+), 29 deletions(-)



Re: [PATCHv2 net] sctp: do not check port in sctp_inet6_cmp_addr

2018-04-12 Thread Neil Horman
On Thu, Apr 12, 2018 at 02:24:31PM +0800, Xin Long wrote:
> pf->cmp_addr() is called before binding a v6 address to the sock. It
> should not check ports, like in sctp_inet_cmp_addr.
> 
> But sctp_inet6_cmp_addr checks the addr by invoking af(6)->cmp_addr,
> sctp_v6_cmp_addr where it also compares the ports.
> 
> This would cause that setsockopt(SCTP_SOCKOPT_BINDX_ADD) could bind
> multiple duplicated IPv6 addresses after Commit 40b4f0fd74e4 ("sctp:
> lack the check for ports in sctp_v6_cmp_addr").
> 
> This patch is to remove af->cmp_addr called in sctp_inet6_cmp_addr,
> but do the proper check for both v6 addrs and v4mapped addrs.
> 
> v1->v2:
>   - define __sctp_v6_cmp_addr to do the common address comparison
> used for both pf and af v6 cmp_addr.
> 
> Fixes: 40b4f0fd74e4 ("sctp: lack the check for ports in sctp_v6_cmp_addr")
> Reported-by: Jianwen Ji 
> Signed-off-by: Xin Long 
> ---
>  net/sctp/ipv6.c | 60 
> -
>  1 file changed, 30 insertions(+), 30 deletions(-)
> 
> diff --git a/net/sctp/ipv6.c b/net/sctp/ipv6.c
> index f1fc48e..09aba03 100644
> --- a/net/sctp/ipv6.c
> +++ b/net/sctp/ipv6.c
> @@ -521,46 +521,49 @@ static void sctp_v6_to_addr(union sctp_addr *addr, 
> struct in6_addr *saddr,
>   addr->v6.sin6_scope_id = 0;
>  }
>  
> -/* Compare addresses exactly.
> - * v4-mapped-v6 is also in consideration.
> - */
> -static int sctp_v6_cmp_addr(const union sctp_addr *addr1,
> - const union sctp_addr *addr2)
> +static int __sctp_v6_cmp_addr(const union sctp_addr *addr1,
> +   const union sctp_addr *addr2)
>  {
>   if (addr1->sa.sa_family != addr2->sa.sa_family) {
>   if (addr1->sa.sa_family == AF_INET &&
>   addr2->sa.sa_family == AF_INET6 &&
> - ipv6_addr_v4mapped(>v6.sin6_addr)) {
> - if (addr2->v6.sin6_port == addr1->v4.sin_port &&
> - addr2->v6.sin6_addr.s6_addr32[3] ==
> - addr1->v4.sin_addr.s_addr)
> - return 1;
> - }
> + ipv6_addr_v4mapped(>v6.sin6_addr) &&
> + addr2->v6.sin6_addr.s6_addr32[3] ==
> + addr1->v4.sin_addr.s_addr)
> + return 1;
> +
>   if (addr2->sa.sa_family == AF_INET &&
>   addr1->sa.sa_family == AF_INET6 &&
> - ipv6_addr_v4mapped(>v6.sin6_addr)) {
> - if (addr1->v6.sin6_port == addr2->v4.sin_port &&
> - addr1->v6.sin6_addr.s6_addr32[3] ==
> - addr2->v4.sin_addr.s_addr)
> - return 1;
> - }
> + ipv6_addr_v4mapped(>v6.sin6_addr) &&
> + addr1->v6.sin6_addr.s6_addr32[3] ==
> + addr2->v4.sin_addr.s_addr)
> + return 1;
> +
>   return 0;
>   }
> - if (addr1->v6.sin6_port != addr2->v6.sin6_port)
> - return 0;
> +
>   if (!ipv6_addr_equal(>v6.sin6_addr, >v6.sin6_addr))
>   return 0;
> +
>   /* If this is a linklocal address, compare the scope_id. */
> - if (ipv6_addr_type(>v6.sin6_addr) & IPV6_ADDR_LINKLOCAL) {
> - if (addr1->v6.sin6_scope_id && addr2->v6.sin6_scope_id &&
> - (addr1->v6.sin6_scope_id != addr2->v6.sin6_scope_id)) {
> - return 0;
> - }
> - }
> + if ((ipv6_addr_type(>v6.sin6_addr) & IPV6_ADDR_LINKLOCAL) &&
> + addr1->v6.sin6_scope_id && addr2->v6.sin6_scope_id &&
> + addr1->v6.sin6_scope_id != addr2->v6.sin6_scope_id)
> + return 0;
>  
>   return 1;
>  }
>  
> +/* Compare addresses exactly.
> + * v4-mapped-v6 is also in consideration.
> + */
> +static int sctp_v6_cmp_addr(const union sctp_addr *addr1,
> + const union sctp_addr *addr2)
> +{
> + return __sctp_v6_cmp_addr(addr1, addr2) &&
> +addr1->v6.sin6_port == addr2->v6.sin6_port;
> +}
> +
>  /* Initialize addr struct to INADDR_ANY. */
>  static void sctp_v6_inaddr_any(union sctp_addr *addr, __be16 port)
>  {
> @@ -846,8 +849,8 @@ static int sctp_inet6_cmp_addr(const union sctp_addr 
> *addr1,
>  const union sctp_addr *addr2,
>  struct sctp_sock *opt)
>  {
> - struct sctp_af *af1, *af2;
>   struct sock *sk = sctp_opt2sk(opt);
> + struct sctp_af *af1, *af2;
>  
>   af1 = sctp_get_af_specific(addr1->sa.sa_family);
>   af2 = sctp_get_af_specific(addr2->sa.sa_family);
> @@ -863,10 +866,7 @@ static int sctp_inet6_cmp_addr(const union sctp_addr 
> *addr1,
>   if (sctp_is_any(sk, addr1) || sctp_is_any(sk, addr2))
>   return 1;
>  
> - if (addr1->sa.sa_family != addr2->sa.sa_family)
> - return 0;
> -
> - return af1->cmp_addr(addr1, addr2);
> + 

[PATCH 0/4] lan78xx: Read configuration from Device Tree

2018-04-12 Thread Phil Elwell
The Microchip LAN78XX family of devices are Ethernet controllers with
a USB interface. Despite being discoverable devices it can be useful to
be able to configure them from Device Tree, particularly in low-cost
applications without an EEPROM or programmed OTP.

This patch set adds support for reading the MAC address, EEE setting
and LED modes from Device Tree.

Phil Elwell (4):
  lan78xx: Read MAC address from DT if present
  lan78xx: Read initial EEE setting from Device Tree
  lan78xx: Read LED modes from Device Tree
  dt-bindings: Document the DT bindings for lan78xx

 .../devicetree/bindings/net/microchip,lan78xx.txt  | 44 
 MAINTAINERS|  1 +
 drivers/net/usb/lan78xx.c  | 81 --
 3 files changed, 105 insertions(+), 21 deletions(-)
 create mode 100644 Documentation/devicetree/bindings/net/microchip,lan78xx.txt

-- 
2.7.4



[PATCH 2/4] lan78xx: Read initial EEE setting from Device Tree

2018-04-12 Thread Phil Elwell
Add two new Device Tree properties:
* microchip,eee-enabled  - a boolean to enable EEE
* microchip,tx-lpi-timer - time in microseconds to wait after TX goes
   idle before entering the low power state
   (default 600)

Signed-off-by: Phil Elwell 
---
 drivers/net/usb/lan78xx.c | 17 +
 1 file changed, 17 insertions(+)

diff --git a/drivers/net/usb/lan78xx.c b/drivers/net/usb/lan78xx.c
index d2727b5..d98397b 100644
--- a/drivers/net/usb/lan78xx.c
+++ b/drivers/net/usb/lan78xx.c
@@ -2080,6 +2080,23 @@ static int lan78xx_phy_init(struct lan78xx_net *dev)
mii_adv = (u32)mii_advertise_flowctrl(dev->fc_request_control);
phydev->advertising |= mii_adv_to_ethtool_adv_t(mii_adv);
 
+   if (of_property_read_bool(dev->udev->dev.of_node,
+ "microchip,eee-enabled")) {
+   struct ethtool_eee edata;
+
+   memset(, 0, sizeof(edata));
+   edata.cmd = ETHTOOL_SEEE;
+   edata.advertised = ADVERTISED_1000baseT_Full |
+  ADVERTISED_100baseT_Full;
+   edata.eee_enabled = true;
+   edata.tx_lpi_enabled = true;
+   if (of_property_read_u32(dev->udev->dev.of_node,
+"microchip,tx-lpi-timer",
+_lpi_timer))
+   edata.tx_lpi_timer = 600; /* non-aggressive */
+   (void)lan78xx_set_eee(dev->net, );
+   }
+
genphy_config_aneg(phydev);
 
dev->fc_autoneg = phydev->autoneg;
-- 
2.7.4



[PATCH 4/4] dt-bindings: Document the DT bindings for lan78xx

2018-04-12 Thread Phil Elwell
The Microchip LAN78XX family of devices are Ethernet controllers with
a USB interface. Despite being discoverable devices it can be useful to
be able to configure them from Device Tree, particularly in low-cost
applications without an EEPROM or programmed OTP.

Document the supported properties in a bindings file, adding it to
MAINTAINERS at the same time.

Signed-off-by: Phil Elwell 
---
 .../devicetree/bindings/net/microchip,lan78xx.txt  | 44 ++
 MAINTAINERS|  1 +
 2 files changed, 45 insertions(+)
 create mode 100644 Documentation/devicetree/bindings/net/microchip,lan78xx.txt

diff --git a/Documentation/devicetree/bindings/net/microchip,lan78xx.txt 
b/Documentation/devicetree/bindings/net/microchip,lan78xx.txt
new file mode 100644
index 000..e7d7850
--- /dev/null
+++ b/Documentation/devicetree/bindings/net/microchip,lan78xx.txt
@@ -0,0 +1,44 @@
+Microchip LAN78xx Gigabit Ethernet controller
+
+The LAN78XX devices are usually configured by programming their OTP or with
+an external EEPROM, but some platforms (e.g. Raspberry Pi 3 B+) have neither.
+
+Please refer to ethernet.txt for a description of common Ethernet bindings.
+
+Optional properties:
+- microchip,eee-enabled: if present, enable Energy Efficient Ethernet support;
+- microchip,led-modes: a two-element vector, with each element configuring
+  the operating mode of an LED. The values supported by the device are;
+  0: Link/Activity
+  1: Link1000/Activity
+  2: Link100/Activity
+  3: Link10/Activity
+  4: Link100/1000/Activity
+  5: Link10/1000/Activity
+  6: Link10/100/Activity
+  7: RESERVED
+  8: Duplex/Collision
+  9: Collision
+  10: Activity
+  11: RESERVED
+  12: Auto-negotiation Fault
+  13: RESERVED
+  14: Off
+  15: On
+- microchip,tx-lpi-timer: the delay (in microseconds) between the TX fifo
+  becoming empty and invoking Low Power Idles (default 600).
+
+Example:
+
+   /* Standard configuration for a Raspberry Pi 3 B+ */
+   ethernet: usbether@1 {
+   compatible = "usb424,7800";
+   reg = <1>;
+   microchip,eee-enabled;
+   microchip,tx-lpi-timer = <600>;
+   /*
+* led0 = 1:link1000/activity
+* led1 = 6:link10/100/activity
+*/
+   microchip,led-modes = <1 6>;
+   };
diff --git a/MAINTAINERS b/MAINTAINERS
index 2328eed..b637aad 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -14482,6 +14482,7 @@ M:  Microchip Linux Driver Support 

 L: netdev@vger.kernel.org
 S: Maintained
 F: drivers/net/usb/lan78xx.*
+F: Documentation/devicetree/bindings/net/microchip,lan78xx.txt
 
 USB MASS STORAGE DRIVER
 M: Alan Stern 
-- 
2.7.4



[PATCH 3/4] lan78xx: Read LED modes from Device Tree

2018-04-12 Thread Phil Elwell
Add support for DT property "microchip,led-modes", a vector of two
cells (u32s) in the range 0-15, each of which sets the mode for one
of the two LEDs. Some possible values are:

0=link/activity  1=link1000/activity
2=link100/activity   3=link10/activity
4=link100/1000/activity  5=link10/1000/activity
6=link10/100/activity14=off15=on

Also use the presence of the DT property to indicate that the
LEDs should be enabled - necessary in the event that no valid OTP
or EEPROM is available.

Signed-off-by: Phil Elwell 
---
 drivers/net/usb/lan78xx.c | 20 
 1 file changed, 20 insertions(+)

diff --git a/drivers/net/usb/lan78xx.c b/drivers/net/usb/lan78xx.c
index d98397b..ffb483d 100644
--- a/drivers/net/usb/lan78xx.c
+++ b/drivers/net/usb/lan78xx.c
@@ -2008,6 +2008,7 @@ static int lan78xx_phy_init(struct lan78xx_net *dev)
 {
int ret;
u32 mii_adv;
+   u32 led_modes[2];
struct phy_device *phydev;
 
phydev = phy_find_first(dev->mdiobus);
@@ -2097,6 +2098,25 @@ static int lan78xx_phy_init(struct lan78xx_net *dev)
(void)lan78xx_set_eee(dev->net, );
}
 
+   if (!of_property_read_u32_array(dev->udev->dev.of_node,
+   "microchip,led-modes",
+   led_modes, ARRAY_SIZE(led_modes))) {
+   u32 reg;
+   int i;
+
+   reg = phy_read(phydev, 0x1d);
+   for (i = 0; i < ARRAY_SIZE(led_modes); i++) {
+   reg &= ~(0xf << (i * 4));
+   reg |= (led_modes[i] & 0xf) << (i * 4);
+   }
+   (void)phy_write(phydev, 0x1d, reg);
+
+   /* Ensure the LEDs are enabled */
+   lan78xx_read_reg(dev, HW_CFG, );
+   reg |= HW_CFG_LED0_EN_ | HW_CFG_LED1_EN_;
+   lan78xx_write_reg(dev, HW_CFG, reg);
+   }
+
genphy_config_aneg(phydev);
 
dev->fc_autoneg = phydev->autoneg;
-- 
2.7.4



[PATCH 1/4] lan78xx: Read MAC address from DT if present

2018-04-12 Thread Phil Elwell
There is a standard mechanism for locating and using a MAC address from
the Device Tree. Use this facility in the lan78xx driver to support
applications without programmed EEPROM or OTP. At the same time,
regularise the handling of the different address sources.

Signed-off-by: Phil Elwell 
---
 drivers/net/usb/lan78xx.c | 44 +++-
 1 file changed, 23 insertions(+), 21 deletions(-)

diff --git a/drivers/net/usb/lan78xx.c b/drivers/net/usb/lan78xx.c
index 55a78eb..d2727b5 100644
--- a/drivers/net/usb/lan78xx.c
+++ b/drivers/net/usb/lan78xx.c
@@ -37,6 +37,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "lan78xx.h"
 
 #define DRIVER_AUTHOR  "WOOJUNG HUH "
@@ -1651,34 +1652,35 @@ static void lan78xx_init_mac_address(struct lan78xx_net 
*dev)
addr[5] = (addr_hi >> 8) & 0xFF;
 
if (!is_valid_ether_addr(addr)) {
-   /* reading mac address from EEPROM or OTP */
-   if ((lan78xx_read_eeprom(dev, EEPROM_MAC_OFFSET, ETH_ALEN,
-addr) == 0) ||
-   (lan78xx_read_otp(dev, EEPROM_MAC_OFFSET, ETH_ALEN,
- addr) == 0)) {
-   if (is_valid_ether_addr(addr)) {
-   /* eeprom values are valid so use them */
-   netif_dbg(dev, ifup, dev->net,
- "MAC address read from EEPROM");
-   } else {
-   /* generate random MAC */
-   random_ether_addr(addr);
-   netif_dbg(dev, ifup, dev->net,
- "MAC address set to random addr");
-   }
+   const u8 *mac_addr;
 
-   addr_lo = addr[0] | (addr[1] << 8) |
- (addr[2] << 16) | (addr[3] << 24);
-   addr_hi = addr[4] | (addr[5] << 8);
-
-   ret = lan78xx_write_reg(dev, RX_ADDRL, addr_lo);
-   ret = lan78xx_write_reg(dev, RX_ADDRH, addr_hi);
+   mac_addr = of_get_mac_address(dev->udev->dev.of_node);
+   if (mac_addr) {
+   /* valid address present in Device Tree */
+   ether_addr_copy(addr, mac_addr);
+   netif_dbg(dev, ifup, dev->net,
+ "MAC address read from Device Tree");
+   } else if (((lan78xx_read_eeprom(dev, EEPROM_MAC_OFFSET,
+ETH_ALEN, addr) == 0) ||
+   (lan78xx_read_otp(dev, EEPROM_MAC_OFFSET,
+ ETH_ALEN, addr) == 0)) &&
+  is_valid_ether_addr(addr)) {
+   /* eeprom values are valid so use them */
+   netif_dbg(dev, ifup, dev->net,
+ "MAC address read from EEPROM");
} else {
/* generate random MAC */
random_ether_addr(addr);
netif_dbg(dev, ifup, dev->net,
  "MAC address set to random addr");
}
+
+   addr_lo = addr[0] | (addr[1] << 8) |
+ (addr[2] << 16) | (addr[3] << 24);
+   addr_hi = addr[4] | (addr[5] << 8);
+
+   ret = lan78xx_write_reg(dev, RX_ADDRL, addr_lo);
+   ret = lan78xx_write_reg(dev, RX_ADDRH, addr_hi);
}
 
ret = lan78xx_write_reg(dev, MAF_LO(0), addr_lo);
-- 
2.7.4



XDP performance regression due to CONFIG_RETPOLINE Spectre V2

2018-04-12 Thread Jesper Dangaard Brouer
Heads-up XDP performance nerds!

I got an unpleasant surprise when I updated my GCC compiler (to support
the option -mindirect-branch=thunk-extern).  My XDP redirect
performance numbers when cut in half; from approx 13Mpps to 6Mpps
(single CPU core).  I've identified the issue, which is caused by
kernel CONFIG_RETPOLINE, that only have effect when the GCC compiler
have support.  This is mitigation of Spectre variant 2 (CVE-2017-5715)
related to indirect (function call) branches.

XDP_REDIRECT itself only have two primary (per packet) indirect
function calls, ndo_xdp_xmit and invoking bpf_prog, plus any
map_lookup_elem calls in the bpf_prog.  I PoC implemented bulking for
ndo_xdp_xmit, which helped, but not enough. The real root-cause is all
the DMA API calls, which uses function pointers extensively.


Mitigation plan
---
Implement support for keeping the DMA mapping through the XDP return
call, to remove RX map/unmap calls.  Implement bulking for XDP
ndo_xdp_xmit and XDP return frame API.  Bulking allows to perform DMA
bulking via scatter-gatter DMA calls, XDP TX need it for DMA
map+unmap. The driver RX DMA-sync (to CPU) per packet calls are harder
to mitigate (via bulk technique). Ask DMA maintainer for a common
case direct call for swiotlb DMA sync call ;-)

Root-cause verification
---
I have verified that indirect DMA calls are the root-cause, by
removing the DMA sync calls from the code (as they for swiotlb does
nothing), and manually inlined the DMA map calls (basically calling
phys_to_dma(dev, page_to_phys(page)) + offset). For my ixgbe test,
performance "returned" to 11Mpps.

Perf reports

It is not easy to diagnose via perf event tool. I'm coordinating with
ACME to make it easier to pinpoint the hotspots.  Lookout for symbols:
__x86_indirect_thunk_r10, __indirect_thunk_start, __x86_indirect_thunk_rdx
etc.  Be aware that they might not be super high in perf top, but they
stop CPU speculation.  Thus, instead use perf-stat and see the
negative effect of 'insn per cycle'.


Want to understand retpoline at ASM level read this:
 https://support.google.com/faqs/answer/7625886

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


Re: [bpf-next PATCH 1/4] bpf: sockmap, free memory on sock close with cork data

2018-04-12 Thread Simon Horman
On Sun, Apr 01, 2018 at 08:00:54AM -0700, John Fastabend wrote:
> If a socket with pending cork data is closed we do not return the
> memory to the socket until the garbage collector free's the psock
> structure. The garbage collector though can run after the sock has
> completed its close operation. If this ordering happens the sock code
> will through a WARN_ON because there is still outstanding memory

s/through/throw/ ?

> accounted to the sock.
> 
> To resolve this ensure we return memory to the sock when a socket
> is closed.
> 
> Signed-off-by: John Fastabend 
> Fixes: 91843d540a13 ("bpf: sockmap, add msg_cork_bytes() helper")
> ---
>  kernel/bpf/sockmap.c |6 ++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/kernel/bpf/sockmap.c b/kernel/bpf/sockmap.c
> index d2bda5a..8ddf326 100644
> --- a/kernel/bpf/sockmap.c
> +++ b/kernel/bpf/sockmap.c
> @@ -211,6 +211,12 @@ static void bpf_tcp_close(struct sock *sk, long timeout)
>   close_fun = psock->save_close;
>  
>   write_lock_bh(>sk_callback_lock);
> + if (psock->cork) {
> + free_start_sg(psock->sock, psock->cork);
> + kfree(psock->cork);
> + psock->cork = NULL;
> + }
> +
>   list_for_each_entry_safe(md, mtmp, >ingress, list) {
>   list_del(>list);
>   free_start_sg(psock->sock, md);
> 


Re: iproute2-4.16.0 no longer accepts routes via fe80::1

2018-04-12 Thread Thomas Deutschmann
Hi,

well, it isn't just "fe80::1", it is any IPv6 address which
will be rejected if not called with "-6". I run bisect:

> git bisect start
> # good: [50b8a842e8c098cddb213f5b3076526df88826e8] v4.15.0
> git bisect good 50b8a842e8c098cddb213f5b3076526df88826e8
> # bad: [4b6c4177ee66421770f0bbcc765c29135e44d921] v4.16.0
> git bisect bad 4b6c4177ee66421770f0bbcc765c29135e44d921
> # bad: [5f4892e2c8d4fb22118713e0c83290b352fe0e34] rdma: Make visible the 
> number of arguments
> git bisect bad 5f4892e2c8d4fb22118713e0c83290b352fe0e34
> # good: [8c75f69411bc8c3affe5d173afcf981d15f5da15] Merge branch 'master' into 
> net-next
> git bisect good 8c75f69411bc8c3affe5d173afcf981d15f5da15
> # bad: [27c523e209ab956ff269afec68c6e744e7f5edb6] utils: Introduce 
> get_addr_rta() and inet_addr_match_rta()
> git bisect bad 27c523e209ab956ff269afec68c6e744e7f5edb6
> # bad: [d0bcedd549566a87354aa804df3be6be80681ee9] tc: introduce 
> tc_qdisc_block_exists helper
> git bisect bad d0bcedd549566a87354aa804df3be6be80681ee9
> # bad: [6c4b672738acf680ee98c10e79a52a8dede5f9a6] iplink_geneve: Get rid of 
> inet_get_addr()
> git bisect bad 6c4b672738acf680ee98c10e79a52a8dede5f9a6
> # bad: [93fa12418dc6f5943692250244be303bb162175b] utils: Always specify 
> family and ->bytelen in get_prefix_1()
> git bisect bad 93fa12418dc6f5943692250244be303bb162175b
> # good: [f2522007d8fee924cb098b4afc8af16f2b25829f] utils: Always specify 
> family for address in get_addr_1()
> git bisect good f2522007d8fee924cb098b4afc8af16f2b25829f
> # first bad commit: [93fa12418dc6f5943692250244be303bb162175b] utils: Always 
> specify family and ->bytelen in get_prefix_1()

> From 93fa12418dc6f5943692250244be303bb162175b Mon Sep 17 00:00:00 2001
> From: Serhey Popovych
> Date: Thu, 18 Jan 2018 20:13:43 +0200
> Subject: utils: Always specify family and ->bytelen in get_prefix_1()
> 
> Handle default/all/any special case in get_addr_1() to setup
> ->family and ->bytelen correctly.
> 
> Make get_addr_1() return ->bitlen == -2 instead of -1 to
> distinguish default/all/any special case from the rest:
> it is safe because all callers check ->bitlen < 0, not
> explicit value -1.
> 
> Reduce intendation by one level and get rid of goto/label
> to make code more readable.
> 
> Signed-off-by: Serhey Popovych
> Signed-off-by: David Ahern

https://git.kernel.org/pub/scm/network/iproute2/iproute2.git/commit/?id=93fa12418dc6f5943692250244be303bb162175b

So was this an intended behavior change? I.e. this will require
updates for various user space tools/network configuration scripts
which are relying on ip utilities feature to auto-detect inet family
which was "supported" (at least working) until 4.16.0...


-- 
Regards,
Thomas Deutschmann / Gentoo Linux Developer
C4DD 695F A713 8F24 2AA1 5638 5849 7EE5 1D5D 74A5



Re: net: hang in unregister_netdevice: waiting for lo to become free

2018-04-12 Thread Dmitry Vyukov
On Wed, Feb 21, 2018 at 3:53 PM, Tommi Rantala
 wrote:
> On 20.02.2018 18:26, Neil Horman wrote:
>>
>> On Tue, Feb 20, 2018 at 09:14:41AM +0100, Dmitry Vyukov wrote:
>>>
>>> On Tue, Feb 20, 2018 at 8:56 AM, Tommi Rantala
>>>  wrote:

 On 19.02.2018 20:59, Dmitry Vyukov wrote:
>
> Is this meant to be fixed already? I am still seeing this on the
> latest upstream tree.
>

 These two commits are in v4.16-rc1:

 commit 4a31a6b19f9ddf498c81f5c9b089742b7472a6f8
 Author: Tommi Rantala 
 Date:   Mon Feb 5 21:48:14 2018 +0200

  sctp: fix dst refcnt leak in sctp_v4_get_dst
 ...
  Fixes: 410f03831 ("sctp: add routing output fallback")
  Fixes: 0ca50d12f ("sctp: fix src address selection if using
 secondary
 addresses")


 commit 957d761cf91cdbb175ad7d8f5472336a4d54dbf2
 Author: Alexey Kodanev 
 Date:   Mon Feb 5 15:10:35 2018 +0300

  sctp: fix dst refcnt leak in sctp_v6_get_dst()
 ...
  Fixes: dbc2b5e9a09e ("sctp: fix src address selection if using
 secondary
 addresses for ipv6")


 I guess we missed something if it's still reproducible.

 I can check it later this week, unless someone else beat me to it.
>>>
>>>
>>> Hi Tommi,
>>>
>>> Hmmm, I can't claim that it's exactly the same bug. Perhaps it's
>>> another one then. But I am still seeing these:
>>>
>>> [   58.799130] unregister_netdevice: waiting for lo to become free.
>>> Usage count = 4
>>> [   60.847138] unregister_netdevice: waiting for lo to become free.
>>> Usage count = 4
>>> [   62.895093] unregister_netdevice: waiting for lo to become free.
>>> Usage count = 4
>>> [   64.943103] unregister_netdevice: waiting for lo to become free.
>>> Usage count = 4
>>>
>>> on upstream tree pulled ~12 hours ago.
>>>
>> Can you write a systemtap script to probe dev_hold, and dev_put, printing
>> out a
>> backtrace if the device name matches "lo".  That should tell us
>> definitively if
>> the problem is in the same location or not
>
>
> Hi Dmitry, I tested with the reproducer and the kernel .config file that you
> sent in the first email in this thread:
>
> With 4.16-rc2 unable to reproduce.
>
> With 4.15-rc9 bug reproducible, and I get "unregister_netdevice: waiting for
> lo to become free. Usage count = 3"
>
> With 4.15-rc9 and Alexey's "sctp: fix dst refcnt leak in sctp_v6_get_dst()"
> cherry-picked on top, unable to reproduce.
>
>
> Is syzkaller doing something else now to trigger the bug...?
> Can you still trigger the bug with the same reproducer?

Hi Neil, Tommi,

Reviving this old thread about "unregister_netdevice: waiting for lo
to become free. Usage count = 3" hangs.
I still did not have time to deep dive into what happens there (too
many bugs coming from syzbot). But this still actively happens and I
suspect accounts to a significant portion of various hang reports,
which are quite unpleasant.

One idea that could make it all simpler:

Is this wait loop in netdev_wait_allrefs() supposed to wait for any
prolonged periods of time under any non-buggy conditions? E.g. more
than 1-2 minutes?
If it only supposed to wait briefly for things that already supposed
to be shutting down, and we add a WARNING there after some timeout,
then syzbot will report all info how/when it happens, hopefully
extracting reproducers, and all the nice things.
But this WARNING should not have any false positives under any
realistic conditions (e.g. waiting for arrival of remote packets with
large timeouts).

Looking at some task hung reports, it seems that this code holds some
mutexes, takes workqueue thread and prevents any progress with
destruction of other devices (and net namespace creation/destruction),
so I guess it should not wait for any indefinite periods of time?


Re: [PATCH] net: ieee802154: atusb: Replace GFP_ATOMIC with GFP_KERNEL in atusb_probe

2018-04-12 Thread Stefan Schmidt
Hello.


On 04/11/2018 04:14 AM, Jia-Ju Bai wrote:
> atusb_probe() is never called in atomic context.
> This function is only set as ".probe" in struct usb_driver.
>
> Despite never getting called from atomic context,
> atusb_probe() calls usb_alloc_urb() with GFP_ATOMIC,
> which does not sleep for allocation.
> GFP_ATOMIC is not necessary and can be replaced with GFP_KERNEL,
> which can sleep and improve the possibility of sucessful allocation.
>
> This is found by a static analysis tool named DCNS written by myself.
> And I also manually check it.
>
> Signed-off-by: Jia-Ju Bai 
> ---
>  drivers/net/ieee802154/atusb.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/net/ieee802154/atusb.c b/drivers/net/ieee802154/atusb.c
> index ef68851..ab6a505 100644
> --- a/drivers/net/ieee802154/atusb.c
> +++ b/drivers/net/ieee802154/atusb.c
> @@ -789,7 +789,7 @@ static int atusb_probe(struct usb_interface *interface,
>   atusb->tx_dr.bRequest = ATUSB_TX;
>   atusb->tx_dr.wValue = cpu_to_le16(0);
>  
> - atusb->tx_urb = usb_alloc_urb(0, GFP_ATOMIC);
> + atusb->tx_urb = usb_alloc_urb(0, GFP_KERNEL);
>   if (!atusb->tx_urb)
>   goto fail;
>  

This patch has been applied to the wpan tree and will be
part of the next pull request to net. Thanks!

regards
Stefan Schmidt


RE: [RFC PATCH v2 14/14] samples/bpf: sample application for AF_XDP sockets

2018-04-12 Thread Karlsson, Magnus


> -Original Message-
> From: Jesper Dangaard Brouer [mailto:bro...@redhat.com]
> Sent: Thursday, April 12, 2018 1:05 PM
> To: Björn Töpel 
> Cc: Karlsson, Magnus ; Duyck, Alexander H
> ; alexander.du...@gmail.com;
> john.fastab...@gmail.com; a...@fb.com;
> willemdebruijn.ker...@gmail.com; dan...@iogearbox.net;
> netdev@vger.kernel.org; michael.lundkv...@ericsson.com; Brandeburg,
> Jesse ; Singhai, Anjali
> ; Zhang, Qi Z ;
> ravineet.si...@ericsson.com; Topel, Bjorn ;
> bro...@redhat.com
> Subject: Re: [RFC PATCH v2 14/14] samples/bpf: sample application for
> AF_XDP sockets
> 
> On Tue, 27 Mar 2018 18:59:19 +0200
> Björn Töpel  wrote:
> 
> > +static void dump_stats(void)
> > +{
> > +   unsigned long stop_time = get_nsecs();
> > +   long dt = stop_time - start_time;
> > +   int i;
> > +
> > +   for (i = 0; i < num_socks; i++) {
> > +   double rx_pps = xsks[i]->rx_npkts * 10.
> / dt;
> > +   double tx_pps = xsks[i]->tx_npkts * 10.
> / dt;
> > +   char *fmt = "%-15s %'-11.0f %'-11lu\n";
> > +
> > +   printf("\n sock%d@", i);
> > +   print_benchmark(false);
> > +   printf("\n");
> > +
> > +   printf("%-15s %-11s %-11s %-11.2f\n", "", "pps",
> "pkts",
> > +  dt / 10.);
> > +   printf(fmt, "rx", rx_pps, xsks[i]->rx_npkts);
> > +   printf(fmt, "tx", tx_pps, xsks[i]->tx_npkts);
> > +   }
> > +}
> > +
> > +static void *poller(void *arg)
> > +{
> > +   (void)arg;
> > +   for (;;) {
> > +   sleep(1);
> > +   dump_stats();
> > +   }
> > +
> > +   return NULL;
> > +}
> 
> You are printing the "pps" (packets per sec) as an average over the entire
> test run... could you please change that to, at least also, have an more 
> up-to-
> date value like between the last measurement?
> 
> The problem is that when you start the test, the first reading will be too 
> low,
> and it takes time to average out/up. For ixgbe, first reading will be zero,
> because it does a link-down+up, which stops my pktgen.
> 
> The second annoyance is that I like to change system/kernel setting during
> the run, and observe the effect. E.g change CPU sleep states (via tuned-
> adm) during the test-run to see the effect, which I cannot with this long
> average.

Good points. Will fix.

/Magnus

> 
> --
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   LinkedIn: http://www.linkedin.com/in/brouer


Re: [RFC PATCH v2 14/14] samples/bpf: sample application for AF_XDP sockets

2018-04-12 Thread Jesper Dangaard Brouer
On Tue, 27 Mar 2018 18:59:19 +0200
Björn Töpel  wrote:

> +static void dump_stats(void)
> +{
> + unsigned long stop_time = get_nsecs();
> + long dt = stop_time - start_time;
> + int i;
> +
> + for (i = 0; i < num_socks; i++) {
> + double rx_pps = xsks[i]->rx_npkts * 10. / dt;
> + double tx_pps = xsks[i]->tx_npkts * 10. / dt;
> + char *fmt = "%-15s %'-11.0f %'-11lu\n";
> +
> + printf("\n sock%d@", i);
> + print_benchmark(false);
> + printf("\n");
> +
> + printf("%-15s %-11s %-11s %-11.2f\n", "", "pps", "pkts",
> +dt / 10.);
> + printf(fmt, "rx", rx_pps, xsks[i]->rx_npkts);
> + printf(fmt, "tx", tx_pps, xsks[i]->tx_npkts);
> + }
> +}
> +
> +static void *poller(void *arg)
> +{
> + (void)arg;
> + for (;;) {
> + sleep(1);
> + dump_stats();
> + }
> +
> + return NULL;
> +}

You are printing the "pps" (packets per sec) as an average over the
entire test run... could you please change that to, at least also, have
an more up-to-date value like between the last measurement?

The problem is that when you start the test, the first reading will be
too low, and it takes time to average out/up. For ixgbe, first reading
will be zero, because it does a link-down+up, which stops my pktgen.

The second annoyance is that I like to change system/kernel setting
during the run, and observe the effect. E.g change CPU sleep states
(via tuned-adm) during the test-run to see the effect, which I cannot
with this long average.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


[PATCH] selftests: add headers_install to lib.mk

2018-04-12 Thread Anders Roxell
If the kernel headers aren't installed we can't build all the tests.
Add a new make target rule 'khdr' in the file lib.mk to generate the
kernel headers and that gets include for every test-dir Makefile that
includes lib.mk If the testdir in turn have its own sub-dirs the
top_srcdir needs to be set to the linux-rootdir to be able to generate
the kernel headers.

Signed-off-by: Anders Roxell 
Reviewed-by: Fathi Boudra 
---
 tools/testing/selftests/android/Makefile  | 2 +-
 tools/testing/selftests/android/ion/Makefile  | 1 +
 tools/testing/selftests/bpf/Makefile  | 5 ++---
 tools/testing/selftests/futex/functional/Makefile | 1 +
 tools/testing/selftests/gpio/Makefile | 4 
 tools/testing/selftests/kvm/Makefile  | 6 +-
 tools/testing/selftests/lib.mk| 8 
 tools/testing/selftests/vm/Makefile   | 3 ---
 8 files changed, 14 insertions(+), 16 deletions(-)

diff --git a/tools/testing/selftests/android/Makefile 
b/tools/testing/selftests/android/Makefile
index f6304d2be90c..087390bbad68 100644
--- a/tools/testing/selftests/android/Makefile
+++ b/tools/testing/selftests/android/Makefile
@@ -6,7 +6,7 @@ TEST_PROGS := run.sh
 
 include ../lib.mk
 
-all:
+all: khdr
@for DIR in $(SUBDIRS); do  \
BUILD_TARGET=$(OUTPUT)/$$DIR;   \
mkdir $$BUILD_TARGET  -p;   \
diff --git a/tools/testing/selftests/android/ion/Makefile 
b/tools/testing/selftests/android/ion/Makefile
index e03695287f76..14ecd9805748 100644
--- a/tools/testing/selftests/android/ion/Makefile
+++ b/tools/testing/selftests/android/ion/Makefile
@@ -11,6 +11,7 @@ $(TEST_GEN_FILES): ipcsocket.c ionutils.c
 TEST_PROGS := ion_test.sh
 
 include ../../lib.mk
+top_srcdir = ../../../../../
 
 $(OUTPUT)/ionapp_export: ionapp_export.c ipcsocket.c ionutils.c
 $(OUTPUT)/ionapp_import: ionapp_import.c ipcsocket.c ionutils.c
diff --git a/tools/testing/selftests/bpf/Makefile 
b/tools/testing/selftests/bpf/Makefile
index 0a315ddabbf4..cc611a284087 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -16,9 +16,8 @@ LDLIBS += -lcap -lelf -lrt -lpthread
 TEST_CUSTOM_PROGS = $(OUTPUT)/urandom_read
 all: $(TEST_CUSTOM_PROGS)
 
-$(TEST_CUSTOM_PROGS): urandom_read
-
-urandom_read: urandom_read.c
+$(TEST_CUSTOM_PROGS):| khdr
+$(TEST_CUSTOM_PROGS): urandom_read.c
$(CC) -o $(TEST_CUSTOM_PROGS) -static $<
 
 # Order correspond to 'make run_tests' order
diff --git a/tools/testing/selftests/futex/functional/Makefile 
b/tools/testing/selftests/futex/functional/Makefile
index ff8feca49746..9f602fb40241 100644
--- a/tools/testing/selftests/futex/functional/Makefile
+++ b/tools/testing/selftests/futex/functional/Makefile
@@ -19,5 +19,6 @@ TEST_GEN_FILES := \
 TEST_PROGS := run.sh
 
 include ../../lib.mk
+top_srcdir = ../../../../../
 
 $(TEST_GEN_FILES): $(HEADERS)
diff --git a/tools/testing/selftests/gpio/Makefile 
b/tools/testing/selftests/gpio/Makefile
index 1bbb47565c55..768b2be010db 100644
--- a/tools/testing/selftests/gpio/Makefile
+++ b/tools/testing/selftests/gpio/Makefile
@@ -25,7 +25,3 @@ $(BINARIES): ../../../gpio/gpio-utils.o 
../../../../usr/include/linux/gpio.h
 
 ../../../gpio/gpio-utils.o:
make ARCH=$(ARCH) CROSS_COMPILE=$(CROSS_COMPILE) -C ../../../gpio
-
-../../../../usr/include/linux/gpio.h:
-   make -C ../../../.. headers_install INSTALL_HDR_PATH=$(shell 
pwd)/../../../../usr/
-
diff --git a/tools/testing/selftests/kvm/Makefile 
b/tools/testing/selftests/kvm/Makefile
index dc44de904797..ba03ce334212 100644
--- a/tools/testing/selftests/kvm/Makefile
+++ b/tools/testing/selftests/kvm/Makefile
@@ -31,9 +31,5 @@ $(LIBKVM_OBJ): $(OUTPUT)/%.o: %.c
 $(OUTPUT)/libkvm.a: $(LIBKVM_OBJ)
$(AR) crs $@ $^
 
-$(LINUX_HDR_PATH):
-   make -C $(top_srcdir) headers_install
-
-all: $(STATIC_LIBS) $(LINUX_HDR_PATH)
+all: $(STATIC_LIBS)
 $(TEST_GEN_PROGS): $(STATIC_LIBS)
-$(TEST_GEN_PROGS) $(LIBKVM_OBJ): | $(LINUX_HDR_PATH)
diff --git a/tools/testing/selftests/lib.mk b/tools/testing/selftests/lib.mk
index 195e9d4739a9..e0bfbc5b1f1f 100644
--- a/tools/testing/selftests/lib.mk
+++ b/tools/testing/selftests/lib.mk
@@ -16,8 +16,16 @@ TEST_GEN_PROGS := $(patsubst %,$(OUTPUT)/%,$(TEST_GEN_PROGS))
 TEST_GEN_PROGS_EXTENDED := $(patsubst %,$(OUTPUT)/%,$(TEST_GEN_PROGS_EXTENDED))
 TEST_GEN_FILES := $(patsubst %,$(OUTPUT)/%,$(TEST_GEN_FILES))
 
+top_srcdir ?= ../../../../
+
 all: $(TEST_GEN_PROGS) $(TEST_GEN_PROGS_EXTENDED) $(TEST_GEN_FILES)
 
+.PHONY: khdr
+khdr:
+   make -C $(top_srcdir) headers_install
+
+$(TEST_GEN_PROGS) $(TEST_GEN_PROGS_EXTENDED) $(TEST_GEN_FILES):| khdr
+
 .ONESHELL:
 define RUN_TESTS
@export KSFT_TAP_LEVEL=`echo 1`;
diff --git a/tools/testing/selftests/vm/Makefile 
b/tools/testing/selftests/vm/Makefile
index fdefa2295ddc..1e34a40745ef 100644
--- a/tools/testing/selftests/vm/Makefile
+++ 

Re: WARNING in kobject_add_internal

2018-04-12 Thread Dmitry Vyukov
On Thu, Apr 12, 2018 at 12:04 PM, Dmitry Vyukov  wrote:
> On Thu, Apr 12, 2018 at 2:29 AM, Yuan, Linyu (NSB - CN/Shanghai)
>  wrote:
>> Hi,
>>
>> I have a question,
>> "can syzbot auto test each tree with newest changeset" ?
>
> Hi Yuan,
>
> Please elaborate.
> What trees? What newest changeset? Test against what criteria?

+syzkaller mailing list

>>> -Original Message-
>>> From: netdev-ow...@vger.kernel.org [mailto:netdev-ow...@vger.kernel.org]
>>> On Behalf Of Dmitry Vyukov
>>> Sent: Wednesday, April 11, 2018 10:58 PM
>>> To: syzbot
>>> Cc: bri...@lists.linux-foundation.org; David Miller; Greg Kroah-Hartman;
>>> LKML; netdev; stephen hemminger; syzkaller-bugs
>>> Subject: Re: WARNING in kobject_add_internal
>>>
>>> On Fri, Jan 5, 2018 at 10:41 PM, syzbot
>>> >> il.com>
>>> wrote:
>>> > syzkaller has found reproducer for the following crash on
>>> > 89876f275e8d562912d9c238cd888b52065cf25c
>>> > git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/master
>>> > compiler: gcc (GCC) 7.1.1 20170620
>>> > .config is attached
>>> > Raw console output is attached.
>>> > C reproducer is attached
>>> > syzkaller reproducer is attached. See https://goo.gl/kgGztJ
>>> > for information about syzkaller reproducers
>>> >
>>> >
>>> > IMPORTANT: if you fix the bug, please add the following tag to the commit:
>>> > Reported-by:
>>> >
>>> syzbot+e204ced820ef739d71ef5438f5e1976a874abc8d@syzkaller.appspotmail
>>> .com
>>> > It will help syzbot understand when the bug is fixed.
>>>
>>> #syz dup: WARNING: kobject bug in device_add
>>>
>>> > [ cut here ]
>>> > kobject_add_internal failed for   (error: -12 parent: net)
>>> > WARNING: CPU: 1 PID: 3494 at lib/kobject.c:244
>>> > kobject_add_internal+0x3f6/0xbc0 lib/kobject.c:242
>>> > Kernel panic - not syncing: panic_on_warn set ...
>>> >
>>> > CPU: 1 PID: 3494 Comm: syzkaller425998 Not tainted 4.15.0-rc6+ #249
>>> > Hardware name: Google Google Compute Engine/Google Compute Engine,
>>> BIOS
>>> > Google 01/01/2011
>>> > Call Trace:
>>> >  __dump_stack lib/dump_stack.c:17 [inline]
>>> >  dump_stack+0x194/0x257 lib/dump_stack.c:53
>>> >  panic+0x1e4/0x41c kernel/panic.c:183
>>> >  __warn+0x1dc/0x200 kernel/panic.c:547
>>> >  report_bug+0x211/0x2d0 lib/bug.c:184
>>> >  fixup_bug.part.11+0x37/0x80 arch/x86/kernel/traps.c:178
>>> >  fixup_bug arch/x86/kernel/traps.c:247 [inline]
>>> >  do_error_trap+0x2d7/0x3e0 arch/x86/kernel/traps.c:296
>>> >  do_invalid_op+0x1b/0x20 arch/x86/kernel/traps.c:315
>>> >  invalid_op+0x22/0x40 arch/x86/entry/entry_64.S:1079
>>> > RIP: 0010:kobject_add_internal+0x3f6/0xbc0 lib/kobject.c:242
>>> > RSP: 0018:8801c53c76f0 EFLAGS: 00010286
>>> > RAX: dc08 RBX: 8801bf5a88d8 RCX: 8159da9e
>>> > RDX:  RSI: 110038a78e99 RDI: 8801c53c73f8
>>> > RBP: 8801c53c77e8 R08: 110038a78e5b R09: 
>>> > R10: 8801c53c74b0 R11:  R12: 110038a78ee4
>>> > R13: fff4 R14: 8801d8359a80 R15: 86201980
>>> >  kobject_add_varg lib/kobject.c:366 [inline]
>>> >  kobject_add+0x132/0x1f0 lib/kobject.c:411
>>> >  device_add+0x35d/0x1650 drivers/base/core.c:1787
>>> >  netdev_register_kobject+0x183/0x360 net/core/net-sysfs.c:1604
>>> >  register_netdevice+0xb2b/0x1010 net/core/dev.c:7698
>>> >  tun_set_iff drivers/net/tun.c:2319 [inline]
>>> >  __tun_chr_ioctl+0x1d89/0x3dd0 drivers/net/tun.c:2524
>>> >  tun_chr_ioctl+0x2a/0x40 drivers/net/tun.c:2773
>>> >  vfs_ioctl fs/ioctl.c:46 [inline]
>>> >  do_vfs_ioctl+0x1b1/0x1520 fs/ioctl.c:686
>>> >  SYSC_ioctl fs/ioctl.c:701 [inline]
>>> >  SyS_ioctl+0x8f/0xc0 fs/ioctl.c:692
>>> >  entry_SYSCALL_64_fastpath+0x23/0x9a
>>> > RIP: 0033:0x444fc9
>>> > RSP: 002b:7fff53389dc8 EFLAGS: 0246 ORIG_RAX:
>>> 0010
>>> > RAX: ffda RBX: 0001 RCX: 00444fc9
>>> > RDX: 20533000 RSI: 400454ca RDI: 0004
>>> > RBP: 0005 R08: 0002 R09: 006f3131
>>> > R10:  R11: 0246 R12: 00402500
>>> > R13: 00402590 R14:  R15: 
>>> >
>>> > Dumping ftrace buffer:
>>> >(ftrace buffer empty)
>>> > Kernel Offset: disabled
>>> > Rebooting in 86400 seconds..
>>> >


  1   2   >