date:20180406

Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice

2018-04-06 Thread Siwei Liu

(put discussions back on list which got accidentally removed)

On Tue, Apr 3, 2018 at 4:08 PM, Stephen Hemminger
 wrote:
> On Tue, 3 Apr 2018 12:04:38 -0700
> Siwei Liu  wrote:
>
>> On Tue, Apr 3, 2018 at 10:35 AM, Stephen Hemminger
>>  wrote:
>> > On Sun,  1 Apr 2018 05:13:09 -0400
>> > Si-Wei Liu  wrote:
>> >
>> >> Hidden netdevice is not visible to userspace such that
>> >> typical network utilites e.g. ip, ifconfig and et al,
>> >> cannot sense its existence or configure it. Internally
>> >> hidden netdev may associate with an upper level netdev
>> >> that userspace has access to. Although userspace cannot
>> >> manipulate the lower netdev directly, user may control
>> >> or configure the underlying hidden device through the
>> >> upper-level netdev. For identification purpose, the
>> >> kobject for hidden netdev still presents in the sysfs
>> >> hierarchy, however, no uevent message will be generated
>> >> when the sysfs entry is created, modified or destroyed.
>> >>
>> >> For that end, a separate namescope needs to be carved
>> >> out for IFF_HIDDEN netdevs. As of now netdev name that
>> >> starts with colon i.e. ':' is invalid in userspace,
>> >> since socket ioctls such as SIOCGIFCONF use ':' as the
>> >> separator for ifname. The absence of namescope started
>> >> with ':' can rightly be used as the namescope for
>> >> the kernel-only IFF_HIDDEN netdevs.
>> >>
>> >> Signed-off-by: Si-Wei Liu 
>> >> ---
>> >
>> > I understand the use case. I proposed using . as a prefix before
>> > but that ran into resistance. Using colon seems worse.
>>
>> Using dot (.) can't be good because it would cause namespace collision
>> and thus breaking apps when you hide the device. Imagine a user really
>> wants to add a link with the same name as the one hidden and it starts
>> with a dot. It would fail, and users don't know its just because the
>> name starts with dot. IMHO users should be agnostic of (the namespace
>> of) hidden device at all if what they pick is a valid name.
>>
>> ":" is an invalid prefix to userspace, there's no such problem if
>> being used to construct the namescope for hidden devices.
>>
>> However, technically, just as what I alluded to in the reply earlier,
>> it might really be consistent to put this under a separeate namespace
>> instead than fiddling with name prefix. But I am just not sure if that
>> is a big hammer and would like to earn enough feedback and attention
>> before going that way too quickly.
>>
>>
>> >
>> > Rather than playing with names and all the issues that can cause,
>> > why not make it an attribute flag of the device in netlink.
>>
>> Atrribute flag doesn't help. It's a matter of namespace.
>>
>> Regards,
>> -Siwei
>
> In Vyatta, we used names like ".isatap" for devices that would clutter up
> the user experience. They are naturally not visible by simple scans of
> /sys/class/net, and there was a patch to ignore them in iproute2.
> It was a hack which worked but not really worth upstreaming.
>
> The question is if this a security feature then it needs to be more

I don't expect the namespace to be a security aspect of feature, but
rather a way to make old userspace unmodified  to work with a new
feature. And, we're going to add API to expose the netdev info for the
invisible IFF_AUTO_MANAGED links anyway. We don't need to make it
secure and all hidden under the dark to be honest.

> robust than just name prefix. Plus it took years to handle network
> namespaces everywhere; this kind of flag would start same problems.
>
> Network namespaces work but have the problem namespaces only weakly
> support hierarchy and nesting. I prefer the namespace approach
> because it fits better and has less impact.

Great, thanks!

-Siwei

kernel BUG at drivers/vhost/vhost.c:LINE! (2)

2018-04-06 Thread syzbot


Hello,

syzbot hit the following crash on upstream commit
38c23685b273cfb4ccf31a199feccce3bdcb5d83 (Fri Apr 6 04:29:35 2018 +)
Merge tag 'armsoc-drivers' of  
git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc
syzbot dashboard link:  
https://syzkaller.appspot.com/bug?extid=65a84dde0214b0387ccd


So far this crash happened 4 times on upstream.
C reproducer: https://syzkaller.appspot.com/x/repro.c?id=6586748079439872
syzkaller reproducer:  
https://syzkaller.appspot.com/x/repro.syz?id=5974272052822016
Raw console output:  
https://syzkaller.appspot.com/x/log.txt?id=6224632407392256
Kernel config:  
https://syzkaller.appspot.com/x/.config?id=-5813481738265533882

compiler: gcc (GCC) 8.0.1 20180301 (experimental)

IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: syzbot+65a84dde0214b0387...@syzkaller.appspotmail.com
It will help syzbot understand when the bug is fixed. See footer for  
details.

If you forward the report, please keep this part and the footer.

[ cut here ]
kernel BUG at drivers/vhost/vhost.c:1652!
invalid opcode:  [#1] SMP KASAN
[ cut here ]
Dumping ftrace buffer:
kernel BUG at drivers/vhost/vhost.c:1652!
   (ftrace buffer empty)
Modules linked in:
CPU: 1 PID: 4461 Comm: syzkaller684218 Not tainted 4.16.0+ #3
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS  
Google 01/01/2011

RIP: 0010:set_bit_to_user drivers/vhost/vhost.c:1652 [inline]
RIP: 0010:log_write+0x42a/0x4d0 drivers/vhost/vhost.c:1676
RSP: 0018:8801b256f920 EFLAGS: 00010293
RAX: 8801adc9e2c0 RBX: dc00 RCX: 85924a0f
RDX:  RSI: 85924cea RDI: 0005
RBP: 8801b256fa58 R08: 8801adc9e2c0 R09: ed003962412d
R10: 8801b256fad8 R11: 8801cb12096f R12: 0001
R13: ed00364adf36 R14:  R15: 8801b256fa30
FS:  7fdf24b19700() GS:8801db10() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 20bf6000 CR3: 0001ae6a7000 CR4: 001406e0
DR0:  DR1:  DR2: 
DR3:  DR6: fffe0ff0 DR7: 0400
Call Trace:
 vhost_update_used_flags+0x3af/0x4a0 drivers/vhost/vhost.c:1723
 vhost_vq_init_access+0x117/0x590 drivers/vhost/vhost.c:1763
 vhost_vsock_start drivers/vhost/vsock.c:446 [inline]
 vhost_vsock_dev_ioctl+0x751/0x920 drivers/vhost/vsock.c:678
 vfs_ioctl fs/ioctl.c:46 [inline]
 file_ioctl fs/ioctl.c:500 [inline]
 do_vfs_ioctl+0x1cf/0x1650 fs/ioctl.c:684
 ksys_ioctl+0xa9/0xd0 fs/ioctl.c:701
 SYSC_ioctl fs/ioctl.c:708 [inline]
 SyS_ioctl+0x24/0x30 fs/ioctl.c:706
 do_syscall_64+0x29e/0x9d0 arch/x86/entry/common.c:287
 entry_SYSCALL_64_after_hwframe+0x42/0xb7
RIP: 0033:0x4456c9
RSP: 002b:7fdf24b18da8 EFLAGS: 0297 ORIG_RAX: 0010
RAX: ffda RBX: 006dac24 RCX: 004456c9
RDX: 20f82ffc RSI: 4004af61 RDI: 001b
RBP: 006dac20 R08:  R09: 
R10:  R11: 0297 R12: 6b636f73762d7473
R13: 6f68762f7665642f R14: fffc R15: 0007
Code: e8 7c 5e e4 fb 4c 89 ef e8 e4 16 06 fc 48 8d 85 58 ff ff ff 48 c1 e8  
03 c6 04 18 f8 e9 46 ff ff ff 45 31 f6 eb 91 e8 56 5e e4 fb <0f> 0b e8 4f  
5e e4 fb 48 c7 c6 a0 a3 24 88 4c 89 ef e8 60 b6 10
RIP: set_bit_to_user drivers/vhost/vhost.c:1652 [inline] RSP:  
8801b256f920

RIP: log_write+0x42a/0x4d0 drivers/vhost/vhost.c:1676 RSP: 8801b256f920
invalid opcode:  [#2] SMP KASAN
---[ end trace 0d0ff45aa44d8a23 ]---
Dumping ftrace buffer:
   (ftrace buffer empty)
Modules linked in:


---
This bug is generated by a dumb bot. It may contain errors.
See https://goo.gl/tpsmEJ for details.
Direct all questions to syzkal...@googlegroups.com.

syzbot will keep track of this bug report.
If you forgot to add the Reported-by tag, once the fix for this bug is  
merged

into any tree, please reply to this email with:
#syz fix: exact-commit-title
If you want to test a patch for this bug, please reply with:
#syz test: git://repo/address.git branch
and provide the patch inline or as an attachment.
To mark this as a duplicate of another syzbot report, please reply with:
#syz dup: exact-subject-of-another-report
If it's a one-off invalid bug report, please reply with:
#syz invalid
Note: if the crash happens again, it will cause creation of a new bug  
report.

Note: all commands must start from beginning of the line in the email body.

Re: [PATCH net 2/2] net: systemport: Fix sparse warnings in bcm_sysport_insert_tsb()

2018-04-06 Thread Sasha Levin

Hi,

[This is an automated email]

This commit has been processed because it contains a "Fixes:" tag,
fixing commit: 80105befdb4b net: systemport: add Broadcom SYSTEMPORT Ethernet 
MAC driver.

The bot has also determined it's probably a bug fixing patch. (score: 50.4075)

The bot has tested the following trees: v4.16, v4.15.15, v4.14.32, v4.9.92, 
v4.4.126.

v4.16: Build OK!
v4.15.15: Build OK!
v4.14.32: Build OK!
v4.9.92: Build OK!
v4.4.126: Build OK!

--
Thanks,
Sasha

Re: [PATCH net-next 6/6] netdevsim: Add simple FIB resource controller via devlink

2018-04-06 Thread David Ahern

On 4/5/18 11:52 PM, Jiri Pirko wrote:
> Thu, Apr 05, 2018 at 11:06:41PM CEST, d...@cumulusnetworks.com wrote:
>> On 4/5/18 2:10 PM, David Ahern wrote:
>>>
>>> The ASIC here is the kernel tables in a namespace. It does not make
>>> sense to have 2 devlink instances for a single namespace.
>>
>> I put this example controller in netdevsim per a suggestion from Ido.
>> The netdevsim seemed like a good idea given that modules intention --
>> testing network facilities. Perhaps I should have done this as a
>> completely standalone module ...
>>
>> The intention is to treat the kernel's tables *per namespace* as a
>> standalone entity that can be managed very similar to ASIC resources.
> 
> So you say you want to treat a namespace as an ASIC? That sounds very
> odd to me :/

Why? The kernel has forwarding tables, acl's, etc just like the ASIC,
and each namespace is a separate set of tables.

If you think about it, userspace "programs" the kernel just like mlxsw
and userspace SDKs "program" an asic.

>> Given that I can add a resource controller module
>> (drivers/net/kern_res_mgr.c?) that creates a 'struct device' per network
>> namespace with a devlink instance. In this case the device would very
>> much be tied to the namespace 1:1.
> 
> That sounds more reasonable and accurate, yet still odd. You would not
> have any netdevices there? Any ports?
> 

Sure, what ever ports are assigned to or created in the namespace.

Nothing about the devlink API says it has to be a real h/w device.
Nothing about the devlink API says it can only be used for real h/w that
has ports represented by netdevices that the devlink instance some how
has "control" over.

As the netdevsim demo shows, I can build an L3 resource controller for
the kernel tables using just the devlink API and the in-kernel notifiers.

Re: [PATCH iproute2-next v1] rdma: Print net device name and index for RDMA device

2018-04-06 Thread David Ahern

On 4/2/18 10:29 PM, Leon Romanovsky wrote:
> From: Leon Romanovsky 
> 
> The RDMA devices are operated in RoCE and iWARP modes have net device
> underneath. Present their names in regular output and their net index
> in detailed mode.
> 
> [root@nps ~]# rdma link show mlx5_3/1
> 4/1: mlx5_3/1: state ACTIVE physical_state LINK_UP netdev ens7
> [root@nps ~]# rdma link show mlx5_3/1 -d
> 4/1: mlx5_3/1: state ACTIVE physical_state LINK_UP netdev ens7 netdev_index 7
> caps: 
> 
> Signed-off-by: Leon Romanovsky 
> Reviewed-by: Steve Wise 
> ---
>  Changes v0->v1:
>   * Resend after commit 29122c1aae35 ("rdma: update rdma_netlink.h")
> which updated relevant netlink attributes.
>   * Added Steve's ROB
> ---
>  rdma/include/uapi/rdma/rdma_netlink.h |  4 
>  rdma/link.c   | 21 +
>  rdma/utils.c  |  2 ++
>  3 files changed, 27 insertions(+)
> 

applied to iproute2-next

[PATCH v4] net: thunderx: rework mac addresses list to u64 array

2018-04-06 Thread Vadim Lomovtsev

From: Vadim Lomovtsev 

It is too expensive to pass u64 values via linked list, instead
allocate array for them by overall number of mac addresses from netdev.

This eventually removes multiple kmalloc() calls, aviod memory
fragmentation and allow to put single null check on kmalloc
return value in order to prevent a potential null pointer dereference.

Addresses-Coverity-ID: 1467429 ("Dereference null return value")
Fixes: 37c3347eb247 ("net: thunderx: add ndo_set_rx_mode callback 
implementation for VF")
Reported-by: Dan Carpenter 
Signed-off-by: Vadim Lomovtsev 
---
Changes from v1 to v2:
 - C99 syntax: update xcast_addr_list struct field mc[0] -> mc[];
Changes from v2 to v3:
 - update commit description with 'Reported-by: Dan Carpenter';
 - update size calculations for mc list to offsetof() call
   instead of explicit arithmetic;
Changes from v3 to v4:
 - change loop control variable type from u8 to int, accordingly
   to mc_count size;
---
 drivers/net/ethernet/cavium/thunder/nic.h|  7 +-
 drivers/net/ethernet/cavium/thunder/nicvf_main.c | 28 +---
 2 files changed, 11 insertions(+), 24 deletions(-)

diff --git a/drivers/net/ethernet/cavium/thunder/nic.h 
b/drivers/net/ethernet/cavium/thunder/nic.h
index 5fc46c5..448d1fa 100644
--- a/drivers/net/ethernet/cavium/thunder/nic.h
+++ b/drivers/net/ethernet/cavium/thunder/nic.h
@@ -265,14 +265,9 @@ struct nicvf_drv_stats {
 
 struct cavium_ptp;
 
-struct xcast_addr {
-   struct list_head list;
-   u64  addr;
-};
-
 struct xcast_addr_list {
-   struct list_head list;
int  count;
+   u64  mc[];
 };
 
 struct nicvf_work {
diff --git a/drivers/net/ethernet/cavium/thunder/nicvf_main.c 
b/drivers/net/ethernet/cavium/thunder/nicvf_main.c
index 1e9a31f..6bd5658 100644
--- a/drivers/net/ethernet/cavium/thunder/nicvf_main.c
+++ b/drivers/net/ethernet/cavium/thunder/nicvf_main.c
@@ -1929,7 +1929,7 @@ static void nicvf_set_rx_mode_task(struct work_struct 
*work_arg)
  work.work);
struct nicvf *nic = container_of(vf_work, struct nicvf, rx_mode_work);
union nic_mbx mbx = {};
-   struct xcast_addr *xaddr, *next;
+   int idx = 0;
 
if (!vf_work)
return;
@@ -1956,16 +1956,10 @@ static void nicvf_set_rx_mode_task(struct work_struct 
*work_arg)
/* check if we have any specific MACs to be added to PF DMAC filter */
if (vf_work->mc) {
/* now go through kernel list of MACs and add them one by one */
-   list_for_each_entry_safe(xaddr, next,
-_work->mc->list, list) {
+   for (idx = 0; idx < vf_work->mc->count; idx++) {
mbx.xcast.msg = NIC_MBOX_MSG_ADD_MCAST;
-   mbx.xcast.data.mac = xaddr->addr;
+   mbx.xcast.data.mac = vf_work->mc->mc[idx];
nicvf_send_msg_to_pf(nic, );
-
-   /* after receiving ACK from PF release memory */
-   list_del(>list);
-   kfree(xaddr);
-   vf_work->mc->count--;
}
kfree(vf_work->mc);
}
@@ -1996,17 +1990,15 @@ static void nicvf_set_rx_mode(struct net_device *netdev)
mode |= BGX_XCAST_MCAST_FILTER;
/* here we need to copy mc addrs */
if (netdev_mc_count(netdev)) {
-   struct xcast_addr *xaddr;
-
-   mc_list = kmalloc(sizeof(*mc_list), GFP_ATOMIC);
-   INIT_LIST_HEAD(_list->list);
+   mc_list = kmalloc(offsetof(typeof(*mc_list),
+  
mc[netdev_mc_count(netdev)]),
+ GFP_ATOMIC);
+   if (unlikely(!mc_list))
+   return;
+   mc_list->count = 0;
netdev_hw_addr_list_for_each(ha, >mc) {
-   xaddr = kmalloc(sizeof(*xaddr),
-   GFP_ATOMIC);
-   xaddr->addr =
+   mc_list->mc[mc_list->count] =
ether_addr_to_u64(ha->addr);
-   list_add_tail(>list,
- _list->list);
mc_list->count++;
}
}
-- 
1.8.3.1

Re: [PATCH net] net: dsa: b53: Fix sparse warnings in b53_mmap.c

2018-04-06 Thread Sasha Levin

Hi,

[This is an automated email]

This commit has been processed because it contains a "Fixes:" tag,
fixing commit: 967dd82ffc52 net: dsa: b53: Add support for Broadcom RoboSwitch.

The bot has also determined it's probably a bug fixing patch. (score: 8.8847)

The bot has tested the following trees: v4.16, v4.15.15, v4.14.32, v4.9.92.

v4.16: Build OK!
v4.15.15: Build OK!
v4.14.32: Build OK!
v4.9.92: Build OK!

--
Thanks,
Sasha

Re: [PATCH net v6 4/4] ipv6: udp: set dst cache for a connected sk if current not valid

2018-04-06 Thread Sasha Levin

Hi,

[This is an automated email]

This commit has been processed because it contains a "Fixes:" tag,
fixing commit: 33c162a980fe ipv6: datagram: Update dst cache of a connected 
datagram sk during pmtu update.

The bot has tested the following trees: v4.16, v4.15.15, v4.14.32, v4.9.92.

v4.16: Failed to apply! Possible dependencies:
96818159c3c0 ("ipv6: allow to cache dst for a connected sk in 
ip6_sk_dst_lookup_flow()")

v4.15.15: Failed to apply! Possible dependencies:
96818159c3c0 ("ipv6: allow to cache dst for a connected sk in 
ip6_sk_dst_lookup_flow()")

v4.14.32: Failed to apply! Possible dependencies:
96818159c3c0 ("ipv6: allow to cache dst for a connected sk in 
ip6_sk_dst_lookup_flow()")

v4.9.92: Failed to apply! Possible dependencies:
96818159c3c0 ("ipv6: allow to cache dst for a connected sk in 
ip6_sk_dst_lookup_flow()")


--
Thanks,
Sasha

Re: [PATCH net 1/2] net: bcmgenet: Fix sparse warnings in bcmgenet_put_tx_csum()

2018-04-06 Thread Sasha Levin

Hi,

[This is an automated email]

This commit has been processed because it contains a "Fixes:" tag,
fixing commit: 1c1008c793fa net: bcmgenet: add main driver file.

The bot has also determined it's probably a bug fixing patch. (score: 49.2621)

The bot has tested the following trees: v4.16, v4.15.15, v4.14.32, v4.9.92, 
v4.4.126.

v4.16: Build OK!
v4.15.15: Build OK!
v4.14.32: Build OK!
v4.9.92: Build OK!
v4.4.126: Build OK!

--
Thanks,
Sasha

[RFC PATCH bpf-next 5/6] samples/bpf: add a test for bpf_get_stack helper

2018-04-06 Thread Yonghong Song

The test attached a kprobe program to kernel function sys_write.
It tested to get stack for user space, kernel space and user
space with build_id request. It also tested to get user
and kernel stack into the same buffer with back-to-back
bpf_get_stack helper calls.

Whenever the kernel stack is available, the user space
application will check to ensure that sys_write/SyS_write
is part of the stack.

Signed-off-by: Yonghong Song 
---
 samples/bpf/Makefile   |   4 +
 samples/bpf/trace_get_stack_kern.c |  80 
 samples/bpf/trace_get_stack_user.c | 150 +
 3 files changed, 234 insertions(+)
 create mode 100644 samples/bpf/trace_get_stack_kern.c
 create mode 100644 samples/bpf/trace_get_stack_user.c

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index 4d6a6ed..94e7b10 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -44,6 +44,7 @@ hostprogs-y += xdp_monitor
 hostprogs-y += xdp_rxq_info
 hostprogs-y += syscall_tp
 hostprogs-y += cpustat
+hostprogs-y += trace_get_stack
 
 # Libbpf dependencies
 LIBBPF := ../../tools/lib/bpf/bpf.o ../../tools/lib/bpf/nlattr.o
@@ -95,6 +96,7 @@ xdp_monitor-objs := bpf_load.o $(LIBBPF) xdp_monitor_user.o
 xdp_rxq_info-objs := bpf_load.o $(LIBBPF) xdp_rxq_info_user.o
 syscall_tp-objs := bpf_load.o $(LIBBPF) syscall_tp_user.o
 cpustat-objs := bpf_load.o $(LIBBPF) cpustat_user.o
+trace_get_stack-objs := bpf_load.o $(LIBBPF) trace_get_stack_user.o
 
 # Tell kbuild to always build the programs
 always := $(hostprogs-y)
@@ -148,6 +150,7 @@ always += xdp_rxq_info_kern.o
 always += xdp2skb_meta_kern.o
 always += syscall_tp_kern.o
 always += cpustat_kern.o
+always += trace_get_stack_kern.o
 
 HOSTCFLAGS += -I$(objtree)/usr/include
 HOSTCFLAGS += -I$(srctree)/tools/lib/
@@ -193,6 +196,7 @@ HOSTLOADLIBES_xdp_monitor += -lelf
 HOSTLOADLIBES_xdp_rxq_info += -lelf
 HOSTLOADLIBES_syscall_tp += -lelf
 HOSTLOADLIBES_cpustat += -lelf
+HOSTLOADLIBES_trace_get_stack += -lelf
 
 # Allows pointing LLC/CLANG to a LLVM backend with bpf support, redefine on 
cmdline:
 #  make samples/bpf/ LLC=~/git/llvm/build/bin/llc 
CLANG=~/git/llvm/build/bin/clang
diff --git a/samples/bpf/trace_get_stack_kern.c 
b/samples/bpf/trace_get_stack_kern.c
new file mode 100644
index 000..c7cc7b1
--- /dev/null
+++ b/samples/bpf/trace_get_stack_kern.c
@@ -0,0 +1,80 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include 
+#include 
+#include 
+#include "bpf_helpers.h"
+
+/* Permit pretty deep stack traces */
+#define MAX_STACK 100
+struct stack_trace_t {
+   int pid;
+   int kern_stack_size;
+   int user_stack_size;
+   int user_stack_buildid_size;
+   u64 kern_stack[MAX_STACK];
+   u64 user_stack[MAX_STACK];
+   struct bpf_stack_build_id user_stack_buildid[MAX_STACK];
+};
+
+struct bpf_map_def SEC("maps") perfmap = {
+   .type = BPF_MAP_TYPE_PERF_EVENT_ARRAY,
+   .key_size = sizeof(int),
+   .value_size = sizeof(u32),
+   .max_entries = 2,
+};
+
+struct bpf_map_def SEC("maps") stackdata_map = {
+   .type = BPF_MAP_TYPE_PERCPU_ARRAY,
+   .key_size = sizeof(u32),
+   .value_size = sizeof(struct stack_trace_t),
+   .max_entries = 1,
+};
+
+SEC("kprobe/sys_write")
+int bpf_prog1(struct pt_regs *ctx)
+{
+   int max_len, max_buildid_len, usize, ksize, total_size;
+   struct stack_trace_t *data;
+   void *raw_data;
+   u32 key = 0;
+
+   data = bpf_map_lookup_elem(_map, );
+   if (!data)
+   return 0;
+
+   max_len = MAX_STACK * sizeof(u64);
+   max_buildid_len = MAX_STACK * sizeof(struct bpf_stack_build_id);
+   data->pid = bpf_get_current_pid_tgid();
+   data->kern_stack_size = bpf_get_stack(ctx, data->kern_stack,
+ max_len, 0);
+   data->user_stack_size = bpf_get_stack(ctx, data->user_stack, max_len,
+   BPF_F_USER_STACK);
+   data->user_stack_buildid_size = bpf_get_stack(
+   ctx, data->user_stack_buildid, max_buildid_len,
+   BPF_F_USER_STACK | BPF_F_USER_BUILD_ID);
+   bpf_perf_event_output(ctx, , 0, data, sizeof(*data));
+
+   /* write both kernel and user stacks to the same buffer */
+   raw_data = (void *)data;
+   usize = bpf_get_stack(ctx, raw_data, max_len, BPF_F_USER_STACK);
+   if (usize < 0)
+   return 0;
+
+   ksize = 0;
+   if (usize < max_len) {
+   ksize = bpf_get_stack(ctx, raw_data + usize, max_len - usize,
+ 0);
+   if (ksize < 0)
+   return 0;
+   }
+   total_size = (usize < max_len ? usize : 0) +
+(ksize < max_len ? ksize : 0);
+   if (total_size > 0 && total_size < max_len)
+   bpf_perf_event_output(ctx, , 0, raw_data, total_size);
+
+   return 0;
+}
+
+char _license[] SEC("license") = "GPL";
+u32 _version

[RFC PATCH bpf-next 3/6] tools/bpf: add bpf_get_stack helper to tools headers

2018-04-06 Thread Yonghong Song

Signed-off-by: Yonghong Song 
---
 tools/include/uapi/linux/bpf.h| 17 +++--
 tools/testing/selftests/bpf/bpf_helpers.h |  2 ++
 2 files changed, 17 insertions(+), 2 deletions(-)

diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 9d07465..3930463 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -517,6 +517,17 @@ union bpf_attr {
  * other bits - reserved
  * Return: >= 0 stackid on success or negative error
  *
+ * int bpf_get_stack(ctx, buf, size, flags)
+ * walk user or kernel stack and store the ips in buf
+ * @ctx: struct pt_regs*
+ * @buf: user buffer to fill stack
+ * @size: the buf size
+ * @flags: bits 0-7 - numer of stack frames to skip
+ * bit 8 - collect user stack instead of kernel
+ * bit 11 - get build-id as well if user stack
+ * other bits - reserved
+ * Return: >= 0 size copied on success or negative error
+ *
  * s64 bpf_csum_diff(from, from_size, to, to_size, seed)
  * calculate csum diff
  * @from: raw from buffer
@@ -821,7 +832,8 @@ union bpf_attr {
FN(msg_apply_bytes),\
FN(msg_cork_bytes), \
FN(msg_pull_data),  \
-   FN(bind),
+   FN(bind),   \
+   FN(get_stack),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
@@ -855,11 +867,12 @@ enum bpf_func_id {
 /* BPF_FUNC_skb_set_tunnel_key and BPF_FUNC_skb_get_tunnel_key flags. */
 #define BPF_F_TUNINFO_IPV6 (1ULL << 0)
 
-/* BPF_FUNC_get_stackid flags. */
+/* BPF_FUNC_get_stackid and BPF_FUNC_get_stack flags. */
 #define BPF_F_SKIP_FIELD_MASK  0xffULL
 #define BPF_F_USER_STACK   (1ULL << 8)
 #define BPF_F_FAST_STACK_CMP   (1ULL << 9)
 #define BPF_F_REUSE_STACKID(1ULL << 10)
+#define BPF_F_USER_BUILD_ID(1ULL << 11)
 
 /* BPF_FUNC_skb_set_tunnel_key flags. */
 #define BPF_F_ZERO_CSUM_TX (1ULL << 1)
diff --git a/tools/testing/selftests/bpf/bpf_helpers.h 
b/tools/testing/selftests/bpf/bpf_helpers.h
index d8223d9..acaed02 100644
--- a/tools/testing/selftests/bpf/bpf_helpers.h
+++ b/tools/testing/selftests/bpf/bpf_helpers.h
@@ -96,6 +96,8 @@ static int (*bpf_msg_pull_data)(void *ctx, int start, int 
end, int flags) =
(void *) BPF_FUNC_msg_pull_data;
 static int (*bpf_bind)(void *ctx, void *addr, int addr_len) =
(void *) BPF_FUNC_bind;
+static int (*bpf_get_stack)(void *ctx, void *buf, int size, int flags) =
+   (void *) BPF_FUNC_get_stack;
 
 /* llvm builtin functions that eBPF C program may use to
  * emit BPF_LD_ABS and BPF_LD_IND instructions
-- 
2.9.5

[RFC PATCH bpf-next 2/6] bpf: add bpf_get_stack helper

2018-04-06 Thread Yonghong Song

Currently, stackmap and bpf_get_stackid helper are provided
for bpf program to get the stack trace. This approach has
a limitation though. If two stack traces have the same hash,
only one will get stored in the stackmap table,
so some stack traces are missing from user perspective.

This patch implements a new helper, bpf_get_stack, will
send stack traces directly to bpf program. The bpf program
is able to see all stack traces, and then can do in-kernel
processing or send stack traces to user space through
shared map or bpf_perf_event_output.

Signed-off-by: Yonghong Song 
---
 include/linux/bpf.h  |  1 +
 include/linux/filter.h   |  3 ++-
 include/uapi/linux/bpf.h | 17 +--
 kernel/bpf/stackmap.c| 56 
 kernel/bpf/syscall.c | 12 ++-
 kernel/bpf/verifier.c|  3 +++
 kernel/trace/bpf_trace.c | 50 +-
 7 files changed, 137 insertions(+), 5 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 95a7abd..72ccb9a 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -676,6 +676,7 @@ extern const struct bpf_func_proto 
bpf_get_current_comm_proto;
 extern const struct bpf_func_proto bpf_skb_vlan_push_proto;
 extern const struct bpf_func_proto bpf_skb_vlan_pop_proto;
 extern const struct bpf_func_proto bpf_get_stackid_proto;
+extern const struct bpf_func_proto bpf_get_stack_proto;
 extern const struct bpf_func_proto bpf_sock_map_update_proto;
 
 /* Shared helpers among cBPF and eBPF. */
diff --git a/include/linux/filter.h b/include/linux/filter.h
index fc4e8f9..9b64f63 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -467,7 +467,8 @@ struct bpf_prog {
dst_needed:1,   /* Do we need dst entry? */
blinded:1,  /* Was blinded */
is_func:1,  /* program is a bpf function */
-   kprobe_override:1; /* Do we override a kprobe? 
*/
+   kprobe_override:1, /* Do we override a kprobe? 
*/
+   need_callchain_buf:1; /* Needs callchain 
buffer? */
enum bpf_prog_type  type;   /* Type of BPF program */
enum bpf_attach_typeexpected_attach_type; /* For some prog types */
u32 len;/* Number of filter blocks */
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index c5ec897..a4ff5b7 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -517,6 +517,17 @@ union bpf_attr {
  * other bits - reserved
  * Return: >= 0 stackid on success or negative error
  *
+ * int bpf_get_stack(ctx, buf, size, flags)
+ * walk user or kernel stack and store the ips in buf
+ * @ctx: struct pt_regs*
+ * @buf: user buffer to fill stack
+ * @size: the buf size
+ * @flags: bits 0-7 - numer of stack frames to skip
+ * bit 8 - collect user stack instead of kernel
+ * bit 11 - get build-id as well if user stack
+ * other bits - reserved
+ * Return: >= 0 size copied on success or negative error
+ *
  * s64 bpf_csum_diff(from, from_size, to, to_size, seed)
  * calculate csum diff
  * @from: raw from buffer
@@ -821,7 +832,8 @@ union bpf_attr {
FN(msg_apply_bytes),\
FN(msg_cork_bytes), \
FN(msg_pull_data),  \
-   FN(bind),
+   FN(bind),   \
+   FN(get_stack),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
@@ -855,11 +867,12 @@ enum bpf_func_id {
 /* BPF_FUNC_skb_set_tunnel_key and BPF_FUNC_skb_get_tunnel_key flags. */
 #define BPF_F_TUNINFO_IPV6 (1ULL << 0)
 
-/* BPF_FUNC_get_stackid flags. */
+/* BPF_FUNC_get_stackid and BPF_FUNC_get_stack flags. */
 #define BPF_F_SKIP_FIELD_MASK  0xffULL
 #define BPF_F_USER_STACK   (1ULL << 8)
 #define BPF_F_FAST_STACK_CMP   (1ULL << 9)
 #define BPF_F_REUSE_STACKID(1ULL << 10)
+#define BPF_F_USER_BUILD_ID(1ULL << 11)
 
 /* BPF_FUNC_skb_set_tunnel_key flags. */
 #define BPF_F_ZERO_CSUM_TX (1ULL << 1)
diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c
index 04f6ec1..371c72e 100644
--- a/kernel/bpf/stackmap.c
+++ b/kernel/bpf/stackmap.c
@@ -402,6 +402,62 @@ const struct bpf_func_proto bpf_get_stackid_proto = {
.arg3_type  = ARG_ANYTHING,
 };
 
+BPF_CALL_4(bpf_get_stack, struct pt_regs *, regs, void *, buf, u32, size,
+  u64, flags)
+{
+   u32 init_nr, trace_nr, copy_len, elem_size, num_elem;
+   bool user_build_id = flags & BPF_F_USER_BUILD_ID;
+   u32 skip = flags & BPF_F_SKIP_FIELD_MASK;
+   bool user = flags & BPF_F_USER_STACK;
+   struct perf_callchain_entry *trace;
+   bool kernel = !user;
+

[RFC PATCH bpf-next 0/6] bpf: add bpf_get_stack_helper

2018-04-06 Thread Yonghong Song

Currently, stackmap and bpf_get_stackid helper are provided
for bpf program to get the stack trace. This approach has
a limitation though. If two stack traces have the same hash,
only one will get stored in the stackmap table,
so some stack traces are missing from user perspective.

This patch implements a new helper, bpf_get_stack, will
send stack traces directly to bpf program. The bpf program
is able to see all stack traces, and then can do in-kernel
processing or send stack traces to user space through
shared map or bpf_perf_event_output.

Patches #1 and #2 implemented the core kernel support.
Patch #3 synced the new helper to tools headers.
Patches #4 and #5 added a test in samples/bpf by attaching
to a kprobe, and Patch #6 added a test in tools/bpf by
attaching to a tracepoint.

Yonghong Song (6):
  bpf: change prototype for stack_map_get_build_id_offset
  bpf: add bpf_get_stack helper
  tools/bpf: add bpf_get_stack helper to tools headers
  samples/bpf: move common-purpose perf_event functions to bpf_load.c
  samples/bpf: add a test for bpf_get_stack helper
  tools/bpf: add a test case for bpf_get_stack helper

 include/linux/bpf.h   |   1 +
 include/linux/filter.h|   3 +-
 include/uapi/linux/bpf.h  |  17 ++-
 kernel/bpf/stackmap.c |  69 --
 kernel/bpf/syscall.c  |  12 +-
 kernel/bpf/verifier.c |   3 +
 kernel/trace/bpf_trace.c  |  50 +++-
 samples/bpf/Makefile  |   4 +
 samples/bpf/bpf_load.c| 104 +++
 samples/bpf/bpf_load.h|   5 +
 samples/bpf/trace_get_stack_kern.c|  80 
 samples/bpf/trace_get_stack_user.c| 150 ++
 samples/bpf/trace_output_user.c   | 113 ++--
 tools/include/uapi/linux/bpf.h|  17 ++-
 tools/testing/selftests/bpf/bpf_helpers.h |   2 +
 tools/testing/selftests/bpf/test_progs.c  |  41 +-
 tools/testing/selftests/bpf/test_stacktrace_map.c |  20 ++-
 17 files changed, 568 insertions(+), 123 deletions(-)
 create mode 100644 samples/bpf/trace_get_stack_kern.c
 create mode 100644 samples/bpf/trace_get_stack_user.c

-- 
2.9.5

[RFC PATCH bpf-next 6/6] tools/bpf: add a test case for bpf_get_stack helper

2018-04-06 Thread Yonghong Song

The test_stacktrace_map is enhanced to call bpf_get_stack
in the helper to get the stack trace as well.
The stack traces from bpf_get_stack and bpf_get_stackid
are compared to ensure that for the same stack as
represented as the same hash, their ip addresses
must be the same.

Signed-off-by: Yonghong Song 
---
 tools/testing/selftests/bpf/test_progs.c  | 41 ++-
 tools/testing/selftests/bpf/test_stacktrace_map.c | 20 +--
 2 files changed, 57 insertions(+), 4 deletions(-)

diff --git a/tools/testing/selftests/bpf/test_progs.c 
b/tools/testing/selftests/bpf/test_progs.c
index faadbe2..8aa2844 100644
--- a/tools/testing/selftests/bpf/test_progs.c
+++ b/tools/testing/selftests/bpf/test_progs.c
@@ -865,9 +865,39 @@ static int compare_map_keys(int map1_fd, int map2_fd)
return 0;
 }
 
+static int compare_stack_ips(int smap_fd, int amap_fd)
+{
+   int max_len = PERF_MAX_STACK_DEPTH * sizeof(__u64);
+   __u32 key, next_key, *cur_key_p, *next_key_p;
+   char val_buf1[max_len], val_buf2[max_len];
+   int i, err;
+
+   cur_key_p = NULL;
+   next_key_p = 
+   while (bpf_map_get_next_key(smap_fd, cur_key_p, next_key_p) == 0) {
+   err = bpf_map_lookup_elem(smap_fd, next_key_p, val_buf1);
+   if (err)
+   return err;
+   err = bpf_map_lookup_elem(amap_fd, next_key_p, val_buf2);
+   if (err)
+   return err;
+   for (i = 0; i < max_len; i++) {
+   if (val_buf1[i] != val_buf2[i])
+   return -1;
+   }
+   key = *next_key_p;
+   cur_key_p = 
+   next_key_p = _key;
+   }
+   if (errno != ENOENT)
+   return -1;
+
+   return 0;
+}
+
 static void test_stacktrace_map()
 {
-   int control_map_fd, stackid_hmap_fd, stackmap_fd;
+   int control_map_fd, stackid_hmap_fd, stackmap_fd, stack_amap_fd;
const char *file = "./test_stacktrace_map.o";
int bytes, efd, err, pmu_fd, prog_fd;
struct perf_event_attr attr = {};
@@ -925,6 +955,10 @@ static void test_stacktrace_map()
if (stackmap_fd < 0)
goto disable_pmu;
 
+   stack_amap_fd = bpf_find_map(__func__, obj, "stack_amap");
+   if (stack_amap_fd < 0)
+   goto disable_pmu;
+
/* give some time for bpf program run */
sleep(1);
 
@@ -946,6 +980,11 @@ static void test_stacktrace_map()
  "err %d errno %d\n", err, errno))
goto disable_pmu_noerr;
 
+   err = compare_stack_ips(stackmap_fd, stack_amap_fd);
+   if (CHECK(err, "compare_stack_ips stackmap vs. stack_amap",
+ "err %d errno %d\n", err, errno))
+   goto disable_pmu_noerr;
+
goto disable_pmu_noerr;
 disable_pmu:
error_cnt++;
diff --git a/tools/testing/selftests/bpf/test_stacktrace_map.c 
b/tools/testing/selftests/bpf/test_stacktrace_map.c
index 76d85c5d..f83c7b6 100644
--- a/tools/testing/selftests/bpf/test_stacktrace_map.c
+++ b/tools/testing/selftests/bpf/test_stacktrace_map.c
@@ -19,14 +19,21 @@ struct bpf_map_def SEC("maps") stackid_hmap = {
.type = BPF_MAP_TYPE_HASH,
.key_size = sizeof(__u32),
.value_size = sizeof(__u32),
-   .max_entries = 1,
+   .max_entries = 16384,
 };
 
 struct bpf_map_def SEC("maps") stackmap = {
.type = BPF_MAP_TYPE_STACK_TRACE,
.key_size = sizeof(__u32),
.value_size = sizeof(__u64) * PERF_MAX_STACK_DEPTH,
-   .max_entries = 1,
+   .max_entries = 16384,
+};
+
+struct bpf_map_def SEC("maps") stack_amap = {
+   .type = BPF_MAP_TYPE_ARRAY,
+   .key_size = sizeof(__u32),
+   .value_size = sizeof(__u64) * PERF_MAX_STACK_DEPTH,
+   .max_entries = 16384,
 };
 
 /* taken from /sys/kernel/debug/tracing/events/sched/sched_switch/format */
@@ -44,7 +51,10 @@ struct sched_switch_args {
 SEC("tracepoint/sched/sched_switch")
 int oncpu(struct sched_switch_args *ctx)
 {
+   __u32 max_len = PERF_MAX_STACK_DEPTH * sizeof(__u64);
__u32 key = 0, val = 0, *value_p;
+   void *stack_p;
+
 
value_p = bpf_map_lookup_elem(_map, );
if (value_p && *value_p)
@@ -52,8 +62,12 @@ int oncpu(struct sched_switch_args *ctx)
 
/* The size of stackmap and stackid_hmap should be the same */
key = bpf_get_stackid(ctx, , 0);
-   if ((int)key >= 0)
+   if ((int)key >= 0) {
bpf_map_update_elem(_hmap, , , 0);
+   stack_p = bpf_map_lookup_elem(_amap, );
+   if (stack_p)
+   bpf_get_stack(ctx, stack_p, max_len, 0);
+   }
 
return 0;
 }
-- 
2.9.5

[RFC PATCH bpf-next 1/6] bpf: change prototype for stack_map_get_build_id_offset

2018-04-06 Thread Yonghong Song

This patch didn't incur functionality change. The function prototype
got changed so that the same function can be reused later.

Signed-off-by: Yonghong Song 
---
 kernel/bpf/stackmap.c | 13 +
 1 file changed, 5 insertions(+), 8 deletions(-)

diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c
index 57eeb12..04f6ec1 100644
--- a/kernel/bpf/stackmap.c
+++ b/kernel/bpf/stackmap.c
@@ -262,16 +262,11 @@ static int stack_map_get_build_id(struct vm_area_struct 
*vma,
return ret;
 }
 
-static void stack_map_get_build_id_offset(struct bpf_map *map,
- struct stack_map_bucket *bucket,
+static void stack_map_get_build_id_offset(struct bpf_stack_build_id *id_offs,
  u64 *ips, u32 trace_nr, bool user)
 {
int i;
struct vm_area_struct *vma;
-   struct bpf_stack_build_id *id_offs;
-
-   bucket->nr = trace_nr;
-   id_offs = (struct bpf_stack_build_id *)bucket->data;
 
/*
 * We cannot do up_read() in nmi context, so build_id lookup is
@@ -361,8 +356,10 @@ BPF_CALL_3(bpf_get_stackid, struct pt_regs *, regs, struct 
bpf_map *, map,
pcpu_freelist_pop(>freelist);
if (unlikely(!new_bucket))
return -ENOMEM;
-   stack_map_get_build_id_offset(map, new_bucket, ips,
- trace_nr, user);
+   new_bucket->nr = trace_nr;
+   stack_map_get_build_id_offset(
+   (struct bpf_stack_build_id *)new_bucket->data,
+   ips, trace_nr, user);
trace_len = trace_nr * sizeof(struct bpf_stack_build_id);
if (hash_matches && bucket->nr == trace_nr &&
memcmp(bucket->data, new_bucket->data, trace_len) == 0) {
-- 
2.9.5

[RFC PATCH bpf-next 4/6] samples/bpf: move common-purpose perf_event functions to bpf_load.c

2018-04-06 Thread Yonghong Song

There is no functionality change in this patch. The common-purpose
perf_event functions are moved from trace_output_user.c to bpf_load.c
so that these function can be reused later.

Signed-off-by: Yonghong Song 
---
 samples/bpf/bpf_load.c  | 104 
 samples/bpf/bpf_load.h  |   5 ++
 samples/bpf/trace_output_user.c | 113 
 3 files changed, 118 insertions(+), 104 deletions(-)

diff --git a/samples/bpf/bpf_load.c b/samples/bpf/bpf_load.c
index bebe418..62aa5cc 100644
--- a/samples/bpf/bpf_load.c
+++ b/samples/bpf/bpf_load.c
@@ -713,3 +713,107 @@ struct ksym *ksym_search(long key)
return [0];
 }
 
+static int page_size;
+static int page_cnt = 8;
+static volatile struct perf_event_mmap_page *header;
+
+static int perf_event_mmap(int fd)
+{
+   void *base;
+   int mmap_size;
+
+   page_size = getpagesize();
+   mmap_size = page_size * (page_cnt + 1);
+
+   base = mmap(NULL, mmap_size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
+   if (base == MAP_FAILED) {
+   printf("mmap err\n");
+   return -1;
+   }
+
+   header = base;
+   return 0;
+}
+
+static int perf_event_poll(int fd)
+{
+   struct pollfd pfd = { .fd = fd, .events = POLLIN };
+
+   return poll(, 1, 1000);
+}
+
+struct perf_event_sample {
+   struct perf_event_header header;
+   __u32 size;
+   char data[];
+};
+
+static void perf_event_read(perf_event_print_fn fn)
+{
+   __u64 data_tail = header->data_tail;
+   __u64 data_head = header->data_head;
+   __u64 buffer_size = page_cnt * page_size;
+   void *base, *begin, *end;
+   char buf[256];
+
+   asm volatile("" ::: "memory"); /* in real code it should be smp_rmb() */
+   if (data_head == data_tail)
+   return;
+
+   base = ((char *)header) + page_size;
+
+   begin = base + data_tail % buffer_size;
+   end = base + data_head % buffer_size;
+
+   while (begin != end) {
+   struct perf_event_sample *e;
+
+   e = begin;
+   if (begin + e->header.size > base + buffer_size) {
+   long len = base + buffer_size - begin;
+
+   assert(len < e->header.size);
+   memcpy(buf, begin, len);
+   memcpy(buf + len, base, e->header.size - len);
+   e = (void *) buf;
+   begin = base + e->header.size - len;
+   } else if (begin + e->header.size == base + buffer_size) {
+   begin = base;
+   } else {
+   begin += e->header.size;
+   }
+
+   if (e->header.type == PERF_RECORD_SAMPLE) {
+   fn(e->data, e->size);
+   } else if (e->header.type == PERF_RECORD_LOST) {
+   struct {
+   struct perf_event_header header;
+   __u64 id;
+   __u64 lost;
+   } *lost = (void *) e;
+   printf("lost %lld events\n", lost->lost);
+   } else {
+   printf("unknown event type=%d size=%d\n",
+  e->header.type, e->header.size);
+   }
+   }
+
+   __sync_synchronize(); /* smp_mb() */
+   header->data_tail = data_head;
+}
+
+int perf_event_poller(int fd, perf_event_exec_fn exec_fn,
+ perf_event_print_fn output_fn)
+{
+   if (perf_event_mmap(fd) < 0)
+   return 1;
+
+   exec_fn();
+
+   for (;;) {
+   perf_event_poll(fd);
+   perf_event_read(output_fn);
+   }
+
+   return 0;
+}
diff --git a/samples/bpf/bpf_load.h b/samples/bpf/bpf_load.h
index 453c200..d618750 100644
--- a/samples/bpf/bpf_load.h
+++ b/samples/bpf/bpf_load.h
@@ -62,4 +62,9 @@ struct ksym {
 int load_kallsyms(void);
 struct ksym *ksym_search(long key);
 int bpf_set_link_xdp_fd(int ifindex, int fd, __u32 flags);
+
+typedef void (*perf_event_exec_fn)(void);
+typedef void (*perf_event_print_fn)(void *data, int size);
+int perf_event_poller(int fd, perf_event_exec_fn exec_fn,
+ perf_event_print_fn output_fn);
 #endif
diff --git a/samples/bpf/trace_output_user.c b/samples/bpf/trace_output_user.c
index ccca1e3..3d3991f 100644
--- a/samples/bpf/trace_output_user.c
+++ b/samples/bpf/trace_output_user.c
@@ -24,97 +24,6 @@
 
 static int pmu_fd;
 
-int page_size;
-int page_cnt = 8;
-volatile struct perf_event_mmap_page *header;
-
-typedef void (*print_fn)(void *data, int size);
-
-static int perf_event_mmap(int fd)
-{
-   void *base;
-   int mmap_size;
-
-   page_size = getpagesize();
-   mmap_size = page_size * (page_cnt + 1);
-
-   base = mmap(NULL, mmap_size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
-   if (base == MAP_FAILED) {
-

Re: [RFC PATCH net-next v5 2/4] net: Introduce generic bypass module

2018-04-06 Thread Samudrala, Sridhar


On 4/6/2018 5:57 AM, Jiri Pirko wrote:

Thu, Apr 05, 2018 at 11:08:21PM CEST, sridhar.samudr...@intel.com wrote:

This provides a generic interface for paravirtual drivers to listen
for netdev register/unregister/link change events from pci ethernet
devices with the same MAC and takeover their datapath. The notifier and
event handling code is based on the existing netvsc implementation. A
paravirtual driver can use this module by registering a set of ops and
each instance of the device when it is probed.

Signed-off-by: Sridhar Samudrala 
---
include/net/bypass.h |  80 ++
net/Kconfig  |  18 +++
net/core/Makefile|   1 +
net/core/bypass.c| 406 +++
4 files changed, 505 insertions(+)
create mode 100644 include/net/bypass.h
create mode 100644 net/core/bypass.c

diff --git a/include/net/bypass.h b/include/net/bypass.h
new file mode 100644
index ..e2dd122f951a
--- /dev/null
+++ b/include/net/bypass.h
@@ -0,0 +1,80 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2018, Intel Corporation. */
+
+#ifndef _NET_BYPASS_H
+#define _NET_BYPASS_H
+
+#include 
+
+struct bypass_ops {

Perhaps "net_bypass_" would be better prefix for this module structs
and functions. No strong opinion though.



+   int (*register_child)(struct net_device *bypass_netdev,
+ struct net_device *child_netdev);

We have master/slave upper/lower netdevices. This adds "child". Consider
using some existing names. Not sure if possible without loss of meaning.


OK. will change this to register_slave()





+   int (*join_child)(struct net_device *bypass_netdev,
+ struct net_device *child_netdev);
+   int (*unregister_child)(struct net_device *bypass_netdev,
+   struct net_device *child_netdev);
+   int (*release_child)(struct net_device *bypass_netdev,
+struct net_device *child_netdev);
+   int (*update_link)(struct net_device *bypass_netdev,
+  struct net_device *child_netdev);
+   rx_handler_result_t (*handle_frame)(struct sk_buff **pskb);
+};
+
+struct bypass_instance {
+   struct list_head list;
+   struct net_device __rcu *bypass_netdev;
+   struct bypass *bypass;
+};
+
+struct bypass {
+   struct list_head list;
+   const struct bypass_ops *ops;
+   const struct net_device_ops *netdev_ops;
+   struct list_head instance_list;
+   struct mutex lock;
+};
+
+#if IS_ENABLED(CONFIG_NET_BYPASS)
+
+struct bypass *bypass_register_driver(const struct bypass_ops *ops,
+ const struct net_device_ops *netdev_ops);
+void bypass_unregister_driver(struct bypass *bypass);
+
+int bypass_register_instance(struct bypass *bypass, struct net_device *dev);
+int bypass_unregister_instance(struct bypass *bypass, struct net_device
*dev);
+
+int bypass_unregister_child(struct net_device *child_netdev);
+
+#else
+
+static inline
+struct bypass *bypass_register_driver(const struct bypass_ops *ops,
+ const struct net_device_ops *netdev_ops)
+{
+   return NULL;
+}
+
+static inline void bypass_unregister_driver(struct bypass *bypass)
+{
+}
+
+static inline int bypass_register_instance(struct bypass *bypass,
+  struct net_device *dev)
+{
+   return 0;
+}
+
+static inline int bypass_unregister_instance(struct bypass *bypass,
+struct net_device *dev)
+{
+   return 0;
+}
+
+static inline int bypass_unregister_child(struct net_device *child_netdev)
+{
+   return 0;
+}
+
+#endif
+
+#endif /* _NET_BYPASS_H */
diff --git a/net/Kconfig b/net/Kconfig
index 0428f12c25c2..994445f4a96a 100644
--- a/net/Kconfig
+++ b/net/Kconfig
@@ -423,6 +423,24 @@ config MAY_USE_DEVLINK
  on MAY_USE_DEVLINK to ensure they do not cause link errors when
  devlink is a loadable module and the driver using it is built-in.

+config NET_BYPASS
+   tristate "Bypass interface"
+   ---help---
+ This provides a generic interface for paravirtual drivers to listen
+ for netdev register/unregister/link change events from pci ethernet
+ devices with the same MAC and takeover their datapath. This also
+ enables live migration of a VM with direct attached VF by failing
+ over to the paravirtual datapath when the VF is unplugged.
+
+config MAY_USE_BYPASS
+   tristate
+   default m if NET_BYPASS=m
+   default y if NET_BYPASS=y || NET_BYPASS=n
+   help
+ Drivers using the bypass infrastructure should have a dependency
+ on MAY_USE_BYPASS to ensure they do not cause link errors when
+ bypass is a loadable module and the driver using it is built-in.
+
endif   # if NET

# Used by archs to tell that they support BPF JIT compiler plus which

RE: [Intel-wired-lan] [next-queue PATCH v6 08/10] igb: Add MAC address support for ethtool nftuple filters

2018-04-06 Thread Brown, Aaron F

> From: Gomes, Vinicius
> Sent: Thursday, April 5, 2018 11:00 AM
> To: Brown, Aaron F ; intel-wired-
> l...@lists.osuosl.org
> Cc: netdev@vger.kernel.org; Sanchez-Palencia, Jesus  palen...@intel.com>
> Subject: RE: [Intel-wired-lan] [next-queue PATCH v6 08/10] igb: Add MAC
> address support for ethtool nftuple filters
> 
> Hi,
> 
> "Brown, Aaron F"  writes:
> 
> >> From: Intel-wired-lan [mailto:intel-wired-lan-boun...@osuosl.org] On
> >> Behalf Of Vinicius Costa Gomes
> >> Sent: Thursday, March 29, 2018 2:08 PM
> >> To: intel-wired-...@lists.osuosl.org
> >> Cc: netdev@vger.kernel.org; Sanchez-Palencia, Jesus  >> palen...@intel.com>
> >> Subject: [Intel-wired-lan] [next-queue PATCH v6 08/10] igb: Add MAC
> >> address support for ethtool nftuple filters
> >>
> >> This adds the capability of configuring the queue steering of arriving
> >> packets based on their source and destination MAC addresses.
> >>
> >> In practical terms this adds support for the following use cases,
> >> characterized by these examples:
> >>
> >> $ ethtool -N eth0 flow-type ether dst aa:aa:aa:aa:aa:aa action 0
> >> (this will direct packets with destination address "aa:aa:aa:aa:aa:aa"
> >> to the RX queue 0)
> >
> > This is now working for me, testing with the dst MAC being the MAC on the
> i210.  I set the filter and all the traffic to the destination MAC address 
> gets
> routed to the chosen RX queue.
> >
> >> $ ethtool -N eth0 flow-type ether src 44:44:44:44:44:44 action 3
> >> (this will direct packets with source address "44:44:44:44:44:44" to
> >> the RX queue 3)

Since this apparently does not work without refining the filter down to an 
ethertype I would like to see this example touched up to include the proto 
keyword.

> >
> > However, I am still not getting the raw ethernet source filter to
> > work.  Even back to back with no other system to "confuse" the stream,
> > I set the filter so the source MAC is the same as the MAC on the link
> > partner, send traffic and the traffic bounces around the queues as if
> > the filter is not set.
> 
> It seems there is at least a documentation issue in the i210 datasheet,
> steering (placing traffic into a specific queue) by source address
> doesn't work, filtering (accepting the traffic based on some rule) does
> work. I pointed this out in the cover letter of v5 as a known issue, but
> forgot to repeat it for v6, sorry about the confusion.

Yes, I recall that now.  I don't think I quite understood the implication at 
the time, but after trying it out it that makes perfect sense with what I am 
seeing.

> 
> But only the filtering part is useful, I think, it enables cases like
> this:
> 
> $ ethtool -N enp2s0 flow-type ether src 68:05:ca:4a:c9:73 proto 0x22f0 action
> 3

Ok, yes, this works.  If I tack on the proto keyword I can filter on whatever 
ethertype I choose and it seems to direct to the queue as expected.

> 
> I added that note in the hope that someone else would have an stronger
> opinion about what to do.

I don't have a strong opinion beyond my preference for an ideal world where 
everything works :)  If the part simply cannot filter on the src address as a 
whole without the protocol I would ideally prefer an attempt in ethtool to set 
the filter on src address as a whole to return an error WHILE still allowing 
the filter to be set on an ethertype when the proto keyword is issued.  If 
ethtool does not allow that fine grain of control then I think the way it is 
now is good, I'd rather have the annoyance of being able to set a filter that 
does nothing then not be able to set the more specific filter at all.  

> 
> Anyway, my plan for now will be to document this better and turn the
> case that only the source address is specified into an error.
> 
> >
> >>
> >> Signed-off-by: Vinicius Costa Gomes 
> >> ---
> >>  drivers/net/ethernet/intel/igb/igb_ethtool.c | 35
> >> 
> >>  1 file changed, 31 insertions(+), 4 deletions(-)
> 
> 
> Cheers,
> --
> Vinicius

Re: [RFC PATCH net-next v5 3/4] virtio_net: Extend virtio to use VF datapath when available

2018-04-06 Thread Samudrala, Sridhar


On 4/6/2018 5:48 AM, Jiri Pirko wrote:

Thu, Apr 05, 2018 at 11:08:22PM CEST, sridhar.samudr...@intel.com wrote:






+
+static void virtnet_bypass_set_rx_mode(struct net_device *dev)
+{
+   struct virtnet_bypass_info *vbi = netdev_priv(dev);
+   struct net_device *child_netdev;
+
+   rcu_read_lock();
+
+   child_netdev = rcu_dereference(vbi->active_netdev);
+   if (child_netdev) {
+   dev_uc_sync_multiple(child_netdev, dev);
+   dev_mc_sync_multiple(child_netdev, dev);
+   }
+
+   child_netdev = rcu_dereference(vbi->backup_netdev);
+   if (child_netdev) {
+   dev_uc_sync_multiple(child_netdev, dev);
+   dev_mc_sync_multiple(child_netdev, dev);
+   }
+
+   rcu_read_unlock();
+}

This should be moved to bypass module.


Sure. All these bypass ndo_ops can be moved to bypass module and any
paravirtual driver that want to go with 3 netdev model can reuse these 
functions.






+
+static const struct net_device_ops virtnet_bypass_netdev_ops = {
+   .ndo_open   = virtnet_bypass_open,
+   .ndo_stop   = virtnet_bypass_close,
+   .ndo_start_xmit = virtnet_bypass_start_xmit,
+   .ndo_select_queue   = virtnet_bypass_select_queue,
+   .ndo_get_stats64= virtnet_bypass_get_stats,
+   .ndo_change_mtu = virtnet_bypass_change_mtu,
+   .ndo_set_rx_mode= virtnet_bypass_set_rx_mode,
+   .ndo_validate_addr  = eth_validate_addr,
+   .ndo_features_check = passthru_features_check,
+};
+
+static int
+virtnet_bypass_ethtool_get_link_ksettings(struct net_device *dev,
+ struct ethtool_link_ksettings *cmd)
+{
+   struct virtnet_bypass_info *vbi = netdev_priv(dev);
+   struct net_device *child_netdev;
+
+   child_netdev = rtnl_dereference(vbi->active_netdev);
+   if (!child_netdev || !virtnet_bypass_xmit_ready(child_netdev)) {
+   child_netdev = rtnl_dereference(vbi->backup_netdev);
+   if (!child_netdev || !virtnet_bypass_xmit_ready(child_netdev)) {
+   cmd->base.duplex = DUPLEX_UNKNOWN;
+   cmd->base.port = PORT_OTHER;
+   cmd->base.speed = SPEED_UNKNOWN;
+
+   return 0;
+   }
+   }
+
+   return __ethtool_get_link_ksettings(child_netdev, cmd);
+}
+
+#define BYPASS_DRV_NAME "virtnet_bypass"
+#define BYPASS_DRV_VERSION "0.1"
+
+static void virtnet_bypass_ethtool_get_drvinfo(struct net_device *dev,
+  struct ethtool_drvinfo *drvinfo)
+{
+   strlcpy(drvinfo->driver, BYPASS_DRV_NAME, sizeof(drvinfo->driver));
+   strlcpy(drvinfo->version, BYPASS_DRV_VERSION, sizeof(drvinfo->version));
+}
+
+static const struct ethtool_ops virtnet_bypass_ethtool_ops = {
+   .get_drvinfo= virtnet_bypass_ethtool_get_drvinfo,
+   .get_link   = ethtool_op_get_link,
+   .get_link_ksettings = virtnet_bypass_ethtool_get_link_ksettings,
+};
+
+static int virtnet_bypass_join_child(struct net_device *bypass_netdev,
+struct net_device *child_netdev)
+{
+   struct virtnet_bypass_info *vbi;
+   bool backup;
+
+   vbi = netdev_priv(bypass_netdev);
+   backup = (child_netdev->dev.parent == bypass_netdev->dev.parent);
+   if (backup ? rtnl_dereference(vbi->backup_netdev) :
+   rtnl_dereference(vbi->active_netdev)) {
+   netdev_info(bypass_netdev,
+   "%s attempting to join bypass dev when %s already 
present\n",
+   child_netdev->name, backup ? "backup" : "active");

Bypass module should check if there is already some other netdev
enslaved and refuse right there.


This will work for virtio-net with 3 netdev model, but this check has to be 
done by netvsc
as its bypass_netdev is same as the backup_netdev.
Will add a flag while registering with the bypass module to indicate if the 
driver is doing
a 2 netdev or 3 netdev model and based on that flag this check can be done in 
bypass module
for 3 netdev scenario.





The active/backup terminology is quite confusing. From the bonding world
that means active is the one which is currently used for tx of the
packets. And it depends on link and other things what netdev is declared
active. However here, it is different. Backup is always the virtio_net
instance even when it is active. Odd. Please change the terminology.
For "active" I suggest to use name "stolen".


I am not too happy with 'stolen' name. Will see if i can come up with a
better name.




*** Also, the 2 slave netdev pointers should be stored in the bypass
module instance, not in the drivers.


Will move virtnet_bypass_info struct to bypass.h






+   return -EEXIST;
+   }
+
+   dev_hold(child_netdev);
+
+   if (backup) {
+

[Patch net] net_sched: fix a missing idr_remove() in u32_delete_key()

2018-04-06 Thread Cong Wang

When we delete a u32 key via u32_delete_key(), we forget to
call idr_remove() to remove its handle from IDR.

Fixes: e7614370d6f0 ("net_sched: use idr to allocate u32 filter handles")
Reported-by: Marcin Kabiesz 
Cc: Linus Torvalds 
Cc: Jamal Hadi Salim 
Signed-off-by: Cong Wang 
---
 net/sched/cls_u32.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/net/sched/cls_u32.c b/net/sched/cls_u32.c
index ed8b6a24b9e9..bac47b5d18fd 100644
--- a/net/sched/cls_u32.c
+++ b/net/sched/cls_u32.c
@@ -489,6 +489,7 @@ static int u32_delete_key(struct tcf_proto *tp, struct 
tc_u_knode *key)
RCU_INIT_POINTER(*kp, key->next);
 
tcf_unbind_filter(tp, >res);
+   idr_remove(>handle_idr, key->handle);
tcf_exts_get_net(>exts);
call_rcu(>rcu, u32_delete_key_freepf_rcu);
return 0;
-- 
2.13.0

Fwd: Problem with the kernel 4.15 - cutting the band (tc)

2018-04-06 Thread Linus Torvalds

Forwarding a report about what looks like a regression between 4.14 and 4.15.

New ENOSPC issue? I don't even knew where to start guessing where to look.

Help me, Davem-Wan Kenobi, you are my only hope.

(But adding netdev just in case somebody else goes "That's obviously Xyz")

  Linus

-- Forwarded message --
From: Marcin Kabiesz 
Date: Thu, Apr 5, 2018 at 10:38 AM
Subject: Problem with the kernel 4.15 - cutting the band (tc)


Hello,
I have a problem with bandwidth cutting on kernel 4.15. On the version
up to 4.15, i.e. 4.14, this problem does not occur.

uname -a: Linux router 4.14.15 #1 SMP x86_64 Intel Xeon E3-1230 v6
command to reproduce:

tc qdisc add dev ifb0 root handle 1: htb r2q 2
tc class add dev ifb0 parent 1: classid 1:1 htb rate 10gbit ceil
10gbit quantum 16000
tc filter add dev ifb0 parent 1: prio 5 handle 1: protocol all u32 divisor 256
tc filter add dev ifb0 protocol all parent 1: prio 5 u32 ht 800::
match ip dst 0.0.0.0/0 hashkey mask 0x00ff at 16 link 1:
tc filter add dev ifb0 parent 1:0 handle ::1 protocol all prio 5 u32
ht 1:2c: match ip dst 192.168.3.44/32 flowid 1:2
tc filter del dev ifb0 parent 1:0 handle 1:2c:1 prio 5 u32
tc filter add dev ifb0 parent 1:0 handle ::1 protocol all prio 5 u32
ht 1:2c: match ip dst 192.168.3.44/32 flowid 1:2
tc filter del dev ifb0 parent 1:0 handle 1:2c:1 prio 5 u32

This ok, no error/warnings and dmesg log.

uname -a: Linux router 4.15.8 #1 SMP x86_64 Intel Xeon E3-1230 v6 (or
4.15.14 this same effect)
command to reproduce:

tc qdisc add dev ifb0 root handle 1: htb r2q 2
tc class add dev ifb0 parent 1: classid 1:1 htb rate 10gbit ceil
10gbit quantum 16000
tc filter add dev ifb0 parent 1: prio 5 handle 1: protocol all u32 divisor 256
tc filter add dev ifb0 protocol all parent 1: prio 5 u32 ht 800::
match ip dst 0.0.0.0/0 hashkey mask 0x00ff at 16 link 1:
tc filter add dev ifb0 parent 1:0 handle ::1 protocol all prio 5 u32
ht 1:2c: match ip dst 192.168.3.44/32 flowid 1:2
tc filter del dev ifb0 parent 1:0 handle 1:2c:1 prio 5 u32
tc filter add dev ifb0 parent 1:0 handle ::1 protocol all prio 5 u32
ht 1:2c: match ip dst 192.168.3.44/32 flowid 1:2
RTNETLINK answers: No space left on device
We have an error talking to the kernel

This not ok, on error/warnings and no dmesg log.

Best Regards
Please forgive my English
Marcin Kabiesz

Re: Problem with the kernel 4.15 - cutting the band (tc)

2018-04-06 Thread Cong Wang

On Fri, Apr 6, 2018 at 2:56 PM, Linus Torvalds
 wrote:
> Forwarding a report about what looks like a regression between 4.14 and 4.15.
>
> New ENOSPC issue? I don't even knew where to start guessing where to look.
>
> Help me, Davem-Wan Kenobi, you are my only hope.
>
> (But adding netdev just in case somebody else goes "That's obviously Xyz")
>
>   Linus
>
> -- Forwarded message --
> From: Marcin Kabiesz 
> Date: Thu, Apr 5, 2018 at 10:38 AM
> Subject: Problem with the kernel 4.15 - cutting the band (tc)
>
>
> Hello,
> I have a problem with bandwidth cutting on kernel 4.15. On the version
> up to 4.15, i.e. 4.14, this problem does not occur.
>
> uname -a: Linux router 4.14.15 #1 SMP x86_64 Intel Xeon E3-1230 v6
> command to reproduce:
>
> tc qdisc add dev ifb0 root handle 1: htb r2q 2
> tc class add dev ifb0 parent 1: classid 1:1 htb rate 10gbit ceil
> 10gbit quantum 16000
> tc filter add dev ifb0 parent 1: prio 5 handle 1: protocol all u32 divisor 256
> tc filter add dev ifb0 protocol all parent 1: prio 5 u32 ht 800::
> match ip dst 0.0.0.0/0 hashkey mask 0x00ff at 16 link 1:
> tc filter add dev ifb0 parent 1:0 handle ::1 protocol all prio 5 u32
> ht 1:2c: match ip dst 192.168.3.44/32 flowid 1:2
> tc filter del dev ifb0 parent 1:0 handle 1:2c:1 prio 5 u32
> tc filter add dev ifb0 parent 1:0 handle ::1 protocol all prio 5 u32
> ht 1:2c: match ip dst 192.168.3.44/32 flowid 1:2
> tc filter del dev ifb0 parent 1:0 handle 1:2c:1 prio 5 u32
>
> This ok, no error/warnings and dmesg log.
>
> uname -a: Linux router 4.15.8 #1 SMP x86_64 Intel Xeon E3-1230 v6 (or
> 4.15.14 this same effect)
> command to reproduce:
>
> tc qdisc add dev ifb0 root handle 1: htb r2q 2
> tc class add dev ifb0 parent 1: classid 1:1 htb rate 10gbit ceil
> 10gbit quantum 16000
> tc filter add dev ifb0 parent 1: prio 5 handle 1: protocol all u32 divisor 256
> tc filter add dev ifb0 protocol all parent 1: prio 5 u32 ht 800::
> match ip dst 0.0.0.0/0 hashkey mask 0x00ff at 16 link 1:
> tc filter add dev ifb0 parent 1:0 handle ::1 protocol all prio 5 u32
> ht 1:2c: match ip dst 192.168.3.44/32 flowid 1:2
> tc filter del dev ifb0 parent 1:0 handle 1:2c:1 prio 5 u32
> tc filter add dev ifb0 parent 1:0 handle ::1 protocol all prio 5 u32
> ht 1:2c: match ip dst 192.168.3.44/32 flowid 1:2
> RTNETLINK answers: No space left on device
> We have an error talking to the kernel
>
> This not ok, on error/warnings and no dmesg log.

We forgot to call idr_remove() when deleting u32 key...

I am cooking a fix now.

Thanks!

RE: [Intel-wired-lan] [next-queue PATCH v6 10/10] igb: Add support for adding offloaded clsflower filters

2018-04-06 Thread Brown, Aaron F

> From: Intel-wired-lan [mailto:intel-wired-lan-boun...@osuosl.org] On
> Behalf Of Vinicius Costa Gomes
> Sent: Thursday, March 29, 2018 2:08 PM
> To: intel-wired-...@lists.osuosl.org
> Cc: netdev@vger.kernel.org; Sanchez-Palencia, Jesus  palen...@intel.com>
> Subject: [Intel-wired-lan] [next-queue PATCH v6 10/10] igb: Add support for
> adding offloaded clsflower filters
> 
> This allows filters added by tc-flower and specifying MAC addresses,
> Ethernet types, and the VLAN priority field, to be offloaded to the
> controller.

Can I get a brief explanation for enabling this?  I'm currently happy with this 
patch series from a regression perspective, but am personally a bit, umm, 
challenged with tc in general but would like to run it through the paces a bit. 
 If it can be done in a one or two liner I think it would be a good addition to 
the patch description.

> 
> This reuses most of the infrastructure used by ethtool, but clsflower
> filters are kept in a separated list, so they are invisible to
> ethtool.
> 
> Signed-off-by: Vinicius Costa Gomes 
> ---
>  drivers/net/ethernet/intel/igb/igb.h  |   2 +
>  drivers/net/ethernet/intel/igb/igb_main.c | 188
> +-
>  2 files changed, 188 insertions(+), 2 deletions(-)

Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice

2018-04-06 Thread Siwei Liu

On Wed, Apr 4, 2018 at 10:37 AM, David Miller  wrote:
> From: David Ahern 
> Date: Wed, 4 Apr 2018 11:21:54 -0600
>
>> It is a netdev so there is no reason to have a separate ip command to
>> inspect it. 'ip link' is the right place.
>
> I agree on this.

I'm completely fine of having an API for inspection purpose. The thing
is that we'd perhaps need to go for the namespace approach, for which
I think everyone seems to agree not to fiddle with the ":" prefix, but
rather have a new class of network subsystem under /sys/class thus a
separate device namespace e.g. /sys/class/net-kernel for those
auto-managed lower netdevs is needed.

And I assume everyone here understands the use case for live migration
(in the context of providing cloud service) is very different, and we
have to hide the netdevs. If not, I'm more than happy to clarify.

With that in mind, if having a new class of net-kernel namespace, we
can name the kernel device elaborately which is not neccessarily equal
to the device name exposed to userspace. For example, we can use
driver name as the prefix as opposed to "eth" or ":eth". And we don't
need to have auto-managed netdevs locked into the ":" prefix at all (I
intentionally left it out in the this RFC patch to ask for comments on
the namespace solution which is much cleaner). That said, an userpsace
named device through udev may call something like ens3 and
switch1-port2, but in the kernel-net namespace, it may look like
ixgbevf0 and mlxsw1p2.

So if we all agree introducing a new namespace is the rigth thing to
do, `ip link' will no longer serve the purpose of displaying the
information for kernel-net devnames for the sake of avoiding ambiguity
and namespace collision: it's entirely possible the ip link name could
collide with a kernel-net devname, it's become unclear which name of a
netdev object the command is expected to operate on. That's why I
thought showing the kernel-only netdevs using a separate subcommand
makes more sense.

Thoughts and comments? Please let me know.

Thanks,
-Siwei

>
> What I really don't understand still is the use case... really.
>
> So there are control netdevs, what exactly is the problem with that?
>
> Are we not exporting enough information for applications to handle
> these devices sanely?  If so, then's let add that information.
>
> We can set netdev->type to ETH_P_LINUXCONTROL or something like that.
>
> Another alternative is to add an interface flag like IFF_CONTROL or
> similar, and that probably is much nicer.
>
> Hiding the devices means that we acknowledge that applications are
> currently broken with control netdevs... and we want them to stay
> broken!
>
> That doesn't sound like a good plan to me.
>
> So let's fix handling of control netdevs instead of hiding them.
>
> Thanks.

[PATCH net 3/5] ibmvnic: Fix reset scheduler error handling

2018-04-06 Thread Thomas Falcon

In some cases, if the driver is waiting for a reset following
a device parameter change, failure to schedule a reset can result
in a hang since a completion signal is never sent.

If the device configuration is being altered by a tool such
as ethtool or ifconfig, it could cause the console to hang
if the reset request does not get scheduled. Add some additional
error handling code to exit the wait_for_completion if there is
one in progress.

Signed-off-by: Thomas Falcon 
---
 drivers/net/ethernet/ibm/ibmvnic.c | 39 --
 1 file changed, 29 insertions(+), 10 deletions(-)

diff --git a/drivers/net/ethernet/ibm/ibmvnic.c 
b/drivers/net/ethernet/ibm/ibmvnic.c
index 153a868..bbcd07a 100644
--- a/drivers/net/ethernet/ibm/ibmvnic.c
+++ b/drivers/net/ethernet/ibm/ibmvnic.c
@@ -1875,23 +1875,25 @@ static void __ibmvnic_reset(struct work_struct *work)
mutex_unlock(>reset_lock);
 }
 
-static void ibmvnic_reset(struct ibmvnic_adapter *adapter,
- enum ibmvnic_reset_reason reason)
+static int ibmvnic_reset(struct ibmvnic_adapter *adapter,
+enum ibmvnic_reset_reason reason)
 {
struct ibmvnic_rwi *rwi, *tmp;
struct net_device *netdev = adapter->netdev;
struct list_head *entry;
+   int ret;
 
if (adapter->state == VNIC_REMOVING ||
adapter->state == VNIC_REMOVED) {
+   ret = EBUSY;
netdev_dbg(netdev, "Adapter removing, skipping reset\n");
-   return;
+   goto err;
}
 
if (adapter->state == VNIC_PROBING) {
netdev_warn(netdev, "Adapter reset during probe\n");
-   adapter->init_done_rc = EAGAIN;
-   return;
+   ret = adapter->init_done_rc = EAGAIN;
+   goto err;
}
 
mutex_lock(>rwi_lock);
@@ -1901,7 +1903,8 @@ static void ibmvnic_reset(struct ibmvnic_adapter *adapter,
if (tmp->reset_reason == reason) {
netdev_dbg(netdev, "Skipping matching reset\n");
mutex_unlock(>rwi_lock);
-   return;
+   ret = EBUSY;
+   goto err;
}
}
 
@@ -1909,7 +1912,8 @@ static void ibmvnic_reset(struct ibmvnic_adapter *adapter,
if (!rwi) {
mutex_unlock(>rwi_lock);
ibmvnic_close(netdev);
-   return;
+   ret = ENOMEM;
+   goto err;
}
 
rwi->reset_reason = reason;
@@ -1918,6 +1922,12 @@ static void ibmvnic_reset(struct ibmvnic_adapter 
*adapter,
 
netdev_dbg(adapter->netdev, "Scheduling reset (reason %d)\n", reason);
schedule_work(>ibmvnic_reset);
+
+   return 0;
+err:
+   if (adapter->wait_for_reset)
+   adapter->wait_for_reset = false;
+   return -ret;
 }
 
 static void ibmvnic_tx_timeout(struct net_device *dev)
@@ -2052,6 +2062,8 @@ static void ibmvnic_netpoll_controller(struct net_device 
*dev)
 
 static int wait_for_reset(struct ibmvnic_adapter *adapter)
 {
+   int rc, ret;
+
adapter->fallback.mtu = adapter->req_mtu;
adapter->fallback.rx_queues = adapter->req_rx_queues;
adapter->fallback.tx_queues = adapter->req_tx_queues;
@@ -2059,11 +2071,15 @@ static int wait_for_reset(struct ibmvnic_adapter 
*adapter)
adapter->fallback.tx_entries = adapter->req_tx_entries_per_subcrq;
 
init_completion(>reset_done);
-   ibmvnic_reset(adapter, VNIC_RESET_CHANGE_PARAM);
adapter->wait_for_reset = true;
+   rc = ibmvnic_reset(adapter, VNIC_RESET_CHANGE_PARAM);
+   if (rc)
+   return rc;
wait_for_completion(>reset_done);
 
+   ret = 0;
if (adapter->reset_done_rc) {
+   ret = -EIO;
adapter->desired.mtu = adapter->fallback.mtu;
adapter->desired.rx_queues = adapter->fallback.rx_queues;
adapter->desired.tx_queues = adapter->fallback.tx_queues;
@@ -2071,12 +2087,15 @@ static int wait_for_reset(struct ibmvnic_adapter 
*adapter)
adapter->desired.tx_entries = adapter->fallback.tx_entries;
 
init_completion(>reset_done);
-   ibmvnic_reset(adapter, VNIC_RESET_CHANGE_PARAM);
+   adapter->wait_for_reset = true;
+   rc = ibmvnic_reset(adapter, VNIC_RESET_CHANGE_PARAM);
+   if (rc)
+   return ret;
wait_for_completion(>reset_done);
}
adapter->wait_for_reset = false;
 
-   return adapter->reset_done_rc;
+   return ret;
 }
 
 static int ibmvnic_change_mtu(struct net_device *netdev, int new_mtu)
-- 
1.8.3.1

[PATCH net 5/5] ibmvnic: Do not reset CRQ for Mobility driver resets

2018-04-06 Thread Thomas Falcon

From: Nathan Fontenot 

When resetting the ibmvnic driver after a partition migration occurs
there is no requirement to do a reset of the main CRQ. The current
driver code does the required re-enable of the main CRQ, then does
a reset of the main CRQ later.

What we should be doing for a driver reset after a migration is to
re-enable the main CRQ, release all the sub-CRQs, and then allocate
new sub-CRQs after capability negotiation.

This patch updates the handling of mobility resets to do the proper
work and not reset the main CRQ. To do this the initialization/reset
of the main CRQ had to be moved out of the ibmvnic_init routine
and in to the ibmvnic_probe and do_reset routines.

Signed-off-by: Nathan Fontenot 
Signed-off-by: Thomas Falcon 
---
 drivers/net/ethernet/ibm/ibmvnic.c | 55 ++
 1 file changed, 32 insertions(+), 23 deletions(-)

diff --git a/drivers/net/ethernet/ibm/ibmvnic.c 
b/drivers/net/ethernet/ibm/ibmvnic.c
index 151542e..aad5658 100644
--- a/drivers/net/ethernet/ibm/ibmvnic.c
+++ b/drivers/net/ethernet/ibm/ibmvnic.c
@@ -118,6 +118,7 @@ static union sub_crq *ibmvnic_next_scrq(struct 
ibmvnic_adapter *,
 static int ibmvnic_init(struct ibmvnic_adapter *);
 static void release_crq_queue(struct ibmvnic_adapter *);
 static int __ibmvnic_set_mac(struct net_device *netdev, struct sockaddr *p);
+static int init_crq_queue(struct ibmvnic_adapter *adapter);
 
 struct ibmvnic_stat {
char name[ETH_GSTRING_LEN];
@@ -1224,7 +1225,6 @@ static int __ibmvnic_close(struct net_device *netdev)
rc = set_link_state(adapter, IBMVNIC_LOGICAL_LNK_DN);
if (rc)
return rc;
-   ibmvnic_cleanup(netdev);
adapter->state = VNIC_CLOSED;
return 0;
 }
@@ -1244,6 +1244,7 @@ static int ibmvnic_close(struct net_device *netdev)
 
mutex_lock(>reset_lock);
rc = __ibmvnic_close(netdev);
+   ibmvnic_cleanup(netdev);
mutex_unlock(>reset_lock);
 
return rc;
@@ -1726,14 +1727,10 @@ static int do_reset(struct ibmvnic_adapter *adapter,
old_num_rx_queues = adapter->req_rx_queues;
old_num_tx_queues = adapter->req_tx_queues;
 
-   if (rwi->reset_reason == VNIC_RESET_MOBILITY) {
-   rc = ibmvnic_reenable_crq_queue(adapter);
-   if (rc)
-   return 0;
-   ibmvnic_cleanup(netdev);
-   } else if (rwi->reset_reason == VNIC_RESET_FAILOVER) {
-   ibmvnic_cleanup(netdev);
-   } else {
+   ibmvnic_cleanup(netdev);
+
+   if (adapter->reset_reason != VNIC_RESET_MOBILITY &&
+   adapter->reset_reason != VNIC_RESET_FAILOVER) {
rc = __ibmvnic_close(netdev);
if (rc)
return rc;
@@ -1752,6 +1749,23 @@ static int do_reset(struct ibmvnic_adapter *adapter,
 */
adapter->state = VNIC_PROBED;
 
+   if (adapter->wait_for_reset) {
+   rc = init_crq_queue(adapter);
+   } else if (adapter->reset_reason == VNIC_RESET_MOBILITY) {
+   rc = ibmvnic_reenable_crq_queue(adapter);
+   release_sub_crqs(adapter, 1);
+   } else {
+   rc = ibmvnic_reset_crq(adapter);
+   if (!rc)
+   rc = vio_enable_interrupts(adapter->vdev);
+   }
+
+   if (rc) {
+   netdev_err(adapter->netdev,
+  "Couldn't initialize crq. rc=%d\n", rc);
+   return rc;
+   }
+
rc = ibmvnic_init(adapter);
if (rc)
return IBMVNIC_INIT_FAILED;
@@ -4500,19 +4514,6 @@ static int ibmvnic_init(struct ibmvnic_adapter *adapter)
u64 old_num_rx_queues, old_num_tx_queues;
int rc;
 
-   if (adapter->resetting && !adapter->wait_for_reset) {
-   rc = ibmvnic_reset_crq(adapter);
-   if (!rc)
-   rc = vio_enable_interrupts(adapter->vdev);
-   } else {
-   rc = init_crq_queue(adapter);
-   }
-
-   if (rc) {
-   dev_err(dev, "Couldn't initialize crq. rc=%d\n", rc);
-   return rc;
-   }
-
adapter->from_passive_init = false;
 
old_num_rx_queues = adapter->req_rx_queues;
@@ -4537,7 +4538,8 @@ static int ibmvnic_init(struct ibmvnic_adapter *adapter)
return -1;
}
 
-   if (adapter->resetting && !adapter->wait_for_reset) {
+   if (adapter->resetting && !adapter->wait_for_reset &&
+   adapter->reset_reason != VNIC_RESET_MOBILITY) {
if (adapter->req_rx_queues != old_num_rx_queues ||
adapter->req_tx_queues != old_num_tx_queues) {
release_sub_crqs(adapter, 0);
@@ -4625,6 +4627,13 @@

[PATCH net 4/5] ibmvnic: Fix failover case for non-redundant configuration

2018-04-06 Thread Thomas Falcon

There is a failover case for a non-redundant pseries VNIC
configuration that was not being handled properly. The current
implementation assumes that the driver will always have a redandant
device to communicate with following a failover notification. There
are cases, however, when a non-redundant configuration can receive
a failover request. If that happens, the driver should wait until
it receives a signal that the device is ready for operation.

The driver is agnostic of its backing hardware configuration,
so this fix necessarily affects all device failover management.
The driver needs to wait until it receives a signal that the device
is ready for resetting. A flag is introduced to track this intermediary
state where the driver is waiting for an active device.

Signed-off-by: Thomas Falcon 
---
 drivers/net/ethernet/ibm/ibmvnic.c | 37 +
 drivers/net/ethernet/ibm/ibmvnic.h |  1 +
 2 files changed, 30 insertions(+), 8 deletions(-)

diff --git a/drivers/net/ethernet/ibm/ibmvnic.c 
b/drivers/net/ethernet/ibm/ibmvnic.c
index bbcd07a..151542e 100644
--- a/drivers/net/ethernet/ibm/ibmvnic.c
+++ b/drivers/net/ethernet/ibm/ibmvnic.c
@@ -325,10 +325,11 @@ static void replenish_rx_pool(struct ibmvnic_adapter 
*adapter,
adapter->replenish_add_buff_failure++;
atomic_add(buffers_added, >available);
 
-   if (lpar_rc == H_CLOSED) {
+   if (lpar_rc == H_CLOSED || adapter->failover_pending) {
/* Disable buffer pool replenishment and report carrier off if
-* queue is closed. Firmware guarantees that a signal will
-* be sent to the driver, triggering a reset.
+* queue is closed or pending failover.
+* Firmware guarantees that a signal will be sent to the
+* driver, triggering a reset.
 */
deactivate_rx_pools(adapter);
netif_carrier_off(adapter->netdev);
@@ -1068,6 +1069,14 @@ static int ibmvnic_open(struct net_device *netdev)
struct ibmvnic_adapter *adapter = netdev_priv(netdev);
int rc;
 
+   /* If device failover is pending, just set device state and return.
+* Device operation will be handled by reset routine.
+*/
+   if (adapter->failover_pending) {
+   adapter->state = VNIC_OPEN;
+   return 0;
+   }
+
mutex_lock(>reset_lock);
 
if (adapter->state != VNIC_CLOSED) {
@@ -1225,6 +1234,14 @@ static int ibmvnic_close(struct net_device *netdev)
struct ibmvnic_adapter *adapter = netdev_priv(netdev);
int rc;
 
+   /* If device failover is pending, just set device state and return.
+* Device operation will be handled by reset routine.
+*/
+   if (adapter->failover_pending) {
+   adapter->state = VNIC_CLOSED;
+   return 0;
+   }
+
mutex_lock(>reset_lock);
rc = __ibmvnic_close(netdev);
mutex_unlock(>reset_lock);
@@ -1559,8 +1576,9 @@ static int ibmvnic_xmit(struct sk_buff *skb, struct 
net_device *netdev)
dev_kfree_skb_any(skb);
tx_buff->skb = NULL;
 
-   if (lpar_rc == H_CLOSED) {
-   /* Disable TX and report carrier off if queue is closed.
+   if (lpar_rc == H_CLOSED || adapter->failover_pending) {
+   /* Disable TX and report carrier off if queue is closed
+* or pending failover.
 * Firmware guarantees that a signal will be sent to the
 * driver, triggering a reset or some other action.
 */
@@ -1884,9 +1902,10 @@ static int ibmvnic_reset(struct ibmvnic_adapter *adapter,
int ret;
 
if (adapter->state == VNIC_REMOVING ||
-   adapter->state == VNIC_REMOVED) {
+   adapter->state == VNIC_REMOVED ||
+   adapter->failover_pending) {
ret = EBUSY;
-   netdev_dbg(netdev, "Adapter removing, skipping reset\n");
+   netdev_dbg(netdev, "Adapter removing or pending failover, 
skipping reset\n");
goto err;
}
 
@@ -4162,7 +4181,9 @@ static void ibmvnic_handle_crq(union ibmvnic_crq *crq,
case IBMVNIC_CRQ_INIT:
dev_info(dev, "Partner initialized\n");
adapter->from_passive_init = true;
+   adapter->failover_pending = false;
complete(>init_done);
+   ibmvnic_reset(adapter, VNIC_RESET_FAILOVER);
break;
case IBMVNIC_CRQ_INIT_COMPLETE:
dev_info(dev, "Partner initialization complete\n");
@@ -4179,7 +4200,7 @@ static void ibmvnic_handle_crq(union ibmvnic_crq *crq,
ibmvnic_reset(adapter, VNIC_RESET_MOBILITY);
} else if (gen_crq->cmd

[PATCH net 2/5] ibmvnic: Zero used TX descriptor counter on reset

2018-04-06 Thread Thomas Falcon

The counter that tracks used TX descriptors pending completion
needs to be zeroed as part of a device reset. This change fixes
a bug causing transmit queues to be stopped unnecessarily and in
some cases a transmit queue stall and timeout reset. If the counter
is not reset, the remaining descriptors will not be "removed",
effectively reducing queue capacity. If the queue is over half full,
it will cause the queue to stall if stopped.

Signed-off-by: Thomas Falcon 
---
 drivers/net/ethernet/ibm/ibmvnic.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/ethernet/ibm/ibmvnic.c 
b/drivers/net/ethernet/ibm/ibmvnic.c
index 58e0143..153a868 100644
--- a/drivers/net/ethernet/ibm/ibmvnic.c
+++ b/drivers/net/ethernet/ibm/ibmvnic.c
@@ -2361,6 +2361,7 @@ static int reset_one_sub_crq_queue(struct ibmvnic_adapter 
*adapter,
}
 
memset(scrq->msgs, 0, 4 * PAGE_SIZE);
+   atomic_set(>used, 0);
scrq->cur = 0;
 
rc = h_reg_sub_crq(adapter->vdev->unit_address, scrq->msg_token,
-- 
1.8.3.1

[PATCH net 0/5] ibmvnic: Fix driver reset and DMA bugs

2018-04-06 Thread Thomas Falcon

This patch series introduces some fixes to the driver reset
routines and a patch that fixes mistakes caught by the kernel
DMA debugger.

The reset fixes include a fix to reset TX queue counters properly
after a reset as well as updates to driver reset error-handling code.
It also provides updates to the reset handling routine for redundant
backing VF failover and partition migration cases.

Nathan Fontenot (1):
  ibmvnic: Do not reset CRQ for Mobility driver resets

Thomas Falcon (4):
  ibmvnic: Fix DMA mapping mistakes
  ibmvnic: Zero used TX descriptor counter on reset
  ibmvnic: Fix reset scheduler error handling
  ibmvnic: Fix failover case for non-redundant configuration

 drivers/net/ethernet/ibm/ibmvnic.c | 146 -
 drivers/net/ethernet/ibm/ibmvnic.h |   1 +
 2 files changed, 98 insertions(+), 49 deletions(-)

-- 
1.8.3.1

[PATCH net 1/5] ibmvnic: Fix DMA mapping mistakes

2018-04-06 Thread Thomas Falcon

Fix some mistakes caught by the DMA debugger. The first change
fixes a unnecessary unmap that should have been removed in an
earlier update. The next hunk fixes another bad unmap by zeroing
the bit checked to determine that an unmap is needed. The final 
change fixes some buffers that are unmapped with the wrong
direction specified.

Signed-off-by: Thomas Falcon 
---
 drivers/net/ethernet/ibm/ibmvnic.c | 14 ++
 1 file changed, 6 insertions(+), 8 deletions(-)

diff --git a/drivers/net/ethernet/ibm/ibmvnic.c 
b/drivers/net/ethernet/ibm/ibmvnic.c
index b492af6..58e0143 100644
--- a/drivers/net/ethernet/ibm/ibmvnic.c
+++ b/drivers/net/ethernet/ibm/ibmvnic.c
@@ -320,9 +320,6 @@ static void replenish_rx_pool(struct ibmvnic_adapter 
*adapter,
dev_info(dev, "replenish pools failure\n");
pool->free_map[pool->next_free] = index;
pool->rx_buff[index].skb = NULL;
-   if (!dma_mapping_error(dev, dma_addr))
-   dma_unmap_single(dev, dma_addr, pool->buff_size,
-DMA_FROM_DEVICE);
 
dev_kfree_skb_any(skb);
adapter->replenish_add_buff_failure++;
@@ -2574,7 +2571,7 @@ static int ibmvnic_complete_tx(struct ibmvnic_adapter 
*adapter,
union sub_crq *next;
int index;
int i, j;
-   u8 first;
+   u8 *first;
 
 restart_loop:
while (pending_scrq(adapter, scrq)) {
@@ -2605,11 +2602,12 @@ static int ibmvnic_complete_tx(struct ibmvnic_adapter 
*adapter,
txbuff->data_dma[j] = 0;
}
/* if sub_crq was sent indirectly */
-   first = txbuff->indir_arr[0].generic.first;
-   if (first == IBMVNIC_CRQ_CMD) {
+   first = >indir_arr[0].generic.first;
+   if (*first == IBMVNIC_CRQ_CMD) {
dma_unmap_single(dev, txbuff->indir_dma,
 sizeof(txbuff->indir_arr),
 DMA_TO_DEVICE);
+   *first = 0;
}
 
if (txbuff->last_frag) {
@@ -3882,9 +3880,9 @@ static int handle_login_rsp(union ibmvnic_crq 
*login_rsp_crq,
int i;
 
dma_unmap_single(dev, adapter->login_buf_token, adapter->login_buf_sz,
-DMA_BIDIRECTIONAL);
+DMA_TO_DEVICE);
dma_unmap_single(dev, adapter->login_rsp_buf_token,
-adapter->login_rsp_buf_sz, DMA_BIDIRECTIONAL);
+adapter->login_rsp_buf_sz, DMA_FROM_DEVICE);
 
/* If the number of queues requested can't be allocated by the
 * server, the login response will return with code 1. We will need
-- 
1.8.3.1

[Patch net] tipc: use the right skb in tipc_sk_fill_sock_diag()

2018-04-06 Thread Cong Wang

Commit 4b2e6877b879 ("tipc: Fix namespace violation in tipc_sk_fill_sock_diag")
tried to fix the crash but failed, the crash is still 100% reproducible
with it.

In tipc_sk_fill_sock_diag(), skb is the diag dump we are filling, it is not
correct to retrieve its NETLINK_CB(), instead, like other protocol diag,
we should use NETLINK_CB(cb->skb).sk here.

Reported-by: 
Fixes: 4b2e6877b879 ("tipc: Fix namespace violation in tipc_sk_fill_sock_diag")
Fixes: c30b70deb5f4 (tipc: implement socket diagnostics for AF_TIPC)
Cc: GhantaKrishnamurthy MohanKrishna 

Cc: Jon Maloy 
Cc: Ying Xue 
Signed-off-by: Cong Wang 
---
 net/tipc/diag.c   | 2 +-
 net/tipc/socket.c | 6 +++---
 net/tipc/socket.h | 4 ++--
 3 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/net/tipc/diag.c b/net/tipc/diag.c
index 46d9cd62f781..aaabb0b776dd 100644
--- a/net/tipc/diag.c
+++ b/net/tipc/diag.c
@@ -59,7 +59,7 @@ static int __tipc_add_sock_diag(struct sk_buff *skb,
if (!nlh)
return -EMSGSIZE;
 
-   err = tipc_sk_fill_sock_diag(skb, tsk, req->tidiag_states,
+   err = tipc_sk_fill_sock_diag(skb, cb, tsk, req->tidiag_states,
 __tipc_diag_gen_cookie);
if (err)
return err;
diff --git a/net/tipc/socket.c b/net/tipc/socket.c
index cee6674a3bf4..1fd1c8b5ce03 100644
--- a/net/tipc/socket.c
+++ b/net/tipc/socket.c
@@ -3257,8 +3257,8 @@ int tipc_nl_sk_walk(struct sk_buff *skb, struct 
netlink_callback *cb,
 }
 EXPORT_SYMBOL(tipc_nl_sk_walk);
 
-int tipc_sk_fill_sock_diag(struct sk_buff *skb, struct tipc_sock *tsk,
-  u32 sk_filter_state,
+int tipc_sk_fill_sock_diag(struct sk_buff *skb, struct netlink_callback *cb,
+  struct tipc_sock *tsk, u32 sk_filter_state,
   u64 (*tipc_diag_gen_cookie)(struct sock *sk))
 {
struct sock *sk = >sk;
@@ -3280,7 +3280,7 @@ int tipc_sk_fill_sock_diag(struct sk_buff *skb, struct 
tipc_sock *tsk,
nla_put_u32(skb, TIPC_NLA_SOCK_TIPC_STATE, (u32)sk->sk_state) ||
nla_put_u32(skb, TIPC_NLA_SOCK_INO, sock_i_ino(sk)) ||
nla_put_u32(skb, TIPC_NLA_SOCK_UID,
-   from_kuid_munged(sk_user_ns(NETLINK_CB(skb).sk),
+   from_kuid_munged(sk_user_ns(NETLINK_CB(cb->skb).sk),
 sock_i_uid(sk))) ||
nla_put_u64_64bit(skb, TIPC_NLA_SOCK_COOKIE,
  tipc_diag_gen_cookie(sk),
diff --git a/net/tipc/socket.h b/net/tipc/socket.h
index aae3fd4cd06c..aff9b2ae5a1f 100644
--- a/net/tipc/socket.h
+++ b/net/tipc/socket.h
@@ -61,8 +61,8 @@ int tipc_sk_rht_init(struct net *net);
 void tipc_sk_rht_destroy(struct net *net);
 int tipc_nl_sk_dump(struct sk_buff *skb, struct netlink_callback *cb);
 int tipc_nl_publ_dump(struct sk_buff *skb, struct netlink_callback *cb);
-int tipc_sk_fill_sock_diag(struct sk_buff *skb, struct tipc_sock *tsk,
-  u32 sk_filter_state,
+int tipc_sk_fill_sock_diag(struct sk_buff *skb, struct netlink_callback *cb,
+  struct tipc_sock *tsk, u32 sk_filter_state,
   u64 (*tipc_diag_gen_cookie)(struct sock *sk));
 int tipc_nl_sk_walk(struct sk_buff *skb, struct netlink_callback *cb,
int (*skb_handler)(struct sk_buff *skb,
-- 
2.13.0

Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice

2018-04-06 Thread Andrew Lunn

Hi Siwei

> I think everyone seems to agree not to fiddle with the ":" prefix, but
> rather have a new class of network subsystem under /sys/class thus a
> separate device namespace e.g. /sys/class/net-kernel for those
> auto-managed lower netdevs is needed.
 
How do you get a device into this new class? I don't know the Linux
driver model too well, but to get a device out of one class and into
another, i think you need to device_del(dev). modify dev->class and
then device_add(dev). However, device_add() says you are not allowed
to do this.

And i don't even see how this helps. Are you also not going to call
list_netdevice()? Are you going to add some other list for these
devices in a different class?

   Andrew

Re: [virtio-dev] Re: [RFC PATCH 1/3] qemu: virtio-bypass should explicitly bind to a passthrough device

2018-04-06 Thread Siwei Liu

(click the wrong reply button again, sorry)


On Thu, Apr 5, 2018 at 8:31 AM, Paolo Bonzini  wrote:
> On 04/04/2018 10:02, Siwei Liu wrote:
>>> pci_bus_num is almost always a bug if not done within
>>> a context of a PCI host, bridge, etc.
>>>
>>> In particular this will not DTRT if run before guest assigns bus
>>> numbers.
>>>
>> I was seeking means to reserve a specific pci bus slot from drivers,
>> and update the driver when guest assigns the bus number but it seems
>> there's no low-hanging fruits. Because of that reason the bus_num is
>> only obtained until it's really needed (during get_config) and I
>> assume at that point the pci bus assignment is already done. I know
>> the current one is not perfect, but we need that information (PCI
>> bus:slot.func number) to name the guest device correctly.
>
> Can you use the -device "id", and look it up as
>
> devices = container_get(qdev_get_machine(), "/peripheral");
> return object_resolve_path_component(devices, id);


No. The problem of using device id is that the vfio device may come
and go at any time, this is particularly true when live migration is
happening. There's no gurantee we can get the bus:device.func info if
that device is gone. Currently the binding between vfio and virtio-net
is weakly coupled through the backup property, there's no better way
than specifying the bus id and addr property directly.

Regards,
-Siwei

>
> ?
>
> Thanks,
>
> Paolo

[PATCH 6/8] ipconfig: Correctly initialise ic_nameservers

2018-04-06 Thread Chris Novakovic

ic_nameservers, which stores the list of name servers discovered by
ipconfig, is initialised (i.e. has all of its elements set to NONE, or
0x) by ic_nameservers_predef() in the following scenarios:

 - before the "ip=" and "nfsaddrs=" kernel command line parameters are
   parsed (in ip_auto_config_setup());
 - before autoconfiguring via DHCP or BOOTP (in ic_bootp_init()), in
   order to clear any values that may have been set after parsing "ip="
   or "nfsaddrs=" and are no longer needed.

This means that ic_nameservers_predef() is not called when neither "ip="
nor "nfsaddrs=" is specified on the kernel command line. In this
scenario, every element in ic_nameservers remains set to 0x,
which is indistinguishable from ANY and causes pnp_seq_show() to write
the following (bogus) information to /proc/net/pnp:

  #MANUAL
  nameserver 0.0.0.0
  nameserver 0.0.0.0
  nameserver 0.0.0.0

This is potentially problematic for systems that blindly link
/etc/resolv.conf to /proc/net/pnp.

Ensure that ic_nameservers is also initialised when neither "ip=" nor
"nfsaddrs=" is specified by calling ic_nameservers_predef() in
ip_auto_config(), but only when ip_auto_config_setup() was not called
earlier. This causes the following to be written to /proc/net/pnp, and
is consistent with what gets written when ipconfig is configured
manually but no name servers are specified on the kernel command line:

  #MANUAL

Signed-off-by: Chris Novakovic 
---
 net/ipv4/ipconfig.c | 13 +
 1 file changed, 13 insertions(+)

diff --git a/net/ipv4/ipconfig.c b/net/ipv4/ipconfig.c
index 0f460d6d3cce..e11dfd29a929 100644
--- a/net/ipv4/ipconfig.c
+++ b/net/ipv4/ipconfig.c
@@ -750,6 +750,11 @@ static void __init ic_bootp_init_ext(u8 *e)
  */
 static inline void __init ic_bootp_init(void)
 {
+   /* Re-initialise all name servers to NONE, in case any were set via the
+* "ip=" or "nfsaddrs=" kernel command line parameters: any IP addresses
+* specified there will already have been decoded but are no longer
+* needed
+*/
ic_nameservers_predef();
 
dev_add_pack(_packet_type);
@@ -1370,6 +1375,13 @@ static int __init ip_auto_config(void)
int err;
unsigned int i;
 
+   /* Initialise all name servers to NONE (but only if the "ip=" or
+* "nfsaddrs=" kernel command line parameters weren't decoded, otherwise
+* we'll overwrite the IP addresses specified there)
+*/
+   if (ic_set_manually == 0)
+   ic_nameservers_predef();
+
 #ifdef CONFIG_PROC_FS
proc_create("pnp", 0444, init_net.proc_net, _seq_fops);
 #endif /* CONFIG_PROC_FS */
@@ -1593,6 +1605,7 @@ static int __init ip_auto_config_setup(char *addrs)
return 1;
}
 
+   /* Initialise all name servers to NONE */
ic_nameservers_predef();
 
/* Parse string for static IP assignment.  */
-- 
2.14.1

[PATCH 7/8] ipconfig: Write NTP server IPs to /proc/net/ntp

2018-04-06 Thread Chris Novakovic

Distributed filesystems are most effective when the server and client
clocks are synchronised. Embedded devices often use NFS for their
root filesystem but typically do not contain an RTC, so the clocks of
the NFS server and the embedded device will be out-of-sync when the root
filesystem is mounted (and may not be synchronised until late in the
boot process).

Extend ipconfig with the ability to export IP addresses of NTP servers
it discovers to /proc/net/ntp. They can be supplied as follows:

 - If ipconfig is configured manually via the "ip=" or "nfsaddrs="
   kernel command line parameters, one NTP server can be specified in
   the new "" parameter.
 - If ipconfig is autoconfigured via DHCP, request DHCP option 42 in
   the DHCPDISCOVER message, and record the IP addresses of up to three
   NTP servers sent by the responding DHCP server in the subsequent
   DHCPOFFER message.

ipconfig will only write the NTP server IP addresses it discovers to
/proc/net/ntp, one per line (in the order received from the DHCP server,
if DHCP autoconfiguration is used); making use of these NTP servers is
the responsibility of a user space process (e.g. an initrd/initram
script that invokes an NTP client before mounting an NFS root
filesystem).

Signed-off-by: Chris Novakovic 
---
 Documentation/filesystems/nfs/nfsroot.txt | 35 +--
 net/ipv4/ipconfig.c   | 99 ---
 2 files changed, 119 insertions(+), 15 deletions(-)

diff --git a/Documentation/filesystems/nfs/nfsroot.txt 
b/Documentation/filesystems/nfs/nfsroot.txt
index a1030bea60d3..4d55470f7ca9 100644
--- a/Documentation/filesystems/nfs/nfsroot.txt
+++ b/Documentation/filesystems/nfs/nfsroot.txt
@@ -5,6 +5,7 @@ Written 1996 by Gero Kuhlmann 
 Updated 1997 by Martin Mares 
 Updated 2006 by Nico Schottelius 
 Updated 2006 by Horms 
+Updated 2018 by Chris Novakovic 
 
 
 
@@ -79,7 +80,7 @@ nfsroot=[:][,]
 
 
 ip=:::
-   :
+   ::
 
   This parameter tells the kernel how to configure IP addresses of devices
   and also how to set up the IP routing table. It was originally called
@@ -178,9 +179,18 @@ 
ip=:::
   IP address of secondary nameserver.
See .
 
-  After configuration (whether manual or automatic) is complete, a file is
-  created at /proc/net/pnp in the following format; lines are omitted if
-  their respective value is empty following configuration.
+   IP address of a Network Time Protocol (NTP) server.
+   Value is exported to /proc/net/ntp, but is otherwise unused
+   (see below).
+
+   Default: None if not using autoconfiguration; determined
+   automatically if using autoconfiguration.
+
+  After configuration (whether manual or automatic) is complete, two files
+  are created in the following format; lines are omitted if their respective
+  value is empty following configuration:
+
+  - /proc/net/pnp:
 
#PROTO: (depending on configuration 
method)
domain  (if autoconfigured, the DNS 
domain)
@@ -189,13 +199,26 @@ 
ip=:::
nameserver (tertiary name server IP)
bootserver   (NFS server IP)
 
-   and  are requested during autoconfiguration; they
-  cannot be specified as part of the "ip=" kernel command line parameter.
+  - /proc/net/ntp:
+
+  (NTP server IP)
+  (NTP server IP)
+  (NTP server IP)
+
+   and  (in /proc/net/pnp) and  and 
+  (in /proc/net/ntp) are requested during autoconfiguration; they cannot be
+  specified as part of the "ip=" kernel command line parameter.
 
   Because the "domain" and "nameserver" options are recognised by DNS
   resolvers, /etc/resolv.conf is often linked to /proc/net/pnp on systems
   that use an NFS root filesystem.
 
+  Note that the kernel will not synchronise the system time with any NTP
+  servers it discovers; this is the responsibility of a user space process
+  (e.g. an initrd/initramfs script that passes the IP addresses listed in
+  /proc/net/ntp to an NTP client before mounting the real root filesystem
+  if it is on NFS).
+
 
 nfsrootdebug
 
diff --git a/net/ipv4/ipconfig.c b/net/ipv4/ipconfig.c
index e11dfd29a929..a5d68e506494 100644
--- a/net/ipv4/ipconfig.c
+++ b/net/ipv4/ipconfig.c
@@ -28,6 +28,9 @@
  *
  *  Multiple Nameservers in /proc/net/pnp
  *  --  Josef Siemes , Aug 2002
+ *
+ *  NTP servers in /proc/net/ntp
+ *  --  Chris Novakovic , April 2018
  */
 
 #include 
@@ -93,6 +96,7 @@
 #define CONF_TIMEOUT_MAX   (HZ*30) /* Maximum allowed timeout */
 #define CONF_NAMESERVERS_MAX   3   /* Maximum number of nameservers

[PATCH 8/8] CREDITS: Add Chris Novakovic

2018-04-06 Thread Chris Novakovic

Signed-off-by: Chris Novakovic 
---
 CREDITS | 4 
 1 file changed, 4 insertions(+)

diff --git a/CREDITS b/CREDITS
index 989cda91c427..5a13bf62c569 100644
--- a/CREDITS
+++ b/CREDITS
@@ -2765,6 +2765,10 @@ E: nor...@nocrew.org
 W: http://www.lysator.liu.se/~noring/
 D: dsp56k device driver
 
+N: Chris Novakovic
+E: ch...@chrisn.me.uk
+D: ipconfig: NTP server support, bug fixes, documentation
+
 N: Michael O'Reilly
 E: mich...@iinet.com.au
 E: oreil...@tartarus.uwa.edu.au
-- 
2.14.1

[PATCH 0/8] ipconfig: NTP server support, bug fixes, documentation improvements

2018-04-06 Thread Chris Novakovic

This series (against net-next) makes various improvements to ipconfig:

 - Patch #1 correctly documents the behaviour of parameter 4 in the
   "ip=" and "nfsaddrs=" command line parameter.
 - Patch #2 tidies up the printk()s for reporting configured name
   servers.
 - Patch #3 fixes a bug in autoconfiguration via BOOTP whereby the IP
   addresses of IEN-116 name servers are requested from the BOOTP
   server, rather than those of DNS name servers.
 - Patch #4 requests the number of DNS servers specified by
   CONF_NAMESERVERS_MAX when autoconfiguring via BOOTP, rather than
   hardcoding it to 2.
 - Patch #5 fully documents the contents and format of /proc/net/pnp in
   Documentation/filesystems/nfs/nfsroot.txt.
 - Patch #6 fixes a bug whereby bogus information is written to
   /proc/net/pnp when ipconfig is not used.
 - Patch #7 allows for NTP servers to be configured (manually on the
   kernel command line or automatically via DHCP), enabling systems with
   an NFS root filesystem to synchronise their clock before mounting
   their root filesystem.

Patch #7 gets a few warnings when run through checkpatch.pl, but I felt
it'd be better to use the same style as surrounding code where
appropriate, rather than adhering strictly to the net/ style guide.

Chris Novakovic (8):
  ipconfig: Document setting of NIS domain name
  ipconfig: Tidy up reporting of name servers
  ipconfig: BOOTP: Don't request IEN-116 name servers
  ipconfig: BOOTP: Request CONF_NAMESERVERS_MAX name servers
  ipconfig: Document /proc/net/pnp
  ipconfig: Correctly initialise ic_nameservers
  ipconfig: Write NTP server IPs to /proc/net/ntp
  CREDITS: Add Chris Novakovic

 CREDITS   |   4 +
 Documentation/filesystems/nfs/nfsroot.txt |  70 ++---
 net/ipv4/ipconfig.c   | 121 +++---
 3 files changed, 174 insertions(+), 21 deletions(-)

-- 
2.14.1

[PATCH 5/8] ipconfig: Document /proc/net/pnp

2018-04-06 Thread Chris Novakovic

Fully document the format used by the /proc/net/pnp file written by
ipconfig, explain where its values originate from, and clarify that the
tertiary name server IP and DNS domain name are only written to the file
when autoconfiguration is used.

Signed-off-by: Chris Novakovic 
---
 Documentation/filesystems/nfs/nfsroot.txt | 34 ++-
 1 file changed, 29 insertions(+), 5 deletions(-)

diff --git a/Documentation/filesystems/nfs/nfsroot.txt 
b/Documentation/filesystems/nfs/nfsroot.txt
index 1513e5d663fd..a1030bea60d3 100644
--- a/Documentation/filesystems/nfs/nfsroot.txt
+++ b/Documentation/filesystems/nfs/nfsroot.txt
@@ -110,6 +110,9 @@ 
ip=:::
will not be triggered if it is missing and NFS root is not
in operation.
 
+   Value is exported to /proc/net/pnp with the prefix "bootserver "
+   (see below).
+
Default: Determined using autoconfiguration.
 The address of the autoconfiguration server is used.
 
@@ -165,12 +168,33 @@ 
ip=:::
 
 Default: any
 
-  IP address of first nameserver.
-   Value gets exported by /proc/net/pnp which is often linked
-   on embedded systems by /etc/resolv.conf.
+  IP address of primary nameserver.
+   Value is exported to /proc/net/pnp with the prefix "nameserver "
+   (see below).
+
+   Default: None if not using autoconfiguration; determined
+   automatically if using autoconfiguration.
+
+  IP address of secondary nameserver.
+   See .
+
+  After configuration (whether manual or automatic) is complete, a file is
+  created at /proc/net/pnp in the following format; lines are omitted if
+  their respective value is empty following configuration.
+
+   #PROTO: (depending on configuration 
method)
+   domain  (if autoconfigured, the DNS 
domain)
+   nameserver (primary name server IP)
+   nameserver (secondary name server IP)
+   nameserver (tertiary name server IP)
+   bootserver   (NFS server IP)
+
+   and  are requested during autoconfiguration; they
+  cannot be specified as part of the "ip=" kernel command line parameter.
 
-  IP address of second nameserver.
-   Same as above.
+  Because the "domain" and "nameserver" options are recognised by DNS
+  resolvers, /etc/resolv.conf is often linked to /proc/net/pnp on systems
+  that use an NFS root filesystem.
 
 
 nfsrootdebug
-- 
2.14.1

[PATCH 4/8] ipconfig: BOOTP: Request CONF_NAMESERVERS_MAX name servers

2018-04-06 Thread Chris Novakovic

When ipconfig is autoconfigured via BOOTP, the request packet
initialised by ic_bootp_init_ext() always allocates 8 bytes for the name
server option, limiting the BOOTP server to responding with at most 2
name servers even though ipconfig in fact supports an arbitrary number
of name servers (as defined by CONF_NAMESERVERS_MAX, which is currently
3).

Only request name servers in the request packet if CONF_NAMESERVERS_MAX
is positive (to comply with [1, §3.8]), and allocate enough space in the
packet for CONF_NAMESERVERS_MAX name servers to indicate the maximum
number we can accept in response.

[1] RFC 2132, "DHCP Options and BOOTP Vendor Extensions":
https://tools.ietf.org/rfc/rfc2132.txt

Signed-off-by: Chris Novakovic 
---
 net/ipv4/ipconfig.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/ipconfig.c b/net/ipv4/ipconfig.c
index bcf3c4f9882d..0f460d6d3cce 100644
--- a/net/ipv4/ipconfig.c
+++ b/net/ipv4/ipconfig.c
@@ -721,9 +721,11 @@ static void __init ic_bootp_init_ext(u8 *e)
*e++ = 3;   /* Default gateway request */
*e++ = 4;
e += 4;
+#if CONF_NAMESERVERS_MAX > 0
*e++ = 6;   /* (DNS) name server request */
-   *e++ = 8;
-   e += 8;
+   *e++ = 4 * CONF_NAMESERVERS_MAX;
+   e += 4 * CONF_NAMESERVERS_MAX;
+#endif
*e++ = 12;  /* Host name request */
*e++ = 32;
e += 32;
-- 
2.14.1

[PATCH 1/8] ipconfig: Document setting of NIS domain name

2018-04-06 Thread Chris Novakovic

ic_do_bootp_ext() is responsible for parsing the "ip=" and "nfsaddrs="
kernel parameters. If a "." character is found in parameter 4 (the
client's hostname), everything before the first "." is used as the
hostname, and everything after it is used as the NIS domain name (but
not necessarily the DNS domain name).

Document this behaviour in Documentation/filesystems/nfs/nfsroot.txt,
as it is not made explicit.

Signed-off-by: Chris Novakovic 
---
 Documentation/filesystems/nfs/nfsroot.txt | 11 +++
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/Documentation/filesystems/nfs/nfsroot.txt 
b/Documentation/filesystems/nfs/nfsroot.txt
index 5efae00f6c7f..1513e5d663fd 100644
--- a/Documentation/filesystems/nfs/nfsroot.txt
+++ b/Documentation/filesystems/nfs/nfsroot.txt
@@ -123,10 +123,13 @@ 
ip=:::
 
Default:  Determined using autoconfiguration.
 
- Name of the client. May be supplied by autoconfiguration,
-   but its absence will not trigger autoconfiguration.
-   If specified and DHCP is used, the user provided hostname will
-   be carried in the DHCP request to hopefully update DNS record.
+ Name of the client. If a '.' character is present, anything
+   before the first '.' is used as the client's hostname, and 
anything
+   after it is used as its NIS domain name. May be supplied by
+   autoconfiguration, but its absence will not trigger 
autoconfiguration.
+   If specified and DHCP is used, the user-provided hostname (and 
NIS
+   domain name, if present) will be carried in the DHCP request; 
this
+   may cause a DNS record to be created or updated for the client.
 
Default: Client IP address is used in ASCII notation.
 
-- 
2.14.1

[PATCH 2/8] ipconfig: Tidy up reporting of name servers

2018-04-06 Thread Chris Novakovic

Commit 5e953778a2aab04929a5e7b69f53dc26e39b079e ("ipconfig: add
nameserver IPs to kernel-parameter ip=") adds the IP addresses of
discovered name servers to the summary printed by ipconfig when
configuration is complete. It appears the intention in ip_auto_config()
was to print the name servers on a new line (especially given the
spacing and lack of comma before "nameserver0="), but they're actually
printed on the same line as the NFS root filesystem configuration
summary:

  [0.686186] IP-Config: Complete:
  [0.686226]  device=eth0, hwaddr=xx:xx:xx:xx:xx:xx, ipaddr=10.0.0.2, 
mask=255.255.255.0, gw=10.0.0.1
  [0.686328]  host=test, domain=example.com, nis-domain=(none)
  [0.686386]  bootserver=10.0.0.1, rootserver=10.0.0.1, rootpath= 
nameserver0=10.0.0.1

This makes it harder to read and parse ipconfig's output. Instead, print
the name servers on a separate line:

  [0.791250] IP-Config: Complete:
  [0.791289]  device=eth0, hwaddr=xx:xx:xx:xx:xx:xx, ipaddr=10.0.0.2, 
mask=255.255.255.0, gw=10.0.0.1
  [0.791407]  host=test, domain=example.com, nis-domain=(none)
  [0.791475]  bootserver=10.0.0.1, rootserver=10.0.0.1, rootpath=
  [0.791476]  nameserver0=10.0.0.1

Signed-off-by: Chris Novakovic 
---
 net/ipv4/ipconfig.c | 19 +++
 1 file changed, 11 insertions(+), 8 deletions(-)

diff --git a/net/ipv4/ipconfig.c b/net/ipv4/ipconfig.c
index 43f620feb1c4..d0ea0ecc9008 100644
--- a/net/ipv4/ipconfig.c
+++ b/net/ipv4/ipconfig.c
@@ -1481,16 +1481,19 @@ static int __init ip_auto_config(void)
_servaddr, _server_addr, root_server_path);
if (ic_dev_mtu)
pr_cont(", mtu=%d", ic_dev_mtu);
-   for (i = 0; i < CONF_NAMESERVERS_MAX; i++)
+   /* Name servers (if any): */
+   for (i = 0; i < CONF_NAMESERVERS_MAX; i++) {
if (ic_nameservers[i] != NONE) {
-   pr_cont(" nameserver%u=%pI4",
-   i, _nameservers[i]);
-   break;
+   if (i == 0)
+   pr_info(" nameserver%u=%pI4",
+   i, _nameservers[i]);
+   else
+   pr_cont(", nameserver%u=%pI4",
+   i, _nameservers[i]);
}
-   for (i++; i < CONF_NAMESERVERS_MAX; i++)
-   if (ic_nameservers[i] != NONE)
-   pr_cont(", nameserver%u=%pI4", i, _nameservers[i]);
-   pr_cont("\n");
+   if (i + 1 == CONF_NAMESERVERS_MAX)
+   pr_cont("\n");
+   }
 #endif /* !SILENT */
 
/*
-- 
2.14.1

[PATCH 3/8] ipconfig: BOOTP: Don't request IEN-116 name servers

2018-04-06 Thread Chris Novakovic

When ipconfig is autoconfigured via BOOTP, the request packet
initialised by ic_bootp_init_ext() allocates 8 bytes for tag 5 ("Name
Server" [1, §3.7]), but tag 5 in the response isn't processed by
ic_do_bootp_ext(). Instead, allocate the 8 bytes to tag 6 ("Domain Name
Server" [1, §3.8]), which is processed by ic_do_bootp_ext(), and appears
to have been the intended tag to request.

This won't cause any breakage for existing users, as tag 5 responses
provided by BOOTP servers weren't being processed anyway.

[1] RFC 2132, "DHCP Options and BOOTP Vendor Extensions":
https://tools.ietf.org/rfc/rfc2132.txt

Signed-off-by: Chris Novakovic 
---
 net/ipv4/ipconfig.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ipv4/ipconfig.c b/net/ipv4/ipconfig.c
index d0ea0ecc9008..bcf3c4f9882d 100644
--- a/net/ipv4/ipconfig.c
+++ b/net/ipv4/ipconfig.c
@@ -721,7 +721,7 @@ static void __init ic_bootp_init_ext(u8 *e)
*e++ = 3;   /* Default gateway request */
*e++ = 4;
e += 4;
-   *e++ = 5;   /* Name server request */
+   *e++ = 6;   /* (DNS) name server request */
*e++ = 8;
e += 8;
*e++ = 12;  /* Host name request */
-- 
2.14.1

Re: TCP one-by-one acking - RFC interpretation question

2018-04-06 Thread Eric Dumazet



On 04/06/2018 03:05 AM, Michal Kubecek wrote:
> Hello,
> 
> I encountered a strange behaviour of some (non-linux) TCP stack which
> I believe is incorrect but support engineers from the company producing
> it claim is OK.
> 
> Assume a client (sender, Linux 4.4 kernel) sends a stream of MSS sized
> segments but segments 2, 4 and 6 do not reach the server (receiver):
> 
>  ACK SAK SAK SAK
>   +---+---+---+---+---+---+---+
>   |   1   |   2   |   3   |   4   |   5   |   6   |   7   |
>   +---+---+---+---+---+---+---+
> 34273   35701   37129   38557   39985   41413   42841   44269
> 
> When segment 2 is retransmitted after RTO timeout, normal response would
> be ACK-ing segment 3 (38557) with SACK for 5 and 7 (39985-41413 and
> 42841-44269).
> 
> However, this server stack responds with two separate ACKs:
> 
>   - ACK 37129, SACK 37129-38557 39985-41413 42841-44269
>   - ACK 38557, SACK 39985-41413 42841-44269

Hmmm... Yes this seems very very wrong and lazy.

Have you verified behavior of more recent linux kernel to such threats ?

packetdrill test would be relatively easy to write.

Regardless of this broken alien stack, we might be able to work around this 
faster
than the vendor is able to fix and deploy a new stack.

( https://en.wikipedia.org/wiki/Robustness_principle )
Be conservative in what you do, be liberal in what you accept from others...



> 
> There is no payload from server, no window update and it happens even if
> there is no other packet received by server between those two. The
> result is that as segment 3 was never retransmitted, second ACK is
> interpreted as acking a newly arrived segment by 4.4 kernel so that the
> whole interval between first transmission of segment 3 and this second
> ACK is used for RTT estimator; even worse, when the same happens again
> for segment 5, both timeouts (for 2 and 4) are counted into its RTT.
> The result is RTO growing exponentially until it reaches the maximum
> (120 seconds) and the connection is effectively stalled.
> 
> In my opinion, server behaviour violates the last paragraph of RFC 5681,
> section 4.2:
> 
>   A TCP receiver MUST NOT generate more than one ACK for every incoming
>   segment, other than to update the offered window as the receiving
>   application consumes new data (see [RFC813] and page 42 of [RFC793]).
> 
> Server vendor claims that their behaviour is correct as first ACK is
> sent in response to segment 2 and second ACK in response to segment 3
> (which has just been delayed in the out of order queue).
> 
> Note that SACK doesn't really help here. First SACK block in first ACK
> (37129-38557) is actually invalid as it violates the "the bytes just
> below the block ... have not been received" condition from RFC 2018
> section 3. Therefore Linux 4.4 stack ignores this SACK block, detects
> (spurious) SACK reneging and unmarks the "previously sacked" flag of
> segment 3 so that when second ACK arrives, there is no trace of it
> having been sacked before. They already admitted this SACK block is
> incorrect but there is still disagreement about the "one-by-one acking"
> behaviour in general.
> 
> My question is: is my interpretation correct? If so, is there an even
> less ambiguous statement somewhere that receiver is supposed to send one
> ACK for "everything they got so far" rather than acking the segments one
> by one? While reading the RFCs, I always considered this obvious but
> apparently some people may think otherwise.
> 
> Thanks in advance,
> Michal Kubecek
>

Re: [PATCH v2] net: thunderx: rework mac addresses list to u64 array

2018-04-06 Thread Vadim Lomovtsev

Hi Robin,

On Fri, Apr 06, 2018 at 12:48:40PM +0100, Robin Murphy wrote:
> On 06/04/18 12:14, Vadim Lomovtsev wrote:
> > From: Vadim Lomovtsev 
> > 
> > It is too expensive to pass u64 values via linked list, instead
> > allocate array for them by overall number of mac addresses from netdev.
> > 
> > This eventually removes multiple kmalloc() calls, aviod memory
> > fragmentation and allow to put single null check on kmalloc
> > return value in order to prevent a potential null pointer dereference.
> > 
> > Addresses-Coverity-ID: 1467429 ("Dereference null return value")
> > Fixes: 37c3347eb247 ("net: thunderx: add ndo_set_rx_mode callback 
> > implementation for VF")
> > Signed-off-by: Vadim Lomovtsev 
> > ---
> > Changes from v1 to v2:
> >   - C99 syntax: update xcast_addr_list struct field mc[0] -> mc[];
> > 
> > ---
> >   drivers/net/ethernet/cavium/thunder/nic.h|  7 +-
> >   drivers/net/ethernet/cavium/thunder/nicvf_main.c | 28 
> > +---
> >   2 files changed, 11 insertions(+), 24 deletions(-)
> > 
> > diff --git a/drivers/net/ethernet/cavium/thunder/nic.h 
> > b/drivers/net/ethernet/cavium/thunder/nic.h
> > index 5fc46c5a4f36..448d1fafc827 100644
> > --- a/drivers/net/ethernet/cavium/thunder/nic.h
> > +++ b/drivers/net/ethernet/cavium/thunder/nic.h
> > @@ -265,14 +265,9 @@ struct nicvf_drv_stats {
> >   struct cavium_ptp;
> > -struct xcast_addr {
> > -   struct list_head list;
> > -   u64  addr;
> > -};
> > -
> >   struct xcast_addr_list {
> > -   struct list_head list;
> > int  count;
> > +   u64  mc[];
> >   };
> >   struct nicvf_work {
> > diff --git a/drivers/net/ethernet/cavium/thunder/nicvf_main.c 
> > b/drivers/net/ethernet/cavium/thunder/nicvf_main.c
> > index 1e9a31fef729..a26d8bc92e01 100644
> > --- a/drivers/net/ethernet/cavium/thunder/nicvf_main.c
> > +++ b/drivers/net/ethernet/cavium/thunder/nicvf_main.c
> > @@ -1929,7 +1929,7 @@ static void nicvf_set_rx_mode_task(struct work_struct 
> > *work_arg)
> >   work.work);
> > struct nicvf *nic = container_of(vf_work, struct nicvf, rx_mode_work);
> > union nic_mbx mbx = {};
> > -   struct xcast_addr *xaddr, *next;
> > +   u8 idx = 0;
> > if (!vf_work)
> > return;
> > @@ -1956,16 +1956,10 @@ static void nicvf_set_rx_mode_task(struct 
> > work_struct *work_arg)
> > /* check if we have any specific MACs to be added to PF DMAC filter */
> > if (vf_work->mc) {
> > /* now go through kernel list of MACs and add them one by one */
> > -   list_for_each_entry_safe(xaddr, next,
> > -_work->mc->list, list) {
> > +   for (idx = 0; idx < vf_work->mc->count; idx++) {
> > mbx.xcast.msg = NIC_MBOX_MSG_ADD_MCAST;
> > -   mbx.xcast.data.mac = xaddr->addr;
> > +   mbx.xcast.data.mac = vf_work->mc->mc[idx];
> > nicvf_send_msg_to_pf(nic, );
> > -
> > -   /* after receiving ACK from PF release memory */
> > -   list_del(>list);
> > -   kfree(xaddr);
> > -   vf_work->mc->count--;
> > }
> > kfree(vf_work->mc);
> > }
> > @@ -1996,17 +1990,15 @@ static void nicvf_set_rx_mode(struct net_device 
> > *netdev)
> > mode |= BGX_XCAST_MCAST_FILTER;
> > /* here we need to copy mc addrs */
> > if (netdev_mc_count(netdev)) {
> > -   struct xcast_addr *xaddr;
> > -
> > -   mc_list = kmalloc(sizeof(*mc_list), GFP_ATOMIC);
> > -   INIT_LIST_HEAD(_list->list);
> > +   mc_list = kmalloc(sizeof(*mc_list) +
> > + sizeof(u64) * 
> > netdev_mc_count(netdev),
> 
> FWIW if you really wanted to disambiguate that it's a structure and not just
> an array being allocated, then it could be expressed without explicit
> arithmetic:
> 
>   size = offsetof(typeof(*mc_list), mc[netdev_mc_count(netdev)]);
> 
> but that's probably just a matter of personal preference at this point.
> 
> Robin.
> 

Thanks for reviewing it and for hint. From style perspective, I believe,
I should get rid off direct types names also, and use your suggestion here.

I'll update patch to v3, test and re-post.
Thank you for your time.

WBR,
Vadim

> > + GFP_ATOMIC);
> > +   if (unlikely(!mc_list))
> > +   return;
> > +   mc_list->count = 0;
> > netdev_hw_addr_list_for_each(ha, >mc) {
> > -   xaddr = kmalloc(sizeof(*xaddr),
> > -   GFP_ATOMIC);
> > -

Re: [RFC PATCH net-next v5 2/4] net: Introduce generic bypass module

2018-04-06 Thread Jiri Pirko

Thu, Apr 05, 2018 at 11:08:21PM CEST, sridhar.samudr...@intel.com wrote:
>This provides a generic interface for paravirtual drivers to listen
>for netdev register/unregister/link change events from pci ethernet
>devices with the same MAC and takeover their datapath. The notifier and
>event handling code is based on the existing netvsc implementation. A
>paravirtual driver can use this module by registering a set of ops and
>each instance of the device when it is probed.
>
>Signed-off-by: Sridhar Samudrala 
>---
> include/net/bypass.h |  80 ++
> net/Kconfig  |  18 +++
> net/core/Makefile|   1 +
> net/core/bypass.c| 406 +++
> 4 files changed, 505 insertions(+)
> create mode 100644 include/net/bypass.h
> create mode 100644 net/core/bypass.c
>
>diff --git a/include/net/bypass.h b/include/net/bypass.h
>new file mode 100644
>index ..e2dd122f951a
>--- /dev/null
>+++ b/include/net/bypass.h
>@@ -0,0 +1,80 @@
>+// SPDX-License-Identifier: GPL-2.0
>+/* Copyright (c) 2018, Intel Corporation. */
>+
>+#ifndef _NET_BYPASS_H
>+#define _NET_BYPASS_H
>+
>+#include 
>+
>+struct bypass_ops {

Perhaps "net_bypass_" would be better prefix for this module structs
and functions. No strong opinion though.


>+  int (*register_child)(struct net_device *bypass_netdev,
>+struct net_device *child_netdev);

We have master/slave upper/lower netdevices. This adds "child". Consider
using some existing names. Not sure if possible without loss of meaning.


>+  int (*join_child)(struct net_device *bypass_netdev,
>+struct net_device *child_netdev);
>+  int (*unregister_child)(struct net_device *bypass_netdev,
>+  struct net_device *child_netdev);
>+  int (*release_child)(struct net_device *bypass_netdev,
>+   struct net_device *child_netdev);
>+  int (*update_link)(struct net_device *bypass_netdev,
>+ struct net_device *child_netdev);
>+  rx_handler_result_t (*handle_frame)(struct sk_buff **pskb);
>+};
>+
>+struct bypass_instance {
>+  struct list_head list;
>+  struct net_device __rcu *bypass_netdev;
>+  struct bypass *bypass;
>+};
>+
>+struct bypass {
>+  struct list_head list;
>+  const struct bypass_ops *ops;
>+  const struct net_device_ops *netdev_ops;
>+  struct list_head instance_list;
>+  struct mutex lock;
>+};
>+
>+#if IS_ENABLED(CONFIG_NET_BYPASS)
>+
>+struct bypass *bypass_register_driver(const struct bypass_ops *ops,
>+const struct net_device_ops *netdev_ops);
>+void bypass_unregister_driver(struct bypass *bypass);
>+
>+int bypass_register_instance(struct bypass *bypass, struct net_device *dev);
>+int bypass_unregister_instance(struct bypass *bypass, struct net_device   
>*dev);
>+
>+int bypass_unregister_child(struct net_device *child_netdev);
>+
>+#else
>+
>+static inline
>+struct bypass *bypass_register_driver(const struct bypass_ops *ops,
>+const struct net_device_ops *netdev_ops)
>+{
>+  return NULL;
>+}
>+
>+static inline void bypass_unregister_driver(struct bypass *bypass)
>+{
>+}
>+
>+static inline int bypass_register_instance(struct bypass *bypass,
>+ struct net_device *dev)
>+{
>+  return 0;
>+}
>+
>+static inline int bypass_unregister_instance(struct bypass *bypass,
>+   struct net_device *dev)
>+{
>+  return 0;
>+}
>+
>+static inline int bypass_unregister_child(struct net_device *child_netdev)
>+{
>+  return 0;
>+}
>+
>+#endif
>+
>+#endif /* _NET_BYPASS_H */
>diff --git a/net/Kconfig b/net/Kconfig
>index 0428f12c25c2..994445f4a96a 100644
>--- a/net/Kconfig
>+++ b/net/Kconfig
>@@ -423,6 +423,24 @@ config MAY_USE_DEVLINK
> on MAY_USE_DEVLINK to ensure they do not cause link errors when
> devlink is a loadable module and the driver using it is built-in.
> 
>+config NET_BYPASS
>+  tristate "Bypass interface"
>+  ---help---
>+This provides a generic interface for paravirtual drivers to listen
>+for netdev register/unregister/link change events from pci ethernet
>+devices with the same MAC and takeover their datapath. This also
>+enables live migration of a VM with direct attached VF by failing
>+over to the paravirtual datapath when the VF is unplugged.
>+
>+config MAY_USE_BYPASS
>+  tristate
>+  default m if NET_BYPASS=m
>+  default y if NET_BYPASS=y || NET_BYPASS=n
>+  help
>+Drivers using the bypass infrastructure should have a dependency
>+on MAY_USE_BYPASS to ensure they do not cause link errors when
>+bypass is a loadable module and the driver using it is built-in.
>+
> endif   # if NET
> 
> # Used by archs to tell that they support BPF JIT compiler

Re: [PATCH net-next] netns: filter uevents correctly

2018-04-06 Thread Christian Brauner

On Thu, Apr 05, 2018 at 10:59:49PM -0500, Eric W. Biederman wrote:
> Christian Brauner  writes:
> 
> > On Thu, Apr 05, 2018 at 05:26:59PM +0300, Kirill Tkhai wrote:
> >> On 05.04.2018 17:07, Christian Brauner wrote:
> >> > On Thu, Apr 05, 2018 at 04:01:03PM +0300, Kirill Tkhai wrote:
> >> >> On 04.04.2018 22:48, Christian Brauner wrote:
> >> >>> commit 07e98962fa77 ("kobject: Send hotplug events in all network 
> >> >>> namespaces")
> >> >>>
> >> >>> enabled sending hotplug events into all network namespaces back in 
> >> >>> 2010.
> >> >>> Over time the set of uevents that get sent into all network namespaces 
> >> >>> has
> >> >>> shrunk. We have now reached the point where hotplug events for all 
> >> >>> devices
> >> >>> that carry a namespace tag are filtered according to that namespace.
> >> >>>
> >> >>> Specifically, they are filtered whenever the namespace tag of the 
> >> >>> kobject
> >> >>> does not match the namespace tag of the netlink socket. One example are
> >> >>> network devices. Uevents for network devices only show up in the 
> >> >>> network
> >> >>> namespaces these devices are moved to or created in.
> >> >>>
> >> >>> However, any uevent for a kobject that does not have a namespace tag
> >> >>> associated with it will not be filtered and we will *try* to broadcast 
> >> >>> it
> >> >>> into all network namespaces.
> >> >>>
> >> >>> The original patchset was written in 2010 before user namespaces were a
> >> >>> thing. With the introduction of user namespaces sending out uevents 
> >> >>> became
> >> >>> partially isolated as they were filtered by user namespaces:
> >> >>>
> >> >>> net/netlink/af_netlink.c:do_one_broadcast()
> >> >>>
> >> >>> if (!net_eq(sock_net(sk), p->net)) {
> >> >>> if (!(nlk->flags & NETLINK_F_LISTEN_ALL_NSID))
> >> >>> return;
> >> >>>
> >> >>> if (!peernet_has_id(sock_net(sk), p->net))
> >> >>> return;
> >> >>>
> >> >>> if (!file_ns_capable(sk->sk_socket->file, p->net->user_ns,
> >> >>>  CAP_NET_BROADCAST))
> >> >>> j   return;
> >> >>> }
> >> >>>
> >> >>> The file_ns_capable() check will check whether the caller had
> >> >>> CAP_NET_BROADCAST at the time of opening the netlink socket in the user
> >> >>> namespace of interest. This check is fine in general but seems 
> >> >>> insufficient
> >> >>> to me when paired with uevents. The reason is that devices always 
> >> >>> belong to
> >> >>> the initial user namespace so uevents for kobjects that do not carry a
> >> >>> namespace tag should never be sent into another user namespace. This 
> >> >>> has
> >> >>> been the intention all along. But there's one case where this breaks,
> >> >>> namely if a new user namespace is created by root on the host and an
> >> >>> identity mapping is established between root on the host and root in 
> >> >>> the
> >> >>> new user namespace. Here's a reproducer:
> >> >>>
> >> >>>  sudo unshare -U --map-root
> >> >>>  udevadm monitor -k
> >> >>>  # Now change to initial user namespace and e.g. do
> >> >>>  modprobe kvm
> >> >>>  # or
> >> >>>  rmmod kvm
> >> >>>
> >> >>> will allow the non-initial user namespace to retrieve all uevents from 
> >> >>> the
> >> >>> host. This seems very anecdotal given that in the general case user
> >> >>> namespaces do not see any uevents and also can't really do anything 
> >> >>> useful
> >> >>> with them.
> >> >>>
> >> >>> Additionally, it is now possible to send uevents from userspace. As 
> >> >>> such we
> >> >>> can let a sufficiently privileged (CAP_SYS_ADMIN in the owning user
> >> >>> namespace of the network namespace of the netlink socket) userspace 
> >> >>> process
> >> >>> make a decision what uevents should be sent.
> >> >>>
> >> >>> This makes me think that we should simply ensure that uevents for 
> >> >>> kobjects
> >> >>> that do not carry a namespace tag are *always* filtered by user 
> >> >>> namespace
> >> >>> in kobj_bcast_filter(). Specifically:
> >> >>> - If the owning user namespace of the uevent socket is not 
> >> >>> init_user_ns the
> >> >>>   event will always be filtered.
> >> >>> - If the network namespace the uevent socket belongs to was created in 
> >> >>> the
> >> >>>   initial user namespace but was opened from a non-initial user 
> >> >>> namespace
> >> >>>   the event will be filtered as well.
> >> >>> Put another way, uevents for kobjects not carrying a namespace tag are 
> >> >>> now
> >> >>> always only sent to the initial user namespace. The regression 
> >> >>> potential
> >> >>> for this is near to non-existent since user namespaces can't really do
> >> >>> anything with interesting devices.
> >> >>>
> >> >>> Signed-off-by: Christian Brauner 
> >> >>> ---
> >> >>>  lib/kobject_uevent.c | 10 +-
> >> >>>  1 file changed, 9 insertions(+), 1 deletion(-)
> >> >>>
> >> >>> diff --git a/lib/kobject_uevent.c b/lib/kobject_uevent.c
> >> >>>

Re: [RFC PATCH net-next v5 3/4] virtio_net: Extend virtio to use VF datapath when available

2018-04-06 Thread Jiri Pirko

Thu, Apr 05, 2018 at 11:08:22PM CEST, sridhar.samudr...@intel.com wrote:
>This patch enables virtio_net to switch over to a VF datapath when a VF
>netdev is present with the same MAC address. It allows live migration
>of a VM with a direct attached VF without the need to setup a bond/team
>between a VF and virtio net device in the guest.
>
>The hypervisor needs to enable only one datapath at any time so that
>packets don't get looped back to the VM over the other datapath. When a VF
>is plugged, the virtio datapath link state can be marked as down. The
>hypervisor needs to unplug the VF device from the guest on the source host
>and reset the MAC filter of the VF to initiate failover of datapath to
>virtio before starting the migration. After the migration is completed,
>the destination hypervisor sets the MAC filter on the VF and plugs it back
>to the guest to switch over to VF datapath.
>
>When BACKUP feature is enabled, an additional netdev(bypass netdev) is
>created that acts as a master device and tracks the state of the 2 lower
>netdevs. The original virtio_net netdev is marked as 'backup' netdev and a
>passthru device with the same MAC is registered as 'active' netdev.
>
>This patch is based on the discussion initiated by Jesse on this thread.
>https://marc.info/?l=linux-virtualization=151189725224231=2
>
>Signed-off-by: Sridhar Samudrala 
>---
> drivers/net/Kconfig  |   1 +
> drivers/net/virtio_net.c | 612 ++-
> 2 files changed, 612 insertions(+), 1 deletion(-)
>
>diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig
>index 891846655000..9e2cf61fd1c1 100644
>--- a/drivers/net/Kconfig
>+++ b/drivers/net/Kconfig
>@@ -331,6 +331,7 @@ config VETH
> config VIRTIO_NET
>   tristate "Virtio network driver"
>   depends on VIRTIO
>+  depends on MAY_USE_BYPASS
>   ---help---
> This is the virtual network driver for virtio.  It can be used with
> QEMU based VMMs (like KVM or Xen).  Say Y or M.
>diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
>index befb5944f3fd..86b2f8f2947d 100644
>--- a/drivers/net/virtio_net.c
>+++ b/drivers/net/virtio_net.c
>@@ -30,8 +30,11 @@
> #include 
> #include 
> #include 
>+#include 
>+#include 
> #include 
> #include 
>+#include 
> 
> static int napi_weight = NAPI_POLL_WEIGHT;
> module_param(napi_weight, int, 0444);
>@@ -206,6 +209,9 @@ struct virtnet_info {
>   u32 speed;
> 
>   unsigned long guest_offloads;
>+
>+  /* upper netdev created when BACKUP feature enabled */
>+  struct net_device __rcu *bypass_netdev;
> };
> 
> struct padded_vnet_hdr {
>@@ -2275,6 +2281,22 @@ static int virtnet_xdp(struct net_device *dev, struct 
>netdev_bpf *xdp)
>   }
> }
> 
>+static int virtnet_get_phys_port_name(struct net_device *dev, char *buf,
>+size_t len)
>+{
>+  struct virtnet_info *vi = netdev_priv(dev);
>+  int ret;
>+
>+  if (!virtio_has_feature(vi->vdev, VIRTIO_NET_F_BACKUP))
>+  return -EOPNOTSUPP;
>+
>+  ret = snprintf(buf, len, "_bkup");
>+  if (ret >= len)
>+  return -EOPNOTSUPP;
>+
>+  return 0;
>+}
>+
> static const struct net_device_ops virtnet_netdev = {
>   .ndo_open= virtnet_open,
>   .ndo_stop= virtnet_close,
>@@ -2292,6 +2314,7 @@ static const struct net_device_ops virtnet_netdev = {
>   .ndo_xdp_xmit   = virtnet_xdp_xmit,
>   .ndo_xdp_flush  = virtnet_xdp_flush,
>   .ndo_features_check = passthru_features_check,
>+  .ndo_get_phys_port_name = virtnet_get_phys_port_name,
> };
> 
> static void virtnet_config_changed_work(struct work_struct *work)
>@@ -2689,6 +2712,576 @@ static int virtnet_validate(struct virtio_device *vdev)
>   return 0;
> }
> 
>+/* START of functions supporting VIRTIO_NET_F_BACKUP feature.
>+ * When BACKUP feature is enabled, an additional netdev(bypass netdev)
>+ * is created that acts as a master device and tracks the state of the
>+ * 2 lower netdevs. The original virtio_net netdev is registered as
>+ * 'backup' netdev and a passthru device with the same MAC is registered
>+ * as 'active' netdev.
>+ */
>+
>+/* bypass state maintained when BACKUP feature is enabled */
>+struct virtnet_bypass_info {
>+  /* passthru netdev with same MAC */
>+  struct net_device __rcu *active_netdev;
>+
>+  /* virtio_net netdev */
>+  struct net_device __rcu *backup_netdev;
>+
>+  /* active netdev stats */
>+  struct rtnl_link_stats64 active_stats;
>+
>+  /* backup netdev stats */
>+  struct rtnl_link_stats64 backup_stats;
>+
>+  /* aggregated stats */
>+  struct rtnl_link_stats64 bypass_stats;
>+
>+  /* spinlock while updating stats */
>+  spinlock_t stats_lock;
>+};
>+
>+static int virtnet_bypass_open(struct net_device *dev)
>+{
>+  struct virtnet_bypass_info *vbi = netdev_priv(dev);
>+  struct net_device *active_netdev,

[PATCH] ARM: dts: ls1021a: Specify TBIPA register address

2018-04-06 Thread Esben Haabendal

From: Esben Haabendal 

The current (mildly evil) fsl_pq_mdio code uses an undocumented shadow of
the TBIPA register on LS1021A, which happens to be read-only.
Changing TBI PHY address therefore does not work on LS1021A.

The real (and documented) address of the TBIPA registere lies in the eTSEC
block and not in MDIO/MII, which is read/write, so using that fixes
the problem.

Signed-off-by: Esben Haabendal 
---
 arch/arm/boot/dts/ls1021a.dtsi | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/arm/boot/dts/ls1021a.dtsi b/arch/arm/boot/dts/ls1021a.dtsi
index c31dad98f989..728e206009ea 100644
--- a/arch/arm/boot/dts/ls1021a.dtsi
+++ b/arch/arm/boot/dts/ls1021a.dtsi
@@ -587,7 +587,8 @@
device_type = "mdio";
#address-cells = <1>;
#size-cells = <0>;
-   reg = <0x0 0x2d24000 0x0 0x4000>;
+   reg = <0x0 0x2d24000 0x0 0x4000>,
+ <0x0 0x2d10030 0x0 0x4>;
};
 
ptp_clock@2d10e00 {
-- 
2.16.3

[PATCH 1/2] net/fsl_pq_mdio: Allow explicit speficition of TBIPA address

2018-04-06 Thread Esben Haabendal

From: Esben Haabendal 

This introduces a simpler and generic method for for finding (and mapping)
the TBIPA register.

Instead of relying of complicated logic for finding the TBIPA register
address based on the MDIO or MII register block base
address, which even in some cases relies on undocumented shadow registers,
a second "reg" entry for the mdio bus devicetree node specifies the TBIPA
register.

Backwards compatibility is kept, as the existing logic is applied when
only a single "reg" mapping is specified.

Signed-off-by: Esben Haabendal 
---
 .../devicetree/bindings/net/fsl-tsec-phy.txt   |  6 ++-
 drivers/net/ethernet/freescale/fsl_pq_mdio.c   | 50 +++---
 2 files changed, 39 insertions(+), 17 deletions(-)

diff --git a/Documentation/devicetree/bindings/net/fsl-tsec-phy.txt 
b/Documentation/devicetree/bindings/net/fsl-tsec-phy.txt
index 594982c6b9f9..79bf352e659c 100644
--- a/Documentation/devicetree/bindings/net/fsl-tsec-phy.txt
+++ b/Documentation/devicetree/bindings/net/fsl-tsec-phy.txt
@@ -6,7 +6,11 @@ the definition of the PHY node in booting-without-of.txt for 
an example
 of how to define a PHY.
 
 Required properties:
-  - reg : Offset and length of the register set for the device
+  - reg : Offset and length of the register set for the device, and optionally
+  the offset and length of the TBIPA register (TBI PHY address
+ register).  If TBIPA register is not specified, the driver will
+ attempt to infer it from the register set specified (your mileage may
+ vary).
   - compatible : Should define the compatible device type for the
 mdio. Currently supported strings/devices are:
- "fsl,gianfar-tbi"
diff --git a/drivers/net/ethernet/freescale/fsl_pq_mdio.c 
b/drivers/net/ethernet/freescale/fsl_pq_mdio.c
index 80ad16acf0f1..ac2c3f6a12bc 100644
--- a/drivers/net/ethernet/freescale/fsl_pq_mdio.c
+++ b/drivers/net/ethernet/freescale/fsl_pq_mdio.c
@@ -377,6 +377,38 @@ static const struct of_device_id fsl_pq_mdio_match[] = {
 };
 MODULE_DEVICE_TABLE(of, fsl_pq_mdio_match);
 
+static void set_tbipa(const u32 tbipa_val, struct platform_device *pdev,
+ uint32_t __iomem * (*get_tbipa)(void __iomem *),
+ void __iomem *reg_map, struct resource *reg_res)
+{
+   struct device_node *np = pdev->dev.of_node;
+   uint32_t __iomem *tbipa;
+   bool tbipa_mapped;
+
+   tbipa = of_iomap(np, 1);
+   if (tbipa) {
+   tbipa_mapped = true;
+   } else {
+   tbipa_mapped = false;
+   tbipa = (*get_tbipa)(reg_map);
+
+   /*
+* Add consistency check to make sure TBI is contained within
+* the mapped range (not because we would get a segfault,
+* rather to catch bugs in computing TBI address). Print error
+* message but continue anyway.
+*/
+   if ((void *)tbipa > reg_map + resource_size(reg_res) - 4)
+   dev_err(>dev, "invalid register map (should be at 
least 0x%04zx to contain TBI address)\n",
+   ((void *)tbipa - reg_map) + 4);
+   }
+
+   iowrite32be(be32_to_cpu(tbipa_val), tbipa);
+
+   if (tbipa_mapped)
+   iounmap(tbipa);
+}
+
 static int fsl_pq_mdio_probe(struct platform_device *pdev)
 {
const struct of_device_id *id =
@@ -450,8 +482,6 @@ static int fsl_pq_mdio_probe(struct platform_device *pdev)
 
if (tbi) {
const u32 *prop = of_get_property(tbi, "reg", NULL);
-   uint32_t __iomem *tbipa;
-
if (!prop) {
dev_err(>dev,
"missing 'reg' property in node %pOF\n",
@@ -459,20 +489,8 @@ static int fsl_pq_mdio_probe(struct platform_device *pdev)
err = -EBUSY;
goto error;
}
-
-   tbipa = data->get_tbipa(priv->map);
-
-   /*
-* Add consistency check to make sure TBI is contained
-* within the mapped range (not because we would get a
-* segfault, rather to catch bugs in computing TBI
-* address). Print error message but continue anyway.
-*/
-   if ((void *)tbipa > priv->map + resource_size() - 4)
-   dev_err(>dev, "invalid register map 
(should be at least 0x%04zx to contain TBI address)\n",
-   ((void *)tbipa - priv->map) + 4);
-
-   iowrite32be(be32_to_cpup(prop), tbipa);
+   set_tbipa(*prop, pdev,
+ data->get_tbipa, priv->map, );
}
}
 
-- 
2.16.3

[PATCH net-next 5/5] net: stmmac: Switch stmmac_mode_ops to generic HW Interface Helpers

2018-04-06 Thread Jose Abreu

Switch stmmac_mode_ops to generic Hardware Interface Helpers instead of
using hard-coded callbacks. This makes the code more readable and more
flexible.

No functional change.

Signed-off-by: Jose Abreu 
Cc: David S. Miller 
Cc: Joao Pinto 
Cc: Giuseppe Cavallaro 
Cc: Alexandre Torgue 
---
 drivers/net/ethernet/stmicro/stmmac/chain_mode.c  | 20 +--
 drivers/net/ethernet/stmicro/stmmac/common.h  | 12 ---
 drivers/net/ethernet/stmicro/stmmac/hwif.h| 27 ++
 drivers/net/ethernet/stmicro/stmmac/ring_mode.c   | 24 ++---
 drivers/net/ethernet/stmicro/stmmac/stmmac_main.c | 43 ++-
 5 files changed, 68 insertions(+), 58 deletions(-)

diff --git a/drivers/net/ethernet/stmicro/stmmac/chain_mode.c 
b/drivers/net/ethernet/stmicro/stmmac/chain_mode.c
index ca0f9c2..b9c9003 100644
--- a/drivers/net/ethernet/stmicro/stmmac/chain_mode.c
+++ b/drivers/net/ethernet/stmicro/stmmac/chain_mode.c
@@ -24,7 +24,7 @@
 
 #include "stmmac.h"
 
-static int stmmac_jumbo_frm(void *p, struct sk_buff *skb, int csum)
+static int jumbo_frm(void *p, struct sk_buff *skb, int csum)
 {
struct stmmac_tx_queue *tx_q = (struct stmmac_tx_queue *)p;
unsigned int nopaged_len = skb_headlen(skb);
@@ -93,7 +93,7 @@ static int stmmac_jumbo_frm(void *p, struct sk_buff *skb, int 
csum)
return entry;
 }
 
-static unsigned int stmmac_is_jumbo_frm(int len, int enh_desc)
+static unsigned int is_jumbo_frm(int len, int enh_desc)
 {
unsigned int ret = 0;
 
@@ -105,7 +105,7 @@ static unsigned int stmmac_is_jumbo_frm(int len, int 
enh_desc)
return ret;
 }
 
-static void stmmac_init_dma_chain(void *des, dma_addr_t phy_addr,
+static void init_dma_chain(void *des, dma_addr_t phy_addr,
  unsigned int size, unsigned int extend_desc)
 {
/*
@@ -135,7 +135,7 @@ static void stmmac_init_dma_chain(void *des, dma_addr_t 
phy_addr,
}
 }
 
-static void stmmac_refill_desc3(void *priv_ptr, struct dma_desc *p)
+static void refill_desc3(void *priv_ptr, struct dma_desc *p)
 {
struct stmmac_rx_queue *rx_q = (struct stmmac_rx_queue *)priv_ptr;
struct stmmac_priv *priv = rx_q->priv_data;
@@ -151,7 +151,7 @@ static void stmmac_refill_desc3(void *priv_ptr, struct 
dma_desc *p)
  sizeof(struct dma_desc)));
 }
 
-static void stmmac_clean_desc3(void *priv_ptr, struct dma_desc *p)
+static void clean_desc3(void *priv_ptr, struct dma_desc *p)
 {
struct stmmac_tx_queue *tx_q = (struct stmmac_tx_queue *)priv_ptr;
struct stmmac_priv *priv = tx_q->priv_data;
@@ -169,9 +169,9 @@ static void stmmac_clean_desc3(void *priv_ptr, struct 
dma_desc *p)
 }
 
 const struct stmmac_mode_ops chain_mode_ops = {
-   .init = stmmac_init_dma_chain,
-   .is_jumbo_frm = stmmac_is_jumbo_frm,
-   .jumbo_frm = stmmac_jumbo_frm,
-   .refill_desc3 = stmmac_refill_desc3,
-   .clean_desc3 = stmmac_clean_desc3,
+   .init = init_dma_chain,
+   .is_jumbo_frm = is_jumbo_frm,
+   .jumbo_frm = jumbo_frm,
+   .refill_desc3 = refill_desc3,
+   .clean_desc3 = clean_desc3,
 };
diff --git a/drivers/net/ethernet/stmicro/stmmac/common.h 
b/drivers/net/ethernet/stmicro/stmmac/common.h
index 7291561..59673c6 100644
--- a/drivers/net/ethernet/stmicro/stmmac/common.h
+++ b/drivers/net/ethernet/stmicro/stmmac/common.h
@@ -405,18 +405,6 @@ struct mii_regs {
unsigned int clk_csr_mask;
 };
 
-/* Helpers to manage the descriptors for chain and ring modes */
-struct stmmac_mode_ops {
-   void (*init) (void *des, dma_addr_t phy_addr, unsigned int size,
- unsigned int extend_desc);
-   unsigned int (*is_jumbo_frm) (int len, int ehn_desc);
-   int (*jumbo_frm)(void *priv, struct sk_buff *skb, int csum);
-   int (*set_16kib_bfsize)(int mtu);
-   void (*init_desc3)(struct dma_desc *p);
-   void (*refill_desc3) (void *priv, struct dma_desc *p);
-   void (*clean_desc3) (void *priv, struct dma_desc *p);
-};
-
 struct mac_device_info {
const struct stmmac_ops *mac;
const struct stmmac_desc_ops *desc;
diff --git a/drivers/net/ethernet/stmicro/stmmac/hwif.h 
b/drivers/net/ethernet/stmicro/stmmac/hwif.h
index e23c0a3..f81ded4 100644
--- a/drivers/net/ethernet/stmicro/stmmac/hwif.h
+++ b/drivers/net/ethernet/stmicro/stmmac/hwif.h
@@ -391,4 +391,31 @@ struct stmmac_hwtimestamp {
 #define stmmac_get_systime(__priv, __args...) \
stmmac_do_void_callback(__priv, ptp, get_systime, __args)
 
+/* Helpers to manage the descriptors for chain and ring modes */
+struct stmmac_mode_ops {
+   void (*init) (void *des, dma_addr_t phy_addr, unsigned int size,
+ unsigned int extend_desc);
+   unsigned int (*is_jumbo_frm) (int len, int ehn_desc);
+   int (*jumbo_frm)(void *priv, struct sk_buff *skb,

[PATCH net-next 2/5] net: stmmac: Switch stmmac_dma_ops to generic HW Interface Helpers

2018-04-06 Thread Jose Abreu

Switch stmmac_dma_ops to generic Hardware Interface Helpers instead of
using hard-coded callbacks. This makes the code more readable and more
flexible.

No functional change.

Signed-off-by: Jose Abreu 
Cc: David S. Miller 
Cc: Joao Pinto 
Cc: Giuseppe Cavallaro 
Cc: Alexandre Torgue 
---
 drivers/net/ethernet/stmicro/stmmac/common.h   |  50 ---
 drivers/net/ethernet/stmicro/stmmac/hwif.h | 106 +++
 .../net/ethernet/stmicro/stmmac/stmmac_ethtool.c   |  14 +-
 drivers/net/ethernet/stmicro/stmmac/stmmac_main.c  | 147 +
 4 files changed, 172 insertions(+), 145 deletions(-)

diff --git a/drivers/net/ethernet/stmicro/stmmac/common.h 
b/drivers/net/ethernet/stmicro/stmmac/common.h
index 2c50d8c..b27221b 100644
--- a/drivers/net/ethernet/stmicro/stmmac/common.h
+++ b/drivers/net/ethernet/stmicro/stmmac/common.h
@@ -381,56 +381,6 @@ struct dma_features {
 extern const struct stmmac_desc_ops enh_desc_ops;
 extern const struct stmmac_desc_ops ndesc_ops;
 
-/* Specific DMA helpers */
-struct stmmac_dma_ops {
-   /* DMA core initialization */
-   int (*reset)(void __iomem *ioaddr);
-   void (*init)(void __iomem *ioaddr, struct stmmac_dma_cfg *dma_cfg,
-u32 dma_tx, u32 dma_rx, int atds);
-   void (*init_chan)(void __iomem *ioaddr,
- struct stmmac_dma_cfg *dma_cfg, u32 chan);
-   void (*init_rx_chan)(void __iomem *ioaddr,
-struct stmmac_dma_cfg *dma_cfg,
-u32 dma_rx_phy, u32 chan);
-   void (*init_tx_chan)(void __iomem *ioaddr,
-struct stmmac_dma_cfg *dma_cfg,
-u32 dma_tx_phy, u32 chan);
-   /* Configure the AXI Bus Mode Register */
-   void (*axi)(void __iomem *ioaddr, struct stmmac_axi *axi);
-   /* Dump DMA registers */
-   void (*dump_regs)(void __iomem *ioaddr, u32 *reg_space);
-   /* Set tx/rx threshold in the csr6 register
-* An invalid value enables the store-and-forward mode */
-   void (*dma_mode)(void __iomem *ioaddr, int txmode, int rxmode,
-int rxfifosz);
-   void (*dma_rx_mode)(void __iomem *ioaddr, int mode, u32 channel,
-   int fifosz, u8 qmode);
-   void (*dma_tx_mode)(void __iomem *ioaddr, int mode, u32 channel,
-   int fifosz, u8 qmode);
-   /* To track extra statistic (if supported) */
-   void (*dma_diagnostic_fr) (void *data, struct stmmac_extra_stats *x,
-  void __iomem *ioaddr);
-   void (*enable_dma_transmission) (void __iomem *ioaddr);
-   void (*enable_dma_irq)(void __iomem *ioaddr, u32 chan);
-   void (*disable_dma_irq)(void __iomem *ioaddr, u32 chan);
-   void (*start_tx)(void __iomem *ioaddr, u32 chan);
-   void (*stop_tx)(void __iomem *ioaddr, u32 chan);
-   void (*start_rx)(void __iomem *ioaddr, u32 chan);
-   void (*stop_rx)(void __iomem *ioaddr, u32 chan);
-   int (*dma_interrupt) (void __iomem *ioaddr,
- struct stmmac_extra_stats *x, u32 chan);
-   /* If supported then get the optional core features */
-   void (*get_hw_feature)(void __iomem *ioaddr,
-  struct dma_features *dma_cap);
-   /* Program the HW RX Watchdog */
-   void (*rx_watchdog)(void __iomem *ioaddr, u32 riwt, u32 number_chan);
-   void (*set_tx_ring_len)(void __iomem *ioaddr, u32 len, u32 chan);
-   void (*set_rx_ring_len)(void __iomem *ioaddr, u32 len, u32 chan);
-   void (*set_rx_tail_ptr)(void __iomem *ioaddr, u32 tail_ptr, u32 chan);
-   void (*set_tx_tail_ptr)(void __iomem *ioaddr, u32 tail_ptr, u32 chan);
-   void (*enable_tso)(void __iomem *ioaddr, bool en, u32 chan);
-};
-
 struct mac_device_info;
 
 /* Helpers to program the MAC core */
diff --git a/drivers/net/ethernet/stmicro/stmmac/hwif.h 
b/drivers/net/ethernet/stmicro/stmmac/hwif.h
index 4994677..e1a9ae6 100644
--- a/drivers/net/ethernet/stmicro/stmmac/hwif.h
+++ b/drivers/net/ethernet/stmicro/stmmac/hwif.h
@@ -122,4 +122,110 @@ struct stmmac_desc_ops {
 #define stmmac_set_mss(__priv, __args...) \
stmmac_do_void_callback(__priv, desc, set_mss, __args)
 
+struct stmmac_dma_cfg;
+struct dma_features;
+
+/* Specific DMA helpers */
+struct stmmac_dma_ops {
+   /* DMA core initialization */
+   int (*reset)(void __iomem *ioaddr);
+   void (*init)(void __iomem *ioaddr, struct stmmac_dma_cfg *dma_cfg,
+u32 dma_tx, u32 dma_rx, int atds);
+   void (*init_chan)(void __iomem *ioaddr,
+ struct stmmac_dma_cfg *dma_cfg, u32 chan);
+   void (*init_rx_chan)(void __iomem *ioaddr,
+struct stmmac_dma_cfg *dma_cfg,
+u32 dma_rx_phy, u32 chan);
+

[PATCH net-next 1/5] net: stmmac: Switch stmmac_desc_ops to generic HW Interface Helpers

2018-04-06 Thread Jose Abreu

Switch stmmac_desc_ops to generic Hardware Interface Helpers instead of
using hard-coded callbacks. This makes the code more readable and more
flexible.

No functional change.

Signed-off-by: Jose Abreu 
Cc: David S. Miller 
Cc: Joao Pinto 
Cc: Giuseppe Cavallaro 
Cc: Alexandre Torgue 
---
 drivers/net/ethernet/stmicro/stmmac/chain_mode.c   |  14 +--
 drivers/net/ethernet/stmicro/stmmac/common.h   |  55 +
 drivers/net/ethernet/stmicro/stmmac/dwmac4_descs.c |   4 +-
 drivers/net/ethernet/stmicro/stmmac/enh_desc.c |   4 +-
 drivers/net/ethernet/stmicro/stmmac/hwif.h | 125 +
 drivers/net/ethernet/stmicro/stmmac/norm_desc.c|   4 +-
 drivers/net/ethernet/stmicro/stmmac/ring_mode.c|  15 +--
 drivers/net/ethernet/stmicro/stmmac/stmmac_main.c  | 110 +-
 8 files changed, 195 insertions(+), 136 deletions(-)
 create mode 100644 drivers/net/ethernet/stmicro/stmmac/hwif.h

diff --git a/drivers/net/ethernet/stmicro/stmmac/chain_mode.c 
b/drivers/net/ethernet/stmicro/stmmac/chain_mode.c
index e93c40b..ca0f9c2 100644
--- a/drivers/net/ethernet/stmicro/stmmac/chain_mode.c
+++ b/drivers/net/ethernet/stmicro/stmmac/chain_mode.c
@@ -51,8 +51,8 @@ static int stmmac_jumbo_frm(void *p, struct sk_buff *skb, int 
csum)
tx_q->tx_skbuff_dma[entry].buf = des2;
tx_q->tx_skbuff_dma[entry].len = bmax;
/* do not close the descriptor and do not set own bit */
-   priv->hw->desc->prepare_tx_desc(desc, 1, bmax, csum, STMMAC_CHAIN_MODE,
-   0, false, skb->len);
+   stmmac_prepare_tx_desc(priv, desc, 1, bmax, csum, STMMAC_CHAIN_MODE,
+   0, false, skb->len);
 
while (len != 0) {
tx_q->tx_skbuff[entry] = NULL;
@@ -68,9 +68,8 @@ static int stmmac_jumbo_frm(void *p, struct sk_buff *skb, int 
csum)
return -1;
tx_q->tx_skbuff_dma[entry].buf = des2;
tx_q->tx_skbuff_dma[entry].len = bmax;
-   priv->hw->desc->prepare_tx_desc(desc, 0, bmax, csum,
-   STMMAC_CHAIN_MODE, 1,
-   false, skb->len);
+   stmmac_prepare_tx_desc(priv, desc, 0, bmax, csum,
+   STMMAC_CHAIN_MODE, 1, false, skb->len);
len -= bmax;
i++;
} else {
@@ -83,9 +82,8 @@ static int stmmac_jumbo_frm(void *p, struct sk_buff *skb, int 
csum)
tx_q->tx_skbuff_dma[entry].buf = des2;
tx_q->tx_skbuff_dma[entry].len = len;
/* last descriptor can be set now */
-   priv->hw->desc->prepare_tx_desc(desc, 0, len, csum,
-   STMMAC_CHAIN_MODE, 1,
-   true, skb->len);
+   stmmac_prepare_tx_desc(priv, desc, 0, len, csum,
+   STMMAC_CHAIN_MODE, 1, true, skb->len);
len = 0;
}
}
diff --git a/drivers/net/ethernet/stmicro/stmmac/common.h 
b/drivers/net/ethernet/stmicro/stmmac/common.h
index ad2388a..2c50d8c 100644
--- a/drivers/net/ethernet/stmicro/stmmac/common.h
+++ b/drivers/net/ethernet/stmicro/stmmac/common.h
@@ -32,6 +32,7 @@
 #endif
 
 #include "descs.h"
+#include "hwif.h"
 #include "mmc.h"
 
 /* Synopsys Core versions */
@@ -377,60 +378,6 @@ struct dma_features {
 
 #define JUMBO_LEN  9000
 
-/* Descriptors helpers */
-struct stmmac_desc_ops {
-   /* DMA RX descriptor ring initialization */
-   void (*init_rx_desc) (struct dma_desc *p, int disable_rx_ic, int mode,
- int end);
-   /* DMA TX descriptor ring initialization */
-   void (*init_tx_desc) (struct dma_desc *p, int mode, int end);
-
-   /* Invoked by the xmit function to prepare the tx descriptor */
-   void (*prepare_tx_desc) (struct dma_desc *p, int is_fs, int len,
-bool csum_flag, int mode, bool tx_own,
-bool ls, unsigned int tot_pkt_len);
-   void (*prepare_tso_tx_desc)(struct dma_desc *p, int is_fs, int len1,
-   int len2, bool tx_own, bool ls,
-   unsigned int tcphdrlen,
-   unsigned int tcppayloadlen);
-   /* Set/get the owner of the descriptor */
-   void (*set_tx_owner) (struct dma_desc *p);
-   int (*get_tx_owner) (struct dma_desc *p);
-   /* Clean the tx descriptor as soon as the tx irq is received */
-   void (*release_tx_desc) (struct dma_desc *p, int mode);
-   /* Clear interrupt on tx

[PATCH net-next 0/5] net: stmmac: Stop using hard-coded callbacks

2018-04-06 Thread Jose Abreu

This a starting point for a cleanup and re-organization of stmmac.

In this series we stop using hard-coded callbacks along the code and use
instead helpers which are defined in a single place ("hwif.h").

This brings several advantages:
1) Less typing :)
2) Guaranteed function pointer check
3) More flexibility

By 2) we stop using the repeated pattern of:
if (priv->hw->mac->some_func)
priv->hw->mac->some_func(...)

I didn't check but I expect the final .ko will be bigger with this series
because *all* of function pointers are checked.

Anyway, I hope this can make the code more readable and more flexible now.

Cc: David S. Miller 
Cc: Joao Pinto 
Cc: Giuseppe Cavallaro 
Cc: Alexandre Torgue 

Jose Abreu (5):
  net: stmmac: Switch stmmac_desc_ops to generic HW Interface Helpers
  net: stmmac: Switch stmmac_dma_ops to generic HW Interface Helpers
  net: stmmac: Switch stmmac_ops to generic HW Interface Helpers
  net: stmmac: Switch stmmac_hwtimestamp to generic HW Interface Helpers
  net: stmmac: Switch stmmac_mode_ops to generic HW Interface Helpers

 drivers/net/ethernet/stmicro/stmmac/chain_mode.c   |  34 +-
 drivers/net/ethernet/stmicro/stmmac/common.h   | 199 +-
 drivers/net/ethernet/stmicro/stmmac/dwmac4_descs.c |   4 +-
 drivers/net/ethernet/stmicro/stmmac/dwmac5.c   |  19 +-
 drivers/net/ethernet/stmicro/stmmac/dwmac5.h   |   6 +-
 drivers/net/ethernet/stmicro/stmmac/enh_desc.c |   4 +-
 drivers/net/ethernet/stmicro/stmmac/hwif.h | 421 
 drivers/net/ethernet/stmicro/stmmac/norm_desc.c|   4 +-
 drivers/net/ethernet/stmicro/stmmac/ring_mode.c|  39 +-
 .../net/ethernet/stmicro/stmmac/stmmac_ethtool.c   |  82 ++--
 .../net/ethernet/stmicro/stmmac/stmmac_hwtstamp.c  |  34 +-
 drivers/net/ethernet/stmicro/stmmac/stmmac_main.c  | 439 +
 drivers/net/ethernet/stmicro/stmmac/stmmac_ptp.c   |  18 +-
 13 files changed, 726 insertions(+), 577 deletions(-)
 create mode 100644 drivers/net/ethernet/stmicro/stmmac/hwif.h

-- 
2.9.3

[PATCH net-next 3/5] net: stmmac: Switch stmmac_ops to generic HW Interface Helpers

2018-04-06 Thread Jose Abreu

Switch stmmac_ops to generic Hardware Interface Helpers instead of using
hard-coded callbacks. This makes the code more readable and more
flexible.

No functional change.

Signed-off-by: Jose Abreu 
Cc: David S. Miller 
Cc: Joao Pinto 
Cc: Giuseppe Cavallaro 
Cc: Alexandre Torgue 
---
 drivers/net/ethernet/stmicro/stmmac/common.h   |  70 ---
 drivers/net/ethernet/stmicro/stmmac/dwmac5.c   |  19 +--
 drivers/net/ethernet/stmicro/stmmac/dwmac5.h   |   6 +-
 drivers/net/ethernet/stmicro/stmmac/hwif.h | 138 +
 .../net/ethernet/stmicro/stmmac/stmmac_ethtool.c   |  68 --
 drivers/net/ethernet/stmicro/stmmac/stmmac_main.c  | 122 +-
 6 files changed, 235 insertions(+), 188 deletions(-)

diff --git a/drivers/net/ethernet/stmicro/stmmac/common.h 
b/drivers/net/ethernet/stmicro/stmmac/common.h
index b27221b..0e0b6f1 100644
--- a/drivers/net/ethernet/stmicro/stmmac/common.h
+++ b/drivers/net/ethernet/stmicro/stmmac/common.h
@@ -383,76 +383,6 @@ extern const struct stmmac_desc_ops ndesc_ops;
 
 struct mac_device_info;
 
-/* Helpers to program the MAC core */
-struct stmmac_ops {
-   /* MAC core initialization */
-   void (*core_init)(struct mac_device_info *hw, struct net_device *dev);
-   /* Enable the MAC RX/TX */
-   void (*set_mac)(void __iomem *ioaddr, bool enable);
-   /* Enable and verify that the IPC module is supported */
-   int (*rx_ipc)(struct mac_device_info *hw);
-   /* Enable RX Queues */
-   void (*rx_queue_enable)(struct mac_device_info *hw, u8 mode, u32 queue);
-   /* RX Queues Priority */
-   void (*rx_queue_prio)(struct mac_device_info *hw, u32 prio, u32 queue);
-   /* TX Queues Priority */
-   void (*tx_queue_prio)(struct mac_device_info *hw, u32 prio, u32 queue);
-   /* RX Queues Routing */
-   void (*rx_queue_routing)(struct mac_device_info *hw, u8 packet,
-u32 queue);
-   /* Program RX Algorithms */
-   void (*prog_mtl_rx_algorithms)(struct mac_device_info *hw, u32 rx_alg);
-   /* Program TX Algorithms */
-   void (*prog_mtl_tx_algorithms)(struct mac_device_info *hw, u32 tx_alg);
-   /* Set MTL TX queues weight */
-   void (*set_mtl_tx_queue_weight)(struct mac_device_info *hw,
-   u32 weight, u32 queue);
-   /* RX MTL queue to RX dma mapping */
-   void (*map_mtl_to_dma)(struct mac_device_info *hw, u32 queue, u32 chan);
-   /* Configure AV Algorithm */
-   void (*config_cbs)(struct mac_device_info *hw, u32 send_slope,
-  u32 idle_slope, u32 high_credit, u32 low_credit,
-  u32 queue);
-   /* Dump MAC registers */
-   void (*dump_regs)(struct mac_device_info *hw, u32 *reg_space);
-   /* Handle extra events on specific interrupts hw dependent */
-   int (*host_irq_status)(struct mac_device_info *hw,
-  struct stmmac_extra_stats *x);
-   /* Handle MTL interrupts */
-   int (*host_mtl_irq_status)(struct mac_device_info *hw, u32 chan);
-   /* Multicast filter setting */
-   void (*set_filter)(struct mac_device_info *hw, struct net_device *dev);
-   /* Flow control setting */
-   void (*flow_ctrl)(struct mac_device_info *hw, unsigned int duplex,
- unsigned int fc, unsigned int pause_time, u32 tx_cnt);
-   /* Set power management mode (e.g. magic frame) */
-   void (*pmt)(struct mac_device_info *hw, unsigned long mode);
-   /* Set/Get Unicast MAC addresses */
-   void (*set_umac_addr)(struct mac_device_info *hw, unsigned char *addr,
- unsigned int reg_n);
-   void (*get_umac_addr)(struct mac_device_info *hw, unsigned char *addr,
- unsigned int reg_n);
-   void (*set_eee_mode)(struct mac_device_info *hw,
-bool en_tx_lpi_clockgating);
-   void (*reset_eee_mode)(struct mac_device_info *hw);
-   void (*set_eee_timer)(struct mac_device_info *hw, int ls, int tw);
-   void (*set_eee_pls)(struct mac_device_info *hw, int link);
-   void (*debug)(void __iomem *ioaddr, struct stmmac_extra_stats *x,
- u32 rx_queues, u32 tx_queues);
-   /* PCS calls */
-   void (*pcs_ctrl_ane)(void __iomem *ioaddr, bool ane, bool srgmi_ral,
-bool loopback);
-   void (*pcs_rane)(void __iomem *ioaddr, bool restart);
-   void (*pcs_get_adv_lp)(void __iomem *ioaddr, struct rgmii_adv *adv);
-   /* Safety Features */
-   int (*safety_feat_config)(void __iomem *ioaddr, unsigned int asp);
-   bool (*safety_feat_irq_status)(struct net_device *ndev,
-   void __iomem *ioaddr, unsigned int asp,
-   struct stmmac_safety_stats

[PATCH net-next 4/5] net: stmmac: Switch stmmac_hwtimestamp to generic HW Interface Helpers

2018-04-06 Thread Jose Abreu

Switch stmmac_hwtimestamp to generic Hardware Interface Helpers instead
of using hard-coded callbacks. This makes the code more readable and
more flexible.

No functional change.

Signed-off-by: Jose Abreu 
Cc: David S. Miller 
Cc: Joao Pinto 
Cc: Giuseppe Cavallaro 
Cc: Alexandre Torgue 
---
 drivers/net/ethernet/stmicro/stmmac/common.h   | 12 
 drivers/net/ethernet/stmicro/stmmac/hwif.h | 25 
 .../net/ethernet/stmicro/stmmac/stmmac_hwtstamp.c  | 34 --
 drivers/net/ethernet/stmicro/stmmac/stmmac_main.c  | 17 +--
 drivers/net/ethernet/stmicro/stmmac/stmmac_ptp.c   | 18 
 5 files changed, 56 insertions(+), 50 deletions(-)

diff --git a/drivers/net/ethernet/stmicro/stmmac/common.h 
b/drivers/net/ethernet/stmicro/stmmac/common.h
index 0e0b6f1..7291561 100644
--- a/drivers/net/ethernet/stmicro/stmmac/common.h
+++ b/drivers/net/ethernet/stmicro/stmmac/common.h
@@ -383,18 +383,6 @@ extern const struct stmmac_desc_ops ndesc_ops;
 
 struct mac_device_info;
 
-/* PTP and HW Timer helpers */
-struct stmmac_hwtimestamp {
-   void (*config_hw_tstamping) (void __iomem *ioaddr, u32 data);
-   u32 (*config_sub_second_increment)(void __iomem *ioaddr, u32 ptp_clock,
-  int gmac4);
-   int (*init_systime) (void __iomem *ioaddr, u32 sec, u32 nsec);
-   int (*config_addend) (void __iomem *ioaddr, u32 addend);
-   int (*adjust_systime) (void __iomem *ioaddr, u32 sec, u32 nsec,
-  int add_sub, int gmac4);
-u64(*get_systime) (void __iomem *ioaddr);
-};
-
 extern const struct stmmac_hwtimestamp stmmac_ptp;
 extern const struct stmmac_mode_ops dwmac4_ring_mode_ops;
 
diff --git a/drivers/net/ethernet/stmicro/stmmac/hwif.h 
b/drivers/net/ethernet/stmicro/stmmac/hwif.h
index 9575135..e23c0a3 100644
--- a/drivers/net/ethernet/stmicro/stmmac/hwif.h
+++ b/drivers/net/ethernet/stmicro/stmmac/hwif.h
@@ -366,4 +366,29 @@ struct stmmac_ops {
 #define stmmac_safety_feat_dump(__priv, __args...) \
stmmac_do_callback(__priv, mac, safety_feat_dump, __args)
 
+/* PTP and HW Timer helpers */
+struct stmmac_hwtimestamp {
+   void (*config_hw_tstamping) (void __iomem *ioaddr, u32 data);
+   void (*config_sub_second_increment)(void __iomem *ioaddr, u32 ptp_clock,
+  int gmac4, u32 *ssinc);
+   int (*init_systime) (void __iomem *ioaddr, u32 sec, u32 nsec);
+   int (*config_addend) (void __iomem *ioaddr, u32 addend);
+   int (*adjust_systime) (void __iomem *ioaddr, u32 sec, u32 nsec,
+  int add_sub, int gmac4);
+   void (*get_systime) (void __iomem *ioaddr, u64 *systime);
+};
+
+#define stmmac_config_hw_tstamping(__priv, __args...) \
+   stmmac_do_void_callback(__priv, ptp, config_hw_tstamping, __args)
+#define stmmac_config_sub_second_increment(__priv, __args...) \
+   stmmac_do_void_callback(__priv, ptp, config_sub_second_increment, 
__args)
+#define stmmac_init_systime(__priv, __args...) \
+   stmmac_do_callback(__priv, ptp, init_systime, __args)
+#define stmmac_config_addend(__priv, __args...) \
+   stmmac_do_callback(__priv, ptp, config_addend, __args)
+#define stmmac_adjust_systime(__priv, __args...) \
+   stmmac_do_callback(__priv, ptp, adjust_systime, __args)
+#define stmmac_get_systime(__priv, __args...) \
+   stmmac_do_void_callback(__priv, ptp, get_systime, __args)
+
 #endif /* __STMMAC_HWIF_H__ */
diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_hwtstamp.c 
b/drivers/net/ethernet/stmicro/stmmac/stmmac_hwtstamp.c
index 08c19eb..8d9cc21 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac_hwtstamp.c
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_hwtstamp.c
@@ -24,13 +24,13 @@
 #include "common.h"
 #include "stmmac_ptp.h"
 
-static void stmmac_config_hw_tstamping(void __iomem *ioaddr, u32 data)
+static void config_hw_tstamping(void __iomem *ioaddr, u32 data)
 {
writel(data, ioaddr + PTP_TCR);
 }
 
-static u32 stmmac_config_sub_second_increment(void __iomem *ioaddr,
- u32 ptp_clock, int gmac4)
+static void config_sub_second_increment(void __iomem *ioaddr,
+   u32 ptp_clock, int gmac4, u32 *ssinc)
 {
u32 value = readl(ioaddr + PTP_TCR);
unsigned long data;
@@ -57,10 +57,11 @@ static u32 stmmac_config_sub_second_increment(void __iomem 
*ioaddr,
 
writel(reg_value, ioaddr + PTP_SSIR);
 
-   return data;
+   if (ssinc)
+   *ssinc = data;
 }
 
-static int stmmac_init_systime(void __iomem *ioaddr, u32 sec, u32 nsec)
+static int init_systime(void __iomem *ioaddr, u32 sec, u32 nsec)
 {
int limit;
u32 value;
@@ -85,7 +86,7 @@ static int stmmac_init_systime(void __iomem *ioaddr, u32 sec, 
u32 nsec)
return 0;

[iovisor-dev] Best userspace programming API for XDP features query to kernel?

2018-04-06 Thread Daniel Borkmann

On 04/05/2018 10:51 PM, Jesper Dangaard Brouer wrote:
> On Thu, 5 Apr 2018 12:37:19 +0200
> Daniel Borkmann  wrote:
> 
>> On 04/04/2018 02:28 PM, Jesper Dangaard Brouer via iovisor-dev wrote:
>>> Hi Suricata people,
>>>
>>> When Eric Leblond (and I helped) integrated XDP in Suricata, we ran
>>> into the issue, that at Suricata load/start time, we cannot determine
>>> if the chosen XDP config options, like xdp-cpu-redirect[1], is valid on
>>> this HW (e.g require driver XDP_REDIRECT support and bpf cpumap).
>>>
>>> We would have liked a way to report that suricata.yaml config was
>>> invalid for this hardware/setup.  Now, it just loads, and packets gets
>>> silently dropped by XDP (well a WARN_ONCE and catchable via tracepoints).
>>>
>>> My question to suricata developers: (Q1) Do you already have code that
>>> query the kernel or drivers for features?
>>>
>>> At the IOvisor call (2 weeks ago), we discussed two options of exposing
>>> XDP features avail in a given driver.
>>>
>>> Option#1: Extend existing ethtool -k/-K "offload and other features"
>>> with some XDP features, that userspace can query. (Do you already query
>>> offloads, regarding Q1)
>>>
>>> Option#2: Invent a new 'ip link set xdp' netlink msg with a query option.  
>>
>> I don't really mind if you go via ethtool, as long as we handle this
>> generically from there and e.g. call the dev's ndo_bpf handler such that
>> we keep all the information in one place. This can be a new ndo_bpf command
>> e.g. XDP_QUERY_FEATURES or such.
> 
> Just to be clear: notice as Victor points out[2], they are programmable
> going though the IOCTL (SIOCETHTOOL) and not using cmdline tools.

Sure, that was perfectly clear. (But at the same time if you extend the
ioctl, it's obvious to also add support to actual ethtool cmdline tool.)

> [2] https://github.com/OISF/suricata/blob/master/src/util-ioctl.c#L326
> 
> If you want everything to go through the drivers ndo_bpf call anyway
> (which userspace API is netlink based) then at what point to you

Not really, that's the front end. ndo_bpf itself is a plain netdev op
and has no real tie to netlink.

> want drivers to call their own ndo_bpf, when activated though their
> ethtool_ops ? (Sorry, but I don't follow the flow you are proposing)
> 
> Notice, I'm not directly against using the drivers ndo_bpf call.  I can
> see it does provide kernel more flexibility than the ethtool IOCTL.

What I was saying is that even if you go via ethtool ioctl api, where
you end up in dev_ethtool() and have some new ETHTOOL_* query command,
then instead of adding a new ethtool_ops callback, we can and should
reuse ndo_bpf from there.

[...]
> Here, I want to discuss how drivers expose/tell userspace that they
> support a given feature: Specifically a bit for: XDP_REDIRECT action
> support.
> 
>> Same for meta data,
> 
> Well, not really.  It would be a "nice-to-have", but not strictly
> needed as a feature bit.  XDP meta-data is controlled via a helper.
> And the BPF-prog can detect/see runtime, that the helper bpf_xdp_adjust_meta
> returns -ENOTSUPP (and need to check the ret value anyhow).  Thus,
> there is that not much gained by exposing this to be detected setup
> time, as all drivers should eventually support this, and we can detect
> it runtime.
> 
> The missing XDP_REDIRECT action features bit it different, as the
> BPF-prog cannot detect runtime that this is an unsupported action.
> Plus, setup time we cannot query the driver for supported XDP actions.

Ok, so with the example of meta data, you're arguing that it's okay
to load a native XDP program onto a driver, and run actual traffic on
the NIC in order probe for the availability of the feature when you're
saying that it "can detect/see [at] runtime". I totally agree with you
that all drivers should eventually support this (same with XDP_REDIRECT),
but today there are even differences in drivers on bpf_xdp_adjust_meta()/
bpf_xdp_adjust_head() with regards to how much headroom they have available,
etc (e.g. some of them have none), so right now you can either go and
read the code or do a runtime test with running actual traffic through
the NIC to check whether your BPF prog is supported or not. Theoretically,
you can do the same runtime test with XDP_REDIRECT (taking the warn in
bpf_warn_invalid_xdp_action() aside for a moment), but you do have the
trace_xdp_exception() tracepoint to figure it out, yes, it's a painful
hassle, but overall, it's not that different as you were trying to argue
here. For /both/ cases it would be nice to know at setup time whether
this would be supported or not. Hence, such query is not just limited to
XDP_REDIRECT alone. Eventually once such interface is agreed upon,
undoubtedly the list of feature bits will grow is what I'm trying to say;
only arguing on the XDP_REDIRECT here would be short term.

[...]
>> What about keeping this high level to users? E.g. say you have 2 options
>> that drivers can expose as

Re: WARNING: CPU: 3 PID: 0 at net/sched/sch_hfsc.c:1388 hfsc_dequeue+0x319/0x350 [sch_hfsc]

2018-04-06 Thread Marco Berizzi

> Il 19 marzo 2018 alle 11.07 Jamal Hadi Salim  ha scritto:
> 
> On 18-03-15 08:48 PM, Cong Wang wrote:
> 
> > On Wed, Mar 14, 2018 at 1:10 AM, Marco Berizzi  wrote:
> > 
> > > > Il 9 marzo 2018 alle 0.14 Cong Wang  ha 
> > > > scritto:
> > 
> > It has been reported here:
> > https://bugzilla.kernel.org/show_bug.cgi?id=109581
> > 
> > And there is a workaround from Konstantin:
> > https://patchwork.ozlabs.org/patch/803885/
> > 
> > Unfortunately I don't think that is a real fix, we probably need to
> > fix HFSC itself rather than just workaround the qlen==0. It is not
> > trivial since HFSC implementation is not easy to understand.
> > Maybe Jamal knows better than me.
> 
> Sorry for the latency - I looked at this on the plane and it is very
> specific to fq/codel. It is not clear to me why codel needs this but
> i note it has been there from the initial commit and from that
> perspective the patch looks reasonable. In any case:
> Punting it to Eric (on Cc).
> 
> cheers,
> jamal

Hello everyone,

About this bugzilla report https://bugzilla.kernel.org/show_bug.cgi?id=109581

I'm getting this error after 4.16-rc4 (also 4.16.0 is affected).
Till 4.15.7 I did not get this error message (linux 4.15.[1,2,3,4,5,6,7] is 
fine).

The bugzilla report is about linux 4.3.3

Re: [PATCH 2/2] net: phy: dp83640: Read strapped configuration settings

2018-04-06 Thread Esben Haabendal

David Miller  writes:

> From: Andrew Lunn 
> Date: Thu, 5 Apr 2018 22:40:49 +0200
>
>> Or could it still contain whatever state the last boot of Linux, or
>> maybe the bootloader, left the PHY in?
>
> Right, this is my concern as well.

I don't think that should happen.
With config_init() being called (in phy_init_hw()) after soft_reset(),
any state set by software should be cleared.

>From DP83620 datasheet description of what happens when BMCR_RESET is
set:

The software reset will reset the device such that all registers
will be reset to default values and the hardware configuration
values will be maintained.

But something else that could be a concern is the risk that there is
boards out there with wrong hardware configuration, which works with
current Linux (because it ignores hardware configuration).  Such designs
could break with this patch.

If we need to safeguard against that, maybe we could just keep the
genphy_read_config() function in the kernel, and let board specific code
use it as a phy_fixup where hardware configuration is to be respected.

Would that be a better approach?

/Esben

Re: [PATCH v2] net: thunderx: rework mac addresses list to u64 array

2018-04-06 Thread Robin Murphy


On 06/04/18 12:14, Vadim Lomovtsev wrote:

From: Vadim Lomovtsev 

It is too expensive to pass u64 values via linked list, instead
allocate array for them by overall number of mac addresses from netdev.

This eventually removes multiple kmalloc() calls, aviod memory
fragmentation and allow to put single null check on kmalloc
return value in order to prevent a potential null pointer dereference.

Addresses-Coverity-ID: 1467429 ("Dereference null return value")
Fixes: 37c3347eb247 ("net: thunderx: add ndo_set_rx_mode callback implementation for 
VF")
Signed-off-by: Vadim Lomovtsev 
---
Changes from v1 to v2:
  - C99 syntax: update xcast_addr_list struct field mc[0] -> mc[];

---
  drivers/net/ethernet/cavium/thunder/nic.h|  7 +-
  drivers/net/ethernet/cavium/thunder/nicvf_main.c | 28 +---
  2 files changed, 11 insertions(+), 24 deletions(-)

diff --git a/drivers/net/ethernet/cavium/thunder/nic.h 
b/drivers/net/ethernet/cavium/thunder/nic.h
index 5fc46c5a4f36..448d1fafc827 100644
--- a/drivers/net/ethernet/cavium/thunder/nic.h
+++ b/drivers/net/ethernet/cavium/thunder/nic.h
@@ -265,14 +265,9 @@ struct nicvf_drv_stats {
  
  struct cavium_ptp;
  
-struct xcast_addr {

-   struct list_head list;
-   u64  addr;
-};
-
  struct xcast_addr_list {
-   struct list_head list;
int  count;
+   u64  mc[];
  };
  
  struct nicvf_work {

diff --git a/drivers/net/ethernet/cavium/thunder/nicvf_main.c 
b/drivers/net/ethernet/cavium/thunder/nicvf_main.c
index 1e9a31fef729..a26d8bc92e01 100644
--- a/drivers/net/ethernet/cavium/thunder/nicvf_main.c
+++ b/drivers/net/ethernet/cavium/thunder/nicvf_main.c
@@ -1929,7 +1929,7 @@ static void nicvf_set_rx_mode_task(struct work_struct 
*work_arg)
  work.work);
struct nicvf *nic = container_of(vf_work, struct nicvf, rx_mode_work);
union nic_mbx mbx = {};
-   struct xcast_addr *xaddr, *next;
+   u8 idx = 0;
  
  	if (!vf_work)

return;
@@ -1956,16 +1956,10 @@ static void nicvf_set_rx_mode_task(struct work_struct 
*work_arg)
/* check if we have any specific MACs to be added to PF DMAC filter */
if (vf_work->mc) {
/* now go through kernel list of MACs and add them one by one */
-   list_for_each_entry_safe(xaddr, next,
-_work->mc->list, list) {
+   for (idx = 0; idx < vf_work->mc->count; idx++) {
mbx.xcast.msg = NIC_MBOX_MSG_ADD_MCAST;
-   mbx.xcast.data.mac = xaddr->addr;
+   mbx.xcast.data.mac = vf_work->mc->mc[idx];
nicvf_send_msg_to_pf(nic, );
-
-   /* after receiving ACK from PF release memory */
-   list_del(>list);
-   kfree(xaddr);
-   vf_work->mc->count--;
}
kfree(vf_work->mc);
}
@@ -1996,17 +1990,15 @@ static void nicvf_set_rx_mode(struct net_device *netdev)
mode |= BGX_XCAST_MCAST_FILTER;
/* here we need to copy mc addrs */
if (netdev_mc_count(netdev)) {
-   struct xcast_addr *xaddr;
-
-   mc_list = kmalloc(sizeof(*mc_list), GFP_ATOMIC);
-   INIT_LIST_HEAD(_list->list);
+   mc_list = kmalloc(sizeof(*mc_list) +
+ sizeof(u64) * 
netdev_mc_count(netdev),


FWIW if you really wanted to disambiguate that it's a structure and not 
just an array being allocated, then it could be expressed without 
explicit arithmetic:


size = offsetof(typeof(*mc_list), mc[netdev_mc_count(netdev)]);

but that's probably just a matter of personal preference at this point.

Robin.


+ GFP_ATOMIC);
+   if (unlikely(!mc_list))
+   return;
+   mc_list->count = 0;
netdev_hw_addr_list_for_each(ha, >mc) {
-   xaddr = kmalloc(sizeof(*xaddr),
-   GFP_ATOMIC);
-   xaddr->addr =
+   mc_list->mc[mc_list->count] =
ether_addr_to_u64(ha->addr);
-   list_add_tail(>list,
- _list->list);
mc_list->count++;
}
}

Re: [PATCH v2] net: thunderx: rework mac addresses list to u64 array

2018-04-06 Thread Vadim Lomovtsev

On Fri, Apr 06, 2018 at 06:47:26AM -0500, Gustavo A. R. Silva wrote:
> 
> 
> On 04/06/2018 06:43 AM, Vadim Lomovtsev wrote:
> > Hi Gustavo,
> > 
> > On Fri, Apr 06, 2018 at 06:36:28AM -0500, Gustavo A. R. Silva wrote:
> > > Hi Vadim,
> > > 
> > > On 04/06/2018 06:14 AM, Vadim Lomovtsev wrote:
> > > > From: Vadim Lomovtsev 
> > > > 
> > > > It is too expensive to pass u64 values via linked list, instead
> > > > allocate array for them by overall number of mac addresses from netdev.
> > > > 
> > > > This eventually removes multiple kmalloc() calls, aviod memory
> > > > fragmentation and allow to put single null check on kmalloc
> > > > return value in order to prevent a potential null pointer dereference.
> > > > 
> > > > Addresses-Coverity-ID: 1467429 ("Dereference null return value")
> > > > Fixes: 37c3347eb247 ("net: thunderx: add ndo_set_rx_mode callback 
> > > > implementation for VF")
> > > 
> > > It'd be nice if you add:
> > > 
> > > Reported-by: Gustavo A. R. Silva 
> > 
> > Ok, I could do that.
> > 
> > Just to explain .. I didn't do it yet since I get such report from
> > Dan Carpenter intially 
> > (https://www.spinics.net/lists/linux-kernel-janitors/msg40720.html)
> > and was working on it when found you patches, so then asking you to give
> > me some time to prepare and test update to driver.
> > 
> 
> Oh I got it. Not big deal. I think in this case you should add Dan's
> Reported-by instead.

Ok then.

> 
> BTW nice patch. :)
>

Thank you for reviewing it.

WBR,
Vadim

> Thanks
> --
> Gustavo
> 
> > So would it be acceptable put two tags 'Reported-by:' then ?
> > 
> > WBR,
> > Vadim
> > 
> > > 
> > > Thanks
> > > --
> > > Gustavo
> > > 
> > > > Signed-off-by: Vadim Lomovtsev 
> > > > ---
> > > > Changes from v1 to v2:
> > > >- C99 syntax: update xcast_addr_list struct field mc[0] -> mc[];
> > > > 
> > > > ---
> > > >drivers/net/ethernet/cavium/thunder/nic.h|  7 +-
> > > >drivers/net/ethernet/cavium/thunder/nicvf_main.c | 28 
> > > > +---
> > > >2 files changed, 11 insertions(+), 24 deletions(-)
> > > > 
> > > > diff --git a/drivers/net/ethernet/cavium/thunder/nic.h 
> > > > b/drivers/net/ethernet/cavium/thunder/nic.h
> > > > index 5fc46c5a4f36..448d1fafc827 100644
> > > > --- a/drivers/net/ethernet/cavium/thunder/nic.h
> > > > +++ b/drivers/net/ethernet/cavium/thunder/nic.h
> > > > @@ -265,14 +265,9 @@ struct nicvf_drv_stats {
> > > >struct cavium_ptp;
> > > > -struct xcast_addr {
> > > > -   struct list_head list;
> > > > -   u64  addr;
> > > > -};
> > > > -
> > > >struct xcast_addr_list {
> > > > -   struct list_head list;
> > > > int  count;
> > > > +   u64  mc[];
> > > >};
> > > >struct nicvf_work {
> > > > diff --git a/drivers/net/ethernet/cavium/thunder/nicvf_main.c 
> > > > b/drivers/net/ethernet/cavium/thunder/nicvf_main.c
> > > > index 1e9a31fef729..a26d8bc92e01 100644
> > > > --- a/drivers/net/ethernet/cavium/thunder/nicvf_main.c
> > > > +++ b/drivers/net/ethernet/cavium/thunder/nicvf_main.c
> > > > @@ -1929,7 +1929,7 @@ static void nicvf_set_rx_mode_task(struct 
> > > > work_struct *work_arg)
> > > >   work.work);
> > > > struct nicvf *nic = container_of(vf_work, struct nicvf, 
> > > > rx_mode_work);
> > > > union nic_mbx mbx = {};
> > > > -   struct xcast_addr *xaddr, *next;
> > > > +   u8 idx = 0;
> > > > if (!vf_work)
> > > > return;
> > > > @@ -1956,16 +1956,10 @@ static void nicvf_set_rx_mode_task(struct 
> > > > work_struct *work_arg)
> > > > /* check if we have any specific MACs to be added to PF DMAC 
> > > > filter */
> > > > if (vf_work->mc) {
> > > > /* now go through kernel list of MACs and add them one 
> > > > by one */
> > > > -   list_for_each_entry_safe(xaddr, next,
> > > > -_work->mc->list, list) {
> > > > +   for (idx = 0; idx < vf_work->mc->count; idx++) {
> > > > mbx.xcast.msg = NIC_MBOX_MSG_ADD_MCAST;
> > > > -   mbx.xcast.data.mac = xaddr->addr;
> > > > +   mbx.xcast.data.mac = vf_work->mc->mc[idx];
> > > > nicvf_send_msg_to_pf(nic, );
> > > > -
> > > > -   /* after receiving ACK from PF release memory */
> > > > -   list_del(>list);
> > > > -   kfree(xaddr);
> > > > -   vf_work->mc->count--;
> > > > }
> > > > kfree(vf_work->mc);
> > > > }
> > > > @@ -1996,17 +1990,15 @@ static void nicvf_set_rx_mode(struct net_device 
> > > > *netdev)
> > > > mode |= BGX_XCAST_MCAST_FILTER;
> > > > /* here we need to copy mc addrs */
>

TCP one-by-one acking - RFC interpretation question

2018-04-06 Thread Michal Kubecek

Hello,

I encountered a strange behaviour of some (non-linux) TCP stack which
I believe is incorrect but support engineers from the company producing
it claim is OK.

Assume a client (sender, Linux 4.4 kernel) sends a stream of MSS sized
segments but segments 2, 4 and 6 do not reach the server (receiver):

 ACK SAK SAK SAK
  +---+---+---+---+---+---+---+
  |   1   |   2   |   3   |   4   |   5   |   6   |   7   |
  +---+---+---+---+---+---+---+
34273   35701   37129   38557   39985   41413   42841   44269

When segment 2 is retransmitted after RTO timeout, normal response would
be ACK-ing segment 3 (38557) with SACK for 5 and 7 (39985-41413 and
42841-44269).

However, this server stack responds with two separate ACKs:

  - ACK 37129, SACK 37129-38557 39985-41413 42841-44269
  - ACK 38557, SACK 39985-41413 42841-44269

There is no payload from server, no window update and it happens even if
there is no other packet received by server between those two. The
result is that as segment 3 was never retransmitted, second ACK is
interpreted as acking a newly arrived segment by 4.4 kernel so that the
whole interval between first transmission of segment 3 and this second
ACK is used for RTT estimator; even worse, when the same happens again
for segment 5, both timeouts (for 2 and 4) are counted into its RTT.
The result is RTO growing exponentially until it reaches the maximum
(120 seconds) and the connection is effectively stalled.

In my opinion, server behaviour violates the last paragraph of RFC 5681,
section 4.2:

  A TCP receiver MUST NOT generate more than one ACK for every incoming
  segment, other than to update the offered window as the receiving
  application consumes new data (see [RFC813] and page 42 of [RFC793]).

Server vendor claims that their behaviour is correct as first ACK is
sent in response to segment 2 and second ACK in response to segment 3
(which has just been delayed in the out of order queue).

Note that SACK doesn't really help here. First SACK block in first ACK
(37129-38557) is actually invalid as it violates the "the bytes just
below the block ... have not been received" condition from RFC 2018
section 3. Therefore Linux 4.4 stack ignores this SACK block, detects
(spurious) SACK reneging and unmarks the "previously sacked" flag of
segment 3 so that when second ACK arrives, there is no trace of it
having been sacked before. They already admitted this SACK block is
incorrect but there is still disagreement about the "one-by-one acking"
behaviour in general.

My question is: is my interpretation correct? If so, is there an even
less ambiguous statement somewhere that receiver is supposed to send one
ACK for "everything they got so far" rather than acking the segments one
by one? While reading the RFCs, I always considered this obvious but
apparently some people may think otherwise.

Thanks in advance,
Michal Kubecek

[RFC bpf-next] bpf: document eBPF helpers and add a script to generate man page

2018-04-06 Thread Quentin Monnet

eBPF helper functions can be called from within eBPF programs to perform
a variety of tasks that would be otherwise hard or impossible to do with
eBPF itself. There is a growing number of such helper functions in the
kernel, but documentation is scarce. The main user space header file
does contain a short commented description of most helpers, but it is
somewhat outdated and not complete. It is more a "cheat sheet" than a
real documentation accessible to new eBPF developers.

This commit attempts to improve the situation by replacing the existing
overview for the helpers with a more developed description. Furthermore,
a Python script is added to generate a manual page for eBPF helpers. The
workflow is the following, and requires the rst2man utility:

$ ./scripts/bpf_helpers_doc.py \
--filename include/uapi/linux/bpf.h > /tmp/bpf-helpers.rst
$ rst2man /tmp/bpf-helpers.rst > /tmp/bpf-helpers.7
$ man /tmp/bpf-helpers.7

The objective is to keep all documentation related to the helpers in a
single place, and to be able to generate from here a manual page that
could be packaged in the man-pages repository and shipped with most
distributions [1].

Additionally, parsing the prototypes of the helper functions could
hopefully be reused, with a different Printer object, to generate
header files needed in some eBPF-related projects.

Regarding the description of each helper, it comprises several items:

- The function prototype.
- A description of the function and of its arguments (except for a
  couple of cases, when there are no arguments and the return value
  makes the function usage really obvious).
- A description of return values (if not void).
- A listing of eBPF program types (if relevant, map types) compatible
  with the helper.
- Information about the helper being restricted to GPL programs, or not.
- The kernel version in which the helper was introduced.
- The commit that introduced the helper (this is mostly to have it in
  the source of the man page, as it can be used to track changes and
  update the page).

For several helpers, descriptions are inspired (at times, nearly copied)
from the commit logs introducing them in the kernel--Many thanks to
their respective authors! They were completed as much as possible, the
objective being to have something easily accessible even for people just
starting with eBPF. There is probably a bit more work to do in this
direction for some helpers.

Some RST formatting is used in the descriptions (not in function
prototypes, to keep them readable, but the Python script provided in
order to generate the RST for the manual page does add formatting to
prototypes, to produce something pretty) to get "bold" and "italics" in
manual pages. Hopefully, the descriptions in bpf.h file remains
perfectly readable. Note that the few trailing white spaces are
intentional, removing them would break paragraphs for rst2man.

The descriptions should ideally be updated each time someone adds a new
helper, or updates the behaviour (compatibility extended to new program
types, new socket option supported...) or the interface (new flags
available, ...) of existing ones.

[1] I have not contacted people from the man-pages project prior to
sending this RFC, so I can offer no guaranty at this time that they
would accept to take the generated man page.

Cc: linux-...@vger.kernel.org
Cc: linux-...@vger.kernel.org
Signed-off-by: Quentin Monnet 
---
 include/uapi/linux/bpf.h   | 2237 
 scripts/bpf_helpers_doc.py |  568 +++
 2 files changed, 2429 insertions(+), 376 deletions(-)
 create mode 100755 scripts/bpf_helpers_doc.py

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index c5ec89732a8d..f47aeddbbe0a 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -367,394 +367,1879 @@ union bpf_attr {
 
 /* BPF helper function descriptions:
  *
- * void *bpf_map_lookup_elem(, )
- * Return: Map value or NULL
- *
- * int bpf_map_update_elem(, , , flags)
- * Return: 0 on success or negative error
- *
- * int bpf_map_delete_elem(, )
- * Return: 0 on success or negative error
- *
- * int bpf_probe_read(void *dst, int size, void *src)
- * Return: 0 on success or negative error
+ * void *bpf_map_lookup_elem(struct bpf_map *map, void *key)
+ * Description
+ * Perform a lookup in *map* for an entry associated to *key*.
+ * Return
+ * Map value associated to *key*, or **NULL** if no entry was
+ * found.
+ * For
+ * All types of programs. Limited to maps of types
+ * **BPF_MAP_TYPE_HASH**,
+ * **BPF_MAP_TYPE_ARRAY**,
+ * **BPF_MAP_TYPE_PERCPU_HASH**,
+ * **BPF_MAP_TYPE_PERCPU_ARRAY**,
+ * **BPF_MAP_TYPE_LRU_HASH**,
+ * **BPF_MAP_TYPE_LRU_PERCPU_HASH**,
+ * **BPF_MAP_TYPE_LPM_TRIE**,
+ *

Re: [PATCH v2] net: thunderx: rework mac addresses list to u64 array

2018-04-06 Thread Vadim Lomovtsev

Hi Gustavo,

On Fri, Apr 06, 2018 at 06:36:28AM -0500, Gustavo A. R. Silva wrote:
> Hi Vadim,
> 
> On 04/06/2018 06:14 AM, Vadim Lomovtsev wrote:
> > From: Vadim Lomovtsev 
> > 
> > It is too expensive to pass u64 values via linked list, instead
> > allocate array for them by overall number of mac addresses from netdev.
> > 
> > This eventually removes multiple kmalloc() calls, aviod memory
> > fragmentation and allow to put single null check on kmalloc
> > return value in order to prevent a potential null pointer dereference.
> > 
> > Addresses-Coverity-ID: 1467429 ("Dereference null return value")
> > Fixes: 37c3347eb247 ("net: thunderx: add ndo_set_rx_mode callback 
> > implementation for VF")
> 
> It'd be nice if you add:
> 
> Reported-by: Gustavo A. R. Silva 

Ok, I could do that.

Just to explain .. I didn't do it yet since I get such report from
Dan Carpenter intially 
(https://www.spinics.net/lists/linux-kernel-janitors/msg40720.html)
and was working on it when found you patches, so then asking you to give
me some time to prepare and test update to driver.

So would it be acceptable put two tags 'Reported-by:' then ?

WBR,
Vadim

> 
> Thanks
> --
> Gustavo
> 
> > Signed-off-by: Vadim Lomovtsev 
> > ---
> > Changes from v1 to v2:
> >   - C99 syntax: update xcast_addr_list struct field mc[0] -> mc[];
> > 
> > ---
> >   drivers/net/ethernet/cavium/thunder/nic.h|  7 +-
> >   drivers/net/ethernet/cavium/thunder/nicvf_main.c | 28 
> > +---
> >   2 files changed, 11 insertions(+), 24 deletions(-)
> > 
> > diff --git a/drivers/net/ethernet/cavium/thunder/nic.h 
> > b/drivers/net/ethernet/cavium/thunder/nic.h
> > index 5fc46c5a4f36..448d1fafc827 100644
> > --- a/drivers/net/ethernet/cavium/thunder/nic.h
> > +++ b/drivers/net/ethernet/cavium/thunder/nic.h
> > @@ -265,14 +265,9 @@ struct nicvf_drv_stats {
> >   struct cavium_ptp;
> > -struct xcast_addr {
> > -   struct list_head list;
> > -   u64  addr;
> > -};
> > -
> >   struct xcast_addr_list {
> > -   struct list_head list;
> > int  count;
> > +   u64  mc[];
> >   };
> >   struct nicvf_work {
> > diff --git a/drivers/net/ethernet/cavium/thunder/nicvf_main.c 
> > b/drivers/net/ethernet/cavium/thunder/nicvf_main.c
> > index 1e9a31fef729..a26d8bc92e01 100644
> > --- a/drivers/net/ethernet/cavium/thunder/nicvf_main.c
> > +++ b/drivers/net/ethernet/cavium/thunder/nicvf_main.c
> > @@ -1929,7 +1929,7 @@ static void nicvf_set_rx_mode_task(struct work_struct 
> > *work_arg)
> >   work.work);
> > struct nicvf *nic = container_of(vf_work, struct nicvf, rx_mode_work);
> > union nic_mbx mbx = {};
> > -   struct xcast_addr *xaddr, *next;
> > +   u8 idx = 0;
> > if (!vf_work)
> > return;
> > @@ -1956,16 +1956,10 @@ static void nicvf_set_rx_mode_task(struct 
> > work_struct *work_arg)
> > /* check if we have any specific MACs to be added to PF DMAC filter */
> > if (vf_work->mc) {
> > /* now go through kernel list of MACs and add them one by one */
> > -   list_for_each_entry_safe(xaddr, next,
> > -_work->mc->list, list) {
> > +   for (idx = 0; idx < vf_work->mc->count; idx++) {
> > mbx.xcast.msg = NIC_MBOX_MSG_ADD_MCAST;
> > -   mbx.xcast.data.mac = xaddr->addr;
> > +   mbx.xcast.data.mac = vf_work->mc->mc[idx];
> > nicvf_send_msg_to_pf(nic, );
> > -
> > -   /* after receiving ACK from PF release memory */
> > -   list_del(>list);
> > -   kfree(xaddr);
> > -   vf_work->mc->count--;
> > }
> > kfree(vf_work->mc);
> > }
> > @@ -1996,17 +1990,15 @@ static void nicvf_set_rx_mode(struct net_device 
> > *netdev)
> > mode |= BGX_XCAST_MCAST_FILTER;
> > /* here we need to copy mc addrs */
> > if (netdev_mc_count(netdev)) {
> > -   struct xcast_addr *xaddr;
> > -
> > -   mc_list = kmalloc(sizeof(*mc_list), GFP_ATOMIC);
> > -   INIT_LIST_HEAD(_list->list);
> > +   mc_list = kmalloc(sizeof(*mc_list) +
> > + sizeof(u64) * 
> > netdev_mc_count(netdev),
> > + GFP_ATOMIC);
> > +   if (unlikely(!mc_list))
> > +   return;
> > +   mc_list->count = 0;
> > netdev_hw_addr_list_for_each(ha, >mc) {
> > -   xaddr = kmalloc(sizeof(*xaddr),
> > -   GFP_ATOMIC);
> > -   xaddr->addr =
> > +

Re: [PATCH v2] net: thunderx: rework mac addresses list to u64 array

2018-04-06 Thread Gustavo A. R. Silva




On 04/06/2018 06:43 AM, Vadim Lomovtsev wrote:

Hi Gustavo,

On Fri, Apr 06, 2018 at 06:36:28AM -0500, Gustavo A. R. Silva wrote:

Hi Vadim,

On 04/06/2018 06:14 AM, Vadim Lomovtsev wrote:

From: Vadim Lomovtsev 

It is too expensive to pass u64 values via linked list, instead
allocate array for them by overall number of mac addresses from netdev.

This eventually removes multiple kmalloc() calls, aviod memory
fragmentation and allow to put single null check on kmalloc
return value in order to prevent a potential null pointer dereference.

Addresses-Coverity-ID: 1467429 ("Dereference null return value")
Fixes: 37c3347eb247 ("net: thunderx: add ndo_set_rx_mode callback implementation for 
VF")


It'd be nice if you add:

Reported-by: Gustavo A. R. Silva 


Ok, I could do that.

Just to explain .. I didn't do it yet since I get such report from
Dan Carpenter intially 
(https://www.spinics.net/lists/linux-kernel-janitors/msg40720.html)
and was working on it when found you patches, so then asking you to give
me some time to prepare and test update to driver.



Oh I got it. Not big deal. I think in this case you should add Dan's 
Reported-by instead.


BTW nice patch. :)

Thanks
--
Gustavo


So would it be acceptable put two tags 'Reported-by:' then ?

WBR,
Vadim



Thanks
--
Gustavo


Signed-off-by: Vadim Lomovtsev 
---
Changes from v1 to v2:
   - C99 syntax: update xcast_addr_list struct field mc[0] -> mc[];

---
   drivers/net/ethernet/cavium/thunder/nic.h|  7 +-
   drivers/net/ethernet/cavium/thunder/nicvf_main.c | 28 
+---
   2 files changed, 11 insertions(+), 24 deletions(-)

diff --git a/drivers/net/ethernet/cavium/thunder/nic.h 
b/drivers/net/ethernet/cavium/thunder/nic.h
index 5fc46c5a4f36..448d1fafc827 100644
--- a/drivers/net/ethernet/cavium/thunder/nic.h
+++ b/drivers/net/ethernet/cavium/thunder/nic.h
@@ -265,14 +265,9 @@ struct nicvf_drv_stats {
   struct cavium_ptp;
-struct xcast_addr {
-   struct list_head list;
-   u64  addr;
-};
-
   struct xcast_addr_list {
-   struct list_head list;
int  count;
+   u64  mc[];
   };
   struct nicvf_work {
diff --git a/drivers/net/ethernet/cavium/thunder/nicvf_main.c 
b/drivers/net/ethernet/cavium/thunder/nicvf_main.c
index 1e9a31fef729..a26d8bc92e01 100644
--- a/drivers/net/ethernet/cavium/thunder/nicvf_main.c
+++ b/drivers/net/ethernet/cavium/thunder/nicvf_main.c
@@ -1929,7 +1929,7 @@ static void nicvf_set_rx_mode_task(struct work_struct 
*work_arg)
  work.work);
struct nicvf *nic = container_of(vf_work, struct nicvf, rx_mode_work);
union nic_mbx mbx = {};
-   struct xcast_addr *xaddr, *next;
+   u8 idx = 0;
if (!vf_work)
return;
@@ -1956,16 +1956,10 @@ static void nicvf_set_rx_mode_task(struct work_struct 
*work_arg)
/* check if we have any specific MACs to be added to PF DMAC filter */
if (vf_work->mc) {
/* now go through kernel list of MACs and add them one by one */
-   list_for_each_entry_safe(xaddr, next,
-_work->mc->list, list) {
+   for (idx = 0; idx < vf_work->mc->count; idx++) {
mbx.xcast.msg = NIC_MBOX_MSG_ADD_MCAST;
-   mbx.xcast.data.mac = xaddr->addr;
+   mbx.xcast.data.mac = vf_work->mc->mc[idx];
nicvf_send_msg_to_pf(nic, );
-
-   /* after receiving ACK from PF release memory */
-   list_del(>list);
-   kfree(xaddr);
-   vf_work->mc->count--;
}
kfree(vf_work->mc);
}
@@ -1996,17 +1990,15 @@ static void nicvf_set_rx_mode(struct net_device *netdev)
mode |= BGX_XCAST_MCAST_FILTER;
/* here we need to copy mc addrs */
if (netdev_mc_count(netdev)) {
-   struct xcast_addr *xaddr;
-
-   mc_list = kmalloc(sizeof(*mc_list), GFP_ATOMIC);
-   INIT_LIST_HEAD(_list->list);
+   mc_list = kmalloc(sizeof(*mc_list) +
+ sizeof(u64) * 
netdev_mc_count(netdev),
+ GFP_ATOMIC);
+   if (unlikely(!mc_list))
+   return;
+   mc_list->count = 0;
netdev_hw_addr_list_for_each(ha, >mc) {
-   xaddr = kmalloc(sizeof(*xaddr),
-   GFP_ATOMIC);
-   xaddr->addr =
+

tcp hang when socket fills up ?

2018-04-06 Thread Dominique Martinet

(current kernel: vanilla 4.14.29)

I've been running into troubles recently where a sockets "fills up" (as
in, select() will no longer return it in its outfd / attempting to write
to it after setting it to NONBLOCK will return -EWOULDBLOCK) and it
doesn't seem to ever "unfill" until the tcp connexion timeout.

The previous time I pushed it down to the application for not trying to
read the socket either as I assume the buffers could be shared and
just waiting won't take data out, but this time I'm a bit more
skeptical as socat waits for the fd in both read and write...

Let me take a minute to describe my setup, I don't think that how the
socket was created matters but it might be interesting:
 - I have two computers behind NATs, no port forwarding on either side
 - One (call it C for client) runs ssh with a proxycommand ncat/socat to
control the source port, e.g.
$ ssh -o ProxyCommand="socat stdio tcp::,sourceport=" server
 - The server runs another socat to connect to that and forwards to ssh
locally, e.g.
$ socat tcp::,sourceport= tcp:127.0.0.1:22

(yes, both are connect() calls, and that just works™ thanks to initial
syn replay and conntrack on routers)

When things stall, the first socat is in a select with both fd in
reading, so it's waiting data.
The second socat has data coming from ssh and is in a select with the
client-facing socket in both read and write - that select never returns
so the socket is not available for reading or writing and does not free
up until the connection eventually times out a few minutes later.

At this point, I only see tcp replays in tcpdump/wireshark. I've
compared dumps from both sides and there are no lost packets, only
reordering - there always is a batch of acks that were sent regularily
coming in shortly before the hang. Here's the trace on the server:

16:49:26.735042 IP .13317 > .31872: Flags 
[.], seq 70476:71850, ack 4190, win 307, options [nop,nop,TS val 1313937641 ecr 
1617129473], length 1374
16:49:26.735046 IP .13317 > .31872: Flags 
[.], seq 71850:73224, ack 4190, win 307, options [nop,nop,TS val 1313937641 ecr 
1617129473], length 1374
16:49:26.735334 IP .31872 > .13317: Flags 
[.], ack 41622, win 918, options [nop,nop,TS val 1617129478 ecr 1313937609], 
length 0
16:49:26.736005 IP .31872 > .13317: Flags 
[.], ack 42996, win 940, options [nop,nop,TS val 1617129478 ecr 1313937609], 
length 0
16:49:26.736402 IP .13317 > .31872: Flags 
[.], seq 73224:74598, ack 4190, win 307, options [nop,nop,TS val 1313937643 ecr 
1617129473], length 1374
16:49:26.736408 IP .13317 > .31872: Flags 
[.], seq 74598:75972, ack 4190, win 307, options [nop,nop,TS val 1313937643 ecr 
1617129473], length 1374
16:49:26.738561 IP .31872 > .13317: Flags 
[.], ack 44370, win 963, options [nop,nop,TS val 1617129482 ecr 1313937616], 
length 0
16:49:26.739539 IP .31872 > .13317: Flags 
[.], ack 45744, win 986, options [nop,nop,TS val 1617129482 ecr 1313937616], 
length 0
16:49:26.739882 IP .31872 > .13317: Flags 
[.], ack 47118, win 1008, options [nop,nop,TS val 1617129484 ecr 1313937617], 
length 0
16:49:26.740255 IP .31872 > .13317: Flags 
[.], ack 48492, win 1031, options [nop,nop,TS val 1617129484 ecr 1313937617], 
length 0
16:49:26.746756 IP .31872 > .13317: Flags 
[.], ack 49866, win 1053, options [nop,nop,TS val 1617129493 ecr 1313937627], 
length 0
16:49:26.747923 IP .31872 > .13317: Flags 
[.], ack 51240, win 1076, options [nop,nop,TS val 1617129494 ecr 1313937627], 
length 0
16:49:26.749083 IP .31872 > .13317: Flags 
[.], ack 52614, win 1099, options [nop,nop,TS val 1617129495 ecr 1313937629], 
length 0
16:49:26.750171 IP .31872 > .13317: Flags 
[.], ack 53988, win 1121, options [nop,nop,TS val 1617129496 ecr 1313937629], 
length 0
16:49:26.750808 IP .31872 > .13317: Flags 
[.], ack 55362, win 1144, options [nop,nop,TS val 1617129497 ecr 1313937629], 
length 0
16:49:26.754648 IP .31872 > .13317: Flags 
[.], ack 56736, win 1167, options [nop,nop,TS val 1617129500 ecr 1313937629], 
length 0
16:49:26.755985 IP .31872 > .13317: Flags 
[.], ack 58110, win 1189, options [nop,nop,TS val 1617129501 ecr 1313937630], 
length 0
16:49:26.758513 IP .31872 > .13317: Flags 
[.], ack 59484, win 1212, options [nop,nop,TS val 1617129502 ecr 1313937630], 
length 0
16:49:26.759096 IP .31872 > .13317: Flags 
[.], ack 60858, win 1234, options [nop,nop,TS val 1617129503 ecr 1313937635], 
length 0
16:49:26.759421 IP .31872 > .13317: Flags 
[.], ack 62232, win 1257, options [nop,nop,TS val 1617129503 ecr 1313937635], 
length 0
16:49:26.759755 IP .31872 > .13317: Flags 
[.], ack 63606, win 1280, options [nop,nop,TS val 1617129504 ecr 1313937636], 
length 0
16:49:26.760653 IP .31872 > .13317: Flags 
[.], ack 64980, win 1302, options [nop,nop,TS val 1617129505 ecr 1313937636], 
length 0
16:49:26.761453 IP .31872 > .13317: Flags 
[.], ack 66354, win 1325, options [nop,nop,TS val 1617129506 ecr 1313937638], 
length 0
16:49:26.762199 IP .31872 > .13317: Flags 
[.], ack 67728, win 1348, options [nop,nop,TS

[PATCH v2] net: thunderx: rework mac addresses list to u64 array

2018-04-06 Thread Vadim Lomovtsev

From: Vadim Lomovtsev 

It is too expensive to pass u64 values via linked list, instead
allocate array for them by overall number of mac addresses from netdev.

This eventually removes multiple kmalloc() calls, aviod memory
fragmentation and allow to put single null check on kmalloc
return value in order to prevent a potential null pointer dereference.

Addresses-Coverity-ID: 1467429 ("Dereference null return value")
Fixes: 37c3347eb247 ("net: thunderx: add ndo_set_rx_mode callback 
implementation for VF")
Signed-off-by: Vadim Lomovtsev 
---
Changes from v1 to v2:
 - C99 syntax: update xcast_addr_list struct field mc[0] -> mc[];

---
 drivers/net/ethernet/cavium/thunder/nic.h|  7 +-
 drivers/net/ethernet/cavium/thunder/nicvf_main.c | 28 +---
 2 files changed, 11 insertions(+), 24 deletions(-)

diff --git a/drivers/net/ethernet/cavium/thunder/nic.h 
b/drivers/net/ethernet/cavium/thunder/nic.h
index 5fc46c5a4f36..448d1fafc827 100644
--- a/drivers/net/ethernet/cavium/thunder/nic.h
+++ b/drivers/net/ethernet/cavium/thunder/nic.h
@@ -265,14 +265,9 @@ struct nicvf_drv_stats {
 
 struct cavium_ptp;
 
-struct xcast_addr {
-   struct list_head list;
-   u64  addr;
-};
-
 struct xcast_addr_list {
-   struct list_head list;
int  count;
+   u64  mc[];
 };
 
 struct nicvf_work {
diff --git a/drivers/net/ethernet/cavium/thunder/nicvf_main.c 
b/drivers/net/ethernet/cavium/thunder/nicvf_main.c
index 1e9a31fef729..a26d8bc92e01 100644
--- a/drivers/net/ethernet/cavium/thunder/nicvf_main.c
+++ b/drivers/net/ethernet/cavium/thunder/nicvf_main.c
@@ -1929,7 +1929,7 @@ static void nicvf_set_rx_mode_task(struct work_struct 
*work_arg)
  work.work);
struct nicvf *nic = container_of(vf_work, struct nicvf, rx_mode_work);
union nic_mbx mbx = {};
-   struct xcast_addr *xaddr, *next;
+   u8 idx = 0;
 
if (!vf_work)
return;
@@ -1956,16 +1956,10 @@ static void nicvf_set_rx_mode_task(struct work_struct 
*work_arg)
/* check if we have any specific MACs to be added to PF DMAC filter */
if (vf_work->mc) {
/* now go through kernel list of MACs and add them one by one */
-   list_for_each_entry_safe(xaddr, next,
-_work->mc->list, list) {
+   for (idx = 0; idx < vf_work->mc->count; idx++) {
mbx.xcast.msg = NIC_MBOX_MSG_ADD_MCAST;
-   mbx.xcast.data.mac = xaddr->addr;
+   mbx.xcast.data.mac = vf_work->mc->mc[idx];
nicvf_send_msg_to_pf(nic, );
-
-   /* after receiving ACK from PF release memory */
-   list_del(>list);
-   kfree(xaddr);
-   vf_work->mc->count--;
}
kfree(vf_work->mc);
}
@@ -1996,17 +1990,15 @@ static void nicvf_set_rx_mode(struct net_device *netdev)
mode |= BGX_XCAST_MCAST_FILTER;
/* here we need to copy mc addrs */
if (netdev_mc_count(netdev)) {
-   struct xcast_addr *xaddr;
-
-   mc_list = kmalloc(sizeof(*mc_list), GFP_ATOMIC);
-   INIT_LIST_HEAD(_list->list);
+   mc_list = kmalloc(sizeof(*mc_list) +
+ sizeof(u64) * 
netdev_mc_count(netdev),
+ GFP_ATOMIC);
+   if (unlikely(!mc_list))
+   return;
+   mc_list->count = 0;
netdev_hw_addr_list_for_each(ha, >mc) {
-   xaddr = kmalloc(sizeof(*xaddr),
-   GFP_ATOMIC);
-   xaddr->addr =
+   mc_list->mc[mc_list->count] =
ether_addr_to_u64(ha->addr);
-   list_add_tail(>list,
- _list->list);
mc_list->count++;
}
}
-- 
2.14.3

[PATCH iproute2] bridge: fix typo in hairpin error message

2018-04-06 Thread Guillaume Nault

No 'g' to hairpin.

Signed-off-by: Guillaume Nault 
---
 bridge/link.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/bridge/link.c b/bridge/link.c
index 579d57e7..8d89aca2 100644
--- a/bridge/link.c
+++ b/bridge/link.c
@@ -312,7 +312,7 @@ static int brlink_modify(int argc, char **argv)
return -1;
} else if (strcmp(*argv, "hairpin") == 0) {
NEXT_ARG();
-   if (!on_off("hairping", , *argv))
+   if (!on_off("hairpin", , *argv))
return -1;
} else if (strcmp(*argv, "fastleave") == 0) {
NEXT_ARG();
-- 
2.17.0

Re: [PATCH v2] net: thunderx: rework mac addresses list to u64 array

2018-04-06 Thread Gustavo A. R. Silva


Hi Vadim,

On 04/06/2018 06:14 AM, Vadim Lomovtsev wrote:

From: Vadim Lomovtsev 

It is too expensive to pass u64 values via linked list, instead
allocate array for them by overall number of mac addresses from netdev.

This eventually removes multiple kmalloc() calls, aviod memory
fragmentation and allow to put single null check on kmalloc
return value in order to prevent a potential null pointer dereference.

Addresses-Coverity-ID: 1467429 ("Dereference null return value")
Fixes: 37c3347eb247 ("net: thunderx: add ndo_set_rx_mode callback implementation for 
VF")


It'd be nice if you add:

Reported-by: Gustavo A. R. Silva 

Thanks
--
Gustavo


Signed-off-by: Vadim Lomovtsev 
---
Changes from v1 to v2:
  - C99 syntax: update xcast_addr_list struct field mc[0] -> mc[];

---
  drivers/net/ethernet/cavium/thunder/nic.h|  7 +-
  drivers/net/ethernet/cavium/thunder/nicvf_main.c | 28 +---
  2 files changed, 11 insertions(+), 24 deletions(-)

diff --git a/drivers/net/ethernet/cavium/thunder/nic.h 
b/drivers/net/ethernet/cavium/thunder/nic.h
index 5fc46c5a4f36..448d1fafc827 100644
--- a/drivers/net/ethernet/cavium/thunder/nic.h
+++ b/drivers/net/ethernet/cavium/thunder/nic.h
@@ -265,14 +265,9 @@ struct nicvf_drv_stats {
  
  struct cavium_ptp;
  
-struct xcast_addr {

-   struct list_head list;
-   u64  addr;
-};
-
  struct xcast_addr_list {
-   struct list_head list;
int  count;
+   u64  mc[];
  };
  
  struct nicvf_work {

diff --git a/drivers/net/ethernet/cavium/thunder/nicvf_main.c 
b/drivers/net/ethernet/cavium/thunder/nicvf_main.c
index 1e9a31fef729..a26d8bc92e01 100644
--- a/drivers/net/ethernet/cavium/thunder/nicvf_main.c
+++ b/drivers/net/ethernet/cavium/thunder/nicvf_main.c
@@ -1929,7 +1929,7 @@ static void nicvf_set_rx_mode_task(struct work_struct 
*work_arg)
  work.work);
struct nicvf *nic = container_of(vf_work, struct nicvf, rx_mode_work);
union nic_mbx mbx = {};
-   struct xcast_addr *xaddr, *next;
+   u8 idx = 0;
  
  	if (!vf_work)

return;
@@ -1956,16 +1956,10 @@ static void nicvf_set_rx_mode_task(struct work_struct 
*work_arg)
/* check if we have any specific MACs to be added to PF DMAC filter */
if (vf_work->mc) {
/* now go through kernel list of MACs and add them one by one */
-   list_for_each_entry_safe(xaddr, next,
-_work->mc->list, list) {
+   for (idx = 0; idx < vf_work->mc->count; idx++) {
mbx.xcast.msg = NIC_MBOX_MSG_ADD_MCAST;
-   mbx.xcast.data.mac = xaddr->addr;
+   mbx.xcast.data.mac = vf_work->mc->mc[idx];
nicvf_send_msg_to_pf(nic, );
-
-   /* after receiving ACK from PF release memory */
-   list_del(>list);
-   kfree(xaddr);
-   vf_work->mc->count--;
}
kfree(vf_work->mc);
}
@@ -1996,17 +1990,15 @@ static void nicvf_set_rx_mode(struct net_device *netdev)
mode |= BGX_XCAST_MCAST_FILTER;
/* here we need to copy mc addrs */
if (netdev_mc_count(netdev)) {
-   struct xcast_addr *xaddr;
-
-   mc_list = kmalloc(sizeof(*mc_list), GFP_ATOMIC);
-   INIT_LIST_HEAD(_list->list);
+   mc_list = kmalloc(sizeof(*mc_list) +
+ sizeof(u64) * 
netdev_mc_count(netdev),
+ GFP_ATOMIC);
+   if (unlikely(!mc_list))
+   return;
+   mc_list->count = 0;
netdev_hw_addr_list_for_each(ha, >mc) {
-   xaddr = kmalloc(sizeof(*xaddr),
-   GFP_ATOMIC);
-   xaddr->addr =
+   mc_list->mc[mc_list->count] =
ether_addr_to_u64(ha->addr);
-   list_add_tail(>list,
- _list->list);
mc_list->count++;
}
}

Re: [iovisor-dev] Best userspace programming API for XDP features query to kernel?

2018-04-06 Thread Jesper Dangaard Brouer

On Thu, 5 Apr 2018 14:47:16 -0700
Jakub Kicinski  wrote:

> On Thu, 5 Apr 2018 22:51:33 +0200, Jesper Dangaard Brouer wrote:
> > > What about nfp in terms of XDP
> > > offload capabilities, should they be included as well or is probing to 
> > > load
> > > the program and see if it loads/JITs as we do today just fine (e.g. you'd
> > > otherwise end up with extra flags on a per BPF helper basis)?
> > 
> > No, flags per BPF helper basis. As I've described above, helper belong
> > to the BPF core, not the driver.  Here I want to know what the specific
> > driver support.  
> 
> I think Daniel meant for nfp offload.  The offload restrictions are
> quite involved, are we going to be able to express those?

Let's keep thing separate.

I'm requesting something really simple.  I want the driver to tell me
what XDP actions it supports.  We/I can implement an XDP_QUERY_ACTIONS
via ndo_bpf, problem solved.  It is small, specific and simple.

For my other use-case of enabling XDP-xmit on a device, I can
implement another ndo_bpf extension. Current approach today is loading
a dummy XDP prog via ndo_bpf anyway (which is awkward). Again a
specific change that let us move one-step further.

For your nfp offload use-case, you/we have to find a completely
different solution.  You have hit a design choice made by BPF.
Which is that BPF is part of the core kernel, and helpers cannot be
loaded as kernel modules.  As we cannot remove or add helpers after the
verifier certified the program.  And basically your nfp offload driver
comes as a kernel module.
 (Details: and you basically already solved your issue by modifying the
core verifier to do a call back to bpf_prog_offload_verify_insn()).
Point being this is very different from what I'm requesting.  Thus, for
offloading you already have a solution, to my setup time detect
problem, as your program gets rejected setup/load time by the verifier.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

Re: [RFC] ethtool: Support for driver private ioctl's

2018-04-06 Thread Michal Kubecek

On Thu, Apr 05, 2018 at 08:50:49AM -0700, Florian Fainelli wrote:
> On 04/05/2018 03:47 AM, Jose Abreu wrote:
> > Background: Synopsys Ethernet IP's have a certain number of
> > features which can be reconfigured at runtime. Giving you two
> > examples: One of the most recent one is the safety features,
> > which can be enabled/disabled and forced at runtime. Another one
> > is a Flexible RX Parser which can route specific packets to
> > specific RX DMA channels. Given that these are features specific
> > to our IP's it would not be useful to add an uniform API for this
> > because the users would only be one or two drivers ...
> 
> Parsing of packets and directing the matched packets to specific
> queues/channels can be done through ethtool rxnfc API, tc/cls_flower as
> well, so you should really check whether those APIs don't already allow
> you to do what you want.
> 
> ethtool already supports a concept of private  flags, not ioctl() though
> which allows you to toggle boolean values for instance (or technically
> up to how many bits a "flag" is used to represent) is that enough or do
> you need to turn on/off the feature as well as pass configuration
> parameters?

Perhaps introducing "driver/device specific tunables" (i.e. something
like tunables or PHY tunables but specific to a particular device) could
be a way. But it could get out of control quickly and users wouldn't be
happy if they had to set the same (or almost the same) parameter under
five different names for five NIC vendors.

Michal Kubecek

[PATCH net 1/3] lan78xx: PHY DSP registers initialization to address EEE link drop issues with long cables

2018-04-06 Thread Raghuram Chary J

The patch is to configure DSP registers of PHY device
to handle Gbe-EEE failures with >40m cable length.

Fixes: 55d7de9de6c3 ("Microchip's LAN7800 family USB 2/3 to 10/100/1000 
Ethernet device driver")
Signed-off-by: Raghuram Chary J 
---
 drivers/net/phy/microchip.c  | 123 ++-
 include/linux/microchipphy.h |   8 +++
 2 files changed, 130 insertions(+), 1 deletion(-)

diff --git a/drivers/net/phy/microchip.c b/drivers/net/phy/microchip.c
index 0f293ef28935..174ae9808722 100644
--- a/drivers/net/phy/microchip.c
+++ b/drivers/net/phy/microchip.c
@@ -20,6 +20,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #define DRIVER_AUTHOR  "WOOJUNG HUH "
 #define DRIVER_DESC"Microchip LAN88XX PHY driver"
@@ -66,6 +67,107 @@ static int lan88xx_suspend(struct phy_device *phydev)
return 0;
 }
 
+static void lan88xx_TR_reg_set(struct phy_device *phydev, u16 regaddr,
+  u32 data)
+{
+   int val;
+   u16 buf;
+
+   /* Get access to token ring page */
+   phy_write(phydev, LAN88XX_EXT_PAGE_ACCESS,
+ LAN88XX_EXT_PAGE_ACCESS_TR);
+
+   phy_write(phydev, LAN88XX_EXT_PAGE_TR_LOW_DATA, (data & 0x));
+   phy_write(phydev, LAN88XX_EXT_PAGE_TR_HIGH_DATA,
+ (data & 0x00FF) >> 16);
+
+   /* Config control bits [15:13] of register */
+   buf = (regaddr & ~(0x3 << 13));/* Clr [14:13] to write data in reg */
+   buf |= 0x8000; /* Set [15] to Packet transmit */
+
+   phy_write(phydev, LAN88XX_EXT_PAGE_TR_CR, buf);
+
+   usleep_range(1000, 2000);/* Wait for Data to be written */
+   val = phy_read(phydev, LAN88XX_EXT_PAGE_TR_CR);
+   if (!(val & 0x8000))
+   pr_warn("TR Register[0x%X] configuration failed\n", regaddr);
+
+   /* Setting to Main page registers space*/
+   phy_write(phydev, LAN88XX_EXT_PAGE_ACCESS, LAN88XX_EXT_PAGE_SPACE_0);
+}
+
+static void lan88xx_config_TR_regs(struct phy_device *phydev)
+{
+   /* Get access to Channel 0x1, Node 0xF , Register 0x01.
+* Write 24-bit value 0x12B00A to register. Setting MrvlTrFix1000Kf,
+* MrvlTrFix1000Kp, MasterEnableTR bits.
+*/
+   lan88xx_TR_reg_set(phydev, 0x0F82, 0x12B00A);
+
+   /* Get access to Channel b'10, Node b'1101, Register 0x06.
+* Write 24-bit value 0xD2C46F to register. Setting SSTrKf1000Slv,
+* SSTrKp1000Mas bits.
+*/
+   lan88xx_TR_reg_set(phydev, 0x168C, 0xD2C46F);
+
+   /* Get access to Channel b'10, Node b', Register 0x11.
+* Write 24-bit value 0x620 to register. Setting rem_upd_done_thresh
+* bits
+*/
+   lan88xx_TR_reg_set(phydev, 0x17A2, 0x620);
+
+   /* Get access to Channel b'10, Node b'1101, Register 0x10.
+* Write 24-bit value 0xEEFFDD to register. Setting
+* eee_TrKp1Long_1000, eee_TrKp2Long_1000, eee_TrKp3Long_1000,
+* eee_TrKp1Short_1000,eee_TrKp2Short_1000, eee_TrKp3Short_1000 bits.
+*/
+   lan88xx_TR_reg_set(phydev, 0x16A0, 0xEEFFDD);
+
+   /* Get access to Channel b'10, Node b'1101, Register 0x13.
+* Write 24-bit value 0x071448 to register. Setting
+* slv_lpi_tr_tmr_val1, slv_lpi_tr_tmr_val2 bits.
+*/
+   lan88xx_TR_reg_set(phydev, 0x16A6, 0x071448);
+
+   /* Get access to Channel b'10, Node b'1101, Register 0x12.
+* Write 24-bit value 0x13132F to register. Setting
+* slv_sigdet_timer_val1, slv_sigdet_timer_val2 bits.
+*/
+   lan88xx_TR_reg_set(phydev, 0x16A4, 0x13132F);
+
+   /* Get access to Channel b'10, Node b'1101, Register 0x14.
+* Write 24-bit value 0x0 to register. Setting eee_3level_delay,
+* eee_TrKf_freeze_delay bits.
+*/
+   lan88xx_TR_reg_set(phydev, 0x16A8, 0x0);
+
+   /* Get access to Channel b'01, Node b', Register 0x34.
+* Write 24-bit value 0x91B06C to register. Setting
+* FastMseSearchThreshLong1000, FastMseSearchThreshShort1000,
+* FastMseSearchUpdGain1000 bits.
+*/
+   lan88xx_TR_reg_set(phydev, 0x0FE8, 0x91B06C);
+
+   /* Get access to Channel b'01, Node b', Register 0x3E.
+* Write 24-bit value 0xC0A028 to register. Setting
+* FastMseKp2ThreshLong1000, FastMseKp2ThreshShort1000,
+* FastMseKp2UpdGain1000, FastMseKp2ExitEn1000 bits.
+*/
+   lan88xx_TR_reg_set(phydev, 0x0FFC, 0xC0A028);
+
+   /* Get access to Channel b'01, Node b', Register 0x35.
+* Write 24-bit value 0x041600 to register. Setting
+* FastMseSearchPhShNum1000, FastMseSearchClksPerPh1000,
+* FastMsePhChangeDelay1000 bits.
+*/
+   lan88xx_TR_reg_set(phydev, 0x0FEA, 0x041600);
+
+   /* Get access to Channel b'10, Node b'1101, Register 0x03.
+* Write 24-bit value 0x04 to register. Setting TrFreeze bits.
+*/
+

[PATCH net 0/3] lan78xx: Fixes and enhancements

2018-04-06 Thread Raghuram Chary J

These series of patches have fix and enhancements for
lan78xx driver.

Raghuram Chary J (3):
  lan78xx: PHY DSP registers initialization to address EEE link drop
issues with long cables
  lan78xx: Add support to dump lan78xx registers
  lan78xx: Lan7801 Support for Fixed PHY

 drivers/net/phy/microchip.c  | 123 ++-
 drivers/net/usb/Kconfig  |   1 +
 drivers/net/usb/lan78xx.c|  93 ++--
 include/linux/microchipphy.h |   8 +++
 4 files changed, 220 insertions(+), 5 deletions(-)

-- 
2.16.2

[PATCH net 2/3] lan78xx: Add support to dump lan78xx registers

2018-04-06 Thread Raghuram Chary J

In order to dump lan78xx family registers using ethtool, add
support at lan78xx driver level.

Fixes: 55d7de9de6c3 ("Microchip's LAN7800 family USB 2/3 to 10/100/1000 
Ethernet device driver")
Signed-off-by: Raghuram Chary J 
---
 drivers/net/usb/lan78xx.c | 51 +++
 1 file changed, 51 insertions(+)

diff --git a/drivers/net/usb/lan78xx.c b/drivers/net/usb/lan78xx.c
index 55a78eb96961..e3cc3b504c87 100644
--- a/drivers/net/usb/lan78xx.c
+++ b/drivers/net/usb/lan78xx.c
@@ -278,6 +278,30 @@ struct lan78xx_statstage64 {
u64 eee_tx_lpi_time;
 };
 
+u32 lan78xx_regs[] = {
+   ID_REV,
+   INT_STS,
+   HW_CFG,
+   PMT_CTL,
+   E2P_CMD,
+   E2P_DATA,
+   USB_STATUS,
+   VLAN_TYPE,
+   MAC_CR,
+   MAC_RX,
+   MAC_TX,
+   FLOW,
+   ERR_STS,
+   MII_ACC,
+   MII_DATA,
+   EEE_TX_LPI_REQ_DLY,
+   EEE_TW_TX_SYS,
+   EEE_TX_LPI_REM_DLY,
+   WUCSR
+};
+
+#define PHY_REG_SIZE (32 * sizeof(u32))
+
 struct lan78xx_net;
 
 struct lan78xx_priv {
@@ -1604,6 +1628,31 @@ static int lan78xx_set_pause(struct net_device *net,
return ret;
 }
 
+static int lan78xx_get_regs_len(struct net_device *netdev)
+{
+   return (sizeof(lan78xx_regs) + PHY_REG_SIZE);
+}
+
+static void
+lan78xx_get_regs(struct net_device *netdev, struct ethtool_regs *regs,
+void *buf)
+{
+   u32 *data = buf;
+   int i, j;
+   struct lan78xx_net *dev = netdev_priv(netdev);
+
+   /* Read Device/MAC registers */
+   for (i = 0, j = 0; i < (sizeof(lan78xx_regs) / sizeof(u32)); i++, j++)
+   lan78xx_read_reg(dev, lan78xx_regs[i], [j]);
+
+   if (!netdev->phydev)
+   return;
+
+   /* Read PHY registers */
+   for (i = 0; i < 32; i++, j++)
+   data[j] = phy_read(netdev->phydev, i);
+}
+
 static const struct ethtool_ops lan78xx_ethtool_ops = {
.get_link   = lan78xx_get_link,
.nway_reset = phy_ethtool_nway_reset,
@@ -1624,6 +1673,8 @@ static const struct ethtool_ops lan78xx_ethtool_ops = {
.set_pauseparam = lan78xx_set_pause,
.get_link_ksettings = lan78xx_get_link_ksettings,
.set_link_ksettings = lan78xx_set_link_ksettings,
+   .get_regs_len   = lan78xx_get_regs_len,
+   .get_regs   = lan78xx_get_regs,
 };
 
 static int lan78xx_ioctl(struct net_device *netdev, struct ifreq *rq, int cmd)
-- 
2.16.2

[PATCH net 3/3] lan78xx: Lan7801 Support for Fixed PHY

2018-04-06 Thread Raghuram Chary J

Adding Fixed PHY support to the lan78xx driver.

Fixes: 55d7de9de6c3 ("Microchip's LAN7800 family USB 2/3 to 10/100/1000 
Ethernet device driver")
Signed-off-by: Raghuram Chary J 
---
 drivers/net/usb/Kconfig   |  1 +
 drivers/net/usb/lan78xx.c | 42 ++
 2 files changed, 39 insertions(+), 4 deletions(-)

diff --git a/drivers/net/usb/Kconfig b/drivers/net/usb/Kconfig
index f28bd74ac275..418b0904cecb 100644
--- a/drivers/net/usb/Kconfig
+++ b/drivers/net/usb/Kconfig
@@ -111,6 +111,7 @@ config USB_LAN78XX
select MII
select PHYLIB
select MICROCHIP_PHY
+   select FIXED_PHY
help
  This option adds support for Microchip LAN78XX based USB 2
  & USB 3 10/100/1000 Ethernet adapters.
diff --git a/drivers/net/usb/lan78xx.c b/drivers/net/usb/lan78xx.c
index e3cc3b504c87..e67b2dabde66 100644
--- a/drivers/net/usb/lan78xx.c
+++ b/drivers/net/usb/lan78xx.c
@@ -36,7 +36,7 @@
 #include 
 #include 
 #include 
-#include 
+#include 
 #include "lan78xx.h"
 
 #define DRIVER_AUTHOR  "WOOJUNG HUH "
@@ -426,6 +426,7 @@ struct lan78xx_net {
struct statstagestats;
 
struct irq_domain_data  domain_data;
+   struct phy_device   *fixedphy;
 };
 
 /* define external phy id */
@@ -2058,11 +2059,39 @@ static int lan78xx_phy_init(struct lan78xx_net *dev)
int ret;
u32 mii_adv;
struct phy_device *phydev;
+   struct fixed_phy_status fphy_status = {
+   .link = 1,
+   .speed = SPEED_1000,
+   .duplex = DUPLEX_FULL,
+   };
 
phydev = phy_find_first(dev->mdiobus);
if (!phydev) {
-   netdev_err(dev->net, "no PHY found\n");
-   return -EIO;
+   if (dev->chipid == ID_REV_CHIP_ID_7801_) {
+   u32 buf;
+
+   netdev_info(dev->net, "PHY Not Found!! Registering 
Fixed PHY\n");
+   phydev = fixed_phy_register(PHY_POLL, _status, -1,
+   NULL);
+   if (IS_ERR(phydev)) {
+   netdev_err(dev->net, "No PHY/fixed_PHY 
found\n");
+   return -ENODEV;
+   }
+   netdev_info(dev->net, "Registered FIXED PHY\n");
+   dev->interface = PHY_INTERFACE_MODE_RGMII;
+   dev->fixedphy = phydev;
+   ret = lan78xx_write_reg(dev, MAC_RGMII_ID,
+   MAC_RGMII_ID_TXC_DELAY_EN_);
+   ret = lan78xx_write_reg(dev, RGMII_TX_BYP_DLL, 0x3D00);
+   ret = lan78xx_read_reg(dev, HW_CFG, );
+   buf |= HW_CFG_CLK125_EN_;
+   buf |= HW_CFG_REFCLK25_EN_;
+   ret = lan78xx_write_reg(dev, HW_CFG, buf);
+   goto phyinit;
+   } else {
+   netdev_err(dev->net, "no PHY found\n");
+   return -EIO;
+   }
}
 
if ((dev->chipid == ID_REV_CHIP_ID_7800_) ||
@@ -2100,7 +2129,7 @@ static int lan78xx_phy_init(struct lan78xx_net *dev)
ret = -EIO;
goto error;
}
-
+phyinit:
/* if phyirq is not set, use polling mode in phylib */
if (dev->domain_data.phyirq > 0)
phydev->irq = dev->domain_data.phyirq;
@@ -3559,6 +3588,11 @@ static void lan78xx_disconnect(struct usb_interface 
*intf)
udev = interface_to_usbdev(intf);
 
net = dev->net;
+
+   if (dev->fixedphy) {
+   fixed_phy_unregister(dev->fixedphy);
+   dev->fixedphy = NULL;
+   }
unregister_netdev(net);
 
cancel_delayed_work_sync(>wq);
-- 
2.16.2

[PATCH] vhost-net: set packet weight of tx polling to 2 * vq size

2018-04-06 Thread 张海斌

handle_tx will delay rx for tens or even hundreds of milliseconds when tx busy
polling udp packets with small length(e.g. 1byte udp payload), because setting
VHOST_NET_WEIGHT takes into account only sent-bytes but no single packet length.

Ping-Latencies shown below were tested between two Virtual Machines using
netperf (UDP_STREAM, len=1), and then another machine pinged the client:

Packet-Weight  Ping-Latencies(millisecond)
   min  avg   max
Origin   3.319   18.48957.303
64   1.6432.021 2.552
128  1.8252.600 3.224
256  1.9972.710 4.295
512  1.8603.171 4.631
1024 2.0024.173 9.056
2048 2.2575.650 9.688
4096 2.0938.50815.943

Ring size is a hint from device about a burst size it can tolerate. Based on
benchmarks, set the weight to 2 * vq size.

To evaluate this change, another tests were done using netperf(RR, TX) between
two machines with Intel(R) Xeon(R) Gold 6133 CPU @ 2.50GHz, and vq size was
tweaked through qemu. Results shown below does not show obvious changes.

vq size=256 TCP_RRvq size=512 TCP_RR
size/sessions/+thu%/+normalize%   size/sessions/+thu%/+normalize%
   1/   1/  -7%/-2%  1/   1/   0%/-2%
   1/   4/  +1%/ 0%  1/   4/  +1%/ 0%
   1/   8/  +1%/-2%  1/   8/   0%/+1%
  64/   1/  -6%/ 0% 64/   1/  +7%/+3%
  64/   4/   0%/+2% 64/   4/  -1%/+1%
  64/   8/   0%/ 0% 64/   8/  -1%/-2%
 256/   1/  -3%/-4%256/   1/  -4%/-2%
 256/   4/  +3%/+4%256/   4/  +1%/+2%
 256/   8/  +2%/ 0%256/   8/  +1%/-1%

vq size=256 UDP_RRvq size=512 UDP_RR
size/sessions/+thu%/+normalize%   size/sessions/+thu%/+normalize%
   1/   1/  -5%/+1%  1/   1/  -3%/-2%
   1/   4/  +4%/+1%  1/   4/  -2%/+2%
   1/   8/  -1%/-1%  1/   8/  -1%/ 0%
  64/   1/  -2%/-3% 64/   1/  +1%/+1%
  64/   4/  -5%/-1% 64/   4/  +2%/ 0%
  64/   8/   0%/-1% 64/   8/  -2%/+1%
 256/   1/  +7%/+1%256/   1/  -7%/ 0%
 256/   4/  +1%/+1%256/   4/  -3%/-4%
 256/   8/  +2%/+2%256/   8/  +1%/+1%

vq size=256 TCP_STREAMvq size=512 TCP_STREAM
size/sessions/+thu%/+normalize%   size/sessions/+thu%/+normalize%
  64/   1/   0%/-3% 64/   1/   0%/ 0%
  64/   4/  +3%/-1% 64/   4/  -2%/+4%
  64/   8/  +9%/-4% 64/   8/  -1%/+2%
 256/   1/  +1%/-4%256/   1/  +1%/+1%
 256/   4/  -1%/-1%256/   4/  -3%/ 0%
 256/   8/  +7%/+5%256/   8/  -3%/ 0%
 512/   1/  +1%/ 0%512/   1/  -1%/-1%
 512/   4/  +1%/-1%512/   4/   0%/ 0%
 512/   8/  +7%/-5%512/   8/  +6%/-1%
1024/   1/   0%/-1%   1024/   1/   0%/+1%
1024/   4/  +3%/ 0%   1024/   4/  +1%/ 0%
1024/   8/  +8%/+5%   1024/   8/  -1%/ 0%
2048/   1/  +2%/+2%   2048/   1/  -1%/ 0%
2048/   4/  +1%/ 0%   2048/   4/   0%/-1%
2048/   8/  -2%/ 0%   2048/   8/   5%/-1%
4096/   1/  -2%/ 0%   4096/   1/  -2%/ 0%
4096/   4/  +2%/ 0%   4096/   4/   0%/ 0%
4096/   8/  +9%/-2%   4096/   8/  -5%/-1%

Signed-off-by: Haibin Zhang 
Signed-off-by: Yunfang Tai 
Signed-off-by: Lidong Chen 
---
 drivers/vhost/net.c | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 8139bc70ad7d..3563a305cc0a 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -44,6 +44,10 @@ MODULE_PARM_DESC(experimental_zcopytx, "Enable Zero Copy TX;"
  * Using this limit prevents one virtqueue from starving others. */
 #define VHOST_NET_WEIGHT 0x8
 
+/* Max number of packets transferred before requeueing the job.
+ * Using this limit prevents one virtqueue from starving rx. */
+#define VHOST_NET_PKT_WEIGHT(vq) ((vq)->num * 2)
+
 /* MAX number of TX used buffers for outstanding zerocopy */
 #define VHOST_MAX_PEND 128
 #define VHOST_GOODCOPY_LEN 256
@@ -473,6 +477,7 @@ static void handle_tx(struct vhost_net *net)
struct socket *sock;
struct vhost_net_ubuf_ref *uninitialized_var(ubufs);
bool zcopy, zcopy_used;
+   int sent_pkts =

Re: marvell switch

2018-04-06 Thread Ran Shalit

On Thu, Apr 5, 2018 at 11:46 PM, Andrew Lunn  wrote:
>> > Hi Ran
>> >
>> > The Marvell driver makes each port act like a normal Linux network
>> > interface. So if you want to enable a port, do
>> >
>> > ip link set lan0 up
>> >
>> > Want to add an ip address to a port
>> >
>> > ip addr add 10.42.42.42/24 dev lan0
>> >
>> > Want to bridge two ports
>> >
>> > ip link add name br0 type bridge
>> > ip link set dev br0 up
>> > ip link set dev lan0 master br0
>> > ip link set dev lan1 master br0
>> >
>> > Just treat them as normal interfaces.
>> >
>>
>> If I may please ask,
>> What is the purpose of using bridge for configuring switch interfaces.
>> Is it in order to isolate some ports from others?
>> I ask because according to my understanding the default configuration of
>> the driver is to enable switch in "flat" configuration, i.e. as if all
>> ports are connected to each other.
>
> Please think about what i said. They are standard Linux network
> interfaces. Do standard Linux network interfaces bridge themselves
> together by default? No, you need to configure a bridge.
>
>  Andrew

I understand now...
Thank you very much.
ranran

Re: WARNING in xfrm6_tunnel_net_exit

2018-04-06 Thread syzbot


syzbot has found reproducer for the following crash on upstream commit
3c8ba0d61d04ced9f8d9ff93977995a9e4e96e91 (Sat Mar 31 01:52:36 2018 +)
kernel.h: Retain constant expression output for max()/min()
syzbot dashboard link:  
https://syzkaller.appspot.com/bug?extid=777bf170a89e7b326405


So far this crash happened 10982 times on linux-next, mmots, net-next,  
upstream.
syzkaller reproducer:  
https://syzkaller.appspot.com/x/repro.syz?id=5399809707999232
Raw console output:  
https://syzkaller.appspot.com/x/log.txt?id=4550974920196096
Kernel config:  
https://syzkaller.appspot.com/x/.config?id=-1647968177339044852

compiler: gcc (GCC) 8.0.1 20180301 (experimental)

IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: syzbot+777bf170a89e7b326...@syzkaller.appspotmail.com
It will help syzbot understand when the bug is fixed.

IPVS: ftp: loaded support on port[0] = 21
IPVS: ftp: loaded support on port[0] = 21
IPVS: ftp: loaded support on port[0] = 21
IPVS: ftp: loaded support on port[0] = 21
IPVS: ftp: loaded support on port[0] = 21
WARNING: CPU: 0 PID: 180 at net/ipv6/xfrm6_tunnel.c:345  
xfrm6_tunnel_net_exit+0x2c0/0x4f0 net/ipv6/xfrm6_tunnel.c:345

Kernel panic - not syncing: panic_on_warn set ...

CPU: 0 PID: 180 Comm: kworker/u4:4 Not tainted 4.16.0+ #2
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS  
Google 01/01/2011

Workqueue: netns cleanup_net
Call Trace:
 __dump_stack lib/dump_stack.c:17 [inline]
 dump_stack+0x1b9/0x29f lib/dump_stack.c:53
 panic+0x22f/0x4de kernel/panic.c:183
 __warn.cold.8+0x163/0x1a3 kernel/panic.c:547
 report_bug+0x252/0x2d0 lib/bug.c:186
 fixup_bug arch/x86/kernel/traps.c:178 [inline]
 do_error_trap+0x1bc/0x470 arch/x86/kernel/traps.c:296
 do_invalid_op+0x1b/0x20 arch/x86/kernel/traps.c:315
 invalid_op+0x1b/0x40 arch/x86/entry/entry_64.S:991
RIP: 0010:xfrm6_tunnel_net_exit+0x2c0/0x4f0 net/ipv6/xfrm6_tunnel.c:345
RSP: 0018:8801d96373d8 EFLAGS: 00010293
RAX: 8801d961c080 RBX: 8801b0e999a0 RCX: 866b08c6
RDX:  RSI: 866b08d0 RDI: 0007
RBP: 8801d96374f8 R08: 8801d961c080 R09: ed003b6046c2
R10: 0003 R11: 0003 R12: 007c
R13: ed003b2c6e82 R14: 8801d96374d0 R15: 8801b6185f80
 ops_exit_list.isra.7+0xb0/0x160 net/core/net_namespace.c:152
 cleanup_net+0x51d/0xb20 net/core/net_namespace.c:523
 process_one_work+0xc1e/0x1b50 kernel/workqueue.c:2145
 worker_thread+0x1cc/0x1440 kernel/workqueue.c:2279
 kthread+0x345/0x410 kernel/kthread.c:238
 ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:411
Dumping ftrace buffer:
   (ftrace buffer empty)
Kernel Offset: disabled
Rebooting in 86400 seconds..

Re: [PATCH v2] net: thunderx: rework mac addresses list to u64 array

2018-04-06 Thread David Miller

From: Vadim Lomovtsev 
Date: Fri,  6 Apr 2018 04:14:25 -0700

> diff --git a/drivers/net/ethernet/cavium/thunder/nic.h 
> b/drivers/net/ethernet/cavium/thunder/nic.h
> index 5fc46c5a4f36..448d1fafc827 100644
> --- a/drivers/net/ethernet/cavium/thunder/nic.h
> +++ b/drivers/net/ethernet/cavium/thunder/nic.h
> @@ -265,14 +265,9 @@ struct nicvf_drv_stats {
>  
>  struct cavium_ptp;
>  
> -struct xcast_addr {
> - struct list_head list;
> - u64  addr;
> -};
> -
>  struct xcast_addr_list {
> - struct list_head list;
>   int  count;
> + u64  mc[];
>  };
>  
>  struct nicvf_work {
> diff --git a/drivers/net/ethernet/cavium/thunder/nicvf_main.c 
> b/drivers/net/ethernet/cavium/thunder/nicvf_main.c
> index 1e9a31fef729..a26d8bc92e01 100644
> --- a/drivers/net/ethernet/cavium/thunder/nicvf_main.c
> +++ b/drivers/net/ethernet/cavium/thunder/nicvf_main.c
> @@ -1929,7 +1929,7 @@ static void nicvf_set_rx_mode_task(struct work_struct 
> *work_arg)
> work.work);
>   struct nicvf *nic = container_of(vf_work, struct nicvf, rx_mode_work);
>   union nic_mbx mbx = {};
> - struct xcast_addr *xaddr, *next;
> + u8 idx = 0;
^^^

>  
>   if (!vf_work)
>   return;
> @@ -1956,16 +1956,10 @@ static void nicvf_set_rx_mode_task(struct work_struct 
> *work_arg)
>   /* check if we have any specific MACs to be added to PF DMAC filter */
>   if (vf_work->mc) {
>   /* now go through kernel list of MACs and add them one by one */
> - list_for_each_entry_safe(xaddr, next,
> -  _work->mc->list, list) {
> + for (idx = 0; idx < vf_work->mc->count; idx++) {

vf_work->mx->count is an 'int' therefore 'idx' should be declared 'int' as well,
not a 'u8'.

Re: [PATCH] dp83640: Ensure against premature access to PHY registers after reset

2018-04-06 Thread David Miller

From: Andrew Lunn 
Date: Fri, 6 Apr 2018 16:14:10 +0200

> On Fri, Apr 06, 2018 at 04:05:40PM +0200, Esben Haabendal wrote:
>> From: Esben Haabendal 
>> 
>> Signed-off-by: Esben Haabendal 
>> ---
>>  drivers/net/phy/dp83640.c | 17 +
>>  1 file changed, 17 insertions(+)
>> 
>> diff --git a/drivers/net/phy/dp83640.c b/drivers/net/phy/dp83640.c
>> index 654f42d00092..48403170096a 100644
>> --- a/drivers/net/phy/dp83640.c
>> +++ b/drivers/net/phy/dp83640.c
>> @@ -1207,6 +1207,22 @@ static void dp83640_remove(struct phy_device *phydev)
>>  kfree(dp83640);
>>  }
>>  
>> +static int dp83640_soft_reset(struct phy_device *phydev)
>> +{
>> +int ret;
>> +
>> +ret = genphy_soft_reset(phydev);
>> +if (ret < 0)
>> +return ret;
>> +
>> +/* From DP83640 datasheet: "Software driver code must wait 3 us
>> + * following a software reset before allowing further serial MII
>> + * operations with the DP83640." */
>> +udelay(3);
> 
> Hi Esben
> 
> The accuracy of udelay() is not guaranteed. So you probably want to be
> a bit pessimistic, and use 10.

Agreed.

Re: [PATCH net-next 0/5] net: stmmac: Stop using hard-coded callbacks

2018-04-06 Thread David Miller

From: Jose Abreu 
Date: Fri,  6 Apr 2018 14:08:14 +0100

> This a starting point for a cleanup and re-organization of stmmac.
> 
> In this series we stop using hard-coded callbacks along the code and use
> instead helpers which are defined in a single place ("hwif.h").
> 
> This brings several advantages:
>   1) Less typing :)
>   2) Guaranteed function pointer check
>   3) More flexibility
> 
> By 2) we stop using the repeated pattern of:
>   if (priv->hw->mac->some_func)
>   priv->hw->mac->some_func(...)
> 
> I didn't check but I expect the final .ko will be bigger with this series
> because *all* of function pointers are checked.
> 
> Anyway, I hope this can make the code more readable and more flexible now.

The net-next tree is closed, please resubmit this series when it opens
back up.

Thank you.

Re: [PATCH v3] net: thunderx: rework mac addresses list to u64 array

2018-04-06 Thread Vadim Lomovtsev

Self-NACK here, because of https://lkml.org/lkml/2018/4/6/724

Sorry for noise.

Vadim

On Fri, Apr 06, 2018 at 07:04:43AM -0700, Vadim Lomovtsev wrote:
> From: Vadim Lomovtsev 
> 
> It is too expensive to pass u64 values via linked list, instead
> allocate array for them by overall number of mac addresses from netdev.
> 
> This eventually removes multiple kmalloc() calls, aviod memory
> fragmentation and allow to put single null check on kmalloc
> return value in order to prevent a potential null pointer dereference.
> 
> Addresses-Coverity-ID: 1467429 ("Dereference null return value")
> Fixes: 37c3347eb247 ("net: thunderx: add ndo_set_rx_mode callback 
> implementation for VF")
> Reported-by: Dan Carpenter 
> Signed-off-by: Vadim Lomovtsev 
> ---
> Changes from v1 to v2:
>  - C99 syntax: update xcast_addr_list struct field mc[0] -> mc[];
> Changes from v2 to v3:
>  - update commit description with 'Reported-by: Dan Carpenter';
>  - update size calculations for mc list to offsetof() call
>instead of explicit arithmetic;
> ---
>  drivers/net/ethernet/cavium/thunder/nic.h|  7 +-
>  drivers/net/ethernet/cavium/thunder/nicvf_main.c | 28 
> +---
>  2 files changed, 11 insertions(+), 24 deletions(-)
> 
> diff --git a/drivers/net/ethernet/cavium/thunder/nic.h 
> b/drivers/net/ethernet/cavium/thunder/nic.h
> index 5fc46c5a4f36..448d1fafc827 100644
> --- a/drivers/net/ethernet/cavium/thunder/nic.h
> +++ b/drivers/net/ethernet/cavium/thunder/nic.h
> @@ -265,14 +265,9 @@ struct nicvf_drv_stats {
>  
>  struct cavium_ptp;
>  
> -struct xcast_addr {
> - struct list_head list;
> - u64  addr;
> -};
> -
>  struct xcast_addr_list {
> - struct list_head list;
>   int  count;
> + u64  mc[];
>  };
>  
>  struct nicvf_work {
> diff --git a/drivers/net/ethernet/cavium/thunder/nicvf_main.c 
> b/drivers/net/ethernet/cavium/thunder/nicvf_main.c
> index 1e9a31fef729..7d9e58533a83 100644
> --- a/drivers/net/ethernet/cavium/thunder/nicvf_main.c
> +++ b/drivers/net/ethernet/cavium/thunder/nicvf_main.c
> @@ -1929,7 +1929,7 @@ static void nicvf_set_rx_mode_task(struct work_struct 
> *work_arg)
> work.work);
>   struct nicvf *nic = container_of(vf_work, struct nicvf, rx_mode_work);
>   union nic_mbx mbx = {};
> - struct xcast_addr *xaddr, *next;
> + u8 idx = 0;
>  
>   if (!vf_work)
>   return;
> @@ -1956,16 +1956,10 @@ static void nicvf_set_rx_mode_task(struct work_struct 
> *work_arg)
>   /* check if we have any specific MACs to be added to PF DMAC filter */
>   if (vf_work->mc) {
>   /* now go through kernel list of MACs and add them one by one */
> - list_for_each_entry_safe(xaddr, next,
> -  _work->mc->list, list) {
> + for (idx = 0; idx < vf_work->mc->count; idx++) {
>   mbx.xcast.msg = NIC_MBOX_MSG_ADD_MCAST;
> - mbx.xcast.data.mac = xaddr->addr;
> + mbx.xcast.data.mac = vf_work->mc->mc[idx];
>   nicvf_send_msg_to_pf(nic, );
> -
> - /* after receiving ACK from PF release memory */
> - list_del(>list);
> - kfree(xaddr);
> - vf_work->mc->count--;
>   }
>   kfree(vf_work->mc);
>   }
> @@ -1996,17 +1990,15 @@ static void nicvf_set_rx_mode(struct net_device 
> *netdev)
>   mode |= BGX_XCAST_MCAST_FILTER;
>   /* here we need to copy mc addrs */
>   if (netdev_mc_count(netdev)) {
> - struct xcast_addr *xaddr;
> -
> - mc_list = kmalloc(sizeof(*mc_list), GFP_ATOMIC);
> - INIT_LIST_HEAD(_list->list);
> + mc_list = kmalloc(offsetof(typeof(*mc_list),
> +
> mc[netdev_mc_count(netdev)]),
> +   GFP_ATOMIC);
> + if (unlikely(!mc_list))
> + return;
> + mc_list->count = 0;
>   netdev_hw_addr_list_for_each(ha, >mc) {
> - xaddr = kmalloc(sizeof(*xaddr),
> - GFP_ATOMIC);
> - xaddr->addr =
> + mc_list->mc[mc_list->count] =
>   ether_addr_to_u64(ha->addr);
> - list_add_tail(>list,
> -   _list->list);
>   mc_list->count++;
>

Re: [PATCH net] net/sched: fix NULL dereference in the error path of tcf_bpf_init()

2018-04-06 Thread Lucas Bates

On Thu, Apr 5, 2018 at 7:19 PM, Davide Caratti  wrote:
> when tcf_bpf_init_from_ops() fails (e.g. because of program having invalid
> number of instructions), tcf_bpf_cfg_cleanup() calls bpf_prog_put(NULL) or
> bpf_prog_destroy(NULL). Unless CONFIG_BPF_SYSCALL is unset, this causes
> the following error:
>
>  BUG: unable to handle kernel NULL pointer dereference at 0020
>  PGD 80007345a067 P4D 80007345a067 PUD 340e1067 PMD 0
>  Oops:  [#1] SMP PTI
>  Modules linked in: act_bpf(E) ip6table_filter ip6_tables iptable_filter 
> binfmt_misc ext4 mbcache jbd2 crct10dif_pclmul crc32_pclmul 
> ghash_clmulni_intel snd_hda_codec_generic pcbc snd_hda_intel snd_hda_codec 
> snd_hda_core snd_hwdep snd_seq snd_seq_device snd_pcm aesni_intel crypto_simd 
> glue_helper cryptd joydev snd_timer snd virtio_balloon pcspkr soundcore 
> i2c_piix4 nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c 
> ata_generic pata_acpi qxl drm_kms_helper syscopyarea sysfillrect sysimgblt 
> fb_sys_fops ttm virtio_blk drm virtio_net virtio_console i2c_core 
> crc32c_intel serio_raw virtio_pci ata_piix libata virtio_ring floppy virtio 
> dm_mirror dm_region_hash dm_log dm_mod [last unloaded: act_bpf]
>  CPU: 3 PID: 5654 Comm: tc Tainted: GE4.16.0.bpf_test+ #408
>  Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
>  RIP: 0010:__bpf_prog_put+0xc/0xc0
>  RSP: 0018:9594003ef728 EFLAGS: 00010202
>  RAX:  RBX: 9594003ef758 RCX: 0024
>  RDX:  RSI: 0001 RDI: 
>  RBP:  R08: 0001 R09: 0044
>  R10: 0220 R11: 8a7ab9f17131 R12: 
>  R13: 8a7ab7c3c8e0 R14: 0001 R15: 8a7ab88f1054
>  FS:  7fcb2f17c740() GS:8a7abfd8() knlGS:
>  CS:  0010 DS:  ES:  CR0: 80050033
>  CR2: 0020 CR3: 7c888006 CR4: 001606e0
>  Call Trace:
>   tcf_bpf_cfg_cleanup+0x2f/0x40 [act_bpf]
>   tcf_bpf_cleanup+0x4c/0x70 [act_bpf]
>   __tcf_idr_release+0x79/0x140
>   tcf_bpf_init+0x125/0x330 [act_bpf]
>   tcf_action_init_1+0x2cc/0x430
>   ? get_page_from_freelist+0x3f0/0x11b0
>   tcf_action_init+0xd3/0x1b0
>   tc_ctl_action+0x18b/0x240
>   rtnetlink_rcv_msg+0x29c/0x310
>   ? _cond_resched+0x15/0x30
>   ? __kmalloc_node_track_caller+0x1b9/0x270
>   ? rtnl_calcit.isra.29+0x100/0x100
>   netlink_rcv_skb+0xd2/0x110
>   netlink_unicast+0x17c/0x230
>   netlink_sendmsg+0x2cd/0x3c0
>   sock_sendmsg+0x30/0x40
>   ___sys_sendmsg+0x27a/0x290
>   ? mem_cgroup_commit_charge+0x80/0x130
>   ? page_add_new_anon_rmap+0x73/0xc0
>   ? do_anonymous_page+0x2a2/0x560
>   ? __handle_mm_fault+0xc75/0xe20
>   __sys_sendmsg+0x58/0xa0
>   do_syscall_64+0x6e/0x1a0
>   entry_SYSCALL_64_after_hwframe+0x3d/0xa2
>  RIP: 0033:0x7fcb2e58eba0
>  RSP: 002b:7ffc93c496c8 EFLAGS: 0246 ORIG_RAX: 002e
>  RAX: ffda RBX: 7ffc93c497f0 RCX: 7fcb2e58eba0
>  RDX:  RSI: 7ffc93c49740 RDI: 0003
>  RBP: 5ac6a646 R08: 0002 R09: 
>  R10: 7ffc93c49120 R11: 0246 R12: 
>  R13: 7ffc93c49804 R14: 0001 R15: 0066afa0
>  Code: 5f 00 48 8b 43 20 48 c7 c7 70 2f 7c b8 c7 40 10 00 00 00 00 5b e9 a5 
> 8b 61 00 0f 1f 44 00 00 0f 1f 44 00 00 41 54 55 48 89 fd 53 <48> 8b 47 20 f0 
> ff 08 74 05 5b 5d 41 5c c3 41 89 f4 0f 1f 44 00
>  RIP: __bpf_prog_put+0xc/0xc0 RSP: 9594003ef728
>  CR2: 0020
>
> Fix it in tcf_bpf_cfg_cleanup(), ensuring that bpf_prog_{put,destroy}(f)
> is called only when f is not NULL.
>
> Fixes: bbc09e7842a5 ("net/sched: fix idr leak on the error path of 
> tcf_bpf_init()")
> Reported-by: Lucas Bates 
> Signed-off-by: Davide Caratti 

That does the trick, thanks Davide.

Tested-by: Lucas Bates

Re: [PATCH net] net: dsa: Discard frames from unused ports

2018-04-06 Thread Andrew Lunn

On Thu, Apr 05, 2018 at 03:20:14PM -0700, Florian Fainelli wrote:
> On 04/04/2018 07:17 PM, Andrew Lunn wrote:
> > On Wed, Apr 04, 2018 at 05:49:10PM -0700, Florian Fainelli wrote:
> >> On 04/04/2018 04:56 PM, Andrew Lunn wrote:
> >>> The Marvell switches under some conditions will pass a frame to the
> >>> host with the port being the CPU port. Such frames are invalid, and
> >>> should be dropped. Not dropping them can result in a crash when
> >>> incrementing the receive statistics for an invalid port.
> >>>
> >>> Reported-by: Chris Healy 
> >>> Fixes: 5f6b4e14cada ("net: dsa: User per-cpu 64-bit statistics")
> >>
> >> Are you sure this is the commit that introduced the problem?
> > 
> > Hi Florian
> > 
> > Well, the problem is it crashes when trying to update the
> > statistics. The CPU port is not allocated a p->stats64, only slave
> > ports get those. So before this patch, there was no crash and the
> > frame would be delivered to the master interface. This in itself is
> > probably not correct, but also not fatal. Talking to Chris, it seems
> > this behaviour has existing for a long while. I needed to use lldpd to
> > trigger the issue, because i assume the Marvell switch sees these as
> > special frames and forwards them to the CPU. The other thing is, the
> > code got refactored recently. So this fix will not rebase to too many
> > earlier versions. It needs a fix per tagging protocol for before the
> > common dsa_master_find_slave() was added.
> 
> Yes what you are explaining makes sense, but does not that mean we would
> just be accessing a garbage memory location before as well?

Humm, yes. I actually picked the wrong patch. It took two attempts to
get the stats64 working. I should of picked the first one.

Before stats64, we just used skb->dev. I need to look back at older
code, but skb->dev is valid in the versions i tested. It points to the
master device. So we don't crash.

However, i agree, we should fix this for the LTS kernels.

 Andrew

Re: [PATCH net-next] netns: filter uevents correctly

2018-04-06 Thread Eric W. Biederman

Christian Brauner  writes:

> On Thu, Apr 05, 2018 at 10:59:49PM -0500, Eric W. Biederman wrote:
>> Christian Brauner  writes:
>> 
>> > On Thu, Apr 05, 2018 at 05:26:59PM +0300, Kirill Tkhai wrote:
>> >> On 05.04.2018 17:07, Christian Brauner wrote:
>> >> > On Thu, Apr 05, 2018 at 04:01:03PM +0300, Kirill Tkhai wrote:
>> >> >> On 04.04.2018 22:48, Christian Brauner wrote:
>> >> >>> commit 07e98962fa77 ("kobject: Send hotplug events in all network 
>> >> >>> namespaces")
>> >> >>>
>> >> >>> enabled sending hotplug events into all network namespaces back in 
>> >> >>> 2010.
>> >> >>> Over time the set of uevents that get sent into all network 
>> >> >>> namespaces has
>> >> >>> shrunk. We have now reached the point where hotplug events for all 
>> >> >>> devices
>> >> >>> that carry a namespace tag are filtered according to that namespace.
>> >> >>>
>> >> >>> Specifically, they are filtered whenever the namespace tag of the 
>> >> >>> kobject
>> >> >>> does not match the namespace tag of the netlink socket. One example 
>> >> >>> are
>> >> >>> network devices. Uevents for network devices only show up in the 
>> >> >>> network
>> >> >>> namespaces these devices are moved to or created in.
>> >> >>>
>> >> >>> However, any uevent for a kobject that does not have a namespace tag
>> >> >>> associated with it will not be filtered and we will *try* to 
>> >> >>> broadcast it
>> >> >>> into all network namespaces.
>> >> >>>
>> >> >>> The original patchset was written in 2010 before user namespaces were 
>> >> >>> a
>> >> >>> thing. With the introduction of user namespaces sending out uevents 
>> >> >>> became
>> >> >>> partially isolated as they were filtered by user namespaces:
>> >> >>>
>> >> >>> net/netlink/af_netlink.c:do_one_broadcast()
>> >> >>>
>> >> >>> if (!net_eq(sock_net(sk), p->net)) {
>> >> >>> if (!(nlk->flags & NETLINK_F_LISTEN_ALL_NSID))
>> >> >>> return;
>> >> >>>
>> >> >>> if (!peernet_has_id(sock_net(sk), p->net))
>> >> >>> return;
>> >> >>>
>> >> >>> if (!file_ns_capable(sk->sk_socket->file, p->net->user_ns,
>> >> >>>  CAP_NET_BROADCAST))
>> >> >>> j   return;
>> >> >>> }
>> >> >>>
>> >> >>> The file_ns_capable() check will check whether the caller had
>> >> >>> CAP_NET_BROADCAST at the time of opening the netlink socket in the 
>> >> >>> user
>> >> >>> namespace of interest. This check is fine in general but seems 
>> >> >>> insufficient
>> >> >>> to me when paired with uevents. The reason is that devices always 
>> >> >>> belong to
>> >> >>> the initial user namespace so uevents for kobjects that do not carry a
>> >> >>> namespace tag should never be sent into another user namespace. This 
>> >> >>> has
>> >> >>> been the intention all along. But there's one case where this breaks,
>> >> >>> namely if a new user namespace is created by root on the host and an
>> >> >>> identity mapping is established between root on the host and root in 
>> >> >>> the
>> >> >>> new user namespace. Here's a reproducer:
>> >> >>>
>> >> >>>  sudo unshare -U --map-root
>> >> >>>  udevadm monitor -k
>> >> >>>  # Now change to initial user namespace and e.g. do
>> >> >>>  modprobe kvm
>> >> >>>  # or
>> >> >>>  rmmod kvm
>> >> >>>
>> >> >>> will allow the non-initial user namespace to retrieve all uevents 
>> >> >>> from the
>> >> >>> host. This seems very anecdotal given that in the general case user
>> >> >>> namespaces do not see any uevents and also can't really do anything 
>> >> >>> useful
>> >> >>> with them.
>> >> >>>
>> >> >>> Additionally, it is now possible to send uevents from userspace. As 
>> >> >>> such we
>> >> >>> can let a sufficiently privileged (CAP_SYS_ADMIN in the owning user
>> >> >>> namespace of the network namespace of the netlink socket) userspace 
>> >> >>> process
>> >> >>> make a decision what uevents should be sent.
>> >> >>>
>> >> >>> This makes me think that we should simply ensure that uevents for 
>> >> >>> kobjects
>> >> >>> that do not carry a namespace tag are *always* filtered by user 
>> >> >>> namespace
>> >> >>> in kobj_bcast_filter(). Specifically:
>> >> >>> - If the owning user namespace of the uevent socket is not 
>> >> >>> init_user_ns the
>> >> >>>   event will always be filtered.
>> >> >>> - If the network namespace the uevent socket belongs to was created 
>> >> >>> in the
>> >> >>>   initial user namespace but was opened from a non-initial user 
>> >> >>> namespace
>> >> >>>   the event will be filtered as well.
>> >> >>> Put another way, uevents for kobjects not carrying a namespace tag 
>> >> >>> are now
>> >> >>> always only sent to the initial user namespace. The regression 
>> >> >>> potential
>> >> >>> for this is near to non-existent since user namespaces can't really do
>> >> >>> anything with interesting devices.
>> >> >>>
>> >> >>> Signed-off-by: Christian Brauner 
>>

Re: [RFC] ethtool: Support for driver private ioctl's

2018-04-06 Thread Andrew Lunn

On Fri, Apr 06, 2018 at 02:51:15PM +0100, Jose Abreu wrote:
> Hi Florian,
> 
> On 05-04-2018 16:50, Florian Fainelli wrote:
> >
> > On 04/05/2018 03:47 AM, Jose Abreu wrote:
> >> Hi All,
> >>
> >> I would like to know your opinion regarding adding support for
> >> driver private ioctl's in ethtool.
> >>
> >> Background: Synopsys Ethernet IP's have a certain number of
> >> features which can be reconfigured at runtime. Giving you two
> >> examples: One of the most recent one is the safety features,
> >> which can be enabled/disabled and forced at runtime.

Hi Jose

Is there a reason somebody would decide to use the Ethernet in
'unsafe' mode? Cannot you just turn it on by default?

 Andrew

Re: [RFC] ethtool: Support for driver private ioctl's

2018-04-06 Thread Jose Abreu

Hi Andrew,

On 06-04-2018 15:47, Andrew Lunn wrote:
> On Fri, Apr 06, 2018 at 02:51:15PM +0100, Jose Abreu wrote:
>> Hi Florian,
>>
>> On 05-04-2018 16:50, Florian Fainelli wrote:
>>> On 04/05/2018 03:47 AM, Jose Abreu wrote:
 Hi All,

 I would like to know your opinion regarding adding support for
 driver private ioctl's in ethtool.

 Background: Synopsys Ethernet IP's have a certain number of
 features which can be reconfigured at runtime. Giving you two
 examples: One of the most recent one is the safety features,
 which can be enabled/disabled and forced at runtime.
> Hi Jose
>
> Is there a reason somebody would decide to use the Ethernet in
> 'unsafe' mode? Cannot you just turn it on by default?

Yes, its already on by default. I was just trying to give an
example of an user-reconfigurable feature, maybe it was not the
best one :)

Thanks and Best Regards,
Jose Miguel Abreu

>
>Andrew

Re: TCP one-by-one acking - RFC interpretation question

2018-04-06 Thread Michal Kubecek

On Fri, Apr 06, 2018 at 05:01:29AM -0700, Eric Dumazet wrote:
> 
> 
> On 04/06/2018 03:05 AM, Michal Kubecek wrote:
> > Hello,
> > 
> > I encountered a strange behaviour of some (non-linux) TCP stack which
> > I believe is incorrect but support engineers from the company producing
> > it claim is OK.
> > 
> > Assume a client (sender, Linux 4.4 kernel) sends a stream of MSS sized
> > segments but segments 2, 4 and 6 do not reach the server (receiver):
> > 
> >  ACK SAK SAK SAK
> >   +---+---+---+---+---+---+---+
> >   |   1   |   2   |   3   |   4   |   5   |   6   |   7   |
> >   +---+---+---+---+---+---+---+
> > 34273   35701   37129   38557   39985   41413   42841   44269
> > 
> > When segment 2 is retransmitted after RTO timeout, normal response would
> > be ACK-ing segment 3 (38557) with SACK for 5 and 7 (39985-41413 and
> > 42841-44269).
> > 
> > However, this server stack responds with two separate ACKs:
> > 
> >   - ACK 37129, SACK 37129-38557 39985-41413 42841-44269
> >   - ACK 38557, SACK 39985-41413 42841-44269
> 
> Hmmm... Yes this seems very very wrong and lazy.
> 
> Have you verified behavior of more recent linux kernel to such threats ?

No, unfortunately the problem was only encountered by our customer in
production environment (they tried to reproduce in a test lab but no
luck). They are running backups to NFS server and it happens from time
to time (in the order of hours, IIUC). So it would be probably hard to
let them try with more recent kernel.

On the other hand, they reported that SLE11 clients (kernel 3.0) do not
run into this kind of problem. It was originally reported as a
a regression on migration from SLE11-SP4 (3.0 kernel) to SLE12-SP2 (4.4
kernel) and the problem was reported as "SLE12-SP2 is ignoring dupacks"
(which seems to be mostly caused by the switch to RACK).

It also seems that part of the problem is specific packet loss pattern
where at some point, many packets are lost in "every second" pattern.
The customer finally started to investigate this problem and it seems it
has something to do with their bonding setup (they provided no details,
my guess is packets are divided over two paths and one of them fails).

> packetdrill test would be relatively easy to write.

I'll try but I have very little experience with writing packetdrill
scripts so it will probably take some time.

> Regardless of this broken alien stack, we might be able to work around
> this faster than the vendor is able to fix and deploy a new stack.
> 
> ( https://en.wikipedia.org/wiki/Robustness_principle )
> Be conservative in what you do, be liberal in what you accept from
> others...

I was thinking about this a bit. "Fixing" the acknowledgment number
could do the trick but it doesn't feel correct. We might use the fact
that TSecr of both ACKs above matches TSval of the retransmission which
triggered them so that RTT calculated from timestamp would be the right
one. So perhaps something like "prefer timestamp RTT if measured RTT
seems way too off". But I'm not sure if it couldn't break other use
cases where (high) measured RTT is actually correct, rather than (low)
timestamp RTT.

Michal Kubecek

Re: [RFC 0/9] bpf: Add buildid check support

2018-04-06 Thread Jiri Olsa

On Thu, Apr 05, 2018 at 06:37:23PM -0700, Alexei Starovoitov wrote:
> On Thu, Apr 05, 2018 at 05:16:36PM +0200, Jiri Olsa wrote:
> > hi,
> > eBPF programs loaded for kprobes are allowed to read kernel
> > internal structures. We check the provided kernel version
> > to ensure that the program is loaded for the proper kernel. 
> > 
> > The problem is that the version check is not enough, because
> > it only follows the version setup from kernel's Makefile.
> > However, the internal kernel structures change based on the
> > .config data, so in practise we have different kernels with
> > same version.
> > 
> > The eBPF kprobe program thus then get loaded for different
> > kernel than it's been built for, get wrong data (silently)
> > and provide misleading output.
> > 
> > This patchset implements additional check in eBPF loading code
> > on provided build ID (from kernel's elf image, .notes section
> > GNU build ID) to ensure we load the eBPF program on correct
> > kernel.
> > 
> > Also available in here (based on bpf-next/master):
> >   https://git.kernel.org/pub/scm/linux/kernel/git/jolsa/perf.git
> >   bpf/checksum
> > 
> > This patchset consists of several changes:
> > 
> > - adding CONFIG_BUILDID_H option that instructs the build
> >   to generate uapi header file with build ID data, that
> >   will be included by eBPF program
> > 
> > - adding CONFIG_BPF_BUILDID_CHECK option and new bpf_attr
> >   field to allow build ID checking when loading the eBPF
> >   program
> > 
> > - changing libbpf to read and pass build ID to the kernel
> > 
> > - several small side fixes
> > 
> > - example perf eBPF code in bpf-samples/bpf-stdout-example.c
> >   to show the build ID support/usage.
> > 
> > # perf record -vv  -e ./bpf-samples/bpf-stdout-example.c kill 2>&1 | 
> > grep buildid
> > libbpf: section(7) buildid, size 21, link 0, flags 3, type=1
> > libbpf: kernel buildid of ./bpf-samples/bpf-stdout-example.c is: 
> > 6e25edeb408513184e2753bebad25d42314501a0
> > 
> >   The buildid is provided the same way we provide kernel
> >   version, in a special "buildid" section:
> > 
> > # cat ./bpf-samples/bpf-stdout-example.c
> > ...
> > #include 
> > 
> > char _buildid[] SEC("buildid") = LINUX_BUILDID_DATA;
> > ...
> > 
> >   where LINUX_BUILDID_DATA is defined in the generated buildid.h.
> > 
> > please note it's an RFC ;-) any comments and suggestions are welcome
> 
> I think this is overkill.
> 
> We're very heavy users of kprobe+bpf. It's used for lots
> of different cases and usage is constantly growing,
> but I haven't seen a single case of :
> 
> > The eBPF kprobe program thus then get loaded for different
> > kernel than it's been built for, get wrong data (silently)
> > and provide misleading output.
> 
> but I saw plenty of the opposite. People pre-compile the program
> and hack kernel version when they load, since they know in advance
> that kprobe+bpf doesn't use any kernel specific things.
> The existing kernel version check for kprobe+bpf is already annoying
> to them.

perhaps verifier could detect this (via bpf_probe_read usage) and disable
the version check automaticaly for such program?

and in the same way force the version check (or buildid when enabled)
once the bpf_probe_read is detected

thanks,
jirka

Re: [PATCH v2] net: thunderx: rework mac addresses list to u64 array

2018-04-06 Thread Vadim Lomovtsev

On Fri, Apr 06, 2018 at 11:06:03AM -0400, David Miller wrote:
> From: Vadim Lomovtsev 
> Date: Fri,  6 Apr 2018 04:14:25 -0700
> 
> > diff --git a/drivers/net/ethernet/cavium/thunder/nic.h 
> > b/drivers/net/ethernet/cavium/thunder/nic.h
> > index 5fc46c5a4f36..448d1fafc827 100644
> > --- a/drivers/net/ethernet/cavium/thunder/nic.h
> > +++ b/drivers/net/ethernet/cavium/thunder/nic.h
> > @@ -265,14 +265,9 @@ struct nicvf_drv_stats {
> >  
> >  struct cavium_ptp;
> >  
> > -struct xcast_addr {
> > -   struct list_head list;
> > -   u64  addr;
> > -};
> > -
> >  struct xcast_addr_list {
> > -   struct list_head list;
> > int  count;
> > +   u64  mc[];
> >  };
> >  
> >  struct nicvf_work {
> > diff --git a/drivers/net/ethernet/cavium/thunder/nicvf_main.c 
> > b/drivers/net/ethernet/cavium/thunder/nicvf_main.c
> > index 1e9a31fef729..a26d8bc92e01 100644
> > --- a/drivers/net/ethernet/cavium/thunder/nicvf_main.c
> > +++ b/drivers/net/ethernet/cavium/thunder/nicvf_main.c
> > @@ -1929,7 +1929,7 @@ static void nicvf_set_rx_mode_task(struct work_struct 
> > *work_arg)
> >   work.work);
> > struct nicvf *nic = container_of(vf_work, struct nicvf, rx_mode_work);
> > union nic_mbx mbx = {};
> > -   struct xcast_addr *xaddr, *next;
> > +   u8 idx = 0;
> ^^^
> 
> >  
> > if (!vf_work)
> > return;
> > @@ -1956,16 +1956,10 @@ static void nicvf_set_rx_mode_task(struct 
> > work_struct *work_arg)
> > /* check if we have any specific MACs to be added to PF DMAC filter */
> > if (vf_work->mc) {
> > /* now go through kernel list of MACs and add them one by one */
> > -   list_for_each_entry_safe(xaddr, next,
> > -_work->mc->list, list) {
> > +   for (idx = 0; idx < vf_work->mc->count; idx++) {
> 
> vf_work->mx->count is an 'int' therefore 'idx' should be declared 'int' as 
> well,
> not a 'u8'.

My bad, sorry.
Will post v4 shortly then.


WBR,
Vadim

Re: [PATCH net 1/3] lan78xx: PHY DSP registers initialization to address EEE link drop issues with long cables

2018-04-06 Thread David Miller

From: Andrew Lunn 
Date: Fri, 6 Apr 2018 16:43:42 +0200

> On Fri, Apr 06, 2018 at 11:42:02AM +0530, Raghuram Chary J wrote:
>> The patch is to configure DSP registers of PHY device
>> to handle Gbe-EEE failures with >40m cable length.
>> 
>> Fixes: 55d7de9de6c3 ("Microchip's LAN7800 family USB 2/3 to 10/100/1000 
>> Ethernet device driver")
>> Signed-off-by: Raghuram Chary J 
>> ---
>>  drivers/net/phy/microchip.c  | 123 
>> ++-
>>  include/linux/microchipphy.h |   8 +++
>>  2 files changed, 130 insertions(+), 1 deletion(-)
>> 
>> diff --git a/drivers/net/phy/microchip.c b/drivers/net/phy/microchip.c
>> index 0f293ef28935..174ae9808722 100644
>> --- a/drivers/net/phy/microchip.c
>> +++ b/drivers/net/phy/microchip.c
>> @@ -20,6 +20,7 @@
>>  #include 
>>  #include 
>>  #include 
>> +#include 
>>  
>>  #define DRIVER_AUTHOR   "WOOJUNG HUH "
>>  #define DRIVER_DESC "Microchip LAN88XX PHY driver"
>> @@ -66,6 +67,107 @@ static int lan88xx_suspend(struct phy_device *phydev)
>>  return 0;
>>  }
>>  
>> +static void lan88xx_TR_reg_set(struct phy_device *phydev, u16 regaddr,
>> +   u32 data)
>> +{
>> +int val;
>> +u16 buf;
>> +
>> +/* Get access to token ring page */
>> +phy_write(phydev, LAN88XX_EXT_PAGE_ACCESS,
>> +  LAN88XX_EXT_PAGE_ACCESS_TR);
> 
> Hi Raghuram
> 
> You might want to look at phy_read_paged(), phy_write_paged(), etc.
> 
> There can be race conditions with paged access.

Yep, so something like:

static void lan88xx_TR_reg_set(struct phy_device *phydev, u16 regaddr,
   u32 data)
{
int save_page, val;
u16 buf;

save_page = phy_save_page(phydev);
phy_write_paged(phydev, LAN88XX_EXT_PAGE_ACCESS_TR,
LAN88XX_EXT_PAGE_TR_LOW_DATA, (data & 0x));
phy_write_paged(phydev, LAN88XX_EXT_PAGE_ACCESS_TR,
LAN88XX_EXT_PAGE_TR_HIGH_DATA,
(data & 0x00FF) >> 16);

/* Config control bits [15:13] of register */
buf = (regaddr & ~(0x3 << 13));/* Clr [14:13] to write data in reg */
buf |= 0x8000; /* Set [15] to Packet transmit */

phy_write_paged(phydev, LAN88XX_EXT_PAGE_ACCESS_TR,
LAN88XX_EXT_PAGE_TR_CR, buf);
usleep_range(1000, 2000);/* Wait for Data to be written */

val = phy_read_paged(phydev, LAN88XX_EXT_PAGE_ACCESS_TR,
 LAN88XX_EXT_PAGE_TR_CR);
if (!(val & 0x8000))
pr_warn("TR Register[0x%X] configuration failed\n", regaddr);

phy_restore_page(phydev, save_page, 0);
}

Since PHY accesses and thus things like phy_save_page() can fail, the
return type of this function should be changed to 'int' and some error
checking should be added.

[net 2/2] ice: Bug fixes in ethtool code

2018-04-06 Thread Jeff Kirsher

From: Anirudh Venkataramanan 

1) Return correct size from ice_get_regs_len.
2) Fix incorrect use of ARRAY_SIZE in ice_get_regs.

Fixes: fcea6f3da546 (ice: Add stats and ethtool support)
Signed-off-by: Anirudh Venkataramanan 
Tested-by: Tony Brelinski 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/ice/ice_ethtool.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/intel/ice/ice_ethtool.c 
b/drivers/net/ethernet/intel/ice/ice_ethtool.c
index 186764a5c263..1db304c01d10 100644
--- a/drivers/net/ethernet/intel/ice/ice_ethtool.c
+++ b/drivers/net/ethernet/intel/ice/ice_ethtool.c
@@ -156,7 +156,7 @@ ice_get_drvinfo(struct net_device *netdev, struct 
ethtool_drvinfo *drvinfo)
 
 static int ice_get_regs_len(struct net_device __always_unused *netdev)
 {
-   return ARRAY_SIZE(ice_regs_dump_list);
+   return sizeof(ice_regs_dump_list);
 }
 
 static void
@@ -170,7 +170,7 @@ ice_get_regs(struct net_device *netdev, struct ethtool_regs 
*regs, void *p)
 
regs->version = 1;
 
-   for (i = 0; i < ARRAY_SIZE(ice_regs_dump_list) / sizeof(u32); ++i)
+   for (i = 0; i < ARRAY_SIZE(ice_regs_dump_list); ++i)
regs_buf[i] = rd32(hw, ice_regs_dump_list[i]);
 }
 
-- 
2.14.3

[net 1/2] ice: Fix error return code in ice_init_hw()

2018-04-06 Thread Jeff Kirsher

From: Wei Yongjun 

Fix to return error code ICE_ERR_NO_MEMORY from the alloc error
handling case instead of 0, as done elsewhere in this function.

Fixes: dc49c7723676 ("ice: Get MAC/PHY/link info and scheduler topology")
Signed-off-by: Wei Yongjun 
Acked-by: Anirudh Venkataramanan 
Tested-by: Tony Brelinski 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/ice/ice_common.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/ice/ice_common.c 
b/drivers/net/ethernet/intel/ice/ice_common.c
index 385f5d425d19..21977ec984c4 100644
--- a/drivers/net/ethernet/intel/ice/ice_common.c
+++ b/drivers/net/ethernet/intel/ice/ice_common.c
@@ -468,8 +468,10 @@ enum ice_status ice_init_hw(struct ice_hw *hw)
mac_buf_len = sizeof(struct ice_aqc_manage_mac_read_resp);
mac_buf = devm_kzalloc(ice_hw_to_dev(hw), mac_buf_len, GFP_KERNEL);
 
-   if (!mac_buf)
+   if (!mac_buf) {
+   status = ICE_ERR_NO_MEMORY;
goto err_unroll_fltr_mgmt_struct;
+   }
 
status = ice_aq_manage_mac_read(hw, mac_buf, mac_buf_len, NULL);
devm_kfree(ice_hw_to_dev(hw), mac_buf);
-- 
2.14.3

[net 0/2][pull request] Intel Wired LAN Driver Updates 2018-04-06

2018-04-06 Thread Jeff Kirsher

This series contains a couple of fixes for the new ice driver.

Wei Yongjun fixes the return error code for error case during init.

Anirudh fixes the incorrect use of ARRAY_SIZE() in the ice ethtool code
and fixed "for" loop calculations.

The following are changes since commit 3239534a79ee6f20cffd974173a1e62e0730e8ac:
  net/sched: fix NULL dereference in the error path of tcf_bpf_init()
and are available in the git repository at:
  git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/net-queue 100GbE

Anirudh Venkataramanan (1):
  ice: Bug fixes in ethtool code

Wei Yongjun (1):
  ice: Fix error return code in ice_init_hw()

 drivers/net/ethernet/intel/ice/ice_common.c  | 4 +++-
 drivers/net/ethernet/intel/ice/ice_ethtool.c | 4 ++--
 2 files changed, 5 insertions(+), 3 deletions(-)

-- 
2.14.3

Re: [net 0/2][pull request] Intel Wired LAN Driver Updates 2018-04-06

2018-04-06 Thread David Miller

From: Jeff Kirsher 
Date: Fri,  6 Apr 2018 08:36:28 -0700

> This series contains a couple of fixes for the new ice driver.
> 
> Wei Yongjun fixes the return error code for error case during init.
> 
> Anirudh fixes the incorrect use of ARRAY_SIZE() in the ice ethtool code
> and fixed "for" loop calculations.

Pulled, thanks Jeff.

Re: [RFC] ethtool: Support for driver private ioctl's

2018-04-06 Thread Jose Abreu

Hi Florian,

On 05-04-2018 16:50, Florian Fainelli wrote:
>
> On 04/05/2018 03:47 AM, Jose Abreu wrote:
>> Hi All,
>>
>> I would like to know your opinion regarding adding support for
>> driver private ioctl's in ethtool.
>>
>> Background: Synopsys Ethernet IP's have a certain number of
>> features which can be reconfigured at runtime. Giving you two
>> examples: One of the most recent one is the safety features,
>> which can be enabled/disabled and forced at runtime. Another one
>> is a Flexible RX Parser which can route specific packets to
>> specific RX DMA channels. Given that these are features specific
>> to our IP's it would not be useful to add an uniform API for this
>> because the users would only be one or two drivers ...
> Parsing of packets and directing the matched packets to specific
> queues/channels can be done through ethtool rxnfc API, tc/cls_flower as
> well, so you should really check whether those APIs don't already allow
> you to do what you want.

Hmm, but in our case this is directly done by HW, we just have to
program a kind of a table which will route automatically the
packets. Does this API support this?

>
> ethtool already supports a concept of private  flags, not ioctl() though
> which allows you to toggle boolean values for instance (or technically
> up to how many bits a "flag" is used to represent) is that enough or do
> you need to turn on/off the feature as well as pass configuration
> parameters?

Some of them I can just turn on/off but the remaining need
configuration and sometimes the configuration is extensive (like
in the case of RX Parser when we have to pass the routing table).

>
>> This new feature would change the help usage for ethtool so that
>> each driver private option would be shown, and then each driver
>> specific file would have a structure with all the available
>> options. Finally, each driver would have to handle the private
>> IOCTL's.
>>
>> We already have this working locally and now I would like to know
>> your opinion about upstreaming this ... Do you think this can be
>> useful for anyone else? Or should we change direction to use, for
>> example, debugfs/configfs?
> In general, even if there is only one driver implementing a particular
> feature, the approach chosen is to come up with an API that is as
> generic as possible. Even if there is a single user of that API in tree,
> having something that was thought to be generic is better than allowing
> uncontrolled private ioctl() implementations.

I understand your point of view but this seems like an overkill
to the -net subsystem because its specific to our IP, or are you
just mentioning a new ethtool entry? i.e. adding a new #define to
the list, plus -net handling ...

Thanks and Best Regards,
Jose Miguel Abreu

[PATCH v3] net: thunderx: rework mac addresses list to u64 array

2018-04-06 Thread Vadim Lomovtsev

From: Vadim Lomovtsev 

It is too expensive to pass u64 values via linked list, instead
allocate array for them by overall number of mac addresses from netdev.

This eventually removes multiple kmalloc() calls, aviod memory
fragmentation and allow to put single null check on kmalloc
return value in order to prevent a potential null pointer dereference.

Addresses-Coverity-ID: 1467429 ("Dereference null return value")
Fixes: 37c3347eb247 ("net: thunderx: add ndo_set_rx_mode callback 
implementation for VF")
Reported-by: Dan Carpenter 
Signed-off-by: Vadim Lomovtsev 
---
Changes from v1 to v2:
 - C99 syntax: update xcast_addr_list struct field mc[0] -> mc[];
Changes from v2 to v3:
 - update commit description with 'Reported-by: Dan Carpenter';
 - update size calculations for mc list to offsetof() call
   instead of explicit arithmetic;
---
 drivers/net/ethernet/cavium/thunder/nic.h|  7 +-
 drivers/net/ethernet/cavium/thunder/nicvf_main.c | 28 +---
 2 files changed, 11 insertions(+), 24 deletions(-)

diff --git a/drivers/net/ethernet/cavium/thunder/nic.h 
b/drivers/net/ethernet/cavium/thunder/nic.h
index 5fc46c5a4f36..448d1fafc827 100644
--- a/drivers/net/ethernet/cavium/thunder/nic.h
+++ b/drivers/net/ethernet/cavium/thunder/nic.h
@@ -265,14 +265,9 @@ struct nicvf_drv_stats {
 
 struct cavium_ptp;
 
-struct xcast_addr {
-   struct list_head list;
-   u64  addr;
-};
-
 struct xcast_addr_list {
-   struct list_head list;
int  count;
+   u64  mc[];
 };
 
 struct nicvf_work {
diff --git a/drivers/net/ethernet/cavium/thunder/nicvf_main.c 
b/drivers/net/ethernet/cavium/thunder/nicvf_main.c
index 1e9a31fef729..7d9e58533a83 100644
--- a/drivers/net/ethernet/cavium/thunder/nicvf_main.c
+++ b/drivers/net/ethernet/cavium/thunder/nicvf_main.c
@@ -1929,7 +1929,7 @@ static void nicvf_set_rx_mode_task(struct work_struct 
*work_arg)
  work.work);
struct nicvf *nic = container_of(vf_work, struct nicvf, rx_mode_work);
union nic_mbx mbx = {};
-   struct xcast_addr *xaddr, *next;
+   u8 idx = 0;
 
if (!vf_work)
return;
@@ -1956,16 +1956,10 @@ static void nicvf_set_rx_mode_task(struct work_struct 
*work_arg)
/* check if we have any specific MACs to be added to PF DMAC filter */
if (vf_work->mc) {
/* now go through kernel list of MACs and add them one by one */
-   list_for_each_entry_safe(xaddr, next,
-_work->mc->list, list) {
+   for (idx = 0; idx < vf_work->mc->count; idx++) {
mbx.xcast.msg = NIC_MBOX_MSG_ADD_MCAST;
-   mbx.xcast.data.mac = xaddr->addr;
+   mbx.xcast.data.mac = vf_work->mc->mc[idx];
nicvf_send_msg_to_pf(nic, );
-
-   /* after receiving ACK from PF release memory */
-   list_del(>list);
-   kfree(xaddr);
-   vf_work->mc->count--;
}
kfree(vf_work->mc);
}
@@ -1996,17 +1990,15 @@ static void nicvf_set_rx_mode(struct net_device *netdev)
mode |= BGX_XCAST_MCAST_FILTER;
/* here we need to copy mc addrs */
if (netdev_mc_count(netdev)) {
-   struct xcast_addr *xaddr;
-
-   mc_list = kmalloc(sizeof(*mc_list), GFP_ATOMIC);
-   INIT_LIST_HEAD(_list->list);
+   mc_list = kmalloc(offsetof(typeof(*mc_list),
+  
mc[netdev_mc_count(netdev)]),
+ GFP_ATOMIC);
+   if (unlikely(!mc_list))
+   return;
+   mc_list->count = 0;
netdev_hw_addr_list_for_each(ha, >mc) {
-   xaddr = kmalloc(sizeof(*xaddr),
-   GFP_ATOMIC);
-   xaddr->addr =
+   mc_list->mc[mc_list->count] =
ether_addr_to_u64(ha->addr);
-   list_add_tail(>list,
- _list->list);
mc_list->count++;
}
}
-- 
2.14.3

Re: [PATCH] dp83640: Ensure against premature access to PHY registers after reset

2018-04-06 Thread Andrew Lunn

On Fri, Apr 06, 2018 at 04:05:40PM +0200, Esben Haabendal wrote:
> From: Esben Haabendal 
> 
> Signed-off-by: Esben Haabendal 
> ---
>  drivers/net/phy/dp83640.c | 17 +
>  1 file changed, 17 insertions(+)
> 
> diff --git a/drivers/net/phy/dp83640.c b/drivers/net/phy/dp83640.c
> index 654f42d00092..48403170096a 100644
> --- a/drivers/net/phy/dp83640.c
> +++ b/drivers/net/phy/dp83640.c
> @@ -1207,6 +1207,22 @@ static void dp83640_remove(struct phy_device *phydev)
>   kfree(dp83640);
>  }
>  
> +static int dp83640_soft_reset(struct phy_device *phydev)
> +{
> + int ret;
> +
> + ret = genphy_soft_reset(phydev);
> + if (ret < 0)
> + return ret;
> +
> + /* From DP83640 datasheet: "Software driver code must wait 3 us
> +  * following a software reset before allowing further serial MII
> +  * operations with the DP83640." */
> + udelay(3);

Hi Esben

The accuracy of udelay() is not guaranteed. So you probably want to be
a bit pessimistic, and use 10.

  Andrew

Re: [PATCH net 1/3] lan78xx: PHY DSP registers initialization to address EEE link drop issues with long cables

2018-04-06 Thread Andrew Lunn

On Fri, Apr 06, 2018 at 11:42:02AM +0530, Raghuram Chary J wrote:
> The patch is to configure DSP registers of PHY device
> to handle Gbe-EEE failures with >40m cable length.
> 
> Fixes: 55d7de9de6c3 ("Microchip's LAN7800 family USB 2/3 to 10/100/1000 
> Ethernet device driver")
> Signed-off-by: Raghuram Chary J 
> ---
>  drivers/net/phy/microchip.c  | 123 
> ++-
>  include/linux/microchipphy.h |   8 +++
>  2 files changed, 130 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/net/phy/microchip.c b/drivers/net/phy/microchip.c
> index 0f293ef28935..174ae9808722 100644
> --- a/drivers/net/phy/microchip.c
> +++ b/drivers/net/phy/microchip.c
> @@ -20,6 +20,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #define DRIVER_AUTHOR"WOOJUNG HUH "
>  #define DRIVER_DESC  "Microchip LAN88XX PHY driver"
> @@ -66,6 +67,107 @@ static int lan88xx_suspend(struct phy_device *phydev)
>   return 0;
>  }
>  
> +static void lan88xx_TR_reg_set(struct phy_device *phydev, u16 regaddr,
> +u32 data)
> +{
> + int val;
> + u16 buf;
> +
> + /* Get access to token ring page */
> + phy_write(phydev, LAN88XX_EXT_PAGE_ACCESS,
> +   LAN88XX_EXT_PAGE_ACCESS_TR);

Hi Raghuram

You might want to look at phy_read_paged(), phy_write_paged(), etc.

There can be race conditions with paged access.

  Andrew

Re: Enable and configure storm prevention in a network device

2018-04-06 Thread Andrew Lunn

On Thu, Apr 05, 2018 at 03:35:06PM -0700, Florian Fainelli wrote:
> On 04/05/2018 01:20 PM, David Miller wrote:
> > From: Murali Karicheri 
> > Date: Thu, 5 Apr 2018 16:14:49 -0400
> > 
> >> Is there a standard way to implement and configure storm prevention
> >> in a Linux network device?
> > 
> > What kind of "storm", an interrupt storm?
> > 
> 
> I would assume Murali is referring to L2 broadcast storms which is
> common in switches. There is not an API for that AFAICT and I am not
> sure what a proper API would look like.

tc?

The Marvell switches have leaky buckets, which can be used for
limiting broadcast and multicast packets, as well as traffic shaping
in general. Storm prevention is just a form of traffic shaping, so if
we have generic traffic shaping, it can be used for storm prevention.

   Andrew

Re: [RFC] ethtool: Support for driver private ioctl's

2018-04-06 Thread Jose Abreu

Hi Michal,

On 06-04-2018 10:07, Michal Kubecek wrote:
> On Thu, Apr 05, 2018 at 08:50:49AM -0700, Florian Fainelli wrote:
>> On 04/05/2018 03:47 AM, Jose Abreu wrote:
>>> Background: Synopsys Ethernet IP's have a certain number of
>>> features which can be reconfigured at runtime. Giving you two
>>> examples: One of the most recent one is the safety features,
>>> which can be enabled/disabled and forced at runtime. Another one
>>> is a Flexible RX Parser which can route specific packets to
>>> specific RX DMA channels. Given that these are features specific
>>> to our IP's it would not be useful to add an uniform API for this
>>> because the users would only be one or two drivers ...
>> Parsing of packets and directing the matched packets to specific
>> queues/channels can be done through ethtool rxnfc API, tc/cls_flower as
>> well, so you should really check whether those APIs don't already allow
>> you to do what you want.
>>
>> ethtool already supports a concept of private  flags, not ioctl() though
>> which allows you to toggle boolean values for instance (or technically
>> up to how many bits a "flag" is used to represent) is that enough or do
>> you need to turn on/off the feature as well as pass configuration
>> parameters?
> Perhaps introducing "driver/device specific tunables" (i.e. something
> like tunables or PHY tunables but specific to a particular device) could
> be a way. But it could get out of control quickly and users wouldn't be
> happy if they had to set the same (or almost the same) parameter under
> five different names for five NIC vendors.

Yeah, that wouldn't be good but I think this should be a
responsibility to developer: To see if there is an existing
API/ethtool entry before implementing the "tunable". I think a
big concern, for me at least, is that ethtool already has a lot
of options and introducing even more would lead the user to
confusion ...

Thanks and Best Regards,
Jose Miguel Abreu

>
> Michal Kubecek

1 2 >

1 - 100 of 120 matches

Mail list logo