Re: [patch net-next 0/2] mlxsw: Update firmware version

2017-11-14 Thread Ido Schimmel
On Tue, Nov 14, 2017 at 05:59:36PM -0700, David Ahern wrote:
> Can you wait until said firmware version has been pushed into
> linux-firmware *before* sending these patches.

OK.

> Considering the driver refuses to work without the right firmware
> version this has a negative impact on users.

I'm aware of the consequences.


Re: [PATCH 1/2] bpf: add a bpf_override_function helper

2017-11-14 Thread Ingo Molnar

* Josef Bacik  wrote:

> > > Then 'not crashing kernel' requirement will be preserved.
> > > btrfs or whatever else we will be testing with override_return
> > > will be functioning in 'stress test' mode and if bpf program
> > > is not careful and returns error all the time then one particular
> > > subsystem (like btrfs) will not be functional, but the kernel
> > > will not be crashing.
> > > Thoughts?
> > 
> > Yeah, that approach sounds much better to me: it should be fundamentally be 
> > opt-in, and should be documented that it should not be possible to crash 
> > the 
> > kernel via changing the return value.
> > 
> > I'd make it a bit clearer in the naming what the purpose of the annotation 
> > is: for 
> > example would BPF_ALLOW_ERROR_INJECTION() work for you guys? I.e. I think 
> > it 
> > should generally be used to change actual integer error values - or at most 
> > user 
> > pointers, but not kernel pointers. Not enforced in a type safe manner, but 
> > the 
> > naming should give enough hints?
> > 
> > Such return-injection BFR programs can still totally confuse user-space 
> > obviously: 
> > for example returning an IO error could corrupt application data - but 
> > that's the 
> > nature of such facilities and similar results could already be achieved via 
> > ptrace 
> > as well. But the result of a BPF program should never be _worse_ than 
> > ptrace, in 
> > terms of kernel integrity.
> > 
> > Note that with such a safety mechanism in place no kernel message has to be 
> > generated either I suspect.
> > 
> > In any case, my NAK would be lifted with such an approach.
> 
> I'm going to want to annotate kmalloc, so it's still going to be possible to
> make things go horribly wrong, is this still going to be ok with you?  
> Obviously
> I want to use this for btrfs, but really what I used this for originally was 
> an
> NBD problem where I had to do special handling for getting EINTR back from
> kernel_sendmsg, which was a pain to trigger properly without this patch.  
> Opt-in
> is going to make it so we're just flagging important function calls anwyay
> because those are the ones that fail rarely and that we want to test, which 
> puts
> us back in the same situation you are worried about, so it doesn't make much
> sense to me to do it this way.  Thanks,

I suppose - let's see how it goes? The important factor is the opt-in aspect I 
believe.

Technically the kernel should never crash on a kmalloc() failure either, 
although 
obviously things can go horribly wrong from user-space's perspective.

Thanks,

Ingo


Re: [PATCH] net: Convert net_mutex into rw_semaphore and down read it on net->init/->exit

2017-11-14 Thread Eric W. Biederman
Kirill Tkhai  writes:

> Curently mutex is used to protect pernet operations list. It makes
> cleanup_net() to execute ->exit methods of the same operations set,
> which was used on the time of ->init, even after net namespace is
> unlinked from net_namespace_list.
>
> But the problem is it's need to synchronize_rcu() after net is removed
> from net_namespace_list():
>
> Destroy net_ns:
> cleanup_net()
>   mutex_lock(_mutex)
>   list_del_rcu(>list)
>   synchronize_rcu()  <--- Sleep there for ages
>   list_for_each_entry_reverse(ops, _list, list)
> ops_exit_list(ops, _exit_list)
>   list_for_each_entry_reverse(ops, _list, list)
> ops_free_list(ops, _exit_list)
>   mutex_unlock(_mutex)
>
> This primitive is not fast, especially on the systems with many processors
> and/or when preemptible RCU is enabled in config. So, all the time, while
> cleanup_net() is waiting for RCU grace period, creation of new net namespaces
> is not possible, the tasks, who makes it, are sleeping on the same mutex:
>
> Create net_ns:
> copy_net_ns()
>   mutex_lock_killable(_mutex)<--- Sleep there for ages
>
> The solution is to convert net_mutex to the rw_semaphore. Then,
> pernet_operations::init/::exit methods, modifying the net-related data,
> will require down_read() locking only, while down_write() will be used
> for changing pernet_list.
>
> This gives signify performance increase, like you may see below. There
> is measured sequential net namespace creation in a cycle, in single
> thread, without other tasks (single user mode):
>
> 1)int main(int argc, char *argv[])
> {
> unsigned nr;
> if (argc < 2) {
> fprintf(stderr, "Provide nr iterations arg\n");
> return 1;
> }
> nr = atoi(argv[1]);
> while (nr-- > 0) {
> if (unshare(CLONE_NEWNET)) {
> perror("Can't unshare");
> return 1;
> }
> }
> return 0;
> }
>
> Origin, 10 unshare():
> 0.03user 23.14system 1:39.85elapsed 23%CPU
>
> Patched, 10 unshare():
> 0.03user 67.49system 1:08.34elapsed 98%CPU
>
> 2)for i in {1..1}; do unshare -n bash -c exit; done
>
> Origin:
> real 1m24,190s
> user 0m6,225s
> sys 0m15,132s
>
> Patched:
> real 0m18,235s   (4.6 times faster)
> user 0m4,544s
> sys 0m13,796s
>
> This patch requires commit 76f8507f7a64 "locking/rwsem: Add 
> down_read_killable()"
> from Linus tree (not in net-next yet).

Using a rwsem to protect the list of operations makes sense.

That should allow removing the sing

I am not wild about taking a the rwsem down_write in
rtnl_link_unregister, and net_ns_barrier.  I think that works but it
goes from being a mild hack to being a pretty bad hack and something
else that can kill the parallelism you are seeking it add.

There are about 204 instances of struct pernet_operations.  That is a
lot of code to have carefully audited to ensure it can in parallel all
at once.  The existence of the exit_batch method, net_ns_barrier,
for_each_net and taking of net_mutex in rtnl_link_unregister all testify
to the fact that there are data structures accessed by multiple network
namespaces.

My preference would be to:

- Add the net_sem in addition to net_mutex with down_write only held in
  register and unregister, and maybe net_ns_barrier and
  rtnl_link_unregister.

- Factor out struct pernet_ops out of struct pernet_operations.  With
  struct pernet_ops not having the exit_batch method.  With pernet_ops
  being embedded an anonymous member of the old struct pernet_operations.

- Add [un]register_pernet_{sys,dev} functions that take a struct
  pernet_ops, that don't take net_mutex.  Have them order the
  pernet_list as:

  pernet_sys
  pernet_subsys
  pernet_device
  pernet_dev

  With the chunk in the middle taking the net_mutex.

  I think I would enforce the ordering with a failure to register
  if a subsystem or a device tried to register out of order.  

- Disable use of the single threaded workqueue if nothing needs the
  net_mutex.

- Add a test mode that deliberartely spawns threads on multiple
  processors and deliberately creates multiple network namespaces
  at the same time.

- Add a test mode that deliberately spawns threads on multiple
  processors and delibearate destrosy multiple network namespaces
  at the same time.
  
- Convert the code to unlocked operation one pernet_operations to at a
  time.  Being careful with the loopback device it's order in the list
  strongly matters.

- Finally remove the unnecessary code.


At the end of the day because all of the operations for one network
namespace will run in parallel with all of the operations for another
network namespace all of the sophistication that goes into batching the
cleanup of multiple network namespaces can be removed.  As different
tasks (not sharing a lock) can wait in syncrhonize_rcu at the same time
without 

Kreditangebot,

2017-11-14 Thread Herr Kelly Williams
Hallo Benötigen Sie einen Kredit, um Schulden zu bekommen oder Ihre Rechnungen 
zu bezahlen? Benötigen Sie ein Home / Car oder ein Business-Darlehen? Werden 
Sie von einer Bank, Freunden oder Partner abgelehnt? Wir bieten alle Arten von 
Darlehen zum Zinssatz von 2%, wenn Sie an einem Darlehen interessiert sind. 
Kontaktieren Sie uns einfach mit den unten stehenden Informationen und wir 
werden Ihre Anfrage beantworten, sobald wir Ihre Daten erhalten haben.

Vollständiger Name:
Land:
Darlehensbetrag benötigt:
Darlehensdauer:
Handynummer:

Freundliche Grüße
Herr Kelly Williams


Re: [regression, 4.14] xfrm: Fix stack-out-of-bounds read in xfrm_state_find breaks selinux-testsuite

2017-11-14 Thread Steffen Klassert
On Tue, Nov 14, 2017 at 03:46:30PM -0500, Stephen Smalley wrote:
> Hi,
> 
> 4.14 is failing the selinux-testsuite labeled IPSEC tests despite
> having just been fixed in commit cf37966751747727 ("xfrm: do
> unconditional template resolution before pcpu cache check").  The
> breaking commit is the very next one, commit c9f3f813d462c72d ("xfrm:
> Fix stack-out-of-bounds read in xfrm_state_find.").  Unlike the earlier
> breakage, which caused use of the wrong SA, this one leads to a failure
> on connect(). Running ip xfrm monitor during one of the failing tests
> shows the following:
> acquire proto ah 
>   sel src 127.0.0.1/32 dst 127.0.0.1/32 proto tcp sport 0 dport 65535
> dev lo 
>   policy src 127.0.0.1/32 dst 127.0.0.1/32 proto tcp 
> security context
> unconfined_u:unconfined_r:test_inet_client_t:s0-s0:c0.c1023 
> dir out priority 0 ptype main 
> tmpl src 0.0.0.0 dst 0.0.0.0
> proto ah reqid 0 mode transport

Yes, I see. This is because there are wildcard src and dst
addresses on the template. I'll revert this one for now.

I slowly start to think that the concept of having a socket
policy on a IPv6 socket that maps to IPv4 is fundamentally
broken. The bug I tried to fix here is not the first one
that were reported from syzkaller for this szenario and I
fear it is not the last one.

Thanks for reporting this!


RE: [PATCH net-next] net: assign err to 0 at begin in do_setlink() function

2017-11-14 Thread Yuan, Linyu (NSB - CN/Shanghai)
Thanks, I will provide a new patch

> -Original Message-
> From: netdev-ow...@vger.kernel.org [mailto:netdev-ow...@vger.kernel.org]
> On Behalf Of David Miller
> Sent: Wednesday, November 15, 2017 1:20 PM
> To: Yuan, Linyu (NSB - CN/Shanghai)
> Cc: cug...@163.com; netdev@vger.kernel.org
> Subject: Re: [PATCH net-next] net: assign err to 0 at begin in do_setlink()
> function
> 
> From: "Yuan, Linyu (NSB - CN/Shanghai)" 
> Date: Wed, 15 Nov 2017 05:15:35 +
> 
> >
> >
> >> -Original Message-
> >> From: netdev-ow...@vger.kernel.org
> [mailto:netdev-ow...@vger.kernel.org]
> >> On Behalf Of David Miller
> >> Sent: Wednesday, November 15, 2017 1:08 PM
> >> To: cug...@163.com
> >> Cc: netdev@vger.kernel.org; Yuan, Linyu (NSB - CN/Shanghai)
> >> Subject: Re: [PATCH net-next] net: assign err to 0 at begin in do_setlink()
> >> function
> >>
> >> From: yuan linyu 
> >> Date: Tue, 14 Nov 2017 22:30:59 +0800
> >>
> >> > From: yuan linyu 
> >> >
> >> > each netlink attribute have proper process when error happen,
> >> > when exit on attribute process, it implies that no error,
> >> > so err = 0; is useless.
> >> >
> >> > assign err = 0; at beginning if all attributes not set.
> >> >
> >> > Signed-off-by: yuan linyu 
> >>
> >> This change is not correct.
> >>
> >> The IFLA_VF_PORTS code block can finish with err set non-zero.
> >
> > Do you mean this block can return err = -EOPNOTSUPP;  ?
> > It means nla_for_each_nested() in this block will never enter, right ?
> 
> Yes, that is what I mean.


Re: [PATCH net-next] net: assign err to 0 at begin in do_setlink() function

2017-11-14 Thread David Miller
From: "Yuan, Linyu (NSB - CN/Shanghai)" 
Date: Wed, 15 Nov 2017 05:15:35 +

> 
> 
>> -Original Message-
>> From: netdev-ow...@vger.kernel.org [mailto:netdev-ow...@vger.kernel.org]
>> On Behalf Of David Miller
>> Sent: Wednesday, November 15, 2017 1:08 PM
>> To: cug...@163.com
>> Cc: netdev@vger.kernel.org; Yuan, Linyu (NSB - CN/Shanghai)
>> Subject: Re: [PATCH net-next] net: assign err to 0 at begin in do_setlink()
>> function
>> 
>> From: yuan linyu 
>> Date: Tue, 14 Nov 2017 22:30:59 +0800
>> 
>> > From: yuan linyu 
>> >
>> > each netlink attribute have proper process when error happen,
>> > when exit on attribute process, it implies that no error,
>> > so err = 0; is useless.
>> >
>> > assign err = 0; at beginning if all attributes not set.
>> >
>> > Signed-off-by: yuan linyu 
>> 
>> This change is not correct.
>> 
>> The IFLA_VF_PORTS code block can finish with err set non-zero.
> 
> Do you mean this block can return err = -EOPNOTSUPP;  ?
> It means nla_for_each_nested() in this block will never enter, right ?

Yes, that is what I mean.


Re: [PATCH] openvswitch: meter: fix NULL pointer dereference in ovs_meter_cmd_reply_start

2017-11-14 Thread David Miller
From: "Gustavo A. R. Silva" 
Date: Tue, 14 Nov 2017 14:26:16 -0600

> It seems that the intention of the code is to null check the value
> returned by function genlmsg_put. But the current code is null
> checking the address of the pointer that holds the value returned
> by genlmsg_put.
> 
> Fix this by properly null checking the value returned by function
> genlmsg_put in order to avoid a pontential null pointer dereference.
> 
> Addresses-Coverity-ID: 1461561 ("Dereference before null check")
> Addresses-Coverity-ID: 1461562 ("Dereference null return value")
> Fixes: 96fbc13d7e77 ("openvswitch: Add meter infrastructure")
> Signed-off-by: Gustavo A. R. Silva 

Applied.


Re: [PATCH net-next v2 0/2] netem: fix compilation on 32 bit

2017-11-14 Thread David Miller
From: Stephen Hemminger 
Date: Tue, 14 Nov 2017 11:27:00 -0800

> A couple of places where 64 bit CPU was being assumed incorrectly.

Series applied, thanks.


RE: [PATCH net-next] net: assign err to 0 at begin in do_setlink() function

2017-11-14 Thread Yuan, Linyu (NSB - CN/Shanghai)


> -Original Message-
> From: netdev-ow...@vger.kernel.org [mailto:netdev-ow...@vger.kernel.org]
> On Behalf Of David Miller
> Sent: Wednesday, November 15, 2017 1:08 PM
> To: cug...@163.com
> Cc: netdev@vger.kernel.org; Yuan, Linyu (NSB - CN/Shanghai)
> Subject: Re: [PATCH net-next] net: assign err to 0 at begin in do_setlink()
> function
> 
> From: yuan linyu 
> Date: Tue, 14 Nov 2017 22:30:59 +0800
> 
> > From: yuan linyu 
> >
> > each netlink attribute have proper process when error happen,
> > when exit on attribute process, it implies that no error,
> > so err = 0; is useless.
> >
> > assign err = 0; at beginning if all attributes not set.
> >
> > Signed-off-by: yuan linyu 
> 
> This change is not correct.
> 
> The IFLA_VF_PORTS code block can finish with err set non-zero.

Do you mean this block can return err = -EOPNOTSUPP;  ?
It means nla_for_each_nested() in this block will never enter, right ?

> The assignment to zero there is really necessary.


Re: [PATCH] net: Protect iterations over net::fib_notifier_ops in fib_seq_sum()

2017-11-14 Thread David Miller
From: Kirill Tkhai 
Date: Tue, 14 Nov 2017 16:51:56 +0300

> There is at least unlocked deletion of net->ipv4.fib_notifier_ops
> from net::fib_notifier_ops:
> 
> ip_fib_net_exit()
>   rtnl_unlock()
>   fib4_notifier_exit()
> fib_notifier_ops_unregister(net->ipv4.notifier_ops)
>   list_del_rcu(>list)
> 
> So fib_seq_sum() can't use rtnl_lock() only for protection.
> 
> The possible solution could be to use rtnl_lock()
> in fib_notifier_ops_unregister(), but this adds
> a possible delay during net namespace creation,
> so we better use rcu_read_lock() till someone
> really needs the mutex (if that happens).
> 
> Signed-off-by: Kirill Tkhai 

Applied.


Re: [PATCH net-next v2] tcp: Namespace-ify sysctl_tcp_default_congestion_control

2017-11-14 Thread David Miller
From: Stephen Hemminger 
Date: Tue, 14 Nov 2017 08:25:49 -0800

> Make default TCP default congestion control to a per namespace
> value. This changes default congestion control to a pointer to congestion ops
> (rather than implicit as first element of available lsit).
> 
> The congestion control setting of new namespaces is inherited
> from the current setting of the root namespace.
> 
> Signed-off-by: Stephen Hemminger 
> ---
> v2 - drop restriction that modules are only loaded in initial name space.

Applied.


Re: [PATCH net-next] net: assign err to 0 at begin in do_setlink() function

2017-11-14 Thread David Miller
From: yuan linyu 
Date: Tue, 14 Nov 2017 22:30:59 +0800

> From: yuan linyu 
> 
> each netlink attribute have proper process when error happen,
> when exit on attribute process, it implies that no error,
> so err = 0; is useless.
> 
> assign err = 0; at beginning if all attributes not set.
> 
> Signed-off-by: yuan linyu 

This change is not correct.

The IFLA_VF_PORTS code block can finish with err set non-zero.
The assignment to zero there is really necessary.


[PATCH net-next] tcp: highest_sack fix

2017-11-14 Thread Eric Dumazet
From: Eric Dumazet 

syzbot easily found a regression added in our latest patches [1]

No longer set tp->highest_sack to the head of the send queue since
this is not logical and error prone.

Only sack processing should maintain the pointer to an skb from rtx queue.

We might in the future only remember the sequence instead of a pointer to skb,
since rb-tree should allow a fast lookup.

[1]
BUG: KASAN: use-after-free in tcp_highest_sack_seq include/net/tcp.h:1706 
[inline]
BUG: KASAN: use-after-free in tcp_ack+0x42bb/0x4fd0 net/ipv4/tcp_input.c:3537
Read of size 4 at addr 8801c154faa8 by task syz-executor4/12860

CPU: 0 PID: 12860 Comm: syz-executor4 Not tainted 4.14.0-next-20171113+ #41
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 
01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:17 [inline]
 dump_stack+0x194/0x257 lib/dump_stack.c:53
 print_address_description+0x73/0x250 mm/kasan/report.c:252
 kasan_report_error mm/kasan/report.c:351 [inline]
 kasan_report+0x25b/0x340 mm/kasan/report.c:409
 __asan_report_load4_noabort+0x14/0x20 mm/kasan/report.c:429
 tcp_highest_sack_seq include/net/tcp.h:1706 [inline]
 tcp_ack+0x42bb/0x4fd0 net/ipv4/tcp_input.c:3537
 tcp_rcv_established+0x672/0x18a0 net/ipv4/tcp_input.c:5439
 tcp_v4_do_rcv+0x2ab/0x7d0 net/ipv4/tcp_ipv4.c:1468
 sk_backlog_rcv include/net/sock.h:909 [inline]
 __release_sock+0x124/0x360 net/core/sock.c:2264
 release_sock+0xa4/0x2a0 net/core/sock.c:2778
 tcp_sendmsg+0x3a/0x50 net/ipv4/tcp.c:1462
 inet_sendmsg+0x11f/0x5e0 net/ipv4/af_inet.c:763
 sock_sendmsg_nosec net/socket.c:632 [inline]
 sock_sendmsg+0xca/0x110 net/socket.c:642
 ___sys_sendmsg+0x75b/0x8a0 net/socket.c:2048
 __sys_sendmsg+0xe5/0x210 net/socket.c:2082
 SYSC_sendmsg net/socket.c:2093 [inline]
 SyS_sendmsg+0x2d/0x50 net/socket.c:2089
 entry_SYSCALL_64_fastpath+0x1f/0x96
RIP: 0033:0x452879
RSP: 002b:7fc9761bfbe8 EFLAGS: 0212 ORIG_RAX: 002e
RAX: ffda RBX: 00758020 RCX: 00452879
RDX:  RSI: 20917fc8 RDI: 0015
RBP: 0086 R08:  R09: 
R10:  R11: 0212 R12: 006ee3a0
R13:  R14: 7fc9761c06d4 R15: 

Allocated by task 12860:
 save_stack+0x43/0xd0 mm/kasan/kasan.c:447
 set_track mm/kasan/kasan.c:459 [inline]
 kasan_kmalloc+0xad/0xe0 mm/kasan/kasan.c:551
 kasan_slab_alloc+0x12/0x20 mm/kasan/kasan.c:489
 kmem_cache_alloc_node+0x144/0x760 mm/slab.c:3638
 __alloc_skb+0xf1/0x780 net/core/skbuff.c:193
 alloc_skb_fclone include/linux/skbuff.h:1023 [inline]
 sk_stream_alloc_skb+0x11d/0x900 net/ipv4/tcp.c:870
 tcp_sendmsg_locked+0x1341/0x3b80 net/ipv4/tcp.c:1299
 tcp_sendmsg+0x2f/0x50 net/ipv4/tcp.c:1461
 inet_sendmsg+0x11f/0x5e0 net/ipv4/af_inet.c:763
 sock_sendmsg_nosec net/socket.c:632 [inline]
 sock_sendmsg+0xca/0x110 net/socket.c:642
 SYSC_sendto+0x358/0x5a0 net/socket.c:1749
 SyS_sendto+0x40/0x50 net/socket.c:1717
 entry_SYSCALL_64_fastpath+0x1f/0x96

Freed by task 12860:
 save_stack+0x43/0xd0 mm/kasan/kasan.c:447
 set_track mm/kasan/kasan.c:459 [inline]
 kasan_slab_free+0x71/0xc0 mm/kasan/kasan.c:524
 __cache_free mm/slab.c:3492 [inline]
 kmem_cache_free+0x77/0x280 mm/slab.c:3750
 kfree_skbmem+0xdd/0x1d0 net/core/skbuff.c:603
 __kfree_skb+0x1d/0x20 net/core/skbuff.c:642
 sk_wmem_free_skb include/net/sock.h:1419 [inline]
 tcp_rtx_queue_unlink_and_free include/net/tcp.h:1682 [inline]
 tcp_clean_rtx_queue net/ipv4/tcp_input.c:3111 [inline]
 tcp_ack+0x1b17/0x4fd0 net/ipv4/tcp_input.c:3593
 tcp_rcv_established+0x672/0x18a0 net/ipv4/tcp_input.c:5439
 tcp_v4_do_rcv+0x2ab/0x7d0 net/ipv4/tcp_ipv4.c:1468
 sk_backlog_rcv include/net/sock.h:909 [inline]
 __release_sock+0x124/0x360 net/core/sock.c:2264
 release_sock+0xa4/0x2a0 net/core/sock.c:2778
 tcp_sendmsg+0x3a/0x50 net/ipv4/tcp.c:1462
 inet_sendmsg+0x11f/0x5e0 net/ipv4/af_inet.c:763
 sock_sendmsg_nosec net/socket.c:632 [inline]
 sock_sendmsg+0xca/0x110 net/socket.c:642
 ___sys_sendmsg+0x75b/0x8a0 net/socket.c:2048
 __sys_sendmsg+0xe5/0x210 net/socket.c:2082
 SYSC_sendmsg net/socket.c:2093 [inline]
 SyS_sendmsg+0x2d/0x50 net/socket.c:2089
 entry_SYSCALL_64_fastpath+0x1f/0x96

The buggy address belongs to the object at 8801c154fa80
 which belongs to the cache skbuff_fclone_cache of size 456
The buggy address is located 40 bytes inside of
 456-byte region [8801c154fa80, 8801c154fc48)
The buggy address belongs to the page:
page:ea00070553c0 count:1 mapcount:0 mapping:8801c154f080 index:0x0
flags: 0x2fffc000100(slab)
raw: 02fffc000100 8801c154f080  00010006
raw: ea00070a5a20 ea0006a18360 8801d9ca0500 
page dumped because: kasan: bad access detected

Fixes: 737ff314563c ("tcp: use sequence distance to detect reordering")
Signed-off-by: Eric Dumazet 
Cc: Yuchung Cheng 
Reported-by: syzbot 

linux-next: build warning after merge of the net-next tree

2017-11-14 Thread Stephen Rothwell
Hi all,

After merging the net-next tree, today's linux-next build (powerpc
allyesconfig) produced this warning:

In file included from drivers/net/ethernet/ibm/ibmvnic.c:52:0:
drivers/net/ethernet/ibm/ibmvnic.c: In function 'ibmvnic_open':
include/linux/dma-mapping.h:571:2: warning: 'dma_addr' may be used 
uninitialized in this function [-Wmaybe-uninitialized]
  debug_dma_mapping_error(dev, dma_addr);
  ^
drivers/net/ethernet/ibm/ibmvnic.c:852:13: note: 'dma_addr' was declared here
  dma_addr_t dma_addr;
 ^

Introduced by commit

  4e6759be28e4 ("ibmvnic: Feature implementation of Vital Product Data (VPD) 
for the ibmvnic driver")

-- 
Cheers,
Stephen Rothwell


Re: [PATCH] ibmveth: Kernel crash LSO offload flag toggle

2017-11-14 Thread Benjamin Herrenschmidt
On Wed, 2017-11-15 at 13:47 +1100, Daniel Axtens wrote:
> Hi Bryant,
> 
> This looks a bit better, but...
> 
> > The following patch ensures that the bounce_buffer is not null
> > prior to using it within skb_copy_from_linear_data.
> 
> How would this occur?
> 
> Looking at ibmveth.c, I see bounce_buffer being freed in ibmveth_close()
> and allocated in ibmveth_open() only. If allocation fails, the whole
> opening of the device fails with -ENOMEM.
> 
> It seems your test case - changing TSO - causes ibmveth_set_tso() to
> cause an adaptor restart - an ibmveth_close(dev) and then an
> ibmveth_open(dev). Is this happening in parallel with an out of memory
> condition - is the memory allocation failing?
> 
> Alternatively, could it be the case that you're closing the device while
> packets are in flight, and then trying to use a bounce_buffer that's
> been freed elsewhere? Do you need to decouple memory freeing from
> ibmveth_close?

Hrm, you should at least stop the tx queue and NAPI (and synchronize)
while doing a reset. A lot of drivers, rather than doing close/open
(which does subtly different things) tend to instead fire a work queue
(often called reset_task) which does the job (and uses the same lower
level helpers as open/close to free/allocate the rings etc...).


> > The problem can be recreated toggling on/off Large send offload.
> > 
> > The following script when run (along with some iperf traffic recreates the
> > crash within 5-10 mins or so).
> > 
> > while true
> > do
> > ethtool -k ibmveth0 | grep tcp-segmentation-offload
> > ethtool -K ibmveth0 tso off
> > ethtool -k ibmveth0 | grep tcp-segmentation-offload
> > ethtool -K ibmveth0 tso on
> > done
> > 
> > Note: This issue happens the very first time largsesend offload is
> > turned off too (but the above script recreates the issue all the times)
> > 
> > [76563.914173] Unable to handle kernel paging request for data at address 
> > 0x
> > [76563.914197] Faulting instruction address: 0xc0063940
> > [76563.914205] Oops: Kernel access of bad area, sig: 11 [#1]
> > [76563.914210] SMP NR_CPUS=2048 NUMA pSeries
> > [76563.914217] Modules linked in: rpadlpar_io rpaphp dccp_diag dccp 
> > tcp_diag udp_diag inet_diag unix_diag af_packet_diag netlink_diag nls_utf8 
> > isofs binfmt_misc pseries_rng rtc_generic autofs4 ibmvfc scsi_transport_fc 
> > ibmvscsi ibmveth
> > [76563.914251] CPU: 3 PID: 0 Comm: swapper/3 Not tainted 4.4.0-34-generic 
> > #53-Ubuntu
> 
> ^--- yikes!
> 
> There are relevant changes to this area since 4.4:
> 2c42bf4b4317 ("ibmveth: check return of skb_linearize in ibmveth_start_xmit")
> 66aa0678efc2 ("ibmveth: Support to enable LSO/CSO for Trunk VEA.")
> 
> Does this crash occurs on a more recent kernel?
> 
> > diff --git a/drivers/net/ethernet/ibm/ibmveth.c 
> > b/drivers/net/ethernet/ibm/ibmveth.c
> > index f210398200ece..1d29b1649118d 100644
> > --- a/drivers/net/ethernet/ibm/ibmveth.c
> > +++ b/drivers/net/ethernet/ibm/ibmveth.c
> > @@ -1092,8 +1092,14 @@ static netdev_tx_t ibmveth_start_xmit(struct sk_buff 
> > *skb,
> >  */
> > if (force_bounce || (!skb_is_nonlinear(skb) &&
> > (skb->len < tx_copybreak))) {
> > -   skb_copy_from_linear_data(skb, adapter->bounce_buffer,
> > - skb->len);
> > +   if (adapter->bounce_buffer) {
> > +   skb_copy_from_linear_data(skb, adapter->bounce_buffer,
> > + skb->len);
> > +   } else {
> > +   adapter->tx_send_failed++;
> > +   netdev->stats.tx_dropped++;
> > +   goto out;
> 
> Finally, as I alluded to at the top of this message, isn't the
> disappearance of the bounce-buffer a pretty serious issue? As I
> understand it, it means your data structures are now inconsistent. Do
> you need to - at the least - be more chatty here?
> 
> Regards,
> Daniel
> > +   }
> >  
> > descs[0].fields.flags_len = desc_flags | skb->len;
> > descs[0].fields.address = adapter->bounce_buffer_dma;
> > -- 
> > 2.13.6 (Apple Git-96)


Re: SRIOV switchdev mode BoF minutes

2017-11-14 Thread Jakub Kicinski
On Tue, 14 Nov 2017 19:04:36 -0800, Alexander Duyck wrote:
> On Tue, Nov 14, 2017 at 3:36 PM, Jakub Kicinski
>  wrote:
> > On Tue, 14 Nov 2017 15:05:08 -0800, Alexander Duyck wrote:  
> >> >> We basically need to do some feasability research to see if we can
> >> >> actually meet all the requirements for switchdev on i40e. We have been
> >> >> getting mixed messages where we are given a great many "yes, but" type
> >> >> answers. For i40e we are looking into it but I don't have high
> >> >> confidence in our ability to actually support it in hardare/firmware.
> >> >> If it were as easy as you have been led to believe, we would have done
> >> >> it months ago when we were researching the requirements to support 
> >> >> switchdev  
> >> >
> >> > wait, Sridhar made seven rounds of his submission (this is the v7
> >> > pointer [1]) and you
> >> > still don't know if what you were attempting to push upstream can
> >> > work, something is
> >> > weird here, can you clarify? Jeff?  
> >>
> >> Not weird so much as stubborn. The patches were being pushed based on
> >> the assumption that the community would accept a NIC generating port
> >> representors that didn't necessarily pass traffic, and then even when
> >> we had them passing traffic the PF still wasn't configured to handle
> >> being the default destination for traffic without any rules
> >> associated, instead VFs would directly send to the outside world.  
> >
> > Perhaps the way forward is to lift the requirement on passing traffic,
> > as long as the limitation is clearly expressed to the users.  
> 
> No, I am not arguing for that because then SwitchDev will fall into
> disarray. If we want to have a strict definition for what is SwitchDev
> and what isn't I am okay with that. It gives us a definition of what
> our hardware needs to do in order to support it and without that we
> are going to get hardware that just bends the rules to claim support
> for it.

Let me make sure we understand each other.  The switchdev SR-IOV mode is
what happens when user requests DEVLINK_ESWITCH_MODE_SWITCHDEV.  Are you
saying you are opposed to adding DEVLINK_ESWITCH_MODE_VEPA?

> All I am asking for is for us to not close the door to the possibility
> of adding features to legacy SR-IOV. I am hoping to use a source
> macvlan based approach to make it so that we can support "port
> representors" for devices that can't support full SwitchDev. The idea
> would be to use them to get as close to SwitchDev level support on
> legacy devices as possible without using full SwitchDev. That should
> solve a good part of the issue, but I am pretty certain I need to be
> able to extend legacy SR-IOV in order to support it. I had talked with
> Jiri at netdev 2.1 about it back when we had submitted the v7 patches,
> and the decision was to look at doing "port representors" but don't
> associate them with SwitchDev. I was out on Sabbatical for most of the
> summer and I am just now starting on the macvlan work I had planned. I
> hope to have it done before the next netdev and then we can discuss it
> there if it needs more discussion than what we can have on the mailing
> list.

I don't know what you mean with the macvlan based approach.  Could you
perhaps describe it in more detail?  Will it allow users to configure
forwarding and queueing with existing, standard tools and APIs?

> I'm fine with us placing any legacy SR-IOV changes under more
> scrutiny. I am just not open to saying we will not extend or update
> any features for legacy SR-IOV. The fact is we are still selling a ton
> of ixgbe based parts, so I can't say with any certainty that there
> won't be a request for some new SR-IOV feature in the future.
> 
> - Alex


Re: [PATCH] macvlan: verify MTU before lowerdev xmit

2017-11-14 Thread Daniel Axtens
Hi Shannon,

> Now that I think about this a little more, why is this not already 
> getting handled by the NETDEV_CHANGEMTU notifier?  In what case are you 
> running into this, and why is it not triggering the notifier?

My commit message was probably not super clear here - apologies for any
mangling of terminology/concepts. The changed MTU is getting propagated
from the lowerdev to the macvtap interface just fine, it's just not
getting enforced.

Hopefully the following diagrams make it clearer.

This is the current behaviour: a VM sends a packet of 1514 bytes and it
makes it through the macvtap/macvlan and lowerdev, to the destination,
even though the intermediate MTUs are lower.

*---*
|VM |
|   |
| mtu 1500  |
*---*
  |
  V 1514 byte packet
  |
*---*
|  macvtap  |
|   |
| mtu 1480  |
*---*
  |
  V 1514 byte packet
  |
*---*
| lowerdev  |
|   |
| mtu 1480  |
*---*
  |
  V 1514 byte packet
  |
*---*
|   dest|
|   |
| mtu 1500  |
*---*

My thought here is that the lowerdev should not be asked to transmit a
packet that is larger than its MTU. The patch causes the following
behaviour:


*---*
|VM |
|   |
| mtu 1500  |
*---*
  |
  V 1514 byte packet
  |
*---*
|  macvtap  |
|   |
| mtu 1480  |
*---*
  |
  | packet dropped
  X  1500 > 1480

I think this makes macvlan consistent with bridges.

>> As for the other paths, I see that the call to dev_forward_skb() already 
>> has this protection in it, but does the call to dev_queue_xmit_accel() 
>> in macvlan_start_xmit() need similar protection?

I think so. I will have a look and do a v2 if the core idea of the patch
is OK.

Regards,
Daniel

>> 
>
>
> sln
>
>
>> 
>> 
>>>   skb->dev = vlan->lowerdev;
>>>   return dev_queue_xmit(skb);
>>>   }
>>>


Re: [PATCH] net: Convert net_mutex into rw_semaphore and down read it on net->init/->exit

2017-11-14 Thread Eric W. Biederman
Kirill Tkhai  writes:

> On 14.11.2017 21:39, Cong Wang wrote:
>> On Tue, Nov 14, 2017 at 5:53 AM, Kirill Tkhai  wrote:
>>> @@ -406,7 +406,7 @@ struct net *copy_net_ns(unsigned long flags,
>>>
>>> get_user_ns(user_ns);
>>>
>>> -   rv = mutex_lock_killable(_mutex);
>>> +   rv = down_read_killable(_sem);
>>> if (rv < 0) {
>>> net_free(net);
>>> dec_net_namespaces(ucounts);
>>> @@ -421,7 +421,7 @@ struct net *copy_net_ns(unsigned long flags,
>>> list_add_tail_rcu(>list, _namespace_list);
>>> rtnl_unlock();
>>> }
>>> -   mutex_unlock(_mutex);
>>> +   up_read(_sem);
>>> if (rv < 0) {
>>> dec_net_namespaces(ucounts);
>>> put_user_ns(user_ns);
>>> @@ -446,7 +446,7 @@ static void cleanup_net(struct work_struct *work)
>>> list_replace_init(_list, _kill_list);
>>> spin_unlock_irq(_list_lock);
>>>
>>> -   mutex_lock(_mutex);
>>> +   down_read(_sem);
>>>
>>> /* Don't let anyone else find us. */
>>> rtnl_lock();
>>> @@ -486,7 +486,7 @@ static void cleanup_net(struct work_struct *work)
>>> list_for_each_entry_reverse(ops, _list, list)
>>> ops_free_list(ops, _exit_list);
>>>
>>> -   mutex_unlock(_mutex);
>>> +   up_read(_sem);
>> 
>> After your patch setup_net() could run concurrently with cleanup_net(),
>> given that ops_exit_list() is called on error path of setup_net() too,
>> it means ops->exit() now could run concurrently if it doesn't have its
>> own lock. Not sure if this breaks any existing user.
>
> Yes, there will be possible concurrent ops->init() for a net namespace,
> and ops->exit() for another one. I hadn't found pernet operations, which
> have a problem with that. If they exist, they are hidden and not clear seen.
> The pernet operations in general do not touch someone else's memory.
> If suddenly there is one, KASAN should show it after a while.

Certainly the use of hash tables shared between multiple network
namespaces would count.  I don't rembmer how many of these we have but
there used to be quite a few.

Eric



Re: [PATCH v4 1/1] xdp: Sample xdp program implementing ip forward

2017-11-14 Thread Christina Jacob
On Thu, Nov 9, 2017 at 7:08 AM, Jesper Dangaard Brouer
 wrote:
> On Wed, 08 Nov 2017 10:40:24 +0900 (KST)
> David Miller  wrote:
>
>> From: Christina Jacob 
>> Date: Sun,  5 Nov 2017 08:52:30 +0530
>>
>> > From: Christina Jacob 
>> >
>> > Implements port to port forwarding with route table and arp table
>> > lookup for ipv4 packets using bpf_redirect helper function and
>> > lpm_trie  map.
>> >
>> > Signed-off-by: Christina Jacob 
>>
>> Applied to net-next, thank you.
>
> I've not had time to proper test (and review) this V4 patch, but I
> guess I'll have to do so when I get home from Seoul...
>
> I especially want to measure the effect of using bpf_redirect_map().
> To Christina: what performance improvement did you see on your
> board/arch when switching from bpf_redirect() to bpf_redirect_map()?

ndo_xdp_flush is yet to be implemented in our driver.
So I don't see any difference moving from bpf_redirect to bpf_redirect_map.

>
> --
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   LinkedIn: http://www.linkedin.com/in/brouer


Re: SRIOV switchdev mode BoF minutes

2017-11-14 Thread Alexander Duyck
On Tue, Nov 14, 2017 at 3:36 PM, Jakub Kicinski
 wrote:
> On Tue, 14 Nov 2017 15:05:08 -0800, Alexander Duyck wrote:
>> >> We basically need to do some feasability research to see if we can
>> >> actually meet all the requirements for switchdev on i40e. We have been
>> >> getting mixed messages where we are given a great many "yes, but" type
>> >> answers. For i40e we are looking into it but I don't have high
>> >> confidence in our ability to actually support it in hardare/firmware.
>> >> If it were as easy as you have been led to believe, we would have done
>> >> it months ago when we were researching the requirements to support 
>> >> switchdev
>> >
>> > wait, Sridhar made seven rounds of his submission (this is the v7
>> > pointer [1]) and you
>> > still don't know if what you were attempting to push upstream can
>> > work, something is
>> > weird here, can you clarify? Jeff?
>>
>> Not weird so much as stubborn. The patches were being pushed based on
>> the assumption that the community would accept a NIC generating port
>> representors that didn't necessarily pass traffic, and then even when
>> we had them passing traffic the PF still wasn't configured to handle
>> being the default destination for traffic without any rules
>> associated, instead VFs would directly send to the outside world.
>
> Perhaps the way forward is to lift the requirement on passing traffic,
> as long as the limitation is clearly expressed to the users.

No, I am not arguing for that because then SwitchDev will fall into
disarray. If we want to have a strict definition for what is SwitchDev
and what isn't I am okay with that. It gives us a definition of what
our hardware needs to do in order to support it and without that we
are going to get hardware that just bends the rules to claim support
for it.

All I am asking for is for us to not close the door to the possibility
of adding features to legacy SR-IOV. I am hoping to use a source
macvlan based approach to make it so that we can support "port
representors" for devices that can't support full SwitchDev. The idea
would be to use them to get as close to SwitchDev level support on
legacy devices as possible without using full SwitchDev. That should
solve a good part of the issue, but I am pretty certain I need to be
able to extend legacy SR-IOV in order to support it. I had talked with
Jiri at netdev 2.1 about it back when we had submitted the v7 patches,
and the decision was to look at doing "port representors" but don't
associate them with SwitchDev. I was out on Sabbatical for most of the
summer and I am just now starting on the macvlan work I had planned. I
hope to have it done before the next netdev and then we can discuss it
there if it needs more discussion than what we can have on the mailing
list.

I'm fine with us placing any legacy SR-IOV changes under more
scrutiny. I am just not open to saying we will not extend or update
any features for legacy SR-IOV. The fact is we are still selling a ton
of ixgbe based parts, so I can't say with any certainty that there
won't be a request for some new SR-IOV feature in the future.

- Alex


Re: [PATCH] ibmveth: Kernel crash LSO offload flag toggle

2017-11-14 Thread Daniel Axtens
Hi Bryant,

This looks a bit better, but...

> The following patch ensures that the bounce_buffer is not null
> prior to using it within skb_copy_from_linear_data.

How would this occur?

Looking at ibmveth.c, I see bounce_buffer being freed in ibmveth_close()
and allocated in ibmveth_open() only. If allocation fails, the whole
opening of the device fails with -ENOMEM.

It seems your test case - changing TSO - causes ibmveth_set_tso() to
cause an adaptor restart - an ibmveth_close(dev) and then an
ibmveth_open(dev). Is this happening in parallel with an out of memory
condition - is the memory allocation failing?

Alternatively, could it be the case that you're closing the device while
packets are in flight, and then trying to use a bounce_buffer that's
been freed elsewhere? Do you need to decouple memory freeing from
ibmveth_close?

> The problem can be recreated toggling on/off Large send offload.
>
> The following script when run (along with some iperf traffic recreates the
> crash within 5-10 mins or so).
>
> while true
> do
>   ethtool -k ibmveth0 | grep tcp-segmentation-offload
>   ethtool -K ibmveth0 tso off
>   ethtool -k ibmveth0 | grep tcp-segmentation-offload
>   ethtool -K ibmveth0 tso on
> done
>
> Note: This issue happens the very first time largsesend offload is
> turned off too (but the above script recreates the issue all the times)
>
> [76563.914173] Unable to handle kernel paging request for data at address 
> 0x
> [76563.914197] Faulting instruction address: 0xc0063940
> [76563.914205] Oops: Kernel access of bad area, sig: 11 [#1]
> [76563.914210] SMP NR_CPUS=2048 NUMA pSeries
> [76563.914217] Modules linked in: rpadlpar_io rpaphp dccp_diag dccp tcp_diag 
> udp_diag inet_diag unix_diag af_packet_diag netlink_diag nls_utf8 isofs 
> binfmt_misc pseries_rng rtc_generic autofs4 ibmvfc scsi_transport_fc ibmvscsi 
> ibmveth
> [76563.914251] CPU: 3 PID: 0 Comm: swapper/3 Not tainted 4.4.0-34-generic 
> #53-Ubuntu
^--- yikes!

There are relevant changes to this area since 4.4:
2c42bf4b4317 ("ibmveth: check return of skb_linearize in ibmveth_start_xmit")
66aa0678efc2 ("ibmveth: Support to enable LSO/CSO for Trunk VEA.")

Does this crash occurs on a more recent kernel?

> diff --git a/drivers/net/ethernet/ibm/ibmveth.c 
> b/drivers/net/ethernet/ibm/ibmveth.c
> index f210398200ece..1d29b1649118d 100644
> --- a/drivers/net/ethernet/ibm/ibmveth.c
> +++ b/drivers/net/ethernet/ibm/ibmveth.c
> @@ -1092,8 +1092,14 @@ static netdev_tx_t ibmveth_start_xmit(struct sk_buff 
> *skb,
>*/
>   if (force_bounce || (!skb_is_nonlinear(skb) &&
>   (skb->len < tx_copybreak))) {
> - skb_copy_from_linear_data(skb, adapter->bounce_buffer,
> -   skb->len);
> + if (adapter->bounce_buffer) {
> + skb_copy_from_linear_data(skb, adapter->bounce_buffer,
> +   skb->len);
> + } else {
> + adapter->tx_send_failed++;
> + netdev->stats.tx_dropped++;
> + goto out;
Finally, as I alluded to at the top of this message, isn't the
disappearance of the bounce-buffer a pretty serious issue? As I
understand it, it means your data structures are now inconsistent. Do
you need to - at the least - be more chatty here?

Regards,
Daniel
> + }
>  
>   descs[0].fields.flags_len = desc_flags | skb->len;
>   descs[0].fields.address = adapter->bounce_buffer_dma;
> -- 
> 2.13.6 (Apple Git-96)


Re: [PATCH] net: recvmsg: Unconditionally zero struct sockaddr_storage

2017-11-14 Thread Kees Cook
On Wed, Nov 1, 2017 at 5:48 AM, Eric W. Biederman  wrote:
> Eric Dumazet  writes:
>
>> On Tue, 2017-10-31 at 09:14 -0700, Kees Cook wrote:
>>> Some protocols do not correctly wipe the contents of the on-stack
>>> struct sockaddr_storage sent down into recvmsg() (e.g. SCTP), and leak
>>> kernel stack contents to userspace. This wipes it unconditionally before
>>> per-protocol handlers run.
>>>
>>> Note that leaks like this are mitigated by building with
>>> CONFIG_GCC_PLUGIN_STRUCTLEAK_BYREF_ALL=y
>>>
>>> Reported-by: Alexander Potapenko 
>>> Cc: "David S. Miller" 
>>> Cc: netdev@vger.kernel.org
>>> Signed-off-by: Kees Cook 
>>> ---
>>>  net/socket.c | 1 +
>>>  1 file changed, 1 insertion(+)
>>>
>>> diff --git a/net/socket.c b/net/socket.c
>>> index c729625eb5d3..34183f4fbdf8 100644
>>> --- a/net/socket.c
>>> +++ b/net/socket.c
>>> @@ -2188,6 +2188,7 @@ static int ___sys_recvmsg(struct socket *sock, struct 
>>> user_msghdr __user *msg,
>>>  struct sockaddr __user *uaddr;
>>>  int __user *uaddr_len = COMPAT_NAMELEN(msg);
>>>
>>> +memset(, 0, sizeof(addr));
>>>  msg_sys->msg_name = 
>>>
>>
>> This kind of patch comes every year.
>>
>> Standard answer is : We fix the buggy protocol, we do not make
>> everything slower just because we are lazy.
>>
>> struct sockaddr is 128 bytes, but IPV4 only uses a fraction of it.
>>
>> Also memset() is using long word stores, so next 4-byte or 2-byte stores
>> on same location hit a performance problem on x86.
>>
>> By adding all these defensive programming, we would give strong
>> incentives to bypass the kernel for networking. That would be bad.
>
> In this case it looks like the root cause is something in sctp
> not filling in the ipv6 sin6_scope_id.
>
> Which is not only a leak but a correctness bug.
>
> I ran the reproducer test program and while none of the leak checkers
> are telling me anything I have gotten as far as seeing that the returned
> length is correct and sometimes nonsense.
>
> Hmm.
>
> At a quick look it looks like all that is necessary is to do this:
>
> diff --git a/net/sctp/ipv6.c b/net/sctp/ipv6.c
> index 51c488769590..6301913d0516 100644
> --- a/net/sctp/ipv6.c
> +++ b/net/sctp/ipv6.c
> @@ -807,9 +807,10 @@ static void sctp_inet6_skb_msgname(struct sk_buff *skb, 
> char *msgname,
> addr->v6.sin6_flowinfo = 0;
> addr->v6.sin6_port = sh->source;
> addr->v6.sin6_addr = ipv6_hdr(skb)->saddr;
> -   if (ipv6_addr_type(>v6.sin6_addr) & 
> IPV6_ADDR_LINKLOCAL) {
> +   if (ipv6_addr_type(>v6.sin6_addr) & IPV6_ADDR_LINKLOCAL)
> addr->v6.sin6_scope_id = sctp_v6_skb_iif(skb);
> -   }
> +   else
> +   addr->v6.sin6_scope_id = 0;
> }
>
> *addr_len = sctp_v6_addr_to_user(sctp_sk(skb->sk), addr);
>

It looks like this never landed anywhere? Eric, are you able to resend
this as a full patch?

Thanks!

-Kees

-- 
Kees Cook
Pixel Security


Re: [RFC PATCH 06/17] net: sched: explicit locking in gso_cpu fallback

2017-11-14 Thread Willem de Bruijn
>> -static inline int dev_requeue_skb(struct sk_buff *skb, struct Qdisc *q)
>> +static inline int __dev_requeue_skb(struct sk_buff *skb, struct Qdisc *q)
>>  {
>
> Perhaps dev_requeue_skb_qdisc_locked is more descriptive. Or
> adding a lockdep_is_held(..) also documents that the __locked variant
> below is not just a lock/unlock wrapper around this inner function.
>
>> -   q->gso_skb = skb;
>> +   __skb_queue_head(>gso_skb, skb);
>> q->qstats.requeues++;
>> qdisc_qstats_backlog_inc(q, skb);
>> q->q.qlen++;/* it's still part of the queue */
>> @@ -57,6 +56,30 @@ static inline int dev_requeue_skb(struct sk_buff *skb, 
>> struct Qdisc *q)
>> return 0;
>>  }
>>
>> +static inline int dev_requeue_skb_locked(struct sk_buff *skb, struct Qdisc 
>> *q)
>> +{
>> +   spinlock_t *lock = qdisc_lock(q);
>> +
>> +   spin_lock(lock);
>> +   __skb_queue_tail(>gso_skb, skb);
>
> why does this requeue at the tail, unlike __dev_requeue_skb?

I guess that requeue has to queue at the tail in the lockless case,
and it does not matter in the qdisc_locked case, as then there can
only ever be at most one outstanding gso_skb.


Re: [RFC PATCH 03/17] net: sched: remove remaining uses for qdisc_qlen in xmit path

2017-11-14 Thread Willem de Bruijn
On Tue, Nov 14, 2017 at 7:11 PM, Willem de Bruijn
 wrote:
> On Mon, Nov 13, 2017 at 3:08 PM, John Fastabend
>  wrote:
>> sch_direct_xmit() uses qdisc_qlen as a return value but all call sites
>> of the routine only check if it is zero or not. Simplify the logic so
>> that we don't need to return an actual queue length value.
>>
>> This introduces a case now where sch_direct_xmit would have returned
>> a qlen of zero but now it returns true.
>
> You mean the first case, when the likely(skb) branch failed?
> Can that return false, then?

I misunderstood. __qdisc_run will just take one extra loop, since it
will no longer break out of the loop on reading the last packet on
the queue with qdisc_restart.

That does have a subtle (but benign if rare) side effect of possibly
calling __netif_schedule while there is nothing queued, if either
the quota is exactly exhausted or need_resched is true.


[PATCH iproute2/net-next v2]tc: B.W limits can now be specified in %.

2017-11-14 Thread Nishanth Devarajan
This patch adapts the tc command line interface to allow bandwidth limits
to be specified as a percentage of the interface's capacity.

For this purpose, we've modified and moved int read_prop() from
ip/iptuntap.c to lib.utils.c to make it accessible to tc.

Additionally, adding this functionality requires passing the specified
device string to each class/qdisc which changes the prototype for a
couple of functions: the .parse_qopt and .parse_copt interfaces. The
device string is a required parameter for tc-qdisc and tc-class, and when
not specified, the kernel returns ENODEV. In this patch, if the user tries
to specify a bandwidth percentage without naming the device, we return an
error from userspace.

Signed-off by: Nishanth Devarajan 
---
 include/utils.h |  1 +
 ip/iptuntap.c   | 32 
 lib/utils.c | 51 
 man/man8/tc.8   |  4 +++-
 tc/q_atm.c  |  2 +-
 tc/q_cbq.c  | 25 +-
 tc/q_choke.c|  9 ++--
 tc/q_clsact.c   |  2 +-
 tc/q_codel.c|  2 +-
 tc/q_drr.c  |  4 ++--
 tc/q_dsmark.c   |  4 ++--
 tc/q_fifo.c |  2 +-
 tc/q_fq.c   | 16 +++---
 tc/q_fq_codel.c |  2 +-
 tc/q_gred.c |  9 ++--
 tc/q_hfsc.c | 45 ++-
 tc/q_hhf.c  |  2 +-
 tc/q_htb.c  | 18 
 tc/q_ingress.c  |  2 +-
 tc/q_mqprio.c   |  2 +-
 tc/q_multiq.c   |  2 +-
 tc/q_netem.c|  9 ++--
 tc/q_pie.c  |  2 +-
 tc/q_prio.c |  2 +-
 tc/q_qfq.c  |  4 ++--
 tc/q_red.c  |  9 ++--
 tc/q_rr.c   |  2 +-
 tc/q_sfb.c  |  2 +-
 tc/q_sfq.c  |  2 +-
 tc/q_tbf.c  | 16 +++---
 tc/tc.c |  2 +-
 tc/tc_class.c   |  2 +-
 tc/tc_qdisc.c   |  2 +-
 tc/tc_util.c| 66 +
 tc/tc_util.h|  8 +--
 35 files changed, 268 insertions(+), 96 deletions(-)

diff --git a/include/utils.h b/include/utils.h
index 3d91c50..63fea7c 100644
--- a/include/utils.h
+++ b/include/utils.h
@@ -87,6 +87,7 @@ int get_prefix(inet_prefix *dst, char *arg, int family);
 int mask2bits(__u32 netmask);
 int get_addr_ila(__u64 *val, const char *arg);
 
+int read_prop(char *dev, char *prop, long *value);
 int get_hex(char c);
 int get_integer(int *val, const char *arg, int base);
 int get_unsigned(unsigned *val, const char *arg, int base);
diff --git a/ip/iptuntap.c b/ip/iptuntap.c
index b46e452..09f2be2 100644
--- a/ip/iptuntap.c
+++ b/ip/iptuntap.c
@@ -223,38 +223,6 @@ static int do_del(int argc, char **argv)
return tap_del_ioctl();
 }
 
-static int read_prop(char *dev, char *prop, long *value)
-{
-   char fname[IFNAMSIZ+25], buf[80], *endp;
-   ssize_t len;
-   int fd;
-   long result;
-
-   sprintf(fname, "/sys/class/net/%s/%s", dev, prop);
-   fd = open(fname, O_RDONLY);
-   if (fd < 0) {
-   if (strcmp(prop, "tun_flags"))
-   fprintf(stderr, "open %s: %s\n", fname,
-   strerror(errno));
-   return -1;
-   }
-   len = read(fd, buf, sizeof(buf)-1);
-   close(fd);
-   if (len < 0) {
-   fprintf(stderr, "read %s: %s", fname, strerror(errno));
-   return -1;
-   }
-
-   buf[len] = 0;
-   result = strtol(buf, , 0);
-   if (*endp != '\n') {
-   fprintf(stderr, "Failed to parse %s\n", fname);
-   return -1;
-   }
-   *value = result;
-   return 0;
-}
-
 static void print_flags(long flags)
 {
if (flags & IFF_TUN)
diff --git a/lib/utils.c b/lib/utils.c
index 4f2fa28..1332410 100644
--- a/lib/utils.c
+++ b/lib/utils.c
@@ -39,6 +39,57 @@
 int resolve_hosts;
 int timestamp_short;
 
+int read_prop(char *dev, char *prop, long *value)
+{
+   char fname[128], buf[80], *endp, *nl;
+   FILE *fp;
+   long result;
+   int ret;
+
+   ret = snprintf(fname, sizeof(fname), "/sys/class/net/%s/%s",
+   dev, prop);
+
+   if (ret <= 0 || ret >= sizeof(fname)) {
+   fprintf(stderr, "could not build pathname for property\n");
+   return -1;
+   }
+
+   fp = fopen(fname, "r");
+   if (fp == NULL) {
+   fprintf(stderr, "fopen %s: %s\n", fname, strerror(errno));
+   return -1;
+   }
+
+   if (!fgets(buf, sizeof(buf), fp)) {
+   fclose(fp);
+   goto out;
+   }
+
+   nl = strchr(buf, '\n');
+   if (nl)
+   *nl = '\0';
+
+   fclose(fp);
+   result = strtoul(buf, , 0);
+
+   if (buf == endp || *endp) {
+   fprintf(stderr, "value \"%s\" in file %s is not a number\n",
+   buf, fname);
+   goto out;
+   }
+
+   if (result == ULONG_MAX && errno == ERANGE) {
+   fprintf(stderr, "strtoul %s: %s", fname, strerror(errno));
+   goto out;
+   }
+
+   

[PATCHv2 net] geneve: fix fill_info when link down

2017-11-14 Thread Hangbin Liu
geneve->sock4/6 were added with geneve_open and released with geneve_stop.
So when geneve link down, we will not able to show remote address and
checksum info after commit 11387fe4a98 ("geneve: fix fill_info when using
collect_metadata").

Fix this by avoid passing *_REMOTE{,6} for COLLECT_METADATA since they are
mutually exclusive, and always show UDP_ZERO_CSUM6_RX info.

Fixes: 11387fe4a98 ("geneve: fix fill_info when using collect_metadata")
Signed-off-by: Hangbin Liu 
---
 drivers/net/geneve.c | 24 ++--
 1 file changed, 10 insertions(+), 14 deletions(-)

diff --git a/drivers/net/geneve.c b/drivers/net/geneve.c
index ed51018..b9d8d71 100644
--- a/drivers/net/geneve.c
+++ b/drivers/net/geneve.c
@@ -1503,6 +1503,7 @@ static int geneve_fill_info(struct sk_buff *skb, const 
struct net_device *dev)
 {
struct geneve_dev *geneve = netdev_priv(dev);
struct ip_tunnel_info *info = >info;
+   bool metadata = geneve->collect_md;
__u8 tmp_vni[3];
__u32 vni;
 
@@ -1511,32 +1512,24 @@ static int geneve_fill_info(struct sk_buff *skb, const 
struct net_device *dev)
if (nla_put_u32(skb, IFLA_GENEVE_ID, vni))
goto nla_put_failure;
 
-   if (rtnl_dereference(geneve->sock4)) {
+   if (!metadata && ip_tunnel_info_af(info) == AF_INET) {
if (nla_put_in_addr(skb, IFLA_GENEVE_REMOTE,
info->key.u.ipv4.dst))
goto nla_put_failure;
-
if (nla_put_u8(skb, IFLA_GENEVE_UDP_CSUM,
   !!(info->key.tun_flags & TUNNEL_CSUM)))
goto nla_put_failure;
 
-   }
-
 #if IS_ENABLED(CONFIG_IPV6)
-   if (rtnl_dereference(geneve->sock6)) {
+   } else if (!metadata) {
if (nla_put_in6_addr(skb, IFLA_GENEVE_REMOTE6,
 >key.u.ipv6.dst))
goto nla_put_failure;
-
if (nla_put_u8(skb, IFLA_GENEVE_UDP_ZERO_CSUM6_TX,
   !(info->key.tun_flags & TUNNEL_CSUM)))
goto nla_put_failure;
-
-   if (nla_put_u8(skb, IFLA_GENEVE_UDP_ZERO_CSUM6_RX,
-  !geneve->use_udp6_rx_checksums))
-   goto nla_put_failure;
-   }
 #endif
+   }
 
if (nla_put_u8(skb, IFLA_GENEVE_TTL, info->key.ttl) ||
nla_put_u8(skb, IFLA_GENEVE_TOS, info->key.tos) ||
@@ -1546,10 +1539,13 @@ static int geneve_fill_info(struct sk_buff *skb, const 
struct net_device *dev)
if (nla_put_be16(skb, IFLA_GENEVE_PORT, info->key.tp_dst))
goto nla_put_failure;
 
-   if (geneve->collect_md) {
-   if (nla_put_flag(skb, IFLA_GENEVE_COLLECT_METADATA))
+   if (metadata && nla_put_flag(skb, IFLA_GENEVE_COLLECT_METADATA))
goto nla_put_failure;
-   }
+
+   if (nla_put_u8(skb, IFLA_GENEVE_UDP_ZERO_CSUM6_RX,
+  !geneve->use_udp6_rx_checksums))
+   goto nla_put_failure;
+
return 0;
 
 nla_put_failure:
-- 
2.5.5



Re: [RFC PATCH 12/17] net: sched: add support for TCQ_F_NOLOCK subqueues to sch_mq

2017-11-14 Thread Willem de Bruijn
>  /**
>   * gnet_stats_copy_queue - copy queue statistics into statistics TLV
> diff --git a/net/sched/sch_mq.c b/net/sched/sch_mq.c
> index 213b586..bc59f05 100644
> --- a/net/sched/sch_mq.c
> +++ b/net/sched/sch_mq.c
> @@ -17,6 +17,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>
>  struct mq_sched {
> struct Qdisc**qdiscs;
> @@ -103,15 +104,25 @@ static int mq_dump(struct Qdisc *sch, struct sk_buff 
> *skb)
> memset(>qstats, 0, sizeof(sch->qstats));
>
> for (ntx = 0; ntx < dev->num_tx_queues; ntx++) {
> +   struct gnet_stats_basic_cpu __percpu *cpu_bstats = NULL;
> +   struct gnet_stats_queue __percpu *cpu_qstats = NULL;
> +   __u32 qlen = 0;
> +
> qdisc = netdev_get_tx_queue(dev, ntx)->qdisc_sleeping;
> spin_lock_bh(qdisc_lock(qdisc));
> -   sch->q.qlen += qdisc->q.qlen;
> -   sch->bstats.bytes   += qdisc->bstats.bytes;
> -   sch->bstats.packets += qdisc->bstats.packets;
> -   sch->qstats.backlog += qdisc->qstats.backlog;
> -   sch->qstats.drops   += qdisc->qstats.drops;
> -   sch->qstats.requeues+= qdisc->qstats.requeues;
> -   sch->qstats.overlimits  += qdisc->qstats.overlimits;
> +
> +   if (qdisc_is_percpu_stats(qdisc)) {
> +   cpu_bstats = qdisc->cpu_bstats;
> +   cpu_qstats = qdisc->cpu_qstats;
> +   }

Else, these fields are NULL. So no need for the intermediate
variables.

> +
> +   qlen = qdisc_qlen_sum(qdisc);
> +
> +   __gnet_stats_copy_basic(NULL, >bstats,
> +   cpu_bstats, >bstats);
> +   __gnet_stats_copy_queue(>qstats,
> +   cpu_qstats, >qstats, qlen);

These functions add if the percpu variant is taken, but assign otherwise,
so overwrite on each iteration of the loop.


Re: [RFC PATCH 10/17] net: sched: qdisc_qlen for per cpu logic

2017-11-14 Thread Willem de Bruijn
On Mon, Nov 13, 2017 at 3:10 PM, John Fastabend
 wrote:
> Add qdisc qlen helper routines for lockless qdiscs to use.
>
> The qdisc qlen is no longer used in the hotpath but it is reported
> via stats query on the qdisc so it still needs to be tracked. This
> adds the per cpu operations needed.
>
> Signed-off-by: John Fastabend 
> ---
>  0 files changed
>
> diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
> index 4717c4b..bad24a9 100644
> --- a/include/net/sch_generic.h
> +++ b/include/net/sch_generic.h
> @@ -291,8 +291,16 @@ static inline void qdisc_cb_private_validate(const 
> struct sk_buff *skb, int sz)
> BUILD_BUG_ON(sizeof(qcb->data) < sz);
>  }
>
> +static inline int qdisc_qlen_cpu(const struct Qdisc *q)
> +{
> +   return this_cpu_ptr(q->cpu_qstats)->qlen;
> +}
> +
>  static inline int qdisc_qlen(const struct Qdisc *q)
>  {
> +   if (q->flags & TCQ_F_NOLOCK)
> +   return qdisc_qlen_cpu(q);

Shouldn't this return qdisc_qlen_sum from the follow-on patch?
The two patches can also be squashed.


[PATCH net] bpf: fix lockdep splat

2017-11-14 Thread Eric Dumazet
From: Eric Dumazet 

pcpu_freelist_pop() needs the same lockdep awareness than
pcpu_freelist_populate() to avoid a false positive.

 [ INFO: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected ]

 switchto-defaul/12508 [HC0[0]:SC0[6]:HE0:SE0] is trying to acquire:
  (>buckets[i].lock){..}, at: [] 
__htab_percpu_map_update_elem+0x1cb/0x300
 
 and this task is already holding:
  (dev_queue->dev->qdisc_class ?: _tx_lock#2){+.-...}, at: 
[] __dev_queue_xmit+0
x868/0x1240
 which would create a new lock dependency:
  (dev_queue->dev->qdisc_class ?: _tx_lock#2){+.-...} -> 
(>buckets[i].lock){..}
 
 but this new dependency connects a SOFTIRQ-irq-safe lock:
  (dev_queue->dev->qdisc_class ?: _tx_lock#2){+.-...}
 ... which became SOFTIRQ-irq-safe at:
   [] __lock_acquire+0x42b/0x1f10
   [] lock_acquire+0xbc/0x1b0
   [] _raw_spin_lock+0x38/0x50
   [] __dev_queue_xmit+0x868/0x1240
   [] dev_queue_xmit+0x10/0x20
   [] ip_finish_output2+0x439/0x590
   [] ip_finish_output+0x150/0x2f0
   [] ip_output+0x7d/0x260
   [] ip_local_out+0x5e/0xe0
   [] ip_queue_xmit+0x205/0x620
   [] tcp_transmit_skb+0x5a8/0xcb0
   [] tcp_write_xmit+0x242/0x1070
   [] __tcp_push_pending_frames+0x3c/0xf0
   [] tcp_rcv_established+0x312/0x700
   [] tcp_v4_do_rcv+0x11c/0x200
   [] tcp_v4_rcv+0xaa2/0xc30
   [] ip_local_deliver_finish+0xa7/0x240
   [] ip_local_deliver+0x66/0x200
   [] ip_rcv_finish+0xdd/0x560
   [] ip_rcv+0x295/0x510
   [] __netif_receive_skb_core+0x988/0x1020
   [] __netif_receive_skb+0x21/0x70
   [] process_backlog+0x6f/0x230
   [] net_rx_action+0x229/0x420
   [] __do_softirq+0xd8/0x43d
   [] do_softirq_own_stack+0x1c/0x30
   [] do_softirq+0x55/0x60
   [] __local_bh_enable_ip+0xa8/0xb0
   [] cpu_startup_entry+0x1c7/0x500
   [] start_secondary+0x113/0x140
 
 to a SOFTIRQ-irq-unsafe lock:
  (>lock){+.+...}
 ... which became SOFTIRQ-irq-unsafe at:
 ...  [] __lock_acquire+0x82f/0x1f10
   [] lock_acquire+0xbc/0x1b0
   [] _raw_spin_lock+0x38/0x50
   [] pcpu_freelist_pop+0x7a/0xb0
   [] htab_map_alloc+0x50c/0x5f0
   [] SyS_bpf+0x265/0x1200
   [] entry_SYSCALL_64_fastpath+0x12/0x17
 
 other info that might help us debug this:
 
 Chain exists of:
   dev_queue->dev->qdisc_class ?: _tx_lock#2 --> >buckets[i].lock 
--> >lock
 
  Possible interrupt unsafe locking scenario:
 
CPU0CPU1

   lock(>lock);
local_irq_disable();
lock(dev_queue->dev->qdisc_class ?: 
_tx_lock#2);
lock(>buckets[i].lock);
   
 lock(dev_queue->dev->qdisc_class ?: _tx_lock#2);
 
  *** DEADLOCK ***

Fixes: e19494edab82 ("bpf: introduce percpu_freelist")
Signed-off-by: Eric Dumazet 
---
 kernel/bpf/percpu_freelist.c |8 ++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/kernel/bpf/percpu_freelist.c b/kernel/bpf/percpu_freelist.c
index 
5c51d1985b5102fbba64929b5626d41fb4b88b66..673fa6fe2d73cf815daf8a0c2eb0ce6eab15b41e
 100644
--- a/kernel/bpf/percpu_freelist.c
+++ b/kernel/bpf/percpu_freelist.c
@@ -78,8 +78,10 @@ struct pcpu_freelist_node *pcpu_freelist_pop(struct 
pcpu_freelist *s)
 {
struct pcpu_freelist_head *head;
struct pcpu_freelist_node *node;
+   unsigned long flags;
int orig_cpu, cpu;
 
+   local_irq_save(flags);
orig_cpu = cpu = raw_smp_processor_id();
while (1) {
head = per_cpu_ptr(s->freelist, cpu);
@@ -87,14 +89,16 @@ struct pcpu_freelist_node *pcpu_freelist_pop(struct 
pcpu_freelist *s)
node = head->first;
if (node) {
head->first = node->next;
-   raw_spin_unlock(>lock);
+   raw_spin_unlock_irqrestore(>lock, flags);
return node;
}
raw_spin_unlock(>lock);
cpu = cpumask_next(cpu, cpu_possible_mask);
if (cpu >= nr_cpu_ids)
cpu = 0;
-   if (cpu == orig_cpu)
+   if (cpu == orig_cpu) {
+   local_irq_restore(flags);
return NULL;
+   }
}
 }




Re: [net-next 00/10][pull request] 40GbE Intel Wired LAN Driver Updates 2017-11-14

2017-11-14 Thread David Miller
From: Jeff Kirsher 
Date: Tue, 14 Nov 2017 14:03:37 -0800

> On Tue, 2017-11-14 at 23:52 +0200, Or Gerlitz wrote:
>> On Tue, Nov 14, 2017 at 9:03 PM, Jeff Kirsher
>>  wrote:
>> > This series contains updates to i40e and i40evf only.
>> 
>> see http://vger.kernel.org/~davem/net-next.html
> 
> Your right Or, I failed to look at the URL for the status of net-next. 
>  I made the mistake of going through the netdev emails to see if Dave
> had sent out a "net-next closed" email and forgot about the URL.
> 
> Dave, I am fine with holding onto most of these patches until net-next
> opens back up.  There are a few fixes in the series which I will
> resubmit to your net tree, after you sync with Linus later this week.

Yes, please hold onto this until net-next is open again.

Thanks.


Re: [patch net-next 0/2] mlxsw: Update firmware version

2017-11-14 Thread David Ahern
On 11/14/17 5:17 AM, David Miller wrote:
> From: Jiri Pirko 
> Date: Sun, 12 Nov 2017 09:01:23 +0100
> 
>> From: Jiri Pirko 
>>
>> Ido says:
>>
>> This set adjusts the driver to use a new firmware version. The new
>> version includes various enhancements and fixes detailed in the first
>> patch.
>>
>> The second patch enables batch deletion of neighbours on a router
>> interface (RIF) which was not possible with previous versions.
> 
> Series applied, thanks.
> 

Can you wait until said firmware version has been pushed into
linux-firmware *before* sending these patches. Considering the driver
refuses to work without the right firmware version this has a negative
impact on users.


Re: [PATCH net-next v2 0/2] retire IPX and NCPFS

2017-11-14 Thread David Miller
From: Stephen Hemminger 
Date: Tue, 14 Nov 2017 08:37:13 -0800

> These are both old decrepit protocols that need to be sent
> to pasture.

These need to go to gregkh and his staging/ tree, not net-next.


Re: [PATCH] ipv6: sr: update the struct ipv6_sr_hdr

2017-11-14 Thread David Miller
From: Ahmed Abdelsalam 
Date: Tue, 14 Nov 2017 15:31:48 +0100

> Also it will not make sense to have the field name differnent from the draft.

That is the danger of defining user facing things against a draft which
is constantly changing.

Sorry, we are stuck with the current name.


Re: [PATCH] ipv6: sr: update the struct ipv6_sr_hdr

2017-11-14 Thread David Miller
From: Edward Cree 
Date: Tue, 14 Nov 2017 14:14:01 +

> On 14/11/17 12:37, David Miller wrote:
>> From: Ahmed Abdelsalam 
>> Date: Sun, 12 Nov 2017 21:37:01 +0100
>> 
>>> diff --git a/include/uapi/linux/seg6.h b/include/uapi/linux/seg6.h
>>> index 2f6fb0d..3f4b3ab 100644
>>> --- a/include/uapi/linux/seg6.h
>>> +++ b/include/uapi/linux/seg6.h
>>> @@ -26,9 +26,9 @@ struct ipv6_sr_hdr {
>>> __u8hdrlen;
>>> __u8type;
>>> __u8segments_left;
>>> -   __u8first_segment;
>>> +   __u8last_entry;
>> 
>> This is user ABI and cannot be changed.
>> 
>> Sorry, folks should have considered these issues when the SR
>> changes were submitted.  This field must keep the name 'first_segment'
>> forever.
> 
> Surely renaming struct fields only changes the API, not the ABI?
> Binaries compiled against the old struct definition will still behave the
>  same (AFAICT the patch doesn't change how these fields are used), while
>  sources being recompiled (so they care about the name change) can be
>  changed.

Yes, it will stop existing applications from compiling and we can't
do that.


Re: [RFC PATCH 06/17] net: sched: explicit locking in gso_cpu fallback

2017-11-14 Thread Willem de Bruijn
>  /* use instead of qdisc->dequeue() for all qdiscs queried with ->peek() */
>  static inline struct sk_buff *qdisc_dequeue_peeked(struct Qdisc *sch)
>  {
> -   struct sk_buff *skb = sch->gso_skb;
> +   struct sk_buff *skb = skb_peek(>gso_skb);
>
> if (skb) {
> -   sch->gso_skb = NULL;
> +   skb = __skb_dequeue(>gso_skb);
> qdisc_qstats_backlog_dec(sch, skb);
> sch->q.qlen--;

In lockless qdiscs, can this race, so that __skb_dequeue returns NULL?
Same for its use in qdisc_peek_dequeued.

> -static inline int dev_requeue_skb(struct sk_buff *skb, struct Qdisc *q)
> +static inline int __dev_requeue_skb(struct sk_buff *skb, struct Qdisc *q)
>  {

Perhaps dev_requeue_skb_qdisc_locked is more descriptive. Or
adding a lockdep_is_held(..) also documents that the __locked variant
below is not just a lock/unlock wrapper around this inner function.

> -   q->gso_skb = skb;
> +   __skb_queue_head(>gso_skb, skb);
> q->qstats.requeues++;
> qdisc_qstats_backlog_inc(q, skb);
> q->q.qlen++;/* it's still part of the queue */
> @@ -57,6 +56,30 @@ static inline int dev_requeue_skb(struct sk_buff *skb, 
> struct Qdisc *q)
> return 0;
>  }
>
> +static inline int dev_requeue_skb_locked(struct sk_buff *skb, struct Qdisc 
> *q)
> +{
> +   spinlock_t *lock = qdisc_lock(q);
> +
> +   spin_lock(lock);
> +   __skb_queue_tail(>gso_skb, skb);

why does this requeue at the tail, unlike __dev_requeue_skb?

> +   spin_unlock(lock);
> +
> +   qdisc_qstats_cpu_requeues_inc(q);
> +   qdisc_qstats_cpu_backlog_inc(q, skb);
> +   qdisc_qstats_cpu_qlen_inc(q);
> +   __netif_schedule(q);
> +
> +   return 0;
> +}
> +


Re: [PATCH 0/9] use network namespace for iSCSI control interfaces

2017-11-14 Thread Chris Leech
On Wed, Nov 08, 2017 at 10:31:04AM +, David Laight wrote:
> From: Chris Leech
> > Sent: 07 November 2017 22:45
> > 
> > I've posted these changes to allow iSCSI management within a container
> > using a network namespace to the SCSI and Open-iSCSI lists, but seeing
> > as it's not really SCSI/block related I'm casting a wider net looking
> > for reviews.
> 
> I didn't spot you acquiring and releasing references to the namespace.
> (I might have missed it, the relevant patch is difficult to read).
> 
> If the sockets are created in the context of the process whose namespace
> you are using you don't need it, but given the hooks and callbacks
> I'm not at all sure that is obviously true.

Thanks David,

Looking at it again, you're right and I think I need to hold a reference
for the iSCSI host and handle namespace deletion.  Even for iscsi_tcp
the socket gets handed off from the creating process to the transport
and can outlive iscsid.

I'm looking at migration or destruction now rather than later.

Chris



Re: [PATCH 4.4 27/56] cdc_ncm: Set NTB format again after altsetting switch for Huawei devices

2017-11-14 Thread Ben Hutchings
On Tue, 2017-07-11 at 17:21 +0200, Enrico Mioso wrote:
> From: Enrico Mioso 
> 
> commit 2b02c20ce0c28974b44e69a2e2f5ddc6a470ad6f upstream.
[...]
> --- a/drivers/net/usb/cdc_ncm.c
> +++ b/drivers/net/usb/cdc_ncm.c
> @@ -724,8 +724,10 @@ int cdc_ncm_bind_common(struct usbnet *d
>   u8 *buf;
>   int len;
>   int temp;
> + int err;
>   u8 iface_no;
>   struct usb_cdc_parsed_header hdr;
> + u16 curr_ntb_format;
[...]
> + err = usbnet_read_cmd(dev, USB_CDC_GET_NTB_FORMAT,
> +   USB_TYPE_CLASS | USB_DIR_IN | 
> USB_RECIP_INTERFACE,
> +   0, iface_no, _ntb_format, 2);
> + if (err < 0) {
> + goto error2;
> + }
> +
> + if (curr_ntb_format == USB_CDC_NCM_NTB32_FORMAT) {
[...]

usbnet_read_cmd() doesn't do any byte-swapping, so it looks like
curr_ntb_format will have little-endian byte order (__le16 not u16). 
The comparison will then need to be done using
le16_to_cpu(curr_ntb_format).

Ben.

-- 
Ben Hutchings
Software Developer, Codethink Ltd.



Re: [patch net-next RFC v2 07/11] mlxsw: spectrum: Register KVD resources with devlink

2017-11-14 Thread David Ahern
On 11/14/17 9:18 AM, Jiri Pirko wrote:
> +static int
> +mlxsw_sp_resource_kvd_size_validate(struct devlink *devlink, u64 size,
> + struct list_head *resource_list,
> + struct netlink_ext_ack *extack)
> +{
> + struct mlxsw_core *mlxsw_core = devlink_priv(devlink);
> + u32 kvd_size, single_size, double_size, linear_size;
> + struct devlink_resource *resource;
> +
> + kvd_size = MLXSW_CORE_RES_GET(mlxsw_core, KVD_SIZE);
> + if (kvd_size != size) {
> + NL_SET_ERR_MSG(extack, MLXSW_SP_PREFIX "kvd size cannot be 
> chagned");
> + return -EINVAL;
> + }
> +
> + list_for_each_entry(resource, resource_list, list) {
> + switch (resource->id) {
> + case MLXSW_SP_RESOURCE_KVD_LINEAR:
> + linear_size = resource->size_new;
> + break;
> + case MLXSW_SP_RESOURCE_KVD_HASH_SINGLE:
> + single_size = resource->size_new;
> + break;
> + case MLXSW_SP_RESOURCE_KVD_HASH_DOUBLE:
> + double_size = resource->size_new;
> + break;
> + }
> + }
> +
> + /* Overlap is not supported */
> + if (linear_size + single_size + double_size > kvd_size) {
> + NL_SET_ERR_MSG(extack, MLXSW_SP_PREFIX "Overlap is not 
> supported");
> + return -EINVAL;
> + }
> +
> + return 0;
> +}
> +

gcc warnings due to the above:

/home/dsa/kernel-3.git/drivers/net/ethernet/mellanox/mlxsw/spectrum.c:
In function ‘mlxsw_sp_resource_kvd_size_validate’:
/home/dsa/kernel-3.git/drivers/net/ethernet/mellanox/mlxsw/spectrum.c:3974:32:
warning: ‘linear_size’ may be used uninitialized in this function
[-Wmaybe-uninitialized]
  if (linear_size + single_size + double_size > kvd_size) {
^
/home/dsa/kernel-3.git/drivers/net/ethernet/mellanox/mlxsw/spectrum.c:3950:29:
warning: ‘double_size’ may be used uninitialized in this function
[-Wmaybe-uninitialized]
  u32 kvd_size, single_size, double_size, linear_size;
 ^
/home/dsa/kernel-3.git/drivers/net/ethernet/mellanox/mlxsw/spectrum.c:3950:16:
warning: ‘single_size’ may be used uninitialized in this function
[-Wmaybe-uninitialized]
  u32 kvd_size, single_size, double_size, linear_size;
^


Re: [RFC PATCH 03/17] net: sched: remove remaining uses for qdisc_qlen in xmit path

2017-11-14 Thread Willem de Bruijn
On Mon, Nov 13, 2017 at 3:08 PM, John Fastabend
 wrote:
> sch_direct_xmit() uses qdisc_qlen as a return value but all call sites
> of the routine only check if it is zero or not. Simplify the logic so
> that we don't need to return an actual queue length value.
>
> This introduces a case now where sch_direct_xmit would have returned
> a qlen of zero but now it returns true.

You mean the first case, when the likely(skb) branch failed?
Can that return false, then?

> However in this case all
> call sites of sch_direct_xmit will implement a dequeue() and get
> a null skb and abort. This trades tracking qlen in the hotpath for
> an extra dequeue operation. Overall this seems to be good for
> performance.
>
> Signed-off-by: John Fastabend 

> @@ -198,18 +198,19 @@ int sch_direct_xmit(struct sk_buff *skb, struct Qdisc 
> *q,
>
> if (dev_xmit_complete(ret)) {
> /* Driver sent out skb successfully or skb was consumed */
> -   ret = qdisc_qlen(q);
> +   ret = true;
> } else {
> /* Driver returned NETDEV_TX_BUSY - requeue skb */
> if (unlikely(ret != NETDEV_TX_BUSY))
> net_warn_ratelimited("BUG %s code %d qlen %d\n",
>  dev->name, ret, q->q.qlen);
>
> -   ret = dev_requeue_skb(skb, q);
> +   dev_requeue_skb(skb, q);
> +   ret = false;
> }
>
> if (ret && netif_xmit_frozen_or_stopped(txq))
> -   ret = 0;
> +   ret = false;
>
> return ret;
>  }


This can now become

   if (!dev_queue_complete(ret)) {
  if (unlikely(ret != NETDEV_TX_BUSY))
...
  dev_requeue_skb(skb, q);
  return false;
  }

  if (netif_xmit_frozen_or_stopped(txq))
return false;

  return true;


Re: SRIOV switchdev mode BoF minutes

2017-11-14 Thread Jakub Kicinski
On Tue, 14 Nov 2017 15:05:08 -0800, Alexander Duyck wrote:
> >> We basically need to do some feasability research to see if we can
> >> actually meet all the requirements for switchdev on i40e. We have been
> >> getting mixed messages where we are given a great many "yes, but" type
> >> answers. For i40e we are looking into it but I don't have high
> >> confidence in our ability to actually support it in hardare/firmware.
> >> If it were as easy as you have been led to believe, we would have done
> >> it months ago when we were researching the requirements to support 
> >> switchdev  
> >
> > wait, Sridhar made seven rounds of his submission (this is the v7
> > pointer [1]) and you
> > still don't know if what you were attempting to push upstream can
> > work, something is
> > weird here, can you clarify? Jeff?  
> 
> Not weird so much as stubborn. The patches were being pushed based on
> the assumption that the community would accept a NIC generating port
> representors that didn't necessarily pass traffic, and then even when
> we had them passing traffic the PF still wasn't configured to handle
> being the default destination for traffic without any rules
> associated, instead VFs would directly send to the outside world.

Perhaps the way forward is to lift the requirement on passing traffic,
as long as the limitation is clearly expressed to the users.


Re: [RFC PATCH 15/17] net: sched: pfifo_fast use skb_array

2017-11-14 Thread Willem de Bruijn
>  static struct sk_buff *pfifo_fast_peek(struct Qdisc *qdisc)
>  {
> struct pfifo_fast_priv *priv = qdisc_priv(qdisc);
> -   int band = bitmap2band[priv->bitmap];
> +   struct sk_buff *skb = NULL;
> +   int band;
>
> -   if (band >= 0) {
> -   struct qdisc_skb_head *qh = band2list(priv, band);
> +   for (band = 0; band < PFIFO_FAST_BANDS && !skb; band++) {
> +   struct skb_array *q = band2list(priv, band);
>
> -   return qh->head;
> +   skb = __skb_array_peek(q);
> +   if (!skb)
> +   continue;
> }

This explicit continue is not needed.

>  static void pfifo_fast_reset(struct Qdisc *qdisc)
>  {
> -   int prio;
> +   int i, band;
> struct pfifo_fast_priv *priv = qdisc_priv(qdisc);
>
> -   for (prio = 0; prio < PFIFO_FAST_BANDS; prio++)
> -   __qdisc_reset_queue(band2list(priv, prio));
> +   for (band = 0; band < PFIFO_FAST_BANDS; band++) {
> +   struct skb_array *q = band2list(priv, band);
> +   struct sk_buff *skb;
>
> -   priv->bitmap = 0;
> -   qdisc->qstats.backlog = 0;
> -   qdisc->q.qlen = 0;
> +   while ((skb = skb_array_consume_bh(q)) != NULL)
> +   __skb_array_destroy_skb(skb);

Can use regular kfree_skb after dequeue. This skb_array specific
callback probably only exists to match the ptr_ring callback typedef.

>  static int pfifo_fast_dump(struct Qdisc *qdisc, struct sk_buff *skb)
> @@ -685,17 +688,40 @@ static int pfifo_fast_dump(struct Qdisc *qdisc, struct 
> sk_buff *skb)
>
>  static int pfifo_fast_init(struct Qdisc *qdisc, struct nlattr *opt)
>  {
> -   int prio;
> +   unsigned int qlen = qdisc_dev(qdisc)->tx_queue_len;
> struct pfifo_fast_priv *priv = qdisc_priv(qdisc);
> +   int prio;
> +
> +   /* guard against zero length rings */
> +   if (!qlen)
> +   return -EINVAL;
>
> -   for (prio = 0; prio < PFIFO_FAST_BANDS; prio++)
> -   qdisc_skb_head_init(band2list(priv, prio));
> +   for (prio = 0; prio < PFIFO_FAST_BANDS; prio++) {
> +   struct skb_array *q = band2list(priv, prio);
> +   int err;
> +
> +   err = skb_array_init(q, qlen, GFP_KERNEL);
> +   if (err)
> +   return -ENOMEM;

This relies on the caller calling ops->destroy on error to free partially
allocated state. For uninitialized bands, that calls spin_lock on an
uninitialized spinlock from skb_array_cleanup -> ptr_ring_cleanup ->
ptr_ring_consume.

> +   }
>
> /* Can by-pass the queue discipline */
> qdisc->flags |= TCQ_F_CAN_BYPASS;
> return 0;
>  }
>
> +static void pfifo_fast_destroy(struct Qdisc *sch)
> +{
> +   struct pfifo_fast_priv *priv = qdisc_priv(sch);
> +   int prio;
> +
> +   for (prio = 0; prio < PFIFO_FAST_BANDS; prio++) {
> +   struct skb_array *q = band2list(priv, prio);
> +
> +   skb_array_cleanup(q);
> +   }
> +}


Re: SRIOV switchdev mode BoF minutes

2017-11-14 Thread Jakub Kicinski
On Tue, 14 Nov 2017 12:00:32 -0800, Alexander Duyck wrote:
> On Tue, Nov 14, 2017 at 8:44 AM, Or Gerlitz wrote:
> > On Mon, Nov 13, 2017 at 7:10 PM, Alexander Duyck wrote:  
> >> On Sun, Nov 12, 2017 at 10:16 PM, Or Gerlitz wrote:  
> > Lets focus on this point for a moment before discussing the other points
> > you raised.
>
> When SR-IOV was introduced there were two available modes, Virtual
> Ethernet Port Aggregation, aka VEPA, and Virtual Ethernet Bridging,
> aka VEB. The fact is SwitchDev is designed specifically for networking
> SR-IOV with Virtual Ethernet Bridging, aka VEB. You argue that the
> legacy model is bad, but I would argue that is because the legacy
> model was really designed to work more for both VEPA than with VEB,
> whereas SwitchDev only focuses on VEB. If you take a look in the ixgbe
> or i40e drivers you will see that we support configuring both of those
> modes via ndo_bridge_setlink since we have customer install bases that
> actually prefer VEPA over VEB as they prefer to have their traffic
> centrally managed instead of having the local host managing the
> traffic. We cannot just arbitrarily tell our customers they are doing
> SR-IOV using the "wrong model".

Maybe that's an obvious statement, but the perhaps real problem we are
grappling with here is that VEPA doesn't really exist as forwarding
model outside of SR-IOV NICs.  So we have no software construct that
cleanly maps onto it for offload.

> I would rather not have SwitchDev become the next SystemD. The type
> argument you are making is basically dictating to us and our customers
> how things are supposed to work based on your view things. We have
> different hardware, different customers, and all of our needs aren't
> necessarily met by SwitchDev. I would agree that SwitchDev is the
> go-to solution for VEB configuration, and we do plan to have future
> hardware support it. In addition I would argue that for the sake of
> consistency we should make sure that any feature that gets added to
> the legacy has to be supported by the SwitchDev model as well before
> it could be supported. If anything my hope is to evolve the legacy
> model to have much of the same look and feel as SwitchDev, but that
> will take time and require changes to the legacy model.

To me the whole point of switchdev is to reuse existing ABIs, TC, FDB,
bridging etc.  We are arguing to stop adding special SR-IOV features,
if the general direction of things is to just reflect configuration
done with SW ABIs to the hardware.  I think saying we need feature
parity between the models is missing this crucial point.

Also more ways there are to configure a single thing, the more confusing
it will be to the users.


Re: Per-CPU Queueing for QoS

2017-11-14 Thread Michael Ma
2017-11-13 20:53 GMT-08:00 Tom Herbert :
> On Mon, Nov 13, 2017 at 7:45 PM, John Fastabend
>  wrote:
>> On 11/13/2017 06:18 PM, Michael Ma wrote:
>>> 2017-11-13 16:13 GMT-08:00 Alexander Duyck :
 On Mon, Nov 13, 2017 at 3:08 PM, Eric Dumazet  
 wrote:
> On Mon, 2017-11-13 at 14:47 -0800, Alexander Duyck wrote:
>> On Mon, Nov 13, 2017 at 10:17 AM, Michael Ma  wrote:
>>> 2017-11-12 16:14 GMT-08:00 Stephen Hemminger 
>>> :
 On Sun, 12 Nov 2017 13:43:13 -0800
 Michael Ma  wrote:

> Any comments? We plan to implement this as a qdisc and appreciate any 
> early feedback.
>
> Thanks,
> Michael
>
>> On Nov 9, 2017, at 5:20 PM, Michael Ma  wrote:
>>
>> Currently txq/qdisc selection is based on flow hash so packets from
>> the same flow will follow the order when they enter qdisc/txq, which
>> avoids out-of-order problem.
>>
>> To improve the concurrency of QoS algorithm we plan to have multiple
>> per-cpu queues for a single TC class and do busy polling from a
>> per-class thread to drain these queues. If we can do this frequently
>> enough the out-of-order situation in this polling thread should not 
>> be
>> that bad.
>>
>> To give more details - in the send path we introduce per-cpu 
>> per-class
>> queues so that packets from the same class and same core will be
>> enqueued to the same place. Then a per-class thread poll the queues
>> belonging to its class from all the cpus and aggregate them into
>> another per-class queue. This can effectively reduce contention but
>> inevitably introduces potential out-of-order issue.
>>
>> Any concern/suggestion for working towards this direction?

 In general, there is no meta design discussions in Linux development
 Several developers have tried to do lockless
 qdisc and similar things in the past.

 The devil is in the details, show us the code.
>>>
>>> Thanks for the response, Stephen. The code is fairly straightforward,
>>> we have a per-cpu per-class queue defined as this:
>>>
>>> struct bandwidth_group
>>> {
>>> struct skb_list queues[MAX_CPU_COUNT];
>>> struct skb_list drain;
>>> }
>>>
>>> "drain" queue is used to aggregate per-cpu queues belonging to the
>>> same class. In the enqueue function, we determine the cpu where the
>>> packet is processed and enqueue it to the corresponding per-cpu queue:
>>>
>>> int cpu;
>>> struct bandwidth_group *bwg = _rx_groups[bwgid];
>>>
>>> cpu = get_cpu();
>>> skb_list_append(>queues[cpu], skb);
>>>
>>> Here we don't check the flow of the packet so if there is task
>>> migration or multiple threads sending packets through the same flow we
>>> theoretically can have packets enqueued to different queues and
>>> aggregated to the "drain" queue out of order.
>>>
>>> Also AFAIK there is no lockless htb-like qdisc implementation
>>> currently, however if there is already similar effort ongoing please
>>> let me know.
>>
>> The question I would have is how would this differ from using XPS w/
>> mqprio? Would this be a classful qdisc like HTB or a classless one
>> like mqprio?
>>
>> From what I can tell XPS would be able to get you your per-cpu
>> functionality, the benefit of it though would be that it would avoid
>> out-of-order issues for sockets originating on the local system. The
>> only thing I see as an issue right now is that the rate limiting with
>> mqprio is assumed to be handled via hardware due to mechanisms such as
>> DCB.
>
> I think one of the key point was in : " do busy polling from a per-class
> thread to drain these queues."
>
> I mentioned this idea in TX path of :
>
> https://netdevconf.org/2.1/slides/apr6/dumazet-BUSY-POLLING-Netdev-2.1.pdf

 I think this is a bit different from that idea in that the busy
 polling is transferring packets from a per-cpu qdisc to a per traffic
 class queueing discipline. Basically it would be a busy_poll version
 of qdisc_run that would be transferring packets from one qdisc onto
 another instead of attempting to transmit them directly.
>>>
>>> The idea is to have the whole part implemented as one classful qdisc
>>> and remove the qdisc root lock since all the synchronization will be
>>> handled internally (let's put aside that other filters/actions/qdiscs
>>> might still require a root lock for now).

 What I think is tripping me up is that I don't think this is 

Re: Per-CPU Queueing for QoS

2017-11-14 Thread Michael Ma
2017-11-13 19:45 GMT-08:00 John Fastabend :
> On 11/13/2017 06:18 PM, Michael Ma wrote:
>> 2017-11-13 16:13 GMT-08:00 Alexander Duyck :
>>> On Mon, Nov 13, 2017 at 3:08 PM, Eric Dumazet  
>>> wrote:
 On Mon, 2017-11-13 at 14:47 -0800, Alexander Duyck wrote:
> On Mon, Nov 13, 2017 at 10:17 AM, Michael Ma  wrote:
>> 2017-11-12 16:14 GMT-08:00 Stephen Hemminger 
>> :
>>> On Sun, 12 Nov 2017 13:43:13 -0800
>>> Michael Ma  wrote:
>>>
 Any comments? We plan to implement this as a qdisc and appreciate any 
 early feedback.

 Thanks,
 Michael

> On Nov 9, 2017, at 5:20 PM, Michael Ma  wrote:
>
> Currently txq/qdisc selection is based on flow hash so packets from
> the same flow will follow the order when they enter qdisc/txq, which
> avoids out-of-order problem.
>
> To improve the concurrency of QoS algorithm we plan to have multiple
> per-cpu queues for a single TC class and do busy polling from a
> per-class thread to drain these queues. If we can do this frequently
> enough the out-of-order situation in this polling thread should not be
> that bad.
>
> To give more details - in the send path we introduce per-cpu per-class
> queues so that packets from the same class and same core will be
> enqueued to the same place. Then a per-class thread poll the queues
> belonging to its class from all the cpus and aggregate them into
> another per-class queue. This can effectively reduce contention but
> inevitably introduces potential out-of-order issue.
>
> Any concern/suggestion for working towards this direction?
>>>
>>> In general, there is no meta design discussions in Linux development
>>> Several developers have tried to do lockless
>>> qdisc and similar things in the past.
>>>
>>> The devil is in the details, show us the code.
>>
>> Thanks for the response, Stephen. The code is fairly straightforward,
>> we have a per-cpu per-class queue defined as this:
>>
>> struct bandwidth_group
>> {
>> struct skb_list queues[MAX_CPU_COUNT];
>> struct skb_list drain;
>> }
>>
>> "drain" queue is used to aggregate per-cpu queues belonging to the
>> same class. In the enqueue function, we determine the cpu where the
>> packet is processed and enqueue it to the corresponding per-cpu queue:
>>
>> int cpu;
>> struct bandwidth_group *bwg = _rx_groups[bwgid];
>>
>> cpu = get_cpu();
>> skb_list_append(>queues[cpu], skb);
>>
>> Here we don't check the flow of the packet so if there is task
>> migration or multiple threads sending packets through the same flow we
>> theoretically can have packets enqueued to different queues and
>> aggregated to the "drain" queue out of order.
>>
>> Also AFAIK there is no lockless htb-like qdisc implementation
>> currently, however if there is already similar effort ongoing please
>> let me know.
>
> The question I would have is how would this differ from using XPS w/
> mqprio? Would this be a classful qdisc like HTB or a classless one
> like mqprio?
>
> From what I can tell XPS would be able to get you your per-cpu
> functionality, the benefit of it though would be that it would avoid
> out-of-order issues for sockets originating on the local system. The
> only thing I see as an issue right now is that the rate limiting with
> mqprio is assumed to be handled via hardware due to mechanisms such as
> DCB.

 I think one of the key point was in : " do busy polling from a per-class
 thread to drain these queues."

 I mentioned this idea in TX path of :

 https://netdevconf.org/2.1/slides/apr6/dumazet-BUSY-POLLING-Netdev-2.1.pdf
>>>
>>> I think this is a bit different from that idea in that the busy
>>> polling is transferring packets from a per-cpu qdisc to a per traffic
>>> class queueing discipline. Basically it would be a busy_poll version
>>> of qdisc_run that would be transferring packets from one qdisc onto
>>> another instead of attempting to transmit them directly.
>>
>> The idea is to have the whole part implemented as one classful qdisc
>> and remove the qdisc root lock since all the synchronization will be
>> handled internally (let's put aside that other filters/actions/qdiscs
>> might still require a root lock for now).
>>>
>>> What I think is tripping me up is that I don't think this is even
>>> meant to work with a multiqueue device. The description seems more
>>> like a mqprio implementation feeding into a prio qdisc which is then
>>> used for dequeue. To me it seems like this solution 

Re: SRIOV switchdev mode BoF minutes

2017-11-14 Thread Alexander Duyck
On Tue, Nov 14, 2017 at 1:50 PM, Or Gerlitz  wrote:
> On Tue, Nov 14, 2017 at 10:00 PM, Alexander Duyck
>  wrote:
>> On Tue, Nov 14, 2017 at 8:44 AM, Or Gerlitz  wrote:
>>> On Mon, Nov 13, 2017 at 7:10 PM, Alexander Duyck
>>>  wrote:
 On Sun, Nov 12, 2017 at 10:16 PM, Or Gerlitz  wrote:
> On Sun, Nov 12, 2017 at 10:38 PM, Alexander Duyck
>>>
> The what we call slow path requirements are the following:
>
> 1. xmit on VF rep always turns to a receive on the VF, regardless of
> the offloaded SW steering rules ("send-to-vport")
>
> 2. xmit on VF which doesn't meet any offloaded SW steering rules must
> be received into the host OS from the VF rep
>>>
> 1,2 above must hold also for the uplink and the PF reps
>>>
 I am well aware of the requirements. We discussed these with Jiri at
 the previous netdev.
>>>
> When the i40e limitation was described to @ netdev, it seems you have a 
> problem
> with VF xmit that should be turned to be a recv on the VF rep but also
> goes to the wire.
>>>
> It smells as if a FW patch can solve that, isn't that?
>>>
 That is a huge maybe. We looked into it last time and while we can
 meet requirements 1 and 2 we do so with a heavy performance penalty
 due to the fact that we don't support anywhere near the same number of
 flows as a true switch. Also while that might work for i40e
>>>
>>> to recap on i40e, you can support the slow path requirements, but  you have 
>>> an
>>> issue with the fast path (== offloaded flows)? what is the issue there?
>>
>> We basically need to do some feasability research to see if we can
>> actually meet all the requirements for switchdev on i40e. We have been
>> getting mixed messages where we are given a great many "yes, but" type
>> answers. For i40e we are looking into it but I don't have high
>> confidence in our ability to actually support it in hardare/firmware.
>> If it were as easy as you have been led to believe, we would have done
>> it months ago when we were researching the requirements to support switchdev
>
> wait, Sridhar made seven rounds of his submission (this is the v7
> pointer [1]) and you
> still don't know if what you were attempting to push upstream can
> work, something is
> weird here, can you clarify? Jeff?

Not weird so much as stubborn. The patches were being pushed based on
the assumption that the community would accept a NIC generating port
representors that didn't necessarily pass traffic, and then even when
we had them passing traffic the PF still wasn't configured to handle
being the default destination for traffic without any rules
associated, instead VFs would directly send to the outside world.

> Sridhar, maybe you can explain if/what wrong assumptions you had in your code
> and what you think is the gap to address them and come up with proper
> impl for i40e?
>
> [1] https://marc.info/?l=linux-netdev=149083338400922=2

For starters the firmware change you are talking about didn't exist
during this time frame. We can ignore those patches as they assumed
that port representors didn't necessarily have to pass traffic.

>> In addition i40e isn't really my concern. I am much more
>> concerned about ixgbe as it has a much larger install base and many
>> more customers that are still buying it today.
>>
 we still have a much larger install base of ixgbe ports that we have to 
 support.
>>>
>>> ok, but support is one thing and keep enhancing a ten years old wrong
>>> SW model is 2nd thing
>>
>> The model might be 10 years old, but as I said we are still shipping
>> new silicon that was released just over a year ago that is supported
>> by the ixgbe driver.
>>
>> Also I don't know if the term "enhancing" is the right word for what I
>> am thinking. I'm not talking about adding new drivers that only
>> support legacy mode.  We are looking at probably having to refactor
>> the whole concept of "trusted" VF in order to break it out into
>> smaller buckets. In addition I plan to come up with a source mode
>> macvlan based "port representor" for legacy SR-IOV and hope to be able
>> to use that to start working on a better path for SR-IOV live
>> migration.
>>
>> Fundamentally the problem I have with us saying we cannot extend
>> legacy mode SR-IOV is that 82599 is a very large piece of the existing
>> install base for 10Gbit in general. We have it shipping on brand new
>> platforms as the silicon that is installed on the motherboard. With
>> that being the case people are going to want to get the most value
>> they can out of the silicon that they purchased since in many cases it
>> is just a standard part of the platform.
>
> Getting the most value still doesn't mean you should approach the community
> and ask to keep enhancing a wrong SW model for a switch.
>
> For example, suppose a single new bit module param to 

Re: Per-CPU Queueing for QoS

2017-11-14 Thread Michael Ma
2017-11-13 18:18 GMT-08:00 Michael Ma :
> 2017-11-13 16:13 GMT-08:00 Alexander Duyck :
>> On Mon, Nov 13, 2017 at 3:08 PM, Eric Dumazet  wrote:
>>> On Mon, 2017-11-13 at 14:47 -0800, Alexander Duyck wrote:
 On Mon, Nov 13, 2017 at 10:17 AM, Michael Ma  wrote:
 > 2017-11-12 16:14 GMT-08:00 Stephen Hemminger 
 > :
 >> On Sun, 12 Nov 2017 13:43:13 -0800
 >> Michael Ma  wrote:
 >>
 >>> Any comments? We plan to implement this as a qdisc and appreciate any 
 >>> early feedback.
 >>>
 >>> Thanks,
 >>> Michael
 >>>
 >>> > On Nov 9, 2017, at 5:20 PM, Michael Ma  wrote:
 >>> >
 >>> > Currently txq/qdisc selection is based on flow hash so packets from
 >>> > the same flow will follow the order when they enter qdisc/txq, which
 >>> > avoids out-of-order problem.
 >>> >
 >>> > To improve the concurrency of QoS algorithm we plan to have multiple
 >>> > per-cpu queues for a single TC class and do busy polling from a
 >>> > per-class thread to drain these queues. If we can do this frequently
 >>> > enough the out-of-order situation in this polling thread should not 
 >>> > be
 >>> > that bad.
 >>> >
 >>> > To give more details - in the send path we introduce per-cpu 
 >>> > per-class
 >>> > queues so that packets from the same class and same core will be
 >>> > enqueued to the same place. Then a per-class thread poll the queues
 >>> > belonging to its class from all the cpus and aggregate them into
 >>> > another per-class queue. This can effectively reduce contention but
 >>> > inevitably introduces potential out-of-order issue.
 >>> >
 >>> > Any concern/suggestion for working towards this direction?
 >>
 >> In general, there is no meta design discussions in Linux development
 >> Several developers have tried to do lockless
 >> qdisc and similar things in the past.
 >>
 >> The devil is in the details, show us the code.
 >
 > Thanks for the response, Stephen. The code is fairly straightforward,
 > we have a per-cpu per-class queue defined as this:
 >
 > struct bandwidth_group
 > {
 > struct skb_list queues[MAX_CPU_COUNT];
 > struct skb_list drain;
 > }
 >
 > "drain" queue is used to aggregate per-cpu queues belonging to the
 > same class. In the enqueue function, we determine the cpu where the
 > packet is processed and enqueue it to the corresponding per-cpu queue:
 >
 > int cpu;
 > struct bandwidth_group *bwg = _rx_groups[bwgid];
 >
 > cpu = get_cpu();
 > skb_list_append(>queues[cpu], skb);
 >
 > Here we don't check the flow of the packet so if there is task
 > migration or multiple threads sending packets through the same flow we
 > theoretically can have packets enqueued to different queues and
 > aggregated to the "drain" queue out of order.
 >
 > Also AFAIK there is no lockless htb-like qdisc implementation
 > currently, however if there is already similar effort ongoing please
 > let me know.

 The question I would have is how would this differ from using XPS w/
 mqprio? Would this be a classful qdisc like HTB or a classless one
 like mqprio?

 From what I can tell XPS would be able to get you your per-cpu
 functionality, the benefit of it though would be that it would avoid
 out-of-order issues for sockets originating on the local system. The
 only thing I see as an issue right now is that the rate limiting with
 mqprio is assumed to be handled via hardware due to mechanisms such as
 DCB.
>>>
>>> I think one of the key point was in : " do busy polling from a per-class
>>> thread to drain these queues."
>>>
>>> I mentioned this idea in TX path of :
>>>
>>> https://netdevconf.org/2.1/slides/apr6/dumazet-BUSY-POLLING-Netdev-2.1.pdf
>>
>> I think this is a bit different from that idea in that the busy
>> polling is transferring packets from a per-cpu qdisc to a per traffic
>> class queueing discipline. Basically it would be a busy_poll version
>> of qdisc_run that would be transferring packets from one qdisc onto
>> another instead of attempting to transmit them directly.
>
> The idea is to have the whole part implemented as one classful qdisc
> and remove the qdisc root lock since all the synchronization will be
> handled internally (let's put aside that other filters/actions/qdiscs
> might still require a root lock for now).
>>
>> What I think is tripping me up is that I don't think this is even
>> meant to work with a multiqueue device. The description seems more
>> like a mqprio implementation feeding into a prio qdisc which is then
>> used for dequeue. To me it seems like this solution would be pulling
>> complexity off of the enqueue 

Re: len = bpf_probe_read_str(); bpf_perf_event_output(... len) == FAIL

2017-11-14 Thread Yonghong Song



On 11/14/17 12:25 PM, Daniel Borkmann wrote:

On 11/14/2017 07:15 PM, Yonghong Song wrote:

On 11/14/17 6:19 AM, Daniel Borkmann wrote:

On 11/14/2017 02:42 PM, Arnaldo Carvalho de Melo wrote:

Em Tue, Nov 14, 2017 at 02:09:34PM +0100, Daniel Borkmann escreveu:

On 11/14/2017 01:58 PM, Arnaldo Carvalho de Melo wrote:

Em Tue, Nov 14, 2017 at 01:09:39AM +0100, Daniel Borkmann escreveu:

On 11/13/2017 04:08 PM, Arnaldo Carvalho de Melo wrote:

libbpf: -- BEGIN DUMP LOG ---
libbpf:
0: (79) r3 = *(u64 *)(r1 +104)
1: (b7) r2 = 0
2: (bf) r6 = r1
3: (bf) r1 = r10
4: (07) r1 += -128
5: (b7) r2 = 128
6: (85) call bpf_probe_read_str#45
7: (bf) r1 = r0
8: (07) r1 += -1
9: (67) r1 <<= 32
10: (77) r1 >>= 32
11: (25) if r1 > 0x7f goto pc+11


Right, so the compiler is optimizing the two tests into a single one above,
which means lower bound cannot properly be derived again by the verifier due
to this and thus you'll get the error. Similar issue was seen recently [1].

Does the below hack work for you?

int prog([...])
{
  char filename[128];
  int ret = bpf_probe_read_str(filename, sizeof(filename), 
filename_ptr);
  if (ret > 0)
  bpf_perf_event_output(ctx, &__bpf_stdout__, 
BPF_F_CURRENT_CPU, filename,
    ret & (sizeof(filename) - 1));
  return 1;
}

r0 should keep on tracking bounds here at least:

prog:
     0:    bf 16 00 00 00 00 00 00 r6 = r1
     1:    bf a1 00 00 00 00 00 00 r1 = r10
     2:    07 01 00 00 80 ff ff ff r1 += -128
     3:    b7 02 00 00 80 00 00 00 r2 = 128
     4:    85 00 00 00 2d 00 00 00 call 45
     5:    67 00 00 00 20 00 00 00 r0 <<= 32
     6:    c7 00 00 00 20 00 00 00 r0 s>>= 32
     7:    b7 01 00 00 01 00 00 00 r1 = 1
     8:    6d 01 0a 00 00 00 00 00 if r1 s> r0 goto 10
     9:    57 00 00 00 7f 00 00 00 r0 &= 127
    10:    bf a4 00 00 00 00 00 00 r4 = r10
    11:    07 04 00 00 80 ff ff ff r4 += -128
    12:    bf 61 00 00 00 00 00 00 r1 = r6
    13:    18 02 00 00 00 00 00 00 00 00 00 00 00 00 00 00 r2 = 0ll
    15:    18 03 00 00 ff ff ff ff 00 00 00 00 00 00 00 00 r3 = 
4294967295ll
    17:    bf 05 00 00 00 00 00 00 r5 = r0
    18:    85 00 00 00 19 00 00 00 call 25

    [1] 
https://urldefense.proofpoint.com/v2/url?u=http-3A__patchwork.ozlabs.org_project_netdev_list_-3Fseries-3D13211=DwIDaQ=5VD0RTtNlTh3ycd41b3MUw=DA8e1B5r073vIqRrFz7MRA=Qp3xFfXEz-CT8rzYtrHeXbow2M6FlsUzwcY32i3_2Q0=z0d6b_hxStA845Kh7epJ-JiFwkiWqUH_z3fEadwqAQY=


Not yet:

6: (85) call bpf_probe_read_str#45
7: (bf) r1 = r0
8: (67) r1 <<= 32
9: (77) r1 >>= 32
10: (15) if r1 == 0x0 goto pc+10
   R0=inv(id=0) R1=inv(id=0,umax_value=4294967295,var_off=(0x0; 0x)) 
R6=ctx(id=0,off=0,imm=0) R10=fp0
11: (57) r0 &= 127
12: (bf) r4 = r10
13: (07) r4 += -128
14: (bf) r1 = r6
15: (18) r2 = 0x92bfc2aba840u
17: (18) r3 = 0x
19: (bf) r5 = r0
20: (85) call bpf_perf_event_output#25
invalid stack type R4 off=-128 access_size=0

I'll try updating clang/llvm...

Full details:

[root@jouet bpf]# cat open.c
#include "bpf.h"

SEC("prog=do_sys_open filename")
int prog(void *ctx, int err, const char __user *filename_ptr)
{
 char filename[128];
 const unsigned len = bpf_probe_read_str(filename, sizeof(filename), 
filename_ptr);


Btw, I was using 'int' here above instead of 'unsigned' as strncpy_from_unsafe()
could potentially return errors like -EFAULT.


I changed to int, didn't help
  

Currently having a version compiled from the git tree:

# llc --version
LLVM 
(https://urldefense.proofpoint.com/v2/url?u=http-3A__llvm.org_=DwIDaQ=5VD0RTtNlTh3ycd41b3MUw=DA8e1B5r073vIqRrFz7MRA=Qp3xFfXEz-CT8rzYtrHeXbow2M6FlsUzwcY32i3_2Q0=BKC_Gu9s1hw0v13OCgCpfsGtAY2hE7dujFqg8LNaK2I=):
    LLVM version 6.0.0git-2d810c2
    Optimized build.
    Default target: x86_64-unknown-linux-gnu
    Host CPU: skylake


[root@jouet bpf]# llc --version
LLVM 
(https://urldefense.proofpoint.com/v2/url?u=http-3A__llvm.org_=DwIDaQ=5VD0RTtNlTh3ycd41b3MUw=DA8e1B5r073vIqRrFz7MRA=Qp3xFfXEz-CT8rzYtrHeXbow2M6FlsUzwcY32i3_2Q0=BKC_Gu9s1hw0v13OCgCpfsGtAY2hE7dujFqg8LNaK2I=):
    LLVM version 4.0.0svn

Old stuff! ;-) Will change, but improving these messages should be on
the radar, I think :-)


Yep, agree, I think we need a generic, better solution for this type of
issue instead of converting individual helpers to handle 0 min bound and
then only bailing out in such case; need to brainstorm a bit on that.

I think for the above in your case ...

   [...]
    6: (85) call bpf_probe_read_str#45
    7: (bf) r1 = r0
    8: (67) r1 <<= 32
    9: (77) r1 >>= 32
   10: (15) if r1 == 0x0 goto pc+10
    R0=inv(id=0) R1=inv(id=0,umax_value=4294967295,var_off=(0x0; 0x)) 
R6=ctx(id=0,off=0,imm=0) R10=fp0
   11: (57) r0 &= 127
   [...]

... the shifts on r1 might be due to using 32 bit type, so if you find
a way to avoid these and have the 

Re: Microchip KSZ* DSA drivers Re: [PATCH v1 RFC 1/1] Add Microchip KSZ8795 DSA driver

2017-11-14 Thread Pavel Machek
On Tue 2017-11-14 01:49:47, tristram...@microchip.com wrote:
> > Subject: Microchip KSZ* DSA drivers Re: [PATCH v1 RFC 1/1] Add Microchip
> > KSZ8795 DSA driver
> > 
> > Hi!
> > 
> > Are there any news here? Is there new release planned? Is there a git
> > tree somewhere? I probably should get it working, soon.. so I guess I
> > can help with testing.
> > 
> 
> Reviewed patches will be submitted formally to net-next within this week.
> 
> BTW, how is the KSZ8895 driver working for you?  Are there some issues that
> prevent you from using it?

I tried patching it, but it seems I did not have all the neccessary
pieces, so I did not get it to build. KSZ8895 seemed to depend on all
the other drivers and I probably missed a patch somewhere.

I have these in my tree:

commit 6262f8173a21eec96f6b600adf5b9c7131e2c8d5
Author: Pavel Machek 
Date:   Sun Oct 22 12:46:03 2017 +0200

From: 
Subject: [PATCH RFC] Add Microchip KSZ8895 DSA driver

From: Tristram Ha 

Add Microchip KSZ8895 DSA driver.

Signed-off-by: Tristram Ha 

commit 4ea582e4b168b6413248ff587604c9fea6dc4f52
Author: Tristram Ha 
Date:   Fri Oct 6 13:33:29 2017 -0700

Add other KSZ switch support so that patch check does not complain

Add other KSZ switch support so that patch check does not complain.

Signed-off-by: Tristram Ha 

commit 75d9e1a610590cb1f94b7ede3635360026fc4e57
Author: Tristram Ha 
Date:   Fri Oct 6 13:33:05 2017 -0700

Modify tag_ksz.c so that tail tag code can be used by other KSZ switch 
drivers

Modify tag_ksz.c so that tail tag code can be used by other KSZ switch
drivers.

Signed-off-by: Tristram Ha 

commit 62efdaf5db7987e313846b82eef642404978cda7
Author: Tristram Ha 
Date:   Fri Oct 6 13:33:04 2017 -0700

Add MIB counter reading support

Add MIB counter reading support.
Rename ksz_9477_reg.h to ksz9477_reg.h for consistency as the product
name is always KSZ.
Header file ksz_priv.h no longer contains any chip specific data.

Signed-off-by: Tristram Ha 

commit 5b29601cfa2c3032af560ae7ab10f55974e493fe
Author: Tristram Ha 
Date:   Fri Oct 6 13:33:03 2017 -0700

Break KSZ9477 DSA driver into two files

Break KSZ9477 DSA driver into two files in preparation to add more KSZ
switch drivers.
Add common functions in ksz_common.h so that other KSZ switch drivers
can access code in ksz_common.c.
Add ksz_spi.h for common functions used by KSZ switch SPI drivers.

Signed-off-by: Tristram Ha 

commit 02cb0a65cfe8434f93d9e16b0c65c86e5fe06870
Author: Tristram Ha 
Date:   Fri Oct 6 13:33:02 2017 -0700

Rename ksz_spi.c to ksz9477_spi.c

Rename ksz_spi.c to ksz9477_spi.c and update Kconfig in preparation to add
more KSZ switch drivers.

Signed-off-by: Tristram Ha 

commit 0bc57b44d7d27fa7f7d682dded502d8682c6
Author: Tristram Ha 
Date:   Fri Oct 6 13:33:01 2017 -0700

Rename some functions with ksz9477 prefix

Rename some functions with ksz9477 prefix to separate chip specific code
from common code.

Signed-off-by: Tristram Ha 

commit 2d3f74e7dd923d5d0c9c2c01609297be1bc56b67
Author: Tristram Ha 
Date:   Fri Oct 6 13:33:00 2017 -0700

Clean up code according to patch check suggestions

Clean up code according to patch check suggestions.

Signed-off-by: Tristram Ha 

commit d0e348b7b7938be2e91997c1dcdea185c8166b00
Author: Tristram Ha 
Date:   Fri Oct 6 13:32:59 2017 -0700

Replace license with GPL

Replace license with GPL.

Signed-off-by: Tristram Ha 

commit d514cb5324feac3f6d8ede6649573f3c393f5e11
Author: Pavel Machek 
Date:   Sun Oct 22 12:33:21 2017 +0200

Config for v4.14-rc5.


...and I guess I miss some KSZ8795 stuff:

  AR  lib/built-in.o
  drivers/net/dsa/microchip/ksz8895.c: In function
  'ksz8895_get_tag_protocol':
  drivers/net/dsa/microchip/ksz8895.c:633:9: error:
  'DSA_TAG_PROTO_KSZ8795' undeclared (first use in this function)
  drivers/net/dsa/microchip/ksz8895.c:633:9: note: each undeclared
  identifier is reported only once for each function it appears in
  drivers/net/dsa/microchip/ksz8895.c:634:1: warning: control reaches
  end of non-void function [-Wreturn-type]


If you had a git tree, or simply one big patch I could apply.. that
would be welcome :-).

Best regards,
Pavel

-- 

Re: [net-next,1/3] netem: convert to qdisc_watchdog_schedule_ns

2017-11-14 Thread Stephen Hemminger
On Tue, 14 Nov 2017 21:11:13 +
James Hogan  wrote:

> On Tue, Nov 07, 2017 at 12:59:34PM -0800, Dave Taht wrote:
> > diff --git a/net/sched/sch_netem.c b/net/sched/sch_netem.c
> > index db0228a..443a75d 100644
> > --- a/net/sched/sch_netem.c
> > +++ b/net/sched/sch_netem.c  
> 
> ...
> 
> > @@ -305,11 +305,11 @@ static bool loss_event(struct netem_sched_data *q)
> >   * std deviation sigma.  Uses table lookup to approximate the desired
> >   * distribution, and a uniformly-distributed pseudo-random source.
> >   */
> > -static psched_tdiff_t tabledist(psched_tdiff_t mu, psched_tdiff_t sigma,
> > -   struct crndstate *state,
> > -   const struct disttable *dist)
> > +static s64 tabledist(s64 mu, s64 sigma,  
> 
> sigma is used in a modulo operation in this function, which results in
> this error on a bunch of MIPS configs once it is made 64-bits wide:
> 
> net/sched/sch_netem.o In function `tabledist':
> net/sched/sch_netem.c:330: undefined reference to `__moddi3'
> 
> Should that code not be using , i.e. div_s64_rem() now
> that it is 64bit?
> 
> Thanks
> James

Not really since random is only 32 bit, the sigma value being 64 bit makes
no sense really.


pgpkNw6QMPPR4.pgp
Description: OpenPGP digital signature


Re: [PATCH][v3] uprobes/x86: emulate push insns for uprobe on x86

2017-11-14 Thread Yonghong Song



On 11/14/17 8:03 AM, Oleg Nesterov wrote:

On 11/14, Oleg Nesterov wrote:



+#ifdef CONFIG_X86_64
+   if (test_thread_flag(TIF_ADDR32))
+   return -ENOSYS;
+#endif


No, this doesn't look right, see my previous email. You should do this
check in the "if (insn->length == 2)" branch below, "push bp" should be
emulated correctly.

And test_thread_flag(TIF_ADDR32) is not right too. The caller is not
necessarily the probed task. See is_64bit_mm(mm) in arch_uprobe_analyze_insn().

And again... please check if uprobe_init_insn() fails or not in this case
(32bit task does, say, "push r8"). If it fails, your V2 should be fine.


To remind, uprobes && 32-bit is broken, let me quote my another email:

The 3rd bug means that you simply can't uprobe a 32bit task on a 64bit
system, the in_compat_syscall() logic in get_unmapped_area() looks very
wrong although I need to re-check.


Yes,


I didn't have time for this problem so far. But emulation should work, so
you can hopefully test your patch.


Ah, no, sizeof_long() is broken by the same reason, so you can't test it...


Right. I hacked the emulate_push_stack (original name: push_ret_address) 
with sizeof_long = 4, and 32bit binary uprobe works fine on x86_64 
platform then... But that will involve a bigger change to propogate

the is_64bit_mm() along the call graph.



OK, I'll try to do something tomorrow, then we will see what can we do
with your patch...


Thanks for reviewing! I will wait for your further comments/direction
before next step.



But it would be nice if you can check what uprobe_init_insn() does in this
case, see above.


As mentioned in my previous email, for 32bit application,
compiler won't generate "push %r8" as "%r8" is only available on
x86_64 platform. For 32bit app, I see "push %bp" etc which does not
have rex_prefix. They cannot be emulated right now due to 
emulate_push_stack needs change.




Oleg.



Re: [PATCH][v3] uprobes/x86: emulate push insns for uprobe on x86

2017-11-14 Thread Yonghong Song



On 11/14/17 7:51 AM, Oleg Nesterov wrote:

On 11/13, Yonghong Song wrote:


+static int push_setup_xol_ops(struct arch_uprobe *auprobe, struct insn *insn)
+{
+   u8 opc1 = OPCODE1(insn), reg_offset = 0;
+
+   if (opc1 < 0x50 || opc1 > 0x57)
+   return -ENOSYS;
+
+   if (insn->length > 2)
+   return -ENOSYS;
+#ifdef CONFIG_X86_64
+   if (test_thread_flag(TIF_ADDR32))
+   return -ENOSYS;
+#endif


No, this doesn't look right, see my previous email. You should do this
check in the "if (insn->length == 2)" branch below, "push bp" should be
emulated correctly.

And test_thread_flag(TIF_ADDR32) is not right too. The caller is not
necessarily the probed task. See is_64bit_mm(mm) in arch_uprobe_analyze_insn().


I printed out some statistics. On x86_64 platform, for 32bit 
application, test_thread_flag(TIF_ADDR32) returns true and 
is_64bit_mm(mm) returns false. For 64bit application,
test_thread_flag(TIF_ADDR32) returns false and is_64bit_mm(mm) return 
true. So that is why my patch works fine.


I did not fully understand how to trigger "the caller is not necessarily 
the probed task." So in the next revision, I will use is_64bit_mm(mm) 
instead.




And again... please check if uprobe_init_insn() fails or not in this case
(32bit task does, say, "push r8"). If it fails, your V2 should be fine.


The compiler won't generated "push r8" for 32bit task since register 
"r8" is not available on 32bit instruction.





To remind, uprobes && 32-bit is broken, let me quote my another email:

The 3rd bug means that you simply can't uprobe a 32bit task on a 64bit
system, the in_compat_syscall() logic in get_unmapped_area() looks very
wrong although I need to re-check.

I didn't have time for this problem so far. But emulation should work, so
you can hopefully test your patch.

Oleg.



Re: [PATCH][v2] uprobes/x86: emulate push insns for uprobe on x86

2017-11-14 Thread Yonghong Song



On 11/14/17 7:34 AM, Oleg Nesterov wrote:

On 11/13, Yonghong Song wrote:


On 11/13/17 4:59 AM, Oleg Nesterov wrote:

+   switch (opc1) {
+   case 0x50:
+   reg_offset = offsetof(struct pt_regs, r8);
+   break;
+   case 0x51:
+   reg_offset = offsetof(struct pt_regs, r9);
+   break;
+   case 0x52:
+   reg_offset = offsetof(struct pt_regs, r10);
+   break;
+   case 0x53:
+   reg_offset = offsetof(struct pt_regs, r11);
+   break;
+   case 0x54:
+   reg_offset = offsetof(struct pt_regs, r12);
+   break;
+   case 0x55:
+   reg_offset = offsetof(struct pt_regs, r13);
+   break;
+   case 0x56:
+   reg_offset = offsetof(struct pt_regs, r14);
+   break;
+   case 0x57:
+   reg_offset = offsetof(struct pt_regs, r15);
+   break;
+   }
+#else
+   return -ENOSYS;
+#endif


OK, but shouldn't we also return ENOSYS if CONFIG_X86_64=y but the probed task 
is 32bit?


Just tested with a 32bit app on x86 box and segfaults.


Hmm. How did you verify this?


On a x86_32 box, I compiled the test case with static libraries 
(including static libc). And I then run this binary on x86_64 with
uprobe enabled. You will need to install glibc-static package to make it 
work.




Your v3 doesn't look right and it seems you misunderstood me...


Yes, we would need to
return ENOSYS if the app is 32bit on 64bit system.


Only if insn->length == 2. "push bp" and other valid 32bit push'es should be
emulated correctly or your patch is wrong. Confused... >

Or in this case uprobe_init_insn(x86_64 => false) should fail and 
push_setup_xol_ops()
won't be called?


So it doesn't fail?

Oleg.



Re: [PATCH] net: Convert net_mutex into rw_semaphore and down read it on net->init/->exit

2017-11-14 Thread Andrei Vagin
On Tue, Nov 14, 2017 at 10:00:59AM -0800, Eric Dumazet wrote:
> On Tue, 2017-11-14 at 09:44 -0800, Andrei Vagin wrote:
> > On Tue, Nov 14, 2017 at 04:53:33PM +0300, Kirill Tkhai wrote:
> > > Curently mutex is used to protect pernet operations list. It makes
> > > cleanup_net() to execute ->exit methods of the same operations set,
> > > which was used on the time of ->init, even after net namespace is
> > > unlinked from net_namespace_list.
> > > 
> > > But the problem is it's need to synchronize_rcu() after net is removed
> > > from net_namespace_list():
> > > 
> > > Destroy net_ns:
> > > cleanup_net()
> > >   mutex_lock(_mutex)
> > >   list_del_rcu(>list)
> > >   synchronize_rcu()  <--- Sleep there for 
> > > ages
> > >   list_for_each_entry_reverse(ops, _list, list)
> > > ops_exit_list(ops, _exit_list)
> > >   list_for_each_entry_reverse(ops, _list, list)
> > > ops_free_list(ops, _exit_list)
> > >   mutex_unlock(_mutex)
> > > 
> > > This primitive is not fast, especially on the systems with many processors
> > > and/or when preemptible RCU is enabled in config. So, all the time, while
> > > cleanup_net() is waiting for RCU grace period, creation of new net 
> > > namespaces
> > > is not possible, the tasks, who makes it, are sleeping on the same mutex:
> > > 
> > > Create net_ns:
> > > copy_net_ns()
> > >   mutex_lock_killable(_mutex)<--- Sleep there for 
> > > ages
> > > 
> > > The solution is to convert net_mutex to the rw_semaphore. Then,
> > > pernet_operations::init/::exit methods, modifying the net-related data,
> > > will require down_read() locking only, while down_write() will be used
> > > for changing pernet_list.
> > > 
> > > This gives signify performance increase, like you may see below. There
> > > is measured sequential net namespace creation in a cycle, in single
> > > thread, without other tasks (single user mode):
> > > 
> > > 1)int main(int argc, char *argv[])
> > > {
> > > unsigned nr;
> > > if (argc < 2) {
> > > fprintf(stderr, "Provide nr iterations arg\n");
> > > return 1;
> > > }
> > > nr = atoi(argv[1]);
> > > while (nr-- > 0) {
> > > if (unshare(CLONE_NEWNET)) {
> > > perror("Can't unshare");
> > > return 1;
> > > }
> > > }
> > > return 0;
> > > }
> > > 
> > > Origin, 10 unshare():
> > > 0.03user 23.14system 1:39.85elapsed 23%CPU
> > > 
> > > Patched, 10 unshare():
> > > 0.03user 67.49system 1:08.34elapsed 98%CPU
> > > 
> > > 2)for i in {1..1}; do unshare -n bash -c exit; done
> > 
> > Hi Kirill,
> > 
> > This mutex has another role. You know that net namespaces are destroyed
> > asynchronously, and the net mutex gurantees that a backlog will be not
> > big. If we have something in backlog, we know that it will be handled
> > before creating a new net ns.
> > 
> > As far as I remember net namespaces are created much faster than
> > they are destroyed, so with this changes we can create a really big
> > backlog, can't we?
> 
> Please take a look at the recent patches I did :
> 
> 8ca712c373a462cfa1b62272870b6c2c74aa83f9 Merge branch 
> 'net-speedup-netns-create-delete-time'
> 64bc17811b72758753e2b64cd8f2a63812c61fe1 ipv4: speedup ipv6 tunnels dismantle
> bb401caefe9d2c65e0c0fa23b21deecfbfa473fe ipv6: speedup ipv6 tunnels dismantle
> 789e6ddb0b2fb5d5024b760b178a47876e4de7a6 tcp: batch tcp_net_metrics_exit
> a90c9347e90ed1e9323d71402ed18023bc910cd8 ipv6: addrlabel: per netns list
> d464e84eed02993d40ad55fdc19f4523e4deee5b kobject: factorize skb setup in 
> kobject_uevent_net_broadcast()
> 4a336a23d619e96aef37d4d054cfadcdd1b581ba kobject: copy env blob in one go
> 16dff336b33d87c15d9cbe933cfd275aae2a8251 kobject: add 
> kobject_uevent_net_broadcast()
> 

Good job! Now it really works much faster. I tested these patches with
Kirill's one and everithing works good. I could not reproduce a
situation, when a backlog starts growing.

Thanks Kirill and Eric.


Re: [net-next 00/10][pull request] 40GbE Intel Wired LAN Driver Updates 2017-11-14

2017-11-14 Thread Jeff Kirsher
On Tue, 2017-11-14 at 23:52 +0200, Or Gerlitz wrote:
> On Tue, Nov 14, 2017 at 9:03 PM, Jeff Kirsher
>  wrote:
> > This series contains updates to i40e and i40evf only.
> 
> see http://vger.kernel.org/~davem/net-next.html

Your right Or, I failed to look at the URL for the status of net-next. 
 I made the mistake of going through the netdev emails to see if Dave
had sent out a "net-next closed" email and forgot about the URL.

Dave, I am fine with holding onto most of these patches until net-next
opens back up.  There are a few fixes in the series which I will
resubmit to your net tree, after you sync with Linus later this week.

signature.asc
Description: This is a digitally signed message part


Re: [PATCH V4 net 0/2] ipv6: fix flowlabel issue for reset packet

2017-11-14 Thread Shaohua Li
On Tue, Nov 14, 2017 at 11:13:10AM -0800, Tom Herbert wrote:
> On Tue, Nov 14, 2017 at 10:24 AM, Shaohua Li  wrote:
> > On Wed, Nov 08, 2017 at 09:44:51AM -0800, Tom Herbert wrote:
> >> On Fri, Aug 18, 2017 at 3:27 PM, David Miller  wrote:
> >> > From: Martin KaFai Lau 
> >> > Date: Fri, 18 Aug 2017 13:51:36 -0700
> >> >
> >> >> It seems like that middle box specifically drops TCP_RST if it
> >> >> does not know anything about this flow.  Since the flowlabel of the 
> >> >> TCP_RST
> >> >> (sent in TW state) is always different, it always lands to a different 
> >> >> middle
> >> >> box.  All of these TCP_RST cannot be delivered.
> >> >
> >> > This really is illegal behavior.  The flow label is not a flow _KEY_
> >> > by any definition whatsoever.
> >> >
> >> > Flow labels are an optimization, not a determinant for flow matching
> >> > particularly for proper TCP state processing.
> >> >
> >> > I'd rather you invest all of this energy getting that vendor to fix
> >> > their kit.
> >> >
> >> We're now seeing several router vendors recommending people to not use
> >> flow labels for ECMP hashing. This is precisely because when a flow
> >> label changes, network devices that maintain state (firewalls, NAT,
> >> load balancers) can't deal with packets being rerouted so connections
> >> are dropped. Unfortunately, the need for packets of a flow to always
> >> follow the same path has become an implicit requirement that I think
> >> we need follow at least as the default behavior.
> >>
> >> Martin: is there any change you could resurrect these patches? In
> >> order to solve the general problem of making routing consistent, I
> >> believe we want to keep sk_tx_hash consistent for the connection from
> >> which a consistent flow label can be derived. To avoid the overhead of
> >> a hash field in sk_common, maybe we could initially set a connection
> >> hash to a five-tuple hash for a flow instead of a random value? So in
> >> TW state the consistent hash can be computed on the fly.
> >
> > Hi Tom,
> > Do we really need to use the five-tupe hash? There are several places using
> > current random hash, which looks more lightweight. To fix issue, we only 
> > need
> > to make sure reset packet include the correct flowlabel. Like what my 
> > previous
> > patch did, we can set tw->tw_flowlabel in tcp_time_wait based on txhash and 
> > use
> > it reset packet. In this way we can use the random hash and not add extra 
> > field
> > in sock.
> >
> Shaohua,
> 
> But that patch discards the full txhash in TW. So it's not just a
> problem with the flow label. sk_tx_hash can also be used for route
> selection in ECMP, port selection we're doing tunneling, etc. The
> general solution should maintains tx_hash or be able to reconstruct it
> in any state, flow label fix is a point solution.

Hi Tom,

do you want to keep sk_rethink_txhash() then? If we changed the hash to random
number, we can't reconstruct it for sure.

Thanks,
Shaohua


Re: [RFC PATCH 5/5] selinux: Add SCTP support

2017-11-14 Thread Richard Haines
On Mon, 2017-11-13 at 17:40 -0500, Paul Moore wrote:
> On Mon, Nov 13, 2017 at 5:05 PM, Richard Haines
>  wrote:
> > On Mon, 2017-11-06 at 19:09 -0500, Paul Moore wrote:
> > > On Tue, Oct 17, 2017 at 9:59 AM, Richard Haines
> > >  wrote:
> > > > The SELinux SCTP implementation is explained in:
> > > > Documentation/security/SELinux-sctp.txt
> > > > 
> > > > Signed-off-by: Richard Haines 
> > > > ---
> > > >  Documentation/security/SELinux-sctp.txt | 108 +
> > > >  security/selinux/hooks.c| 268
> > > > ++--
> > > >  security/selinux/include/classmap.h |   3 +-
> > > >  security/selinux/include/netlabel.h |   9 +-
> > > >  security/selinux/include/objsec.h   |   5 +
> > > >  security/selinux/netlabel.c |  52 ++-
> > > >  6 files changed, 427 insertions(+), 18 deletions(-)
> > > >  create mode 100644 Documentation/security/SELinux-sctp.txt
> 
> ...
> 
> > > > +Policy Statements
> > > > +==
> > > > +The following class and permissions to support SCTP are
> > > > available
> > > > within the
> > > > +kernel:
> > > > +class sctp_socket inherits socket { node_bind }
> > > > +
> > > > +whenever the following policy capability is enabled:
> > > > +policycap extended_socket_class;
> > > > +
> > > > +The SELinux SCTP support adds the additional permissions that
> > > > are
> > > > explained
> > > > +in the sections below:
> > > > +association bindx connectx
> > > 
> > > Is the distinction between bind and bindx significant?  The same
> > > question applies to connect/connectx.  I think we can probably
> > > just
> > > reuse bind and connect in these cases.
> > 
> > This has been discussed before with Marcelo and keeping
> > bindx/connectx
> > is a useful distinction.
> 
> My apologies, I must have forgotten/missed that discussion.  Do you
> have an archive pointer?

No this was off list, however I've copied the relevant bits:

> SCTP Socket Option Permissions
> ===
> Permissions that are validated on setsockopt(2) calls (note that the
> sctp_socket SETOPT permission must be allowed):
>
> This option requires the BINDX_ADDR permission:
> SCTP_SOCKOPT_BINDX_REM - Remove additional bind address.

Can't see an usage for this one.

>
> These options require the SET_PARAMS permission:
> SCTP_PEER_ADDR_PARAMS  - Set heartbeats and address max
> retransmissions.
> SCTP_PEER_ADDR_THLDS  - Set thresholds.
> SCTP_ASSOCINFO- Set association / endpoint parameters.

Also for these, considering we are not willing to go as deep as to only
allow these if within a given threshold. But still even then, sounds
like too much.

>
>
> SCTP Bind, Connect and ASCONF Chunk Parameter Permission Checks
> ==
> The hook security_sctp_addr_list() is called by SCTP when processing
> various options (@optname) to check permissions required for the list
> of ipv4/ipv6 addresses (@address) as follows:
> 
> |sctp_socket BIND type permission checks  |
> |(The socket must also have the BIND permission)  |
> |  @optname| Permission  |  @address  |
> |--|-|-|
> |SCTP_SOCKOPT_BINDX_ADD|BINDX_ADDRS  |One or more ipv4/ipv6 adr|

This one can be useful, for that privilege-dropping case.

Paul note: I later changed BINDX_ADDRS to just BINDX

> |SCTP_PRIMARY_ADDR|SET_PRI_ADDR |Single ipv4 or ipv6 adr  |
> |SCTP_SET_PEER_PRIMARY_ADDR|SET_PEER_ADDR|Single ipv4 or ipv6 adr  |

But these, can't use an use-case.

> 
> 
> |sctp_socket CONNECT type permission checks|
> |(The socket must also have the CONNECT permission)|
> |  @optname| Permission  |  @address  |
> |--|-|-|
> |SCTP_SOCKOPT_CONNECTX|CONNECTX|One or more ipv4/ipv6 adr|
> |SCTP_PARAM_ADD_IP|BINDX_ADDRS  |One or more ipv4/ipv6 adr|

The 2 above, can be useful.

> |SCTP_PARAM_DEL_IP|BINDX_ADDRS  |One or more ipv4/ipv6 adr|
> |SCTP_PARAM_SET_PRIMARY|SET_PRI_ADDR |Single ipv4 or ipv6 adr  |

But not these two..

> 
>
> SCTP_SOCKOPT_BINDX_ADD - Allows additional bind addresses to be
> associated after (optionally) calling
> bind(3).
> sctp_bindx(3) adds a set of bind
> addresses on a socket.
>
> SCTP_SOCKOPT_CONNECTX - Allows the allocation of multiple
> addresses for reaching a peer
> (multi-homed).
> sctp_connectx(3) initiates a connection
> on an 

Re: [net-next 00/10][pull request] 40GbE Intel Wired LAN Driver Updates 2017-11-14

2017-11-14 Thread Or Gerlitz
On Tue, Nov 14, 2017 at 9:03 PM, Jeff Kirsher
 wrote:
> This series contains updates to i40e and i40evf only.

see http://vger.kernel.org/~davem/net-next.html


Re: SRIOV switchdev mode BoF minutes

2017-11-14 Thread Or Gerlitz
On Tue, Nov 14, 2017 at 10:00 PM, Alexander Duyck
 wrote:
> On Tue, Nov 14, 2017 at 8:44 AM, Or Gerlitz  wrote:
>> On Mon, Nov 13, 2017 at 7:10 PM, Alexander Duyck
>>  wrote:
>>> On Sun, Nov 12, 2017 at 10:16 PM, Or Gerlitz  wrote:
 On Sun, Nov 12, 2017 at 10:38 PM, Alexander Duyck
>>
 The what we call slow path requirements are the following:

 1. xmit on VF rep always turns to a receive on the VF, regardless of
 the offloaded SW steering rules ("send-to-vport")

 2. xmit on VF which doesn't meet any offloaded SW steering rules must
 be received into the host OS from the VF rep
>>
 1,2 above must hold also for the uplink and the PF reps
>>
>>> I am well aware of the requirements. We discussed these with Jiri at
>>> the previous netdev.
>>
 When the i40e limitation was described to @ netdev, it seems you have a 
 problem
 with VF xmit that should be turned to be a recv on the VF rep but also
 goes to the wire.
>>
 It smells as if a FW patch can solve that, isn't that?
>>
>>> That is a huge maybe. We looked into it last time and while we can
>>> meet requirements 1 and 2 we do so with a heavy performance penalty
>>> due to the fact that we don't support anywhere near the same number of
>>> flows as a true switch. Also while that might work for i40e
>>
>> to recap on i40e, you can support the slow path requirements, but  you have 
>> an
>> issue with the fast path (== offloaded flows)? what is the issue there?
>
> We basically need to do some feasability research to see if we can
> actually meet all the requirements for switchdev on i40e. We have been
> getting mixed messages where we are given a great many "yes, but" type
> answers. For i40e we are looking into it but I don't have high
> confidence in our ability to actually support it in hardare/firmware.
> If it were as easy as you have been led to believe, we would have done
> it months ago when we were researching the requirements to support switchdev

wait, Sridhar made seven rounds of his submission (this is the v7
pointer [1]) and you
still don't know if what you were attempting to push upstream can
work, something is
weird here, can you clarify? Jeff?

Sridhar, maybe you can explain if/what wrong assumptions you had in your code
and what you think is the gap to address them and come up with proper
impl for i40e?

[1] https://marc.info/?l=linux-netdev=149083338400922=2


> In addition i40e isn't really my concern. I am much more
> concerned about ixgbe as it has a much larger install base and many
> more customers that are still buying it today.
>
>>> we still have a much larger install base of ixgbe ports that we have to 
>>> support.
>>
>> ok, but support is one thing and keep enhancing a ten years old wrong
>> SW model is 2nd thing
>
> The model might be 10 years old, but as I said we are still shipping
> new silicon that was released just over a year ago that is supported
> by the ixgbe driver.
>
> Also I don't know if the term "enhancing" is the right word for what I
> am thinking. I'm not talking about adding new drivers that only
> support legacy mode.  We are looking at probably having to refactor
> the whole concept of "trusted" VF in order to break it out into
> smaller buckets. In addition I plan to come up with a source mode
> macvlan based "port representor" for legacy SR-IOV and hope to be able
> to use that to start working on a better path for SR-IOV live
> migration.
>
> Fundamentally the problem I have with us saying we cannot extend
> legacy mode SR-IOV is that 82599 is a very large piece of the existing
> install base for 10Gbit in general. We have it shipping on brand new
> platforms as the silicon that is installed on the motherboard. With
> that being the case people are going to want to get the most value
> they can out of the silicon that they purchased since in many cases it
> is just a standard part of the platform.

Getting the most value still doesn't mean you should approach the community
and ask to keep enhancing a wrong SW model for a switch.

For example, suppose a single new bit module param to IXGBE will get
you to sell another
100K or 1M or 10M pieces per year but we as community decided that
module params are
not the way to go - will you come and ask to add the module param for
you to get more biz?


> I'm not saying we have new parts. I'm saying we have existing parts
> that will likely need some work done. SwitchDev was only introduced
> about 2 years ago. We have parts that were released around or before
> then with functionality that didn't anticipate this. We still haven't
> finished fully implementing all the features that were available on
> the parts, that is what I am arguing. Usually new features go in for
> several years after a part is released, usually something on the 3 to
> 5 year range.

> When SR-IOV was introduced there were two available 

[vhost:vhost 1/9] drivers//virtio/virtio_balloon.c:167:3: error: implicit declaration of function 'balloon_page_push'; did you mean 'balloon_page_putback'?

2017-11-14 Thread kbuild test robot
tree:   https://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git vhost
head:   74d5d28b9903cd48c22c04f887354a3d49621c13
commit: 47349198d7a879febab785a00c39c2de379f35c2 [1/9] virtio_balloon: fix 
deadlock on OOM
config: sh-allyesconfig (attached as .config)
compiler: sh4-linux-gnu-gcc (Debian 7.2.0-11) 7.2.0
reproduce:
wget 
https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O 
~/bin/make.cross
chmod +x ~/bin/make.cross
git checkout 47349198d7a879febab785a00c39c2de379f35c2
# save the attached .config to linux build tree
make.cross ARCH=sh 

All errors (new ones prefixed by >>):

   drivers//virtio/virtio_balloon.c: In function 'fill_balloon':
>> drivers//virtio/virtio_balloon.c:167:3: error: implicit declaration of 
>> function 'balloon_page_push'; did you mean 'balloon_page_putback'? 
>> [-Werror=implicit-function-declaration]
  balloon_page_push(, page);
  ^
  balloon_page_putback
>> drivers//virtio/virtio_balloon.c:174:17: error: implicit declaration of 
>> function 'balloon_page_pop'; did you mean 'balloon_page_alloc'? 
>> [-Werror=implicit-function-declaration]
 while ((page = balloon_page_pop())) {
^~~~
balloon_page_alloc
   drivers//virtio/virtio_balloon.c:174:15: warning: assignment makes pointer 
from integer without a cast [-Wint-conversion]
 while ((page = balloon_page_pop())) {
  ^
   cc1: some warnings being treated as errors

vim +167 drivers//virtio/virtio_balloon.c

   143  
   144  static unsigned fill_balloon(struct virtio_balloon *vb, size_t num)
   145  {
   146  unsigned num_allocated_pages;
   147  unsigned num_pfns;
   148  struct page *page;
   149  LIST_HEAD(pages);
   150  
   151  /* We can only do one array worth at a time. */
   152  num = min(num, ARRAY_SIZE(vb->pfns));
   153  
   154  for (num_pfns = 0; num_pfns < num;
   155   num_pfns += VIRTIO_BALLOON_PAGES_PER_PAGE) {
   156  struct page *page = balloon_page_alloc();
   157  
   158  if (!page) {
   159  dev_info_ratelimited(>vdev->dev,
   160   "Out of puff! Can't get %u 
pages\n",
   161   
VIRTIO_BALLOON_PAGES_PER_PAGE);
   162  /* Sleep for at least 1/5 of a second before 
retry. */
   163  msleep(200);
   164  break;
   165  }
   166  
 > 167  balloon_page_push(, page);
   168  }
   169  
   170  mutex_lock(>balloon_lock);
   171  
   172  vb->num_pfns = 0;
   173  
 > 174  while ((page = balloon_page_pop())) {
   175  balloon_page_enqueue(>vb_dev_info, page);
   176  
   177  vb->num_pfns += VIRTIO_BALLOON_PAGES_PER_PAGE;
   178  
   179  set_page_pfns(vb, vb->pfns + vb->num_pfns, page);
   180  vb->num_pages += VIRTIO_BALLOON_PAGES_PER_PAGE;
   181  if (!virtio_has_feature(vb->vdev,
   182  
VIRTIO_BALLOON_F_DEFLATE_ON_OOM))
   183  adjust_managed_page_count(page, -1);
   184  }
   185  
   186  num_allocated_pages = vb->num_pfns;
   187  /* Did we get any? */
   188  if (vb->num_pfns != 0)
   189  tell_host(vb, vb->inflate_vq);
   190  mutex_unlock(>balloon_lock);
   191  
   192  return num_allocated_pages;
   193  }
   194  

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: application/gzip


RE: [PATCH net] hv_netvsc: preserve hw_features on mtu/channels/ringparam changes

2017-11-14 Thread Haiyang Zhang


> -Original Message-
> From: Vitaly Kuznetsov [mailto:vkuzn...@redhat.com]
> Sent: Tuesday, November 14, 2017 11:58 AM
> To: Stephen Hemminger 
> Cc: netdev@vger.kernel.org; linux-ker...@vger.kernel.org;
> de...@linuxdriverproject.org; KY Srinivasan ; Haiyang
> Zhang ; Stephen Hemminger
> ; Mohammed Gamal 
> Subject: Re: [PATCH net] hv_netvsc: preserve hw_features on
> mtu/channels/ringparam changes
> 
> Stephen Hemminger  writes:
> 
> > On Tue, 14 Nov 2017 16:22:05 +0100
> > Vitaly Kuznetsov  wrote:
> >
> > Yes, this looks like a real issue.
> >
> >> +  /* Query hardware capabilities if we're called from netvsc_probe() */
> >> +  if (!net->hw_features) {
> >> +  ret = rndis_netdev_set_hwcaps(net_device, rndis_device);
> >> +  if (ret != 0)
> >> +  goto err_dev_remv;
> >> +  }
> >> +
> >
> > Rather than conditional behavior in rndis_filter_device_add, it would
> > be cleaner to make the call to get hardware capabilities there.
> >
> > Please respin and make the query of host a separate function.
> 
> You mean call rndis_netdev_set_hwcaps() from netvsc_probe()? Will do.
> 
> One question though: in case we'll be avoiding
> rndis_filter_set_offload_params() call on mtu/channels/ringparam changes -
> - can we trust the host to preserve what was there before the RNDIS reset?
> In case not we'll have to untangle what is
> rndis_netdev_set_hwcaps() in my patch splitting it into two: hw_features
> setup and rndis_filter_set_offload_params() and leaving the later in
> rndis_filter_device_add().

After remove/re-add RNDIS dev, you should pass the parameters to the host
again.

Thanks,
- Haiyang


Re: [net-next,1/3] netem: convert to qdisc_watchdog_schedule_ns

2017-11-14 Thread James Hogan
On Tue, Nov 14, 2017 at 09:11:12PM +, James Hogan wrote:
> On Tue, Nov 07, 2017 at 12:59:34PM -0800, Dave Taht wrote:
> > diff --git a/net/sched/sch_netem.c b/net/sched/sch_netem.c
> > index db0228a..443a75d 100644
> > --- a/net/sched/sch_netem.c
> > +++ b/net/sched/sch_netem.c
> 
> ...
> 
> > @@ -305,11 +305,11 @@ static bool loss_event(struct netem_sched_data *q)
> >   * std deviation sigma.  Uses table lookup to approximate the desired
> >   * distribution, and a uniformly-distributed pseudo-random source.
> >   */
> > -static psched_tdiff_t tabledist(psched_tdiff_t mu, psched_tdiff_t sigma,
> > -   struct crndstate *state,
> > -   const struct disttable *dist)
> > +static s64 tabledist(s64 mu, s64 sigma,
> 
> sigma is used in a modulo operation in this function, which results in
> this error on a bunch of MIPS configs once it is made 64-bits wide:
> 
> net/sched/sch_netem.o In function `tabledist':
> net/sched/sch_netem.c:330: undefined reference to `__moddi3'
> 
> Should that code not be using , i.e. div_s64_rem() now
> that it is 64bit?

For the record, Dave has kindly pointed me at:
https://patchwork.ozlabs.org/project/netdev/list/?series=13554

which fixes the MIPS builds.

Cheers
James


signature.asc
Description: Digital signature


Re: [net-next,v2,0/2] netem: fix compilation on 32 bit

2017-11-14 Thread James Hogan
On Tue, Nov 14, 2017 at 11:27:00AM -0800, Stephen Hemminger wrote:
> A couple of places where 64 bit CPU was being assumed incorrectly.
> 
> Stephen Hemminger (2):
>   netem: fix 64 bit divide
>   netem: remove unnecessary 64 bit modulus
> 
>  net/sched/sch_netem.c | 17 +++--
>  1 file changed, 7 insertions(+), 10 deletions(-)
> Acked-by: Randy Dunlap 
> Acked-by: Dave Taht 

These appears to fix the MIPS build, so

Tested-by: James Hogan 

Thanks
James


signature.asc
Description: Digital signature


Re: [net-next,1/3] netem: convert to qdisc_watchdog_schedule_ns

2017-11-14 Thread James Hogan
On Tue, Nov 07, 2017 at 12:59:34PM -0800, Dave Taht wrote:
> diff --git a/net/sched/sch_netem.c b/net/sched/sch_netem.c
> index db0228a..443a75d 100644
> --- a/net/sched/sch_netem.c
> +++ b/net/sched/sch_netem.c

...

> @@ -305,11 +305,11 @@ static bool loss_event(struct netem_sched_data *q)
>   * std deviation sigma.  Uses table lookup to approximate the desired
>   * distribution, and a uniformly-distributed pseudo-random source.
>   */
> -static psched_tdiff_t tabledist(psched_tdiff_t mu, psched_tdiff_t sigma,
> - struct crndstate *state,
> - const struct disttable *dist)
> +static s64 tabledist(s64 mu, s64 sigma,

sigma is used in a modulo operation in this function, which results in
this error on a bunch of MIPS configs once it is made 64-bits wide:

net/sched/sch_netem.o In function `tabledist':
net/sched/sch_netem.c:330: undefined reference to `__moddi3'

Should that code not be using , i.e. div_s64_rem() now
that it is 64bit?

Thanks
James


signature.asc
Description: Digital signature


Re: [PATCH net v2] ipv6: set all.accept_dad to 0 by default

2017-11-14 Thread Girish Moodalbail

On 11/14/17 11:10 AM, Stefano Brivio wrote:

On Tue, 14 Nov 2017 10:30:33 -0800
Girish Moodalbail  wrote:


On 11/14/17 5:21 AM, Nicolas Dichtel wrote:

With commits 35e015e1f577 and a2d3f3e33853, the global 'accept_dad' flag
is also taken into account (default value is 1). If either global or
per-interface flag is non-zero, DAD will be enabled on a given interface.

This is not backward compatible: before those patches, the user could
disable DAD just by setting the per-interface flag to 0. Now, the
user instead needs to set both flags to 0 to actually disable DAD.

Restore the previous behaviour by setting the default for the global
'accept_dad' flag to 0. This way, DAD is still enabled by default,
as per-interface flags are set to 1 on device creation, but setting
them to 0 is enough to disable DAD on a given interface.

- Before 35e015e1f57a7 and a2d3f3e33853:
globalper-interfaceDAD enabled
[default]   1 1  yes
  X 0  no
  X 1  yes

- After 35e015e1f577 and a2d3f3e33853:
globalper-interfaceDAD enabled
[default]   1 1  yes
  0 0  no
  0 1  yes
  1 0  yes

- After this fix:
globalper-interfaceDAD enabled
  1 1  yes
  0 0  no
[default]   0 1  yes
  1 0  yes


Above table can be summarized to..

- After this fix:
globalper-interfaceDAD enabled
  1 X  yes
  0 0  no
[default]   0 1  yes

So, if global is set to '1', then irrespective of what the per-interface value
is DAD will be enabled. Is it not confusing. Shouldn't the more specific value
override the general value?


Might be a bit confusing, yes, but in order to implement an overriding
mechanism you would need to implement a tristate option as Eric K.
proposed. That is, by default you would have -1 (meaning "don't care")
on per-interface flags, and if this value is changed then the
per-interface value wins over the global one.

Sensible, but I think it's outside of the scope of this patch, which is
just intended to restore a specific pre-existing userspace expectation.


On the other hand, if the global is set to '0', then per-interface value will be
honored (overrides global). So, the meaning of global varies based on its value.
Isn't that confusing as well.


I don't find this confusing though. Setting the global flag always has
the meaning of "force enabling DAD on all interfaces".

You would have the same problem if you chose a logical AND between
global and per-interface flag. There, setting the global flag would mean
"force disabling DAD on all interfaces".

So the only indisputable improvement I see here would be to implement a
"don't care" value (either for global or for per-interface flags). But
I'd rather agree with Nicolas that we should fix a potentially broken
userspace assumption first.


Agree.

Thanks,
~Girish




[regression, 4.14] xfrm: Fix stack-out-of-bounds read in xfrm_state_find breaks selinux-testsuite

2017-11-14 Thread Stephen Smalley
Hi,

4.14 is failing the selinux-testsuite labeled IPSEC tests despite
having just been fixed in commit cf37966751747727 ("xfrm: do
unconditional template resolution before pcpu cache check").  The
breaking commit is the very next one, commit c9f3f813d462c72d ("xfrm:
Fix stack-out-of-bounds read in xfrm_state_find.").  Unlike the earlier
breakage, which caused use of the wrong SA, this one leads to a failure
on connect(). Running ip xfrm monitor during one of the failing tests
shows the following:
acquire proto ah 
  sel src 127.0.0.1/32 dst 127.0.0.1/32 proto tcp sport 0 dport 65535
dev lo 
  policy src 127.0.0.1/32 dst 127.0.0.1/32 proto tcp 
security context
unconfined_u:unconfined_r:test_inet_client_t:s0-s0:c0.c1023 
dir out priority 0 ptype main 
tmpl src 0.0.0.0 dst 0.0.0.0
proto ah reqid 0 mode transport

Expired src 127.0.0.1 dst 0.0.0.0
proto ah spi 0x reqid 0 mode transport
replay-window 0 
sel src 127.0.0.1/32 dst 127.0.0.1/32 proto tcp sport 0 dport
65535 dev lo 
hard 1

On the last working commit, connect() instead succeeds and ip xfrm
monitor shows the following:
Async event  (0x20)  timer expired 
src 127.0.0.1 dst 127.0.0.1  reqid 0x0 protocol ah  SPI 0x200
Async event  (0x10)  replay update 
src 127.0.0.1 dst 127.0.0.1  reqid 0x0 protocol ah  SPI 0x200
Async event  (0x10)  replay update 
src 127.0.0.1 dst 127.0.0.1  reqid 0x0 protocol ah  SPI 0x200

Reproducer:
# Install a Fedora VM w/ SELinux enabled (default).
$ git clone https://github.com/SELinuxProject/selinux-testsuite/
# Make sure you have the requisite kernel config options enabled.
$ cd linux
$ ./scripts/kconfig/merge_config.sh .config ../selinux-
testsuite/defconfig
$ make
$ sudo make modules_install install
$ sudo reboot
# Install dependencies.
sudo dnf install perl-Test perl-Test-Harness perl-Test-Simple selinux-
policy-devel gcc libselinux-devel net-tools netlabel_tools iptables
# Build and run the tests
sudo make test

After running once as above, you can run just the inet socket tests
again via:
cd tests/inet_socket
./test




Re: [PATCH] net: Convert net_mutex into rw_semaphore and down read it on net->init/->exit

2017-11-14 Thread Kirill Tkhai
On 14.11.2017 21:38, Andrei Vagin wrote:
> On Tue, Nov 14, 2017 at 09:04:06PM +0300, Kirill Tkhai wrote:
>> On 14.11.2017 20:44, Andrei Vagin wrote:
>>> On Tue, Nov 14, 2017 at 04:53:33PM +0300, Kirill Tkhai wrote:
 Curently mutex is used to protect pernet operations list. It makes
 cleanup_net() to execute ->exit methods of the same operations set,
 which was used on the time of ->init, even after net namespace is
 unlinked from net_namespace_list.

 But the problem is it's need to synchronize_rcu() after net is removed
 from net_namespace_list():

 Destroy net_ns:
 cleanup_net()
   mutex_lock(_mutex)
   list_del_rcu(>list)
   synchronize_rcu()  <--- Sleep there for 
 ages
   list_for_each_entry_reverse(ops, _list, list)
 ops_exit_list(ops, _exit_list)
   list_for_each_entry_reverse(ops, _list, list)
 ops_free_list(ops, _exit_list)
   mutex_unlock(_mutex)

 This primitive is not fast, especially on the systems with many processors
 and/or when preemptible RCU is enabled in config. So, all the time, while
 cleanup_net() is waiting for RCU grace period, creation of new net 
 namespaces
 is not possible, the tasks, who makes it, are sleeping on the same mutex:

 Create net_ns:
 copy_net_ns()
   mutex_lock_killable(_mutex)<--- Sleep there for 
 ages

 The solution is to convert net_mutex to the rw_semaphore. Then,
 pernet_operations::init/::exit methods, modifying the net-related data,
 will require down_read() locking only, while down_write() will be used
 for changing pernet_list.

 This gives signify performance increase, like you may see below. There
 is measured sequential net namespace creation in a cycle, in single
 thread, without other tasks (single user mode):

 1)int main(int argc, char *argv[])
 {
 unsigned nr;
 if (argc < 2) {
 fprintf(stderr, "Provide nr iterations arg\n");
 return 1;
 }
 nr = atoi(argv[1]);
 while (nr-- > 0) {
 if (unshare(CLONE_NEWNET)) {
 perror("Can't unshare");
 return 1;
 }
 }
 return 0;
 }

 Origin, 10 unshare():
 0.03user 23.14system 1:39.85elapsed 23%CPU

 Patched, 10 unshare():
 0.03user 67.49system 1:08.34elapsed 98%CPU

 2)for i in {1..1}; do unshare -n bash -c exit; done
>>>
>>> Hi Kirill,
>>>
>>> This mutex has another role. You know that net namespaces are destroyed
>>> asynchronously, and the net mutex gurantees that a backlog will be not
>>> big. If we have something in backlog, we know that it will be handled
>>> before creating a new net ns.
>>>
>>> As far as I remember net namespaces are created much faster than
>>> they are destroyed, so with this changes we can create a really big
>>> backlog, can't we?
>>
>> I don't think limitation is a good goal or a gool for the mutex,
>> because it's very easy to create many net namespaces in case of
>> the mutex exists. You may open /proc/[pid]/ns/net like a file,
>> and net_ns counter will increment. Then, do unshare(), and
>> the mutex has no a way to protect against that.
> 
> You are right, but with the mutex a user can not support a big backlog
> for a long time, it is shrunk to zero periodically. With these changes
> he can support a big backlog for a long time.
> 
> A big backlog affects other users. If someone creates namespaces, he
> probably expects that they will be destroyed for a reasonable time.

Here 2 different topics:
1)Limitation.
This problem exists in current code, and the patch does not make it worse.
If someone wants to limit number of net namespaces per user_ns/cgroup,
he should solve it in another way. Mutex does not help there.

2)Speed-up struct net destruction.
Many structures in kernel are released via rcu. The only difference
of struct net is it has more delayed actions, then task_struct for example.
User is not interested in the time, when task_struct is completely destroyed.
But I may imagine it's more important for struct net, if (for example) some
delayed connections are destroyed from ops->exit(). Are they?!
But like in the first paragraph, I think this is not related to the patch,
because in the current code you may generate dozen thousands of net namespaces
and then destroy them at once. And they will be destroyed for the same long 
time,
and there is additional worse thing, that it won't be possible to create a new
one for a long period. So, it's completely unrelated to the patch.
 
> But currently someone else can increase a destroy time to a really big
> values. This problem was before your patches, but they may do this
> problem worse. The question here is: Should we think about this problem

[PATCH] openvswitch: meter: fix NULL pointer dereference in ovs_meter_cmd_reply_start

2017-11-14 Thread Gustavo A. R. Silva
It seems that the intention of the code is to null check the value
returned by function genlmsg_put. But the current code is null
checking the address of the pointer that holds the value returned
by genlmsg_put.

Fix this by properly null checking the value returned by function
genlmsg_put in order to avoid a pontential null pointer dereference.

Addresses-Coverity-ID: 1461561 ("Dereference before null check")
Addresses-Coverity-ID: 1461562 ("Dereference null return value")
Fixes: 96fbc13d7e77 ("openvswitch: Add meter infrastructure")
Signed-off-by: Gustavo A. R. Silva 
---
 net/openvswitch/meter.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/openvswitch/meter.c b/net/openvswitch/meter.c
index 2a5ba35..bc0b6fc 100644
--- a/net/openvswitch/meter.c
+++ b/net/openvswitch/meter.c
@@ -106,7 +106,7 @@ ovs_meter_cmd_reply_start(struct genl_info *info, u8 cmd,
*ovs_reply_header = genlmsg_put(skb, info->snd_portid,
info->snd_seq,
_meter_genl_family, 0, cmd);
-   if (!ovs_reply_header) {
+   if (!*ovs_reply_header) {
nlmsg_free(skb);
return ERR_PTR(-EMSGSIZE);
}
-- 
2.7.4



Re: len = bpf_probe_read_str(); bpf_perf_event_output(... len) == FAIL

2017-11-14 Thread Daniel Borkmann
On 11/14/2017 07:15 PM, Yonghong Song wrote:
> On 11/14/17 6:19 AM, Daniel Borkmann wrote:
>> On 11/14/2017 02:42 PM, Arnaldo Carvalho de Melo wrote:
>>> Em Tue, Nov 14, 2017 at 02:09:34PM +0100, Daniel Borkmann escreveu:
 On 11/14/2017 01:58 PM, Arnaldo Carvalho de Melo wrote:
> Em Tue, Nov 14, 2017 at 01:09:39AM +0100, Daniel Borkmann escreveu:
>> On 11/13/2017 04:08 PM, Arnaldo Carvalho de Melo wrote:
>>> libbpf: -- BEGIN DUMP LOG ---
>>> libbpf:
>>> 0: (79) r3 = *(u64 *)(r1 +104)
>>> 1: (b7) r2 = 0
>>> 2: (bf) r6 = r1
>>> 3: (bf) r1 = r10
>>> 4: (07) r1 += -128
>>> 5: (b7) r2 = 128
>>> 6: (85) call bpf_probe_read_str#45
>>> 7: (bf) r1 = r0
>>> 8: (07) r1 += -1
>>> 9: (67) r1 <<= 32
>>> 10: (77) r1 >>= 32
>>> 11: (25) if r1 > 0x7f goto pc+11
>>
>> Right, so the compiler is optimizing the two tests into a single one 
>> above,
>> which means lower bound cannot properly be derived again by the verifier 
>> due
>> to this and thus you'll get the error. Similar issue was seen recently 
>> [1].
>>
>> Does the below hack work for you?
>>
>> int prog([...])
>> {
>>  char filename[128];
>>  int ret = bpf_probe_read_str(filename, sizeof(filename), 
>> filename_ptr);
>>  if (ret > 0)
>>  bpf_perf_event_output(ctx, &__bpf_stdout__, 
>> BPF_F_CURRENT_CPU, filename,
>>    ret & (sizeof(filename) - 1));
>>  return 1;
>> }
>>
>> r0 should keep on tracking bounds here at least:
>>
>> prog:
>>     0:    bf 16 00 00 00 00 00 00 r6 = r1
>>     1:    bf a1 00 00 00 00 00 00 r1 = r10
>>     2:    07 01 00 00 80 ff ff ff r1 += -128
>>     3:    b7 02 00 00 80 00 00 00 r2 = 128
>>     4:    85 00 00 00 2d 00 00 00 call 45
>>     5:    67 00 00 00 20 00 00 00 r0 <<= 32
>>     6:    c7 00 00 00 20 00 00 00 r0 s>>= 32
>>     7:    b7 01 00 00 01 00 00 00 r1 = 1
>>     8:    6d 01 0a 00 00 00 00 00 if r1 s> r0 goto 10
>>     9:    57 00 00 00 7f 00 00 00 r0 &= 127
>>    10:    bf a4 00 00 00 00 00 00 r4 = r10
>>    11:    07 04 00 00 80 ff ff ff r4 += -128
>>    12:    bf 61 00 00 00 00 00 00 r1 = r6
>>    13:    18 02 00 00 00 00 00 00 00 00 00 00 00 00 00 00 r2 = 
>> 0ll
>>    15:    18 03 00 00 ff ff ff ff 00 00 00 00 00 00 00 00 r3 = 
>> 4294967295ll
>>    17:    bf 05 00 00 00 00 00 00 r5 = r0
>>    18:    85 00 00 00 19 00 00 00 call 25
>>
>>    [1] 
>> https://urldefense.proofpoint.com/v2/url?u=http-3A__patchwork.ozlabs.org_project_netdev_list_-3Fseries-3D13211=DwIDaQ=5VD0RTtNlTh3ycd41b3MUw=DA8e1B5r073vIqRrFz7MRA=Qp3xFfXEz-CT8rzYtrHeXbow2M6FlsUzwcY32i3_2Q0=z0d6b_hxStA845Kh7epJ-JiFwkiWqUH_z3fEadwqAQY=
>
> Not yet:
>
> 6: (85) call bpf_probe_read_str#45
> 7: (bf) r1 = r0
> 8: (67) r1 <<= 32
> 9: (77) r1 >>= 32
> 10: (15) if r1 == 0x0 goto pc+10
>   R0=inv(id=0) R1=inv(id=0,umax_value=4294967295,var_off=(0x0; 
> 0x)) R6=ctx(id=0,off=0,imm=0) R10=fp0
> 11: (57) r0 &= 127
> 12: (bf) r4 = r10
> 13: (07) r4 += -128
> 14: (bf) r1 = r6
> 15: (18) r2 = 0x92bfc2aba840u
> 17: (18) r3 = 0x
> 19: (bf) r5 = r0
> 20: (85) call bpf_perf_event_output#25
> invalid stack type R4 off=-128 access_size=0
>
> I'll try updating clang/llvm...
>
> Full details:
>
> [root@jouet bpf]# cat open.c
> #include "bpf.h"
>
> SEC("prog=do_sys_open filename")
> int prog(void *ctx, int err, const char __user *filename_ptr)
> {
> char filename[128];
> const unsigned len = bpf_probe_read_str(filename, sizeof(filename), 
> filename_ptr);

 Btw, I was using 'int' here above instead of 'unsigned' as 
 strncpy_from_unsafe()
 could potentially return errors like -EFAULT.
>>>
>>> I changed to int, didn't help
>>>  
 Currently having a version compiled from the git tree:

 # llc --version
 LLVM 
 (https://urldefense.proofpoint.com/v2/url?u=http-3A__llvm.org_=DwIDaQ=5VD0RTtNlTh3ycd41b3MUw=DA8e1B5r073vIqRrFz7MRA=Qp3xFfXEz-CT8rzYtrHeXbow2M6FlsUzwcY32i3_2Q0=BKC_Gu9s1hw0v13OCgCpfsGtAY2hE7dujFqg8LNaK2I=):
    LLVM version 6.0.0git-2d810c2
    Optimized build.
    Default target: x86_64-unknown-linux-gnu
    Host CPU: skylake
>>>
>>> [root@jouet bpf]# llc --version
>>> LLVM 
>>> (https://urldefense.proofpoint.com/v2/url?u=http-3A__llvm.org_=DwIDaQ=5VD0RTtNlTh3ycd41b3MUw=DA8e1B5r073vIqRrFz7MRA=Qp3xFfXEz-CT8rzYtrHeXbow2M6FlsUzwcY32i3_2Q0=BKC_Gu9s1hw0v13OCgCpfsGtAY2hE7dujFqg8LNaK2I=):
>>>    LLVM version 4.0.0svn
>>>
>>> Old stuff! ;-) Will change, but improving these messages should be on
>>> the radar, 

RE: Broken netlink ABI

2017-11-14 Thread Jon Maloy


> -Original Message-
> From: netdev-ow...@vger.kernel.org [mailto:netdev-
> ow...@vger.kernel.org] On Behalf Of David Ahern
> Sent: Tuesday, November 14, 2017 15:18
> To: Jon Maloy ; netdev@vger.kernel.org; Jiri
> Pirko 
> Cc: David Miller (da...@davemloft.net) 
> Subject: Re: Broken netlink ABI
> 
> On 11/14/17 1:15 PM, David Ahern wrote:
> > On 11/14/17 12:19 PM, Jon Maloy wrote:
> >> When I give the command:
> >> ~$ tipc node set addr 1.1.2
> >>
> >> I get the following response:
> >>
> >> error: Numerical result out of range
> >> Unable to get TIPC nl family id (module loaded?) error, message
> >> initialisation failed
> >
> > tipc is sending a u32 for the family attribute when it should be a u16:
> >
> > diff --git a/tipc/msg.c b/tipc/msg.c
> > index 22c6bb20..dc09d05048f3 100644
> > --- a/tipc/msg.c
> > +++ b/tipc/msg.c
> > @@ -125,7 +125,7 @@ static int get_family(void)
> > genl->cmd = CTRL_CMD_GETFAMILY;
> > genl->version = 1;
> >
> > -   mnl_attr_put_u32(nlh, CTRL_ATTR_FAMILY_ID, GENL_ID_CTRL);
> > +   mnl_attr_put_u16(nlh, CTRL_ATTR_FAMILY_ID, GENL_ID_CTRL);
> > mnl_attr_put_strz(nlh, CTRL_ATTR_FAMILY_NAME,
> > TIPC_GENL_V2_NAME);
> >
> > if ((err = msg_query(nlh, family_id_cb, _family)))
> >
> > With the above change the tipc command runs fine.

I can fix that, but that that doesn't change the fact that binaries that have 
been around and worked flawlessly for years now all by sudden have stopped 
working.
Whether the user is doing right or wrong, that if for me the very definition of 
a broken ABI, and is unacceptable.

Either you have to remove the test in your patch, or you can try to identify 
tipc and devlink in the code and exempt those from your test.

BR
///jon

> >
> 
> devlink is similarly broken:
> 
> diff --git a/devlink/mnlg.c b/devlink/mnlg.c index
> 9e27de275518..b1e1b0ab32f6 100644
> --- a/devlink/mnlg.c
> +++ b/devlink/mnlg.c
> @@ -163,7 +163,7 @@ int mnlg_socket_group_add(struct mnlg_socket *nlg,
> const char *group_name)
> 
> nlh = __mnlg_msg_prepare(nlg, CTRL_CMD_GETFAMILY,
>  NLM_F_REQUEST | NLM_F_ACK, GENL_ID_CTRL, 1);
> -   mnl_attr_put_u32(nlh, CTRL_ATTR_FAMILY_ID, nlg->id);
> +   mnl_attr_put_u16(nlh, CTRL_ATTR_FAMILY_ID, nlg->id);
> 
> err = mnlg_socket_send(nlg, nlh);
> if (err < 0)


Re: [PATCH net-next v2 0/2] netem: fix compilation on 32 bit

2017-11-14 Thread Dave Taht
On Tue, Nov 14, 2017 at 11:40 AM, Randy Dunlap  wrote:
> On 11/14/2017 11:27 AM, Stephen Hemminger wrote:
>> A couple of places where 64 bit CPU was being assumed incorrectly.
>>
>> Stephen Hemminger (2):
>>   netem: fix 64 bit divide
>>   netem: remove unnecessary 64 bit modulus
>>
>>  net/sched/sch_netem.c | 17 +++--
>>  1 file changed, 7 insertions(+), 10 deletions(-)
>>
>
> Acked-by: Randy Dunlap 

Acked-by: Dave Taht 

Thx.


Re: Broken netlink ABI

2017-11-14 Thread David Ahern
On 11/14/17 1:15 PM, David Ahern wrote:
> On 11/14/17 12:19 PM, Jon Maloy wrote:
>> When I give the command:
>> ~$ tipc node set addr 1.1.2
>>
>> I get the following response:
>>
>> error: Numerical result out of range
>> Unable to get TIPC nl family id (module loaded?)
>> error, message initialisation failed
> 
> tipc is sending a u32 for the family attribute when it should be a u16:
> 
> diff --git a/tipc/msg.c b/tipc/msg.c
> index 22c6bb20..dc09d05048f3 100644
> --- a/tipc/msg.c
> +++ b/tipc/msg.c
> @@ -125,7 +125,7 @@ static int get_family(void)
> genl->cmd = CTRL_CMD_GETFAMILY;
> genl->version = 1;
> 
> -   mnl_attr_put_u32(nlh, CTRL_ATTR_FAMILY_ID, GENL_ID_CTRL);
> +   mnl_attr_put_u16(nlh, CTRL_ATTR_FAMILY_ID, GENL_ID_CTRL);
> mnl_attr_put_strz(nlh, CTRL_ATTR_FAMILY_NAME, TIPC_GENL_V2_NAME);
> 
> if ((err = msg_query(nlh, family_id_cb, _family)))
> 
> With the above change the tipc command runs fine.
> 

devlink is similarly broken:

diff --git a/devlink/mnlg.c b/devlink/mnlg.c
index 9e27de275518..b1e1b0ab32f6 100644
--- a/devlink/mnlg.c
+++ b/devlink/mnlg.c
@@ -163,7 +163,7 @@ int mnlg_socket_group_add(struct mnlg_socket *nlg,
const char *group_name)

nlh = __mnlg_msg_prepare(nlg, CTRL_CMD_GETFAMILY,
 NLM_F_REQUEST | NLM_F_ACK,
GENL_ID_CTRL, 1);
-   mnl_attr_put_u32(nlh, CTRL_ATTR_FAMILY_ID, nlg->id);
+   mnl_attr_put_u16(nlh, CTRL_ATTR_FAMILY_ID, nlg->id);

err = mnlg_socket_send(nlg, nlh);
if (err < 0)


Re: Broken netlink ABI

2017-11-14 Thread David Ahern
On 11/14/17 12:19 PM, Jon Maloy wrote:
> When I give the command:
> ~$ tipc node set addr 1.1.2
> 
> I get the following response:
> 
> error: Numerical result out of range
> Unable to get TIPC nl family id (module loaded?)
> error, message initialisation failed

tipc is sending a u32 for the family attribute when it should be a u16:

diff --git a/tipc/msg.c b/tipc/msg.c
index 22c6bb20..dc09d05048f3 100644
--- a/tipc/msg.c
+++ b/tipc/msg.c
@@ -125,7 +125,7 @@ static int get_family(void)
genl->cmd = CTRL_CMD_GETFAMILY;
genl->version = 1;

-   mnl_attr_put_u32(nlh, CTRL_ATTR_FAMILY_ID, GENL_ID_CTRL);
+   mnl_attr_put_u16(nlh, CTRL_ATTR_FAMILY_ID, GENL_ID_CTRL);
mnl_attr_put_strz(nlh, CTRL_ATTR_FAMILY_NAME, TIPC_GENL_V2_NAME);

if ((err = msg_query(nlh, family_id_cb, _family)))

With the above change the tipc command runs fine.


Re: Regression in throughput between kvm guests over virtual bridge

2017-11-14 Thread Matthew Rosato
On 11/12/2017 01:34 PM, Wei Xu wrote:
> On Sat, Nov 11, 2017 at 03:59:54PM -0500, Matthew Rosato wrote:
 This case should be quite similar with pkgten, if you got improvement with
 pktgen, usually it was also the same for UDP, could you please try to 
 disable
 tso, gso, gro, ufo on all host tap devices and guest virtio-net devices? 
 Currently
 the most significant tests would be like this AFAICT:

 Host->VM 4.124.13
  TCP:
  UDP:
 pktgen:

 Don't want to bother you too much, so maybe 4.12 & 4.13 without Jason's 
 patch should
 work since we have seen positive number for that, you can also temporarily 
 skip
 net-next as well.
>>>
>>> Here are the requested numbers, averaged over numerous runs --  guest is
>>> 4GB+1vcpu, host uperf/pktgen bound to 1 host CPU + qemu and vhost thread
>>> pinned to other unique host CPUs.  tso, gso, gro, ufo disabled on host
>>> taps / guest virtio-net devs as requested:
>>>
>>> Host->VM4.124.13
>>> TCP:9.92Gb/s6.44Gb/s
>>> UDP:5.77Gb/s6.63Gb/s
>>> pktgen: 1572403pps  1904265pps
>>>
>>> UDP/pktgen both show improvement from 4.12->4.13.  More interesting,
>>> however, is that I am seeing the TCP regression for the first time from
>>> host->VM.  I wonder if the combination of CPU binding + disabling of one
>>> or more of tso/gso/gro/ufo is related.
>>>

 If you see UDP and pktgen are aligned, then it might be helpful to continue
 the other two cases, otherwise we fail in the first place.
>>>
>>
>> I continued running many iterations of these tests between 4.12 and
>> 4.13..  My throughput findings can be summarized as:
> 
> Really nice to have these numbers.
> 

Wasn't sure if you were asking for the individual #s -- Just in case,
here are the other averages I used to draw my conclusions:

VM->VM  4.124.13
UDP 9.06Gb/s8.99Gb/s
TCP 9.16Gb/s8.67Gb/s

VM->Host4.124.13
UDP 9.70Gb/s9.53Gb/s
TCP 6.12Gb/s6.00Gb/s

>>
>> VM->VM case:
>> UDP:  roughly equivalent
>> TCP:  Consistent regression (5-10%)
>>
>> VM->Host
>> Both UDP and TCP traffic are roughly equivalent.
> 
> The patch improves performance for Rx from guest point of view, so the Tx
> would be no big difference since the Rx packets are far less than Tx in 
> this case.
> 
>>
>> Host->VM
>> UDP+pktgen: improvement (5-10%), but inconsistent
>> TCP: Consistent regression (25-30%)
> 
> Maybe we can try to figure out this case first since it is the shortest path,
> can you have a look at TCP statistics and paste a few outputs between tests?
> I am suspecting there are some retransmitting, zero window probing, etc.
> 

Grabbed some netperf -s results after a few minutes of running (snipped
uninteresting icmp and udp sections).  The test was TCP Host->VM
scenario, binding and tso/gso/gro/ufo disabled as before:


Host 4.12

Ip:
Forwarding: 1
3724964 total packets received
0 forwarded
0 incoming packets discarded
3724964 incoming packets delivered
526 requests sent out
Tcp:
4 active connection openings
1 passive connection openings
0 failed connection attempts
0 connection resets received
1 connections established
3724954 segments received
133112205 segments sent out
93106 segments retransmitted
0 bad segments received
2 resets sent
TcpExt:
5 delayed acks sent
8 packets directly queued to recvmsg prequeue
TCPDirectCopyFromPrequeue: 1736
146 packet headers predicted
4 packet headers predicted and directly queued to user
3218205 acknowledgments not containing data payload received
506561 predicted acknowledgments
TCPSackRecovery: 2096
TCPLostRetransmit: 860
93106 fast retransmits
TCPLossProbes: 5
TCPSackShifted: 1959097
TCPSackMerged: 458343
TCPSackShiftFallback: 7969
TCPRcvCoalesce: 2
TCPOrigDataSent: 133112178
TCPHystartTrainDetect: 2
TCPHystartTrainCwnd: 96
TCPWinProbe: 2
IpExt:
InBcastPkts: 4
InOctets: 226014831
OutOctets: 193103919403
InBcastOctets: 1312
InNoECTPkts: 3724964


Host 4.13

Ip:
Forwarding: 1
5930785 total packets received
0 forwarded
0 incoming packets discarded
5930785 incoming packets delivered
4495113 requests sent out
Tcp:
4 active connection openings
1 passive connection openings
0 failed connection attempts
0 connection resets received
1 connections established
5930775 segments received
73226521 segments sent out
13975 segments retransmitted
0 bad segments received
4 resets sent
TcpExt:
5 delayed acks sent
8 packets directly queued to recvmsg prequeue
TCPDirectCopyFromPrequeue: 1736
18 packet headers predicted
4 packet headers predicted and directly queued to user
4091720 

Re: SRIOV switchdev mode BoF minutes

2017-11-14 Thread Alexander Duyck
On Tue, Nov 14, 2017 at 8:44 AM, Or Gerlitz  wrote:
> On Mon, Nov 13, 2017 at 7:10 PM, Alexander Duyck
>  wrote:
>> On Sun, Nov 12, 2017 at 10:16 PM, Or Gerlitz  wrote:
>>> On Sun, Nov 12, 2017 at 10:38 PM, Alexander Duyck
>
>>> The what we call slow path requirements are the following:
>>>
>>> 1. xmit on VF rep always turns to a receive on the VF, regardless of
>>> the offloaded SW steering rules ("send-to-vport")
>>>
>>> 2. xmit on VF which doesn't meet any offloaded SW steering rules must
>>> be received into the host OS from the VF rep
>
>>> 1,2 above must hold also for the uplink and the PF reps
>
>> I am well aware of the requirements. We discussed these with Jiri at
>> the previous netdev.
>
>>> When the i40e limitation was described to @ netdev, it seems you have a 
>>> problem
>>> with VF xmit that should be turned to be a recv on the VF rep but also
>>> goes to the wire.
>
>>> It smells as if a FW patch can solve that, isn't that?
>
>> That is a huge maybe. We looked into it last time and while we can
>> meet requirements 1 and 2 we do so with a heavy performance penalty
>> due to the fact that we don't support anywhere near the same number of
>> flows as a true switch. Also while that might work for i40e
>
> to recap on i40e, you can support the slow path requirements, but  you have an
> issue with the fast path (== offloaded flows)? what is the issue there?

We basically need to do some feasability research to see if we can
actually meet all the requirements for switchdev on i40e. We have been
getting mixed messages where we are given a great many "yes, but" type
answers. For i40e we are looking into it but I don't have high
confidence in our ability to actually support it in hardare/firmware.
If it were as easy as you have been led to believe, we would have done
it months ago when we were researching the requirements to support
switchdev. In addition i40e isn't really my concern. I am much more
concerned about ixgbe as it has a much larger install base and many
more customers that are still buying it today.

>> we still have a much larger install base of ixgbe ports that we have to 
>> support.
>
> ok, but support is one thing and keep enhancing a ten years old wrong
> SW model is 2nd thing

The model might be 10 years old, but as I said we are still shipping
new silicon that was released just over a year ago that is supported
by the ixgbe driver.

Also I don't know if the term "enhancing" is the right word for what I
am thinking. I'm not talking about adding new drivers that only
support legacy mode.  We are looking at probably having to refactor
the whole concept of "trusted" VF in order to break it out into
smaller buckets. In addition I plan to come up with a source mode
macvlan based "port representor" for legacy SR-IOV and hope to be able
to use that to start working on a better path for SR-IOV live
migration.

Fundamentally the problem I have with us saying we cannot extend
legacy mode SR-IOV is that 82599 is a very large piece of the existing
install base for 10Gbit in general. We have it shipping on brand new
platforms as the silicon that is installed on the motherboard. With
that being the case people are going to want to get the most value
they can out of the silicon that they purchased since in many cases it
is just a standard part of the platform.

> I would have to disagree with this. For devices such as 82599 that
 doesn't have a true switch this may limit future functionality since
 we can't move it over to switchdev mode. For example one thing I may
 need to add is the ability to disable multicast and broadcast receive
 on a per-VF basis at some point in the future.
>
>>> We are on the same boat with ConnectX3/mlx4, so us lucky that misery loves
>>> company (my google search also yielded "many narrow-half consolation" is 
>>> that
>>> completely unrelated?) - the legacy mode for ixgbe/mlx4 is there for ~8-10 
>>> years
>>> - and since then both companies had 2-3 newer HW generations. I don't see 
>>> why
>>> you can't come to your customers and tell that newish functionality needs 
>>> newer
>>> HW - it will also help sell more from the new stuff..  If you keep
>>> extending the legacy mode, more ppl/drivers will do that as well and it 
>>> will not let us go
>>> in the right direction.
>
>> Well I don't know about you guys, but we still are selling parts
>> supported by ixgbe
>
> Same here, we are selling lots of CX3 and have to support that, but I didn't
> see why someone will want new features there.

I think the difference is that we get pressed on as part of the
platform instead of being a single component. If a customer wants some
specific feature enabled on 82599 as a part of the platform we tend to
need to go along with it in order to avoid being a roadblock in a sale
of other components.

>> still been adding new hardware as recently as just a couple years ago.
>
> wait, that's 

Re: [PATCH] net: Convert net_mutex into rw_semaphore and down read it on net->init/->exit

2017-11-14 Thread Kirill Tkhai
On 14.11.2017 21:39, Cong Wang wrote:
> On Tue, Nov 14, 2017 at 5:53 AM, Kirill Tkhai  wrote:
>> @@ -406,7 +406,7 @@ struct net *copy_net_ns(unsigned long flags,
>>
>> get_user_ns(user_ns);
>>
>> -   rv = mutex_lock_killable(_mutex);
>> +   rv = down_read_killable(_sem);
>> if (rv < 0) {
>> net_free(net);
>> dec_net_namespaces(ucounts);
>> @@ -421,7 +421,7 @@ struct net *copy_net_ns(unsigned long flags,
>> list_add_tail_rcu(>list, _namespace_list);
>> rtnl_unlock();
>> }
>> -   mutex_unlock(_mutex);
>> +   up_read(_sem);
>> if (rv < 0) {
>> dec_net_namespaces(ucounts);
>> put_user_ns(user_ns);
>> @@ -446,7 +446,7 @@ static void cleanup_net(struct work_struct *work)
>> list_replace_init(_list, _kill_list);
>> spin_unlock_irq(_list_lock);
>>
>> -   mutex_lock(_mutex);
>> +   down_read(_sem);
>>
>> /* Don't let anyone else find us. */
>> rtnl_lock();
>> @@ -486,7 +486,7 @@ static void cleanup_net(struct work_struct *work)
>> list_for_each_entry_reverse(ops, _list, list)
>> ops_free_list(ops, _exit_list);
>>
>> -   mutex_unlock(_mutex);
>> +   up_read(_sem);
> 
> After your patch setup_net() could run concurrently with cleanup_net(),
> given that ops_exit_list() is called on error path of setup_net() too,
> it means ops->exit() now could run concurrently if it doesn't have its
> own lock. Not sure if this breaks any existing user.

Yes, there will be possible concurrent ops->init() for a net namespace,
and ops->exit() for another one. I hadn't found pernet operations, which
have a problem with that. If they exist, they are hidden and not clear seen.
The pernet operations in general do not touch someone else's memory.
If suddenly there is one, KASAN should show it after a while.


Re: [PATCH net-next v2 0/2] netem: fix compilation on 32 bit

2017-11-14 Thread Randy Dunlap
On 11/14/2017 11:27 AM, Stephen Hemminger wrote:
> A couple of places where 64 bit CPU was being assumed incorrectly.
> 
> Stephen Hemminger (2):
>   netem: fix 64 bit divide
>   netem: remove unnecessary 64 bit modulus
> 
>  net/sched/sch_netem.c | 17 +++--
>  1 file changed, 7 insertions(+), 10 deletions(-)
> 

Acked-by: Randy Dunlap 

Thanks.

-- 
~Randy


Re: [PATCH][net-next] tcp: remove the now redundant non-null check on tskb

2017-11-14 Thread Eric Dumazet
On Tue, 2017-11-14 at 18:41 +, Colin King wrote:
> From: Colin Ian King 
> 
> The non-null check on tskb is redundant as it is in an else
> section of a check on tskb where tskb is always null. Remove
> the redundant if statement and the label coalesce.
> 
> Detected by CoverityScan, CID#1457751 ("Logically dead code")
> 
> Fixes: 75c119afe14f ("tcp: implement rb-tree based retransmit queue")
> Signed-off-by: Colin Ian King 
> ---
>  net/ipv4/tcp_output.c | 6 +-
>  1 file changed, 1 insertion(+), 5 deletions(-)
> 
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index 071bdd34f8eb..b58c986b2b27 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -3053,7 +3053,6 @@ void tcp_send_fin(struct sock *sk)
>   tskb = skb_rb_last(>tcp_rtx_queue);
>  
>   if (tskb) {
> -coalesce:
>   TCP_SKB_CB(tskb)->tcp_flags |= TCPHDR_FIN;
>   TCP_SKB_CB(tskb)->end_seq++;
>   tp->write_seq++;
> @@ -3069,11 +3068,8 @@ void tcp_send_fin(struct sock *sk)
>   }
>   } else {
>   skb = alloc_skb_fclone(MAX_TCP_HEADER, sk->sk_allocation);
> - if (unlikely(!skb)) {
> - if (tskb)
> - goto coalesce;
> + if (unlikely(!skb))
>   return;
> - }
>   INIT_LIST_HEAD(>tcp_tsorted_anchor);
>   skb_reserve(skb, MAX_TCP_HEADER);
>   sk_forced_mem_schedule(sk, skb->truesize);

Hmm... I would rather try to use skb_rb_last(), because
alloc_skb_fclone() might fail even if tcp_under_memory_pressure() is
false.

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 
76dbe884f2469660028684a46fc19afa000a1353..eea017b8a8918815226fd1412c0a7b8e484aeca8
 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -3070,6 +3070,7 @@ void tcp_send_fin(struct sock *sk)
} else {
skb = alloc_skb_fclone(MAX_TCP_HEADER, sk->sk_allocation);
if (unlikely(!skb)) {
+   tskb = skb_rb_last(>tcp_rtx_queue);
if (tskb)
goto coalesce;
return;




Re: Broken netlink ABI

2017-11-14 Thread David Ahern
On 11/14/17 12:19 PM, Jon Maloy wrote:
> commit 28033ae4e0f ("net: netlink: Update attr validation to require exact 
> length for some types") breaks the netlink ABI.

It's not breaking the ABI; it's enforcing expected attributes based on
policy.

> 
> When I give the command:
> ~$ tipc node set addr 1.1.2
> 
> I get the following response:
> 
> error: Numerical result out of range
> Unable to get TIPC nl family id (module loaded?)
> error, message initialisation failed
> 

I'll take a look at tipc and get back to you.


[PATCH net-next v2 2/2] netem: remove unnecessary 64 bit modulus

2017-11-14 Thread Stephen Hemminger
Fix compilation on 32 bit platforms (where doing modulus operation
with 64 bit requires extra glibc functions) by truncation.
The jitter for table distribution is limited to a 32 bit value
because random numbers are scaled as 32 bit value.

Also fix some whitespace.

Fixes: 99803171ef04 ("netem: add uapi to express delay and jitter in 
nanoseconds")
Reported-by: Randy Dunlap 
Signed-off-by: Stephen Hemminger 
---
 net/sched/sch_netem.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/net/sched/sch_netem.c b/net/sched/sch_netem.c
index 644323d6081c..dd70924cbcdf 100644
--- a/net/sched/sch_netem.c
+++ b/net/sched/sch_netem.c
@@ -312,9 +312,9 @@ static bool loss_event(struct netem_sched_data *q)
  * std deviation sigma.  Uses table lookup to approximate the desired
  * distribution, and a uniformly-distributed pseudo-random source.
  */
-static s64 tabledist(s64 mu, s64 sigma,
+static s64 tabledist(s64 mu, s32 sigma,
 struct crndstate *state,
-const struct disttable *dist)
+const struct disttable *dist)
 {
s64 x;
long t;
@@ -327,7 +327,7 @@ static s64 tabledist(s64 mu, s64 sigma,
 
/* default uniform distribution */
if (dist == NULL)
-   return (rnd % (2*sigma)) - sigma + mu;
+   return (rnd % (2 * sigma)) - sigma + mu;
 
t = dist->table[rnd % dist->size];
x = (sigma % NETEM_DIST_SCALE) * t;
-- 
2.11.0



[PATCH net-next v2 1/2] netem: use 64 bit divide by rate

2017-11-14 Thread Stephen Hemminger
Since times are now expressed in nanosecond, need to now do
true 64 bit divide. Old code would truncate rate at 32 bits.
Rename function to better express current usage.

Signed-off-by: Stephen Hemminger 
---
 net/sched/sch_netem.c | 11 ---
 1 file changed, 4 insertions(+), 7 deletions(-)

diff --git a/net/sched/sch_netem.c b/net/sched/sch_netem.c
index b686e755fda9..644323d6081c 100644
--- a/net/sched/sch_netem.c
+++ b/net/sched/sch_netem.c
@@ -339,10 +339,8 @@ static s64 tabledist(s64 mu, s64 sigma,
return  x / NETEM_DIST_SCALE + (sigma / NETEM_DIST_SCALE) * t + mu;
 }
 
-static u64 packet_len_2_sched_time(unsigned int len,
-  struct netem_sched_data *q)
+static u64 packet_time_ns(u64 len, const struct netem_sched_data *q)
 {
-   u64 offset;
len += q->packet_overhead;
 
if (q->cell_size) {
@@ -352,9 +350,8 @@ static u64 packet_len_2_sched_time(unsigned int len,
cells++;
len = cells * (q->cell_size + q->cell_overhead);
}
-   offset = (u64)len * NSEC_PER_SEC;
-   do_div(offset, q->rate);
-   return offset;
+
+   return div64_u64(len * NSEC_PER_SEC, q->rate);
 }
 
 static void tfifo_reset(struct Qdisc *sch)
@@ -556,7 +553,7 @@ static int netem_enqueue(struct sk_buff *skb, struct Qdisc 
*sch,
now = last->time_to_send;
}
 
-   delay += packet_len_2_sched_time(qdisc_pkt_len(skb), q);
+   delay += packet_time_ns(qdisc_pkt_len(skb), q);
}
 
cb->time_to_send = now + delay;
-- 
2.11.0



[PATCH net-next v2 0/2] netem: fix compilation on 32 bit

2017-11-14 Thread Stephen Hemminger
A couple of places where 64 bit CPU was being assumed incorrectly.

Stephen Hemminger (2):
  netem: fix 64 bit divide
  netem: remove unnecessary 64 bit modulus

 net/sched/sch_netem.c | 17 +++--
 1 file changed, 7 insertions(+), 10 deletions(-)

-- 
2.11.0



Broken netlink ABI

2017-11-14 Thread Jon Maloy
commit 28033ae4e0f ("net: netlink: Update attr validation to require exact 
length for some types") breaks the netlink ABI.

When I give the command:
~$ tipc node set addr 1.1.2

I get the following response:

error: Numerical result out of range
Unable to get TIPC nl family id (module loaded?)
error, message initialisation failed

The module is definitely loaded:

~$ lsmod 
tipc 172032 0 - Live 0xa0062000
ip6_udp_tunnel 16384 1 tipc, Live 0xa0034000
udp_tunnel 16384 1 tipc, Live 0xa0039000

Bisecting reveals that the culprit is the commit referred to above, or more 
exactly the lines:

 if (nla_attr_len[pt->type]) {
if (attrlen != nla_attr_len[pt->type])
return -ERANGE;
return 0;
}

This test compares the following values:
attrlen == 4 , 
nla_attr_len[pt->type] == 2

The corresponding code in the tipc tool is:

static int get_family(void)
{
int err;
int nl_family;
struct nlmsghdr *nlh;
struct genlmsghdr *genl;
char buf[MNL_SOCKET_BUFFER_SIZE];

nlh = mnl_nlmsg_put_header(buf);
nlh->nlmsg_type = GENL_ID_CTRL;
nlh->nlmsg_flags = NLM_F_REQUEST | NLM_F_ACK;

genl = mnl_nlmsg_put_extra_header(nlh, sizeof(struct genlmsghdr));
genl->cmd = CTRL_CMD_GETFAMILY;
genl->version = 1;

mnl_attr_put_u32(nlh, CTRL_ATTR_FAMILY_ID, GENL_ID_CTRL);
mnl_attr_put_strz(nlh, CTRL_ATTR_FAMILY_NAME, TIPC_GENL_V2_NAME);

if ((err = msg_query(nlh, family_id_cb, _family)))
return err;

return nl_family;
}

I didn't dig further into this, but you will notice that tipc, in contrast to 
most other iproute2 tools, uses libmnl to build messages. Maybe the reason can 
be found there?

BR
Jon Maloy



Re: [PATCH V4 net 0/2] ipv6: fix flowlabel issue for reset packet

2017-11-14 Thread Tom Herbert
On Tue, Nov 14, 2017 at 10:24 AM, Shaohua Li  wrote:
> On Wed, Nov 08, 2017 at 09:44:51AM -0800, Tom Herbert wrote:
>> On Fri, Aug 18, 2017 at 3:27 PM, David Miller  wrote:
>> > From: Martin KaFai Lau 
>> > Date: Fri, 18 Aug 2017 13:51:36 -0700
>> >
>> >> It seems like that middle box specifically drops TCP_RST if it
>> >> does not know anything about this flow.  Since the flowlabel of the 
>> >> TCP_RST
>> >> (sent in TW state) is always different, it always lands to a different 
>> >> middle
>> >> box.  All of these TCP_RST cannot be delivered.
>> >
>> > This really is illegal behavior.  The flow label is not a flow _KEY_
>> > by any definition whatsoever.
>> >
>> > Flow labels are an optimization, not a determinant for flow matching
>> > particularly for proper TCP state processing.
>> >
>> > I'd rather you invest all of this energy getting that vendor to fix
>> > their kit.
>> >
>> We're now seeing several router vendors recommending people to not use
>> flow labels for ECMP hashing. This is precisely because when a flow
>> label changes, network devices that maintain state (firewalls, NAT,
>> load balancers) can't deal with packets being rerouted so connections
>> are dropped. Unfortunately, the need for packets of a flow to always
>> follow the same path has become an implicit requirement that I think
>> we need follow at least as the default behavior.
>>
>> Martin: is there any change you could resurrect these patches? In
>> order to solve the general problem of making routing consistent, I
>> believe we want to keep sk_tx_hash consistent for the connection from
>> which a consistent flow label can be derived. To avoid the overhead of
>> a hash field in sk_common, maybe we could initially set a connection
>> hash to a five-tuple hash for a flow instead of a random value? So in
>> TW state the consistent hash can be computed on the fly.
>
> Hi Tom,
> Do we really need to use the five-tupe hash? There are several places using
> current random hash, which looks more lightweight. To fix issue, we only need
> to make sure reset packet include the correct flowlabel. Like what my previous
> patch did, we can set tw->tw_flowlabel in tcp_time_wait based on txhash and 
> use
> it reset packet. In this way we can use the random hash and not add extra 
> field
> in sock.
>
Shaohua,

But that patch discards the full txhash in TW. So it's not just a
problem with the flow label. sk_tx_hash can also be used for route
selection in ECMP, port selection we're doing tunneling, etc. The
general solution should maintains tx_hash or be able to reconstruct it
in any state, flow label fix is a point solution.

Thanks,
Tom

> Thanks,
> Shaohua


Re: [PATCH net v2] ipv6: set all.accept_dad to 0 by default

2017-11-14 Thread Stefano Brivio
On Tue, 14 Nov 2017 10:30:33 -0800
Girish Moodalbail  wrote:

> On 11/14/17 5:21 AM, Nicolas Dichtel wrote:
> > With commits 35e015e1f577 and a2d3f3e33853, the global 'accept_dad' flag
> > is also taken into account (default value is 1). If either global or
> > per-interface flag is non-zero, DAD will be enabled on a given interface.
> > 
> > This is not backward compatible: before those patches, the user could
> > disable DAD just by setting the per-interface flag to 0. Now, the
> > user instead needs to set both flags to 0 to actually disable DAD.
> > 
> > Restore the previous behaviour by setting the default for the global
> > 'accept_dad' flag to 0. This way, DAD is still enabled by default,
> > as per-interface flags are set to 1 on device creation, but setting
> > them to 0 is enough to disable DAD on a given interface.
> > 
> > - Before 35e015e1f57a7 and a2d3f3e33853:
> >globalper-interfaceDAD enabled
> > [default]   1 1  yes
> >  X 0  no
> >  X 1  yes
> > 
> > - After 35e015e1f577 and a2d3f3e33853:
> >globalper-interfaceDAD enabled
> > [default]   1 1  yes
> >  0 0  no
> >  0 1  yes
> >  1 0  yes
> > 
> > - After this fix:
> >globalper-interfaceDAD enabled
> >  1 1  yes
> >  0 0  no
> > [default]   0 1  yes
> >  1 0  yes  
> 
> Above table can be summarized to..
> 
> - After this fix:
>globalper-interfaceDAD enabled
>  1 X  yes
>  0 0  no
> [default]   0 1  yes
> 
> So, if global is set to '1', then irrespective of what the per-interface 
> value 
> is DAD will be enabled. Is it not confusing. Shouldn't the more specific 
> value 
> override the general value?

Might be a bit confusing, yes, but in order to implement an overriding
mechanism you would need to implement a tristate option as Eric K.
proposed. That is, by default you would have -1 (meaning "don't care")
on per-interface flags, and if this value is changed then the
per-interface value wins over the global one.

Sensible, but I think it's outside of the scope of this patch, which is
just intended to restore a specific pre-existing userspace expectation.

> On the other hand, if the global is set to '0', then per-interface value will 
> be 
> honored (overrides global). So, the meaning of global varies based on its 
> value. 
> Isn't that confusing as well.

I don't find this confusing though. Setting the global flag always has
the meaning of "force enabling DAD on all interfaces".

You would have the same problem if you chose a logical AND between
global and per-interface flag. There, setting the global flag would mean
"force disabling DAD on all interfaces".

So the only indisputable improvement I see here would be to implement a
"don't care" value (either for global or for per-interface flags). But
I'd rather agree with Nicolas that we should fix a potentially broken
userspace assumption first.

-- 
Stefano


Re: [PATCH] netem: fix 64 bit divide

2017-11-14 Thread Stephen Hemminger
On Tue, 14 Nov 2017 09:57:11 -0800
Randy Dunlap  wrote:

> On 11/14/2017 09:28 AM, Stephen Hemminger wrote:
> > Since times are now expressed in 64 bit nanosecond, need to now do
> > true 64 bit divide. Change the name of the function to bette express
> > the new units.
> > 
> > Fixes: 99803171ef04 ("netem: add uapi to express delay and jitter in 
> > nanoseconds")
> > Reported-by: Randy Dunlap 
> > Signed-off-by: Stephen Hemminger   
> 
> Hi Stephen,
> 
> I still get it. Maybe it's this % operator:
> 
>   skb->data[prandom_u32() % skb_headlen(skb)] ^=
>   1<<(prandom_u32() % 8);
> 
> in net_enqueue().
> 
> > ---
> >  net/sched/sch_netem.c | 11 ---
> >  1 file changed, 4 insertions(+), 7 deletions(-)
> > 
> > diff --git a/net/sched/sch_netem.c b/net/sched/sch_netem.c
> > index b686e755fda9..644323d6081c 100644
> > --- a/net/sched/sch_netem.c
> > +++ b/net/sched/sch_netem.c
> > @@ -339,10 +339,8 @@ static s64 tabledist(s64 mu, s64 sigma,
> > return  x / NETEM_DIST_SCALE + (sigma / NETEM_DIST_SCALE) * t + mu;
> >  }
> >  
> > -static u64 packet_len_2_sched_time(unsigned int len,
> > -  struct netem_sched_data *q)
> > +static u64 packet_time_ns(u64 len, const struct netem_sched_data *q)
> >  {
> > -   u64 offset;
> > len += q->packet_overhead;
> >  
> > if (q->cell_size) {
> > @@ -352,9 +350,8 @@ static u64 packet_len_2_sched_time(unsigned int len,
> > cells++;
> > len = cells * (q->cell_size + q->cell_overhead);
> > }
> > -   offset = (u64)len * NSEC_PER_SEC;
> > -   do_div(offset, q->rate);
> > -   return offset;
> > +
> > +   return div64_u64(len * NSEC_PER_SEC, q->rate);
> >  }
> >  
> >  static void tfifo_reset(struct Qdisc *sch)
> > @@ -556,7 +553,7 @@ static int netem_enqueue(struct sk_buff *skb, struct 
> > Qdisc *sch,
> > now = last->time_to_send;
> > }
> >  
> > -   delay += packet_len_2_sched_time(qdisc_pkt_len(skb), q);
> > +   delay += packet_time_ns(qdisc_pkt_len(skb), q);
> > }
> >  
> > cb->time_to_send = now + delay;
> >   
> 
> 

Close, the problem is here:
return (rnd % (2*sigma)) - sigma + mu;
139d:   8b 4d dcmov-0x24(%ebp),%ecx
13a0:   8b 5d e0mov-0x20(%ebp),%ebx
13a3:   0f a4 cb 01 shld   $0x1,%ecx,%ebx
13a7:   01 c9   add%ecx,%ecx
13a9:   53  push   %ebx
13aa:   51  push   %ecx
13ab:   e8 fc ff ff ff  call   13ac 
13ac: R_386_PC32__moddi3

Will add second patch for that.



Re: [PATCH] net: Convert net_mutex into rw_semaphore and down read it on net->init/->exit

2017-11-14 Thread Eric Dumazet
On Tue, 2017-11-14 at 10:11 -0800, Stephen Hemminger wrote:
> On Tue, 14 Nov 2017 16:53:33 +0300
> Kirill Tkhai  wrote:
> 
> > +   /*
> > +* RCU-protected list, modifiable by pernet-init and -exit methods.
> > +* When net namespace is alive (net::count > 0), all the changes
> > +* are made under rw_sem held on write.
> > +*/
> > +   struct list_headfib_notifier_ops;
> >  
> 
> If you use __rcu annotation then it could be checked (and the comment is not 
> needed).

I do not think we can use __rcu annotation yet on a struct list_head ?

(The annotation would be needed on the members)






Re: [PATCH net] sctp: check stream reset info len before making reconf chunk

2017-11-14 Thread Marcelo Ricardo Leitner
On Tue, Nov 14, 2017 at 04:27:41PM +0800, Xin Long wrote:
> On Tue, Nov 14, 2017 at 3:54 AM, Marcelo Ricardo Leitner
>  wrote:
> > On Mon, Nov 13, 2017 at 11:15:40PM +0800, Xin Long wrote:
> >> On Mon, Nov 13, 2017 at 11:09 PM, Neil Horman  
> >> wrote:
> >> > On Mon, Nov 13, 2017 at 01:39:27PM +0800, Xin Long wrote:
> >> >> Now when resetting stream, if both in and out flags are set, the info
> >> >> len can reach:
> >> >>   sizeof(struct sctp_strreset_outreq) + SCTP_MAX_STREAM(65535) +
> >> >>   sizeof(struct sctp_strreset_inreq)  + SCTP_MAX_STREAM(65535)
> >> >> even without duplicated stream no, this value is far greater than the
> >> >> chunk's max size.
> >> >>
> >> >> _sctp_make_chunk doesn't do any check for this, which would cause the
> >> >> skb it allocs is huge, syzbot even reported a crash due to this.
> >> >>
> >> >> This patch is to check stream reset info len before making reconf
> >> >> chunk and return NULL if the len exceeds chunk's capacity.
> >> >>
> >> >> Fixes: cc16f00f6529 ("sctp: add support for generating stream reconf 
> >> >> ssn reset request chunk")
> >> >> Reported-by: Dmitry Vyukov 
> >> >> Signed-off-by: Xin Long 
> >> >> ---
> >> >>  net/sctp/sm_make_chunk.c | 7 +--
> >> >>  net/sctp/stream.c| 8 +---
> >> >>  2 files changed, 10 insertions(+), 5 deletions(-)
> >> >>
> >> >> diff --git a/net/sctp/sm_make_chunk.c b/net/sctp/sm_make_chunk.c
> >> >> index 514465b..a21328a 100644
> >> >> --- a/net/sctp/sm_make_chunk.c
> >> >> +++ b/net/sctp/sm_make_chunk.c
> >> >> @@ -3598,14 +3598,17 @@ struct sctp_chunk *sctp_make_strreset_req(
> >> >>   __u16 stream_len = stream_num * 2;
> >
> > Unrelated, but.. won't stream_len overflow if stream_num >= 32768?
> > When called form sctp_send_reset_streams() I don't see anything
> > restricting it to such range.
> right.
> 
> >
> >> >>   struct sctp_strreset_inreq inreq;
> >> >>   struct sctp_chunk *retval;
> >> >> - __u16 outlen, inlen;
> >> >> + int outlen, inlen;
> >> >>
> >> >>   outlen = (sizeof(outreq) + stream_len) * out;
> >> >>   inlen = (sizeof(inreq) + stream_len) * in;
> >> >>
> >> >> + if (outlen + inlen > SCTP_MAX_CHUNK_LEN - sizeof(struct 
> >> >> sctp_chunkhdr))
> >> >> + return ERR_PTR(-EINVAL);
> >> >> +
> >> > Why all the ERR_PTR manipulations here?  Just returning NULL, like the 
> >> > fuction
> >> > has been doing is sufficient to set ENOMEM at both call sites
> >> I don't like ERR_PTR handling here either,
> >> But it shouldn't be ENOMEM, should it ?
> >>
> >> It may confuse users, but I'm also ok to let it just return
> >> ENOMEM as you wish. wdyt ?
> >
> > Returning ENOMEM in the above error can be misleading. It's not that
> > we cannot allocate it, it's that it won't fit the packet no matter how
> > much memory we add to the system.
> right.
> 
> let's move the check into sctp_send_reset_streams()
> 
> I believe this one fixes them both:
> @@ -139,15 +139,31 @@ int sctp_send_reset_streams(struct sctp_association 
> *asoc,
> 
> str_nums = params->srs_number_streams;
> str_list = params->srs_stream_list;
> -   if (out && str_nums)
> -   for (i = 0; i < str_nums; i++)
> -   if (str_list[i] >= stream->outcnt)
> -   goto out;
> +   if (str_nums) {
> +   int param_len = 0;
> 
> -   if (in && str_nums)
> -   for (i = 0; i < str_nums; i++)
> -   if (str_list[i] >= stream->incnt)
> -   goto out;
> +   if (out) {
> +   for (i = 0; i < str_nums; i++)
> +   if (str_list[i] >= stream->outcnt)
> +   goto out;
> +
> +   param_len = str_nums * 2 +
 sizeof(__u16) --^

> +   sizeof(struct sctp_strreset_outreq);
> +   }
> +
> +   if (in) {
> +   for (i = 0; i < str_nums; i++)
> +   if (str_list[i] >= stream->incnt)
> +   goto out;
> +
> +   param_len += str_nums * 2 +
  sizeof(__u16) --^

> +sizeof(struct sctp_strreset_inreq);
> +   }
> +
> +   if (param_len > SCTP_MAX_CHUNK_LEN -
> +   sizeof(struct sctp_reconf_chunk))
> +   goto out;
> +   }
> 
> 
> and int this fix,  it's good to do all checks only when str_nums !=0.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-sctp" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


Re: [PATCH] macvlan: verify MTU before lowerdev xmit

2017-11-14 Thread Shannon Nelson

On 11/14/2017 9:03 AM, Shannon Nelson wrote:

On 11/14/2017 2:32 AM, Daniel Axtens wrote:

If a macvlan device which is not in bridge mode receives a packet,
it is sent straight to the lowerdev without checking against the
device's MTU. This also happens for multicast traffic.

Add an is_skb_forwardable() check against the lowerdev before
sending the packet out through it. I think this is the simplest
and best way to do it, and is consistent with the use of
dev_forward_skb() in the bridge path.

This is easy to replicate:
  - create a VM with a macvtap connection in private mode
  - set the lowerdev MTU to something low in the host (e.g. 1480)
  - do not set the MTU lower in the guest (e.g. keep at 1500)
  - netperf to a different host with the same high MTU
  - observe that currently, the driver will forward too-big packets
  - observe that with this patch the packets are dropped

Cc: Shannon Nelson 
Signed-off-by: Daniel Axtens 

---

After hearing Shannon's lightning talk on macvlan at netdev I
figured I'd strike while the iron is hot and get this out of my
patch queue where it has been languishing.
---
  drivers/net/macvlan.c | 4 
  1 file changed, 4 insertions(+)

diff --git a/drivers/net/macvlan.c b/drivers/net/macvlan.c
index a178c5efd33e..8adcad6798c5 100644
--- a/drivers/net/macvlan.c
+++ b/drivers/net/macvlan.c
@@ -534,6 +534,10 @@ static int macvlan_queue_xmit(struct sk_buff 
*skb, struct net_device *dev)

  }
  xmit_world:
+    /* verify MTU */
+    if (!is_skb_forwardable(vlan->lowerdev, skb))
+    return NET_XMIT_DROP;
+


Good catch.  I remember your comment on this last week, but hadn't had a 
chance to look it up yet.


As for the other paths, I see that the call to dev_forward_skb() already 
has this protection in it, but does the call to dev_queue_xmit_accel() 
in macvlan_start_xmit() need similar protection?




Now that I think about this a little more, why is this not already 
getting handled by the NETDEV_CHANGEMTU notifier?  In what case are you 
running into this, and why is it not triggering the notifier?


sln






  skb->dev = vlan->lowerdev;
  return dev_queue_xmit(skb);
  }



[net-next 07/10] i40e: add helper conversion function for link_speed

2017-11-14 Thread Jeff Kirsher
From: Jacob Keller 

We introduced the virtchnl interface in order to have an interface for
talking to a virtual device driver which was host-driver agnostic. This
interface has its own definitions, including one for link speed.

The host driver has to talk to the virtchnl interface using these new
definitions in order to remain compatible. Today, the i40e link_speed
enumerations are value-exact matches for the virtchnl interface, so it
was originally decided to simply use a typecast.

However, this is unsafe, and makes it easier for future drivers to
continue this unsafe practice. There is nothing guaranteeing these
values are exact, and the type-cast would hide any compiler warning
which indicates the problem.

Rather than rely on this type cast, introduce a helper function which
can convert the AdminQ link speed definition into a virtchnl
definition. This can then be used by host driver implementations in
order to safely convert to the interface recognized by the virtual
functions.

If the link speed is not able to be represented by the virtchnl
definitions we'll report UNKNOWN which is the safest result.

This will ensure that should the driver specific link_speeds actual bit
definitions change, we do not report them incorrectly according to the
VF.

Additionally, this provides a better pattern for future drivers to copy,
as it is more likely a future device may not use the exact same bit-wise
definition as the current virtchnl interface.

Signed-off-by: Jacob Keller 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40e/i40e_prototype.h   | 31 ++
 drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c |  4 +--
 2 files changed, 33 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_prototype.h 
b/drivers/net/ethernet/intel/i40e/i40e_prototype.h
index 3bb6659db822..e70bebc4d2a3 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_prototype.h
+++ b/drivers/net/ethernet/intel/i40e/i40e_prototype.h
@@ -343,6 +343,37 @@ static inline struct i40e_rx_ptype_decoded 
decode_rx_desc_ptype(u8 ptype)
return i40e_ptype_lookup[ptype];
 }
 
+/**
+ * i40e_virtchnl_link_speed - Convert AdminQ link_speed to virtchnl definition
+ * @link_speed: the speed to convert
+ *
+ * Returns the link_speed in terms of the virtchnl interface, for use in
+ * converting link_speed as reported by the AdminQ into the format used for
+ * talking to virtchnl devices. If we can't represent the link speed properly,
+ * report LINK_SPEED_UNKNOWN.
+ **/
+static inline enum virtchnl_link_speed
+i40e_virtchnl_link_speed(enum i40e_aq_link_speed link_speed)
+{
+   switch (link_speed) {
+   case I40E_LINK_SPEED_100MB:
+   return VIRTCHNL_LINK_SPEED_100MB;
+   case I40E_LINK_SPEED_1GB:
+   return VIRTCHNL_LINK_SPEED_1GB;
+   case I40E_LINK_SPEED_10GB:
+   return VIRTCHNL_LINK_SPEED_10GB;
+   case I40E_LINK_SPEED_40GB:
+   return VIRTCHNL_LINK_SPEED_40GB;
+   case I40E_LINK_SPEED_20GB:
+   return VIRTCHNL_LINK_SPEED_20GB;
+   case I40E_LINK_SPEED_25GB:
+   return VIRTCHNL_LINK_SPEED_25GB;
+   case I40E_LINK_SPEED_UNKNOWN:
+   default:
+   return VIRTCHNL_LINK_SPEED_UNKNOWN;
+   }
+}
+
 /* prototype for functions used for SW locks */
 
 /* i40e_common for VF drivers*/
diff --git a/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c 
b/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c
index 6172f4a4cf71..7738ddab8ad6 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c
@@ -81,12 +81,12 @@ static void i40e_vc_notify_vf_link_state(struct i40e_vf *vf)
if (vf->link_forced) {
pfe.event_data.link_event.link_status = vf->link_up;
pfe.event_data.link_event.link_speed =
-   (vf->link_up ? I40E_LINK_SPEED_40GB : 0);
+   (vf->link_up ? VIRTCHNL_LINK_SPEED_40GB : 0);
} else {
pfe.event_data.link_event.link_status =
ls->link_info & I40E_AQ_LINK_UP;
pfe.event_data.link_event.link_speed =
-   (enum virtchnl_link_speed)ls->link_speed;
+   i40e_virtchnl_link_speed(ls->link_speed);
}
i40e_aq_send_msg_to_vf(hw, abs_vf_id, VIRTCHNL_OP_EVENT,
   0, (u8 *), sizeof(pfe), NULL);
-- 
2.14.2



[net-next 09/10] i40e/i40evf: Bump driver versions

2017-11-14 Thread Jeff Kirsher
From: Alice Michael 

Bump the i40e driver from 2.1.14 to 2.3.2.

Bump the i40evf driver from 3.0.1 to 3.2.2

Signed-off-by: Alice Michael 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40e/i40e_main.c | 4 ++--
 drivers/net/ethernet/intel/i40evf/i40evf_main.c | 4 ++--
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c 
b/drivers/net/ethernet/intel/i40e/i40e_main.c
index 17e6f64299cf..02880d833c17 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -47,8 +47,8 @@ static const char i40e_driver_string[] =
 #define DRV_KERN "-k"
 
 #define DRV_VERSION_MAJOR 2
-#define DRV_VERSION_MINOR 1
-#define DRV_VERSION_BUILD 14
+#define DRV_VERSION_MINOR 3
+#define DRV_VERSION_BUILD 2
 #define DRV_VERSION __stringify(DRV_VERSION_MAJOR) "." \
 __stringify(DRV_VERSION_MINOR) "." \
 __stringify(DRV_VERSION_BUILD)DRV_KERN
diff --git a/drivers/net/ethernet/intel/i40evf/i40evf_main.c 
b/drivers/net/ethernet/intel/i40evf/i40evf_main.c
index 45cff725ed31..75dc94f490f3 100644
--- a/drivers/net/ethernet/intel/i40evf/i40evf_main.c
+++ b/drivers/net/ethernet/intel/i40evf/i40evf_main.c
@@ -45,8 +45,8 @@ static const char i40evf_driver_string[] =
 #define DRV_KERN "-k"
 
 #define DRV_VERSION_MAJOR 3
-#define DRV_VERSION_MINOR 0
-#define DRV_VERSION_BUILD 1
+#define DRV_VERSION_MINOR 2
+#define DRV_VERSION_BUILD 2
 #define DRV_VERSION __stringify(DRV_VERSION_MAJOR) "." \
 __stringify(DRV_VERSION_MINOR) "." \
 __stringify(DRV_VERSION_BUILD) \
-- 
2.14.2



[net-next 08/10] i40e: Fix for NUP NVM image downgrade failure

2017-11-14 Thread Jeff Kirsher
From: Jacob Keller 

Since commit 96a39aed25e6 ("i40e: Acquire NVM lock before
reads on all devices") we've used the NVM lock
to synchronize NVM reads even on devices which don't strictly
need the lock.

Doing so can cause a regression on older firmware prior to 1.5,
especially when downgrading the firmware.

Fix this by only grabbing the lock if we're running on an X722
device (which requires the lock as it uses the AdminQ to read
the NVM), or if we're currently running 1.5 or newer firmware.

Signed-off-by: Jacob Keller 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40e/i40e_adminq.c | 6 ++
 drivers/net/ethernet/intel/i40e/i40e_common.c | 3 ++-
 drivers/net/ethernet/intel/i40e/i40e_nvm.c| 8 +---
 drivers/net/ethernet/intel/i40e/i40e_type.h   | 1 +
 4 files changed, 14 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_adminq.c 
b/drivers/net/ethernet/intel/i40e/i40e_adminq.c
index 9dcb2a961197..9af74253c3f7 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_adminq.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_adminq.c
@@ -613,6 +613,12 @@ i40e_status i40e_init_adminq(struct i40e_hw *hw)
hw->flags |= I40E_HW_FLAG_AQ_PHY_ACCESS_CAPABLE;
}
 
+   /* Newer versions of firmware require lock when reading the NVM */
+   if (hw->aq.api_maj_ver > 1 ||
+   (hw->aq.api_maj_ver == 1 &&
+hw->aq.api_min_ver >= 5))
+   hw->flags |= I40E_HW_FLAG_NVM_READ_REQUIRES_LOCK;
+
/* The ability to RX (not drop) 802.1ad frames was added in API 1.7 */
if (hw->aq.api_maj_ver > 1 ||
(hw->aq.api_maj_ver == 1 &&
diff --git a/drivers/net/ethernet/intel/i40e/i40e_common.c 
b/drivers/net/ethernet/intel/i40e/i40e_common.c
index 0203665cb53c..13c79468a6da 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_common.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_common.c
@@ -948,7 +948,8 @@ i40e_status i40e_init_shared_code(struct i40e_hw *hw)
hw->pf_id = (u8)(func_rid & 0x7);
 
if (hw->mac.type == I40E_MAC_X722)
-   hw->flags |= I40E_HW_FLAG_AQ_SRCTL_ACCESS_ENABLE;
+   hw->flags |= I40E_HW_FLAG_AQ_SRCTL_ACCESS_ENABLE |
+I40E_HW_FLAG_NVM_READ_REQUIRES_LOCK;
 
status = i40e_init_nvm(hw);
return status;
diff --git a/drivers/net/ethernet/intel/i40e/i40e_nvm.c 
b/drivers/net/ethernet/intel/i40e/i40e_nvm.c
index 0ccab0a5d717..7689c2ee0d46 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_nvm.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_nvm.c
@@ -328,15 +328,17 @@ static i40e_status __i40e_read_nvm_word(struct i40e_hw 
*hw,
 i40e_status i40e_read_nvm_word(struct i40e_hw *hw, u16 offset,
   u16 *data)
 {
-   i40e_status ret_code;
+   i40e_status ret_code = 0;
 
-   ret_code = i40e_acquire_nvm(hw, I40E_RESOURCE_READ);
+   if (hw->flags & I40E_HW_FLAG_NVM_READ_REQUIRES_LOCK)
+   ret_code = i40e_acquire_nvm(hw, I40E_RESOURCE_READ);
if (ret_code)
return ret_code;
 
ret_code = __i40e_read_nvm_word(hw, offset, data);
 
-   i40e_release_nvm(hw);
+   if (hw->flags & I40E_HW_FLAG_NVM_READ_REQUIRES_LOCK)
+   i40e_release_nvm(hw);
 
return ret_code;
 }
diff --git a/drivers/net/ethernet/intel/i40e/i40e_type.h 
b/drivers/net/ethernet/intel/i40e/i40e_type.h
index 00d4833e9925..0e8568719b4e 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_type.h
+++ b/drivers/net/ethernet/intel/i40e/i40e_type.h
@@ -629,6 +629,7 @@ struct i40e_hw {
 #define I40E_HW_FLAG_AQ_SRCTL_ACCESS_ENABLE BIT_ULL(0)
 #define I40E_HW_FLAG_802_1AD_CAPABLEBIT_ULL(1)
 #define I40E_HW_FLAG_AQ_PHY_ACCESS_CAPABLE  BIT_ULL(2)
+#define I40E_HW_FLAG_NVM_READ_REQUIRES_LOCK BIT_ULL(3)
u64 flags;
 
/* Used in set switch config AQ command */
-- 
2.14.2



[net-next 10/10] i40e: remove redundant initialization of read_size

2017-11-14 Thread Jeff Kirsher
From: Colin Ian King 

Variable read_size is initialized and this value is never read, it is
instead set inside the do-loop, hence the initialization is redundant
and can be removed. Cleans up clang warning:

drivers/net/ethernet/intel/i40e/i40e_nvm.c:390:6: warning: Value stored
to 'read_size' during its initialization is never read

Signed-off-by: Colin Ian King 
Tested-by: Aaron Brown 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40e/i40e_nvm.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_nvm.c 
b/drivers/net/ethernet/intel/i40e/i40e_nvm.c
index 7689c2ee0d46..425713fb72e5 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_nvm.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_nvm.c
@@ -389,7 +389,7 @@ static i40e_status i40e_read_nvm_buffer_aq(struct i40e_hw 
*hw, u16 offset,
   u16 *words, u16 *data)
 {
i40e_status ret_code;
-   u16 read_size = *words;
+   u16 read_size;
bool last_cmd = false;
u16 words_read = 0;
u16 i = 0;
-- 
2.14.2



[net-next 04/10] i40evf: release bit locks in reverse order

2017-11-14 Thread Jeff Kirsher
From: Jacob Keller 

Although not strictly necessary, it is customary to reverse the order in
which we release locks that we acquire. This helps preserve lock
ordering during future refactors, which can help avoid potential
deadlock situations.

Signed-off-by: Jacob Keller 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40evf/i40evf_main.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/i40evf/i40evf_main.c 
b/drivers/net/ethernet/intel/i40evf/i40evf_main.c
index 2b135dc29fde..9a945ffef1e1 100644
--- a/drivers/net/ethernet/intel/i40evf/i40evf_main.c
+++ b/drivers/net/ethernet/intel/i40evf/i40evf_main.c
@@ -1960,8 +1960,8 @@ static void i40evf_reset_task(struct work_struct *work)
 
adapter->aq_required |= I40EVF_FLAG_AQ_ADD_MAC_FILTER;
adapter->aq_required |= I40EVF_FLAG_AQ_ADD_VLAN_FILTER;
-   clear_bit(__I40EVF_IN_CRITICAL_TASK, >crit_section);
clear_bit(__I40EVF_IN_CLIENT_TASK, >crit_section);
+   clear_bit(__I40EVF_IN_CRITICAL_TASK, >crit_section);
i40evf_misc_irq_enable(adapter);
 
mod_timer(>watchdog_timer, jiffies + 2);
-- 
2.14.2



[net-next 03/10] i40evf: use spinlock to protect (mac|vlan)_filter_list

2017-11-14 Thread Jeff Kirsher
From: Jacob Keller 

Stop overloading the __I40EVF_IN_CRITICAL_TASK bit lock to protect the
mac_filter_list and vlan_filter_list. Instead, implement a spinlock to
protect these two lists, similar to how we protect the hash in the i40e
PF code.

Ensure that every place where we access the list uses the spinlock to
ensure consistency, and stop holding the critical section around blocks
of code which only need access to the macvlan filter lists.

This refactor helps simplify the locking behavior, and is necessary as
a future refactor to the __I40EVF_IN_CRITICAL_TASK would cause
a deadlock otherwise.

Signed-off-by: Jacob Keller 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40evf/i40evf.h |  4 +-
 drivers/net/ethernet/intel/i40evf/i40evf_main.c| 84 +++---
 .../net/ethernet/intel/i40evf/i40evf_virtchnl.c| 42 +--
 3 files changed, 81 insertions(+), 49 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40evf/i40evf.h 
b/drivers/net/ethernet/intel/i40evf/i40evf.h
index de0af521d602..47040ab2e298 100644
--- a/drivers/net/ethernet/intel/i40evf/i40evf.h
+++ b/drivers/net/ethernet/intel/i40evf/i40evf.h
@@ -199,6 +199,9 @@ struct i40evf_adapter {
wait_queue_head_t down_waitqueue;
struct i40e_q_vector *q_vectors;
struct list_head vlan_filter_list;
+   struct list_head mac_filter_list;
+   /* Lock to protect accesses to MAC and VLAN lists */
+   spinlock_t mac_vlan_list_lock;
char misc_vector_name[IFNAMSIZ + 9];
int num_active_queues;
int num_req_queues;
@@ -206,7 +209,6 @@ struct i40evf_adapter {
/* TX */
struct i40e_ring *tx_rings;
u32 tx_timeout_count;
-   struct list_head mac_filter_list;
u32 tx_desc_count;
 
/* RX */
diff --git a/drivers/net/ethernet/intel/i40evf/i40evf_main.c 
b/drivers/net/ethernet/intel/i40evf/i40evf_main.c
index 320408f39a44..2b135dc29fde 100644
--- a/drivers/net/ethernet/intel/i40evf/i40evf_main.c
+++ b/drivers/net/ethernet/intel/i40evf/i40evf_main.c
@@ -706,7 +706,8 @@ static void i40evf_configure_rx(struct i40evf_adapter 
*adapter)
  * @adapter: board private structure
  * @vlan: vlan tag
  *
- * Returns ptr to the filter object or NULL
+ * Returns ptr to the filter object or NULL. Must be called while holding the
+ * mac_vlan_list_lock.
  **/
 static struct
 i40evf_vlan_filter *i40evf_find_vlan(struct i40evf_adapter *adapter, u16 vlan)
@@ -731,14 +732,8 @@ static struct
 i40evf_vlan_filter *i40evf_add_vlan(struct i40evf_adapter *adapter, u16 vlan)
 {
struct i40evf_vlan_filter *f = NULL;
-   int count = 50;
 
-   while (test_and_set_bit(__I40EVF_IN_CRITICAL_TASK,
-   >crit_section)) {
-   udelay(1);
-   if (--count == 0)
-   goto out;
-   }
+   spin_lock_bh(>mac_vlan_list_lock);
 
f = i40evf_find_vlan(adapter, vlan);
if (!f) {
@@ -755,8 +750,7 @@ i40evf_vlan_filter *i40evf_add_vlan(struct i40evf_adapter 
*adapter, u16 vlan)
}
 
 clearout:
-   clear_bit(__I40EVF_IN_CRITICAL_TASK, >crit_section);
-out:
+   spin_unlock_bh(>mac_vlan_list_lock);
return f;
 }
 
@@ -768,21 +762,16 @@ i40evf_vlan_filter *i40evf_add_vlan(struct i40evf_adapter 
*adapter, u16 vlan)
 static void i40evf_del_vlan(struct i40evf_adapter *adapter, u16 vlan)
 {
struct i40evf_vlan_filter *f;
-   int count = 50;
 
-   while (test_and_set_bit(__I40EVF_IN_CRITICAL_TASK,
-   >crit_section)) {
-   udelay(1);
-   if (--count == 0)
-   return;
-   }
+   spin_lock_bh(>mac_vlan_list_lock);
 
f = i40evf_find_vlan(adapter, vlan);
if (f) {
f->remove = true;
adapter->aq_required |= I40EVF_FLAG_AQ_DEL_VLAN_FILTER;
}
-   clear_bit(__I40EVF_IN_CRITICAL_TASK, >crit_section);
+
+   spin_unlock_bh(>mac_vlan_list_lock);
 }
 
 /**
@@ -824,7 +813,8 @@ static int i40evf_vlan_rx_kill_vid(struct net_device 
*netdev,
  * @adapter: board private structure
  * @macaddr: the MAC address
  *
- * Returns ptr to the filter object or NULL
+ * Returns ptr to the filter object or NULL. Must be called while holding the
+ * mac_vlan_list_lock.
  **/
 static struct
 i40evf_mac_filter *i40evf_find_filter(struct i40evf_adapter *adapter,
@@ -854,26 +844,17 @@ i40evf_mac_filter *i40evf_add_filter(struct 
i40evf_adapter *adapter,
 u8 *macaddr)
 {
struct i40evf_mac_filter *f;
-   int count = 50;
 
if (!macaddr)
return NULL;
 
-   while (test_and_set_bit(__I40EVF_IN_CRITICAL_TASK,
-   >crit_section)) {
-   udelay(1);
-   if (--count == 0)
-   

[net-next 06/10] i40e: update VFs of link state after GET_VF_RESOURCES

2017-11-14 Thread Jeff Kirsher
From: Jacob Keller 

We currently notify a VF of the link state after ENABLE_QUEUES, which is
the last thing a VF does after being configured. Guests may not actually
ENABLE_QUEUES until they get configured, and thus between driver load
and device configuration the VF may show inaccurate link status.

Fix this by also sending the link state after GET_VF_RESOURCES. Although
we could remove the message following ENABLE_QUEUES, it's not that
significant of a loss, so this patch just keeps both to ensure maximum
compatibility with guests on various OSes.

Specifically, without this patch guests running FreeBSD will display
inaccurate link state until the device is brought up. This is mostly
a cosmetic issue but can be confusing to system administrators.

Signed-off-by: Jacob Keller 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c 
b/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c
index f8a794b72462..6172f4a4cf71 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c
@@ -2748,6 +2748,7 @@ int i40e_vc_process_vf_msg(struct i40e_pf *pf, s16 vf_id, 
u32 v_opcode,
break;
case VIRTCHNL_OP_GET_VF_RESOURCES:
ret = i40e_vc_get_vf_resources_msg(vf, msg);
+   i40e_vc_notify_vf_link_state(vf);
break;
case VIRTCHNL_OP_RESET_VF:
i40e_vc_reset_vf_msg(vf);
-- 
2.14.2



[net-next 01/10] i40e: display priority_xon and priority_xoff stats

2017-11-14 Thread Jeff Kirsher
From: Alice Michael 

Display some more stats that were already being counted, to help users
understand when priority xon/xoff packets are being sent/received

Signed-off-by: Alice Michael 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40e/i40e_ethtool.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_ethtool.c 
b/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
index dc9b8dcf4a1e..2e5ccddeb60a 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
@@ -126,6 +126,10 @@ static const struct i40e_stats i40e_gstrings_stats[] = {
I40E_PF_STAT("link_xoff_rx", stats.link_xoff_rx),
I40E_PF_STAT("link_xon_tx", stats.link_xon_tx),
I40E_PF_STAT("link_xoff_tx", stats.link_xoff_tx),
+   I40E_PF_STAT("priority_xon_rx", stats.priority_xon_rx),
+   I40E_PF_STAT("priority_xoff_rx", stats.priority_xoff_rx),
+   I40E_PF_STAT("priority_xon_tx", stats.priority_xon_tx),
+   I40E_PF_STAT("priority_xoff_tx", stats.priority_xoff_tx),
I40E_PF_STAT("rx_size_64", stats.rx_size_64),
I40E_PF_STAT("rx_size_127", stats.rx_size_127),
I40E_PF_STAT("rx_size_255", stats.rx_size_255),
-- 
2.14.2



[net-next 05/10] i40evf: hold the critical task bit lock while opening

2017-11-14 Thread Jeff Kirsher
From: Jacob Keller 

If i40evf_open() is called quickly at the same time as a reset occurs
(such as via ethtool) it is possible for the device to attempt to open
while a reset is in progress. This occurs because the driver was not
holding the critical task bit lock during i40evf_open, nor was it
holding it around the call to i40evf_up_complete() in
i40evf_reset_task().

We didn't hold the lock previously because calls to i40evf_down() would
take the bit lock directly, and this would have caused a deadlock.

To avoid this, we'll move the bit lock handling out of i40evf_down() and
into the callers of this function. Additionally, we'll now hold the bit
lock over the entire set of steps when going up or down, to ensure that
we remain consistent.

Ultimately this causes us to serialize the transitions between down and
up properly, and avoid changing status while we're resetting.

Signed-off-by: Jacob Keller 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40evf/i40evf_main.c | 40 +++--
 1 file changed, 31 insertions(+), 9 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40evf/i40evf_main.c 
b/drivers/net/ethernet/intel/i40evf/i40evf_main.c
index 9a945ffef1e1..45cff725ed31 100644
--- a/drivers/net/ethernet/intel/i40evf/i40evf_main.c
+++ b/drivers/net/ethernet/intel/i40evf/i40evf_main.c
@@ -1035,6 +1035,8 @@ static void i40evf_configure(struct i40evf_adapter 
*adapter)
 /**
  * i40evf_up_complete - Finish the last steps of bringing up a connection
  * @adapter: board private structure
+ *
+ * Expects to be called while holding the __I40EVF_IN_CRITICAL_TASK bit lock.
  **/
 static void i40evf_up_complete(struct i40evf_adapter *adapter)
 {
@@ -1052,6 +1054,8 @@ static void i40evf_up_complete(struct i40evf_adapter 
*adapter)
 /**
  * i40e_down - Shutdown the connection processing
  * @adapter: board private structure
+ *
+ * Expects to be called while holding the __I40EVF_IN_CRITICAL_TASK bit lock.
  **/
 void i40evf_down(struct i40evf_adapter *adapter)
 {
@@ -1061,10 +1065,6 @@ void i40evf_down(struct i40evf_adapter *adapter)
if (adapter->state <= __I40EVF_DOWN_PENDING)
return;
 
-   while (test_and_set_bit(__I40EVF_IN_CRITICAL_TASK,
-   >crit_section))
-   usleep_range(500, 1000);
-
netif_carrier_off(netdev);
netif_tx_disable(netdev);
adapter->link_up = false;
@@ -1097,7 +1097,6 @@ void i40evf_down(struct i40evf_adapter *adapter)
adapter->aq_required |= I40EVF_FLAG_AQ_DISABLE_QUEUES;
}
 
-   clear_bit(__I40EVF_IN_CRITICAL_TASK, >crit_section);
mod_timer_pending(>watchdog_timer, jiffies + 1);
 }
 
@@ -1960,8 +1959,6 @@ static void i40evf_reset_task(struct work_struct *work)
 
adapter->aq_required |= I40EVF_FLAG_AQ_ADD_MAC_FILTER;
adapter->aq_required |= I40EVF_FLAG_AQ_ADD_VLAN_FILTER;
-   clear_bit(__I40EVF_IN_CLIENT_TASK, >crit_section);
-   clear_bit(__I40EVF_IN_CRITICAL_TASK, >crit_section);
i40evf_misc_irq_enable(adapter);
 
mod_timer(>watchdog_timer, jiffies + 2);
@@ -1998,9 +1995,13 @@ static void i40evf_reset_task(struct work_struct *work)
adapter->state = __I40EVF_DOWN;
wake_up(>down_waitqueue);
}
+   clear_bit(__I40EVF_IN_CLIENT_TASK, >crit_section);
+   clear_bit(__I40EVF_IN_CRITICAL_TASK, >crit_section);
 
return;
 reset_err:
+   clear_bit(__I40EVF_IN_CLIENT_TASK, >crit_section);
+   clear_bit(__I40EVF_IN_CRITICAL_TASK, >crit_section);
dev_err(>pdev->dev, "failed to allocate resources during 
reinit\n");
i40evf_close(netdev);
 }
@@ -2244,8 +2245,14 @@ static int i40evf_open(struct net_device *netdev)
return -EIO;
}
 
-   if (adapter->state != __I40EVF_DOWN)
-   return -EBUSY;
+   while (test_and_set_bit(__I40EVF_IN_CRITICAL_TASK,
+   >crit_section))
+   usleep_range(500, 1000);
+
+   if (adapter->state != __I40EVF_DOWN) {
+   err = -EBUSY;
+   goto err_unlock;
+   }
 
/* allocate transmit descriptors */
err = i40evf_setup_all_tx_resources(adapter);
@@ -2269,6 +2276,8 @@ static int i40evf_open(struct net_device *netdev)
 
i40evf_irq_enable(adapter, true);
 
+   clear_bit(__I40EVF_IN_CRITICAL_TASK, >crit_section);
+
return 0;
 
 err_req_irq:
@@ -2278,6 +2287,8 @@ static int i40evf_open(struct net_device *netdev)
i40evf_free_all_rx_resources(adapter);
 err_setup_tx:
i40evf_free_all_tx_resources(adapter);
+err_unlock:
+   clear_bit(__I40EVF_IN_CRITICAL_TASK, >crit_section);
 
return err;
 }
@@ -2301,6 +2312,9 @@ static int i40evf_close(struct net_device *netdev)
if (adapter->state <= __I40EVF_DOWN_PENDING)

[net-next 02/10] i40evf: don't rely on netif_running() outside rtnl_lock()

2017-11-14 Thread Jeff Kirsher
From: Jacob Keller 

In i40evf_reset_task we use netif_running() to determine whether or not
the device is currently up. This allows us to properly free queue memory
and shut down things before we request the hardware reset.

It turns out that we cannot be guaranteed of netif_running() returning
false until the device is fully up, as the kernel core code sets
__LINK_STATE_START prior to calling .ndo_open. Since we're not holding
the rtnl_lock(), it's possible that the driver's i40evf_open handler
function is currently being called while we're resetting.

We can't simply hold the rtnl_lock() while checking netif_running() as
this could cause a deadlock with the i40evf_open() function.
Additionally, we can't avoid the deadlock by holding the rtnl_lock()
over the whole reset path, as this essentially serializes all resets,
and can cause massive delays if we have multiple VFs on a system.

Instead, lets just check our own internal state __I40EVF_RUNNING state
field. This allows us to ensure that the state is correct and is only
set after we've finished bringing the device up.

Without this change we might free data structures about device queues
and other memory before they've been fully allocated.

Signed-off-by: Jacob Keller 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40evf/i40evf_main.c | 20 +---
 1 file changed, 17 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40evf/i40evf_main.c 
b/drivers/net/ethernet/intel/i40evf/i40evf_main.c
index ca2ebdbd24d7..320408f39a44 100644
--- a/drivers/net/ethernet/intel/i40evf/i40evf_main.c
+++ b/drivers/net/ethernet/intel/i40evf/i40evf_main.c
@@ -1796,7 +1796,11 @@ static void i40evf_disable_vf(struct i40evf_adapter 
*adapter)
 
adapter->flags |= I40EVF_FLAG_PF_COMMS_FAILED;
 
-   if (netif_running(adapter->netdev)) {
+   /* We don't use netif_running() because it may be true prior to
+* ndo_open() returning, so we can't assume it means all our open
+* tasks have finished, since we're not holding the rtnl_lock here.
+*/
+   if (adapter->state == __I40EVF_RUNNING) {
set_bit(__I40E_VSI_DOWN, adapter->vsi.state);
netif_carrier_off(adapter->netdev);
netif_tx_disable(adapter->netdev);
@@ -1854,6 +1858,7 @@ static void i40evf_reset_task(struct work_struct *work)
struct i40evf_mac_filter *f;
u32 reg_val;
int i = 0, err;
+   bool running;
 
while (test_and_set_bit(__I40EVF_IN_CLIENT_TASK,
>crit_section))
@@ -1913,7 +1918,13 @@ static void i40evf_reset_task(struct work_struct *work)
}
 
 continue_reset:
-   if (netif_running(netdev)) {
+   /* We don't use netif_running() because it may be true prior to
+* ndo_open() returning, so we can't assume it means all our open
+* tasks have finished, since we're not holding the rtnl_lock here.
+*/
+   running = (adapter->state == __I40EVF_RUNNING);
+
+   if (running) {
netif_carrier_off(netdev);
netif_tx_stop_all_queues(netdev);
adapter->link_up = false;
@@ -1964,7 +1975,10 @@ static void i40evf_reset_task(struct work_struct *work)
 
mod_timer(>watchdog_timer, jiffies + 2);
 
-   if (netif_running(adapter->netdev)) {
+   /* We were running when the reset started, so we need to restore some
+* state here.
+*/
+   if (running) {
/* allocate transmit descriptors */
err = i40evf_setup_all_tx_resources(adapter);
if (err)
-- 
2.14.2



[net-next 00/10][pull request] 40GbE Intel Wired LAN Driver Updates 2017-11-14

2017-11-14 Thread Jeff Kirsher
This series contains updates to i40e and i40evf only.

Alice simply adds the displaying of additional stats that were already
being counted.

Jake provides several fixes and updates starting with fix to i40evf
where we might free data structures about device queues and other memory
before they have been fully allocated, so instead of using netif_running()
to determine whether or not the device is currently up, check our own
internal state __I40EVF_RUNNING field.  Implements a spinlock to protect
the mac_filter_list and vlan_filter_list, instead of overloading the
bit lock __I40EVF_IN_CRITICAL_TASK.  Serialize the transactions between
down and up properly and avoid changing status while we are resetting.
Fixed a regression on older firmware prior to 1.5, by only grabbing a
NVM lock to synchronize NVM reads on only newer firmware or if on a X722
devices.

Colin Ian King removes a redundant variable initialization.

The following are changes since commit 4497478c60c04d2bf37082e27fc98f4f835db96b:
  net: stmmac: fix LPI transitioning for dwmac4
and are available in the git repository at:
  git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue 40GbE

Alice Michael (2):
  i40e: display priority_xon and priority_xoff stats
  i40e/i40evf: Bump driver versions

Colin Ian King (1):
  i40e: remove redundant initialization of read_size

Jacob Keller (7):
  i40evf: don't rely on netif_running() outside rtnl_lock()
  i40evf: use spinlock to protect (mac|vlan)_filter_list
  i40evf: release bit locks in reverse order
  i40evf: hold the critical task bit lock while opening
  i40e: update VFs of link state after GET_VF_RESOURCES
  i40e: add helper conversion function for link_speed
  i40e: Fix for NUP NVM image downgrade failure

 drivers/net/ethernet/intel/i40e/i40e_adminq.c  |   6 +
 drivers/net/ethernet/intel/i40e/i40e_common.c  |   3 +-
 drivers/net/ethernet/intel/i40e/i40e_ethtool.c |   4 +
 drivers/net/ethernet/intel/i40e/i40e_main.c|   4 +-
 drivers/net/ethernet/intel/i40e/i40e_nvm.c |  10 +-
 drivers/net/ethernet/intel/i40e/i40e_prototype.h   |  31 +
 drivers/net/ethernet/intel/i40e/i40e_type.h|   1 +
 drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c |   5 +-
 drivers/net/ethernet/intel/i40evf/i40evf.h |   4 +-
 drivers/net/ethernet/intel/i40evf/i40evf_main.c| 148 +
 .../net/ethernet/intel/i40evf/i40evf_virtchnl.c|  42 +-
 11 files changed, 186 insertions(+), 72 deletions(-)

-- 
2.14.2



  1   2   3   >