from:"Kirill Tkhai"

[PATCH net-next 0/3] Refactorings on af_inet pernet initialization

2018-09-20 Thread Kirill Tkhai

This patch set makes several cleanups around inet_init_net().

---

Cong Wang (1):
  ipv4: initialize ra_mutex in inet_init_net()

Kirill Tkhai (2):
  net: Remove inet_exit_net()
  net: Register af_inet_ops earlier


 net/core/net_namespace.c |1 -
 net/ipv4/af_inet.c   |   13 +
 2 files changed, 5 insertions(+), 9 deletions(-)

--
Signed-off-by: Kirill Tkhai

[PATCH 2/3] net: Register af_inet_ops earlier

2018-09-20 Thread Kirill Tkhai

This function just initializes locks and defaults.
Let register it before other pernet operation,
since some of them potentially may relay on that.

Signed-off-by: Kirill Tkhai 
---
 net/ipv4/af_inet.c |6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index f4ecbe0aaf1a..bbd3a072ffea 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -1938,6 +1938,9 @@ static int __init inet_init(void)
for (q = inetsw_array; q < _array[INETSW_ARRAY_LEN]; ++q)
inet_register_protosw(q);
 
+   if (init_inet_pernet_ops())
+   pr_crit("%s: Cannot init ipv4 inet pernet ops\n", __func__);
+
/*
 *  Set the ARP module up
 */
@@ -1975,9 +1978,6 @@ static int __init inet_init(void)
if (ip_mr_init())
pr_crit("%s: Cannot init ipv4 mroute\n", __func__);
 #endif
-
-   if (init_inet_pernet_ops())
-   pr_crit("%s: Cannot init ipv4 inet pernet ops\n", __func__);
/*
 *  Initialise per-cpu ipv4 mibs
 */

[PATCH 3/3] ipv4: initialize ra_mutex in inet_init_net()

2018-09-20 Thread Kirill Tkhai

From: Cong Wang 

ra_mutex is a IPv4 specific mutex, it is inside struct netns_ipv4,
but its initialization is in the generic netns code, setup_net().

Move it to IPv4 specific net init code, inet_init_net().

Fixes: d9ff3049739e ("net: Replace ip_ra_lock with per-net mutex")
Signed-off-by: Cong Wang 
Acked-by: Kirill Tkhai 
---
 net/core/net_namespace.c |1 -
 net/ipv4/af_inet.c   |2 ++
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
index 670c84b1bfc2..b272ccfcbf63 100644
--- a/net/core/net_namespace.c
+++ b/net/core/net_namespace.c
@@ -308,7 +308,6 @@ static __net_init int setup_net(struct net *net, struct 
user_namespace *user_ns)
net->user_ns = user_ns;
idr_init(>netns_ids);
spin_lock_init(>nsid_lock);
-   mutex_init(>ipv4.ra_mutex);
 
list_for_each_entry(ops, _list, list) {
error = ops_init(ops, net);
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index bbd3a072ffea..d4623144e237 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -1818,6 +1818,8 @@ static __net_init int inet_init_net(struct net *net)
net->ipv4.sysctl_igmp_llm_reports = 1;
net->ipv4.sysctl_igmp_qrv = 2;
 
+   mutex_init(>ipv4.ra_mutex);
+
return 0;
 }

[PATCH 1/3] net: Remove inet_exit_net()

2018-09-20 Thread Kirill Tkhai

This function does nothing, and since ops_exit_list()
checks for NULL ->exit method, we do not need stub here.

Signed-off-by: Kirill Tkhai 
---
 net/ipv4/af_inet.c |5 -
 1 file changed, 5 deletions(-)

diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index 1fbe2f815474..f4ecbe0aaf1a 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -1821,13 +1821,8 @@ static __net_init int inet_init_net(struct net *net)
return 0;
 }
 
-static __net_exit void inet_exit_net(struct net *net)
-{
-}
-
 static __net_initdata struct pernet_operations af_inet_ops = {
.init = inet_init_net,
-   .exit = inet_exit_net,
 };
 
 static int __init init_inet_pernet_ops(void)

Re: [Patch net-next] ipv4: initialize ra_mutex in inet_init_net()

2018-09-20 Thread Kirill Tkhai

On 20.09.2018 0:28, Cong Wang wrote:
> On Wed, Sep 19, 2018 at 1:25 AM Kirill Tkhai  wrote:
>>
>> On 18.09.2018 23:17, Cong Wang wrote:
>>> On Mon, Sep 17, 2018 at 12:25 AM Kirill Tkhai  wrote:
>>>> In inet_init() the order of registration is:
>>>>
>>>> ip_mr_init();
>>>> init_inet_pernet_ops();
>>>>
>>>> This means, ipmr_net_ops pernet operations are before af_inet_ops
>>>> in pernet_list. So, there is a theoretical probability, sometimes
>>>> in the future, we will have a problem during a fail of net initialization.
>>>>
>>>> Say,
>>>>
>>>> setup_net():
>>>> ipmr_net_ops->init() returns 0
>>>> xxx->init()  returns error
>>>> and then we do:
>>>> ipmr_net_ops->exit(),
>>>>
>>>> which could touch ra_mutex (theoretically).
>>>
>>> How could ra_mutex be touched in this scenario?
>>>
>>> ra_mutex is only used in ip_ra_control() which is called
>>> only by {get,set}sockopt(). I don't see anything related
>>> to netns exit() path here.
>>
>> Currently, it is not touched. But it's an ordinary practice,
>> someone closes sockets in pernet ->exit methods. For example,
>> we close percpu icmp sockets in icmp_sk_exit(), which are
>> also of RAW type, and there is also called ip_ra_control()
>> for them. Yes, they differ by their protocol; icmp sockets
>> are of IPPROTO_ICMP protocol, while ip_ra_control() acts
>> on IPPROTO_RAW sockets, but it's not good anyway. This does
>> not look reliable for the future. In case of someone changes
>> something here, we may do not notice this for the long time,
>> while some users will meet bugs on their places.
> 
> First of all, we only consider current code base. Even if you
> really planned to changed this in the future, it would be still your
> responsibility to take care of it. Why? Simple, FIFO. My patch
> comes ahead of any future changes here, obviously.
> 
> Secondly, it is certainly not hard to notice. I am pretty sure
> you would get a warning/crash if this bug triggered.
> 
> 
> 
>>
>> Problems on error paths is not easy to detect on testing,
>> while user may meet them. We had issue of same type with
>> uninitialized xfrm_policy_lock. It was introduced in 2013,
>> while the problem was found only in 2017:
> 
> Been there, done that, I've fixed multiple IPv6 init code
> error path. This is not where we disagree, by the way.
> 
> 
>>
>> introduced by 283bc9f35bbb
>> fixed  by c28a45cb
>>
>> (Last week I met it on RH7 kernel, which still has no a fix.
>>  But this talk is not about distribution kernels, just about
>>  the way).
>>
>> I just want to say if someone makes some creativity on top
>> of this code, it will be to more friendly from us to him/her
>> to not force this person to think about such not obvious details,
>> but just to implement nice architecture right now.
> 
> You keep saying nice architecture, how did you architect
> ipv4.ra_mutex into net/core/net_namespace.c? It is apparently
> not nice.
> 
> Good luck with your future.
> 
> I am tired, let's just drop it. I have no interest to fix your mess.

You added me to CC, so you probably want to know my opinion about this.
Since it's not a real problem fix, but just a refactoring, I say you
my opinion, how this refactoring may be made better. If you don't want
to know my opinion, you may consider not to CC me.

Just this, not a reason to take offense.

Thanks,
Kirill

Re: [Patch net-next] ipv4: initialize ra_mutex in inet_init_net()

2018-09-19 Thread Kirill Tkhai

On 18.09.2018 23:17, Cong Wang wrote:
> On Mon, Sep 17, 2018 at 12:25 AM Kirill Tkhai  wrote:
>> In inet_init() the order of registration is:
>>
>> ip_mr_init();
>> init_inet_pernet_ops();
>>
>> This means, ipmr_net_ops pernet operations are before af_inet_ops
>> in pernet_list. So, there is a theoretical probability, sometimes
>> in the future, we will have a problem during a fail of net initialization.
>>
>> Say,
>>
>> setup_net():
>> ipmr_net_ops->init() returns 0
>> xxx->init()  returns error
>> and then we do:
>> ipmr_net_ops->exit(),
>>
>> which could touch ra_mutex (theoretically).
> 
> How could ra_mutex be touched in this scenario?
> 
> ra_mutex is only used in ip_ra_control() which is called
> only by {get,set}sockopt(). I don't see anything related
> to netns exit() path here.

Currently, it is not touched. But it's an ordinary practice,
someone closes sockets in pernet ->exit methods. For example,
we close percpu icmp sockets in icmp_sk_exit(), which are
also of RAW type, and there is also called ip_ra_control()
for them. Yes, they differ by their protocol; icmp sockets
are of IPPROTO_ICMP protocol, while ip_ra_control() acts
on IPPROTO_RAW sockets, but it's not good anyway. This does
not look reliable for the future. In case of someone changes
something here, we may do not notice this for the long time,
while some users will meet bugs on their places.

Problems on error paths is not easy to detect on testing,
while user may meet them. We had issue of same type with
uninitialized xfrm_policy_lock. It was introduced in 2013,
while the problem was found only in 2017:

introduced by 283bc9f35bbb
fixed  by c28a45cb

(Last week I met it on RH7 kernel, which still has no a fix.
 But this talk is not about distribution kernels, just about
 the way).

I just want to say if someone makes some creativity on top
of this code, it will be to more friendly from us to him/her
to not force this person to think about such not obvious details,
but just to implement nice architecture right now.

Thanks,
Kirill

Re: [Patch net-next] ipv4: initialize ra_mutex in inet_init_net()

2018-09-17 Thread Kirill Tkhai

On 14.09.2018 23:32, Cong Wang wrote:
> ra_mutex is a IPv4 specific mutex, it is inside struct netns_ipv4,
> but its initialization is in the generic netns code, setup_net().
> 
> Move it to IPv4 specific net init code, inet_init_net().
> 
> Fixes: d9ff3049739e ("net: Replace ip_ra_lock with per-net mutex")
> Cc: Kirill Tkhai 
> Signed-off-by: Cong Wang 
> ---
>  net/core/net_namespace.c | 1 -
>  net/ipv4/af_inet.c   | 2 ++
>  2 files changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
> index 670c84b1bfc2..b272ccfcbf63 100644
> --- a/net/core/net_namespace.c
> +++ b/net/core/net_namespace.c
> @@ -308,7 +308,6 @@ static __net_init int setup_net(struct net *net, struct 
> user_namespace *user_ns)
>   net->user_ns = user_ns;
>   idr_init(>netns_ids);
>   spin_lock_init(>nsid_lock);
> - mutex_init(>ipv4.ra_mutex);
>  
>   list_for_each_entry(ops, _list, list) {
>   error = ops_init(ops, net);
> diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
> index 20fda8fb8ffd..57b7bffb93e5 100644
> --- a/net/ipv4/af_inet.c
> +++ b/net/ipv4/af_inet.c
> @@ -1817,6 +1817,8 @@ static __net_init int inet_init_net(struct net *net)
>   net->ipv4.sysctl_igmp_llm_reports = 1;
>   net->ipv4.sysctl_igmp_qrv = 2;
>  
> + mutex_init(>ipv4.ra_mutex);
> +

In inet_init() the order of registration is:

ip_mr_init();
init_inet_pernet_ops();

This means, ipmr_net_ops pernet operations are before af_inet_ops
in pernet_list. So, there is a theoretical probability, sometimes
in the future, we will have a problem during a fail of net initialization.

Say,

setup_net():
ipmr_net_ops->init() returns 0
xxx->init()  returns error
and then we do:
ipmr_net_ops->exit(),

which could touch ra_mutex (theoretically).

Your patch is OK, but since you do this, we may also swap the order
of registration of ipmr_net_ops and af_inet_ops better too.

Kirill

Re: [PATCH net-next 03/13] net: sched: extend Qdisc with rcu

2018-09-06 Thread Kirill Tkhai

On 06.09.2018 11:30, Eric Dumazet wrote:
> 
> 
> On 09/06/2018 12:58 AM, Vlad Buslov wrote:
> 
> ...
> 
>> diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
>> index 18e22a5a6550..239c73f29471 100644
>> --- a/include/net/sch_generic.h
>> +++ b/include/net/sch_generic.h
>> @@ -90,6 +90,7 @@ struct Qdisc {
>>  struct gnet_stats_queue __percpu *cpu_qstats;
>>  int padded;
>>  refcount_t  refcnt;
>> +struct rcu_head rcu;
>>  
>>  /*
>>   * For performance sake on SMP, we put highly modified fields at the end
> 
> Probably better to move this at the end of struct Qdisc,
> not risking unexpected performance regressions in fast path.

Do you mean regressions on UP? On SMP it looks like this field
fits in the unused gap created by:

struct sk_buff_head gso_skb cacheline_aligned_in_smp;

Kirill

Re: [PATCH] net: Fix device name resolving crash in default_device_exit()

2018-06-22 Thread Kirill Tkhai

On 21.06.2018 18:28, David Ahern wrote:
> On 6/21/18 4:03 AM, Kirill Tkhai wrote:
>>> This patch does not remove the BUG, so does not really solve the
>>> problem. ie., it is fairly trivial to write a script (32k dev%d named
>>> devices in init_net) that triggers it again, so your commit subject and
>>> commit log are not correct with the references to 'fixing the problem'.
>>
>> 1)I'm not agree with you and I don't think removing the BUG() is a good idea.
>> This function is called from the place, where it must not fail. But it can
>> fail, and the problem with name is not the only reason of this happens.
>> We can't continue further pernet_operations in case of a problem happened
>> in default_device_exit(), and we can't remove the BUG() before this function
>> becomes of void type. But we are not going to make it of void type. So
>> we can't remove the BUG().
> 
> You missed my point: that the function can still fail means you are not
> "fixing" the problem, only delaying it.

Till the function is of int type and it can fail, we can't remove the BUG().
And this does not connected with name resolution.

>>
>> 2)In case of the script is trivial, can't you just post it here to show
>> what type of devices you mean? Is there real problem or this is
>> a theoretical thinking?
> 
> Current code:
> 
> # ip li sh dev eth2
> 4: eth2:  mtu 1500 qdisc mq state UP
> mode DEFAULT group default qlen 1000
> link/ether 02:e0:f9:46:64:80 brd ff:ff:ff:ff:ff:ff
> # ip netns add fubar
> # ip li set eth2 netns fubar
> # ip li add eth2 type dummy
> # ip li add dev4 type dummy
> # ip netns del fubar
> --> BUG
> kernel:[78079.127748] default_device_exit: failed to move eth2 to
> init_net: -17
> 
> 
> With your patch:
> 
> # ip li sh dev eth2
> 4: eth2:  mtu 1500 qdisc mq state UP
> mode DEFAULT group default qlen 1000
> link/ether 02:e0:f9:46:64:80 brd ff:ff:ff:ff:ff:ff
> # ip netns add fubar
> # ip li set eth2 netns fubar
> # ip li add eth2 type dummy
> # for n in $(seq 0 $((32*1024))); do
>   echo "li add dev${n} type dummy"
>   done > ip.batch
> # ip -batch ip.batch
> # ip netns del fubar
> --> BUG
> kernel:[   25.800024] default_device_exit: failed to move eth2 to
> init_net: -17

Yeah, this has a sense.

>>
>> All virtual devices I see have rtnl_link_ops, so that they just destroyed
>> in default_device_exit_batch(). According to physical devices, it's difficult
>> to imagine a node with 32k physical devices, and if someone tried to deploy
>> them it may meet problems not only in this place.
> 
> Nothing says it has to be a physical device. It is only checking for a name.
> 
>>
>>> The change does provide more variability in naming and reduces the
>>> likelihood of not being able to push a device back to init_net.
>>
>> No, it provides. With the patch one may move real device to a container,
>> and allow to do with the device anything including changing of device
>> index. Then, the destruction of the container does not resilt a kernel
>> panic just because of two devices have the same index.

Re: [PATCH] net: Fix device name resolving crash in default_device_exit()

2018-06-21 Thread Kirill Tkhai

On 20.06.2018 20:15, David Ahern wrote:
> On 6/20/18 2:57 AM, Kirill Tkhai wrote:
>> From: Kirill Tkhai 
>>
>> The following script makes kernel to crash since it can't obtain
>> a name for a device, when the name is occupied by another device:
>>
>> #!/bin/bash
>> ifconfig eth0 down
>> ifconfig eth1 down
>> index=`cat /sys/class/net/eth1/ifindex`
>> ip link set eth1 name dev$index
>> unshare -n sleep 1h &
>> pid=$!
>> while [[ "`readlink /proc/self/ns/net`" == "`readlink /proc/$pid/ns/net`" 
>> ]]; do continue; done
>> ip link set dev$index netns $pid
>> ip link set eth0 name dev$index
>> kill -9 $pid
>>
>> Kernel messages:
>>
>> virtio_net virtio1 dev3: renamed from eth1
>> virtio_net virtio0 dev3: renamed from eth0
>> default_device_exit: failed to move dev3 to init_net: -17
>> [ cut here ]
>> kernel BUG at net/core/dev.c:8978!
>> invalid opcode:  [#1] PREEMPT SMP
>> CPU: 1 PID: 276 Comm: kworker/u8:3 Not tainted 4.17.0+ #292
>> Workqueue: netns cleanup_net
>> RIP: 0010:default_device_exit+0x9c/0xb0
>> [stack trace snipped]
>>
>> This patch gives more variability during choosing new name
>> of device and fixes the problem.
>>
>> Signed-off-by: Kirill Tkhai 
>> ---
>>
>> Since there is no suggestions how to fix this in another way, I'm resending 
>> the patch.
> 
> This patch does not remove the BUG, so does not really solve the
> problem. ie., it is fairly trivial to write a script (32k dev%d named
> devices in init_net) that triggers it again, so your commit subject and
> commit log are not correct with the references to 'fixing the problem'.

1)I'm not agree with you and I don't think removing the BUG() is a good idea.
This function is called from the place, where it must not fail. But it can
fail, and the problem with name is not the only reason of this happens.
We can't continue further pernet_operations in case of a problem happened
in default_device_exit(), and we can't remove the BUG() before this function
becomes of void type. But we are not going to make it of void type. So
we can't remove the BUG().

2)In case of the script is trivial, can't you just post it here to show
what type of devices you mean? Is there real problem or this is
a theoretical thinking?

All virtual devices I see have rtnl_link_ops, so that they just destroyed
in default_device_exit_batch(). According to physical devices, it's difficult
to imagine a node with 32k physical devices, and if someone tried to deploy
them it may meet problems not only in this place.

> The change does provide more variability in naming and reduces the
> likelihood of not being able to push a device back to init_net.

No, it provides. With the patch one may move real device to a container,
and allow to do with the device anything including changing of device
index. Then, the destruction of the container does not resilt a kernel
panic just because of two devices have the same index.

Kirill

[PATCH] net: Fix device name resolving crash in default_device_exit()

2018-06-20 Thread Kirill Tkhai

From: Kirill Tkhai 

The following script makes kernel to crash since it can't obtain
a name for a device, when the name is occupied by another device:

#!/bin/bash
ifconfig eth0 down
ifconfig eth1 down
index=`cat /sys/class/net/eth1/ifindex`
ip link set eth1 name dev$index
unshare -n sleep 1h &
pid=$!
while [[ "`readlink /proc/self/ns/net`" == "`readlink /proc/$pid/ns/net`" ]]; 
do continue; done
ip link set dev$index netns $pid
ip link set eth0 name dev$index
kill -9 $pid

Kernel messages:

virtio_net virtio1 dev3: renamed from eth1
virtio_net virtio0 dev3: renamed from eth0
default_device_exit: failed to move dev3 to init_net: -17
[ cut here ]
kernel BUG at net/core/dev.c:8978!
invalid opcode:  [#1] PREEMPT SMP
CPU: 1 PID: 276 Comm: kworker/u8:3 Not tainted 4.17.0+ #292
Workqueue: netns cleanup_net
RIP: 0010:default_device_exit+0x9c/0xb0
[stack trace snipped]

This patch gives more variability during choosing new name
of device and fixes the problem.

Signed-off-by: Kirill Tkhai 
---

Since there is no suggestions how to fix this in another way, I'm resending the 
patch.

 net/core/dev.c |4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 6e18242a1cae..6c9b9303ded6 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -8959,7 +8959,6 @@ static void __net_exit default_device_exit(struct net 
*net)
rtnl_lock();
for_each_netdev_safe(net, dev, aux) {
int err;
-   char fb_name[IFNAMSIZ];
 
/* Ignore unmoveable devices (i.e. loopback) */
if (dev->features & NETIF_F_NETNS_LOCAL)
@@ -8970,8 +8969,7 @@ static void __net_exit default_device_exit(struct net 
*net)
continue;
 
/* Push remaining network devices to init_net */
-   snprintf(fb_name, IFNAMSIZ, "dev%d", dev->ifindex);
-   err = dev_change_net_namespace(dev, _net, fb_name);
+   err = dev_change_net_namespace(dev, _net, "dev%d");
if (err) {
pr_emerg("%s: failed to move %s to init_net: %d\n",
 __func__, dev->name, err);

Re: [PATCH] net: Fix device name resolving crash in default_device_exit()

2018-06-18 Thread Kirill Tkhai

On 18.06.2018 14:21, Kirill Tkhai wrote:
> On 17.06.2018 21:58, David Ahern wrote:
>> On 6/15/18 3:44 AM, Kirill Tkhai wrote:
>>> Hm, but is this a likely case, when real device is moved to net ns, so it
>>> requires moving to init_net back? It seems the most devices moved to 
>>> !init_net
>>> are virtual and they just destroyed in default_device_exit_batch(). Or we 
>>> have
>>> more devices to care here?
>>>
>>> I don't much want to insert here something like below:
>>>
>>> if (__dev_get_by_name(_net, dev->name))
>>> snprintf(fb_name, IFNAMSIZ, "dev%d", dev->ifindex);
>>> err = dev_change_net_namespace(dev, _net, "dev%d");
>>>
>>> because dev_change_net_namespace() is generic interface and it's used not 
>>> only here,
>>> and this will crumble the code in corner cases.
>>>
>>> Maybe you have better ideas about this?
>>
>> There are a lot of use cases these days (e.g., switch NOS) with 1000's
>> (10's of 1000's) of netdevices. On top of that support for port netdevs
>> in a namespace to create virtual switches needs to happen (and I suspect
>> will happen in the next few years). That becomes one example where
>> netdevices representing physical ports can be pushed back to init_net.
> 
> Oh, then we really need to do something with rtnl_mutex. Otherwise
> this will stop working at all.
> 
>> That said, not many easy options at the moment for the bug you are fixing.
>>
>> Further, panic'ing a node because the move back to init_net fails is
>> just wrong.
> 
> So, let's fix it for now like in the patch to avoid the panic. Then we
> can rework this in generic way to make the generic fallback name for moved
> devices. Maybe, something like to give all moved device a fallback name
> like "__moved--".

Just to clarify, I mean that this small fix will be easy to backport to stable.

Kirill

Re: [PATCH] net: Fix device name resolving crash in default_device_exit()

2018-06-18 Thread Kirill Tkhai

On 17.06.2018 21:58, David Ahern wrote:
> On 6/15/18 3:44 AM, Kirill Tkhai wrote:
>> Hm, but is this a likely case, when real device is moved to net ns, so it
>> requires moving to init_net back? It seems the most devices moved to 
>> !init_net
>> are virtual and they just destroyed in default_device_exit_batch(). Or we 
>> have
>> more devices to care here?
>>
>> I don't much want to insert here something like below:
>>
>>  if (__dev_get_by_name(_net, dev->name))
>>  snprintf(fb_name, IFNAMSIZ, "dev%d", dev->ifindex);
>>  err = dev_change_net_namespace(dev, _net, "dev%d");
>>
>> because dev_change_net_namespace() is generic interface and it's used not 
>> only here,
>> and this will crumble the code in corner cases.
>>
>> Maybe you have better ideas about this?
> 
> There are a lot of use cases these days (e.g., switch NOS) with 1000's
> (10's of 1000's) of netdevices. On top of that support for port netdevs
> in a namespace to create virtual switches needs to happen (and I suspect
> will happen in the next few years). That becomes one example where
> netdevices representing physical ports can be pushed back to init_net.

Oh, then we really need to do something with rtnl_mutex. Otherwise
this will stop working at all.

> That said, not many easy options at the moment for the bug you are fixing.
> 
> Further, panic'ing a node because the move back to init_net fails is
> just wrong.

So, let's fix it for now like in the patch to avoid the panic. Then we
can rework this in generic way to make the generic fallback name for moved
devices. Maybe, something like to give all moved device a fallback name
like "__moved--".

Kirill

Re: [PATCH] net: Fix device name resolving crash in default_device_exit()

2018-06-15 Thread Kirill Tkhai

On 14.06.2018 20:11, David Ahern wrote:
> On 6/14/18 6:38 AM, Kirill Tkhai wrote:
>> The following script makes kernel to crash since it can't obtain
>> a name for a device, when the name is occupied by another device:
>>
>> #!/bin/bash
>> ifconfig eth0 down
>> ifconfig eth1 down
>> index=`cat /sys/class/net/eth1/ifindex`
>> ip link set eth1 name dev$index
>> unshare -n sleep 1h &
>> pid=$!
>> while [[ "`readlink /proc/self/ns/net`" == "`readlink /proc/$pid/ns/net`" 
>> ]]; do continue; done
>> ip link set dev$index netns $pid
>> ip link set eth0 name dev$index
>> kill -9 $pid
>>
>> Kernel messages:
>>
>> virtio_net virtio1 dev3: renamed from eth1
>> virtio_net virtio0 dev3: renamed from eth0
>> default_device_exit: failed to move dev3 to init_net: -17
>> [ cut here ]
>> kernel BUG at net/core/dev.c:8978!
>> invalid opcode:  [#1] PREEMPT SMP
>> CPU: 1 PID: 276 Comm: kworker/u8:3 Not tainted 4.17.0+ #292
>> Workqueue: netns cleanup_net
>> RIP: 0010:default_device_exit+0x9c/0xb0
>> [stack trace snipped]
>>
>> This patch gives more variability during choosing new name
>> of device and fixes the problem.
>>
>> Signed-off-by: Kirill Tkhai 
>> ---
>>  net/core/dev.c |4 +---
>>  1 file changed, 1 insertion(+), 3 deletions(-)
>>
>> diff --git a/net/core/dev.c b/net/core/dev.c
>> index 6e18242a1cae..6c9b9303ded6 100644
>> --- a/net/core/dev.c
>> +++ b/net/core/dev.c
>> @@ -8959,7 +8959,6 @@ static void __net_exit default_device_exit(struct net 
>> *net)
>>  rtnl_lock();
>>  for_each_netdev_safe(net, dev, aux) {
>>  int err;
>> -char fb_name[IFNAMSIZ];
>>  
>>  /* Ignore unmoveable devices (i.e. loopback) */
>>  if (dev->features & NETIF_F_NETNS_LOCAL)
>> @@ -8970,8 +8969,7 @@ static void __net_exit default_device_exit(struct net 
>> *net)
>>  continue;
>>  
>>  /* Push remaining network devices to init_net */
>> -snprintf(fb_name, IFNAMSIZ, "dev%d", dev->ifindex);
>> -err = dev_change_net_namespace(dev, _net, fb_name);
>> +err = dev_change_net_namespace(dev, _net, "dev%d");
>>  if (err) {
>>  pr_emerg("%s: failed to move %s to init_net: %d\n",
>>   __func__, dev->name, err);
>>
> 
> This could cause repeated looping over __dev_alloc_name. If init_net has
> a large number of devices, it is going to be a performance bottleneck.

Hm, but is this a likely case, when real device is moved to net ns, so it
requires moving to init_net back? It seems the most devices moved to !init_net
are virtual and they just destroyed in default_device_exit_batch(). Or we have
more devices to care here?

I don't much want to insert here something like below:

if (__dev_get_by_name(_net, dev->name))
snprintf(fb_name, IFNAMSIZ, "dev%d", dev->ifindex);
err = dev_change_net_namespace(dev, _net, "dev%d");

because dev_change_net_namespace() is generic interface and it's used not only 
here,
and this will crumble the code in corner cases.

Maybe you have better ideas about this?

Kirill

[PATCH] net: Fix device name resolving crash in default_device_exit()

2018-06-14 Thread Kirill Tkhai

The following script makes kernel to crash since it can't obtain
a name for a device, when the name is occupied by another device:

#!/bin/bash
ifconfig eth0 down
ifconfig eth1 down
index=`cat /sys/class/net/eth1/ifindex`
ip link set eth1 name dev$index
unshare -n sleep 1h &
pid=$!
while [[ "`readlink /proc/self/ns/net`" == "`readlink /proc/$pid/ns/net`" ]]; 
do continue; done
ip link set dev$index netns $pid
ip link set eth0 name dev$index
kill -9 $pid

Kernel messages:

virtio_net virtio1 dev3: renamed from eth1
virtio_net virtio0 dev3: renamed from eth0
default_device_exit: failed to move dev3 to init_net: -17
[ cut here ]
kernel BUG at net/core/dev.c:8978!
invalid opcode:  [#1] PREEMPT SMP
CPU: 1 PID: 276 Comm: kworker/u8:3 Not tainted 4.17.0+ #292
Workqueue: netns cleanup_net
RIP: 0010:default_device_exit+0x9c/0xb0
[stack trace snipped]

This patch gives more variability during choosing new name
of device and fixes the problem.

Signed-off-by: Kirill Tkhai 
---
 net/core/dev.c |4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 6e18242a1cae..6c9b9303ded6 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -8959,7 +8959,6 @@ static void __net_exit default_device_exit(struct net 
*net)
rtnl_lock();
for_each_netdev_safe(net, dev, aux) {
int err;
-   char fb_name[IFNAMSIZ];
 
/* Ignore unmoveable devices (i.e. loopback) */
if (dev->features & NETIF_F_NETNS_LOCAL)
@@ -8970,8 +8969,7 @@ static void __net_exit default_device_exit(struct net 
*net)
continue;
 
/* Push remaining network devices to init_net */
-   snprintf(fb_name, IFNAMSIZ, "dev%d", dev->ifindex);
-   err = dev_change_net_namespace(dev, _net, fb_name);
+   err = dev_change_net_namespace(dev, _net, "dev%d");
if (err) {
pr_emerg("%s: failed to move %s to init_net: %d\n",
 __func__, dev->name, err);

Re: [PATCH net] kcm: fix races on sk_receive_queue

2018-06-06 Thread Kirill Tkhai

On 06.06.2018 16:16, Paolo Abeni wrote:
> KCM removes the packets from sk_receive_queue in requeue_rx_msgs()
> 
> without acquiring any lock. Moreover, in R() when the MSG_PEEK
> flag is not present, the skb is peeked and dequeued with two
> separate, non-atomic, calls.
> 
> The above create room for races, which SYZBOT has been able to
> exploit, causing list corruption and kernel oops:
> 
> kasan: CONFIG_KASAN_INLINE enabled
> kasan: GPF could be caused by NULL-ptr deref or user memory access
> general protection fault:  [#1] SMP KASAN
> Dumping ftrace buffer:
> (ftrace buffer empty)
> Modules linked in:
> CPU: 0 PID: 8484 Comm: syz-executor919 Not tainted 4.17.0-rc7+ #74
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
> Google 01/01/2011
> RIP: 0010:__skb_unlink include/linux/skbuff.h:1844 [inline]
> RIP: 0010:skb_unlink+0xc1/0x160 net/core/skbuff.c:2921
> RSP: 0018:8801d012f6f0 EFLAGS: 00010002
> RAX: 0286 RBX: 8801d6e073c0 RCX: 0001
> RDX: dc00 RSI: 0004 RDI: 0008
> RBP: 8801d012f718 R08: ed0038bb3b6d R09: ed0038bb3b6c
> R10: ed0038bb3b6c R11: 8801c5d9db63 R12: 
> R13:  R14: 8801c5d9db60 R15: 8801d012fce0
> FS:  00ab7880() GS:8801dae0() knlGS:
> CS:  0010 DS:  ES:  CR0: 80050033
> CR2: 20e5b000 CR3: 0001c31fb000 CR4: 001406f0
> DR0:  DR1:  DR2: 
> DR3:  DR6: fffe0ff0 DR7: 0400
> Call Trace:
>   kcm_recvmsg+0x48d/0x590 net/kcm/kcmsock.c:1160
>   sock_recvmsg_nosec+0x8c/0xb0 net/socket.c:802
>   ___sys_recvmsg+0x2b6/0x680 net/socket.c:2279
>   __sys_recvmmsg+0x2f9/0xb80 net/socket.c:2391
>   do_sys_recvmmsg+0xe4/0x190 net/socket.c:2472
>   __do_sys_recvmmsg net/socket.c:2485 [inline]
>   __se_sys_recvmmsg net/socket.c:2481 [inline]
>   __x64_sys_recvmmsg+0xbe/0x150 net/socket.c:2481
>   do_syscall_64+0x1b1/0x800 arch/x86/entry/common.c:287
>   entry_SYSCALL_64_after_hwframe+0x49/0xbe
> RIP: 0033:0x4417a9
> RSP: 002b:7ffe27282838 EFLAGS: 0206 ORIG_RAX: 012b
> RAX: ffda RBX:  RCX: 004417a9
> RDX: 04f7 RSI: 22c0 RDI: 0006
> RBP:  R08: 2200 R09: 7ffe272829f8
> R10: 0060 R11: 0206 R12: 01f3
> R13: 0001f871 R14:  R15: 
> Code: 00 00 00 49 8d 7d 08 4c 8b 63 08 48 ba 00 00 00 00 00 fc ff df 48 c7
> 43 08 00 00 00 00 48 89 f9 48 c7 03 00 00 00 00 48 c1 e9 03 <80> 3c 11 00
> 75 5b 4c 89 e1 4d 89 65 08 48 ba 00 00 00 00 00 fc
> RIP: __skb_unlink include/linux/skbuff.h:1844 [inline] RSP: 8801d012f6f0
> RIP: skb_unlink+0xc1/0x160 net/core/skbuff.c:2921 RSP: 8801d012f6f0
> 
> To fix the above, we need to use the locked version of the socket dequeue
> helper in requeue_rx_msgs() and kcm_wait_data is changed to dequeue
> the available skb when not peeking.
> 
> RFC -> v1:
>  - use skb_dequeue(), as suggested by Tom
>  - explicitly close the race between skb_peek and skb_unlink
> 
> Fixes: ab7ac4eb9832 ("kcm: Kernel Connection Multiplexor module")
> Reported-and-tested-by: syzbot+278279efdd2730dd1...@syzkaller.appspotmail.com
> Signed-off-by: Paolo Abeni 
> ---
> This is an RFC, since I'm really new to this area, anyway the syzport
> reported success in testing the proposed fix.
> This is very likely a scenario where the upcoming skb->prev,next -> list_head
> conversion would have helped a lot, thanks to list poisoning and list debug
> ---
>  net/kcm/kcmsock.c | 19 ---
>  1 file changed, 12 insertions(+), 7 deletions(-)
> 
> diff --git a/net/kcm/kcmsock.c b/net/kcm/kcmsock.c
> index d3601d421571..dd2d02bb35ae 100644
> --- a/net/kcm/kcmsock.c
> +++ b/net/kcm/kcmsock.c
> @@ -223,7 +223,7 @@ static void requeue_rx_msgs(struct kcm_mux *mux, struct 
> sk_buff_head *head)
>   struct sk_buff *skb;
>   struct kcm_sock *kcm;
>  
> - while ((skb = __skb_dequeue(head))) {
> + while ((skb = skb_dequeue(head))) {

I try to find how the patch protects against the following race:

requeue_rx_msgs() kcm_recvmsg()
  skb = skb_dequeue()   skb = kcm_wait_data(peek = true)
  ...   ...
free skb...
... skb_copy_datagram_msg(skb) <--- Use after free?

Isn't there possible a use-after-free?

Thanks,
Kirill

>   /* Reset destructor to avoid calling kcm_rcv_ready */
>   skb->destructor = sock_rfree;
>   skb_orphan(skb);
> @@ -1080,12 +1080,17 @@ static int kcm_sendmsg(struct socket *sock, struct 
> msghdr *msg, size_t len)
>   return err;
>  }
>  
> -static struct sk_buff *kcm_wait_data(struct sock *sk, int flags,
> +static struct sk_buff *kcm_wait_data(struct sock *sk, int

Re: [RFC PATCH] kcm: hold rx mux lock when updating the receive queue.

2018-06-06 Thread Kirill Tkhai

Hi, Paolo,

below is couple my thoughts about this.

On 06.06.2018 12:44, Paolo Abeni wrote:
> On Tue, 2018-06-05 at 18:06 +0200, Paolo Abeni wrote:
>> On Tue, 2018-06-05 at 08:35 -0700, Tom Herbert wrote:
>>> Paolo, thanks for looking into this! Can you try replacing
>>> __skb_dequeue in requeue_rx_msgs with skb_dequeue to see if that is
>>> the fix.
>>
>> Sure, I'll retrigger the test, and report the result here (or directly
>> a new patch, should the test be succesful)
> 
> Contrary to my expectations, the suggested change does not fix the
> issue. I'm still investigating the overall locking schema.

kcm_rcv_strparser()->unreserve_rx_kcm()->requeue_rx_msgs()->__skb_dequeue()

seems needed to be synchronized with:

kcm_recvmsg()->kcm_wait_data().

Otherwise, requeue_rx_msgs() removes kcm_recvmsg() peeked skb.

The solution could be to take lock_sock(>sk) in requeue_rx_msgs(), but
we can't do that since there is already locked another socket (and potentially,
this may be a reason of deadlock).

The approach you made in initial patch seems good for me to solve this problem.
The only thing I'm not sure is either lock_sock() is needed in kcm_recvmsg() 
after
this.

Thanks,
Kirill

Re: Dead lock condition occured ipanic during register_netdevice_notifier call in 4.9.102

2018-05-31 Thread Kirill Tkhai

Hi, Illyas,

On 31.05.2018 11:43, Mansoor, Illyas wrote:
> We are facing mutex dead lock condition that we think might be related to a 
> fix that you have provided in:
> Merge branch 
> 'Close-race-between-un-register_netdevice_notifier-and-pernet_operations' 
> commit b9a12601541eb55d07e00261a5112a4bc36fe7be
> 
> We tried to backport the patch series, but got stuck due to dependencies not 
> met in 4.9.102 kernel for these patch series.
> Could you please provide some pointers, so that we can fix in 4.9.y kernel.
> 
> Appreciate any help or pointers on this one.
> 
> Ipanic logs pasted below:
> 
> <3>[ 6513.681473] INFO: task sensors@1.0-ser:2744 blocked for more than 120 
> seconds.
> <3>[ 6513.689723]   Tainted: P U  W  O
> 4.9.102-quilt-2e5dc0ac-07850-g222b9655589b #1
> <3>[ 6513.699108] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables 
> this message.
> <6>[ 6513.707997] sensors@1.0-ser D0  2744  1 0x
> <4>[ 6513.708007]  880223f38040 88027fc980c0  
> 880271987000
> <4>[ 6513.708024]  88026f9ae040 c9d57d40 81b363d1 
> 81396e0b
> <4>[ 6513.708032]  00ffc9d57d20 88027fc980c0 c9d57d90 
> 88026f9ae040
> <4>[ 6513.708040] Call Trace:
> <4>[ 6513.708056]  [] ? __schedule+0x221/0x6e0
> <4>[ 6513.708063]  [] ? sidtab_context_to_sid+0x39b/0x410
> <4>[ 6513.708068]  [] schedule+0x36/0x90
> <4>[ 6513.708072]  [] schedule_preempt_disabled+0x18/0x30
> <4>[ 6513.708078]  [] __mutex_lock_slowpath+0x185/0x3f0
> <4>[ 6513.708083]  [] mutex_lock+0x25/0x30
> <4>[ 6513.708089]  [] rtnl_lock+0x15/0x20
> <4>[ 6513.708095]  [] register_netdevice_notifier+0x2d/0x200
> <4>[ 6513.708107]  [] raw_init+0x8b/0x90
> <4>[ 6513.708118]  [] can_create+0xe1/0x1c0
> <4>[ 6513.708129]  [] __sock_create+0x12e/0x210
> <4>[ 6513.708141]  [] SyS_socket+0x55/0xb0
> <4>[ 6513.708156]  [] do_syscall_64+0x6a/0xe0
> <4>[ 6513.708166]  [] 
> entry_SYSCALL_64_after_swapgs+0x5d/0xd7
> <4>[ 6513.708171] NMI backtrace for cpu 2
> <4>[ 6513.708178] CPU: 2 PID: 482 Comm: khungtaskd Tainted: P U  W  O
> 4.9.102-quilt-2e5dc0ac-07850-g222b9655589b #1
> <4>[ 6513.708180]  c9eafdd0 813f56bc  
> 
> <4>[ 6513.708188]  c9eafe00 813f9fe1 0002 
> 
> <4>[ 6513.708195]  81042d80 826120f8 c9eafe30 
> 813fa0a3

1)I'm not sure commit b9a12601541eb55d07e00261a5112a4bc36fe7be will help here, 
because this
stack looks for me like just someone does not release the mutex. It's possible 
firstly
try to analyze who actually owns it.

2)Also, note that rtnl_is_locked() is used in wrong way in one driver there
(see WILC_WFI_deinit_mon_interface()), so it also may introduce an imbalance
(if you use the driver).

Kirill

Re: Expected result when racing listen(2) on two sockets bound to the same address

2018-05-23 Thread Kirill Tkhai

Hi,

On 23.05.2018 14:15, Alexander Kurtz wrote:
> [Please keep me CC'ed; I'm not subscribed to the list]
> 
> Hi!
> 
> The program shown below (also available at [0]) does the following:
> 
>  * Create two sockets
>  * Enable SO_REUSEADDR on both
>  * Bind both sockets to [::1]:12345
>  * Spawn two threads which both call listen(2) on one socket each
>  * Check that at least one thread succeeded
> 
> Unfortunately, when running this program on Linux 4.16, it is sometimes
> possible that neither thread succeeds in calling listen(2):
> 
> $ uname -a
> Linux shepard 4.16.0-1-amd64 #1 SMP Debian 4.16.5-1 (2018-04-29) x86_64 
> GNU/Linux
> $ time make
> cc -Wall -Wextra -pedantic -Werror -O3listenrace.c  -lpthread -o 
> listenrace
> for i in `seq 1`; do ./listenrace; done
> listenrace: listenrace.c:58: main: Assertion `result1 == 0 || result2 == 
> 0' failed.
> Aborted
> listenrace: listenrace.c:58: main: Assertion `result1 == 0 || result2 == 
> 0' failed.
> Aborted
> listenrace: listenrace.c:58: main: Assertion `result1 == 0 || result2 == 
> 0' failed.
> Aborted
> 
> real  0m8.201s
> user  0m6.801s
> sys   0m2.141s
> $ 
> 
> As can be seen, on 3 runs (out of 1) calling listen(2) failed in
> *both* threads. Is this to be expected (i.e. "don't do this then") or
> could this be some race condition in the Linux kernel?

At the first sight, it is in kernel and is expected:

inet_csk_listen_start()
{
/* There is race window here: we announce ourselves listening,
 * but this transition is still not validated by get_port().
 * It is OK, because this socket enters to hash table only
 * after validation is complete.
 */
inet_sk_state_store(sk, TCP_LISTEN);
if (!sk->sk_prot->get_port(sk, inet->inet_num)) {

}

CC netdev.

Kirill

Re: [PATCH net-next] tun: Do SIOCGSKNS out of rtnl_lock()

2018-05-09 Thread Kirill Tkhai

Hi, Jason,

On 09.05.2018 10:18, Jason Wang wrote:
> 
> 
> On 2018年05月09日 00:21, Kirill Tkhai wrote:
>> Since net ns of tun device is assigned on the device creation,
>> and it never changes, we do not need to use any lock to get it
>> from alive tun.
>>
>> Signed-off-by: Kirill Tkhai <ktk...@virtuozzo.com>
>> ---
>>   drivers/net/tun.c |   18 +++---
>>   1 file changed, 7 insertions(+), 11 deletions(-)
>>
>> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
>> index d3c04ab9752a..44d4f3d25350 100644
>> --- a/drivers/net/tun.c
>> +++ b/drivers/net/tun.c
>> @@ -2850,10 +2850,10 @@ static long __tun_chr_ioctl(struct file *file, 
>> unsigned int cmd,
>>   unsigned long arg, int ifreq_len)
>>   {
>>   struct tun_file *tfile = file->private_data;
>> +    struct net *net = sock_net(>sk);
>>   struct tun_struct *tun;
>>   void __user* argp = (void __user*)arg;
>>   struct ifreq ifr;
>> -    struct net *net;
>>   kuid_t owner;
>>   kgid_t group;
>>   int sndbuf;
>> @@ -2877,14 +2877,18 @@ static long __tun_chr_ioctl(struct file *file, 
>> unsigned int cmd,
>>    */
>>   return put_user(IFF_TUN | IFF_TAP | TUN_FEATURES,
>>   (unsigned int __user*)argp);
>> -    } else if (cmd == TUNSETQUEUE)
>> +    } else if (cmd == TUNSETQUEUE) {
>>   return tun_set_queue(file, );
>> +    } else if (cmd == SIOCGSKNS) {
> 
> Not for this patch, reusing socket ioctl cmd is probably not good though they 
> were probably not intersected (see ioctl-number.txt). We probably need to 
> introduce TUN specific ioctls for SIOCGSKNS and SIOCGIFHWADDR and warn for 
> socket ones.

The most of socket ioctl cmds use 0x8900 type:

#define SOCK_IOC_TYPE   0x89

while tun cmd is 5400 ('T'). They should not intersect.

The only exceptions are

#define SIOCINQ FIONREAD
#define SIOCOUTQTIOCOUTQ

#define TIOCOUTQ0x5411
#define FIONREAD0x541B

But they can't intersect even with exceptions, since tun nr starts from 200:

#define TUNSETNOCSUM  _IOW('T', 200, int) 

and 200 > 0x1b (==27).

I implemented SIOCGSKNS cmd in the same style as older socket cmds were used.
I'm not sure, we can remove existing SIOCGIFHWADDR, since they are already used.
If we add a warn, which time will we able to remove them? Some old software may
use it, and in case of the program isn't developed any more, nobody can fix this
warnings, even if he/she sees them..

Kirill

> 
>> +    if (!ns_capable(net->user_ns, CAP_NET_ADMIN))
>> +    return -EPERM;
>> +    return open_related_ns(>ns, get_net_ns);
>> +    }
>>     ret = 0;
>>   rtnl_lock();
>>     tun = tun_get(tfile);
>> -    net = sock_net(>sk);
>>   if (cmd == TUNSETIFF) {
>>   ret = -EEXIST;
>>   if (tun)
>> @@ -2914,14 +2918,6 @@ static long __tun_chr_ioctl(struct file *file, 
>> unsigned int cmd,
>>   tfile->ifindex = ifindex;
>>   goto unlock;
>>   }
>> -    if (cmd == SIOCGSKNS) {
>> -    ret = -EPERM;
>> -    if (!ns_capable(net->user_ns, CAP_NET_ADMIN))
>> -    goto unlock;
>> -
>> -    ret = open_related_ns(>ns, get_net_ns);
>> -    goto unlock;
>> -    }
>>     ret = -EBADFD;
>>   if (!tun)
>>
>

[PATCH net-next] tun: Do SIOCGSKNS out of rtnl_lock()

2018-05-08 Thread Kirill Tkhai

Since net ns of tun device is assigned on the device creation,
and it never changes, we do not need to use any lock to get it
from alive tun.

Signed-off-by: Kirill Tkhai <ktk...@virtuozzo.com>
---
 drivers/net/tun.c |   18 +++---
 1 file changed, 7 insertions(+), 11 deletions(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index d3c04ab9752a..44d4f3d25350 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -2850,10 +2850,10 @@ static long __tun_chr_ioctl(struct file *file, unsigned 
int cmd,
unsigned long arg, int ifreq_len)
 {
struct tun_file *tfile = file->private_data;
+   struct net *net = sock_net(>sk);
struct tun_struct *tun;
void __user* argp = (void __user*)arg;
struct ifreq ifr;
-   struct net *net;
kuid_t owner;
kgid_t group;
int sndbuf;
@@ -2877,14 +2877,18 @@ static long __tun_chr_ioctl(struct file *file, unsigned 
int cmd,
 */
return put_user(IFF_TUN | IFF_TAP | TUN_FEATURES,
(unsigned int __user*)argp);
-   } else if (cmd == TUNSETQUEUE)
+   } else if (cmd == TUNSETQUEUE) {
return tun_set_queue(file, );
+   } else if (cmd == SIOCGSKNS) {
+   if (!ns_capable(net->user_ns, CAP_NET_ADMIN))
+   return -EPERM;
+   return open_related_ns(>ns, get_net_ns);
+   }
 
ret = 0;
rtnl_lock();
 
tun = tun_get(tfile);
-   net = sock_net(>sk);
if (cmd == TUNSETIFF) {
ret = -EEXIST;
if (tun)
@@ -2914,14 +2918,6 @@ static long __tun_chr_ioctl(struct file *file, unsigned 
int cmd,
tfile->ifindex = ifindex;
goto unlock;
}
-   if (cmd == SIOCGSKNS) {
-   ret = -EPERM;
-   if (!ns_capable(net->user_ns, CAP_NET_ADMIN))
-   goto unlock;
-
-   ret = open_related_ns(>ns, get_net_ns);
-   goto unlock;
-   }
 
ret = -EBADFD;
if (!tun)

[PATCH net-next] net: Fix coccinelle warning

2018-04-26 Thread Kirill Tkhai

kbuild test robot says:

  >coccinelle warnings: (new ones prefixed by >>)
  >>> net/core/dev.c:1588:2-3: Unneeded semicolon

So, let's remove it.

Reported-by: kbuild test robot <l...@intel.com>
Signed-off-by: Kirill Tkhai <ktk...@virtuozzo.com>
---
 net/core/dev.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index c624a04dad1f..72e9299cd1e6 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1587,7 +1587,7 @@ const char *netdev_cmd_to_name(enum netdev_cmd cmd)
N(UDP_TUNNEL_DROP_INFO) N(CHANGE_TX_QUEUE_LEN)
N(CVLAN_FILTER_PUSH_INFO) N(CVLAN_FILTER_DROP_INFO)
N(SVLAN_FILTER_PUSH_INFO) N(SVLAN_FILTER_DROP_INFO)
-   };
+   }
 #undef N
return "UNKNOWN_NETDEV_EVENT";
 }

Re: [PATCH] iptables: Per-net ns lock

2018-04-23 Thread Kirill Tkhai

On 21.04.2018 02:06, Andrei Vagin wrote:
> On Fri, Apr 20, 2018 at 04:42:47PM +0300, Kirill Tkhai wrote:
>> Containers want to restore their own net ns,
>> while they may have no their own mnt ns.
>> This case they share host's /run/xtables.lock
>> file, but they may not have permission to open
>> it.
>>
>> Patch makes /run/xtables.lock to be per-namespace,
>> i.e., to refer to the caller task's net ns.
>>
>> Signed-off-by: Kirill Tkhai <ktk...@virtuozzo.com>
>> ---
>>  iptables/xshared.c |7 ++-
>>  1 file changed, 6 insertions(+), 1 deletion(-)
>>
>> diff --git a/iptables/xshared.c b/iptables/xshared.c
>> index 06db72d4..b6dbe4e7 100644
>> --- a/iptables/xshared.c
>> +++ b/iptables/xshared.c
>> @@ -254,7 +254,12 @@ static int xtables_lock(int wait, struct timeval 
>> *wait_interval)
>>  time_left.tv_sec = wait;
>>  time_left.tv_usec = 0;
>>  
>> -fd = open(XT_LOCK_NAME, O_CREAT, 0600);
>> +if (symlink("/proc/self/ns/net", XT_LOCK_NAME) != 0 &&
> 
> Any user can open this file and take the lock. Before this patch, the
> lock file could be opened only by the root user. It means that any user
> will be able to block all iptables operations. Do I miss something?

Yes, this is the idea. It looks like the only way to save compatibility
with old iptables and to allow to set rules from nested net namespaces.
Also, this allows to synchronize with containers, which have its own mount
namespace.

Comparing to existing interfaces in kernel, there is an example. Ordinary user
can open a file RO on a partition, and this prevents root from umounting it
But this is never considered as a problem, and nobody makes partitions available
only for root in 0600 mode to prevent this. There is lsof, and it's easy to find
the lock owner. The same with iptables. The lock is not a critical protection,
it's just a try for different users to synchronize between each other. Real 
protection
happens in setsockopt() path.

> [root@fc24 ~]# ln -s /proc/self/ns/net /run/xtables.lock2
> [root@fc24 ~]# ls -l /run/xtables.lock2
> lrwxrwxrwx 1 root root 17 Apr 21 01:52 /run/xtables.lock2 ->
> /proc/self/ns/net
> [root@fc24 ~]# ls -l /proc/self/ns/net 
> lrwxrwxrwx 1 root root 0 Apr 21 01:52 /proc/self/ns/net ->
> net:[4026531993]
> 
> Thanks,
> Andrei
> 
>> +errno != EEXIST) {
>> +fprintf(stderr, "Fatal: can't create lock file\n");
> 
>   fprintf(stderr, "Fatal: can't create lock file %s: %s\n",
>   XT_LOCK_NAME, strerror(errno));
> 
>> +return XT_LOCK_FAILED;
>> +}
>> +fd = open(XT_LOCK_NAME, O_RDONLY);
>>  if (fd < 0) {
>>  fprintf(stderr, "Fatal: can't open lock file %s: %s\n",
>>  XT_LOCK_NAME, strerror(errno));

Kirill

[PATCH] iptables: Per-net ns lock

2018-04-20 Thread Kirill Tkhai

Containers want to restore their own net ns,
while they may have no their own mnt ns.
This case they share host's /run/xtables.lock
file, but they may not have permission to open
it.

Patch makes /run/xtables.lock to be per-namespace,
i.e., to refer to the caller task's net ns.

Signed-off-by: Kirill Tkhai <ktk...@virtuozzo.com>
---
 iptables/xshared.c |7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/iptables/xshared.c b/iptables/xshared.c
index 06db72d4..b6dbe4e7 100644
--- a/iptables/xshared.c
+++ b/iptables/xshared.c
@@ -254,7 +254,12 @@ static int xtables_lock(int wait, struct timeval 
*wait_interval)
time_left.tv_sec = wait;
time_left.tv_usec = 0;
 
-   fd = open(XT_LOCK_NAME, O_CREAT, 0600);
+   if (symlink("/proc/self/ns/net", XT_LOCK_NAME) != 0 &&
+   errno != EEXIST) {
+   fprintf(stderr, "Fatal: can't create lock file\n");
+   return XT_LOCK_FAILED;
+   }
+   fd = open(XT_LOCK_NAME, O_RDONLY);
if (fd < 0) {
fprintf(stderr, "Fatal: can't open lock file %s: %s\n",
XT_LOCK_NAME, strerror(errno));

Re: [PATCH RFC iptables] iptables: Per-net ns lock

2018-04-20 Thread Kirill Tkhai

Hi, Florian,

On 20.04.2018 13:50, Florian Westphal wrote:
> Kirill Tkhai <ktk...@virtuozzo.com> wrote:
>> Pablo, Florian, could you please provide comments on this?
>>
>> On 09.04.2018 19:55, Kirill Tkhai wrote:
>>> In CRIU and LXC-restore we met the situation,
>>> when iptables in container can't be restored
>>> because of permission denied:
>>>
>>> https://github.com/checkpoint-restore/criu/issues/469
>>>
>>> Containers want to restore their own net ns,
>>> while they may have no their own mnt ns.
>>> This case they share host's /run/xtables.lock
>>> file, but they may not have permission to open
>>> it.
>>>
>>> Patch makes /run/xtables.lock to be per-namespace,
>>> i.e., to refer to the caller task's net ns.
> 
> It looks ok to me but then again the entire userspace
> lock thing is a ugly band aid :-/

I'm agree, but I'm not sure there is a possibility
to go away from it in classic iptables...

Kirill

Re: [PATCH RFC iptables] iptables: Per-net ns lock

2018-04-20 Thread Kirill Tkhai

Pablo, Florian, could you please provide comments on this?

On 09.04.2018 19:55, Kirill Tkhai wrote:
> In CRIU and LXC-restore we met the situation,
> when iptables in container can't be restored
> because of permission denied:
> 
> https://github.com/checkpoint-restore/criu/issues/469
> 
> Containers want to restore their own net ns,
> while they may have no their own mnt ns.
> This case they share host's /run/xtables.lock
> file, but they may not have permission to open
> it.
> 
> Patch makes /run/xtables.lock to be per-namespace,
> i.e., to refer to the caller task's net ns.
> 
> What you think?
> 
> Thanks,
> Kirill
> 
> Signed-off-by: Kirill Tkhai <ktk...@virtuozzo.com>
> ---
>  iptables/xshared.c |7 ++-
>  1 file changed, 6 insertions(+), 1 deletion(-)
> 
> diff --git a/iptables/xshared.c b/iptables/xshared.c
> index 06db72d4..b6dbe4e7 100644
> --- a/iptables/xshared.c
> +++ b/iptables/xshared.c
> @@ -254,7 +254,12 @@ static int xtables_lock(int wait, struct timeval 
> *wait_interval)
>   time_left.tv_sec = wait;
>   time_left.tv_usec = 0;
>  
> - fd = open(XT_LOCK_NAME, O_CREAT, 0600);
> + if (symlink("/proc/self/ns/net", XT_LOCK_NAME) != 0 &&
> + errno != EEXIST) {
> + fprintf(stderr, "Fatal: can't create lock file\n");
> + return XT_LOCK_FAILED;
> + }
> + fd = open(XT_LOCK_NAME, O_RDONLY);
>   if (fd < 0) {
>   fprintf(stderr, "Fatal: can't open lock file %s: %s\n",
>   XT_LOCK_NAME, strerror(errno));
>

Re: [bisected] Stack overflow after fs: "switch the IO-triggering parts of umount to fs_pin" (was net namespaces kernel stack overflow)

2018-04-19 Thread Kirill Tkhai

On 19.04.2018 19:44, Al Viro wrote:
> On Thu, Apr 19, 2018 at 04:34:48PM +0100, Al Viro wrote:
> 
>> IOW, we only get there if our vfsmount was an MNT_INTERNAL one.
>> So we have mnt->mnt_umount of some MNT_INTERNAL mount found in
>> ->mnt_pins of some other mount.  Which, AFAICS, means that
>> it used to be mounted on that other mount.  How the hell can
>> that happen?
>>
>> It looks like you somehow get a long chain of MNT_INTERNAL mounts
>> stacked on top of each other, which ought to be prevented by
>> mnt_flags &= ~MNT_INTERNAL_FLAGS;
>> in do_add_mount().  Nuts...
> 
> Argh...  Nuts is right - clone_mnt() preserves the sodding
> MNT_INTERNAL, with obvious results.
> 
> netns is related to the problem, by exposing MNT_INTERNAL mounts
> (in /proc/*/ns/*) for mount --bind to copy and attach to the
> tree.  AFAICS, the minimal reproducer is
> 
> touch /tmp/a
> unshare -m sh -c 'for i in `seq 1`; do mount --bind /proc/1/ns/net 
> /tmp/a; done'
> 
> (and it can be anything in /proc/*/ns/*, really)
> 
> I think the fix should be along the lines of the following:
> 
> Don't leak MNT_INTERNAL away from internal mounts
> 
> We want it only for the stuff created by SB_KERNMOUNT mounts, *not* for
> their copies.
> 
> Cc: sta...@kernel.org
> Signed-off-by: Al Viro <v...@zeniv.linux.org.uk>

Flawless victory! Thanks.

Tested-by: Kirill Tkhai <ktk...@virtuozzo.com>

> ---
> diff --git a/fs/namespace.c b/fs/namespace.c
> --- a/fs/namespace.c
> +++ b/fs/namespace.c
> @@ -1089,7 +1089,8 @@ static struct mount *clone_mnt(struct mount *old, 
> struct dentry *root,
>   goto out_free;
>   }
>  
> - mnt->mnt.mnt_flags = old->mnt.mnt_flags & ~(MNT_WRITE_HOLD|MNT_MARKED);
> + mnt->mnt.mnt_flags = old->mnt.mnt_flags;
> + mnt->mnt.mnt_flags &= ~(MNT_WRITE_HOLD|MNT_MARKED|MNT_INTERNAL);
>   /* Don't allow unprivileged users to change mount flags */
>   if (flag & CL_UNPRIVILEGED) {
>   mnt->mnt.mnt_flags |= MNT_LOCK_ATIME;
>

[bisected] Stack overflow after fs: "switch the IO-triggering parts of umount to fs_pin" (was net namespaces kernel stack overflow)

2018-04-19 Thread Kirill Tkhai

Hi, Al,

commit 87b95ce0964c016ede92763be9c164e49f1019e9 is the first after which the 
below test crashes the kernel:

Author: Al Viro <v...@zeniv.linux.org.uk>
Date:   Sat Jan 10 19:01:08 2015 -0500

switch the IO-triggering parts of umount to fs_pin

Signed-off-by: Al Viro <v...@zeniv.linux.org.uk>

$modprobe dummy

$while true
 do
 mkdir /var/run/netns
 touch /var/run/netns/init_net
 mount --bind /proc/1/ns/net /var/run/netns/init_net

 ip netns add foo
 ip netns exec foo ip link add dummy0 type dummy
 ip netns delete foo
done

[   22.058349] ip (3249) used greatest stack depth: 8 bytes left
[   22.182195] BUG: unable to handle kernel paging request at 00035bb1f080
[   22.183065] IP: [] kick_process+0x34/0x80
[   22.183065] PGD 0 
[   22.183065] Thread overran stack, or stack corrupted
[   22.183065] Oops:  [#1] PREEMPT SMP 
[   22.183065] CPU: 1 PID: 3255 Comm: ip Not tainted 3.19.0-rc5+ #111
[   22.183065] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
1.11.1-1 04/01/2014
[   22.183065] task: 88007c475100 ti: 88007b3cc000 task.ti: 
88007b3cc000
[   22.183065] RIP: 0010:[]  [] 
kick_process+0x34/0x80
[   22.183065] RSP: 0018:88007b3cfcf8  EFLAGS: 00010293
[   22.183065] RAX: 00012900 RBX: 88007c475100 RCX: 88007b20e7b8
[   22.183065] RDX: 7b3cc028 RSI: 819b05f8 RDI: 819cb999
[   22.183065] RBP: 88007b3cfd08 R08: 81cbf688 R09: 88007d3d0810
[   22.183065] R10: 88007fc933c8 R11:  R12: 7b3cc028
[   22.183065] R13: 88007c475100 R14:  R15: 7fff7793a448
[   22.183065] FS:  7fc987546700() GS:88007fc8() 
knlGS:
[   22.183065] CS:  0010 DS:  ES:  CR0: 8005003b
[   22.183065] CR2: 00035bb1f080 CR3: 01c11000 CR4: 06e0
[   22.183065] Stack:
[   22.183065]  88007c3b67b8 88007b3cfd98 88007b3cfd18 
81066b05
[   22.183065]  88007b3cfd38 81176f4c 88007b3cfd48 
88007c3b68a0
[   22.183065]  88007b3cfd48 811f 88007b3cfd68 
81177a49
[   22.183065] Call Trace:
[   22.183065]  [] task_work_add+0x45/0x60
[   22.183065]  [] mntput_no_expire+0xdc/0x150
[   22.183065]  [] mntput+0x1f/0x30
[   22.183065]  [] drop_mountpoint+0x29/0x30
[   22.183065]  [] pin_kill+0x66/0xf0
[   22.183065]  [] ? __wake_up_common+0x90/0x90
[   22.183065]  [] group_pin_kill+0x19/0x40
[   22.183065]  [] namespace_unlock+0x58/0x60
[   22.183065]  [] drop_collected_mounts+0x4e/0x60
[   22.183065]  [] put_mnt_ns+0x2d/0x50
[   22.183065]  [] free_nsproxy+0x1a/0x80
[   22.183065]  [] switch_task_namespaces+0x58/0x70
[   22.183065]  [] exit_task_namespaces+0xb/0x10
[   22.183065]  [] do_exit+0x2c7/0xc00
[   22.183065]  [] do_group_exit+0x3a/0xa0
[   22.183065]  [] SyS_exit_group+0xf/0x10
[   22.183065]  [] system_call_fastpath+0x12/0x17

Kirill

On 19.04.2018 01:08, Kirill Tkhai wrote:
> Hi, Alexander!
> 
> On 18.04.2018 22:45, Alexander Aring wrote:
>> I currently can crash my net/master kernel by execute the following script:
>>
>> --- snip
>>
>> modprobe dummy
>>
>> #mkdir /var/run/netns
>> #touch /var/run/netns/init_net
>> #mount --bind /proc/1/ns/net /var/run/netns/init_net
>>
>> while true
>> do
>> mkdir /var/run/netns
>> touch /var/run/netns/init_net
>> mount --bind /proc/1/ns/net /var/run/netns/init_net
>>
>> ip netns add foo
>> ip netns exec foo ip link add dummy0 type dummy
>> ip netns delete foo
>> done
> 
> Fast answer is the best, so I tried your test on my not-for-work computer.
> There is old kernel without asynchronous pernet operations:
> 
> $uname -a
> Linux localhost.localdomain 4.15.0-2-amd64 #1 SMP Debian 4.15.11-1 
> (2018-03-20) x86_64 GNU/Linux
> 
> After approximately 15 seconds of your test execution it died :(
> (Hopefully, I executed it in "init 1" with all partitions RO as usual).
> 
> There is no serial console, so I can't say that the first stack is exactly
> the same as you see. But it crashed. So, it seems, the problem have been
> existing long ago.
> 
> Have you tried to reproduce it in older kernels or to bisect the problem 
> commit?
> Or maybe it doesn't reproduce on old kernels in your environment?
> 
>> --- snap
>>
>> After max ~1 minute the kernel will crash.
>> Doing my hack of saving init_net outside the loop it will run fine...
>> So the mount bind is necessary.
>>
>> The last message which I see is:
>>
>> BUG: stack guard page was hit at f0751759 (stack is
>> 69363195..73ddc474)
>> kernel stack overflow (double-fault):  [#1] SMP PTI
>> Modules linked in:
&g

Re: net namespaces kernel stack overflow

2018-04-18 Thread Kirill Tkhai

Hi, Alexander!

On 18.04.2018 22:45, Alexander Aring wrote:
> I currently can crash my net/master kernel by execute the following script:
> 
> --- snip
> 
> modprobe dummy
> 
> #mkdir /var/run/netns
> #touch /var/run/netns/init_net
> #mount --bind /proc/1/ns/net /var/run/netns/init_net
> 
> while true
> do
> mkdir /var/run/netns
> touch /var/run/netns/init_net
> mount --bind /proc/1/ns/net /var/run/netns/init_net
> 
> ip netns add foo
> ip netns exec foo ip link add dummy0 type dummy
> ip netns delete foo
> done

Fast answer is the best, so I tried your test on my not-for-work computer.
There is old kernel without asynchronous pernet operations:

$uname -a
Linux localhost.localdomain 4.15.0-2-amd64 #1 SMP Debian 4.15.11-1 (2018-03-20) 
x86_64 GNU/Linux

After approximately 15 seconds of your test execution it died :(
(Hopefully, I executed it in "init 1" with all partitions RO as usual).

There is no serial console, so I can't say that the first stack is exactly
the same as you see. But it crashed. So, it seems, the problem have been
existing long ago.

Have you tried to reproduce it in older kernels or to bisect the problem commit?
Or maybe it doesn't reproduce on old kernels in your environment?

> --- snap
> 
> After max ~1 minute the kernel will crash.
> Doing my hack of saving init_net outside the loop it will run fine...
> So the mount bind is necessary.
> 
> The last message which I see is:
> 
> BUG: stack guard page was hit at f0751759 (stack is
> 69363195..73ddc474)
> kernel stack overflow (double-fault):  [#1] SMP PTI
> Modules linked in:
> CPU: 0 PID: 13917 Comm: ip Not tainted 4.16.0-11878-gef9d066f6808 #32
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 
> 04/01/2014
> RIP: 0010:validate_chain.isra.23+0x44/0xc40
> RSP: 0018:c92cbff8 EFLAGS: 00010002
> RAX: 0004 RBX: 0e58b88e1d4d15da RCX: 0e58b88e1d4d15da
> RDX:  RSI: 8802b25ee2a0 RDI: 8802b25edb00
> RBP: 0e58b88e1d4d15da R08:  R09: 0004
> R10: c92cc050 R11: 8802b1054be8 R12: 0001
> R13: 8802b25ee268 R14: 8802b25edb00 R15: 
> FS:  () GS:8802bfc0() knlGS:
> CS:  0010 DS:  ES:  CR0: 80050033
> CR2: c92cbfe8 CR3: 02024000 CR4: 06f0
> Call Trace:
>  ? get_max_files+0x10/0x10
>  __lock_acquire+0x332/0x710
>  lock_acquire+0x67/0xb0
>  ? lockref_put_or_lock+0x9/0x30
>  ? dput.part.7+0x17/0x2d0
>  _raw_spin_lock+0x2b/0x60
>  ? lockref_put_or_lock+0x9/0x30
>  lockref_put_or_lock+0x9/0x30
>  dput.part.7+0x1ec/0x2d0
>  drop_mountpoint+0x10/0x40
>  pin_kill+0x9b/0x3a0
>  ? wait_woken+0x90/0x90
>  ? mnt_pin_kill+0x2d/0x100
>  mnt_pin_kill+0x2d/0x100
>  cleanup_mnt+0x66/0x70
>  pin_kill+0x9b/0x3a0
>  ? wait_woken+0x90/0x90
>  ? mnt_pin_kill+0x2d/0x100
>  mnt_pin_kill+0x2d/0x100
>  cleanup_mnt+0x66/0x70
> ...
> 
> I guess maybe it has something to do with recently switching to
> migrate per-net ops to async.
> 
> - Alex

Kirill

[PATCH RFC iptables] iptables: Per-net ns lock

2018-04-09 Thread Kirill Tkhai

In CRIU and LXC-restore we met the situation,
when iptables in container can't be restored
because of permission denied:

https://github.com/checkpoint-restore/criu/issues/469

Containers want to restore their own net ns,
while they may have no their own mnt ns.
This case they share host's /run/xtables.lock
file, but they may not have permission to open
it.

Patch makes /run/xtables.lock to be per-namespace,
i.e., to refer to the caller task's net ns.

What you think?

Thanks,
Kirill

Signed-off-by: Kirill Tkhai <ktk...@virtuozzo.com>
---
 iptables/xshared.c |7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/iptables/xshared.c b/iptables/xshared.c
index 06db72d4..b6dbe4e7 100644
--- a/iptables/xshared.c
+++ b/iptables/xshared.c
@@ -254,7 +254,12 @@ static int xtables_lock(int wait, struct timeval 
*wait_interval)
time_left.tv_sec = wait;
time_left.tv_usec = 0;
 
-   fd = open(XT_LOCK_NAME, O_CREAT, 0600);
+   if (symlink("/proc/self/ns/net", XT_LOCK_NAME) != 0 &&
+   errno != EEXIST) {
+   fprintf(stderr, "Fatal: can't create lock file\n");
+   return XT_LOCK_FAILED;
+   }
+   fd = open(XT_LOCK_NAME, O_RDONLY);
if (fd < 0) {
fprintf(stderr, "Fatal: can't open lock file %s: %s\n",
XT_LOCK_NAME, strerror(errno));

Re: [PATCH net-next] netns: filter uevents correctly

2018-04-05 Thread Kirill Tkhai

On 05.04.2018 17:07, Christian Brauner wrote:
> On Thu, Apr 05, 2018 at 04:01:03PM +0300, Kirill Tkhai wrote:
>> On 04.04.2018 22:48, Christian Brauner wrote:
>>> commit 07e98962fa77 ("kobject: Send hotplug events in all network 
>>> namespaces")
>>>
>>> enabled sending hotplug events into all network namespaces back in 2010.
>>> Over time the set of uevents that get sent into all network namespaces has
>>> shrunk. We have now reached the point where hotplug events for all devices
>>> that carry a namespace tag are filtered according to that namespace.
>>>
>>> Specifically, they are filtered whenever the namespace tag of the kobject
>>> does not match the namespace tag of the netlink socket. One example are
>>> network devices. Uevents for network devices only show up in the network
>>> namespaces these devices are moved to or created in.
>>>
>>> However, any uevent for a kobject that does not have a namespace tag
>>> associated with it will not be filtered and we will *try* to broadcast it
>>> into all network namespaces.
>>>
>>> The original patchset was written in 2010 before user namespaces were a
>>> thing. With the introduction of user namespaces sending out uevents became
>>> partially isolated as they were filtered by user namespaces:
>>>
>>> net/netlink/af_netlink.c:do_one_broadcast()
>>>
>>> if (!net_eq(sock_net(sk), p->net)) {
>>> if (!(nlk->flags & NETLINK_F_LISTEN_ALL_NSID))
>>> return;
>>>
>>> if (!peernet_has_id(sock_net(sk), p->net))
>>> return;
>>>
>>> if (!file_ns_capable(sk->sk_socket->file, p->net->user_ns,
>>>  CAP_NET_BROADCAST))
>>> j   return;
>>> }
>>>
>>> The file_ns_capable() check will check whether the caller had
>>> CAP_NET_BROADCAST at the time of opening the netlink socket in the user
>>> namespace of interest. This check is fine in general but seems insufficient
>>> to me when paired with uevents. The reason is that devices always belong to
>>> the initial user namespace so uevents for kobjects that do not carry a
>>> namespace tag should never be sent into another user namespace. This has
>>> been the intention all along. But there's one case where this breaks,
>>> namely if a new user namespace is created by root on the host and an
>>> identity mapping is established between root on the host and root in the
>>> new user namespace. Here's a reproducer:
>>>
>>>  sudo unshare -U --map-root
>>>  udevadm monitor -k
>>>  # Now change to initial user namespace and e.g. do
>>>  modprobe kvm
>>>  # or
>>>  rmmod kvm
>>>
>>> will allow the non-initial user namespace to retrieve all uevents from the
>>> host. This seems very anecdotal given that in the general case user
>>> namespaces do not see any uevents and also can't really do anything useful
>>> with them.
>>>
>>> Additionally, it is now possible to send uevents from userspace. As such we
>>> can let a sufficiently privileged (CAP_SYS_ADMIN in the owning user
>>> namespace of the network namespace of the netlink socket) userspace process
>>> make a decision what uevents should be sent.
>>>
>>> This makes me think that we should simply ensure that uevents for kobjects
>>> that do not carry a namespace tag are *always* filtered by user namespace
>>> in kobj_bcast_filter(). Specifically:
>>> - If the owning user namespace of the uevent socket is not init_user_ns the
>>>   event will always be filtered.
>>> - If the network namespace the uevent socket belongs to was created in the
>>>   initial user namespace but was opened from a non-initial user namespace
>>>   the event will be filtered as well.
>>> Put another way, uevents for kobjects not carrying a namespace tag are now
>>> always only sent to the initial user namespace. The regression potential
>>> for this is near to non-existent since user namespaces can't really do
>>> anything with interesting devices.
>>>
>>> Signed-off-by: Christian Brauner <christian.brau...@ubuntu.com>
>>> ---
>>>  lib/kobject_uevent.c | 10 +-
>>>  1 file changed, 9 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/lib/kobject_uevent.c b/lib/kobject_uevent.c
>>> index 15ea216a67ce..cb98cddb6e3b 100644
>>

Re: [PATCH net-next] netns: filter uevents correctly

2018-04-05 Thread Kirill Tkhai

On 04.04.2018 22:48, Christian Brauner wrote:
> commit 07e98962fa77 ("kobject: Send hotplug events in all network namespaces")
> 
> enabled sending hotplug events into all network namespaces back in 2010.
> Over time the set of uevents that get sent into all network namespaces has
> shrunk. We have now reached the point where hotplug events for all devices
> that carry a namespace tag are filtered according to that namespace.
> 
> Specifically, they are filtered whenever the namespace tag of the kobject
> does not match the namespace tag of the netlink socket. One example are
> network devices. Uevents for network devices only show up in the network
> namespaces these devices are moved to or created in.
> 
> However, any uevent for a kobject that does not have a namespace tag
> associated with it will not be filtered and we will *try* to broadcast it
> into all network namespaces.
> 
> The original patchset was written in 2010 before user namespaces were a
> thing. With the introduction of user namespaces sending out uevents became
> partially isolated as they were filtered by user namespaces:
> 
> net/netlink/af_netlink.c:do_one_broadcast()
> 
> if (!net_eq(sock_net(sk), p->net)) {
> if (!(nlk->flags & NETLINK_F_LISTEN_ALL_NSID))
> return;
> 
> if (!peernet_has_id(sock_net(sk), p->net))
> return;
> 
> if (!file_ns_capable(sk->sk_socket->file, p->net->user_ns,
>  CAP_NET_BROADCAST))
> j   return;
> }
> 
> The file_ns_capable() check will check whether the caller had
> CAP_NET_BROADCAST at the time of opening the netlink socket in the user
> namespace of interest. This check is fine in general but seems insufficient
> to me when paired with uevents. The reason is that devices always belong to
> the initial user namespace so uevents for kobjects that do not carry a
> namespace tag should never be sent into another user namespace. This has
> been the intention all along. But there's one case where this breaks,
> namely if a new user namespace is created by root on the host and an
> identity mapping is established between root on the host and root in the
> new user namespace. Here's a reproducer:
> 
>  sudo unshare -U --map-root
>  udevadm monitor -k
>  # Now change to initial user namespace and e.g. do
>  modprobe kvm
>  # or
>  rmmod kvm
> 
> will allow the non-initial user namespace to retrieve all uevents from the
> host. This seems very anecdotal given that in the general case user
> namespaces do not see any uevents and also can't really do anything useful
> with them.
> 
> Additionally, it is now possible to send uevents from userspace. As such we
> can let a sufficiently privileged (CAP_SYS_ADMIN in the owning user
> namespace of the network namespace of the netlink socket) userspace process
> make a decision what uevents should be sent.
> 
> This makes me think that we should simply ensure that uevents for kobjects
> that do not carry a namespace tag are *always* filtered by user namespace
> in kobj_bcast_filter(). Specifically:
> - If the owning user namespace of the uevent socket is not init_user_ns the
>   event will always be filtered.
> - If the network namespace the uevent socket belongs to was created in the
>   initial user namespace but was opened from a non-initial user namespace
>   the event will be filtered as well.
> Put another way, uevents for kobjects not carrying a namespace tag are now
> always only sent to the initial user namespace. The regression potential
> for this is near to non-existent since user namespaces can't really do
> anything with interesting devices.
> 
> Signed-off-by: Christian Brauner 
> ---
>  lib/kobject_uevent.c | 10 +-
>  1 file changed, 9 insertions(+), 1 deletion(-)
> 
> diff --git a/lib/kobject_uevent.c b/lib/kobject_uevent.c
> index 15ea216a67ce..cb98cddb6e3b 100644
> --- a/lib/kobject_uevent.c
> +++ b/lib/kobject_uevent.c
> @@ -251,7 +251,15 @@ static int kobj_bcast_filter(struct sock *dsk, struct 
> sk_buff *skb, void *data)
>   return sock_ns != ns;
>   }
>  
> - return 0;
> + /*
> +  * The kobject does not carry a namespace tag so filter by user
> +  * namespace below.
> +  */
> + if (sock_net(dsk)->user_ns != _user_ns)
> + return 1;
> +
> + /* Check if socket was opened from non-initial user namespace. */
> + return sk_user_ns(dsk) != _user_ns;
>  }
>  #endif

So, this prohibits to listen events of all devices except network-related
in containers? If it's so, I don't think it's a good solution. Uevents is not
net-devices-only related interface and it's used for all devices in system.
People may want to delegate block devices to nested user_ns, for example.

Better we should think about something like "generic device <-> user_ns" 
connection,
and to filter events by this user_ns.

Thanks,
Kirill

Re: possible deadlock in skb_queue_tail

2018-04-03 Thread Kirill Tkhai

On 03.04.2018 14:25, Dmitry Vyukov wrote:
> On Tue, Apr 3, 2018 at 11:50 AM, Kirill Tkhai <ktk...@virtuozzo.com> wrote:
>> On 02.04.2018 12:20, syzbot wrote:
>>> Hello,
>>>
>>> syzbot hit the following crash on net-next commit
>>> 06b19fe9a6df7aaa423cd8404ebe5ac9ec4b2960 (Sun Apr 1 03:37:33 2018 +)
>>> Merge branch 'chelsio-inline-tls'
>>> syzbot dashboard link: 
>>> https://syzkaller.appspot.com/bug?extid=6b495100f17ca8554ab9
>>>
>>> Unfortunately, I don't have any reproducer for this crash yet.
>>> Raw console output: 
>>> https://syzkaller.appspot.com/x/log.txt?id=6218830443446272
>>> Kernel config: 
>>> https://syzkaller.appspot.com/x/.config?id=3327544840960562528
>>> compiler: gcc (GCC) 7.1.1 20170620
>>>
>>> IMPORTANT: if you fix the bug, please add the following tag to the commit:
>>> Reported-by: syzbot+6b495100f17ca8554...@syzkaller.appspotmail.com
>>> It will help syzbot understand when the bug is fixed. See footer for 
>>> details.
>>> If you forward the report, please keep this part and the footer.
>>>
>>>
>>> ==
>>> WARNING: possible circular locking dependency detected
>>> 4.16.0-rc6+ #290 Not tainted
>>> --
>>> syz-executor7/20971 is trying to acquire lock:
>>>  (_unix_sk_receive_queue_lock_key){+.+.}, at: [<271ef0d8>] 
>>> skb_queue_tail+0x26/0x150 net/core/skbuff.c:2899
>>>
>>> but task is already holding lock:
>>>  (&(>lock)->rlock/1){+.+.}, at: [<4e725e14>] 
>>> unix_state_double_lock+0x7b/0xb0 net/unix/af_unix.c:1088
>>>
>>> which lock already depends on the new lock.
>>>
>>>
>>> the existing dependency chain (in reverse order) is:
>>>
>>> -> #1 (&(>lock)->rlock/1){+.+.}:
>>>_raw_spin_lock_nested+0x28/0x40 kernel/locking/spinlock.c:354
>>>sk_diag_dump_icons net/unix/diag.c:82 [inline]
>>>sk_diag_fill.isra.4+0xa52/0xfe0 net/unix/diag.c:144
>>>sk_diag_dump net/unix/diag.c:178 [inline]
>>>unix_diag_dump+0x400/0x4f0 net/unix/diag.c:206
>>>netlink_dump+0x492/0xcf0 net/netlink/af_netlink.c:2221
>>>__netlink_dump_start+0x4ec/0x710 net/netlink/af_netlink.c:2318
>>>netlink_dump_start include/linux/netlink.h:214 [inline]
>>>unix_diag_handler_dump+0x3e7/0x750 net/unix/diag.c:307
>>>__sock_diag_cmd net/core/sock_diag.c:230 [inline]
>>>sock_diag_rcv_msg+0x204/0x360 net/core/sock_diag.c:261
>>>netlink_rcv_skb+0x14b/0x380 net/netlink/af_netlink.c:2443
>>>sock_diag_rcv+0x2a/0x40 net/core/sock_diag.c:272
>>>netlink_unicast_kernel net/netlink/af_netlink.c:1307 [inline]
>>>netlink_unicast+0x4c4/0x6b0 net/netlink/af_netlink.c:1333
>>>netlink_sendmsg+0xa4a/0xe80 net/netlink/af_netlink.c:1896
>>>sock_sendmsg_nosec net/socket.c:629 [inline]
>>>sock_sendmsg+0xca/0x110 net/socket.c:639
>>>sock_write_iter+0x31a/0x5d0 net/socket.c:908
>>>call_write_iter include/linux/fs.h:1782 [inline]
>>>new_sync_write fs/read_write.c:469 [inline]
>>>__vfs_write+0x684/0x970 fs/read_write.c:482
>>>vfs_write+0x189/0x510 fs/read_write.c:544
>>>SYSC_write fs/read_write.c:589 [inline]
>>>SyS_write+0xef/0x220 fs/read_write.c:581
>>>do_syscall_64+0x281/0x940 arch/x86/entry/common.c:287
>>>entry_SYSCALL_64_after_hwframe+0x42/0xb7
>>>
>>> -> #0 (_unix_sk_receive_queue_lock_key){+.+.}:
>>>lock_acquire+0x1d5/0x580 kernel/locking/lockdep.c:3920
>>>__raw_spin_lock_irqsave include/linux/spinlock_api_smp.h:110 [inline]
>>>_raw_spin_lock_irqsave+0x96/0xc0 kernel/locking/spinlock.c:152
>>>skb_queue_tail+0x26/0x150 net/core/skbuff.c:2899
>>>unix_dgram_sendmsg+0xa30/0x1610 net/unix/af_unix.c:1807
>>>sock_sendmsg_nosec net/socket.c:629 [inline]
>>>sock_sendmsg+0xca/0x110 net/socket.c:639
>>>___sys_sendmsg+0x320/0x8b0 net/socket.c:2047
>>>__sys_sendmmsg+0x1ee/0x620 net/socket.c:2137
>>>SYSC_sendmmsg net/socket.c:2168 [inline]
>>>SyS_sendmmsg+0x35/0x60 net/socket.c:2163
>>>do_syscall_64+0x281/0x940 arch/x86/entry/common.c:287
>>

Re: possible deadlock in skb_queue_tail

2018-04-03 Thread Kirill Tkhai

On 02.04.2018 12:20, syzbot wrote:
> Hello,
> 
> syzbot hit the following crash on net-next commit
> 06b19fe9a6df7aaa423cd8404ebe5ac9ec4b2960 (Sun Apr 1 03:37:33 2018 +)
> Merge branch 'chelsio-inline-tls'
> syzbot dashboard link: 
> https://syzkaller.appspot.com/bug?extid=6b495100f17ca8554ab9
> 
> Unfortunately, I don't have any reproducer for this crash yet.
> Raw console output: 
> https://syzkaller.appspot.com/x/log.txt?id=6218830443446272
> Kernel config: https://syzkaller.appspot.com/x/.config?id=3327544840960562528
> compiler: gcc (GCC) 7.1.1 20170620
> 
> IMPORTANT: if you fix the bug, please add the following tag to the commit:
> Reported-by: syzbot+6b495100f17ca8554...@syzkaller.appspotmail.com
> It will help syzbot understand when the bug is fixed. See footer for details.
> If you forward the report, please keep this part and the footer.
> 
> 
> ==
> WARNING: possible circular locking dependency detected
> 4.16.0-rc6+ #290 Not tainted
> --
> syz-executor7/20971 is trying to acquire lock:
>  (_unix_sk_receive_queue_lock_key){+.+.}, at: [<271ef0d8>] 
> skb_queue_tail+0x26/0x150 net/core/skbuff.c:2899
> 
> but task is already holding lock:
>  (&(>lock)->rlock/1){+.+.}, at: [<4e725e14>] 
> unix_state_double_lock+0x7b/0xb0 net/unix/af_unix.c:1088
> 
> which lock already depends on the new lock.
> 
> 
> the existing dependency chain (in reverse order) is:
> 
> -> #1 (&(>lock)->rlock/1){+.+.}:
>    _raw_spin_lock_nested+0x28/0x40 kernel/locking/spinlock.c:354
>    sk_diag_dump_icons net/unix/diag.c:82 [inline]
>    sk_diag_fill.isra.4+0xa52/0xfe0 net/unix/diag.c:144
>    sk_diag_dump net/unix/diag.c:178 [inline]
>    unix_diag_dump+0x400/0x4f0 net/unix/diag.c:206
>    netlink_dump+0x492/0xcf0 net/netlink/af_netlink.c:2221
>    __netlink_dump_start+0x4ec/0x710 net/netlink/af_netlink.c:2318
>    netlink_dump_start include/linux/netlink.h:214 [inline]
>    unix_diag_handler_dump+0x3e7/0x750 net/unix/diag.c:307
>    __sock_diag_cmd net/core/sock_diag.c:230 [inline]
>    sock_diag_rcv_msg+0x204/0x360 net/core/sock_diag.c:261
>    netlink_rcv_skb+0x14b/0x380 net/netlink/af_netlink.c:2443
>    sock_diag_rcv+0x2a/0x40 net/core/sock_diag.c:272
>    netlink_unicast_kernel net/netlink/af_netlink.c:1307 [inline]
>    netlink_unicast+0x4c4/0x6b0 net/netlink/af_netlink.c:1333
>    netlink_sendmsg+0xa4a/0xe80 net/netlink/af_netlink.c:1896
>    sock_sendmsg_nosec net/socket.c:629 [inline]
>    sock_sendmsg+0xca/0x110 net/socket.c:639
>    sock_write_iter+0x31a/0x5d0 net/socket.c:908
>    call_write_iter include/linux/fs.h:1782 [inline]
>    new_sync_write fs/read_write.c:469 [inline]
>    __vfs_write+0x684/0x970 fs/read_write.c:482
>    vfs_write+0x189/0x510 fs/read_write.c:544
>    SYSC_write fs/read_write.c:589 [inline]
>    SyS_write+0xef/0x220 fs/read_write.c:581
>    do_syscall_64+0x281/0x940 arch/x86/entry/common.c:287
>    entry_SYSCALL_64_after_hwframe+0x42/0xb7
> 
> -> #0 (_unix_sk_receive_queue_lock_key){+.+.}:
>    lock_acquire+0x1d5/0x580 kernel/locking/lockdep.c:3920
>    __raw_spin_lock_irqsave include/linux/spinlock_api_smp.h:110 [inline]
>    _raw_spin_lock_irqsave+0x96/0xc0 kernel/locking/spinlock.c:152
>    skb_queue_tail+0x26/0x150 net/core/skbuff.c:2899
>    unix_dgram_sendmsg+0xa30/0x1610 net/unix/af_unix.c:1807
>    sock_sendmsg_nosec net/socket.c:629 [inline]
>    sock_sendmsg+0xca/0x110 net/socket.c:639
>    ___sys_sendmsg+0x320/0x8b0 net/socket.c:2047
>    __sys_sendmmsg+0x1ee/0x620 net/socket.c:2137
>    SYSC_sendmmsg net/socket.c:2168 [inline]
>    SyS_sendmmsg+0x35/0x60 net/socket.c:2163
>    do_syscall_64+0x281/0x940 arch/x86/entry/common.c:287
>    entry_SYSCALL_64_after_hwframe+0x42/0xb7

sk_diag_dump_icons() dumps only sockets in TCP_LISTEN state.
TCP_LISTEN state may be assigned in only place in net/unix/af_unix.c:
it's unix_listen(). The function is applied to stream and seqpacket
socket types.

It can't be stream because of the second stack, and seqpacket also can't,
as I don't think it's possible for gcc to inline unix_seqpacket_sendmsg()
in the way, we don't see it in the stack.

So, this is looks like false positive result for me.

Kirill

> 
> other info that might help us debug this:
> 
>  Possible unsafe locking scenario:
> 
>    CPU0    CPU1
>        
>   lock(&(>lock)->rlock/1);
>    lock(_unix_sk_receive_queue_lock_key);
>    lock(&(>lock)->rlock/1);
>   lock(_unix_sk_receive_queue_lock_key);
> 
>  *** DEADLOCK ***
> 
> 1 lock held by syz-executor7/20971:
>  #0:  (&(>lock)->rlock/1){+.+.}, at: [<4e725e14>] 
> unix_state_double_lock+0x7b/0xb0 net/unix/af_unix.c:1088
> 
> stack

Re: [BUG/Q] can_pernet_exit() leaves devices on dead net

2018-04-02 Thread Kirill Tkhai

Hi, Oliver,

On 02.04.2018 18:28, Oliver Hartkopp wrote:
> Hi Kirill, Marc,
> 
> I checked the code once more and added some debug output to the other parts 
> in CAN notifier code.
> 
> In fact the code pointed to by both of you seems to be obsolete as I only 
> wanted to be 'really sure' that no leftovers of the CAN filters at module 
> unloading.
> 
> 
>> Yes, this one looks good:
>> https://marc.info/?l=linux-can=150169589119335=2
>>
>> Regards,
>> Kirill
>>
> 
> I was obviously too cautious ;-)
> 
> All tests I made resulted in the function iterating through all the CAN 
> netdevices doing exactly nothing.
> 
> I'm fine with removing that stuff - but I'm not sure whether it's worth to 
> push that patch to stable 4.12+ or even before 4.12 (without namespace 
> support - and removing rcu_barrier() too).
> 
> Any opinions?

I think the same -- it's not need for stable as there is just iteration over 
empty list, i.e., noop.

Kirill

Re: [PATCH v2 net-next 08/12] inet: frags: use rhashtables for reassembly units

2018-03-30 Thread Kirill Tkhai

Hi, Eric,

thanks for more small patches in v2. One comment below.

On 30.03.2018 23:42, Eric Dumazet wrote:
> Some applications still rely on IP fragmentation, and to be fair linux
> reassembly unit is not working under any serious load.
> 
> It uses static hash tables of 1024 buckets, and up to 128 items per bucket 
> (!!!)
> 
> A work queue is supposed to garbage collect items when host is under memory
> pressure, and doing a hash rebuild, changing seed used in hash computations.
> 
> This work queue blocks softirqs for up to 25 ms when doing a hash rebuild,
> occurring every 5 seconds if host is under fire.
> 
> Then there is the problem of sharing this hash table for all netns.
> 
> It is time to switch to rhashtables, and allocate one of them per netns
> to speedup netns dismantle, since this is a critical metric these days.
> 
> Lookup is now using RCU. A followup patch will even remove
> the refcount hold/release left from prior implementation and save
> a couple of atomic operations.
> 
> Before this patch, 16 cpus (16 RX queue NIC) could not handle more
> than 1 Mpps frags DDOS.
> 
> After the patch, I reach 7 Mpps without any tuning, and can use up to 2GB
> of storage for the fragments.
> 
> $ grep FRAG /proc/net/sockstat
> FRAG: inuse 1966916 memory 2140004608
> 
> A followup patch will change the limits for 64bit arches.
> 
> Signed-off-by: Eric Dumazet 
> Cc: Florian Westphal 
> Cc: Nikolay Aleksandrov 
> Cc: Jesper Dangaard Brouer 
> Cc: Alexander Aring 
> Cc: Stefan Schmidt 
> ---
>  Documentation/networking/ip-sysctl.txt  |   7 +-
>  include/net/inet_frag.h |  81 +++---
>  include/net/ipv6.h  |  16 +-
>  net/ieee802154/6lowpan/6lowpan_i.h  |  26 +-
>  net/ieee802154/6lowpan/reassembly.c |  93 +++
>  net/ipv4/inet_fragment.c| 352 +---
>  net/ipv4/ip_fragment.c  | 112 
>  net/ipv6/netfilter/nf_conntrack_reasm.c |  51 +---
>  net/ipv6/reassembly.c   | 110 
>  9 files changed, 269 insertions(+), 579 deletions(-)
> 
> diff --git a/Documentation/networking/ip-sysctl.txt 
> b/Documentation/networking/ip-sysctl.txt
> index 
> 33f35f049ad57ad6c06ed6e089966e346d72d108..6f2a3670e44b6662ce53c16cb7ca1e4f61274c15
>  100644
> --- a/Documentation/networking/ip-sysctl.txt
> +++ b/Documentation/networking/ip-sysctl.txt
> @@ -134,13 +134,10 @@ min_adv_mss - INTEGER
>  IP Fragmentation:
>  
>  ipfrag_high_thresh - INTEGER
> - Maximum memory used to reassemble IP fragments. When
> - ipfrag_high_thresh bytes of memory is allocated for this purpose,
> - the fragment handler will toss packets until ipfrag_low_thresh
> - is reached. This also serves as a maximum limit to namespaces
> - different from the initial one.
> + Maximum memory used to reassemble IP fragments.
>  
>  ipfrag_low_thresh - INTEGER
> + (Obsolete since linux-4.17)
>   Maximum memory used to reassemble IP fragments before the kernel
>   begins to remove incomplete fragment queues to free up resources.
>   The kernel still accepts new fragments for defragmentation.
> diff --git a/include/net/inet_frag.h b/include/net/inet_frag.h
> index 
> 69e531ed81894393e07cac9e953825fcb55ef42a..3fec0d3a0d0186e98afb951784e1fe7329ba6d77
>  100644
> --- a/include/net/inet_frag.h
> +++ b/include/net/inet_frag.h
> @@ -2,7 +2,11 @@
>  #ifndef __NET_FRAG_H__
>  #define __NET_FRAG_H__
>  
> +#include 
> +
>  struct netns_frags {
> + struct rhashtable   rhashtable cacheline_aligned_in_smp;
> +
>   /* Keep atomic mem on separate cachelines in structs that include it */
>   atomic_tmem cacheline_aligned_in_smp;
>   /* sysctls */
> @@ -26,12 +30,30 @@ enum {
>   INET_FRAG_COMPLETE  = BIT(2),
>  };
>  
> +struct frag_v4_compare_key {
> + __be32  saddr;
> + __be32  daddr;
> + u32 user;
> + u32 vif;
> + __be16  id;
> + u16 protocol;
> +};
> +
> +struct frag_v6_compare_key {
> + struct in6_addr saddr;
> + struct in6_addr daddr;
> + u32 user;
> + __be32  id;
> + u32 iif;
> +};
> +
>  /**
>   * struct inet_frag_queue - fragment queue
>   *
> - * @lock: spinlock protecting the queue
> + * @node: rhash node
> + * @key: keys identifying this frag.
>   * @timer: queue expiration timer
> - * @list: hash bucket list
> + * @lock: spinlock protecting this frag
>   * @refcnt: reference count of the queue
>   * @fragments: received fragments head
>   * @fragments_tail: received fragments tail
> @@ -41,12 +63,16 @@ enum {
>   * @flags: fragment queue flags
>   * @max_size: maximum received fragment size
>   * @net: namespace that this frag belongs to
> - * @list_evictor: list of queues to forcefully evict (e.g. due

[PATCH RESEND net-next 1/2] net: Remove net_rwsem from {, un}register_netdevice_notifier()

2018-03-30 Thread Kirill Tkhai

These functions take net_rwsem, while wireless_nlevent_flush()
also takes it. But down_read() can't be taken recursive,
because of rw_semaphore design, which prevents it to be occupied
by only readers forever.

Since we take pernet_ops_rwsem in {,un}register_netdevice_notifier(),
net list can't change, so these down_read()/up_read() can be removed.

Fixes: f0b07bb151b0 "net: Introduce net_rwsem to protect net_namespace_list"
Signed-off-by: Kirill Tkhai <ktk...@virtuozzo.com>
---
 net/core/dev.c |5 -
 1 file changed, 5 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 07da7add4845..8edb58829124 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1633,7 +1633,6 @@ int register_netdevice_notifier(struct notifier_block *nb)
goto unlock;
if (dev_boot_phase)
goto unlock;
-   down_read(_rwsem);
for_each_net(net) {
for_each_netdev(net, dev) {
err = call_netdevice_notifier(nb, NETDEV_REGISTER, dev);
@@ -1647,7 +1646,6 @@ int register_netdevice_notifier(struct notifier_block *nb)
call_netdevice_notifier(nb, NETDEV_UP, dev);
}
}
-   up_read(_rwsem);
 
 unlock:
rtnl_unlock();
@@ -1671,7 +1669,6 @@ int register_netdevice_notifier(struct notifier_block *nb)
}
 
 outroll:
-   up_read(_rwsem);
raw_notifier_chain_unregister(_chain, nb);
goto unlock;
 }
@@ -1704,7 +1701,6 @@ int unregister_netdevice_notifier(struct notifier_block 
*nb)
if (err)
goto unlock;
 
-   down_read(_rwsem);
for_each_net(net) {
for_each_netdev(net, dev) {
if (dev->flags & IFF_UP) {
@@ -1715,7 +1711,6 @@ int unregister_netdevice_notifier(struct notifier_block 
*nb)
call_netdevice_notifier(nb, NETDEV_UNREGISTER, dev);
}
}
-   up_read(_rwsem);
 unlock:
rtnl_unlock();
up_write(_ops_rwsem);

[PATCH RESEND net-next 2/2] net: Do not take net_rwsem in __rtnl_link_unregister()

2018-03-30 Thread Kirill Tkhai

This function calls call_netdevice_notifier(), which also
may take net_rwsem. So, we can't use net_rwsem here.

This patch makes callers of this functions take pernet_ops_rwsem,
like register_netdevice_notifier() does. This will protect
the modifications of net_namespace_list, and allows notifiers
to take it (they won't have to care about context).

Since __rtnl_link_unregister() is used on module load
and unload (which are not frequent operations), this looks
for me better, than make all call_netdevice_notifier()
always executing in "protected net_namespace_list" context.

Also, this fixes the problem we had a deal in 328fbe747ad4
"Close race between {un, }register_netdevice_notifier and ...",
and guarantees __rtnl_link_unregister() does not skip
exitting net.

Signed-off-by: Kirill Tkhai <ktk...@virtuozzo.com>
---
 drivers/net/dummy.c  |2 ++
 drivers/net/ifb.c|2 ++
 net/core/net_namespace.c |1 +
 net/core/rtnetlink.c |6 +++---
 4 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/drivers/net/dummy.c b/drivers/net/dummy.c
index 30b1c8512049..0d15a12a4560 100644
--- a/drivers/net/dummy.c
+++ b/drivers/net/dummy.c
@@ -219,6 +219,7 @@ static int __init dummy_init_module(void)
 {
int i, err = 0;
 
+   down_write(_ops_rwsem);
rtnl_lock();
err = __rtnl_link_register(_link_ops);
if (err < 0)
@@ -233,6 +234,7 @@ static int __init dummy_init_module(void)
 
 out:
rtnl_unlock();
+   up_write(_ops_rwsem);
 
return err;
 }
diff --git a/drivers/net/ifb.c b/drivers/net/ifb.c
index 0008da7e9d4c..5f2897ec0edc 100644
--- a/drivers/net/ifb.c
+++ b/drivers/net/ifb.c
@@ -330,6 +330,7 @@ static int __init ifb_init_module(void)
 {
int i, err;
 
+   down_write(_ops_rwsem);
rtnl_lock();
err = __rtnl_link_register(_link_ops);
if (err < 0)
@@ -344,6 +345,7 @@ static int __init ifb_init_module(void)
 
 out:
rtnl_unlock();
+   up_write(_ops_rwsem);
 
return err;
 }
diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
index 7fdf321d4997..a11e03f920d3 100644
--- a/net/core/net_namespace.c
+++ b/net/core/net_namespace.c
@@ -51,6 +51,7 @@ static bool init_net_initialized;
  * outside.
  */
 DECLARE_RWSEM(pernet_ops_rwsem);
+EXPORT_SYMBOL_GPL(pernet_ops_rwsem);
 
 #define MIN_PERNET_OPS_ID  \
((sizeof(struct net_generic) + sizeof(void *) - 1) / sizeof(void *))
diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index e86b28482ca7..45936922d7e2 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -412,17 +412,17 @@ static void __rtnl_kill_links(struct net *net, struct 
rtnl_link_ops *ops)
  * __rtnl_link_unregister - Unregister rtnl_link_ops from rtnetlink.
  * @ops: struct rtnl_link_ops * to unregister
  *
- * The caller must hold the rtnl_mutex.
+ * The caller must hold the rtnl_mutex and guarantee net_namespace_list
+ * integrity (hold pernet_ops_rwsem for writing to close the race
+ * with setup_net() and cleanup_net()).
  */
 void __rtnl_link_unregister(struct rtnl_link_ops *ops)
 {
struct net *net;
 
-   down_read(_rwsem);
for_each_net(net) {
__rtnl_kill_links(net, ops);
}
-   up_read(_rwsem);
list_del(>list);
 }
 EXPORT_SYMBOL_GPL(__rtnl_link_unregister);

[PATCH RESEND net-next 0/2] net_rwsem fixes

2018-03-30 Thread Kirill Tkhai

(Forgot to CC netdev@ :)

Hi,

there is wext_netdev_notifier_call()->wireless_nlevent_flush()
netdevice notifier, which takes net_rwsem, so we can't take
net_rwsem in {,un}register_netdevice_notifier().

Since {,un}register_netdevice_notifier() is executed under
pernet_ops_rwsem, net_namespace_list can't change, while we
holding it, so there is no need net_rwsem in these functions [1/2].

The same is in [2/2]. We make callers of __rtnl_link_unregister()
take pernet_ops_rwsem, and close the race with setup_net()
and cleanup_net(), so __rtnl_link_unregister() does not need it.
This also fixes the problem of that __rtnl_link_unregister() does
not see initializing and exiting nets.

Thanks,
Kirill
---

Kirill Tkhai (2):
  net: Remove net_rwsem from {,un}register_netdevice_notifier()
  net: Do not take net_rwsem in __rtnl_link_unregister()


 drivers/net/dummy.c  |2 ++
 drivers/net/ifb.c|2 ++
 net/core/dev.c   |5 -
 net/core/net_namespace.c |1 +
 net/core/rtnetlink.c |6 +++---
 5 files changed, 8 insertions(+), 8 deletions(-)

--
Signed-off-by: Kirill Tkhai <ktk...@virtuozzo.com>

Re: [PATCH net-next 0/3] Close race between {un, }register_netdevice_notifier and pernet_operations

2018-03-30 Thread Kirill Tkhai

On 30.03.2018 18:00, David Miller wrote:
> From: Kirill Tkhai <ktk...@virtuozzo.com>
> Date: Thu, 29 Mar 2018 17:03:15 +0300
> 
>> the problem is {,un}register_netdevice_notifier() do not take
>> pernet_ops_rwsem, and they don't see network namespaces, being
>> initialized in setup_net() and cleanup_net(), since at this
>> time net is not hashed to net_namespace_list.
>>
>> This may lead to imbalance, when a notifier is called at time of
>> setup_net()/net is alive, but it's not called at time of cleanup_net(),
>> for the devices, hashed to the net, and vise versa. See (3/3) for
>> the scheme of imbalance.
>>
>> This patchset fixes the problem by acquiring pernet_ops_rwsem
>> at the time of {,un}register_netdevice_notifier() (3/3).
>> (1-2/3) are preparations in xfrm and netfilter subsystems.
>>
>> The problem was introduced a long ago, but backporting won't be easy,
>> since every previous kernel version may have changes in netdevice
>> notifiers, and they all need review and testing. Otherwise, there
>> may be more pernet_operations, which register or unregister
>> netdevice notifiers, and that leads to deadlock (which is was fixed
>> in 1-2/3). This patchset is for net-next.
> 
> I am applying this series and skipping the rwsem revert.
> 
> Thanks Kirill.

Thanks, David.

I'll send the fixing patch soon today.

Kirill

Re: [PATCH net-next 4/6] inet: frags: use rhashtables for reassembly units

2018-03-30 Thread Kirill Tkhai

Hi, Eric,

On 30.03.2018 08:22, Eric Dumazet wrote:
> Some applications still rely on IP fragmentation, and to be fair linux
> reassembly unit is not working under any serious load.
> 
> It uses static hash tables of 1024 buckets, and up to 128 items per bucket 
> (!!!)
> 
> A work queue is supposed to garbage collect items when host is under memory
> pressure, and doing a hash rebuild, changing seed used in hash computations.
> 
> This work queue blocks softirqs for up to 25 ms when doing a hash rebuild,
> occurring every 5 seconds if host is under fire.
> 
> Then there is the problem of sharing this hash table for all netns.
> 
> It is time to switch to rhashtables, and allocate one of them per netns
> to speedup netns dismantle, since this is a critical metric these days.
> 
> Lookup is now using RCU. A followup patch will even remove
> the refcount hold/release left from prior implementation and save
> a couple of atomic operations.
> 
> Before this patch, 16 cpus (16 RX queue NIC) could not handle more
> than 1 Mpps frags DDOS.
> 
> After the patch, I reach 7 Mpps without any tuning, and can use up to 2GB
> of storage for the fragments.

Great results!

Please, see some comments below.
 
> $ grep FRAG /proc/net/sockstat
> FRAG: inuse 1966916 memory 2140004608
> 
> A followup patch will change the limits for 64bit arches.
> 
> Signed-off-by: Eric Dumazet 
> Cc: Florian Westphal 
> Cc: Nikolay Aleksandrov 
> Cc: Jesper Dangaard Brouer 
> Cc: Alexander Aring 
> Cc: Stefan Schmidt 
> ---
>  Documentation/networking/ip-sysctl.txt  |   7 +-
>  include/net/inet_frag.h |  99 +++---
>  include/net/ipv6.h  |  20 +-
>  net/ieee802154/6lowpan/6lowpan_i.h  |  26 +-
>  net/ieee802154/6lowpan/reassembly.c | 108 +++
>  net/ipv4/inet_fragment.c| 399 +---
>  net/ipv4/ip_fragment.c  | 165 +-
>  net/ipv6/netfilter/nf_conntrack_reasm.c |  62 ++--
>  net/ipv6/reassembly.c   | 152 +
>  9 files changed, 344 insertions(+), 694 deletions(-)
> 
> diff --git a/Documentation/networking/ip-sysctl.txt 
> b/Documentation/networking/ip-sysctl.txt
> index 
> 1d1120753ae82d0aee3e934a3d9c074b70dcbca6..c3b65f24e58aa72b720861d816fb76f9956800f0
>  100644
> --- a/Documentation/networking/ip-sysctl.txt
> +++ b/Documentation/networking/ip-sysctl.txt
> @@ -134,13 +134,10 @@ min_adv_mss - INTEGER
>  IP Fragmentation:
>  
>  ipfrag_high_thresh - INTEGER
> - Maximum memory used to reassemble IP fragments. When
> - ipfrag_high_thresh bytes of memory is allocated for this purpose,
> - the fragment handler will toss packets until ipfrag_low_thresh
> - is reached. This also serves as a maximum limit to namespaces
> - different from the initial one.
> + Maximum memory used to reassemble IP fragments.
>  
>  ipfrag_low_thresh - INTEGER
> + (Obsolete since linux-4.17)
>   Maximum memory used to reassemble IP fragments before the kernel
>   begins to remove incomplete fragment queues to free up resources.
>   The kernel still accepts new fragments for defragmentation.
> diff --git a/include/net/inet_frag.h b/include/net/inet_frag.h
> index 
> 69e531ed81894393e07cac9e953825fcb55ef42a..05099f9f980e2384c0c8cd7e74659656b585cd22
>  100644
> --- a/include/net/inet_frag.h
> +++ b/include/net/inet_frag.h
> @@ -2,15 +2,20 @@
>  #ifndef __NET_FRAG_H__
>  #define __NET_FRAG_H__
>  
> +#include 
> +
>  struct netns_frags {
> - /* Keep atomic mem on separate cachelines in structs that include it */
> - atomic_tmem cacheline_aligned_in_smp;
>   /* sysctls */
>   int timeout;
>   int high_thresh;
>   int low_thresh;
>   int max_dist;
>   struct inet_frags   *f;
> +
> + /* Keep atomic mem on separate cachelines in structs that include it */
> + atomic_tmem cacheline_aligned_in_smp;

The patch is big, and it seems it's possible to extract refactorings like this
in separate patch/patches. Here just two lines moved down.

> +
> + struct rhashtable   rhashtable cacheline_aligned_in_smp;
>  };
>  
>  /**
> @@ -26,12 +31,31 @@ enum {
>   INET_FRAG_COMPLETE  = BIT(2),
>  };
>  
> +struct frag_v4_compare_key {
> + __be32  saddr;
> + __be32  daddr;
> + u32 user;
> + u32 vif;
> + __be16  id;
> + u16 protocol;
> +};
> +
> +struct frag_v6_compare_key {
> + struct in6_addr saddr;
> + struct in6_addr daddr;
> + u32 user;
> + __be32  id;
> + u32 iif;
> +};
> +
>  /**
>   * struct inet_frag_queue - fragment queue
>   *
> - * @lock: spinlock protecting the queue
> + * @node: rhash

Re: [PATCH net-next] net: Revert net_rwsem

2018-03-30 Thread Kirill Tkhai

On 30.03.2018 12:23, Kirill Tkhai wrote:
> This reverts:
> 
> 152f253152cc net: Remove rtnl_lock() in nf_ct_iterate_destroy()
> ec9c780925c5 ovs: Remove rtnl_lock() from ovs_exit_net()
> 350311aab4c0 security: Remove rtnl_lock() in selinux_xfrm_notify_policyload()
> 10256debb918 net: Don't take rtnl_lock() in wireless_nlevent_flush()
> f0b07bb151b0 net: Introduce net_rwsem to protect net_namespace_list
> 
> There are missed, that down_read() can't be taken recursive.
> This is because of rw_semaphore design, which prevents it
> to be occupied by only readers forever.
> 
> So, we can't take it in register_netdevice_notifier(), as it's also
> taken in wext_netdev_notifier_call()->wireless_nlevent_flush().
> 
> The solution is to protect net_namespace_list modifications
> in register_netdevice_notifier() via pernet_ops_rwsem, as it's
> made in:
> 
> https://patchwork.ozlabs.org/project/netdev/list/?series=36495
> 
> Then, we won't have to take net_rwsem in call_netevent_notifiers().
> 
> But since the patchset is not in kernel, let's just revert net_rwsem
> for now, and I'll resubmit it later (after the above patchset).
> 
> Signed-off-by: Kirill Tkhai <ktk...@virtuozzo.com>

Or, if https://patchwork.ozlabs.org/project/netdev/list/?series=36495
is ok enough to go in kernel in a little while, I'll make another fix
(removing down_read(_rwsem) from {,un}register_netdevice_notifiers()).
Please, let me know if some other actions are required.

Thanks,
Kirill

> ---
>  drivers/infiniband/core/roce_gid_mgmt.c |2 --
>  include/linux/rtnetlink.h   |1 -
>  include/net/net_namespace.h |1 -
>  net/core/dev.c  |5 -
>  net/core/fib_notifier.c |2 --
>  net/core/net_namespace.c|   18 +-
>  net/core/rtnetlink.c|5 -
>  net/netfilter/nf_conntrack_core.c   |4 ++--
>  net/openvswitch/datapath.c  |4 ++--
>  net/wireless/wext-core.c|6 --
>  security/selinux/include/xfrm.h |4 ++--
>  11 files changed, 15 insertions(+), 37 deletions(-)
> 
> diff --git a/drivers/infiniband/core/roce_gid_mgmt.c 
> b/drivers/infiniband/core/roce_gid_mgmt.c
> index cc2966380c0c..5a52ec77940a 100644
> --- a/drivers/infiniband/core/roce_gid_mgmt.c
> +++ b/drivers/infiniband/core/roce_gid_mgmt.c
> @@ -403,12 +403,10 @@ static void enum_all_gids_of_dev_cb(struct ib_device 
> *ib_dev,
>* our feet
>*/
>   rtnl_lock();
> - down_read(_rwsem);
>   for_each_net(net)
>   for_each_netdev(net, ndev)
>   if (is_eth_port_of_netdev(ib_dev, port, rdma_ndev, 
> ndev))
>   add_netdev_ips(ib_dev, port, rdma_ndev, ndev);
> - up_read(_rwsem);
>   rtnl_unlock();
>  }
>  
> diff --git a/include/linux/rtnetlink.h b/include/linux/rtnetlink.h
> index 5225832bd6ff..c7d1e4689325 100644
> --- a/include/linux/rtnetlink.h
> +++ b/include/linux/rtnetlink.h
> @@ -37,7 +37,6 @@ extern int rtnl_lock_killable(void);
>  
>  extern wait_queue_head_t netdev_unregistering_wq;
>  extern struct rw_semaphore pernet_ops_rwsem;
> -extern struct rw_semaphore net_rwsem;
>  
>  #ifdef CONFIG_PROVE_LOCKING
>  extern bool lockdep_rtnl_is_held(void);
> diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h
> index 47e35cce3b64..1ab4f920f109 100644
> --- a/include/net/net_namespace.h
> +++ b/include/net/net_namespace.h
> @@ -291,7 +291,6 @@ static inline struct net *read_pnet(const possible_net_t 
> *pnet)
>  #endif
>  }
>  
> -/* Protected by net_rwsem */
>  #define for_each_net(VAR)\
>   list_for_each_entry(VAR, _namespace_list, list)
>  
> diff --git a/net/core/dev.c b/net/core/dev.c
> index eca5458b2753..e13807b5c84d 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -1629,7 +1629,6 @@ int register_netdevice_notifier(struct notifier_block 
> *nb)
>   goto unlock;
>   if (dev_boot_phase)
>   goto unlock;
> - down_read(_rwsem);
>   for_each_net(net) {
>   for_each_netdev(net, dev) {
>   err = call_netdevice_notifier(nb, NETDEV_REGISTER, dev);
> @@ -1643,7 +1642,6 @@ int register_netdevice_notifier(struct notifier_block 
> *nb)
>   call_netdevice_notifier(nb, NETDEV_UP, dev);
>   }
>   }
> - up_read(_rwsem);
>  
>  unlock:
>   rtnl_unlock();
> @@ -1666,7 +1664,6 @@ int register_netdevice_notifier(struct notifier_block 
> *nb)
>   }
>  
>  outroll:
> - up_read(_rwsem);
>

[PATCH net-next] net: Revert net_rwsem

2018-03-30 Thread Kirill Tkhai

This reverts:

152f253152cc net: Remove rtnl_lock() in nf_ct_iterate_destroy()
ec9c780925c5 ovs: Remove rtnl_lock() from ovs_exit_net()
350311aab4c0 security: Remove rtnl_lock() in selinux_xfrm_notify_policyload()
10256debb918 net: Don't take rtnl_lock() in wireless_nlevent_flush()
f0b07bb151b0 net: Introduce net_rwsem to protect net_namespace_list

There are missed, that down_read() can't be taken recursive.
This is because of rw_semaphore design, which prevents it
to be occupied by only readers forever.

So, we can't take it in register_netdevice_notifier(), as it's also
taken in wext_netdev_notifier_call()->wireless_nlevent_flush().

The solution is to protect net_namespace_list modifications
in register_netdevice_notifier() via pernet_ops_rwsem, as it's
made in:

https://patchwork.ozlabs.org/project/netdev/list/?series=36495

Then, we won't have to take net_rwsem in call_netevent_notifiers().

But since the patchset is not in kernel, let's just revert net_rwsem
for now, and I'll resubmit it later (after the above patchset).

Signed-off-by: Kirill Tkhai <ktk...@virtuozzo.com>
---
 drivers/infiniband/core/roce_gid_mgmt.c |2 --
 include/linux/rtnetlink.h   |1 -
 include/net/net_namespace.h |1 -
 net/core/dev.c  |5 -
 net/core/fib_notifier.c |2 --
 net/core/net_namespace.c|   18 +-
 net/core/rtnetlink.c|5 -
 net/netfilter/nf_conntrack_core.c   |4 ++--
 net/openvswitch/datapath.c  |4 ++--
 net/wireless/wext-core.c|6 --
 security/selinux/include/xfrm.h |4 ++--
 11 files changed, 15 insertions(+), 37 deletions(-)

diff --git a/drivers/infiniband/core/roce_gid_mgmt.c 
b/drivers/infiniband/core/roce_gid_mgmt.c
index cc2966380c0c..5a52ec77940a 100644
--- a/drivers/infiniband/core/roce_gid_mgmt.c
+++ b/drivers/infiniband/core/roce_gid_mgmt.c
@@ -403,12 +403,10 @@ static void enum_all_gids_of_dev_cb(struct ib_device 
*ib_dev,
 * our feet
 */
rtnl_lock();
-   down_read(_rwsem);
for_each_net(net)
for_each_netdev(net, ndev)
if (is_eth_port_of_netdev(ib_dev, port, rdma_ndev, 
ndev))
add_netdev_ips(ib_dev, port, rdma_ndev, ndev);
-   up_read(_rwsem);
rtnl_unlock();
 }
 
diff --git a/include/linux/rtnetlink.h b/include/linux/rtnetlink.h
index 5225832bd6ff..c7d1e4689325 100644
--- a/include/linux/rtnetlink.h
+++ b/include/linux/rtnetlink.h
@@ -37,7 +37,6 @@ extern int rtnl_lock_killable(void);
 
 extern wait_queue_head_t netdev_unregistering_wq;
 extern struct rw_semaphore pernet_ops_rwsem;
-extern struct rw_semaphore net_rwsem;
 
 #ifdef CONFIG_PROVE_LOCKING
 extern bool lockdep_rtnl_is_held(void);
diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h
index 47e35cce3b64..1ab4f920f109 100644
--- a/include/net/net_namespace.h
+++ b/include/net/net_namespace.h
@@ -291,7 +291,6 @@ static inline struct net *read_pnet(const possible_net_t 
*pnet)
 #endif
 }
 
-/* Protected by net_rwsem */
 #define for_each_net(VAR)  \
list_for_each_entry(VAR, _namespace_list, list)
 
diff --git a/net/core/dev.c b/net/core/dev.c
index eca5458b2753..e13807b5c84d 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1629,7 +1629,6 @@ int register_netdevice_notifier(struct notifier_block *nb)
goto unlock;
if (dev_boot_phase)
goto unlock;
-   down_read(_rwsem);
for_each_net(net) {
for_each_netdev(net, dev) {
err = call_netdevice_notifier(nb, NETDEV_REGISTER, dev);
@@ -1643,7 +1642,6 @@ int register_netdevice_notifier(struct notifier_block *nb)
call_netdevice_notifier(nb, NETDEV_UP, dev);
}
}
-   up_read(_rwsem);
 
 unlock:
rtnl_unlock();
@@ -1666,7 +1664,6 @@ int register_netdevice_notifier(struct notifier_block *nb)
}
 
 outroll:
-   up_read(_rwsem);
raw_notifier_chain_unregister(_chain, nb);
goto unlock;
 }
@@ -1697,7 +1694,6 @@ int unregister_netdevice_notifier(struct notifier_block 
*nb)
if (err)
goto unlock;
 
-   down_read(_rwsem);
for_each_net(net) {
for_each_netdev(net, dev) {
if (dev->flags & IFF_UP) {
@@ -1708,7 +1704,6 @@ int unregister_netdevice_notifier(struct notifier_block 
*nb)
call_netdevice_notifier(nb, NETDEV_UNREGISTER, dev);
}
}
-   up_read(_rwsem);
 unlock:
rtnl_unlock();
return err;
diff --git a/net/core/fib_notifier.c b/net/core/fib_notifier.c
index 13a40b831d6d..b793b523aba3 100644
--- a/net/core/fib_notifier.c
+++ b/net/core/fib_notifier.c
@@ -39,7 +39,6 @@ static unsigned int fib_seq_sum(void)

[PATCH net-next 5/5] net: Remove rtnl_lock() in nf_ct_iterate_destroy()

2018-03-29 Thread Kirill Tkhai

rtnl_lock() doesn't protect net::ct::count,
and it's not needed for__nf_ct_unconfirmed_destroy()
and for nf_queue_nf_hook_drop().

Signed-off-by: Kirill Tkhai <ktk...@virtuozzo.com>
---
 net/netfilter/nf_conntrack_core.c |2 --
 1 file changed, 2 deletions(-)

diff --git a/net/netfilter/nf_conntrack_core.c 
b/net/netfilter/nf_conntrack_core.c
index 370f9b7f051b..41ff04ee2554 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -1763,7 +1763,6 @@ nf_ct_iterate_destroy(int (*iter)(struct nf_conn *i, void 
*data), void *data)
 {
struct net *net;
 
-   rtnl_lock();
down_read(_rwsem);
for_each_net(net) {
if (atomic_read(>ct.count) == 0)
@@ -1772,7 +1771,6 @@ nf_ct_iterate_destroy(int (*iter)(struct nf_conn *i, void 
*data), void *data)
nf_queue_nf_hook_drop(net);
}
up_read(_rwsem);
-   rtnl_unlock();
 
/* Need to wait for netns cleanup worker to finish, if its
 * running -- it might have deleted a net namespace from

[PATCH net-next 2/5] net: Don't take rtnl_lock() in wireless_nlevent_flush()

2018-03-29 Thread Kirill Tkhai

This function iterates over net_namespace_list and flushes
the queue for every of them. What does this rtnl_lock()
protects?! Since we may add skbs to net::wext_nlevents
without rtnl_lock(), it does not protects us about queuers.

It guarantees, two threads can't flush the queue in parallel,
that can change the order, but since skb can be queued
in any order, it doesn't matter, how many threads do this
in parallel. In case of several threads, this will be even
faster.

So, we can remove rtnl_lock() here, as it was used for
iteration over net_namespace_list only.

Signed-off-by: Kirill Tkhai <ktk...@virtuozzo.com>
---
 net/wireless/wext-core.c |4 
 1 file changed, 4 deletions(-)

diff --git a/net/wireless/wext-core.c b/net/wireless/wext-core.c
index 544d7b62d7ca..5e677dac2a0c 100644
--- a/net/wireless/wext-core.c
+++ b/net/wireless/wext-core.c
@@ -347,8 +347,6 @@ void wireless_nlevent_flush(void)
struct sk_buff *skb;
struct net *net;
 
-   ASSERT_RTNL();
-
down_read(_rwsem);
for_each_net(net) {
while ((skb = skb_dequeue(>wext_nlevents)))
@@ -412,9 +410,7 @@ subsys_initcall(wireless_nlevent_init);
 /* Process events generated by the wireless layer or the driver. */
 static void wireless_nlevent_process(struct work_struct *work)
 {
-   rtnl_lock();
wireless_nlevent_flush();
-   rtnl_unlock();
 }
 
 static DECLARE_WORK(wireless_nlevent_work, wireless_nlevent_process);

[PATCH net-next 3/5] security: Remove rtnl_lock() in selinux_xfrm_notify_policyload()

2018-03-29 Thread Kirill Tkhai

rt_genid_bump_all() consists of ipv4 and ipv6 part.
ipv4 part is incrementing of net::ipv4::rt_genid,
and I see many places, where it's read without rtnl_lock().

ipv6 part calls __fib6_clean_all(), and it's also
called without rtnl_lock() in other places.

So, rtnl_lock() here was used to iterate net_namespace_list only,
and we can remove it.

Signed-off-by: Kirill Tkhai <ktk...@virtuozzo.com>
---
 security/selinux/include/xfrm.h |2 --
 1 file changed, 2 deletions(-)

diff --git a/security/selinux/include/xfrm.h b/security/selinux/include/xfrm.h
index 31d66431be1e..a0b465316292 100644
--- a/security/selinux/include/xfrm.h
+++ b/security/selinux/include/xfrm.h
@@ -47,12 +47,10 @@ static inline void selinux_xfrm_notify_policyload(void)
 {
struct net *net;
 
-   rtnl_lock();
down_read(_rwsem);
for_each_net(net)
rt_genid_bump_all(net);
up_read(_rwsem);
-   rtnl_unlock();
 }
 #else
 static inline int selinux_xfrm_enabled(void)

[PATCH net-next 4/5] ovs: Remove rtnl_lock() from ovs_exit_net()

2018-03-29 Thread Kirill Tkhai

Here we iterate for_each_net() and removes
vport from alive net to the exiting net.

ovs_net::dps are protected by ovs_mutex(),
and the others, who change it (ovs_dp_cmd_new(),
__dp_destroy()) also take it.
The same with datapath::ports list.

So, we remove rtnl_lock() here.

Signed-off-by: Kirill Tkhai <ktk...@virtuozzo.com>
---
 net/openvswitch/datapath.c |2 --
 1 file changed, 2 deletions(-)

diff --git a/net/openvswitch/datapath.c b/net/openvswitch/datapath.c
index 9746ee30a99b..015e24e08909 100644
--- a/net/openvswitch/datapath.c
+++ b/net/openvswitch/datapath.c
@@ -2363,12 +2363,10 @@ static void __net_exit ovs_exit_net(struct net *dnet)
list_for_each_entry_safe(dp, dp_next, _net->dps, list_node)
__dp_destroy(dp);
 
-   rtnl_lock();
down_read(_rwsem);
for_each_net(net)
list_vports_from_net(net, dnet, );
up_read(_rwsem);
-   rtnl_unlock();
 
/* Detach all vports from given namespace. */
list_for_each_entry_safe(vport, vport_next, , detach_list) {

[PATCH net-next 1/5] net: Introduce net_rwsem to protect net_namespace_list

2018-03-29 Thread Kirill Tkhai

rtnl_lock() is used everywhere, and contention is very high.
When someone wants to iterate over alive net namespaces,
he/she has no a possibility to do that without exclusive lock.
But the exclusive rtnl_lock() in such places is overkill,
and it just increases the contention. Yes, there is already
for_each_net_rcu() in kernel, but it requires rcu_read_lock(),
and this can't be sleepable. Also, sometimes it may be need
really prevent net_namespace_list growth, so for_each_net_rcu()
is not fit there.

This patch introduces new rw_semaphore, which will be used
instead of rtnl_mutex to protect net_namespace_list. It is
sleepable and allows not-exclusive iterations over net
namespaces list. It allows to stop using rtnl_lock()
in several places (what is made in next patches) and makes
less the time, we keep rtnl_mutex. Here we just add new lock,
while the explanation of we can remove rtnl_lock() there are
in next patches.

Fine grained locks generally are better, then one big lock,
so let's do that with net_namespace_list, while the situation
allows that.

Signed-off-by: Kirill Tkhai <ktk...@virtuozzo.com>
---
 drivers/infiniband/core/roce_gid_mgmt.c |2 ++
 include/linux/rtnetlink.h   |1 +
 include/net/net_namespace.h |1 +
 net/core/dev.c  |5 +
 net/core/fib_notifier.c |2 ++
 net/core/net_namespace.c|   18 +-
 net/core/rtnetlink.c|5 +
 net/netfilter/nf_conntrack_core.c   |2 ++
 net/openvswitch/datapath.c  |2 ++
 net/wireless/wext-core.c|2 ++
 security/selinux/include/xfrm.h |2 ++
 11 files changed, 37 insertions(+), 5 deletions(-)

diff --git a/drivers/infiniband/core/roce_gid_mgmt.c 
b/drivers/infiniband/core/roce_gid_mgmt.c
index 5a52ec77940a..cc2966380c0c 100644
--- a/drivers/infiniband/core/roce_gid_mgmt.c
+++ b/drivers/infiniband/core/roce_gid_mgmt.c
@@ -403,10 +403,12 @@ static void enum_all_gids_of_dev_cb(struct ib_device 
*ib_dev,
 * our feet
 */
rtnl_lock();
+   down_read(_rwsem);
for_each_net(net)
for_each_netdev(net, ndev)
if (is_eth_port_of_netdev(ib_dev, port, rdma_ndev, 
ndev))
add_netdev_ips(ib_dev, port, rdma_ndev, ndev);
+   up_read(_rwsem);
rtnl_unlock();
 }
 
diff --git a/include/linux/rtnetlink.h b/include/linux/rtnetlink.h
index c7d1e4689325..5225832bd6ff 100644
--- a/include/linux/rtnetlink.h
+++ b/include/linux/rtnetlink.h
@@ -37,6 +37,7 @@ extern int rtnl_lock_killable(void);
 
 extern wait_queue_head_t netdev_unregistering_wq;
 extern struct rw_semaphore pernet_ops_rwsem;
+extern struct rw_semaphore net_rwsem;
 
 #ifdef CONFIG_PROVE_LOCKING
 extern bool lockdep_rtnl_is_held(void);
diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h
index 1ab4f920f109..47e35cce3b64 100644
--- a/include/net/net_namespace.h
+++ b/include/net/net_namespace.h
@@ -291,6 +291,7 @@ static inline struct net *read_pnet(const possible_net_t 
*pnet)
 #endif
 }
 
+/* Protected by net_rwsem */
 #define for_each_net(VAR)  \
list_for_each_entry(VAR, _namespace_list, list)
 
diff --git a/net/core/dev.c b/net/core/dev.c
index e13807b5c84d..eca5458b2753 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1629,6 +1629,7 @@ int register_netdevice_notifier(struct notifier_block *nb)
goto unlock;
if (dev_boot_phase)
goto unlock;
+   down_read(_rwsem);
for_each_net(net) {
for_each_netdev(net, dev) {
err = call_netdevice_notifier(nb, NETDEV_REGISTER, dev);
@@ -1642,6 +1643,7 @@ int register_netdevice_notifier(struct notifier_block *nb)
call_netdevice_notifier(nb, NETDEV_UP, dev);
}
}
+   up_read(_rwsem);
 
 unlock:
rtnl_unlock();
@@ -1664,6 +1666,7 @@ int register_netdevice_notifier(struct notifier_block *nb)
}
 
 outroll:
+   up_read(_rwsem);
raw_notifier_chain_unregister(_chain, nb);
goto unlock;
 }
@@ -1694,6 +1697,7 @@ int unregister_netdevice_notifier(struct notifier_block 
*nb)
if (err)
goto unlock;
 
+   down_read(_rwsem);
for_each_net(net) {
for_each_netdev(net, dev) {
if (dev->flags & IFF_UP) {
@@ -1704,6 +1708,7 @@ int unregister_netdevice_notifier(struct notifier_block 
*nb)
call_netdevice_notifier(nb, NETDEV_UNREGISTER, dev);
}
}
+   up_read(_rwsem);
 unlock:
rtnl_unlock();
return err;
diff --git a/net/core/fib_notifier.c b/net/core/fib_notifier.c
index 0c048bdeb016..614b985c92a4 100644
--- a/net/core/fib_notifier.c
+++ b/net/core/fib_notifier.c
@@ -33,6 +33,7 @@ static unsigned int fib_seq_sum(void)
st

[PATCH net-next 0/5] Introduce net_rwsem to protect net_namespace_list

2018-03-29 Thread Kirill Tkhai

The series introduces fine grained rw_semaphore, which will be used
instead of rtnl_lock() to protect net_namespace_list.

This improves scalability and allows to do non-exclusive sleepable
iteration for_each_net(), which is enough for most cases.

scripts/get_maintainer.pl gives enormous list of people, and I add
all to CC.

Note, that this patch is independent of "Close race between
{un, }register_netdevice_notifier and pernet_operations":
https://patchwork.ozlabs.org/project/netdev/list/?series=36495

Signed-off-by: Kirill Tkhai <ktk...@virtuozzo.com>
---

Kirill Tkhai (5):
  net: Introduce net_rwsem to protect net_namespace_list
  net: Don't take rtnl_lock() in wireless_nlevent_flush()
  security: Remove rtnl_lock() in selinux_xfrm_notify_policyload()
  ovs: Remove rtnl_lock() from ovs_exit_net()
  net: Remove rtnl_lock() in nf_ct_iterate_destroy()


 drivers/infiniband/core/roce_gid_mgmt.c |2 ++
 include/linux/rtnetlink.h   |1 +
 include/net/net_namespace.h |1 +
 net/core/dev.c  |5 +
 net/core/fib_notifier.c |2 ++
 net/core/net_namespace.c|   18 +-
 net/core/rtnetlink.c|5 +
 net/netfilter/nf_conntrack_core.c   |4 ++--
 net/openvswitch/datapath.c  |4 ++--
 net/wireless/wext-core.c|6 ++
 security/selinux/include/xfrm.h |4 ++--
 11 files changed, 37 insertions(+), 15 deletions(-)

--
Signed-off-by: Kirill Tkhai <ktk...@virtuozzo.com>

[PATCH net-next 3/3] net: Close race between {un, }register_netdevice_notifier() and setup_net()/cleanup_net()

2018-03-29 Thread Kirill Tkhai

{un,}register_netdevice_notifier() iterate over all net namespaces
hashed to net_namespace_list. But pernet_operations register and
unregister netdevices in unhashed net namespace, and they are not
seen for netdevice notifiers. This results in asymmetry:

1)Race with register_netdevice_notifier()
  pernet_operations::init(net)  ...
   register_netdevice() ...
call_netdevice_notifiers()  ...
  ... nb is not called ...
  ...   register_netdevice_notifier(nb) -> net skipped
  ...   ...
  list_add_tail(>list, ..) ...

  Then, userspace stops using net, and it's destructed:

  pernet_operations::exit(net)
   unregister_netdevice()
call_netdevice_notifiers()
  ... nb is called ...

This always happens with net::loopback_dev, but it may be not the only device.

2)Race with unregister_netdevice_notifier()
  pernet_operations::init(net)
   register_netdevice()
call_netdevice_notifiers()
  ... nb is called ...

  Then, userspace stops using net, and it's destructed:

  list_del_rcu(>list)  ...
  pernet_operations::exit(net)  unregister_netdevice_notifier(nb) -> net skipped
   dev_change_net_namespace()   ...
call_netdevice_notifiers()
  ... nb is not called ...
   unregister_netdevice()
call_netdevice_notifiers()
  ... nb is not called ...

This race is more danger, since dev_change_net_namespace() moves real
network devices, which use not trivial netdevice notifiers, and if this
will happen, the system will be left in unpredictable state.

The patch closes the race. During the testing I found two places,
where register_netdevice_notifier() is called from pernet init/exit
methods (which led to deadlock) and fixed them (see previous patches).

The review moved me to one more unusual registration place:
raw_init() (can driver). It may be a reason of problems,
if someone creates in-kernel CAN_RAW sockets, since they
will be destroyed in exit method and raw_release()
will call unregister_netdevice_notifier(). But grep over
kernel tree does not show, someone creates such sockets
from kernel space.

Theoretically, there can be more places like this, and which are
hidden from review, but we found them on the first bumping there
(since there is no a race, it will be 100% reproducible).

Signed-off-by: Kirill Tkhai <ktk...@virtuozzo.com>
---
 net/core/dev.c |6 ++
 1 file changed, 6 insertions(+)

diff --git a/net/core/dev.c b/net/core/dev.c
index e13807b5c84d..43abc5785a85 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1623,6 +1623,8 @@ int register_netdevice_notifier(struct notifier_block *nb)
struct net *net;
int err;
 
+   /* Close race with setup_net() and cleanup_net() */
+   down_write(_ops_rwsem);
rtnl_lock();
err = raw_notifier_chain_register(_chain, nb);
if (err)
@@ -1645,6 +1647,7 @@ int register_netdevice_notifier(struct notifier_block *nb)
 
 unlock:
rtnl_unlock();
+   up_write(_ops_rwsem);
return err;
 
 rollback:
@@ -1689,6 +1692,8 @@ int unregister_netdevice_notifier(struct notifier_block 
*nb)
struct net *net;
int err;
 
+   /* Close race with setup_net() and cleanup_net() */
+   down_write(_ops_rwsem);
rtnl_lock();
err = raw_notifier_chain_unregister(_chain, nb);
if (err)
@@ -1706,6 +1711,7 @@ int unregister_netdevice_notifier(struct notifier_block 
*nb)
}
 unlock:
rtnl_unlock();
+   up_write(_ops_rwsem);
return err;
 }
 EXPORT_SYMBOL(unregister_netdevice_notifier);

[PATCH net-next 2/3] netfilter: Rework xt_TEE netdevice notifier

2018-03-29 Thread Kirill Tkhai

Register netdevice notifier for every iptable entry
is not good, since this breaks modularity, and
the hidden synchronization is based on rtnl_lock().

This patch reworks the synchronization via new lock,
while the rest of logic remains as it was before.
This is required for the next patch.

Tested via:

while :; do
unshare -n iptables -t mangle -A OUTPUT -j TEE --gateway 1.1.1.2 --oif 
lo;
done

Signed-off-by: Kirill Tkhai <ktk...@virtuozzo.com>
---
 net/netfilter/xt_TEE.c |   73 ++--
 1 file changed, 46 insertions(+), 27 deletions(-)

diff --git a/net/netfilter/xt_TEE.c b/net/netfilter/xt_TEE.c
index 86b0580b2216..475957cfcf50 100644
--- a/net/netfilter/xt_TEE.c
+++ b/net/netfilter/xt_TEE.c
@@ -20,7 +20,7 @@
 #include 
 
 struct xt_tee_priv {
-   struct notifier_block   notifier;
+   struct list_headlist;
struct xt_tee_tginfo*tginfo;
int oif;
 };
@@ -51,29 +51,35 @@ tee_tg6(struct sk_buff *skb, const struct xt_action_param 
*par)
 }
 #endif
 
+static DEFINE_MUTEX(priv_list_mutex);
+static LIST_HEAD(priv_list);
+
 static int tee_netdev_event(struct notifier_block *this, unsigned long event,
void *ptr)
 {
struct net_device *dev = netdev_notifier_info_to_dev(ptr);
struct xt_tee_priv *priv;
 
-   priv = container_of(this, struct xt_tee_priv, notifier);
-   switch (event) {
-   case NETDEV_REGISTER:
-   if (!strcmp(dev->name, priv->tginfo->oif))
-   priv->oif = dev->ifindex;
-   break;
-   case NETDEV_UNREGISTER:
-   if (dev->ifindex == priv->oif)
-   priv->oif = -1;
-   break;
-   case NETDEV_CHANGENAME:
-   if (!strcmp(dev->name, priv->tginfo->oif))
-   priv->oif = dev->ifindex;
-   else if (dev->ifindex == priv->oif)
-   priv->oif = -1;
-   break;
+   mutex_lock(_list_mutex);
+   list_for_each_entry(priv, _list, list) {
+   switch (event) {
+   case NETDEV_REGISTER:
+   if (!strcmp(dev->name, priv->tginfo->oif))
+   priv->oif = dev->ifindex;
+   break;
+   case NETDEV_UNREGISTER:
+   if (dev->ifindex == priv->oif)
+   priv->oif = -1;
+   break;
+   case NETDEV_CHANGENAME:
+   if (!strcmp(dev->name, priv->tginfo->oif))
+   priv->oif = dev->ifindex;
+   else if (dev->ifindex == priv->oif)
+   priv->oif = -1;
+   break;
+   }
}
+   mutex_unlock(_list_mutex);
 
return NOTIFY_DONE;
 }
@@ -89,8 +95,6 @@ static int tee_tg_check(const struct xt_tgchk_param *par)
return -EINVAL;
 
if (info->oif[0]) {
-   int ret;
-
if (info->oif[sizeof(info->oif)-1] != '\0')
return -EINVAL;
 
@@ -100,14 +104,11 @@ static int tee_tg_check(const struct xt_tgchk_param *par)
 
priv->tginfo  = info;
priv->oif = -1;
-   priv->notifier.notifier_call = tee_netdev_event;
info->priv= priv;
 
-   ret = register_netdevice_notifier(>notifier);
-   if (ret) {
-   kfree(priv);
-   return ret;
-   }
+   mutex_lock(_list_mutex);
+   list_add(>list, _list);
+   mutex_unlock(_list_mutex);
} else
info->priv = NULL;
 
@@ -120,7 +121,9 @@ static void tee_tg_destroy(const struct xt_tgdtor_param 
*par)
struct xt_tee_tginfo *info = par->targinfo;
 
if (info->priv) {
-   unregister_netdevice_notifier(>priv->notifier);
+   mutex_lock(_list_mutex);
+   list_del(>priv->list);
+   mutex_unlock(_list_mutex);
kfree(info->priv);
}
static_key_slow_dec(_tee_enabled);
@@ -153,13 +156,29 @@ static struct xt_target tee_tg_reg[] __read_mostly = {
 #endif
 };
 
+static struct notifier_block tee_netdev_notifier = {
+   .notifier_call = tee_netdev_event,
+};
+
 static int __init tee_tg_init(void)
 {
-   return xt_register_targets(tee_tg_reg, ARRAY_SIZE(tee_tg_reg));
+   int ret;
+
+   ret = xt_register_targets(tee_tg_reg, ARRAY_SIZE(tee_tg_reg));
+   if (ret)
+   return ret;
+   ret = register_netdevice_notifier(_netdev_notifier);
+   if (ret) {
+   xt_unregister_targets(tee_tg_reg, ARRAY_SIZE(tee_tg_reg));
+

[PATCH net-next 1/3] xfrm: Register xfrm_dev_notifier in appropriate place

2018-03-29 Thread Kirill Tkhai

Currently, driver registers it from pernet_operations::init method,
and this breaks modularity, because initialization of net namespace
and netdevice notifiers are orthogonal actions. We don't have
per-namespace netdevice notifiers; all of them are global for all
devices in all namespaces.

Signed-off-by: Kirill Tkhai <ktk...@virtuozzo.com>
---
 include/net/xfrm.h |2 +-
 net/xfrm/xfrm_device.c |2 +-
 net/xfrm/xfrm_policy.c |3 +--
 3 files changed, 3 insertions(+), 4 deletions(-)

diff --git a/include/net/xfrm.h b/include/net/xfrm.h
index aa027ba1d032..a872379b69da 100644
--- a/include/net/xfrm.h
+++ b/include/net/xfrm.h
@@ -1894,7 +1894,7 @@ static inline struct xfrm_offload *xfrm_offload(struct 
sk_buff *skb)
 #endif
 }
 
-void __net_init xfrm_dev_init(void);
+void __init xfrm_dev_init(void);
 
 #ifdef CONFIG_XFRM_OFFLOAD
 void xfrm_dev_resume(struct sk_buff *skb);
diff --git a/net/xfrm/xfrm_device.c b/net/xfrm/xfrm_device.c
index e87d6c4dd5b6..175941e15a6e 100644
--- a/net/xfrm/xfrm_device.c
+++ b/net/xfrm/xfrm_device.c
@@ -350,7 +350,7 @@ static struct notifier_block xfrm_dev_notifier = {
.notifier_call  = xfrm_dev_event,
 };
 
-void __net_init xfrm_dev_init(void)
+void __init xfrm_dev_init(void)
 {
register_netdevice_notifier(_dev_notifier);
 }
diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c
index 625b3fca5704..f29c8d588116 100644
--- a/net/xfrm/xfrm_policy.c
+++ b/net/xfrm/xfrm_policy.c
@@ -2895,8 +2895,6 @@ static int __net_init xfrm_policy_init(struct net *net)
INIT_LIST_HEAD(>xfrm.policy_all);
INIT_WORK(>xfrm.policy_hash_work, xfrm_hash_resize);
INIT_WORK(>xfrm.policy_hthresh.work, xfrm_hash_rebuild);
-   if (net_eq(net, _net))
-   xfrm_dev_init();
return 0;
 
 out_bydst:
@@ -2999,6 +2997,7 @@ void __init xfrm_init(void)
INIT_WORK(_pcpu_work[i], xfrm_pcpu_work_fn);
 
register_pernet_subsys(_net_ops);
+   xfrm_dev_init();
seqcount_init(_policy_hash_generation);
xfrm_input_init();
 }

[PATCH net-next 0/3] Close race between {un, }register_netdevice_notifier and pernet_operations

2018-03-29 Thread Kirill Tkhai

Hi,

the problem is {,un}register_netdevice_notifier() do not take
pernet_ops_rwsem, and they don't see network namespaces, being
initialized in setup_net() and cleanup_net(), since at this
time net is not hashed to net_namespace_list.

This may lead to imbalance, when a notifier is called at time of
setup_net()/net is alive, but it's not called at time of cleanup_net(),
for the devices, hashed to the net, and vise versa. See (3/3) for
the scheme of imbalance.

This patchset fixes the problem by acquiring pernet_ops_rwsem
at the time of {,un}register_netdevice_notifier() (3/3).
(1-2/3) are preparations in xfrm and netfilter subsystems.

The problem was introduced a long ago, but backporting won't be easy,
since every previous kernel version may have changes in netdevice
notifiers, and they all need review and testing. Otherwise, there
may be more pernet_operations, which register or unregister
netdevice notifiers, and that leads to deadlock (which is was fixed
in 1-2/3). This patchset is for net-next.

Thanks,
Kirill
---

Kirill Tkhai (3):
  xfrm: Register xfrm_dev_notifier in appropriate place
  netfilter: Rework xt_TEE netdevice notifier
  net: Close race between {un,}register_netdevice_notifier() and 
setup_net()/cleanup_net()


 include/net/xfrm.h |2 +
 net/core/dev.c |6 
 net/netfilter/xt_TEE.c |   73 ++--
 net/xfrm/xfrm_device.c |2 +
 net/xfrm/xfrm_policy.c |3 +-
 5 files changed, 55 insertions(+), 31 deletions(-)

--
Signed-off-by: Kirill Tkhai <ktk...@virtuozzo.com>

Re: [PATCH V5 net-next 06/14] net/tls: Add generic NIC offload infrastructure

2018-03-28 Thread Kirill Tkhai

On 28.03.2018 02:56, Saeed Mahameed wrote:
> From: Ilya Lesokhin 
> 
> This patch adds a generic infrastructure to offload TLS crypto to a
> network device. It enables the kernel TLS socket to skip encryption
> and authentication operations on the transmit side of the data path.
> Leaving those computationally expensive operations to the NIC.
> 
> The NIC offload infrastructure builds TLS records and pushes them to
> the TCP layer just like the SW KTLS implementation and using the same API.
> TCP segmentation is mostly unaffected. Currently the only exception is
> that we prevent mixed SKBs where only part of the payload requires
> offload. In the future we are likely to add a similar restriction
> following a change cipher spec record.
> 
> The notable differences between SW KTLS and NIC offloaded TLS
> implementations are as follows:
> 1. The offloaded implementation builds "plaintext TLS record", those
> records contain plaintext instead of ciphertext and place holder bytes
> instead of authentication tags.
> 2. The offloaded implementation maintains a mapping from TCP sequence
> number to TLS records. Thus given a TCP SKB sent from a NIC offloaded
> TLS socket, we can use the tls NIC offload infrastructure to obtain
> enough context to encrypt the payload of the SKB.
> A TLS record is released when the last byte of the record is ack'ed,
> this is done through the new icsk_clean_acked callback.
> 
> The infrastructure should be extendable to support various NIC offload
> implementations.  However it is currently written with the
> implementation below in mind:
> The NIC assumes that packets from each offloaded stream are sent as
> plaintext and in-order. It keeps track of the TLS records in the TCP
> stream. When a packet marked for offload is transmitted, the NIC
> encrypts the payload in-place and puts authentication tags in the
> relevant place holders.
> 
> The responsibility for handling out-of-order packets (i.e. TCP
> retransmission, qdisc drops) falls on the netdev driver.
> 
> The netdev driver keeps track of the expected TCP SN from the NIC's
> perspective.  If the next packet to transmit matches the expected TCP
> SN, the driver advances the expected TCP SN, and transmits the packet
> with TLS offload indication.
> 
> If the next packet to transmit does not match the expected TCP SN. The
> driver calls the TLS layer to obtain the TLS record that includes the
> TCP of the packet for transmission. Using this TLS record, the driver
> posts a work entry on the transmit queue to reconstruct the NIC TLS
> state required for the offload of the out-of-order packet. It updates
> the expected TCP SN accordingly and transmits the now in-order packet.
> The same queue is used for packet transmission and TLS context
> reconstruction to avoid the need for flushing the transmit queue before
> issuing the context reconstruction request.
> 
> Signed-off-by: Ilya Lesokhin 
> Signed-off-by: Boris Pismenny 
> Signed-off-by: Aviad Yehezkel 
> Signed-off-by: Saeed Mahameed 
> ---
>  include/net/tls.h | 120 +--
>  net/tls/Kconfig   |  10 +
>  net/tls/Makefile  |   2 +
>  net/tls/tls_device.c  | 759 
> ++
>  net/tls/tls_device_fallback.c | 454 +
>  net/tls/tls_main.c| 120 ---
>  net/tls/tls_sw.c  | 132 
>  7 files changed, 1476 insertions(+), 121 deletions(-)
>  create mode 100644 net/tls/tls_device.c
>  create mode 100644 net/tls/tls_device_fallback.c
> 
> diff --git a/include/net/tls.h b/include/net/tls.h
> index 437a746300bf..0a8529e9ec21 100644
> --- a/include/net/tls.h
> +++ b/include/net/tls.h
> @@ -57,21 +57,10 @@
>  
>  #define TLS_AAD_SPACE_SIZE   13
>  
> -struct tls_sw_context {
> +struct tls_sw_context_tx {

What the reason splitting this into tx + rx does not go in separate patch?

>   struct crypto_aead *aead_send;
> - struct crypto_aead *aead_recv;
>   struct crypto_wait async_wait;
>  
> - /* Receive context */
> - struct strparser strp;
> - void (*saved_data_ready)(struct sock *sk);
> - unsigned int (*sk_poll)(struct file *file, struct socket *sock,
> - struct poll_table_struct *wait);
> - struct sk_buff *recv_pkt;
> - u8 control;
> - bool decrypted;
> -
> - /* Sending context */
>   char aad_space[TLS_AAD_SPACE_SIZE];
>  
>   unsigned int sg_plaintext_size;
> @@ -88,6 +77,50 @@ struct tls_sw_context {
>   struct scatterlist sg_aead_out[2];
>  };
>  
> +struct tls_sw_context_rx {
> + struct crypto_aead *aead_recv;
> + struct crypto_wait async_wait;
> +
> + struct strparser strp;
> + void (*saved_data_ready)(struct sock *sk);
> + unsigned int (*sk_poll)(struct file *file, struct socket *sock,
> + struct

Re: [PATCH net-next 0/5] Make pernet_operations always read locked

2018-03-27 Thread Kirill Tkhai

On 27.03.2018 20:18, David Miller wrote:
> From: Kirill Tkhai <ktk...@virtuozzo.com>
> Date: Tue, 27 Mar 2018 18:01:42 +0300
> 
>> All the pernet_operations are converted, and the last one
>> is in this patchset (nfsd_net_ops acked by J. Bruce Fields).
>> So, it's the time to kill pernet_operations::async field,
>> and make setup_net() and cleanup_net() always require
>> the rwsem only read locked.
>>
>> All further pernet_operations have to be developed to fit
>> this rule. Some of previous patches added a comment to
>> struct pernet_operations about that.
>>
>> Also, this patchset renames net_sem to pernet_ops_rwsem
>> to make the target area of the rwsem is more clear visible,
>> and adds more comments.
> 
> Amazing work, series applied, thanks!

Thanks a lot, David!

Kirill

[PATCH net-next 5/5] net: Add more comments

2018-03-27 Thread Kirill Tkhai

This adds comments to different places to improve
readability.

Signed-off-by: Kirill Tkhai <ktk...@virtuozzo.com>
---
 include/net/net_namespace.h |4 
 net/core/net_namespace.c|2 ++
 net/core/rtnetlink.c|2 +-
 3 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h
index 922e8b6fb422..1ab4f920f109 100644
--- a/include/net/net_namespace.h
+++ b/include/net/net_namespace.h
@@ -323,6 +323,10 @@ struct pernet_operations {
 * have to keep in mind all other pernet_operations and
 * to introduce a locking, if they share common resources.
 *
+* The only time they are called with exclusive lock is
+* from register_pernet_subsys(), unregister_pernet_subsys()
+* register_pernet_device() and unregister_pernet_device().
+*
 * Exit methods using blocking RCU primitives, such as
 * synchronize_rcu(), should be implemented via exit_batch.
 * Then, destruction of a group of net requires single
diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
index 9e8ee4640451..b5796d17a302 100644
--- a/net/core/net_namespace.c
+++ b/net/core/net_namespace.c
@@ -43,6 +43,8 @@ static bool init_net_initialized;
 /*
  * pernet_ops_rwsem: protects: pernet_list, net_generic_ids,
  * init_net_initialized and first_device pointer.
+ * This is internal net namespace object. Please, don't use it
+ * outside.
  */
 DECLARE_RWSEM(pernet_ops_rwsem);
 
diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index 73011a60434c..2d3949789cef 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -459,7 +459,7 @@ static void rtnl_lock_unregistering_all(void)
  */
 void rtnl_link_unregister(struct rtnl_link_ops *ops)
 {
-   /* Close the race with cleanup_net() */
+   /* Close the race with setup_net() and cleanup_net() */
down_write(_ops_rwsem);
rtnl_lock_unregistering_all();
__rtnl_link_unregister(ops);

[PATCH net-next 4/5] net: Rename net_sem to pernet_ops_rwsem

2018-03-27 Thread Kirill Tkhai

net_sem is some undefined area name, so it will be better
to make the area more defined.

Rename it to pernet_ops_rwsem for better readability and
better intelligibility.

Signed-off-by: Kirill Tkhai <ktk...@virtuozzo.com>
---
 include/linux/rtnetlink.h   |2 +-
 include/net/net_namespace.h |   12 +++-
 net/core/net_namespace.c|   40 
 net/core/rtnetlink.c|4 ++--
 4 files changed, 30 insertions(+), 28 deletions(-)

diff --git a/include/linux/rtnetlink.h b/include/linux/rtnetlink.h
index 562a175c35a9..c7d1e4689325 100644
--- a/include/linux/rtnetlink.h
+++ b/include/linux/rtnetlink.h
@@ -36,7 +36,7 @@ extern int rtnl_is_locked(void);
 extern int rtnl_lock_killable(void);
 
 extern wait_queue_head_t netdev_unregistering_wq;
-extern struct rw_semaphore net_sem;
+extern struct rw_semaphore pernet_ops_rwsem;
 
 #ifdef CONFIG_PROVE_LOCKING
 extern bool lockdep_rtnl_is_held(void);
diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h
index 37bcf8382b61..922e8b6fb422 100644
--- a/include/net/net_namespace.h
+++ b/include/net/net_namespace.h
@@ -60,9 +60,10 @@ struct net {
 
struct list_headlist;   /* list of network namespaces */
struct list_headexit_list;  /* To linked to call pernet exit
-* methods on dead net (net_sem
-* read locked), or to 
unregister
-* pernet ops (net_sem wr 
locked).
+* methods on dead net (
+* pernet_ops_rwsem read 
locked),
+* or to unregister pernet ops
+* (pernet_ops_rwsem write 
locked).
 */
struct llist_node   cleanup_list;   /* namespaces on death row */
 
@@ -95,8 +96,9 @@ struct net {
/* core fib_rules */
struct list_headrules_ops;
 
-   struct list_headfib_notifier_ops;  /* protected by net_sem */
-
+   struct list_headfib_notifier_ops;  /* Populated by
+   * register_pernet_subsys()
+   */
struct net_device   *loopback_dev;  /* The loopback */
struct netns_core   core;
struct netns_mibmib;
diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
index eef17ad29dea..9e8ee4640451 100644
--- a/net/core/net_namespace.c
+++ b/net/core/net_namespace.c
@@ -41,10 +41,10 @@ EXPORT_SYMBOL(init_net);
 
 static bool init_net_initialized;
 /*
- * net_sem: protects: pernet_list, net_generic_ids,
+ * pernet_ops_rwsem: protects: pernet_list, net_generic_ids,
  * init_net_initialized and first_device pointer.
  */
-DECLARE_RWSEM(net_sem);
+DECLARE_RWSEM(pernet_ops_rwsem);
 
 #define MIN_PERNET_OPS_ID  \
((sizeof(struct net_generic) + sizeof(void *) - 1) / sizeof(void *))
@@ -72,7 +72,7 @@ static int net_assign_generic(struct net *net, unsigned int 
id, void *data)
BUG_ON(id < MIN_PERNET_OPS_ID);
 
old_ng = rcu_dereference_protected(net->gen,
-  lockdep_is_held(_sem));
+  lockdep_is_held(_ops_rwsem));
if (old_ng->s.len > id) {
old_ng->ptr[id] = data;
return 0;
@@ -289,7 +289,7 @@ struct net *get_net_ns_by_id(struct net *net, int id)
  */
 static __net_init int setup_net(struct net *net, struct user_namespace 
*user_ns)
 {
-   /* Must be called with net_sem held */
+   /* Must be called with pernet_ops_rwsem held */
const struct pernet_operations *ops, *saved_ops;
int error = 0;
LIST_HEAD(net_exit_list);
@@ -422,13 +422,13 @@ struct net *copy_net_ns(unsigned long flags,
net->ucounts = ucounts;
get_user_ns(user_ns);
 
-   rv = down_read_killable(_sem);
+   rv = down_read_killable(_ops_rwsem);
if (rv < 0)
goto put_userns;
 
rv = setup_net(net, user_ns);
 
-   up_read(_sem);
+   up_read(_ops_rwsem);
 
if (rv < 0) {
 put_userns:
@@ -480,7 +480,7 @@ static void cleanup_net(struct work_struct *work)
/* Atomically snapshot the list of namespaces to cleanup */
net_kill_list = llist_del_all(_list);
 
-   down_read(_sem);
+   down_read(_ops_rwsem);
 
/* Don't let anyone else find us. */
rtnl_lock();
@@ -519,7 +519,7 @@ static void cleanup_net(struct work_struct *work)
list_for_each_entry_reverse(ops, _list, list)
ops_free_list(ops, _exit_list);
 
-   up_read(_sem);
+   up_read(_ops_rwsem);
 
/* Ensure there are no outstanding rcu callbacks

[PATCH net-next 3/5] net: Drop pernet_operations::async

2018-03-27 Thread Kirill Tkhai

Synchronous pernet_operations are not allowed anymore.
All are asynchronous. So, drop the structure member.

Signed-off-by: Kirill Tkhai <ktk...@virtuozzo.com>
---
 drivers/infiniband/core/cma.c  |1 -
 drivers/net/bonding/bond_main.c|1 -
 drivers/net/geneve.c   |1 -
 drivers/net/gtp.c  |1 -
 drivers/net/ipvlan/ipvlan_main.c   |1 -
 drivers/net/loopback.c |1 -
 drivers/net/ppp/ppp_generic.c  |1 -
 drivers/net/ppp/pppoe.c|1 -
 drivers/net/vrf.c  |1 -
 drivers/net/vxlan.c|1 -
 drivers/net/wireless/mac80211_hwsim.c  |1 -
 fs/lockd/svc.c |1 -
 fs/nfs/blocklayout/rpc_pipefs.c|1 -
 fs/nfs/dns_resolve.c   |1 -
 fs/nfs/inode.c |1 -
 fs/nfs_common/grace.c  |1 -
 fs/nfsd/nfsctl.c   |1 -
 fs/proc/proc_net.c |1 -
 include/net/net_namespace.h|6 --
 kernel/audit.c |1 -
 lib/kobject_uevent.c   |1 -
 net/8021q/vlan.c   |1 -
 net/bridge/br.c|1 -
 net/bridge/br_netfilter_hooks.c|1 -
 net/bridge/netfilter/ebtable_broute.c  |1 -
 net/bridge/netfilter/ebtable_filter.c  |1 -
 net/bridge/netfilter/ebtable_nat.c |1 -
 net/bridge/netfilter/nf_log_bridge.c   |1 -
 net/caif/caif_dev.c|1 -
 net/can/af_can.c   |1 -
 net/can/bcm.c  |1 -
 net/can/gw.c   |1 -
 net/core/dev.c |2 --
 net/core/fib_notifier.c|1 -
 net/core/fib_rules.c   |1 -
 net/core/net-procfs.c  |2 --
 net/core/net_namespace.c   |2 --
 net/core/pktgen.c  |1 -
 net/core/rtnetlink.c   |1 -
 net/core/sock.c|2 --
 net/core/sock_diag.c   |1 -
 net/core/sysctl_net_core.c |1 -
 net/dccp/ipv4.c|1 -
 net/dccp/ipv6.c|1 -
 net/ieee802154/6lowpan/reassembly.c|1 -
 net/ieee802154/core.c  |1 -
 net/ipv4/af_inet.c |2 --
 net/ipv4/arp.c |1 -
 net/ipv4/devinet.c |1 -
 net/ipv4/fib_frontend.c|1 -
 net/ipv4/fou.c |1 -
 net/ipv4/icmp.c|1 -
 net/ipv4/igmp.c|1 -
 net/ipv4/ip_fragment.c |1 -
 net/ipv4/ip_gre.c  |3 ---
 net/ipv4/ip_vti.c  |1 -
 net/ipv4/ipip.c|1 -
 net/ipv4/ipmr.c|1 -
 net/ipv4/netfilter/arp_tables.c|1 -
 net/ipv4/netfilter/arptable_filter.c   |1 -
 net/ipv4/netfilter/ip_tables.c |1 -
 net/ipv4/netfilter/ipt_CLUSTERIP.c |1 -
 net/ipv4/netfilter/iptable_filter.c|1 -
 net/ipv4/netfilter/iptable_mangle.c|1 -
 net/ipv4/netfilter/iptable_nat.c   |1 -
 net/ipv4/netfilter/iptable_raw.c   |1 -
 net/ipv4/netfilter/iptable_security.c  |1 -
 net/ipv4/netfilter/nf_conntrack_l3proto_ipv4.c |1 -
 net/ipv4/netfilter/nf_defrag_ipv4.c|1 -
 net/ipv4/netfilter/nf_log_arp.c|1 -
 net/ipv4/netfilter/nf_log_ipv4.c   |1 -
 net/ipv4/ping.c|1 -
 net/ipv4/proc.c|1 -
 net/ipv4/raw.c |1 -
 net/ipv4/route.c   |4 
 net/ipv4/sysctl_net_ipv4.c |1 -
 net/ipv4/tcp_ipv4.c|2 --
 net/ipv4/tcp_metrics.c |1 -
 net/ipv4/udp.c |2 --
 net/ipv4/udplite.c |1 -
 net/ipv4/xfrm4_policy.c|1 -
 net/ipv6/addrconf.c|2 --
 net/ipv6/addrlabel.c   |1 -
 net/ipv6/af_inet6.c|1 -
 net/ipv6/fib6_r

[PATCH net-next 2/5] net: Reflect all pernet_operations are converted

2018-03-27 Thread Kirill Tkhai

All pernet_operations are reviewed and converted, hooray!
Reflect this in core code: setup_net() and cleanup_net()
will take down_read() always.

Signed-off-by: Kirill Tkhai <ktk...@virtuozzo.com>
---
 net/core/net_namespace.c |   43 ++-
 1 file changed, 6 insertions(+), 37 deletions(-)

diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
index 95ba2c53bd9a..0f614523a13f 100644
--- a/net/core/net_namespace.c
+++ b/net/core/net_namespace.c
@@ -40,9 +40,8 @@ struct net init_net = {
 EXPORT_SYMBOL(init_net);
 
 static bool init_net_initialized;
-static unsigned nr_sync_pernet_ops;
 /*
- * net_sem: protects: pernet_list, net_generic_ids, nr_sync_pernet_ops,
+ * net_sem: protects: pernet_list, net_generic_ids,
  * init_net_initialized and first_device pointer.
  */
 DECLARE_RWSEM(net_sem);
@@ -406,7 +405,6 @@ struct net *copy_net_ns(unsigned long flags,
 {
struct ucounts *ucounts;
struct net *net;
-   unsigned write;
int rv;
 
if (!(flags & CLONE_NEWNET))
@@ -424,25 +422,14 @@ struct net *copy_net_ns(unsigned long flags,
refcount_set(>passive, 1);
net->ucounts = ucounts;
get_user_ns(user_ns);
-again:
-   write = READ_ONCE(nr_sync_pernet_ops);
-   if (write)
-   rv = down_write_killable(_sem);
-   else
-   rv = down_read_killable(_sem);
+
+   rv = down_read_killable(_sem);
if (rv < 0)
goto put_userns;
 
-   if (!write && unlikely(READ_ONCE(nr_sync_pernet_ops))) {
-   up_read(_sem);
-   goto again;
-   }
rv = setup_net(net, user_ns);
 
-   if (write)
-   up_write(_sem);
-   else
-   up_read(_sem);
+   up_read(_sem);
 
if (rv < 0) {
 put_userns:
@@ -490,21 +477,11 @@ static void cleanup_net(struct work_struct *work)
struct net *net, *tmp, *last;
struct llist_node *net_kill_list;
LIST_HEAD(net_exit_list);
-   unsigned write;
 
/* Atomically snapshot the list of namespaces to cleanup */
net_kill_list = llist_del_all(_list);
-again:
-   write = READ_ONCE(nr_sync_pernet_ops);
-   if (write)
-   down_write(_sem);
-   else
-   down_read(_sem);
 
-   if (!write && unlikely(READ_ONCE(nr_sync_pernet_ops))) {
-   up_read(_sem);
-   goto again;
-   }
+   down_read(_sem);
 
/* Don't let anyone else find us. */
rtnl_lock();
@@ -543,10 +520,7 @@ static void cleanup_net(struct work_struct *work)
list_for_each_entry_reverse(ops, _list, list)
ops_free_list(ops, _exit_list);
 
-   if (write)
-   up_write(_sem);
-   else
-   up_read(_sem);
+   up_read(_sem);
 
/* Ensure there are no outstanding rcu callbacks using this
 * network namespace.
@@ -1006,9 +980,6 @@ static int register_pernet_operations(struct list_head 
*list,
rcu_barrier();
if (ops->id)
ida_remove(_generic_ids, *ops->id);
-   } else if (!ops->async) {
-   pr_info_once("Pernet operations %ps are sync.\n", ops);
-   nr_sync_pernet_ops++;
}
 
return error;
@@ -1016,8 +987,6 @@ static int register_pernet_operations(struct list_head 
*list,
 
 static void unregister_pernet_operations(struct pernet_operations *ops)
 {
-   if (!ops->async)
-   BUG_ON(nr_sync_pernet_ops-- == 0);
__unregister_pernet_operations(ops);
rcu_barrier();
if (ops->id)

[PATCH net-next 1/5] net: Convert nfsd_net_ops

2018-03-27 Thread Kirill Tkhai

These pernet_operations look similar to rpcsec_gss_net_ops,
they just create and destroy another caches. So, they also
can be async.

Signed-off-by: Kirill Tkhai <ktk...@virtuozzo.com>
Acked-by: J. Bruce Fields <bfie...@redhat.com>
---
 fs/nfsd/nfsctl.c |1 +
 1 file changed, 1 insertion(+)

diff --git a/fs/nfsd/nfsctl.c b/fs/nfsd/nfsctl.c
index d107b4426f7e..1e3824e6cce0 100644
--- a/fs/nfsd/nfsctl.c
+++ b/fs/nfsd/nfsctl.c
@@ -1263,6 +1263,7 @@ static struct pernet_operations nfsd_net_ops = {
.exit = nfsd_exit_net,
.id   = _net_id,
.size = sizeof(struct nfsd_net),
+   .async = true,
 };
 
 static int __init init_nfsd(void)

[PATCH net-next 0/5] Make pernet_operations always read locked

2018-03-27 Thread Kirill Tkhai

All the pernet_operations are converted, and the last one
is in this patchset (nfsd_net_ops acked by J. Bruce Fields).
So, it's the time to kill pernet_operations::async field,
and make setup_net() and cleanup_net() always require
the rwsem only read locked.

All further pernet_operations have to be developed to fit
this rule. Some of previous patches added a comment to
struct pernet_operations about that.

Also, this patchset renames net_sem to pernet_ops_rwsem
to make the target area of the rwsem is more clear visible,
and adds more comments.

Thanks,
Kirill
---

Kirill Tkhai (5):
  net: Convert nfsd_net_ops
  net: Reflect all pernet_operations are converted
  net: Drop pernet_operations::async
  net: Rename net_sem to pernet_ops_rwsem
  net: Add more comments


 drivers/infiniband/core/cma.c  |1 
 drivers/net/bonding/bond_main.c|1 
 drivers/net/geneve.c   |1 
 drivers/net/gtp.c  |1 
 drivers/net/ipvlan/ipvlan_main.c   |1 
 drivers/net/loopback.c |1 
 drivers/net/ppp/ppp_generic.c  |1 
 drivers/net/ppp/pppoe.c|1 
 drivers/net/vrf.c  |1 
 drivers/net/vxlan.c|1 
 drivers/net/wireless/mac80211_hwsim.c  |1 
 fs/lockd/svc.c |1 
 fs/nfs/blocklayout/rpc_pipefs.c|1 
 fs/nfs/dns_resolve.c   |1 
 fs/nfs/inode.c |1 
 fs/nfs_common/grace.c  |1 
 fs/proc/proc_net.c |1 
 include/linux/rtnetlink.h  |2 -
 include/net/net_namespace.h|   22 +++
 kernel/audit.c |1 
 lib/kobject_uevent.c   |1 
 net/8021q/vlan.c   |1 
 net/bridge/br.c|1 
 net/bridge/br_netfilter_hooks.c|1 
 net/bridge/netfilter/ebtable_broute.c  |1 
 net/bridge/netfilter/ebtable_filter.c  |1 
 net/bridge/netfilter/ebtable_nat.c |1 
 net/bridge/netfilter/nf_log_bridge.c   |1 
 net/caif/caif_dev.c|1 
 net/can/af_can.c   |1 
 net/can/bcm.c  |1 
 net/can/gw.c   |1 
 net/core/dev.c |2 -
 net/core/fib_notifier.c|1 
 net/core/fib_rules.c   |1 
 net/core/net-procfs.c  |2 -
 net/core/net_namespace.c   |   77 +++-
 net/core/pktgen.c  |1 
 net/core/rtnetlink.c   |7 +-
 net/core/sock.c|2 -
 net/core/sock_diag.c   |1 
 net/core/sysctl_net_core.c |1 
 net/dccp/ipv4.c|1 
 net/dccp/ipv6.c|1 
 net/ieee802154/6lowpan/reassembly.c|1 
 net/ieee802154/core.c  |1 
 net/ipv4/af_inet.c |2 -
 net/ipv4/arp.c |1 
 net/ipv4/devinet.c |1 
 net/ipv4/fib_frontend.c|1 
 net/ipv4/fou.c |1 
 net/ipv4/icmp.c|1 
 net/ipv4/igmp.c|1 
 net/ipv4/ip_fragment.c |1 
 net/ipv4/ip_gre.c  |3 -
 net/ipv4/ip_vti.c  |1 
 net/ipv4/ipip.c|1 
 net/ipv4/ipmr.c|1 
 net/ipv4/netfilter/arp_tables.c|1 
 net/ipv4/netfilter/arptable_filter.c   |1 
 net/ipv4/netfilter/ip_tables.c |1 
 net/ipv4/netfilter/ipt_CLUSTERIP.c |1 
 net/ipv4/netfilter/iptable_filter.c|1 
 net/ipv4/netfilter/iptable_mangle.c|1 
 net/ipv4/netfilter/iptable_nat.c   |1 
 net/ipv4/netfilter/iptable_raw.c   |1 
 net/ipv4/netfilter/iptable_security.c  |1 
 net/ipv4/netfilter/nf_conntrack_l3proto_ipv4.c |1 
 net/ipv4/netfilter/nf_defrag_ipv4.c|1 
 net/ipv4/netfilter/nf_log_arp.c|1 
 net/ipv4/netfilter/nf_log_ipv4.c   |1 
 net/ipv4/ping.c|1 
 net/ipv4/proc.c|1 
 net/ipv4/raw.c |1 
 net/ipv4/route.c

Re: [PATCH net-next] net: Export net->ipv6.sysctl.ip_nonlocal_bind to /proc

2018-03-27 Thread Kirill Tkhai

Please, ignore this.

Thanks,
Kirill

On 27.03.2018 14:24, Kirill Tkhai wrote:
> Currenly, this parameter can be configured via sysctl
> only. But sysctl is considered as depricated interface
> (man 2 sysctl), and it only can applied to current's net
> namespace (this requires to do setns() to change it in
> not current's net ns).
> 
> So, let's export the parameter to /proc in standard way,
> and this allows to access another process namespace
> via /proc/[pid]/net/ip6_nonlocal_bind.
> 
> Signed-off-by: Kirill Tkhai <ktk...@virtuozzo.com>
> ---
>  net/ipv6/proc.c |   48 
>  1 file changed, 48 insertions(+)
> 
> diff --git a/net/ipv6/proc.c b/net/ipv6/proc.c
> index 6e57028d2e91..2d0aa59c2d0d 100644
> --- a/net/ipv6/proc.c
> +++ b/net/ipv6/proc.c
> @@ -312,6 +312,47 @@ int snmp6_unregister_dev(struct inet6_dev *idev)
>   return 0;
>  }
>  
> +static int nonlocal_bind_show(struct seq_file *seq, void *v)
> +{
> + struct net *net = seq->private;
> +
> + seq_printf(seq, "%d\n", !!net->ipv6.sysctl.ip_nonlocal_bind);
> + return 0;
> +}
> +
> +static int open_nonlocal_bind(struct inode *inode, struct file *file)
> +{
> + return single_open_net(inode, file, nonlocal_bind_show);
> +}
> +
> +static ssize_t write_nonlocal_bind(struct file *file, const char __user 
> *ubuf,
> +size_t count, loff_t *ppos)
> +{
> + struct net *net = ((struct seq_file *)file->private_data)->private;
> + char buf[3];
> +
> + if (*ppos || count <= 0 || count > sizeof(buf))
> + return -EINVAL;
> +
> + if (copy_from_user(buf, ubuf, count))
> + return -EFAULT;
> + buf[0] -= '0';
> + if ((count == 3 && buf[2] != '\0') ||
> + (count >= 2 && buf[1] != '\n') ||
> + (buf[0] != 0 && buf[0] != 1))
> + return -EINVAL;
> +
> + net->ipv6.sysctl.ip_nonlocal_bind = buf[0];
> + return count;
> +}
> +
> +static const struct file_operations nonlocal_bind_ops = {
> + .open   = open_nonlocal_bind,
> + .read   = seq_read,
> + .write  = write_nonlocal_bind,
> + .release = single_release_net,
> +};
> +
>  static int __net_init ipv6_proc_init_net(struct net *net)
>  {
>   if (!proc_create("sockstat6", 0444, net->proc_net,
> @@ -321,12 +362,18 @@ static int __net_init ipv6_proc_init_net(struct net 
> *net)
>   if (!proc_create("snmp6", 0444, net->proc_net, _seq_fops))
>   goto proc_snmp6_fail;
>  
> +if (!proc_create_data("ip6_nonlocal_bind", 0644,
> +net->proc_net, _bind_ops, net))
> + goto proc_bind_fail;
> +
>   net->mib.proc_net_devsnmp6 = proc_mkdir("dev_snmp6", net->proc_net);
>   if (!net->mib.proc_net_devsnmp6)
>   goto proc_dev_snmp6_fail;
>   return 0;
>  
>  proc_dev_snmp6_fail:
> + remove_proc_entry("ip6_nonlocal_bind", net->proc_net);
> +proc_bind_fail:
>   remove_proc_entry("snmp6", net->proc_net);
>  proc_snmp6_fail:
>   remove_proc_entry("sockstat6", net->proc_net);
> @@ -337,6 +384,7 @@ static void __net_exit ipv6_proc_exit_net(struct net *net)
>  {
>   remove_proc_entry("sockstat6", net->proc_net);
>   remove_proc_entry("dev_snmp6", net->proc_net);
> + remove_proc_entry("ip6_nonlocal_bind", net->proc_net);
>   remove_proc_entry("snmp6", net->proc_net);
>  }
>  
>

[PATCH net-next] net: Export net->ipv6.sysctl.ip_nonlocal_bind to /proc

2018-03-27 Thread Kirill Tkhai

Currenly, this parameter can be configured via sysctl
only. But sysctl is considered as depricated interface
(man 2 sysctl), and it only can applied to current's net
namespace (this requires to do setns() to change it in
not current's net ns).

So, let's export the parameter to /proc in standard way,
and this allows to access another process namespace
via /proc/[pid]/net/ip6_nonlocal_bind.

Signed-off-by: Kirill Tkhai <ktk...@virtuozzo.com>
---
 net/ipv6/proc.c |   48 
 1 file changed, 48 insertions(+)

diff --git a/net/ipv6/proc.c b/net/ipv6/proc.c
index 6e57028d2e91..2d0aa59c2d0d 100644
--- a/net/ipv6/proc.c
+++ b/net/ipv6/proc.c
@@ -312,6 +312,47 @@ int snmp6_unregister_dev(struct inet6_dev *idev)
return 0;
 }
 
+static int nonlocal_bind_show(struct seq_file *seq, void *v)
+{
+   struct net *net = seq->private;
+
+   seq_printf(seq, "%d\n", !!net->ipv6.sysctl.ip_nonlocal_bind);
+   return 0;
+}
+
+static int open_nonlocal_bind(struct inode *inode, struct file *file)
+{
+   return single_open_net(inode, file, nonlocal_bind_show);
+}
+
+static ssize_t write_nonlocal_bind(struct file *file, const char __user *ubuf,
+  size_t count, loff_t *ppos)
+{
+   struct net *net = ((struct seq_file *)file->private_data)->private;
+   char buf[3];
+
+   if (*ppos || count <= 0 || count > sizeof(buf))
+   return -EINVAL;
+
+   if (copy_from_user(buf, ubuf, count))
+   return -EFAULT;
+   buf[0] -= '0';
+   if ((count == 3 && buf[2] != '\0') ||
+   (count >= 2 && buf[1] != '\n') ||
+   (buf[0] != 0 && buf[0] != 1))
+   return -EINVAL;
+
+   net->ipv6.sysctl.ip_nonlocal_bind = buf[0];
+   return count;
+}
+
+static const struct file_operations nonlocal_bind_ops = {
+   .open   = open_nonlocal_bind,
+   .read   = seq_read,
+   .write  = write_nonlocal_bind,
+   .release = single_release_net,
+};
+
 static int __net_init ipv6_proc_init_net(struct net *net)
 {
if (!proc_create("sockstat6", 0444, net->proc_net,
@@ -321,12 +362,18 @@ static int __net_init ipv6_proc_init_net(struct net *net)
if (!proc_create("snmp6", 0444, net->proc_net, _seq_fops))
goto proc_snmp6_fail;
 
+if (!proc_create_data("ip6_nonlocal_bind", 0644,
+  net->proc_net, _bind_ops, net))
+   goto proc_bind_fail;
+
net->mib.proc_net_devsnmp6 = proc_mkdir("dev_snmp6", net->proc_net);
if (!net->mib.proc_net_devsnmp6)
goto proc_dev_snmp6_fail;
return 0;
 
 proc_dev_snmp6_fail:
+   remove_proc_entry("ip6_nonlocal_bind", net->proc_net);
+proc_bind_fail:
remove_proc_entry("snmp6", net->proc_net);
 proc_snmp6_fail:
remove_proc_entry("sockstat6", net->proc_net);
@@ -337,6 +384,7 @@ static void __net_exit ipv6_proc_exit_net(struct net *net)
 {
remove_proc_entry("sockstat6", net->proc_net);
remove_proc_entry("dev_snmp6", net->proc_net);
+   remove_proc_entry("ip6_nonlocal_bind", net->proc_net);
remove_proc_entry("snmp6", net->proc_net);
 }

Re: [PATCH net-next nfs 1/6] net: Convert rpcsec_gss_net_ops

2018-03-27 Thread Kirill Tkhai

On 26.03.2018 21:36, J. Bruce Fields wrote:
> On Fri, Mar 23, 2018 at 02:53:34PM -0400, Anna Schumaker wrote:
>>
>>
>> On 03/13/2018 06:49 AM, Kirill Tkhai wrote:
>>> These pernet_operations initialize and destroy sunrpc_net_id refered
>>> per-net items. Only used global list is cache_list, and accesses
>>> already serialized.
>>>
>>> sunrpc_destroy_cache_detail() check for list_empty() without
>>> cache_list_lock, but when it's called from
>>> unregister_pernet_subsys(), there can't be callers in parallel, so
>>> we won't miss list_empty() in this case.
>>>
>>> Signed-off-by: Kirill Tkhai <ktk...@virtuozzo.com>
>>
>> It might make sense to take these and the other NFS patches through
>> the net tree, since the pernet_operations don't yet have the async
>> field in my tree (and I therefore can't compile once these are
>> applied).
> 
> Ditto for the nfsd patch, so, for what it's worth:
> 
>   Acked-by: J. Bruce Fields <bfie...@redhat.com>
> 
> for that patch.--b.

Thanks, Bruce.

Kirill

Re: [PATCH net-next nfs 1/6] net: Convert rpcsec_gss_net_ops

2018-03-26 Thread Kirill Tkhai

On 23.03.2018 21:53, Anna Schumaker wrote:
> 
> 
> On 03/13/2018 06:49 AM, Kirill Tkhai wrote:
>> These pernet_operations initialize and destroy sunrpc_net_id
>> refered per-net items. Only used global list is cache_list,
>> and accesses already serialized.
>>
>> sunrpc_destroy_cache_detail() check for list_empty() without
>> cache_list_lock, but when it's called from unregister_pernet_subsys(),
>> there can't be callers in parallel, so we won't miss list_empty()
>> in this case.
>>
>> Signed-off-by: Kirill Tkhai <ktk...@virtuozzo.com>
> 
> It might make sense to take these and the other NFS patches through the net 
> tree, since the pernet_operations don't yet have the async field in my tree 
> (and I therefore can't compile once these are applied).
> 
> Acked-by: Anna Schumaker <anna.schuma...@netapp.com>

Thanks a lot, Anna!

Kirill

>> ---
>>  net/sunrpc/auth_gss/auth_gss.c |1 +
>>  1 file changed, 1 insertion(+)
>>
>> diff --git a/net/sunrpc/auth_gss/auth_gss.c b/net/sunrpc/auth_gss/auth_gss.c
>> index 9463af4b32e8..44f939cb6bc8 100644
>> --- a/net/sunrpc/auth_gss/auth_gss.c
>> +++ b/net/sunrpc/auth_gss/auth_gss.c
>> @@ -2063,6 +2063,7 @@ static __net_exit void rpcsec_gss_exit_net(struct net 
>> *net)
>>  static struct pernet_operations rpcsec_gss_net_ops = {
>>  .init = rpcsec_gss_init_net,
>>  .exit = rpcsec_gss_exit_net,
>> +.async = true,
>>  };
>>  
>>  /*
>>

[PATCH net-next v1 3/4] net: Convert nfs4_dns_resolver_ops

2018-03-26 Thread Kirill Tkhai

These pernet_operations look similar to rpcsec_gss_net_ops,
they just create and destroy another cache. Also they create
and destroy directory. So, they also look safe to be async.

Signed-off-by: Kirill Tkhai <ktk...@virtuozzo.com>
Acked-by: Anna Schumaker <anna.schuma...@netapp.com>
---
 fs/nfs/dns_resolve.c |1 +
 1 file changed, 1 insertion(+)

diff --git a/fs/nfs/dns_resolve.c b/fs/nfs/dns_resolve.c
index 060c658eab66..e90bd69ab653 100644
--- a/fs/nfs/dns_resolve.c
+++ b/fs/nfs/dns_resolve.c
@@ -410,6 +410,7 @@ static void nfs4_dns_net_exit(struct net *net)
 static struct pernet_operations nfs4_dns_resolver_ops = {
.init = nfs4_dns_net_init,
.exit = nfs4_dns_net_exit,
+   .async = true,
 };
 
 static int rpc_pipefs_event(struct notifier_block *nb, unsigned long event,

[PATCH net-next v1 4/4] net: Convert nfs4blocklayout_net_ops

2018-03-26 Thread Kirill Tkhai

These pernet_operations create and destroy per-net pipe
and dentry, and they seem safe to be marked as async.

Signed-off-by: Kirill Tkhai <ktk...@virtuozzo.com>
Acked-by: Anna Schumaker <anna.schuma...@netapp.com>
---
 fs/nfs/blocklayout/rpc_pipefs.c |1 +
 1 file changed, 1 insertion(+)

diff --git a/fs/nfs/blocklayout/rpc_pipefs.c b/fs/nfs/blocklayout/rpc_pipefs.c
index 9fb067a6f7e0..ef9fa111b009 100644
--- a/fs/nfs/blocklayout/rpc_pipefs.c
+++ b/fs/nfs/blocklayout/rpc_pipefs.c
@@ -261,6 +261,7 @@ static void nfs4blocklayout_net_exit(struct net *net)
 static struct pernet_operations nfs4blocklayout_net_ops = {
.init = nfs4blocklayout_net_init,
.exit = nfs4blocklayout_net_exit,
+   .async = true,
 };
 
 int __init bl_init_pipefs(void)

[PATCH net-next v1 2/4] net: Convert sunrpc_net_ops

2018-03-26 Thread Kirill Tkhai

These pernet_operations look similar to rpcsec_gss_net_ops,
they just create and destroy another caches. So, they also
can be async.

Signed-off-by: Kirill Tkhai <ktk...@virtuozzo.com>
Acked-by: Anna Schumaker <anna.schuma...@netapp.com>
---
 net/sunrpc/sunrpc_syms.c |1 +
 1 file changed, 1 insertion(+)

diff --git a/net/sunrpc/sunrpc_syms.c b/net/sunrpc/sunrpc_syms.c
index 56f9eff74150..68287e921847 100644
--- a/net/sunrpc/sunrpc_syms.c
+++ b/net/sunrpc/sunrpc_syms.c
@@ -79,6 +79,7 @@ static struct pernet_operations sunrpc_net_ops = {
.exit = sunrpc_exit_net,
.id = _net_id,
.size = sizeof(struct sunrpc_net),
+   .async = true,
 };
 
 static int __init

[PATCH net-next v1 0/4] Converting pernet_operations (part #7.1)

2018-03-26 Thread Kirill Tkhai

Hi,

this is a resending of the 4 patches from path #7.

Anna kindly reviewed them and suggested to take the patches
through net tree, since there is pernet_operations::async only
in net-next.git.

There is Anna's acks on every header, the rest of patch
has no changes.

Thanks,
Kirill
---

Kirill Tkhai (4):
  net: Convert rpcsec_gss_net_ops
  net: Convert sunrpc_net_ops
  net: Convert nfs4_dns_resolver_ops
  net: Convert nfs4blocklayout_net_ops


 fs/nfs/blocklayout/rpc_pipefs.c |1 +
 fs/nfs/dns_resolve.c|1 +
 net/sunrpc/auth_gss/auth_gss.c  |1 +
 net/sunrpc/sunrpc_syms.c|1 +
 4 files changed, 4 insertions(+)

--
Signed-off-by: Kirill Tkhai <ktk...@virtuozzo.com>
Acked-by: Anna Schumaker <anna.schuma...@netapp.com>

[PATCH net-next v1 1/4] net: Convert rpcsec_gss_net_ops

2018-03-26 Thread Kirill Tkhai

These pernet_operations initialize and destroy sunrpc_net_id
refered per-net items. Only used global list is cache_list,
and accesses already serialized.

sunrpc_destroy_cache_detail() check for list_empty() without
cache_list_lock, but when it's called from unregister_pernet_subsys(),
there can't be callers in parallel, so we won't miss list_empty()
in this case.

Signed-off-by: Kirill Tkhai <ktk...@virtuozzo.com>
Acked-by: Anna Schumaker <anna.schuma...@netapp.com>
---
 net/sunrpc/auth_gss/auth_gss.c |1 +
 1 file changed, 1 insertion(+)

diff --git a/net/sunrpc/auth_gss/auth_gss.c b/net/sunrpc/auth_gss/auth_gss.c
index 9463af4b32e8..44f939cb6bc8 100644
--- a/net/sunrpc/auth_gss/auth_gss.c
+++ b/net/sunrpc/auth_gss/auth_gss.c
@@ -2063,6 +2063,7 @@ static __net_exit void rpcsec_gss_exit_net(struct net 
*net)
 static struct pernet_operations rpcsec_gss_net_ops = {
.init = rpcsec_gss_init_net,
.exit = rpcsec_gss_exit_net,
+   .async = true,
 };
 
 /*

[PATCH net-next v2 2/3] infiniband: Replace usnic_ib_netdev_event_to_string() with netdev_cmd_to_name()

2018-03-23 Thread Kirill Tkhai

This function just calls netdev_cmd_to_name().

Signed-off-by: Kirill Tkhai <ktk...@virtuozzo.com>
---
 drivers/infiniband/hw/usnic/usnic_ib_main.c |   15 +--
 1 file changed, 5 insertions(+), 10 deletions(-)

diff --git a/drivers/infiniband/hw/usnic/usnic_ib_main.c 
b/drivers/infiniband/hw/usnic/usnic_ib_main.c
index 5bf3b20eba25..ca5638091b55 100644
--- a/drivers/infiniband/hw/usnic/usnic_ib_main.c
+++ b/drivers/infiniband/hw/usnic/usnic_ib_main.c
@@ -95,11 +95,6 @@ void usnic_ib_log_vf(struct usnic_ib_vf *vf)
 }
 
 /* Start of netdev section */
-static inline const char *usnic_ib_netdev_event_to_string(unsigned long event)
-{
-   return netdev_cmd_to_name(event);
-}
-
 static void usnic_ib_qp_grp_modify_active_to_err(struct usnic_ib_dev *us_ibdev)
 {
struct usnic_ib_ucontext *ctx;
@@ -172,7 +167,7 @@ static void usnic_ib_handle_usdev_event(struct usnic_ib_dev 
*us_ibdev,
ib_dispatch_event(_event);
} else {
usnic_dbg("Ignoring %s on %s\n",
-   usnic_ib_netdev_event_to_string(event),
+   netdev_cmd_to_name(event),
us_ibdev->ib_dev.name);
}
break;
@@ -209,7 +204,7 @@ static void usnic_ib_handle_usdev_event(struct usnic_ib_dev 
*us_ibdev,
break;
default:
usnic_dbg("Ignoring event %s on %s",
-   usnic_ib_netdev_event_to_string(event),
+   netdev_cmd_to_name(event),
us_ibdev->ib_dev.name);
}
mutex_unlock(_ibdev->usdev_lock);
@@ -251,7 +246,7 @@ static int usnic_ib_handle_inet_event(struct usnic_ib_dev 
*us_ibdev,
switch (event) {
case NETDEV_DOWN:
usnic_info("%s via ip notifiers",
-   usnic_ib_netdev_event_to_string(event));
+   netdev_cmd_to_name(event));
usnic_fwd_del_ipaddr(us_ibdev->ufdev);
usnic_ib_qp_grp_modify_active_to_err(us_ibdev);
ib_event.event = IB_EVENT_GID_CHANGE;
@@ -262,7 +257,7 @@ static int usnic_ib_handle_inet_event(struct usnic_ib_dev 
*us_ibdev,
case NETDEV_UP:
usnic_fwd_add_ipaddr(us_ibdev->ufdev, ifa->ifa_address);
usnic_info("%s via ip notifiers: ip %pI4",
-   usnic_ib_netdev_event_to_string(event),
+   netdev_cmd_to_name(event),
_ibdev->ufdev->inaddr);
ib_event.event = IB_EVENT_GID_CHANGE;
ib_event.device = _ibdev->ib_dev;
@@ -271,7 +266,7 @@ static int usnic_ib_handle_inet_event(struct usnic_ib_dev 
*us_ibdev,
break;
default:
usnic_info("Ignoring event %s on %s",
-   usnic_ib_netdev_event_to_string(event),
+   netdev_cmd_to_name(event),
us_ibdev->ib_dev.name);
}
mutex_unlock(_ibdev->usdev_lock);

[PATCH net-next v2 3/3] net: Drop NETDEV_UNREGISTER_FINAL

2018-03-23 Thread Kirill Tkhai

Last user is gone after bdf5bd7f2132 "rds: tcp: remove
register_netdevice_notifier infrastructure.", so we can
remove this netdevice command. This allows to delete
rtnl_lock() in netdev_run_todo(), which is hot path for
net namespace unregistration.

dev_change_net_namespace() and netdev_wait_allrefs()
have rcu_barrier() before NETDEV_UNREGISTER_FINAL call,
and the source commits say they were introduced to
delemit the call with NETDEV_UNREGISTER, but this patch
leaves them on the places, since they require additional
analysis, whether we need in them for something else.

Signed-off-by: Kirill Tkhai <ktk...@virtuozzo.com>
---
 drivers/infiniband/hw/qedr/main.c |4 ++--
 include/linux/netdevice.h |1 -
 include/rdma/ib_verbs.h   |4 ++--
 net/core/dev.c|7 ---
 4 files changed, 4 insertions(+), 12 deletions(-)

diff --git a/drivers/infiniband/hw/qedr/main.c 
b/drivers/infiniband/hw/qedr/main.c
index db4bf97c0e15..eb32abb0099a 100644
--- a/drivers/infiniband/hw/qedr/main.c
+++ b/drivers/infiniband/hw/qedr/main.c
@@ -90,8 +90,8 @@ static struct net_device *qedr_get_netdev(struct ib_device 
*dev, u8 port_num)
dev_hold(qdev->ndev);
 
/* The HW vendor's device driver must guarantee
-* that this function returns NULL before the net device reaches
-* NETDEV_UNREGISTER_FINAL state.
+* that this function returns NULL before the net device has finished
+* NETDEV_UNREGISTER state.
 */
return qdev->ndev;
 }
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index dd5a04c971d5..2a2d9cf50aa2 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -2336,7 +2336,6 @@ enum netdev_cmd {
NETDEV_PRE_TYPE_CHANGE,
NETDEV_POST_TYPE_CHANGE,
NETDEV_POST_INIT,
-   NETDEV_UNREGISTER_FINAL,
NETDEV_RELEASE,
NETDEV_NOTIFY_PEERS,
NETDEV_JOIN,
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index 73b2387e3f74..a1c5e8320f86 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -2126,8 +2126,8 @@ struct ib_device {
 * net device of device @device at port @port_num or NULL if such
 * a net device doesn't exist. The vendor driver should call dev_hold
 * on this net device. The HW vendor's device driver must guarantee
-* that this function returns NULL before the net device reaches
-* NETDEV_UNREGISTER_FINAL state.
+* that this function returns NULL before the net device has finished
+* NETDEV_UNREGISTER state.
 */
struct net_device *(*get_netdev)(struct ib_device *device,
 u8 port_num);
diff --git a/net/core/dev.c b/net/core/dev.c
index eb8ff44f46d6..cb4d2ceb36fb 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1584,7 +1584,6 @@ const char *netdev_cmd_to_name(enum netdev_cmd cmd)
N(RESEND_IGMP) N(PRECHANGEMTU) N(CHANGEINFODATA) N(BONDING_INFO)
N(PRECHANGEUPPER) N(CHANGELOWERSTATE) N(UDP_TUNNEL_PUSH_INFO)
N(UDP_TUNNEL_DROP_INFO) N(CHANGE_TX_QUEUE_LEN)
-   N(UNREGISTER_FINAL)
};
 #undef N
return "UNKNOWN_NETDEV_EVENT";
@@ -8089,7 +8088,6 @@ static void netdev_wait_allrefs(struct net_device *dev)
rcu_barrier();
rtnl_lock();
 
-   call_netdevice_notifiers(NETDEV_UNREGISTER_FINAL, dev);
if (test_bit(__LINK_STATE_LINKWATCH_PENDING,
 >state)) {
/* We must not have linkwatch events
@@ -8161,10 +8159,6 @@ void netdev_run_todo(void)
= list_first_entry(, struct net_device, todo_list);
list_del(>todo_list);
 
-   rtnl_lock();
-   call_netdevice_notifiers(NETDEV_UNREGISTER_FINAL, dev);
-   __rtnl_unlock();
-
if (unlikely(dev->reg_state != NETREG_UNREGISTERING)) {
pr_err("network todo '%s' but state %d\n",
   dev->name, dev->reg_state);
@@ -8606,7 +8600,6 @@ int dev_change_net_namespace(struct net_device *dev, 
struct net *net, const char
 */
call_netdevice_notifiers(NETDEV_UNREGISTER, dev);
rcu_barrier();
-   call_netdevice_notifiers(NETDEV_UNREGISTER_FINAL, dev);
 
new_nsid = peernet2id_alloc(dev_net(dev), net);
/* If there is an ifindex conflict assign a new one */

[PATCH net-next v2 0/3] Drop NETDEV_UNREGISTER_FINAL (was unnamed)

2018-03-23 Thread Kirill Tkhai

This series drops unused NETDEV_UNREGISTER_FINAL
after some preparations.

v2: New patch [2/3]. Use switch() in [1/3].

The first version was acked by Jason Gunthorpe,
and [1/3] was acked by David Ahern.

Since there are differences to v1, I haven't added
Acked-by tags of people. It would be nice, if you
fill OK to tag v2 too.

Thanks,
Kirill
---

Kirill Tkhai (3):
  net: Make NETDEV_XXX commands enum { }
  infiniband: Replace usnic_ib_netdev_event_to_string() with 
netdev_cmd_to_name()
  net: Drop NETDEV_UNREGISTER_FINAL


 drivers/infiniband/hw/qedr/main.c   |4 +-
 drivers/infiniband/hw/usnic/usnic_ib_main.c |   28 ++-
 include/linux/netdevice.h   |   68 ++-
 include/rdma/ib_verbs.h |4 +-
 net/core/dev.c  |   25 --
 5 files changed, 63 insertions(+), 66 deletions(-)

--
Signed-off-by: Kirill Tkhai <ktk...@virtuozzo.com>

[PATCH net-next 2/2] net: Drop NETDEV_UNREGISTER_FINAL

2018-03-23 Thread Kirill Tkhai

Last user is gone after bdf5bd7f2132 "rds: tcp: remove
register_netdevice_notifier infrastructure.", so we can
remove this netdevice command. This allows to delete
rtnl_lock() in netdev_run_todo(), which is hot path for
net namespace unregistration.

dev_change_net_namespace() and netdev_wait_allrefs()
have rcu_barrier() before NETDEV_UNREGISTER_FINAL call,
and the source commits say they were introduced to
delemit the call with NETDEV_UNREGISTER, but this patch
leaves them on the places, since they require additional
analysis, whether we need in them for something else.

Signed-off-by: Kirill Tkhai <ktk...@virtuozzo.com>
---
 drivers/infiniband/hw/qedr/main.c |4 ++--
 include/linux/netdevice.h |1 -
 include/rdma/ib_verbs.h   |4 ++--
 net/core/dev.c|6 --
 4 files changed, 4 insertions(+), 11 deletions(-)

diff --git a/drivers/infiniband/hw/qedr/main.c 
b/drivers/infiniband/hw/qedr/main.c
index db4bf97c0e15..eb32abb0099a 100644
--- a/drivers/infiniband/hw/qedr/main.c
+++ b/drivers/infiniband/hw/qedr/main.c
@@ -90,8 +90,8 @@ static struct net_device *qedr_get_netdev(struct ib_device 
*dev, u8 port_num)
dev_hold(qdev->ndev);
 
/* The HW vendor's device driver must guarantee
-* that this function returns NULL before the net device reaches
-* NETDEV_UNREGISTER_FINAL state.
+* that this function returns NULL before the net device has finished
+* NETDEV_UNREGISTER state.
 */
return qdev->ndev;
 }
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index dd5a04c971d5..2a2d9cf50aa2 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -2336,7 +2336,6 @@ enum netdev_cmd {
NETDEV_PRE_TYPE_CHANGE,
NETDEV_POST_TYPE_CHANGE,
NETDEV_POST_INIT,
-   NETDEV_UNREGISTER_FINAL,
NETDEV_RELEASE,
NETDEV_NOTIFY_PEERS,
NETDEV_JOIN,
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index 73b2387e3f74..a1c5e8320f86 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -2126,8 +2126,8 @@ struct ib_device {
 * net device of device @device at port @port_num or NULL if such
 * a net device doesn't exist. The vendor driver should call dev_hold
 * on this net device. The HW vendor's device driver must guarantee
-* that this function returns NULL before the net device reaches
-* NETDEV_UNREGISTER_FINAL state.
+* that this function returns NULL before the net device has finished
+* NETDEV_UNREGISTER state.
 */
struct net_device *(*get_netdev)(struct ib_device *device,
 u8 port_num);
diff --git a/net/core/dev.c b/net/core/dev.c
index bcc9fe68240b..c444da3e4b82 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -8086,7 +8086,6 @@ static void netdev_wait_allrefs(struct net_device *dev)
rcu_barrier();
rtnl_lock();
 
-   call_netdevice_notifiers(NETDEV_UNREGISTER_FINAL, dev);
if (test_bit(__LINK_STATE_LINKWATCH_PENDING,
 >state)) {
/* We must not have linkwatch events
@@ -8158,10 +8157,6 @@ void netdev_run_todo(void)
= list_first_entry(, struct net_device, todo_list);
list_del(>todo_list);
 
-   rtnl_lock();
-   call_netdevice_notifiers(NETDEV_UNREGISTER_FINAL, dev);
-   __rtnl_unlock();
-
if (unlikely(dev->reg_state != NETREG_UNREGISTERING)) {
pr_err("network todo '%s' but state %d\n",
   dev->name, dev->reg_state);
@@ -8603,7 +8598,6 @@ int dev_change_net_namespace(struct net_device *dev, 
struct net *net, const char
 */
call_netdevice_notifiers(NETDEV_UNREGISTER, dev);
rcu_barrier();
-   call_netdevice_notifiers(NETDEV_UNREGISTER_FINAL, dev);
 
new_nsid = peernet2id_alloc(dev_net(dev), net);
/* If there is an ifindex conflict assign a new one */

[PATCH net-next 0/2] Converting pernet_operations (part #11)

2018-03-22 Thread Kirill Tkhai

Hi,

this series continues to review and to convert pernet_operations
to make them possible to be executed in parallel for several
net namespaces at the same time.

I thought last series was last, but there is one
new pernet_operations came to kernel. This is
udp_sysctl_ops, and here we convert it.

Also, David Howells acked rxrpc_net_ops, so I resend
the patch in case of it should be queued by patchwork:

https://www.spinics.net/lists/netdev/msg490678.html

Thanks,
Kirill
---

Kirill Tkhai (2):
  net: Convert udp_sysctl_ops
  net: Convert rxrpc_net_ops


 net/ipv4/udp.c |3 ++-
 net/rxrpc/net_ns.c |1 +
 2 files changed, 3 insertions(+), 1 deletion(-)

--
Signed-off-by: Kirill Tkhai <ktk...@virtuozzo.com>

[PATCH net-next 2/2] net: Convert rxrpc_net_ops

2018-03-22 Thread Kirill Tkhai

These pernet_operations modifies rxrpc_net_id-pointed
per-net entities. There is external link to AF_RXRPC
in fs/afs/Kconfig, but it seems there is no other
pernet_operations interested in that per-net entities.

Signed-off-by: Kirill Tkhai <ktk...@virtuozzo.com>
Acked-by: David Howells <dhowe...@redhat.com>
---
 net/rxrpc/net_ns.c |1 +
 1 file changed, 1 insertion(+)

diff --git a/net/rxrpc/net_ns.c b/net/rxrpc/net_ns.c
index f18c9248e0d4..5fd939dabf41 100644
--- a/net/rxrpc/net_ns.c
+++ b/net/rxrpc/net_ns.c
@@ -106,4 +106,5 @@ struct pernet_operations rxrpc_net_ops = {
.exit   = rxrpc_exit_net,
.id = _net_id,
.size   = sizeof(struct rxrpc_net),
+   .async  = true,
 };

[PATCH net-next 1/2] net: Convert udp_sysctl_ops

2018-03-22 Thread Kirill Tkhai

These pernet_operations just initialize udp4 defaults.

Signed-off-by: Kirill Tkhai <ktk...@virtuozzo.com>
---
 net/ipv4/udp.c |3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 908fc02fb4f8..c6dc019bc64b 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -2842,7 +2842,8 @@ static int __net_init udp_sysctl_init(struct net *net)
 }
 
 static struct pernet_operations __net_initdata udp_sysctl_ops = {
-   .init   = udp_sysctl_init,
+   .init   = udp_sysctl_init,
+   .async  = true,
 };
 
 void __init udp_init(void)

Re: [PATCH net-next 06/14] net/tls: Add generic NIC offload infrastructure

2018-03-22 Thread Kirill Tkhai

On 22.03.2018 15:38, Boris Pismenny wrote:
> ...

 Can't we move this check in tls_dev_event() and use it for all types of 
 events?
 Then we avoid duplicate code.

>>>
>>> No. Not all events require this check. Also, the result is different for 
>>> different events.
>>
>> No. You always return NOTIFY_DONE, in case of !(netdev->features & 
>> NETIF_F_HW_TLS_TX).
>> See below:
>>
>> static int tls_check_dev_ops(struct net_device *dev)
>> {
>> if (!dev->tlsdev_ops)
>>     return NOTIFY_BAD;
>>
>> return NOTIFY_DONE;
>> }
>>
>> static int tls_device_down(struct net_device *netdev)
>> {
>> struct tls_context *ctx, *tmp;
>> struct list_head list;
>> unsigned long flags;
>>
>> ...
>> return NOTIFY_DONE;
>> }
>>
>> static int tls_dev_event(struct notifier_block *this, unsigned long event,
>>   void *ptr)
>> {
>>  struct net_device *dev = netdev_notifier_info_to_dev(ptr);
>>
>> if (!(netdev->features & NETIF_F_HW_TLS_TX))
>>     return NOTIFY_DONE;
>>    switch (event) {
>>  case NETDEV_REGISTER:
>>  case NETDEV_FEAT_CHANGE:
>>  return tls_check_dev_ops(dev);
>>    case NETDEV_DOWN:
>>  return tls_device_down(dev);
>>  }
>>  return NOTIFY_DONE;
>> }
>>  
> 
> Sure, will fix in V3.
> 
> +
> +    /* Request a write lock to block new offload attempts
> + */
> +    percpu_down_write(_offload_lock);

 What is the reason percpu_rwsem is chosen here? It looks like this 
 primitive
 gives more advantages readers, then plain rwsem does. But it also gives
 disadvantages to writers. It would be good, unless tls_device_down() is 
 called
 with rtnl_lock() held from netdevice notifier. But since netdevice notifier
 are called with rtnl_lock() held, percpu_rwsem will increase the time 
 rtnl_lock()
 is locked.
>>> We use the a rwsem to allow multiple (readers) invocations of 
>>> tls_set_device_offload, which is triggered by the user (persumably) during 
>>> the TLS handshake. This might be considered a fast-path.
>>>
>>> However, we must block all calls to tls_set_device_offload while we are 
>>> processing NETDEV_DOWN events (writer).
>>>
>>> As you've mentioned, the percpu rwsem is more efficient for readers, 
>>> especially on NUMA systems, where cache-line bouncing occurs during reader 
>>> acquire and reduces performance.
>>
>> Hm, and who are the readers? It's used from do_tls_setsockopt_tx(), but it 
>> doesn't
>> seem to be performance critical. Who else?
>>
> 
> It depends on whether you consider the TLS handshake code as critical.
> The readers are TCP connections processing the CCS message of the TLS 
> handshake. They are providing key material to the kernel to start using 
> Kernel TLS.

The thing is rtnl_lock() is critical for the rest of the system,
while TLS handshake is small subset of actions the system makes.

rtnl_lock() is used just almost everywhere, from netlink messages
to netdev ioctls.

Currently, you even just can't close raw socket without rtnl lock.
So, all of this is big reason to avoid doing rcu waitings under it.

Kirill


 Can't we use plain rwsem here instead?

>>>
>>> Its a performance tradeoff. I'm not certain that the percpu rwsem write 
>>> side acquire is significantly worse than using the global rwsem.
>>>
>>> For now, while all of this is experimental, can we agree to focus on the 
>>> performance of readers? We can change it later if it becomes a problem.
>>
>> Same as above.
>>   
> 
> Replaced with rwsem from V2.

Re: [PATCH net-next v3 0/5] Rework ip_ra_chain protection

2018-03-22 Thread Kirill Tkhai

On 22.03.2018 12:44, Kirill Tkhai wrote:
> Commit 1215e51edad1 "ipv4: fix a deadlock in ip_ra_control"
> made rtnl_lock() be used in raw_close(). This function is called
> on every RAW socket destruction, so that rtnl_mutex is taken
> every time. This scales very sadly. I observe cleanup_net()
> spending a lot of time in rtnl_lock() and raw_close() is one
> of the biggest rtnl user (since we have percpu net->ipv4.icmp_sk).
> 
> This patchset reworks the locking: reverts the problem commit
> and its descendant, and introduces rtnl-independent locking.
> This may have a continuation, and someone may work on killing
> rtnl_lock() in mrtsock_destruct() in the future.
> 
> Thanks,
> Kirill
> 
> ---
> v3: Change patches order: [2/5] and [3/5].
> v2: Fix sparse warning [4/5], as reported by kbuild test robot.
> 
> ---
> 
> Kirill Tkhai (5):
>   net: Revert "ipv4: get rid of ip_ra_lock"
>   net: Move IP_ROUTER_ALERT out of lock_sock(sk)
>   net: Revert "ipv4: fix a deadlock in ip_ra_control"
>   net: Make ip_ra_chain per struct net
>   net: Replace ip_ra_lock with per-net mutex
> 
> 
>  include/net/ip.h |   13 +++--
>  include/net/netns/ipv4.h |2 ++
>  net/core/net_namespace.c |1 +
>  net/ipv4/ip_input.c  |5 ++---
>  net/ipv4/ip_sockglue.c   |   34 +-
>  net/ipv4/ipmr.c  |   11 +++++--
>  net/ipv4/raw.c   |2 --
>  7 files changed, 38 insertions(+), 30 deletions(-)
> 
> --
> Signed-off-by: Kirill Tkhai <ktk...@virtuozzo.com>

JFI: I used the below program to test:

#define _GNU_SOURCE
#include 
#include 
#include 
#include 
#include 

int main()
{
int sk, v, i = 0;

if (unshare(CLONE_NEWNET)) {
perror("unshare");
return 1;
}
sk = socket(AF_INET, SOCK_RAW, IPPROTO_IGMP);
if (sk < 0) {
perror("socket");
return 1;
}
for (i = 0; i < 3; i++)
fork();

while (1) {
setsockopt(sk, IPPROTO_IP, MRT_INIT, (void *), sizeof(v));
setsockopt(sk, IPPROTO_IP, MRT_DONE, (void *), sizeof(v));
v = (i++)%2;
setsockopt(sk, IPPROTO_IP, IP_ROUTER_ALERT, (void *), 
sizeof(v));
}

return 0;
}

[PATCH net-next v3 4/5] net: Make ip_ra_chain per struct net

2018-03-22 Thread Kirill Tkhai

This is optimization, which makes ip_call_ra_chain()
iterate less sockets to find the sockets it's looking for.

Signed-off-by: Kirill Tkhai <ktk...@virtuozzo.com>
---
 include/net/ip.h |   13 +++--
 include/net/netns/ipv4.h |1 +
 net/ipv4/ip_input.c  |5 ++---
 net/ipv4/ip_sockglue.c   |   15 ++-
 4 files changed, 16 insertions(+), 18 deletions(-)

diff --git a/include/net/ip.h b/include/net/ip.h
index fe63ba95d12b..d53b5a9eae34 100644
--- a/include/net/ip.h
+++ b/include/net/ip.h
@@ -91,6 +91,17 @@ static inline int inet_sdif(struct sk_buff *skb)
return 0;
 }
 
+/* Special input handler for packets caught by router alert option.
+   They are selected only by protocol field, and then processed likely
+   local ones; but only if someone wants them! Otherwise, router
+   not running rsvpd will kill RSVP.
+
+   It is user level problem, what it will make with them.
+   I have no idea, how it will masquearde or NAT them (it is joke, joke :-)),
+   but receiver should be enough clever f.e. to forward mtrace requests,
+   sent to multicast group to reach destination designated router.
+ */
+
 struct ip_ra_chain {
struct ip_ra_chain __rcu *next;
struct sock *sk;
@@ -101,8 +112,6 @@ struct ip_ra_chain {
struct rcu_head rcu;
 };
 
-extern struct ip_ra_chain __rcu *ip_ra_chain;
-
 /* IP flags. */
 #define IP_CE  0x8000  /* Flag: "Congestion"   */
 #define IP_DF  0x4000  /* Flag: "Don't Fragment"   */
diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
index 382bfd7583cf..97d7ee6667c7 100644
--- a/include/net/netns/ipv4.h
+++ b/include/net/netns/ipv4.h
@@ -49,6 +49,7 @@ struct netns_ipv4 {
 #endif
struct ipv4_devconf *devconf_all;
struct ipv4_devconf *devconf_dflt;
+   struct ip_ra_chain __rcu *ra_chain;
 #ifdef CONFIG_IP_MULTIPLE_TABLES
struct fib_rules_ops*rules_ops;
boolfib_has_custom_rules;
diff --git a/net/ipv4/ip_input.c b/net/ipv4/ip_input.c
index 57fc13c6ab2b..7582713dd18f 100644
--- a/net/ipv4/ip_input.c
+++ b/net/ipv4/ip_input.c
@@ -159,7 +159,7 @@ bool ip_call_ra_chain(struct sk_buff *skb)
struct net_device *dev = skb->dev;
struct net *net = dev_net(dev);
 
-   for (ra = rcu_dereference(ip_ra_chain); ra; ra = 
rcu_dereference(ra->next)) {
+   for (ra = rcu_dereference(net->ipv4.ra_chain); ra; ra = 
rcu_dereference(ra->next)) {
struct sock *sk = ra->sk;
 
/* If socket is bound to an interface, only report
@@ -167,8 +167,7 @@ bool ip_call_ra_chain(struct sk_buff *skb)
 */
if (sk && inet_sk(sk)->inet_num == protocol &&
(!sk->sk_bound_dev_if ||
-sk->sk_bound_dev_if == dev->ifindex) &&
-   net_eq(sock_net(sk), net)) {
+sk->sk_bound_dev_if == dev->ifindex)) {
if (ip_is_fragment(ip_hdr(skb))) {
if (ip_defrag(net, skb, 
IP_DEFRAG_CALL_RA_CHAIN))
return true;
diff --git a/net/ipv4/ip_sockglue.c b/net/ipv4/ip_sockglue.c
index bf5f44b27b7e..f36d35fe924b 100644
--- a/net/ipv4/ip_sockglue.c
+++ b/net/ipv4/ip_sockglue.c
@@ -322,18 +322,6 @@ int ip_cmsg_send(struct sock *sk, struct msghdr *msg, 
struct ipcm_cookie *ipc,
return 0;
 }
 
-
-/* Special input handler for packets caught by router alert option.
-   They are selected only by protocol field, and then processed likely
-   local ones; but only if someone wants them! Otherwise, router
-   not running rsvpd will kill RSVP.
-
-   It is user level problem, what it will make with them.
-   I have no idea, how it will masquearde or NAT them (it is joke, joke :-)),
-   but receiver should be enough clever f.e. to forward mtrace requests,
-   sent to multicast group to reach destination designated router.
- */
-struct ip_ra_chain __rcu *ip_ra_chain;
 static DEFINE_SPINLOCK(ip_ra_lock);
 
 
@@ -350,6 +338,7 @@ int ip_ra_control(struct sock *sk, unsigned char on,
 {
struct ip_ra_chain *ra, *new_ra;
struct ip_ra_chain __rcu **rap;
+   struct net *net = sock_net(sk);
 
if (sk->sk_type != SOCK_RAW || inet_sk(sk)->inet_num == IPPROTO_RAW)
return -EINVAL;
@@ -357,7 +346,7 @@ int ip_ra_control(struct sock *sk, unsigned char on,
new_ra = on ? kmalloc(sizeof(*new_ra), GFP_KERNEL) : NULL;
 
spin_lock_bh(_ra_lock);
-   for (rap = _ra_chain;
+   for (rap = >ipv4.ra_chain;
 (ra = rcu_dereference_protected(*rap,
lockdep_is_held(_ra_lock))) != NULL;
 rap = >next) {

[PATCH net-next v3 5/5] net: Replace ip_ra_lock with per-net mutex

2018-03-22 Thread Kirill Tkhai

Since ra_chain is per-net, we may use per-net mutexes
to protect them in ip_ra_control(). This improves
scalability.

Signed-off-by: Kirill Tkhai <ktk...@virtuozzo.com>
---
 include/net/netns/ipv4.h |1 +
 net/core/net_namespace.c |1 +
 net/ipv4/ip_sockglue.c   |   15 ++-
 3 files changed, 8 insertions(+), 9 deletions(-)

diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
index 97d7ee6667c7..8491bc9c86b1 100644
--- a/include/net/netns/ipv4.h
+++ b/include/net/netns/ipv4.h
@@ -50,6 +50,7 @@ struct netns_ipv4 {
struct ipv4_devconf *devconf_all;
struct ipv4_devconf *devconf_dflt;
struct ip_ra_chain __rcu *ra_chain;
+   struct mutexra_mutex;
 #ifdef CONFIG_IP_MULTIPLE_TABLES
struct fib_rules_ops*rules_ops;
boolfib_has_custom_rules;
diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
index c340d5cfbdec..95ba2c53bd9a 100644
--- a/net/core/net_namespace.c
+++ b/net/core/net_namespace.c
@@ -301,6 +301,7 @@ static __net_init int setup_net(struct net *net, struct 
user_namespace *user_ns)
net->user_ns = user_ns;
idr_init(>netns_ids);
spin_lock_init(>nsid_lock);
+   mutex_init(>ipv4.ra_mutex);
 
list_for_each_entry(ops, _list, list) {
error = ops_init(ops, net);
diff --git a/net/ipv4/ip_sockglue.c b/net/ipv4/ip_sockglue.c
index f36d35fe924b..5ad2d8ed3a3f 100644
--- a/net/ipv4/ip_sockglue.c
+++ b/net/ipv4/ip_sockglue.c
@@ -322,9 +322,6 @@ int ip_cmsg_send(struct sock *sk, struct msghdr *msg, 
struct ipcm_cookie *ipc,
return 0;
 }
 
-static DEFINE_SPINLOCK(ip_ra_lock);
-
-
 static void ip_ra_destroy_rcu(struct rcu_head *head)
 {
struct ip_ra_chain *ra = container_of(head, struct ip_ra_chain, rcu);
@@ -345,21 +342,21 @@ int ip_ra_control(struct sock *sk, unsigned char on,
 
new_ra = on ? kmalloc(sizeof(*new_ra), GFP_KERNEL) : NULL;
 
-   spin_lock_bh(_ra_lock);
+   mutex_lock(>ipv4.ra_mutex);
for (rap = >ipv4.ra_chain;
 (ra = rcu_dereference_protected(*rap,
-   lockdep_is_held(_ra_lock))) != NULL;
+   lockdep_is_held(>ipv4.ra_mutex))) != NULL;
 rap = >next) {
if (ra->sk == sk) {
if (on) {
-   spin_unlock_bh(_ra_lock);
+   mutex_unlock(>ipv4.ra_mutex);
kfree(new_ra);
return -EADDRINUSE;
}
/* dont let ip_call_ra_chain() use sk again */
ra->sk = NULL;
RCU_INIT_POINTER(*rap, ra->next);
-   spin_unlock_bh(_ra_lock);
+   mutex_unlock(>ipv4.ra_mutex);
 
if (ra->destructor)
ra->destructor(sk);
@@ -374,7 +371,7 @@ int ip_ra_control(struct sock *sk, unsigned char on,
}
}
if (!new_ra) {
-   spin_unlock_bh(_ra_lock);
+   mutex_unlock(>ipv4.ra_mutex);
return -ENOBUFS;
}
new_ra->sk = sk;
@@ -383,7 +380,7 @@ int ip_ra_control(struct sock *sk, unsigned char on,
RCU_INIT_POINTER(new_ra->next, ra);
rcu_assign_pointer(*rap, new_ra);
sock_hold(sk);
-   spin_unlock_bh(_ra_lock);
+   mutex_unlock(>ipv4.ra_mutex);
 
return 0;
 }

[PATCH net-next v3 3/5] net: Revert "ipv4: fix a deadlock in ip_ra_control"

2018-03-22 Thread Kirill Tkhai

This reverts commit 1215e51edad1.
Since raw_close() is used on every RAW socket destruction,
the changes made by 1215e51edad1 scale sadly. This clearly
seen on endless unshare(CLONE_NEWNET) test, and cleanup_net()
kwork spends a lot of time waiting for rtnl_lock() introduced
by this commit.

Previous patch moved IP_ROUTER_ALERT out of rtnl_lock(),
so we revert this patch.

Signed-off-by: Kirill Tkhai <ktk...@virtuozzo.com>
---
 net/ipv4/ip_sockglue.c |1 -
 net/ipv4/ipmr.c|   11 +--
 net/ipv4/raw.c |2 --
 3 files changed, 9 insertions(+), 5 deletions(-)

diff --git a/net/ipv4/ip_sockglue.c b/net/ipv4/ip_sockglue.c
index dcbf6afe27e7..bf5f44b27b7e 100644
--- a/net/ipv4/ip_sockglue.c
+++ b/net/ipv4/ip_sockglue.c
@@ -594,7 +594,6 @@ static bool setsockopt_needs_rtnl(int optname)
case MCAST_LEAVE_GROUP:
case MCAST_LEAVE_SOURCE_GROUP:
case MCAST_UNBLOCK_SOURCE:
-   case IP_ROUTER_ALERT:
return true;
}
return false;
diff --git a/net/ipv4/ipmr.c b/net/ipv4/ipmr.c
index d752a70855d8..f6be5db16da2 100644
--- a/net/ipv4/ipmr.c
+++ b/net/ipv4/ipmr.c
@@ -1399,7 +1399,7 @@ static void mrtsock_destruct(struct sock *sk)
struct net *net = sock_net(sk);
struct mr_table *mrt;
 
-   ASSERT_RTNL();
+   rtnl_lock();
ipmr_for_each_table(mrt, net) {
if (sk == rtnl_dereference(mrt->mroute_sk)) {
IPV4_DEVCONF_ALL(net, MC_FORWARDING)--;
@@ -1411,6 +1411,7 @@ static void mrtsock_destruct(struct sock *sk)
mroute_clean_tables(mrt, false);
}
}
+   rtnl_unlock();
 }
 
 /* Socket options and virtual interface manipulation. The whole
@@ -1475,8 +1476,13 @@ int ip_mroute_setsockopt(struct sock *sk, int optname, 
char __user *optval,
if (sk != rcu_access_pointer(mrt->mroute_sk)) {
ret = -EACCES;
} else {
+   /* We need to unlock here because mrtsock_destruct takes
+* care of rtnl itself and we can't change that due to
+* the IP_ROUTER_ALERT setsockopt which runs without it.
+*/
+   rtnl_unlock();
ret = ip_ra_control(sk, 0, NULL);
-   goto out_unlock;
+   goto out;
}
break;
case MRT_ADD_VIF:
@@ -1588,6 +1594,7 @@ int ip_mroute_setsockopt(struct sock *sk, int optname, 
char __user *optval,
}
 out_unlock:
rtnl_unlock();
+out:
return ret;
 }
 
diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c
index 54648d20bf0f..720bef7da2f6 100644
--- a/net/ipv4/raw.c
+++ b/net/ipv4/raw.c
@@ -711,9 +711,7 @@ static void raw_close(struct sock *sk, long timeout)
/*
 * Raw sockets may have direct kernel references. Kill them.
 */
-   rtnl_lock();
ip_ra_control(sk, 0, NULL);
-   rtnl_unlock();
 
sk_common_release(sk);
 }

[PATCH net-next v3 0/5] Rework ip_ra_chain protection

2018-03-22 Thread Kirill Tkhai

Commit 1215e51edad1 "ipv4: fix a deadlock in ip_ra_control"
made rtnl_lock() be used in raw_close(). This function is called
on every RAW socket destruction, so that rtnl_mutex is taken
every time. This scales very sadly. I observe cleanup_net()
spending a lot of time in rtnl_lock() and raw_close() is one
of the biggest rtnl user (since we have percpu net->ipv4.icmp_sk).

This patchset reworks the locking: reverts the problem commit
and its descendant, and introduces rtnl-independent locking.
This may have a continuation, and someone may work on killing
rtnl_lock() in mrtsock_destruct() in the future.

Thanks,
Kirill

---
v3: Change patches order: [2/5] and [3/5].
v2: Fix sparse warning [4/5], as reported by kbuild test robot.

---

Kirill Tkhai (5):
  net: Revert "ipv4: get rid of ip_ra_lock"
  net: Move IP_ROUTER_ALERT out of lock_sock(sk)
  net: Revert "ipv4: fix a deadlock in ip_ra_control"
  net: Make ip_ra_chain per struct net
  net: Replace ip_ra_lock with per-net mutex


 include/net/ip.h |   13 +++--
 include/net/netns/ipv4.h |2 ++
 net/core/net_namespace.c |1 +
 net/ipv4/ip_input.c  |5 ++---
 net/ipv4/ip_sockglue.c   |   34 +-
 net/ipv4/ipmr.c  |   11 +--
 net/ipv4/raw.c   |2 --
 7 files changed, 38 insertions(+), 30 deletions(-)

--
Signed-off-by: Kirill Tkhai <ktk...@virtuozzo.com>

[PATCH net-next v3 2/5] net: Move IP_ROUTER_ALERT out of lock_sock(sk)

2018-03-22 Thread Kirill Tkhai

ip_ra_control() does not need sk_lock. Who are the another
users of ip_ra_chain? ip_mroute_setsockopt() doesn't take
sk_lock, while parallel IP_ROUTER_ALERT syscalls are
synchronized by ip_ra_lock. So, we may move this command
out of sk_lock.

Signed-off-by: Kirill Tkhai <ktk...@virtuozzo.com>
---
 net/ipv4/ip_sockglue.c |5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/net/ipv4/ip_sockglue.c b/net/ipv4/ip_sockglue.c
index be7c3b71914d..dcbf6afe27e7 100644
--- a/net/ipv4/ip_sockglue.c
+++ b/net/ipv4/ip_sockglue.c
@@ -647,6 +647,8 @@ static int do_ip_setsockopt(struct sock *sk, int level,
 
/* If optlen==0, it is equivalent to val == 0 */
 
+   if (optname == IP_ROUTER_ALERT)
+   return ip_ra_control(sk, val ? 1 : 0, NULL);
if (ip_mroute_opt(optname))
return ip_mroute_setsockopt(sk, optname, optval, optlen);
 
@@ -1157,9 +1159,6 @@ static int do_ip_setsockopt(struct sock *sk, int level,
goto e_inval;
inet->mc_all = val;
break;
-   case IP_ROUTER_ALERT:
-   err = ip_ra_control(sk, val ? 1 : 0, NULL);
-   break;
 
case IP_FREEBIND:
if (optlen < 1)

[PATCH net-next v3 1/5] net: Revert "ipv4: get rid of ip_ra_lock"

2018-03-22 Thread Kirill Tkhai

This reverts commit ba3f571d5dde. The commit was made
after 1215e51edad1 "ipv4: fix a deadlock in ip_ra_control",
and killed ip_ra_lock, which became useless after rtnl_lock()
made used to destroy every raw ipv4 socket. This scales
very bad, and next patch in series reverts 1215e51edad1.
ip_ra_lock will be used again.

Signed-off-by: Kirill Tkhai <ktk...@virtuozzo.com>
---
 net/ipv4/ip_sockglue.c |   12 ++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/ip_sockglue.c b/net/ipv4/ip_sockglue.c
index 74c962b9b09c..be7c3b71914d 100644
--- a/net/ipv4/ip_sockglue.c
+++ b/net/ipv4/ip_sockglue.c
@@ -334,6 +334,7 @@ int ip_cmsg_send(struct sock *sk, struct msghdr *msg, 
struct ipcm_cookie *ipc,
sent to multicast group to reach destination designated router.
  */
 struct ip_ra_chain __rcu *ip_ra_chain;
+static DEFINE_SPINLOCK(ip_ra_lock);
 
 
 static void ip_ra_destroy_rcu(struct rcu_head *head)
@@ -355,17 +356,21 @@ int ip_ra_control(struct sock *sk, unsigned char on,
 
new_ra = on ? kmalloc(sizeof(*new_ra), GFP_KERNEL) : NULL;
 
+   spin_lock_bh(_ra_lock);
for (rap = _ra_chain;
-(ra = rtnl_dereference(*rap)) != NULL;
+(ra = rcu_dereference_protected(*rap,
+   lockdep_is_held(_ra_lock))) != NULL;
 rap = >next) {
if (ra->sk == sk) {
if (on) {
+   spin_unlock_bh(_ra_lock);
kfree(new_ra);
return -EADDRINUSE;
}
/* dont let ip_call_ra_chain() use sk again */
ra->sk = NULL;
RCU_INIT_POINTER(*rap, ra->next);
+   spin_unlock_bh(_ra_lock);
 
if (ra->destructor)
ra->destructor(sk);
@@ -379,14 +384,17 @@ int ip_ra_control(struct sock *sk, unsigned char on,
return 0;
}
}
-   if (!new_ra)
+   if (!new_ra) {
+   spin_unlock_bh(_ra_lock);
return -ENOBUFS;
+   }
new_ra->sk = sk;
new_ra->destructor = destructor;
 
RCU_INIT_POINTER(new_ra->next, ra);
rcu_assign_pointer(*rap, new_ra);
sock_hold(sk);
+   spin_unlock_bh(_ra_lock);
 
return 0;
 }

Re: [PATCH V2 net-next 06/14] net/tls: Add generic NIC offload infrastructure

2018-03-21 Thread Kirill Tkhai

Hi, Saeed,

thanks for fixing some of my remarks, but I've dived into the code
more deeply, and found with a sadness, the patch lacks the readability.

It too big and not fit kernel coding style. Please, see some comments
below.

Can we do something with patch length? Is there a way to split it in
several small patches? It's difficult to review the logic of changes.

On 22.03.2018 00:01, Saeed Mahameed wrote:
> From: Ilya Lesokhin 
> 
> This patch adds a generic infrastructure to offload TLS crypto to a
> network devices. It enables the kernel TLS socket to skip encryption
> and authentication operations on the transmit side of the data path.
> Leaving those computationally expensive operations to the NIC.
> 
> The NIC offload infrastructure builds TLS records and pushes them to
> the TCP layer just like the SW KTLS implementation and using the same API.
> TCP segmentation is mostly unaffected. Currently the only exception is
> that we prevent mixed SKBs where only part of the payload requires
> offload. In the future we are likely to add a similar restriction
> following a change cipher spec record.
> 
> The notable differences between SW KTLS and NIC offloaded TLS
> implementations are as follows:
> 1. The offloaded implementation builds "plaintext TLS record", those
> records contain plaintext instead of ciphertext and place holder bytes
> instead of authentication tags.
> 2. The offloaded implementation maintains a mapping from TCP sequence
> number to TLS records. Thus given a TCP SKB sent from a NIC offloaded
> TLS socket, we can use the tls NIC offload infrastructure to obtain
> enough context to encrypt the payload of the SKB.
> A TLS record is released when the last byte of the record is ack'ed,
> this is done through the new icsk_clean_acked callback.
> 
> The infrastructure should be extendable to support various NIC offload
> implementations.  However it is currently written with the
> implementation below in mind:
> The NIC assumes that packets from each offloaded stream are sent as
> plaintext and in-order. It keeps track of the TLS records in the TCP
> stream. When a packet marked for offload is transmitted, the NIC
> encrypts the payload in-place and puts authentication tags in the
> relevant place holders.
> 
> The responsibility for handling out-of-order packets (i.e. TCP
> retransmission, qdisc drops) falls on the netdev driver.
> 
> The netdev driver keeps track of the expected TCP SN from the NIC's
> perspective.  If the next packet to transmit matches the expected TCP
> SN, the driver advances the expected TCP SN, and transmits the packet
> with TLS offload indication.
> 
> If the next packet to transmit does not match the expected TCP SN. The
> driver calls the TLS layer to obtain the TLS record that includes the
> TCP of the packet for transmission. Using this TLS record, the driver
> posts a work entry on the transmit queue to reconstruct the NIC TLS
> state required for the offload of the out-of-order packet. It updates
> the expected TCP SN accordingly and transmit the now in-order packet.
> The same queue is used for packet transmission and TLS context
> reconstruction to avoid the need for flushing the transmit queue before
> issuing the context reconstruction request.
> 
> Signed-off-by: Ilya Lesokhin 
> Signed-off-by: Boris Pismenny 
> Signed-off-by: Aviad Yehezkel 
> Signed-off-by: Saeed Mahameed 
> ---
>  include/net/tls.h |  74 +++-
>  net/tls/Kconfig   |  10 +
>  net/tls/Makefile  |   2 +
>  net/tls/tls_device.c  | 793 
> ++
>  net/tls/tls_device_fallback.c | 415 ++
>  net/tls/tls_main.c|  33 +-
>  6 files changed, 1320 insertions(+), 7 deletions(-)
>  create mode 100644 net/tls/tls_device.c
>  create mode 100644 net/tls/tls_device_fallback.c
> 
> diff --git a/include/net/tls.h b/include/net/tls.h
> index 4913430ab807..0bfb1b0a156a 100644
> --- a/include/net/tls.h
> +++ b/include/net/tls.h
> @@ -77,6 +77,37 @@ struct tls_sw_context {
>   struct scatterlist sg_aead_out[2];
>  };
>  
> +struct tls_record_info {
> + struct list_head list;
> + u32 end_seq;
> + int len;
> + int num_frags;
> + skb_frag_t frags[MAX_SKB_FRAGS];
> +};
> +
> +struct tls_offload_context {
> + struct crypto_aead *aead_send;
> + spinlock_t lock;/* protects records list */
> + struct list_head records_list;
> + struct tls_record_info *open_record;
> + struct tls_record_info *retransmit_hint;
> + u64 hint_record_sn;
> + u64 unacked_record_sn;
> +
> + struct scatterlist sg_tx_data[MAX_SKB_FRAGS];
> + void (*sk_destruct)(struct sock *sk);
> + u8 driver_state[];
> + /* The TLS layer reserves room for driver specific state
> +  * Currently the belief is that there is not enough
> +  * driver specific state

Re: [PATCH net-next 06/14] net/tls: Add generic NIC offload infrastructure

2018-03-21 Thread Kirill Tkhai

On 21.03.2018 18:53, Boris Pismenny wrote:
> ...
>>
>> Other patches have two licenses in header. Can I distribute this file under 
>> GPL license terms?
>>
> 
> Sure, I'll update the license to match other files under net/tls.
> 
>>> +#include 
>>> +#include 
>>> +#include 
>>> +#include 
>>> +#include 
>>> +
>>> +#include 
>>> +#include 
>>> +
>>> +/* device_offload_lock is used to synchronize tls_dev_add
>>> + * against NETDEV_DOWN notifications.
>>> + */
>>> +DEFINE_STATIC_PERCPU_RWSEM(device_offload_lock);
>>> +
>>> +static void tls_device_gc_task(struct work_struct *work);
>>> +
>>> +static DECLARE_WORK(tls_device_gc_work, tls_device_gc_task);
>>> +static LIST_HEAD(tls_device_gc_list);
>>> +static LIST_HEAD(tls_device_list);
>>> +static DEFINE_SPINLOCK(tls_device_lock);
>>> +
>>> +static void tls_device_free_ctx(struct tls_context *ctx)
>>> +{
>>> +    struct tls_offload_context *offlad_ctx = tls_offload_ctx(ctx);
>>> +
>>> +    kfree(offlad_ctx);
>>> +    kfree(ctx);
>>> +}
>>> +
>>> +static void tls_device_gc_task(struct work_struct *work)
>>> +{
>>> +    struct tls_context *ctx, *tmp;
>>> +    struct list_head gc_list;
>>> +    unsigned long flags;
>>> +
>>> +    spin_lock_irqsave(_device_lock, flags);
>>> +    INIT_LIST_HEAD(_list);
>>
>> This is stack variable, and it should be initialized outside of global 
>> spinlock.
>> There is LIST_HEAD() primitive for that in kernel.
>> There is one more similar place below.
>>
> 
> Sure.
> 
>>> +    list_splice_init(_device_gc_list, _list);
>>> +    spin_unlock_irqrestore(_device_lock, flags);
>>> +
>>> +    list_for_each_entry_safe(ctx, tmp, _list, list) {
>>> +    struct net_device *netdev = ctx->netdev;
>>> +
>>> +    if (netdev) {
>>> +    netdev->tlsdev_ops->tls_dev_del(netdev, ctx,
>>> +    TLS_OFFLOAD_CTX_DIR_TX);
>>> +    dev_put(netdev);
>>> +    }
>>
>> How is possible the situation we meet NULL netdev here >
> 
> This can happen in tls_device_down. tls_deviec_down is called whenever a 
> netdev that is used for TLS inline crypto offload goes down. It gets called 
> via the NETDEV_DOWN event of the netdevice notifier.
> 
> This flow is somewhat similar to the xfrm_device netdev notifier. However, we 
> do not destroy the socket (as in destroying the xfrm_state in xfrm_device). 
> Instead, we cleanup the netdev state and allow software fallback to handle 
> the rest of the traffic.
> 
>>> +
>>> +    list_del(>list);
>>> +    tls_device_free_ctx(ctx);
>>> +    }
>>> +}
>>> +
>>> +static void tls_device_queue_ctx_destruction(struct tls_context *ctx)
>>> +{
>>> +    unsigned long flags;
>>> +
>>> +    spin_lock_irqsave(_device_lock, flags);
>>> +    list_move_tail(>list, _device_gc_list);
>>> +
>>> +    /* schedule_work inside the spinlock
>>> + * to make sure tls_device_down waits for that work.
>>> + */
>>> +    schedule_work(_device_gc_work);
>>> +
>>> +    spin_unlock_irqrestore(_device_lock, flags);
>>> +}
>>> +
>>> +/* We assume that the socket is already connected */
>>> +static struct net_device *get_netdev_for_sock(struct sock *sk)
>>> +{
>>> +    struct inet_sock *inet = inet_sk(sk);
>>> +    struct net_device *netdev = NULL;
>>> +
>>> +    netdev = dev_get_by_index(sock_net(sk), inet->cork.fl.flowi_oif);
>>> +
>>> +    return netdev;
>>> +}
>>> +
>>> +static int attach_sock_to_netdev(struct sock *sk, struct net_device 
>>> *netdev,
>>> + struct tls_context *ctx)
>>> +{
>>> +    int rc;
>>> +
>>> +    rc = netdev->tlsdev_ops->tls_dev_add(netdev, sk, 
>>> TLS_OFFLOAD_CTX_DIR_TX,
>>> + >crypto_send,
>>> + tcp_sk(sk)->write_seq);
>>> +    if (rc) {
>>> +    pr_err_ratelimited("The netdev has refused to offload this 
>>> socket\n");
>>> +    goto out;
>>> +    }
>>> +
>>> +    rc = 0;
>>> +out:
>>> +    return rc;
>>> +}
>>> +
>>> +static void destroy_record(struct tls_record_info *record)
>>> +{
>>> +    skb_frag_t *frag;
>>> +    int nr_frags = record->num_frags;
>>> +
>>> +    while (nr_frags > 0) {
>>> +    frag = >frags[nr_frags - 1];
>>> +    __skb_frag_unref(frag);
>>> +    --nr_frags;
>>> +    }
>>> +    kfree(record);
>>> +}
>>> +
>>> +static void delete_all_records(struct tls_offload_context *offload_ctx)
>>> +{
>>> +    struct tls_record_info *info, *temp;
>>> +
>>> +    list_for_each_entry_safe(info, temp, _ctx->records_list, list) 
>>> {
>>> +    list_del(>list);
>>> +    destroy_record(info);
>>> +    }
>>> +
>>> +    offload_ctx->retransmit_hint = NULL;
>>> +}
>>> +
>>> +static void tls_icsk_clean_acked(struct sock *sk, u32 acked_seq)
>>> +{
>>> +    struct tls_context *tls_ctx = tls_get_ctx(sk);
>>> +    struct tls_offload_context *ctx;
>>> +    struct tls_record_info *info, *temp;
>>> +    unsigned long flags;
>>> +    u64 deleted_records = 0;
>>> +
>>> +    if (!tls_ctx)
>>> +    return;
>>> +
>>> +    ctx = tls_offload_ctx(tls_ctx);
>>> +
>>> +

Re: [PATCH net-next 06/14] net/tls: Add generic NIC offload infrastructure

2018-03-21 Thread Kirill Tkhai

On 20.03.2018 05:45, Saeed Mahameed wrote:
> From: Ilya Lesokhin 
> 
> This patch adds a generic infrastructure to offload TLS crypto to a
> network devices. It enables the kernel TLS socket to skip encryption
> and authentication operations on the transmit side of the data path.
> Leaving those computationally expensive operations to the NIC.
> 
> The NIC offload infrastructure builds TLS records and pushes them to
> the TCP layer just like the SW KTLS implementation and using the same API.
> TCP segmentation is mostly unaffected. Currently the only exception is
> that we prevent mixed SKBs where only part of the payload requires
> offload. In the future we are likely to add a similar restriction
> following a change cipher spec record.
> 
> The notable differences between SW KTLS and NIC offloaded TLS
> implementations are as follows:
> 1. The offloaded implementation builds "plaintext TLS record", those
> records contain plaintext instead of ciphertext and place holder bytes
> instead of authentication tags.
> 2. The offloaded implementation maintains a mapping from TCP sequence
> number to TLS records. Thus given a TCP SKB sent from a NIC offloaded
> TLS socket, we can use the tls NIC offload infrastructure to obtain
> enough context to encrypt the payload of the SKB.
> A TLS record is released when the last byte of the record is ack'ed,
> this is done through the new icsk_clean_acked callback.
> 
> The infrastructure should be extendable to support various NIC offload
> implementations.  However it is currently written with the
> implementation below in mind:
> The NIC assumes that packets from each offloaded stream are sent as
> plaintext and in-order. It keeps track of the TLS records in the TCP
> stream. When a packet marked for offload is transmitted, the NIC
> encrypts the payload in-place and puts authentication tags in the
> relevant place holders.
> 
> The responsibility for handling out-of-order packets (i.e. TCP
> retransmission, qdisc drops) falls on the netdev driver.
> 
> The netdev driver keeps track of the expected TCP SN from the NIC's
> perspective.  If the next packet to transmit matches the expected TCP
> SN, the driver advances the expected TCP SN, and transmits the packet
> with TLS offload indication.
> 
> If the next packet to transmit does not match the expected TCP SN. The
> driver calls the TLS layer to obtain the TLS record that includes the
> TCP of the packet for transmission. Using this TLS record, the driver
> posts a work entry on the transmit queue to reconstruct the NIC TLS
> state required for the offload of the out-of-order packet. It updates
> the expected TCP SN accordingly and transmit the now in-order packet.
> The same queue is used for packet transmission and TLS context
> reconstruction to avoid the need for flushing the transmit queue before
> issuing the context reconstruction request.
> 
> Signed-off-by: Ilya Lesokhin 
> Signed-off-by: Boris Pismenny 
> Signed-off-by: Aviad Yehezkel 
> Signed-off-by: Saeed Mahameed 
> ---
>  include/net/tls.h |  70 +++-
>  net/tls/Kconfig   |  10 +
>  net/tls/Makefile  |   2 +
>  net/tls/tls_device.c  | 804 
> ++
>  net/tls/tls_device_fallback.c | 419 ++
>  net/tls/tls_main.c|  33 +-
>  6 files changed, 1331 insertions(+), 7 deletions(-)
>  create mode 100644 net/tls/tls_device.c
>  create mode 100644 net/tls/tls_device_fallback.c
> 
> diff --git a/include/net/tls.h b/include/net/tls.h
> index 4913430ab807..ab98a6dc4929 100644
> --- a/include/net/tls.h
> +++ b/include/net/tls.h
> @@ -77,6 +77,37 @@ struct tls_sw_context {
>   struct scatterlist sg_aead_out[2];
>  };
>  
> +struct tls_record_info {
> + struct list_head list;
> + u32 end_seq;
> + int len;
> + int num_frags;
> + skb_frag_t frags[MAX_SKB_FRAGS];
> +};
> +
> +struct tls_offload_context {
> + struct crypto_aead *aead_send;
> + spinlock_t lock;/* protects records list */
> + struct list_head records_list;
> + struct tls_record_info *open_record;
> + struct tls_record_info *retransmit_hint;
> + u64 hint_record_sn;
> + u64 unacked_record_sn;
> +
> + struct scatterlist sg_tx_data[MAX_SKB_FRAGS];
> + void (*sk_destruct)(struct sock *sk);
> + u8 driver_state[];
> + /* The TLS layer reserves room for driver specific state
> +  * Currently the belief is that there is not enough
> +  * driver specific state to justify another layer of indirection
> +  */
> +#define TLS_DRIVER_STATE_SIZE (max_t(size_t, 8, sizeof(void *)))
> +};
> +
> +#define TLS_OFFLOAD_CONTEXT_SIZE 
>   \
> + (ALIGN(sizeof(struct tls_offload_context), sizeof(void *)) +   \
> +  TLS_DRIVER_STATE_SIZE)
> +
>  enum {
>

Re: [PATCH net-next v2 2/5] net: Revert "ipv4: fix a deadlock in ip_ra_control"

2018-03-20 Thread Kirill Tkhai

On 20.03.2018 22:25, Kirill Tkhai wrote:
> Hi, David,
> 
> thanks for the review!
> 
> On 20.03.2018 19:23, David Miller wrote:
>> From: Kirill Tkhai <ktk...@virtuozzo.com>
>> Date: Mon, 19 Mar 2018 12:14:54 +0300
>>
>>> This reverts commit 1215e51edad1.
>>> Since raw_close() is used on every RAW socket destruction,
>>> the changes made by 1215e51edad1 scale sadly. This clearly
>>> seen on endless unshare(CLONE_NEWNET) test, and cleanup_net()
>>> kwork spends a lot of time waiting for rtnl_lock() introduced
>>> by this commit.
>>>
>>> Next patches in series will rework this in another way,
>>> so now we revert 1215e51edad1. Also, it doesn't seen
>>> mrtsock_destruct() takes sk_lock, and the comment to the commit
>>> does not show the actual stack dump. So, there is a question
>>> did we really need in it.
>>>
>>> Signed-off-by: Kirill Tkhai <ktk...@virtuozzo.com>
>>
>> Kirill, I think the commit you are reverting is legitimate.
>>
>> The IP_RAW_CONTROL path has an ABBA deadlock with other paths once
>> you revert this, so you are reintroducing a bug.
> 
> The talk is about IP_ROUTER_ALERT, I assume there is just an erratum.
>  
>> All code paths that must take both RTNL and the socket lock must
>> do them in the same order.  And that order is RTNL then socket
>> lock.
> 
> The place I change in this patch is IP_ROUTER_ALERT. There is only
> a call of ip_ra_control(), while this function does not need socket
> lock. Please, see next patch. It moves this ip_ra_control() out
> of socket lock. And it fixes the problem pointed in reverted patch
> in another way. So, if there is ABBA, after next patch it becomes
> solved. Does this mean I have to merge [2/5] and [3/5] together?

We also can just change the order of patches, and make [3/5] go before [2/5].
Then, the kernel still remains bisectable. How do you think about this?

Thanks,
Kirill

>> But you are breaking that here by getting us back into a state
>> where IP_RAW_CONTROL setsockopt will take the socket lock and
>> then RTNL.
>>
>> Again, we can't take, or retake, RTNL if we have the socket lock
>> currently.
>>
>> The only valid locking order is socket lock then RTNL.
> 
> Thanks,
> Kirill
>

Re: [PATCH net-next v2 2/5] net: Revert "ipv4: fix a deadlock in ip_ra_control"

2018-03-20 Thread Kirill Tkhai

Hi, David,

thanks for the review!

On 20.03.2018 19:23, David Miller wrote:
> From: Kirill Tkhai <ktk...@virtuozzo.com>
> Date: Mon, 19 Mar 2018 12:14:54 +0300
> 
>> This reverts commit 1215e51edad1.
>> Since raw_close() is used on every RAW socket destruction,
>> the changes made by 1215e51edad1 scale sadly. This clearly
>> seen on endless unshare(CLONE_NEWNET) test, and cleanup_net()
>> kwork spends a lot of time waiting for rtnl_lock() introduced
>> by this commit.
>>
>> Next patches in series will rework this in another way,
>> so now we revert 1215e51edad1. Also, it doesn't seen
>> mrtsock_destruct() takes sk_lock, and the comment to the commit
>> does not show the actual stack dump. So, there is a question
>> did we really need in it.
>>
>> Signed-off-by: Kirill Tkhai <ktk...@virtuozzo.com>
> 
> Kirill, I think the commit you are reverting is legitimate.
> 
> The IP_RAW_CONTROL path has an ABBA deadlock with other paths once
> you revert this, so you are reintroducing a bug.

The talk is about IP_ROUTER_ALERT, I assume there is just an erratum.
 
> All code paths that must take both RTNL and the socket lock must
> do them in the same order.  And that order is RTNL then socket
> lock.

The place I change in this patch is IP_ROUTER_ALERT. There is only
a call of ip_ra_control(), while this function does not need socket
lock. Please, see next patch. It moves this ip_ra_control() out
of socket lock. And it fixes the problem pointed in reverted patch
in another way. So, if there is ABBA, after next patch it becomes
solved. Does this mean I have to merge [2/5] and [3/5] together?

> But you are breaking that here by getting us back into a state
> where IP_RAW_CONTROL setsockopt will take the socket lock and
> then RTNL.
> 
> Again, we can't take, or retake, RTNL if we have the socket lock
> currently.
> 
> The only valid locking order is socket lock then RTNL.

Thanks,
Kirill

Re: [PATCH net-next] rds: tcp: remove register_netdevice_notifier infrastructure.

2018-03-19 Thread Kirill Tkhai

On 19.03.2018 16:52, Sowmini Varadhan wrote:
> The netns deletion path does not need to wait for all net_devices
> to be unregistered before dismantling rds_tcp state for the netns
> (we are able to dismantle this state on module unload even when
> all net_devices are active so there is no dependency here).
> 
> This patch removes code related to netdevice notifiers and
> refactors all the code needed to dismantle rds_tcp state
> into a ->exit callback for the pernet_operations used with
> register_pernet_device().
> 
> Signed-off-by: Sowmini Varadhan <sowmini.varad...@oracle.com>

I just repeat my words:
rds_tcp_listen_sock destruction looks nice and safe, since all
the places the sockets is dereferenced use sk_callback_lock.
So they don't miss rds_tcp_listen_sock = NULL, as rds_tcp_listen_stop()
takes the lock too.

rds_tcp_conn_list is populated from:

1)rds_tcp_accept_one(), which can't happen after we flushed the queue
in rds_tcp_listen_stop();

2)rds_sendmsg(), which is triggered by userspace, and that's impossible,
when net is dead;

3)rds_ib_cm_handle_connect(), which call rds_conn_create() with init_net
argument only. This may race with module unloading only, but this problem
is already solved in RDS by rds_destroy_pending() check, which care about
that.

Reviewed-by: Kirill Tkhai <ktk...@virtuozzo.com>

(The only thing I don't know is the reason we need to destroy the sockets
 before last netdevice, but I haven't dived into that. Just to mention this...).

Thanks, Sowmini.

Kirill

> ---
>  net/rds/tcp.c |   93 
> ++---
>  1 files changed, 23 insertions(+), 70 deletions(-)
> 
> diff --git a/net/rds/tcp.c b/net/rds/tcp.c
> index 08ea9cd..4f3a32c 100644
> --- a/net/rds/tcp.c
> +++ b/net/rds/tcp.c
> @@ -485,40 +485,6 @@ static __net_init int rds_tcp_init_net(struct net *net)
>   return err;
>  }
>  
> -static void __net_exit rds_tcp_exit_net(struct net *net)
> -{
> - struct rds_tcp_net *rtn = net_generic(net, rds_tcp_netid);
> -
> - if (rtn->rds_tcp_sysctl)
> - unregister_net_sysctl_table(rtn->rds_tcp_sysctl);
> -
> - if (net != _net && rtn->ctl_table)
> - kfree(rtn->ctl_table);
> -
> - /* If rds_tcp_exit_net() is called as a result of netns deletion,
> -  * the rds_tcp_kill_sock() device notifier would already have cleaned
> -  * up the listen socket, thus there is no work to do in this function.
> -  *
> -  * If rds_tcp_exit_net() is called as a result of module unload,
> -  * i.e., due to rds_tcp_exit() -> unregister_pernet_subsys(), then
> -  * we do need to clean up the listen socket here.
> -  */
> - if (rtn->rds_tcp_listen_sock) {
> - struct socket *lsock = rtn->rds_tcp_listen_sock;
> -
> - rtn->rds_tcp_listen_sock = NULL;
> - rds_tcp_listen_stop(lsock, >rds_tcp_accept_w);
> - }
> -}
> -
> -static struct pernet_operations rds_tcp_net_ops = {
> - .init = rds_tcp_init_net,
> - .exit = rds_tcp_exit_net,
> - .id = _tcp_netid,
> - .size = sizeof(struct rds_tcp_net),
> - .async = true,
> -};
> -
>  static void rds_tcp_kill_sock(struct net *net)
>  {
>   struct rds_tcp_connection *tc, *_tc;
> @@ -546,40 +512,38 @@ static void rds_tcp_kill_sock(struct net *net)
>   rds_conn_destroy(tc->t_cpath->cp_conn);
>  }
>  
> -void *rds_tcp_listen_sock_def_readable(struct net *net)
> +static void __net_exit rds_tcp_exit_net(struct net *net)
>  {
>   struct rds_tcp_net *rtn = net_generic(net, rds_tcp_netid);
> - struct socket *lsock = rtn->rds_tcp_listen_sock;
>  
> - if (!lsock)
> - return NULL;
> + rds_tcp_kill_sock(net);
>  
> - return lsock->sk->sk_user_data;
> + if (rtn->rds_tcp_sysctl)
> + unregister_net_sysctl_table(rtn->rds_tcp_sysctl);
> +
> + if (net != _net && rtn->ctl_table)
> + kfree(rtn->ctl_table);
>  }
>  
> -static int rds_tcp_dev_event(struct notifier_block *this,
> -  unsigned long event, void *ptr)
> +static struct pernet_operations rds_tcp_net_ops = {
> + .init = rds_tcp_init_net,
> + .exit = rds_tcp_exit_net,
> + .id = _tcp_netid,
> + .size = sizeof(struct rds_tcp_net),
> + .async = true,
> +};
> +
> +void *rds_tcp_listen_sock_def_readable(struct net *net)
>  {
> - struct net_device *dev = netdev_notifier_info_to_dev(ptr);
> + struct rds_tcp_net *rtn = net_generic(net, rds_tcp_netid);
> + struct socket *lsock = rtn->rds_tcp_listen_sock;
>  
> - /* rds-tcp registers as a pern

Re: [PATCH 1/2 v3] net: add uevent socket member

2018-03-19 Thread Kirill Tkhai

Thanks for doing this. One small comment below.

On 17.03.2018 14:08, Christian Brauner wrote:
> This commit adds struct uevent_sock to struct net. Since struct uevent_sock
> records the position of the uevent socket in the uevent socket list we can
> trivially remove it from the uevent socket list during cleanup. This speeds
> up the old removal codepath.
> Note, list_del() will hit __list_del_entry_valid() in its call chain which
> will validate that the element is a member of the list. If it isn't it will
> take care that the list is not modified.
> 
> Signed-off-by: Christian Brauner 
> ---
> Changelog v2->v3:
> * patch added
>   This patch was split out of the follow up patch
>   Subject: [PATCH 2/2 v3] netns: send uevent messages
> 
> Changelog v1->v2:
> * patch not present
> 
> Changelog v0->v1:
> * patch not present
> ---
>  include/net/net_namespace.h |  4 +++-
>  lib/kobject_uevent.c| 19 +--
>  2 files changed, 12 insertions(+), 11 deletions(-)
> 
> diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h
> index f306b2aa15a4..abd7d91bffac 100644
> --- a/include/net/net_namespace.h
> +++ b/include/net/net_namespace.h
> @@ -40,7 +40,7 @@ struct net_device;
>  struct sock;
>  struct ctl_table_header;
>  struct net_generic;
> -struct sock;
> +struct uevent_sock;
>  struct netns_ipvs;
>  
>  
> @@ -79,6 +79,8 @@ struct net {
>   struct sock *rtnl;  /* rtnetlink socket */
>   struct sock *genl_sock;
>  
> + struct uevent_sock  *uevent_sock;   /* uevent socket */
> +
>   struct list_headdev_base_head;
>   struct hlist_head   *dev_name_head;
>   struct hlist_head   *dev_index_head;
> diff --git a/lib/kobject_uevent.c b/lib/kobject_uevent.c
> index 9fe6ec8fda28..cbdc60542cab 100644
> --- a/lib/kobject_uevent.c
> +++ b/lib/kobject_uevent.c
> @@ -32,11 +32,13 @@ u64 uevent_seqnum;
>  #ifdef CONFIG_UEVENT_HELPER
>  char uevent_helper[UEVENT_HELPER_PATH_LEN] = CONFIG_UEVENT_HELPER_PATH;
>  #endif
> -#ifdef CONFIG_NET
> +
>  struct uevent_sock {
>   struct list_head list;
>   struct sock *sk;
>  };
> +
> +#ifdef CONFIG_NET
>  static LIST_HEAD(uevent_sock_list);
>  #endif
>  
> @@ -621,6 +623,9 @@ static int uevent_net_init(struct net *net)
>   kfree(ue_sk);
>   return -ENODEV;
>   }
> +
> + net->uevent_sock = ue_sk;
> +
>   mutex_lock(_sock_mutex);
>   list_add_tail(_sk->list, _sock_list);
>   mutex_unlock(_sock_mutex);
> @@ -629,22 +634,16 @@ static int uevent_net_init(struct net *net)
>  
>  static void uevent_net_exit(struct net *net)
>  {
> - struct uevent_sock *ue_sk;
> + struct uevent_sock *ue_sk = net->uevent_sock;
>  
>   mutex_lock(_sock_mutex);
> - list_for_each_entry(ue_sk, _sock_list, list) {
> - if (sock_net(ue_sk->sk) == net)
> - goto found;
> - }
> - mutex_unlock(_sock_mutex);
> - return;
> -
> -found:
>   list_del(_sk->list);
>   mutex_unlock(_sock_mutex);
>  
>   netlink_kernel_release(ue_sk->sk);
>   kfree(ue_sk);
> +
> + return;

There is end of function. Doesn't return is excess here?

>  }
>  
>  static struct pernet_operations uevent_net_ops = {
> 

Kirill

[PATCH net-next 1/2] net: Convert lowpan_frags_ops

2018-03-19 Thread Kirill Tkhai

These pernet_operations register and unregister sysctl.
Also, there is inet_frags_exit_net() called in exit method,
which has to be safe after a560002437d3 "net: Fix hlist
corruptions in inet_evict_bucket()".

Signed-off-by: Kirill Tkhai <ktk...@virtuozzo.com>
---
 net/ieee802154/6lowpan/reassembly.c |1 +
 1 file changed, 1 insertion(+)

diff --git a/net/ieee802154/6lowpan/reassembly.c 
b/net/ieee802154/6lowpan/reassembly.c
index 85bf86ad6b18..a9ccb1322f69 100644
--- a/net/ieee802154/6lowpan/reassembly.c
+++ b/net/ieee802154/6lowpan/reassembly.c
@@ -603,6 +603,7 @@ static void __net_exit lowpan_frags_exit_net(struct net 
*net)
 static struct pernet_operations lowpan_frags_ops = {
.init = lowpan_frags_init_net,
.exit = lowpan_frags_exit_net,
+   .async = true,
 };
 
 int __init lowpan_net_frag_init(void)

[PATCH net-next 2/2] net: Convert nf_ct_net_ops

2018-03-19 Thread Kirill Tkhai

These pernet_operations register and unregister sysctl.
Also, there is inet_frags_exit_net() called in exit method,
which has to be safe after a560002437d3 "net: Fix hlist
corruptions in inet_evict_bucket()".

Signed-off-by: Kirill Tkhai <ktk...@virtuozzo.com>
---
 net/ipv6/netfilter/nf_conntrack_reasm.c |1 +
 1 file changed, 1 insertion(+)

diff --git a/net/ipv6/netfilter/nf_conntrack_reasm.c 
b/net/ipv6/netfilter/nf_conntrack_reasm.c
index b84ce3e6d728..34136fe80ed5 100644
--- a/net/ipv6/netfilter/nf_conntrack_reasm.c
+++ b/net/ipv6/netfilter/nf_conntrack_reasm.c
@@ -646,6 +646,7 @@ static void nf_ct_net_exit(struct net *net)
 static struct pernet_operations nf_ct_net_ops = {
.init = nf_ct_net_init,
.exit = nf_ct_net_exit,
+   .async = true,
 };
 
 int nf_ct_frag6_init(void)

[PATCH net-next 0/2] Converting pernet_operations (part #10)

2018-03-19 Thread Kirill Tkhai

Hi,

this series continues to review and to convert pernet_operations
to make them possible to be executed in parallel for several
net namespaces at the same time. There are two ops similar
to already converted ip4_frags_ops.

This requires a560002437d3 "net: Fix hlist corruptions
in inet_evict_bucket()", which is currently in net.git.

It seems to be the finial convertion series.

Thanks,
Kirill
---

Kirill Tkhai (2):
  net: Convert lowpan_frags_ops
  net: Convert nf_ct_net_ops


 net/ieee802154/6lowpan/reassembly.c |1 +
 net/ipv6/netfilter/nf_conntrack_reasm.c |1 +
 2 files changed, 2 insertions(+)

--
Signed-off-by: Kirill Tkhai <ktk...@virtuozzo.com>

[PATCH net-next] net: Convert can_pernet_ops

2018-03-19 Thread Kirill Tkhai

These pernet_operations create and destroy /proc entries
and cancel per-net timer.

Also, there are unneed iterations over empty list of net
devices, since all net devices must be already moved
to init_net or unregistered by default_device_ops. This
already was mentioned here:

https://marc.info/?l=linux-can=150169589119335=2

So, it looks safe to make them async.

Signed-off-by: Kirill Tkhai <ktk...@virtuozzo.com>
---
 net/can/af_can.c |1 +
 1 file changed, 1 insertion(+)

diff --git a/net/can/af_can.c b/net/can/af_can.c
index 6da324550eec..e899970398a1 100644
--- a/net/can/af_can.c
+++ b/net/can/af_can.c
@@ -954,6 +954,7 @@ static struct notifier_block can_netdev_notifier 
__read_mostly = {
 static struct pernet_operations can_pernet_ops __read_mostly = {
.init = can_pernet_init,
.exit = can_pernet_exit,
+   .async = true,
 };
 
 static __init int can_init(void)

[PATCH net-next can] Converting pernet_operations (part #9)

2018-03-19 Thread Kirill Tkhai

Hi,

this series continues to review and to convert pernet_operations
to make them possible to be executed in parallel for several
net namespaces at the same time. There is only one patch converting
can_pernet_ops.

Thanks,
Kirill
---

Kirill Tkhai (1):
  net: Convert can_pernet_ops


 net/can/af_can.c |1 +
 1 file changed, 1 insertion(+)

--
Signed-off-by: Kirill Tkhai <ktk...@virtuozzo.com>

Re: [rds-devel] [PATCH RFC RFC] rds: Use NETDEV_UNREGISTER in rds_tcp_dev_event() (then kill NETDEV_UNREGISTER_FINAL)

2018-03-19 Thread Kirill Tkhai

Hi, Sowmini,

thanks for looking into this.

On 18.03.2018 23:45, Sowmini Varadhan wrote:
> On (03/18/18 00:55), Kirill Tkhai wrote:
>>
>> I just want to make rds not using NETDEV_UNREGISTER_FINAL. If there is
>> another solution to do that, I'm not again that.
> 
> The patch below takes care of this. I've done some preliminary testing,
> and I'll send it upstream after doing additional self-review/testing.
> Please also take a look, if you can, to see if I missed something.
> 
> Thanks for the input,
> 
> --Sowmini
> ---patch follows
> 
> diff --git a/net/rds/tcp.c b/net/rds/tcp.c
> index 08ea9cd..87c2643 100644
> --- a/net/rds/tcp.c
> +++ b/net/rds/tcp.c
> @@ -485,40 +485,6 @@ static __net_init int rds_tcp_init_net(struct net *net)
>   return err;
>  }
>  
> -static void __net_exit rds_tcp_exit_net(struct net *net)
> -{
> - struct rds_tcp_net *rtn = net_generic(net, rds_tcp_netid);
> -
> - if (rtn->rds_tcp_sysctl)
> - unregister_net_sysctl_table(rtn->rds_tcp_sysctl);
> -
> - if (net != _net && rtn->ctl_table)
> - kfree(rtn->ctl_table);
> -
> - /* If rds_tcp_exit_net() is called as a result of netns deletion,
> -  * the rds_tcp_kill_sock() device notifier would already have cleaned
> -  * up the listen socket, thus there is no work to do in this function.
> -  *
> -  * If rds_tcp_exit_net() is called as a result of module unload,
> -  * i.e., due to rds_tcp_exit() -> unregister_pernet_subsys(), then
> -  * we do need to clean up the listen socket here.
> -  */
> - if (rtn->rds_tcp_listen_sock) {
> - struct socket *lsock = rtn->rds_tcp_listen_sock;
> -
> - rtn->rds_tcp_listen_sock = NULL;
> - rds_tcp_listen_stop(lsock, >rds_tcp_accept_w);
> - }
> -}
> -
> -static struct pernet_operations rds_tcp_net_ops = {
> - .init = rds_tcp_init_net,
> - .exit = rds_tcp_exit_net,
> - .id = _tcp_netid,
> - .size = sizeof(struct rds_tcp_net),
> - .async = true,
> -};
> -
>  static void rds_tcp_kill_sock(struct net *net)
>  {
>   struct rds_tcp_connection *tc, *_tc;
> @@ -546,40 +512,38 @@ static void rds_tcp_kill_sock(struct net *net)
>   rds_conn_destroy(tc->t_cpath->cp_conn);
>  }
>  
> -void *rds_tcp_listen_sock_def_readable(struct net *net)
> +static void __net_exit rds_tcp_exit_net(struct net *net)
>  {
>   struct rds_tcp_net *rtn = net_generic(net, rds_tcp_netid);
> - struct socket *lsock = rtn->rds_tcp_listen_sock;
>  
> - if (!lsock)
> - return NULL;
> + rds_tcp_kill_sock(net);

rds_tcp_listen_sock destruction looks nice and safe, since all
the places the sockets is dereferenced use sk_callback_lock.
So they don't miss rds_tcp_listen_sock = NULL, as rds_tcp_listen_stop()
takes the lock too.

rds_tcp_conn_list is populated from:

1)rds_tcp_accept_one(), which can't happen after we flushed the queue
in rds_tcp_listen_stop();

2)rds_sendmsg(), which is triggered by userspace, and that's impossible,
when net is dead;

3)rds_ib_cm_handle_connect(), which call rds_conn_create() with init_net
argument only. This may race with module unloading only, but this problem
is already solved in RDS by rds_destroy_pending() check, which care about
that:

static bool rds_tcp_is_unloading(struct rds_connection *conn)
{
return atomic_read(_tcp_unloading) != 0;
}

static void rds_tcp_exit(void)
{
rds_tcp_set_unloading();
synchronize_rcu();
...
}

static struct rds_connection * __rds_conn_create(...)
{
...
rcu_read_lock();
if (rds_destroy_pending(conn))
ret = -ENETDOWN;
else
ret = trans->conn_alloc(conn, GFP_ATOMIC);
...
}

So, everything looks good for me.

Kirill

>  
> - return lsock->sk->sk_user_data;
> + if (rtn->rds_tcp_sysctl)
> + unregister_net_sysctl_table(rtn->rds_tcp_sysctl);
> +
> + if (net != _net && rtn->ctl_table)
> + kfree(rtn->ctl_table);
>  }
>  
> -static int rds_tcp_dev_event(struct notifier_block *this,
> -  unsigned long event, void *ptr)
> +static struct pernet_operations rds_tcp_net_ops = {
> + .init = rds_tcp_init_net,
> + .exit = rds_tcp_exit_net,
> + .id = _tcp_netid,
> + .size = sizeof(struct rds_tcp_net),
> + .async = true,
> +};
> +
> +void *rds_tcp_listen_sock_def_readable(struct net *net)
>  {
> - struct net_device *dev = netdev_notifier_info_to_dev(ptr);
> + struct rds_tcp_net *rtn = net_generic(ne

[PATCH net-next v2 5/5] net: Replace ip_ra_lock with per-net mutex

2018-03-19 Thread Kirill Tkhai

Since ra_chain is per-net, we may use per-net mutexes
to protect them in ip_ra_control(). This improves
scalability.

Signed-off-by: Kirill Tkhai <ktk...@virtuozzo.com>
---
 include/net/netns/ipv4.h |1 +
 net/core/net_namespace.c |1 +
 net/ipv4/ip_sockglue.c   |   15 ++-
 3 files changed, 8 insertions(+), 9 deletions(-)

diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
index 97d7ee6667c7..8491bc9c86b1 100644
--- a/include/net/netns/ipv4.h
+++ b/include/net/netns/ipv4.h
@@ -50,6 +50,7 @@ struct netns_ipv4 {
struct ipv4_devconf *devconf_all;
struct ipv4_devconf *devconf_dflt;
struct ip_ra_chain __rcu *ra_chain;
+   struct mutexra_mutex;
 #ifdef CONFIG_IP_MULTIPLE_TABLES
struct fib_rules_ops*rules_ops;
boolfib_has_custom_rules;
diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
index c340d5cfbdec..95ba2c53bd9a 100644
--- a/net/core/net_namespace.c
+++ b/net/core/net_namespace.c
@@ -301,6 +301,7 @@ static __net_init int setup_net(struct net *net, struct 
user_namespace *user_ns)
net->user_ns = user_ns;
idr_init(>netns_ids);
spin_lock_init(>nsid_lock);
+   mutex_init(>ipv4.ra_mutex);
 
list_for_each_entry(ops, _list, list) {
error = ops_init(ops, net);
diff --git a/net/ipv4/ip_sockglue.c b/net/ipv4/ip_sockglue.c
index f36d35fe924b..5ad2d8ed3a3f 100644
--- a/net/ipv4/ip_sockglue.c
+++ b/net/ipv4/ip_sockglue.c
@@ -322,9 +322,6 @@ int ip_cmsg_send(struct sock *sk, struct msghdr *msg, 
struct ipcm_cookie *ipc,
return 0;
 }
 
-static DEFINE_SPINLOCK(ip_ra_lock);
-
-
 static void ip_ra_destroy_rcu(struct rcu_head *head)
 {
struct ip_ra_chain *ra = container_of(head, struct ip_ra_chain, rcu);
@@ -345,21 +342,21 @@ int ip_ra_control(struct sock *sk, unsigned char on,
 
new_ra = on ? kmalloc(sizeof(*new_ra), GFP_KERNEL) : NULL;
 
-   spin_lock_bh(_ra_lock);
+   mutex_lock(>ipv4.ra_mutex);
for (rap = >ipv4.ra_chain;
 (ra = rcu_dereference_protected(*rap,
-   lockdep_is_held(_ra_lock))) != NULL;
+   lockdep_is_held(>ipv4.ra_mutex))) != NULL;
 rap = >next) {
if (ra->sk == sk) {
if (on) {
-   spin_unlock_bh(_ra_lock);
+   mutex_unlock(>ipv4.ra_mutex);
kfree(new_ra);
return -EADDRINUSE;
}
/* dont let ip_call_ra_chain() use sk again */
ra->sk = NULL;
RCU_INIT_POINTER(*rap, ra->next);
-   spin_unlock_bh(_ra_lock);
+   mutex_unlock(>ipv4.ra_mutex);
 
if (ra->destructor)
ra->destructor(sk);
@@ -374,7 +371,7 @@ int ip_ra_control(struct sock *sk, unsigned char on,
}
}
if (!new_ra) {
-   spin_unlock_bh(_ra_lock);
+   mutex_unlock(>ipv4.ra_mutex);
return -ENOBUFS;
}
new_ra->sk = sk;
@@ -383,7 +380,7 @@ int ip_ra_control(struct sock *sk, unsigned char on,
RCU_INIT_POINTER(new_ra->next, ra);
rcu_assign_pointer(*rap, new_ra);
sock_hold(sk);
-   spin_unlock_bh(_ra_lock);
+   mutex_unlock(>ipv4.ra_mutex);
 
return 0;
 }

[PATCH net-next v2 4/5] net: Make ip_ra_chain per struct net

2018-03-19 Thread Kirill Tkhai

This is optimization, which makes ip_call_ra_chain()
iterate less sockets to find the sockets it's looking for.

Signed-off-by: Kirill Tkhai <ktk...@virtuozzo.com>
---
 include/net/ip.h |   13 +++--
 include/net/netns/ipv4.h |1 +
 net/ipv4/ip_input.c  |5 ++---
 net/ipv4/ip_sockglue.c   |   15 ++-
 4 files changed, 16 insertions(+), 18 deletions(-)

diff --git a/include/net/ip.h b/include/net/ip.h
index fe63ba95d12b..d53b5a9eae34 100644
--- a/include/net/ip.h
+++ b/include/net/ip.h
@@ -91,6 +91,17 @@ static inline int inet_sdif(struct sk_buff *skb)
return 0;
 }
 
+/* Special input handler for packets caught by router alert option.
+   They are selected only by protocol field, and then processed likely
+   local ones; but only if someone wants them! Otherwise, router
+   not running rsvpd will kill RSVP.
+
+   It is user level problem, what it will make with them.
+   I have no idea, how it will masquearde or NAT them (it is joke, joke :-)),
+   but receiver should be enough clever f.e. to forward mtrace requests,
+   sent to multicast group to reach destination designated router.
+ */
+
 struct ip_ra_chain {
struct ip_ra_chain __rcu *next;
struct sock *sk;
@@ -101,8 +112,6 @@ struct ip_ra_chain {
struct rcu_head rcu;
 };
 
-extern struct ip_ra_chain __rcu *ip_ra_chain;
-
 /* IP flags. */
 #define IP_CE  0x8000  /* Flag: "Congestion"   */
 #define IP_DF  0x4000  /* Flag: "Don't Fragment"   */
diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
index 382bfd7583cf..97d7ee6667c7 100644
--- a/include/net/netns/ipv4.h
+++ b/include/net/netns/ipv4.h
@@ -49,6 +49,7 @@ struct netns_ipv4 {
 #endif
struct ipv4_devconf *devconf_all;
struct ipv4_devconf *devconf_dflt;
+   struct ip_ra_chain __rcu *ra_chain;
 #ifdef CONFIG_IP_MULTIPLE_TABLES
struct fib_rules_ops*rules_ops;
boolfib_has_custom_rules;
diff --git a/net/ipv4/ip_input.c b/net/ipv4/ip_input.c
index 57fc13c6ab2b..7582713dd18f 100644
--- a/net/ipv4/ip_input.c
+++ b/net/ipv4/ip_input.c
@@ -159,7 +159,7 @@ bool ip_call_ra_chain(struct sk_buff *skb)
struct net_device *dev = skb->dev;
struct net *net = dev_net(dev);
 
-   for (ra = rcu_dereference(ip_ra_chain); ra; ra = 
rcu_dereference(ra->next)) {
+   for (ra = rcu_dereference(net->ipv4.ra_chain); ra; ra = 
rcu_dereference(ra->next)) {
struct sock *sk = ra->sk;
 
/* If socket is bound to an interface, only report
@@ -167,8 +167,7 @@ bool ip_call_ra_chain(struct sk_buff *skb)
 */
if (sk && inet_sk(sk)->inet_num == protocol &&
(!sk->sk_bound_dev_if ||
-sk->sk_bound_dev_if == dev->ifindex) &&
-   net_eq(sock_net(sk), net)) {
+sk->sk_bound_dev_if == dev->ifindex)) {
if (ip_is_fragment(ip_hdr(skb))) {
if (ip_defrag(net, skb, 
IP_DEFRAG_CALL_RA_CHAIN))
return true;
diff --git a/net/ipv4/ip_sockglue.c b/net/ipv4/ip_sockglue.c
index bf5f44b27b7e..f36d35fe924b 100644
--- a/net/ipv4/ip_sockglue.c
+++ b/net/ipv4/ip_sockglue.c
@@ -322,18 +322,6 @@ int ip_cmsg_send(struct sock *sk, struct msghdr *msg, 
struct ipcm_cookie *ipc,
return 0;
 }
 
-
-/* Special input handler for packets caught by router alert option.
-   They are selected only by protocol field, and then processed likely
-   local ones; but only if someone wants them! Otherwise, router
-   not running rsvpd will kill RSVP.
-
-   It is user level problem, what it will make with them.
-   I have no idea, how it will masquearde or NAT them (it is joke, joke :-)),
-   but receiver should be enough clever f.e. to forward mtrace requests,
-   sent to multicast group to reach destination designated router.
- */
-struct ip_ra_chain __rcu *ip_ra_chain;
 static DEFINE_SPINLOCK(ip_ra_lock);
 
 
@@ -350,6 +338,7 @@ int ip_ra_control(struct sock *sk, unsigned char on,
 {
struct ip_ra_chain *ra, *new_ra;
struct ip_ra_chain __rcu **rap;
+   struct net *net = sock_net(sk);
 
if (sk->sk_type != SOCK_RAW || inet_sk(sk)->inet_num == IPPROTO_RAW)
return -EINVAL;
@@ -357,7 +346,7 @@ int ip_ra_control(struct sock *sk, unsigned char on,
new_ra = on ? kmalloc(sizeof(*new_ra), GFP_KERNEL) : NULL;
 
spin_lock_bh(_ra_lock);
-   for (rap = _ra_chain;
+   for (rap = >ipv4.ra_chain;
 (ra = rcu_dereference_protected(*rap,
lockdep_is_held(_ra_lock))) != NULL;
 rap = >next) {

1 2 3 4 5 >

1 - 100 of 418 matches

Mail list logo