Re: [PATCH RESEND net-next] net: Do synchronize_rcu() in ip6mr_sk_done() only if this is needed

2018-03-07 Thread Eric Dumazet
On Wed, 2018-03-07 at 13:51 +, Yuval Mintz wrote:
> > > > > After unshare test kworker hangs for ages:
> > > > > 
> > > > > $ while :; do unshare -n true; done &
> > > > > 
> > > > > $ perf report 
> > > > > -   88,82% 0,00%  kworker/u16:0  [kernel.vmlinux]  [k
> > > > > ]
> > > > > cleanup_net
> > > > >  cleanup_net
> > > > >    - ops_exit_list.isra.9
> > > > >   - 85,68% igmp6_net_exit
> > > > >  - 53,31% sock_release
> > > > > - inet_release
> > > > >    - 25,33% rawv6_close
> > > > >   - ip6mr_sk_done
> > > > >  + 23,38% synchronize_rcu
> > > > > 
> > > > > Keep in mind, this perf report shows the time a function was
> > > > > executing, and
> > > > > it does not show the time, it was sleeping. But it's easy to
> > > > > imagine,
> > > > > how
> > > > > much it is sleeping, if synchronize_rcu() execution takes the
> > > > > most
> > > > > time.
> > > > > Top shows the kworker R time is less then 1%.
> > > > > 
> > > > > This happen, because of in ip6mr_sk_done() we do too many
> > > > > synchronize_rcu(),
> > > > > even for the sockets, that are not referenced in mr_table,
> > > > > and which
> > > > > are not
> > > > > need it. So, the progress of kworker becomes very slow.
> > > > > 
> > > > > The patch introduces apparent solution, and it makes
> > > > > ip6mr_sk_done()
> > > > > to skip
> > > > > synchronize_rcu() for sockets, that are not need that. After
> > > > > the
> > > > > patch,
> > > > > kworker becomes able to warm the cpu up as expected.
> > > > > 
> > > > > Signed-off-by: Kirill Tkhai 
> > > > > ---
> > > > >  net/ipv6/ip6mr.c |4 +++-
> > > > >  1 file changed, 3 insertions(+), 1 deletion(-)
> > > > > 
> > > > > diff --git a/net/ipv6/ip6mr.c b/net/ipv6/ip6mr.c
> > > > > index 2a38f9de45d3..290a8d0d5eac 100644
> > > > > --- a/net/ipv6/ip6mr.c
> > > > > +++ b/net/ipv6/ip6mr.c
> > > > > @@ -1485,7 +1485,9 @@ int ip6mr_sk_done(struct sock *sk)
> > > > >   }
> > > > >   }
> > > > >   rtnl_unlock();
> > > > > - synchronize_rcu();
> > > > > +
> > > > > + if (!err)
> > > > > + synchronize_rcu();
> > > > > 
> > > > 
> > > > 
> > > > But... what is this synchronize_rcu() doing exactly ?
> > > > 
> > > > This was added in 8571ab479a6e1ef46ead5ebee567e128a422767c
> > > > 
> > > > ("ip6mr: Make mroute_sk rcu-based")
> > > > 
> > > > Typically on a delete, the synchronize_rcu() would be needed
> > > > before
> > > > freeing the deleted object.
> > > > 
> > > > But nowadays we have better way : SOCK_RCU_FREE
> > > 
> > > Hm. I'm agree with you. This is hot path, and mroute sockets
> > > created from
> > 
> > userspace
> > > will delay userspace tasks close() and exit(). Since there may be
> > > many such
> > 
> > sockets,
> > > we may get a zombie task, which can't be reaped for ages. This
> > > slows down
> > 
> > the system
> > > directly.
> > > 
> > > Fix for pernet_operations works, but we need generic solution
> > > instead.
> > > 
> > > The commit "8571ab479a6e1ef46ead5ebee567e128a422767c" says:
> > > 
> > > ip6mr: Make mroute_sk rcu-based
> > > 
> > > In ipmr the mr_table socket is handled under RCU. Introduce
> > > the same
> > > for ip6mr.
> > > 
> > > There is no pointing to improvements it invents, or to the
> > > problem it solves.
> > 
> > The description
> > > looks like a cleanup. It's too expensive cleanup, if it worsens
> > > the
> > 
> > performance a hundred
> > > times.
> > > 
> > > Can't we simply revert it?!
> > > 
> > > Yuval, do you have ideas to fix that (maybe, via SOCK_RCU_FREE
> > > suggested
> > 
> > by Eric)?
> 
> Sorry, failed to notice ip6mr_sk_done() is called unconditionally
> from
> rawv6_close(). But even with your suggested fix it should be
> ~resolved
> [as only sockets used for ip6mr would reach the sync]

Yes but some user setups could still hit this synchronize_rcu() quite
hard.

New synchronize_rcu() should not be added in possibly critical paths
unless absolutely needed (if really there is no other ways)

We have to be very careful here, since this kind of mistake can have a
100x impact.

> .
> Or are you claiming there's still some performance hit even with your
> suggested change?
> 
> > > 
> > > We actually use rcu_dereference() in ip6mr_cache_report() only.
> > > The only
> > 
> > user of dereference
> > > is sock_queue_rcv_skb(). Superficial analysis shows we purge the
> > > queue in
> > 
> > inet_sock_destruct().
> > 
> > + So this change should be safe.
> 
> I might have misunderstood the comment from
> commit 4c9687098f24 ("ipmr: RCU conversion of mroute_sk") when
> writing
> this; Thought comment regarding ip_ra_destroy() meant that for the
> IPv6 case
> we DO have to make sure there's a grace-period before destroying the
> socket.

Yeah but since 2010 things have changed a lot in the kernel ;)

And we now have SOCK_RCU_FREE for the few sockets 

RE: [PATCH RESEND net-next] net: Do synchronize_rcu() in ip6mr_sk_done() only if this is needed

2018-03-07 Thread Yuval Mintz
> >>> After unshare test kworker hangs for ages:
> >>>
> >>> $ while :; do unshare -n true; done &
> >>>
> >>> $ perf report 
> >>> -   88,82% 0,00%  kworker/u16:0  [kernel.vmlinux]  [k]
> >>> cleanup_net
> >>>  cleanup_net
> >>>    - ops_exit_list.isra.9
> >>>   - 85,68% igmp6_net_exit
> >>>  - 53,31% sock_release
> >>> - inet_release
> >>>    - 25,33% rawv6_close
> >>>   - ip6mr_sk_done
> >>>  + 23,38% synchronize_rcu
> >>>
> >>> Keep in mind, this perf report shows the time a function was
> >>> executing, and
> >>> it does not show the time, it was sleeping. But it's easy to imagine,
> >>> how
> >>> much it is sleeping, if synchronize_rcu() execution takes the most
> >>> time.
> >>> Top shows the kworker R time is less then 1%.
> >>>
> >>> This happen, because of in ip6mr_sk_done() we do too many
> >>> synchronize_rcu(),
> >>> even for the sockets, that are not referenced in mr_table, and which
> >>> are not
> >>> need it. So, the progress of kworker becomes very slow.
> >>>
> >>> The patch introduces apparent solution, and it makes ip6mr_sk_done()
> >>> to skip
> >>> synchronize_rcu() for sockets, that are not need that. After the
> >>> patch,
> >>> kworker becomes able to warm the cpu up as expected.
> >>>
> >>> Signed-off-by: Kirill Tkhai 
> >>> ---
> >>>  net/ipv6/ip6mr.c |4 +++-
> >>>  1 file changed, 3 insertions(+), 1 deletion(-)
> >>>
> >>> diff --git a/net/ipv6/ip6mr.c b/net/ipv6/ip6mr.c
> >>> index 2a38f9de45d3..290a8d0d5eac 100644
> >>> --- a/net/ipv6/ip6mr.c
> >>> +++ b/net/ipv6/ip6mr.c
> >>> @@ -1485,7 +1485,9 @@ int ip6mr_sk_done(struct sock *sk)
> >>>   }
> >>>   }
> >>>   rtnl_unlock();
> >>> - synchronize_rcu();
> >>> +
> >>> + if (!err)
> >>> + synchronize_rcu();
> >>>
> >>
> >>
> >> But... what is this synchronize_rcu() doing exactly ?
> >>
> >> This was added in 8571ab479a6e1ef46ead5ebee567e128a422767c
> >>
> >> ("ip6mr: Make mroute_sk rcu-based")
> >>
> >> Typically on a delete, the synchronize_rcu() would be needed before
> >> freeing the deleted object.
> >>
> >> But nowadays we have better way : SOCK_RCU_FREE
> >
> > Hm. I'm agree with you. This is hot path, and mroute sockets created from
> userspace
> > will delay userspace tasks close() and exit(). Since there may be many such
> sockets,
> > we may get a zombie task, which can't be reaped for ages. This slows down
> the system
> > directly.
> >
> > Fix for pernet_operations works, but we need generic solution instead.
> >
> > The commit "8571ab479a6e1ef46ead5ebee567e128a422767c" says:
> >
> > ip6mr: Make mroute_sk rcu-based
> >
> > In ipmr the mr_table socket is handled under RCU. Introduce the same
> > for ip6mr.
> >
> > There is no pointing to improvements it invents, or to the problem it 
> > solves.
> The description
> > looks like a cleanup. It's too expensive cleanup, if it worsens the
> performance a hundred
> > times.
> >
> > Can't we simply revert it?!
> >
> > Yuval, do you have ideas to fix that (maybe, via SOCK_RCU_FREE suggested
> by Eric)?

Sorry, failed to notice ip6mr_sk_done() is called unconditionally from
rawv6_close(). But even with your suggested fix it should be ~resolved
[as only sockets used for ip6mr would reach the sync].
Or are you claiming there's still some performance hit even with your
suggested change?

> >
> > We actually use rcu_dereference() in ip6mr_cache_report() only. The only
> user of dereference
> > is sock_queue_rcv_skb(). Superficial analysis shows we purge the queue in
> inet_sock_destruct().
> 
> + So this change should be safe.

I might have misunderstood the comment from
commit 4c9687098f24 ("ipmr: RCU conversion of mroute_sk") when writing
this; Thought comment regarding ip_ra_destroy() meant that for the IPv6 case
we DO have to make sure there's a grace-period before destroying the socket.

> 
> > Thanks,
> > Kirill
> >


Re: [PATCH RESEND net-next] net: Do synchronize_rcu() in ip6mr_sk_done() only if this is needed

2018-03-07 Thread Kirill Tkhai
On 07.03.2018 12:22, Kirill Tkhai wrote:
> On 06.03.2018 19:50, Eric Dumazet wrote:
>> On Tue, 2018-03-06 at 19:24 +0300, Kirill Tkhai wrote:
>>> After unshare test kworker hangs for ages:
>>>
>>> $ while :; do unshare -n true; done &
>>>
>>> $ perf report 
>>> -   88,82% 0,00%  kworker/u16:0  [kernel.vmlinux]  [k]
>>> cleanup_net
>>>  cleanup_net
>>>    - ops_exit_list.isra.9
>>>   - 85,68% igmp6_net_exit
>>>  - 53,31% sock_release
>>> - inet_release
>>>    - 25,33% rawv6_close
>>>   - ip6mr_sk_done
>>>  + 23,38% synchronize_rcu
>>>
>>> Keep in mind, this perf report shows the time a function was
>>> executing, and
>>> it does not show the time, it was sleeping. But it's easy to imagine,
>>> how
>>> much it is sleeping, if synchronize_rcu() execution takes the most
>>> time.
>>> Top shows the kworker R time is less then 1%.
>>>
>>> This happen, because of in ip6mr_sk_done() we do too many
>>> synchronize_rcu(),
>>> even for the sockets, that are not referenced in mr_table, and which
>>> are not
>>> need it. So, the progress of kworker becomes very slow.
>>>
>>> The patch introduces apparent solution, and it makes ip6mr_sk_done()
>>> to skip
>>> synchronize_rcu() for sockets, that are not need that. After the
>>> patch,
>>> kworker becomes able to warm the cpu up as expected.
>>>
>>> Signed-off-by: Kirill Tkhai 
>>> ---
>>>  net/ipv6/ip6mr.c |4 +++-
>>>  1 file changed, 3 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/net/ipv6/ip6mr.c b/net/ipv6/ip6mr.c
>>> index 2a38f9de45d3..290a8d0d5eac 100644
>>> --- a/net/ipv6/ip6mr.c
>>> +++ b/net/ipv6/ip6mr.c
>>> @@ -1485,7 +1485,9 @@ int ip6mr_sk_done(struct sock *sk)
>>>     }
>>>     }
>>>     rtnl_unlock();
>>> -   synchronize_rcu();
>>> +
>>> +   if (!err)
>>> +   synchronize_rcu();
>>>  
>>
>>
>> But... what is this synchronize_rcu() doing exactly ?
>>
>> This was added in 8571ab479a6e1ef46ead5ebee567e128a422767c
>>
>> ("ip6mr: Make mroute_sk rcu-based")
>>
>> Typically on a delete, the synchronize_rcu() would be needed before
>> freeing the deleted object.
>>
>> But nowadays we have better way : SOCK_RCU_FREE
> 
> Hm. I'm agree with you. This is hot path, and mroute sockets created from 
> userspace
> will delay userspace tasks close() and exit(). Since there may be many such 
> sockets,
> we may get a zombie task, which can't be reaped for ages. This slows down the 
> system
> directly.
> 
> Fix for pernet_operations works, but we need generic solution instead.
> 
> The commit "8571ab479a6e1ef46ead5ebee567e128a422767c" says:
> 
> ip6mr: Make mroute_sk rcu-based
> 
> In ipmr the mr_table socket is handled under RCU. Introduce the same
> for ip6mr.
> 
> There is no pointing to improvements it invents, or to the problem it solves. 
> The description
> looks like a cleanup. It's too expensive cleanup, if it worsens the 
> performance a hundred
> times.
> 
> Can't we simply revert it?!
> 
> Yuval, do you have ideas to fix that (maybe, via SOCK_RCU_FREE suggested by 
> Eric)?
> 
> We actually use rcu_dereference() in ip6mr_cache_report() only. The only user 
> of dereference
> is sock_queue_rcv_skb(). Superficial analysis shows we purge the queue in 
> inet_sock_destruct().

+ So this change should be safe.

> Thanks,
> Kirill
> 


Re: [PATCH RESEND net-next] net: Do synchronize_rcu() in ip6mr_sk_done() only if this is needed

2018-03-07 Thread Kirill Tkhai
On 06.03.2018 19:50, Eric Dumazet wrote:
> On Tue, 2018-03-06 at 19:24 +0300, Kirill Tkhai wrote:
>> After unshare test kworker hangs for ages:
>>
>> $ while :; do unshare -n true; done &
>>
>> $ perf report 
>> -   88,82% 0,00%  kworker/u16:0  [kernel.vmlinux]  [k]
>> cleanup_net
>>  cleanup_net
>>    - ops_exit_list.isra.9
>>   - 85,68% igmp6_net_exit
>>  - 53,31% sock_release
>> - inet_release
>>    - 25,33% rawv6_close
>>   - ip6mr_sk_done
>>  + 23,38% synchronize_rcu
>>
>> Keep in mind, this perf report shows the time a function was
>> executing, and
>> it does not show the time, it was sleeping. But it's easy to imagine,
>> how
>> much it is sleeping, if synchronize_rcu() execution takes the most
>> time.
>> Top shows the kworker R time is less then 1%.
>>
>> This happen, because of in ip6mr_sk_done() we do too many
>> synchronize_rcu(),
>> even for the sockets, that are not referenced in mr_table, and which
>> are not
>> need it. So, the progress of kworker becomes very slow.
>>
>> The patch introduces apparent solution, and it makes ip6mr_sk_done()
>> to skip
>> synchronize_rcu() for sockets, that are not need that. After the
>> patch,
>> kworker becomes able to warm the cpu up as expected.
>>
>> Signed-off-by: Kirill Tkhai 
>> ---
>>  net/ipv6/ip6mr.c |4 +++-
>>  1 file changed, 3 insertions(+), 1 deletion(-)
>>
>> diff --git a/net/ipv6/ip6mr.c b/net/ipv6/ip6mr.c
>> index 2a38f9de45d3..290a8d0d5eac 100644
>> --- a/net/ipv6/ip6mr.c
>> +++ b/net/ipv6/ip6mr.c
>> @@ -1485,7 +1485,9 @@ int ip6mr_sk_done(struct sock *sk)
>>  }
>>  }
>>  rtnl_unlock();
>> -synchronize_rcu();
>> +
>> +if (!err)
>> +synchronize_rcu();
>>  
> 
> 
> But... what is this synchronize_rcu() doing exactly ?
> 
> This was added in 8571ab479a6e1ef46ead5ebee567e128a422767c
> 
> ("ip6mr: Make mroute_sk rcu-based")
> 
> Typically on a delete, the synchronize_rcu() would be needed before
> freeing the deleted object.
> 
> But nowadays we have better way : SOCK_RCU_FREE

Hm. I'm agree with you. This is hot path, and mroute sockets created from 
userspace
will delay userspace tasks close() and exit(). Since there may be many such 
sockets,
we may get a zombie task, which can't be reaped for ages. This slows down the 
system
directly.

Fix for pernet_operations works, but we need generic solution instead.

The commit "8571ab479a6e1ef46ead5ebee567e128a422767c" says:

ip6mr: Make mroute_sk rcu-based

In ipmr the mr_table socket is handled under RCU. Introduce the same
for ip6mr.

There is no pointing to improvements it invents, or to the problem it solves. 
The description
looks like a cleanup. It's too expensive cleanup, if it worsens the performance 
a hundred
times.

Can't we simply revert it?!

Yuval, do you have ideas to fix that (maybe, via SOCK_RCU_FREE suggested by 
Eric)?

We actually use rcu_dereference() in ip6mr_cache_report() only. The only user 
of dereference
is sock_queue_rcv_skb(). Superficial analysis shows we purge the queue in 
inet_sock_destruct().

Thanks,
Kirill


Re: [PATCH RESEND net-next] net: Do synchronize_rcu() in ip6mr_sk_done() only if this is needed

2018-03-06 Thread Eric Dumazet
On Tue, 2018-03-06 at 08:58 -0800, Eric Dumazet wrote:
> 
> To be clear, your patch is fine Kirill,
> 
> I am only sad seeing one can add a synchronize_rcu() in hot path
> without anyone complaining during code review.

lpaa2:~# time for i in {1..1000}; do unshare -n /bin/false;done

real7m18.911s
user0m0.185s
sys 0m2.314s

Instead of less than 10 seconds not a long time ago :/



Re: [PATCH RESEND net-next] net: Do synchronize_rcu() in ip6mr_sk_done() only if this is needed

2018-03-06 Thread Eric Dumazet
On Tue, 2018-03-06 at 08:50 -0800, Eric Dumazet wrote:
> 
> But... what is this synchronize_rcu() doing exactly ?
> 
> This was added in 8571ab479a6e1ef46ead5ebee567e128a422767c
> 
> ("ip6mr: Make mroute_sk rcu-based")
> 
> Typically on a delete, the synchronize_rcu() would be needed before
> freeing the deleted object.
> 
> But nowadays we have better way : SOCK_RCU_FREE

To be clear, your patch is fine Kirill,

I am only sad seeing one can add a synchronize_rcu() in hot path
without anyone complaining during code review.

Reviewed-by: Eric Dumazet 




Re: [PATCH RESEND net-next] net: Do synchronize_rcu() in ip6mr_sk_done() only if this is needed

2018-03-06 Thread Eric Dumazet
On Tue, 2018-03-06 at 19:24 +0300, Kirill Tkhai wrote:
> After unshare test kworker hangs for ages:
> 
> $ while :; do unshare -n true; done &
> 
> $ perf report 
> -   88,82% 0,00%  kworker/u16:0  [kernel.vmlinux]  [k]
> cleanup_net
>  cleanup_net
>    - ops_exit_list.isra.9
>   - 85,68% igmp6_net_exit
>  - 53,31% sock_release
> - inet_release
>    - 25,33% rawv6_close
>   - ip6mr_sk_done
>  + 23,38% synchronize_rcu
> 
> Keep in mind, this perf report shows the time a function was
> executing, and
> it does not show the time, it was sleeping. But it's easy to imagine,
> how
> much it is sleeping, if synchronize_rcu() execution takes the most
> time.
> Top shows the kworker R time is less then 1%.
> 
> This happen, because of in ip6mr_sk_done() we do too many
> synchronize_rcu(),
> even for the sockets, that are not referenced in mr_table, and which
> are not
> need it. So, the progress of kworker becomes very slow.
> 
> The patch introduces apparent solution, and it makes ip6mr_sk_done()
> to skip
> synchronize_rcu() for sockets, that are not need that. After the
> patch,
> kworker becomes able to warm the cpu up as expected.
> 
> Signed-off-by: Kirill Tkhai 
> ---
>  net/ipv6/ip6mr.c |4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/net/ipv6/ip6mr.c b/net/ipv6/ip6mr.c
> index 2a38f9de45d3..290a8d0d5eac 100644
> --- a/net/ipv6/ip6mr.c
> +++ b/net/ipv6/ip6mr.c
> @@ -1485,7 +1485,9 @@ int ip6mr_sk_done(struct sock *sk)
>   }
>   }
>   rtnl_unlock();
> - synchronize_rcu();
> +
> + if (!err)
> + synchronize_rcu();
>  


But... what is this synchronize_rcu() doing exactly ?

This was added in 8571ab479a6e1ef46ead5ebee567e128a422767c

("ip6mr: Make mroute_sk rcu-based")

Typically on a delete, the synchronize_rcu() would be needed before
freeing the deleted object.

But nowadays we have better way : SOCK_RCU_FREE