Re: [PATCH RESEND net-next] net: Do synchronize_rcu() in ip6mr_sk_done() only if this is needed
On Wed, 2018-03-07 at 13:51 +, Yuval Mintz wrote: > > > > > After unshare test kworker hangs for ages: > > > > > > > > > > $ while :; do unshare -n true; done & > > > > > > > > > > $ perf report > > > > > - 88,82% 0,00% kworker/u16:0 [kernel.vmlinux] [k > > > > > ] > > > > > cleanup_net > > > > > cleanup_net > > > > > - ops_exit_list.isra.9 > > > > > - 85,68% igmp6_net_exit > > > > > - 53,31% sock_release > > > > > - inet_release > > > > > - 25,33% rawv6_close > > > > > - ip6mr_sk_done > > > > > + 23,38% synchronize_rcu > > > > > > > > > > Keep in mind, this perf report shows the time a function was > > > > > executing, and > > > > > it does not show the time, it was sleeping. But it's easy to > > > > > imagine, > > > > > how > > > > > much it is sleeping, if synchronize_rcu() execution takes the > > > > > most > > > > > time. > > > > > Top shows the kworker R time is less then 1%. > > > > > > > > > > This happen, because of in ip6mr_sk_done() we do too many > > > > > synchronize_rcu(), > > > > > even for the sockets, that are not referenced in mr_table, > > > > > and which > > > > > are not > > > > > need it. So, the progress of kworker becomes very slow. > > > > > > > > > > The patch introduces apparent solution, and it makes > > > > > ip6mr_sk_done() > > > > > to skip > > > > > synchronize_rcu() for sockets, that are not need that. After > > > > > the > > > > > patch, > > > > > kworker becomes able to warm the cpu up as expected. > > > > > > > > > > Signed-off-by: Kirill Tkhai> > > > > --- > > > > > net/ipv6/ip6mr.c |4 +++- > > > > > 1 file changed, 3 insertions(+), 1 deletion(-) > > > > > > > > > > diff --git a/net/ipv6/ip6mr.c b/net/ipv6/ip6mr.c > > > > > index 2a38f9de45d3..290a8d0d5eac 100644 > > > > > --- a/net/ipv6/ip6mr.c > > > > > +++ b/net/ipv6/ip6mr.c > > > > > @@ -1485,7 +1485,9 @@ int ip6mr_sk_done(struct sock *sk) > > > > > } > > > > > } > > > > > rtnl_unlock(); > > > > > - synchronize_rcu(); > > > > > + > > > > > + if (!err) > > > > > + synchronize_rcu(); > > > > > > > > > > > > > > > > > But... what is this synchronize_rcu() doing exactly ? > > > > > > > > This was added in 8571ab479a6e1ef46ead5ebee567e128a422767c > > > > > > > > ("ip6mr: Make mroute_sk rcu-based") > > > > > > > > Typically on a delete, the synchronize_rcu() would be needed > > > > before > > > > freeing the deleted object. > > > > > > > > But nowadays we have better way : SOCK_RCU_FREE > > > > > > Hm. I'm agree with you. This is hot path, and mroute sockets > > > created from > > > > userspace > > > will delay userspace tasks close() and exit(). Since there may be > > > many such > > > > sockets, > > > we may get a zombie task, which can't be reaped for ages. This > > > slows down > > > > the system > > > directly. > > > > > > Fix for pernet_operations works, but we need generic solution > > > instead. > > > > > > The commit "8571ab479a6e1ef46ead5ebee567e128a422767c" says: > > > > > > ip6mr: Make mroute_sk rcu-based > > > > > > In ipmr the mr_table socket is handled under RCU. Introduce > > > the same > > > for ip6mr. > > > > > > There is no pointing to improvements it invents, or to the > > > problem it solves. > > > > The description > > > looks like a cleanup. It's too expensive cleanup, if it worsens > > > the > > > > performance a hundred > > > times. > > > > > > Can't we simply revert it?! > > > > > > Yuval, do you have ideas to fix that (maybe, via SOCK_RCU_FREE > > > suggested > > > > by Eric)? > > Sorry, failed to notice ip6mr_sk_done() is called unconditionally > from > rawv6_close(). But even with your suggested fix it should be > ~resolved > [as only sockets used for ip6mr would reach the sync] Yes but some user setups could still hit this synchronize_rcu() quite hard. New synchronize_rcu() should not be added in possibly critical paths unless absolutely needed (if really there is no other ways) We have to be very careful here, since this kind of mistake can have a 100x impact. > . > Or are you claiming there's still some performance hit even with your > suggested change? > > > > > > > We actually use rcu_dereference() in ip6mr_cache_report() only. > > > The only > > > > user of dereference > > > is sock_queue_rcv_skb(). Superficial analysis shows we purge the > > > queue in > > > > inet_sock_destruct(). > > > > + So this change should be safe. > > I might have misunderstood the comment from > commit 4c9687098f24 ("ipmr: RCU conversion of mroute_sk") when > writing > this; Thought comment regarding ip_ra_destroy() meant that for the > IPv6 case > we DO have to make sure there's a grace-period before destroying the > socket. Yeah but since 2010 things have changed a lot in the kernel ;) And we now have SOCK_RCU_FREE for the few sockets
RE: [PATCH RESEND net-next] net: Do synchronize_rcu() in ip6mr_sk_done() only if this is needed
> >>> After unshare test kworker hangs for ages: > >>> > >>> $ while :; do unshare -n true; done & > >>> > >>> $ perf report > >>> - 88,82% 0,00% kworker/u16:0 [kernel.vmlinux] [k] > >>> cleanup_net > >>> cleanup_net > >>> - ops_exit_list.isra.9 > >>> - 85,68% igmp6_net_exit > >>> - 53,31% sock_release > >>> - inet_release > >>> - 25,33% rawv6_close > >>> - ip6mr_sk_done > >>> + 23,38% synchronize_rcu > >>> > >>> Keep in mind, this perf report shows the time a function was > >>> executing, and > >>> it does not show the time, it was sleeping. But it's easy to imagine, > >>> how > >>> much it is sleeping, if synchronize_rcu() execution takes the most > >>> time. > >>> Top shows the kworker R time is less then 1%. > >>> > >>> This happen, because of in ip6mr_sk_done() we do too many > >>> synchronize_rcu(), > >>> even for the sockets, that are not referenced in mr_table, and which > >>> are not > >>> need it. So, the progress of kworker becomes very slow. > >>> > >>> The patch introduces apparent solution, and it makes ip6mr_sk_done() > >>> to skip > >>> synchronize_rcu() for sockets, that are not need that. After the > >>> patch, > >>> kworker becomes able to warm the cpu up as expected. > >>> > >>> Signed-off-by: Kirill Tkhai> >>> --- > >>> net/ipv6/ip6mr.c |4 +++- > >>> 1 file changed, 3 insertions(+), 1 deletion(-) > >>> > >>> diff --git a/net/ipv6/ip6mr.c b/net/ipv6/ip6mr.c > >>> index 2a38f9de45d3..290a8d0d5eac 100644 > >>> --- a/net/ipv6/ip6mr.c > >>> +++ b/net/ipv6/ip6mr.c > >>> @@ -1485,7 +1485,9 @@ int ip6mr_sk_done(struct sock *sk) > >>> } > >>> } > >>> rtnl_unlock(); > >>> - synchronize_rcu(); > >>> + > >>> + if (!err) > >>> + synchronize_rcu(); > >>> > >> > >> > >> But... what is this synchronize_rcu() doing exactly ? > >> > >> This was added in 8571ab479a6e1ef46ead5ebee567e128a422767c > >> > >> ("ip6mr: Make mroute_sk rcu-based") > >> > >> Typically on a delete, the synchronize_rcu() would be needed before > >> freeing the deleted object. > >> > >> But nowadays we have better way : SOCK_RCU_FREE > > > > Hm. I'm agree with you. This is hot path, and mroute sockets created from > userspace > > will delay userspace tasks close() and exit(). Since there may be many such > sockets, > > we may get a zombie task, which can't be reaped for ages. This slows down > the system > > directly. > > > > Fix for pernet_operations works, but we need generic solution instead. > > > > The commit "8571ab479a6e1ef46ead5ebee567e128a422767c" says: > > > > ip6mr: Make mroute_sk rcu-based > > > > In ipmr the mr_table socket is handled under RCU. Introduce the same > > for ip6mr. > > > > There is no pointing to improvements it invents, or to the problem it > > solves. > The description > > looks like a cleanup. It's too expensive cleanup, if it worsens the > performance a hundred > > times. > > > > Can't we simply revert it?! > > > > Yuval, do you have ideas to fix that (maybe, via SOCK_RCU_FREE suggested > by Eric)? Sorry, failed to notice ip6mr_sk_done() is called unconditionally from rawv6_close(). But even with your suggested fix it should be ~resolved [as only sockets used for ip6mr would reach the sync]. Or are you claiming there's still some performance hit even with your suggested change? > > > > We actually use rcu_dereference() in ip6mr_cache_report() only. The only > user of dereference > > is sock_queue_rcv_skb(). Superficial analysis shows we purge the queue in > inet_sock_destruct(). > > + So this change should be safe. I might have misunderstood the comment from commit 4c9687098f24 ("ipmr: RCU conversion of mroute_sk") when writing this; Thought comment regarding ip_ra_destroy() meant that for the IPv6 case we DO have to make sure there's a grace-period before destroying the socket. > > > Thanks, > > Kirill > >
Re: [PATCH RESEND net-next] net: Do synchronize_rcu() in ip6mr_sk_done() only if this is needed
On 07.03.2018 12:22, Kirill Tkhai wrote: > On 06.03.2018 19:50, Eric Dumazet wrote: >> On Tue, 2018-03-06 at 19:24 +0300, Kirill Tkhai wrote: >>> After unshare test kworker hangs for ages: >>> >>> $ while :; do unshare -n true; done & >>> >>> $ perf report >>> - 88,82% 0,00% kworker/u16:0 [kernel.vmlinux] [k] >>> cleanup_net >>> cleanup_net >>> - ops_exit_list.isra.9 >>> - 85,68% igmp6_net_exit >>> - 53,31% sock_release >>> - inet_release >>> - 25,33% rawv6_close >>> - ip6mr_sk_done >>> + 23,38% synchronize_rcu >>> >>> Keep in mind, this perf report shows the time a function was >>> executing, and >>> it does not show the time, it was sleeping. But it's easy to imagine, >>> how >>> much it is sleeping, if synchronize_rcu() execution takes the most >>> time. >>> Top shows the kworker R time is less then 1%. >>> >>> This happen, because of in ip6mr_sk_done() we do too many >>> synchronize_rcu(), >>> even for the sockets, that are not referenced in mr_table, and which >>> are not >>> need it. So, the progress of kworker becomes very slow. >>> >>> The patch introduces apparent solution, and it makes ip6mr_sk_done() >>> to skip >>> synchronize_rcu() for sockets, that are not need that. After the >>> patch, >>> kworker becomes able to warm the cpu up as expected. >>> >>> Signed-off-by: Kirill Tkhai>>> --- >>> net/ipv6/ip6mr.c |4 +++- >>> 1 file changed, 3 insertions(+), 1 deletion(-) >>> >>> diff --git a/net/ipv6/ip6mr.c b/net/ipv6/ip6mr.c >>> index 2a38f9de45d3..290a8d0d5eac 100644 >>> --- a/net/ipv6/ip6mr.c >>> +++ b/net/ipv6/ip6mr.c >>> @@ -1485,7 +1485,9 @@ int ip6mr_sk_done(struct sock *sk) >>> } >>> } >>> rtnl_unlock(); >>> - synchronize_rcu(); >>> + >>> + if (!err) >>> + synchronize_rcu(); >>> >> >> >> But... what is this synchronize_rcu() doing exactly ? >> >> This was added in 8571ab479a6e1ef46ead5ebee567e128a422767c >> >> ("ip6mr: Make mroute_sk rcu-based") >> >> Typically on a delete, the synchronize_rcu() would be needed before >> freeing the deleted object. >> >> But nowadays we have better way : SOCK_RCU_FREE > > Hm. I'm agree with you. This is hot path, and mroute sockets created from > userspace > will delay userspace tasks close() and exit(). Since there may be many such > sockets, > we may get a zombie task, which can't be reaped for ages. This slows down the > system > directly. > > Fix for pernet_operations works, but we need generic solution instead. > > The commit "8571ab479a6e1ef46ead5ebee567e128a422767c" says: > > ip6mr: Make mroute_sk rcu-based > > In ipmr the mr_table socket is handled under RCU. Introduce the same > for ip6mr. > > There is no pointing to improvements it invents, or to the problem it solves. > The description > looks like a cleanup. It's too expensive cleanup, if it worsens the > performance a hundred > times. > > Can't we simply revert it?! > > Yuval, do you have ideas to fix that (maybe, via SOCK_RCU_FREE suggested by > Eric)? > > We actually use rcu_dereference() in ip6mr_cache_report() only. The only user > of dereference > is sock_queue_rcv_skb(). Superficial analysis shows we purge the queue in > inet_sock_destruct(). + So this change should be safe. > Thanks, > Kirill >
Re: [PATCH RESEND net-next] net: Do synchronize_rcu() in ip6mr_sk_done() only if this is needed
On 06.03.2018 19:50, Eric Dumazet wrote: > On Tue, 2018-03-06 at 19:24 +0300, Kirill Tkhai wrote: >> After unshare test kworker hangs for ages: >> >> $ while :; do unshare -n true; done & >> >> $ perf report >> - 88,82% 0,00% kworker/u16:0 [kernel.vmlinux] [k] >> cleanup_net >> cleanup_net >> - ops_exit_list.isra.9 >> - 85,68% igmp6_net_exit >> - 53,31% sock_release >> - inet_release >> - 25,33% rawv6_close >> - ip6mr_sk_done >> + 23,38% synchronize_rcu >> >> Keep in mind, this perf report shows the time a function was >> executing, and >> it does not show the time, it was sleeping. But it's easy to imagine, >> how >> much it is sleeping, if synchronize_rcu() execution takes the most >> time. >> Top shows the kworker R time is less then 1%. >> >> This happen, because of in ip6mr_sk_done() we do too many >> synchronize_rcu(), >> even for the sockets, that are not referenced in mr_table, and which >> are not >> need it. So, the progress of kworker becomes very slow. >> >> The patch introduces apparent solution, and it makes ip6mr_sk_done() >> to skip >> synchronize_rcu() for sockets, that are not need that. After the >> patch, >> kworker becomes able to warm the cpu up as expected. >> >> Signed-off-by: Kirill Tkhai>> --- >> net/ipv6/ip6mr.c |4 +++- >> 1 file changed, 3 insertions(+), 1 deletion(-) >> >> diff --git a/net/ipv6/ip6mr.c b/net/ipv6/ip6mr.c >> index 2a38f9de45d3..290a8d0d5eac 100644 >> --- a/net/ipv6/ip6mr.c >> +++ b/net/ipv6/ip6mr.c >> @@ -1485,7 +1485,9 @@ int ip6mr_sk_done(struct sock *sk) >> } >> } >> rtnl_unlock(); >> -synchronize_rcu(); >> + >> +if (!err) >> +synchronize_rcu(); >> > > > But... what is this synchronize_rcu() doing exactly ? > > This was added in 8571ab479a6e1ef46ead5ebee567e128a422767c > > ("ip6mr: Make mroute_sk rcu-based") > > Typically on a delete, the synchronize_rcu() would be needed before > freeing the deleted object. > > But nowadays we have better way : SOCK_RCU_FREE Hm. I'm agree with you. This is hot path, and mroute sockets created from userspace will delay userspace tasks close() and exit(). Since there may be many such sockets, we may get a zombie task, which can't be reaped for ages. This slows down the system directly. Fix for pernet_operations works, but we need generic solution instead. The commit "8571ab479a6e1ef46ead5ebee567e128a422767c" says: ip6mr: Make mroute_sk rcu-based In ipmr the mr_table socket is handled under RCU. Introduce the same for ip6mr. There is no pointing to improvements it invents, or to the problem it solves. The description looks like a cleanup. It's too expensive cleanup, if it worsens the performance a hundred times. Can't we simply revert it?! Yuval, do you have ideas to fix that (maybe, via SOCK_RCU_FREE suggested by Eric)? We actually use rcu_dereference() in ip6mr_cache_report() only. The only user of dereference is sock_queue_rcv_skb(). Superficial analysis shows we purge the queue in inet_sock_destruct(). Thanks, Kirill
Re: [PATCH RESEND net-next] net: Do synchronize_rcu() in ip6mr_sk_done() only if this is needed
On Tue, 2018-03-06 at 08:58 -0800, Eric Dumazet wrote: > > To be clear, your patch is fine Kirill, > > I am only sad seeing one can add a synchronize_rcu() in hot path > without anyone complaining during code review. lpaa2:~# time for i in {1..1000}; do unshare -n /bin/false;done real7m18.911s user0m0.185s sys 0m2.314s Instead of less than 10 seconds not a long time ago :/
Re: [PATCH RESEND net-next] net: Do synchronize_rcu() in ip6mr_sk_done() only if this is needed
On Tue, 2018-03-06 at 08:50 -0800, Eric Dumazet wrote: > > But... what is this synchronize_rcu() doing exactly ? > > This was added in 8571ab479a6e1ef46ead5ebee567e128a422767c > > ("ip6mr: Make mroute_sk rcu-based") > > Typically on a delete, the synchronize_rcu() would be needed before > freeing the deleted object. > > But nowadays we have better way : SOCK_RCU_FREE To be clear, your patch is fine Kirill, I am only sad seeing one can add a synchronize_rcu() in hot path without anyone complaining during code review. Reviewed-by: Eric Dumazet
Re: [PATCH RESEND net-next] net: Do synchronize_rcu() in ip6mr_sk_done() only if this is needed
On Tue, 2018-03-06 at 19:24 +0300, Kirill Tkhai wrote: > After unshare test kworker hangs for ages: > > $ while :; do unshare -n true; done & > > $ perf report > - 88,82% 0,00% kworker/u16:0 [kernel.vmlinux] [k] > cleanup_net > cleanup_net > - ops_exit_list.isra.9 > - 85,68% igmp6_net_exit > - 53,31% sock_release > - inet_release > - 25,33% rawv6_close > - ip6mr_sk_done > + 23,38% synchronize_rcu > > Keep in mind, this perf report shows the time a function was > executing, and > it does not show the time, it was sleeping. But it's easy to imagine, > how > much it is sleeping, if synchronize_rcu() execution takes the most > time. > Top shows the kworker R time is less then 1%. > > This happen, because of in ip6mr_sk_done() we do too many > synchronize_rcu(), > even for the sockets, that are not referenced in mr_table, and which > are not > need it. So, the progress of kworker becomes very slow. > > The patch introduces apparent solution, and it makes ip6mr_sk_done() > to skip > synchronize_rcu() for sockets, that are not need that. After the > patch, > kworker becomes able to warm the cpu up as expected. > > Signed-off-by: Kirill Tkhai> --- > net/ipv6/ip6mr.c |4 +++- > 1 file changed, 3 insertions(+), 1 deletion(-) > > diff --git a/net/ipv6/ip6mr.c b/net/ipv6/ip6mr.c > index 2a38f9de45d3..290a8d0d5eac 100644 > --- a/net/ipv6/ip6mr.c > +++ b/net/ipv6/ip6mr.c > @@ -1485,7 +1485,9 @@ int ip6mr_sk_done(struct sock *sk) > } > } > rtnl_unlock(); > - synchronize_rcu(); > + > + if (!err) > + synchronize_rcu(); > But... what is this synchronize_rcu() doing exactly ? This was added in 8571ab479a6e1ef46ead5ebee567e128a422767c ("ip6mr: Make mroute_sk rcu-based") Typically on a delete, the synchronize_rcu() would be needed before freeing the deleted object. But nowadays we have better way : SOCK_RCU_FREE