Re: [PATCH v4 1/2] rcu/tree: Add basic support for kfree_rcu() batching

2019-10-08 Thread Uladzislau Rezki
On Fri, Oct 04, 2019 at 01:20:38PM -0400, Joel Fernandes wrote:
> On Tue, Oct 01, 2019 at 01:27:02PM +0200, Uladzislau Rezki wrote:
> [snip] 
> > > > I have just a small question related to workloads and performance 
> > > > evaluation.
> > > > Are you aware of any specific workloads which benefit from it for 
> > > > example
> > > > mobile area, etc? I am asking because i think about backporting of it 
> > > > and
> > > > reuse it on our kernel. 
> > > 
> > > I am not aware of a mobile usecase that benefits but there are server
> > > workloads that make system more stable in the face of a kfree_rcu() flood.
> > > 
> > OK, i got it. I wanted to test it finding out how it could effect mobile
> > workloads.
> > 
> > >
> > > For the KVA allocator work, I see it is quite similar to the way binder
> > > allocates blocks. See function: binder_alloc_new_buf_locked(). Is there 
> > > are
> > > any chance to reuse any code? For one thing, binder also has an rbtree for
> > > allocated blocks for fast lookup of allocated blocks. Does the KVA 
> > > allocator
> > > not have the need for that?
> > >
> > Well, there is a difference. Actually the free blocks are not sorted by
> > the its size like in binder layer, if understand the code correctly.
> > 
> > Instead, i keep them(free blocks) sorted(by start address) in ascending
> > order + maintain the augment value(biggest free size in left or right 
> > sub-tree)
> > for each node, that allows to navigate toward the lowest address and the 
> > block
> > that definitely suits. So as a result our allocations become sequential
> > what is important.
> 
> Right, I realized this after sending the email that binder and kva sort
> differently though they both try to use free sizes during the allocation.
> 
> Would you have any papers, which survey various rb-tree based allocator
> algorithms and their tradeoffs? I am interested in studying these more
> especially in relation to the binder driver. Would also be nice to make
> contributions to papers surveying both these allocators to describe the state
> of the art.
> 
So far i have not had any paper with different kind of comparison. But
that is interested for sure, especially to analyze the model for example
based on B-Tree, so when we can fully utilize a cache performance.
Because regular binary trees are just pointer chasing.

As for binder driver and its allocator, is it O(lognN) complexity? Is
there any bottleneck in its implementation?

Thanks!

--
Vlad Rezki


Re: [PATCH v4 1/2] rcu/tree: Add basic support for kfree_rcu() batching

2019-10-04 Thread Joel Fernandes
On Tue, Oct 01, 2019 at 01:27:02PM +0200, Uladzislau Rezki wrote:
[snip] 
> > > I have just a small question related to workloads and performance 
> > > evaluation.
> > > Are you aware of any specific workloads which benefit from it for example
> > > mobile area, etc? I am asking because i think about backporting of it and
> > > reuse it on our kernel. 
> > 
> > I am not aware of a mobile usecase that benefits but there are server
> > workloads that make system more stable in the face of a kfree_rcu() flood.
> > 
> OK, i got it. I wanted to test it finding out how it could effect mobile
> workloads.
> 
> >
> > For the KVA allocator work, I see it is quite similar to the way binder
> > allocates blocks. See function: binder_alloc_new_buf_locked(). Is there are
> > any chance to reuse any code? For one thing, binder also has an rbtree for
> > allocated blocks for fast lookup of allocated blocks. Does the KVA allocator
> > not have the need for that?
> >
> Well, there is a difference. Actually the free blocks are not sorted by
> the its size like in binder layer, if understand the code correctly.
> 
> Instead, i keep them(free blocks) sorted(by start address) in ascending
> order + maintain the augment value(biggest free size in left or right 
> sub-tree)
> for each node, that allows to navigate toward the lowest address and the block
> that definitely suits. So as a result our allocations become sequential
> what is important.

Right, I realized this after sending the email that binder and kva sort
differently though they both try to use free sizes during the allocation.

Would you have any papers, which survey various rb-tree based allocator
algorithms and their tradeoffs? I am interested in studying these more
especially in relation to the binder driver. Would also be nice to make
contributions to papers surveying both these allocators to describe the state
of the art.

thanks,

 - Joel


> >
> > And, nice LPC presentation! I was there ;)
> > 
> Thanks :)
> 
> --
> Vlad Rezki


Re: [PATCH v4 1/2] rcu/tree: Add basic support for kfree_rcu() batching

2019-10-01 Thread Uladzislau Rezki
> > Hello, Joel.
> > 
> > First of all thank you for improving it. I also noticed a high pressure
> > on RCU-machinery during performing some vmalloc tests when kfree_rcu()
> > flood occurred. Therefore i got rid of using kfree_rcu() there.
> 
> Replying a bit late due to overseas conference travel and vacation.
> 
> When you say 'high pressure', do you mean memory pressure or just system
> load?
> 
>
> Memory pressure slightly increases with the kfree_rcu() rework with the
> benefit of much fewer grace periods.
> 
I meant a system load, because of high number of cycles in the kfree_rcu()
symbol under stressing. But i do not have numbers next to me, because it
was quite a long time ago. As for memory usage, i understand that.

> > I have just a small question related to workloads and performance 
> > evaluation.
> > Are you aware of any specific workloads which benefit from it for example
> > mobile area, etc? I am asking because i think about backporting of it and
> > reuse it on our kernel. 
> 
> I am not aware of a mobile usecase that benefits but there are server
> workloads that make system more stable in the face of a kfree_rcu() flood.
> 
OK, i got it. I wanted to test it finding out how it could effect mobile
workloads.

>
> For the KVA allocator work, I see it is quite similar to the way binder
> allocates blocks. See function: binder_alloc_new_buf_locked(). Is there are
> any chance to reuse any code? For one thing, binder also has an rbtree for
> allocated blocks for fast lookup of allocated blocks. Does the KVA allocator
> not have the need for that?
>
Well, there is a difference. Actually the free blocks are not sorted by
the its size like in binder layer, if understand the code correctly.

Instead, i keep them(free blocks) sorted(by start address) in ascending
order + maintain the augment value(biggest free size in left or right sub-tree)
for each node, that allows to navigate toward the lowest address and the block
that definitely suits. So as a result our allocations become sequential
what is important.

>
> And, nice LPC presentation! I was there ;)
> 
Thanks :)

--
Vlad Rezki


Re: [PATCH v4 1/2] rcu/tree: Add basic support for kfree_rcu() batching

2019-09-30 Thread Joel Fernandes
On Wed, Sep 18, 2019 at 11:58:11AM +0200, Uladzislau Rezki wrote:
> > Recently a discussion about stability and performance of a system
> > involving a high rate of kfree_rcu() calls surfaced on the list [1]
> > which led to another discussion how to prepare for this situation.
> > 
> > This patch adds basic batching support for kfree_rcu(). It is "basic"
> > because we do none of the slab management, dynamic allocation, code
> > moving or any of the other things, some of which previous attempts did
> > [2]. These fancier improvements can be follow-up patches and there are
> > different ideas being discussed in those regards. This is an effort to
> > start simple, and build up from there. In the future, an extension to
> > use kfree_bulk and possibly per-slab batching could be done to further
> > improve performance due to cache-locality and slab-specific bulk free
> > optimizations. By using an array of pointers, the worker thread
> > processing the work would need to read lesser data since it does not
> > need to deal with large rcu_head(s) any longer.
> > 
> > Torture tests follow in the next patch and show improvements of around
> > 5x reduction in number of  grace periods on a 16 CPU system. More
> > details and test data are in that patch.
> > 
> > There is an implication with rcu_barrier() with this patch. Since the
> > kfree_rcu() calls can be batched, and may not be handed yet to the RCU
> > machinery in fact, the monitor may not have even run yet to do the
> > queue_rcu_work(), there seems no easy way of implementing rcu_barrier()
> > to wait for those kfree_rcu()s that are already made. So this means a
> > kfree_rcu() followed by an rcu_barrier() does not imply that memory will
> > be freed once rcu_barrier() returns.
> > 
> > Another implication is higher active memory usage (although not
> > run-away..) until the kfree_rcu() flooding ends, in comparison to
> > without batching. More details about this are in the second patch which
> > adds an rcuperf test.
> > 
> > Finally, in the near future we will get rid of kfree_rcu() special casing
> > within RCU such as in rcu_do_batch and switch everything to just
> > batching. Currently we don't do that since timer subsystem is not yet up
> > and we cannot schedule the kfree_rcu() monitor as the timer subsystem's
> > lock are not initialized. That would also mean getting rid of
> > kfree_call_rcu_nobatch() entirely.
> > 
> Hello, Joel.
> 
> First of all thank you for improving it. I also noticed a high pressure
> on RCU-machinery during performing some vmalloc tests when kfree_rcu()
> flood occurred. Therefore i got rid of using kfree_rcu() there.

Replying a bit late due to overseas conference travel and vacation.

When you say 'high pressure', do you mean memory pressure or just system
load?

Memory pressure slightly increases with the kfree_rcu() rework with the
benefit of much fewer grace periods.

> I have just a small question related to workloads and performance evaluation.
> Are you aware of any specific workloads which benefit from it for example
> mobile area, etc? I am asking because i think about backporting of it and
> reuse it on our kernel. 

I am not aware of a mobile usecase that benefits but there are server
workloads that make system more stable in the face of a kfree_rcu() flood.

For the KVA allocator work, I see it is quite similar to the way binder
allocates blocks. See function: binder_alloc_new_buf_locked(). Is there are
any chance to reuse any code? For one thing, binder also has an rbtree for
allocated blocks for fast lookup of allocated blocks. Does the KVA allocator
not have the need for that?

And, nice LPC presentation! I was there ;)

thanks,

 - Joel



Re: [PATCH v4 1/2] rcu/tree: Add basic support for kfree_rcu() batching

2019-09-18 Thread Uladzislau Rezki
> Recently a discussion about stability and performance of a system
> involving a high rate of kfree_rcu() calls surfaced on the list [1]
> which led to another discussion how to prepare for this situation.
> 
> This patch adds basic batching support for kfree_rcu(). It is "basic"
> because we do none of the slab management, dynamic allocation, code
> moving or any of the other things, some of which previous attempts did
> [2]. These fancier improvements can be follow-up patches and there are
> different ideas being discussed in those regards. This is an effort to
> start simple, and build up from there. In the future, an extension to
> use kfree_bulk and possibly per-slab batching could be done to further
> improve performance due to cache-locality and slab-specific bulk free
> optimizations. By using an array of pointers, the worker thread
> processing the work would need to read lesser data since it does not
> need to deal with large rcu_head(s) any longer.
> 
> Torture tests follow in the next patch and show improvements of around
> 5x reduction in number of  grace periods on a 16 CPU system. More
> details and test data are in that patch.
> 
> There is an implication with rcu_barrier() with this patch. Since the
> kfree_rcu() calls can be batched, and may not be handed yet to the RCU
> machinery in fact, the monitor may not have even run yet to do the
> queue_rcu_work(), there seems no easy way of implementing rcu_barrier()
> to wait for those kfree_rcu()s that are already made. So this means a
> kfree_rcu() followed by an rcu_barrier() does not imply that memory will
> be freed once rcu_barrier() returns.
> 
> Another implication is higher active memory usage (although not
> run-away..) until the kfree_rcu() flooding ends, in comparison to
> without batching. More details about this are in the second patch which
> adds an rcuperf test.
> 
> Finally, in the near future we will get rid of kfree_rcu() special casing
> within RCU such as in rcu_do_batch and switch everything to just
> batching. Currently we don't do that since timer subsystem is not yet up
> and we cannot schedule the kfree_rcu() monitor as the timer subsystem's
> lock are not initialized. That would also mean getting rid of
> kfree_call_rcu_nobatch() entirely.
> 
Hello, Joel.

First of all thank you for improving it. I also noticed a high pressure
on RCU-machinery during performing some vmalloc tests when kfree_rcu()
flood occurred. Therefore i got rid of using kfree_rcu() there.

I have just a small question related to workloads and performance evaluation.
Are you aware of any specific workloads which benefit from it for example
mobile area, etc? I am asking because i think about backporting of it and
reuse it on our kernel. 

Thank you!

--
Vlad Rezki


Re: [PATCH v4 1/2] rcu/tree: Add basic support for kfree_rcu() batching

2019-08-17 Thread Paul E. McKenney
On Sat, Aug 17, 2019 at 01:53:29AM -0400, Joel Fernandes wrote:
> On Fri, Aug 16, 2019 at 10:20:23PM -0700, Paul E. McKenney wrote:
> > On Sat, Aug 17, 2019 at 12:30:24AM -0400, Joel Fernandes wrote:
> > > On Fri, Aug 16, 2019 at 08:56:37PM -0700, Paul E. McKenney wrote:
> > > > On Fri, Aug 16, 2019 at 09:32:23PM -0400, Joel Fernandes wrote:
> > > > > Hi Paul,
> > > > > 
> > > > > On Fri, Aug 16, 2019 at 3:16 PM Paul E. McKenney 
> > > > >  wrote:
> > > > > > > > Hello, Joel,
> > > > > > > >
> > > > > > > > I reworked the commit log as follows, but was then unsuccessful 
> > > > > > > > in
> > > > > > > > working out which -rcu commit to apply it to.  Could you please
> > > > > > > > tell me what commit to apply this to?  (Once applied, git 
> > > > > > > > cherry-pick
> > > > > > > > is usually pretty good about handling minor conflicts.)
> > > > > > >
> > > > > > > It was originally based on v5.3-rc2
> > > > > > >
> > > > > > > I was able to apply it just now to the rcu -dev branch and I 
> > > > > > > pushed it here:
> > > > > > > https://github.com/joelagnel/linux-kernel.git (branch paul-dev)
> > > > > > >
> > > > > > > Let me know if any other issues, thanks for the change log rework!
> > > > > >
> > > > > > Pulled and cherry-picked, thank you!
> > > > > >
> > > > > > Just for grins, I also  pushed out a from-joel.2019.08.16a showing 
> > > > > > the
> > > > > > results of the pull.  If you pull that branch, then run something 
> > > > > > like
> > > > > > "gitk v5.3-rc2..", and then do the same with branch "dev", 
> > > > > > comparing the
> > > > > > two might illustrate some of the reasons for the current 
> > > > > > restrictions
> > > > > > on pull requests and trees subject to rebase.
> > > > > 
> > > > > Right, I did the compare and see what you mean. I guess sending any
> > > > > future pull requests against Linux -next would be the best option?
> > > > 
> > > > Hmmm...  You really want to send some pull requests, don't you?  ;-)
> > > 
> > > I would be lying if I said I don't have the itch to ;-)
> > > 
> > > > Suppose you had sent that pull request against Linux -next or v5.2
> > > > or wherever.  What would happen next, given the high probability of a
> > > > conflict with someone else's patch?  What would the result look like?
> > > 
> > > One hopes that the tools are able to automatically resolve the resolution,
> > > however adequate re-inspection of the resulting code and testing it would 
> > > be
> > > needed in either case, to ensure the conflict resolution (whether manual 
> > > or
> > > automatic) happened correctly.
> > 
> > I didn't ask you to hope.  I instead asked you what tell me what would
> > actually happen.  ;-)
> > 
> > You could actually try this by randomly grouping the patches in -rcu
> > (say, placing every third patch into one of three groups), generating
> > separate pull requests, and then merging the pull requests together.
> > Then you wouldn't have to hope.  You could instead look at it in (say)
> > gitk after the pieces were put together.
> 
> So you take whatever is worked on in 'dev' and create separate branches out
> of them, then merge them together later?
> 
> I have seen you doing these tricks and would love to get ideas from your
> experiences on these.

If the release dates line up, perhaps I can demo it for v5.4 at LPC.

> > > IIUC, this usually depends on the maintainer's preference on which branch 
> > > to
> > > send patches against.
> > > 
> > > Are you saying -rcu's dev branch is still the best option to send patches
> > > against, even though it is rebased often?
> > 
> > Sounds like we might need to discuss this face to face.
> 
> Yes, let us talk for sure at plumbers, thank you so much!
> 
> (Also I sent a patch just now to fix that xchg() issue).

Yes, I just now squashed it in, thank you!

Thanx, Paul


Re: [PATCH v4 1/2] rcu/tree: Add basic support for kfree_rcu() batching

2019-08-16 Thread Joel Fernandes
On Fri, Aug 16, 2019 at 10:20:23PM -0700, Paul E. McKenney wrote:
> On Sat, Aug 17, 2019 at 12:30:24AM -0400, Joel Fernandes wrote:
> > On Fri, Aug 16, 2019 at 08:56:37PM -0700, Paul E. McKenney wrote:
> > > On Fri, Aug 16, 2019 at 09:32:23PM -0400, Joel Fernandes wrote:
> > > > Hi Paul,
> > > > 
> > > > On Fri, Aug 16, 2019 at 3:16 PM Paul E. McKenney 
> > > >  wrote:
> > > > > > > Hello, Joel,
> > > > > > >
> > > > > > > I reworked the commit log as follows, but was then unsuccessful in
> > > > > > > working out which -rcu commit to apply it to.  Could you please
> > > > > > > tell me what commit to apply this to?  (Once applied, git 
> > > > > > > cherry-pick
> > > > > > > is usually pretty good about handling minor conflicts.)
> > > > > >
> > > > > > It was originally based on v5.3-rc2
> > > > > >
> > > > > > I was able to apply it just now to the rcu -dev branch and I pushed 
> > > > > > it here:
> > > > > > https://github.com/joelagnel/linux-kernel.git (branch paul-dev)
> > > > > >
> > > > > > Let me know if any other issues, thanks for the change log rework!
> > > > >
> > > > > Pulled and cherry-picked, thank you!
> > > > >
> > > > > Just for grins, I also  pushed out a from-joel.2019.08.16a showing the
> > > > > results of the pull.  If you pull that branch, then run something like
> > > > > "gitk v5.3-rc2..", and then do the same with branch "dev", comparing 
> > > > > the
> > > > > two might illustrate some of the reasons for the current restrictions
> > > > > on pull requests and trees subject to rebase.
> > > > 
> > > > Right, I did the compare and see what you mean. I guess sending any
> > > > future pull requests against Linux -next would be the best option?
> > > 
> > > Hmmm...  You really want to send some pull requests, don't you?  ;-)
> > 
> > I would be lying if I said I don't have the itch to ;-)
> > 
> > > Suppose you had sent that pull request against Linux -next or v5.2
> > > or wherever.  What would happen next, given the high probability of a
> > > conflict with someone else's patch?  What would the result look like?
> > 
> > One hopes that the tools are able to automatically resolve the resolution,
> > however adequate re-inspection of the resulting code and testing it would be
> > needed in either case, to ensure the conflict resolution (whether manual or
> > automatic) happened correctly.
> 
> I didn't ask you to hope.  I instead asked you what tell me what would
> actually happen.  ;-)
> 
> You could actually try this by randomly grouping the patches in -rcu
> (say, placing every third patch into one of three groups), generating
> separate pull requests, and then merging the pull requests together.
> Then you wouldn't have to hope.  You could instead look at it in (say)
> gitk after the pieces were put together.

So you take whatever is worked on in 'dev' and create separate branches out
of them, then merge them together later?

I have seen you doing these tricks and would love to get ideas from your
experiences on these.

> > IIUC, this usually depends on the maintainer's preference on which branch to
> > send patches against.
> > 
> > Are you saying -rcu's dev branch is still the best option to send patches
> > against, even though it is rebased often?
> 
> Sounds like we might need to discuss this face to face.

Yes, let us talk for sure at plumbers, thank you so much!

(Also I sent a patch just now to fix that xchg() issue).

 - Joel



Re: [PATCH v4 1/2] rcu/tree: Add basic support for kfree_rcu() batching

2019-08-16 Thread Paul E. McKenney
On Sat, Aug 17, 2019 at 12:30:24AM -0400, Joel Fernandes wrote:
> On Fri, Aug 16, 2019 at 08:56:37PM -0700, Paul E. McKenney wrote:
> > On Fri, Aug 16, 2019 at 09:32:23PM -0400, Joel Fernandes wrote:
> > > Hi Paul,
> > > 
> > > On Fri, Aug 16, 2019 at 3:16 PM Paul E. McKenney  
> > > wrote:
> > > > > > Hello, Joel,
> > > > > >
> > > > > > I reworked the commit log as follows, but was then unsuccessful in
> > > > > > working out which -rcu commit to apply it to.  Could you please
> > > > > > tell me what commit to apply this to?  (Once applied, git 
> > > > > > cherry-pick
> > > > > > is usually pretty good about handling minor conflicts.)
> > > > >
> > > > > It was originally based on v5.3-rc2
> > > > >
> > > > > I was able to apply it just now to the rcu -dev branch and I pushed 
> > > > > it here:
> > > > > https://github.com/joelagnel/linux-kernel.git (branch paul-dev)
> > > > >
> > > > > Let me know if any other issues, thanks for the change log rework!
> > > >
> > > > Pulled and cherry-picked, thank you!
> > > >
> > > > Just for grins, I also  pushed out a from-joel.2019.08.16a showing the
> > > > results of the pull.  If you pull that branch, then run something like
> > > > "gitk v5.3-rc2..", and then do the same with branch "dev", comparing the
> > > > two might illustrate some of the reasons for the current restrictions
> > > > on pull requests and trees subject to rebase.
> > > 
> > > Right, I did the compare and see what you mean. I guess sending any
> > > future pull requests against Linux -next would be the best option?
> > 
> > Hmmm...  You really want to send some pull requests, don't you?  ;-)
> 
> I would be lying if I said I don't have the itch to ;-)
> 
> > Suppose you had sent that pull request against Linux -next or v5.2
> > or wherever.  What would happen next, given the high probability of a
> > conflict with someone else's patch?  What would the result look like?
> 
> One hopes that the tools are able to automatically resolve the resolution,
> however adequate re-inspection of the resulting code and testing it would be
> needed in either case, to ensure the conflict resolution (whether manual or
> automatic) happened correctly.

I didn't ask you to hope.  I instead asked you what tell me what would
actually happen.  ;-)

You could actually try this by randomly grouping the patches in -rcu
(say, placing every third patch into one of three groups), generating
separate pull requests, and then merging the pull requests together.
Then you wouldn't have to hope.  You could instead look at it in (say)
gitk after the pieces were put together.

And there are more questions.  For example, how would this affect testing
given issues involving both RCU and other pieces of the kernel?

> IIUC, this usually depends on the maintainer's preference on which branch to
> send patches against.
> 
> Are you saying -rcu's dev branch is still the best option to send patches
> against, even though it is rebased often?

Sounds like we might need to discuss this face to face.

Thanx, Paul


Re: [PATCH v4 1/2] rcu/tree: Add basic support for kfree_rcu() batching

2019-08-16 Thread Joel Fernandes
On Fri, Aug 16, 2019 at 08:56:37PM -0700, Paul E. McKenney wrote:
> On Fri, Aug 16, 2019 at 09:32:23PM -0400, Joel Fernandes wrote:
> > Hi Paul,
> > 
> > On Fri, Aug 16, 2019 at 3:16 PM Paul E. McKenney  
> > wrote:
> > > > > Hello, Joel,
> > > > >
> > > > > I reworked the commit log as follows, but was then unsuccessful in
> > > > > working out which -rcu commit to apply it to.  Could you please
> > > > > tell me what commit to apply this to?  (Once applied, git cherry-pick
> > > > > is usually pretty good about handling minor conflicts.)
> > > >
> > > > It was originally based on v5.3-rc2
> > > >
> > > > I was able to apply it just now to the rcu -dev branch and I pushed it 
> > > > here:
> > > > https://github.com/joelagnel/linux-kernel.git (branch paul-dev)
> > > >
> > > > Let me know if any other issues, thanks for the change log rework!
> > >
> > > Pulled and cherry-picked, thank you!
> > >
> > > Just for grins, I also  pushed out a from-joel.2019.08.16a showing the
> > > results of the pull.  If you pull that branch, then run something like
> > > "gitk v5.3-rc2..", and then do the same with branch "dev", comparing the
> > > two might illustrate some of the reasons for the current restrictions
> > > on pull requests and trees subject to rebase.
> > 
> > Right, I did the compare and see what you mean. I guess sending any
> > future pull requests against Linux -next would be the best option?
> 
> Hmmm...  You really want to send some pull requests, don't you?  ;-)

I would be lying if I said I don't have the itch to ;-)

> Suppose you had sent that pull request against Linux -next or v5.2
> or wherever.  What would happen next, given the high probability of a
> conflict with someone else's patch?  What would the result look like?

One hopes that the tools are able to automatically resolve the resolution,
however adequate re-inspection of the resulting code and testing it would be
needed in either case, to ensure the conflict resolution (whether manual or
automatic) happened correctly.

IIUC, this usually depends on the maintainer's preference on which branch to
send patches against.

Are you saying -rcu's dev branch is still the best option to send patches
against, even though it is rebased often?

thanks,

 - Joel



Re: [PATCH v4 1/2] rcu/tree: Add basic support for kfree_rcu() batching

2019-08-16 Thread Paul E. McKenney
On Fri, Aug 16, 2019 at 09:32:23PM -0400, Joel Fernandes wrote:
> Hi Paul,
> 
> On Fri, Aug 16, 2019 at 3:16 PM Paul E. McKenney  
> wrote:
> > > > Hello, Joel,
> > > >
> > > > I reworked the commit log as follows, but was then unsuccessful in
> > > > working out which -rcu commit to apply it to.  Could you please
> > > > tell me what commit to apply this to?  (Once applied, git cherry-pick
> > > > is usually pretty good about handling minor conflicts.)
> > >
> > > It was originally based on v5.3-rc2
> > >
> > > I was able to apply it just now to the rcu -dev branch and I pushed it 
> > > here:
> > > https://github.com/joelagnel/linux-kernel.git (branch paul-dev)
> > >
> > > Let me know if any other issues, thanks for the change log rework!
> >
> > Pulled and cherry-picked, thank you!
> >
> > Just for grins, I also  pushed out a from-joel.2019.08.16a showing the
> > results of the pull.  If you pull that branch, then run something like
> > "gitk v5.3-rc2..", and then do the same with branch "dev", comparing the
> > two might illustrate some of the reasons for the current restrictions
> > on pull requests and trees subject to rebase.
> 
> Right, I did the compare and see what you mean. I guess sending any
> future pull requests against Linux -next would be the best option?

Hmmm...  You really want to send some pull requests, don't you?  ;-)

Suppose you had sent that pull request against Linux -next or v5.2
or wherever.  What would happen next, given the high probability of a
conflict with someone else's patch?  What would the result look like?

Thanx, Paul



Re: [PATCH v4 1/2] rcu/tree: Add basic support for kfree_rcu() batching

2019-08-16 Thread Joel Fernandes
Hi Paul,

On Fri, Aug 16, 2019 at 3:16 PM Paul E. McKenney  wrote:
> > > Hello, Joel,
> > >
> > > I reworked the commit log as follows, but was then unsuccessful in
> > > working out which -rcu commit to apply it to.  Could you please
> > > tell me what commit to apply this to?  (Once applied, git cherry-pick
> > > is usually pretty good about handling minor conflicts.)
> >
> > It was originally based on v5.3-rc2
> >
> > I was able to apply it just now to the rcu -dev branch and I pushed it here:
> > https://github.com/joelagnel/linux-kernel.git (branch paul-dev)
> >
> > Let me know if any other issues, thanks for the change log rework!
>
> Pulled and cherry-picked, thank you!
>
> Just for grins, I also  pushed out a from-joel.2019.08.16a showing the
> results of the pull.  If you pull that branch, then run something like
> "gitk v5.3-rc2..", and then do the same with branch "dev", comparing the
> two might illustrate some of the reasons for the current restrictions
> on pull requests and trees subject to rebase.

Right, I did the compare and see what you mean. I guess sending any
future pull requests against Linux -next would be the best option?

thanks,

 - Joel


Re: [PATCH v4 1/2] rcu/tree: Add basic support for kfree_rcu() batching

2019-08-16 Thread Paul E. McKenney
On Fri, Aug 16, 2019 at 01:44:29PM -0400, Joel Fernandes wrote:
> On Fri, Aug 16, 2019 at 09:43:30AM -0700, Paul E. McKenney wrote:
> > Hello, Joel,
> > 
> > I reworked the commit log as follows, but was then unsuccessful in
> > working out which -rcu commit to apply it to.  Could you please
> > tell me what commit to apply this to?  (Once applied, git cherry-pick
> > is usually pretty good about handling minor conflicts.)
> 
> It was originally based on v5.3-rc2
> 
> I was able to apply it just now to the rcu -dev branch and I pushed it here:
> https://github.com/joelagnel/linux-kernel.git (branch paul-dev)
> 
> Let me know if any other issues, thanks for the change log rework!

Pulled and cherry-picked, thank you!

Just for grins, I also  pushed out a from-joel.2019.08.16a showing the
results of the pull.  If you pull that branch, then run something like
"gitk v5.3-rc2..", and then do the same with branch "dev", comparing the
two might illustrate some of the reasons for the current restrictions
on pull requests and trees subject to rebase.

Thanx, Paul

> thanks,
> 
>  - Joel
> 
> > Thanx, Paul
> > 
> > On Wed, Aug 14, 2019 at 12:04:10PM -0400, Joel Fernandes (Google) wrote:
> > > A recent discussion about stability and performance systems with
> > > kfree_rcu() flooding [1] led to another discussion how to better handle
> > > this situation.
> > > 
> > > This commit starts simple, adding only basic batching support for
> > > kfree_rcu(). It is "basic" because it does none of the slab management,
> > > dynamic allocation, or code movement carried out by a previous attempt
> > > [2].  These additional improvements can be implemented later as agreement
> > > is reached on these other issues.  For example, future work might
> > > increase cache locality by applying vector object lists, kfree_bulk(),
> > > or per-slab batching to further improve handling of kfree_rcu() floods.
> > > 
> > > Performance tests are provided in a latter commmit.  These tests show a
> > > 5x reduction in number of grace periods on a 16 CPU system, with minimal
> > > increase in kfree() latency.
> > > 
> > > Note that this commit prevents rcu_barrier() from waiting for the
> > > execution of the kfree() calls associated with prior kfree_rcu()
> > > invocations.  This should not be a problem, given that the resulting
> > > pending kfree() calls do not prevent module unloading or filesystem
> > > unmounting.  The reason rcu_barrier() no longer waits for the kfree()
> > > calls is that the kfree_rcu() requests are now batched, so that at
> > > any given time there might be kfree_rcu() requests that are not yet
> > > known to the core RCU machinery.  Furthermore, once a kfree_rcu()
> > > grace period has elapsed, the actual kfree() invocations happen in
> > > workqueue context.  So rcu_barrier() no longer waits for all of the
> > > prior requests, nor it does not wait for the workqueue handlers to
> > > start, let alone complete.  If there is ever a good reason to wait for
> > > the kfree() invocation corresponding to all prior kfree_rcu() calls,
> > > an approapriate kfree_rcu_barrier() can be constructed.  However, at
> > > the moment no reasonable use case is apparent.
> > > 
> > > This commit can result in increased memory footprint because the
> > > batching can increase the kfree_rcu()-to-kfree() latency.  Later
> > > commits will reduce this memory footprint.
> > > 
> > > Later commits will also remove the special handling of kfree_rcu() by
> > > __rcu_reclaim() within the RCU core.  This will require changes to
> > > rcuperf testing and to early boot handling of kfree_rcu().
> > > 
> > > [1] 
> > > http://lore.kernel.org/lkml/20190723035725-mutt-send-email-...@kernel.org
> > > [2] https://lkml.org/lkml/2017/12/19/824
> > > 
> > > Cc: kernel-t...@android.com
> > > Cc: kernel-t...@lge.com
> > > Co-developed-by: Byungchul Park 
> > > Signed-off-by: Byungchul Park 
> > > Signed-off-by: Joel Fernandes (Google) 
> > > 
> > > ---
> > > v3->v4: Some corrections by Paul.
> > >   Used xchg in places to simplify code.
> > > 
> > > v2->v3: Just some code comment changes thanks to Byungchul.
> > > 
> > > RFCv1->PATCH v2: Removed limits on the ->head list, just let it grow.
> > >Dropped KFREE_MAX_JIFFIES to HZ/50 from HZ/20 to 
> > > reduce OOM occurrence.
> > >Removed sleeps in rcuperf test, just using 
> > > cond_resched()in loop.
> > >Better code comments ;)
> > > 
> > >  include/linux/rcutiny.h |   5 ++
> > >  include/linux/rcutree.h |   1 +
> > >  kernel/rcu/tree.c   | 194 ++--
> > >  3 files changed, 194 insertions(+), 6 deletions(-)
> > > 
> > > diff --git a/include/linux/rcutiny.h b/include/linux/rcutiny.h
> > > index 8e727f57d814..383f2481750f 100644
> > > --- a/include/linux/rcutiny.h
> > > +++ b/include/linux/rcutiny.h
> > > 

Re: [PATCH v4 1/2] rcu/tree: Add basic support for kfree_rcu() batching

2019-08-16 Thread Joel Fernandes
On Fri, Aug 16, 2019 at 09:43:30AM -0700, Paul E. McKenney wrote:
> Hello, Joel,
> 
> I reworked the commit log as follows, but was then unsuccessful in
> working out which -rcu commit to apply it to.  Could you please
> tell me what commit to apply this to?  (Once applied, git cherry-pick
> is usually pretty good about handling minor conflicts.)

It was originally based on v5.3-rc2

I was able to apply it just now to the rcu -dev branch and I pushed it here:
https://github.com/joelagnel/linux-kernel.git (branch paul-dev)

Let me know if any other issues, thanks for the change log rework!

thanks,

 - Joel

>   Thanx, Paul
> 
> On Wed, Aug 14, 2019 at 12:04:10PM -0400, Joel Fernandes (Google) wrote:
> > A recent discussion about stability and performance systems with
> > kfree_rcu() flooding [1] led to another discussion how to better handle
> > this situation.
> > 
> > This commit starts simple, adding only basic batching support for
> > kfree_rcu(). It is "basic" because it does none of the slab management,
> > dynamic allocation, or code movement carried out by a previous attempt
> > [2].  These additional improvements can be implemented later as agreement
> > is reached on these other issues.  For example, future work might
> > increase cache locality by applying vector object lists, kfree_bulk(),
> > or per-slab batching to further improve handling of kfree_rcu() floods.
> > 
> > Performance tests are provided in a latter commmit.  These tests show a
> > 5x reduction in number of grace periods on a 16 CPU system, with minimal
> > increase in kfree() latency.
> > 
> > Note that this commit prevents rcu_barrier() from waiting for the
> > execution of the kfree() calls associated with prior kfree_rcu()
> > invocations.  This should not be a problem, given that the resulting
> > pending kfree() calls do not prevent module unloading or filesystem
> > unmounting.  The reason rcu_barrier() no longer waits for the kfree()
> > calls is that the kfree_rcu() requests are now batched, so that at
> > any given time there might be kfree_rcu() requests that are not yet
> > known to the core RCU machinery.  Furthermore, once a kfree_rcu()
> > grace period has elapsed, the actual kfree() invocations happen in
> > workqueue context.  So rcu_barrier() no longer waits for all of the
> > prior requests, nor it does not wait for the workqueue handlers to
> > start, let alone complete.  If there is ever a good reason to wait for
> > the kfree() invocation corresponding to all prior kfree_rcu() calls,
> > an approapriate kfree_rcu_barrier() can be constructed.  However, at
> > the moment no reasonable use case is apparent.
> > 
> > This commit can result in increased memory footprint because the
> > batching can increase the kfree_rcu()-to-kfree() latency.  Later
> > commits will reduce this memory footprint.
> > 
> > Later commits will also remove the special handling of kfree_rcu() by
> > __rcu_reclaim() within the RCU core.  This will require changes to
> > rcuperf testing and to early boot handling of kfree_rcu().
> > 
> > [1] 
> > http://lore.kernel.org/lkml/20190723035725-mutt-send-email-...@kernel.org
> > [2] https://lkml.org/lkml/2017/12/19/824
> > 
> > Cc: kernel-t...@android.com
> > Cc: kernel-t...@lge.com
> > Co-developed-by: Byungchul Park 
> > Signed-off-by: Byungchul Park 
> > Signed-off-by: Joel Fernandes (Google) 
> > 
> > ---
> > v3->v4: Some corrections by Paul.
> > Used xchg in places to simplify code.
> > 
> > v2->v3: Just some code comment changes thanks to Byungchul.
> > 
> > RFCv1->PATCH v2: Removed limits on the ->head list, just let it grow.
> >Dropped KFREE_MAX_JIFFIES to HZ/50 from HZ/20 to reduce 
> > OOM occurrence.
> >Removed sleeps in rcuperf test, just using 
> > cond_resched()in loop.
> >Better code comments ;)
> > 
> >  include/linux/rcutiny.h |   5 ++
> >  include/linux/rcutree.h |   1 +
> >  kernel/rcu/tree.c   | 194 ++--
> >  3 files changed, 194 insertions(+), 6 deletions(-)
> > 
> > diff --git a/include/linux/rcutiny.h b/include/linux/rcutiny.h
> > index 8e727f57d814..383f2481750f 100644
> > --- a/include/linux/rcutiny.h
> > +++ b/include/linux/rcutiny.h
> > @@ -39,6 +39,11 @@ static inline void kfree_call_rcu(struct rcu_head *head, 
> > rcu_callback_t func)
> > call_rcu(head, func);
> >  }
> >  
> > +static inline void kfree_call_rcu_nobatch(struct rcu_head *head, 
> > rcu_callback_t func)
> > +{
> > +   call_rcu(head, func);
> > +}
> > +
> >  void rcu_qs(void);
> >  
> >  static inline void rcu_softirq_qs(void)
> > diff --git a/include/linux/rcutree.h b/include/linux/rcutree.h
> > index 735601ac27d3..7e38b39ec634 100644
> > --- a/include/linux/rcutree.h
> > +++ b/include/linux/rcutree.h
> > @@ -34,6 +34,7 @@ static inline void rcu_virt_note_context_switch(int cpu)
> >  
> >  void synchronize_rcu_expedited(void);
> >  void 

Re: [PATCH v4 1/2] rcu/tree: Add basic support for kfree_rcu() batching

2019-08-16 Thread Paul E. McKenney
Hello, Joel,

I reworked the commit log as follows, but was then unsuccessful in
working out which -rcu commit to apply it to.  Could you please
tell me what commit to apply this to?  (Once applied, git cherry-pick
is usually pretty good about handling minor conflicts.)

Thanx, Paul

On Wed, Aug 14, 2019 at 12:04:10PM -0400, Joel Fernandes (Google) wrote:
> A recent discussion about stability and performance systems with
> kfree_rcu() flooding [1] led to another discussion how to better handle
> this situation.
> 
> This commit starts simple, adding only basic batching support for
> kfree_rcu(). It is "basic" because it does none of the slab management,
> dynamic allocation, or code movement carried out by a previous attempt
> [2].  These additional improvements can be implemented later as agreement
> is reached on these other issues.  For example, future work might
> increase cache locality by applying vector object lists, kfree_bulk(),
> or per-slab batching to further improve handling of kfree_rcu() floods.
> 
> Performance tests are provided in a latter commmit.  These tests show a
> 5x reduction in number of grace periods on a 16 CPU system, with minimal
> increase in kfree() latency.
> 
> Note that this commit prevents rcu_barrier() from waiting for the
> execution of the kfree() calls associated with prior kfree_rcu()
> invocations.  This should not be a problem, given that the resulting
> pending kfree() calls do not prevent module unloading or filesystem
> unmounting.  The reason rcu_barrier() no longer waits for the kfree()
> calls is that the kfree_rcu() requests are now batched, so that at
> any given time there might be kfree_rcu() requests that are not yet
> known to the core RCU machinery.  Furthermore, once a kfree_rcu()
> grace period has elapsed, the actual kfree() invocations happen in
> workqueue context.  So rcu_barrier() no longer waits for all of the
> prior requests, nor it does not wait for the workqueue handlers to
> start, let alone complete.  If there is ever a good reason to wait for
> the kfree() invocation corresponding to all prior kfree_rcu() calls,
> an approapriate kfree_rcu_barrier() can be constructed.  However, at
> the moment no reasonable use case is apparent.
> 
> This commit can result in increased memory footprint because the
> batching can increase the kfree_rcu()-to-kfree() latency.  Later
> commits will reduce this memory footprint.
> 
> Later commits will also remove the special handling of kfree_rcu() by
> __rcu_reclaim() within the RCU core.  This will require changes to
> rcuperf testing and to early boot handling of kfree_rcu().
> 
> [1] http://lore.kernel.org/lkml/20190723035725-mutt-send-email-...@kernel.org
> [2] https://lkml.org/lkml/2017/12/19/824
> 
> Cc: kernel-t...@android.com
> Cc: kernel-t...@lge.com
> Co-developed-by: Byungchul Park 
> Signed-off-by: Byungchul Park 
> Signed-off-by: Joel Fernandes (Google) 
> 
> ---
> v3->v4: Some corrections by Paul.
>   Used xchg in places to simplify code.
> 
> v2->v3: Just some code comment changes thanks to Byungchul.
> 
> RFCv1->PATCH v2: Removed limits on the ->head list, just let it grow.
>Dropped KFREE_MAX_JIFFIES to HZ/50 from HZ/20 to reduce 
> OOM occurrence.
>Removed sleeps in rcuperf test, just using 
> cond_resched()in loop.
>Better code comments ;)
> 
>  include/linux/rcutiny.h |   5 ++
>  include/linux/rcutree.h |   1 +
>  kernel/rcu/tree.c   | 194 ++--
>  3 files changed, 194 insertions(+), 6 deletions(-)
> 
> diff --git a/include/linux/rcutiny.h b/include/linux/rcutiny.h
> index 8e727f57d814..383f2481750f 100644
> --- a/include/linux/rcutiny.h
> +++ b/include/linux/rcutiny.h
> @@ -39,6 +39,11 @@ static inline void kfree_call_rcu(struct rcu_head *head, 
> rcu_callback_t func)
>   call_rcu(head, func);
>  }
>  
> +static inline void kfree_call_rcu_nobatch(struct rcu_head *head, 
> rcu_callback_t func)
> +{
> + call_rcu(head, func);
> +}
> +
>  void rcu_qs(void);
>  
>  static inline void rcu_softirq_qs(void)
> diff --git a/include/linux/rcutree.h b/include/linux/rcutree.h
> index 735601ac27d3..7e38b39ec634 100644
> --- a/include/linux/rcutree.h
> +++ b/include/linux/rcutree.h
> @@ -34,6 +34,7 @@ static inline void rcu_virt_note_context_switch(int cpu)
>  
>  void synchronize_rcu_expedited(void);
>  void kfree_call_rcu(struct rcu_head *head, rcu_callback_t func);
> +void kfree_call_rcu_nobatch(struct rcu_head *head, rcu_callback_t func);
>  
>  void rcu_barrier(void);
>  bool rcu_eqs_special_set(int cpu);
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index a14e5fbbea46..1d1847cadea2 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -2593,17 +2593,185 @@ void call_rcu(struct rcu_head *head, rcu_callback_t 
> func)
>  }
>  EXPORT_SYMBOL_GPL(call_rcu);
>  
> +
> +/* Maximum number of jiffies to wait before 

[PATCH v4 1/2] rcu/tree: Add basic support for kfree_rcu() batching

2019-08-14 Thread Joel Fernandes (Google)
Recently a discussion about stability and performance of a system
involving a high rate of kfree_rcu() calls surfaced on the list [1]
which led to another discussion how to prepare for this situation.

This patch adds basic batching support for kfree_rcu(). It is "basic"
because we do none of the slab management, dynamic allocation, code
moving or any of the other things, some of which previous attempts did
[2]. These fancier improvements can be follow-up patches and there are
different ideas being discussed in those regards. This is an effort to
start simple, and build up from there. In the future, an extension to
use kfree_bulk and possibly per-slab batching could be done to further
improve performance due to cache-locality and slab-specific bulk free
optimizations. By using an array of pointers, the worker thread
processing the work would need to read lesser data since it does not
need to deal with large rcu_head(s) any longer.

Torture tests follow in the next patch and show improvements of around
5x reduction in number of  grace periods on a 16 CPU system. More
details and test data are in that patch.

There is an implication with rcu_barrier() with this patch. Since the
kfree_rcu() calls can be batched, and may not be handed yet to the RCU
machinery in fact, the monitor may not have even run yet to do the
queue_rcu_work(), there seems no easy way of implementing rcu_barrier()
to wait for those kfree_rcu()s that are already made. So this means a
kfree_rcu() followed by an rcu_barrier() does not imply that memory will
be freed once rcu_barrier() returns.

Another implication is higher active memory usage (although not
run-away..) until the kfree_rcu() flooding ends, in comparison to
without batching. More details about this are in the second patch which
adds an rcuperf test.

Finally, in the near future we will get rid of kfree_rcu() special casing
within RCU such as in rcu_do_batch and switch everything to just
batching. Currently we don't do that since timer subsystem is not yet up
and we cannot schedule the kfree_rcu() monitor as the timer subsystem's
lock are not initialized. That would also mean getting rid of
kfree_call_rcu_nobatch() entirely.

[1] http://lore.kernel.org/lkml/20190723035725-mutt-send-email-...@kernel.org
[2] https://lkml.org/lkml/2017/12/19/824

Cc: kernel-t...@android.com
Cc: kernel-t...@lge.com
Co-developed-by: Byungchul Park 
Signed-off-by: Byungchul Park 
Signed-off-by: Joel Fernandes (Google) 

---
v3->v4: Some corrections by Paul.
Used xchg in places to simplify code.

v2->v3: Just some code comment changes thanks to Byungchul.

RFCv1->PATCH v2: Removed limits on the ->head list, just let it grow.
   Dropped KFREE_MAX_JIFFIES to HZ/50 from HZ/20 to reduce OOM 
occurrence.
   Removed sleeps in rcuperf test, just using cond_resched()in 
loop.
   Better code comments ;)

 include/linux/rcutiny.h |   5 ++
 include/linux/rcutree.h |   1 +
 kernel/rcu/tree.c   | 194 ++--
 3 files changed, 194 insertions(+), 6 deletions(-)

diff --git a/include/linux/rcutiny.h b/include/linux/rcutiny.h
index 8e727f57d814..383f2481750f 100644
--- a/include/linux/rcutiny.h
+++ b/include/linux/rcutiny.h
@@ -39,6 +39,11 @@ static inline void kfree_call_rcu(struct rcu_head *head, 
rcu_callback_t func)
call_rcu(head, func);
 }
 
+static inline void kfree_call_rcu_nobatch(struct rcu_head *head, 
rcu_callback_t func)
+{
+   call_rcu(head, func);
+}
+
 void rcu_qs(void);
 
 static inline void rcu_softirq_qs(void)
diff --git a/include/linux/rcutree.h b/include/linux/rcutree.h
index 735601ac27d3..7e38b39ec634 100644
--- a/include/linux/rcutree.h
+++ b/include/linux/rcutree.h
@@ -34,6 +34,7 @@ static inline void rcu_virt_note_context_switch(int cpu)
 
 void synchronize_rcu_expedited(void);
 void kfree_call_rcu(struct rcu_head *head, rcu_callback_t func);
+void kfree_call_rcu_nobatch(struct rcu_head *head, rcu_callback_t func);
 
 void rcu_barrier(void);
 bool rcu_eqs_special_set(int cpu);
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index a14e5fbbea46..1d1847cadea2 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -2593,17 +2593,185 @@ void call_rcu(struct rcu_head *head, rcu_callback_t 
func)
 }
 EXPORT_SYMBOL_GPL(call_rcu);
 
+
+/* Maximum number of jiffies to wait before draining a batch. */
+#define KFREE_DRAIN_JIFFIES (HZ / 50)
+
 /*
- * Queue an RCU callback for lazy invocation after a grace period.
- * This will likely be later named something like "call_rcu_lazy()",
- * but this change will require some way of tagging the lazy RCU
- * callbacks in the list of pending callbacks. Until then, this
- * function may only be called from __kfree_rcu().
+ * Maximum number of kfree(s) to batch, if this limit is hit then the batch of
+ * kfree(s) is queued for freeing after a grace period, right away.
  */
-void kfree_call_rcu(struct rcu_head *head, rcu_callback_t func)
+struct