Re: Blocking vcache_tryvget() across VOP_INACTIVE() - unneeded

2020-01-21 Thread Thor Lancelot Simon
On Tue, Jan 21, 2020 at 11:12:06PM +0100, Mateusz Guzik wrote:
> 
> This does not happen with rb trees and would not happen with the hash
> table if there was bucket locking instead of per-cpu.
> 
> I would stick to hash tables since they are easier to scale (both with
> and without locks).
> 
> For instance if 2 threads look up "/bin" and "/usr" and there is no
> hash collision, they lock *separate* buckets and suffer less in terms
> of cacheline bouncing. In comparison with rb trees they will take the
> same lock. Of course this similarly does not scale if they are looking
> up the same path.

I think there's probably a generic bucket-locked hashtable implementation
in one of the code contributions Coyote Point made long ago.  That said,
we're talking about a screenful of code, so it's not like it'd be hard to
rewrite, either.  Do we want such a thing?  Or do we want to press on
towards more modern "that didn't work, try again" approaches like
pserialize or the COW scheme you describe?

-- 
 Thor Lancelot Simon t...@panix.com
  "Whether or not there's hope for change is not the question.  If you
   want to be a free person, you don't stand up for human rights because
   it will work, but because it is right."  --Andrei Sakharov


Re: Blocking vcache_tryvget() across VOP_INACTIVE() - unneeded

2020-01-21 Thread Mateusz Guzik
On 1/21/20, Andrew Doran  wrote:
> On Thu, Jan 16, 2020 at 04:51:44AM +0100, Mateusz Guzik wrote:
>>
>> I'm assuming the goal for the foreseeable future is to achieve path
>> lookup
>> with shared vnode locks.
>
> Good guess.  There is a prototype of LK_SHARED lookup on the ad-namecache
> branch, along with lookup using namecache only that takes as few vnode
> locks
> and refcounts as possible.  In the easiest case only the root vnode and the
> leaf vnode are touched (although whether the root vnode ever actually needs
> a reference is another question).  There are many "non-easy" cases though.
>

As in you can get away without ever accessing it if there are no long
enough .. chains and absolute paths (along with symlinks)?

My solution to the problem (not implemented yet) is to introduce a
copy-on-write structre which holds the ref to the rootvnode. Then lwps
can just use it while at worst refing/unrefing that strut but never the
vnode. This also covers the current working directory.

I have seen the branch. I think the use of rb instead of a hash is
pessimal.

My speculation where the wins over the current code are coming from boils
down to per-cpu locks being eliminated. Specifically, since v_interlock
is a lock from bufobj and is heavily used in vm, likelyhood of getting
preempted while holding it increases. Say CPU0 takes its own pcpu lock and
proceeds to take v_interlock. Now the thread taken off to wait. If the new
thread running on CPU0 performs *any* path lookup it will also block,
instead of only blocking if it was looking up the affected vnode. iow
there is huge potential to stall *all* lookups on given CPU.

This does not happen with rb trees and would not happen with the hash
table if there was bucket locking instead of per-cpu.

I would stick to hash tables since they are easier to scale (both with
and without locks).

For instance if 2 threads look up "/bin" and "/usr" and there is no
hash collision, they lock *separate* buckets and suffer less in terms
of cacheline bouncing. In comparison with rb trees they will take the
same lock. Of course this similarly does not scale if they are looking
up the same path.

-- 
Mateusz Guzik 



Re: libc.so's vmobjlock / v_interlock

2020-01-21 Thread Andrew Doran
On Sun, Jan 19, 2020 at 09:50:24PM +, Andrew Doran wrote:

> Here's a dtrace flamegraph for a kernel build on my system, while running a
> kernel from the ad-namecache branch.
> 
>   http://www.netbsd.org/~ad/2020/jan19.svg
> 
> The biggest single remaining concurency problem is the mutex on libc.so's
> uvm_object / vnode.  It pops up in a number of places, but most clearly seen
> above the "uvm_fault_internal" box.
> 
> I think making the vmobjlock / v_interlock a rwlock could help with making
> inroads on this problem, because in the libc.so case it's not modifying too
> much under cover of the lock (mostly pmap state).
> 
> That's not a small undertaking, but I reckon it would take about a day of
> donkey work to change every mutex_enter(..) to a rw_enter(.., RW_WRITER) and
> make a start at choosing rw_write_held() or rw_lock_held() for the
> assertions, followed by a bit of benchmarking to check for a regression and
> a jumbo commit.  From there on things could be changed incrementally.
> 
> Thoughts?

Here's a first set of changes to do just this.  I don't plan on doing more
on this for about a ~month because I will be travelling soon.

http://www.netbsd.org/~ad/2020/vmobjlock.diff

Andrew


- This splits v_interlock apart from vmobjlock.  v_interlock stays a mutex,
  vmobjlock becomes a RW lock.  The lock order is vmobjlock -> v_interlock. 
  There are a few shared items that need to be examined closely and fixed or
  good excuses made, e.g. VI_EXECMAP but it's mostly right.

- I made anons/amaps have a rwlock too.  There is bound to be an application
  for this and I think it might be PostgreSQL, must check.

- There seems to be no performance regression from doing this.  Not having
  vnode stuff adding contention to the vmobjlock makes the numbers better.

To get to the point of doing concurrent faults for the simple cases, like
are needed for libc.so I think the following are also needed:

- PV lists in the pmap need explicit locking (unless the pmap uses a global
  lock, which is fine if it suits the machine).  I have done the PV list
  locking for x86 and it's in the patch.

  I'm suggesting a lock still need to be held on the UVM object / amap for
  pmap ops.  But that can be a read lock; for x86 I want a write lock held
  only for pmap_page_protect(VM_PROT_NONE) aka pmap_page_remove(), because
  trying have the x86 pmap manage walking the PV list while removing PTEs,
  and while everything is changing around it, and have it all nice and
  concurrent, would be really tough.  I don't think ppp(VM_PROT_NONE) is a
  hotspot anyway.

- With the above done uvm_unmap_remove() and uvm_map_protect() can take
  a reader lock on the object.  I've tried this out and it works fine
  "on my machine", except that in the libc.so case it then starts to 
  compete with uvm_fault() for the vmobjlock.

- PG_BUSY could become a reader/writer lock of sorts, I'm thinking something
  like a counter in struct vm_page, and I like the idea of covering it with
  pg->interlock.  Interested to hear any other ideas.

- Then it's a trip into the uvm_fault() -> getpages -> uvn_findpages()
  path looking for what's needed, I expect probably not a whole lot if you
  only want to handle faults for stuff that's in-core using a read lock.
  I don't know the uvm_fault() code too well.


Re: Adaptive RW locks

2020-01-21 Thread Andrew Doran
To follow up: in testing this I ended up discovering a number of tedious,
complicated deadlocks that could occur due to softints, kernel_lock and
other factors.  Trying to mitigate them killed the performance gain and it
still wasn't right.  I'm abandoning this idea because in practice it seems
too complicated.  On a more positive note there are a couple of LOCKDEBUG
improvements out of it.

Andrew


Re: Blocking vcache_tryvget() across VOP_INACTIVE() - unneeded

2020-01-21 Thread Andrew Doran
On Thu, Jan 16, 2020 at 04:51:44AM +0100, Mateusz Guzik wrote:

> On 1/15/20, Andrew Doran  wrote:
> > I don't understand why we do this.  I don't think it's needed at all
> > because
> > if the file really is deleted and the inode is getting cleaned out, then it
> > shouldn't be getting new refs in any of the usual ways and we don't need to
> > worry about it.  And in any case the vnode lock prevents anyone from
> > messing
> > with the inode during VOP_INACTIVE().
> >
> > I want to change this because it causes nasty problems on MP.  For example
> > I
> > see threads very regulary thrown out of the namecache by vcache_vtryget(),
> > and when they finally get the vnode ref'd and locked they do it by digging
> > through the file system metadata (when everything is already in cache).
> >
> >   http://www.netbsd.org/~ad/inactive.diff
> >
> 
> I think looking at a bigger picture suggests a different solution, which
> will also keep the current blocked <-> loaded transition for consideration
> later.
> 
> I'm assuming the goal for the foreseeable future is to achieve path lookup
> with shared vnode locks.

Good guess.  There is a prototype of LK_SHARED lookup on the ad-namecache
branch, along with lookup using namecache only that takes as few vnode locks
and refcounts as possible.  In the easiest case only the root vnode and the
leaf vnode are touched (although whether the root vnode ever actually needs
a reference is another question).  There are many "non-easy" cases though.

> Since ->v_usecont transition to 0 happens all the time and VOP_INACTIVE
> processing is almost never needed, the current requirement to hold an
> exclusive lock just to find out there is nothing to do should be lifted.
> Apart from getting away with only shared lock (or no locks) this would
> also naturally eliminate vnode state changes for the common case.
> 
> Preferably vfs would provide a method for filesystems to call to signify
> they want VOP_INACTIVE to be called for given vnode, then vrelel could
> just test a bit. However, this is probably not worth the effort right now.
> 
> Said locking was also a problem in FreeBSD. It got temporarily sorted
> out with introduction of VOP_NEED_INACTIVE -- by default it returns 1,
> but filesystems which provide their own method easily avoid the problem.
> I think starting with requiring at least shared lock would do the trick
> just fine while providing race-free methods at least for ufs and tmpfs.
> 
> Then vrelel could stick to:
> diff --git a/sys/kern/vfs_vnode.c b/sys/kern/vfs_vnode.c
> index d05d5180f730..45d80dab4a2c 100644
> --- a/sys/kern/vfs_vnode.c
> +++ b/sys/kern/vfs_vnode.c
> @@ -774,7 +774,7 @@ vrelel(vnode_t *vp, int flags, int lktype)
>  * If not clean, deactivate the vnode, but preserve
>  * our reference across the call to VOP_INACTIVE().
>  */
> -   if (VSTATE_GET(vp) == VS_RECLAIMED) {
> +   if (VSTATE_GET(vp) == VS_RECLAIMED || !VOP_NEED_INACTIVE(vp))
> VOP_UNLOCK(vp);
> } else {
> VSTATE_CHANGE(vp, VS_LOADED, VS_BLOCKED);
> 
> which even with exclusive locks already avoids the blocked state problem.
> 
> Another option is to move some of the current body into smaller halpers
> and then let filesystem piece together their own VOP_VRELE.

Interesting.  I had tried another approach, a VOP_INACTIVE() that could be
called with an LK_SHARED lock.  It returned ENOLCK if there was work to do
that needed an exclusive lock, but it was ugly.  I will take a look at the
FreeBSD solution.

Thanks,
Andrew