Re: Please review: lookup changes

2020-04-04 Thread Andrew Doran
On Wed, Mar 11, 2020 at 05:40:59PM +0100, J. Hannken-Illjes wrote:

> > On 5. Mar 2020, at 23:48, Andrew Doran  wrote:
> > 
> > Hi,
> > 
> > I'd like to merge the changes on the ad-namecache branch, and would
> > appreciate any code review.  The motivation for these changes is, as
> > you might guess, performance.  I put an overview below.  I won't be
> > merging this for a couple of weeks in any case.
> 
> [snip]
> 
> > vfs_vnode.c:
> > 
> > Namecache related changes, and the change to not block new vnode
> > references across VOP_INACTIVE() mentioned on tech-kern.  I don't
> > think this blocking is needed at least for well behaved FSes like
> > ffs & tmpfs.
> 
> I suppose this blocking is no longer needed.  It originates from
> the early steps to clean up the vnode lifecycle.  If anything
> goes wrong it is quite simple to undo this change.
> 
> Blocking further references should only be needed for vrecycle().
> 
> We have to terminate vrelel() early if we got new references,
> diff attached.

I was wondering about the same the other day.  Changes applied, thank you!

Andrew


Re: Please review: lookup changes

2020-03-17 Thread Andrew Doran
On Sun, Mar 08, 2020 at 09:22:28PM +, Taylor R Campbell wrote:

> Maybe we could use a critbit tree?  It will have cache utilization
> characteristics similar to a red/black tree (fanout is 2 either way),
> but the lookup can be pserialized and each step may be a little
> cheaper computationally.
> 
> https://mumble.net/~campbell/hg/critbit/
> 
> Alternatively, there is rmind's thmap already in tree.

I had a look and this is a really interesting data structure.  I think this
plus a seqlock, and a nice reclamation method would do great.  My only
concern is that single threaded performance should be top notch too because
single component name lookups are often more frequent even than traps and
syscalls put together.  I suppose it's a case of trying it.

Another application I can think of is _lwp_unpark() although the minor
complication there is the radixtree is now used to allocate LIDs too by
scanning for unused slots.

Andrew


Re: Please review: lookup changes

2020-03-11 Thread J. Hannken-Illjes
> On 5. Mar 2020, at 23:48, Andrew Doran  wrote:
> 
> Hi,
> 
> I'd like to merge the changes on the ad-namecache branch, and would
> appreciate any code review.  The motivation for these changes is, as
> you might guess, performance.  I put an overview below.  I won't be
> merging this for a couple of weeks in any case.

[snip]

> vfs_vnode.c:
> 
>   Namecache related changes, and the change to not block new vnode
>   references across VOP_INACTIVE() mentioned on tech-kern.  I don't
>   think this blocking is needed at least for well behaved FSes like
>   ffs & tmpfs.

I suppose this blocking is no longer needed.  It originates from
the early steps to clean up the vnode lifecycle.  If anything
goes wrong it is quite simple to undo this change.

Blocking further references should only be needed for vrecycle().

We have to terminate vrelel() early if we got new references,
diff attached.

--
J. Hannken-Illjes - hann...@eis.cs.tu-bs.de - TU Braunschweig



addendum.diff
Description: Binary data


signature.asc
Description: Message signed with OpenPGP


Re: Please review: lookup changes

2020-03-11 Thread Mateusz Guzik
>> locks and not hashing itself. Note that at the time vnode interlock and
>> vm object lock were the same thing and in this workload the lock is
>> under a lot of contention. Should the lock owner get preempted, anyone
>> doing a lookup to the affected vnode (e.g., libc) will be holding
>> the relevant per-cpu lock and will block on a turnstile. Whoever ends
>> up running on the affected cpu is likely to do a lookup on their own,
>> but the relevant per-cpu lock is taken and go off cpu. The same thing
>> happening on more than one cpu at a time could easily reslut in a
>> cascading failure, which I strongly suspect is precisely what happened.
>>
>> That is, the win does not stem from rb trees but finer-grained locking
>> which does not block other threads which look up something else on
>> the same cpu.
>
> Not on NetBSD.  Kernel preemption is possible and allowed (mainly for real
> time applications), but happens infrequently during normal operation.
> There
> are a number of pieces of code that take advantage of that fact and are
> "optimistically per-CPU", and they work very well as preemption is rarely
> observed.  Further the blocking case on v_interlock in cache_lookup() is
> rare.  That's no to say it doesn't happen, it does, but I don't think it
> enough to explain the performance differences.
>

I noted suspected preemption was occurring because of contention on the vm
side. It should only take one to start the cascade.

>> As mentioned earlier I think rb trees instead of a hash are pessimal
>> here.
>>
>> First, a little step back. The lookup starts with securing vnodes from
>> cwdinfo. This represents a de facto global serialisation point (times two
>> since the work has to be reverted later). In FreeBSD I implemented an
>> equivalent with copy-on-write semantics. Easy take on it is that I take
>> an rwlock-equivalent and then grab a reference on the found struct.
>> This provides me with an implicit reference on root and current working
>> directory vnodes. If the struct is unshared on fork, aforementioned
>> serialisation point becomes localized to the process.
>
> Interesting.  Sounds somewhat like both NetBSD and FreeBSD do for process
> credentials.
>

I was thinking about doing precisely that but I found it iffy to have
permanently stored references per-thread. With the proposal they get
"gained" around the actual lookup, otherwise this is very similar to
what mountcheckdirs is dealing with right now.

>> In my tests even with lookups which share most path components, the
>> last one tends to be different. Using a hash means this typically
>> results in grabbing different locks for that case and consequently
>> fewer cache-line ping pongs.
>
> My experience has been different.  What I've observed is that shared hash
> tables usually generate huge cache pressure unless they're small and rarely
> updated.  If the hash were small, matching the access pattern (e.g.
> per-dir) then I think it would have the opportunity to make maximum use of
> the cache.  That could be a great win and certainly better than rbtree.
>

Well in my tests this is all heavily dominated by SMP-effects, which I
expect to be exacerbated by just one lock.

Side note is that I had a look at your vput. The pre-read + VOP_UNLOCK +
actual loop to drop the ref definitely slow things down if only a little
bit as this can force a shared cacheline transition from under someone
cmpxching.

That said, can you generate a flamegraph from a fully patched kernel?
Curious where the time is spent now, my bet is spinning on vnode locks.

-- 
Mateusz Guzik 



Re: Please review: lookup changes

2020-03-11 Thread Mateusz Guzik
Something ate a huge chunk of my e-mail, resending

On 3/8/20, Andrew Doran  wrote:
> On Sat, Mar 07, 2020 at 12:14:05AM +0100, Mateusz Guzik wrote:
>> I believe it is always legal to just upgrade the lock in getattr if
>> necessary. Since itimes updates are typically not needed, this would mean
>> common case would get away with shared locking. This would run into
>> issues
>> in vput, for which see below.
>
> Hmm.  With NetBSD upgrading a lock in the middle of a VOP could have
> consequences for fstrans.  I'm not sure.
>

I don't know how do you do upgrades specifically. "mere" upgrades can of
course fail in presence of additional readers. In FreeBSD there is an
option to release the lock entirely and relock from scratch, in paritcular
meaning you can undo fstrans enter (if necessary). While I don't know the
specifics of your VFS layer, v_usecount which is kept should postpone
freeing of the vnode. Should it fall victim of forced unmount or some
other state change you can just return an error, pretending getattr
never got this far.

>> Assuming this is fixed fstat could also shared lock. I think this is a
>> simple change to make and most certainly worth doing.
>>
>> > vfs_vnode.c:
>> >
>> >   Namecache related changes, and the change to not block new vnode
>> >   references across VOP_INACTIVE() mentioned on tech-kern.  I don't
>> >   think this blocking is needed at least for well behaved FSes like
>> >   ffs & tmpfs.
>> >
>>
>> I think this is very hairy and the entire thing is avoidable.
>>
>> I believe I mentioned some other time that FreeBSD got VOP_NEED_INACTIVE.
>> You should be able to trivially implement an equivalent which would be
>> called if the vnode is only shared locked. Supporting an unlocked case
>> comes with extra woes which may not be worth it at this stage.
>>
>> Since inative processing is almost never needed this would avoid a lot
>> of lock contention.
>>
>> With all this in place you would also not run into the problematic state
>> in the fast path in the common case.
>>
>> This is a cheap cop out, the real solution would encode need for inactive
>> in usecount as a flag but that's a lot of work.
>
> Two seperate issues - I agree on avoiding VOP_INACTIVE(), but the issue
> here
> is peculiar to NetBSD I think.  NetBSD sets the equivalent of the 4.4BSD
> VXLOCK flag to block vget() across VOP_INACTIVE().  Should only be needed
> for revoke/clean.  At one point we did have VXLOCK embedded in v_usecount
> which allowed for vget() with one atomic op, but that was undone at some
> point, I don't know why.
>

If you have a dedicated spot which is guaranteed to be executed for
exclusion against vget then things become easier in terms of effort
required. A dedicated vget variant for the namecache can xadd into the
counter and if the bit is set, backpedal from it. New arrivals should
eliminated by purging nc entries beforehand. Everyone else can
cmpxchg-loop with the interlock like right now.

>> > Performance:
>> >
>> >   During a kernel build on my test system it yields about a 15%
>> > drop
>> >   in system time:
>> >
>> >   HEAD79.16s real  1602.97s user   419.54s system
>> >   changes 77.26s real  1564.38s user   368.41s system
>> >
>> >   The rbtree lookup can be assumed to be slightly slower than a
>> > hash
>> >   table lookup and microbenchmarking shows this sometimes in the
>> >   single CPU case, but as part of namei() it ends up being
>> > ameliorated
>> >   somewhat by the reduction in locking costs, the absence of hash
>> >   bucket collisions with rbtrees, and better L2/L3 cache use on hot
>> >   directories.  I don't have numbers but I know joerg@ saw
>> > significant
>> >   speedups on pbulk builds with an earlier version of this code
>> > with
>> >   little changed from HEAD other than hash -> rbtree.  Anyway
>> > rbtree
>> >   isn't set in stone as the data structure, it's just what we have
>> > in
>> >   tree that's most suitable right now.  It can be changed.
>> >
>>
>> I strongly suspect the win observed with pbulk had to with per-cpu
>> locks and not hashing itself. Note that at the time vnode interlock and
>> vm object lock were the same thing and in this workload the lock is
>> under a lot of contention. Should the lock owner get preempted, anyone
>> doing a lookup to the affected vnode (e.g., libc) will be holding
>> the relevant per-cpu lock and will block on a turnstile. Whoever ends
>> up running on the affected cpu is likely to do a lookup on their own,
>> but the relevant per-cpu lock is taken and go off cpu. The same thing
>> happening on more than one cpu at a time could easily reslut in a
>> cascading failure, which I strongly suspect is precisely what happened.
>>
>> That is, the win does not stem from rb trees but finer-grained locking
>> which does not block other threads which look up something else on
>> the same cpu.
>
> Not on NetBSD.  Kernel preemption 

re: Please review: lookup changes

2020-03-10 Thread matthew green
> This reminds me that we need to find a way to release upper layer
> vnodes when the lower layer is recyclable -- see the comment in
> layer_inactive.  Otherwise files deleted under a null mount while open
> through the null mount may persist on disk indefinitely as long as the
> upper layer is still around...

FWIW, i think it's worse than you describe.  for me:

- it doesn't matter what layer i delete the files from

- the files aren't open anymore

the only way i found to recover disk space was to unmount
the upper layer (which hung for a while, assuming writing
the new disk contents), or, later reduce kern.maxvnodes.


.mrg.


Re: Please review: lookup changes

2020-03-10 Thread Taylor R Campbell
> Date: Mon, 9 Mar 2020 19:59:03 +
> From: Andrew Doran 
> 
> As far as I recall layer vnodes persist unless recycled like any other
> vnode, so there should be nothing special happening there around contention
> due to lifecycle.  I think you could cache or transpose the cached names
> above, maybe using some kind of VOP.  I definitely think it's doable but
> beyond that, at least for me thare lots of unknowns so I have avoided it.

This reminds me that we need to find a way to release upper layer
vnodes when the lower layer is recyclable -- see the comment in
layer_inactive.  Otherwise files deleted under a null mount while open
through the null mount may persist on disk indefinitely as long as the
upper layer is still around...

(But no need for that to be in the scope of namecache improvements.)

> I also expect that if ufs_lookup() were opened up to allow a shared lock
> along the full path then pressure would be diverted to buffer locks
> previously shielded by the vnode lock.  I would be unsurprised to see that
> performing worse under contention than the single exclusive lock with that
> code as-is.  There's also the fact that modifications and uncached lookups*
> are often a tiny proportion of activity on the system in comparison to
> cached lookups, so without a specific application in mind I'm not sure
> there's a great deal to be won there.
> 
> Looking up names directly in the namecache bypasses these problem to some
> extent, because the directory vnode lock needs to be touched far less often,
> and when it is touched it's because there's real work to be done.

OK, makes sense.

Other than layerfs specifically avoiding the namecache for now, are
there any other barriers to pulling the copypasta cache_* calls out of
every VOP_LOOKUP and into namei?

> LOCKLEAF is something very old and LOCKSHARED a minor tweak to it (a
> conditional picking either LK_SHARED/LK_EXCLUSIVE on the way out of
> namei()).  I'd prefer that namei() didn't lock any vnodes to begin with but
> that's the scheme we have (see my comment re: 1993 :-).  It's also what
> FreeBSD does for this and in my experience these little differences are
> troublesome WRT code sharing so I decided to match their flag.

How much code sharing can we actually take advantage of in the vfs
layer?

> > > vfs_syscalls.c:
> > > 
> > >   Corrected vnode lock types: LK_SHARED ok for VOP_ACCESS(), but
> > >   LK_EXCLUSIVE still needed for VOP_GETATTR() due to itimes handling.
> > >   That's a shame and it would be nice to use a seqlock or something
> > >   there eventually.
> > 
> > I don't understand this itimes handling.  Why do we ever _set_ any
> > itimes in VOP_GETATTR?
> > [...]
> > Why don't we check our watch immediately when we set IN_CHANGE ,
> > and defer the disk write to update the inode like UFS_WAPBL_UPDATE
> > does -- and just skip UFS_ITIMES in VOP_GETATTR?
> 
> This looks to go back least as far as Unix 32V (& possibly further) in one
> form or another, from what I can see.  Whatever the original intent I think
> it might have morphed into an attempt to avoid querying the system clock at
> some point, which is about the only point I can see for it now.

That was my conclusion too.  Let's get rid of it and make VOP_GETATTR
work with a shared lock (if not a seqlock)!

> > > vfs_vnode.c:
> > > 
> > >   Namecache related changes, and the change to not block new vnode
> > >   references across VOP_INACTIVE() mentioned on tech-kern.  I don't
> > >   think this blocking is needed at least for well behaved FSes like
> > >   ffs & tmpfs.
> > 
> > hannken@ cautions me that not blocking until VOP_INACTIVE completes
> > may be dangerous.
> > 
> > The last major deadlock I found in lfs is [...]
> 
> Is this something that should be done in VOP_RECLAIM() instead?  Then any
> concerns about VOP_INACTIVE() don't apply since the inode is well and truly
> dead.

I don't think so.  I believe the reason we block during VOP_INACTIVE
is to prevent file systems from gaining new references to vnodes while
vrele is deciding whether to reclaim them -- that way, the decision to
reclaim a vnode is final and made, under the vnode lock, in one place
that doesn't have to worry about backing out.

Historically, the logic that dealt with backing out of the decision to
reclaim a vnode has been riddled with races and bugs and incoherent
workarounds, so I think it's a Good Thing that hannken@ structured the
logic so that the decision really is final.

For the lfs segment writer, examining a vnode that may be about to be
reclaimed is safe because the lfs segment writer never creates new
references -- actually it just wants to skip all of the vnodes with
ip->i_nlink == 0 because they never need to be written out to disk.
The namecache is a different story, though!


Re: Please review: lookup changes

2020-03-09 Thread Andrew Doran
On Mon, Mar 09, 2020 at 07:59:03PM +, Andrew Doran wrote:

> > Why don't we check our watch immediately when we set IN_CHANGE ,
> > and defer the disk write to update the inode like UFS_WAPBL_UPDATE
> > does -- and just skip UFS_ITIMES in VOP_GETATTR?
> 
> This looks to go back least as far as Unix 32V (& possibly further) in one
> form or another, from what I can see.  Whatever the original intent I think
> it might have morphed into an attempt to avoid querying the system clock at
> some point, which is about the only point I can see for it now.

Considering this originates in a different era..  I suppose it might also
yield a better chance of observing distinct timestamps, if the system had
poor clock resolution.

Andrew


Re: Please review: lookup changes

2020-03-09 Thread Andrew Doran
On Sun, Mar 08, 2020 at 09:22:28PM +, Taylor R Campbell wrote:

> > Date: Thu, 5 Mar 2020 22:48:25 +
> > From: Andrew Doran 
> > 
> > I'd like to merge the changes on the ad-namecache branch, and would
> > appreciate any code review.  The motivation for these changes is, as
> > you might guess, performance.  I put an overview below.  I won't be
> > merging this for a couple of weeks in any case.
> 
> Thanks for working on this.  Lookups are a big point of contention, so
> it's worth putting effort into fixing that.  At the same time, I want
> to make sure we don't regress into the vfs locking insanity we had ten
> years ago; dholland@ and especially hannken@ have put a lot of work
> into cleaning it up and making it understandable and maintainable.
> Please make sure to wait until hannken@ has had an opportunity weigh
> in on it.

Absolutely.  Some philosophical ramblings:

I see the namecache as a thing that allows you to avoid re-doing expensive
stuff that you've already done once before - whether that's disc I/O and
metadata crawls (old application) or multiprocessor locking (the app here).

Were it 1993 I might well be up for ripping out vnode locking and going back
to inode locking, but we're way too far down the chosen path now, and there
has been major effort invested by many people over the years to work the
vnode locking into shape, as you note.  It works very well now, and I don't
think we should mess with it too much.

With that in mind I have tried do this in a way that requires as few changes
to the vnode locking and file systems as possible: use the namecache to do
things fast whenever possible, but preserve the strongly locked path.

> > If the file system exports IMNT_NCLOOKUP, this now tries to "fast
> > forward" through as many component names as possible by looking
> > directly in the namecache, without taking vnode locks nor vnode
> > refs, and stops when the leaf is reached or there is something that
> > can't be handled using the namecache.  In the most optimistic case a
> > vnode reference is taken only at the root and the leaf. 
> > IMNT_NCLOOKUP is implemented right now only for tmpfs, ffs.
> 
> What file systems can't support IMNT_NCLOOKUP and why?

Good question.  Other than layered file systems I don't know for sure, but I
think it can be done for most FSes.  NFS might have its own complications; I
haven't looked yet.  IMNT_NCLOOKUP is more of a feature gate than anything
else.  The file system hooks are unintrusive and few in number but adding
those hooks takes careful, time consuming code inspection to make sure
nothing is missed.

> > Layered file systems can't work that way (names are cached below),
> 
> It seems there's a tradeoff here:
> 
> 1. We could cache layered vnodes, at the cost of doubling the size of
>the namecache.
> 
> 2. We could _not_ cache layered vnodes, at the cost of having to redo
>layer_node_create and vcache_get for every lookup -- a process
>which scales worse than namecache lookups (global vcache_lock).
> 
> Am I missing anything else about this?  Is this tradeoff the only
> reason that it's still up to the vnode operations to call cache_lookup
> and cache_enter, rather than something namei does directly -- so that
> layer_lookup can skip them?

As far as I recall layer vnodes persist unless recycled like any other
vnode, so there should be nothing special happening there around contention
due to lifecycle.  I think you could cache or transpose the cached names
above, maybe using some kind of VOP.  I definitely think it's doable but
beyond that, at least for me thare lots of unknowns so I have avoided it.

> > and the permissions check won't work for anything unusual like ACLs,
> > so there is also IMNT_SHRLOOKUP.  If the file system exports
> > IMNT_SHRLOOKUP, then VOP_LOOKUP() for "read only" lookups (e.g.  not
> > RENAME & friends) is tried first with an LK_SHARED lock on the
> > directory.  For an FS that can do the whole lookup this way (e.g. 
> > tmpfs) all is good.  Otherwise the FS is free to return ENOLCK from
> > VOP_LOOKUP() if something tricky comes up, and the lookup gets
> > retried with LK_EXCLUSIVE (e.g. ffs when it needs to scan directory
> > metadata).
> 
> Except for access to the create/rename/remove hints (which we should
> do differently anyway [*]), in what circumstances does ufs lookup need
> an exclusive lock?  Why does scanning directory metadata require that?
>
> [*] Specifically: We should push the last component lookup into
> directory operations, and leave it to *fs_rename/remove/ to invoke
> an internal lookup routine that does the right thing with locks and
> returns the hint to the caller rather than stashing it in the
> directory's struct inode.  This has been intended for several years,
> but dholland and I have had a shortage of round tuits."

I was thinking of the sequential scan optimisation from 4.3BSD ("2passes" in
the 

Re: Please review: lookup changes

2020-03-08 Thread Taylor R Campbell
> Date: Thu, 5 Mar 2020 22:48:25 +
> From: Andrew Doran 
> 
> I'd like to merge the changes on the ad-namecache branch, and would
> appreciate any code review.  The motivation for these changes is, as
> you might guess, performance.  I put an overview below.  I won't be
> merging this for a couple of weeks in any case.

Thanks for working on this.  Lookups are a big point of contention, so
it's worth putting effort into fixing that.  At the same time, I want
to make sure we don't regress into the vfs locking insanity we had ten
years ago; dholland@ and especially hannken@ have put a lot of work
into cleaning it up and making it understandable and maintainable.
Please make sure to wait until hannken@ has had an opportunity weigh
in on it.

>   If the file system exports IMNT_NCLOOKUP, this now tries to "fast
>   forward" through as many component names as possible by looking
>   directly in the namecache, without taking vnode locks nor vnode
>   refs, and stops when the leaf is reached or there is something that
>   can't be handled using the namecache.  In the most optimistic case a
>   vnode reference is taken only at the root and the leaf. 
>   IMNT_NCLOOKUP is implemented right now only for tmpfs, ffs.

What file systems can't support IMNT_NCLOOKUP and why?

>   Layered file systems can't work that way (names are cached below),

It seems there's a tradeoff here:

1. We could cache layered vnodes, at the cost of doubling the size of
   the namecache.

2. We could _not_ cache layered vnodes, at the cost of having to redo
   layer_node_create and vcache_get for every lookup -- a process
   which scales worse than namecache lookups (global vcache_lock).

Am I missing anything else about this?  Is this tradeoff the only
reason that it's still up to the vnode operations to call cache_lookup
and cache_enter, rather than something namei does directly -- so that
layer_lookup can skip them?

>   and the permissions check won't work for anything unusual like ACLs,
>   so there is also IMNT_SHRLOOKUP.  If the file system exports
>   IMNT_SHRLOOKUP, then VOP_LOOKUP() for "read only" lookups (e.g.  not
>   RENAME & friends) is tried first with an LK_SHARED lock on the
>   directory.  For an FS that can do the whole lookup this way (e.g. 
>   tmpfs) all is good.  Otherwise the FS is free to return ENOLCK from
>   VOP_LOOKUP() if something tricky comes up, and the lookup gets
>   retried with LK_EXCLUSIVE (e.g. ffs when it needs to scan directory
>   metadata).

Except for access to the create/rename/remove hints (which we should
do differently anyway [*]), in what circumstances does ufs lookup need
an exclusive lock?  Why does scanning directory metadata require that?

[*] Specifically: We should push the last component lookup into
directory operations, and leave it to *fs_rename/remove/ to invoke
an internal lookup routine that does the right thing with locks and
returns the hint to the caller rather than stashing it in the
directory's struct inode.  This has been intended for several years,
but dholland and I have had a shortage of round tuits.

>   I also added a LOCKSHARED flag to go with LOCKLEAF (get an LK_SHARED
>   lock on the leaf instead of LK_EXCLUSIVE).  This matches FreeBSD.

I worry that after we spent several years getting rid of unnecessary
complexity in the incomprehensible namei flags that rendered the
subsystem unmaintainable, we may be sliding back toward that.  Could
we do this with a separate operation that happens after namei instead?

NDINIT(, ...);
error = namei();
if (error)
goto out;
namei_lockleaf();

(namei_lockleaf would obviously has to be careful not to lock when
it's a `.' lookup and the parent is already locked,   Perhaps this
won't work -- I haven't thought it through -- but it would be nice to
decrease rather than increase the complexity of namei itself.)

On a related note: Why do VFS_VGET, VFS_ROOT, and VFS_FHTOVP need a
lktype argument?  Why can't they just return the vnode unlocked, and
leave it to the caller to just call vn_lock?

> vfs_syscalls.c:
> 
>   Corrected vnode lock types: LK_SHARED ok for VOP_ACCESS(), but
>   LK_EXCLUSIVE still needed for VOP_GETATTR() due to itimes handling.
>   That's a shame and it would be nice to use a seqlock or something
>   there eventually.

I don't understand this itimes handling.  Why do we ever _set_ any
itimes in VOP_GETATTR?

My understanding is that we try to defer issuing disk writes to update
inode times as long as possible (except in wapbl where it can be done
asynchronously and write-combined -- if nevertheless serially across
the whole file system -- via the log).

That explains why, e.g., ufs has patterns like

ip->i_flag |= IN_CHANGE;
UFS_WAPBL_UPDATE(vp, NULL, NULL, 0);

to indicate that a change just happened and to defer a disk write
either to the next wapbl commit 

Re: Please review: lookup changes

2020-03-08 Thread Andrew Doran
On Thu, Mar 05, 2020 at 10:48:25PM +, Andrew Doran wrote:

>   http://www.netbsd.org/~ad/2020/namecache.diff
>   http://www.netbsd.org/~ad/2020/vfs_lookup.c
>   http://www.netbsd.org/~ad/2020/vfs_cache.c

This was missing one small change:

http://www.netbsd.org/~ad/2020/namecache2.diff

To compile it also needs this first:

cd src/sys/sys
USETOOLS=no make namei

Andrew


Re: Please review: lookup changes

2020-03-08 Thread Andrew Doran
On Sat, Mar 07, 2020 at 12:14:05AM +0100, Mateusz Guzik wrote:

> On 3/5/20, Andrew Doran  wrote:
> > vfs_cache.c:
> >
> >   I changed the namecache to operate using a per-directory index of
> >   names rather than a global index, and changed the reclamation
> >   mechanism for cache entries to ape the VM system's CLOCK algorithm
> >   rather than using LRU / pseudo-LRU.  This made concurrency easier to
> >   manage, and the nasty locking scheme from 5.0 is gone.
> >
> 
> I don't think rb trees (or in fact anything non-hash) is a good idea for
> this purpose. See way below.

Right, I've admitted to the fact that rbtree is not ideal here.  But it is
"good enough", and I'm happy for someone to replace it with some other data
structure later on.
 
> I doubt aging namecache entries is worth the complexity. Note that said
> very likely this code does not really win anything in practice.

If there was a 1:1 relationship I would agree but due to links (including
dot and dotdot) it's N:M and I don't fancy changing that behaviour too much
right now.  Definitely something to consider in the future.
 
> >   I also changed the namecache to cache permission information for
> >   directories, so that it can be used to do single component lookups
> >   fully in-cache for file systems that can work that way.
> >
> 
> Low priority, but I think this comes with a bug. kauth() has this support
> for "listeners" and so happens in the base system at least there is
> nobody hooked for vnode lookup. Key point being that normally the vnode
> is passed in shared-locked, but fast path lookup avoids this and
> consequently if there is a real-world uesr of the feature somewhere, they
> may be silently broken.

Good point.  Yes that aspect of kauth is a bit of a mess.  Plus it adds a
lot of CPU overhead.  It could do with its own solution for concurrency
that's straightforward, easy to maintain and doesn't have too many tentacles
that depend on concurrency rules in the various subsystems.

> I had a similar problem with the MAC framework and devised a simple flag
> which can be checked upfront to disable fast path lookup.
> 
> So this would have a side effect of removing the call in the first place.
> genfs_can_access is rather pessimal for checking VEXEC. I wrote a dedicated
> routine which only checks for VEXEC and starts with a check for
> (mode & 0111) == 0111 upfront. It happens to almost always be true.
> 
> >   This is described in the comments in the file.
> >
> > vfs_getcwd.c:
> >
> >   The design pattern here was to hold vnode locks as long as possible
> >   and clearly deliniate the handoff from one lock to another.  It's a
> >   nice way of working but not good for MP performance, and not needed
> >   for correctness.  I pushed the locks back as far as possible and try
> >   to do the lookup in the namecache without taking vnode locks (not
> >   needed to look at namecache).
> >
> > vfs_lookup.c:
> >
> >   Like vfs_getcwd.c the vnode locking is pushed back far as possible.
> >
> >   If the file system exports IMNT_NCLOOKUP, this now tries to "fast
> >   forward" through as many component names as possible by looking
> >   directly in the namecache, without taking vnode locks nor vnode
> >   refs, and stops when the leaf is reached or there is something that
> >   can't be handled using the namecache.  In the most optimistic case a
> >   vnode reference is taken only at the root and the leaf.
> >   IMNT_NCLOOKUP is implemented right now only for tmpfs, ffs.
> >
> >   Layered file systems can't work that way (names are cached below),
> >   and the permissions check won't work for anything unusual like ACLs,
> >   so there is also IMNT_SHRLOOKUP.  If the file system exports
> >   IMNT_SHRLOOKUP, then VOP_LOOKUP() for "read only" lookups (e.g.  not
> >   RENAME & friends) is tried first with an LK_SHARED lock on the
> >   directory.  For an FS that can do the whole lookup this way (e.g.
> >   tmpfs) all is good.  Otherwise the FS is free to return ENOLCK from
> >   VOP_LOOKUP() if something tricky comes up, and the lookup gets
> >   retried with LK_EXCLUSIVE (e.g. ffs when it needs to scan directory
> >   metadata).
> >
> >   I also added a LOCKSHARED flag to go with LOCKLEAF (get an LK_SHARED
> >   lock on the leaf instead of LK_EXCLUSIVE).  This matches FreeBSD.
> >
> > vfs_syscalls.c:
> >
> >   Corrected vnode lock types: LK_SHARED ok for VOP_ACCESS(), but
> >   LK_EXCLUSIVE still needed for VOP_GETATTR() due to itimes handling.
> >   That's a shame and it would be nice to use a seqlock or something
> >   there eventually.
> >
> 
> I'm considering doing something like this myself. Lookup doing this could
> avoid even referencing the vnode in the first place in the common case.
> 
> The itimes situation is very broken in my opinion. Filesystems mark vnode
> as