Re: [Cluster-devel] RFC: hold i_rwsem until aio completes

2020-02-03 Thread Dave Chinner
On Mon, Feb 03, 2020 at 06:46:41PM +0100, Christoph Hellwig wrote:
> On Sat, Jan 18, 2020 at 08:28:38PM +1100, Dave Chinner wrote:
> > I think it's pretty gross, actually. It  makes the same mistake made
> > with locking in the old direct IO code - it encodes specific lock
> > operations via flags into random locations in the DIO path. This is
> > a very slippery slope, and IMO it is an layering violation to encode
> > specific filesystem locking smeantics into a layer that is supposed
> > to be generic and completely filesystem agnostic. i.e.  this
> > mechanism breaks if a filesystem moves to a different type of lock
> > (e.g. range locks), and history teaches us that we'll end up making
> > a horrible, unmaintainable mess to support different locking
> > mechanisms and contexts.
> > 
> > I think that we should be moving to a model where the filesystem
> > provides an unlock method in the iomap operations structure, and if
> > the method is present in iomap_dio_complete() it gets called for the
> > filesystem to unlock the inode at the appropriate point. This also
> > allows the filesystem to provide a different method for read or
> > write unlock, depending on what type of lock it held at submission.
> > This gets rid of the need for the iomap code to know what type of
> > lock the caller holds, too.
> 
> I'd rather avoid yet another method.  But I think with a little
> tweaking we can move the unlock into the ->end_io method.

That would work, too :)

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com




Re: [Cluster-devel] RFC: hold i_rwsem until aio completes

2020-02-03 Thread Christoph Hellwig
On Sat, Jan 18, 2020 at 08:28:38PM +1100, Dave Chinner wrote:
> I think it's pretty gross, actually. It  makes the same mistake made
> with locking in the old direct IO code - it encodes specific lock
> operations via flags into random locations in the DIO path. This is
> a very slippery slope, and IMO it is an layering violation to encode
> specific filesystem locking smeantics into a layer that is supposed
> to be generic and completely filesystem agnostic. i.e.  this
> mechanism breaks if a filesystem moves to a different type of lock
> (e.g. range locks), and history teaches us that we'll end up making
> a horrible, unmaintainable mess to support different locking
> mechanisms and contexts.
> 
> I think that we should be moving to a model where the filesystem
> provides an unlock method in the iomap operations structure, and if
> the method is present in iomap_dio_complete() it gets called for the
> filesystem to unlock the inode at the appropriate point. This also
> allows the filesystem to provide a different method for read or
> write unlock, depending on what type of lock it held at submission.
> This gets rid of the need for the iomap code to know what type of
> lock the caller holds, too.

I'd rather avoid yet another method.  But I think with a little
tweaking we can move the unlock into the ->end_io method.




Re: [Cluster-devel] RFC: hold i_rwsem until aio completes

2020-02-03 Thread Christoph Hellwig
On Thu, Jan 16, 2020 at 03:00:04PM +0100, Jan Kara wrote:
> I'd like to note that using i_dio_count has also one advantage you didn't
> mention. For AIO case, if you need to hold i_rwsem in exclusive mode,
> holding the i_rwsem just for submission part is a significant performance
> advantage (shorter lock hold times allow for higher IO parallelism). I
> guess this could be mitigated by downgrading the lock to shared mode
> once the IO is submitted. But there will be still some degradation visible
> for the cases of mixed exclusive and shared acquisitions because shared
> holders will be blocking exclusive ones for longer time.
> 
> This may be especially painful for filesystems that don't implement DIO
> overwrites with i_rwsem in shared mode...

True.  Fortunately there are patches for ext4 out to move over to that
scheme.  gfs2 will need a little more attention, but that also for other
reasons.




Re: [Cluster-devel] RFC: hold i_rwsem until aio completes

2020-01-18 Thread Matthew Wilcox
On Wed, Jan 15, 2020 at 03:33:47PM +0100, Peter Zijlstra wrote:
> On Wed, Jan 15, 2020 at 09:24:28AM -0400, Jason Gunthorpe wrote:
> 
> > I was interested because you are talking about allowing the read/write side
> > of a rw sem to be held across a return to user space/etc, which is the
> > same basic problem.
> 
> No it is not; allowing the lock to be held across userspace doesn't
> change the owner. This is a crucial difference, PI depends on there
> being a distinct owner. That said, allowing the lock to be held across
> userspace still breaks PI in that it completely wrecks the ability to
> analyze the critical section.

Thinking about this from a PI point of view, the problem is not that we
returned to userspace still holding the lock, it's that boosting this
process's priority will not help release the lock faster because this
process no longer owns the lock.

If we had a lock owner handoff API (ie I can donate my lock to another
owner), that would solve this problem.  We'd want to have special owners
to denote "RCU" "bottom halves" or "irq" so we know what we can do about
PI.  I don't think we need a "I have stolen this lock from somebody else"
API, but maybe I'm wrong there.




Re: [Cluster-devel] RFC: hold i_rwsem until aio completes

2020-01-18 Thread Dave Chinner
On Tue, Jan 14, 2020 at 05:12:13PM +0100, Christoph Hellwig wrote:
> Hi all,
> 
> Asynchronous read/write operations currently use a rather magic locking
> scheme, were access to file data is normally protected using a rw_semaphore,
> but if we are doing aio where the syscall returns to userspace before the
> I/O has completed we also use an atomic_t to track the outstanding aio
> ops.  This scheme has lead to lots of subtle bugs in file systems where
> didn't wait to the count to reach zero, and due to its adhoc nature also
> means we have to serialize direct I/O writes that are smaller than the
> file system block size.
> 
> All this is solved by releasing i_rwsem only when the I/O has actually
> completed, but doings so is against to mantras of Linux locking primites:
> 
>  (1) no unlocking by another process than the one that acquired it
>  (2) no return to userspace with locks held
> 
> It actually happens we have various places that work around this.  A few
> callers do non-owner unlocks of rwsems, which are pretty nasty for
> PREEMPT_RT as the owner tracking doesn't work.  OTOH the file system
> freeze code has both problems and works around them a little better,
> although in a somewhat awkward way, in that it releases the lockdep
> object when returning to userspace, and reacquires it when done, and
> also clears the rwsem owner when returning to userspace, and then sets
> the new onwer before unlocking.
> 
> This series tries to follow that scheme, also it doesn't fully work.  The
> first issue is that the rwsem code has a bug where it doesn't properly
> handle clearing the owner.  This series has a patch to fix that, but it
> is ugly and might not be correct so some help is needed.  Second I/O
> completions often come from interrupt context, which means the re-acquire
> is recorded as from irq context, leading to warnings about incorrect
> contexts.  I wonder if we could just have a bit in lockdep that says
> returning to userspace is ok for this particular lock?  That would also
> clean up the fsfreeze situation a lot.
> 
> Let me know what you think of all this.  While I converted all the iomap
> using file systems only XFS is actually tested.

I think it's pretty gross, actually. It  makes the same mistake made
with locking in the old direct IO code - it encodes specific lock
operations via flags into random locations in the DIO path. This is
a very slippery slope, and IMO it is an layering violation to encode
specific filesystem locking smeantics into a layer that is supposed
to be generic and completely filesystem agnostic. i.e.  this
mechanism breaks if a filesystem moves to a different type of lock
(e.g. range locks), and history teaches us that we'll end up making
a horrible, unmaintainable mess to support different locking
mechanisms and contexts.

I think that we should be moving to a model where the filesystem
provides an unlock method in the iomap operations structure, and if
the method is present in iomap_dio_complete() it gets called for the
filesystem to unlock the inode at the appropriate point. This also
allows the filesystem to provide a different method for read or
write unlock, depending on what type of lock it held at submission.
This gets rid of the need for the iomap code to know what type of
lock the caller holds, too.

This way we always have a consistent point in the AIO/DIO completion
model where the inode gets unlocked, it only gets called for the IO
contexts where the filesystem wants/needs unlock on IO competion
semantics, and it's completely filesystem IO-lock implementation
agnostic. And for filesystems that use the inode i_rwsem, we can
just provide simple helper functions for their read/write unlock
methods.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com




Re: [Cluster-devel] RFC: hold i_rwsem until aio completes

2020-01-16 Thread Jan Kara
Hello!

On Tue 14-01-20 17:12:13, Christoph Hellwig wrote:
> Asynchronous read/write operations currently use a rather magic locking
> scheme, were access to file data is normally protected using a rw_semaphore,
> but if we are doing aio where the syscall returns to userspace before the
> I/O has completed we also use an atomic_t to track the outstanding aio
> ops.  This scheme has lead to lots of subtle bugs in file systems where
> didn't wait to the count to reach zero, and due to its adhoc nature also
> means we have to serialize direct I/O writes that are smaller than the
> file system block size.
> 
> All this is solved by releasing i_rwsem only when the I/O has actually
> completed, but doings so is against to mantras of Linux locking primites:
> 
>  (1) no unlocking by another process than the one that acquired it
>  (2) no return to userspace with locks held

I'd like to note that using i_dio_count has also one advantage you didn't
mention. For AIO case, if you need to hold i_rwsem in exclusive mode,
holding the i_rwsem just for submission part is a significant performance
advantage (shorter lock hold times allow for higher IO parallelism). I
guess this could be mitigated by downgrading the lock to shared mode
once the IO is submitted. But there will be still some degradation visible
for the cases of mixed exclusive and shared acquisitions because shared
holders will be blocking exclusive ones for longer time.

This may be especially painful for filesystems that don't implement DIO
overwrites with i_rwsem in shared mode...


Honza
-- 
Jan Kara 
SUSE Labs, CR




Re: [Cluster-devel] RFC: hold i_rwsem until aio completes

2020-01-15 Thread Waiman Long
On 1/15/20 9:49 AM, Jason Gunthorpe wrote:
> On Wed, Jan 15, 2020 at 03:33:47PM +0100, Peter Zijlstra wrote:
>> On Wed, Jan 15, 2020 at 09:24:28AM -0400, Jason Gunthorpe wrote:
>>
>>> I was interested because you are talking about allowing the read/write side
>>> of a rw sem to be held across a return to user space/etc, which is the
>>> same basic problem.
>> No it is not; allowing the lock to be held across userspace doesn't
>> change the owner. This is a crucial difference, PI depends on there
>> being a distinct owner. That said, allowing the lock to be held across
>> userspace still breaks PI in that it completely wrecks the ability to
>> analyze the critical section.
> I'm not sure what you are contrasting?
>
> I was remarking that I see many places open code a rwsem using an
> atomic and a completion specifically because they need to do the
> things Christoph identified:
>
>> (1) no unlocking by another process than the one that acquired it
>> (2) no return to userspace with locks held
> As an example flow: obtain the read side lock, schedual a work queue,
> return to user space, and unlock the read side from the work queue.

We currently have down_read_non_owner() and up_read_non_owner() that
perform the lock and unlock without lockdep tracking. Of course, that is
a hack and their use must be carefully scrutinized to make sure that
there is no deadlock or other potentially locking issues.

Cheers,
Longman



Re: [Cluster-devel] RFC: hold i_rwsem until aio completes

2020-01-15 Thread Jason Gunthorpe
On Wed, Jan 15, 2020 at 04:36:14PM +0100, Christoph Hellwig wrote:

> synchronous and currently hack that up, so a version of the percpu_ref
> that actually waits for the other references to away like we hacked
> up various places seems to exactly suit your requirements.

Ah, yes, sounds like a similar goal, many cases I'm thinking about
also hack up a kref to trigger a completion to make it
synchronous.

Jason




Re: [Cluster-devel] RFC: hold i_rwsem until aio completes

2020-01-15 Thread Christoph Hellwig
On Wed, Jan 15, 2020 at 09:24:28AM -0400, Jason Gunthorpe wrote:
> > Your requirement seems a little different, and in fact in many ways
> > similar to the percpu_ref primitive.
> 
> I was interested because you are talking about allowing the read/write side
> of a rw sem to be held across a return to user space/etc, which is the
> same basic problem.
> 
> precpu refcount looks more like a typical refcount with a release that
> is called by whatever context does the final put. The point above is
> to basically move the release of a refcount into a synchrnous path by
> introducing some barrier to wait for the refcount to go to zero. In
> the above the barrier is the down_write() as it is really closer to a
> rwsem than a refcount.

No, percpu_ref is a little different than the name suggests, as it has
a magic initial reference, and then the other short term reference.  To
actually tear it down now just a normal put of the reference is needed,
but an explicit percpu_ref_kill or percpu_ref_kill_and_confirm.  Various
callers (including all I added) would like that operation to be
synchronous and currently hack that up, so a version of the percpu_ref
that actually waits for the other references to away like we hacked
up various places seems to exactly suit your requirements.




Re: [Cluster-devel] RFC: hold i_rwsem until aio completes

2020-01-15 Thread Jason Gunthorpe
On Wed, Jan 15, 2020 at 03:33:47PM +0100, Peter Zijlstra wrote:
> On Wed, Jan 15, 2020 at 09:24:28AM -0400, Jason Gunthorpe wrote:
> 
> > I was interested because you are talking about allowing the read/write side
> > of a rw sem to be held across a return to user space/etc, which is the
> > same basic problem.
> 
> No it is not; allowing the lock to be held across userspace doesn't
> change the owner. This is a crucial difference, PI depends on there
> being a distinct owner. That said, allowing the lock to be held across
> userspace still breaks PI in that it completely wrecks the ability to
> analyze the critical section.

I'm not sure what you are contrasting?

I was remarking that I see many places open code a rwsem using an
atomic and a completion specifically because they need to do the
things Christoph identified:

> (1) no unlocking by another process than the one that acquired it
> (2) no return to userspace with locks held

As an example flow: obtain the read side lock, schedual a work queue,
return to user space, and unlock the read side from the work queue.

If we can make some general primative that addresses this then maybe
those open coded places can convert as well?

Regards,
Jason




Re: [Cluster-devel] RFC: hold i_rwsem until aio completes

2020-01-15 Thread Peter Zijlstra
On Wed, Jan 15, 2020 at 09:24:28AM -0400, Jason Gunthorpe wrote:

> I was interested because you are talking about allowing the read/write side
> of a rw sem to be held across a return to user space/etc, which is the
> same basic problem.

No it is not; allowing the lock to be held across userspace doesn't
change the owner. This is a crucial difference, PI depends on there
being a distinct owner. That said, allowing the lock to be held across
userspace still breaks PI in that it completely wrecks the ability to
analyze the critical section.




Re: [Cluster-devel] RFC: hold i_rwsem until aio completes

2020-01-15 Thread Jason Gunthorpe
On Wed, Jan 15, 2020 at 07:56:14AM +0100, Christoph Hellwig wrote:
> On Tue, Jan 14, 2020 at 03:27:00PM -0400, Jason Gunthorpe wrote:
> > I've seen similar locking patterns quite a lot, enough I've thought
> > about having a dedicated locking primitive to do it. It really wants
> > to be a rwsem, but as here the rwsem rules don't allow it.
> > 
> > The common pattern I'm looking at looks something like this:
> > 
> >  'try begin read'() // aka down_read_trylock()
> > 
> >   /* The lockdep release hackery you describe,
> >  the rwsem remains read locked */
> >  'exit reader'()
> > 
> >  .. delegate unlock to work queue, timer, irq, etc ..
> > 
> > in the new context:
> > 
> >  're_enter reader'() // Get our lockdep tracking back
> > 
> >  'end reader'() // aka up_read()
> > 
> > vs a typical write side:
> > 
> >  'begin write'() // aka down_write()
> > 
> >  /* There is no reason to unlock it before kfree of the rwsem memory.
> > Somehow the user prevents any new down_read_trylock()'s */
> >  'abandon writer'() // The object will be kfree'd with a locked writer
> >  kfree()
> > 
> > The typical goal is to provide an object destruction path that can
> > serialize and fence all readers wherever they may be before proceeding
> > to some synchronous destruction.
> > 
> > Usually this gets open coded with some atomic/kref/refcount and a
> > completion or wait queue. Often implemented wrongly, lacking the write
> > favoring bias in the rwsem, and lacking any lockdep tracking on the
> > naked completion.
> > 
> > Not to discourage your patch, but to ask if we can make the solution
> > more broadly applicable?
> 
> Your requirement seems a little different, and in fact in many ways
> similar to the percpu_ref primitive.

I was interested because you are talking about allowing the read/write side
of a rw sem to be held across a return to user space/etc, which is the
same basic problem.

precpu refcount looks more like a typical refcount with a release that
is called by whatever context does the final put. The point above is
to basically move the release of a refcount into a synchrnous path by
introducing some barrier to wait for the refcount to go to zero. In
the above the barrier is the down_write() as it is really closer to a
rwsem than a refcount.

Thanks,
Jason




Re: [Cluster-devel] RFC: hold i_rwsem until aio completes

2020-01-15 Thread Jason Gunthorpe
On Tue, Jan 14, 2020 at 05:12:13PM +0100, Christoph Hellwig wrote:
> Hi all,
> 
> Asynchronous read/write operations currently use a rather magic locking
> scheme, were access to file data is normally protected using a rw_semaphore,
> but if we are doing aio where the syscall returns to userspace before the
> I/O has completed we also use an atomic_t to track the outstanding aio
> ops.  This scheme has lead to lots of subtle bugs in file systems where
> didn't wait to the count to reach zero, and due to its adhoc nature also
> means we have to serialize direct I/O writes that are smaller than the
> file system block size.

I've seen similar locking patterns quite a lot, enough I've thought
about having a dedicated locking primitive to do it. It really wants
to be a rwsem, but as here the rwsem rules don't allow it.

The common pattern I'm looking at looks something like this:

 'try begin read'() // aka down_read_trylock()

  /* The lockdep release hackery you describe,
 the rwsem remains read locked */
 'exit reader'()

 .. delegate unlock to work queue, timer, irq, etc ..

in the new context:

 're_enter reader'() // Get our lockdep tracking back

 'end reader'() // aka up_read()

vs a typical write side:

 'begin write'() // aka down_write()

 /* There is no reason to unlock it before kfree of the rwsem memory.
Somehow the user prevents any new down_read_trylock()'s */
 'abandon writer'() // The object will be kfree'd with a locked writer
 kfree()

The typical goal is to provide an object destruction path that can
serialize and fence all readers wherever they may be before proceeding
to some synchronous destruction.

Usually this gets open coded with some atomic/kref/refcount and a
completion or wait queue. Often implemented wrongly, lacking the write
favoring bias in the rwsem, and lacking any lockdep tracking on the
naked completion.

Not to discourage your patch, but to ask if we can make the solution
more broadly applicable?

Thanks,
Jason




Re: [Cluster-devel] RFC: hold i_rwsem until aio completes

2020-01-14 Thread Christoph Hellwig
On Tue, Jan 14, 2020 at 03:27:00PM -0400, Jason Gunthorpe wrote:
> I've seen similar locking patterns quite a lot, enough I've thought
> about having a dedicated locking primitive to do it. It really wants
> to be a rwsem, but as here the rwsem rules don't allow it.
> 
> The common pattern I'm looking at looks something like this:
> 
>  'try begin read'() // aka down_read_trylock()
> 
>   /* The lockdep release hackery you describe,
>  the rwsem remains read locked */
>  'exit reader'()
> 
>  .. delegate unlock to work queue, timer, irq, etc ..
> 
> in the new context:
> 
>  're_enter reader'() // Get our lockdep tracking back
> 
>  'end reader'() // aka up_read()
> 
> vs a typical write side:
> 
>  'begin write'() // aka down_write()
> 
>  /* There is no reason to unlock it before kfree of the rwsem memory.
> Somehow the user prevents any new down_read_trylock()'s */
>  'abandon writer'() // The object will be kfree'd with a locked writer
>  kfree()
> 
> The typical goal is to provide an object destruction path that can
> serialize and fence all readers wherever they may be before proceeding
> to some synchronous destruction.
> 
> Usually this gets open coded with some atomic/kref/refcount and a
> completion or wait queue. Often implemented wrongly, lacking the write
> favoring bias in the rwsem, and lacking any lockdep tracking on the
> naked completion.
> 
> Not to discourage your patch, but to ask if we can make the solution
> more broadly applicable?

Your requirement seems a little different, and in fact in many ways
similar to the percpu_ref primitive.




Re: [Cluster-devel] RFC: hold i_rwsem until aio completes

2020-01-14 Thread Christoph Hellwig
On Tue, Jan 14, 2020 at 10:47:07AM -0800, Matthew Wilcox wrote:
> It would be helpful if we could also use the same lockdep logic
> for PageLocked.  Again, it's a case where returning to userspace with
> PageLock held is fine, because we're expecting an interrupt to come in
> and drop the lock for us.

Yes, this is a very typical pattern for I/O.  Besides the page and
buffer head bit locks it also applies to the semaphore in the xfs_buf
structure and probably various other places that currently used hand
crafted or legacy locking primitives to escape lockdep.

> Perhaps the right answer is, from lockdep's point of view, to mark the
> lock as being released at the point where we submit the I/O.  Then
> in the completion path release the lock without telling lockdep we
> released it.

That is similar to what the fsfreeze code does, but I don't think it
is very optimal, as misses to track any dependencies after I/O
submission, and at least some of the completions paths do take
locks.  But it might be a start.




Re: [Cluster-devel] RFC: hold i_rwsem until aio completes

2020-01-14 Thread Matthew Wilcox
On Tue, Jan 14, 2020 at 05:12:13PM +0100, Christoph Hellwig wrote:
> Second I/O
> completions often come from interrupt context, which means the re-acquire
> is recorded as from irq context, leading to warnings about incorrect
> contexts.  I wonder if we could just have a bit in lockdep that says
> returning to userspace is ok for this particular lock?  That would also
> clean up the fsfreeze situation a lot.

It would be helpful if we could also use the same lockdep logic
for PageLocked.  Again, it's a case where returning to userspace with
PageLock held is fine, because we're expecting an interrupt to come in
and drop the lock for us.

Perhaps the right answer is, from lockdep's point of view, to mark the
lock as being released at the point where we submit the I/O.  Then
in the completion path release the lock without telling lockdep we
released it.

That would catch cases where we inadvertently returned to userspace
without submitting the I/O, for example.