Re: [RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002 ;-)

2019-09-04 Thread Ira Weiny
On Tue, Sep 03, 2019 at 08:26:18AM +1000, Dave Chinner wrote:
> On Wed, Aug 28, 2019 at 07:02:31PM -0700, Ira Weiny wrote:
> > On Mon, Aug 26, 2019 at 03:55:10PM +1000, Dave Chinner wrote:
> > > On Fri, Aug 23, 2019 at 10:08:36PM -0700, Ira Weiny wrote:
> > > > On Sat, Aug 24, 2019 at 10:11:24AM +1000, Dave Chinner wrote:
> > > > > On Fri, Aug 23, 2019 at 09:04:29AM -0300, Jason Gunthorpe wrote:
> > > > "Leases are associated with an open file description (see open(2)).  
> > > > This means
> > > > that duplicate file descriptors (created by, for example, fork(2) or 
> > > > dup(2))
> > > > refer to the same lease, and this lease may be modified or released 
> > > > using any
> > > > of these descriptors.  Furthermore,  the lease is released by either an
> > > > explicit F_UNLCK operation on any of these duplicate file descriptors, 
> > > > or when
> > > > all such file descriptors have been closed."
> > > 
> > > Right, the lease is attached to the struct file, so it follows
> > > where-ever the struct file goes. That doesn't mean it's actually
> > > useful when the struct file is duplicated and/or passed to another
> > > process. :/
> > > 
> > > AFAICT, the problem is that when we take another reference to the
> > > struct file, or when the struct file is passed to a different
> > > process, nothing updates the lease or lease state attached to that
> > > struct file.
> > 
> > Ok, I probably should have made this more clear in the cover letter but 
> > _only_
> > the process which took the lease can actually pin memory.
> 
> Sure, no question about that.
> 
> > That pinned memory _can_ be passed to another process but those 
> > sub-process' can
> > _not_ use the original lease to pin _more_ of the file.  They would need to
> > take their own lease to do that.
> 
> Yes, they would need a new lease to extend it. But that ignores the
> fact they don't have a lease on the existing pins they are using and
> have no control over the lease those pins originated under.  e.g.
> the originating process dies (for whatever reason) and now we have
> pins without a valid lease holder.

Define "valid lease holder"?

> 
> If something else now takes an exclusive lease on the file (because
> the original exclusive lease no longer exists), it's not going to
> work correctly because of the zombied page pins caused by closing
> the exclusive lease they were gained under. IOWs, pages pinned under
> an exclusive lease are no longer "exclusive" the moment the original
> exclusive lease is dropped, and pins passed to another process are
> no longer covered by the original lease they were created under.

The page pins are not zombied the lease is.  The lease still exists, it can't
be dropped while the pins are in place.  I need to double check the
implementation but that was the intent.

Yep just did a quick check, I have a test for that.  If the page pins exist
then the lease can _not_ be released.  Closing the FD will "zombie" the lease
but it and the struct file will still exist until the pins go away.

Furthermore, a "zombie" lease is _not_ sufficient to pin more pages.  (I have a
test for this too.)  I apologize that I don't have something to submit to
xfstests.  I'm new to that code base.

I'm happy to share the code I have which I've been using to test...  But it is
pretty rough as it has undergone a number of changes.  I think it would be
better to convert my test series to xfstests.

However, I don't know if it is ok to require RDMA within those tests.  Right
now that is the only sub-system I have allowed to create these page pins.  So
I'm not sure what to do at this time.  I'm open to suggestions.

> 
> > Sorry for not being clear on that.
> 
> I know exactly what you are saying. What I'm failing to get across
> is that file layout leases don't actually allow the behaviour you
> want to have.

Not currently, no.  But we are discussing the semantics to allow them _to_ have
the behavior needed.

> 
> > > As such, leases that require callbacks to userspace are currently
> > > only valid within the process context the lease was taken in.
> > 
> > But for long term pins we are not requiring callbacks.
> 
> Regardless, we still require an active lease for long term pins so
> that other lease holders fail operations appropriately. And that
> exclusive lease must follow the process that pins the pages so that
> the life cycle is the same...

I disagree.  See below.

> 
> > > Indeed, even closing the fd the lease was taken on without
> > > F_UNLCKing it first doesn't mean the lease has been torn down if
> > > there is some other reference to the struct file. That means the
> > > original lease owner will still get SIGIO delivered to that fd on a
> > > lease break regardless of whether it is open or not. ANd if we
> > > implement "layout lease not released within SIGIO response timeout"
> > > then that process will get killed, despite the fact it may not even
> > > have a reference to that file anymore.
> > 
> > I'm not seeing 

Re: [RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002 ;-)

2019-09-02 Thread Dave Chinner
On Wed, Aug 28, 2019 at 07:02:31PM -0700, Ira Weiny wrote:
> On Mon, Aug 26, 2019 at 03:55:10PM +1000, Dave Chinner wrote:
> > On Fri, Aug 23, 2019 at 10:08:36PM -0700, Ira Weiny wrote:
> > > On Sat, Aug 24, 2019 at 10:11:24AM +1000, Dave Chinner wrote:
> > > > On Fri, Aug 23, 2019 at 09:04:29AM -0300, Jason Gunthorpe wrote:
> > > "Leases are associated with an open file description (see open(2)).  This 
> > > means
> > > that duplicate file descriptors (created by, for example, fork(2) or 
> > > dup(2))
> > > refer to the same lease, and this lease may be modified or released using 
> > > any
> > > of these descriptors.  Furthermore,  the lease is released by either an
> > > explicit F_UNLCK operation on any of these duplicate file descriptors, or 
> > > when
> > > all such file descriptors have been closed."
> > 
> > Right, the lease is attached to the struct file, so it follows
> > where-ever the struct file goes. That doesn't mean it's actually
> > useful when the struct file is duplicated and/or passed to another
> > process. :/
> > 
> > AFAICT, the problem is that when we take another reference to the
> > struct file, or when the struct file is passed to a different
> > process, nothing updates the lease or lease state attached to that
> > struct file.
> 
> Ok, I probably should have made this more clear in the cover letter but _only_
> the process which took the lease can actually pin memory.

Sure, no question about that.

> That pinned memory _can_ be passed to another process but those sub-process' 
> can
> _not_ use the original lease to pin _more_ of the file.  They would need to
> take their own lease to do that.

Yes, they would need a new lease to extend it. But that ignores the
fact they don't have a lease on the existing pins they are using and
have no control over the lease those pins originated under.  e.g.
the originating process dies (for whatever reason) and now we have
pins without a valid lease holder.

If something else now takes an exclusive lease on the file (because
the original exclusive lease no longer exists), it's not going to
work correctly because of the zombied page pins caused by closing
the exclusive lease they were gained under. IOWs, pages pinned under
an exclusive lease are no longer "exclusive" the moment the original
exclusive lease is dropped, and pins passed to another process are
no longer covered by the original lease they were created under.

> Sorry for not being clear on that.

I know exactly what you are saying. What I'm failing to get across
is that file layout leases don't actually allow the behaviour you
want to have.

> > As such, leases that require callbacks to userspace are currently
> > only valid within the process context the lease was taken in.
> 
> But for long term pins we are not requiring callbacks.

Regardless, we still require an active lease for long term pins so
that other lease holders fail operations appropriately. And that
exclusive lease must follow the process that pins the pages so that
the life cycle is the same...

> > Indeed, even closing the fd the lease was taken on without
> > F_UNLCKing it first doesn't mean the lease has been torn down if
> > there is some other reference to the struct file. That means the
> > original lease owner will still get SIGIO delivered to that fd on a
> > lease break regardless of whether it is open or not. ANd if we
> > implement "layout lease not released within SIGIO response timeout"
> > then that process will get killed, despite the fact it may not even
> > have a reference to that file anymore.
> 
> I'm not seeing that as a problem.  This is all a result of the application
> failing to do the right thing.

How is that not a problem?

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002 ;-)

2019-08-29 Thread Ira Weiny
On Wed, Aug 28, 2019 at 08:27:23PM -0700, John Hubbard wrote:
> On 8/28/19 7:02 PM, Ira Weiny wrote:
> > On Mon, Aug 26, 2019 at 03:55:10PM +1000, Dave Chinner wrote:
> > > On Fri, Aug 23, 2019 at 10:08:36PM -0700, Ira Weiny wrote:
> > > > On Sat, Aug 24, 2019 at 10:11:24AM +1000, Dave Chinner wrote:
> > > > > On Fri, Aug 23, 2019 at 09:04:29AM -0300, Jason Gunthorpe wrote:
> ...
> > > 
> > > Sure, that part works because the struct file is passed. It doesn't
> > > end up with the same fd number in the other process, though.
> > > 
> > > The issue is that layout leases need to notify userspace when they
> > > are broken by the kernel, so a lease stores the owner pid/tid in the
> > > file->f_owner field via __f_setown(). It also keeps a struct fasync
> > > attached to the file_lock that records the fd that the lease was
> > > created on.  When a signal needs to be sent to userspace for that
> > > lease, we call kill_fasync() and that walks the list of fasync
> > > structures on the lease and calls:
> > > 
> > >   send_sigio(fown, fa->fa_fd, band);
> > > 
> > > And it does for every fasync struct attached to a lease. Yes, a
> > > lease can track multiple fds, but it can only track them in a single
> > > process context. The moment the struct file is shared with another
> > > process, the lease is no longer capable of sending notifications to
> > > all the lease holders.
> > > 
> > > Yes, you can change the owning process via F_SETOWNER, but that's
> > > still only a single process context, and you can't change the fd in
> > > the fasync list. You can add new fd to an existing lease by calling
> > > F_SETLEASE on the new fd, but you still only have a single process
> > > owner context for signal delivery.
> > > 
> > > As such, leases that require callbacks to userspace are currently
> > > only valid within the process context the lease was taken in.
> > 
> > But for long term pins we are not requiring callbacks.
> > 
> 
> Hi Ira,
> 
> If "require callbacks to userspace" means sending SIGIO, then actually
> FOLL_LONGTERM *does* require those callbacks. Because we've been, so
> far, equating FOLL_LONGTERM with the vaddr_pin struct and with a lease.
> 
> What am I missing here?

We agreed back in June that the layout lease would have 2 "levels".  The
"normal" layout lease would cause SIGIO and could be broken and another
"exclusive" level which could _not_ be broken.

Because we _can't_ _trust_ user space to react to the SIGIO properly the
"exclusive" lease is required to take the longterm pins.  Also this is the
lease which causes the truncate to fail (return ETXTBSY) because the kernel
can't break the lease.

The vaddr_pin struct in the current RFC is there for a couple of reasons.

1) To ensure that we have a way to correlate the long term pin user with the
   file if the data file FD's are closed.  (ie the application has zombie'd the
   lease).

2) And more importantly as a token the vaddr_pin*() callers use to be able to
   properly ref count the file itself while in use.

Ira



Re: [RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002 ;-)

2019-08-28 Thread John Hubbard

On 8/28/19 7:02 PM, Ira Weiny wrote:

On Mon, Aug 26, 2019 at 03:55:10PM +1000, Dave Chinner wrote:

On Fri, Aug 23, 2019 at 10:08:36PM -0700, Ira Weiny wrote:

On Sat, Aug 24, 2019 at 10:11:24AM +1000, Dave Chinner wrote:

On Fri, Aug 23, 2019 at 09:04:29AM -0300, Jason Gunthorpe wrote:

...


Sure, that part works because the struct file is passed. It doesn't
end up with the same fd number in the other process, though.

The issue is that layout leases need to notify userspace when they
are broken by the kernel, so a lease stores the owner pid/tid in the
file->f_owner field via __f_setown(). It also keeps a struct fasync
attached to the file_lock that records the fd that the lease was
created on.  When a signal needs to be sent to userspace for that
lease, we call kill_fasync() and that walks the list of fasync
structures on the lease and calls:

send_sigio(fown, fa->fa_fd, band);

And it does for every fasync struct attached to a lease. Yes, a
lease can track multiple fds, but it can only track them in a single
process context. The moment the struct file is shared with another
process, the lease is no longer capable of sending notifications to
all the lease holders.

Yes, you can change the owning process via F_SETOWNER, but that's
still only a single process context, and you can't change the fd in
the fasync list. You can add new fd to an existing lease by calling
F_SETLEASE on the new fd, but you still only have a single process
owner context for signal delivery.

As such, leases that require callbacks to userspace are currently
only valid within the process context the lease was taken in.


But for long term pins we are not requiring callbacks.



Hi Ira,

If "require callbacks to userspace" means sending SIGIO, then actually
FOLL_LONGTERM *does* require those callbacks. Because we've been, so
far, equating FOLL_LONGTERM with the vaddr_pin struct and with a lease.

What am I missing here?

thanks,
--
John Hubbard
NVIDIA


Re: [RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002 ;-)

2019-08-28 Thread Ira Weiny
On Mon, Aug 26, 2019 at 03:55:10PM +1000, Dave Chinner wrote:
> On Fri, Aug 23, 2019 at 10:08:36PM -0700, Ira Weiny wrote:
> > On Sat, Aug 24, 2019 at 10:11:24AM +1000, Dave Chinner wrote:
> > > On Fri, Aug 23, 2019 at 09:04:29AM -0300, Jason Gunthorpe wrote:
> > >
> > > > IMHO it is wrong to try and create a model where the file lease exists
> > > > independently from the kernel object relying on it. In other words the
> > > > IB MR object itself should hold a reference to the lease it relies
> > > > upon to function properly.
> > > 
> > > That still doesn't work. Leases are not individually trackable or
> > > reference counted objects objects - they are attached to a struct
> > > file bUt, in reality, they are far more restricted than a struct
> > > file.
> > > 
> > > That is, a lease specifically tracks the pid and the _open fd_ it
> > > was obtained for, so it is essentially owned by a specific process
> > > context.  Hence a lease is not able to be passed to a separate
> > > process context and have it still work correctly for lease break
> > > notifications.  i.e. the layout break signal gets delivered to
> > > original process that created the struct file, if it still exists
> > > and has the original fd still open. It does not get sent to the
> > > process that currently holds a reference to the IB context.

But this is an exclusive layout lease which does not send a signal.  There is
no way to break it.

> > >
> > 
> > The fcntl man page says:
> > 
> > "Leases are associated with an open file description (see open(2)).  This 
> > means
> > that duplicate file descriptors (created by, for example, fork(2) or dup(2))
> > refer to the same lease, and this lease may be modified or released using 
> > any
> > of these descriptors.  Furthermore,  the lease is released by either an
> > explicit F_UNLCK operation on any of these duplicate file descriptors, or 
> > when
> > all such file descriptors have been closed."
> 
> Right, the lease is attached to the struct file, so it follows
> where-ever the struct file goes. That doesn't mean it's actually
> useful when the struct file is duplicated and/or passed to another
> process. :/
> 
> AFAICT, the problem is that when we take another reference to the
> struct file, or when the struct file is passed to a different
> process, nothing updates the lease or lease state attached to that
> struct file.

Ok, I probably should have made this more clear in the cover letter but _only_
the process which took the lease can actually pin memory.

That pinned memory _can_ be passed to another process but those sub-process' can
_not_ use the original lease to pin _more_ of the file.  They would need to
take their own lease to do that.

Sorry for not being clear on that.

> 
> > From this I took it that the child process FD would have the lease as well
> > _and_ could release it.  I _assumed_ that applied to SCM_RIGHTS but it does 
> > not
> > seem to work the same way as dup() so I'm not so sure.
> 
> Sure, that part works because the struct file is passed. It doesn't
> end up with the same fd number in the other process, though.
> 
> The issue is that layout leases need to notify userspace when they
> are broken by the kernel, so a lease stores the owner pid/tid in the
> file->f_owner field via __f_setown(). It also keeps a struct fasync
> attached to the file_lock that records the fd that the lease was
> created on.  When a signal needs to be sent to userspace for that
> lease, we call kill_fasync() and that walks the list of fasync
> structures on the lease and calls:
> 
>   send_sigio(fown, fa->fa_fd, band);
> 
> And it does for every fasync struct attached to a lease. Yes, a
> lease can track multiple fds, but it can only track them in a single
> process context. The moment the struct file is shared with another
> process, the lease is no longer capable of sending notifications to
> all the lease holders.
> 
> Yes, you can change the owning process via F_SETOWNER, but that's
> still only a single process context, and you can't change the fd in
> the fasync list. You can add new fd to an existing lease by calling
> F_SETLEASE on the new fd, but you still only have a single process
> owner context for signal delivery.
> 
> As such, leases that require callbacks to userspace are currently
> only valid within the process context the lease was taken in.

But for long term pins we are not requiring callbacks.

> Indeed, even closing the fd the lease was taken on without
> F_UNLCKing it first doesn't mean the lease has been torn down if
> there is some other reference to the struct file. That means the
> original lease owner will still get SIGIO delivered to that fd on a
> lease break regardless of whether it is open or not. ANd if we
> implement "layout lease not released within SIGIO response timeout"
> then that process will get killed, despite the fact it may not even
> have a reference to that file anymore.

I'm not seeing that as a problem.  This is all a result 

Re: [RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002 ;-)

2019-08-25 Thread Dave Chinner
On Fri, Aug 23, 2019 at 10:08:36PM -0700, Ira Weiny wrote:
> On Sat, Aug 24, 2019 at 10:11:24AM +1000, Dave Chinner wrote:
> > On Fri, Aug 23, 2019 at 09:04:29AM -0300, Jason Gunthorpe wrote:
> > > On Fri, Aug 23, 2019 at 01:23:45PM +1000, Dave Chinner wrote:
> > > 
> > > > > But the fact that RDMA, and potentially others, can "pass the
> > > > > pins" to other processes is something I spent a lot of time trying to 
> > > > > work out.
> > > > 
> > > > There's nothing in file layout lease architecture that says you
> > > > can't "pass the pins" to another process.  All the file layout lease
> > > > requirements say is that if you are going to pass a resource for
> > > > which the layout lease guarantees access for to another process,
> > > > then the destination process already have a valid, active layout
> > > > lease that covers the range of the pins being passed to it via the
> > > > RDMA handle.
> > > 
> > > How would the kernel detect and enforce this? There are many ways to
> > > pass a FD.
> > 
> > AFAIC, that's not really a kernel problem. It's more of an
> > application design constraint than anything else. i.e. if the app
> > passes the IB context to another process without a lease, then the
> > original process is still responsible for recalling the lease and
> > has to tell that other process to release the IB handle and it's
> > resources.
> > 
> > > IMHO it is wrong to try and create a model where the file lease exists
> > > independently from the kernel object relying on it. In other words the
> > > IB MR object itself should hold a reference to the lease it relies
> > > upon to function properly.
> > 
> > That still doesn't work. Leases are not individually trackable or
> > reference counted objects objects - they are attached to a struct
> > file bUt, in reality, they are far more restricted than a struct
> > file.
> > 
> > That is, a lease specifically tracks the pid and the _open fd_ it
> > was obtained for, so it is essentially owned by a specific process
> > context.  Hence a lease is not able to be passed to a separate
> > process context and have it still work correctly for lease break
> > notifications.  i.e. the layout break signal gets delivered to
> > original process that created the struct file, if it still exists
> > and has the original fd still open. It does not get sent to the
> > process that currently holds a reference to the IB context.
> >
> 
> The fcntl man page says:
> 
> "Leases are associated with an open file description (see open(2)).  This 
> means
> that duplicate file descriptors (created by, for example, fork(2) or dup(2))
> refer to the same lease, and this lease may be modified or released using any
> of these descriptors.  Furthermore,  the lease is released by either an
> explicit F_UNLCK operation on any of these duplicate file descriptors, or when
> all such file descriptors have been closed."

Right, the lease is attached to the struct file, so it follows
where-ever the struct file goes. That doesn't mean it's actually
useful when the struct file is duplicated and/or passed to another
process. :/

AFAICT, the problem is that when we take another reference to the
struct file, or when the struct file is passed to a different
process, nothing updates the lease or lease state attached to that
struct file.

> From this I took it that the child process FD would have the lease as well
> _and_ could release it.  I _assumed_ that applied to SCM_RIGHTS but it does 
> not
> seem to work the same way as dup() so I'm not so sure.

Sure, that part works because the struct file is passed. It doesn't
end up with the same fd number in the other process, though.

The issue is that layout leases need to notify userspace when they
are broken by the kernel, so a lease stores the owner pid/tid in the
file->f_owner field via __f_setown(). It also keeps a struct fasync
attached to the file_lock that records the fd that the lease was
created on.  When a signal needs to be sent to userspace for that
lease, we call kill_fasync() and that walks the list of fasync
structures on the lease and calls:

send_sigio(fown, fa->fa_fd, band);

And it does for every fasync struct attached to a lease. Yes, a
lease can track multiple fds, but it can only track them in a single
process context. The moment the struct file is shared with another
process, the lease is no longer capable of sending notifications to
all the lease holders.

Yes, you can change the owning process via F_SETOWNER, but that's
still only a single process context, and you can't change the fd in
the fasync list. You can add new fd to an existing lease by calling
F_SETLEASE on the new fd, but you still only have a single process
owner context for signal delivery.

As such, leases that require callbacks to userspace are currently
only valid within the process context the lease was taken in.
Indeed, even closing the fd the lease was taken on without
F_UNLCKing it first doesn't mean the lease has been torn down if
there 

Re: [RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002 ;-)

2019-08-25 Thread Jason Gunthorpe
On Fri, Aug 23, 2019 at 09:49:12PM -0700, Ira Weiny wrote:

> So far, I have not been able to get RDMA to have an issue like Jason suggested
> would happen (or used to happen).  So from that perspective it may be ok to
> hang the close.

No, it is not OK to hang the close. You will deadlock on process
destruction when the 'lease fd' hangs waiting for the 'uverbs fd'
which is later in the single threaded destruction sequence.

This is different from the uverbs deadlock I outlined

Jason


Re: [RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002 ;-)

2019-08-25 Thread Jason Gunthorpe
On Sat, Aug 24, 2019 at 10:11:24AM +1000, Dave Chinner wrote:
> On Fri, Aug 23, 2019 at 09:04:29AM -0300, Jason Gunthorpe wrote:
> > On Fri, Aug 23, 2019 at 01:23:45PM +1000, Dave Chinner wrote:
> > 
> > > > But the fact that RDMA, and potentially others, can "pass the
> > > > pins" to other processes is something I spent a lot of time trying to 
> > > > work out.
> > > 
> > > There's nothing in file layout lease architecture that says you
> > > can't "pass the pins" to another process.  All the file layout lease
> > > requirements say is that if you are going to pass a resource for
> > > which the layout lease guarantees access for to another process,
> > > then the destination process already have a valid, active layout
> > > lease that covers the range of the pins being passed to it via the
> > > RDMA handle.
> > 
> > How would the kernel detect and enforce this? There are many ways to
> > pass a FD.
> 
> AFAIC, that's not really a kernel problem. It's more of an
> application design constraint than anything else. i.e. if the app
> passes the IB context to another process without a lease, then the
> original process is still responsible for recalling the lease and
> has to tell that other process to release the IB handle and it's
> resources.

It is a kernel problem, the MR exists and is doing DMA. That relies on
the lease to prevent data corruption.

The sanest outcome I could suggest is that when the kernel detects the
MR has outlived the lease it needs then we forcibly abort the entire
RDMA state. Ie the application has malfunctioned and gets wacked with
a very big hammer.

> That still doesn't work. Leases are not individually trackable or
> reference counted objects objects - they are attached to a struct
> file bUt, in reality, they are far more restricted than a struct
> file.

This is the problem. How to link something that is not refcounted to
the refcounted world of file descriptors does not seem very obvious.

There are too many places where struct file relies on its refcounting
to try to and plug them.

Jason


Re: [RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002 ;-)

2019-08-23 Thread Ira Weiny
On Sat, Aug 24, 2019 at 10:11:24AM +1000, Dave Chinner wrote:
> On Fri, Aug 23, 2019 at 09:04:29AM -0300, Jason Gunthorpe wrote:
> > On Fri, Aug 23, 2019 at 01:23:45PM +1000, Dave Chinner wrote:
> > 
> > > > But the fact that RDMA, and potentially others, can "pass the
> > > > pins" to other processes is something I spent a lot of time trying to 
> > > > work out.
> > > 
> > > There's nothing in file layout lease architecture that says you
> > > can't "pass the pins" to another process.  All the file layout lease
> > > requirements say is that if you are going to pass a resource for
> > > which the layout lease guarantees access for to another process,
> > > then the destination process already have a valid, active layout
> > > lease that covers the range of the pins being passed to it via the
> > > RDMA handle.
> > 
> > How would the kernel detect and enforce this? There are many ways to
> > pass a FD.
> 
> AFAIC, that's not really a kernel problem. It's more of an
> application design constraint than anything else. i.e. if the app
> passes the IB context to another process without a lease, then the
> original process is still responsible for recalling the lease and
> has to tell that other process to release the IB handle and it's
> resources.
> 
> > IMHO it is wrong to try and create a model where the file lease exists
> > independently from the kernel object relying on it. In other words the
> > IB MR object itself should hold a reference to the lease it relies
> > upon to function properly.
> 
> That still doesn't work. Leases are not individually trackable or
> reference counted objects objects - they are attached to a struct
> file bUt, in reality, they are far more restricted than a struct
> file.
> 
> That is, a lease specifically tracks the pid and the _open fd_ it
> was obtained for, so it is essentially owned by a specific process
> context.  Hence a lease is not able to be passed to a separate
> process context and have it still work correctly for lease break
> notifications.  i.e. the layout break signal gets delivered to
> original process that created the struct file, if it still exists
> and has the original fd still open. It does not get sent to the
> process that currently holds a reference to the IB context.
>

The fcntl man page says:

"Leases are associated with an open file description (see open(2)).  This means
that duplicate file descriptors (created by, for example, fork(2) or dup(2))
refer to the same lease, and this lease may be modified or released using any
of these descriptors.  Furthermore,  the lease is released by either an
explicit F_UNLCK operation on any of these duplicate file descriptors, or when
all such file descriptors have been closed."

>From this I took it that the child process FD would have the lease as well
_and_ could release it.  I _assumed_ that applied to SCM_RIGHTS but it does not
seem to work the same way as dup() so I'm not so sure.

Ira

> 
> So while a struct file passed to another process might still have
> an active lease, and you can change the owner of the struct file
> via fcntl(F_SETOWN), you can't associate the existing lease with a
> the new fd in the new process and so layout break signals can't be
> directed at the lease fd
> 
> This really means that a lease can only be owned by a single process
> context - it can't be shared across multiple processes (so I was
> wrong about dup/pass as being a possible way of passing them)
> because there's only one process that can "own" a struct file, and
> that where signals are sent when the lease needs to be broken.
> 
> So, fundamentally, if you want to pass a resource that pins a file
> layout between processes, both processes need to hold a layout lease
> on that file range. And that means exclusive leases and passing
> layouts between processes are fundamentally incompatible because you
> can't hold two exclusive leases on the same file range
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> da...@fromorbit.com


Re: [RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002 ;-)

2019-08-23 Thread Ira Weiny
On Fri, Aug 23, 2019 at 01:23:45PM +1000, Dave Chinner wrote:
> On Wed, Aug 21, 2019 at 01:44:21PM -0700, Ira Weiny wrote:
> > On Wed, Aug 21, 2019 at 04:48:10PM -0300, Jason Gunthorpe wrote:
> > > On Wed, Aug 21, 2019 at 11:57:03AM -0700, Ira Weiny wrote:
> > > 
> > > > > Oh, I didn't think we were talking about that. Hanging the close of
> > > > > the datafile fd contingent on some other FD's closure is a recipe for
> > > > > deadlock..
> > > > 
> > > > The discussion between Jan and Dave was concerning what happens when a 
> > > > user
> > > > calls
> > > > 
> > > > fd = open()
> > > > fnctl(...getlease...)
> > > > addr = mmap(fd...)
> > > > ib_reg_mr() 
> > > > munmap(addr...)
> > > > close(fd)
> > > 
> > > I don't see how blocking close(fd) could work.
> > 
> > Well Dave was saying this _could_ work. FWIW I'm not 100% sure it will but I
> > can't prove it won't..
> 
> Right, I proposed it as a possible way of making sure application
> developers don't do this. It _could_ be made to work (e.g. recording
> longterm page pins on the vma->file), but this is tangential to 
> the discussion of requiring active references to all resources
> covered by the layout lease.
> 
> I think allowing applications to behave like the above is simply
> poor system level design, regardless of the interaction with
> filesystems and layout leases.
> 
> > Maybe we are all just touching a different part of this
> > elephant[1] but the above scenario or one without munmap is very reasonably
> > something a user would do.  So we can either allow the close to complete (my
> > current patches) or try to make it block like Dave is suggesting.

My belief when writing the current series was that hanging the close would
cause deadlock.  But it seems I was wrong because of the delayed __fput().

So far, I have not been able to get RDMA to have an issue like Jason suggested
would happen (or used to happen).  So from that perspective it may be ok to
hang the close.

> > 
> > I don't disagree with Dave with the semantics being nice and clean for the
> > filesystem.
> 
> I'm not trying to make it "nice and clean for the filesystem".
> 
> The problem is not just RDMA/DAX - anything that is directly
> accessing the block device under the filesystem has the same set of
> issues. That is, the filesystem controls the life cycle of the
> blocks in the block device, so direct access to the blocks by any
> means needs to be co-ordinated with the filesystem. Pinning direct
> access to a file via page pins attached to a hardware context that
> the filesystem knows nothing about is not an access model that the
> filesystems can support.
> 
> IOWs, anyone looking at this problem just from the RDMA POV of page
> pins is not seeing all the other direct storage access mechainsms
> that we need to support in the filesystems. RDMA on DAX is just one
> of them.  pNFS is another. Remote acces via NVMeOF is another. XDP
> -> DAX (direct file data placement from the network hardware) is
> another. There are /lots/ of different direct storage access
> mechanisms that filesystems need to support and we sure as hell do
> not want to have to support special case semantics for every single
> one of them.

My use of struct file was based on the fact that FDs are a primary interface
for linux and my thought was that they would be more universal than having file
pin information stored in an RDMA specific structure.

XDP is not as direct; it uses sockets.  But sockets also have a struct file
which I believe could be used in a similar manner.  I'm not 100% sure of the
xdp_umem lifetime yet but it seems that my choice of using struct file was a
good one in this respect.

> 
> Hence if we don't start with a sane model for arbitrating direct
> access to the storage at the filesystem level we'll never get this
> stuff to work reliably, let alone work together coherently.  An
> application that wants a direct data path to storage should have a
> single API that enables then to safely access the storage,
> regardless of how they are accessing the storage.
> 
> From that perspective, what we are talking about here with RDMA
> doing "mmap, page pin, unmap, close" and "pass page pins via
> SCM_RIGHTS" are fundamentally unworkable from the filesystem
> perspective. They are use-after-free situations from the filesystem
> perspective - they do not hold direct references to anything in the
> filesystem, and so the filesytem is completely unaware of them.

I see your point of view but looking at it from a different point of view I
don't see this as a "use after free".

The user has explicitly registered this memory (and layout) with another direct
access subsystem (RDMA for example) so why do they need to keep the FD around?

> 
> The filesystem needs to be aware of /all users/ of it's resources if
> it's going to manage them sanely.

>From the way I look at it the underlying filesystem _is_ aware of the leases
with my patch set.  And so to is the user.  It is just not through the 

Re: [RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002 ;-)

2019-08-23 Thread Dave Chinner
On Fri, Aug 23, 2019 at 10:15:04AM -0700, Ira Weiny wrote:
> On Fri, Aug 23, 2019 at 10:59:14AM +1000, Dave Chinner wrote:
> > On Wed, Aug 21, 2019 at 11:02:00AM -0700, Ira Weiny wrote:
> > > On Tue, Aug 20, 2019 at 08:55:15AM -0300, Jason Gunthorpe wrote:
> > > > On Tue, Aug 20, 2019 at 11:12:10AM +1000, Dave Chinner wrote:
> > > > > On Mon, Aug 19, 2019 at 09:38:41AM -0300, Jason Gunthorpe wrote:
> > > > > > On Mon, Aug 19, 2019 at 07:24:09PM +1000, Dave Chinner wrote:
> > > > > > 
> > > > > > > So that leaves just the normal close() syscall exit case, where 
> > > > > > > the
> > > > > > > application has full control of the order in which resources are
> > > > > > > released. We've already established that we can block in this
> > > > > > > context.  Blocking in an interruptible state will allow fatal 
> > > > > > > signal
> > > > > > > delivery to wake us, and then we fall into the
> > > > > > > fatal_signal_pending() case if we get a SIGKILL while blocking.
> > > > > > 
> > > > > > The major problem with RDMA is that it doesn't always wait on 
> > > > > > close() for the
> > > > > > MR holding the page pins to be destoyed. This is done to avoid a
> > > > > > deadlock of the form:
> > > > > > 
> > > > > >uverbs_destroy_ufile_hw()
> > > > > >   mutex_lock()
> > > > > >[..]
> > > > > > mmput()
> > > > > >  exit_mmap()
> > > > > >   remove_vma()
> > > > > >fput();
> > > > > > file_operations->release()
> > > > > 
> > > > > I think this is wrong, and I'm pretty sure it's an example of why
> > > > > the final __fput() call is moved out of line.
> > > > 
> > > > Yes, I think so too, all I can say is this *used* to happen, as we
> > > > have special code avoiding it, which is the code that is messing up
> > > > Ira's lifetime model.
> > > > 
> > > > Ira, you could try unraveling the special locking, that solves your
> > > > lifetime issues?
> > > 
> > > Yes I will try to prove this out...  But I'm still not sure this fully 
> > > solves
> > > the problem.
> > > 
> > > This only ensures that the process which has the RDMA context (RDMA FD) 
> > > is safe
> > > with regard to hanging the close for the "data file FD" (the file which 
> > > has
> > > pinned pages) in that _same_ process.  But what about the scenario.
> > > 
> > > Process A has the RDMA context FD and data file FD (with lease) open.
> > > 
> > > Process A uses SCM_RIGHTS to pass the RDMA context FD to Process B.
> > 
> > Passing the RDMA context dependent on a file layout lease to another
> > process that doesn't have a file layout lease or a reference to the
> > original lease should be considered a violation of the layout lease.
> > Process B does not have an active layout lease, and so by the rules
> > of layout leases, it is not allowed to pin the layout of the file.
> > 
> 
> I don't disagree with the semantics of this.  I just don't know how to enforce
> it.
> 
> > > Process A attempts to exit (hangs because data file FD is pinned).
> > > 
> > > Admin kills process A.  kill works because we have allowed for it...
> > > 
> > > Process B _still_ has the RDMA context FD open _and_ therefore still 
> > > holds the
> > > file pins.
> > > 
> > > Truncation still fails.
> > > 
> > > Admin does not know which process is holding the pin.
> > > 
> > > What am I missing?
> > 
> > Application does not hold the correct file layout lease references.
> > Passing the fd via SCM_RIGHTS to a process without a layout lease
> > is equivalent to not using layout leases in the first place.
> 
> Ok, So If I understand you correctly you would support a failure of SCM_RIGHTS
> in this case?  I'm ok with that but not sure how to implement it right now.
> 
> To that end, I would like to simplify this slightly because I'm not convinced
> that SCM_RIGHTS is a problem we need to solve right now.  ie I don't know of a
> user who wants to do this.

I don't think we can support it, let alone want to. SCM_RIGHTS was a
mistake made years ago that has been causing bugs and complexity to
try and avoid those bugs ever since.  I'm only taking about it
because someone else raised it and I asummed they raised it because
they want it to "work".

> Right now duplication via SCM_RIGHTS could fail if _any_ file pins (and by
> definition leases) exist underneath the "RDMA FD" (or other direct access FD,
> like XDP etc) being duplicated.

Sounds like a fine idea to me.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002 ;-)

2019-08-23 Thread Dave Chinner
On Fri, Aug 23, 2019 at 09:04:29AM -0300, Jason Gunthorpe wrote:
> On Fri, Aug 23, 2019 at 01:23:45PM +1000, Dave Chinner wrote:
> 
> > > But the fact that RDMA, and potentially others, can "pass the
> > > pins" to other processes is something I spent a lot of time trying to 
> > > work out.
> > 
> > There's nothing in file layout lease architecture that says you
> > can't "pass the pins" to another process.  All the file layout lease
> > requirements say is that if you are going to pass a resource for
> > which the layout lease guarantees access for to another process,
> > then the destination process already have a valid, active layout
> > lease that covers the range of the pins being passed to it via the
> > RDMA handle.
> 
> How would the kernel detect and enforce this? There are many ways to
> pass a FD.

AFAIC, that's not really a kernel problem. It's more of an
application design constraint than anything else. i.e. if the app
passes the IB context to another process without a lease, then the
original process is still responsible for recalling the lease and
has to tell that other process to release the IB handle and it's
resources.

> IMHO it is wrong to try and create a model where the file lease exists
> independently from the kernel object relying on it. In other words the
> IB MR object itself should hold a reference to the lease it relies
> upon to function properly.

That still doesn't work. Leases are not individually trackable or
reference counted objects objects - they are attached to a struct
file bUt, in reality, they are far more restricted than a struct
file.

That is, a lease specifically tracks the pid and the _open fd_ it
was obtained for, so it is essentially owned by a specific process
context. Hence a lease is not able to be passed to a separate
process context and have it still work correctly for lease break
notifications.  i.e. the layout break signal gets delivered to
original process that created the struct file, if it still exists
and has the original fd still open. It does not get sent to the
process that currently holds a reference to the IB context.

So while a struct file passed to another process might still have
an active lease, and you can change the owner of the struct file
via fcntl(F_SETOWN), you can't associate the existing lease with a
the new fd in the new process and so layout break signals can't be
directed at the lease fd

This really means that a lease can only be owned by a single process
context - it can't be shared across multiple processes (so I was
wrong about dup/pass as being a possible way of passing them)
because there's only one process that can "own" a struct file, and
that where signals are sent when the lease needs to be broken.

So, fundamentally, if you want to pass a resource that pins a file
layout between processes, both processes need to hold a layout lease
on that file range. And that means exclusive leases and passing
layouts between processes are fundamentally incompatible because you
can't hold two exclusive leases on the same file range

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002 ;-)

2019-08-23 Thread Ira Weiny
On Fri, Aug 23, 2019 at 10:59:14AM +1000, Dave Chinner wrote:
> On Wed, Aug 21, 2019 at 11:02:00AM -0700, Ira Weiny wrote:
> > On Tue, Aug 20, 2019 at 08:55:15AM -0300, Jason Gunthorpe wrote:
> > > On Tue, Aug 20, 2019 at 11:12:10AM +1000, Dave Chinner wrote:
> > > > On Mon, Aug 19, 2019 at 09:38:41AM -0300, Jason Gunthorpe wrote:
> > > > > On Mon, Aug 19, 2019 at 07:24:09PM +1000, Dave Chinner wrote:
> > > > > 
> > > > > > So that leaves just the normal close() syscall exit case, where the
> > > > > > application has full control of the order in which resources are
> > > > > > released. We've already established that we can block in this
> > > > > > context.  Blocking in an interruptible state will allow fatal signal
> > > > > > delivery to wake us, and then we fall into the
> > > > > > fatal_signal_pending() case if we get a SIGKILL while blocking.
> > > > > 
> > > > > The major problem with RDMA is that it doesn't always wait on close() 
> > > > > for the
> > > > > MR holding the page pins to be destoyed. This is done to avoid a
> > > > > deadlock of the form:
> > > > > 
> > > > >uverbs_destroy_ufile_hw()
> > > > >   mutex_lock()
> > > > >[..]
> > > > > mmput()
> > > > >  exit_mmap()
> > > > >   remove_vma()
> > > > >fput();
> > > > > file_operations->release()
> > > > 
> > > > I think this is wrong, and I'm pretty sure it's an example of why
> > > > the final __fput() call is moved out of line.
> > > 
> > > Yes, I think so too, all I can say is this *used* to happen, as we
> > > have special code avoiding it, which is the code that is messing up
> > > Ira's lifetime model.
> > > 
> > > Ira, you could try unraveling the special locking, that solves your
> > > lifetime issues?
> > 
> > Yes I will try to prove this out...  But I'm still not sure this fully 
> > solves
> > the problem.
> > 
> > This only ensures that the process which has the RDMA context (RDMA FD) is 
> > safe
> > with regard to hanging the close for the "data file FD" (the file which has
> > pinned pages) in that _same_ process.  But what about the scenario.
> > 
> > Process A has the RDMA context FD and data file FD (with lease) open.
> > 
> > Process A uses SCM_RIGHTS to pass the RDMA context FD to Process B.
> 
> Passing the RDMA context dependent on a file layout lease to another
> process that doesn't have a file layout lease or a reference to the
> original lease should be considered a violation of the layout lease.
> Process B does not have an active layout lease, and so by the rules
> of layout leases, it is not allowed to pin the layout of the file.
> 

I don't disagree with the semantics of this.  I just don't know how to enforce
it.

> > Process A attempts to exit (hangs because data file FD is pinned).
> > 
> > Admin kills process A.  kill works because we have allowed for it...
> > 
> > Process B _still_ has the RDMA context FD open _and_ therefore still holds 
> > the
> > file pins.
> > 
> > Truncation still fails.
> > 
> > Admin does not know which process is holding the pin.
> > 
> > What am I missing?
> 
> Application does not hold the correct file layout lease references.
> Passing the fd via SCM_RIGHTS to a process without a layout lease
> is equivalent to not using layout leases in the first place.

Ok, So If I understand you correctly you would support a failure of SCM_RIGHTS
in this case?  I'm ok with that but not sure how to implement it right now.

To that end, I would like to simplify this slightly because I'm not convinced
that SCM_RIGHTS is a problem we need to solve right now.  ie I don't know of a
user who wants to do this.

Right now duplication via SCM_RIGHTS could fail if _any_ file pins (and by
definition leases) exist underneath the "RDMA FD" (or other direct access FD,
like XDP etc) being duplicated.  Later, if this becomes a use case we will need
to code up the proper checks, potentially within each of the subsystems.  This
is because, with RDMA at least, there are potentially large numbers of MR's and
file leases which may have to be checked.

Ira



Re: [RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002 ;-)

2019-08-23 Thread Jason Gunthorpe
On Fri, Aug 23, 2019 at 01:23:45PM +1000, Dave Chinner wrote:

> > But the fact that RDMA, and potentially others, can "pass the
> > pins" to other processes is something I spent a lot of time trying to work 
> > out.
> 
> There's nothing in file layout lease architecture that says you
> can't "pass the pins" to another process.  All the file layout lease
> requirements say is that if you are going to pass a resource for
> which the layout lease guarantees access for to another process,
> then the destination process already have a valid, active layout
> lease that covers the range of the pins being passed to it via the
> RDMA handle.

How would the kernel detect and enforce this? There are many ways to
pass a FD.

IMHO it is wrong to try and create a model where the file lease exists
independently from the kernel object relying on it. In other words the
IB MR object itself should hold a reference to the lease it relies
upon to function properly.

Then we don't have to wreck the unix FD model to fit this in.

Jason


Re: [RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002 ;-)

2019-08-22 Thread Dave Chinner
On Wed, Aug 21, 2019 at 01:44:21PM -0700, Ira Weiny wrote:
> On Wed, Aug 21, 2019 at 04:48:10PM -0300, Jason Gunthorpe wrote:
> > On Wed, Aug 21, 2019 at 11:57:03AM -0700, Ira Weiny wrote:
> > 
> > > > Oh, I didn't think we were talking about that. Hanging the close of
> > > > the datafile fd contingent on some other FD's closure is a recipe for
> > > > deadlock..
> > > 
> > > The discussion between Jan and Dave was concerning what happens when a 
> > > user
> > > calls
> > > 
> > > fd = open()
> > > fnctl(...getlease...)
> > > addr = mmap(fd...)
> > > ib_reg_mr() 
> > > munmap(addr...)
> > > close(fd)
> > 
> > I don't see how blocking close(fd) could work.
> 
> Well Dave was saying this _could_ work. FWIW I'm not 100% sure it will but I
> can't prove it won't..

Right, I proposed it as a possible way of making sure application
developers don't do this. It _could_ be made to work (e.g. recording
longterm page pins on the vma->file), but this is tangential to 
the discussion of requiring active references to all resources
covered by the layout lease.

I think allowing applications to behave like the above is simply
poor system level design, regardless of the interaction with
filesystems and layout leases.

> Maybe we are all just touching a different part of this
> elephant[1] but the above scenario or one without munmap is very reasonably
> something a user would do.  So we can either allow the close to complete (my
> current patches) or try to make it block like Dave is suggesting.
> 
> I don't disagree with Dave with the semantics being nice and clean for the
> filesystem.

I'm not trying to make it "nice and clean for the filesystem".

The problem is not just RDMA/DAX - anything that is directly
accessing the block device under the filesystem has the same set of
issues. That is, the filesystem controls the life cycle of the
blocks in the block device, so direct access to the blocks by any
means needs to be co-ordinated with the filesystem. Pinning direct
access to a file via page pins attached to a hardware context that
the filesystem knows nothing about is not an access model that the
filesystems can support.

IOWs, anyone looking at this problem just from the RDMA POV of page
pins is not seeing all the other direct storage access mechainsms
that we need to support in the filesystems. RDMA on DAX is just one
of them.  pNFS is another. Remote acces via NVMeOF is another. XDP
-> DAX (direct file data placement from the network hardware) is
another. There are /lots/ of different direct storage access
mechanisms that filesystems need to support and we sure as hell do
not want to have to support special case semantics for every single
one of them.

Hence if we don't start with a sane model for arbitrating direct
access to the storage at the filesystem level we'll never get this
stuff to work reliably, let alone work together coherently.  An
application that wants a direct data path to storage should have a
single API that enables then to safely access the storage,
regardless of how they are accessing the storage.

>From that perspective, what we are talking about here with RDMA
doing "mmap, page pin, unmap, close" and "pass page pins via
SCM_RIGHTS" are fundamentally unworkable from the filesystem
perspective. They are use-after-free situations from the filesystem
perspective - they do not hold direct references to anything in the
filesystem, and so the filesytem is completely unaware of them.

The filesystem needs to be aware of /all users/ of it's resources if
it's going to manage them sanely.  It needs to be able to corectly
coordinate modifications to ownership of the underlying storage with
all the users directly accessing that physical storage regardless of
the mechanism being used to access the storage.  IOWs, access
control must be independent of the mechanism used to gain access to
the storage hardware.

That's what file layout leases are defining - the filesystem support
model for allowing direct storage access from userspace. It's not
defining "RDMA/FSDAX" access rules, it's defining a generic direct
access model. And one of the rules in this model is "if you don't
have an active reference to the file layout, you are not allowed to
directly access the layout.".

Anything else is unsupportable from the filesystem perspective -
designing an access mechanism that allows userspace to retain access
indefinitely by relying on hidden details of kernel subsystem
implementations is a terrible architecture.  Not only does it bleed
kernel implementation into the API and the behavioural model, it
means we can't ever change that internal kernel behaviour because
userspace may be dependent on it. I shouldn't be having to point out
how bad this is from a system design perspective.

That's where the "nice and clean" semantics come from - starting
from "what can we actually support?", "what exactly do all the
different direct access mechanisms actually require?", "does it work
for future technologies not yet on 

Re: [RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002 ;-)

2019-08-22 Thread Dave Chinner
On Wed, Aug 21, 2019 at 11:02:00AM -0700, Ira Weiny wrote:
> On Tue, Aug 20, 2019 at 08:55:15AM -0300, Jason Gunthorpe wrote:
> > On Tue, Aug 20, 2019 at 11:12:10AM +1000, Dave Chinner wrote:
> > > On Mon, Aug 19, 2019 at 09:38:41AM -0300, Jason Gunthorpe wrote:
> > > > On Mon, Aug 19, 2019 at 07:24:09PM +1000, Dave Chinner wrote:
> > > > 
> > > > > So that leaves just the normal close() syscall exit case, where the
> > > > > application has full control of the order in which resources are
> > > > > released. We've already established that we can block in this
> > > > > context.  Blocking in an interruptible state will allow fatal signal
> > > > > delivery to wake us, and then we fall into the
> > > > > fatal_signal_pending() case if we get a SIGKILL while blocking.
> > > > 
> > > > The major problem with RDMA is that it doesn't always wait on close() 
> > > > for the
> > > > MR holding the page pins to be destoyed. This is done to avoid a
> > > > deadlock of the form:
> > > > 
> > > >uverbs_destroy_ufile_hw()
> > > >   mutex_lock()
> > > >[..]
> > > > mmput()
> > > >  exit_mmap()
> > > >   remove_vma()
> > > >fput();
> > > > file_operations->release()
> > > 
> > > I think this is wrong, and I'm pretty sure it's an example of why
> > > the final __fput() call is moved out of line.
> > 
> > Yes, I think so too, all I can say is this *used* to happen, as we
> > have special code avoiding it, which is the code that is messing up
> > Ira's lifetime model.
> > 
> > Ira, you could try unraveling the special locking, that solves your
> > lifetime issues?
> 
> Yes I will try to prove this out...  But I'm still not sure this fully solves
> the problem.
> 
> This only ensures that the process which has the RDMA context (RDMA FD) is 
> safe
> with regard to hanging the close for the "data file FD" (the file which has
> pinned pages) in that _same_ process.  But what about the scenario.
> 
> Process A has the RDMA context FD and data file FD (with lease) open.
> 
> Process A uses SCM_RIGHTS to pass the RDMA context FD to Process B.

Passing the RDMA context dependent on a file layout lease to another
process that doesn't have a file layout lease or a reference to the
original lease should be considered a violation of the layout lease.
Process B does not have an active layout lease, and so by the rules
of layout leases, it is not allowed to pin the layout of the file.

> Process A attempts to exit (hangs because data file FD is pinned).
> 
> Admin kills process A.  kill works because we have allowed for it...
> 
> Process B _still_ has the RDMA context FD open _and_ therefore still holds the
> file pins.
> 
> Truncation still fails.
> 
> Admin does not know which process is holding the pin.
> 
> What am I missing?

Application does not hold the correct file layout lease references.
Passing the fd via SCM_RIGHTS to a process without a layout lease
is equivalent to not using layout leases in the first place.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002 ;-)

2019-08-21 Thread Jason Gunthorpe
On Wed, Aug 21, 2019 at 01:44:21PM -0700, Ira Weiny wrote:

> > The order FD's are closed during sigkill is not deterministic, so when
> > all the fputs happen during a kill'd exit we could end up blocking in
> > close(fd) as close(uverbs) will come after in the close
> > list. close(uverbs) is the thing that does the dereg_mr and releases
> > the pin.
> 
> Of course, that is a different scenario which needs to be fixed in my patch
> set.  Now that my servers are back up I can hopefully make progress.  (Power
> was down for them yesterday).

It isn't really a different scenario, the problem is that the
filesystem fd must be closable independenly of fencing the MR to avoid
deadlock cycles. Once you resolve that the issue of the uverbs FD out
living it won't matter one bit if it is in the same process or
another.

Jason


Re: [RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002 ;-)

2019-08-21 Thread Ira Weiny
On Wed, Aug 21, 2019 at 04:48:10PM -0300, Jason Gunthorpe wrote:
> On Wed, Aug 21, 2019 at 11:57:03AM -0700, Ira Weiny wrote:
> 
> > > Oh, I didn't think we were talking about that. Hanging the close of
> > > the datafile fd contingent on some other FD's closure is a recipe for
> > > deadlock..
> > 
> > The discussion between Jan and Dave was concerning what happens when a user
> > calls
> > 
> > fd = open()
> > fnctl(...getlease...)
> > addr = mmap(fd...)
> > ib_reg_mr() 
> > munmap(addr...)
> > close(fd)
> 
> I don't see how blocking close(fd) could work.

Well Dave was saying this _could_ work.  FWIW I'm not 100% sure it will but I
can't prove it won't..  Maybe we are all just touching a different part of this
elephant[1] but the above scenario or one without munmap is very reasonably
something a user would do.  So we can either allow the close to complete (my
current patches) or try to make it block like Dave is suggesting.

I don't disagree with Dave with the semantics being nice and clean for the
filesystem.  But the fact that RDMA, and potentially others, can "pass the
pins" to other processes is something I spent a lot of time trying to work out.

>
> Write it like this:
> 
>  fd = open()
>  uverbs = open(/dev/uverbs)
>  fnctl(...getlease...)
>  addr = mmap(fd...)
>  ib_reg_mr() 
>  munmap(addr...)
>   
> 
> The order FD's are closed during sigkill is not deterministic, so when
> all the fputs happen during a kill'd exit we could end up blocking in
> close(fd) as close(uverbs) will come after in the close
> list. close(uverbs) is the thing that does the dereg_mr and releases
> the pin.

Of course, that is a different scenario which needs to be fixed in my patch
set.  Now that my servers are back up I can hopefully make progress.  (Power
was down for them yesterday).

> 
> We don't need complexity with dup to create problems.

No but that complexity _will_ come unless we "zombie" layout leases.

Ira

[1] https://en.wikipedia.org/wiki/Blind_men_and_an_elephant

> 
> Jason
> 


Re: [RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002 ;-)

2019-08-21 Thread Jason Gunthorpe
On Wed, Aug 21, 2019 at 11:57:03AM -0700, Ira Weiny wrote:

> > Oh, I didn't think we were talking about that. Hanging the close of
> > the datafile fd contingent on some other FD's closure is a recipe for
> > deadlock..
> 
> The discussion between Jan and Dave was concerning what happens when a user
> calls
> 
> fd = open()
> fnctl(...getlease...)
> addr = mmap(fd...)
> ib_reg_mr() 
> munmap(addr...)
> close(fd)

I don't see how blocking close(fd) could work. Write it like this:

 fd = open()
 uverbs = open(/dev/uverbs)
 fnctl(...getlease...)
 addr = mmap(fd...)
 ib_reg_mr() 
 munmap(addr...)
  

The order FD's are closed during sigkill is not deterministic, so when
all the fputs happen during a kill'd exit we could end up blocking in
close(fd) as close(uverbs) will come after in the close
list. close(uverbs) is the thing that does the dereg_mr and releases
the pin.

We don't need complexity with dup to create problems.

Jason


Re: [RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002 ;-)

2019-08-21 Thread Ira Weiny
On Wed, Aug 21, 2019 at 11:43:30AM -0700, John Hubbard wrote:
> On 8/19/19 8:36 PM, Dave Chinner wrote:
> > On Mon, Aug 19, 2019 at 08:09:33PM -0700, John Hubbard wrote:
> > > On 8/19/19 6:20 PM, Dave Chinner wrote:
> > > > On Mon, Aug 19, 2019 at 05:05:53PM -0700, John Hubbard wrote:
> > > > > On 8/19/19 2:24 AM, Dave Chinner wrote:
> > > > > > On Mon, Aug 19, 2019 at 08:34:12AM +0200, Jan Kara wrote:
> > > > > > > On Sat 17-08-19 12:26:03, Dave Chinner wrote:
> > > > > > > > On Fri, Aug 16, 2019 at 12:05:28PM -0700, Ira Weiny wrote:
> > > > > > > > > On Thu, Aug 15, 2019 at 03:05:58PM +0200, Jan Kara wrote:
> > > > > > > > > > On Wed 14-08-19 11:08:49, Ira Weiny wrote:
> > > > > > > > > > > On Wed, Aug 14, 2019 at 12:17:14PM +0200, Jan Kara wrote:
> > > > > ...
> > AFAIA, there is no struct file here - the memory that has been pinned
> > is just something mapped into the application's address space.
> > 
> > It seems to me that the socket here is equivalent of the RDMA handle
> > that that owns the hardware that pins the pages. Again, that RDMA
> > handle is not aware of waht the mapping represents, hence need to
> > hold a layout lease if it's a file mapping.
> > 
> > SO from the filesystem persepctive, there's no difference between
> > XDP or RDMA - if it's a FSDAX mapping then it is DMAing directly
> > into the filesystem's backing store and that will require use of
> > layout leases to perform safely.
> > 
> 
> OK, got it! Makes perfect sense.

Just to chime in here... Yea from the FS perspective it is the same.

But on the driver side it is more complicated because of how the references to
the pins can be shared among other processes.

See the other branch of this thread

https://lkml.org/lkml/2019/8/21/828

Ira

> 
> thanks,
> -- 
> John Hubbard
> NVIDIA


Re: [RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002 ;-)

2019-08-21 Thread Ira Weiny
On Wed, Aug 21, 2019 at 11:57:03AM -0700, 'Ira Weiny' wrote:
> On Wed, Aug 21, 2019 at 03:13:43PM -0300, Jason Gunthorpe wrote:
> > On Wed, Aug 21, 2019 at 11:02:00AM -0700, Ira Weiny wrote:
> > > On Tue, Aug 20, 2019 at 08:55:15AM -0300, Jason Gunthorpe wrote:
> > > > On Tue, Aug 20, 2019 at 11:12:10AM +1000, Dave Chinner wrote:
> > > > > On Mon, Aug 19, 2019 at 09:38:41AM -0300, Jason Gunthorpe wrote:
> > > > > > On Mon, Aug 19, 2019 at 07:24:09PM +1000, Dave Chinner wrote:
> > > > > > 
> > > > > > > So that leaves just the normal close() syscall exit case, where 
> > > > > > > the
> > > > > > > application has full control of the order in which resources are
> > > > > > > released. We've already established that we can block in this
> > > > > > > context.  Blocking in an interruptible state will allow fatal 
> > > > > > > signal
> > > > > > > delivery to wake us, and then we fall into the
> > > > > > > fatal_signal_pending() case if we get a SIGKILL while blocking.
> > > > > > 
> > > > > > The major problem with RDMA is that it doesn't always wait on 
> > > > > > close() for the
> > > > > > MR holding the page pins to be destoyed. This is done to avoid a
> > > > > > deadlock of the form:
> > > > > > 
> > > > > >uverbs_destroy_ufile_hw()
> > > > > >   mutex_lock()
> > > > > >[..]
> > > > > > mmput()
> > > > > >  exit_mmap()
> > > > > >   remove_vma()
> > > > > >fput();
> > > > > > file_operations->release()
> > > > > 
> > > > > I think this is wrong, and I'm pretty sure it's an example of why
> > > > > the final __fput() call is moved out of line.
> > > > 
> > > > Yes, I think so too, all I can say is this *used* to happen, as we
> > > > have special code avoiding it, which is the code that is messing up
> > > > Ira's lifetime model.
> > > > 
> > > > Ira, you could try unraveling the special locking, that solves your
> > > > lifetime issues?
> > > 
> > > Yes I will try to prove this out...  But I'm still not sure this fully 
> > > solves
> > > the problem.
> > > 
> > > This only ensures that the process which has the RDMA context (RDMA FD) 
> > > is safe
> > > with regard to hanging the close for the "data file FD" (the file which 
> > > has
> > > pinned pages) in that _same_ process.  But what about the scenario.
> > 
> > Oh, I didn't think we were talking about that. Hanging the close of
> > the datafile fd contingent on some other FD's closure is a recipe for
> > deadlock..
> 
> The discussion between Jan and Dave was concerning what happens when a user
> calls
> 
> fd = open()
> fnctl(...getlease...)
> addr = mmap(fd...)
> ib_reg_mr() 
> munmap(addr...)
> close(fd)
> 
> Dave suggested:
> 
> "I'm of a mind to make the last close() on a file block if there's an
> active layout lease to prevent processes from zombie-ing layout
> leases like this. i.e. you can't close the fd until resources that
> pin the lease have been released."
> 
>   -- Dave https://lkml.org/lkml/2019/8/16/994

I think this may all be easier if there was a way to block a dup() if it comes
from an SCM_RIGHTS.  Does anyone know if that is easy?  I assume it would just
mean passing some flag through the dup() call chain.

Jason, if we did that would it break RDMA use cases?  I personally don't know
of any.  We could pass data back from vaddr_pin indicating that a file pin has
been taken and predicate the blocking of SCM_RIGHTS on that? 

Of course if the user called:

fd = open()
fnctl(...getlease...)
addr = mmap(fd...)
ib_reg_mr() 
munmap(addr...)
close(fd)
fork() <=== in another thread because close is hanging

Would that dup() "fd" above into the child?  Or maybe that would be part of the
work to make close() hang?  Ensure the fd/file is still in the FD table so it
gets dupped???

Ira


> 
> > 
> > IMHO the pin refcnt is held by the driver char dev FD, that is the
> > object you need to make it visible against.
> 
> I'm sorry but what do you mean by "driver char dev FD"?
> 
> > 
> > Why not just have a single table someplace of all the layout leases
> > with the file they are held on and the FD/socket/etc that is holding
> > the pin? Make it independent of processes and FDs?
> 
> If it is independent of processes how will we know which process is blocking
> the truncate?  Using a global table is an interesting idea but I still believe
> the users are going to want to track this to specific processes.  It's not
> clear to me how that would be done with a global table.
> 
> I agree the XDP/socket case is bothersome...  I was thinking that somewhere 
> the
> fd of the socket could be hooked up in this case.  But taking a look at it
> reveals that is not going to be easy.  And I assume XDP has the same issue WRT
> SCM_RIGHTS and the ability to share the xdp context?
> 
> Ira
> 
> > 
> > Jason


Re: [RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002 ;-)

2019-08-21 Thread Ira Weiny
On Wed, Aug 21, 2019 at 03:13:43PM -0300, Jason Gunthorpe wrote:
> On Wed, Aug 21, 2019 at 11:02:00AM -0700, Ira Weiny wrote:
> > On Tue, Aug 20, 2019 at 08:55:15AM -0300, Jason Gunthorpe wrote:
> > > On Tue, Aug 20, 2019 at 11:12:10AM +1000, Dave Chinner wrote:
> > > > On Mon, Aug 19, 2019 at 09:38:41AM -0300, Jason Gunthorpe wrote:
> > > > > On Mon, Aug 19, 2019 at 07:24:09PM +1000, Dave Chinner wrote:
> > > > > 
> > > > > > So that leaves just the normal close() syscall exit case, where the
> > > > > > application has full control of the order in which resources are
> > > > > > released. We've already established that we can block in this
> > > > > > context.  Blocking in an interruptible state will allow fatal signal
> > > > > > delivery to wake us, and then we fall into the
> > > > > > fatal_signal_pending() case if we get a SIGKILL while blocking.
> > > > > 
> > > > > The major problem with RDMA is that it doesn't always wait on close() 
> > > > > for the
> > > > > MR holding the page pins to be destoyed. This is done to avoid a
> > > > > deadlock of the form:
> > > > > 
> > > > >uverbs_destroy_ufile_hw()
> > > > >   mutex_lock()
> > > > >[..]
> > > > > mmput()
> > > > >  exit_mmap()
> > > > >   remove_vma()
> > > > >fput();
> > > > > file_operations->release()
> > > > 
> > > > I think this is wrong, and I'm pretty sure it's an example of why
> > > > the final __fput() call is moved out of line.
> > > 
> > > Yes, I think so too, all I can say is this *used* to happen, as we
> > > have special code avoiding it, which is the code that is messing up
> > > Ira's lifetime model.
> > > 
> > > Ira, you could try unraveling the special locking, that solves your
> > > lifetime issues?
> > 
> > Yes I will try to prove this out...  But I'm still not sure this fully 
> > solves
> > the problem.
> > 
> > This only ensures that the process which has the RDMA context (RDMA FD) is 
> > safe
> > with regard to hanging the close for the "data file FD" (the file which has
> > pinned pages) in that _same_ process.  But what about the scenario.
> 
> Oh, I didn't think we were talking about that. Hanging the close of
> the datafile fd contingent on some other FD's closure is a recipe for
> deadlock..

The discussion between Jan and Dave was concerning what happens when a user
calls

fd = open()
fnctl(...getlease...)
addr = mmap(fd...)
ib_reg_mr() 
munmap(addr...)
close(fd)

Dave suggested:

"I'm of a mind to make the last close() on a file block if there's an
active layout lease to prevent processes from zombie-ing layout
leases like this. i.e. you can't close the fd until resources that
pin the lease have been released."

-- Dave https://lkml.org/lkml/2019/8/16/994

> 
> IMHO the pin refcnt is held by the driver char dev FD, that is the
> object you need to make it visible against.

I'm sorry but what do you mean by "driver char dev FD"?

> 
> Why not just have a single table someplace of all the layout leases
> with the file they are held on and the FD/socket/etc that is holding
> the pin? Make it independent of processes and FDs?

If it is independent of processes how will we know which process is blocking
the truncate?  Using a global table is an interesting idea but I still believe
the users are going to want to track this to specific processes.  It's not
clear to me how that would be done with a global table.

I agree the XDP/socket case is bothersome...  I was thinking that somewhere the
fd of the socket could be hooked up in this case.  But taking a look at it
reveals that is not going to be easy.  And I assume XDP has the same issue WRT
SCM_RIGHTS and the ability to share the xdp context?

Ira

> 
> Jason


Re: [RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002 ;-)

2019-08-21 Thread John Hubbard

On 8/19/19 8:36 PM, Dave Chinner wrote:

On Mon, Aug 19, 2019 at 08:09:33PM -0700, John Hubbard wrote:

On 8/19/19 6:20 PM, Dave Chinner wrote:

On Mon, Aug 19, 2019 at 05:05:53PM -0700, John Hubbard wrote:

On 8/19/19 2:24 AM, Dave Chinner wrote:

On Mon, Aug 19, 2019 at 08:34:12AM +0200, Jan Kara wrote:

On Sat 17-08-19 12:26:03, Dave Chinner wrote:

On Fri, Aug 16, 2019 at 12:05:28PM -0700, Ira Weiny wrote:

On Thu, Aug 15, 2019 at 03:05:58PM +0200, Jan Kara wrote:

On Wed 14-08-19 11:08:49, Ira Weiny wrote:

On Wed, Aug 14, 2019 at 12:17:14PM +0200, Jan Kara wrote:

...

AFAIA, there is no struct file here - the memory that has been pinned
is just something mapped into the application's address space.

It seems to me that the socket here is equivalent of the RDMA handle
that that owns the hardware that pins the pages. Again, that RDMA
handle is not aware of waht the mapping represents, hence need to
hold a layout lease if it's a file mapping.

SO from the filesystem persepctive, there's no difference between
XDP or RDMA - if it's a FSDAX mapping then it is DMAing directly
into the filesystem's backing store and that will require use of
layout leases to perform safely.



OK, got it! Makes perfect sense.

thanks,
--
John Hubbard
NVIDIA


Re: [RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002 ;-)

2019-08-21 Thread John Hubbard

On 8/21/19 11:13 AM, Jason Gunthorpe wrote:

On Wed, Aug 21, 2019 at 11:02:00AM -0700, Ira Weiny wrote:

On Tue, Aug 20, 2019 at 08:55:15AM -0300, Jason Gunthorpe wrote:

On Tue, Aug 20, 2019 at 11:12:10AM +1000, Dave Chinner wrote:

On Mon, Aug 19, 2019 at 09:38:41AM -0300, Jason Gunthorpe wrote:

On Mon, Aug 19, 2019 at 07:24:09PM +1000, Dave Chinner wrote:


So that leaves just the normal close() syscall exit case, where the
application has full control of the order in which resources are
released. We've already established that we can block in this
context.  Blocking in an interruptible state will allow fatal signal
delivery to wake us, and then we fall into the
fatal_signal_pending() case if we get a SIGKILL while blocking.


The major problem with RDMA is that it doesn't always wait on close() for the
MR holding the page pins to be destoyed. This is done to avoid a
deadlock of the form:

uverbs_destroy_ufile_hw()
   mutex_lock()
[..]
 mmput()
  exit_mmap()
   remove_vma()
fput();
 file_operations->release()


I think this is wrong, and I'm pretty sure it's an example of why
the final __fput() call is moved out of line.


Yes, I think so too, all I can say is this *used* to happen, as we
have special code avoiding it, which is the code that is messing up
Ira's lifetime model.

Ira, you could try unraveling the special locking, that solves your
lifetime issues?


Yes I will try to prove this out...  But I'm still not sure this fully solves
the problem.

This only ensures that the process which has the RDMA context (RDMA FD) is safe
with regard to hanging the close for the "data file FD" (the file which has
pinned pages) in that _same_ process.  But what about the scenario.


Oh, I didn't think we were talking about that. Hanging the close of
the datafile fd contingent on some other FD's closure is a recipe for
deadlock..

IMHO the pin refcnt is held by the driver char dev FD, that is the
object you need to make it visible against.



If you do that, it might make it a lot simpler to add lease support
to drivers like XDP, which is otherwise looking pretty difficult to
set up with an fd. (It's socket-based, and not immediately clear where
to connect up the fd.)


thanks,
--
John Hubbard
NVIDIA



Why not just have a single table someplace of all the layout leases
with the file they are held on and the FD/socket/etc that is holding
the pin? Make it independent of processes and FDs?

Jason



Re: [RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002 ;-)

2019-08-21 Thread Jason Gunthorpe
On Wed, Aug 21, 2019 at 11:02:00AM -0700, Ira Weiny wrote:
> On Tue, Aug 20, 2019 at 08:55:15AM -0300, Jason Gunthorpe wrote:
> > On Tue, Aug 20, 2019 at 11:12:10AM +1000, Dave Chinner wrote:
> > > On Mon, Aug 19, 2019 at 09:38:41AM -0300, Jason Gunthorpe wrote:
> > > > On Mon, Aug 19, 2019 at 07:24:09PM +1000, Dave Chinner wrote:
> > > > 
> > > > > So that leaves just the normal close() syscall exit case, where the
> > > > > application has full control of the order in which resources are
> > > > > released. We've already established that we can block in this
> > > > > context.  Blocking in an interruptible state will allow fatal signal
> > > > > delivery to wake us, and then we fall into the
> > > > > fatal_signal_pending() case if we get a SIGKILL while blocking.
> > > > 
> > > > The major problem with RDMA is that it doesn't always wait on close() 
> > > > for the
> > > > MR holding the page pins to be destoyed. This is done to avoid a
> > > > deadlock of the form:
> > > > 
> > > >uverbs_destroy_ufile_hw()
> > > >   mutex_lock()
> > > >[..]
> > > > mmput()
> > > >  exit_mmap()
> > > >   remove_vma()
> > > >fput();
> > > > file_operations->release()
> > > 
> > > I think this is wrong, and I'm pretty sure it's an example of why
> > > the final __fput() call is moved out of line.
> > 
> > Yes, I think so too, all I can say is this *used* to happen, as we
> > have special code avoiding it, which is the code that is messing up
> > Ira's lifetime model.
> > 
> > Ira, you could try unraveling the special locking, that solves your
> > lifetime issues?
> 
> Yes I will try to prove this out...  But I'm still not sure this fully solves
> the problem.
> 
> This only ensures that the process which has the RDMA context (RDMA FD) is 
> safe
> with regard to hanging the close for the "data file FD" (the file which has
> pinned pages) in that _same_ process.  But what about the scenario.

Oh, I didn't think we were talking about that. Hanging the close of
the datafile fd contingent on some other FD's closure is a recipe for
deadlock..

IMHO the pin refcnt is held by the driver char dev FD, that is the
object you need to make it visible against.

Why not just have a single table someplace of all the layout leases
with the file they are held on and the FD/socket/etc that is holding
the pin? Make it independent of processes and FDs?

Jason


Re: [RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002 ;-)

2019-08-21 Thread Ira Weiny
On Tue, Aug 20, 2019 at 08:55:15AM -0300, Jason Gunthorpe wrote:
> On Tue, Aug 20, 2019 at 11:12:10AM +1000, Dave Chinner wrote:
> > On Mon, Aug 19, 2019 at 09:38:41AM -0300, Jason Gunthorpe wrote:
> > > On Mon, Aug 19, 2019 at 07:24:09PM +1000, Dave Chinner wrote:
> > > 
> > > > So that leaves just the normal close() syscall exit case, where the
> > > > application has full control of the order in which resources are
> > > > released. We've already established that we can block in this
> > > > context.  Blocking in an interruptible state will allow fatal signal
> > > > delivery to wake us, and then we fall into the
> > > > fatal_signal_pending() case if we get a SIGKILL while blocking.
> > > 
> > > The major problem with RDMA is that it doesn't always wait on close() for 
> > > the
> > > MR holding the page pins to be destoyed. This is done to avoid a
> > > deadlock of the form:
> > > 
> > >uverbs_destroy_ufile_hw()
> > >   mutex_lock()
> > >[..]
> > > mmput()
> > >  exit_mmap()
> > >   remove_vma()
> > >fput();
> > > file_operations->release()
> > 
> > I think this is wrong, and I'm pretty sure it's an example of why
> > the final __fput() call is moved out of line.
> 
> Yes, I think so too, all I can say is this *used* to happen, as we
> have special code avoiding it, which is the code that is messing up
> Ira's lifetime model.
> 
> Ira, you could try unraveling the special locking, that solves your
> lifetime issues?

Yes I will try to prove this out...  But I'm still not sure this fully solves
the problem.

This only ensures that the process which has the RDMA context (RDMA FD) is safe
with regard to hanging the close for the "data file FD" (the file which has
pinned pages) in that _same_ process.  But what about the scenario.

Process A has the RDMA context FD and data file FD (with lease) open.

Process A uses SCM_RIGHTS to pass the RDMA context FD to Process B.

Process A attempts to exit (hangs because data file FD is pinned).

Admin kills process A.  kill works because we have allowed for it...

Process B _still_ has the RDMA context FD open _and_ therefore still holds the
file pins.

Truncation still fails.

Admin does not know which process is holding the pin.

What am I missing?

Ira



Re: [RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002 ;-)

2019-08-20 Thread Jason Gunthorpe
On Tue, Aug 20, 2019 at 11:12:10AM +1000, Dave Chinner wrote:
> On Mon, Aug 19, 2019 at 09:38:41AM -0300, Jason Gunthorpe wrote:
> > On Mon, Aug 19, 2019 at 07:24:09PM +1000, Dave Chinner wrote:
> > 
> > > So that leaves just the normal close() syscall exit case, where the
> > > application has full control of the order in which resources are
> > > released. We've already established that we can block in this
> > > context.  Blocking in an interruptible state will allow fatal signal
> > > delivery to wake us, and then we fall into the
> > > fatal_signal_pending() case if we get a SIGKILL while blocking.
> > 
> > The major problem with RDMA is that it doesn't always wait on close() for 
> > the
> > MR holding the page pins to be destoyed. This is done to avoid a
> > deadlock of the form:
> > 
> >uverbs_destroy_ufile_hw()
> >   mutex_lock()
> >[..]
> > mmput()
> >  exit_mmap()
> >   remove_vma()
> >fput();
> > file_operations->release()
> 
> I think this is wrong, and I'm pretty sure it's an example of why
> the final __fput() call is moved out of line.

Yes, I think so too, all I can say is this *used* to happen, as we
have special code avoiding it, which is the code that is messing up
Ira's lifetime model.

Ira, you could try unraveling the special locking, that solves your
lifetime issues?

Jason


Re: [RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002 ;-)

2019-08-19 Thread Dave Chinner
On Mon, Aug 19, 2019 at 08:09:33PM -0700, John Hubbard wrote:
> On 8/19/19 6:20 PM, Dave Chinner wrote:
> > On Mon, Aug 19, 2019 at 05:05:53PM -0700, John Hubbard wrote:
> > > On 8/19/19 2:24 AM, Dave Chinner wrote:
> > > > On Mon, Aug 19, 2019 at 08:34:12AM +0200, Jan Kara wrote:
> > > > > On Sat 17-08-19 12:26:03, Dave Chinner wrote:
> > > > > > On Fri, Aug 16, 2019 at 12:05:28PM -0700, Ira Weiny wrote:
> > > > > > > On Thu, Aug 15, 2019 at 03:05:58PM +0200, Jan Kara wrote:
> > > > > > > > On Wed 14-08-19 11:08:49, Ira Weiny wrote:
> > > > > > > > > On Wed, Aug 14, 2019 at 12:17:14PM +0200, Jan Kara wrote:
> > > ...
> > > 
> > > Any thoughts about sockets? I'm looking at net/xdp/xdp_umem.c which pins
> > > memory with FOLL_LONGTERM, and wondering how to make that work here.
> > 
> > I'm not sure how this interacts with file mappings? I mean, this
> > is just pinning anonymous pages for direct data placement into
> > userspace, right?
> > 
> > Are you asking "what if this pinned memory was a file mapping?",
> > or something else?
> 
> Yes, mainly that one. Especially since the FOLL_LONGTERM flag is
> already there in xdp_umem_pin_pages(), unconditionally. So the
> simple rules about struct *vaddr_pin usage (set it to NULL if FOLL_LONGTERM is
> not set) are not going to work here.
> 
> 
> > 
> > > These are close to files, in how they're handled, but just different
> > > enough that it's not clear to me how to make work with this system.
> > 
> > I'm guessing that if they are pinning a file backed mapping, they
> > are trying to dma direct to the file (zero copy into page cache?)
> > and so they'll need to either play by ODP rules or take layout
> > leases, too
> > 
> 
> OK. I was just wondering if there was some simple way to dig up a
> struct file associated with a socket (I don't think so), but it sounds
> like this is an exercise that's potentially different for each subsystem.

AFAIA, there is no struct file here - the memory that has been pinned
is just something mapped into the application's address space.

It seems to me that the socket here is equivalent of the RDMA handle
that that owns the hardware that pins the pages. Again, that RDMA
handle is not aware of waht the mapping represents, hence need to
hold a layout lease if it's a file mapping.

SO from the filesystem persepctive, there's no difference between
XDP or RDMA - if it's a FSDAX mapping then it is DMAing directly
into the filesystem's backing store and that will require use of
layout leases to perform safely.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002 ;-)

2019-08-19 Thread John Hubbard

On 8/19/19 6:20 PM, Dave Chinner wrote:

On Mon, Aug 19, 2019 at 05:05:53PM -0700, John Hubbard wrote:

On 8/19/19 2:24 AM, Dave Chinner wrote:

On Mon, Aug 19, 2019 at 08:34:12AM +0200, Jan Kara wrote:

On Sat 17-08-19 12:26:03, Dave Chinner wrote:

On Fri, Aug 16, 2019 at 12:05:28PM -0700, Ira Weiny wrote:

On Thu, Aug 15, 2019 at 03:05:58PM +0200, Jan Kara wrote:

On Wed 14-08-19 11:08:49, Ira Weiny wrote:

On Wed, Aug 14, 2019 at 12:17:14PM +0200, Jan Kara wrote:

...

Any thoughts about sockets? I'm looking at net/xdp/xdp_umem.c which pins
memory with FOLL_LONGTERM, and wondering how to make that work here.


I'm not sure how this interacts with file mappings? I mean, this
is just pinning anonymous pages for direct data placement into
userspace, right?

Are you asking "what if this pinned memory was a file mapping?",
or something else?


Yes, mainly that one. Especially since the FOLL_LONGTERM flag is
already there in xdp_umem_pin_pages(), unconditionally. So the
simple rules about struct *vaddr_pin usage (set it to NULL if FOLL_LONGTERM is
not set) are not going to work here.





These are close to files, in how they're handled, but just different
enough that it's not clear to me how to make work with this system.


I'm guessing that if they are pinning a file backed mapping, they
are trying to dma direct to the file (zero copy into page cache?)
and so they'll need to either play by ODP rules or take layout
leases, too



OK. I was just wondering if there was some simple way to dig up a
struct file associated with a socket (I don't think so), but it sounds
like this is an exercise that's potentially different for each subsystem.

thanks,
--
John Hubbard
NVIDIA


Re: [RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002 ;-)

2019-08-19 Thread Dave Chinner
On Mon, Aug 19, 2019 at 05:05:53PM -0700, John Hubbard wrote:
> On 8/19/19 2:24 AM, Dave Chinner wrote:
> > On Mon, Aug 19, 2019 at 08:34:12AM +0200, Jan Kara wrote:
> > > On Sat 17-08-19 12:26:03, Dave Chinner wrote:
> > > > On Fri, Aug 16, 2019 at 12:05:28PM -0700, Ira Weiny wrote:
> > > > > On Thu, Aug 15, 2019 at 03:05:58PM +0200, Jan Kara wrote:
> > > > > > On Wed 14-08-19 11:08:49, Ira Weiny wrote:
> > > > > > > On Wed, Aug 14, 2019 at 12:17:14PM +0200, Jan Kara wrote:
> ...
> > The last close is an interesting case because the __fput() call
> > actually runs from task_work() context, not where the last reference
> > is actually dropped. So it already has certain specific interactions
> > with signals and task exit processing via task_add_work() and
> > task_work_run().
> > 
> > task_add_work() calls set_notify_resume(task), so if nothing else
> > triggers when returning to userspace we run this path:
> > 
> > exit_to_usermode_loop()
> >tracehook_notify_resume()
> >  task_work_run()
> >__fput()
> > locks_remove_file()
> >   locks_remove_lease()
> > 
> > 
> > It's worth noting that locks_remove_lease() does a
> > percpu_down_read() which means we can already block in this context
> > removing leases
> > 
> > If there is a signal pending, the task work is run this way (before
> > the above notify path):
> > 
> > exit_to_usermode_loop()
> >do_signal()
> >  get_signal()
> >task_work_run()
> >  __fput()
> > 
> > We can detect this case via signal_pending() and even SIGKILL via
> > fatal_signal_pending(), and so we can decide not to block based on
> > the fact the process is about to be reaped and so the lease largely
> > doesn't matter anymore. I'd argue that it is close and we can't
> > easily back out, so we'd only break the block on a fatal signal
> > 
> > And then, of course, is the call path through do_exit(), which has
> > the PF_EXITING task flag set:
> > 
> > do_exit()
> >exit_task_work()
> >  task_work_run()
> >__fput()
> > 
> > and so it's easy to avoid blocking in this case, too.
> 
> Any thoughts about sockets? I'm looking at net/xdp/xdp_umem.c which pins
> memory with FOLL_LONGTERM, and wondering how to make that work here.

I'm not sure how this interacts with file mappings? I mean, this
is just pinning anonymous pages for direct data placement into
userspace, right?

Are you asking "what if this pinned memory was a file mapping?",
or something else?

> These are close to files, in how they're handled, but just different
> enough that it's not clear to me how to make work with this system.

I'm guessing that if they are pinning a file backed mapping, they
are trying to dma direct to the file (zero copy into page cache?)
and so they'll need to either play by ODP rules or take layout
leases, too

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002 ;-)

2019-08-19 Thread Dave Chinner
On Mon, Aug 19, 2019 at 09:38:41AM -0300, Jason Gunthorpe wrote:
> On Mon, Aug 19, 2019 at 07:24:09PM +1000, Dave Chinner wrote:
> 
> > So that leaves just the normal close() syscall exit case, where the
> > application has full control of the order in which resources are
> > released. We've already established that we can block in this
> > context.  Blocking in an interruptible state will allow fatal signal
> > delivery to wake us, and then we fall into the
> > fatal_signal_pending() case if we get a SIGKILL while blocking.
> 
> The major problem with RDMA is that it doesn't always wait on close() for the
> MR holding the page pins to be destoyed. This is done to avoid a
> deadlock of the form:
> 
>uverbs_destroy_ufile_hw()
>   mutex_lock()
>[..]
> mmput()
>  exit_mmap()
>   remove_vma()
>fput();
> file_operations->release()

I think this is wrong, and I'm pretty sure it's an example of why
the final __fput() call is moved out of line.

fput()
  fput_many()
task_add_work(f, __fput())

and the call chain ends there.

Before the syscall returns to userspace, it then runs the __fput()
call through the task_work_run() interfaces, and hence the call
chain is just:

task_work_run
  __fput
> file_operations->release()
>  ib_uverbs_close()
>   uverbs_destroy_ufile_hw()
>mutex_lock()   <-- Deadlock

And there is no deadlock because nothing holds the mutex at this
point.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002 ;-)

2019-08-19 Thread John Hubbard

On 8/19/19 2:24 AM, Dave Chinner wrote:

On Mon, Aug 19, 2019 at 08:34:12AM +0200, Jan Kara wrote:

On Sat 17-08-19 12:26:03, Dave Chinner wrote:

On Fri, Aug 16, 2019 at 12:05:28PM -0700, Ira Weiny wrote:

On Thu, Aug 15, 2019 at 03:05:58PM +0200, Jan Kara wrote:

On Wed 14-08-19 11:08:49, Ira Weiny wrote:

On Wed, Aug 14, 2019 at 12:17:14PM +0200, Jan Kara wrote:

...

The last close is an interesting case because the __fput() call
actually runs from task_work() context, not where the last reference
is actually dropped. So it already has certain specific interactions
with signals and task exit processing via task_add_work() and
task_work_run().

task_add_work() calls set_notify_resume(task), so if nothing else
triggers when returning to userspace we run this path:

exit_to_usermode_loop()
   tracehook_notify_resume()
 task_work_run()
   __fput()
locks_remove_file()
  locks_remove_lease()


It's worth noting that locks_remove_lease() does a
percpu_down_read() which means we can already block in this context
removing leases

If there is a signal pending, the task work is run this way (before
the above notify path):

exit_to_usermode_loop()
   do_signal()
 get_signal()
   task_work_run()
 __fput()

We can detect this case via signal_pending() and even SIGKILL via
fatal_signal_pending(), and so we can decide not to block based on
the fact the process is about to be reaped and so the lease largely
doesn't matter anymore. I'd argue that it is close and we can't
easily back out, so we'd only break the block on a fatal signal

And then, of course, is the call path through do_exit(), which has
the PF_EXITING task flag set:

do_exit()
   exit_task_work()
 task_work_run()
   __fput()

and so it's easy to avoid blocking in this case, too.


Any thoughts about sockets? I'm looking at net/xdp/xdp_umem.c which pins
memory with FOLL_LONGTERM, and wondering how to make that work here.

These are close to files, in how they're handled, but just different
enough that it's not clear to me how to make work with this system.




So that leaves just the normal close() syscall exit case, where the
application has full control of the order in which resources are
released. We've already established that we can block in this
context.  Blocking in an interruptible state will allow fatal signal
delivery to wake us, and then we fall into the
fatal_signal_pending() case if we get a SIGKILL while blocking.

Hence I think blocking in this case would be OK - it indicates an
application bug (releasing a lease before releasing the resources)
but leaves SIGKILL available to administrators to resolve situations
involving buggy applications.

This requires applications to follow the rules: any process
that pins physical resources must have an active reference to a
layout lease, either via a duplicated fd or it's own private lease.
If the app doesn't play by the rules, it hangs in close() until it
is killed.


+1 for these rules, assuming that we can make them work. They are
easy to explain and intuitive.


thanks,
--
John Hubbard
NVIDIA


Re: [RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002 ;-)

2019-08-19 Thread Ira Weiny
On Mon, Aug 19, 2019 at 09:38:41AM -0300, Jason Gunthorpe wrote:
> On Mon, Aug 19, 2019 at 07:24:09PM +1000, Dave Chinner wrote:
> 
> > So that leaves just the normal close() syscall exit case, where the
> > application has full control of the order in which resources are
> > released. We've already established that we can block in this
> > context.  Blocking in an interruptible state will allow fatal signal
> > delivery to wake us, and then we fall into the
> > fatal_signal_pending() case if we get a SIGKILL while blocking.
> 
> The major problem with RDMA is that it doesn't always wait on close() for the
> MR holding the page pins to be destoyed. This is done to avoid a
> deadlock of the form:
> 
>uverbs_destroy_ufile_hw()
>   mutex_lock()
>[..]
> mmput()
>  exit_mmap()
>   remove_vma()
>fput();
> file_operations->release()
>  ib_uverbs_close()
>   uverbs_destroy_ufile_hw()
>mutex_lock()   <-- Deadlock
> 
> But, as I said to Ira earlier, I wonder if this is now impossible on
> modern kernels and we can switch to making the whole thing
> synchronous. That would resolve RDMA's main problem with this.

I'm still looking into this...  but my bigger concern is that the RDMA FD can
be passed to other processes via SCM_RIGHTS.  Which means the process holding
the pin may _not_ be the one with the open file and layout lease...

Ira



Re: [RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002 ;-)

2019-08-19 Thread Jason Gunthorpe
On Mon, Aug 19, 2019 at 07:24:09PM +1000, Dave Chinner wrote:

> So that leaves just the normal close() syscall exit case, where the
> application has full control of the order in which resources are
> released. We've already established that we can block in this
> context.  Blocking in an interruptible state will allow fatal signal
> delivery to wake us, and then we fall into the
> fatal_signal_pending() case if we get a SIGKILL while blocking.

The major problem with RDMA is that it doesn't always wait on close() for the
MR holding the page pins to be destoyed. This is done to avoid a
deadlock of the form:

   uverbs_destroy_ufile_hw()
  mutex_lock()
   [..]
mmput()
 exit_mmap()
  remove_vma()
   fput();
file_operations->release()
 ib_uverbs_close()
  uverbs_destroy_ufile_hw()
   mutex_lock()   <-- Deadlock

But, as I said to Ira earlier, I wonder if this is now impossible on
modern kernels and we can switch to making the whole thing
synchronous. That would resolve RDMA's main problem with this.

Jason


Re: [RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002 ;-)

2019-08-19 Thread Dave Chinner
On Mon, Aug 19, 2019 at 08:34:12AM +0200, Jan Kara wrote:
> On Sat 17-08-19 12:26:03, Dave Chinner wrote:
> > On Fri, Aug 16, 2019 at 12:05:28PM -0700, Ira Weiny wrote:
> > > On Thu, Aug 15, 2019 at 03:05:58PM +0200, Jan Kara wrote:
> > > > On Wed 14-08-19 11:08:49, Ira Weiny wrote:
> > > > > On Wed, Aug 14, 2019 at 12:17:14PM +0200, Jan Kara wrote:
> > > 2) Second reason is that I thought I did not have a good way to tell if 
> > > the
> > >lease was actually in use.  What I mean is that letting the lease go 
> > > should
> > >be ok IFF we don't have any pins...  I was thinking that without 
> > > John's code
> > >we don't have a way to know if there are any pins...  But that is 
> > > wrong...
> > >All we have to do is check
> > > 
> > >   !list_empty(file->file_pins)
> > > 
> > > So now with this detail I think you are right, we should be able to hold 
> > > the
> > > lease through the struct file even if the process no longer has any
> > > "references" to it (ie closes and munmaps the file).
> > 
> > I really, really dislike the idea of zombie layout leases. It's a
> > nasty hack for poor application behaviour. This is a "we allow use
> > after layout lease release" API, and I think encoding largely
> > untraceable zombie objects into an API is very poor design.
> > 
> > From the fcntl man page:
> > 
> > LEASES
> > Leases are associated with an open file description (see
> > open(2)).  This means that duplicate file descriptors
> > (created by, for example, fork(2) or dup(2))  re‐ fer  to
> > the  same  lease,  and this lease may be modified or
> > released using any of these descriptors.  Furthermore, the
> > lease is released by either an explicit F_UNLCK operation on
> > any of these duplicate file descriptors, or when all such
> > file descriptors have been closed.
> > 
> > Leases are associated with *open* file descriptors, not the
> > lifetime of the struct file in the kernel. If the application closes
> > the open fds that refer to the lease, then the kernel does not
> > guarantee, and the application has no right to expect, that the
> > lease remains active in any way once the application closes all
> > direct references to the lease.
> > 
> > IOWs, applications using layout leases need to hold the lease fd
> > open for as long as the want access to the physical file layout. It
> > is a also a requirement of the layout lease that the holder releases
> > the resources it holds on the layout before it releases the layout
> > lease, exclusive lease or not. Closing the fd indicates they do not
> > need access to the file any more, and so the lease should be
> > reclaimed at that point.
> > 
> > I'm of a mind to make the last close() on a file block if there's an
> > active layout lease to prevent processes from zombie-ing layout
> > leases like this. i.e. you can't close the fd until resources that
> > pin the lease have been released.
> 
> Yeah, so this was my initial though as well [1]. But as the discussion in
> that thread revealed, the problem with blocking last close is that kernel
> does not really expect close to block. You could easily deadlock e.g. if
> the process gets SIGKILL, file with lease has fd 10, and the RDMA context
> holding pages pinned has fd 15.

Sure, I did think about this a bit about it before suggesting it :)

The last close is an interesting case because the __fput() call
actually runs from task_work() context, not where the last reference
is actually dropped. So it already has certain specific interactions
with signals and task exit processing via task_add_work() and
task_work_run().

task_add_work() calls set_notify_resume(task), so if nothing else
triggers when returning to userspace we run this path:

exit_to_usermode_loop()
  tracehook_notify_resume()
task_work_run()
  __fput()
locks_remove_file()
  locks_remove_lease()


It's worth noting that locks_remove_lease() does a
percpu_down_read() which means we can already block in this context
removing leases

If there is a signal pending, the task work is run this way (before
the above notify path):

exit_to_usermode_loop()
  do_signal()
get_signal()
  task_work_run()
__fput()

We can detect this case via signal_pending() and even SIGKILL via
fatal_signal_pending(), and so we can decide not to block based on
the fact the process is about to be reaped and so the lease largely
doesn't matter anymore. I'd argue that it is close and we can't
easily back out, so we'd only break the block on a fatal signal

And then, of course, is the call path through do_exit(), which has
the PF_EXITING task flag set:

do_exit()
  exit_task_work()
task_work_run()
  __fput()

and so it's easy to avoid blocking in this case, too.

So that leaves just the normal close() syscall exit case, where the
application has full control of the order in which resources are
released. We've already established that we can block in this

Re: [RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002 ; -)

2019-08-19 Thread Jan Kara
On Fri 16-08-19 16:20:07, Ira Weiny wrote:
> On Fri, Aug 16, 2019 at 12:05:28PM -0700, 'Ira Weiny' wrote:
> > On Thu, Aug 15, 2019 at 03:05:58PM +0200, Jan Kara wrote:
> > > On Wed 14-08-19 11:08:49, Ira Weiny wrote:
> > > > On Wed, Aug 14, 2019 at 12:17:14PM +0200, Jan Kara wrote:
> > > > > Hello!
> > > > > 
> > > > > On Fri 09-08-19 15:58:14, ira.we...@intel.com wrote:
> > > > > > Pre-requisites
> > > > > > ==
> > > > > > Based on mmotm tree.
> > > > > > 
> > > > > > Based on the feedback from LSFmm, the LWN article, the RFC series 
> > > > > > since
> > > > > > then, and a ton of scenarios I've worked in my mind and/or 
> > > > > > tested...[1]
> > > > > > 
> > > > > > Solution summary
> > > > > > 
> > > > > > 
> > > > > > The real issue is that there is no use case for a user to have RDMA 
> > > > > > pinn'ed
> > > > > > memory which is then truncated.  So really any solution we present 
> > > > > > which:
> > > > > > 
> > > > > > A) Prevents file system corruption or data leaks
> > > > > > ...and...
> > > > > > B) Informs the user that they did something wrong
> > > > > > 
> > > > > > Should be an acceptable solution.
> > > > > > 
> > > > > > Because this is slightly new behavior.  And because this is going 
> > > > > > to be
> > > > > > specific to DAX (because of the lack of a page cache) we have made 
> > > > > > the user
> > > > > > "opt in" to this behavior.
> > > > > > 
> > > > > > The following patches implement the following solution.
> > > > > > 
> > > > > > 0) Registrations to Device DAX char devs are not affected
> > > > > > 
> > > > > > 1) The user has to opt in to allowing page pins on a file with an 
> > > > > > exclusive
> > > > > >layout lease.  Both exclusive and layout lease flags are user 
> > > > > > visible now.
> > > > > > 
> > > > > > 2) page pins will fail if the lease is not active when the file 
> > > > > > back page is
> > > > > >encountered.
> > > > > > 
> > > > > > 3) Any truncate or hole punch operation on a pinned DAX page will 
> > > > > > fail.
> > > > > 
> > > > > So I didn't fully grok the patch set yet but by "pinned DAX page" do 
> > > > > you
> > > > > mean a page which has corresponding file_pin covering it? Or do you 
> > > > > mean a
> > > > > page which has pincount increased? If the first then I'd rephrase 
> > > > > this to
> > > > > be less ambiguous, if the second then I think it is wrong. 
> > > > 
> > > > I mean the second.  but by "fail" I mean hang.  Right now the "normal" 
> > > > page
> > > > pincount processing will hang the truncate.  Given the discussion with 
> > > > John H
> > > > we can make this a bit better if we use something like FOLL_PIN and the 
> > > > page
> > > > count bias to indicate this type of pin.  Then I could fail the truncate
> > > > outright.  but that is not done yet.
> > > > 
> > > > so... I used the word "fail" to be a bit more vague as the final 
> > > > implementation
> > > > may return ETXTBUSY or hang as noted.
> > > 
> > > Ah, OK. Hanging is fine in principle but with longterm pins, your work
> > > makes sure they actually fail with ETXTBUSY, doesn't it? The thing is that
> > > e.g. DIO will use page pins as well for its buffers and we must wait there
> > > until the pin is released. So please just clarify your 'fail' here a bit
> > > :).
> > 
> > It will fail with ETXTBSY.  I've fixed a bug...  See below.
> > 
> > > 
> > > > > > 4) The user has the option of holding the lease or releasing it.  
> > > > > > If they
> > > > > >release it no other pin calls will work on the file.
> > > > > 
> > > > > Last time we spoke the plan was that the lease is kept while the 
> > > > > pages are
> > > > > pinned (and an attempt to release the lease would block until the 
> > > > > pages are
> > > > > unpinned). That also makes it clear that the *lease* is what is making
> > > > > truncate and hole punch fail with ETXTBUSY and the file_pin structure 
> > > > > is
> > > > > just an implementation detail how the existence is efficiently 
> > > > > tracked (and
> > > > > what keeps the backing file for the pages open so that the lease does 
> > > > > not
> > > > > get auto-destroyed). Why did you change this?
> > > > 
> > > > closing the file _and_ unmaping it will cause the lease to be released
> > > > regardless of if we allow this or not.
> > > > 
> > > > As we discussed preventing the close seemed intractable.
> > > 
> > > Yes, preventing the application from closing the file is difficult. But
> > > from a quick look at your patches it seemed to me that you actually hold a
> > > backing file reference from the file_pin structure thus even though the
> > > application closes its file descriptor, the struct file (and thus the
> > > lease) lives further until the file_pin gets released. And that should 
> > > last
> > > as long as the pages are pinned. Am I missing something?
> > > 
> > > > I thought about failing the munmap but that seemed wrong as well.  But 
> > > > more
> > > > 

Re: [RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002 ;-)

2019-08-19 Thread Jan Kara
On Sat 17-08-19 12:26:03, Dave Chinner wrote:
> On Fri, Aug 16, 2019 at 12:05:28PM -0700, Ira Weiny wrote:
> > On Thu, Aug 15, 2019 at 03:05:58PM +0200, Jan Kara wrote:
> > > On Wed 14-08-19 11:08:49, Ira Weiny wrote:
> > > > On Wed, Aug 14, 2019 at 12:17:14PM +0200, Jan Kara wrote:
> > 2) Second reason is that I thought I did not have a good way to tell if the
> >lease was actually in use.  What I mean is that letting the lease go 
> > should
> >be ok IFF we don't have any pins...  I was thinking that without John's 
> > code
> >we don't have a way to know if there are any pins...  But that is 
> > wrong...
> >All we have to do is check
> > 
> > !list_empty(file->file_pins)
> > 
> > So now with this detail I think you are right, we should be able to hold the
> > lease through the struct file even if the process no longer has any
> > "references" to it (ie closes and munmaps the file).
> 
> I really, really dislike the idea of zombie layout leases. It's a
> nasty hack for poor application behaviour. This is a "we allow use
> after layout lease release" API, and I think encoding largely
> untraceable zombie objects into an API is very poor design.
> 
> From the fcntl man page:
> 
> LEASES
>   Leases are associated with an open file description (see
>   open(2)).  This means that duplicate file descriptors
>   (created by, for example, fork(2) or dup(2))  re‐ fer  to
>   the  same  lease,  and this lease may be modified or
>   released using any of these descriptors.  Furthermore, the
>   lease is released by either an explicit F_UNLCK operation on
>   any of these duplicate file descriptors, or when all such
>   file descriptors have been closed.
> 
> Leases are associated with *open* file descriptors, not the
> lifetime of the struct file in the kernel. If the application closes
> the open fds that refer to the lease, then the kernel does not
> guarantee, and the application has no right to expect, that the
> lease remains active in any way once the application closes all
> direct references to the lease.
> 
> IOWs, applications using layout leases need to hold the lease fd
> open for as long as the want access to the physical file layout. It
> is a also a requirement of the layout lease that the holder releases
> the resources it holds on the layout before it releases the layout
> lease, exclusive lease or not. Closing the fd indicates they do not
> need access to the file any more, and so the lease should be
> reclaimed at that point.
> 
> I'm of a mind to make the last close() on a file block if there's an
> active layout lease to prevent processes from zombie-ing layout
> leases like this. i.e. you can't close the fd until resources that
> pin the lease have been released.

Yeah, so this was my initial though as well [1]. But as the discussion in
that thread revealed, the problem with blocking last close is that kernel
does not really expect close to block. You could easily deadlock e.g. if
the process gets SIGKILL, file with lease has fd 10, and the RDMA context
holding pages pinned has fd 15. Or you could wait for another process to
release page pins and blocking SIGKILL on that is also bad. So in the end
the least bad solution we've come up with were these "zombie" leases as you
call them and tracking them in /proc so that userspace at least has a way
of seeing them. But if you can come up with a different solution, I'm
certainly not attached to the current one...

Honza

[1] https://lore.kernel.org/lkml/20190606104203.gf7...@quack2.suse.cz
-- 
Jan Kara 
SUSE Labs, CR


Re: [RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002 ;-)

2019-08-16 Thread Dave Chinner
On Fri, Aug 16, 2019 at 12:05:28PM -0700, Ira Weiny wrote:
> On Thu, Aug 15, 2019 at 03:05:58PM +0200, Jan Kara wrote:
> > On Wed 14-08-19 11:08:49, Ira Weiny wrote:
> > > On Wed, Aug 14, 2019 at 12:17:14PM +0200, Jan Kara wrote:
> 2) Second reason is that I thought I did not have a good way to tell if the
>lease was actually in use.  What I mean is that letting the lease go should
>be ok IFF we don't have any pins...  I was thinking that without John's 
> code
>we don't have a way to know if there are any pins...  But that is wrong...
>All we have to do is check
> 
>   !list_empty(file->file_pins)
> 
> So now with this detail I think you are right, we should be able to hold the
> lease through the struct file even if the process no longer has any
> "references" to it (ie closes and munmaps the file).

I really, really dislike the idea of zombie layout leases. It's a
nasty hack for poor application behaviour. This is a "we allow use
after layout lease release" API, and I think encoding largely
untraceable zombie objects into an API is very poor design.

>From the fcntl man page:

LEASES
Leases are associated with an open file description (see
open(2)).  This means that duplicate file descriptors
(created by, for example, fork(2) or dup(2))  re‐ fer  to
the  same  lease,  and this lease may be modified or
released using any of these descriptors.  Furthermore, the
lease is released by either an explicit F_UNLCK operation on
any of these duplicate file descriptors, or when all such
file descriptors have been closed.

Leases are associated with *open* file descriptors, not the
lifetime of the struct file in the kernel. If the application closes
the open fds that refer to the lease, then the kernel does not
guarantee, and the application has no right to expect, that the
lease remains active in any way once the application closes all
direct references to the lease.

IOWs, applications using layout leases need to hold the lease fd
open for as long as the want access to the physical file layout. It
is a also a requirement of the layout lease that the holder releases
the resources it holds on the layout before it releases the layout
lease, exclusive lease or not. Closing the fd indicates they do not
need access to the file any more, and so the lease should be
reclaimed at that point.

I'm of a mind to make the last close() on a file block if there's an
active layout lease to prevent processes from zombie-ing layout
leases like this. i.e. you can't close the fd until resources that
pin the lease have been released.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002 ; -)

2019-08-16 Thread Ira Weiny
On Fri, Aug 16, 2019 at 12:05:28PM -0700, 'Ira Weiny' wrote:
> On Thu, Aug 15, 2019 at 03:05:58PM +0200, Jan Kara wrote:
> > On Wed 14-08-19 11:08:49, Ira Weiny wrote:
> > > On Wed, Aug 14, 2019 at 12:17:14PM +0200, Jan Kara wrote:
> > > > Hello!
> > > > 
> > > > On Fri 09-08-19 15:58:14, ira.we...@intel.com wrote:
> > > > > Pre-requisites
> > > > > ==
> > > > >   Based on mmotm tree.
> > > > > 
> > > > > Based on the feedback from LSFmm, the LWN article, the RFC series 
> > > > > since
> > > > > then, and a ton of scenarios I've worked in my mind and/or 
> > > > > tested...[1]
> > > > > 
> > > > > Solution summary
> > > > > 
> > > > > 
> > > > > The real issue is that there is no use case for a user to have RDMA 
> > > > > pinn'ed
> > > > > memory which is then truncated.  So really any solution we present 
> > > > > which:
> > > > > 
> > > > > A) Prevents file system corruption or data leaks
> > > > > ...and...
> > > > > B) Informs the user that they did something wrong
> > > > > 
> > > > > Should be an acceptable solution.
> > > > > 
> > > > > Because this is slightly new behavior.  And because this is going to 
> > > > > be
> > > > > specific to DAX (because of the lack of a page cache) we have made 
> > > > > the user
> > > > > "opt in" to this behavior.
> > > > > 
> > > > > The following patches implement the following solution.
> > > > > 
> > > > > 0) Registrations to Device DAX char devs are not affected
> > > > > 
> > > > > 1) The user has to opt in to allowing page pins on a file with an 
> > > > > exclusive
> > > > >layout lease.  Both exclusive and layout lease flags are user 
> > > > > visible now.
> > > > > 
> > > > > 2) page pins will fail if the lease is not active when the file back 
> > > > > page is
> > > > >encountered.
> > > > > 
> > > > > 3) Any truncate or hole punch operation on a pinned DAX page will 
> > > > > fail.
> > > > 
> > > > So I didn't fully grok the patch set yet but by "pinned DAX page" do you
> > > > mean a page which has corresponding file_pin covering it? Or do you 
> > > > mean a
> > > > page which has pincount increased? If the first then I'd rephrase this 
> > > > to
> > > > be less ambiguous, if the second then I think it is wrong. 
> > > 
> > > I mean the second.  but by "fail" I mean hang.  Right now the "normal" 
> > > page
> > > pincount processing will hang the truncate.  Given the discussion with 
> > > John H
> > > we can make this a bit better if we use something like FOLL_PIN and the 
> > > page
> > > count bias to indicate this type of pin.  Then I could fail the truncate
> > > outright.  but that is not done yet.
> > > 
> > > so... I used the word "fail" to be a bit more vague as the final 
> > > implementation
> > > may return ETXTBUSY or hang as noted.
> > 
> > Ah, OK. Hanging is fine in principle but with longterm pins, your work
> > makes sure they actually fail with ETXTBUSY, doesn't it? The thing is that
> > e.g. DIO will use page pins as well for its buffers and we must wait there
> > until the pin is released. So please just clarify your 'fail' here a bit
> > :).
> 
> It will fail with ETXTBSY.  I've fixed a bug...  See below.
> 
> > 
> > > > > 4) The user has the option of holding the lease or releasing it.  If 
> > > > > they
> > > > >release it no other pin calls will work on the file.
> > > > 
> > > > Last time we spoke the plan was that the lease is kept while the pages 
> > > > are
> > > > pinned (and an attempt to release the lease would block until the pages 
> > > > are
> > > > unpinned). That also makes it clear that the *lease* is what is making
> > > > truncate and hole punch fail with ETXTBUSY and the file_pin structure is
> > > > just an implementation detail how the existence is efficiently tracked 
> > > > (and
> > > > what keeps the backing file for the pages open so that the lease does 
> > > > not
> > > > get auto-destroyed). Why did you change this?
> > > 
> > > closing the file _and_ unmaping it will cause the lease to be released
> > > regardless of if we allow this or not.
> > > 
> > > As we discussed preventing the close seemed intractable.
> > 
> > Yes, preventing the application from closing the file is difficult. But
> > from a quick look at your patches it seemed to me that you actually hold a
> > backing file reference from the file_pin structure thus even though the
> > application closes its file descriptor, the struct file (and thus the
> > lease) lives further until the file_pin gets released. And that should last
> > as long as the pages are pinned. Am I missing something?
> > 
> > > I thought about failing the munmap but that seemed wrong as well.  But 
> > > more
> > > importantly AFAIK RDMA can pass its memory pins to other processes via FD
> > > passing...  This means that one could pin this memory, pass it to another
> > > process and exit.  The file lease on the pin'ed file is lost.
> > 
> > Not if file_pin grabs struct file reference as I mentioned above...
> 

Re: [RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002 ;-)

2019-08-16 Thread Ira Weiny
On Thu, Aug 15, 2019 at 03:05:58PM +0200, Jan Kara wrote:
> On Wed 14-08-19 11:08:49, Ira Weiny wrote:
> > On Wed, Aug 14, 2019 at 12:17:14PM +0200, Jan Kara wrote:
> > > Hello!
> > > 
> > > On Fri 09-08-19 15:58:14, ira.we...@intel.com wrote:
> > > > Pre-requisites
> > > > ==
> > > > Based on mmotm tree.
> > > > 
> > > > Based on the feedback from LSFmm, the LWN article, the RFC series since
> > > > then, and a ton of scenarios I've worked in my mind and/or tested...[1]
> > > > 
> > > > Solution summary
> > > > 
> > > > 
> > > > The real issue is that there is no use case for a user to have RDMA 
> > > > pinn'ed
> > > > memory which is then truncated.  So really any solution we present 
> > > > which:
> > > > 
> > > > A) Prevents file system corruption or data leaks
> > > > ...and...
> > > > B) Informs the user that they did something wrong
> > > > 
> > > > Should be an acceptable solution.
> > > > 
> > > > Because this is slightly new behavior.  And because this is going to be
> > > > specific to DAX (because of the lack of a page cache) we have made the 
> > > > user
> > > > "opt in" to this behavior.
> > > > 
> > > > The following patches implement the following solution.
> > > > 
> > > > 0) Registrations to Device DAX char devs are not affected
> > > > 
> > > > 1) The user has to opt in to allowing page pins on a file with an 
> > > > exclusive
> > > >layout lease.  Both exclusive and layout lease flags are user 
> > > > visible now.
> > > > 
> > > > 2) page pins will fail if the lease is not active when the file back 
> > > > page is
> > > >encountered.
> > > > 
> > > > 3) Any truncate or hole punch operation on a pinned DAX page will fail.
> > > 
> > > So I didn't fully grok the patch set yet but by "pinned DAX page" do you
> > > mean a page which has corresponding file_pin covering it? Or do you mean a
> > > page which has pincount increased? If the first then I'd rephrase this to
> > > be less ambiguous, if the second then I think it is wrong. 
> > 
> > I mean the second.  but by "fail" I mean hang.  Right now the "normal" page
> > pincount processing will hang the truncate.  Given the discussion with John 
> > H
> > we can make this a bit better if we use something like FOLL_PIN and the page
> > count bias to indicate this type of pin.  Then I could fail the truncate
> > outright.  but that is not done yet.
> > 
> > so... I used the word "fail" to be a bit more vague as the final 
> > implementation
> > may return ETXTBUSY or hang as noted.
> 
> Ah, OK. Hanging is fine in principle but with longterm pins, your work
> makes sure they actually fail with ETXTBUSY, doesn't it? The thing is that
> e.g. DIO will use page pins as well for its buffers and we must wait there
> until the pin is released. So please just clarify your 'fail' here a bit
> :).

It will fail with ETXTBSY.  I've fixed a bug...  See below.

> 
> > > > 4) The user has the option of holding the lease or releasing it.  If 
> > > > they
> > > >release it no other pin calls will work on the file.
> > > 
> > > Last time we spoke the plan was that the lease is kept while the pages are
> > > pinned (and an attempt to release the lease would block until the pages 
> > > are
> > > unpinned). That also makes it clear that the *lease* is what is making
> > > truncate and hole punch fail with ETXTBUSY and the file_pin structure is
> > > just an implementation detail how the existence is efficiently tracked 
> > > (and
> > > what keeps the backing file for the pages open so that the lease does not
> > > get auto-destroyed). Why did you change this?
> > 
> > closing the file _and_ unmaping it will cause the lease to be released
> > regardless of if we allow this or not.
> > 
> > As we discussed preventing the close seemed intractable.
> 
> Yes, preventing the application from closing the file is difficult. But
> from a quick look at your patches it seemed to me that you actually hold a
> backing file reference from the file_pin structure thus even though the
> application closes its file descriptor, the struct file (and thus the
> lease) lives further until the file_pin gets released. And that should last
> as long as the pages are pinned. Am I missing something?
> 
> > I thought about failing the munmap but that seemed wrong as well.  But more
> > importantly AFAIK RDMA can pass its memory pins to other processes via FD
> > passing...  This means that one could pin this memory, pass it to another
> > process and exit.  The file lease on the pin'ed file is lost.
> 
> Not if file_pin grabs struct file reference as I mentioned above...
>  
> > The file lease is just a key to get the memory pin.  Once unlocked the 
> > procfs
> > tracking keeps track of where that pin goes and which processes need to be
> > killed to get rid of it.
> 
> I think having file lease being just a key to get the pin is conceptually
> wrong. The lease is what expresses: "I'm accessing these blocks directly,
> don't touch 

Re: [RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002 ;-)

2019-08-15 Thread Jan Kara
On Wed 14-08-19 11:08:49, Ira Weiny wrote:
> On Wed, Aug 14, 2019 at 12:17:14PM +0200, Jan Kara wrote:
> > Hello!
> > 
> > On Fri 09-08-19 15:58:14, ira.we...@intel.com wrote:
> > > Pre-requisites
> > > ==
> > >   Based on mmotm tree.
> > > 
> > > Based on the feedback from LSFmm, the LWN article, the RFC series since
> > > then, and a ton of scenarios I've worked in my mind and/or tested...[1]
> > > 
> > > Solution summary
> > > 
> > > 
> > > The real issue is that there is no use case for a user to have RDMA 
> > > pinn'ed
> > > memory which is then truncated.  So really any solution we present which:
> > > 
> > > A) Prevents file system corruption or data leaks
> > > ...and...
> > > B) Informs the user that they did something wrong
> > > 
> > > Should be an acceptable solution.
> > > 
> > > Because this is slightly new behavior.  And because this is going to be
> > > specific to DAX (because of the lack of a page cache) we have made the 
> > > user
> > > "opt in" to this behavior.
> > > 
> > > The following patches implement the following solution.
> > > 
> > > 0) Registrations to Device DAX char devs are not affected
> > > 
> > > 1) The user has to opt in to allowing page pins on a file with an 
> > > exclusive
> > >layout lease.  Both exclusive and layout lease flags are user visible 
> > > now.
> > > 
> > > 2) page pins will fail if the lease is not active when the file back page 
> > > is
> > >encountered.
> > > 
> > > 3) Any truncate or hole punch operation on a pinned DAX page will fail.
> > 
> > So I didn't fully grok the patch set yet but by "pinned DAX page" do you
> > mean a page which has corresponding file_pin covering it? Or do you mean a
> > page which has pincount increased? If the first then I'd rephrase this to
> > be less ambiguous, if the second then I think it is wrong. 
> 
> I mean the second.  but by "fail" I mean hang.  Right now the "normal" page
> pincount processing will hang the truncate.  Given the discussion with John H
> we can make this a bit better if we use something like FOLL_PIN and the page
> count bias to indicate this type of pin.  Then I could fail the truncate
> outright.  but that is not done yet.
> 
> so... I used the word "fail" to be a bit more vague as the final 
> implementation
> may return ETXTBUSY or hang as noted.

Ah, OK. Hanging is fine in principle but with longterm pins, your work
makes sure they actually fail with ETXTBUSY, doesn't it? The thing is that
e.g. DIO will use page pins as well for its buffers and we must wait there
until the pin is released. So please just clarify your 'fail' here a bit
:).

> > > 4) The user has the option of holding the lease or releasing it.  If they
> > >release it no other pin calls will work on the file.
> > 
> > Last time we spoke the plan was that the lease is kept while the pages are
> > pinned (and an attempt to release the lease would block until the pages are
> > unpinned). That also makes it clear that the *lease* is what is making
> > truncate and hole punch fail with ETXTBUSY and the file_pin structure is
> > just an implementation detail how the existence is efficiently tracked (and
> > what keeps the backing file for the pages open so that the lease does not
> > get auto-destroyed). Why did you change this?
> 
> closing the file _and_ unmaping it will cause the lease to be released
> regardless of if we allow this or not.
> 
> As we discussed preventing the close seemed intractable.

Yes, preventing the application from closing the file is difficult. But
from a quick look at your patches it seemed to me that you actually hold a
backing file reference from the file_pin structure thus even though the
application closes its file descriptor, the struct file (and thus the
lease) lives further until the file_pin gets released. And that should last
as long as the pages are pinned. Am I missing something?

> I thought about failing the munmap but that seemed wrong as well.  But more
> importantly AFAIK RDMA can pass its memory pins to other processes via FD
> passing...  This means that one could pin this memory, pass it to another
> process and exit.  The file lease on the pin'ed file is lost.

Not if file_pin grabs struct file reference as I mentioned above...
 
> The file lease is just a key to get the memory pin.  Once unlocked the procfs
> tracking keeps track of where that pin goes and which processes need to be
> killed to get rid of it.

I think having file lease being just a key to get the pin is conceptually
wrong. The lease is what expresses: "I'm accessing these blocks directly,
don't touch them without coordinating with me." So it would be only natural
if we maintained the lease while we are accessing blocks instead of
transferring this protection responsibility to another structure - namely
file_pin - and letting the lease go. But maybe I miss some technical reason
why maintaining file lease is difficult. If that's the case, I'd like to hear
what...
 
> > > 

Re: [RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002 ;-)

2019-08-14 Thread Ira Weiny
On Wed, Aug 14, 2019 at 12:17:14PM +0200, Jan Kara wrote:
> Hello!
> 
> On Fri 09-08-19 15:58:14, ira.we...@intel.com wrote:
> > Pre-requisites
> > ==
> > Based on mmotm tree.
> > 
> > Based on the feedback from LSFmm, the LWN article, the RFC series since
> > then, and a ton of scenarios I've worked in my mind and/or tested...[1]
> > 
> > Solution summary
> > 
> > 
> > The real issue is that there is no use case for a user to have RDMA pinn'ed
> > memory which is then truncated.  So really any solution we present which:
> > 
> > A) Prevents file system corruption or data leaks
> > ...and...
> > B) Informs the user that they did something wrong
> > 
> > Should be an acceptable solution.
> > 
> > Because this is slightly new behavior.  And because this is going to be
> > specific to DAX (because of the lack of a page cache) we have made the user
> > "opt in" to this behavior.
> > 
> > The following patches implement the following solution.
> > 
> > 0) Registrations to Device DAX char devs are not affected
> > 
> > 1) The user has to opt in to allowing page pins on a file with an exclusive
> >layout lease.  Both exclusive and layout lease flags are user visible 
> > now.
> > 
> > 2) page pins will fail if the lease is not active when the file back page is
> >encountered.
> > 
> > 3) Any truncate or hole punch operation on a pinned DAX page will fail.
> 
> So I didn't fully grok the patch set yet but by "pinned DAX page" do you
> mean a page which has corresponding file_pin covering it? Or do you mean a
> page which has pincount increased? If the first then I'd rephrase this to
> be less ambiguous, if the second then I think it is wrong. 

I mean the second.  but by "fail" I mean hang.  Right now the "normal" page
pincount processing will hang the truncate.  Given the discussion with John H
we can make this a bit better if we use something like FOLL_PIN and the page
count bias to indicate this type of pin.  Then I could fail the truncate
outright.  but that is not done yet.

so... I used the word "fail" to be a bit more vague as the final implementation
may return ETXTBUSY or hang as noted.

> 
> > 4) The user has the option of holding the lease or releasing it.  If they
> >release it no other pin calls will work on the file.
> 
> Last time we spoke the plan was that the lease is kept while the pages are
> pinned (and an attempt to release the lease would block until the pages are
> unpinned). That also makes it clear that the *lease* is what is making
> truncate and hole punch fail with ETXTBUSY and the file_pin structure is
> just an implementation detail how the existence is efficiently tracked (and
> what keeps the backing file for the pages open so that the lease does not
> get auto-destroyed). Why did you change this?

closing the file _and_ unmaping it will cause the lease to be released
regardless of if we allow this or not.

As we discussed preventing the close seemed intractable.

I thought about failing the munmap but that seemed wrong as well.  But more
importantly AFAIK RDMA can pass its memory pins to other processes via FD
passing...  This means that one could pin this memory, pass it to another
process and exit.  The file lease on the pin'ed file is lost.

The file lease is just a key to get the memory pin.  Once unlocked the procfs
tracking keeps track of where that pin goes and which processes need to be
killed to get rid of it.

> 
> > 5) Closing the file is ok.
> > 
> > 6) Unmapping the file is ok
> > 
> > 7) Pins against the files are tracked back to an owning file or an owning mm
> >depending on the internal subsystem needs.  With RDMA there is an owning
> >file which is related to the pined file.
> > 
> > 8) Only RDMA is currently supported
> 
> If you currently only need "owning file" variant in your patch set, then
> I'd just implement that and leave "owning mm" variant for later if it
> proves to be necessary. The things are complex enough as is...

I can do that...  I was trying to get io_uring working as well with the
owning_mm but I should save that for later.

> 
> > 9) Truncation of pages which are not actively pinned nor covered by a lease
> >will succeed.
> 
> Otherwise I like the design.

Thanks,
Ira

> 
>   Honza
> 
> -- 
> Jan Kara 
> SUSE Labs, CR


Re: [RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002 ;-)

2019-08-14 Thread Jan Kara
Hello!

On Fri 09-08-19 15:58:14, ira.we...@intel.com wrote:
> Pre-requisites
> ==
>   Based on mmotm tree.
> 
> Based on the feedback from LSFmm, the LWN article, the RFC series since
> then, and a ton of scenarios I've worked in my mind and/or tested...[1]
> 
> Solution summary
> 
> 
> The real issue is that there is no use case for a user to have RDMA pinn'ed
> memory which is then truncated.  So really any solution we present which:
> 
> A) Prevents file system corruption or data leaks
> ...and...
> B) Informs the user that they did something wrong
> 
> Should be an acceptable solution.
> 
> Because this is slightly new behavior.  And because this is going to be
> specific to DAX (because of the lack of a page cache) we have made the user
> "opt in" to this behavior.
> 
> The following patches implement the following solution.
> 
> 0) Registrations to Device DAX char devs are not affected
> 
> 1) The user has to opt in to allowing page pins on a file with an exclusive
>layout lease.  Both exclusive and layout lease flags are user visible now.
> 
> 2) page pins will fail if the lease is not active when the file back page is
>encountered.
> 
> 3) Any truncate or hole punch operation on a pinned DAX page will fail.

So I didn't fully grok the patch set yet but by "pinned DAX page" do you
mean a page which has corresponding file_pin covering it? Or do you mean a
page which has pincount increased? If the first then I'd rephrase this to
be less ambiguous, if the second then I think it is wrong. 

> 4) The user has the option of holding the lease or releasing it.  If they
>release it no other pin calls will work on the file.

Last time we spoke the plan was that the lease is kept while the pages are
pinned (and an attempt to release the lease would block until the pages are
unpinned). That also makes it clear that the *lease* is what is making
truncate and hole punch fail with ETXTBUSY and the file_pin structure is
just an implementation detail how the existence is efficiently tracked (and
what keeps the backing file for the pages open so that the lease does not
get auto-destroyed). Why did you change this?

> 5) Closing the file is ok.
> 
> 6) Unmapping the file is ok
> 
> 7) Pins against the files are tracked back to an owning file or an owning mm
>depending on the internal subsystem needs.  With RDMA there is an owning
>file which is related to the pined file.
> 
> 8) Only RDMA is currently supported

If you currently only need "owning file" variant in your patch set, then
I'd just implement that and leave "owning mm" variant for later if it
proves to be necessary. The things are complex enough as is...

> 9) Truncation of pages which are not actively pinned nor covered by a lease
>will succeed.

Otherwise I like the design.

Honza

-- 
Jan Kara 
SUSE Labs, CR