Re: [PATCH RFC tip/core/rcu] Add shrinker to shift to fast/inefficient GP mode

2020-05-12 Thread Dave Chinner
f the load on RCU,
nor how long direct reclaim is taking. Calling synchronize_rcu()
incorrectly has pretty major downsides to it, so nobody should be
trying to expedite kfree_rcu() unless there is a good reason to do
so (e.g. at unmount to ensure everything allocated by a filesystem
has actually been freed). Hence I'd much prefer the decision to
expedite callbacks is made by the RCU subsystem based on it's known
callback load and some indication of how close memory reclaim is to
declaring OOM...

Cheers,

Dave.

-- 
Dave Chinner
da...@fromorbit.com


Re: [PATCH RFC 2/8] selftests: add stress testing tool for dcache

2020-05-12 Thread Dave Chinner
On Fri, May 08, 2020 at 03:23:17PM +0300, Konstantin Khlebnikov wrote:
> This tool fills dcache with negative dentries. Between iterations it prints
> statistics and measures time of inotify operation which might degrade.
> 
> Signed-off-by: Konstantin Khlebnikov 
> ---
>  tools/testing/selftests/filesystems/Makefile   |1 
>  .../testing/selftests/filesystems/dcache_stress.c  |  210 
> 

This sort of thing should go into fstests along with test scripts
that use it to exercise the dentry cache. We already have tools like
this in fstests (dirstress, metaperf, etc) for exercising name-based
operations like this, so it would fit right in.

That way it would get run by just about every filesystem developer
and distro QE department automatically and extremely frequently...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [PATCH v5 0/4] Charge loop device i/o to issuing cgroup

2020-05-04 Thread Dave Chinner
On Wed, Apr 29, 2020 at 12:25:40PM +0200, Jan Kara wrote:
> On Wed 29-04-20 07:47:34, Dave Chinner wrote:
> > On Tue, Apr 28, 2020 at 12:13:46PM -0400, Dan Schatzberg wrote:
> > > The loop device runs all i/o to the backing file on a separate kworker
> > > thread which results in all i/o being charged to the root cgroup. This
> > > allows a loop device to be used to trivially bypass resource limits
> > > and other policy. This patch series fixes this gap in accounting.
> > 
> > How is this specific to the loop device? Isn't every block device
> > that offloads work to a kthread or single worker thread susceptible
> > to the same "exploit"?
> > 
> > Or is the problem simply that the loop worker thread is simply not
> > taking the IO's associated cgroup and submitting the IO with that
> > cgroup associated with it? That seems kinda simple to fix
> > 
> > > Naively charging cgroups could result in priority inversions through
> > > the single kworker thread in the case where multiple cgroups are
> > > reading/writing to the same loop device.
> > 
> > And that's where all the complexity and serialisation comes from,
> > right?
> > 
> > So, again: how is this unique to the loop device? Other block
> > devices also offload IO to kthreads to do blocking work and IO
> > submission to lower layers. Hence this seems to me like a generic
> > "block device does IO submission from different task" issue that
> > should be handled by generic infrastructure and not need to be
> > reimplemented multiple times in every block device driver that
> > offloads work to other threads...
> 
> Yeah, I was thinking about the same when reading the patch series
> description. We already have some cgroup workarounds for btrfs kthreads if
> I remember correctly, we have cgroup handling for flush workers, now we are
> adding cgroup handling for loopback device workers, and soon I'd expect
> someone comes with a need for DM/MD worker processes and IMHO it's getting
> out of hands because the complexity spreads through the kernel with every
> subsystem comming with slightly different solution to the problem and also
> the number of kthreads gets multiplied by the number of cgroups. So I
> agree some generic solution how to approach IO throttling of kthreads /
> workers would be desirable.

Yup, that's pretty much what I was thinking: it's yet another
special snowflake for cgroup-aware IO

> OTOH I don't have a great idea how the generic infrastructure should look
> like...

I haven't given it any thought - it's not something I have any
bandwidth to spend time on.  I'll happily review a unified
generic cgroup-aware kthread-based IO dispatch mechanism, but I
don't have the time to design and implement that myself

OTOH, I will make time to stop people screwing up filesystems and
block devices with questionable complexity and unique, storage
device dependent userspace visible error behaviour. This sort of
change is objectively worse for users than not supporting the
functionality in the first place.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [PATCH v5 0/4] Charge loop device i/o to issuing cgroup

2020-05-04 Thread Dave Chinner
On Tue, Apr 28, 2020 at 10:27:32PM -0400, Johannes Weiner wrote:
> On Wed, Apr 29, 2020 at 07:47:34AM +1000, Dave Chinner wrote:
> > On Tue, Apr 28, 2020 at 12:13:46PM -0400, Dan Schatzberg wrote:
> > > This patch series does some
> > > minor modification to the loop driver so that each cgroup can make
> > > forward progress independently to avoid this inversion.
> > > 
> > > With this patch series applied, the above script triggers OOM kills
> > > when writing through the loop device as expected.
> > 
> > NACK!
> > 
> > The IO that is disallowed should fail with ENOMEM or some similar
> > error, not trigger an OOM kill that shoots some innocent bystander
> > in the head. That's worse than using BUG() to report errors...
> 
> Did you actually read the script?

Of course I did. You specifically mean this bit:

echo 64M > $CGROUP/memory.max;
mount -t tmpfs -o size=512m tmpfs /tmp;
truncate -s 512m /tmp/backing_file
losetup $LOOP_DEV /tmp/backing_file
dd if=/dev/zero of=$LOOP_DEV bs=1M count=256;

And with this patch set the dd gets OOM killed because the
/tmp/backing_file usage accounted to the cgroup goes over 64MB and
so tmpfs OOM kills the IO...

As I said: that's very broken behaviour from a storage stack
perspective.

> It's OOMing because it's creating 256M worth of tmpfs pages inside a
> 64M cgroup. It's not killing an innocent bystander, it's killing in
> the cgroup that is allocating all that memory - after Dan makes sure
> that memory is accounted to its rightful owner.

What this example does is turn /tmp into thinly provisioned storage
for $CGROUP via a restricted quota. It has a device size of 512MB,
but only has physical storage capacity of 64MB, as constrained by
the cgroup memory limit.  You're dealing with management of -storage
devices- and -IO error reporting- here, not memory management.

When thin provisioned storage runs out of space - for whatever
reason - and it cannot resolve the issue by blocking, then it *must*
return ENOSPC to the IO submitter to tell it the IO has failed. This
is no different to if we set a quota on /tmp/backing_file and it is
exhausted at 64MB of data written - we fail the IO with ENOSPC or
EDQUOT, depending on which quota we used.

IOWs, we have solid expectations on how block devices report
unsolvable resource shortages during IO because those errors have to
be handled correctly by whatever is submitting the IO. We do not use
the OOM-killer to report or resolve ENOSPC errors.

Indeed, once you've killed the dd, that CGROUP still consumes 64MB
of tmpfs space in /tmp/backing_file, in which case any further IO to
$LOOP_DEV is also going to OOM kill. These are horrible semantics
for reporting errors to userspace.

And if the OOM-killer actually frees the 64MB of data written to
/tmp/backing_file through the $LOOP_DEV, then you're actually
corrupting the storage and ensuring that nobody can read the data
that was written to $LOOP_DEV.

So now lets put a filesystem on $LOOP_DEV in the above example, and
write out data to the filesystem which does IO to $LOOP_DEV. Just by
chance, the filesystem does some metadata writes iin the context of
the user process doing the writes (because journalling, etc) and
that metadata IO is what pushes the loop device over the cgroup's
memory limit.

You kill the user application even though it wasn't directly
responsible for going over the 64MB limit of space in $LOOP_DEV.
What happens now to the filesystem's metadata IO?  Did $LOOP_DEV
return an error, or after the OOM kill did the IO succeed?  What
happens if all the IO being generated from the user application is
metadata and that starts failing - killing the user application
doesn't help anything - the filesystem IO is failing and that's
where the errors need to be reported.

And if the answer is "metadata IO isn't accounted to the $CGROUP"
then what happens when the tmpfs actually runs out of it's 512MB of
space because of all the metadata the filesystem wrote (e.g. create
lots of zero length files)? That's an ENOSPC error, and we'll get
that from $LOOP_DEV just fine.

So why should the same error - running out of tmpfs space vs running
out of CGROUP quota space on tmpfs be handled differently? Either
they are both ENOSPC IO errors, or they are both OOM kill vectors
because tmpfs space has run out...

See the problem here? We report storage resource shortages
(whatever the cause) by IO errors, not by killing userspace
processes. Userspace may be able to handle the IO error sanely; it
has no warning or choice when you use OOM kill to report ENOSPC
errors...

> As opposed to before this series, where all this memory isn't
> accounted properly and goes to the root cgroup - where, ironically, it
> could cause OOM and kill an actually innocent bysta

Re: [PATCH] fs: xfs: fix a possible data race in xfs_inode_set_reclaim_tag()

2020-05-04 Thread Dave Chinner
On Tue, May 05, 2020 at 12:15:30AM +0800, Jia-Ju Bai wrote:
> We find that xfs_inode_set_reclaim_tag() and xfs_reclaim_inode() are
> concurrently executed at runtime in the following call contexts:
> 
> Thread1:
>   xfs_fs_put_super()
> xfs_unmountfs()
>   xfs_rtunmount_inodes()
> xfs_irele()
>   xfs_fs_destroy_inode()
> xfs_inode_set_reclaim_tag()
> 
> Thread2:
>   xfs_reclaim_worker()
> xfs_reclaim_inodes()
>   xfs_reclaim_inodes_ag()
> xfs_reclaim_inode()
> 
> In xfs_inode_set_reclaim_tag():
>   pag = xfs_perag_get(mp, XFS_INO_TO_AGNO(mp, ip->i_ino));
>   ...
>   spin_lock(&ip->i_flags_lock);
> 
> In xfs_reclaim_inode():
>   spin_lock(&ip->i_flags_lock);
>   ...
>   ip->i_ino = 0;
>   spin_unlock(&ip->i_flags_lock);
> 
> Thus, a data race can occur for ip->i_ino.
> 
> To fix this data race, the spinlock ip->i_flags_lock is used to protect
> the access to ip->i_ino in xfs_inode_set_reclaim_tag().
> 
> This data race is found by our concurrency fuzzer.

This data race cannot happen.

xfs_reclaim_inode() will not be called on this inode until -after-
the XFS_ICI_RECLAIM_TAG is set in the radix tree for this inode, and
setting that is protected by the i_flags_lock.

So while the xfs_perag_get() call doesn't lock the ip->i_ino access,
there is are -multiple_ iflags_lock lock/unlock cycles before
ip->i_ino is cleared in the reclaim worker. Hence there is a full
unlock->lock memory barrier for the ip->i_ino reset inside the
critical section vs xfs_inode_set_reclaim_tag().

Hence even if the reclaim worker could access the inode before the
XFS_ICI_RECLAIM_TAG is set, no data race exists here.

> Signed-off-by: Jia-Ju Bai 
> ---
>  fs/xfs/xfs_icache.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> index 8bf1d15be3f6..a2de08222ff5 100644
> --- a/fs/xfs/xfs_icache.c
> +++ b/fs/xfs/xfs_icache.c
> @@ -229,9 +229,9 @@ xfs_inode_set_reclaim_tag(
>   struct xfs_mount*mp = ip->i_mount;
>   struct xfs_perag*pag;
>  
> + spin_lock(&ip->i_flags_lock);
>   pag = xfs_perag_get(mp, XFS_INO_TO_AGNO(mp, ip->i_ino));
>   spin_lock(&pag->pag_ici_lock);
> - spin_lock(&ip->i_flags_lock);

Also, this creates a lock inversion deadlock here with
xfs_iget_cache_hit() clearing the XFS_IRECLAIMABLE flag.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [PATCH v5 0/4] Charge loop device i/o to issuing cgroup

2020-04-28 Thread Dave Chinner
On Tue, Apr 28, 2020 at 12:13:46PM -0400, Dan Schatzberg wrote:
> The loop device runs all i/o to the backing file on a separate kworker
> thread which results in all i/o being charged to the root cgroup. This
> allows a loop device to be used to trivially bypass resource limits
> and other policy. This patch series fixes this gap in accounting.

How is this specific to the loop device? Isn't every block device
that offloads work to a kthread or single worker thread susceptible
to the same "exploit"?

Or is the problem simply that the loop worker thread is simply not
taking the IO's associated cgroup and submitting the IO with that
cgroup associated with it? That seems kinda simple to fix

> Naively charging cgroups could result in priority inversions through
> the single kworker thread in the case where multiple cgroups are
> reading/writing to the same loop device.

And that's where all the complexity and serialisation comes from,
right?

So, again: how is this unique to the loop device? Other block
devices also offload IO to kthreads to do blocking work and IO
submission to lower layers. Hence this seems to me like a generic
"block device does IO submission from different task" issue that
should be handled by generic infrastructure and not need to be
reimplemented multiple times in every block device driver that
offloads work to other threads...

> This patch series does some
> minor modification to the loop driver so that each cgroup can make
> forward progress independently to avoid this inversion.
> 
> With this patch series applied, the above script triggers OOM kills
> when writing through the loop device as expected.

NACK!

The IO that is disallowed should fail with ENOMEM or some similar
error, not trigger an OOM kill that shoots some innocent bystander
in the head. That's worse than using BUG() to report errors...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: 回复: Re: [RFC PATCH 0/8] dax: Add a dax-rmap tree to support reflink

2020-04-28 Thread Dave Chinner
On Tue, Apr 28, 2020 at 08:37:32AM -0700, Darrick J. Wong wrote:
> On Tue, Apr 28, 2020 at 09:24:41PM +1000, Dave Chinner wrote:
> > On Tue, Apr 28, 2020 at 04:16:36AM -0700, Matthew Wilcox wrote:
> > > On Tue, Apr 28, 2020 at 05:32:41PM +0800, Ruan Shiyang wrote:
> > > > On 2020/4/28 下午2:43, Dave Chinner wrote:
> > > > > On Tue, Apr 28, 2020 at 06:09:47AM +, Ruan, Shiyang wrote:
> > > > > > 在 2020/4/27 20:28:36, "Matthew Wilcox"  写道:
> > > > > > > On Mon, Apr 27, 2020 at 04:47:42PM +0800, Shiyang Ruan wrote:
> > > > > > > >   This patchset is a try to resolve the shared 'page cache' 
> > > > > > > > problem for
> > > > > > > >   fsdax.
> > > > > > > > 
> > > > > > > >   In order to track multiple mappings and indexes on one page, I
> > > > > > > >   introduced a dax-rmap rb-tree to manage the relationship.  A 
> > > > > > > > dax entry
> > > > > > > >   will be associated more than once if is shared.  At the 
> > > > > > > > second time we
> > > > > > > >   associate this entry, we create this rb-tree and store its 
> > > > > > > > root in
> > > > > > > >   page->private(not used in fsdax).  Insert (->mapping, 
> > > > > > > > ->index) when
> > > > > > > >   dax_associate_entry() and delete it when 
> > > > > > > > dax_disassociate_entry().
> > > > > > > 
> > > > > > > Do we really want to track all of this on a per-page basis?  I 
> > > > > > > would
> > > > > > > have thought a per-extent basis was more useful.  Essentially, 
> > > > > > > create
> > > > > > > a new address_space for each shared extent.  Per page just seems 
> > > > > > > like
> > > > > > > a huge overhead.
> > > > > > > 
> > > > > > Per-extent tracking is a nice idea for me.  I haven't thought of it
> > > > > > yet...
> > > > > > 
> > > > > > But the extent info is maintained by filesystem.  I think we need a 
> > > > > > way
> > > > > > to obtain this info from FS when associating a page.  May be a bit
> > > > > > complicated.  Let me think about it...
> > > > > 
> > > > > That's why I want the -user of this association- to do a filesystem
> > > > > callout instead of keeping it's own naive tracking infrastructure.
> > > > > The filesystem can do an efficient, on-demand reverse mapping lookup
> > > > > from it's own extent tracking infrastructure, and there's zero
> > > > > runtime overhead when there are no errors present.
> > > > > 
> > > > > At the moment, this "dax association" is used to "report" a storage
> > > > > media error directly to userspace. I say "report" because what it
> > > > > does is kill userspace processes dead. The storage media error
> > > > > actually needs to be reported to the owner of the storage media,
> > > > > which in the case of FS-DAX is the filesytem.
> > > > 
> > > > Understood.
> > > > 
> > > > BTW, this is the usage in memory-failure, so what about rmap?  I have 
> > > > not
> > > > found how to use this tracking in rmap.  Do you have any ideas?
> > > > 
> > > > > 
> > > > > That way the filesystem can then look up all the owners of that bad
> > > > > media range (i.e. the filesystem block it corresponds to) and take
> > > > > appropriate action. e.g.
> > > > 
> > > > I tried writing a function to look up all the owners' info of one block 
> > > > in
> > > > xfs for memory-failure use.  It was dropped in this patchset because I 
> > > > found
> > > > out that this lookup function needs 'rmapbt' to be enabled when mkfs.  
> > > > But
> > > > by default, rmapbt is disabled.  I am not sure if it matters...
> > > 
> > > I'm pretty sure you can't have shared extents on an XFS filesystem if you
> > > _don't_ have the rmapbt feature enabled.  I mean, that's why it exists.
> > 
> &

Re: 回复: Re: [RFC PATCH 0/8] dax: Add a dax-rmap tree to support reflink

2020-04-28 Thread Dave Chinner
On Tue, Apr 28, 2020 at 04:16:36AM -0700, Matthew Wilcox wrote:
> On Tue, Apr 28, 2020 at 05:32:41PM +0800, Ruan Shiyang wrote:
> > On 2020/4/28 下午2:43, Dave Chinner wrote:
> > > On Tue, Apr 28, 2020 at 06:09:47AM +, Ruan, Shiyang wrote:
> > > > 在 2020/4/27 20:28:36, "Matthew Wilcox"  写道:
> > > > > On Mon, Apr 27, 2020 at 04:47:42PM +0800, Shiyang Ruan wrote:
> > > > > >   This patchset is a try to resolve the shared 'page cache' problem 
> > > > > > for
> > > > > >   fsdax.
> > > > > > 
> > > > > >   In order to track multiple mappings and indexes on one page, I
> > > > > >   introduced a dax-rmap rb-tree to manage the relationship.  A dax 
> > > > > > entry
> > > > > >   will be associated more than once if is shared.  At the second 
> > > > > > time we
> > > > > >   associate this entry, we create this rb-tree and store its root in
> > > > > >   page->private(not used in fsdax).  Insert (->mapping, ->index) 
> > > > > > when
> > > > > >   dax_associate_entry() and delete it when dax_disassociate_entry().
> > > > > 
> > > > > Do we really want to track all of this on a per-page basis?  I would
> > > > > have thought a per-extent basis was more useful.  Essentially, create
> > > > > a new address_space for each shared extent.  Per page just seems like
> > > > > a huge overhead.
> > > > > 
> > > > Per-extent tracking is a nice idea for me.  I haven't thought of it
> > > > yet...
> > > > 
> > > > But the extent info is maintained by filesystem.  I think we need a way
> > > > to obtain this info from FS when associating a page.  May be a bit
> > > > complicated.  Let me think about it...
> > > 
> > > That's why I want the -user of this association- to do a filesystem
> > > callout instead of keeping it's own naive tracking infrastructure.
> > > The filesystem can do an efficient, on-demand reverse mapping lookup
> > > from it's own extent tracking infrastructure, and there's zero
> > > runtime overhead when there are no errors present.
> > > 
> > > At the moment, this "dax association" is used to "report" a storage
> > > media error directly to userspace. I say "report" because what it
> > > does is kill userspace processes dead. The storage media error
> > > actually needs to be reported to the owner of the storage media,
> > > which in the case of FS-DAX is the filesytem.
> > 
> > Understood.
> > 
> > BTW, this is the usage in memory-failure, so what about rmap?  I have not
> > found how to use this tracking in rmap.  Do you have any ideas?
> > 
> > > 
> > > That way the filesystem can then look up all the owners of that bad
> > > media range (i.e. the filesystem block it corresponds to) and take
> > > appropriate action. e.g.
> > 
> > I tried writing a function to look up all the owners' info of one block in
> > xfs for memory-failure use.  It was dropped in this patchset because I found
> > out that this lookup function needs 'rmapbt' to be enabled when mkfs.  But
> > by default, rmapbt is disabled.  I am not sure if it matters...
> 
> I'm pretty sure you can't have shared extents on an XFS filesystem if you
> _don't_ have the rmapbt feature enabled.  I mean, that's why it exists.

You're confusing reflink with rmap. :)

rmapbt does all the reverse mapping tracking, reflink just does the
shared data extent tracking.

But given that anyone who wants to use DAX with reflink is going to
have to mkfs their filesystem anyway (to turn on reflink) requiring
that rmapbt is also turned on is not a big deal. Especially as we
can check it at mount time in the kernel...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [PATCH 5/5] fs/xfs: Allow toggle of physical DAX flag

2019-10-21 Thread Dave Chinner
On Mon, Oct 21, 2019 at 03:49:31PM -0700, Ira Weiny wrote:
> On Mon, Oct 21, 2019 at 11:45:36AM +1100, Dave Chinner wrote:
> > On Sun, Oct 20, 2019 at 08:59:35AM -0700, ira.we...@intel.com wrote:
> > > @@ -1232,12 +1233,10 @@ xfs_diflags_to_linux(
> > >   inode->i_flags |= S_NOATIME;
> > >   else
> > >   inode->i_flags &= ~S_NOATIME;
> > > -#if 0/* disabled until the flag switching races are sorted out */
> > >   if (xflags & FS_XFLAG_DAX)
> > >   inode->i_flags |= S_DAX;
> > >   else
> > >   inode->i_flags &= ~S_DAX;
> > > -#endif
> > 
> > This code has bit-rotted. See xfs_setup_iops(), where we now have a
> > different inode->i_mapping->a_ops for DAX inodes.
> 
> :-(
> 
> > 
> > That, fundamentally, is the issue here - it's not setting/clearing
> > the DAX flag that is the issue, it's doing a swap of the
> > mapping->a_ops while there may be other code using that ops
> > structure.
> > 
> > IOWs, if there is any code anywhere in the kernel that
> > calls an address space op without holding one of the three locks we
> > hold here (i_rwsem, MMAPLOCK, ILOCK) then it can race with the swap
> > of the address space operations.
> > 
> > By limiting the address space swap to file sizes of zero, we rule
> > out the page fault path (mmap of a zero length file segv's with an
> > access beyond EOF on the first read/write page fault, right?).
> 
> Yes I checked that and thought we were safe here...
> 
> > However, other aops callers that might run unlocked and do the wrong
> > thing if the aops pointer is swapped between check of the aop method
> > existing and actually calling it even if the file size is zero?
> > 
> > A quick look shows that FIBMAP (ioctl_fibmap())) looks susceptible
> > to such a race condition with the current definitions of the XFS DAX
> > aops. I'm guessing there will be others, but I haven't looked
> > further than this...
> 
> I'll check for others and think on what to do about this.  ext4 will have the
> same problem I think.  :-(
> 
> I don't suppose using a single a_ops for both DAX and non-DAX is palatable?

IMO, no. It means we have to check IS_DAX() in every aops,
and replicate a bunch of callouts to generic code. i.e this sort of
thing:

if (aops->method)
return aops->method(...)

/* do something else */

results in us having to replicate that logic as something like:

if (!IS_DAX)
return filesystem_aops_method()

/* do something else */

Indeed, the calling code may well do the wrong thing if we have
methods defined just to add IS_DAX() checks to avoid using that
functionality because the caller now thinks that functionality is
supported when in fact it isn't.

So it seems to me like an even bigger can of worms to try to use a
single aops structure for vastly different functionality

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [PATCH 5/5] fs/xfs: Allow toggle of physical DAX flag

2019-10-20 Thread Dave Chinner
On Sun, Oct 20, 2019 at 08:59:35AM -0700, ira.we...@intel.com wrote:
> From: Ira Weiny 
> 
> Switching between DAX and non-DAX on a file is racy with respect to data
> operations.  However, if no data is involved the flag is safe to switch.
> 
> Allow toggling the physical flag if a file is empty.  The file length
> check is not racy with respect to other operations as it is performed
> under XFS_MMAPLOCK_EXCL and XFS_IOLOCK_EXCL locks.
> 
> Signed-off-by: Ira Weiny 
> ---
>  fs/xfs/xfs_ioctl.c | 9 +++--
>  1 file changed, 7 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
> index d3a7340d3209..3839684c249b 100644
> --- a/fs/xfs/xfs_ioctl.c
> +++ b/fs/xfs/xfs_ioctl.c
> @@ -33,6 +33,7 @@
>  #include "xfs_sb.h"
>  #include "xfs_ag.h"
>  #include "xfs_health.h"
> +#include "libxfs/xfs_dir2.h"
>  
>  #include 
>  #include 
> @@ -1232,12 +1233,10 @@ xfs_diflags_to_linux(
>   inode->i_flags |= S_NOATIME;
>   else
>   inode->i_flags &= ~S_NOATIME;
> -#if 0/* disabled until the flag switching races are sorted out */
>   if (xflags & FS_XFLAG_DAX)
>   inode->i_flags |= S_DAX;
>   else
>   inode->i_flags &= ~S_DAX;
> -#endif

This code has bit-rotted. See xfs_setup_iops(), where we now have a
different inode->i_mapping->a_ops for DAX inodes.

That, fundamentally, is the issue here - it's not setting/clearing
the DAX flag that is the issue, it's doing a swap of the
mapping->a_ops while there may be other code using that ops
structure.

IOWs, if there is any code anywhere in the kernel that
calls an address space op without holding one of the three locks we
hold here (i_rwsem, MMAPLOCK, ILOCK) then it can race with the swap
of the address space operations.

By limiting the address space swap to file sizes of zero, we rule
out the page fault path (mmap of a zero length file segv's with an
access beyond EOF on the first read/write page fault, right?).
However, other aops callers that might run unlocked and do the wrong
thing if the aops pointer is swapped between check of the aop method
existing and actually calling it even if the file size is zero?

A quick look shows that FIBMAP (ioctl_fibmap())) looks susceptible
to such a race condition with the current definitions of the XFS DAX
aops. I'm guessing there will be others, but I haven't looked
further than this...

>   /* lock, flush and invalidate mapping in preparation for flag change */
>   xfs_ilock(ip, XFS_MMAPLOCK_EXCL | XFS_IOLOCK_EXCL);
> +
> + if (i_size_read(inode) != 0) {
> + error = -EOPNOTSUPP;
> + goto out_unlock;
> + }

Wrong error. Should be the same as whatever is returned when we try
to change the extent size hint and can't because the file is
non-zero in length (-EINVAL, I think). Also needs a comment
explainging why this check exists, and probably better written as
i_size_read() > 0 

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [PATCH 2/5] fs/xfs: Isolate the physical DAX flag from effective

2019-10-20 Thread Dave Chinner
On Sun, Oct 20, 2019 at 08:59:32AM -0700, ira.we...@intel.com wrote:
> From: Ira Weiny 
> 
> xfs_ioctl_setattr_dax_invalidate() currently checks if the DAX flag is
> changing as a quick check.
> 
> But the implementation mixes the physical (XFS_DIFLAG2_DAX) and
> effective (S_DAX) DAX flags.

More nuanced than that.

The idea was that if the mount option was set, clearing the
per-inode flag would override the mount option. i.e. the mount
option sets the S_DAX flag at inode instantiation, so using
FSSETXATTR to ensure the FS_XFLAG_DAX is not set would override the
mount option setting, giving applications a way of guranteeing they
aren't using DAX to access the data.

So if the mount option is going to live on, I suspect that we want
to keep this code as it stands.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [PATCH 10/14] iomap: lift the xfs writeback code to iomap

2019-10-17 Thread Dave Chinner
On Thu, Oct 17, 2019 at 07:56:20PM +0200, Christoph Hellwig wrote:
> Take the xfs writeback code and move it to fs/iomap.  A new structure
> with three methods is added as the abstraction from the generic writeback
> code to the file system.  These methods are used to map blocks, submit an
> ioend, and cancel a page that encountered an error before it was added to
> an ioend.
> 
> Signed-off-by: Christoph Hellwig 

With Darrick's renaming of the .submit_ioend method, this looks
fine.

Reviewed-by: Dave Chinner 
-- 
Dave Chinner
da...@fromorbit.com


Re: [PATCH 01/14] iomap: iomap that extends beyond EOF should be marked dirty

2019-10-17 Thread Dave Chinner
On Thu, Oct 17, 2019 at 11:39:17AM -0700, Darrick J. Wong wrote:
> On Thu, Oct 17, 2019 at 07:56:11PM +0200, Christoph Hellwig wrote:
> > From: Dave Chinner 
> > 
> > When doing a direct IO that spans the current EOF, and there are
> > written blocks beyond EOF that extend beyond the current write, the
> > only metadata update that needs to be done is a file size extension.
> > 
> > However, we don't mark such iomaps as IOMAP_F_DIRTY to indicate that
> > there is IO completion metadata updates required, and hence we may
> > fail to correctly sync file size extensions made in IO completion
> > when O_DSYNC writes are being used and the hardware supports FUA.
> > 
> > Hence when setting IOMAP_F_DIRTY, we need to also take into account
> > whether the iomap spans the current EOF. If it does, then we need to
> > mark it dirty so that IO completion will call generic_write_sync()
> > to flush the inode size update to stable storage correctly.
> > 
> > Signed-off-by: Dave Chinner 
> > Signed-off-by: Christoph Hellwig 
> 
> Looks ok, but need fixes tag.  Also, might it be wise to split off the
> ext4 section into a separate patch so that it can be backported
> separately?

I 've done a bit more digging on this, and the ext4 part is not
needed for DAX as IOMAP_F_DIRTY is only used in the page fault path
and hence can't change the file size. As such, this only affects
direct IO. Hence the ext4 hunk can be added to the ext4 iomap-dio
patchset that is being developed rather than being in this patch.

Fixes: 3460cac1ca76 ("iomap: Use FUA for pure data O_DSYNC DIO writes")

Cheers,

Dave.

-- 
Dave Chinner
da...@fromorbit.com


Re: [PATCH 12/12] iomap: cleanup iomap_ioend_compare

2019-10-15 Thread Dave Chinner
On Tue, Oct 15, 2019 at 05:43:45PM +0200, Christoph Hellwig wrote:
> Move the initialization of ia and ib to the declaration line and remove
> a superflous else.
> 
> Signed-off-by: Christoph Hellwig 

nice little cleanup.

Reviewed-by: Dave Chinner 
-- 
Dave Chinner
da...@fromorbit.com


Re: [PATCH 11/12] iomap: move struct iomap_page out of iomap.h

2019-10-15 Thread Dave Chinner
On Tue, Oct 15, 2019 at 05:43:44PM +0200, Christoph Hellwig wrote:
> Now that all the writepage code is in the iomap code there is no
> need to keep this structure public.
> 
> Signed-off-by: Christoph Hellwig 
> Reviewed-by: Carlos Maiolino 
> Reviewed-by: Darrick J. Wong 
> ---
>  fs/iomap/buffered-io.c | 17 +
>  include/linux/iomap.h  | 17 -
>  2 files changed, 17 insertions(+), 17 deletions(-)

Sensible, nothing should be playing around with internal iomap
per-page state.

Reviewed-by: Dave Chinner 
-- 
Dave Chinner
da...@fromorbit.com


Re: [PATCH 10/12] iomap: warn on inline maps in iomap_writepage_map

2019-10-15 Thread Dave Chinner
On Tue, Oct 15, 2019 at 05:43:43PM +0200, Christoph Hellwig wrote:
> And inline mapping should never mark the page dirty and thus never end up
> in writepages.  Add a check for that condition and warn if it happens.
> 
> Signed-off-by: Christoph Hellwig 
> Reviewed-by: Darrick J. Wong 
> Signed-off-by: Darrick J. Wong 
> ---
>  fs/iomap/buffered-io.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index 00af08006cd3..76e72576f307 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -1421,6 +1421,8 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc,
>   error = wpc->ops->map_blocks(wpc, inode, file_offset);
>   if (error)
>   break;
> + if (WARN_ON_ONCE(wpc->iomap.type == IOMAP_INLINE))
> + continue;
>   if (wpc->iomap.type == IOMAP_HOLE)
>   continue;
>   iomap_add_to_ioend(inode, file_offset, page, iop, wpc, wbc,

looks fine.

Reviewed-by: Dave Chinner 
-- 
Dave Chinner
da...@fromorbit.com


Re: [PATCH 09/12] iomap: lift the xfs writeback code to iomap

2019-10-15 Thread Dave Chinner
_bio);
> + return error;
> + }
> +
> + submit_bio(ioend->io_bio);
> + return 0;
> +}

.
> +/*
> + * We implement an immediate ioend submission policy here to avoid needing to
> + * chain multiple ioends and hence nest mempool allocations which can violate
> + * forward progress guarantees we need to provide. The current ioend we are
> + * adding blocks to is cached on the writepage context, and if the new block

adding pages to ... , and if the new block mapping

> + * does not append to the cached ioend it will create a new ioend and cache 
> that
> + * instead.
> + *
> + * If a new ioend is created and cached, the old ioend is returned and queued
> + * locally for submission once the entire page is processed or an error has 
> been
> + * detected.  While ioends are submitted immediately after they are 
> completed,
> + * batching optimisations are provided by higher level block plugging.
> + *
> + * At the end of a writeback pass, there will be a cached ioend remaining on 
> the
> + * writepage context that the caller will need to submit.
> + */
> +static int
> +iomap_writepage_map(struct iomap_writepage_ctx *wpc,
> + struct writeback_control *wbc, struct inode *inode,
> + struct page *page, u64 end_offset)
> +{
> + struct iomap_page *iop = to_iomap_page(page);
> + struct iomap_ioend *ioend, *next;
> + unsigned len = i_blocksize(inode);
> + u64 file_offset; /* file offset of page */
> + int error = 0, count = 0, i;
> + LIST_HEAD(submit_list);
> +
> + WARN_ON_ONCE(i_blocksize(inode) < PAGE_SIZE && !iop);
> + WARN_ON_ONCE(iop && atomic_read(&iop->write_count) != 0);
> +
> + /*
> +  * Walk through the page to find areas to write back. If we run off the
> +  * end of the current map or find the current map invalid, grab a new
> +  * one.
> +  */
> + for (i = 0, file_offset = page_offset(page);
> +  i < (PAGE_SIZE >> inode->i_blkbits) && file_offset < end_offset;
> +  i++, file_offset += len) {
> + if (iop && !test_bit(i, iop->uptodate))
> + continue;
> +
> + error = wpc->ops->map_blocks(wpc, inode, file_offset);
> + if (error)
> + break;
> + if (wpc->iomap.type == IOMAP_HOLE)
> + continue;
> + iomap_add_to_ioend(inode, file_offset, page, iop, wpc, wbc,
> +  &submit_list);
> + count++;
> + }
> +
> + WARN_ON_ONCE(!wpc->ioend && !list_empty(&submit_list));
> + WARN_ON_ONCE(!PageLocked(page));
> + WARN_ON_ONCE(PageWriteback(page));
> +
> + /*
> +  * On error, we have to fail the ioend here because we may have set
> +  * pages under writeback, we have to make sure we run IO completion to
> +  * mark the error state of the IO appropriately, so we can't cancel the
> +  * ioend directly here.

Few too many commas and run-ons here. Maybe reword it like this:

/*
 * We cannot cancel the ioend directly here if there is a submission
 * error. We may have already set pages under writeback and hence we
 * have to run IO completion to mark the error state of the pages under
 * writeback appropriately.

>
>
>   That means we have to mark this page as under
> +  * writeback if we included any blocks from it in the ioend chain so
> +  * that completion treats it correctly.
> +  *
> +  * If we didn't include the page in the ioend, the on error we can
   then on error

> +  * simply discard and unlock it as there are no other users of the page
> +  * now.  The caller will still need to trigger submission of outstanding
> +  * ioends on the writepage context so they are treated correctly on
> +  * error.
> +  */

.

> +static int
> +iomap_do_writepage(struct page *page, struct writeback_control *wbc, void 
> *data)
> +{
> + struct iomap_writepage_ctx *wpc = data;
> + struct inode *inode = page->mapping->host;
> + pgoff_t end_index;
> + u64 end_offset;
> + loff_t offset;
> +
> + trace_iomap_writepage(inode, page, 0, 0);
> +
> + /*
> +  * Refuse to write the page out if we are called from reclaim context.
> +  *
> +  * This avoids stack overflows when called from deeply used stacks in
> +  * random callers for direct reclaim or memcg reclaim.  We explicitly
> +  * allow reclaim from kswapd as the stack usage there is relatively low.
> +  *
> +  * This should never happen except in the case of a VM regression so
> +  * warn about it.
> +  */
> + if (WARN_ON_ONCE((current->flags & (PF_MEMALLOC|PF_KSWAPD)) ==
> + PF_MEMALLOC))
> + goto redirty;
> +
> + /*
> +  * Given that we do not allow direct reclaim to call us, we should
> +  * never be called while in a filesystem transaction.
> +  */

   never be called in a recursive filesystem reclaim context.

> + if (WARN_ON_ONCE(current->flags & PF_MEMALLOC_NOFS))
> + goto redirty;
> +

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [PATCH 08/12] iomap: lift the xfs readpage / readpages tracing to iomap

2019-10-15 Thread Dave Chinner
On Tue, Oct 15, 2019 at 05:43:41PM +0200, Christoph Hellwig wrote:
> Lift the xfs code for tracing address space operations to the iomap
> layer.
> 
> Signed-off-by: Christoph Hellwig 

OK.

Reviewed-by: Dave Chinner 
-- 
Dave Chinner
da...@fromorbit.com


Re: [PATCH 07/12] iomap: zero newly allocated mapped blocks

2019-10-15 Thread Dave Chinner
On Tue, Oct 15, 2019 at 05:43:40PM +0200, Christoph Hellwig wrote:
> File systems like gfs2 don't support delayed allocations or unwritten
> extents and thus allocate normal mapped blocks to fill holes.  To
> cover the case of such file systems allocating new blocks to fill holes
> also zero out mapped blocks with the new flag.
> 
> Signed-off-by: Christoph Hellwig 
> Reviewed-by: Darrick J. Wong 

Sensible.

Reviewed-by: Dave Chinner 


-- 
Dave Chinner
da...@fromorbit.com


Re: [PATCH 06/12] xfs: remove the fork fields in the writepage_ctx and ioend

2019-10-15 Thread Dave Chinner
On Tue, Oct 15, 2019 at 05:43:39PM +0200, Christoph Hellwig wrote:
> In preparation for moving the writeback code to iomap.c, replace the
> XFS-specific COW fork concept with the iomap IOMAP_F_SHARED flag.
> 
> Signed-off-by: Christoph Hellwig 

no problems I can spot.

Reviewed-by: Dave Chinner 
-- 
Dave Chinner
da...@fromorbit.com


Re: [PATCH 05/12] xfs: turn io_append_trans into an io_private void pointer

2019-10-15 Thread Dave Chinner
On Tue, Oct 15, 2019 at 05:43:38PM +0200, Christoph Hellwig wrote:
> In preparation for moving the ioend structure to common code we need
> to get rid of the xfs-specific xfs_trans type.  Just make it a file
> system private void pointer instead.
> 
> Signed-off-by: Christoph Hellwig 

looks good.

Reviewed-by: Dave Chinner 
-- 
Dave Chinner
da...@fromorbit.com


Re: [PATCH 04/12] xfs: refactor the ioend merging code

2019-10-15 Thread Dave Chinner
On Tue, Oct 15, 2019 at 05:43:37PM +0200, Christoph Hellwig wrote:
> Introduce two nicely abstracted helper, which can be moved to the
> iomap code later.  Also use list_pop_entry and list_first_entry_or_null
> to simplify the code a bit.
> 
> Signed-off-by: Christoph Hellwig 

looks ok.

Reviewed-by: Dave Chinner 
-- 
Dave Chinner
da...@fromorbit.com


Re: [PATCH 03/12] xfs: use a struct iomap in xfs_writepage_ctx

2019-10-15 Thread Dave Chinner
On Tue, Oct 15, 2019 at 05:43:36PM +0200, Christoph Hellwig wrote:
> In preparation for moving the XFS writeback code to fs/iomap.c, switch
> it to use struct iomap instead of the XFS-specific struct xfs_bmbt_irec.
> 
> Signed-off-by: Christoph Hellwig 
> Reviewed-by: Carlos Maiolino 
> Reviewed-by: Darrick J. Wong 

Pretty straight forward.

Reviewed-by: Dave Chinner 
-- 
Dave Chinner
da...@fromorbit.com


Re: [PATCH 02/12] xfs: set IOMAP_F_NEW more carefully

2019-10-15 Thread Dave Chinner
On Tue, Oct 15, 2019 at 05:43:35PM +0200, Christoph Hellwig wrote:
> Don't set IOMAP_F_NEW if we COW over and existing allocated range, as
> these aren't strictly new allocations.  This is required to be able to
> use IOMAP_F_NEW to zero newly allocated blocks, which is required for
> the iomap code to fully support file systems that don't do delayed
> allocations or use unwritten extents.
> 
> Signed-off-by: Christoph Hellwig 

looks fine.

Reviewed-by: Dave Chinner 

-- 
Dave Chinner
da...@fromorbit.com


Re: [PATCH 01/12] xfs: initialize iomap->flags in xfs_bmbt_to_iomap

2019-10-15 Thread Dave Chinner
On Tue, Oct 15, 2019 at 05:43:34PM +0200, Christoph Hellwig wrote:
> Currently we don't overwrite the flags field in the iomap in
> xfs_bmbt_to_iomap.  This works fine with 0-initialized iomaps on stack,
> but is harmful once we want to be able to reuse an iomap in the
> writeback code.  Replace the shared paramter with a set of initial
> flags an thus ensures the flags field is always reinitialized.
> 
> Signed-off-by: Christoph Hellwig 

Looks fine.

Reviewed-by: Dave Chinner 
-- 
Dave Chinner
da...@fromorbit.com


Re: Lease semantic proposal

2019-10-10 Thread Dave Chinner
On Tue, Oct 01, 2019 at 02:01:57PM -0700, Ira Weiny wrote:
> On Mon, Sep 30, 2019 at 06:42:33PM +1000, Dave Chinner wrote:
> > On Wed, Sep 25, 2019 at 04:46:03PM -0700, Ira Weiny wrote:
> > > On Tue, Sep 24, 2019 at 08:26:20AM +1000, Dave Chinner wrote:
> > > > Hence, AFIACT, the above definition of a F_RDLCK|F_LAYOUT lease
> > > > doesn't appear to be compatible with the semantics required by
> > > > existing users of layout leases.
> > > 
> > > I disagree.  Other than the addition of F_UNBREAK, I think this is 
> > > consistent
> > > with what is currently implemented.  Also, by exporting all this to user 
> > > space
> > > we can now write tests for it independent of the RDMA pinning.
> > 
> > The current usage of F_RDLCK | F_LAYOUT by the pNFS code allows
> > layout changes to occur to the file while the layout lease is held.
> 
> This was not my understanding.

These are the remote procerdeure calls that the pNFS client uses to
map and/or allocate space in the file it has a lease on:

struct export_operations {

int (*map_blocks)(struct inode *inode, loff_t offset,
  u64 len, struct iomap *iomap,
  bool write, u32 *device_generation);
int (*commit_blocks)(struct inode *inode, struct iomap *iomaps,
 int nr_iomaps, struct iattr *iattr);
};

.map_blocks() allows the pnfs client to allocate blocks in the
storage.  .commit_blocks() is called once the write is complete to
do things like unwritten extent conversion on extents that it
allocated. In the XFS implementation of these methods, they call
directly into the XFS same block mapping code that the
read/write/mmap IO paths call into.

A typical pNFS use case is a HPC clusters, where thousands of nodes
might all be writing to separate parts of a huge sparse file (e.g.
out of core sparse matrix solver) and are reading/writing direct to
the storage via iSER or some other low level IB/RDMA storage
protocol.  Every write on every pNFS client needs space allocation,
so the pNFS server is basically just providing a remote interface to
the XFS space allocation interfaces for direct IO on the pNFS
clients.

IOWs, there can be thousands of concurrent pNFS layout leases on a
single inode at any given time and they can all be allocating space,
too.

> > IOWs, your definition of F_RDLCK | F_LAYOUT not being allowed
> > to change the is in direct contradition to existing users.
> > 
> > I've said this several times over the past few months now: shared
> > layout leases must allow layout modifications to be made.
> 
> I don't understand what the point of having a layout lease is then?

It's a *notification* mechanism.

Multiple processes can modify the file layout at the same time -
XFs was designed as a multi-write filesystem from the ground up and
we make use of that with shared IO locks for direct IO writes. 

The read layout lease model we've used for pNFS is essentially the
same concurrent writer model that direct IO in XFS uses. And to
enable concurrent writers, we use shared locking for the the layout
leases.

IOWs, the pNFS client IO model is very similar to local direct IO,
except for the fact they can remotely cache layout mappings.  Hence
if you do a server-side local buffered write (delayed allocation),
truncate, punch a hole, etc, (or a remote operation through the NFS
server that ends up in these same paths) the mappings those pNFS
clients hold are no longer guaranteed to cover valid data and/or
have correct physical mappings for valid data held on the server.

At this point, the layouts need to be invalidated, and so the layout
lease is broken by the filesystem operations that may cause an
issue. The pNFS server reacts to the lease break by recalling the
client layout(s) and the pNFS client has to request a new layout
from the server to be able to directly access the storage again.

i.e. the layout lease is primarily a *notification* mechanism to
allow safe interop between application level direct access
mechanisms and local filesystem access.

What you are trying to do is turn this multi-writer layout lease
notification mechanism into a single writer access control
mechanism. i.e. F_UNBREAK is all about /control/ of the layout and
who can and can't modify it, regardless of whether they write
permissions have been granted or not.

It seems I have been unable to get this point across despite trying
for months now: access control is not a function provided by layout
leases. If you need to guarantee exclusive access to a file so
nobody else can modify it, direct access or through the filesystem,
then that is what permission bits, ACLs, file locks, LSMs, etc are
for. Don't try to overload a layout change notification mechanism
with data access controls.

> I a

Re: [RFC PATCH 0/7] xfs: add reflink & dedupe support for fsdax.

2019-10-10 Thread Dave Chinner
On Wed, Oct 09, 2019 at 10:11:52AM -0700, Darrick J. Wong wrote:
> On Tue, Oct 08, 2019 at 11:31:44PM -0700, Christoph Hellwig wrote:
> > Btw, I just had a chat with Dan last week on this.  And he pointed out
> > that while this series deals with the read/write path issues of 
> > reflink on DAX it doesn't deal with the mmap side issue that
> > page->mapping and page->index can point back to exactly one file.
> > 
> > I think we want a few xfstests that reflink a file and then use the
> > different links using mmap, as that should blow up pretty reliably.
> 
> Hmm, you're right, we don't actually have a test that checks the
> behavior of mwriting all copies of a shared block.  Ok, I'll go write
> one.

I've pointed this problem out to everyone who has asked me "what do
we need to do to support reflink on DAX". I've even walked a couple
of people right through the problem that needs to be solved and
discussed the potential solutions to it.

Problems that I think need addressing:

- device dax and filesystem dax have fundamentally different
  needs in this space, so they need to be separated and not
  try to use the same solution.
- dax_lock_entry() being used as a substitute for
  page_lock() but it not being held on the page itself means
  it can't be extended to serialise access to the page
  across multiple mappings that are unaware of each other
- dax_lock_page/dax_unlock_page interface for hardware
  memory errors needs to report to the
  filesystem for processing and repair, not assume the page
  is user data and killing processes is the only possible
  recovery mechanism.
- dax_associate_entry/dax_disassociate_entry can only work
  for a 1:1 page:mapping,index relationship. It needs to go
  away and be replaced by a mechanism that allows
  tracking multiple page mapping/index/state tuples. This
  has much wider use than DAX (e.g. sharing page cache pages
  between reflinked files)

I've proposed shadow pages (based on a concept from Matethw Wilcox)
for each read-only reflink mapping with the real physical page being
owned by the filesystem and indexed by LBA in the filesystem buffer
cache. This would be based on whether the extent in the file the
page is mapped from has multiple references to it.

i.e. When a new page mapping occurs in a shared extent, we add the
page to the buffer cache (i.e. point a struct xfs_buf at it)i if it
isn't already present, then allocate a shadow page, point it at the
master, set it up with the new mapping,index tuple and add it to the
mapping tree. Then we can treat it as a unique page even though it
points to the read-only master page.

When the page get's COWed, we toss away the shadow page and the
master can be reclaimed with the reference count goes to zero or the
extent is no longer shared.  Think of it kind of like the way we
multiply reference the zero page for holes in mmap()d dax regions,
except we can have millions of them and they are found by physical
buffer cache index lookups. 

This works for both DAX and non-DAX sharing of read-only shared
filesytsem pages. i.e. it would form the basis of single-copy
read-only page cache pages for reflinked files.

There was quite a bit of talk at LSFMM 2018 about having a linked
list of mapping structures hanging off a struct page, one for each
mapping that references the page. Operations would then have to walk
all mappings that reference the page. This was useful to other
subsystems (HMM?) for some purpose I forget, but I'm not sure it's
particularly useful by itself for non-dax reflink purposes - I
suspect the filesystem would still need to track such pages itself
in it's buffer cache so it can find the cached page to link new
reflink copies to the same page...

ISTR a couple of other solutions were thrown around, but I don't
think anyone came up with a simple solution...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [PATCH 02/11] iomap: copy the xfs writeback code to iomap.c

2019-10-08 Thread Dave Chinner
On Tue, Oct 08, 2019 at 08:34:36AM +0200, Christoph Hellwig wrote:
> On Tue, Oct 08, 2019 at 08:43:53AM +1100, Dave Chinner wrote:
> > > +static int
> > > +iomap_ioend_compare(void *priv, struct list_head *a, struct list_head *b)
> > > +{
> > > + struct iomap_ioend *ia, *ib;
> > > +
> > > + ia = container_of(a, struct iomap_ioend, io_list);
> > > + ib = container_of(b, struct iomap_ioend, io_list);
> > > + if (ia->io_offset < ib->io_offset)
> > > + return -1;
> > > + else if (ia->io_offset > ib->io_offset)
> > > + return 1;
> > > + return 0;
> > 
> > No need for the else here.
> 
> That is usually my comment :)  But in this case it is just copied over
> code, so I didn't want to do cosmetic changes.

*nod*

> > > + /*
> > > +  * Given that we do not allow direct reclaim to call us, we should
> > > +  * never be called while in a filesystem transaction.
> > > +  */
> > > + if (WARN_ON_ONCE(current->flags & PF_MEMALLOC_NOFS))
> > > + goto redirty;
> > 
> > Is this true for all expected callers of these functions rather than
> > just XFS? i.e. PF_MEMALLOC_NOFS is used by transactions in XFS to
> > prevent transaction context recursion, but other filesystems do not
> > do this..
> > 
> > FWIW, I can also see that this is going to cause us problems if high
> > level code starts using memalloc_nofs_save() and then calling
> > filemap_datawrite() and friends...
> 
> We have the check for direct reclaim just above, so any file system
> using this iomap code will not allow direct reclaim.  Which I think is
> a very good idea given that direct reclaim through the file system is
> a very bad idea.

*nod*

> That leaves with only the filemap_datawrite case, which so far is
> theoretical.  If that ever becomes a think it is very obvious and we
> can just remove the debug check.

I expect it will be a thing sooner rather than later...

> > > +iomap_writepage(struct page *page, struct writeback_control *wbc,
> > > + struct iomap_writepage_ctx *wpc,
> > > + const struct iomap_writeback_ops *ops)
> > > +{
> > > + int ret;
> > > +
> > > + wpc->ops = ops;
> > > + ret = iomap_do_writepage(page, wbc, wpc);
> > > + if (!wpc->ioend)
> > > + return ret;
> > > + return iomap_submit_ioend(wpc, wpc->ioend, ret);
> > > +}
> > > +EXPORT_SYMBOL_GPL(iomap_writepage);
> > 
> > Can we kill ->writepage for iomap users, please? After all, we don't
> > mostly don't allow memory reclaim to do writeback of dirty pages,
> > and that's the only caller of ->writepage.
> 
> I'd rather not do this as part of this move.  But if you could expedite
> your patch to kill ->writepage from the large block size support patch
> and submit it ASAP on top of this series I would be very much in favor.

Ok, looks like the usual of more follow up patches on top of these.
I'm kinda waiting for these to land before porting the large block
size stuff on top of it...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [PATCH] cgroup, blkcg: prevent dirty inodes to pin dying memory cgroups

2019-10-07 Thread Dave Chinner
On Fri, Oct 04, 2019 at 03:11:04PM -0700, Roman Gushchin wrote:
> This is a RFC patch, which is not intended to be merged as is,
> but hopefully will start a discussion which can result in a good
> solution for the described problem.
> 
> --
> 
> We've noticed that the number of dying cgroups on our production hosts
> tends to grow with the uptime. This time it's caused by the writeback
> code.
> 
> An inode which is getting dirty for the first time is associated
> with the wb structure (look at __inode_attach_wb()). It can later
> be switched to another wb under some conditions (e.g. some other
> cgroup is writing a lot of data to the same inode), but generally
> stays associated up to the end of life of the inode structure.
> 
> The problem is that the wb structure holds a reference to the original
> memory cgroup. So if the inode was dirty once, it has a good chance
> to pin down the original memory cgroup.
> 
> An example from the real life: some service runs periodically and
> updates rpm packages. Each time in a new memory cgroup. Installed
> .so files are heavily used by other cgroups, so corresponding inodes
> tend to stay alive for a long. So do pinned memory cgroups.
> In production I've seen many hosts with 1-2 thousands of dying
> cgroups.
> 
> This is not the first problem with the dying memory cgroups. As
> always, the problem is with their relative size: memory cgroups
> are large objects, easily 100x-1000x larger that inodes. So keeping
> a couple of thousands of dying cgroups in memory without a good reason
> (what we easily do with inodes) is quite costly (and is measured
> in tens and hundreds of Mb).
> 
> One possible approach to this problem is to switch inodes associated
> with dying wbs to the root wb. Switching is a best effort operation
> which can fail silently, so unfortunately we can't run once over a
> list of associated inodes (even if we'd have such a list). So we
> really have to scan all inodes.
> 
> In the proposed patch I schedule a work on each memory cgroup
> deletion, which is probably too often. Alternatively, we can do it
> periodically under some conditions (e.g. the number of dying memory
> cgroups is larger than X). So it's basically a gc run.
> 
> I wonder if there are any better ideas?
> 
> Signed-off-by: Roman Gushchin 
> ---
>  fs/fs-writeback.c | 29 +
>  mm/memcontrol.c   |  5 +
>  2 files changed, 34 insertions(+)
> 
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index 542b02d170f8..4bbc9a200b2c 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -545,6 +545,35 @@ static void inode_switch_wbs(struct inode *inode, int 
> new_wb_id)
>   up_read(&bdi->wb_switch_rwsem);
>  }
>  
> +static void reparent_dirty_inodes_one_sb(struct super_block *sb, void *arg)
> +{
> + struct inode *inode, *next;
> +
> + spin_lock(&sb->s_inode_list_lock);
> + list_for_each_entry_safe(inode, next, &sb->s_inodes, i_sb_list) {
> + spin_lock(&inode->i_lock);
> + if (inode->i_state & (I_NEW | I_FREEING | I_WILL_FREE)) {
> + spin_unlock(&inode->i_lock);
> + continue;
> + }
> +
> + if (inode->i_wb && wb_dying(inode->i_wb)) {
> + spin_unlock(&inode->i_lock);
> + inode_switch_wbs(inode, root_mem_cgroup->css.id);
> + continue;
> + }
> +
> + spin_unlock(&inode->i_lock);
> + }
> + spin_unlock(&sb->s_inode_list_lock);

No idea what the best solution is, but I think this is fundamentally
unworkable. It's not uncommon to have a hundred million cached
inodes these days, often on a single filesystem. Anything that
requires a brute-force system wide inode scan, especially without
conditional reschedule points, is largely a non-starter.

Also, inode_switch_wbs() is not guaranteed to move the inode to the
destination wb.  There can only be WB_FRN_MAX_IN_FLIGHT (1024)
switches in flight at once and switches are run via RCU callbacks,
so I suspect that using inode_switch_wbs() for bulk re-assignment is
going to be a lot more complex than just finding inodes to call
inode_switch_wbs() on

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [PATCH 02/15] fs: Introduce i_blocks_per_page

2019-10-07 Thread Dave Chinner
On Fri, Oct 04, 2019 at 12:28:12PM -0700, Matthew Wilcox wrote:
> On Wed, Sep 25, 2019 at 06:36:50PM +1000, Dave Chinner wrote:
> > I'm actually working on abstrcting this code from both block size
> > and page size via the helpers below. We ahve need to support block
> > size > page size, and so that requires touching a bunch of all the
> > same code as this patchset. I'm currently trying to combine your
> > last patch set with my patchset so I can easily test allocating 64k
> > page cache pages on a 64k block size filesystem on a 4k page size
> > machine with XFS
> 
> This all makes sense ...
> 
> > > - if (iop || i_blocksize(inode) == PAGE_SIZE)
> > > + if (iop || i_blocks_per_page(inode, page) <= 1)
> > >   return iop;
> > 
> > That also means checks like these become:
> > 
> > if (iop || iomap_chunks_per_page(inode, page) <= 1)
> > 
> > as a single file can now have multiple pages per block, a page per
> > block and multiple blocks per page as the page size changes...
> > 
> > I'd like to only have to make one pass over this code to abstract
> > out page and block sizes, so I'm guessing we'll need to do some
> > co-ordination here
> 
> Yup.  I'm happy if you want to send your patches out; I'll keep going
> with the patches I have for the moment, and we'll figure out how to
> merge the two series in a way that makes sense.

I'm waiting for the xfs -> iomap writeback changes to land in a
stable branch so I don't have to do things twice in slightly
different ways in the patchset. Once we get that in an iomap-next
branch I'll rebase my patches on top of it and go from there...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [PATCH 05/11] iomap: zero newly allocated mapped blocks

2019-10-07 Thread Dave Chinner
On Tue, Oct 08, 2019 at 08:46:32AM +1100, Dave Chinner wrote:
> On Sun, Oct 06, 2019 at 05:46:02PM +0200, Christoph Hellwig wrote:
> > File systems like gfs2 don't support delayed allocations or unwritten
> > extents and thus allocate normal mapped blocks to fill holes.  To
> > cover the case of such file systems allocating new blocks to fill holes
> > also zero out mapped blocks with the new flag.
> > 
> > Signed-off-by: Christoph Hellwig 
> > Reviewed-by: Darrick J. Wong 
> > Signed-off-by: Darrick J. Wong 
> > ---
> >  fs/iomap/buffered-io.c | 12 ++--
> >  1 file changed, 10 insertions(+), 2 deletions(-)
> > 
> > diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> > index 23cc308f971d..4132c0cccb0a 100644
> > --- a/fs/iomap/buffered-io.c
> > +++ b/fs/iomap/buffered-io.c
> > @@ -207,6 +207,14 @@ iomap_read_inline_data(struct inode *inode, struct 
> > page *page,
> > SetPageUptodate(page);
> >  }
> >  
> > +static inline bool iomap_block_needs_zeroing(struct inode *inode,
> > +   struct iomap *iomap, loff_t pos)
> > +{
> > +   return iomap->type != IOMAP_MAPPED ||
> > +   (iomap->flags & IOMAP_F_NEW) ||
> > +   pos >= i_size_read(inode);
> 
> This is a change of logic - why is the IOMAP_F_NEW check added here
> and what bug does it fix?

Sorry, brain-fart here - that's what this patch is adding, it's not
a pure factoring patch :/

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [PATCH 09/11] xfs: remove the fork fields in the writepage_ctx and ioend

2019-10-07 Thread Dave Chinner
On Sun, Oct 06, 2019 at 05:46:06PM +0200, Christoph Hellwig wrote:
> In preparation for moving the writeback code to iomap.c, replace the
> XFS-specific COW fork concept with the iomap IOMAP_F_SHARED flag.

"In preparation for switching XFS to use the fs/iomap writeback
code..."?

I suspect the IOMAP_F_SHARED hunk I pointed out in the previous
patch should be in this one...

> 
> Signed-off-by: Christoph Hellwig 
> Reviewed-by: Carlos Maiolino 
> Reviewed-by: Darrick J. Wong 
> ---
>  fs/xfs/xfs_aops.c | 42 ++
>  fs/xfs/xfs_aops.h |  2 +-
>  2 files changed, 23 insertions(+), 21 deletions(-)
> 
> diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> index 9f22885902ef..8c101081e3b1 100644
> --- a/fs/xfs/xfs_aops.c
> +++ b/fs/xfs/xfs_aops.c
> @@ -23,7 +23,6 @@
>   */
>  struct xfs_writepage_ctx {
>   struct iomapiomap;
> - int fork;
>   unsigned intdata_seq;
>   unsigned intcow_seq;
>   struct xfs_ioend*ioend;
> @@ -257,7 +256,7 @@ xfs_end_ioend(
>*/
>   error = blk_status_to_errno(ioend->io_bio->bi_status);
>   if (unlikely(error)) {
> - if (ioend->io_fork == XFS_COW_FORK)
> + if (ioend->io_flags & IOMAP_F_SHARED)
>   xfs_reflink_cancel_cow_range(ip, offset, size, true);
>   goto done;
>   }
> @@ -265,7 +264,7 @@ xfs_end_ioend(
>   /*
>* Success: commit the COW or unwritten blocks if needed.
>*/
> - if (ioend->io_fork == XFS_COW_FORK)
> + if (ioend->io_flags & IOMAP_F_SHARED)
>   error = xfs_reflink_end_cow(ip, offset, size);
>   else if (ioend->io_type == IOMAP_UNWRITTEN)
>   error = xfs_iomap_write_unwritten(ip, offset, size, false);
> @@ -298,7 +297,8 @@ xfs_ioend_can_merge(
>  {
>   if (ioend->io_bio->bi_status != next->io_bio->bi_status)
>   return false;
> - if ((ioend->io_fork == XFS_COW_FORK) ^ (next->io_fork == XFS_COW_FORK))
> + if ((ioend->io_flags & IOMAP_F_SHARED) ^
> + (next->io_flags & IOMAP_F_SHARED))
>   return false;
>   if ((ioend->io_type == IOMAP_UNWRITTEN) ^
>   (next->io_type == IOMAP_UNWRITTEN))

These probably should be indented too, as they are continuations,
not separate logic statements.

> @@ -768,7 +769,8 @@ xfs_add_to_ioend(
>   boolmerged, same_page = false;
>  
>   if (!wpc->ioend ||
> - wpc->fork != wpc->ioend->io_fork ||
> + (wpc->iomap.flags & IOMAP_F_SHARED) !=
> + (wpc->ioend->io_flags & IOMAP_F_SHARED) ||

Same here.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [PATCH 08/11] xfs: use a struct iomap in xfs_writepage_ctx

2019-10-07 Thread Dave Chinner
On Sun, Oct 06, 2019 at 05:46:05PM +0200, Christoph Hellwig wrote:
> In preparation for moving the XFS writeback code to fs/iomap.c, switch
> it to use struct iomap instead of the XFS-specific struct xfs_bmbt_irec.
> 
> Signed-off-by: Christoph Hellwig 
> Reviewed-by: Carlos Maiolino 
> Reviewed-by: Darrick J. Wong 
> ---
>  fs/xfs/libxfs/xfs_bmap.c | 14 +--
>  fs/xfs/libxfs/xfs_bmap.h |  3 +-
>  fs/xfs/xfs_aops.c| 82 +++-
>  fs/xfs/xfs_aops.h|  2 +-
>  4 files changed, 50 insertions(+), 51 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
> index 4edc25a2ba80..d0e7f9ef7b85 100644
> --- a/fs/xfs/libxfs/xfs_bmap.c
> +++ b/fs/xfs/libxfs/xfs_bmap.c
> @@ -34,6 +34,7 @@
>  #include "xfs_ag_resv.h"
>  #include "xfs_refcount.h"
>  #include "xfs_icache.h"
> +#include "xfs_iomap.h"
>  
>  
>  kmem_zone_t  *xfs_bmap_free_item_zone;
> @@ -4454,16 +4455,21 @@ int
>  xfs_bmapi_convert_delalloc(
>   struct xfs_inode*ip,
>   int whichfork,
> - xfs_fileoff_t   offset_fsb,
> - struct xfs_bmbt_irec*imap,
> + xfs_off_t   offset,
> + struct iomap*iomap,
>   unsigned int*seq)
>  {
>   struct xfs_ifork*ifp = XFS_IFORK_PTR(ip, whichfork);
>   struct xfs_mount*mp = ip->i_mount;
> + xfs_fileoff_t   offset_fsb = XFS_B_TO_FSBT(mp, offset);
>   struct xfs_bmalloca bma = { NULL };
> + u16 flags = 0;
>   struct xfs_trans*tp;
>   int error;
>  
> + if (whichfork == XFS_COW_FORK)
> + flags |= IOMAP_F_SHARED;

That seems out of place - I don't see anywhere in this patch that
moves/removes setting the IOMAP_F_SHARED flag. i.e this looks like a
change of behaviour

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [PATCH 05/11] iomap: zero newly allocated mapped blocks

2019-10-07 Thread Dave Chinner
On Sun, Oct 06, 2019 at 05:46:02PM +0200, Christoph Hellwig wrote:
> File systems like gfs2 don't support delayed allocations or unwritten
> extents and thus allocate normal mapped blocks to fill holes.  To
> cover the case of such file systems allocating new blocks to fill holes
> also zero out mapped blocks with the new flag.
> 
> Signed-off-by: Christoph Hellwig 
> Reviewed-by: Darrick J. Wong 
> Signed-off-by: Darrick J. Wong 
> ---
>  fs/iomap/buffered-io.c | 12 ++--
>  1 file changed, 10 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index 23cc308f971d..4132c0cccb0a 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -207,6 +207,14 @@ iomap_read_inline_data(struct inode *inode, struct page 
> *page,
>   SetPageUptodate(page);
>  }
>  
> +static inline bool iomap_block_needs_zeroing(struct inode *inode,
> + struct iomap *iomap, loff_t pos)
> +{
> + return iomap->type != IOMAP_MAPPED ||
> + (iomap->flags & IOMAP_F_NEW) ||
> + pos >= i_size_read(inode);

This is a change of logic - why is the IOMAP_F_NEW check added here
and what bug does it fix?

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [PATCH 02/11] iomap: copy the xfs writeback code to iomap.c

2019-10-07 Thread Dave Chinner
On Sun, Oct 06, 2019 at 05:45:59PM +0200, Christoph Hellwig wrote:
> Takes the xfs writeback code and copies it to iomap.c.  A new structure
> with three methods is added as the abstraction from the generic
> writeback code to the file system.  These methods are used to map
> blocks, submit an ioend, and cancel a page that encountered an error
> before it was added to an ioend.
> 
> Signed-off-by: Christoph Hellwig 
> [darrick: create the new iomap code, we'll delete the xfs code separately]
> Reviewed-by: Darrick J. Wong 
> Signed-off-by: Darrick J. Wong 
.

> +static int
> +iomap_ioend_compare(void *priv, struct list_head *a, struct list_head *b)
> +{
> + struct iomap_ioend *ia, *ib;
> +
> + ia = container_of(a, struct iomap_ioend, io_list);
> + ib = container_of(b, struct iomap_ioend, io_list);
> + if (ia->io_offset < ib->io_offset)
> + return -1;
> + else if (ia->io_offset > ib->io_offset)
> + return 1;
> + return 0;

No need for the else here.

> +/*
> + * Test to see if we have an existing ioend structure that we could append to
> + * first, otherwise finish off the current ioend and start another.
> + */
> +static void
> +iomap_add_to_ioend(struct inode *inode, loff_t offset, struct page *page,
> + struct iomap_page *iop, struct iomap_writepage_ctx *wpc,
> + struct writeback_control *wbc, struct list_head *iolist)
> +{
> + sector_t sector = iomap_sector(&wpc->iomap, offset);
> + unsigned len = i_blocksize(inode);
> + unsigned poff = offset & (PAGE_SIZE - 1);
> + bool merged, same_page = false;
> +
> + if (!wpc->ioend ||
> + (wpc->iomap.flags & IOMAP_F_SHARED) !=
> + (wpc->ioend->io_flags & IOMAP_F_SHARED) ||

This second line of the comparison should be indented further as it
is a continuation of the the previous line's log statement, not a
unique logic statement by itself.

> + wpc->iomap.type != wpc->ioend->io_type ||
> + sector != bio_end_sector(wpc->ioend->io_bio) ||
> + offset != wpc->ioend->io_offset + wpc->ioend->io_size) {
> + if (wpc->ioend)
> + list_add(&wpc->ioend->io_list, iolist);
> + wpc->ioend = iomap_alloc_ioend(inode, wpc, offset, sector, wbc);
> + }
.
> +static int
> +iomap_do_writepage(struct page *page, struct writeback_control *wbc, void 
> *data)
> +{
> + struct iomap_writepage_ctx *wpc = data;
> + struct inode *inode = page->mapping->host;
> + pgoff_t end_index;
> + u64 end_offset;
> + loff_t offset;
> +
> + trace_iomap_writepage(inode, page, 0, 0);
> +
> + /*
> +  * Refuse to write the page out if we are called from reclaim context.
> +  *
> +  * This avoids stack overflows when called from deeply used stacks in
> +  * random callers for direct reclaim or memcg reclaim.  We explicitly
> +  * allow reclaim from kswapd as the stack usage there is relatively low.
> +  *
> +  * This should never happen except in the case of a VM regression so
> +  * warn about it.
> +  */
> + if (WARN_ON_ONCE((current->flags & (PF_MEMALLOC|PF_KSWAPD)) ==
> + PF_MEMALLOC))
> + goto redirty;
> +
> + /*
> +  * Given that we do not allow direct reclaim to call us, we should
> +  * never be called while in a filesystem transaction.
> +  */
> + if (WARN_ON_ONCE(current->flags & PF_MEMALLOC_NOFS))
> + goto redirty;

Is this true for all expected callers of these functions rather than
just XFS? i.e. PF_MEMALLOC_NOFS is used by transactions in XFS to
prevent transaction context recursion, but other filesystems do not
do this..

FWIW, I can also see that this is going to cause us problems if high
level code starts using memalloc_nofs_save() and then calling
filemap_datawrite() and friends...

> +iomap_writepage(struct page *page, struct writeback_control *wbc,
> + struct iomap_writepage_ctx *wpc,
> + const struct iomap_writeback_ops *ops)
> +{
> + int ret;
> +
> + wpc->ops = ops;
> + ret = iomap_do_writepage(page, wbc, wpc);
> + if (!wpc->ioend)
> + return ret;
> + return iomap_submit_ioend(wpc, wpc->ioend, ret);
> +}
> +EXPORT_SYMBOL_GPL(iomap_writepage);

Can we kill ->writepage for iomap users, please? After all, we don't
mostly don't allow memory reclaim to do writeback of dirty pages,
and that's the only caller of ->writepage.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: Lease semantic proposal

2019-09-30 Thread Dave Chinner
On Wed, Sep 25, 2019 at 04:46:03PM -0700, Ira Weiny wrote:
> On Tue, Sep 24, 2019 at 08:26:20AM +1000, Dave Chinner wrote:
> > Hence, AFIACT, the above definition of a F_RDLCK|F_LAYOUT lease
> > doesn't appear to be compatible with the semantics required by
> > existing users of layout leases.
> 
> I disagree.  Other than the addition of F_UNBREAK, I think this is consistent
> with what is currently implemented.  Also, by exporting all this to user space
> we can now write tests for it independent of the RDMA pinning.

The current usage of F_RDLCK | F_LAYOUT by the pNFS code allows
layout changes to occur to the file while the layout lease is held.
IOWs, your definition of F_RDLCK | F_LAYOUT not being allowed
to change the is in direct contradition to existing users.

I've said this several times over the past few months now: shared
layout leases must allow layout modifications to be made. Only
allowing an exclusive layout lease to modify the layout rules out
many potential use cases for direct data placement and p2p DMA
applications, not to mention conflicts with the existing pNFS usage.
Layout leases need to support more than just RDMA, and tailoring the
API to exactly the immediate needs of RDMA is just going to make it
useless for anything else.

I'm getting frustrated now because we still seem to be going around
in circles and getting nowhere.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [PATCH v2] mm: implement write-behind policy for sequential file writes

2019-09-25 Thread Dave Chinner
On Wed, Sep 25, 2019 at 11:15:30AM +0300, Konstantin Khlebnikov wrote:
> On 25/09/2019 10.18, Dave Chinner wrote:
> > On Tue, Sep 24, 2019 at 12:00:17PM +0300, Konstantin Khlebnikov wrote:
> > > On 24/09/2019 10.39, Dave Chinner wrote:
> > > > On Mon, Sep 23, 2019 at 06:06:46PM +0300, Konstantin Khlebnikov wrote:
> > > > > On 23/09/2019 17.52, Tejun Heo wrote:
> > > > > > Hello, Konstantin.
> > > > > > 
> > > > > > On Fri, Sep 20, 2019 at 10:39:33AM +0300, Konstantin Khlebnikov 
> > > > > > wrote:
> > > > > > > With vm.dirty_write_behind 1 or 2 files are written even faster 
> > > > > > > and
> > > > > > 
> > > > > > Is the faster speed reproducible?  I don't quite understand why this
> > > > > > would be.
> > > > > 
> > > > > Writing to disk simply starts earlier.
> > > > 
> > > > Stupid question: how is this any different to simply winding down
> > > > our dirty writeback and throttling thresholds like so:
> > > > 
> > > > # echo $((100 * 1000 * 1000)) > /proc/sys/vm/dirty_background_bytes
> > > > 
> > > > to start background writeback when there's 100MB of dirty pages in
> > > > memory, and then:
> > > > 
> > > > # echo $((200 * 1000 * 1000)) > /proc/sys/vm/dirty_bytes
> > > > 
> > > > So that writers are directly throttled at 200MB of dirty pages in
> > > > memory?
> > > > 
> > > > This effectively gives us global writebehind behaviour with a
> > > > 100-200MB cache write burst for initial writes.
> > > 
> > > Global limits affect all dirty pages including memory-mapped and
> > > randomly touched. Write-behind aims only into sequential streams.
> > 
> > There are  apps that do sequential writes via mmap()d files.
> > They should do writebehind too, yes?
> 
> I see no reason for that. This is different scenario.

It is?

> Mmap have no clear signal about "end of write", only page fault at
> beginning. Theoretically we could implement similar sliding window and
> start writeback on consequent page faults.

sequential IO doing pwrite() in a loop has no clear signal about
"end of write", either. It's exactly the same as doing a memset(0)
on a mmap()d region to zero the file. i.e. the write doesn't stop
until EOF is reached...

> But applications who use memory mapped files probably knows better what
> to do with this data. I prefer to leave them alone for now.

By that argument, we shouldn't have readahead for mmap() access or
even read-around for page faults. We can track read and write faults
exactly for mmap(), so if you are tracking sequential page dirtying
for writebehind we can do that jsut as easily for mmap (via
->page_mkwrite) as we can for write() IO.

> > > > ANd, really such strict writebehind behaviour is going to cause all
> > > > sorts of unintended problesm with filesystems because there will be
> > > > adverse interactions with delayed allocation. We need a substantial
> > > > amount of dirty data to be cached for writeback for fragmentation
> > > > minimisation algorithms to be able to do their job
> > > 
> > > I think most sequentially written files never change after close.
> > 
> > There are lots of apps that write zeros to initialise and allocate
> > space, then go write real data to them. Database WAL files are
> > commonly initialised like this...
> 
> Those zeros are just bunch of dirty pages which have to be written.
> Sync and memory pressure will do that, why write-behind don't have to?

Huh? IIUC, the writebehind flag is a global behaviour flag for the
kernel - everything does writebehind or nothing does it, right?

Hence if you turn on writebehind, the writebehind will write the
zeros to disk before real data can be written. We no longer have
zeroing as something that sits in the cache until it's overwritten
with real data - that file now gets written twice and it delays the
application from actually writing real data until the zeros are all
on disk.

strict writebehind without the ability to burst temporary/short-term
data/state into the cache is going to cause a lot of performance
regressions in applications

> > > Except of knowing final size of huge files (>16Mb in my patch)
> > > there should be no difference for delayed allocation.
> > 
> > There is, because you throttle the writes down such that there is
> > only 16MB of dirty data in memory. Hence filesyst

Re: [PATCH 02/15] fs: Introduce i_blocks_per_page

2019-09-25 Thread Dave Chinner
On Tue, Sep 24, 2019 at 05:52:01PM -0700, Matthew Wilcox wrote:
> From: "Matthew Wilcox (Oracle)" 
> 
> This helper is useful for both large pages in the page cache and for
> supporting block size larger than page size.  Convert some example
> users (we have a few different ways of writing this idiom).
> 
> Signed-off-by: Matthew Wilcox (Oracle) 

I'm actually working on abstrcting this code from both block size
and page size via the helpers below. We ahve need to support block
size > page size, and so that requires touching a bunch of all the
same code as this patchset. I'm currently trying to combine your
last patch set with my patchset so I can easily test allocating 64k
page cache pages on a 64k block size filesystem on a 4k page size
machine with XFS

/*
 * Return the chunk size we should use for page cache based operations.
 * This supports both large block sizes and variable page sizes based on the
 * restriction that order-n blocks and page cache pages are order-n file offset
 * aligned.
 *
 * This will return the inode block size for block size < page_size(page),
 * otherwise it will return page_size(page).
 */
static inline unsigned
iomap_chunk_size(struct inode *inode, struct page *page)
{
return min_t(unsigned, page_size(page), i_blocksize(inode));
}

static inline unsigned
iomap_chunk_bits(struct inode *inode, struct page *page)
{
return min_t(unsigned, page_shift(page), inode->i_blkbits);
}

static inline unsigned
iomap_chunks_per_page(struct inode *inode, struct page *page)
{
return page_size(page) >> inode->i_blkbits;
}

Basically, the process is to convert the iomap code over to
iterating "chunks" rather than blocks or pages, and then allocate
a struct iomap_page according to the difference between page and
block size

> ---
>  fs/iomap/buffered-io.c  |  4 ++--
>  fs/jfs/jfs_metapage.c   |  2 +-
>  fs/xfs/xfs_aops.c   |  8 
>  include/linux/pagemap.h | 13 +
>  4 files changed, 20 insertions(+), 7 deletions(-)
> 
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index e25901ae3ff4..0e76a4b6d98a 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -24,7 +24,7 @@ iomap_page_create(struct inode *inode, struct page *page)
>  {
>   struct iomap_page *iop = to_iomap_page(page);
>  
> - if (iop || i_blocksize(inode) == PAGE_SIZE)
> + if (iop || i_blocks_per_page(inode, page) <= 1)
>   return iop;

That also means checks like these become:

if (iop || iomap_chunks_per_page(inode, page) <= 1)

as a single file can now have multiple pages per block, a page per
block and multiple blocks per page as the page size changes...

I'd like to only have to make one pass over this code to abstract
out page and block sizes, so I'm guessing we'll need to do some
co-ordination here

> @@ -636,4 +636,17 @@ static inline unsigned long dir_pages(struct inode 
> *inode)
>  PAGE_SHIFT;
>  }
>  
> +/**
> + * i_blocks_per_page - How many blocks fit in this page.
> + * @inode: The inode which contains the blocks.
> + * @page: The (potentially large) page.
> + *
> + * Context: Any context.
> + * Return: The number of filesystem blocks covered by this page.
> + */
> +static inline
> +unsigned int i_blocks_per_page(struct inode *inode, struct page *page)
> +{
> + return page_size(page) >> inode->i_blkbits;
> +}
>  #endif /* _LINUX_PAGEMAP_H */

It also means that we largely don't need to touch mm headers as
all the helpers end up being iomap specific and private...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [PATCH v2] mm: implement write-behind policy for sequential file writes

2019-09-25 Thread Dave Chinner
On Tue, Sep 24, 2019 at 12:08:04PM -0700, Linus Torvalds wrote:
> On Tue, Sep 24, 2019 at 12:39 AM Dave Chinner  wrote:
> >
> > Stupid question: how is this any different to simply winding down
> > our dirty writeback and throttling thresholds like so:
> >
> > # echo $((100 * 1000 * 1000)) > /proc/sys/vm/dirty_background_bytes
> 
> Our dirty_background stuff is very questionable, but it exists (and
> has those insane defaults) because of various legacy reasons.

That's not what I was asking about.  The context is in the previous
lines you didn't quote:

> > > > Is the faster speed reproducible?  I don't quite understand why this
> > > > would be.
> > >
> > > Writing to disk simply starts earlier.
> >
> > Stupid question: how is this any different to simply winding down
> > our dirty writeback and throttling thresholds like so:

i.e. I'm asking about the reasons for the performance differential
not asking for an explanation of what writebehind is. If the
performance differential really is caused by writeback starting
sooner, then winding down dirty_background_bytes should produce
exactly the same performance because it will start writeback -much
faster-.

If it doesn't, then the assertion that the difference is caused by
earlier writeout is questionable and the code may not actually be
doing what is claimed

Basically, I'm asking for proof that the explanation is correct.

> > to start background writeback when there's 100MB of dirty pages in
> > memory, and then:
> >
> > # echo $((200 * 1000 * 1000)) > /proc/sys/vm/dirty_bytes
> 
> The thing is, that also accounts for dirty shared mmap pages. And it
> really will kill some benchmarks that people take very very seriously.

Yes, I know that. I'm not suggesting that we do this,

[snip]

> Anyway, the end result of all this is that we have that
> balance_dirty_pages() that is pretty darn complex and I suspect very
> few people understand everything that goes on in that function.

I'd agree with you there - most of the ground work for the
balance_dirty_pages IO throttling feedback loop was all based on
concepts I developed to solve dirty page writeback thrashing
problems on Irix back in 2003.  The code we have in Linux was
written by Fenguang Wu with help for a lot of people, but the
underlying concepts of delegating IO to dedicated writeback threads
that calculate and track page cleaning rates (BDI writeback rates)
and then throttling incoming page dirtying rate to the page cleaning
rate all came out of my head

So, much as it may surprise you, I am one of the few people who do
actually understand how that whole complex mass of accounting and
feedback is supposed to work. :)

> Now, whether write-behind really _does_ help that, or whether it's
> just yet another tweak and complication, I can't actually say.

Neither can I at this point - I lack the data and that's why I was
asking if there was a perf difference with the existing limits wound
right down. Knowing whether the performance difference is simply a
result of starting writeback IO sooner tells me an awful lot about
what other behaviour is happening as a result of the changes in this
patch.

> But I
> don't think 'dirty_background_bytes' is really an argument against
> write-behind, it's just one knob on the very complex dirty handling we
> have.

Never said it was - just trying to determine if a one line
explanation is true or not.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [PATCH v2] mm: implement write-behind policy for sequential file writes

2019-09-25 Thread Dave Chinner
On Tue, Sep 24, 2019 at 12:00:17PM +0300, Konstantin Khlebnikov wrote:
> On 24/09/2019 10.39, Dave Chinner wrote:
> > On Mon, Sep 23, 2019 at 06:06:46PM +0300, Konstantin Khlebnikov wrote:
> > > On 23/09/2019 17.52, Tejun Heo wrote:
> > > > Hello, Konstantin.
> > > > 
> > > > On Fri, Sep 20, 2019 at 10:39:33AM +0300, Konstantin Khlebnikov wrote:
> > > > > With vm.dirty_write_behind 1 or 2 files are written even faster and
> > > > 
> > > > Is the faster speed reproducible?  I don't quite understand why this
> > > > would be.
> > > 
> > > Writing to disk simply starts earlier.
> > 
> > Stupid question: how is this any different to simply winding down
> > our dirty writeback and throttling thresholds like so:
> > 
> > # echo $((100 * 1000 * 1000)) > /proc/sys/vm/dirty_background_bytes
> > 
> > to start background writeback when there's 100MB of dirty pages in
> > memory, and then:
> > 
> > # echo $((200 * 1000 * 1000)) > /proc/sys/vm/dirty_bytes
> > 
> > So that writers are directly throttled at 200MB of dirty pages in
> > memory?
> > 
> > This effectively gives us global writebehind behaviour with a
> > 100-200MB cache write burst for initial writes.
> 
> Global limits affect all dirty pages including memory-mapped and
> randomly touched. Write-behind aims only into sequential streams.

There are  apps that do sequential writes via mmap()d files.
They should do writebehind too, yes?

> > ANd, really such strict writebehind behaviour is going to cause all
> > sorts of unintended problesm with filesystems because there will be
> > adverse interactions with delayed allocation. We need a substantial
> > amount of dirty data to be cached for writeback for fragmentation
> > minimisation algorithms to be able to do their job
> 
> I think most sequentially written files never change after close.

There are lots of apps that write zeros to initialise and allocate
space, then go write real data to them. Database WAL files are
commonly initialised like this...

> Except of knowing final size of huge files (>16Mb in my patch)
> there should be no difference for delayed allocation.

There is, because you throttle the writes down such that there is
only 16MB of dirty data in memory. Hence filesystems will only
typically allocate in 16MB chunks as that's all the delalloc range
spans.

I'm not so concerned for XFS here, because our speculative
preallocation will handle this just fine, but for ext4 and btrfs
it's going to interleave the allocate of concurrent streaming writes
and fragment the crap out of the files.

In general, the smaller you make the individual file writeback
window, the worse the fragmentation problems gets

> Probably write behind could provide hint about streaming pattern:
> pass something like "MSG_MORE" into writeback call.

How does that help when we've only got dirty data and block
reservations up to EOF which is no more than 16MB away?

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [PATCH v2 2/2] mm, sl[aou]b: guarantee natural alignment for kmalloc(power-of-two)

2019-09-24 Thread Dave Chinner
On Tue, Sep 24, 2019 at 11:19:29PM +0200, Vlastimil Babka wrote:
> On 9/23/19 7:51 PM, Darrick J. Wong wrote:
> > On Mon, Sep 23, 2019 at 07:17:10PM +0200, David Sterba wrote:
> >> On Mon, Sep 23, 2019 at 06:36:32PM +0200, Vlastimil Babka wrote:
> >>> So if anyone thinks this is a good idea, please express it (preferably
> >>> in a formal way such as Acked-by), otherwise it seems the patch will be
> >>> dropped (due to a private NACK, apparently).
> > 
> > Oh, I didn't realize   that *some* of us are allowed the
> > privilege of gutting a patch via private NAK without any of that open
> > development discussion incovenience. 
> > 
> > As far as XFS is concerned I merged Dave's series that checks the
> > alignment of io memory allocations and falls back to vmalloc if the
> > alignment won't work, because I got tired of scrolling past the endless
> > discussion and bug reports and inaction spanning months.
> 
> I think it's a big fail of kmalloc API that you have to do that, and
> especially with vmalloc, which has the overhead of setting up page
> tables, and it's a waste for allocation requests smaller than page size.
> I wish we could have nice things.

I don't think the problem here is the code. The problem here is that
we have a dysfunctional development community and there are no
processes we can follow to ensure architectural problems in core
subsystems are addressed in a timely manner...

And this criticism isn't just of the mm/ here - this alignment
problem is exacerbated by exactly the same issue on the block layer
side. i.e. the block layer and drivers have -zero- bounds checking
to catch these sorts of things and the block layer maintainer will
not accept patches for runtime checks that would catch these issues
and make them instantly visible to us.

These are not code problems: we can fix the problems with code (and
I have done so to demonstrate "this is how we do what you say is
impossible").  The problem here is people in positions of
control/power are repeatedly demonstrating an inability to
compromise to reach a solution that works for everyone.

It's far better for us just to work around bullshit like this in XFS
now, then when the core subsystems get they act together years down
the track we can remove the workaround from XFS. Users don't care
how we fix the problem, they just want it fixed. If that means we
have to route around dysfunctional developer groups, then we'll just
have to do that

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [PATCH v2] mm: implement write-behind policy for sequential file writes

2019-09-24 Thread Dave Chinner
On Mon, Sep 23, 2019 at 06:06:46PM +0300, Konstantin Khlebnikov wrote:
> On 23/09/2019 17.52, Tejun Heo wrote:
> > Hello, Konstantin.
> > 
> > On Fri, Sep 20, 2019 at 10:39:33AM +0300, Konstantin Khlebnikov wrote:
> > > With vm.dirty_write_behind 1 or 2 files are written even faster and
> > 
> > Is the faster speed reproducible?  I don't quite understand why this
> > would be.
> 
> Writing to disk simply starts earlier.

Stupid question: how is this any different to simply winding down
our dirty writeback and throttling thresholds like so:

# echo $((100 * 1000 * 1000)) > /proc/sys/vm/dirty_background_bytes

to start background writeback when there's 100MB of dirty pages in
memory, and then:

# echo $((200 * 1000 * 1000)) > /proc/sys/vm/dirty_bytes

So that writers are directly throttled at 200MB of dirty pages in
memory?

This effectively gives us global writebehind behaviour with a
100-200MB cache write burst for initial writes.

ANd, really such strict writebehind behaviour is going to cause all
sorts of unintended problesm with filesystems because there will be
adverse interactions with delayed allocation. We need a substantial
amount of dirty data to be cached for writeback for fragmentation
minimisation algorithms to be able to do their job

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: Lease semantic proposal

2019-09-23 Thread Dave Chinner
ngle open fd. Single writers are something we
are always trying to avoid in XFS.

> Operations which change the layout are allowed by that process.  But 
> operations
> from other file descriptors which attempt to change the layout will break the
> lease through the standard lease break process.

If the F_WRLCK|F_LAYOUT lease is exclusive, who is actually able to
modify the layout?  Are you talking about processes that don't
actually hold leases modifying the layout? i.e. what are the
constraints on "exclusive access" here - is F_WRLCK|F_LAYOUT is
only exclusive when every modification is co-operating and taking
the appropriate layout lease for every access to the file that is
made?

If that's the case, what happens when someone fails to get a read
lock and decides "I can break write locks just by using ftruncate()
to the same size without a layout lease". Or fallocate() to
preallocate space that is already allocated. Or many other things I
can think of.

IOWs, this seems to me like a very fragile sort of construct that is
open to abuse and that will lead to everyone using F_UNBREAK, which
is highly unfriendly to everyone else...

> The F_LAYOUT flag is used to
> indicate a difference between a regular F_WRLCK and F_WRLCK with F_LAYOUT.  In
> the F_LAYOUT case opens for write do not break the lease.  But some 
> operations,
> if they change the underlying layout, may.
> 
> The distinction between read layout leases and write layout leases is that
> write layout leases can change the layout without breaking the lease within 
> the
> owning process.  This is useful to guarantee a layout prior to specifying the
> unbreakable flag described below.

Ok, so now you really are saying that F_RDLCK leases can only be
used on O_RDONLY file descriptors because any modification under a
F_RDLCK|LAYOUT will trigger a layout break.

> **Unbreakable Layout Leases (F_UNBREAK)**
> 
> In order to support pinning of file pages by direct user space users an
> unbreakable flag (F_UNBREAK) can be used to modify the read and write layout
> lease.  When specified, F_UNBREAK indicates that any user attempting to break
> the lease will fail with ETXTBUSY rather than follow the normal breaking
> procedure.
> 
> Both read and write layout leases can have the unbreakable flag (F_UNBREAK)
> specified.  The difference between an unbreakable read layout lease and an
> unbreakable write layout lease are that an unbreakable read layout lease is
> _not_ exclusive. 

Oh, this doesn't work at all. Now we get write()s to F_RDLCK leases
that can't break the leases and so all writes, even to processes
that own RDLCK|UNBREAK, will fail with ETXTBSY.

> This means that once a layout is established on a file,
> multiple unbreakable read layout leases can be taken by multiple processes and
> used to pin the underlying pages of that file.

Ok, so what happens when someone now takes a
F_WRLOCK|F_LAYOUT|F_UNBREAK? Is that supposed to break
F_RDLCK|F_LAYOUT|F_UNBREAK, as the wording about F_WRLCK behaviour
implies it should?

> Care must therefore be taken to ensure that the layout of the file is as the
> user wants prior to using the unbreakable read layout lease.  A safe mechanism
> to do this would be to take a write layout lease and use fallocate() to set 
> the
> layout of the file.  The layout lease can then be "downgraded" to unbreakable
> read layout as long as no other user broke the write layout lease.

What are the semantics of this "downgrade" behaviour you speak of? :)

My thoughts are:
- RDLCK can only be used for O_RDONLY because write()
  requires breaking of leases
- WRLCK is open to abuse simply by not using a layout lease
  to do a "no change" layout modification
- RDLCK|F_UNBREAK is entirely unusable
- WRLCK|F_UNBREAK will be what every application uses
  because everything else either doesn't work or is too easy
  to abuse.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [PATCH 0/5] hugetlbfs: Disable PMD sharing for large systems

2019-09-12 Thread Dave Chinner
On Wed, Sep 11, 2019 at 04:05:32PM +0100, Waiman Long wrote:
> A customer with large SMP systems (up to 16 sockets) with application
> that uses large amount of static hugepages (~500-1500GB) are experiencing
> random multisecond delays. These delays was caused by the long time it
> took to scan the VMA interval tree with mmap_sem held.
> 
> To fix this problem while perserving existing behavior as much as
> possible, we need to allow timeout in down_write() and disabling PMD
> sharing when it is taking too long to do so. Since a transaction can
> involving touching multiple huge pages, timing out for each of the huge
> page interactions does not completely solve the problem. So a threshold
> is set to completely disable PMD sharing if too many timeouts happen.
> 
> The first 4 patches of this 5-patch series adds a new
> down_write_timedlock() API which accepts a timeout argument and return
> true is locking is successful or false otherwise. It works more or less
> than a down_write_trylock() but the calling thread may sleep.

Just on general principle, this is a non-starter. If a lock is being
held too long, then whatever the lock is protecting needs fixing.
Adding timeouts to locks and sysctls to tune them is not a viable
solution to address latencies caused by algorithm scalability
issues.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [xfs] 610125ab1e: fsmark.app_overhead -71.2% improvement

2019-09-08 Thread Dave Chinner
On Mon, Sep 09, 2019 at 02:06:54PM +0800, Rong Chen wrote:
> Hi Dave,
> 
> On 9/9/19 1:32 PM, Dave Chinner wrote:
> > On Mon, Sep 09, 2019 at 09:58:49AM +0800, kernel test robot wrote:
> > > Greeting,
> > > 
> > > FYI, we noticed a -71.2% improvement of fsmark.app_overhead due to commit:
> > A negative improvement? That's somewhat ambiguous...
> 
> Sorry for causing the misunderstanding, it's a improvement not a regression.
> 
> 
> > 
> > > 0e822255f95db400 610125ab1e4b1b48dcffe74d9d8
> > >  ---
> > >   %stddev %change %stddev
> > >   \  |\
> > >   1.095e+08   -71.2%   31557568fsmark.app_overhead
> > >6157   +95.5%  12034fsmark.files_per_sec
> > So, the files/s rate doubled, and the amount of time spent in
> > userspace by the fsmark app dropped by 70%.
> > 
> > >  167.31   -47.3%  88.25fsmark.time.elapsed_time
> > >  167.31   -47.3%  88.25
> > > fsmark.time.elapsed_time.max
> > Wall time went down by 50%.
> > 
> > >   91.00-8.8%  83.00
> > > fsmark.time.percent_of_cpu_this_job_got
> > >  148.15   -53.2%  69.38fsmark.time.system_time
> > As did system CPU.
> > 
> > IOWs, this change has changed create performance by a factor of 4 -
> > the file create is 2x faster for half the CPU spent.
> > 
> > I don't think this is a negative improvement - it's a large positive
> > improvement.  I suspect that you need to change the metric
> > classifications for this workload...
> To avoid misunderstanding, we'll use fsmark.files_per_sec instead of
> fsmark.app_overhead in the subject.

Well, the two are separate ways of measuring improvement. A change
in one without a change in the other is just as significant as
a change in both...

Cheers,

Dave.
-- 
Dave Chinner
dchin...@redhat.com


Re: [xfs] 610125ab1e: fsmark.app_overhead -71.2% improvement

2019-09-08 Thread Dave Chinner
On Mon, Sep 09, 2019 at 09:58:49AM +0800, kernel test robot wrote:
> Greeting,
> 
> FYI, we noticed a -71.2% improvement of fsmark.app_overhead due to commit:

A negative improvement? That's somewhat ambiguous...

> 0e822255f95db400 610125ab1e4b1b48dcffe74d9d8 
>  --- 
>  %stddev %change %stddev
>  \  |\  
>  1.095e+08   -71.2%   31557568fsmark.app_overhead
>   6157   +95.5%  12034fsmark.files_per_sec

So, the files/s rate doubled, and the amount of time spent in
userspace by the fsmark app dropped by 70%.

> 167.31   -47.3%  88.25fsmark.time.elapsed_time
> 167.31   -47.3%  88.25fsmark.time.elapsed_time.max

Wall time went down by 50%.

>  91.00-8.8%  83.00
> fsmark.time.percent_of_cpu_this_job_got
> 148.15   -53.2%  69.38fsmark.time.system_time

As did system CPU.

IOWs, this change has changed create performance by a factor of 4 -
the file create is 2x faster for half the CPU spent.

I don't think this is a negative improvement - it's a large positive
improvement.  I suspect that you need to change the metric
classifications for this workload...

Cheers,

Dave.
-- 
Dave Chinner
dchin...@redhat.com


Re: [RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002 ;-)

2019-09-02 Thread Dave Chinner
On Wed, Aug 28, 2019 at 07:02:31PM -0700, Ira Weiny wrote:
> On Mon, Aug 26, 2019 at 03:55:10PM +1000, Dave Chinner wrote:
> > On Fri, Aug 23, 2019 at 10:08:36PM -0700, Ira Weiny wrote:
> > > On Sat, Aug 24, 2019 at 10:11:24AM +1000, Dave Chinner wrote:
> > > > On Fri, Aug 23, 2019 at 09:04:29AM -0300, Jason Gunthorpe wrote:
> > > "Leases are associated with an open file description (see open(2)).  This 
> > > means
> > > that duplicate file descriptors (created by, for example, fork(2) or 
> > > dup(2))
> > > refer to the same lease, and this lease may be modified or released using 
> > > any
> > > of these descriptors.  Furthermore,  the lease is released by either an
> > > explicit F_UNLCK operation on any of these duplicate file descriptors, or 
> > > when
> > > all such file descriptors have been closed."
> > 
> > Right, the lease is attached to the struct file, so it follows
> > where-ever the struct file goes. That doesn't mean it's actually
> > useful when the struct file is duplicated and/or passed to another
> > process. :/
> > 
> > AFAICT, the problem is that when we take another reference to the
> > struct file, or when the struct file is passed to a different
> > process, nothing updates the lease or lease state attached to that
> > struct file.
> 
> Ok, I probably should have made this more clear in the cover letter but _only_
> the process which took the lease can actually pin memory.

Sure, no question about that.

> That pinned memory _can_ be passed to another process but those sub-process' 
> can
> _not_ use the original lease to pin _more_ of the file.  They would need to
> take their own lease to do that.

Yes, they would need a new lease to extend it. But that ignores the
fact they don't have a lease on the existing pins they are using and
have no control over the lease those pins originated under.  e.g.
the originating process dies (for whatever reason) and now we have
pins without a valid lease holder.

If something else now takes an exclusive lease on the file (because
the original exclusive lease no longer exists), it's not going to
work correctly because of the zombied page pins caused by closing
the exclusive lease they were gained under. IOWs, pages pinned under
an exclusive lease are no longer "exclusive" the moment the original
exclusive lease is dropped, and pins passed to another process are
no longer covered by the original lease they were created under.

> Sorry for not being clear on that.

I know exactly what you are saying. What I'm failing to get across
is that file layout leases don't actually allow the behaviour you
want to have.

> > As such, leases that require callbacks to userspace are currently
> > only valid within the process context the lease was taken in.
> 
> But for long term pins we are not requiring callbacks.

Regardless, we still require an active lease for long term pins so
that other lease holders fail operations appropriately. And that
exclusive lease must follow the process that pins the pages so that
the life cycle is the same...

> > Indeed, even closing the fd the lease was taken on without
> > F_UNLCKing it first doesn't mean the lease has been torn down if
> > there is some other reference to the struct file. That means the
> > original lease owner will still get SIGIO delivered to that fd on a
> > lease break regardless of whether it is open or not. ANd if we
> > implement "layout lease not released within SIGIO response timeout"
> > then that process will get killed, despite the fact it may not even
> > have a reference to that file anymore.
> 
> I'm not seeing that as a problem.  This is all a result of the application
> failing to do the right thing.

How is that not a problem?

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [PATCH] drivers/staging/exfat - by default, prohibit mount of fat/vfat

2019-09-01 Thread Dave Chinner
On Sat, Aug 31, 2019 at 11:37:27PM -0400, Valdis Klētnieks wrote:
> On Sun, 01 Sep 2019 11:07:21 +1000, Dave Chinner said:
> > Totally irrelevant to the issue at hand. You can easily co-ordinate
> > out of tree contributions through a github tree, or a tree on
> > kernel.org, etc.
> 
> Well.. I'm not personally wedded to the staging tree.  I'm just interested in
> getting a driver done and upstreamed with as little pain as possible. :)

Understood - I'm trying to head off you guys getting delayed
by sitting for a year in the staging tree polishing a turd and not
addressing the things that really need to be done first...

> Is there any preference for github versus kernel.org?  I can set up a github
> tree on my own, no idea who needs to do what for a kernel.org tree.

What ever is most convenient for you to manage and co-ordinate. :P

> Also, this (from another email of yours) was (at least to me) the most useful
> thing said so far:
> 
> > look at what other people have raised w.r.t. to that filesystem -
> > on-disk format validation, re-implementation of largely generic
> > code, lack of namespacing of functions leading to conflicts with
> > generic/VFS functionality, etc.
> 
> All of which are now on the to-do list, thanks.
> 
> Now one big question:
> 
> Should I heave all the vfat stuff overboard and make a module that
> *only* does exfat, or is there enough interest in an extended FAT module
> that does vfat and extfat, in which case the direction should be to re-align
> this module's code with vfat?

I don't know the details of the exfat spec or the code to know what
the best approach is. I've worked fairly closely with Christoph for
more than a decade - you need to think about what he says rather
than /how he says it/ because there's a lot of thought and knowledge
behind his reasoning. Hence if I were implementing exfat and
Christoph was saying "throw it away and extend fs/fat"
then that's what I'd be doing.

A lot of this is largely risk management - there's a good chance
that the people developing the code move on after the project is
done and merged, which leaves the people like Christoph with the
burden of long term code maintenance for that filesystem. There's
enough crusty, old, largely unmaintained filesystem code already,
and we don't want more. Implementing exfat on top of fs/fat kills
two birds with one stone - it modernises the fs/fat code base and
brings new functionality that will have more developers interested
in maintaining it over the long term. So from an overall filesystem
maintenance perspective, building on top of fs/fat makes a lot of
sense to a kernel filesystem developer...

This is not a judgement on the quality of the existing exfat code
or it's developers - it's just that there are very good reasons for
building on existing codebase rather than creating a whole new code
base that has very similar functionality.

That's my total involvement in this - I don't really care about
exfat at all - my stake in this is to improve the development,
review and merge process being undertaken. We have a history of lax
review, poor processes and really shitty code being merged into the
kernel and I've been on the cleanup squad for a few of them over the
past couple of years. That's a burnout vector, so it's in the
interests of my own sanity (and fs developers) to set standards and
precedence to prevent more trainwrecks from happening before the
train even leaves the station...

> > That's the choice you have to make now: listen to the reviewers
> > saying "resolve the fundamental issues before goign any further",
> 
> Well... *getting* a laundry list of what the reviewers see as the fundamental
> issues is the first step in resolving them ;)

You won't get them all at once. You'll get something new every round
of review as the bigger issues are worked out, the reviewers become
more familiar with the code and notice more detailed/subtle
issues. Most filesystem reviews start with developers trying to
understand the on-disk structure and architecture rather that focus
on whitespace and code cleanliness. Cleaning up the code can be done
after we work through all the structural issues...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [PATCH] drivers/staging/exfat - by default, prohibit mount of fat/vfat

2019-08-31 Thread Dave Chinner
On Sat, Aug 31, 2019 at 06:25:21AM -0400, Valdis Klētnieks wrote:
> On Fri, 30 Aug 2019 23:46:16 -0700, Christoph Hellwig said:
> > Since when did Linux kernel submissions become "show me a better patch"
> > to reject something obviously bad?
> 
> Well, do you even have a *suggestion* for a better idea?  Other than "just rip
> it out"?  Keeping in mind that:
> 
> > As I said the right approach is to probably (pending comments from the
> > actual fat maintainer) to merge exfat support into the existing fs/fat/
> > codebase.  You obviously seem to disagree (and at the same time not).
> 
> At this point, there isn't any true consensus on whether that's the best
> approach at the current.

Which, quite frankly, means it has been merged prematurely.

Valdis - the model for getting a new filesystem merged is the one
taken by Orangefs. That was not merged through the staging tree,
it was reviewd via patches to linux-fsdevel that were iterated far
faster than the stable merge cycle allows, allowed all the cleanups
to be done independently of the feature work needed, the structural
changes we easy to discuss, quote, etc.

These are the sorts of problems we are having with EROFS right now,
even though it's been in staging for some time, and it's clear we
are already having them with exfat - fundamental architecture issues
have not yet been decided, and so there's likely major structural
change yet to be done.

That's stuff that is much more easily done and reveiwed by patches
on a mailing list. You don't need the code in the staging tree to
get this sort of thing done and, really, having it already merged
gets in the way of doing major structural change as it cannot be
rapidly iterated independently of the kernel dev cycle...

So when Christoph say:

> "Just rip it out"

what he is really saying is that Greg has jumped the jump and is -
yet again - fucking over filesystem developers because he's
taken the review process for a new filesystem out of hands _yet
again_.

He did this with POHMELFS, then Lustre, then EROFS - they all got
merged into stable over the objections of senior filesystem
developers.  The first two were an utter disaster, the latter is
rapidly turning into one.

You wanted a "show me a better patch" response from Christoph. What
he actually is saying is "we've got a better process for reviewing
and merging filesystems". That is, we need to reboot the exfat
process from the start - sort out the fundamental implementation
direction and then implement via the normal out-of-tree patch series
iteration process.

That's the fundamental problem here - we have a rogue maintainer
that is abusing their power by subverting our normal review and
merge process. I don't know (or care) what motive that maintainer
has for expedited merging of this filesystem, but history tells us
it _will_ result in a total clusterfuck in the near future. In fact,
I'd argue that the clusterfuck has already arrived

> And by the way, it seems like other filesystems rely on "others" to help out.
> Let's look at the non-merge commit for fs/ext4. And these are actual patches,
> not just reviewer comments

Totally irrelevant to the issue at hand. You can easily co-ordinate
out of tree contributions through a github tree, or a tree on
kernel.org, etc.

Being in the staging tree does not get you more robust review -
it'll just delay it until someone wants it out of the staging tree,
and then you'll get all the same issues with experienced filesystem
developer review as you are getting now.

That's the choice you have to make now: listen to the reviewers
saying "resolve the fundamental issues before goign any further",
or you can ignore that and have it rejected after another year of
work because the fundamental issues haven't been resolved while it
sits in staging

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [PATCH] staging: exfat: add exfat filesystem code to staging

2019-08-31 Thread Dave Chinner
On Sat, Aug 31, 2019 at 06:31:45AM -0400, Valdis Klētnieks wrote:
> On Sat, 31 Aug 2019 07:54:10 +1000, Dave Chinner said:
> 
> > The correct place for new filesystem review is where all the
> > experienced filesystem developers hang out - that's linux-fsdevel,
> > not the driver staging tree.
> 
> So far everything's been cc'ed to linux-fsdevel, which has been spending
> more time discussing unlikely() usage in a different filesystem.

That's just noise - you'll get whitespace and other trivial
review on any list you post a patch series for review. Go back and
look at what other people have raised w.r.t. to that filesystem -
on-disk format validation, re-implementation of largely generic
code, lack of namespacing of functions leading to conflicts with
generic/VFS functionality, etc.

Review bandwidth for things like on-disk format definition and
manipulation, consistency with other filesystems, efficient
integration into the generic infrastructure, etc is limited.
Everyone has to juggle that around the load they have for their own
filesystem maintenance, and there's usually only bandwidth for a
single filesystem at a time.

Just be patient - trying to force the merging of code before there's
even been consensus on fundamental architecture choices doesn't make
things better for anyone.  Merging incomplete filesystem code early
in the development cycle has -always- been something we've regretted
in the long run.  We've learn this lesson many times before, yet we
seem doomed to repeat it yet again...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [PATCH] staging: exfat: add exfat filesystem code to staging

2019-08-30 Thread Dave Chinner
On Thu, Aug 29, 2019 at 01:18:10PM +0200, Greg Kroah-Hartman wrote:
> On Thu, Aug 29, 2019 at 03:37:49AM -0700, Christoph Hellwig wrote:
> > On Thu, Aug 29, 2019 at 11:50:19AM +0200, Greg Kroah-Hartman wrote:
> > > I know the code is horrible, but I will gladly take horrible code into
> > > staging.  If it bothers you, just please ignore it.  That's what staging
> > > is there for :)
> > 
> > And then after a while you decide it's been long enough and force move
> > it out of staging like the POS erofs code?
> 
> Hey, that's not nice, erofs isn't a POS.  It could always use more
> review, which the developers asked for numerous times.
> 
> There's nothing different from a filesystem compared to a driver.  If
> its stand-alone, and touches nothing else, all issues with it are
> self-contained and do not bother anyone else in the kernel.  We merge

I whole-heartedly disagree with that statement.

The major difference between filesystems and the rest of the kernel
that seems to be missed by most kernel developers is that
filesystems maintain persistent data - you can't fix a problem/bug
by rebooting or power cycling. Once the filesystem code screws up,
the user is left with a mess they have to fix and that invariably
results in data loss.

Users remember when a filesystem eats their data - they don't tend
to want to have anything to do with that filesystem ever again if it
happens to them. We still get people saying "XFS ate my data back in
2002, I dont trust it and I'll never use it again".

Users put up with shit hardware and drivers - it's an inconvenience
more than anything. They don't put up with buggy filesystems that
lose their data - that is completely unacceptible to users.  As a
result, the quality and stability standard for merging a new
filesystem needs to be far higher that what is acceptible for
merging a new driver.

The correct place for new filesystem review is where all the
experienced filesystem developers hang out - that's linux-fsdevel,
not the driver staging tree.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [PATCH] xfs: Initialize label array properly

2019-08-30 Thread Dave Chinner
On Fri, Aug 30, 2019 at 02:37:07PM +0900, Austin Kim wrote:
> In case kernel stack variable is not initialized properly,
> there is a risk of kernel information disclosure.
> 
> So, initialize 'char label[]' array with null characters.

Can you describe the information disclosure vector here? I can't see
any, mostly because this is the "set label" function and that
doesn't return anything to userspace.

We also zero the on-disk label before we copy the user label into
it, so I don't see that anything can leak onto disk, either...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [PATCH 01/19] dax: remove block device dependencies

2019-08-28 Thread Dave Chinner
On Wed, Aug 28, 2019 at 01:58:43PM -0400, Vivek Goyal wrote:
> On Tue, Aug 27, 2019 at 11:58:09PM -0700, Christoph Hellwig wrote:
> > On Tue, Aug 27, 2019 at 12:38:28PM -0400, Vivek Goyal wrote:
> > > > For bdev_dax_pgoff
> > > > I'd much rather have the partition offset if there is on in the daxdev
> > > > somehow so that we can get rid of the block device entirely.
> > > 
> > > IIUC, there is one block_device per partition while there is only one
> > > dax_device for the whole disk. So we can't directly move bdev logical
> > > offset into dax_device.
> > 
> > Well, then we need to find a way to get partitions for dax devices,
> > as we really should not expect a block device hiding behind a dax
> > dev.  That is just a weird legacy assumption - block device need to
> > layer on top of the dax device optionally.
> > 
> > > 
> > > We probably could put this in "iomap" and leave it to filesystems to
> > > report offset into dax_dev in iomap that way dax generic code does not
> > > have to deal with it. But that probably will be a bigger change.
> > 
> > And where would the file system get that information from?
> 
> File system knows about block device, can it just call get_start_sect()
> while filling iomap->addr. And this means we don't have to have
> parition information in dax device. Will something like following work?
> (Just a proof of concept patch).
> 
> 
> ---
>  drivers/dax/super.c |   11 +++
>  fs/dax.c|6 +++---
>  fs/ext4/inode.c |6 +-
>  include/linux/dax.h |1 +
>  4 files changed, 20 insertions(+), 4 deletions(-)
> 
> Index: rhvgoyal-linux/fs/ext4/inode.c
> ===
> --- rhvgoyal-linux.orig/fs/ext4/inode.c   2019-08-28 13:51:16.051937204 
> -0400
> +++ rhvgoyal-linux/fs/ext4/inode.c2019-08-28 13:51:44.453937204 -0400
> @@ -3589,7 +3589,11 @@ retry:
>   WARN_ON_ONCE(1);
>   return -EIO;
>   }
> - iomap->addr = (u64)map.m_pblk << blkbits;
> + if (IS_DAX(inode))
> + iomap->addr = ((u64)map.m_pblk << blkbits) +
> +   (get_start_sect(iomap->bdev) * 512);
> + else
> + iomap->addr = (u64)map.m_pblk << blkbits;

I'm not a fan of returning a physical device sector address from an
interface where ever other user/caller expects this address to be a
logical block address into the block device. It creates a landmine
in the iomap API that callers may not be aware of and that's going
to cause bugs. We're trying really hard to keep special case hacks
like this out of the iomap infrastructure, so on those grounds alone
I'd suggest this is a dead end approach.

Hence I think that if the dax device needs a physical offset from
the start of the block device the filesystem sits on, it should be
set up at dax device instantiation time and so the filesystem/bdev
never needs to be queried again for this information.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002 ;-)

2019-08-25 Thread Dave Chinner
On Fri, Aug 23, 2019 at 10:08:36PM -0700, Ira Weiny wrote:
> On Sat, Aug 24, 2019 at 10:11:24AM +1000, Dave Chinner wrote:
> > On Fri, Aug 23, 2019 at 09:04:29AM -0300, Jason Gunthorpe wrote:
> > > On Fri, Aug 23, 2019 at 01:23:45PM +1000, Dave Chinner wrote:
> > > 
> > > > > But the fact that RDMA, and potentially others, can "pass the
> > > > > pins" to other processes is something I spent a lot of time trying to 
> > > > > work out.
> > > > 
> > > > There's nothing in file layout lease architecture that says you
> > > > can't "pass the pins" to another process.  All the file layout lease
> > > > requirements say is that if you are going to pass a resource for
> > > > which the layout lease guarantees access for to another process,
> > > > then the destination process already have a valid, active layout
> > > > lease that covers the range of the pins being passed to it via the
> > > > RDMA handle.
> > > 
> > > How would the kernel detect and enforce this? There are many ways to
> > > pass a FD.
> > 
> > AFAIC, that's not really a kernel problem. It's more of an
> > application design constraint than anything else. i.e. if the app
> > passes the IB context to another process without a lease, then the
> > original process is still responsible for recalling the lease and
> > has to tell that other process to release the IB handle and it's
> > resources.
> > 
> > > IMHO it is wrong to try and create a model where the file lease exists
> > > independently from the kernel object relying on it. In other words the
> > > IB MR object itself should hold a reference to the lease it relies
> > > upon to function properly.
> > 
> > That still doesn't work. Leases are not individually trackable or
> > reference counted objects objects - they are attached to a struct
> > file bUt, in reality, they are far more restricted than a struct
> > file.
> > 
> > That is, a lease specifically tracks the pid and the _open fd_ it
> > was obtained for, so it is essentially owned by a specific process
> > context.  Hence a lease is not able to be passed to a separate
> > process context and have it still work correctly for lease break
> > notifications.  i.e. the layout break signal gets delivered to
> > original process that created the struct file, if it still exists
> > and has the original fd still open. It does not get sent to the
> > process that currently holds a reference to the IB context.
> >
> 
> The fcntl man page says:
> 
> "Leases are associated with an open file description (see open(2)).  This 
> means
> that duplicate file descriptors (created by, for example, fork(2) or dup(2))
> refer to the same lease, and this lease may be modified or released using any
> of these descriptors.  Furthermore,  the lease is released by either an
> explicit F_UNLCK operation on any of these duplicate file descriptors, or when
> all such file descriptors have been closed."

Right, the lease is attached to the struct file, so it follows
where-ever the struct file goes. That doesn't mean it's actually
useful when the struct file is duplicated and/or passed to another
process. :/

AFAICT, the problem is that when we take another reference to the
struct file, or when the struct file is passed to a different
process, nothing updates the lease or lease state attached to that
struct file.

> From this I took it that the child process FD would have the lease as well
> _and_ could release it.  I _assumed_ that applied to SCM_RIGHTS but it does 
> not
> seem to work the same way as dup() so I'm not so sure.

Sure, that part works because the struct file is passed. It doesn't
end up with the same fd number in the other process, though.

The issue is that layout leases need to notify userspace when they
are broken by the kernel, so a lease stores the owner pid/tid in the
file->f_owner field via __f_setown(). It also keeps a struct fasync
attached to the file_lock that records the fd that the lease was
created on.  When a signal needs to be sent to userspace for that
lease, we call kill_fasync() and that walks the list of fasync
structures on the lease and calls:

send_sigio(fown, fa->fa_fd, band);

And it does for every fasync struct attached to a lease. Yes, a
lease can track multiple fds, but it can only track them in a single
process context. The moment the struct file is shared with another
process, the lease is no longer capable of sending notifications to
all the lease holders.

Yes, you can change the owning process via F_SETOWNER, but that&#

Re: [RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002 ;-)

2019-08-23 Thread Dave Chinner
On Fri, Aug 23, 2019 at 10:15:04AM -0700, Ira Weiny wrote:
> On Fri, Aug 23, 2019 at 10:59:14AM +1000, Dave Chinner wrote:
> > On Wed, Aug 21, 2019 at 11:02:00AM -0700, Ira Weiny wrote:
> > > On Tue, Aug 20, 2019 at 08:55:15AM -0300, Jason Gunthorpe wrote:
> > > > On Tue, Aug 20, 2019 at 11:12:10AM +1000, Dave Chinner wrote:
> > > > > On Mon, Aug 19, 2019 at 09:38:41AM -0300, Jason Gunthorpe wrote:
> > > > > > On Mon, Aug 19, 2019 at 07:24:09PM +1000, Dave Chinner wrote:
> > > > > > 
> > > > > > > So that leaves just the normal close() syscall exit case, where 
> > > > > > > the
> > > > > > > application has full control of the order in which resources are
> > > > > > > released. We've already established that we can block in this
> > > > > > > context.  Blocking in an interruptible state will allow fatal 
> > > > > > > signal
> > > > > > > delivery to wake us, and then we fall into the
> > > > > > > fatal_signal_pending() case if we get a SIGKILL while blocking.
> > > > > > 
> > > > > > The major problem with RDMA is that it doesn't always wait on 
> > > > > > close() for the
> > > > > > MR holding the page pins to be destoyed. This is done to avoid a
> > > > > > deadlock of the form:
> > > > > > 
> > > > > >uverbs_destroy_ufile_hw()
> > > > > >   mutex_lock()
> > > > > >[..]
> > > > > > mmput()
> > > > > >  exit_mmap()
> > > > > >   remove_vma()
> > > > > >fput();
> > > > > > file_operations->release()
> > > > > 
> > > > > I think this is wrong, and I'm pretty sure it's an example of why
> > > > > the final __fput() call is moved out of line.
> > > > 
> > > > Yes, I think so too, all I can say is this *used* to happen, as we
> > > > have special code avoiding it, which is the code that is messing up
> > > > Ira's lifetime model.
> > > > 
> > > > Ira, you could try unraveling the special locking, that solves your
> > > > lifetime issues?
> > > 
> > > Yes I will try to prove this out...  But I'm still not sure this fully 
> > > solves
> > > the problem.
> > > 
> > > This only ensures that the process which has the RDMA context (RDMA FD) 
> > > is safe
> > > with regard to hanging the close for the "data file FD" (the file which 
> > > has
> > > pinned pages) in that _same_ process.  But what about the scenario.
> > > 
> > > Process A has the RDMA context FD and data file FD (with lease) open.
> > > 
> > > Process A uses SCM_RIGHTS to pass the RDMA context FD to Process B.
> > 
> > Passing the RDMA context dependent on a file layout lease to another
> > process that doesn't have a file layout lease or a reference to the
> > original lease should be considered a violation of the layout lease.
> > Process B does not have an active layout lease, and so by the rules
> > of layout leases, it is not allowed to pin the layout of the file.
> > 
> 
> I don't disagree with the semantics of this.  I just don't know how to enforce
> it.
> 
> > > Process A attempts to exit (hangs because data file FD is pinned).
> > > 
> > > Admin kills process A.  kill works because we have allowed for it...
> > > 
> > > Process B _still_ has the RDMA context FD open _and_ therefore still 
> > > holds the
> > > file pins.
> > > 
> > > Truncation still fails.
> > > 
> > > Admin does not know which process is holding the pin.
> > > 
> > > What am I missing?
> > 
> > Application does not hold the correct file layout lease references.
> > Passing the fd via SCM_RIGHTS to a process without a layout lease
> > is equivalent to not using layout leases in the first place.
> 
> Ok, So If I understand you correctly you would support a failure of SCM_RIGHTS
> in this case?  I'm ok with that but not sure how to implement it right now.
> 
> To that end, I would like to simplify this slightly because I'm not convinced
> that SCM_RIGHTS is a problem we need to solve right now.  ie I don't know of a
> user who wants to do this.

I don't think we can support it, let alone want to. SCM_RIGHTS was a
mistake made years ago that has been causing bugs and complexity to
try and avoid those bugs ever since.  I'm only taking about it
because someone else raised it and I asummed they raised it because
they want it to "work".

> Right now duplication via SCM_RIGHTS could fail if _any_ file pins (and by
> definition leases) exist underneath the "RDMA FD" (or other direct access FD,
> like XDP etc) being duplicated.

Sounds like a fine idea to me.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002 ;-)

2019-08-23 Thread Dave Chinner
On Fri, Aug 23, 2019 at 09:04:29AM -0300, Jason Gunthorpe wrote:
> On Fri, Aug 23, 2019 at 01:23:45PM +1000, Dave Chinner wrote:
> 
> > > But the fact that RDMA, and potentially others, can "pass the
> > > pins" to other processes is something I spent a lot of time trying to 
> > > work out.
> > 
> > There's nothing in file layout lease architecture that says you
> > can't "pass the pins" to another process.  All the file layout lease
> > requirements say is that if you are going to pass a resource for
> > which the layout lease guarantees access for to another process,
> > then the destination process already have a valid, active layout
> > lease that covers the range of the pins being passed to it via the
> > RDMA handle.
> 
> How would the kernel detect and enforce this? There are many ways to
> pass a FD.

AFAIC, that's not really a kernel problem. It's more of an
application design constraint than anything else. i.e. if the app
passes the IB context to another process without a lease, then the
original process is still responsible for recalling the lease and
has to tell that other process to release the IB handle and it's
resources.

> IMHO it is wrong to try and create a model where the file lease exists
> independently from the kernel object relying on it. In other words the
> IB MR object itself should hold a reference to the lease it relies
> upon to function properly.

That still doesn't work. Leases are not individually trackable or
reference counted objects objects - they are attached to a struct
file bUt, in reality, they are far more restricted than a struct
file.

That is, a lease specifically tracks the pid and the _open fd_ it
was obtained for, so it is essentially owned by a specific process
context. Hence a lease is not able to be passed to a separate
process context and have it still work correctly for lease break
notifications.  i.e. the layout break signal gets delivered to
original process that created the struct file, if it still exists
and has the original fd still open. It does not get sent to the
process that currently holds a reference to the IB context.

So while a struct file passed to another process might still have
an active lease, and you can change the owner of the struct file
via fcntl(F_SETOWN), you can't associate the existing lease with a
the new fd in the new process and so layout break signals can't be
directed at the lease fd

This really means that a lease can only be owned by a single process
context - it can't be shared across multiple processes (so I was
wrong about dup/pass as being a possible way of passing them)
because there's only one process that can "own" a struct file, and
that where signals are sent when the lease needs to be broken.

So, fundamentally, if you want to pass a resource that pins a file
layout between processes, both processes need to hold a layout lease
on that file range. And that means exclusive leases and passing
layouts between processes are fundamentally incompatible because you
can't hold two exclusive leases on the same file range

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002 ;-)

2019-08-22 Thread Dave Chinner
may be dependent on it. I shouldn't be having to point out
how bad this is from a system design perspective.

That's where the "nice and clean" semantics come from - starting
from "what can we actually support?", "what exactly do all the
different direct access mechanisms actually require?", "does it work
for future technologies not yet on our radar?" and working from
there.  So I'm not just looking at what we are doing right now, I'm
looking at 15 years down the track when we still have to support
layout leases and we've got hardware we haven't dreamed of yet.  If
the model is clean, simple, robust, implementation independent and
has well defined semantics, then it should stand the test of time.
i.e. the "nice and clean" semantics have nothign to do with the
filesystem per se, but everything to do with ensuring the mechanism
is generic and workable for direct storage access for a long time
into the future.

We can't force people to use layout leases - at all, let alone
correctly - but if you want filesystems and enterprise distros to
support direct access to filesystem controlled storage, then direct
access applications need to follow a sane set of rules that are
supportable at the filesystem level.

> But the fact that RDMA, and potentially others, can "pass the
> pins" to other processes is something I spent a lot of time trying to work 
> out.

There's nothing in file layout lease architecture that says you
can't "pass the pins" to another process.  All the file layout lease
requirements say is that if you are going to pass a resource for
which the layout lease guarantees access for to another process,
then the destination process already have a valid, active layout
lease that covers the range of the pins being passed to it via the
RDMA handle.

i.e. as the pins pass from one process to another, they pass from
the protection of the lease process A holds to the protection that
the lease process B holds. This can probably even be done by
duplicating the lease fd and passing it by SCM_RIGHTS first.

Cheers,

Dave.

-- 
Dave Chinner
da...@fromorbit.com


Re: [RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002 ;-)

2019-08-22 Thread Dave Chinner
On Wed, Aug 21, 2019 at 11:02:00AM -0700, Ira Weiny wrote:
> On Tue, Aug 20, 2019 at 08:55:15AM -0300, Jason Gunthorpe wrote:
> > On Tue, Aug 20, 2019 at 11:12:10AM +1000, Dave Chinner wrote:
> > > On Mon, Aug 19, 2019 at 09:38:41AM -0300, Jason Gunthorpe wrote:
> > > > On Mon, Aug 19, 2019 at 07:24:09PM +1000, Dave Chinner wrote:
> > > > 
> > > > > So that leaves just the normal close() syscall exit case, where the
> > > > > application has full control of the order in which resources are
> > > > > released. We've already established that we can block in this
> > > > > context.  Blocking in an interruptible state will allow fatal signal
> > > > > delivery to wake us, and then we fall into the
> > > > > fatal_signal_pending() case if we get a SIGKILL while blocking.
> > > > 
> > > > The major problem with RDMA is that it doesn't always wait on close() 
> > > > for the
> > > > MR holding the page pins to be destoyed. This is done to avoid a
> > > > deadlock of the form:
> > > > 
> > > >uverbs_destroy_ufile_hw()
> > > >   mutex_lock()
> > > >[..]
> > > > mmput()
> > > >  exit_mmap()
> > > >   remove_vma()
> > > >fput();
> > > > file_operations->release()
> > > 
> > > I think this is wrong, and I'm pretty sure it's an example of why
> > > the final __fput() call is moved out of line.
> > 
> > Yes, I think so too, all I can say is this *used* to happen, as we
> > have special code avoiding it, which is the code that is messing up
> > Ira's lifetime model.
> > 
> > Ira, you could try unraveling the special locking, that solves your
> > lifetime issues?
> 
> Yes I will try to prove this out...  But I'm still not sure this fully solves
> the problem.
> 
> This only ensures that the process which has the RDMA context (RDMA FD) is 
> safe
> with regard to hanging the close for the "data file FD" (the file which has
> pinned pages) in that _same_ process.  But what about the scenario.
> 
> Process A has the RDMA context FD and data file FD (with lease) open.
> 
> Process A uses SCM_RIGHTS to pass the RDMA context FD to Process B.

Passing the RDMA context dependent on a file layout lease to another
process that doesn't have a file layout lease or a reference to the
original lease should be considered a violation of the layout lease.
Process B does not have an active layout lease, and so by the rules
of layout leases, it is not allowed to pin the layout of the file.

> Process A attempts to exit (hangs because data file FD is pinned).
> 
> Admin kills process A.  kill works because we have allowed for it...
> 
> Process B _still_ has the RDMA context FD open _and_ therefore still holds the
> file pins.
> 
> Truncation still fails.
> 
> Admin does not know which process is holding the pin.
> 
> What am I missing?

Application does not hold the correct file layout lease references.
Passing the fd via SCM_RIGHTS to a process without a layout lease
is equivalent to not using layout leases in the first place.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [PATCH 2/3] xfs: add kmem_alloc_io()

2019-08-22 Thread Dave Chinner
On Thu, Aug 22, 2019 at 02:19:04PM +0200, Vlastimil Babka wrote:
> On 8/22/19 2:07 PM, Dave Chinner wrote:
> > On Thu, Aug 22, 2019 at 01:14:30PM +0200, Vlastimil Babka wrote:
> > 
> > No, the problem is this (using kmalloc as a general term for
> > allocation, whether it be kmalloc, kmem_cache_alloc, alloc_page, etc)
> > 
> >some random kernel code
> > kmalloc(GFP_KERNEL)
> >  reclaim
> >  PF_MEMALLOC
> >  shrink_slab
> >   xfs_inode_shrink
> >XFS_ILOCK
> > xfs_buf_allocate_memory()
> >  kmalloc(GFP_KERNEL)
> > 
> > And so locks on inodes in reclaim are seen below reclaim. Then
> > somewhere else we have:
> > 
> >some high level read-only xfs code like readdir
> > XFS_ILOCK
> >  xfs_buf_allocate_memory()
> >   kmalloc(GFP_KERNEL)
> >reclaim
> > 
> > And this one throws false positive lockdep warnings because we
> > called into reclaim with XFS_ILOCK held and GFP_KERNEL alloc
> 
> OK, and what exactly makes this positive a false one? Why can't it continue 
> like
> the first example where reclaim leads to another XFS_ILOCK, thus deadlock?

Because above reclaim we only have operations being done on
referenced inodes, and below reclaim we only have unreferenced
inodes. We never lock the same inode both above and below reclaim
at the same time.

IOWs, an operation above reclaim cannot see, access or lock
unreferenced inodes, except in inode write clustering, and that uses
trylocks so cannot deadlock with reclaim.

An operation below reclaim cannot see, access or lock referenced
inodes except during inode write clustering, and that uses trylocks
so cannot deadlock with code above reclaim.

FWIW, I'm trying to make the inode writeback clustering go away from
reclaim at the moment, so even that possibility is going away soon.
That will change everything to trylocks in reclaim context, so
lockdep is going to stop tracking it entirely.

Hmmm - maybe we're getting to the point where we actually
don't need GFP_NOFS/PF_MEMALLOC_NOFS at all in XFS anymore.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [PATCH 2/3] xfs: add kmem_alloc_io()

2019-08-22 Thread Dave Chinner
On Thu, Aug 22, 2019 at 01:14:30PM +0200, Vlastimil Babka wrote:
> On 8/22/19 12:14 PM, Dave Chinner wrote:
> > On Thu, Aug 22, 2019 at 11:10:57AM +0200, Peter Zijlstra wrote:
> >> 
> >> Ah, current_gfp_context() already seems to transfer PF_MEMALLOC_NOFS
> >> into the GFP flags.
> >> 
> >> So are we sure it is broken and needs mending?
> > 
> > Well, that's what we are trying to work out. The problem is that we
> > have code that takes locks and does allocations that is called both
> > above and below the reclaim "lock" context. Once it's been seen
> > below the reclaim lock context, calling it with GFP_KERNEL context
> > above the reclaim lock context throws a deadlock warning.
> > 
> > The only way around that was to mark these allocation sites as
> > GFP_NOFS so lockdep is never allowed to see that recursion through
> > reclaim occur. Even though it isn't a deadlock vector.
> > 
> > What we're looking at is whether PF_MEMALLOC_NOFS changes this - I
> > don't think it does solve this problem. i.e. if we define the
> > allocation as GFP_KERNEL and then use PF_MEMALLOC_NOFS where reclaim
> > is not allowed, we still have GFP_KERNEL allocations in code above
> > reclaim that has also been seen below relcaim. And so we'll get
> > false positive warnings again.
> 
> If I understand both you and the code directly, the code sites won't call
> __fs_reclaim_acquire when called with current->flags including 
> PF_MEMALLOC_NOFS.
> So that would mean they "won't be seen below the reclaim" and all would be 
> fine,
> right?

No, the problem is this (using kmalloc as a general term for
allocation, whether it be kmalloc, kmem_cache_alloc, alloc_page, etc)

   some random kernel code
kmalloc(GFP_KERNEL)
 reclaim
 PF_MEMALLOC
 shrink_slab
  xfs_inode_shrink
   XFS_ILOCK
xfs_buf_allocate_memory()
 kmalloc(GFP_KERNEL)

And so locks on inodes in reclaim are seen below reclaim. Then
somewhere else we have:

   some high level read-only xfs code like readdir
XFS_ILOCK
 xfs_buf_allocate_memory()
  kmalloc(GFP_KERNEL)
   reclaim

And this one throws false positive lockdep warnings because we
called into reclaim with XFS_ILOCK held and GFP_KERNEL alloc
context. So the only solution we had at the tiem to shut it up was:

   some high level read-only xfs code like readdir
XFS_ILOCK
 xfs_buf_allocate_memory()
  kmalloc(GFP_NOFS)

So that lockdep sees it's not going to recurse into reclaim and
doesn't throw a warning...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [PATCH 2/3] xfs: add kmem_alloc_io()

2019-08-22 Thread Dave Chinner
On Thu, Aug 22, 2019 at 11:10:57AM +0200, Peter Zijlstra wrote:
> On Thu, Aug 22, 2019 at 10:51:30AM +0200, Peter Zijlstra wrote:
> > On Thu, Aug 22, 2019 at 12:59:48AM -0700, Christoph Hellwig wrote:
> > > On Thu, Aug 22, 2019 at 10:31:32AM +1000, Dave Chinner wrote:
> > > > > Btw, I think we should eventually kill off KM_NOFS and just use
> > > > > PF_MEMALLOC_NOFS in XFS, as the interface makes so much more sense.
> > > > > But that's something for the future.
> > > > 
> > > > Yeah, and it's not quite as simple as just using PF_MEMALLOC_NOFS
> > > > at high levels - we'll still need to annotate callers that use KM_NOFS
> > > > to avoid lockdep false positives. i.e. any code that can be called from
> > > > GFP_KERNEL and reclaim context will throw false positives from
> > > > lockdep if we don't annotate tehm correctly
> > > 
> > > Oh well.  For now we have the XFS kmem_wrappers to turn that into
> > > GFP_NOFS so we shouldn't be too worried, but I think that is something
> > > we should fix in lockdep to ensure it is generally useful.  I've added
> > > the maintainers and relevant lists to kick off a discussion.
> > 
> > Strictly speaking the fs_reclaim annotation is no longer part of the
> > lockdep core, but is simply a fake lock in page_alloc.c and thus falls
> > under the mm people's purview.
> > 
> > That said; it should be fairly straight forward to teach
> > __need_fs_reclaim() about PF_MEMALLOC_NOFS, much like how it already
> > knows about PF_MEMALLOC.
> 
> Ah, current_gfp_context() already seems to transfer PF_MEMALLOC_NOFS
> into the GFP flags.
> 
> So are we sure it is broken and needs mending?

Well, that's what we are trying to work out. The problem is that we
have code that takes locks and does allocations that is called both
above and below the reclaim "lock" context. Once it's been seen
below the reclaim lock context, calling it with GFP_KERNEL context
above the reclaim lock context throws a deadlock warning.

The only way around that was to mark these allocation sites as
GFP_NOFS so lockdep is never allowed to see that recursion through
reclaim occur. Even though it isn't a deadlock vector.

What we're looking at is whether PF_MEMALLOC_NOFS changes this - I
don't think it does solve this problem. i.e. if we define the
allocation as GFP_KERNEL and then use PF_MEMALLOC_NOFS where reclaim
is not allowed, we still have GFP_KERNEL allocations in code above
reclaim that has also been seen below relcaim. And so we'll get
false positive warnings again.

What I think we are going to have to do here is manually audit
each of the KM_NOFS call sites as we remove the NOFS from them and
determine if ___GFP_NOLOCKDEP is needed to stop lockdep from trying
to track these allocation sites. We've never used this tag because
we'd already fixed most of these false positives with explicit
GFP_NOFS tags long before ___GFP_NOLOCKDEP was created.

But until someone starts doing the work, I don't know if it will
work or even whether conversion PF_MEMALLOC_NOFS is going to
introduce a bunch of new ways to get false positives from lockdep...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002 ;-)

2019-08-19 Thread Dave Chinner
On Mon, Aug 19, 2019 at 08:09:33PM -0700, John Hubbard wrote:
> On 8/19/19 6:20 PM, Dave Chinner wrote:
> > On Mon, Aug 19, 2019 at 05:05:53PM -0700, John Hubbard wrote:
> > > On 8/19/19 2:24 AM, Dave Chinner wrote:
> > > > On Mon, Aug 19, 2019 at 08:34:12AM +0200, Jan Kara wrote:
> > > > > On Sat 17-08-19 12:26:03, Dave Chinner wrote:
> > > > > > On Fri, Aug 16, 2019 at 12:05:28PM -0700, Ira Weiny wrote:
> > > > > > > On Thu, Aug 15, 2019 at 03:05:58PM +0200, Jan Kara wrote:
> > > > > > > > On Wed 14-08-19 11:08:49, Ira Weiny wrote:
> > > > > > > > > On Wed, Aug 14, 2019 at 12:17:14PM +0200, Jan Kara wrote:
> > > ...
> > > 
> > > Any thoughts about sockets? I'm looking at net/xdp/xdp_umem.c which pins
> > > memory with FOLL_LONGTERM, and wondering how to make that work here.
> > 
> > I'm not sure how this interacts with file mappings? I mean, this
> > is just pinning anonymous pages for direct data placement into
> > userspace, right?
> > 
> > Are you asking "what if this pinned memory was a file mapping?",
> > or something else?
> 
> Yes, mainly that one. Especially since the FOLL_LONGTERM flag is
> already there in xdp_umem_pin_pages(), unconditionally. So the
> simple rules about struct *vaddr_pin usage (set it to NULL if FOLL_LONGTERM is
> not set) are not going to work here.
> 
> 
> > 
> > > These are close to files, in how they're handled, but just different
> > > enough that it's not clear to me how to make work with this system.
> > 
> > I'm guessing that if they are pinning a file backed mapping, they
> > are trying to dma direct to the file (zero copy into page cache?)
> > and so they'll need to either play by ODP rules or take layout
> > leases, too
> > 
> 
> OK. I was just wondering if there was some simple way to dig up a
> struct file associated with a socket (I don't think so), but it sounds
> like this is an exercise that's potentially different for each subsystem.

AFAIA, there is no struct file here - the memory that has been pinned
is just something mapped into the application's address space.

It seems to me that the socket here is equivalent of the RDMA handle
that that owns the hardware that pins the pages. Again, that RDMA
handle is not aware of waht the mapping represents, hence need to
hold a layout lease if it's a file mapping.

SO from the filesystem persepctive, there's no difference between
XDP or RDMA - if it's a FSDAX mapping then it is DMAing directly
into the filesystem's backing store and that will require use of
layout leases to perform safely.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002 ;-)

2019-08-19 Thread Dave Chinner
On Mon, Aug 19, 2019 at 05:05:53PM -0700, John Hubbard wrote:
> On 8/19/19 2:24 AM, Dave Chinner wrote:
> > On Mon, Aug 19, 2019 at 08:34:12AM +0200, Jan Kara wrote:
> > > On Sat 17-08-19 12:26:03, Dave Chinner wrote:
> > > > On Fri, Aug 16, 2019 at 12:05:28PM -0700, Ira Weiny wrote:
> > > > > On Thu, Aug 15, 2019 at 03:05:58PM +0200, Jan Kara wrote:
> > > > > > On Wed 14-08-19 11:08:49, Ira Weiny wrote:
> > > > > > > On Wed, Aug 14, 2019 at 12:17:14PM +0200, Jan Kara wrote:
> ...
> > The last close is an interesting case because the __fput() call
> > actually runs from task_work() context, not where the last reference
> > is actually dropped. So it already has certain specific interactions
> > with signals and task exit processing via task_add_work() and
> > task_work_run().
> > 
> > task_add_work() calls set_notify_resume(task), so if nothing else
> > triggers when returning to userspace we run this path:
> > 
> > exit_to_usermode_loop()
> >tracehook_notify_resume()
> >  task_work_run()
> >__fput()
> > locks_remove_file()
> >   locks_remove_lease()
> > 
> > 
> > It's worth noting that locks_remove_lease() does a
> > percpu_down_read() which means we can already block in this context
> > removing leases
> > 
> > If there is a signal pending, the task work is run this way (before
> > the above notify path):
> > 
> > exit_to_usermode_loop()
> >do_signal()
> >  get_signal()
> >task_work_run()
> >  __fput()
> > 
> > We can detect this case via signal_pending() and even SIGKILL via
> > fatal_signal_pending(), and so we can decide not to block based on
> > the fact the process is about to be reaped and so the lease largely
> > doesn't matter anymore. I'd argue that it is close and we can't
> > easily back out, so we'd only break the block on a fatal signal
> > 
> > And then, of course, is the call path through do_exit(), which has
> > the PF_EXITING task flag set:
> > 
> > do_exit()
> >exit_task_work()
> >  task_work_run()
> >__fput()
> > 
> > and so it's easy to avoid blocking in this case, too.
> 
> Any thoughts about sockets? I'm looking at net/xdp/xdp_umem.c which pins
> memory with FOLL_LONGTERM, and wondering how to make that work here.

I'm not sure how this interacts with file mappings? I mean, this
is just pinning anonymous pages for direct data placement into
userspace, right?

Are you asking "what if this pinned memory was a file mapping?",
or something else?

> These are close to files, in how they're handled, but just different
> enough that it's not clear to me how to make work with this system.

I'm guessing that if they are pinning a file backed mapping, they
are trying to dma direct to the file (zero copy into page cache?)
and so they'll need to either play by ODP rules or take layout
leases, too

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002 ;-)

2019-08-19 Thread Dave Chinner
On Mon, Aug 19, 2019 at 09:38:41AM -0300, Jason Gunthorpe wrote:
> On Mon, Aug 19, 2019 at 07:24:09PM +1000, Dave Chinner wrote:
> 
> > So that leaves just the normal close() syscall exit case, where the
> > application has full control of the order in which resources are
> > released. We've already established that we can block in this
> > context.  Blocking in an interruptible state will allow fatal signal
> > delivery to wake us, and then we fall into the
> > fatal_signal_pending() case if we get a SIGKILL while blocking.
> 
> The major problem with RDMA is that it doesn't always wait on close() for the
> MR holding the page pins to be destoyed. This is done to avoid a
> deadlock of the form:
> 
>uverbs_destroy_ufile_hw()
>   mutex_lock()
>[..]
> mmput()
>  exit_mmap()
>   remove_vma()
>fput();
> file_operations->release()

I think this is wrong, and I'm pretty sure it's an example of why
the final __fput() call is moved out of line.

fput()
  fput_many()
task_add_work(f, __fput())

and the call chain ends there.

Before the syscall returns to userspace, it then runs the __fput()
call through the task_work_run() interfaces, and hence the call
chain is just:

task_work_run
  __fput
> file_operations->release()
>  ib_uverbs_close()
>   uverbs_destroy_ufile_hw()
>mutex_lock()   <-- Deadlock

And there is no deadlock because nothing holds the mutex at this
point.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002 ;-)

2019-08-19 Thread Dave Chinner
On Mon, Aug 19, 2019 at 08:34:12AM +0200, Jan Kara wrote:
> On Sat 17-08-19 12:26:03, Dave Chinner wrote:
> > On Fri, Aug 16, 2019 at 12:05:28PM -0700, Ira Weiny wrote:
> > > On Thu, Aug 15, 2019 at 03:05:58PM +0200, Jan Kara wrote:
> > > > On Wed 14-08-19 11:08:49, Ira Weiny wrote:
> > > > > On Wed, Aug 14, 2019 at 12:17:14PM +0200, Jan Kara wrote:
> > > 2) Second reason is that I thought I did not have a good way to tell if 
> > > the
> > >lease was actually in use.  What I mean is that letting the lease go 
> > > should
> > >be ok IFF we don't have any pins...  I was thinking that without 
> > > John's code
> > >we don't have a way to know if there are any pins...  But that is 
> > > wrong...
> > >All we have to do is check
> > > 
> > >   !list_empty(file->file_pins)
> > > 
> > > So now with this detail I think you are right, we should be able to hold 
> > > the
> > > lease through the struct file even if the process no longer has any
> > > "references" to it (ie closes and munmaps the file).
> > 
> > I really, really dislike the idea of zombie layout leases. It's a
> > nasty hack for poor application behaviour. This is a "we allow use
> > after layout lease release" API, and I think encoding largely
> > untraceable zombie objects into an API is very poor design.
> > 
> > From the fcntl man page:
> > 
> > LEASES
> > Leases are associated with an open file description (see
> > open(2)).  This means that duplicate file descriptors
> > (created by, for example, fork(2) or dup(2))  re‐ fer  to
> > the  same  lease,  and this lease may be modified or
> > released using any of these descriptors.  Furthermore, the
> > lease is released by either an explicit F_UNLCK operation on
> > any of these duplicate file descriptors, or when all such
> > file descriptors have been closed.
> > 
> > Leases are associated with *open* file descriptors, not the
> > lifetime of the struct file in the kernel. If the application closes
> > the open fds that refer to the lease, then the kernel does not
> > guarantee, and the application has no right to expect, that the
> > lease remains active in any way once the application closes all
> > direct references to the lease.
> > 
> > IOWs, applications using layout leases need to hold the lease fd
> > open for as long as the want access to the physical file layout. It
> > is a also a requirement of the layout lease that the holder releases
> > the resources it holds on the layout before it releases the layout
> > lease, exclusive lease or not. Closing the fd indicates they do not
> > need access to the file any more, and so the lease should be
> > reclaimed at that point.
> > 
> > I'm of a mind to make the last close() on a file block if there's an
> > active layout lease to prevent processes from zombie-ing layout
> > leases like this. i.e. you can't close the fd until resources that
> > pin the lease have been released.
> 
> Yeah, so this was my initial though as well [1]. But as the discussion in
> that thread revealed, the problem with blocking last close is that kernel
> does not really expect close to block. You could easily deadlock e.g. if
> the process gets SIGKILL, file with lease has fd 10, and the RDMA context
> holding pages pinned has fd 15.

Sure, I did think about this a bit about it before suggesting it :)

The last close is an interesting case because the __fput() call
actually runs from task_work() context, not where the last reference
is actually dropped. So it already has certain specific interactions
with signals and task exit processing via task_add_work() and
task_work_run().

task_add_work() calls set_notify_resume(task), so if nothing else
triggers when returning to userspace we run this path:

exit_to_usermode_loop()
  tracehook_notify_resume()
task_work_run()
  __fput()
locks_remove_file()
  locks_remove_lease()


It's worth noting that locks_remove_lease() does a
percpu_down_read() which means we can already block in this context
removing leases

If there is a signal pending, the task work is run this way (before
the above notify path):

exit_to_usermode_loop()
  do_signal()
get_signal()
  task_work_run()
__fput()

We can detect this case via signal_pending() and even SIGKILL via
fatal_signal_pending(), and so we can decide not to block based on
the fact the process is about to be reaped and so the lease largely
doesn't matter anymore. I'd argue that it i

[BUG 5.3-rc5] rwsem: use after free on task_struct if task exits with rwsem held

2019-08-18 Thread Dave Chinner
78a91800 index:0x0 compound_mapcount: 0
[  216.480784] flags: 0xd7c0010200(slab|head)
[  216.481807] raw: 00d7c0010200 dead0100 dead0122 
888078a91800
[  216.483563] raw:  00030003 0001 

[  216.485310] page dumped because: kasan: bad access detected
[  216.486582] 
[  216.486943] Memory state around the buggy address:
[  216.488038]  0b3f7f00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc 
fc
[  216.489674]  0b3f7f80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc 
fc
[  216.491315] >0b3f8000: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb 
fb
[  216.492943] ^
[  216.494094]  0b3f8080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb 
fb
[  216.495737]  0b3f8100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb 
fb
[  216.497365] 
==

I outlined both methods of causing this issue because they are two
different use-after-free cases - one is in the read slowpath, the
other is in the write slow path. 

I know that processes should not exit while holding a rwsem, but
bugs do happen.  I'd much prefer that leaked rwsems just hang and we
do not add the potential for random memory corruption into these
situations as well - a lock hang is much easier to debug than a
memory corruption

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002 ;-)

2019-08-16 Thread Dave Chinner
On Fri, Aug 16, 2019 at 12:05:28PM -0700, Ira Weiny wrote:
> On Thu, Aug 15, 2019 at 03:05:58PM +0200, Jan Kara wrote:
> > On Wed 14-08-19 11:08:49, Ira Weiny wrote:
> > > On Wed, Aug 14, 2019 at 12:17:14PM +0200, Jan Kara wrote:
> 2) Second reason is that I thought I did not have a good way to tell if the
>lease was actually in use.  What I mean is that letting the lease go should
>be ok IFF we don't have any pins...  I was thinking that without John's 
> code
>we don't have a way to know if there are any pins...  But that is wrong...
>All we have to do is check
> 
>   !list_empty(file->file_pins)
> 
> So now with this detail I think you are right, we should be able to hold the
> lease through the struct file even if the process no longer has any
> "references" to it (ie closes and munmaps the file).

I really, really dislike the idea of zombie layout leases. It's a
nasty hack for poor application behaviour. This is a "we allow use
after layout lease release" API, and I think encoding largely
untraceable zombie objects into an API is very poor design.

>From the fcntl man page:

LEASES
Leases are associated with an open file description (see
open(2)).  This means that duplicate file descriptors
(created by, for example, fork(2) or dup(2))  re‐ fer  to
the  same  lease,  and this lease may be modified or
released using any of these descriptors.  Furthermore, the
lease is released by either an explicit F_UNLCK operation on
any of these duplicate file descriptors, or when all such
file descriptors have been closed.

Leases are associated with *open* file descriptors, not the
lifetime of the struct file in the kernel. If the application closes
the open fds that refer to the lease, then the kernel does not
guarantee, and the application has no right to expect, that the
lease remains active in any way once the application closes all
direct references to the lease.

IOWs, applications using layout leases need to hold the lease fd
open for as long as the want access to the physical file layout. It
is a also a requirement of the layout lease that the holder releases
the resources it holds on the layout before it releases the layout
lease, exclusive lease or not. Closing the fd indicates they do not
need access to the file any more, and so the lease should be
reclaimed at that point.

I'm of a mind to make the last close() on a file block if there's an
active layout lease to prevent processes from zombie-ing layout
leases like this. i.e. you can't close the fd until resources that
pin the lease have been released.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [PATCH v5 01/18] xfs: compat_ioctl: use compat_ptr()

2019-08-15 Thread Dave Chinner
On Thu, Aug 15, 2019 at 03:26:49AM -0700, Christoph Hellwig wrote:
> On Thu, Aug 15, 2019 at 01:02:11AM -0700, Christoph Hellwig wrote:
> > In many ways I'd actually much rather have a table driven approach.
> > Let me try something..
> 
> Ok, it seems like we don't even need a table containing native and
> compat as we can just fall back.  The tables still seem nicer to read,
> though.
> 
> Let me know what you think of this:
> 
> http://git.infradead.org/users/hch/xfs.git/shortlog/refs/heads/xfs-ioctl-table

Lots to like in that handful of patches. :)

It can easily go before or after Arnd's patch, and the merge
conflict either way would be minor, so I'm not really fussed either
way this gets sorted out...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [RFC PATCH v2 02/19] fs/locks: Add Exclusive flag to user Layout lease

2019-08-14 Thread Dave Chinner
On Wed, Aug 14, 2019 at 10:15:06AM -0400, Jeff Layton wrote:
> On Fri, 2019-08-09 at 15:58 -0700, ira.we...@intel.com wrote:
> > From: Ira Weiny 
> > 
> > Add an exclusive lease flag which indicates that the layout mechanism
> > can not be broken.
> > 
> > Exclusive layout leases allow the file system to know that pages may be
> > GUP pined and that attempts to change the layout, ie truncate, should be
> > failed.
> > 
> > A process which attempts to break it's own exclusive lease gets an
> > EDEADLOCK return to help determine that this is likely a programming bug
> > vs someone else holding a resource.
.
> > diff --git a/include/uapi/asm-generic/fcntl.h 
> > b/include/uapi/asm-generic/fcntl.h
> > index baddd54f3031..88b175ceccbc 100644
> > --- a/include/uapi/asm-generic/fcntl.h
> > +++ b/include/uapi/asm-generic/fcntl.h
> > @@ -176,6 +176,8 @@ struct f_owner_ex {
> >  
> >  #define F_LAYOUT   16  /* layout lease to allow longterm pins such as
> >RDMA */
> > +#define F_EXCLUSIVE32  /* layout lease is exclusive */
> > +   /* FIXME or shoudl this be F_EXLCK??? */
> >  
> >  /* operations for bsd flock(), also used by the kernel implementation */
> >  #define LOCK_SH1   /* shared lock */
> 
> This interface just seems weird to me. The existing F_*LCK values aren't
> really set up to be flags, but are enumerated values (even if there are
> some gaps on some arches). For instance, on parisc and sparc:

I don't think we need to worry about this - the F_WRLCK version of
the layout lease should have these exclusive access semantics (i.e
other ops fail rather than block waiting for lease recall) and hence
the API shouldn't need a new flag to specify them.

i.e. the primary difference between F_RDLCK and F_WRLCK layout
leases is that the F_RDLCK is a shared, co-operative lease model
where only delays in operations will be seen, while F_WRLCK is a
"guarantee exclusive access and I don't care what it breaks"
model... :)

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [PATCH v5 01/18] xfs: compat_ioctl: use compat_ptr()

2019-08-14 Thread Dave Chinner
On Wed, Aug 14, 2019 at 10:42:28PM +0200, Arnd Bergmann wrote:
> For 31-bit s390 user space, we have to pass pointer arguments through
> compat_ptr() in the compat_ioctl handler.

Seems fair enough, but...
> 
> Signed-off-by: Arnd Bergmann 
> ---
>  fs/xfs/xfs_ioctl32.c | 8 
>  1 file changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/xfs/xfs_ioctl32.c b/fs/xfs/xfs_ioctl32.c
> index 7fcf7569743f..ad91e81a2fcf 100644
> --- a/fs/xfs/xfs_ioctl32.c
> +++ b/fs/xfs/xfs_ioctl32.c
> @@ -547,7 +547,7 @@ xfs_file_compat_ioctl(
>   struct inode*inode = file_inode(filp);
>   struct xfs_inode*ip = XFS_I(inode);
>   struct xfs_mount*mp = ip->i_mount;
> - void__user *arg = (void __user *)p;
> + void__user *arg = compat_ptr(p);
>   int error;
>  
>   trace_xfs_file_compat_ioctl(ip);
> @@ -576,7 +576,7 @@ xfs_file_compat_ioctl(
>   case XFS_IOC_SCRUB_METADATA:
>   case XFS_IOC_BULKSTAT:
>   case XFS_IOC_INUMBERS:
> - return xfs_file_ioctl(filp, cmd, p);
> + return xfs_file_ioctl(filp, cmd, (unsigned long)arg);

I don't really like having to sprinkle special casts through the
code because of this.

Perhaps do something like:

static inline unsigned long compat_ptr_mask(unsigned long p)
{
return (unsigned long)compat_ptr(p);
}

and then up front you can do:

void__user *arg;

p = compat_ptr_mask(p);
arg = (void __user *)p;


and then the rest of the code remains unchanged by now uses p
correctly instead of having to change all the code to cast arg back
to an unsigned long...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [RFC PATCH v2 01/19] fs/locks: Export F_LAYOUT lease to user space

2019-08-14 Thread Dave Chinner
On Wed, Aug 14, 2019 at 07:21:34AM -0400, Jeff Layton wrote:
> On Wed, 2019-08-14 at 18:05 +1000, Dave Chinner wrote:
> > On Mon, Aug 12, 2019 at 10:36:26AM -0700, Ira Weiny wrote:
> > > On Sat, Aug 10, 2019 at 09:52:31AM +1000, Dave Chinner wrote:
> > > > On Fri, Aug 09, 2019 at 03:58:15PM -0700, ira.we...@intel.com wrote:
> > > > > + /*
> > > > > +  * NOTE on F_LAYOUT lease
> > > > > +  *
> > > > > +  * LAYOUT lease types are taken on files which the user knows 
> > > > > that
> > > > > +  * they will be pinning in memory for some indeterminate amount 
> > > > > of
> > > > > +  * time.
> > > > 
> > > > Indeed, layout leases have nothing to do with pinning of memory.
> > > 
> > > Yep, Fair enough.  I'll rework the comment.
> > > 
> > > > That's something an application taht uses layout leases might do,
> > > > but it largely irrelevant to the functionality layout leases
> > > > provide. What needs to be done here is explain what the layout lease
> > > > API actually guarantees w.r.t. the physical file layout, not what
> > > > some application is going to do with a lease. e.g.
> > > > 
> > > > The layout lease F_RDLCK guarantees that the holder will be
> > > > notified that the physical file layout is about to be
> > > > changed, and that it needs to release any resources it has
> > > > over the range of this lease, drop the lease and then
> > > > request it again to wait for the kernel to finish whatever
> > > > it is doing on that range.
> > > > 
> > > > The layout lease F_RDLCK also allows the holder to modify
> > > > the physical layout of the file. If an operation from the
> > > > lease holder occurs that would modify the layout, that lease
> > > > holder does not get notification that a change will occur,
> > > > but it will block until all other F_RDLCK leases have been
> > > > released by their holders before going ahead.
> > > > 
> > > > If there is a F_WRLCK lease held on the file, then a F_RDLCK
> > > > holder will fail any operation that may modify the physical
> > > > layout of the file. F_WRLCK provides exclusive physical
> > > > modification access to the holder, guaranteeing nothing else
> > > > will change the layout of the file while it holds the lease.
> > > > 
> > > > The F_WRLCK holder can change the physical layout of the
> > > > file if it so desires, this will block while F_RDLCK holders
> > > > are notified and release their leases before the
> > > > modification will take place.
> > > > 
> > > > We need to define the semantics we expose to userspace first.
> 
> Absolutely.
> 
> > > 
> > > Agreed.  I believe I have implemented the semantics you describe above.  
> > > Do I
> > > have your permission to use your verbiage as part of reworking the 
> > > comment and
> > > commit message?
> > 
> > Of course. :)
> > 
> > Cheers,
> > 
> 
> I'll review this in more detail soon, but subsequent postings of the set
> should probably also go to linux-api mailing list. This is a significant
> API change. It might not also hurt to get the glibc folks involved here
> too since you'll probably want to add the constants to the headers there
> as well.

Sure, but lets first get it to the point where we have something
that is actually workable, much more complete and somewhat validated
with unit tests before we start involving too many people. Wide
review of prototype code isn't really a good use of resources given
how much it's probably going to change from here...

> Finally, consider going ahead and drafting a patch to the fcntl(2)
> manpage if you think you have the API mostly nailed down. This API is a
> little counterintuitive (i.e. you can change the layout with an F_RDLCK
> lease), so it will need to be very clearly documented. I've also found
> that when creating a new API, documenting it tends to help highlight its
> warts and areas where the behavior is not clearly defined.

I find writing unit tests for xfstests to validate the new APIs work
as intended finds far more problems with the API than writing the
documentation. :)

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [RFC PATCH v2 01/19] fs/locks: Export F_LAYOUT lease to user space

2019-08-14 Thread Dave Chinner
On Mon, Aug 12, 2019 at 10:36:26AM -0700, Ira Weiny wrote:
> On Sat, Aug 10, 2019 at 09:52:31AM +1000, Dave Chinner wrote:
> > On Fri, Aug 09, 2019 at 03:58:15PM -0700, ira.we...@intel.com wrote:
> > > + /*
> > > +  * NOTE on F_LAYOUT lease
> > > +  *
> > > +  * LAYOUT lease types are taken on files which the user knows that
> > > +  * they will be pinning in memory for some indeterminate amount of
> > > +  * time.
> > 
> > Indeed, layout leases have nothing to do with pinning of memory.
> 
> Yep, Fair enough.  I'll rework the comment.
> 
> > That's something an application taht uses layout leases might do,
> > but it largely irrelevant to the functionality layout leases
> > provide. What needs to be done here is explain what the layout lease
> > API actually guarantees w.r.t. the physical file layout, not what
> > some application is going to do with a lease. e.g.
> > 
> > The layout lease F_RDLCK guarantees that the holder will be
> > notified that the physical file layout is about to be
> > changed, and that it needs to release any resources it has
> > over the range of this lease, drop the lease and then
> > request it again to wait for the kernel to finish whatever
> > it is doing on that range.
> > 
> > The layout lease F_RDLCK also allows the holder to modify
> > the physical layout of the file. If an operation from the
> > lease holder occurs that would modify the layout, that lease
> > holder does not get notification that a change will occur,
> > but it will block until all other F_RDLCK leases have been
> > released by their holders before going ahead.
> > 
> > If there is a F_WRLCK lease held on the file, then a F_RDLCK
> > holder will fail any operation that may modify the physical
> > layout of the file. F_WRLCK provides exclusive physical
> > modification access to the holder, guaranteeing nothing else
> > will change the layout of the file while it holds the lease.
> > 
> > The F_WRLCK holder can change the physical layout of the
> > file if it so desires, this will block while F_RDLCK holders
> > are notified and release their leases before the
> > modification will take place.
> > 
> > We need to define the semantics we expose to userspace first.
> 
> Agreed.  I believe I have implemented the semantics you describe above.  Do I
> have your permission to use your verbiage as part of reworking the comment and
> commit message?

Of course. :)

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [RFC PATCH v2 07/19] fs/xfs: Teach xfs to use new dax_layout_busy_page()

2019-08-14 Thread Dave Chinner
On Mon, Aug 12, 2019 at 11:05:51AM -0700, Ira Weiny wrote:
> On Sat, Aug 10, 2019 at 09:30:37AM +1000, Dave Chinner wrote:
> > On Fri, Aug 09, 2019 at 03:58:21PM -0700, ira.we...@intel.com wrote:
> > > From: Ira Weiny 
> > > 
> > > dax_layout_busy_page() can now operate on a sub-range of the
> > > address_space provided.
> > > 
> > > Have xfs specify the sub range to dax_layout_busy_page()
> > 
> > Hmmm. I've got patches that change all these XFS interfaces to
> > support range locks. I'm not sure the way the ranges are passed here
> > is the best way to do it, and I suspect they aren't correct in some
> > cases, either
> > 
> > > diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
> > > index ff3c1fae5357..f0de5486f6c1 100644
> > > --- a/fs/xfs/xfs_iops.c
> > > +++ b/fs/xfs/xfs_iops.c
> > > @@ -1042,10 +1042,16 @@ xfs_vn_setattr(
> > >   xfs_ilock(ip, XFS_MMAPLOCK_EXCL);
> > >   iolock = XFS_IOLOCK_EXCL | XFS_MMAPLOCK_EXCL;
> > >  
> > > - error = xfs_break_layouts(inode, &iolock, BREAK_UNMAP);
> > > - if (error) {
> > > - xfs_iunlock(ip, XFS_MMAPLOCK_EXCL);
> > > - return error;
> > > + if (iattr->ia_size < inode->i_size) {
> > > + loff_t  off = iattr->ia_size;
> > > + loff_t  len = inode->i_size - 
> > > iattr->ia_size;
> > > +
> > > + error = xfs_break_layouts(inode, &iolock, off, len,
> > > +   BREAK_UNMAP);
> > > + if (error) {
> > > + xfs_iunlock(ip, XFS_MMAPLOCK_EXCL);
> > > + return error;
> > > + }
> > 
> > This isn't right - truncate up still needs to break the layout on
> > the last filesystem block of the file,
> 
> I'm not following this?  From a user perspective they can't have done anything
> with the data beyond the EOF.  So isn't it safe to allow EOF to grow without
> changing the layout of that last block?


You're looking at this from the perspective of what RDMA page
pinning, not what the guarantees a filesystem has to provide layout
holders.

For example, truncate up has to zero the portion of the block beyond
EOF and that requires a data write. What happens if that block is a
shared extent and hence we have do a copy on write and alter the
file layout?

Or perhaps that tail block still has dirty data over it that is
marked for delayed allocation? Truncate up will have to write that
data to zero the delayed allocation extent that spans EOF, and hence
the truncate modifies the layout because it triggers allocation.

i.e. just because an operation does not change user data, it does
not mean that it will not change the file layout. There is a chance
that truncate up will modify the layout and so we need to break the
layout leases that span the range from the old size to the new
size...

> > and truncate down needs to
> > extend to "maximum file offset" because we remove all extents beyond
> > EOF on a truncate down.
> 
> Ok, I was trying to allow a user to extend the file without conflicts if they
> were to have a pin on the 'beginning' of the original file.

If we want to allow file extension under a layout lease, the lease
has to extend beyond EOF, otherwise the new section of the file is
not covered by a lease. If leases only extend to the existing
EOF, then once the new data is written and the file is extended,
then the lease owner needs to take a new lease on the range they
just wrote. SO the application ends up having to do write - lease
-write -lease -  so that it has leases covering the range of the
file it is extending into.

Much better it to define a lease that extends to max file offset,
such that it always covers they range past the existing EOF and
extending writes will automatically be covered. What this then does
is to trigger layout break notifications on file size change, either
by write, truncate, fallocate, without having to actually know or
track the exactly file size in the lease

> This sounds like
> you are saying that a layout lease must be dropped to do that?  In some ways I
> think I understand what you are driving at and I think I see how I may have
> been playing "fast and loose" with the strictness of the layout lease.  But
> from a user perspective if there is a part of the file which "does not exist"
> (beyond EOF) does it matter that the layout there may change?

Yes, it does, because userspace c

Re: [RFC PATCH v2 01/19] fs/locks: Export F_LAYOUT lease to user space

2019-08-09 Thread Dave Chinner
On Fri, Aug 09, 2019 at 03:58:15PM -0700, ira.we...@intel.com wrote:
> From: Ira Weiny 
> 
> In order to support an opt-in policy for users to allow long term pins
> of FS DAX pages we need to export the LAYOUT lease to user space.
> 
> This is the first of 2 new lease flags which must be used to allow a
> long term pin to be made on a file.
> 
> After the complete series:
> 
> 0) Registrations to Device DAX char devs are not affected
> 
> 1) The user has to opt in to allowing page pins on a file with an exclusive
>layout lease.  Both exclusive and layout lease flags are user visible now.
> 
> 2) page pins will fail if the lease is not active when the file back page is
>encountered.
> 
> 3) Any truncate or hole punch operation on a pinned DAX page will fail.
> 
> 4) The user has the option of holding the lease or releasing it.  If they
>release it no other pin calls will work on the file.
> 
> 5) Closing the file is ok.
> 
> 6) Unmapping the file is ok
> 
> 7) Pins against the files are tracked back to an owning file or an owning mm
>depending on the internal subsystem needs.  With RDMA there is an owning
>file which is related to the pined file.
> 
> 8) Only RDMA is currently supported
> 
> 9) Truncation of pages which are not actively pinned nor covered by a lease
>will succeed.

This has nothing to do with layout leases or what they provide
access arbitration over. Layout leases have _nothing_ to do with
page pinning or RDMA - they arbitrate behaviour the file offset ->
physical block device mapping within the filesystem and the
behaviour that will occur when a specific lease is held.

The commit descripting needs to describe what F_LAYOUT actually
protects, when they'll get broken, etc, not how RDMA is going to use
it.

> @@ -2022,8 +2030,26 @@ static int do_fcntl_add_lease(unsigned int fd, struct 
> file *filp, long arg)
>   struct file_lock *fl;
>   struct fasync_struct *new;
>   int error;
> + unsigned int flags = 0;
> +
> + /*
> +  * NOTE on F_LAYOUT lease
> +  *
> +  * LAYOUT lease types are taken on files which the user knows that
> +  * they will be pinning in memory for some indeterminate amount of
> +  * time.

Indeed, layout leases have nothing to do with pinning of memory.
That's something an application taht uses layout leases might do,
but it largely irrelevant to the functionality layout leases
provide. What needs to be done here is explain what the layout lease
API actually guarantees w.r.t. the physical file layout, not what
some application is going to do with a lease. e.g.

The layout lease F_RDLCK guarantees that the holder will be
notified that the physical file layout is about to be
changed, and that it needs to release any resources it has
over the range of this lease, drop the lease and then
request it again to wait for the kernel to finish whatever
it is doing on that range.

The layout lease F_RDLCK also allows the holder to modify
the physical layout of the file. If an operation from the
lease holder occurs that would modify the layout, that lease
holder does not get notification that a change will occur,
but it will block until all other F_RDLCK leases have been
released by their holders before going ahead.

If there is a F_WRLCK lease held on the file, then a F_RDLCK
holder will fail any operation that may modify the physical
layout of the file. F_WRLCK provides exclusive physical
modification access to the holder, guaranteeing nothing else
will change the layout of the file while it holds the lease.

The F_WRLCK holder can change the physical layout of the
file if it so desires, this will block while F_RDLCK holders
are notified and release their leases before the
modification will take place.

We need to define the semantics we expose to userspace first.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [RFC PATCH v2 07/19] fs/xfs: Teach xfs to use new dax_layout_busy_page()

2019-08-09 Thread Dave Chinner
On Fri, Aug 09, 2019 at 03:58:21PM -0700, ira.we...@intel.com wrote:
> From: Ira Weiny 
> 
> dax_layout_busy_page() can now operate on a sub-range of the
> address_space provided.
> 
> Have xfs specify the sub range to dax_layout_busy_page()

Hmmm. I've got patches that change all these XFS interfaces to
support range locks. I'm not sure the way the ranges are passed here
is the best way to do it, and I suspect they aren't correct in some
cases, either

> diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
> index ff3c1fae5357..f0de5486f6c1 100644
> --- a/fs/xfs/xfs_iops.c
> +++ b/fs/xfs/xfs_iops.c
> @@ -1042,10 +1042,16 @@ xfs_vn_setattr(
>   xfs_ilock(ip, XFS_MMAPLOCK_EXCL);
>   iolock = XFS_IOLOCK_EXCL | XFS_MMAPLOCK_EXCL;
>  
> - error = xfs_break_layouts(inode, &iolock, BREAK_UNMAP);
> - if (error) {
> - xfs_iunlock(ip, XFS_MMAPLOCK_EXCL);
> - return error;
> + if (iattr->ia_size < inode->i_size) {
> + loff_t  off = iattr->ia_size;
> + loff_t  len = inode->i_size - 
> iattr->ia_size;
> +
> + error = xfs_break_layouts(inode, &iolock, off, len,
> +   BREAK_UNMAP);
> + if (error) {
> + xfs_iunlock(ip, XFS_MMAPLOCK_EXCL);
> + return error;
> + }

This isn't right - truncate up still needs to break the layout on
the last filesystem block of the file, and truncate down needs to
extend to "maximum file offset" because we remove all extents beyond
EOF on a truncate down.

i.e. when we use preallocation, the extent map extends beyond EOF,
and layout leases need to be able to extend beyond the current EOF
to allow the lease owner to do extending writes, extending truncate,
preallocation beyond EOF, etc safely without having to get a new
lease to cover the new region in the extended file...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [RFC PATCH v2 08/19] fs/xfs: Fail truncate if page lease can't be broken

2019-08-09 Thread Dave Chinner
On Fri, Aug 09, 2019 at 03:58:22PM -0700, ira.we...@intel.com wrote:
> From: Ira Weiny 
> 
> If pages are under a lease fail the truncate operation.  We change the order 
> of
> lease breaks to directly fail the operation if the lease exists.
> 
> Select EXPORT_BLOCK_OPS for FS_DAX to ensure that xfs_break_lease_layouts() is
> defined for FS_DAX as well as pNFS.
> 
> Signed-off-by: Ira Weiny 
> ---
>  fs/Kconfig| 1 +
>  fs/xfs/xfs_file.c | 5 +++--
>  2 files changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/Kconfig b/fs/Kconfig
> index 14cd4abdc143..c10b91f92528 100644
> --- a/fs/Kconfig
> +++ b/fs/Kconfig
> @@ -48,6 +48,7 @@ config FS_DAX
>   select DEV_PAGEMAP_OPS if (ZONE_DEVICE && !FS_DAX_LIMITED)
>   select FS_IOMAP
>   select DAX
> + select EXPORTFS_BLOCK_OPS
>   help
> Direct Access (DAX) can be used on memory-backed block devices.
> If the block device supports DAX and the filesystem supports DAX,

That looks wrong. If you require xfs_break_lease_layouts() outside
of pnfs context, then move the function in the XFS code base to a
file that is built in. It's only external dependency is on the
break_layout() function, and XFS already has other unconditional
direct calls to break_layout()...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: XFS segementation fault with new linux 4.19.63

2019-08-06 Thread Dave Chinner
On Tue, Aug 06, 2019 at 09:10:57AM +0200, Kinky Nekoboi wrote:
> Addional info:
> 
> this only occurs if kernel is compiled with:
> 
> CONFIG_XFS_DEBUG=y
> 
> running 4.19.64 without xfs debugging works fine

I'm guessing 4.19 doesn't have commit c08768977b9a ("xfs: finobt AG
reserves don't consider last AG can be a runt")

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [RFC PATCH 0/7] xfs: add reflink & dedupe support for fsdax.

2019-08-04 Thread Dave Chinner
On Thu, Aug 01, 2019 at 09:37:04AM +0800, Shiyang Ruan wrote:
> 
> 
> On 8/1/19 4:33 AM, Goldwyn Rodrigues wrote:
> > On 19:49 31/07, Shiyang Ruan wrote:
> > > This patchset aims to take care of this issue to make reflink and dedupe
> > > work correctly in XFS.
> > > 
> > > It is based on Goldwyn's patchsets: "v4 Btrfs dax support" and "Btrfs
> > > iomap".  I picked up some patches related and made a few fix to make it
> > > basically works fine.
> > > 
> > > For dax framework:
> > >1. adapt to the latest change in iomap.
> > > 
> > > For XFS:
> > >1. report the source address and set IOMAP_COW type for those write
> > >   operations that need COW.
> > >2. update extent list at the end.
> > >3. add file contents comparison function based on dax framework.
> > >4. use xfs_break_layouts() to support dax.
> > 
> > Shiyang,
> > 
> > I think you used the older patches which does not contain the iomap changes
> > which would call iomap_begin() with two iomaps. I have it in the btrfs-iomap
> 
> Oh, Sorry for my carelessness.  This patchset is built on your "Btrfs
> iomap".  I didn't point it out in cover letter.
> 
> > branch and plan to update it today. It is built on v5.3-rcX, so it should
> > contain the changes which moves the iomap code to the different directory.
> > I will build the dax patches on top of that.
> > However, we are making a big dependency chain here
> Don't worry.  It's fine for me.  I'll follow your updates.

Hi Shiyang,

I'll wait for you to update your patches on top of the latest btrfs
patches before looking at this any futher. it would be good to get
a set of iomap infrastructure patches separated from the btrfs
patchsets so could have them both built from a common patchset.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [PATCH v3 0/2] mm,thp: Add filemap_huge_fault() for THP

2019-07-31 Thread Dave Chinner
On Wed, Jul 31, 2019 at 04:32:21AM -0700, Matthew Wilcox wrote:
> On Wed, Jul 31, 2019 at 08:20:53PM +1000, Dave Chinner wrote:
> > On Wed, Jul 31, 2019 at 02:25:11AM -0600, William Kucharski wrote:
> > > This set of patches is the first step towards a mechanism for 
> > > automatically
> > > mapping read-only text areas of appropriate size and alignment to THPs
> > > whenever possible.
> > > 
> > > For now, the central routine, filemap_huge_fault(), amd various support
> > > routines are only included if the experimental kernel configuration option
> > > 
> > >   RO_EXEC_FILEMAP_HUGE_FAULT_THP
> > > 
> > > is enabled.
> > > 
> > > This is because filemap_huge_fault() is dependent upon the
> > > address_space_operations vector readpage() pointing to a routine that will
> > > read and fill an entire large page at a time without poulluting the page
> > > cache with PAGESIZE entries
> > 
> > How is the readpage code supposed to stuff a THP page into a bio?
> > 
> > i.e. Do bio's support huge pages, and if not, what is needed to
> > stuff a huge page in a bio chain?
> 
> I believe that the current BIO code (after Ming Lei's multipage patches
> from late last year / earlier this year) is capable of handling a
> PMD-sized page.
> 
> > Once you can answer that question, you should be able to easily
> > convert the iomap_readpage/iomap_readpage_actor code to support THP
> > pages without having to care about much else as iomap_readpage()
> > is already coded in a way that will iterate IO over the entire THP
> > for you
> 
> Christoph drafted a patch which illustrates the changes needed to the
> iomap code.  The biggest problem is:
> 
> struct iomap_page {
> atomic_tread_count;
> atomic_twrite_count;
> DECLARE_BITMAP(uptodate, PAGE_SIZE / 512);
> };
> 
> All of a sudden that needs to go from a single unsigned long bitmap (or
> two on 64kB page size machines) to 512 bytes on x86 and even larger on,
> eg, POWER.

The struct iomap_page is dynamically allocated, so the bitmap itself
can be sized appropriate to the size of the page the structure is
being allocated for. The current code is simple because we have a
bound PAGE_SIZE so the structure size is always small.

Making it dynamically sized would also reduce the size of the bitmap
because it only needs to track filesystem blocks, not sectors. The
fact it is hard coded means it has to support the worst case of
tracking uptodata state for 512 byte block sizes, hence the "128
bits on 64k pages" static size.

i.e. huge pages on a 4k block size filesystem only requires 512
*bits* for a 2MB page, not 512 * 8 bits.  And when I get back to the
64k block size on 4k page size support for XFS+iomap, that will go
down even further. i.e. the huge page will only have to track 32
filesystem blocks, not 512, and we're back to fitting in the
existing static iomap_page

So, yeah, I think the struct iomap_page needs to be dynamically
sized to support 2MB (or larger) pages effectively.

/me wonders what is necessary for page invalidation to work
correctly for these huge pages. e.g. someone does a direct IO
write to a range within a cached read only huge page

Which reminds me, I bet there are assumptions in some of the iomap
code (or surrounding filesystem code) that assume if filesystem
block size = PAGE_SIZE there will be no iomap_page attached to the
page. And that if there is a iomap_page attached, then the block
size is < PAGE_SIZE. And do't make assumptions about block size
being <= PAGE_SIZE, as I have a patchset to support block size >
PAGE_SIZE for the iomap and XFS code which I'll be getting back to
Real Soon.

> It's egregious because no sane filesystem is going to fragment a PMD
> sized page into that number of discontiguous blocks,

It's not whether a sane filesytem will do that, the reality is that
it can happen and so it needs to work. Anyone using 512 byte block
size filesysetms and expecting PMD sized pages to be *efficient* has
rocks in their head. We just need to make it work.

> so we never need
> to allocate the 520 byte data structure this suddenly becomes.  It'd be
> nice to have a more efficient data structure (maybe that tracks uptodate
> by extent instead of by individual sector?)

Extents can still get fragmented, and we have to support the worst
case fragmentation that can occur. Which is single filesystem
blocks. And that fragmentation can change during the life of the
page (punch out blocks, allocate different ones, COW, etc) so we
have to allocate the worst case up front even if we rarely (if
ever!) need it.

> But I don't understand the
&g

Re: [PATCH v3 0/2] mm,thp: Add filemap_huge_fault() for THP

2019-07-31 Thread Dave Chinner
On Wed, Jul 31, 2019 at 02:25:11AM -0600, William Kucharski wrote:
> This set of patches is the first step towards a mechanism for automatically
> mapping read-only text areas of appropriate size and alignment to THPs
> whenever possible.
> 
> For now, the central routine, filemap_huge_fault(), amd various support
> routines are only included if the experimental kernel configuration option
> 
>   RO_EXEC_FILEMAP_HUGE_FAULT_THP
> 
> is enabled.
> 
> This is because filemap_huge_fault() is dependent upon the
> address_space_operations vector readpage() pointing to a routine that will
> read and fill an entire large page at a time without poulluting the page
> cache with PAGESIZE entries

How is the readpage code supposed to stuff a THP page into a bio?

i.e. Do bio's support huge pages, and if not, what is needed to
stuff a huge page in a bio chain?

Once you can answer that question, you should be able to easily
convert the iomap_readpage/iomap_readpage_actor code to support THP
pages without having to care about much else as iomap_readpage()
is already coded in a way that will iterate IO over the entire THP
for you

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [PATCH] ext4: Fix deadlock on page reclaim

2019-07-30 Thread Dave Chinner
On Tue, Jul 30, 2019 at 02:06:33AM +, Damien Le Moal wrote:
> If we had a pread_nofs()/pwrite_nofs(), that would work. Or we could define a
> RWF_NORECLAIM flag for pwritev2()/preadv2(). This last one could actually be 
> the
> cleanest approach.

Clean, yes, but I'm not sure we want to expose kernel memory reclaim
capabilities to userspace... It would be misleading, too, because we
still want to allow reclaim to occur, just not have reclaim recurse
into other filesystems

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [PATCH] ext4: Fix deadlock on page reclaim

2019-07-28 Thread Dave Chinner
On Sat, Jul 27, 2019 at 02:59:59AM +, Damien Le Moal wrote:
> On 2019/07/27 7:55, Theodore Y. Ts'o wrote:
> > On Sat, Jul 27, 2019 at 08:44:23AM +1000, Dave Chinner wrote:
> >>>
> >>> This looks like something that could hit every file systems, so
> >>> shouldn't we fix this in common code?  We could also look into
> >>> just using memalloc_nofs_save for the page cache allocation path
> >>> instead of the per-mapping gfp_mask.
> >>
> >> I think it has to be the entire IO path - any allocation from the
> >> underlying filesystem could recurse into the top level filesystem
> >> and then deadlock if the memory reclaim submits IO or blocks on
> >> IO completion from the upper filesystem. That's a bloody big hammer
> >> for something that is only necessary when there are stacked
> >> filesystems like this
> > 
> > Yeah that's why using memalloc_nofs_save() probably makes the most
> > sense, and dm_zoned should use that before it calls into ext4.
> 
> Unfortunately, with this particular setup, that will not solve the problem.
> dm-zoned submit BIOs to its backend drive in response to XFS activity. The
> requests for these BIOs are passed along to the kernel tcmu HBA and end up in
> that HBA command ring. The commands themselves are read from the ring and
> executed by the tcmu-runner user process which executes them doing
> pread()/pwrite() to the ext4 file. The tcmu-runner process being a different
> context than the dm-zoned worker thread issuing the BIO,
> memalloc_nofs_save/restore() calls in dm-zoned will have no effect.

Right, I'm talking about using memalloc_nofs_save() as a huge hammer
in the pread/pwrite() calling context, not the bio submission
context (which is typically GFP_NOFS above submit_bio() and GFP_NOIO
below).

> So back to Dave's point, we may be needing the big-hammer solution in the case
> of stacked file systems, while a non-stack setups do not necessarily need it
> (that is for the FS to decide). But I do not see how to implement this big
> hammer conditionally. How can a file system tell if it is at the top of the
> stack (big hammer not needed) or lower than the top level (big hammer needed) 
> ?

Age old question - tracking arbitrary nesting through mount/fs/bdev
layers is ... difficult. There is no easy way to tell.

> One simple hack would be an fcntl() or mount option to tell the FS to use
> GFP_NOFS unconditionally, but avoiding the bug would mean making sure that the
> applications or system setup is correct. So not so safe.

Wasn't there discussion at some point in the past about an interface
for special processes to be able to mark themselves as PF_MEMALLOC
(some kind of prctl, I think) for things like FUSE daemons? That
would prevent direct reclaim recursion for these userspace daemons
that are in the kernel memory reclaim IO path. It's the same
situation there, isn't it? How does fuse deal with this problem?

Cheers,

Dave,
-- 
Dave Chinner
da...@fromorbit.com


Re: [PATCH] ext4: Fix deadlock on page reclaim

2019-07-26 Thread Dave Chinner
On Thu, Jul 25, 2019 at 04:54:42AM -0700, Christoph Hellwig wrote:
> On Thu, Jul 25, 2019 at 06:33:58PM +0900, Damien Le Moal wrote:
> > +   gfp_t gfp_mask;
> > +
> > switch (ext4_inode_journal_mode(inode)) {
> > case EXT4_INODE_ORDERED_DATA_MODE:
> > case EXT4_INODE_WRITEBACK_DATA_MODE:
> > @@ -4019,6 +4019,14 @@ void ext4_set_aops(struct inode *inode)
> > inode->i_mapping->a_ops = &ext4_da_aops;
> > else
> > inode->i_mapping->a_ops = &ext4_aops;
> > +
> > +   /*
> > +* Ensure all page cache allocations are done from GFP_NOFS context to
> > +* prevent direct reclaim recursion back into the filesystem and blowing
> > +* stacks or deadlocking.
> > +*/
> > +   gfp_mask = mapping_gfp_mask(inode->i_mapping);
> > +   mapping_set_gfp_mask(inode->i_mapping, (gfp_mask & ~(__GFP_FS)));
> 
> This looks like something that could hit every file systems, so
> shouldn't we fix this in common code?  We could also look into
> just using memalloc_nofs_save for the page cache allocation path
> instead of the per-mapping gfp_mask.

I think it has to be the entire IO path - any allocation from the
underlying filesystem could recurse into the top level filesystem
and then deadlock if the memory reclaim submits IO or blocks on
IO completion from the upper filesystem. That's a bloody big hammer
for something that is only necessary when there are stacked
filesystems like this

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: pagecache locking

2019-07-09 Thread Dave Chinner
On Mon, Jul 08, 2019 at 03:31:14PM +0200, Jan Kara wrote:
> On Sat 06-07-19 09:31:57, Dave Chinner wrote:
> > On Wed, Jul 03, 2019 at 03:04:45AM +0300, Boaz Harrosh wrote:
> > > On 20/06/2019 01:37, Dave Chinner wrote:
> > > <>
> > > > 
> > > > I'd prefer it doesn't get lifted to the VFS because I'm planning on
> > > > getting rid of it in XFS with range locks. i.e. the XFS_MMAPLOCK is
> > > > likely to go away in the near term because a range lock can be
> > > > taken on either side of the mmap_sem in the page fault path.
> > > > 
> > > <>
> > > Sir Dave
> > > 
> > > Sorry if this was answered before. I am please very curious. In the zufs
> > > project I have an equivalent rw_MMAPLOCK that I _read_lock on page_faults.
> > > (Read & writes all take read-locks ...)
> > > The only reason I have it is because of lockdep actually.
> > > 
> > > Specifically for those xfstests that mmap a buffer then direct_IO in/out
> > > of that buffer from/to another file in the same FS or the same file.
> > > (For lockdep its the same case).
> > 
> > Which can deadlock if the same inode rwsem is taken on both sides of
> > the mmap_sem, as lockdep tells you...
> > 
> > > I would be perfectly happy to recursively _read_lock both from the top
> > > of the page_fault at the DIO path, and under in the page_fault. I'm
> > > _read_locking after all. But lockdep is hard to convince. So I stole the
> > > xfs idea of having an rw_MMAPLOCK. And grab yet another _write_lock at
> > > truncate/punch/clone time when all mapping traversal needs to stop for
> > > the destructive change to take place. (Allocations are done another way
> > > and are race safe with traversal)
> > > 
> > > How do you intend to address this problem with range-locks? ie recursively
> > > taking the same "lock"? because if not for the recursive-ity and lockdep 
> > > I would
> > > not need the extra lock-object per inode.
> > 
> > As long as the IO ranges to the same file *don't overlap*, it should
> > be perfectly safe to take separate range locks (in read or write
> > mode) on either side of the mmap_sem as non-overlapping range locks
> > can be nested and will not self-deadlock.
> 
> I'd be really careful with nesting range locks. You can have nasty
> situations like:
> 
> Thread 1  Thread 2
> read_lock(0,1000) 
>   write_lock(500,1500) -> blocks due to read lock
> read_lock(1001,1500) -> blocks due to write lock (or you have to break
>   fairness and then deal with starvation issues).
>
> So once you allow nesting of range locks, you have to very carefully define
> what is and what is not allowed.

Yes. I do understand the problem with rwsem read nesting and how
that can translate to reange locks.

That's why my range locks don't even try to block on other pending
waiters. The case where read nesting vs write might occur like above
is something like copy_file_range() vs truncate, but otherwise for
IO locks we simply don't have arbitrarily deep nesting of range
locks.

i.e. for your example my range lock would result in:

Thread 1Thread 2
read_lock(0,1000)   
write_lock(500,1500)



<...>
read_lock_nested(1001,1500)



<>
read_unlock(1001,1500)  
<>
read_unlock(0,1000)





IOWs, I'm not trying to complicate the range lock implementation
with complex stuff like waiter fairness or anti-starvation semantics
at this point in time. Waiters simply don't impact whether a new lock
can be gained or not, and hence the limitations of rwsem semantics
don't apply.

If such functionality is necessary (and I suspect it will be to
prevent AIO from delaying truncate and fallocate-like operations
indefinitely) then I'll add a "barrier" lock type (e.g.
range_lock_write_barrier()) that will block new range lock attempts
across it's span.

However, because this can cause deadlocks like the above, a barrier
lock will not block new *_lock_nested() or *_trylock() calls, hence
allowing runs of nested range locking to complete and drain without
deadlocking on a conflicting barrier range. And that still can't be
modelled by existing rwsem semantics

> That's why in my range lock implementation
> ages back I've decided to treat range lock as a rwsem for deadlock
> verification purposes.

IMO, treating a range lock a

Re: pagecache locking

2019-07-07 Thread Dave Chinner
On Sun, Jul 07, 2019 at 06:05:16PM +0300, Boaz Harrosh wrote:
> On 06/07/2019 02:31, Dave Chinner wrote:
> 
> > 
> > As long as the IO ranges to the same file *don't overlap*, it should
> > be perfectly safe to take separate range locks (in read or write
> > mode) on either side of the mmap_sem as non-overlapping range locks
> > can be nested and will not self-deadlock.
> > 
> > The "recursive lock problem" still arises with DIO and page faults
> > inside gup, but it only occurs when the user buffer range overlaps
> > the DIO range to the same file. IOWs, the application is trying to
> > do something that has an undefined result and is likely to result in
> > data corruption. So, in that case I plan to have the gup page faults
> > fail and the DIO return -EDEADLOCK to userspace
> > 
> 
> This sounds very cool. I now understand. I hope you put all the tools
> for this in generic places so it will be easier to salvage.

That's the plan, though I'm not really caring about anything outside
XFS for the moment.

> One thing I will be very curious to see is how you teach lockdep
> about the "range locks can be nested" thing. I know its possible,
> other places do it, but its something I never understood.

The issue with lockdep is not nested locks, it's that there is no
concept of ranges. e.g.  This is fine:

P0  P1
read_lock(A, 0, 1000)
read_lock(B, 0, 1000)
write_lock(B, 1001, 2000)
write_lock(A, 1001, 2000)

Because the read/write lock ranges on file A don't overlap and so
can be held concurrently, similarly the ranges on file B. i.e. This
lock pattern does not result in deadlock.

However, this very similar lock pattern is not fine:

P0  P1
read_lock(A, 0, 1000)
read_lock(B, 0, 1000)
write_lock(B, 500, 1500)
write_lock(A, 900, 1900)

i.e. it's an ABBA deadlock because the lock ranges partially
overlap.

IOWs, the problem with lockdep is not nesting read lock or nesting
write locks (because that's valid, too), the problem is that it
needs to be taught about ranges. Once it knows about ranges, nested
read/write locking contexts don't require any special support...

As it is, tracking overlapping lock ranges in lockdep will be
interesting, given that I've been taking several thousand
non-overlapping range locks concurrently on a single file in my
testing. Tracking this sort of usage without completely killing the
machine looking for conflicts and order violations likely makes
lockdep validation of range locks a non-starter

> [ Ha one more question if you have time:
> 
>   In one of the mails, and you also mentioned it before, you said about
>   the rw_read_lock not being able to scale well on mammoth machines
>   over 10ns of cores (maybe you said over 20).
>   I wonder why that happens. Is it because of the atomic operations,
>   or something in the lock algorithm. In my theoretical understanding,
>   as long as there are no write-lock-grabbers, why would the readers
>   interfere with each other?

Concurrent shared read lock/unlock are still atomic counting
operations.  Hence they bounce exclusive cachelines from CPU to
CPU...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: pagecache locking

2019-07-05 Thread Dave Chinner
On Wed, Jul 03, 2019 at 03:04:45AM +0300, Boaz Harrosh wrote:
> On 20/06/2019 01:37, Dave Chinner wrote:
> <>
> > 
> > I'd prefer it doesn't get lifted to the VFS because I'm planning on
> > getting rid of it in XFS with range locks. i.e. the XFS_MMAPLOCK is
> > likely to go away in the near term because a range lock can be
> > taken on either side of the mmap_sem in the page fault path.
> > 
> <>
> Sir Dave
> 
> Sorry if this was answered before. I am please very curious. In the zufs
> project I have an equivalent rw_MMAPLOCK that I _read_lock on page_faults.
> (Read & writes all take read-locks ...)
> The only reason I have it is because of lockdep actually.
> 
> Specifically for those xfstests that mmap a buffer then direct_IO in/out
> of that buffer from/to another file in the same FS or the same file.
> (For lockdep its the same case).

Which can deadlock if the same inode rwsem is taken on both sides of
the mmap_sem, as lockdep tells you...

> I would be perfectly happy to recursively _read_lock both from the top
> of the page_fault at the DIO path, and under in the page_fault. I'm
> _read_locking after all. But lockdep is hard to convince. So I stole the
> xfs idea of having an rw_MMAPLOCK. And grab yet another _write_lock at
> truncate/punch/clone time when all mapping traversal needs to stop for
> the destructive change to take place. (Allocations are done another way
> and are race safe with traversal)
> 
> How do you intend to address this problem with range-locks? ie recursively
> taking the same "lock"? because if not for the recursive-ity and lockdep I 
> would
> not need the extra lock-object per inode.

As long as the IO ranges to the same file *don't overlap*, it should
be perfectly safe to take separate range locks (in read or write
mode) on either side of the mmap_sem as non-overlapping range locks
can be nested and will not self-deadlock.

The "recursive lock problem" still arises with DIO and page faults
inside gup, but it only occurs when the user buffer range overlaps
the DIO range to the same file. IOWs, the application is trying to
do something that has an undefined result and is likely to result in
data corruption. So, in that case I plan to have the gup page faults
fail and the DIO return -EDEADLOCK to userspace

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [PATCH 11/12] iomap: move the xfs writeback code to iomap.c

2019-07-01 Thread Dave Chinner
On Mon, Jul 01, 2019 at 08:43:33AM +0200, Christoph Hellwig wrote:
> On Mon, Jul 01, 2019 at 10:08:59AM +1000, Dave Chinner wrote:
> > > Why do you assume you have to test it?  Back when we shared
> > > generic_file_read with everyone you also didn't test odd change to
> > > it with every possible fs.
> > 
> > I'm not sure what function you are referring to here. Can you
> > clarify?
> 
> Right now it is generic_file_read_iter(), but before iter it was
> generic_file_readv, generic_file_read, etc.

This generic code never came from XFS, so I'm still not sure what
you are refering to here? Some pointers to commits would help me
remember. :/

> > > If you change iomap.c, you'll test it
> > > with XFS, and Cc other maintainers so that they get a chance to
> > > also test it and comment on it, just like we do with other shared
> > > code in the kernel.
> > 
> > Which is why we've had problems with the generic code paths in the
> > past and other filesystems just copy and paste then before making
> > signficant modifications. e.g. both ext4 and btrfs re-implement
> > write_cache_pages() rather than use the generic writeback code
> > because they have slightly different requirements and those
> > developers don't want to have to worry about other filesystems every
> > time there is an internal filesystem change that affects their
> > writeback constraints...
> > 
> > That's kinda what I'm getting at here: writeback isn't being shared
> > by any of the major filesystems for good reasons...
> 
> I very fundamentally disagree.  It is not shared for a bad reasons,
> and that is people not understanding the mess that the buffer head
> based code is, and not wanting to understand it. 

The problem with heavily shared code is that it requires far more
expertise, knowledge, capability and time to modify it. The code
essentially ossifies, because changing something fundamental risks
breaking other stuff that nobody actually understands anymore and is
unwilling to risk changing.

That's not a problem with bufferheads - that's a problem of widely
shared code that has been slowly hacked to pieces to "fix' random
problems that show up from different users of the shared code.

When the shared code ossifies like this, the only way to make
progress is to either copy it and do whatever you need privately,
or re-implement it completely. ext4 and btrfs have taken the route
of "copy and modify privately", whereas XFS has taken the
"re-implement it completely" path.

We're now starting down the "share the XFS re-implementation" and
we're slowly adding more complexity to the iomap code to handle the
different things each filesystem that is converted needs. With each
new fs adding their own little quirks, it gets harder to make
significant modifications without unknowingly breaking something in
some other filesystem.

It takes highly capable developers to make serious modifications
across highly shared code and the reality is that there are very few
of them around. most developers simply aren't capable of taking on
such a task, especially given that they see capable, experienced
developers who won't even try because of past experiences akin to
a game of Running Man(*)

Shared code is good, up to the point where the sharing gets so
complex that even people with the capability are not willing to
touch/fix the code. That's what happened to bufferheads and it's a
pattern repeated across lots of kernel infrastructure code. Just
because you can handle these modifications doesn't mean everyone
else can or even wants to.

> And I'd much rather fix this than going down the copy an paste and
> slightly tweak it while fucking up something else route.

The copy-n-paste is a result of developers who have little knowledge
of things outside their domain of interest/expertise making the sane
decision to minimise risk of breaking something they know nothing
about. From an individual subsystem perspective, that's a -good
decision- to make, and that's the point I was trying to make.

You see that as a bad decision, because you equating "shared code"
with "high quality" code. The reality is that shared code is often
poor quality because people get too scared to touch it. That's
exactly the situation I don't want us to get stuck with, and why I
want to see how multiple implementations of this abstracted writeback
path change what we have now before we start moving code about...

i.e. I'm not saying "we shouldn't do this", I'm just saying that "we
should do this because shared code is good" fundamentally conflicts
with the fact we've just re-implemented a bunch of stuff because
the *shared code was really bad*. And taking the same path that lead
to really bad shared code (i.e. organic growth without planning or
design) is likely to end up in the same place

Cheers,

Dave.

(*) https://www.imdb.com/title/tt0093894/

"A wrongly convicted man must try to survive a public execution
gauntlet staged as a game show."

-- 
Dave Chinner
da...@fromorbit.com


Re: [PATCH 11/12] iomap: move the xfs writeback code to iomap.c

2019-06-30 Thread Dave Chinner
On Fri, Jun 28, 2019 at 07:33:20AM +0200, Christoph Hellwig wrote:
> On Fri, Jun 28, 2019 at 10:45:42AM +1000, Dave Chinner wrote:
> > You've already mentioned two new users you want to add. I don't even
> > have zone capable hardware here to test one of the users you are
> > indicating will use this code, and I suspect that very few people
> > do.  That's a non-trivial increase in testing requirements for
> > filesystem developers and distro QA departments who will want to
> > change and/or validate this code path.
> 
> Why do you assume you have to test it?  Back when we shared
> generic_file_read with everyone you also didn't test odd change to
> it with every possible fs.

I'm not sure what function you are referring to here. Can you
clarify?

> If you change iomap.c, you'll test it
> with XFS, and Cc other maintainers so that they get a chance to
> also test it and comment on it, just like we do with other shared
> code in the kernel.

Which is why we've had problems with the generic code paths in the
past and other filesystems just copy and paste then before making
signficant modifications. e.g. both ext4 and btrfs re-implement
write_cache_pages() rather than use the generic writeback code
because they have slightly different requirements and those
developers don't want to have to worry about other filesystems every
time there is an internal filesystem change that affects their
writeback constraints...

That's kinda what I'm getting at here: writeback isn't being shared
by any of the major filesystems for good reasons...

> > Indeed, integrating gfs2 into the existing generic iomap code has
> > required quite a bit of munging and adding new code paths and so on.
> > That's mostly been straight forward because it's just been adding
> > flags and conditional code to the existing paths. The way we
> > regularly rewrite sections of the XFS writeback code is a very
> > different sort of modification, and one that will be much harder to
> > do if we have to make those changes to generic code.
> 
> As the person who has done a lot of the recent rewriting of the
> writeback code I disagree.  Most of it has been do divorce is from
> leftovers of the buffer_head based sinle page at a time design from
> stone age.  Very little is about XFS itself, most of it has been
> about not being stupid in a fairly generic way.  And every since
> I got rid of buffer heads xfs_aops.c has been intimately tied
  ^

*cough*

Getting rid of bufferheads in writeback was largely a result of work
I did over a period of several years, thank you very much. Yes, work
you did over the same time period also got us there, but it's not
all your work.

> into the iomap infrastructure, and I'd rather keep those details in
> one place.  I.e. with this series now XFS doesn't even need to know
> about the details of the iomap_page structure and the uptodate
> bits.  If for example I'd want to add sub-page dirty bits (which I
> don't if I can avoid it) I can handle this entirely in iomap now
> instead of spreading around iomap, xfs and duplicating the thing
> in every copy of the XFS code that would otherwise show up.

Yes, I understand your motivations, I'm just not convinced that it
is the right thing to do given the history of this code and the
history of filesystem writeback code in general

> > i.e. shared code is good if it's simple and doesn't have a lot of
> > external dependencies that restrict the type and scope of
> > modifications that can be made easily. Shared code that is complex
> > and comes from code that was tightly integrated with a specific
> > subsystem architecture is going to carry all those architectural
> > foilbles into the new "generic" code. Once it gets sufficient
> > users it's going to end up with the same "we can't change this code"
> > problems that we had with the existing IO path, and we'll go back to
> > implementing our own writeback path
> 
> From the high level POV I agree with your stance.  But the point is
> that the writeback code is not tightly integrated with xfs, and that

Except it is

> is why I don't want it in XFS.  It is on other other hand very
> tightly integrated with the iomap buffer read and write into pagecache
> code, which is why I want to keep it together with that.

It's not tightly integrated into the iomap read side or page cache
implementations.  Writeback currently gets a dirty page, we look up
a block map, we add it to/create a cached ioend/bio pair.  There are
four lines of code in the entire XFS writeback code path that
interact with iomap specific state, and that's the 

Re: [PATCH 11/12] iomap: move the xfs writeback code to iomap.c

2019-06-27 Thread Dave Chinner
On Tue, Jun 25, 2019 at 12:10:20PM +0200, Christoph Hellwig wrote:
> On Tue, Jun 25, 2019 at 09:43:04AM +1000, Dave Chinner wrote:
> > I'm a little concerned this is going to limit what we can do
> > with the XFS IO path because now we can't change this code without
> > considering the direct impact on other filesystems. The QA burden of
> > changing the XFS writeback code goes through the roof with this
> > change (i.e. we can break multiple filesystems, not just XFS).
> 
> Going through the roof is a little exaggerated.

You've already mentioned two new users you want to add. I don't even
have zone capable hardware here to test one of the users you are
indicating will use this code, and I suspect that very few people
do.  That's a non-trivial increase in testing requirements for
filesystem developers and distro QA departments who will want to
change and/or validate this code path.

> Yes, it will be more
> testing overhead, but that is life in a world where we try to share
> code rather than duplicating it, which is pretty much a general
> kernel policy that has served us well.

Yes, but we also need to acknowledge why we have re-implemented
everything in fs/iomap.c - we haven't lifted code from XFS to do
that - we've replaced existing generic code that didn't do what we
needed and couldn't easily be modified to do what we needed because
of all it's external dependencies.

Indeed, integrating gfs2 into the existing generic iomap code has
required quite a bit of munging and adding new code paths and so on.
That's mostly been straight forward because it's just been adding
flags and conditional code to the existing paths. The way we
regularly rewrite sections of the XFS writeback code is a very
different sort of modification, and one that will be much harder to
do if we have to make those changes to generic code.

i.e. shared code is good if it's simple and doesn't have a lot of
external dependencies that restrict the type and scope of
modifications that can be made easily. Shared code that is complex
and comes from code that was tightly integrated with a specific
subsystem architecture is going to carry all those architectural
foilbles into the new "generic" code. Once it gets sufficient
users it's going to end up with the same "we can't change this code"
problems that we had with the existing IO path, and we'll go back to
implementing our own writeback path

> > The writepage code is one of the areas that, historically speaking,
> > has one of the highest rates of modification in XFS - we've
> > substantially reworked this code from top to bottom 4 or 5 times in
> > a bit over ten years, and each time it's been removing abstraction
> > layers and getting the writeback code closer to the internal XFS
> > extent mapping infrastructure.
> 
> I don't think we had all that much churn.

We've had more churn over time to the writeback code than just about
any other subsystem in XFS.

It also represents a complete 180-degree flip on how we've been
streamlining the writeback path in XFS over the past few years.
We've been moving closer and closer to the generic writeback
infrastructure as well as closer to the XFS inode extent tree.

I've been planning on taking it even closer to the extent tree to
give us lockless, modification range coherent extent map caching in
this path (e.g. write() can add new delalloc extents without
invalidating cached writeback maps).  This patchset re-introduces
the iomap abstraction over the bmbt - an abstraction we removed some
time ago - and that makes these sorts of improvements much harder
and more complex to implement

IOWs, I'm not convinced that lifting this code out of XFS is the
best long term plan for XFS or the iomap code here

> Yes, we've improved it a
> lot, but much of that was in response to core changes, and pretty much
> all of it benefits other users as well.  And the more users we have
> for this infrastructure that more clout it has with core VM folks
> when we have to push back odd design decisions.

If we have to make stuff "generic" to be able to influence how other
subsystems go about providing infrastructure to filesytsems, then
our development community and processes are even more broken than I
think they are. Developer communication and design influence are not
problems we should be trying to fix with code.

Cheers,

Dave.
-Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [PATCH 12/12] iomap: add tracing for the address space operations

2019-06-27 Thread Dave Chinner
On Tue, Jun 25, 2019 at 12:15:15PM +0200, Christoph Hellwig wrote:
> On Tue, Jun 25, 2019 at 09:49:21AM +1000, Dave Chinner wrote:
> > > +#undef TRACE_SYSTEM
> > > +#define TRACE_SYSTEM iomap
> > 
> > Can you add a comment somewhere here that says these tracepoints are
> > volatile and we reserve the right to change them at any time so they
> > don't form any sort of persistent UAPI that we have to maintain?
> 
> Sure.  Note that we don't have any such comment in xfs either..

Yes, but that is buries inside the xfs code where we largely set our
own rules. This, however, is generic code where people have a habit
of arguing that tracepoints are stable API and they can never be
changed because some random userspace application may have hard
coded a dependency on it...

Hence we need to be explicit here that this is diagnostic/debug code
and anyone who tries to rely on it as a stable API gets to keep all
the broken bits to themselves.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [PATCH 07/12] xfs: don't preallocate a transaction for file size updates

2019-06-27 Thread Dave Chinner
On Tue, Jun 25, 2019 at 12:25:07PM +0200, Christoph Hellwig wrote:
> On Tue, Jun 25, 2019 at 09:15:23AM +1000, Dave Chinner wrote:
> > > So, uh, how much of a hit do we take for having to allocate a
> > > transaction for a file size extension?  Particularly since we can
> > > combine those things now?
> > 
> > Unless we are out of log space, the transaction allocation and free
> > should be largely uncontended and so it's just a small amount of CPU
> > usage. i.e it's a slab allocation/free and then lockless space
> > reservation/free. If we are out of log space, then we sleep waiting
> > for space - the issue really comes down to where it is better to
> > sleep in that case
> 
> I see the general point, but we'll still have the same issue with
> unwritten extent conversion and cow completions, and I don't remember
> seeing any issue in that regard.

These are realtively rare for small file workloads - I'm really
talking about the effect of delalloc and how we've optimised
allocation during writeback to merge small, cross-file writeback
into much larger large physical IOs. Unwritten extents nor COW are
used in these (common) cases, and if they are then the allocation
patterns prevent the cross-file IO merging in the block layer and so
we don't get the "hundred ioends for a hundred inodes from a single
a physical IO completion" thundering heard problem

> And we'd hit exactly that case
> with random writes to preallocated or COW files, i.e. the typical image
> file workload.

I do see a noticable amount of IO completion overhead in the host
when hitting unwritten extents in VM image workloads. I'll see if I
can track the number of kworkers we're stalling in under some of
these workloads, but I think it's still largely bound by the request
queue depth of the IO stack inside the VM because there is no IO
merging in these cases.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [PATCH 12/12] iomap: add tracing for the address space operations

2019-06-24 Thread Dave Chinner
On Mon, Jun 24, 2019 at 07:52:53AM +0200, Christoph Hellwig wrote:
> Lift the xfs code for tracing address space operations to the iomap
> layer.
> 
> Signed-off-by: Christoph Hellwig 



> diff --git a/include/trace/events/iomap.h b/include/trace/events/iomap.h
> new file mode 100644
> index ..da50ece663f8
> --- /dev/null
> +++ b/include/trace/events/iomap.h
> @@ -0,0 +1,82 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Copyright (c) 2009-2019, Christoph Hellwig
> + * All Rights Reserved.
> + */
> +#undef TRACE_SYSTEM
> +#define TRACE_SYSTEM iomap

Can you add a comment somewhere here that says these tracepoints are
volatile and we reserve the right to change them at any time so they
don't form any sort of persistent UAPI that we have to maintain?

-Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [PATCH 11/12] iomap: move the xfs writeback code to iomap.c

2019-06-24 Thread Dave Chinner
On Mon, Jun 24, 2019 at 07:52:52AM +0200, Christoph Hellwig wrote:
> Takes the xfs writeback code and move it to iomap.c.  A new structure
> with three methods is added as the abstraction from the generic
> writeback code to the file system.  These methods are used to map
> blocks, submit an ioend, and cancel a page that encountered an error
> before it was added to an ioend.
> 
> Note that we temporarily lose the writepage tracing, but that will
> be added back soon.

I'm a little concerned this is going to limit what we can do
with the XFS IO path because now we can't change this code without
considering the direct impact on other filesystems. The QA burden of
changing the XFS writeback code goes through the roof with this
change (i.e. we can break multiple filesystems, not just XFS).

The writepage code is one of the areas that, historically speaking,
has one of the highest rates of modification in XFS - we've
substantially reworked this code from top to bottom 4 or 5 times in
a bit over ten years, and each time it's been removing abstraction
layers and getting the writeback code closer to the internal XFS
extent mapping infrastructure.

This steps the other way - it adds abstraction to move the XFS code
to be generic, and now we have to be concerned about how changes to
the XFS IO path affects other filesystems. While I can see the
desire to use this code in other filesystems, no other filesystem
does COW or delayed allocation like XFS and this functionality is
tightly tied into the iomap page writeback architecture.

As such, I'm not convinced that a wholesale lifting of this code
into the generic iomap code is going to make our life easier or
better. The stuff we've already got in fs/iomap.c is largely
uncontroversial and straight forward, but this writeback code is
anything but straight forward.

Another issue this raises is that fs/iomap.c is already huge chunk
of code with lots of different functionality in it. Adding another
500+ lines of new functionality to it doesn't make it any easier to
navigate or find things.

If we are going to move this writeback code to the generic iomap
infrastructure, can we please split the iomap code up in to smaller
files first?  e.g. fs/iomap-dio.c for all the direct IO code,
fs/iomap-pageio.c for all the page-based IO, fs/iomap.c for all the
core functionality (like iomap_apply()) and fs/iomap-util.c for all
the miscellaneous one-off functions like fiemap, etc?

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [PATCH 07/12] xfs: don't preallocate a transaction for file size updates

2019-06-24 Thread Dave Chinner
On Mon, Jun 24, 2019 at 09:17:20AM -0700, Darrick J. Wong wrote:
> On Mon, Jun 24, 2019 at 07:52:48AM +0200, Christoph Hellwig wrote:
> > We have historically decided that we want to preallocate the xfs_trans
> > structure at writeback time so that we don't have to allocate on in
> > the I/O completion handler.  But we treat unwrittent extent and COW
> > fork conversions different already, which proves that the transaction
> > allocations in the end I/O handler are not a problem.  Removing the
> > preallocation gets rid of a lot of corner case code, and also ensures
> > we only allocate one and log a transaction when actually required,
> > as the ioend merging can reduce the number of actual i_size updates
> > significantly.
> 
> That's what I thought when I wrote the ioend merging patches, but IIRC
> Dave objected on the grounds that most file writes are trivial file
> extending writes and therefore we should leave this alone to avoid
> slowing down the ioend path even if it came at a cost of cancelling a
> lot of empty transactions.

The issue is stuff like extracting a tarball, where we might write a
hundred thousand files and they are all written in a single IO. i.e.
there is no IO completion merging at all.

> I wasn't 100% convinced it mattered but ran out of time in the
> development window and never got around to researching if it made any
> difference.

Yeah, it's not all that simple :/

In these cases, we always have to allocate a transaction for every
file being written. If we do it before we submit the IO, then all
the transactions are allocated from the single writeback context. If
we don't have log space, data writeback pauses while the tail of the
AIL is pushed, metadata writeback occurs, and then the transaction
allocation for data writeback is woken, and data writeback
submission continues. It's fairly orderly, and we don't end up
trying to write back data while we are doing bulk metadata flushing
from the AIL.

If we delay the transaction allocation to the ioend context and we
are low on log space, we end up blocking a kworker on a transaction
allocation which then has to wait for metadata writeback. The
kworker infrastructure will then issue the next ioend work, which
then blocks on transaction allocation. Delayed allocation can cause
thousands of small file IOs to be inflight at the same time due to
intra-file contiguous allocation and IO merging in the block layer,
hence we can have thousands of individual IO completions that
require transaction allocation to be completed.

> So, uh, how much of a hit do we take for having to allocate a
> transaction for a file size extension?  Particularly since we can
> combine those things now?

Unless we are out of log space, the transaction allocation and free
should be largely uncontended and so it's just a small amount of CPU
usage. i.e it's a slab allocation/free and then lockless space
reservation/free. If we are out of log space, then we sleep waiting
for space - the issue really comes down to where it is better to
sleep in that case

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [PATCH 06/12] xfs: remove XFS_TRANS_NOFS

2019-06-24 Thread Dave Chinner
On Mon, Jun 24, 2019 at 07:52:47AM +0200, Christoph Hellwig wrote:
> Instead of a magic flag for xfs_trans_alloc, just ensure all callers
> that can't relclaim through the file system use memalloc_nofs_save to
> set the per-task nofs flag.

I'm thinking that it would be a good idea to add comments to explain
exactly what the memalloc_nofs_save/restore() are protecting where
they are used. Right now the XFS_TRANS_NOFS flag is largely
undocumented, so a reader is left guessing as to why the flag is
necessary and what contexts it may apply to. Hence I think we should
fix that while we are changing over to a different GFP_NOFS
allocation context mechanism

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: pagecache locking (was: bcachefs status update) merged)

2019-06-19 Thread Dave Chinner
On Wed, Jun 19, 2019 at 12:38:38PM +0200, Jan Kara wrote:
> On Tue 18-06-19 07:21:56, Amir Goldstein wrote:
> > > > Right, but regardless of the spec we have to consider that the
> > > > behaviour of XFS comes from it's Irix heritage (actually from EFS,
> > > > the predecessor of XFS from the late 1980s)
> > >
> > > Sure. And as I mentioned, I think it's technically the nicer guarantee.
> > >
> > > That said, it's a pretty *expensive* guarantee. It's one that you
> > > yourself are not willing to give for O_DIRECT IO.
> > >
> > > And it's not a guarantee that Linux has ever had. In fact, it's not
> > > even something I've ever seen anybody ever depend on.
> > >
> > > I agree that it's possible that some app out there might depend on
> > > that kind of guarantee, but I also suspect it's much much more likely
> > > that it's the other way around: XFS is being unnecessarily strict,
> > > because everybody is testing against filesystems that don't actually
> > > give the total atomicity guarantees.
> > >
> > > Nobody develops for other unixes any more (and nobody really ever did
> > > it by reading standards papers - even if they had been very explicit).
> > >
> > > And honestly, the only people who really do threaded accesses to the same 
> > > file
> > >
> > >  (a) don't want that guarantee in the first place
> > >
> > >  (b) are likely to use direct-io that apparently doesn't give that
> > > atomicity guarantee even on xfs
> > >
> > > so I do think it's moot.
> > >
> > > End result: if we had a really cheap range lock, I think it would be a
> > > good idea to use it (for the whole QoI implementation), but for
> > > practical reasons it's likely better to just stick to the current lack
> > > of serialization because it performs better and nobody really seems to
> > > want anything else anyway.
> > >
> > 
> > This is the point in the conversation where somebody usually steps in
> > and says "let the user/distro decide". Distro maintainers are in a much
> > better position to take the risk of breaking hypothetical applications.
> > 
> > I should point out that even if "strict atomic rw" behavior is desired, then
> > page cache warmup [1] significantly improves performance.
> > Having mentioned that, the discussion can now return to what is the
> > preferred way to solve the punch hole vs. page cache add race.
> > 
> > XFS may end up with special tailored range locks, which beings some
> > other benefits to XFS, but all filesystems need the solution for the punch
> > hole vs. page cache add race.
> > Jan recently took a stab at it for ext4 [2], but that didn't work out.
> 
> Yes, but I have idea how to fix it. I just need to push acquiring ext4's
> i_mmap_sem down a bit further so that only page cache filling is protected
> by it but not copying of data out to userspace. But I didn't get to coding
> it last week due to other stuff.
> 
> > So I wonder what everyone thinks about Kent's page add lock as the
> > solution to the problem.
> > Allegedly, all filesystems (XFS included) are potentially exposed to
> > stale data exposure/data corruption.
> 
> When we first realized that hole-punching vs page faults races are an issue I
> was looking into a generic solution but at that time there was a sentiment
> against adding another rwsem to struct address_space (or inode) so we ended
> up with a private lock in ext4 (i_mmap_rwsem), XFS (I_MMAPLOCK), and other
> filesystems these days. If people think that growing struct inode for
> everybody is OK, we can think about lifting private filesystem solutions
> into a generic one. I'm fine with that.

I'd prefer it doesn't get lifted to the VFS because I'm planning on
getting rid of it in XFS with range locks. i.e. the XFS_MMAPLOCK is
likely to go away in the near term because a range lock can be
taken on either side of the mmap_sem in the page fault path.

> That being said as Dave said we use those fs-private locks also for
> serializing against equivalent issues arising for DAX. So the problem is
> not only about page cache but generally about doing IO and caching
> block mapping information for a file range. So the solution should not be
> too tied to page cache.

Yup, that was the point I was trying to make when Linus started
shouting at me about how caches work and how essential they are.  I
guess the fact that DAX doesn't use the page cache isn't as widely
known as I assumed it was...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: pagecache locking (was: bcachefs status update) merged)

2019-06-17 Thread Dave Chinner
On Fri, Jun 14, 2019 at 06:01:07PM -1000, Linus Torvalds wrote:
> On Thu, Jun 13, 2019 at 5:08 PM Linus Torvalds
>  wrote:
> >
> > I do not believe that posix itself actually requires that at all,
> > although extended standards may.
> 
> So I tried to see if I could find what this perhaps alludes to.
> 
> And I suspect it's not in the read/write thing, but the pthreads side
> talks about atomicity.
>
> Interesting, but I doubt if that's actually really intentional, since
> the non-thread read/write behavior specifically seems to avoid the
> whole concurrency issue.

The wording of posix changes every time they release a new version
of the standard, and it's _never_ obvious what behaviour the
standard is actually meant to define. They are always written with
sufficient ambiguity and wiggle room that they could mean
_anything_. The POSIX 2017.1 standard you quoted is quite different
to older versions, but it's no less ambiguous...

> The pthreads atomicity thing seems to be about not splitting up IO and
> doing it in chunks when you have m:n threading models, but can be
> (mis-)construed to have threads given higher atomicity guarantees than
> processes.

Right, but regardless of the spec we have to consider that the
behaviour of XFS comes from it's Irix heritage (actually from EFS,
the predecessor of XFS from the late 1980s). i.e. the IO exclusion
model dates to long before POSIX had anything to say about pthreads,
and it's wording about atomicity could only refer to to
multi-process interactions.

These days, however, is the unfortunate reality of a long tail of
applications developed on other Unix systems under older POSIX
specifications that are still being ported to and deployed on Linux.
Hence the completely ambiguous behaviours defined in the older specs
are still just as important these days as the completely ambiguous
behaviours defined in the new specifications. :/

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: pagecache locking (was: bcachefs status update) merged)

2019-06-14 Thread Dave Chinner
On Thu, Jun 13, 2019 at 04:30:36PM -1000, Linus Torvalds wrote:
> On Thu, Jun 13, 2019 at 1:56 PM Dave Chinner  wrote:
> >
> > That said, the page cache is still far, far slower than direct IO,
> 
> Bullshit, Dave.
> 
> You've made that claim before, and it's been complete bullshit before
> too, and I've called you out on it then too.

Yes, your last run of insulting rants on this topic resulted in me
pointing out your CoC violations because you were unable to listen
or discuss the subject matter in a civil manner. And you've started
right where you left off last time

> Why do you continue to make this obviously garbage argument?
> 
> The key word in the "page cache" name is "cache".
> 
> Caches work, Dave.

Yes, they do, I see plenty of cases where the page cache works just
fine because it is still faster than most storage. But that's _not
what I said_.

Indeed, you haven't even bothered to ask me to clarify what I was
refering to in the statement you quoted. IOWs, you've taken _one
single statement_ I made from a huge email about complexities in
dealing with IO concurency, the page cache and architectural flaws n
the existing code, quoted it out of context, fabricated a completely
new context and started ranting about how I know nothing about how
caches or the page cache work.

Not very professional but, unfortunately, an entirely predictable
and _expected_ response.

Linus, nobody can talk about direct IO without you screaming and
tossing all your toys out of the crib. If you can't be civil or you
find yourself writing a some condescending "caching 101" explanation
to someone who has spent the last 15+ years working with filesystems
and caches, then you're far better off not saying anything.

---

So, in the interests of further _civil_ discussion, let me clarify
my statement for you: for a highly concurrent application that is
crunching through bulk data on large files on high throughput
storage, the page cache is still far, far slower than direct IO.

Which comes back to this statement you made:

> Is direct IO faster when you *know* it's not cached, and shouldn't
> be cached? Sure. But that/s actually quite rare. 

This is where I think you get the wrong end of the stick, Linus.

The world I work in has a significant proportion of applications
where the data set is too large to be cached effectively or is
better cached by the application than the kernel. IOWs, data being
cached efficiently by the page cache is the exception rather than
the rule. Hence, they use direct IO because it is faster than the
page cache. This is common in applications like major enterprise
databases, HPC apps, data mining/analysis applications, etc. and
there's an awful lot of the world that runs on these apps

-Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [PATCH RFC 00/10] RDMA/FS DAX truncate proposal

2019-06-13 Thread Dave Chinner
On Thu, Jun 13, 2019 at 01:34:06PM -0700, Ira Weiny wrote:
> On Thu, Jun 13, 2019 at 10:55:52AM +1000, Dave Chinner wrote:
> > On Wed, Jun 12, 2019 at 04:30:24PM -0700, Ira Weiny wrote:
> > > On Wed, Jun 12, 2019 at 05:37:53AM -0700, Matthew Wilcox wrote:
> > > > On Sat, Jun 08, 2019 at 10:10:36AM +1000, Dave Chinner wrote:
> > > > > On Fri, Jun 07, 2019 at 11:25:35AM -0700, Ira Weiny wrote:
> > > > > > Are you suggesting that we have something like this from user space?
> > > > > > 
> > > > > > fcntl(fd, F_SETLEASE, F_LAYOUT | F_UNBREAKABLE);
> > > > > 
> > > > > Rather than "unbreakable", perhaps a clearer description of the
> > > > > policy it entails is "exclusive"?
> > > > > 
> > > > > i.e. what we are talking about here is an exclusive lease that
> > > > > prevents other processes from changing the layout. i.e. the
> > > > > mechanism used to guarantee a lease is exclusive is that the layout
> > > > > becomes "unbreakable" at the filesystem level, but the policy we are
> > > > > actually presenting to uses is "exclusive access"...
> > > > 
> > > > That's rather different from the normal meaning of 'exclusive' in the
> > > > context of locks, which is "only one user can have access to this at
> > > > a time".  As I understand it, this is rather more like a 'shared' or
> > > > 'read' lock.  The filesystem would be the one which wants an exclusive
> > > > lock, so it can modify the mapping of logical to physical blocks.
> > > > 
> > > > The complication being that by default the filesystem has an exclusive
> > > > lock on the mapping, and what we're trying to add is the ability for
> > > > readers to ask the filesystem to give up its exclusive lock.
> > > 
> > > This is an interesting view...
> > > 
> > > And after some more thought, exclusive does not seem like a good name for 
> > > this
> > > because technically F_WRLCK _is_ an exclusive lease...
> > > 
> > > In addition, the user does not need to take the "exclusive" write lease 
> > > to be
> > > notified of (broken by) an unexpected truncate.  A "read" lease is broken 
> > > by
> > > truncate.  (And "write" leases really don't do anything different WRT the
> > > interaction of the FS and the user app.  Write leases control "exclusive"
> > > access between other file descriptors.)
> > 
> > I've been assuming that there is only one type of layout lease -
> > there is no use case I've heard of for read/write layout leases, and
> > like you say there is zero difference in behaviour at the filesystem
> > level - they all have to be broken to allow a non-lease truncate to
> > proceed.
> > 
> > IMO, taking a "read lease" to be able to modify and write to the
> > underlying mapping of a file makes absolutely no sense at all.
> > IOWs, we're talking exaclty about a revokable layout lease vs an
> > exclusive layout lease here, and so read/write really doesn't match
> > the policy or semantics we are trying to provide.
> 
> I humbly disagree, at least depending on how you look at it...  :-D
> 
> The patches as they stand expect the user to take a "read" layout lease which
> indicates they are currently using "reading" the layout as is.
> They are not
> changing ("writing" to) the layout.

As I said in a another email in the thread, a layout lease does not
make the layout "read only". It just means the lease owner will be
notified when someone else is about to modify it. The lease owner
can modify the mapping themselves, and they will not get notified
about their own modifications.

> They then pin pages which locks parts of
> the layout and therefore they expect no "writers" to change the layout.

Except they can change the layout themselves. It's perfectly valid
to get a layout lease, write() from offset 0 to EOF and fsync() to
intiialise the file and allocate all the space in the file, then
mmap() it and hand to off to RMDA, all while holding the layout
lease.

> The "write" layout lease breaks the "read" layout lease indicating that the
> layout is being written to.

Layout leases do not work this way.

> In fact, this is what NFS does right now.  The lease it puts on the file is of
> "read" type.
> 
> nfs4layouts.c:
&g

Re: [PATCH RFC 00/10] RDMA/FS DAX truncate proposal

2019-06-13 Thread Dave Chinner
On Thu, Jun 13, 2019 at 07:31:07PM -0700, Matthew Wilcox wrote:
> On Fri, Jun 14, 2019 at 12:09:21PM +1000, Dave Chinner wrote:
> > If the lease holder modifies the mapping in a way that causes it's
> > own internal state to screw up, then that's a bug in the lease
> > holder application.
> 
> Sounds like the lease semantics aren't the right ones for the longterm
> GUP users then.  The point of the longterm GUP is so the pages can be
> written to, and if the filesystem is going to move the pages around when
> they're written to, that just won't work.

And now we go full circle back to the constraints we decided on long
ago because we can't rely on demand paging RDMA hardware any time
soon to do everything we need to transparently support long-term GUP
on file-backed mappings. i.e.:

RDMA to file backed mappings must first preallocate and
write zeros to the range of the file they are mapping so
that the filesystem block mapping is complete and static for
the life of the RDMA mapping that will pin it.

IOWs, the layout lease will tell the RDMA application that the
static setup it has already done  to work correctly with a file
backed mapping may be about to be broken by a third party.

-Dave.
-- 
Dave Chinner
da...@fromorbit.com


<    1   2   3   4   5   6   7   8   9   10   >