Re: [Lsf-pc] [LSF/MM TOPIC] Badblocks checking/representation in filesystems

2017-01-18 Thread Darrick J. Wong
On Wed, Jan 18, 2017 at 03:39:17PM -0500, Jeff Moyer wrote:
> Jan Kara  writes:
> 
> > On Tue 17-01-17 15:14:21, Vishal Verma wrote:
> >> Your note on the online repair does raise another tangentially related
> >> topic. Currently, if there are badblocks, writes via the bio submission
> >> path will clear the error (if the hardware is able to remap the bad
> >> locations). However, if the filesystem is mounted eith DAX, even
> >> non-mmap operations - read() and write() will go through the dax paths
> >> (dax_do_io()). We haven't found a good/agreeable way to perform
> >> error-clearing in this case. So currently, if a dax mounted filesystem
> >> has badblocks, the only way to clear those badblocks is to mount it
> >> without DAX, and overwrite/zero the bad locations. This is a pretty
> >> terrible user experience, and I'm hoping this can be solved in a better
> >> way.
> >
> > Please remind me, what is the problem with DAX code doing necessary work to
> > clear the error when it gets EIO from memcpy on write?
> 
> You won't get an MCE for a store;  only loads generate them.
> 
> Won't fallocate FL_ZERO_RANGE clear bad blocks when mounted with -o dax?

Not necessarily; XFS usually implements this by punching out the range
and then reallocating it as unwritten blocks.

--D

> 
> Cheers,
> Jeff
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [LSF/MM TOPIC] Future direction of DAX

2017-01-16 Thread Darrick J. Wong
On Mon, Jan 16, 2017 at 03:00:41PM -0500, Jeff Moyer wrote:
> "Darrick J. Wong" <darrick.w...@oracle.com> writes:
> 
> >> - Whenever you mount a filesystem with DAX, it spits out a message that 
> >> says
> >>   "DAX enabled. Warning: EXPERIMENTAL, use at your own risk".  What 
> >> criteria
> >>   needs to be met for DAX to no longer be considered experimental?
> >
> > For XFS I'd like to get reflink working with it, for starters.
> 
> What do you mean by this, exactly?  When Dave outlined the requirements
> for PMEM_IMMUTABLE, it was very clear that metadata updates would not be
> possible.  And would you really cosider this a barrier to marking dax
> fully supported?  I wouldn't.

For PMEM_IMMUTABLE files, yes, reflink cannot be supported.

I'm talking about supporting reflink for DAX files that are /not/
PMEM_IMMUTABLE, where user programs can mmap pmem directly but write
activity still must use fsync/msync to ensure that everything's on disk.

I wouldn't consider it a barrier in general (since ext4 also prints
EXPERIMENTAL warnings for DAX), merely one for XFS.  I don't even think
it's that big of a hurdle -- afaict XFS ought to be able to achieve this
by modifying iomap_begin to allocate new pmem blocks, memcpy the
contents, and update the memory mappings.  I think.

> > We probably need a bunch more verification work to show that file IO
> > doesn't adopt any bad quirks having turned on the per-inode DAX flag.
> 
> Can you be more specific?  We have ltp and xfstests.  If you have some
> mkfs/mount options that you think should be tested, speak up.  Beyond
> that, if it passes ./check -g auto and ltp, are we good?

That's probably good -- I simply wanted to know if we'd at least gotten
to the point that someone had run both suites with and without DAX and
not seen any major regressions between the two.

--D

> 
> -Jeff
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 05/12] fs: add a F_IOINFO fcntl

2017-02-28 Thread Darrick J. Wong
On Tue, Feb 28, 2017 at 06:57:30AM -0800, Christoph Hellwig wrote:
> This fcntl can be used to query I/O parameters for the given file
> descriptor.  Initially it is used for the I/O alignment and atomic
> write parameters.
> 
> Signed-off-by: Christoph Hellwig 
> ---
>  fs/fcntl.c | 18 ++
>  include/linux/fs.h |  1 +
>  include/uapi/linux/fcntl.h | 16 
>  3 files changed, 35 insertions(+)
> 
> diff --git a/fs/fcntl.c b/fs/fcntl.c
> index ca5d228be7ea..248fb4cc66a6 100644
> --- a/fs/fcntl.c
> +++ b/fs/fcntl.c
> @@ -241,6 +241,21 @@ static int f_getowner_uids(struct file *filp, unsigned 
> long arg)
>  }
>  #endif
>  
> +static int fcntl_ioinfo(struct file *file, void __user *argp)
> +{
> + struct fcntl_ioinfo fio = { 0, };
> +
> + if (file->f_op->ioinfo) {
> + int ret = file->f_op->ioinfo(file, );
> + if (ret)
> + return ret;
> + }
> +
> + if (copy_to_user(argp, , sizeof(fio)))
> + return -EFAULT;
> + return 0;
> +}
> +
>  static long do_fcntl(int fd, unsigned int cmd, unsigned long arg,
>   struct file *filp)
>  {
> @@ -335,6 +350,9 @@ static long do_fcntl(int fd, unsigned int cmd, unsigned 
> long arg,
>   case F_GET_SEALS:
>   err = shmem_fcntl(filp, cmd, arg);
>   break;
> + case F_IOINFO:
> + err = fcntl_ioinfo(filp, (void __user *)arg);
> + break;
>   default:
>   break;
>   }
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 2ba074328894..33b08a8c2bc3 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -1680,6 +1680,7 @@ struct file_operations {
>   u64);
>   ssize_t (*dedupe_file_range)(struct file *, u64, u64, struct file *,
>   u64);
> + int (*ioinfo)(struct file *, struct fcntl_ioinfo *);
>  };
>  
>  struct inode_operations {
> diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h
> index beed138bd359..6b0aaba7c623 100644
> --- a/include/uapi/linux/fcntl.h
> +++ b/include/uapi/linux/fcntl.h
> @@ -42,6 +42,22 @@
>  #define F_SEAL_WRITE 0x0008  /* prevent writes */
>  /* (1U << 31) is reserved for signed error codes */
>  
> +
> +#define F_IOINFO (F_LINUX_SPECIFIC_BASE +  11)
> +
> +struct fcntl_ioinfo {
> + __u16   fio_flags;  /* FIO_FL_* */
> + __u16   fio_alignment;  /* required I/O alignment on disk */

Hm... is fio_alignment is specified in units of bytes?  If so, then
shouldn't this be a __u32 so that we can handle some weird future device
that wants, say, 1MB alignment for its atomic IO?

I'm not sure there /are/ such weird devices, and the current patchset
assumes (probably sanely) that atomic requests only have to be
lba-aligned, but otoh this is a userland field and we have plenty of
reserved space.

Though, now that I look at the XFS ioinfo patch, I guess fio_alignment
is set only for O_DIRECT files?  So it's really the required alignment
for directio operations.

(Now I think I'm simply confused about this field.)

--D

> + __u32   __reserved1;/* must be zero */
> + __u64   fio_max_atomic; /* max size for atomic writes */
> + __u64   __reserved2[14];/* must be zero */
> +};
> +
> +/* supports atomic writes using O_(D)SYNC */
> +#define FIO_FL_ATOMIC_OSYNC  (1 << 0)
> +/* supports atomic writes committed using fsync/fdatasync/msync */
> +#define FIO_FL_ATOMIC_FSYNC  (1 << 1)
> +
>  /*
>   * Types of directory notifications that may be requested.
>   */
> -- 
> 2.11.0
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 07/12] xfs: implement failure-atomic writes

2017-02-28 Thread Darrick J. Wong
On Tue, Feb 28, 2017 at 06:57:32AM -0800, Christoph Hellwig wrote:
> If O_ATOMIC is specified in the open flags this will cause XFS to
> allocate new extents in the COW for even if overwriting existing data,

"COW fork"^^^

The previous patch's commit message also has that quirk.

> and not remap them into the data fork until ->fsync is called,
> at which point the whole range will be atomically remapped into the
> data fork.  This allows applications to ѕafely overwrite data instead
> of having to do double writes.

By the way, the copy on write code remembers the extents it has
allocated for CoW staging in the refcount btree so that it can free them
after a crash, which means that O_ATOMIC requires reflink to be enabled.
There doesn't seem to be any explicit checking that reflink is even
enabled, which will probably just lead to weird crashes on a pre-reflink
xfs.

FWIW I didn't see any checking anywhere (vfs or xfs) that the filesystem
can actually support O_ATOMIC.  If the FS doesn't support atomic writes,
shouldn't the kernel send EINVAL or something back to userspace?

> Signed-off-by: Christoph Hellwig 
> ---
>  fs/xfs/xfs_aops.c| 18 +-
>  fs/xfs/xfs_aops.h|  4 ++-
>  fs/xfs/xfs_file.c| 17 +
>  fs/xfs/xfs_iomap.c   | 18 --
>  fs/xfs/xfs_reflink.c | 69 
> ++--
>  fs/xfs/xfs_reflink.h |  5 ++--
>  6 files changed, 91 insertions(+), 40 deletions(-)
> 
> diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> index c78b585b3d84..1c5efbb05b47 100644
> --- a/fs/xfs/xfs_aops.c
> +++ b/fs/xfs/xfs_aops.c
> @@ -292,6 +292,7 @@ xfs_end_io(
>   if (unlikely(error)) {
>   switch (ioend->io_type) {
>   case XFS_IO_COW:
> + case XFS_IO_ATOMIC:

So we cancel the CoW staging blocks if the write was atomic and failed.

Later in the !error case we remap the blocks if it was a cow write, but
leave the mapping in memory if the write was atomic.  That is consistent
with the commit message, good.

At the start of xfs_reflink.c is a long block comment describing how the
copy on write mechanism works.  Since O_ATOMIC is a variant on CoW (it's
basically CoW with remapping deferred until fsync), please update the
comment so that the comments capture the details of how atomic writes
work.

(IOWs: Dave asked me to leave the big comment, so I'm going to try to
keep it fairly up to date.)

>   xfs_reflink_cancel_cow_range(ip, offset, size, 0);
>   break;
>   }
> @@ -327,7 +328,9 @@ xfs_end_bio(
>   struct xfs_ioend*ioend = bio->bi_private;
>   struct xfs_mount*mp = XFS_I(ioend->io_inode)->i_mount;
>  
> - if (ioend->io_type == XFS_IO_UNWRITTEN || ioend->io_type == XFS_IO_COW)
> + if (ioend->io_type == XFS_IO_UNWRITTEN ||
> + ioend->io_type == XFS_IO_COW ||
> + ioend->io_type == XFS_IO_ATOMIC)
>   queue_work(mp->m_unwritten_workqueue, >io_work);
>   else if (ioend->io_append_trans)
>   queue_work(mp->m_data_workqueue, >io_work);
> @@ -354,6 +357,7 @@ xfs_map_blocks(
>   return -EIO;
>  
>   ASSERT(type != XFS_IO_COW);
> + ASSERT(type != XFS_IO_ATOMIC);
>   if (type == XFS_IO_UNWRITTEN)
>   bmapi_flags |= XFS_BMAPI_IGSTATE;
>  
> @@ -768,7 +772,8 @@ xfs_map_cow(
>   struct xfs_writepage_ctx *wpc,
>   struct inode*inode,
>   loff_t  offset,
> - unsigned int*new_type)
> + unsigned int*new_type,
> + boolatomic)
>  {
>   struct xfs_inode*ip = XFS_I(inode);
>   struct xfs_bmbt_irecimap;
> @@ -778,10 +783,10 @@ xfs_map_cow(
>   /*
>* If we already have a valid COW mapping keep using it.
>*/
> - if (wpc->io_type == XFS_IO_COW) {
> + if (wpc->io_type == XFS_IO_COW || wpc->io_type == XFS_IO_ATOMIC) {
>   wpc->imap_valid = xfs_imap_valid(inode, >imap, offset);
>   if (wpc->imap_valid) {
> - *new_type = XFS_IO_COW;
> + *new_type = wpc->io_type;
>   return 0;
>   }
>   }
> @@ -807,7 +812,7 @@ xfs_map_cow(
>   return error;
>   }
>  
> - wpc->io_type = *new_type = XFS_IO_COW;
> + wpc->io_type = *new_type = atomic ? XFS_IO_ATOMIC : XFS_IO_COW;
>   wpc->imap_valid = true;
>   wpc->imap = imap;
>   return 0;
> @@ -886,7 +891,8 @@ xfs_writepage_map(
>   }
>  
>   if (XFS_I(inode)->i_cowfp) {
> - error = xfs_map_cow(wpc, inode, offset, _type);
> + error = xfs_map_cow(wpc, inode, offset, _type,
> + buffer_atomic(bh));
>   if (error)
>   goto out;
>   }
> diff --git 

Re: [RFC] failure atomic writes for file systems and block devices

2017-02-28 Thread Darrick J. Wong
On Tue, Feb 28, 2017 at 06:57:25AM -0800, Christoph Hellwig wrote:
> Hi all,
> 
> this series implements a new O_ATOMIC flag for failure atomic writes
> to files.   It is based on and tries to unify to earlier proposals,
> the first one for block devices by Chris Mason:
> 
>   https://lwn.net/Articles/573092/
> 
> and the second one for regular files, published by HP Research at
> Usenix FAST 2015:
> 
>   
> https://www.usenix.org/conference/fast15/technical-sessions/presentation/verma
> 
> It adds a new O_ATOMIC flag for open, which requests writes to be
> failure-atomic, that is either the whole write makes it to persistent
> storage, or none of it, even in case of power of other failures.
> 
> There are two implementation various of this:  on block devices O_ATOMIC
> must be combined with O_(D)SYNC so that storage devices that can handle
> large writes atomically can simply do that without any additional work.
> This case is supported by NVMe.
> 
> The second case is for file systems, where we simply write new blocks
> out of places and then remap them into the file atomically on either
> completion of an O_(D)SYNC write or when fsync is called explicitly.
> 
> The semantics of the latter case are explained in detail at the Usenix
> paper above.

(Assuming there's no syncv involved here...?)

> Last but not least a new fcntl is implemented to provide information
> about I/O restrictions such as alignment requirements and the maximum
> atomic write size.
> 
> The implementation is simple and clean, but I'm rather unhappy about
> the interface as it has too many failure modes to bullet proof.  For
> one old kernels ignore unknown open flags silently, so applications

Ok, heh, disregard my review comment (for the xfs part) about the
seemingly insufficient O_ATOMIC validation.

> have to check the F_IOINFO fcntl before, which is a bit of a killer.
> Because of that I've also not implemented any other validity checks
> yet, as they might make thing even worse when an open on a not supported
> file system or device fails, but not on an old kernel.  Maybe we need
> a new open version that checks arguments properly first?

Does fcntl(F_SETFL...) suffer from this?

> Also I'm really worried about the NVMe failure modes - devices simply
> advertise an atomic write size, with no way for the device to know
> that the host requested a given write to be atomic, and thus no
> error reporting.

Yikes!

> This is made worse by NVMe 1.2 adding per-namespace
> atomic I/O parameters that devices can use to introduce additional
> odd alignment quirks - while there is some language in the spec
> requiring them not to weaken the per-controller guarantees it all
> looks rather weak and I'm not too confident in all implementations
> getting everything right.
> 
> Last but not least this depends on a few XFS patches, so to actually
> apply / run the patches please use this git tree:

Well, the XFS parts don't look too bad

--D

> 
> git://git.infradead.org/users/hch/vfs.git O_ATOMIC
> 
> Gitweb:
> 
> http://git.infradead.org/users/hch/vfs.git/shortlog/refs/heads/O_ATOMIC


[PATCH 3/3] block: implement (some of) fallocate for block devices

2016-09-29 Thread Darrick J. Wong
After much discussion, it seems that the fallocate feature flag
FALLOC_FL_ZERO_RANGE maps nicely to SCSI WRITE SAME; and the feature
FALLOC_FL_PUNCH_HOLE maps nicely to the devices that have been
whitelisted for zeroing SCSI UNMAP.  Punch still requires that
FALLOC_FL_KEEP_SIZE is set.  A length that goes past the end of the
device will be clamped to the device size if KEEP_SIZE is set; or will
return -EINVAL if not.  Both start and length must be aligned to the
device's logical block size.

Since the semantics of fallocate are fairly well established already,
wire up the two pieces.  The other fallocate variants (collapse range,
insert range, and allocate blocks) are not supported.

Signed-off-by: Darrick J. Wong <darrick.w...@oracle.com>
Reviewed-by: Hannes Reinecke <h...@suse.com>
Reviewed-by: Bart Van Assche <bart.vanass...@sandisk.com>
---
 fs/block_dev.c |   77 
 fs/open.c  |3 +-
 2 files changed, 79 insertions(+), 1 deletion(-)


diff --git a/fs/block_dev.c b/fs/block_dev.c
index 08ae993..777fd9b 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -30,6 +30,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include "internal.h"
 
@@ -1787,6 +1788,81 @@ static const struct address_space_operations 
def_blk_aops = {
.is_dirty_writeback = buffer_check_dirty_writeback,
 };
 
+#defineBLKDEV_FALLOC_FL_SUPPORTED  
\
+   (FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE |   \
+FALLOC_FL_ZERO_RANGE | FALLOC_FL_NO_HIDE_STALE)
+
+static long blkdev_fallocate(struct file *file, int mode, loff_t start,
+loff_t len)
+{
+   struct block_device *bdev = I_BDEV(bdev_file_inode(file));
+   struct request_queue *q = bdev_get_queue(bdev);
+   struct address_space *mapping;
+   loff_t end = start + len - 1;
+   loff_t isize;
+   int error;
+
+   /* Fail if we don't recognize the flags. */
+   if (mode & ~BLKDEV_FALLOC_FL_SUPPORTED)
+   return -EOPNOTSUPP;
+
+   /* Don't go off the end of the device. */
+   isize = i_size_read(bdev->bd_inode);
+   if (start >= isize)
+   return -EINVAL;
+   if (end >= isize) {
+   if (mode & FALLOC_FL_KEEP_SIZE) {
+   len = isize - start;
+   end = start + len - 1;
+   } else
+   return -EINVAL;
+   }
+
+   /*
+* Don't allow IO that isn't aligned to logical block size.
+*/
+   if ((start | len) & (bdev_logical_block_size(bdev) - 1))
+   return -EINVAL;
+
+   /* Invalidate the page cache, including dirty pages. */
+   mapping = bdev->bd_inode->i_mapping;
+   truncate_inode_pages_range(mapping, start, end);
+
+   switch (mode) {
+   case FALLOC_FL_ZERO_RANGE:
+   case FALLOC_FL_ZERO_RANGE | FALLOC_FL_KEEP_SIZE:
+   error = blkdev_issue_zeroout(bdev, start >> 9, len >> 9,
+   GFP_KERNEL, false);
+   break;
+   case FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE:
+   /* Only punch if the device can do zeroing discard. */
+   if (!blk_queue_discard(q) || !q->limits.discard_zeroes_data)
+   return -EOPNOTSUPP;
+   error = blkdev_issue_discard(bdev, start >> 9, len >> 9,
+GFP_KERNEL, 0);
+   break;
+   case FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE | 
FALLOC_FL_NO_HIDE_STALE:
+   if (!blk_queue_discard(q))
+   return -EOPNOTSUPP;
+   error = blkdev_issue_discard(bdev, start >> 9, len >> 9,
+GFP_KERNEL, 0);
+   break;
+   default:
+   return -EOPNOTSUPP;
+   }
+   if (error)
+   return error;
+
+   /*
+* Invalidate again; if someone wandered in and dirtied a page,
+* the caller will be given -EBUSY.  The third argument is
+* inclusive, so the rounding here is safe.
+*/
+   return invalidate_inode_pages2_range(mapping,
+start >> PAGE_SHIFT,
+end >> PAGE_SHIFT);
+}
+
 const struct file_operations def_blk_fops = {
.open   = blkdev_open,
.release= blkdev_close,
@@ -1801,6 +1877,7 @@ const struct file_operations def_blk_fops = {
 #endif
.splice_read= generic_file_splice_read,
.splice_write   = iter_file_splice_write,
+   .fallocate  = blkdev_fallocate,
 };
 
 int ioctl_by_bdev(struct block_device *bdev, unsigned cmd, unsigned long arg)
diff --git a/fs/open.c b/fs/open.c
index 4fd6e25..01b6092 100644

[PATCH 3/3] block: implement (some of) fallocate for block devices

2016-08-25 Thread Darrick J. Wong
After much discussion, it seems that the fallocate feature flag
FALLOC_FL_ZERO_RANGE maps nicely to SCSI WRITE SAME; and the feature
FALLOC_FL_PUNCH_HOLE maps nicely to the devices that have been
whitelisted for zeroing SCSI UNMAP.  Punch still requires that
FALLOC_FL_KEEP_SIZE is set.  A length that goes past the end of the
device will be clamped to the device size if KEEP_SIZE is set; or will
return -EINVAL if not.  Both start and length must be aligned to the
device's logical block size.

Since the semantics of fallocate are fairly well established already,
wire up the two pieces.  The other fallocate variants (collapse range,
insert range, and allocate blocks) are not supported.

v2: Incorporate feedback from Christoph & Linus.  Tentatively add
a requirement that the fallocate arguments be aligned to logical block
size, and put in a few XXX comments ahead of LSF discussion.

v3: Forward port to 4.7.

Signed-off-by: Darrick J. Wong <darrick.w...@oracle.com>
---
 fs/block_dev.c |   84 
 fs/open.c  |3 +-
 2 files changed, 86 insertions(+), 1 deletion(-)


diff --git a/fs/block_dev.c b/fs/block_dev.c
index c3cdde8..4df3fc8 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -30,6 +30,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include "internal.h"
 
@@ -1786,6 +1787,88 @@ static const struct address_space_operations 
def_blk_aops = {
.is_dirty_writeback = buffer_check_dirty_writeback,
 };
 
+#defineBLKDEV_FALLOC_FL_SUPPORTED  
\
+   (FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE |   \
+FALLOC_FL_ZERO_RANGE | FALLOC_FL_NO_HIDE_STALE)
+
+static long blkdev_fallocate(struct file *file, int mode, loff_t start,
+loff_t len)
+{
+   struct block_device *bdev = I_BDEV(bdev_file_inode(file));
+   struct request_queue *q = bdev_get_queue(bdev);
+   struct address_space *mapping;
+   loff_t end = start + len - 1;
+   loff_t isize;
+   int error;
+
+   /* Fail if we don't recognize the flags. */
+   if (mode & ~BLKDEV_FALLOC_FL_SUPPORTED)
+   return -EOPNOTSUPP;
+
+   /* Don't go off the end of the device. */
+   isize = i_size_read(bdev->bd_inode);
+   if (start >= isize)
+   return -EINVAL;
+   if (end > isize) {
+   if (mode & FALLOC_FL_KEEP_SIZE) {
+   len = isize - start;
+   end = start + len - 1;
+   } else
+   return -EINVAL;
+   }
+
+   /*
+* Don't allow IO that isn't aligned to logical block size.
+*/
+   if ((start | len) & (bdev_logical_block_size(bdev) - 1))
+   return -EINVAL;
+
+   /* Invalidate the page cache, including dirty pages. */
+   mapping = bdev->bd_inode->i_mapping;
+   truncate_inode_pages_range(mapping, start, end);
+
+   switch (mode) {
+   case FALLOC_FL_ZERO_RANGE:
+   case FALLOC_FL_ZERO_RANGE | FALLOC_FL_KEEP_SIZE:
+   error = blkdev_issue_zeroout(bdev, start >> 9, len >> 9,
+   GFP_KERNEL, false);
+   if (error)
+   return error;
+   break;
+   case FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE:
+   /* Only punch if the device can do zeroing discard. */
+   if (!blk_queue_discard(q) || !q->limits.discard_zeroes_data)
+   return -EOPNOTSUPP;
+   error = blkdev_issue_discard(bdev, start >> 9, len >> 9,
+GFP_KERNEL, 0);
+   if (error)
+   return error;
+   break;
+   case FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE | 
FALLOC_FL_NO_HIDE_STALE:
+   /*
+* XXX: a well known search engine vendor interprets this
+* flag (in other circumstances) to mean "I don't care if
+* we can read stale contents later".  Is it appropriate
+* to wire this up to the non-zeroing discard?
+*/
+   error = blkdev_issue_discard(bdev, start >> 9, len >> 9,
+GFP_KERNEL, 0);
+   if (error)
+   return error;
+   break;
+   default:
+   return -EOPNOTSUPP;
+   }
+
+   /*
+* Invalidate again; if someone wandered in and dirtied a page,
+* the caller will be given -EBUSY;
+*/
+   return invalidate_inode_pages2_range(mapping,
+start >> PAGE_SHIFT,
+end >> PAGE_SHIFT);
+}
+
 const struct file_operations def_blk_fops = {
.open

[PATCH 1/3] block: invalidate the page cache when issuing BLKZEROOUT.

2016-09-28 Thread Darrick J. Wong
Invalidate the page cache (as a regular O_DIRECT write would do) to avoid
returning stale cache contents at a later time.

Signed-off-by: Darrick J. Wong <darrick.w...@oracle.com>
Reviewed-by: Christoph Hellwig <h...@lst.de>
Reviewed-by: Martin K. Petersen <martin.peter...@oracle.com>
---
 block/ioctl.c |   18 --
 1 file changed, 12 insertions(+), 6 deletions(-)


diff --git a/block/ioctl.c b/block/ioctl.c
index ed2397f..755119c 100644
--- a/block/ioctl.c
+++ b/block/ioctl.c
@@ -225,7 +225,8 @@ static int blk_ioctl_zeroout(struct block_device *bdev, 
fmode_t mode,
unsigned long arg)
 {
uint64_t range[2];
-   uint64_t start, len;
+   struct address_space *mapping;
+   uint64_t start, end, len;
 
if (!(mode & FMODE_WRITE))
return -EBADF;
@@ -235,18 +236,23 @@ static int blk_ioctl_zeroout(struct block_device *bdev, 
fmode_t mode,
 
start = range[0];
len = range[1];
+   end = start + len - 1;
 
if (start & 511)
return -EINVAL;
if (len & 511)
return -EINVAL;
-   start >>= 9;
-   len >>= 9;
-
-   if (start + len > (i_size_read(bdev->bd_inode) >> 9))
+   if (end >= (uint64_t)i_size_read(bdev->bd_inode))
+   return -EINVAL;
+   if (end < start)
return -EINVAL;
 
-   return blkdev_issue_zeroout(bdev, start, len, GFP_KERNEL, false);
+   /* Invalidate the page cache, including dirty pages */
+   mapping = bdev->bd_inode->i_mapping;
+   truncate_inode_pages_range(mapping, start, end);
+
+   return blkdev_issue_zeroout(bdev, start >> 9, len >> 9, GFP_KERNEL,
+   false);
 }
 
 static int put_ushort(unsigned long arg, unsigned short val)

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] block: implement (some of) fallocate for block devices

2016-09-28 Thread Darrick J. Wong
On Wed, Sep 28, 2016 at 06:42:14PM -0700, Bart Van Assche wrote:
> On 09/28/16 17:39, Darrick J. Wong wrote:
> >+if (end > isize) {
> >+if (mode & FALLOC_FL_KEEP_SIZE) {
> >+len = isize - start;
> >+end = start + len - 1;
> >+} else
> >+return -EINVAL;
> >+}
> 
> If FALLOC_FL_KEEP_SIZE has been set and end == isize the above code won't
> reduce end to isize - 1. Shouldn't "end > isize" be changed into "end >=
> isize" ?

Oops.  Will fix and send out a v2.

> >+switch (mode) {
> >+case FALLOC_FL_ZERO_RANGE:
> >+case FALLOC_FL_ZERO_RANGE | FALLOC_FL_KEEP_SIZE:
> >+error = blkdev_issue_zeroout(bdev, start >> 9, len >> 9,
> >+GFP_KERNEL, false);
> >+if (error)
> >+return error;
> >+break;
> >+case FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE:
> >+/* Only punch if the device can do zeroing discard. */
> >+if (!blk_queue_discard(q) || !q->limits.discard_zeroes_data)
> >+return -EOPNOTSUPP;
> >+error = blkdev_issue_discard(bdev, start >> 9, len >> 9,
> >+ GFP_KERNEL, 0);
> >+if (error)
> >+return error;
> >+break;
> >+case FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE | 
> >FALLOC_FL_NO_HIDE_STALE:
> >+error = blkdev_issue_discard(bdev, start >> 9, len >> 9,
> >+ GFP_KERNEL, 0);
> >+if (error)
> >+return error;
> >+break;
> >+default:
> >+return -EOPNOTSUPP;
> >+}
> 
> Have you considered to move "if (error) return error" out of the switch
> statement?

Sure, I could do that.

> >+/*
> >+ * Invalidate again; if someone wandered in and dirtied a page,
> >+ * the caller will be given -EBUSY;
> >+ */
> >+return invalidate_inode_pages2_range(mapping,
> >+ start >> PAGE_SHIFT,
> >+ end >> PAGE_SHIFT);
> 
> A comment might be appropriate here that since end is inclusive and since
> the third argument of invalidate_inode_pages2_range() is inclusive that
> rounding down will yield the correct result.

/methot the documentation of invalidate_inode_pages2_range was clear
enough on that point, but I could throw that into the comment too.

--D
> 
> Bart.
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 3/3] block: implement (some of) fallocate for block devices

2016-09-28 Thread Darrick J. Wong
After much discussion, it seems that the fallocate feature flag
FALLOC_FL_ZERO_RANGE maps nicely to SCSI WRITE SAME; and the feature
FALLOC_FL_PUNCH_HOLE maps nicely to the devices that have been
whitelisted for zeroing SCSI UNMAP.  Punch still requires that
FALLOC_FL_KEEP_SIZE is set.  A length that goes past the end of the
device will be clamped to the device size if KEEP_SIZE is set; or will
return -EINVAL if not.  Both start and length must be aligned to the
device's logical block size.

Since the semantics of fallocate are fairly well established already,
wire up the two pieces.  The other fallocate variants (collapse range,
insert range, and allocate blocks) are not supported.

Signed-off-by: Darrick J. Wong <darrick.w...@oracle.com>
---
v2: Incorporate feedback from Christoph & Linus.  Tentatively add
a requirement that the fallocate arguments be aligned to logical block
size, and put in a few XXX comments ahead of LSF discussion.
v3: Forward port to 4.7.
v4: Forward port to 4.8.
---
 fs/block_dev.c |   75 
 fs/open.c  |3 +-
 2 files changed, 77 insertions(+), 1 deletion(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 08ae993..7b6d096 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -30,6 +30,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include "internal.h"
 
@@ -1787,6 +1788,79 @@ static const struct address_space_operations 
def_blk_aops = {
.is_dirty_writeback = buffer_check_dirty_writeback,
 };
 
+#defineBLKDEV_FALLOC_FL_SUPPORTED  
\
+   (FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE |   \
+FALLOC_FL_ZERO_RANGE | FALLOC_FL_NO_HIDE_STALE)
+
+static long blkdev_fallocate(struct file *file, int mode, loff_t start,
+loff_t len)
+{
+   struct block_device *bdev = I_BDEV(bdev_file_inode(file));
+   struct request_queue *q = bdev_get_queue(bdev);
+   struct address_space *mapping;
+   loff_t end = start + len - 1;
+   loff_t isize;
+   int error;
+
+   /* Fail if we don't recognize the flags. */
+   if (mode & ~BLKDEV_FALLOC_FL_SUPPORTED)
+   return -EOPNOTSUPP;
+
+   /* Don't go off the end of the device. */
+   isize = i_size_read(bdev->bd_inode);
+   if (start >= isize)
+   return -EINVAL;
+   if (end >= isize) {
+   if (mode & FALLOC_FL_KEEP_SIZE) {
+   len = isize - start;
+   end = start + len - 1;
+   } else
+   return -EINVAL;
+   }
+
+   /*
+* Don't allow IO that isn't aligned to logical block size.
+*/
+   if ((start | len) & (bdev_logical_block_size(bdev) - 1))
+   return -EINVAL;
+
+   /* Invalidate the page cache, including dirty pages. */
+   mapping = bdev->bd_inode->i_mapping;
+   truncate_inode_pages_range(mapping, start, end);
+
+   switch (mode) {
+   case FALLOC_FL_ZERO_RANGE:
+   case FALLOC_FL_ZERO_RANGE | FALLOC_FL_KEEP_SIZE:
+   error = blkdev_issue_zeroout(bdev, start >> 9, len >> 9,
+   GFP_KERNEL, false);
+   break;
+   case FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE:
+   /* Only punch if the device can do zeroing discard. */
+   if (!blk_queue_discard(q) || !q->limits.discard_zeroes_data)
+   return -EOPNOTSUPP;
+   error = blkdev_issue_discard(bdev, start >> 9, len >> 9,
+GFP_KERNEL, 0);
+   break;
+   case FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE | 
FALLOC_FL_NO_HIDE_STALE:
+   error = blkdev_issue_discard(bdev, start >> 9, len >> 9,
+GFP_KERNEL, 0);
+   break;
+   default:
+   return -EOPNOTSUPP;
+   }
+   if (error)
+   return error;
+
+   /*
+* Invalidate again; if someone wandered in and dirtied a page,
+* the caller will be given -EBUSY.  The third argument is
+* inclusive, so the rounding here is safe.
+*/
+   return invalidate_inode_pages2_range(mapping,
+start >> PAGE_SHIFT,
+end >> PAGE_SHIFT);
+}
+
 const struct file_operations def_blk_fops = {
.open   = blkdev_open,
.release= blkdev_close,
@@ -1801,6 +1875,7 @@ const struct file_operations def_blk_fops = {
 #endif
.splice_read= generic_file_splice_read,
.splice_write   = iter_file_splice_write,
+   .fallocate  = blkdev_fallocate,
 };
 
 int ioctl_by_bdev(struct block_device *bdev, unsigned cmd, unsigned long arg)
diff --git a/fs

Re: [PATCH v2 3/3] block: implement (some of) fallocate for block devices

2016-09-29 Thread Darrick J. Wong
On Thu, Sep 29, 2016 at 01:08:57PM -0700, Bart Van Assche wrote:
> On 09/28/2016 07:19 PM, Darrick J. Wong wrote:
> >After much discussion, it seems that the fallocate feature flag
> >FALLOC_FL_ZERO_RANGE maps nicely to SCSI WRITE SAME; and the feature
> >FALLOC_FL_PUNCH_HOLE maps nicely to the devices that have been
> >whitelisted for zeroing SCSI UNMAP.  Punch still requires that
> >FALLOC_FL_KEEP_SIZE is set.  A length that goes past the end of the
> >device will be clamped to the device size if KEEP_SIZE is set; or will
> >return -EINVAL if not.  Both start and length must be aligned to the
> >device's logical block size.
> >
> >Since the semantics of fallocate are fairly well established already,
> >wire up the two pieces.  The other fallocate variants (collapse range,
> >insert range, and allocate blocks) are not supported.
> 
> For the FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE | FALLOC_FL_NO_HIDE_STALE
> case, it's probably safer not to try to send a discard to block devices that
> do not support discard in order not to hit block driver bugs. But that's
> something we can still discuss later. Hence:

I'll just change it to check the queue flags and post a new revision.
At this point I might as well repost the whole thing to reflect the
reviewed-bys.

--D

> Reviewed-by: Bart Van Assche <bart.vanass...@sandisk.com>
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/3] block: implement (some of) fallocate for block devices

2016-09-28 Thread Darrick J. Wong
After much discussion, it seems that the fallocate feature flag
FALLOC_FL_ZERO_RANGE maps nicely to SCSI WRITE SAME; and the feature
FALLOC_FL_PUNCH_HOLE maps nicely to the devices that have been
whitelisted for zeroing SCSI UNMAP.  Punch still requires that
FALLOC_FL_KEEP_SIZE is set.  A length that goes past the end of the
device will be clamped to the device size if KEEP_SIZE is set; or will
return -EINVAL if not.  Both start and length must be aligned to the
device's logical block size.

Since the semantics of fallocate are fairly well established already,
wire up the two pieces.  The other fallocate variants (collapse range,
insert range, and allocate blocks) are not supported.

Signed-off-by: Darrick J. Wong <darrick.w...@oracle.com>
---
v2: Incorporate feedback from Christoph & Linus.  Tentatively add
a requirement that the fallocate arguments be aligned to logical block
size, and put in a few XXX comments ahead of LSF discussion.
v3: Forward port to 4.7.
v4: Forward port to 4.8.
---
 fs/block_dev.c |   78 
 fs/open.c  |3 +-
 2 files changed, 80 insertions(+), 1 deletion(-)


diff --git a/fs/block_dev.c b/fs/block_dev.c
index 08ae993..0c808fc 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -30,6 +30,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include "internal.h"
 
@@ -1787,6 +1788,82 @@ static const struct address_space_operations 
def_blk_aops = {
.is_dirty_writeback = buffer_check_dirty_writeback,
 };
 
+#defineBLKDEV_FALLOC_FL_SUPPORTED  
\
+   (FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE |   \
+FALLOC_FL_ZERO_RANGE | FALLOC_FL_NO_HIDE_STALE)
+
+static long blkdev_fallocate(struct file *file, int mode, loff_t start,
+loff_t len)
+{
+   struct block_device *bdev = I_BDEV(bdev_file_inode(file));
+   struct request_queue *q = bdev_get_queue(bdev);
+   struct address_space *mapping;
+   loff_t end = start + len - 1;
+   loff_t isize;
+   int error;
+
+   /* Fail if we don't recognize the flags. */
+   if (mode & ~BLKDEV_FALLOC_FL_SUPPORTED)
+   return -EOPNOTSUPP;
+
+   /* Don't go off the end of the device. */
+   isize = i_size_read(bdev->bd_inode);
+   if (start >= isize)
+   return -EINVAL;
+   if (end > isize) {
+   if (mode & FALLOC_FL_KEEP_SIZE) {
+   len = isize - start;
+   end = start + len - 1;
+   } else
+   return -EINVAL;
+   }
+
+   /*
+* Don't allow IO that isn't aligned to logical block size.
+*/
+   if ((start | len) & (bdev_logical_block_size(bdev) - 1))
+   return -EINVAL;
+
+   /* Invalidate the page cache, including dirty pages. */
+   mapping = bdev->bd_inode->i_mapping;
+   truncate_inode_pages_range(mapping, start, end);
+
+   switch (mode) {
+   case FALLOC_FL_ZERO_RANGE:
+   case FALLOC_FL_ZERO_RANGE | FALLOC_FL_KEEP_SIZE:
+   error = blkdev_issue_zeroout(bdev, start >> 9, len >> 9,
+   GFP_KERNEL, false);
+   if (error)
+   return error;
+   break;
+   case FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE:
+   /* Only punch if the device can do zeroing discard. */
+   if (!blk_queue_discard(q) || !q->limits.discard_zeroes_data)
+   return -EOPNOTSUPP;
+   error = blkdev_issue_discard(bdev, start >> 9, len >> 9,
+GFP_KERNEL, 0);
+   if (error)
+   return error;
+   break;
+   case FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE | 
FALLOC_FL_NO_HIDE_STALE:
+   error = blkdev_issue_discard(bdev, start >> 9, len >> 9,
+GFP_KERNEL, 0);
+   if (error)
+   return error;
+   break;
+   default:
+   return -EOPNOTSUPP;
+   }
+
+   /*
+* Invalidate again; if someone wandered in and dirtied a page,
+* the caller will be given -EBUSY;
+*/
+   return invalidate_inode_pages2_range(mapping,
+start >> PAGE_SHIFT,
+end >> PAGE_SHIFT);
+}
+
 const struct file_operations def_blk_fops = {
.open   = blkdev_open,
.release= blkdev_close,
@@ -1801,6 +1878,7 @@ const struct file_operations def_blk_fops = {
 #endif
.splice_read= generic_file_splice_read,
.splice_write   = iter_file_splice_write,
+   .fallocate  = blkdev_fallocate,
 };
 
 int ioctl_by_bdev(st

[PATCH 2/3] block: require write_same and discard requests align to logical block size

2016-09-28 Thread Darrick J. Wong
Make sure that the offset and length arguments that we're using to
construct WRITE SAME and DISCARD requests are actually aligned to the
logical block size.  Failure to do this causes other errors in other
parts of the block layer or the SCSI layer because disks don't support
partial logical block writes.

Signed-off-by: Darrick J. Wong <darrick.w...@oracle.com>
Reviewed-by: Christoph Hellwig <h...@lst.de>
Reviewed-by: Bart Van Assche <bart.vanass...@sandisk.com>
Reviewed-by: Martin K. Petersen <martin.peter...@oracle.com>
---
 block/blk-lib.c |   15 +++
 1 file changed, 15 insertions(+)


diff --git a/block/blk-lib.c b/block/blk-lib.c
index 083e56f..46fe924 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -31,6 +31,7 @@ int __blkdev_issue_discard(struct block_device *bdev, 
sector_t sector,
unsigned int granularity;
enum req_op op;
int alignment;
+   sector_t bs_mask;
 
if (!q)
return -ENXIO;
@@ -50,6 +51,10 @@ int __blkdev_issue_discard(struct block_device *bdev, 
sector_t sector,
op = REQ_OP_DISCARD;
}
 
+   bs_mask = (bdev_logical_block_size(bdev) >> 9) - 1;
+   if ((sector | nr_sects) & bs_mask)
+   return -EINVAL;
+
/* Zero-sector (unknown) and one-sector granularities are the same.  */
granularity = max(q->limits.discard_granularity >> 9, 1U);
alignment = (bdev_discard_alignment(bdev) >> 9) % granularity;
@@ -150,10 +155,15 @@ int blkdev_issue_write_same(struct block_device *bdev, 
sector_t sector,
unsigned int max_write_same_sectors;
struct bio *bio = NULL;
int ret = 0;
+   sector_t bs_mask;
 
if (!q)
return -ENXIO;
 
+   bs_mask = (bdev_logical_block_size(bdev) >> 9) - 1;
+   if ((sector | nr_sects) & bs_mask)
+   return -EINVAL;
+
/* Ensure that max_write_same_sectors doesn't overflow bi_size */
max_write_same_sectors = UINT_MAX >> 9;
 
@@ -202,6 +212,11 @@ static int __blkdev_issue_zeroout(struct block_device 
*bdev, sector_t sector,
int ret;
struct bio *bio = NULL;
unsigned int sz;
+   sector_t bs_mask;
+
+   bs_mask = (bdev_logical_block_size(bdev) >> 9) - 1;
+   if ((sector | nr_sects) & bs_mask)
+   return -EINVAL;
 
while (nr_sects != 0) {
bio = next_bio(bio, min(nr_sects, (sector_t)BIO_MAX_PAGES),

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 7/8] nowait aio: xfs

2017-04-06 Thread Darrick J. Wong
On Mon, Apr 03, 2017 at 11:52:11PM -0700, Christoph Hellwig wrote:
> > +   if (unaligned_io) {
> > +   /* If we are going to wait for other DIO to finish, bail */
> > +   if ((iocb->ki_flags & IOCB_NOWAIT) &&
> > +atomic_read(>i_dio_count))
> > +   return -EAGAIN;
> > inode_dio_wait(inode);
> 
> This checks i_dio_count twice in the nowait case, I think it should be:
> 
>   if (iocb->ki_flags & IOCB_NOWAIT) {
>   if (atomic_read(>i_dio_count))
>   return -EAGAIN;
>   } else {
>   inode_dio_wait(inode);
>   }
> 
> > if ((flags & (IOMAP_WRITE | IOMAP_ZERO)) && xfs_is_reflink_inode(ip)) {
> > if (flags & IOMAP_DIRECT) {
> > +   /* A reflinked inode will result in CoW alloc */
> > +   if (flags & IOMAP_NOWAIT) {
> > +   error = -EAGAIN;
> > +   goto out_unlock;
> > +   }
> 
> This is a bit pessimistic - just because the inode has any shared
> extents we could still write into unshared ones.  For now I think this
> pessimistic check is fine, but the comment should be corrected.

Consider what happens in both _reflink_{allocate,reserve}_cow.  If there
is already an existing reservation in the CoW fork then we'll have to
CoW and therefore can't satisfy the NOWAIT flag.  If there isn't already
anything in the CoW fork, then we have to see if there are shared blocks
by calling _reflink_trim_around_shared.  That performs a refcountbt
lookup, which involves locking the AGF, so we also can't satisfy NOWAIT.

IOWs, I think this hunk has to move outside the IOMAP_DIRECT check to
cover both write-to-reflinked-file cases.

--D

> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 7/8] nowait aio: xfs

2017-04-07 Thread Darrick J. Wong
On Fri, Apr 07, 2017 at 06:34:28AM -0500, Goldwyn Rodrigues wrote:
> 
> 
> On 04/06/2017 05:54 PM, Darrick J. Wong wrote:
> > On Mon, Apr 03, 2017 at 11:52:11PM -0700, Christoph Hellwig wrote:
> >>> + if (unaligned_io) {
> >>> + /* If we are going to wait for other DIO to finish, bail */
> >>> + if ((iocb->ki_flags & IOCB_NOWAIT) &&
> >>> +  atomic_read(>i_dio_count))
> >>> + return -EAGAIN;
> >>>   inode_dio_wait(inode);
> >>
> >> This checks i_dio_count twice in the nowait case, I think it should be:
> >>
> >>if (iocb->ki_flags & IOCB_NOWAIT) {
> >>if (atomic_read(>i_dio_count))
> >>return -EAGAIN;
> >>} else {
> >>inode_dio_wait(inode);
> >>}
> >>
> >>>   if ((flags & (IOMAP_WRITE | IOMAP_ZERO)) && xfs_is_reflink_inode(ip)) {
> >>>   if (flags & IOMAP_DIRECT) {
> >>> + /* A reflinked inode will result in CoW alloc */
> >>> + if (flags & IOMAP_NOWAIT) {
> >>> + error = -EAGAIN;
> >>> + goto out_unlock;
> >>> + }
> >>
> >> This is a bit pessimistic - just because the inode has any shared
> >> extents we could still write into unshared ones.  For now I think this
> >> pessimistic check is fine, but the comment should be corrected.
> > 
> > Consider what happens in both _reflink_{allocate,reserve}_cow.  If there
> > is already an existing reservation in the CoW fork then we'll have to
> > CoW and therefore can't satisfy the NOWAIT flag.  If there isn't already
> > anything in the CoW fork, then we have to see if there are shared blocks
> > by calling _reflink_trim_around_shared.  That performs a refcountbt
> > lookup, which involves locking the AGF, so we also can't satisfy NOWAIT.
> > 
> > IOWs, I think this hunk has to move outside the IOMAP_DIRECT check to
> > cover both write-to-reflinked-file cases.
> > 
> 
> IOMAP_NOWAIT is set only with IOMAP_DIRECT since the nowait feature is
> for direct-IO only. This is checked early on, when we are checking for

Ah, ok.  Disregard what I said about moving it then.

--D

> user-passed flags, and if not, -EINVAL is returned.
> 
> 
> -- 
> Goldwyn


Re: [PATCH v8 17/18] xfs: minimal conversion to errseq_t writeback error reporting

2017-06-29 Thread Darrick J. Wong
On Thu, Jun 29, 2017 at 09:19:53AM -0400, jlay...@kernel.org wrote:
> From: Jeff Layton <jlay...@redhat.com>
> 
> Just check and advance the data errseq_t in struct file before
> before returning from fsync on normal files. Internal filemap_*
> callers are left as-is.
> 
> Signed-off-by: Jeff Layton <jlay...@redhat.com>

Looks ok,
Reviewed-by: Darrick J. Wong <darrick.w...@oracle.com>

--D

> ---
>  fs/xfs/xfs_file.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 5fb5a0958a14..6600b264b0b6 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -140,7 +140,7 @@ xfs_file_fsync(
>  
>   trace_xfs_file_fsync(ip);
>  
> - error = filemap_write_and_wait_range(inode->i_mapping, start, end);
> + error = file_write_and_wait_range(file, start, end);
>   if (error)
>   return error;
>  
> -- 
> 2.13.0
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v8 12/18] Documentation: flesh out the section in vfs.txt on storing and reporting writeback errors

2017-06-29 Thread Darrick J. Wong
On Thu, Jun 29, 2017 at 09:19:48AM -0400, jlay...@kernel.org wrote:
> From: Jeff Layton 
> 
> Let's try to make this extra clear for fs authors.
> 
> Cc: Jan Kara 
> Signed-off-by: Jeff Layton 
> ---
>  Documentation/filesystems/vfs.txt | 43 
> ---
>  1 file changed, 40 insertions(+), 3 deletions(-)
> 
> diff --git a/Documentation/filesystems/vfs.txt 
> b/Documentation/filesystems/vfs.txt
> index f42b90687d40..1366043b3942 100644
> --- a/Documentation/filesystems/vfs.txt
> +++ b/Documentation/filesystems/vfs.txt
> @@ -576,7 +576,42 @@ should clear PG_Dirty and set PG_Writeback.  It can be 
> actually
>  written at any point after PG_Dirty is clear.  Once it is known to be
>  safe, PG_Writeback is cleared.
>  
> -Writeback makes use of a writeback_control structure...
> +Writeback makes use of a writeback_control structure to direct the
> +operations.  This gives the the writepage and writepages operations some
> +information about the nature of and reason for the writeback request,
> +and the constraints under which it is being done.  It is also used to
> +return information back to the caller about the result of a writepage or
> +writepages request.
> +
> +Handling errors during writeback
> +
> +Most applications that utilize the pagecache will periodically call
> +fsync to ensure that data written has made it to the backing store.

/me wonders if this sentence ought to be worded more strongly, e.g.

"Applications that utilize the pagecache must call a data
synchronization syscall such as fsync, fdatasync, or msync to ensure
that data written has made it to the backing store."

I'm also wondering -- fdatasync and msync will also report any writeback
errors that have happened anywhere (like fsync), since they all map to
vfs_fsync_range, correct?  If so, I think it worth it to state
explicitly that the other *sync methods behave the same as fsync w.r.t.
writeback error reporting.

--D

> +When there is an error during writeback, they expect that error to be
> +reported when fsync is called.  After an error has been reported on one
> +fsync, subsequent fsync calls on the same file descriptor should return
> +0, unless further writeback errors have occurred since the previous
> +fsync.
> +
> +Ideally, the kernel would report an error only on file descriptions on
> +which writes were done that subsequently failed to be written back.  The
> +generic pagecache infrastructure does not track the file descriptions
> +that have dirtied each individual page however, so determining which
> +file descriptors should get back an error is not possible.
> +
> +Instead, the generic writeback error tracking infrastructure in the
> +kernel settles for reporting errors to fsync on all file descriptions
> +that were open at the time that the error occurred.  In a situation with
> +multiple writers, all of them will get back an error on a subsequent fsync,
> +even if all of the writes done through that particular file descriptor
> +succeeded (or even if there were no writes on that file descriptor at all).
> +
> +Filesystems that wish to use this infrastructure should call
> +mapping_set_error to record the error in the address_space when it
> +occurs.  Then, at the end of their fsync operation, they should call
> +file_check_and_advance_wb_err to ensure that the struct file's error
> +cursor has advanced to the correct point in the stream of errors emitted
> +by the backing device(s).
>  
>  struct address_space_operations
>  ---
> @@ -804,7 +839,8 @@ struct address_space_operations {
>  The File Object
>  ===
>  
> -A file object represents a file opened by a process.
> +A file object represents a file opened by a process. This is also known
> +as an "open file description" in POSIX parlance.
>  
>  
>  struct file_operations
> @@ -887,7 +923,8 @@ otherwise noted.
>  
>release: called when the last reference to an open file is closed
>  
> -  fsync: called by the fsync(2) system call
> +  fsync: called by the fsync(2) system call. Also see the section above
> +  entitled "Handling errors during writeback".
>  
>fasync: called by the fcntl(2) system call when asynchronous
>   (non-blocking) mode is enabled for a file
> -- 
> 2.13.0
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [xfstests PATCH v2 2/3] ext4: allow ext4 to use $SCRATCH_LOGDEV

2017-05-15 Thread Darrick J. Wong
On Tue, May 09, 2017 at 12:12:44PM -0400, Jeff Layton wrote:
> The writeback error handling test requires that you put the journal on a
> separate device. This allows us to use dmerror to simulate data
> writeback failure, without affecting the journal.
> 
> xfs already has infrastructure for this (a'la $SCRATCH_LOGDEV), so wire
> up the ext4 code so that it can do the same thing when _scratch_mkfs is
> called.
> 
> Signed-off-by: Jeff Layton <jlay...@redhat.com>

Looks ok,
Reviewed-by: Darrick J. Wong <darrick.w...@oracle.com>

--D

> ---
>  common/rc | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/common/rc b/common/rc
> index 257b1903359d..8b815d9c8c33 100644
> --- a/common/rc
> +++ b/common/rc
> @@ -675,6 +675,9 @@ _scratch_mkfs_ext4()
>   local tmp=`mktemp`
>   local mkfs_status
>  
> + [ "$USE_EXTERNAL" = yes -a ! -z "$SCRATCH_LOGDEV" ] && \
> + $mkfs_cmd -O journal_dev $SCRATCH_LOGDEV && \
> + mkfs_cmd="$mkfs_cmd -J device=$SCRATCH_LOGDEV"
>  
>   _scratch_do_mkfs "$mkfs_cmd" "$mkfs_filter" $* 2>$tmp.mkfserr 
> 1>$tmp.mkfsstd
>   mkfs_status=$?
> -- 
> 2.9.3
> 
> --
> To unsubscribe from this list: send the line "unsubscribe fstests" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 7/9] xfs: add support for passing in write hints for buffered writes

2017-06-20 Thread Darrick J. Wong
On Tue, Jun 20, 2017 at 06:22:04PM -0600, Jens Axboe wrote:
> Reviewed-by: Andreas Dilger <adil...@dilger.ca>
> Signed-off-by: Jens Axboe <ax...@kernel.dk>

Looks ok to me,
Reviewed-by: Darrick J. Wong <darrick.w...@oracle.com>

--D

> ---
>  fs/xfs/xfs_aops.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> index 76b6f988e2fa..e4d9d470402c 100644
> --- a/fs/xfs/xfs_aops.c
> +++ b/fs/xfs/xfs_aops.c
> @@ -506,6 +506,7 @@ xfs_submit_ioend(
>   return status;
>   }
>  
> + ioend->io_bio->bi_opf |= 
> write_hint_to_opf(inode_write_hint(ioend->io_inode));
>   submit_bio(ioend->io_bio);
>   return 0;
>  }
> @@ -565,6 +566,7 @@ xfs_chain_bio(
>   bio_chain(ioend->io_bio, new);
>   bio_get(ioend->io_bio); /* for xfs_destroy_ioend */
>   ioend->io_bio->bi_opf = REQ_OP_WRITE | wbc_to_write_flags(wbc);
> + ioend->io_bio->bi_opf |= 
> write_hint_to_opf(inode_write_hint(ioend->io_inode));
>   submit_bio(ioend->io_bio);
>   ioend->io_bio = new;
>  }
> -- 
> 2.7.4
> 


Re: [PATCH v7 21/22] xfs: minimal conversion to errseq_t writeback error reporting

2017-06-26 Thread Darrick J. Wong
On Fri, Jun 16, 2017 at 03:34:26PM -0400, Jeff Layton wrote:
> Just check and advance the data errseq_t in struct file before
> before returning from fsync on normal files. Internal filemap_*
> callers are left as-is.
> 
> Signed-off-by: Jeff Layton <jlay...@redhat.com>
> ---
>  fs/xfs/xfs_file.c | 15 +++
>  1 file changed, 11 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 5fb5a0958a14..bc3b1575e8db 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -134,7 +134,7 @@ xfs_file_fsync(
>   struct inode*inode = file->f_mapping->host;
>   struct xfs_inode*ip = XFS_I(inode);
>   struct xfs_mount*mp = ip->i_mount;
> - int error = 0;
> + int error = 0, err2;
>   int log_flushed = 0;
>   xfs_lsn_t   lsn = 0;
>  
> @@ -142,10 +142,12 @@ xfs_file_fsync(
>  
>   error = filemap_write_and_wait_range(inode->i_mapping, start, end);
>   if (error)
> - return error;
> + goto out;
>  
> - if (XFS_FORCED_SHUTDOWN(mp))
> - return -EIO;
> + if (XFS_FORCED_SHUTDOWN(mp)) {
> + error = -EIO;
> + goto out;
> + }
>  
>   xfs_iflags_clear(ip, XFS_ITRUNCATED);
>  
> @@ -197,6 +199,11 @@ xfs_file_fsync(
>   mp->m_logdev_targp == mp->m_ddev_targp)
>   xfs_blkdev_issue_flush(mp->m_ddev_targp);
>  
> +out:
> + err2 = filemap_report_wb_err(file);

Could we have a comment here to remind anyone reading the code a year
from now that filemap_report_wb_err has side effects?  Pre-coffee me was
wondering why we'd bother calling filemap_report_wb_err in the
XFS_FORCED_SHUTDOWN case, then remembered that it touches data
structures.

The first sentence of the commit message (really, the word 'advance')
added as a comment was adequate to remind me of the side effects.

Once that's added,
Reviewed-by: Darrick J. Wong <darrick.w...@oracle.com>

--D

> + if (!error)
> + error = err2;
> +
>   return error;
>  }
>  
> -- 
> 2.13.0
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/9] fs: add fcntl() interface for setting/getting write life time hints

2017-06-26 Thread Darrick J. Wong
On Mon, Jun 26, 2017 at 07:55:27AM -0600, Jens Axboe wrote:
> On 06/26/2017 03:51 AM, Christoph Hellwig wrote:
> > Please document the userspace API (added linux-api and linux-man
> > to CC for sugestions), especially including the odd effects of the
> > per-inode settings.
> 
> Of course, I'll send in a diff for the fcntl(2) man page.
> 
> > Also I would highly recommend to use different fcntl commands
> > for the file vs inode hints to avoid any strange behavior.
> 
> OK, used to have that too... I can add specific _FILE versions.

While you're at it, can you also send in an xfstest or two to check the
basic functionality of the fcntl so that we know the code reflects the
userspace API ("I set this hint and now I can query it back" and "file
hint overrides inode hint") that we want?

--D

> 
> -- 
> Jens Axboe
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-api" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v7 21/22] xfs: minimal conversion to errseq_t writeback error reporting

2017-06-26 Thread Darrick J. Wong
On Mon, Jun 26, 2017 at 01:58:32PM -0400, jlay...@redhat.com wrote:
> On Mon, 2017-06-26 at 08:22 -0700, Darrick J. Wong wrote:
> > On Fri, Jun 16, 2017 at 03:34:26PM -0400, Jeff Layton wrote:
> > > Just check and advance the data errseq_t in struct file before
> > > before returning from fsync on normal files. Internal filemap_*
> > > callers are left as-is.
> > > 
> > > Signed-off-by: Jeff Layton <jlay...@redhat.com>
> > > ---
> > >  fs/xfs/xfs_file.c | 15 +++
> > >  1 file changed, 11 insertions(+), 4 deletions(-)
> > > 
> > > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> > > index 5fb5a0958a14..bc3b1575e8db 100644
> > > --- a/fs/xfs/xfs_file.c
> > > +++ b/fs/xfs/xfs_file.c
> > > @@ -134,7 +134,7 @@ xfs_file_fsync(
> > >   struct inode*inode = file->f_mapping-
> > > >host;
> > >   struct xfs_inode*ip = XFS_I(inode);
> > >   struct xfs_mount*mp = ip->i_mount;
> > > - int error = 0;
> > > + int error = 0, err2;
> > >   int log_flushed = 0;
> > >   xfs_lsn_t   lsn = 0;
> > >  
> > > @@ -142,10 +142,12 @@ xfs_file_fsync(
> > >  
> > >   error = filemap_write_and_wait_range(inode->i_mapping,
> > > start, end);
> > >   if (error)
> > > - return error;
> > > + goto out;
> > >  
> > > - if (XFS_FORCED_SHUTDOWN(mp))
> > > - return -EIO;
> > > + if (XFS_FORCED_SHUTDOWN(mp)) {
> > > + error = -EIO;
> > > + goto out;
> > > + }
> > >  
> > >   xfs_iflags_clear(ip, XFS_ITRUNCATED);
> > >  
> > > @@ -197,6 +199,11 @@ xfs_file_fsync(
> > >   mp->m_logdev_targp == mp->m_ddev_targp)
> > >   xfs_blkdev_issue_flush(mp->m_ddev_targp);
> > >  
> > > +out:
> > > + err2 = filemap_report_wb_err(file);
> > 
> > Could we have a comment here to remind anyone reading the code a year
> > from now that filemap_report_wb_err has side effects?  Pre-coffee me
> > was
> > wondering why we'd bother calling filemap_report_wb_err in the
> > XFS_FORCED_SHUTDOWN case, then remembered that it touches data
> > structures.
> > 
> > The first sentence of the commit message (really, the word 'advance')
> > added as a comment was adequate to remind me of the side effects.
> > 
> > Once that's added,
> > Reviewed-by: Darrick J. Wong <darrick.w...@oracle.com>
> > 
> > --D
> > 
> 
> Yeah, definitely. I'm working on a respin of the series now to
> incorporate HCH's suggestion too. I'll add that in as well.
> 
> Maybe I should rename that function to file_check_and_advance_wb_err()
> ? It would be good to make it clear that it does advance the errseq_t
> cursor.

Seems like a good idea.

--D

> 
> > > + if (!error)
> > > + error = err2;
> > > +
> > >   return error;
> > >  }
> > >  
> > > -- 
> > > 2.13.0
> > > 
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe linux-
> > > xfs" in
> > > the body of a message to majord...@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-
> > btrfs" in
> > the body of a message to majord...@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [xfstests PATCH v3 1/5] generic: add a writeback error handling test

2017-06-06 Thread Darrick J. Wong
On Tue, Jun 06, 2017 at 04:12:58PM -0400, Jeff Layton wrote:
> On Tue, 2017-06-06 at 10:17 -0700, Darrick J. Wong wrote:
> > On Tue, Jun 06, 2017 at 08:23:25PM +0800, Eryu Guan wrote:
> > > On Tue, Jun 06, 2017 at 06:15:57AM -0400, Jeff Layton wrote:
> > > > On Tue, 2017-06-06 at 16:58 +0800, Eryu Guan wrote:
> > > > > On Wed, May 31, 2017 at 09:08:16AM -0400, Jeff Layton wrote:
> > > > > > I'm working on a set of kernel patches to change how writeback 
> > > > > > errors
> > > > > > are handled and reported in the kernel. Instead of reporting a
> > > > > > writeback error to only the first fsync caller on the file, I aim
> > > > > > to make the kernel report them once on every file description.
> > > > > > 
> > > > > > This patch adds a test for the new behavior. Basically, open many 
> > > > > > fds
> > > > > > to the same file, turn on dm_error, write to each of the fds, and 
> > > > > > then
> > > > > > fsync them all to ensure that they all get an error back.
> > > > > > 
> > > > > > To do that, I'm adding a new tools/dmerror script that the C program
> > > > > > can use to load the error table. For now, that's all it can do, but
> > > > > > we can fill it out with other commands as necessary.
> > > > > > 
> > > > > > Signed-off-by: Jeff Layton <jlay...@redhat.com>
> > > > > 
> > > > > Thanks for the new tests! And sorry for the late review..
> > > > > 
> > > > > It's testing a new behavior on error reporting on writeback, I'm not
> > > > > sure if we can call it a new feature or it fixed a bug? But it's more
> > > > > like a behavior change, I'm not sure how to categorize it.
> > > > > 
> > > > > Because if it's testing a new feature, we usually let test do proper
> > > > > detection of current test environment (based on actual behavior not
> > > > > kernel version) and _notrun on filesystems that don't have this 
> > > > > feature
> > > > > yet, instead of failing the test; if it's testing a bug fix, we could
> > > > > leave the test fail on unfixed filesystems, this also serves as a
> > > > > reminder that there's bug to fix.
> > > > > 
> > > > 
> > > > Thanks for the review! I'm not sure how to categorize this either. Since
> > > > the plan is to convert all the filesystems piecemeal, maybe we should
> > > > just consider it a new feature.
> > > 
> > > Then we need a new _require rule to properly detect for the 'feature'
> > > support. I'm not sure if this is doable, but something like
> > > _require_statx, _require_seek_data_hole would be good.
> > > 
> > > > 
> > > > > I pulled your test kernel tree, and test passed on EXT4 but failed on
> > > > > other local filesystems (XFS, btrfs). I assume that's expected.
> > > > > 
> > > > > Besides this kinda high-level question, some minor comments inline.
> > > > > 
> > > > 
> > > > Yes, ext4 should pass on my latest kernel tree, but everything else
> > > > should fail. 
> 
> Oh, and I should mention that ext2/3 also pass when mounted using ext4
> driver. Legacy ext2 driver sort of works, but it reports a few too many
> errors because of the way the ext2_error macro works. That shouldn't be
> too hard to fix, I just need some guidance on that one.
> 
> I had xfs and btrfs working with an earlier iteration of the patches,
> but now that we're converting a fs at a time, it's a little more work to
> get there. It shouldn't be too hard to do though. I'll probably re-post
> in a few days, and will try to take a stab at XFS and btrfs conversion
> too.
> 
> > > 
> > > With the new _require rule, test should _notrun on XFS and btrfs then.
> > 
> > Frankly I personally prefer that upstream XFS fails until someone fixes it. 
> > :)
> > (But that's just my opinion.)
> > 
> > That said, I'm not 100% sure what's required of XFS to play nicely with
> > this new mechanism -- glancing at the ext* patches it looks like we'd
> > need to set a fs flag and possibly change some or all of the "write
> > cached dirty buffers out to disk" calls to their _since variants?
> 
> Yeah, that's pretty much the size of it.
> 
> In fact, the latter part (changing to 

Re: [PATCH 04/11] fs: add support for allowing applications to pass in write life time hints

2017-06-14 Thread Darrick J. Wong
On Wed, Jun 14, 2017 at 09:45:05PM -0600, Jens Axboe wrote:
> Add four flags for the pwritev2(2) system call, allowing an application
> to give the kernel a hint about what on-media life times can be
> expected from a given write.
> 
> The intent is for these values to be relative to each other, no
> absolute meaning should be attached to these flag names.
> 
> Define IOCB flags to carry this over this information from the pwritev2
> RWF_WRITE_LIFE_* flags.
> 
> Reviewed-by: Andreas Dilger 
> Signed-off-by: Jens Axboe 
> ---
>  fs/read_write.c |  9 -
>  include/linux/fs.h  | 12 
>  include/uapi/linux/fs.h | 10 ++
>  3 files changed, 30 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/read_write.c b/fs/read_write.c
> index 47c1d4484df9..9cb2314efca3 100644
> --- a/fs/read_write.c
> +++ b/fs/read_write.c
> @@ -678,7 +678,7 @@ static ssize_t do_iter_readv_writev(struct file *filp, 
> struct iov_iter *iter,
>   struct kiocb kiocb;
>   ssize_t ret;
>  
> - if (flags & ~(RWF_HIPRI | RWF_DSYNC | RWF_SYNC))
> + if (flags & ~(RWF_HIPRI | RWF_DSYNC | RWF_SYNC | RWF_WRITE_LIFE_MASK))
>   return -EOPNOTSUPP;
>  
>   init_sync_kiocb(, filp);
> @@ -688,6 +688,13 @@ static ssize_t do_iter_readv_writev(struct file *filp, 
> struct iov_iter *iter,
>   kiocb.ki_flags |= IOCB_DSYNC;
>   if (flags & RWF_SYNC)
>   kiocb.ki_flags |= (IOCB_DSYNC | IOCB_SYNC);
> + if (flags & RWF_WRITE_LIFE_MASK) {
> + struct inode *inode = file_inode(filp);
> +
> + inode->i_write_hint = (flags & RWF_WRITE_LIFE_MASK) >>
> + RWF_WRITE_LIFE_SHIFT;

Hmm, so once set, hints stick around until someone else sets a different
one.  I suppose it's unlikely that you'd have two programs writing to
the same inode with different write hints, right?

Just wondering if anyone will be surprised that they thought they were
writing to an _EXTREME hint file but someone else switched it to _SHORT
on them.

Also, how does userspace query the write hint value once set?

> + kiocb.ki_flags |= inode->i_write_hint << IOCB_WRITE_LIFE_SHIFT;
> + }
>   kiocb.ki_pos = *ppos;
>  
>   if (type == READ)
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index f4f9df8ed059..63798b67fcfe 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -269,6 +269,12 @@ struct writeback_control;
>  #define IOCB_SYNC(1 << 5)
>  #define IOCB_WRITE   (1 << 6)
>  
> +/*
> + * Steal 4-bits for stream information, this allows 16 valid streams
> + */
> +#define IOCB_WRITE_LIFE_SHIFT7
> +#define IOCB_WRITE_LIFE_MASK (BIT(7) | BIT(8) | BIT(9) | BIT(10))
> +
>  struct kiocb {
>   struct file *ki_filp;
>   loff_t  ki_pos;
> @@ -292,6 +298,12 @@ static inline void init_sync_kiocb(struct kiocb *kiocb, 
> struct file *filp)
>   };
>  }
>  
> +static inline int iocb_write_hint(const struct kiocb *iocb)
> +{
> + return (iocb->ki_flags & IOCB_WRITE_LIFE_MASK) >>
> + IOCB_WRITE_LIFE_SHIFT;
> +}
> +
>  /*
>   * "descriptor" for what we're up to with a read.
>   * This allows us to use the same read code yet
> diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
> index 24e61a54feaa..58b7ee06b380 100644
> --- a/include/uapi/linux/fs.h
> +++ b/include/uapi/linux/fs.h
> @@ -361,4 +361,14 @@ struct fscrypt_key {
>  #define RWF_DSYNC0x0002 /* per-IO O_DSYNC */
>  #define RWF_SYNC 0x0004 /* per-IO O_SYNC */
>  
> +/*
> + * Data life time write flags, steal 4 bits for that
> + */
> +#define RWF_WRITE_LIFE_SHIFT 4
> +#define RWF_WRITE_LIFE_MASK  0x00f0 /* 4 bits of stream ID */
> +#define RWF_WRITE_LIFE_SHORT (1 << RWF_WRITE_LIFE_SHIFT)
> +#define RWF_WRITE_LIFE_MEDIUM(2 << RWF_WRITE_LIFE_SHIFT)
> +#define RWF_WRITE_LIFE_LONG  (3 << RWF_WRITE_LIFE_SHIFT)
> +#define RWF_WRITE_LIFE_EXTREME   (4 << RWF_WRITE_LIFE_SHIFT)

Should O_TMPFILE files ought to be created with i_write_hint =
RWF_WRITE_LIFE_SHORT by default?

--D

> +
>  #endif /* _UAPI_LINUX_FS_H */
> -- 
> 2.7.4
> 


Re: [PATCH v6 19/20] xfs: minimal conversion to errseq_t writeback error reporting

2017-06-12 Thread Darrick J. Wong
On Mon, Jun 12, 2017 at 08:23:15AM -0400, Jeff Layton wrote:
> Just set the FS_WB_ERRSEQ flag to indicate that we want to use errseq_t
> based error reporting. Internal filemap_* calls are left as-is for now.
> 
> Signed-off-by: Jeff Layton 
> ---
>  fs/xfs/xfs_super.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> index 455a575f101d..28d3be187025 100644
> --- a/fs/xfs/xfs_super.c
> +++ b/fs/xfs/xfs_super.c
> @@ -1758,7 +1758,7 @@ static struct file_system_type xfs_fs_type = {
>   .name   = "xfs",
>   .mount  = xfs_fs_mount,
>   .kill_sb= kill_block_super,
> - .fs_flags   = FS_REQUIRES_DEV,
> + .fs_flags   = FS_REQUIRES_DEV | FS_WB_ERRSEQ,

Huh?  Why are there two patches with the same subject line?  And this
same bit of code too?  Or ... 11/13, 11/20?  What's going on here?



--D

>  };
>  MODULE_ALIAS_FS("xfs");
>  
> -- 
> 2.13.0
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [xfstests PATCH v3 1/5] generic: add a writeback error handling test

2017-06-06 Thread Darrick J. Wong
On Tue, Jun 06, 2017 at 08:23:25PM +0800, Eryu Guan wrote:
> On Tue, Jun 06, 2017 at 06:15:57AM -0400, Jeff Layton wrote:
> > On Tue, 2017-06-06 at 16:58 +0800, Eryu Guan wrote:
> > > On Wed, May 31, 2017 at 09:08:16AM -0400, Jeff Layton wrote:
> > > > I'm working on a set of kernel patches to change how writeback errors
> > > > are handled and reported in the kernel. Instead of reporting a
> > > > writeback error to only the first fsync caller on the file, I aim
> > > > to make the kernel report them once on every file description.
> > > > 
> > > > This patch adds a test for the new behavior. Basically, open many fds
> > > > to the same file, turn on dm_error, write to each of the fds, and then
> > > > fsync them all to ensure that they all get an error back.
> > > > 
> > > > To do that, I'm adding a new tools/dmerror script that the C program
> > > > can use to load the error table. For now, that's all it can do, but
> > > > we can fill it out with other commands as necessary.
> > > > 
> > > > Signed-off-by: Jeff Layton 
> > > 
> > > Thanks for the new tests! And sorry for the late review..
> > > 
> > > It's testing a new behavior on error reporting on writeback, I'm not
> > > sure if we can call it a new feature or it fixed a bug? But it's more
> > > like a behavior change, I'm not sure how to categorize it.
> > > 
> > > Because if it's testing a new feature, we usually let test do proper
> > > detection of current test environment (based on actual behavior not
> > > kernel version) and _notrun on filesystems that don't have this feature
> > > yet, instead of failing the test; if it's testing a bug fix, we could
> > > leave the test fail on unfixed filesystems, this also serves as a
> > > reminder that there's bug to fix.
> > > 
> > 
> > Thanks for the review! I'm not sure how to categorize this either. Since
> > the plan is to convert all the filesystems piecemeal, maybe we should
> > just consider it a new feature.
> 
> Then we need a new _require rule to properly detect for the 'feature'
> support. I'm not sure if this is doable, but something like
> _require_statx, _require_seek_data_hole would be good.
> 
> > 
> > > I pulled your test kernel tree, and test passed on EXT4 but failed on
> > > other local filesystems (XFS, btrfs). I assume that's expected.
> > > 
> > > Besides this kinda high-level question, some minor comments inline.
> > > 
> > 
> > Yes, ext4 should pass on my latest kernel tree, but everything else
> > should fail. 
> 
> With the new _require rule, test should _notrun on XFS and btrfs then.

Frankly I personally prefer that upstream XFS fails until someone fixes it. :)
(But that's just my opinion.)

That said, I'm not 100% sure what's required of XFS to play nicely with
this new mechanism -- glancing at the ext* patches it looks like we'd
need to set a fs flag and possibly change some or all of the "write
cached dirty buffers out to disk" calls to their _since variants?
Metadata writeback errors are handled by retrying writes and/or shutting
down the fs, so I think the f_md_wb_error case is already covered.

That said, I agree that it's useful to detect that the kernel simply
lacks any of the new wb error reporting at all, so therefore we can skip
the tests.

--D

> 
> > 
> > > > ---
> > > >  common/dmerror |  13 ++--
> > > >  doc/auxiliary-programs.txt |   8 +++
> > > >  src/Makefile   |   2 +-
> > > >  src/fsync-err.c| 161 
> > > > +
> > > 
> > > New binary needs an entry in .gitignore file.
> > > 
> > 
> > OK, thanks. Will fix.
> > 
> > > >  tests/generic/999  |  76 +
> > > >  tests/generic/999.out  |   3 +
> > > >  tests/generic/group|   1 +
> > > >  tools/dmerror  |  44 +
> > > 
> > > This file is used by the test, then it should be in src/ directory and
> > > be installed along with other executable files on "make install".
> > > Because files under tools/ are not installed. Most people will run tests
> > > in the root dir of xfstests and this is not a problem, but there're
> > > still cases people do "make && make install" and run fstests from
> > > /var/lib/xfstests (default installation target).
> > > 
> > 
> > Ok, no problem. I'll move it. I wasn't sure here since dmerror is a
> > shell script, and most of the stuff in src/ is stuff that needs to be
> > built.
> 
> We do have a few perl or shell scripts in src/ dir, we can see this from
> src/Makefile
> 
> $(LTINSTALL) -m 755 fill2attr fill2fs fill2fs_check scaleread.sh 
> $(PKG_LIB_DIR)/src
> 
> >  
> > > >  8 files changed, 302 insertions(+), 6 deletions(-)
> > > >  create mode 100644 src/fsync-err.c
> > > >  create mode 100755 tests/generic/999
> > > >  create mode 100644 tests/generic/999.out
> > > >  create mode 100755 tools/dmerror
> > > > 
> > > > diff --git a/common/dmerror b/common/dmerror
> > > > index d46c5d0b7266..238baa213b1f 100644
> > > > --- 

Re: [PATCH 3/4] blk-wbt: pass in enum wbt_flags to get_rq_wait()

2018-05-07 Thread Darrick J. Wong
On Mon, May 07, 2018 at 10:13:34AM -0600, Jens Axboe wrote:
> This is in preparation for having more write queues, in which
> case we would have needed to pass in more information than just
> a simple 'is_kswapd' boolean.
> 
> Signed-off-by: Jens Axboe <ax...@kernel.dk>

Looks ok (having not run any kind of testing on this yet),
Reviewed-by: Darrick J. Wong <darrick.w...@oracle.com>

--D

> ---
>  block/blk-wbt.c | 25 +++--
>  block/blk-wbt.h |  4 +++-
>  2 files changed, 18 insertions(+), 11 deletions(-)
> 
> diff --git a/block/blk-wbt.c b/block/blk-wbt.c
> index 3e34b41bcefc..25d202345965 100644
> --- a/block/blk-wbt.c
> +++ b/block/blk-wbt.c
> @@ -101,9 +101,13 @@ static bool wb_recent_wait(struct rq_wb *rwb)
>   return time_before(jiffies, wb->dirty_sleep + HZ);
>  }
>  
> -static inline struct rq_wait *get_rq_wait(struct rq_wb *rwb, bool is_kswapd)
> +static inline struct rq_wait *get_rq_wait(struct rq_wb *rwb,
> +   enum wbt_flags wb_acct)
>  {
> - return >rq_wait[is_kswapd];
> + if (wb_acct & WBT_KSWAPD)
> + return >rq_wait[WBT_RWQ_KSWAPD];
> +
> + return >rq_wait[WBT_RWQ_BG];
>  }
>  
>  static void rwb_wake_all(struct rq_wb *rwb)
> @@ -126,7 +130,7 @@ void __wbt_done(struct rq_wb *rwb, enum wbt_flags wb_acct)
>   if (!(wb_acct & WBT_TRACKED))
>   return;
>  
> - rqw = get_rq_wait(rwb, wb_acct & WBT_KSWAPD);
> + rqw = get_rq_wait(rwb, wb_acct);
>   inflight = atomic_dec_return(>inflight);
>  
>   /*
> @@ -529,11 +533,12 @@ static inline bool may_queue(struct rq_wb *rwb, struct 
> rq_wait *rqw,
>   * Block if we will exceed our limit, or if we are currently waiting for
>   * the timer to kick off queuing again.
>   */
> -static void __wbt_wait(struct rq_wb *rwb, unsigned long rw, spinlock_t *lock)
> +static void __wbt_wait(struct rq_wb *rwb, enum wbt_flags wb_acct,
> +unsigned long rw, spinlock_t *lock)
>   __releases(lock)
>   __acquires(lock)
>  {
> - struct rq_wait *rqw = get_rq_wait(rwb, current_is_kswapd());
> + struct rq_wait *rqw = get_rq_wait(rwb, wb_acct);
>   DEFINE_WAIT(wait);
>  
>   if (may_queue(rwb, rqw, , rw))
> @@ -584,7 +589,7 @@ static inline bool wbt_should_throttle(struct rq_wb *rwb, 
> struct bio *bio)
>   */
>  enum wbt_flags wbt_wait(struct rq_wb *rwb, struct bio *bio, spinlock_t *lock)
>  {
> - unsigned int ret = 0;
> + enum wbt_flags ret = 0;
>  
>   if (!rwb_enabled(rwb))
>   return 0;
> @@ -598,14 +603,14 @@ enum wbt_flags wbt_wait(struct rq_wb *rwb, struct bio 
> *bio, spinlock_t *lock)
>   return ret;
>   }
>  
> - __wbt_wait(rwb, bio->bi_opf, lock);
> + if (current_is_kswapd())
> + ret |= WBT_KSWAPD;
> +
> + __wbt_wait(rwb, ret, bio->bi_opf, lock);
>  
>   if (!blk_stat_is_active(rwb->cb))
>   rwb_arm_timer(rwb);
>  
> - if (current_is_kswapd())
> - ret |= WBT_KSWAPD;
> -
>   return ret | WBT_TRACKED;
>  }
>  
> diff --git a/block/blk-wbt.h b/block/blk-wbt.h
> index a232c98fbf4d..8038b4a0d4ef 100644
> --- a/block/blk-wbt.h
> +++ b/block/blk-wbt.h
> @@ -19,7 +19,9 @@ enum wbt_flags {
>  };
>  
>  enum {
> - WBT_NUM_RWQ = 2,
> + WBT_RWQ_BG  = 0,
> + WBT_RWQ_KSWAPD,
> + WBT_NUM_RWQ,
>  };
>  
>  /*
> -- 
> 2.7.4
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/4] block: break discard submissions into the user defined size

2018-05-07 Thread Darrick J. Wong
On Mon, May 07, 2018 at 10:13:32AM -0600, Jens Axboe wrote:
> Don't build discards bigger than what the user asked for, if the
> user decided to limit the size by writing to 'discard_max_bytes'.
> 
> Signed-off-by: Jens Axboe <ax...@kernel.dk>

Seems fine to me,
Reviewed-by: Darrick J. Wong <darrick.w...@oracle.com>

--D

> ---
>  block/blk-lib.c | 7 ---
>  1 file changed, 4 insertions(+), 3 deletions(-)
> 
> diff --git a/block/blk-lib.c b/block/blk-lib.c
> index a676084d4740..7417d617091b 100644
> --- a/block/blk-lib.c
> +++ b/block/blk-lib.c
> @@ -62,10 +62,11 @@ int __blkdev_issue_discard(struct block_device *bdev, 
> sector_t sector,
>   unsigned int req_sects;
>   sector_t end_sect, tmp;
>  
> - /* Make sure bi_size doesn't overflow */
> - req_sects = min_t(sector_t, nr_sects, UINT_MAX >> 9);
> + /* Issue in chunks of the user defined max discard setting */
> + req_sects = min_t(sector_t, nr_sects,
> + q->limits.max_discard_sectors);
>  
> - /**
> + /*
>* If splitting a request, and the next starting sector would be
>* misaligned, stop the discard at the previous aligned sector.
>*/
> -- 
> 2.7.4
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/4] blk-wbt: account any writing command as a write

2018-05-07 Thread Darrick J. Wong
On Mon, May 07, 2018 at 10:13:33AM -0600, Jens Axboe wrote:
> We currently special case WRITE and FLUSH, but we should really
> just include any command with the write bit set. This ensures
> that we account DISCARD.
> 
> Reviewed-by: Christoph Hellwig <h...@lst.de>
> Signed-off-by: Jens Axboe <ax...@kernel.dk>

Looks ok,
Reviewed-by: Darrick J. Wong <darrick.w...@oracle.com>

--D

> ---
>  block/blk-wbt.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/block/blk-wbt.c b/block/blk-wbt.c
> index f92fc84b5e2c..3e34b41bcefc 100644
> --- a/block/blk-wbt.c
> +++ b/block/blk-wbt.c
> @@ -701,7 +701,7 @@ static int wbt_data_dir(const struct request *rq)
>  
>   if (op == REQ_OP_READ)
>   return READ;
> - else if (op == REQ_OP_WRITE || op == REQ_OP_FLUSH)
> + else if (op_is_write(op))
>   return WRITE;
>  
>   /* don't account */
> -- 
> 2.7.4
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/4] blk-wbt: throttle discards like background writes

2018-05-07 Thread Darrick J. Wong
On Mon, May 07, 2018 at 10:13:35AM -0600, Jens Axboe wrote:
> Throttle discards like we would any background write. Discards should
> be background activity, so if they are impacting foreground IO, then
> we will throttle them down.
> 
> Signed-off-by: Jens Axboe <ax...@kernel.dk>

Looks ok,
Reviewed-by: Darrick J. Wong <darrick.w...@oracle.com>

--D

> ---
>  block/blk-stat.h |  6 +++---
>  block/blk-wbt.c  | 43 ++-
>  block/blk-wbt.h  |  4 +++-
>  3 files changed, 32 insertions(+), 21 deletions(-)
> 
> diff --git a/block/blk-stat.h b/block/blk-stat.h
> index 2dd36347252a..c22049a8125e 100644
> --- a/block/blk-stat.h
> +++ b/block/blk-stat.h
> @@ -10,11 +10,11 @@
>  
>  /*
>   * from upper:
> - * 3 bits: reserved for other usage
> + * 4 bits: reserved for other usage
>   * 12 bits: size
> - * 49 bits: time
> + * 48 bits: time
>   */
> -#define BLK_STAT_RES_BITS3
> +#define BLK_STAT_RES_BITS4
>  #define BLK_STAT_SIZE_BITS   12
>  #define BLK_STAT_RES_SHIFT   (64 - BLK_STAT_RES_BITS)
>  #define BLK_STAT_SIZE_SHIFT  (BLK_STAT_RES_SHIFT - BLK_STAT_SIZE_BITS)
> diff --git a/block/blk-wbt.c b/block/blk-wbt.c
> index 25d202345965..a7a724580033 100644
> --- a/block/blk-wbt.c
> +++ b/block/blk-wbt.c
> @@ -106,6 +106,8 @@ static inline struct rq_wait *get_rq_wait(struct rq_wb 
> *rwb,
>  {
>   if (wb_acct & WBT_KSWAPD)
>   return >rq_wait[WBT_RWQ_KSWAPD];
> + else if (wb_acct & WBT_DISCARD)
> + return >rq_wait[WBT_RWQ_DISCARD];
>  
>   return >rq_wait[WBT_RWQ_BG];
>  }
> @@ -143,10 +145,13 @@ void __wbt_done(struct rq_wb *rwb, enum wbt_flags 
> wb_acct)
>   }
>  
>   /*
> -  * If the device does write back caching, drop further down
> -  * before we wake people up.
> +  * For discards, our limit is always the background. For writes, if
> +  * the device does write back caching, drop further down before we
> +  * wake people up.
>*/
> - if (rwb->wc && !wb_recent_wait(rwb))
> + if (wb_acct & WBT_DISCARD)
> + limit = rwb->wb_background;
> + else if (rwb->wc && !wb_recent_wait(rwb))
>   limit = 0;
>   else
>   limit = rwb->wb_normal;
> @@ -483,6 +488,9 @@ static inline unsigned int get_limit(struct rq_wb *rwb, 
> unsigned long rw)
>  {
>   unsigned int limit;
>  
> + if ((rw & REQ_OP_MASK) == REQ_OP_DISCARD)
> + return rwb->wb_background;
> +
>   /*
>* At this point we know it's a buffered write. If this is
>* kswapd trying to free memory, or REQ_SYNC is set, then
> @@ -564,21 +572,20 @@ static void __wbt_wait(struct rq_wb *rwb, enum 
> wbt_flags wb_acct,
>  
>  static inline bool wbt_should_throttle(struct rq_wb *rwb, struct bio *bio)
>  {
> - const int op = bio_op(bio);
> -
> - /*
> -  * If not a WRITE, do nothing
> -  */
> - if (op != REQ_OP_WRITE)
> - return false;
> -
> - /*
> -  * Don't throttle WRITE_ODIRECT
> -  */
> - if ((bio->bi_opf & (REQ_SYNC | REQ_IDLE)) == (REQ_SYNC | REQ_IDLE))
> + switch (bio_op(bio)) {
> + case REQ_OP_WRITE:
> + /*
> +  * Don't throttle WRITE_ODIRECT
> +  */
> + if ((bio->bi_opf & (REQ_SYNC | REQ_IDLE)) ==
> + (REQ_SYNC | REQ_IDLE))
> + return false;
> + /* fallthrough */
> + case REQ_OP_DISCARD:
> + return true;
> + default:
>   return false;
> -
> - return true;
> + }
>  }
>  
>  /*
> @@ -605,6 +612,8 @@ enum wbt_flags wbt_wait(struct rq_wb *rwb, struct bio 
> *bio, spinlock_t *lock)
>  
>   if (current_is_kswapd())
>   ret |= WBT_KSWAPD;
> + if (bio_op(bio) == REQ_OP_DISCARD)
> + ret |= WBT_DISCARD;
>  
>   __wbt_wait(rwb, ret, bio->bi_opf, lock);
>  
> diff --git a/block/blk-wbt.h b/block/blk-wbt.h
> index 8038b4a0d4ef..d6a125e49db5 100644
> --- a/block/blk-wbt.h
> +++ b/block/blk-wbt.h
> @@ -14,13 +14,15 @@ enum wbt_flags {
>   WBT_TRACKED = 1,/* write, tracked for throttling */
>   WBT_READ= 2,/* read */
>   WBT_KSWAPD  = 4,/* write, from kswapd */
> + WBT_DISCARD = 8,/* discard */
>  
> - WBT_NR_BITS = 3,/* number of bits */
> + WBT_NR_BITS = 4,/* number of bits */
>  };
>  
>  enum {
>   WBT_RWQ_BG  = 0,
>   WBT_RWQ_KSWAPD,
> + WBT_RWQ_DISCARD,
>   WBT_NUM_RWQ,
>  };
>  
> -- 
> 2.7.4
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 10/33] iomap: add an iomap-based bmap implementation

2018-05-09 Thread Darrick J. Wong
On Wed, May 09, 2018 at 09:48:07AM +0200, Christoph Hellwig wrote:
> This adds a simple iomap-based implementation of the legacy ->bmap
> interface.  Note that we can't easily add checks for rt or reflink
> files, so these will have to remain in the callers.  This interface
> just needs to die..

You /can/ check these...

if (iomap->bdev != inode->i_sb->s_bdev)
return 0;
if (iomap->flags & IOMAP_F_SHARED)
return 0;

> Signed-off-by: Christoph Hellwig 
> ---
>  fs/iomap.c| 29 +
>  include/linux/iomap.h |  3 +++
>  2 files changed, 32 insertions(+)
> 
> diff --git a/fs/iomap.c b/fs/iomap.c
> index af525cb47339..049e0c4aacac 100644
> --- a/fs/iomap.c
> +++ b/fs/iomap.c
> @@ -1201,3 +1201,32 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
>   return ret;
>  }
>  EXPORT_SYMBOL_GPL(iomap_dio_rw);
> +
> +static loff_t
> +iomap_bmap_actor(struct inode *inode, loff_t pos, loff_t length,
> + void *data, struct iomap *iomap)
> +{
> + sector_t *bno = data;
> +
> + if (iomap->type == IOMAP_MAPPED)
> + *bno = (iomap->addr + pos - iomap->offset) >> inode->i_blkbits;

Does this need to be careful w.r.t. overflow on systems where sector_t
is a 32-bit unsigned long?

Also, ioctl_fibmap() typecasts the returned sector_t to an int, which
also seems broken.  I agree the interface needs to die, but ioctls take
a long time to deprecate.

--D

> + return 0;
> +}
> +
> +/* legacy ->bmap interface.  0 is the error return (!) */
> +sector_t
> +iomap_bmap(struct address_space *mapping, sector_t bno,
> + const struct iomap_ops *ops)
> +{
> + struct inode *inode = mapping->host;
> + loff_t pos = bno >> inode->i_blkbits;
> + unsigned blocksize = i_blocksize(inode);
> +
> + if (filemap_write_and_wait(mapping))
> + return 0;
> +
> + bno = 0;
> + iomap_apply(inode, pos, blocksize, 0, ops, , iomap_bmap_actor);
> + return bno;
> +}
> +EXPORT_SYMBOL_GPL(iomap_bmap);
> diff --git a/include/linux/iomap.h b/include/linux/iomap.h
> index 19a07de28212..07f73224c38b 100644
> --- a/include/linux/iomap.h
> +++ b/include/linux/iomap.h
> @@ -4,6 +4,7 @@
>  
>  #include 
>  
> +struct address_space;
>  struct fiemap_extent_info;
>  struct inode;
>  struct iov_iter;
> @@ -95,6 +96,8 @@ loff_t iomap_seek_hole(struct inode *inode, loff_t offset,
>   const struct iomap_ops *ops);
>  loff_t iomap_seek_data(struct inode *inode, loff_t offset,
>   const struct iomap_ops *ops);
> +sector_t iomap_bmap(struct address_space *mapping, sector_t bno,
> + const struct iomap_ops *ops);
>  
>  /*
>   * Flags for direct I/O ->end_io:
> -- 
> 2.17.0
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 10/33] iomap: add an iomap-based bmap implementation

2018-05-10 Thread Darrick J. Wong
On Thu, May 10, 2018 at 08:42:50AM +0200, Christoph Hellwig wrote:
> On Wed, May 09, 2018 at 09:46:28AM -0700, Darrick J. Wong wrote:
> > On Wed, May 09, 2018 at 09:48:07AM +0200, Christoph Hellwig wrote:
> > > This adds a simple iomap-based implementation of the legacy ->bmap
> > > interface.  Note that we can't easily add checks for rt or reflink
> > > files, so these will have to remain in the callers.  This interface
> > > just needs to die..
> > 
> > You /can/ check these...
> > 
> > if (iomap->bdev != inode->i_sb->s_bdev)
> > return 0;
> > if (iomap->flags & IOMAP_F_SHARED)
> > return 0;
> 
> The latter only checks for a shared extent, not a file with possibly
> shared extents.  I'd rather keep the check for a file with possible
> shared extents.



> > > +static loff_t
> > > +iomap_bmap_actor(struct inode *inode, loff_t pos, loff_t length,
> > > + void *data, struct iomap *iomap)
> > > +{
> > > + sector_t *bno = data;
> > > +
> > > + if (iomap->type == IOMAP_MAPPED)
> > > + *bno = (iomap->addr + pos - iomap->offset) >> inode->i_blkbits;
> > 
> > Does this need to be careful w.r.t. overflow on systems where sector_t
> > is a 32-bit unsigned long?
> > 
> > Also, ioctl_fibmap() typecasts the returned sector_t to an int, which
> > also seems broken.  I agree the interface needs to die, but ioctls take
> > a long time to deprecate.
> 
> Not much we can do about the interface.

Yes, the interface is fubar, but if file /foo maps to block 8589934720
then do we return the truncated result 128?

--D


Re: stop using buffer heads in xfs and iomap

2018-05-10 Thread Darrick J. Wong
On Wed, May 09, 2018 at 09:47:57AM +0200, Christoph Hellwig wrote:
> Hi all,
> 
> this series adds support for reading blocks from disk using the iomap
> interface, and then gradually switched the buffered I/O path to not
> require buffer heads.  It has survived xfstests for 1k and 4k block
> size.
> 
> There are various small changes to the core VFS, block and readahead
> code to make this happen.
> 
> 
> A git tree is available at:
> 
> git://git.infradead.org/users/hch/xfs.git xfs-remove-bufferheads
> 
> Gitweb:
> 
> 
> http://git.infradead.org/users/hch/xfs.git/shortlog/refs/heads/xfs-remove-bufferheads

I ran xfstests on this for fun last night but hung in g/095:

FSTYP -- xfs (debug)
PLATFORM  -- Linux/x86_64 submarine-djwong-mtr01 4.17.0-rc4-djw
MKFS_OPTIONS  -- -f -m reflink=1,rmapbt=1, -i sparse=1, -b size=1024, /dev/sdf
MOUNT_OPTIONS -- /dev/sdf /opt

FWIW the stock v4 and the 'v5 with everything and 4k blocks' vms
passed, so I guess there's a bug somewhere in the sub-page block size
code paths...

--D

[ 2586.943205] run fstests generic/095 at 2018-05-09 23:28:01
[ 2587.252740] XFS (sdf): Unmounting Filesystem
[ 2587.908441] XFS (sdf): Mounting V5 Filesystem
[ 2587.914685] XFS (sdf): Ending clean mount
[ 2702.258764] INFO: task kworker/u10:3:11834 blocked for more than 60 seconds.
[ 2702.261734]   Tainted: GW 4.17.0-rc4-djw #2
[ 2702.263607] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this 
message.
[ 2702.265600] kworker/u10:3   D11984 11834  2 0x8000
[ 2702.273445] Workqueue: writeback wb_workfn (flush-8:80)
[ 2702.274751] Call Trace:
[ 2702.275339]  ? __schedule+0x3e4/0xa70
[ 2702.276112]  ? blk_flush_plug_list+0xe4/0x280
[ 2702.277086]  schedule+0x40/0x90
[ 2702.277967]  io_schedule+0x16/0x40
[ 2702.278774]  __lock_page+0x12d/0x160
[ 2702.279680]  ? page_cache_tree_insert+0x100/0x100
[ 2702.280712]  write_cache_pages+0x32c/0x530
[ 2702.281820]  ? xfs_add_to_ioend+0x350/0x350 [xfs]
[ 2702.292350]  xfs_vm_writepages+0x57/0x80 [xfs]
[ 2702.294048]  do_writepages+0x1a/0x70
[ 2702.295068]  __writeback_single_inode+0x59/0x800
[ 2702.296118]  writeback_sb_inodes+0x282/0x550
[ 2702.297039]  __writeback_inodes_wb+0x87/0xb0
[ 2702.298173]  wb_writeback+0x430/0x5d0
[ 2702.299332]  ? wb_workfn+0x448/0x740
[ 2702.300578]  wb_workfn+0x448/0x740
[ 2702.301434]  ? lock_acquire+0xab/0x200
[ 2702.305413]  process_one_work+0x1ef/0x650
[ 2702.306687]  worker_thread+0x4d/0x3e0
[ 2702.307671]  kthread+0x106/0x140
[ 2702.308473]  ? rescuer_thread+0x340/0x340
[ 2702.309442]  ? kthread_delayed_work_timer_fn+0x90/0x90
[ 2702.310995]  ret_from_fork+0x3a/0x50
[ 2702.312088] INFO: task fio:2618 blocked for more than 60 seconds.
[ 2702.313395]   Tainted: GW 4.17.0-rc4-djw #2
[ 2702.315139] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this 
message.
[ 2702.316820] fio D14224  2618   2612 0x
[ 2702.318050] Call Trace:
[ 2702.318757]  ? __schedule+0x3e4/0xa70
[ 2702.319639]  ? rwsem_down_read_failed+0x7f/0x170
[ 2702.320798]  schedule+0x40/0x90
[ 2702.321630]  rwsem_down_read_failed+0x128/0x170
[ 2702.322752]  ? current_time+0x18/0x70
[ 2702.323857]  ? xfs_file_dio_aio_read+0x6d/0x1c0 [xfs]
[ 2702.325162]  ? call_rwsem_down_read_failed+0x14/0x30
[ 2702.326423]  call_rwsem_down_read_failed+0x14/0x30
[ 2702.328393]  ? xfs_ilock+0x28f/0x330 [xfs]
[ 2702.329539]  down_read_nested+0x9d/0xa0
[ 2702.330452]  xfs_ilock+0x28f/0x330 [xfs]
[ 2702.331427]  xfs_file_dio_aio_read+0x6d/0x1c0 [xfs]
[ 2702.332590]  xfs_file_read_iter+0x9a/0xb0 [xfs]
[ 2702.333992]  __vfs_read+0x136/0x1a0
[ 2702.335133]  vfs_read+0xa3/0x150
[ 2702.336129]  ksys_read+0x45/0xa0
[ 2702.337085]  do_syscall_64+0x56/0x180
[ 2702.337985]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 2702.339537] RIP: 0033:0x7ff4c152751d
[ 2702.340623] RSP: 002b:7fffa13c93b0 EFLAGS: 0293 ORIG_RAX: 

[ 2702.342774] RAX: ffda RBX: 00a0a2c0 RCX: 7ff4c152751d
[ 2702.344269] RDX: 0400 RSI: 00a17c00 RDI: 0005
[ 2702.345801] RBP: 7ff4a6f57000 R08: 08002000 R09: 0004
[ 2702.347342] R10: 0001 R11: 0293 R12: 
[ 2702.348861] R13: 0400 R14: 00a0a2e8 R15: 7ff4a6f57000
[ 2702.351298] INFO: task fio:2619 blocked for more than 60 seconds.
[ 2702.353158]   Tainted: GW 4.17.0-rc4-djw #2
[ 2702.355103] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this 
message.
[ 2702.357586] fio D14224  2619   2612 0x
[ 2702.359181] Call Trace:
[ 2702.359815]  ? __schedule+0x3e4/0xa70
[ 2702.360708]  ? rwsem_down_read_failed+0x7f/0x170
[ 2702.361880]  schedule+0x40/0x90
[ 2702.362727]  rwsem_down_read_failed+0x128/0x170
[ 2702.363811]  ? current_time+0x18/0x70
[ 2702.364775]  ? xfs_file_dio_aio_read+0x6d/0x1c0 [xfs]
[ 2702.365994]  ? call_rwsem_down_read_failed+0x14/0x30
[ 

Re: [PATCH 10/33] iomap: add an iomap-based bmap implementation

2018-05-11 Thread Darrick J. Wong
On Fri, May 11, 2018 at 08:25:27AM +0200, Christoph Hellwig wrote:
> On Thu, May 10, 2018 at 08:08:38AM -0700, Darrick J. Wong wrote:
> > > > > + sector_t *bno = data;
> > > > > +
> > > > > + if (iomap->type == IOMAP_MAPPED)
> > > > > + *bno = (iomap->addr + pos - iomap->offset) >> 
> > > > > inode->i_blkbits;
> > > > 
> > > > Does this need to be careful w.r.t. overflow on systems where sector_t
> > > > is a 32-bit unsigned long?
> > > > 
> > > > Also, ioctl_fibmap() typecasts the returned sector_t to an int, which
> > > > also seems broken.  I agree the interface needs to die, but ioctls take
> > > > a long time to deprecate.
> > > 
> > > Not much we can do about the interface.
> > 
> > Yes, the interface is fubar, but if file /foo maps to block 8589934720
> > then do we return the truncated result 128?
> 
> Then we'll get a corrupt result.  What do you think we could do here
> eithere in the old or new code?

I think the only thing we /can/ do is figure out if we'd be truncating
the result, dump a warning to the kernel, and return 0, because we don't
want smartypants FIBMAP callers to be using crap block pointers.

Something like this for the bmap implementation...

uint64_t mapping = iomap->addr;

#ifdef CONFIG_LBDAF
if (mapping > ULONG_MAX) {
/* Do not truncate results. */
return 0;
}
#endif

...and in the bmap ioctl...

sector_t mapping = ...;

if (mapping > INT_MAX) {
WARN(1, "would truncate bmap result, go fix your stupid program");
return 0;
}

--D

> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: stop using buffer heads in xfs and iomap

2018-05-11 Thread Darrick J. Wong
On Fri, May 11, 2018 at 08:22:08AM +0200, Christoph Hellwig wrote:
> On Thu, May 10, 2018 at 08:13:03AM -0700, Darrick J. Wong wrote:
> > I ran xfstests on this for fun last night but hung in g/095:
> > 
> > FSTYP -- xfs (debug)
> > PLATFORM  -- Linux/x86_64 submarine-djwong-mtr01 4.17.0-rc4-djw
> > MKFS_OPTIONS  -- -f -m reflink=1,rmapbt=1, -i sparse=1, -b size=1024, 
> > /dev/sdf
> > MOUNT_OPTIONS -- /dev/sdf /opt
> > 
> > FWIW the stock v4 and the 'v5 with everything and 4k blocks' vms
> > passed, so I guess there's a bug somewhere in the sub-page block size
> > code paths...
> 
> I haven't seen that in my -b size 1024 -m relink test  I'll try your above
> exact setup, too.  Is this disk or SSD?  How much memory and how many
> CPUs?

4 CPUs in a VM on a Nehalem-era machine, 2GB RAM, two 2.3GB virtio-scsi
disks...

...the VM host itself is a quad-core Nehalem, 24G RAM, atop an ext4 fs
on spinning rust in a raid1.

> Btw, I think the series might be worthwhile even if we have to delay
> the sub-page blocksize support - the blocksize == pagesize code is
> basically entirely separate and already very useful.  Only the last
> three patches contain the small blocksize support, without that we'll
> just continue using buffer heads for that case for now.

 I'll keep reading, it seemed generally ok until I hit the
sub-page part and my eyes glazed over. :)

--D

> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 11/12] xfs: convert to bioset_init()/mempool_init()

2018-05-21 Thread Darrick J. Wong
On Sun, May 20, 2018 at 06:25:57PM -0400, Kent Overstreet wrote:
> Signed-off-by: Kent Overstreet <kent.overstr...@gmail.com>

Looks ok, I guess...
Acked-by: Darrick J. Wong <darrick.w...@oracle.com>

--D

> ---
>  fs/xfs/xfs_aops.c  |  2 +-
>  fs/xfs/xfs_aops.h  |  2 +-
>  fs/xfs/xfs_super.c | 11 +--
>  3 files changed, 7 insertions(+), 8 deletions(-)
> 
> diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> index 0ab824f574..102463543d 100644
> --- a/fs/xfs/xfs_aops.c
> +++ b/fs/xfs/xfs_aops.c
> @@ -594,7 +594,7 @@ xfs_alloc_ioend(
>   struct xfs_ioend*ioend;
>   struct bio  *bio;
>  
> - bio = bio_alloc_bioset(GFP_NOFS, BIO_MAX_PAGES, xfs_ioend_bioset);
> + bio = bio_alloc_bioset(GFP_NOFS, BIO_MAX_PAGES, _ioend_bioset);
>   xfs_init_bio_from_bh(bio, bh);
>  
>   ioend = container_of(bio, struct xfs_ioend, io_inline_bio);
> diff --git a/fs/xfs/xfs_aops.h b/fs/xfs/xfs_aops.h
> index 69346d460d..694c85b038 100644
> --- a/fs/xfs/xfs_aops.h
> +++ b/fs/xfs/xfs_aops.h
> @@ -18,7 +18,7 @@
>  #ifndef __XFS_AOPS_H__
>  #define __XFS_AOPS_H__
>  
> -extern struct bio_set *xfs_ioend_bioset;
> +extern struct bio_set xfs_ioend_bioset;
>  
>  /*
>   * Types of I/O for bmap clustering and I/O completion tracking.
> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> index d714240529..f643d76db5 100644
> --- a/fs/xfs/xfs_super.c
> +++ b/fs/xfs/xfs_super.c
> @@ -63,7 +63,7 @@
>  #include 
>  
>  static const struct super_operations xfs_super_operations;
> -struct bio_set *xfs_ioend_bioset;
> +struct bio_set xfs_ioend_bioset;
>  
>  static struct kset *xfs_kset;/* top-level xfs sysfs dir */
>  #ifdef DEBUG
> @@ -1845,10 +1845,9 @@ MODULE_ALIAS_FS("xfs");
>  STATIC int __init
>  xfs_init_zones(void)
>  {
> - xfs_ioend_bioset = bioset_create(4 * MAX_BUF_PER_PAGE,
> + if (bioset_init(_ioend_bioset, 4 * MAX_BUF_PER_PAGE,
>   offsetof(struct xfs_ioend, io_inline_bio),
> - BIOSET_NEED_BVECS);
> - if (!xfs_ioend_bioset)
> + BIOSET_NEED_BVECS))
>   goto out;
>  
>   xfs_log_ticket_zone = kmem_zone_init(sizeof(xlog_ticket_t),
> @@ -1997,7 +1996,7 @@ xfs_init_zones(void)
>   out_destroy_log_ticket_zone:
>   kmem_zone_destroy(xfs_log_ticket_zone);
>   out_free_ioend_bioset:
> - bioset_free(xfs_ioend_bioset);
> + bioset_exit(_ioend_bioset);
>   out:
>   return -ENOMEM;
>  }
> @@ -2029,7 +2028,7 @@ xfs_destroy_zones(void)
>   kmem_zone_destroy(xfs_btree_cur_zone);
>   kmem_zone_destroy(xfs_bmap_free_item_zone);
>   kmem_zone_destroy(xfs_log_ticket_zone);
> - bioset_free(xfs_ioend_bioset);
> + bioset_exit(_ioend_bioset);
>  }
>  
>  STATIC int __init
> -- 
> 2.17.0
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 05/34] fs: use ->is_partially_uptodate in page_cache_seek_hole_data

2018-05-21 Thread Darrick J. Wong
On Fri, May 18, 2018 at 06:48:01PM +0200, Christoph Hellwig wrote:
> This way the implementation doesn't depend on buffer_head internals.
> 
> Signed-off-by: Christoph Hellwig 
> ---
>  fs/iomap.c | 83 +++---
>  1 file changed, 42 insertions(+), 41 deletions(-)
> 
> diff --git a/fs/iomap.c b/fs/iomap.c
> index bef5e91d40bf..0fecd5789d7b 100644
> --- a/fs/iomap.c
> +++ b/fs/iomap.c
> @@ -594,31 +594,54 @@ EXPORT_SYMBOL_GPL(iomap_fiemap);
>   *
>   * Returns the offset within the file on success, and -ENOENT otherwise.

This comment is now wrong, since we return the offset via *lastoff and I
think the return value is whether or not we found what we were looking
for...?

>   */
> -static loff_t
> -page_seek_hole_data(struct page *page, loff_t lastoff, int whence)
> +static bool
> +page_seek_hole_data(struct inode *inode, struct page *page, loff_t *lastoff,
> + int whence)
>  {
> - loff_t offset = page_offset(page);
> - struct buffer_head *bh, *head;
> + const struct address_space_operations *ops = inode->i_mapping->a_ops;
> + unsigned int bsize = i_blocksize(inode), off;
>   bool seek_data = whence == SEEK_DATA;
> + loff_t poff = page_offset(page);
>  
> - if (lastoff < offset)
> - lastoff = offset;
> -
> - bh = head = page_buffers(page);
> - do {
> - offset += bh->b_size;
> - if (lastoff >= offset)
> - continue;
> + if (WARN_ON_ONCE(*lastoff >= poff + PAGE_SIZE))
> + return false;
>  
> + if (*lastoff < poff) {
>   /*
> -  * Any buffer with valid data in it should have BH_Uptodate set.
> +  * Last offset smaller than the start of the page means we found
> +  * a hole:
>*/
> - if (buffer_uptodate(bh) == seek_data)
> - return lastoff;
> + if (whence == SEEK_HOLE)
> + return true;
> + *lastoff = poff;
> + }
>  
> - lastoff = offset;
> - } while ((bh = bh->b_this_page) != head);
> - return -ENOENT;
> + /*
> +  * Just check the page unless we can and should check block ranges:
> +  */
> + if (bsize == PAGE_SIZE || !ops->is_partially_uptodate) {
> + if (PageUptodate(page) == seek_data)
> + return true;
> + return false;

return PageUptodate(page) == seek_data; ?

--D

> + }
> +
> + lock_page(page);
> + if (unlikely(page->mapping != inode->i_mapping))
> + goto out_unlock_not_found;
> +
> + for (off = 0; off < PAGE_SIZE; off += bsize) {
> + if ((*lastoff & ~PAGE_MASK) >= off + bsize)
> + continue;
> + if (ops->is_partially_uptodate(page, off, bsize) == seek_data) {
> + unlock_page(page);
> + return true;
> + }
> + *lastoff = poff + off + bsize;
> + }
> +
> +out_unlock_not_found:
> + unlock_page(page);
> + return false;
>  }
>  
>  /*
> @@ -655,30 +678,8 @@ page_cache_seek_hole_data(struct inode *inode, loff_t 
> offset, loff_t length,
>   for (i = 0; i < nr_pages; i++) {
>   struct page *page = pvec.pages[i];
>  
> - /*
> -  * At this point, the page may be truncated or
> -  * invalidated (changing page->mapping to NULL), or
> -  * even swizzled back from swapper_space to tmpfs file
> -  * mapping.  However, page->index will not change
> -  * because we have a reference on the page.
> - *
> -  * If current page offset is beyond where we've ended,
> -  * we've found a hole.
> - */
> - if (whence == SEEK_HOLE &&
> - lastoff < page_offset(page))
> + if (page_seek_hole_data(inode, page, , whence))
>   goto check_range;
> -
> - lock_page(page);
> - if (likely(page->mapping == inode->i_mapping) &&
> - page_has_buffers(page)) {
> - lastoff = page_seek_hole_data(page, lastoff, 
> whence);
> - if (lastoff >= 0) {
> - unlock_page(page);
> - goto check_range;
> - }
> - }
> - unlock_page(page);
>   lastoff = page_offset(page) + PAGE_SIZE;
>   }
>   pagevec_release();
> -- 
> 2.17.0
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  

Re: buffered I/O without buffer heads in xfs and iomap v2

2018-05-21 Thread Darrick J. Wong
On Fri, May 18, 2018 at 06:47:56PM +0200, Christoph Hellwig wrote:
> Hi all,
> 
> this series adds support for buffered I/O without buffer heads to
> the iomap and XFS code.
> 
> For now this series only contains support for block size == PAGE_SIZE,
> with the 4k support split into a separate series.
> 
> 
> A git tree is available at:
> 
> git://git.infradead.org/users/hch/xfs.git xfs-iomap-read.2
> 
> Gitweb:
> 
> 
> http://git.infradead.org/users/hch/xfs.git/shortlog/refs/heads/xfs-iomap-read.2

Hmm, so I pulled this and ran my trivial stupid benchmark on for-next.
It's a stupid VM with a 2G of RAM and a 12GB virtio-scsi disk backed by
tmpfs:

# mkfs.xfs -f -m rmapbt=0,reflink=1 /dev/sda
meta-data=/dev/sda   isize=512agcount=4, agsize=823296
blks
 =   sectsz=512   attr=2, projid32bit=1
 =   crc=1finobt=1, sparse=1,
rmapbt=1
 =   reflink=1
data =   bsize=4096   blocks=3293184, imaxpct=25
 =   sunit=0  swidth=0 blks
naming   =version 2  bsize=4096   ascii-ci=0, ftype=1
log  =internal log   bsize=4096   blocks=3693, version=2
 =   sectsz=512   sunit=0 blks, lazy-count=1
realtime =none   extsz=4096   blocks=0, rtextents=0
# mount /dev/sda /mnt
# xfs_io -f -c 'pwrite -W -S 0x64 -b 83886080 0 734003200' /mnt/a
wrote 734003200/734003200 bytes at offset 0
700 MiB, 9 ops; 0:00:01.06 (655.500 MiB/sec and 8.4279 ops/sec)
# cp --reflink=always /mnt/a /mnt/b
# xfs_io -f -c 'pwrite -W -S 0x65 -b 83886080 0 734003200' /mnt/b
wrote 734003200/734003200 bytes at offset 0
700 MiB, 9 ops; 0.9620 sec (727.615 MiB/sec and 9.3551 ops/sec)

Then I applied your series (not including the blocksize < pagesize
series) and saw this big regression:

# mkfs.xfs -f -m rmapbt=0,reflink=1 /dev/sda
meta-data=/dev/sda   isize=512agcount=4, agsize=823296
blks
 =   sectsz=512   attr=2, projid32bit=1
 =   crc=1finobt=1, sparse=1,
rmapbt=1
 =   reflink=1
data =   bsize=4096   blocks=3293184, imaxpct=25
 =   sunit=0  swidth=0 blks
naming   =version 2  bsize=4096   ascii-ci=0, ftype=1
log  =internal log   bsize=4096   blocks=3693, version=2
 =   sectsz=512   sunit=0 blks, lazy-count=1
realtime =none   extsz=4096   blocks=0, rtextents=0
# mount /dev/sda /mnt
# xfs_io -f -c 'pwrite -W -S 0x64 -b 83886080 0 734003200' /mnt/a
wrote 734003200/734003200 bytes at offset 0
700 MiB, 9 ops; 0:00:08.04 (87.031 MiB/sec and 1.1190 ops/sec)
# cp --reflink=always /mnt/a /mnt/b
# xfs_io -f -c 'pwrite -W -S 0x65 -b 83886080 0 734003200' /mnt/b
wrote 734003200/734003200 bytes at offset 0
700 MiB, 9 ops; 0:00:21.61 (32.389 MiB/sec and 0.4164 ops/sec)

I'll see if I can spot the problem while I read through the v2 code...

--D

> 
> Changes since v1:
>  - fix the iomap_readpages error handling
>  - use unsigned file offsets in a few places to avoid arithmetic overflows
>  - allocate a iomap_page in iomap_page_mkwrite to fix generic/095
>  - improve a few comments
>  - add more asserts
>  - warn about truncated block numbers from ->bmap
>  - new patch to change the __do_page_cache_readahead return value to
>unsigned int
>  - remove an incorrectly added empty line
>  - make inline data an explicit iomap type instead of a flag
>  - add a IOMAP_F_BUFFER_HEAD flag to force use of buffers heads for gfs2,
>and keep the basic buffer head infrastructure around for now.


Re: [PATCH 16/34] iomap: add initial support for writes without buffer heads

2018-05-21 Thread Darrick J. Wong
On Fri, May 18, 2018 at 06:48:12PM +0200, Christoph Hellwig wrote:
> For now just limited to blocksize == PAGE_SIZE, where we can simply read
> in the full page in write begin, and just set the whole page dirty after
> copying data into it.  This code is enabled by default and XFS will now
> be feed pages without buffer heads in ->writepage and ->writepages.
> 
> If a file system sets the IOMAP_F_BUFFER_HEAD flag on the iomap the old
> path will still be used, this both helps the transition in XFS and
> prepares for the gfs2 migration to the iomap infrastructure.
> 
> Signed-off-by: Christoph Hellwig 
> ---
>  fs/iomap.c| 132 ++
>  fs/xfs/xfs_iomap.c|   6 +-
>  include/linux/iomap.h |   2 +
>  3 files changed, 127 insertions(+), 13 deletions(-)
> 
> diff --git a/fs/iomap.c b/fs/iomap.c
> index 821671af2618..cd4c563db80a 100644
> --- a/fs/iomap.c
> +++ b/fs/iomap.c
> @@ -314,6 +314,58 @@ iomap_write_failed(struct inode *inode, loff_t pos, 
> unsigned len)
>   truncate_pagecache_range(inode, max(pos, i_size), pos + len);
>  }
>  
> +static int
> +iomap_read_page_sync(struct inode *inode, loff_t block_start, struct page 
> *page,
> + unsigned poff, unsigned plen, struct iomap *iomap)
> +{
> + struct bio_vec bvec;
> + struct bio bio;
> + int ret;
> +
> + bio_init(, , 1);
> + bio.bi_opf = REQ_OP_READ;
> + bio.bi_iter.bi_sector = iomap_sector(iomap, block_start);
> + bio_set_dev(, iomap->bdev);
> + __bio_add_page(, page, plen, poff);
> + ret = submit_bio_wait();
> + if (ret < 0 && iomap_block_needs_zeroing(inode, block_start, iomap))
> + zero_user(page, poff, plen);
> + return ret;
> +}
> +
> +static int
> +__iomap_write_begin(struct inode *inode, loff_t pos, unsigned len,
> + struct page *page, struct iomap *iomap)
> +{
> + loff_t block_size = i_blocksize(inode);
> + loff_t block_start = pos & ~(block_size - 1);
> + loff_t block_end = (pos + len + block_size - 1) & ~(block_size - 1);
> + unsigned poff = block_start & (PAGE_SIZE - 1);
> + unsigned plen = min_t(loff_t, PAGE_SIZE - poff, block_end - 
> block_start);
> + int status;
> +
> + WARN_ON_ONCE(i_blocksize(inode) < PAGE_SIZE);
> +
> + if (PageUptodate(page))
> + return 0;
> +
> + if (iomap_block_needs_zeroing(inode, block_start, iomap)) {
> + unsigned from = pos & (PAGE_SIZE - 1), to = from + len;
> + unsigned pend = poff + plen;
> +
> + if (poff < from || pend > to)
> + zero_user_segments(page, poff, from, to, pend);
> + } else {
> + status = iomap_read_page_sync(inode, block_start, page,
> + poff, plen, iomap);

Something doesn't smell right here.  The only pages we need to read in
are the first and last pages in the write_begin range, and only if they
aren't page aligned and the underlying extent is IOMAP_MAPPED, right?

I also noticed that speculative preallocation kicks in by the second 80M
write() call and writeback for the second call can successfully allocate
the entire preallocation, which means that the third (or nth) write call
can have a real extent already mapped in, and then we end up reading it.

--D

> + if (status < 0)
> + return status;
> + SetPageUptodate(page);
> + }
> +
> + return 0;
> +}
> +
>  static int
>  iomap_write_begin(struct inode *inode, loff_t pos, unsigned len, unsigned 
> flags,
>   struct page **pagep, struct iomap *iomap)
> @@ -331,7 +383,10 @@ iomap_write_begin(struct inode *inode, loff_t pos, 
> unsigned len, unsigned flags,
>   if (!page)
>   return -ENOMEM;
>  
> - status = __block_write_begin_int(page, pos, len, NULL, iomap);
> + if (iomap->flags & IOMAP_F_BUFFER_HEAD)
> + status = __block_write_begin_int(page, pos, len, NULL, iomap);
> + else
> + status = __iomap_write_begin(inode, pos, len, page, iomap);
>   if (unlikely(status)) {
>   unlock_page(page);
>   put_page(page);
> @@ -344,14 +399,63 @@ iomap_write_begin(struct inode *inode, loff_t pos, 
> unsigned len, unsigned flags,
>   return status;
>  }
>  
> +int
> +iomap_set_page_dirty(struct page *page)
> +{
> + struct address_space *mapping = page_mapping(page);
> + int newly_dirty;
> +
> + if (unlikely(!mapping))
> + return !TestSetPageDirty(page);
> +
> + /*
> +  * Lock out page->mem_cgroup migration to keep PageDirty
> +  * synchronized with per-memcg dirty page counters.
> +  */
> + lock_page_memcg(page);
> + newly_dirty = !TestSetPageDirty(page);
> + if (newly_dirty)
> + __set_page_dirty(page, mapping, 0);
> + unlock_page_memcg(page);
> +
> + if (newly_dirty)
> + __mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
> + 

Re: [PATCH 0/3 v3] iomap: Use FUA for O_DSYNC DIO writes

2018-05-02 Thread Darrick J. Wong
On Wed, May 02, 2018 at 03:38:04PM +1000, Dave Chinner wrote:
> Hi folks,
> 
> Version 3 of the FUA for O-DSYNC patchset. This version fixes bugs
> found in the previous version. Functionality is otherwise the same
> as described in the first version:
> 
> https://marc.info/?l=linux-xfs=152213446528167=2

Will test the whole series, consider this a tentative:
Reviewed-by: Darrick J. Wong <darrick.w...@oracle.com>

--D

> 
> Version 3:
> 
> - fixed O_SYNC behaviour as noticed by Jan Kara
> - fixed use after free on IO completion due
>   iomap_dio_complete_work() simplification added in version 2. Found
>   by KASAN when running xfstests.
> 
> Version 2:
> - Fixed comment typos in first patch
> - simplified iomap_dio_complete_work()
> - changed IOMAP_DIO_WRITE_SYNC to IOMAP_DIO_NEED_SYNC
> - split blk_queue_fua() into it's own patch
> - fixed formatting issue in last patch
> - update bio->io_opf directly rather than use bio_set_op_attrs()
> - Updated comment to mention we try to use FUA optimistically.
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] xfs: add 'discard_sync' mount flag

2018-04-30 Thread Darrick J. Wong
On Mon, Apr 30, 2018 at 04:40:04PM -0600, Jens Axboe wrote:
> On 4/30/18 4:28 PM, Dave Chinner wrote:
> > On Mon, Apr 30, 2018 at 03:42:11PM -0600, Jens Axboe wrote:
> >> On 4/30/18 3:31 PM, Dave Chinner wrote:
> >>> On Mon, Apr 30, 2018 at 09:32:52AM -0600, Jens Axboe wrote:
>  XFS recently added support for async discards. While this can be
>  a win for some workloads and devices, there are also cases where
>  async bursty discard will severly harm the latencies of reads
>  and writes.
> >>>
> >>> FWIW, convention is it document the performance regression in the
> >>> commit message, not leave the reader to guess at what it was
> >>
> >> Yeah I'll give you that, I can improve the commit message for sure.
> >>
> >>> Did anyone analyse the pattern of discards being issued to work out
> >>> what pattern was worse for async vs sync discard? is it lots of
> >>> little discards, large extents being discarded, perhaps a problem
> >>> with the request request queue starving other IOs because we queue
> >>> so many async discards in such a short time (which is the difference
> >>> in behaviour vs the old code), or something else?
> >>
> >> What was observed was a big discard which would previously have
> >> gone down as smaller discards now going down as either one or many
> >> discards. Looking at the blktrace data, it's the difference between
> >>
> >> discard 1 queue
> >> discard 1 complete
> >> discatd 2 queue
> >> discard 2 complete
> >> [...]
> >> discard n queue
> >> discard n complete
> >>
> >> which is now
> >>
> >> discard 1 queue
> >> discard 2 queue
> >> [...]
> >> discard n queue
> >> [...]
> >> discard 1 complete
> >> discard 2 complete
> >> [...]
> >> discard n complete
> >>
> >> Note that we set a max discard size of 64MB for most devices,
> >> since it's been shown to have less impact on latencies for
> >> the IO that jobs actually care about.

Ok, so as I see it the problem here is that discards are not some
instantaneous command, but instead have some cost that is (probably)
less than (say) a full write of the same quantity of zeroes.  Previously
we'd feed the block layer one discard io at a time, but now we batch
them up and send multiple large discard requests at the same time.  The
discards then tie up the device for long periods of time.  We'd get the
same result if we sent gigabytes of write commands simultaneously too,
right?

Wouldn't the same problem would happen if say 200 threads were each
issuing discards one by one because there's just too many bytes in
flight at a time?

So what I'm asking here is, can we throttle discard IOs so that no
single issuer can monopolize a device?  As a stupid example, if program
A is sending 200MB of reads, program B is sending 200MB of writes, and
xfs is sending 200MB of discards at the same time, can we throttle all
three so that they each get roughly a third of the queue at a time?

This way XFS can still send huge batches of discard IOs asynchronously,
and it's fine enough if some of those have to wait longer in order to
avoid starving other things.

(Yes, I imagine you could serialize discards and run them one at a time,
but that seems a little suboptimal...)

> > 
> > IOWs, this has nothing to do with XFS behaviour, and everything to
> > do with suboptimal scheduling control for concurrent queued discards
> > in the block layer
> 
> You could argue that, and it's fallout from XFS being the first
> user of async discard. Prior to that, we've never had that use
> case. I'm quite fine making all discards throttle to depth
> 1.
> 
> > XFS can issue discard requests of up to 8GB (on 4k block size
> > filesystems), and how they are optimised is completely up to the
> > block layer. blkdev_issue_discard() happens to be synchronous (for
> > historical reasons) and that does not match our asynchronous log IO
> > model. it's always been a cause of total filesystem stall problems
> > for XFS because we can't free log space until all the pending
> > discards have been issued. Hence we can see global filesystems
> > stalls of *minutes* with synchronous discards on bad devices and can
> > cause OOM and all sorts of other nasty cascading failures.
> > 
> > Async discard dispatch solves this problem for XFS - discards no
> > longer block forward journal progress, and we really don't want to
> > go back to having that ticking timebomb in XFS. Handling concurrent
> > discard requests in an optimal manner is not a filesystem problem -
> > it's an IO scheduling problem.
> > 
> > 
> > Essentially, we've exposed yet another limitation of the block/IO
> > layer handling of discard requests in the linux storage stack - it
> > does not have a configurable discard queue depth.
> > 
> > I'd much prefer these problems get handled at the IO scheduling
> > layer where there is visibulity of device capabilities and request
> > queue state. i.e. we should be throttling async discards just like
> > we can throttle read or write IO, especially given there 

Re: [PATCH v5] Return bytes transferred for partial direct I/O

2018-01-04 Thread Darrick J. Wong
On Wed, Nov 22, 2017 at 06:29:01AM -0600, Goldwyn Rodrigues wrote:
> From: Goldwyn Rodrigues 
> 
> In case direct I/O encounters an error midway, it returns the error.
> Instead it should be returning the number of bytes transferred so far.
> 
> Test case for filesystems (with ENOSPC):
> 1. Create an almost full filesystem
> 2. Create a file, say /mnt/lastfile, until the filesystem is full.
> 3. Direct write() with count > sizeof /mnt/lastfile.
> 
> Result: write() returns -ENOSPC. However, file content has data written
> in step 3.
> 
> Changes since v1:
>  - incorporated iomap and block devices
> 
> Changes since v2:
>  - realized that file size was not increasing when performing a (partial)
>direct I/O because end_io function was receiving the error instead of
>size. Fixed.
> 
> Changes since v3:
>  - [hch] initialize transferred with dio->size and use transferred instead
>of dio->size.
> 
> Changes since v4:
>  - Refreshed to v4.14
> 
> Signed-off-by: Goldwyn Rodrigues 
> Reviewed-by: Christoph Hellwig 

Hmmm, I shoved this into 4.15-rc6 and saw this horrible splat running
generic/472 (that is the test that goes with this patch, right?) on XFS:

[ 2545.534331] XFS: Assertion failed: ret < 0 || ret == count, file: 
/storage/home/djwong/cdev/work/linux-xfs/fs/xfs/xfs_file.c, line: 598
[ 2545.538177] WARNING: CPU: 2 PID: 16340 at 
/storage/home/djwong/cdev/work/linux-xfs/fs/xfs/xfs_message.c:116 
assfail+0x58/0x90 [xfs]
[ 2545.539748] Modules linked in: xfs libcrc32c dax_pmem device_dax nd_pmem 
sch_fq_codel af_packet [last unloaded: xfs]
[ 2545.541191] CPU: 2 PID: 16340 Comm: xfs_io Not tainted 4.15.0-rc6-xfsx #5
[ 2545.542119] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 
1.10.2-1ubuntu1djwong0 04/01/2014
[ 2545.543414] RIP: 0010:assfail+0x58/0x90 [xfs]
[ 2545.544032] RSP: 0018:88002b56f990 EFLAGS: 00010246
[ 2545.544760] RAX:  RBX:  RCX: 
[ 2545.545721] RDX: 0004 RSI: 000a RDI: a0a1eac4
[ 2545.546681] RBP: 88002b56fdd0 R08: 88002b56f6e0 R09: 
[ 2545.547646] R10: 11000fe14f6e R11:  R12: 1100056adf3a
[ 2545.548613] R13: 0040 R14: 88002b56fc38 R15: 
[ 2545.549579] FS:  7fc1f601e700() GS:88007f00() 
knlGS:
[ 2545.550666] CS:  0010 DS:  ES:  CR0: 80050033
[ 2545.551460] CR2: 7fc1f4bac000 CR3: 75165001 CR4: 001606e0
[ 2545.552423] Call Trace:
[ 2545.552850]  xfs_file_dio_aio_write+0x6e3/0xe40 [xfs]
[ 2545.553581]  ? kvm_clock_read+0x1f/0x30
[ 2545.554191]  ? xfs_file_dax_write+0x6a0/0x6a0 [xfs]
[ 2545.554885]  ? kvm_clock_read+0x1f/0x30
[ 2545.555439]  ? kvm_sched_clock_read+0x5/0x10
[ 2545.556046]  ? sched_clock+0x5/0x10
[ 2545.556553]  ? sched_clock_cpu+0x18/0x1e0
[ 2545.557136]  ? __lock_acquire+0xbbf/0x40f0
[ 2545.557719]  ? kvm_clock_read+0x1f/0x30
[ 2545.558272]  ? sched_clock+0x5/0x10
[ 2545.558848]  xfs_file_write_iter+0x34a/0xb50 [xfs]
[ 2545.559544]  do_iter_readv_writev+0x44c/0x700
[ 2545.560170]  ? _copy_from_user+0x96/0xd0
[ 2545.560729]  ? vfs_dedupe_file_range+0x820/0x820
[ 2545.561398]  do_iter_write+0x12c/0x550
[ 2545.561939]  ? rcu_lockdep_current_cpu_online+0xd7/0x120
[ 2545.562682]  ? rcu_read_lock_sched_held+0xa3/0x120
[ 2545.563366]  vfs_writev+0x175/0x2d0
[ 2545.563874]  ? vfs_iter_write+0xc0/0xc0
[ 2545.564434]  ? get_lock_stats+0x16/0x90
[ 2545.564992]  ? lock_downgrade+0x580/0x580
[ 2545.565583]  ? __fget+0x1e7/0x350
[ 2545.566077]  ? entry_SYSCALL_64_fastpath+0x5/0x96
[ 2545.566744]  ? do_pwritev+0x125/0x160
[ 2545.567277]  do_pwritev+0x125/0x160
[ 2545.567787]  entry_SYSCALL_64_fastpath+0x1f/0x96
[ 2545.568437] RIP: 0033:0x7fc1f56d6110
[ 2545.568948] RSP: 002b:7ffc07f11340 EFLAGS: 0293 ORIG_RAX: 
0128
[ 2545.569971] RAX: ffda RBX: 01ef9170 RCX: 7fc1f56d6110
[ 2545.570932] RDX: 0001 RSI: 01ef9170 RDI: 0003
[ 2545.571897] RBP:  R08:  R09: 0040
[ 2545.572857] R10:  R11: 0293 R12: 
[ 2545.573818] R13:  R14: 0040 R15: 01ef9170
[ 2545.574802] Code: df 48 89 fa 48 c1 ea 03 0f b6 04 02 48 89 fa 83 e2 07 38 
d0 7f 04 84 c0 75 15 0f b6 1d d6 51 65 00 80 fb 01 77 1e 83 e3 01 75 0b <0f> ff 
5b c3 e8 0f 59 1b e1 eb e4 0f 0b 48 c7 c7 20 40 8c a0 e8 
[ 2545.577451] ---[ end trace 7bcb2da267d05648 ]---

I suspect that all we need to do is remove the ASSERT(ret...) at the bottom of
xfs_file_dio_aio_write?

--D

> ---
>  fs/block_dev.c |  2 +-
>  fs/direct-io.c |  4 +---
>  fs/iomap.c | 20 ++--
>  3 files changed, 12 insertions(+), 14 deletions(-)
> 
> diff --git a/fs/block_dev.c b/fs/block_dev.c
> index 789f55e851ae..4d3e4603f687 100644
> --- a/fs/block_dev.c
> +++ 

Re: [PATCH v5] Return bytes transferred for partial direct I/O

2018-01-18 Thread Darrick J. Wong
On Fri, Jan 05, 2018 at 06:15:55AM -0600, Goldwyn Rodrigues wrote:
> 
> 
> On 01/04/2018 08:10 PM, Darrick J. Wong wrote:
> > On Wed, Nov 22, 2017 at 06:29:01AM -0600, Goldwyn Rodrigues wrote:
> >> From: Goldwyn Rodrigues <rgold...@suse.com>
> >>
> >> In case direct I/O encounters an error midway, it returns the error.
> >> Instead it should be returning the number of bytes transferred so far.
> >>
> >> Test case for filesystems (with ENOSPC):
> >> 1. Create an almost full filesystem
> >> 2. Create a file, say /mnt/lastfile, until the filesystem is full.
> >> 3. Direct write() with count > sizeof /mnt/lastfile.
> >>
> >> Result: write() returns -ENOSPC. However, file content has data written
> >> in step 3.
> >>
> >> Changes since v1:
> >>  - incorporated iomap and block devices
> >>
> >> Changes since v2:
> >>  - realized that file size was not increasing when performing a (partial)
> >>direct I/O because end_io function was receiving the error instead of
> >>size. Fixed.
> >>
> >> Changes since v3:
> >>  - [hch] initialize transferred with dio->size and use transferred instead
> >>of dio->size.
> >>
> >> Changes since v4:
> >>  - Refreshed to v4.14
> >>
> >> Signed-off-by: Goldwyn Rodrigues <rgold...@suse.com>
> >> Reviewed-by: Christoph Hellwig <h...@lst.de>
> > 
> > Hmmm, I shoved this into 4.15-rc6 and saw this horrible splat running
> > generic/472 (that is the test that goes with this patch, right?) on XFS:
> 
> Yes, I will add it to the patch header.
> 
> > 
> > [ 2545.534331] XFS: Assertion failed: ret < 0 || ret == count, file: 
> > /storage/home/djwong/cdev/work/linux-xfs/fs/xfs/xfs_file.c, line: 598
> > [ 2545.538177] WARNING: CPU: 2 PID: 16340 at 
> > /storage/home/djwong/cdev/work/linux-xfs/fs/xfs/xfs_message.c:116 
> > assfail+0x58/0x90 [xfs]
> > [ 2545.539748] Modules linked in: xfs libcrc32c dax_pmem device_dax nd_pmem 
> > sch_fq_codel af_packet [last unloaded: xfs]
> > [ 2545.541191] CPU: 2 PID: 16340 Comm: xfs_io Not tainted 4.15.0-rc6-xfsx #5
> > [ 2545.542119] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 
> > 1.10.2-1ubuntu1djwong0 04/01/2014
> > [ 2545.543414] RIP: 0010:assfail+0x58/0x90 [xfs]
> > [ 2545.544032] RSP: 0018:88002b56f990 EFLAGS: 00010246
> > [ 2545.544760] RAX:  RBX:  RCX: 
> > 
> > [ 2545.545721] RDX: 0004 RSI: 000a RDI: 
> > a0a1eac4
> > [ 2545.546681] RBP: 88002b56fdd0 R08: 88002b56f6e0 R09: 
> > 
> > [ 2545.547646] R10: 11000fe14f6e R11:  R12: 
> > 1100056adf3a
> > [ 2545.548613] R13: 0040 R14: 88002b56fc38 R15: 
> > 
> > [ 2545.549579] FS:  7fc1f601e700() GS:88007f00() 
> > knlGS:
> > [ 2545.550666] CS:  0010 DS:  ES:  CR0: 80050033
> > [ 2545.551460] CR2: 7fc1f4bac000 CR3: 75165001 CR4: 
> > 001606e0
> > [ 2545.552423] Call Trace:
> > [ 2545.552850]  xfs_file_dio_aio_write+0x6e3/0xe40 [xfs]
> > [ 2545.553581]  ? kvm_clock_read+0x1f/0x30
> > [ 2545.554191]  ? xfs_file_dax_write+0x6a0/0x6a0 [xfs]
> > [ 2545.554885]  ? kvm_clock_read+0x1f/0x30
> > [ 2545.555439]  ? kvm_sched_clock_read+0x5/0x10
> > [ 2545.556046]  ? sched_clock+0x5/0x10
> > [ 2545.556553]  ? sched_clock_cpu+0x18/0x1e0
> > [ 2545.557136]  ? __lock_acquire+0xbbf/0x40f0
> > [ 2545.557719]  ? kvm_clock_read+0x1f/0x30
> > [ 2545.558272]  ? sched_clock+0x5/0x10
> > [ 2545.558848]  xfs_file_write_iter+0x34a/0xb50 [xfs]
> > [ 2545.559544]  do_iter_readv_writev+0x44c/0x700
> > [ 2545.560170]  ? _copy_from_user+0x96/0xd0
> > [ 2545.560729]  ? vfs_dedupe_file_range+0x820/0x820
> > [ 2545.561398]  do_iter_write+0x12c/0x550
> > [ 2545.561939]  ? rcu_lockdep_current_cpu_online+0xd7/0x120
> > [ 2545.562682]  ? rcu_read_lock_sched_held+0xa3/0x120
> > [ 2545.563366]  vfs_writev+0x175/0x2d0
> > [ 2545.563874]  ? vfs_iter_write+0xc0/0xc0
> > [ 2545.564434]  ? get_lock_stats+0x16/0x90
> > [ 2545.564992]  ? lock_downgrade+0x580/0x580
> > [ 2545.565583]  ? __fget+0x1e7/0x350
> > [ 2545.566077]  ? entry_SYSCALL_64_fastpath+0x5/0x96
> > [ 2545.566744]  ? do_pwritev+0x125/0x160
> > [ 2545.567277]  do_pwritev+0x125/0x160
> > [ 2545.567787]  entry_SYSCALL_64_fastpath+0x1f/0x96
> > [ 2545.568437] RIP: 003

Re: [PATCH v5 1/2] Return bytes transferred for partial direct I/O

2018-01-18 Thread Darrick J. Wong
On Fri, Jan 19, 2018 at 02:59:51PM +1100, Dave Chinner wrote:
> On Fri, Jan 19, 2018 at 02:13:53AM +, Al Viro wrote:
> > On Thu, Jan 18, 2018 at 06:57:40PM -0600, Goldwyn Rodrigues wrote:
> > > From: Goldwyn Rodrigues <rgold...@suse.com>
> > > 
> > > In case direct I/O encounters an error midway, it returns the error.
> > > Instead it should be returning the number of bytes transferred so far.
> > > 
> > > Test case for filesystems (with ENOSPC):
> > > 1. Create an almost full filesystem
> > > 2. Create a file, say /mnt/lastfile, until the filesystem is full.
> > > 3. Direct write() with count > sizeof /mnt/lastfile.
> > > 
> > > Result: write() returns -ENOSPC. However, file content has data written
> > > in step 3.
> > > 
> > > This fixes fstest generic/472.
> > 
> > OK...  I can live with that.  What about the XFS side?  It should be
> > a prereq, to avoid bisection hazard; I can throw both into vfs.git,
> > if XFS folks are OK with that.  Objections?
> 
> Going through the VFS tree seesm the best approach to me - it's a
> trivial change. I'm sure Darrick will shout if it's going to be a
> problem, though.

vfs.git is fine, though the second patch to remove the xfs assert should
go first, as Al points out.

For both patches,
Reviewed-by: Darrick J. Wong <darrick.w...@oracle.com>

--D

> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> da...@fromorbit.com


Re: [PATCH v7 1/2] xfs: remove assert to check bytes returned

2018-02-08 Thread Darrick J. Wong
On Thu, Feb 08, 2018 at 12:59:47PM -0600, Goldwyn Rodrigues wrote:
> From: Goldwyn Rodrigues <rgold...@suse.com>
> 
> Since we can return less than count in case of partial direct
> writes, remove the ASSERT.
> 
> Signed-off-by: Goldwyn Rodrigues <rgold...@suse.com>

Looks ok,
Reviewed-by: Darrick J. Wong <darrick.w...@oracle.com>

--D


Re: [PATCH v7 2/2] Return bytes transferred for partial direct I/O

2018-03-07 Thread Darrick J. Wong
On Thu, Feb 08, 2018 at 12:59:48PM -0600, Goldwyn Rodrigues wrote:
> From: Goldwyn Rodrigues 
> 
> In case direct I/O encounters an error midway, it returns the error.
> Instead it should be returning the number of bytes transferred so far.
> 
> Test case for filesystems (with ENOSPC):
> 1. Create an almost full filesystem
> 2. Create a file, say /mnt/lastfile, until the filesystem is full.
> 3. Direct write() with count > sizeof /mnt/lastfile.
> 
> Result: write() returns -ENOSPC. However, file content has data written
> in step 3.
> 
> Added a sysctl entry: dio_short_writes which is on by default. This is
> to support applications which expect either and error or the bytes submitted
> as a return value for the write calls.
> 
> This fixes fstest generic/472.
> 
> Signed-off-by: Goldwyn Rodrigues 
> ---
>  Documentation/sysctl/fs.txt | 14 ++
>  fs/block_dev.c  |  2 +-
>  fs/direct-io.c  |  7 +--
>  fs/iomap.c  | 23 ---
>  include/linux/fs.h  |  1 +
>  kernel/sysctl.c |  9 +
>  6 files changed, 42 insertions(+), 14 deletions(-)
> 
> Changes since v1:
>  - incorporated iomap and block devices
> 
> Changes since v2:
>  - realized that file size was not increasing when performing a (partial)
>direct I/O because end_io function was receiving the error instead of
>size. Fixed.
> 
> Changes since v3:
>  - [hch] initialize transferred with dio->size and use transferred instead
>of dio->size.
> 
> Changes since v4:
>  - Refreshed to v4.14
> 
> Changes since v5:
>  - Added /proc/sys/fs/dio_short_writes (default 1) to guard older applications
>which expect write(fd, buf, count) returns either count or error.
> 
> Changes since v6:
>  - Corrected documentation
>  - Re-ordered patch
> 
> diff --git a/Documentation/sysctl/fs.txt b/Documentation/sysctl/fs.txt
> index 6c00c1e2743f..21582f675985 100644
> --- a/Documentation/sysctl/fs.txt
> +++ b/Documentation/sysctl/fs.txt
> @@ -22,6 +22,7 @@ Currently, these files are in /proc/sys/fs:
>  - aio-max-nr
>  - aio-nr
>  - dentry-state
> +- dio_short_writes
>  - dquot-max
>  - dquot-nr
>  - file-max
> @@ -76,6 +77,19 @@ dcache isn't pruned yet.
>  
>  ==
>  
> +dio_short_writes:
> +
> +In case Direct I/O encounters a transient error, it returns
> +the error code, even if it has performed part of the write.
> +This flag, if on (default), will return the number of bytes written
> +so far, as the write(2) semantics are. However, some older applications
> +still consider a direct write as an error if all of the I/O
> +submitted is not complete. I.e. write(file, count, buf) != count.
> +This option can be disabled on systems in order to support
> +existing applications which do not expect short writes.
> +
> +==
> +
>  dquot-max & dquot-nr:
>  
>  The file dquot-max shows the maximum number of cached disk
> diff --git a/fs/block_dev.c b/fs/block_dev.c
> index 4a181fcb5175..49d94360bb51 100644
> --- a/fs/block_dev.c
> +++ b/fs/block_dev.c
> @@ -409,7 +409,7 @@ __blkdev_direct_IO(struct kiocb *iocb, struct iov_iter 
> *iter, int nr_pages)
>  
>   if (!ret)
>   ret = blk_status_to_errno(dio->bio.bi_status);
> - if (likely(!ret))
> + if (likely(dio->size))
>   ret = dio->size;
>  
>   bio_put(>bio);
> diff --git a/fs/direct-io.c b/fs/direct-io.c
> index 3aafb3343a65..9bd15be64c25 100644
> --- a/fs/direct-io.c
> +++ b/fs/direct-io.c
> @@ -151,6 +151,7 @@ struct dio {
>  } cacheline_aligned_in_smp;
>  
>  static struct kmem_cache *dio_cache __read_mostly;
> +unsigned int sysctl_dio_short_writes = 1;
>  
>  /*
>   * How many pages are in the queue?
> @@ -262,7 +263,7 @@ static ssize_t dio_complete(struct dio *dio, ssize_t ret, 
> unsigned int flags)
>   ret = dio->page_errors;
>   if (ret == 0)
>   ret = dio->io_error;
> - if (ret == 0)
> + if (!sysctl_dio_short_writes && (ret == 0))
>   ret = transferred;
>  
>   if (dio->end_io) {
> @@ -310,7 +311,9 @@ static ssize_t dio_complete(struct dio *dio, ssize_t ret, 
> unsigned int flags)
>   }
>  
>   kmem_cache_free(dio_cache, dio);
> - return ret;
> + if (!sysctl_dio_short_writes)
> + return ret;
> + return transferred ? transferred : ret;
>  }
>  
>  static void dio_aio_complete_work(struct work_struct *work)
> diff --git a/fs/iomap.c b/fs/iomap.c
> index 47d29ccffaef..a8d6908dc0de 100644
> --- a/fs/iomap.c
> +++ b/fs/iomap.c
> @@ -716,23 +716,24 @@ static ssize_t iomap_dio_complete(struct iomap_dio *dio)
>   struct kiocb *iocb = dio->iocb;
>   struct inode *inode = file_inode(iocb->ki_filp);
>   loff_t offset = iocb->ki_pos;
> - ssize_t ret;
> + ssize_t err;
> + ssize_t transferred = dio->size;

I'm sorry to bring 

Re: [RFC PATCH 00/79] Generic page write protection and a solution to page waitqueue

2018-04-18 Thread Darrick J. Wong
On Wed, Apr 18, 2018 at 11:54:30AM -0400, Jerome Glisse wrote:
> On Wed, Apr 18, 2018 at 04:13:37PM +0200, Jan Kara wrote:
> > Hello,
> > 
> > so I finally got to this :)
> > 
> > On Wed 04-04-18 15:17:50, jgli...@redhat.com wrote:
> > > From: Jérôme Glisse 
> 
> [...]
> 
> > > --
> > > The Why ?
> > > 
> > > I have two objectives: duplicate memory read only accross nodes and or
> > > devices and work around PCIE atomic limitations. More on each of those
> > > objective below. I also want to put forward that it can solve the page
> > > wait list issue ie having each page with its own wait list and thus
> > > avoiding long wait list traversale latency recently reported [1].
> > > 
> > > It does allow KSM for file back pages (truely generic KSM even between
> > > both anonymous and file back page). I am not sure how useful this can
> > > be, this was not an objective i did pursue, this is just a for free
> > > feature (see below).
> > 
> > I know some people (Matthew Wilcox?) wanted to do something like KSM for
> > file pages - not all virtualization schemes use overlayfs and e.g. if you
> > use reflinks (essentially shared on-disk extents among files) for your
> > container setup, you could save significant amounts of memory with the
> > ability to share pages in page cache among files that are reflinked.
> 
> Yes i believe they are still use case where KSM with file back page make
> senses, i am just not familiar enough with those workload to know how big
> of a deal this is.

Imagine container farms where they deploy the base os image via cp --reflink.
This would be a huge win for btrfs/xfs but we've all been too terrified
of the memory manager to try it. :)

For those following at home, we had a track at LSFMM2017 (and hallway
bofs about this in previous years):
https://lwn.net/Articles/717950/

/me starts to look at this big series, having sent his own yesterday. :)

--D

> > > [1] https://groups.google.com/forum/#!topic/linux.kernel/Iit1P5BNyX8
> > > 
> > > --
> > > Per page wait list, so long page_waitqueue() !
> > > 
> > > Not implemented in this RFC but below is the logic and pseudo code
> > > at bottom of this email.
> > > 
> > > When there is a contention on struct page lock bit, the caller which
> > > is trying to lock the page will add itself to a waitqueue. The issues
> > > here is that multiple pages share the same wait queue and on large
> > > system with a lot of ram this means we can quickly get to a long list
> > > of waiters for differents pages (or for the same page) on the same
> > > list [1].
> > > 
> > > The present patchset virtualy kills all places that need to access the
> > > page->mapping field and only a handfull are left, namely for testing
> > > page truncation and for vmscan. The former can be remove if we reuse
> > > the PG_waiters flag for a new PG_truncate flag set on truncation then
> > > we can virtualy kill all derefence of page->mapping (this patchset
> > > proves it is doable). NOTE THIS DOES NOT MEAN THAT MAPPING is FREE TO
> > > BE USE BY ANYONE TO STORE WHATEVER IN STRUCT PAGE. SORRY NO !
> > 
> > It is interesting that you can get rid of page->mapping uses in most
> > places. For page reclaim (vmscan) you'll still need a way to get from a
> > page to an address_space so that you can reclaim the page so you can hardly
> > get rid of page->mapping completely but you're right that with such limited
> > use that transition could be more complex / expensive.
> 
> Idea for vmscan is that you either have regular mapping pointer store in
> page->mapping or you have a pointer to special struct which has a function
> pointer to a reclaim/walker function (rmap_walk_ksm)
> 
> > What I wonder though is what is the cost of this (in the terms of code size
> > and speed) - propagating the mapping down the stack costs something... Also
> > in terms of maintainability, code readability suffers a bit.
> 
> I haven't checked that, i will, i was not so concern because in the vast
> majority of places there is already struct address_space on the stack
> frame (ie local variable in function being call) so moving it to function
> argument shouldn't impact that. However as i expect this will be merge
> over multiple kernel release cycle and the intermediary step will see an
> increase in stack size. The code size should only grow marginaly i expect.
> I will provide numbers with my next posting after LSF/MM.
> 
> 
> > This could be helped though. In some cases it seems we just use the mapping
> > because it was easily available but could get away without it. In other
> > case (e.g. lot of fs/buffer.c) we could make bh -> mapping transition easy
> > by storing the mapping in the struct buffer_head - possibly it could
> > replace b_bdev pointer as we could get to that from the mapping with a bit
> > of magic and pointer chasing and 

Re: [PATCH] block: fix 32 bit overflow in __blkdev_issue_discard()

2018-11-13 Thread Darrick J. Wong
On Wed, Nov 14, 2018 at 08:43:37AM +1100, Dave Chinner wrote:
> From: Dave Chinner 
> 
> A discard cleanup merged into 4.20-rc2 causes fstests xfs/259 to
> fall into an endless loop in the discard code. The test is creating
> a device that is exactly 2^32 sectors in size to test mkfs boundary
> conditions around the 32 bit sector overflow region.
> 
> mkfs issues a discard for the entire device size by default, and
> hence this throws a sector count of 2^32 into
> blkdev_issue_discard(). It takes the number of sectors to discard as
> a sector_t - a 64 bit value.
> 
> The commit ba5d73851e71 ("block: cleanup __blkdev_issue_discard")
> takes this sector count and casts it to a 32 bit value before
> comapring it against the maximum allowed discard size the device
> has. This truncates away the upper 32 bits, and so if the lower 32
> bits of the sector count is zero, it starts issuing discards of
> length 0. This causes the code to fall into an endless loop, issuing
> a zero length discards over and over again on the same sector.
> 
> Fixes: ba5d73851e71 ("block: cleanup __blkdev_issue_discard")
> Signed-off-by: Dave Chinner 

Fixes the regression for me too, so...

Tested-by: Darrick J. Wong 
Reviewed-by: Darrick J. Wong 

--D

> ---
>  block/blk-lib.c | 5 -
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/block/blk-lib.c b/block/blk-lib.c
> index e8b3bb9bf375..144e156ed341 100644
> --- a/block/blk-lib.c
> +++ b/block/blk-lib.c
> @@ -55,9 +55,12 @@ int __blkdev_issue_discard(struct block_device *bdev, 
> sector_t sector,
>   return -EINVAL;
>  
>   while (nr_sects) {
> - unsigned int req_sects = min_t(unsigned int, nr_sects,
> + sector_t req_sects = min_t(sector_t, nr_sects,
>   bio_allowed_max_sectors(q));
>  
> + WARN_ON_ONCE(req_sects == 0);
> + WARN_ON_ONCE((req_sects << 9) > UINT_MAX);
> +
>   bio = blk_next_bio(bio, 0, gfp_mask);
>   bio->bi_iter.bi_sector = sector;
>   bio_set_dev(bio, bdev);
> -- 
> 2.19.1
> 


Re: [PATCH] block: fix 32 bit overflow in __blkdev_issue_discard()

2018-11-15 Thread Darrick J. Wong
On Fri, Nov 16, 2018 at 09:13:37AM +1100, Dave Chinner wrote:
> On Thu, Nov 15, 2018 at 11:10:36AM +0800, Ming Lei wrote:
> > On Thu, Nov 15, 2018 at 12:22:01PM +1100, Dave Chinner wrote:
> > > On Thu, Nov 15, 2018 at 09:06:52AM +0800, Ming Lei wrote:
> > > > On Wed, Nov 14, 2018 at 08:18:24AM -0700, Jens Axboe wrote:
> > > > > On 11/13/18 2:43 PM, Dave Chinner wrote:
> > > > > > From: Dave Chinner 
> > > > > > 
> > > > > > A discard cleanup merged into 4.20-rc2 causes fstests xfs/259 to
> > > > > > fall into an endless loop in the discard code. The test is creating
> > > > > > a device that is exactly 2^32 sectors in size to test mkfs boundary
> > > > > > conditions around the 32 bit sector overflow region.
> > > > > > 
> > > > > > mkfs issues a discard for the entire device size by default, and
> > > > > > hence this throws a sector count of 2^32 into
> > > > > > blkdev_issue_discard(). It takes the number of sectors to discard as
> > > > > > a sector_t - a 64 bit value.
> > > > > > 
> > > > > > The commit ba5d73851e71 ("block: cleanup __blkdev_issue_discard")
> > > > > > takes this sector count and casts it to a 32 bit value before
> > > > > > comapring it against the maximum allowed discard size the device
> > > > > > has. This truncates away the upper 32 bits, and so if the lower 32
> > > > > > bits of the sector count is zero, it starts issuing discards of
> > > > > > length 0. This causes the code to fall into an endless loop, issuing
> > > > > > a zero length discards over and over again on the same sector.
> > > > > 
> > > > > Applied, thanks. Ming, can you please add a blktests test for
> > > > > this case? This is the 2nd time it's been broken.
> > > > 
> > > > OK, I will add zram discard test in blktests, which should cover the
> > > > 1st report. For the xfs/259, I need to investigate if it is easy to
> > > > do in blktests.
> > > 
> > > Just write a test that creates block devices of 2^32 + (-1,0,1)
> > > sectors and runs a discard across the entire device. That's all that
> > > xfs/259 it doing - exercising mkfs on 2TB, 4TB and 16TB boundaries.
> > > i.e. the boundaries where sectors and page cache indexes (on 4k page
> > > size systems) overflow 32 bit int and unsigned int sizes. mkfs
> > > issues a discard for the entire device, so it's testing that as
> > > well...
> > 
> > Indeed, I can reproduce this issue via the following commands:
> > 
> > modprobe scsi_debug virtual_gb=2049 sector_size=512 lbpws10=1 
> > dev_size_mb=512
> > blkdiscard /dev/sde
> > 
> > > 
> > > You need to write tests that exercise write_same, write_zeros and
> > > discard operations around these boundaries, because they all take
> > > a 64 bit sector count and stuff them into 32 bit size fields in
> > > the bio tha tis being submitted.
> > 
> > write_same/write_zeros are usually used by driver directly, so we
> > may need make the test case on some specific device.
> 
> My local linux iscsi server and client advertise support for them.
> It definitely does not ships zeros across the wire(*) when I use
> things like FALLOC_FL_ZERO_RANGE, but fstests does not have block
> device fallocate() tests for zeroing or punching...

fstests does (generic/{349,350,351}) but those basic functionality tests
don't include creating a 2^32 block device and seeing if overflows
happen... :/

...I also see that Eryu succeeded in kicking those tests out of the
quick group, so they probably don't run that often either.

--D

> 
> Cheers,
> 
> Dave.
> 
> (*) but the back end storage is a sparse file on an XFS filesystem,
> and the iscsi server fails to translate write_zeroes or
> WRITE_SAME(0) to FALLOC_FL_ZERO_RANGE on the storage side and hence
> is really slow because it physically writes zeros to the XFS file.
> i.e. the client offloads the operation to the server to minimise
> wire traffic, but then the server doesn't offload the operation to
> the storage
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> da...@fromorbit.com