from:"Darrick J. Wong"

Re: [PATCH v5 5/7] fsdax: Dedup file range to use a compare function

2021-05-14 Thread Darrick J. Wong

On Fri, May 14, 2021 at 08:35:44AM +, ruansy.f...@fujitsu.com wrote:
> 
> 
> > -Original Message-
> > From: Darrick J. Wong 
> > Subject: Re: [PATCH v5 5/7] fsdax: Dedup file range to use a compare 
> > function
> > 
> > On Tue, May 11, 2021 at 11:09:31AM +0800, Shiyang Ruan wrote:
> > > With dax we cannot deal with readpage() etc. So, we create a dax
> > > comparison funciton which is similar with
> > > vfs_dedupe_file_range_compare().
> > > And introduce dax_remap_file_range_prep() for filesystem use.
> > >
> > > Signed-off-by: Goldwyn Rodrigues 
> > > Signed-off-by: Shiyang Ruan 
> > > ---
> > >  fs/dax.c | 56
> > +++
> > >  fs/remap_range.c | 57
> > +---
> > >  fs/xfs/xfs_reflink.c |  8 +--
> > >  include/linux/dax.h  |  4 
> > >  include/linux/fs.h   | 12 ++
> > >  5 files changed, 123 insertions(+), 14 deletions(-)
> > >
> > > diff --git a/fs/dax.c b/fs/dax.c
> > > index ee9d28a79bfb..dedf1be0155c 100644
> > > --- a/fs/dax.c
> > > +++ b/fs/dax.c
> > > @@ -1853,3 +1853,59 @@ vm_fault_t dax_finish_sync_fault(struct vm_fault
> > *vmf,
> > >   return dax_insert_pfn_mkwrite(vmf, pfn, order);  }
> > > EXPORT_SYMBOL_GPL(dax_finish_sync_fault);
> > > +
> > > +static loff_t dax_range_compare_actor(struct inode *ino1, loff_t pos1,
> > > + struct inode *ino2, loff_t pos2, loff_t len, void *data,
> > > + struct iomap *smap, struct iomap *dmap) {
> > > + void *saddr, *daddr;
> > > + bool *same = data;
> > > + int ret;
> > > +
> > > + if (smap->type == IOMAP_HOLE && dmap->type == IOMAP_HOLE) {
> > > + *same = true;
> > > + return len;
> > > + }
> > > +
> > > + if (smap->type == IOMAP_HOLE || dmap->type == IOMAP_HOLE) {
> > > + *same = false;
> > > + return 0;
> > > + }
> > > +
> > > + ret = dax_iomap_direct_access(smap, pos1, ALIGN(pos1 + len, PAGE_SIZE),
> > > +   , NULL);
> > > + if (ret < 0)
> > > + return -EIO;
> > > +
> > > + ret = dax_iomap_direct_access(dmap, pos2, ALIGN(pos2 + len, PAGE_SIZE),
> > > +   , NULL);
> > > + if (ret < 0)
> > > + return -EIO;
> > > +
> > > + *same = !memcmp(saddr, daddr, len);
> > > + return len;
> > > +}
> > > +
> > > +int dax_dedupe_file_range_compare(struct inode *src, loff_t srcoff,
> > > + struct inode *dest, loff_t destoff, loff_t len, bool *is_same,
> > > + const struct iomap_ops *ops)
> > > +{
> > > + int id, ret = 0;
> > > +
> > > + id = dax_read_lock();
> > > + while (len) {
> > > + ret = iomap_apply2(src, srcoff, dest, destoff, len, 0, ops,
> > > +is_same, dax_range_compare_actor);
> > > + if (ret < 0 || !*is_same)
> > > + goto out;
> > > +
> > > + len -= ret;
> > > + srcoff += ret;
> > > + destoff += ret;
> > > + }
> > > + ret = 0;
> > > +out:
> > > + dax_read_unlock(id);
> > > + return ret;
> > > +}
> > > +EXPORT_SYMBOL_GPL(dax_dedupe_file_range_compare);
> > > diff --git a/fs/remap_range.c b/fs/remap_range.c index
> > > e4a5fdd7ad7b..7bc4c8e3aa9f 100644
> > > --- a/fs/remap_range.c
> > > +++ b/fs/remap_range.c
> > > @@ -14,6 +14,7 @@
> > >  #include 
> > >  #include 
> > >  #include 
> > > +#include 
> > >  #include "internal.h"
> > >
> > >  #include 
> > > @@ -199,9 +200,9 @@ static void vfs_unlock_two_pages(struct page *page1,
> > struct page *page2)
> > >   * Compare extents of two files to see if they are the same.
> > >   * Caller must have locked both inodes to prevent write races.
> > >   */
> > > -static int vfs_dedupe_file_range_compare(struct inode *src, loff_t 
> > > srcoff,
> > > -  struct inode *dest, loff_t destoff,
> > > -  loff_t len, bool *is_same)
> > > +int vfs_dedupe_file_range_compare(struct inode *src, loff_t srcoff,
> > > +   struct inode *

Re: [PATCH v5 1/7] fsdax: Introduce dax_iomap_cow_copy()

2021-05-13 Thread Darrick J. Wong

On Thu, May 13, 2021 at 07:57:47AM +, ruansy.f...@fujitsu.com wrote:
> > -Original Message-
> > From: Darrick J. Wong 
> > Subject: Re: [PATCH v5 1/7] fsdax: Introduce dax_iomap_cow_copy()
> > 
> > On Tue, May 11, 2021 at 11:09:27AM +0800, Shiyang Ruan wrote:
> > > In the case where the iomap is a write operation and iomap is not
> > > equal to srcmap after iomap_begin, we consider it is a CoW operation.
> > >
> > > The destance extent which iomap indicated is new allocated extent.
> > > So, it is needed to copy the data from srcmap to new allocated extent.
> > > In theory, it is better to copy the head and tail ranges which is
> > > outside of the non-aligned area instead of copying the whole aligned
> > > range. But in dax page fault, it will always be an aligned range.  So,
> > > we have to copy the whole range in this case.
> > >
> > > Signed-off-by: Shiyang Ruan 
> > > Reviewed-by: Christoph Hellwig 
> > > ---
> > >  fs/dax.c | 86
> > > 
> > >  1 file changed, 81 insertions(+), 5 deletions(-)
> > >
> > > diff --git a/fs/dax.c b/fs/dax.c
> > > index bf3fc8242e6c..f0249bb1d46a 100644
> > > --- a/fs/dax.c
> > > +++ b/fs/dax.c
> > > @@ -1038,6 +1038,61 @@ static int dax_iomap_direct_access(struct iomap
> > *iomap, loff_t pos, size_t size,
> > >   return rc;
> > >  }
> > >
> > > +/**
> > > + * dax_iomap_cow_copy(): Copy the data from source to destination before
> > write.
> > > + * @pos: address to do copy from.
> > > + * @length:  size of copy operation.
> > > + * @align_size:  aligned w.r.t align_size (either PMD_SIZE or PAGE_SIZE)
> > > + * @srcmap:  iomap srcmap
> > > + * @daddr:   destination address to copy to.
> > > + *
> > > + * This can be called from two places. Either during DAX write fault,
> > > +to copy
> > > + * the length size data to daddr. Or, while doing normal DAX write
> > > +operation,
> > > + * dax_iomap_actor() might call this to do the copy of either start
> > > +or end
> > > + * unaligned address. In this case the rest of the copy of aligned
> > > +ranges is
> > > + * taken care by dax_iomap_actor() itself.
> > > + * Also, note DAX fault will always result in aligned pos and pos + 
> > > length.
> > > + */
> > > +static int dax_iomap_cow_copy(loff_t pos, loff_t length, size_t
> > > +align_size,
> > 
> > Nit: Linus has asked us not to continue the use of loff_t for file io 
> > length.  Could
> > you change this to 'uint64_t length', please?
> > (Assuming we even need the extra length bits?)
> > 
> > With that fixed up...
> > Reviewed-by: Darrick J. Wong 
> > 
> > --D
> > 
> > > + struct iomap *srcmap, void *daddr)
> > > +{
> > > + loff_t head_off = pos & (align_size - 1);
> > 
> > Other nit: head_off = round_down(pos, align_size); ?
> 
> We need the offset within a page here, either PTE or PMD.  So I think 
> round_down() is not suitable here.

Oops, yeah.  /me wonders if any of Matthew's folio cleanups will reduce
the amount of opencoding around this...

--D

> 
> 
> --
> Thanks,
> Ruan Shiyang.
> 
> > 
> > > + size_t size = ALIGN(head_off + length, align_size);
> > > + loff_t end = pos + length;
> > > + loff_t pg_end = round_up(end, align_size);
> > > + bool copy_all = head_off == 0 && end == pg_end;
> > > + void *saddr = 0;
> > > + int ret = 0;
> > > +
> > > + ret = dax_iomap_direct_access(srcmap, pos, size, , NULL);
> > > + if (ret)
> > > + return ret;
> > > +
> > > + if (copy_all) {
> > > + ret = copy_mc_to_kernel(daddr, saddr, length);
> > > + return ret ? -EIO : 0;
> > > + }
> > > +
> > > + /* Copy the head part of the range.  Note: we pass offset as length. */
> > > + if (head_off) {
> > > + ret = copy_mc_to_kernel(daddr, saddr, head_off);
> > > + if (ret)
> > > + return -EIO;
> > > + }
> > > +
> > > + /* Copy the tail part of the range */
> > > + if (end < pg_end) {
> > > + loff_t tail_off = head_off + length;
> > > + loff_t tail_len = pg_end - end;
> > > +
> > > + ret = copy_mc_to_kernel(daddr + tail_off, saddr + tail_off,
>

Re: [PATCH v5 6/7] fs/xfs: Handle CoW for fsdax write() path

2021-05-11 Thread Darrick J. Wong

On Tue, May 11, 2021 at 11:09:32AM +0800, Shiyang Ruan wrote:
> In fsdax mode, WRITE and ZERO on a shared extent need CoW performed. After
> CoW, new allocated extents needs to be remapped to the file.  So, add an
> iomap_end for dax write ops to do the remapping work.
> 
> Signed-off-by: Shiyang Ruan 
> ---
>  fs/xfs/xfs_bmap_util.c |  3 +--
>  fs/xfs/xfs_file.c  |  9 +++
>  fs/xfs/xfs_iomap.c | 61 +-
>  fs/xfs/xfs_iomap.h |  4 +++
>  fs/xfs/xfs_iops.c  |  7 +++--
>  fs/xfs/xfs_reflink.c   |  3 +--
>  6 files changed, 72 insertions(+), 15 deletions(-)
> 
> diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
> index a5e9d7d34023..2a36dc93ff27 100644
> --- a/fs/xfs/xfs_bmap_util.c
> +++ b/fs/xfs/xfs_bmap_util.c
> @@ -965,8 +965,7 @@ xfs_free_file_space(
>   return 0;
>   if (offset + len > XFS_ISIZE(ip))
>   len = XFS_ISIZE(ip) - offset;
> - error = iomap_zero_range(VFS_I(ip), offset, len, NULL,
> - _buffered_write_iomap_ops);
> + error = xfs_iomap_zero_range(ip, offset, len, NULL);
>   if (error)
>   return error;
>  
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 396ef36dcd0a..38d8eca05aee 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -684,11 +684,8 @@ xfs_file_dax_write(
>   pos = iocb->ki_pos;
>  
>   trace_xfs_file_dax_write(iocb, from);
> - ret = dax_iomap_rw(iocb, from, _direct_write_iomap_ops);
> - if (ret > 0 && iocb->ki_pos > i_size_read(inode)) {
> - i_size_write(inode, iocb->ki_pos);
> - error = xfs_setfilesize(ip, pos, ret);
> - }
> + ret = dax_iomap_rw(iocb, from, _dax_write_iomap_ops);
> +
>  out:
>   if (iolock)
>   xfs_iunlock(ip, iolock);
> @@ -1309,7 +1306,7 @@ __xfs_filemap_fault(
>  
>   ret = dax_iomap_fault(vmf, pe_size, , NULL,
>   (write_fault && !vmf->cow_page) ?
> -  _direct_write_iomap_ops :
> +  _dax_write_iomap_ops :
>_read_iomap_ops);
>   if (ret & VM_FAULT_NEEDDSYNC)
>   ret = dax_finish_sync_fault(vmf, pe_size, pfn);
> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> index d154f42e2dc6..8b593a51480d 100644
> --- a/fs/xfs/xfs_iomap.c
> +++ b/fs/xfs/xfs_iomap.c
> @@ -761,7 +761,8 @@ xfs_direct_write_iomap_begin(
>  
>   /* may drop and re-acquire the ilock */
>   error = xfs_reflink_allocate_cow(ip, , , ,
> - , flags & IOMAP_DIRECT);
> + ,
> + (flags & IOMAP_DIRECT) || IS_DAX(inode));
>   if (error)
>   goto out_unlock;
>   if (shared)
> @@ -854,6 +855,41 @@ const struct iomap_ops xfs_direct_write_iomap_ops = {
>   .iomap_begin= xfs_direct_write_iomap_begin,
>  };
>  
> +static int
> +xfs_dax_write_iomap_end(
> + struct inode*inode,
> + loff_t  pos,
> + loff_t  length,
> + ssize_t written,
> + unsigned intflags,
> + struct iomap*iomap)
> +{
> + int error = 0;
> + struct xfs_inode*ip = XFS_I(inode);
> + boolcow = xfs_is_cow_inode(ip);
> +
> + if (!written)
> + return 0;
> +
> + if (pos + written > i_size_read(inode) && !(flags & IOMAP_FAULT)) {
> + i_size_write(inode, pos + written);
> + error = xfs_setfilesize(ip, pos, written);
> + if (error && cow) {
> + xfs_reflink_cancel_cow_range(ip, pos, written, true);
> + return error;
> + }
> + }
> + if (cow)
> + error = xfs_reflink_end_cow(ip, pos, written);
> +
> + return error;
> +}
> +
> +const struct iomap_ops xfs_dax_write_iomap_ops = {
> + .iomap_begin= xfs_direct_write_iomap_begin,
> + .iomap_end  = xfs_dax_write_iomap_end,
> +};
> +
>  static int
>  xfs_buffered_write_iomap_begin(
>   struct inode*inode,
> @@ -1311,3 +1347,26 @@ xfs_xattr_iomap_begin(
>  const struct iomap_ops xfs_xattr_iomap_ops = {
>   .iomap_begin= xfs_xattr_iomap_begin,
>  };
> +
> +int
> +xfs_iomap_zero_range(
> + struct xfs_inode*ip,
> + loff_t  offset,
> + loff_t  len,
> + bool*did_zero)
> +{
> + return iomap_zero_range(VFS_I(ip), offset, len, did_zero,
> + IS_DAX(VFS_I(ip)) ? _dax_write_iomap_ops
> +   : _buffered_write_iomap_ops);
> +}
> +
> +int
> +xfs_iomap_truncate_page(
> + struct xfs_inode*ip,
> + loff_t  pos,
> + bool*did_zero)
> +{
> +

Re: [PATCH v5 5/7] fsdax: Dedup file range to use a compare function

2021-05-11 Thread Darrick J. Wong

On Tue, May 11, 2021 at 11:09:31AM +0800, Shiyang Ruan wrote:
> With dax we cannot deal with readpage() etc. So, we create a dax
> comparison funciton which is similar with
> vfs_dedupe_file_range_compare().
> And introduce dax_remap_file_range_prep() for filesystem use.
> 
> Signed-off-by: Goldwyn Rodrigues 
> Signed-off-by: Shiyang Ruan 
> ---
>  fs/dax.c | 56 +++
>  fs/remap_range.c | 57 +---
>  fs/xfs/xfs_reflink.c |  8 +--
>  include/linux/dax.h  |  4 
>  include/linux/fs.h   | 12 ++
>  5 files changed, 123 insertions(+), 14 deletions(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index ee9d28a79bfb..dedf1be0155c 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -1853,3 +1853,59 @@ vm_fault_t dax_finish_sync_fault(struct vm_fault *vmf,
>   return dax_insert_pfn_mkwrite(vmf, pfn, order);
>  }
>  EXPORT_SYMBOL_GPL(dax_finish_sync_fault);
> +
> +static loff_t dax_range_compare_actor(struct inode *ino1, loff_t pos1,
> + struct inode *ino2, loff_t pos2, loff_t len, void *data,
> + struct iomap *smap, struct iomap *dmap)
> +{
> + void *saddr, *daddr;
> + bool *same = data;
> + int ret;
> +
> + if (smap->type == IOMAP_HOLE && dmap->type == IOMAP_HOLE) {
> + *same = true;
> + return len;
> + }
> +
> + if (smap->type == IOMAP_HOLE || dmap->type == IOMAP_HOLE) {
> + *same = false;
> + return 0;
> + }
> +
> + ret = dax_iomap_direct_access(smap, pos1, ALIGN(pos1 + len, PAGE_SIZE),
> +   , NULL);
> + if (ret < 0)
> + return -EIO;
> +
> + ret = dax_iomap_direct_access(dmap, pos2, ALIGN(pos2 + len, PAGE_SIZE),
> +   , NULL);
> + if (ret < 0)
> + return -EIO;
> +
> + *same = !memcmp(saddr, daddr, len);
> + return len;
> +}
> +
> +int dax_dedupe_file_range_compare(struct inode *src, loff_t srcoff,
> + struct inode *dest, loff_t destoff, loff_t len, bool *is_same,
> + const struct iomap_ops *ops)
> +{
> + int id, ret = 0;
> +
> + id = dax_read_lock();
> + while (len) {
> + ret = iomap_apply2(src, srcoff, dest, destoff, len, 0, ops,
> +is_same, dax_range_compare_actor);
> + if (ret < 0 || !*is_same)
> + goto out;
> +
> + len -= ret;
> + srcoff += ret;
> + destoff += ret;
> + }
> + ret = 0;
> +out:
> + dax_read_unlock(id);
> + return ret;
> +}
> +EXPORT_SYMBOL_GPL(dax_dedupe_file_range_compare);
> diff --git a/fs/remap_range.c b/fs/remap_range.c
> index e4a5fdd7ad7b..7bc4c8e3aa9f 100644
> --- a/fs/remap_range.c
> +++ b/fs/remap_range.c
> @@ -14,6 +14,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include "internal.h"
>  
>  #include 
> @@ -199,9 +200,9 @@ static void vfs_unlock_two_pages(struct page *page1, 
> struct page *page2)
>   * Compare extents of two files to see if they are the same.
>   * Caller must have locked both inodes to prevent write races.
>   */
> -static int vfs_dedupe_file_range_compare(struct inode *src, loff_t srcoff,
> -  struct inode *dest, loff_t destoff,
> -  loff_t len, bool *is_same)
> +int vfs_dedupe_file_range_compare(struct inode *src, loff_t srcoff,
> +   struct inode *dest, loff_t destoff,
> +   loff_t len, bool *is_same)
>  {
>   loff_t src_poff;
>   loff_t dest_poff;
> @@ -280,6 +281,7 @@ static int vfs_dedupe_file_range_compare(struct inode 
> *src, loff_t srcoff,
>  out_error:
>   return error;
>  }
> +EXPORT_SYMBOL(vfs_dedupe_file_range_compare);
>  
>  /*
>   * Check that the two inodes are eligible for cloning, the ranges make
> @@ -289,9 +291,11 @@ static int vfs_dedupe_file_range_compare(struct inode 
> *src, loff_t srcoff,
>   * If there's an error, then the usual negative error code is returned.
>   * Otherwise returns 0 with *len set to the request length.
>   */
> -int generic_remap_file_range_prep(struct file *file_in, loff_t pos_in,
> -   struct file *file_out, loff_t pos_out,
> -   loff_t *len, unsigned int remap_flags)
> +static int
> +__generic_remap_file_range_prep(struct file *file_in, loff_t pos_in,
> + struct file *file_out, loff_t pos_out,
> + loff_t *len, unsigned int remap_flags,
> + const struct iomap_ops *dax_read_ops)
>  {
>   struct inode *inode_in = file_inode(file_in);
>   struct inode *inode_out = file_inode(file_out);
> @@ -351,8 +355,17 @@ int generic_remap_file_range_prep(struct file *file_in, 
> loff_t pos_in,
>   if (remap_flags & REMAP_FILE_DEDUP) {
>   bool

Re: [PATCH v5 3/7] fsdax: Add dax_iomap_cow_copy() for dax_iomap_zero

2021-05-11 Thread Darrick J. Wong

On Tue, May 11, 2021 at 11:09:29AM +0800, Shiyang Ruan wrote:
> Punch hole on a reflinked file needs dax_copy_edge() too.  Otherwise,
> data in not aligned area will be not correct.  So, add the srcmap to
> dax_iomap_zero() and replace memset() as dax_copy_edge().
> 
> Signed-off-by: Shiyang Ruan 
> Reviewed-by: Ritesh Harjani 
> ---
>  fs/dax.c   | 25 +++--
>  fs/iomap/buffered-io.c |  2 +-
>  include/linux/dax.h|  3 ++-
>  3 files changed, 18 insertions(+), 12 deletions(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index ef0e564e7904..ee9d28a79bfb 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -1186,7 +1186,8 @@ static vm_fault_t dax_pmd_load_hole(struct xa_state 
> *xas, struct vm_fault *vmf,
>  }
>  #endif /* CONFIG_FS_DAX_PMD */
>  
> -s64 dax_iomap_zero(loff_t pos, u64 length, struct iomap *iomap)
> +s64 dax_iomap_zero(loff_t pos, u64 length, struct iomap *iomap,
> + struct iomap *srcmap)
>  {
>   sector_t sector = iomap_sector(iomap, pos & PAGE_MASK);
>   pgoff_t pgoff;
> @@ -1208,19 +1209,23 @@ s64 dax_iomap_zero(loff_t pos, u64 length, struct 
> iomap *iomap)
>  
>   if (page_aligned)
>   rc = dax_zero_page_range(iomap->dax_dev, pgoff, 1);
> - else
> + else {
>   rc = dax_direct_access(iomap->dax_dev, pgoff, 1, , NULL);
> - if (rc < 0) {
> - dax_read_unlock(id);
> - return rc;
> - }
> -
> - if (!page_aligned) {
> - memset(kaddr + offset, 0, size);
> + if (rc < 0)
> + goto out;
> + if (iomap->addr != srcmap->addr) {

Why isn't this "if (srcmap->type != IOMAP_HOLE)" ?

I suppose it has the same effect, since @iomap should never be a hole
and we should never have a @srcmap that's the same as @iomap, but
still, we use IOMAP_HOLE checks in most other parts of fs/iomap/.

Other than that, the logic looks decent to me.

--D

> + rc = dax_iomap_cow_copy(offset, size, PAGE_SIZE, srcmap,
> + kaddr);
> + if (rc < 0)
> + goto out;
> + } else
> + memset(kaddr + offset, 0, size);
>   dax_flush(iomap->dax_dev, kaddr + offset, size);
>   }
> +
> +out:
>   dax_read_unlock(id);
> - return size;
> + return rc < 0 ? rc : size;
>  }
>  
>  static loff_t
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index f2cd2034a87b..2734955ea67f 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -933,7 +933,7 @@ static loff_t iomap_zero_range_actor(struct inode *inode, 
> loff_t pos,
>   s64 bytes;
>  
>   if (IS_DAX(inode))
> - bytes = dax_iomap_zero(pos, length, iomap);
> + bytes = dax_iomap_zero(pos, length, iomap, srcmap);
>   else
>   bytes = iomap_zero(inode, pos, length, iomap, srcmap);
>   if (bytes < 0)
> diff --git a/include/linux/dax.h b/include/linux/dax.h
> index b52f084aa643..3275e01ed33d 100644
> --- a/include/linux/dax.h
> +++ b/include/linux/dax.h
> @@ -237,7 +237,8 @@ vm_fault_t dax_finish_sync_fault(struct vm_fault *vmf,
>  int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index);
>  int dax_invalidate_mapping_entry_sync(struct address_space *mapping,
> pgoff_t index);
> -s64 dax_iomap_zero(loff_t pos, u64 length, struct iomap *iomap);
> +s64 dax_iomap_zero(loff_t pos, u64 length, struct iomap *iomap,
> + struct iomap *srcmap);
>  static inline bool dax_mapping(struct address_space *mapping)
>  {
>   return mapping->host && IS_DAX(mapping->host);
> -- 
> 2.31.1
> 
> 
> 
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [PATCH v5 1/7] fsdax: Introduce dax_iomap_cow_copy()

2021-05-11 Thread Darrick J. Wong

On Tue, May 11, 2021 at 11:09:27AM +0800, Shiyang Ruan wrote:
> In the case where the iomap is a write operation and iomap is not equal
> to srcmap after iomap_begin, we consider it is a CoW operation.
> 
> The destance extent which iomap indicated is new allocated extent.
> So, it is needed to copy the data from srcmap to new allocated extent.
> In theory, it is better to copy the head and tail ranges which is
> outside of the non-aligned area instead of copying the whole aligned
> range. But in dax page fault, it will always be an aligned range.  So,
> we have to copy the whole range in this case.
> 
> Signed-off-by: Shiyang Ruan 
> Reviewed-by: Christoph Hellwig 
> ---
>  fs/dax.c | 86 
>  1 file changed, 81 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index bf3fc8242e6c..f0249bb1d46a 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -1038,6 +1038,61 @@ static int dax_iomap_direct_access(struct iomap 
> *iomap, loff_t pos, size_t size,
>   return rc;
>  }
>  
> +/**
> + * dax_iomap_cow_copy(): Copy the data from source to destination before 
> write.
> + * @pos: address to do copy from.
> + * @length:  size of copy operation.
> + * @align_size:  aligned w.r.t align_size (either PMD_SIZE or PAGE_SIZE)
> + * @srcmap:  iomap srcmap
> + * @daddr:   destination address to copy to.
> + *
> + * This can be called from two places. Either during DAX write fault, to copy
> + * the length size data to daddr. Or, while doing normal DAX write operation,
> + * dax_iomap_actor() might call this to do the copy of either start or end
> + * unaligned address. In this case the rest of the copy of aligned ranges is
> + * taken care by dax_iomap_actor() itself.
> + * Also, note DAX fault will always result in aligned pos and pos + length.
> + */
> +static int dax_iomap_cow_copy(loff_t pos, loff_t length, size_t align_size,

Nit: Linus has asked us not to continue the use of loff_t for file
io length.  Could you change this to 'uint64_t length', please?
(Assuming we even need the extra length bits?)

With that fixed up...
Reviewed-by: Darrick J. Wong 

--D

> + struct iomap *srcmap, void *daddr)
> +{
> + loff_t head_off = pos & (align_size - 1);

Other nit: head_off = round_down(pos, align_size); ?

> + size_t size = ALIGN(head_off + length, align_size);
> + loff_t end = pos + length;
> + loff_t pg_end = round_up(end, align_size);
> + bool copy_all = head_off == 0 && end == pg_end;
> + void *saddr = 0;
> + int ret = 0;
> +
> + ret = dax_iomap_direct_access(srcmap, pos, size, , NULL);
> + if (ret)
> + return ret;
> +
> + if (copy_all) {
> + ret = copy_mc_to_kernel(daddr, saddr, length);
> + return ret ? -EIO : 0;
> + }
> +
> + /* Copy the head part of the range.  Note: we pass offset as length. */
> + if (head_off) {
> + ret = copy_mc_to_kernel(daddr, saddr, head_off);
> + if (ret)
> + return -EIO;
> + }
> +
> + /* Copy the tail part of the range */
> + if (end < pg_end) {
> + loff_t tail_off = head_off + length;
> + loff_t tail_len = pg_end - end;
> +
> + ret = copy_mc_to_kernel(daddr + tail_off, saddr + tail_off,
> + tail_len);
> + if (ret)
> + return -EIO;
> + }
> + return 0;
> +}
> +
>  /*
>   * The user has performed a load from a hole in the file.  Allocating a new
>   * page in the file would cause excessive storage usage for workloads with
> @@ -1167,11 +1222,12 @@ dax_iomap_actor(struct inode *inode, loff_t pos, 
> loff_t length, void *data,
>   struct dax_device *dax_dev = iomap->dax_dev;
>   struct iov_iter *iter = data;
>   loff_t end = pos + length, done = 0;
> + bool write = iov_iter_rw(iter) == WRITE;
>   ssize_t ret = 0;
>   size_t xfer;
>   int id;
>  
> - if (iov_iter_rw(iter) == READ) {
> + if (!write) {
>   end = min(end, i_size_read(inode));
>   if (pos >= end)
>   return 0;
> @@ -1180,7 +1236,12 @@ dax_iomap_actor(struct inode *inode, loff_t pos, 
> loff_t length, void *data,
>   return iov_iter_zero(min(length, end - pos), iter);
>   }
>  
> - if (WARN_ON_ONCE(iomap->type != IOMAP_MAPPED))
> + /*
> +  * In DAX mode, we allow either pure overwrites of written extents, or
> +  * writes to unwritten extents as part of a copy-on-write operation.
> +  */
> + if (WARN_O

Re: [PATCH v5 2/7] fsdax: Replace mmap entry in case of CoW

2021-05-11 Thread Darrick J. Wong

On Tue, May 11, 2021 at 11:09:28AM +0800, Shiyang Ruan wrote:
> We replace the existing entry to the newly allocated one in case of CoW.
> Also, we mark the entry as PAGECACHE_TAG_TOWRITE so writeback marks this
> entry as writeprotected.  This helps us snapshots so new write
> pagefaults after snapshots trigger a CoW.
> 
> Signed-off-by: Goldwyn Rodrigues 
> Signed-off-by: Shiyang Ruan 
> Reviewed-by: Christoph Hellwig 
> Reviewed-by: Ritesh Harjani 

Seems fine to me...
Reviewed-by: Darrick J. Wong 

--D

> ---
>  fs/dax.c | 39 ---
>  1 file changed, 28 insertions(+), 11 deletions(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index f0249bb1d46a..ef0e564e7904 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -722,6 +722,10 @@ static int copy_cow_page_dax(struct block_device *bdev, 
> struct dax_device *dax_d
>   return 0;
>  }
>  
> +/* DAX Insert Flag: The state of the entry we insert */
> +#define DAX_IF_DIRTY (1 << 0)
> +#define DAX_IF_COW   (1 << 1)
> +
>  /*
>   * By this point grab_mapping_entry() has ensured that we have a locked entry
>   * of the appropriate size so we don't have to worry about downgrading PMDs 
> to
> @@ -729,16 +733,19 @@ static int copy_cow_page_dax(struct block_device *bdev, 
> struct dax_device *dax_d
>   * already in the tree, we will skip the insertion and just dirty the PMD as
>   * appropriate.
>   */
> -static void *dax_insert_entry(struct xa_state *xas,
> - struct address_space *mapping, struct vm_fault *vmf,
> - void *entry, pfn_t pfn, unsigned long flags, bool dirty)
> +static void *dax_insert_entry(struct xa_state *xas, struct vm_fault *vmf,
> + void *entry, pfn_t pfn, unsigned long flags,
> + unsigned int insert_flags)
>  {
> + struct address_space *mapping = vmf->vma->vm_file->f_mapping;
>   void *new_entry = dax_make_entry(pfn, flags);
> + bool dirty = insert_flags & DAX_IF_DIRTY;
> + bool cow = insert_flags & DAX_IF_COW;
>  
>   if (dirty)
>   __mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
>  
> - if (dax_is_zero_entry(entry) && !(flags & DAX_ZERO_PAGE)) {
> + if (cow || (dax_is_zero_entry(entry) && !(flags & DAX_ZERO_PAGE))) {
>   unsigned long index = xas->xa_index;
>   /* we are replacing a zero page with block mapping */
>   if (dax_is_pmd_entry(entry))
> @@ -750,7 +757,7 @@ static void *dax_insert_entry(struct xa_state *xas,
>  
>   xas_reset(xas);
>   xas_lock_irq(xas);
> - if (dax_is_zero_entry(entry) || dax_is_empty_entry(entry)) {
> + if (cow || dax_is_zero_entry(entry) || dax_is_empty_entry(entry)) {
>   void *old;
>  
>   dax_disassociate_entry(entry, mapping, false);
> @@ -774,6 +781,9 @@ static void *dax_insert_entry(struct xa_state *xas,
>   if (dirty)
>   xas_set_mark(xas, PAGECACHE_TAG_DIRTY);
>  
> + if (cow)
> + xas_set_mark(xas, PAGECACHE_TAG_TOWRITE);
> +
>   xas_unlock_irq(xas);
>   return entry;
>  }
> @@ -1109,8 +1119,7 @@ static vm_fault_t dax_load_hole(struct xa_state *xas,
>   pfn_t pfn = pfn_to_pfn_t(my_zero_pfn(vaddr));
>   vm_fault_t ret;
>  
> - *entry = dax_insert_entry(xas, mapping, vmf, *entry, pfn,
> - DAX_ZERO_PAGE, false);
> + *entry = dax_insert_entry(xas, vmf, *entry, pfn, DAX_ZERO_PAGE, 0);
>  
>   ret = vmf_insert_mixed(vmf->vma, vaddr, pfn);
>   trace_dax_load_hole(inode, vmf, ret);
> @@ -1137,8 +1146,8 @@ static vm_fault_t dax_pmd_load_hole(struct xa_state 
> *xas, struct vm_fault *vmf,
>   goto fallback;
>  
>   pfn = page_to_pfn_t(zero_page);
> - *entry = dax_insert_entry(xas, mapping, vmf, *entry, pfn,
> - DAX_PMD | DAX_ZERO_PAGE, false);
> + *entry = dax_insert_entry(xas, vmf, *entry, pfn,
> +   DAX_PMD | DAX_ZERO_PAGE, 0);
>  
>   if (arch_needs_pgtable_deposit()) {
>   pgtable = pte_alloc_one(vma->vm_mm);
> @@ -1448,6 +1457,7 @@ static vm_fault_t dax_fault_actor(struct vm_fault *vmf, 
> pfn_t *pfnp,
>   bool write = vmf->flags & FAULT_FLAG_WRITE;
>   bool sync = dax_fault_is_synchronous(flags, vmf->vma, iomap);
>   unsigned long entry_flags = pmd ? DAX_PMD : 0;
> + unsigned int insert_flags = 0;
>   int err = 0;
>   pfn_t pfn;
>   void *kaddr;
> @@ -1470,8 +1480,15 @@ static vm_fault_t dax_fault_actor(struct vm_fault 
> *vmf, pfn_t *pfnp,
>   if (err)
>   return pmd ? VM_FAULT_FALLBACK : dax_fau

Re: [PATCH v5 7/7] fs/xfs: Add dax dedupe support

2021-05-11 Thread Darrick J. Wong

On Tue, May 11, 2021 at 11:09:33AM +0800, Shiyang Ruan wrote:
> Introduce xfs_mmaplock_two_inodes_and_break_dax_layout() for dax files
> who are going to be deduped.  After that, call compare range function
> only when files are both DAX or not.
> 
> Signed-off-by: Shiyang Ruan 
> ---
>  fs/xfs/xfs_file.c|  2 +-
>  fs/xfs/xfs_inode.c   | 66 +++-
>  fs/xfs/xfs_inode.h   |  1 +
>  fs/xfs/xfs_reflink.c |  4 +--
>  4 files changed, 69 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 38d8eca05aee..bd5002d38df4 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -823,7 +823,7 @@ xfs_wait_dax_page(
>   xfs_ilock(ip, XFS_MMAPLOCK_EXCL);
>  }
>  
> -static int
> +int
>  xfs_break_dax_layouts(
>   struct inode*inode,
>   bool*retry)
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index 0369eb22c1bb..0774b6e2b940 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -3711,6 +3711,64 @@ xfs_iolock_two_inodes_and_break_layout(
>   return 0;
>  }
>  
> +static int
> +xfs_mmaplock_two_inodes_and_break_dax_layout(
> + struct inode*src,
> + struct inode*dest)

MMAPLOCK is an xfs_inode lock, so please pass those in here.

> +{
> + int error, attempts = 0;
> + boolretry;
> + struct xfs_inode*ip0, *ip1;
> + struct page *page;
> + struct xfs_log_item *lp;
> +
> + if (src > dest)
> + swap(src, dest);

The MMAPLOCK (and ILOCK) locking order is increasing inode number, not
the address of the incore object.  This is different (and not consistent
with) i_rwsem/XFS_IOLOCK, but those are the rules.

> + ip0 = XFS_I(src);
> + ip1 = XFS_I(dest);
> +
> +again:
> + retry = false;
> + /* Lock the first inode */
> + xfs_ilock(ip0, XFS_MMAPLOCK_EXCL);
> + error = xfs_break_dax_layouts(src, );
> + if (error || retry) {
> + xfs_iunlock(ip0, XFS_MMAPLOCK_EXCL);
> + goto again;
> + }
> +
> + if (src == dest)
> + return 0;
> +
> + /* Nested lock the second inode */
> + lp = >i_itemp->ili_item;
> + if (lp && test_bit(XFS_LI_IN_AIL, >li_flags)) {
> + if (!xfs_ilock_nowait(ip1,
> + xfs_lock_inumorder(XFS_MMAPLOCK_EXCL, 1))) {
> + xfs_iunlock(ip0, XFS_MMAPLOCK_EXCL);
> + if ((++attempts % 5) == 0)
> + delay(1); /* Don't just spin the CPU */
> + goto again;
> + }
> + } else
> + xfs_ilock(ip1, xfs_lock_inumorder(XFS_MMAPLOCK_EXCL, 1));
> + /*
> +  * We cannot use xfs_break_dax_layouts() directly here because it may
> +  * need to unlock & lock the XFS_MMAPLOCK_EXCL which is not suitable
> +  * for this nested lock case.
> +  */
> + page = dax_layout_busy_page(dest->i_mapping);
> + if (page) {
> + if (page_ref_count(page) != 1) {

This could be flattened to:

if (page && page_ref_count(page) != 1) {
...
}

--D

> + xfs_iunlock(ip1, XFS_MMAPLOCK_EXCL);
> + xfs_iunlock(ip0, XFS_MMAPLOCK_EXCL);
> + goto again;
> + }
> + }
> +
> + return 0;
> +}
> +
>  /*
>   * Lock two inodes so that userspace cannot initiate I/O via file syscalls or
>   * mmap activity.
> @@ -3721,10 +3779,16 @@ xfs_ilock2_io_mmap(
>   struct xfs_inode*ip2)
>  {
>   int ret;
> + struct inode*ino1 = VFS_I(ip1);
> + struct inode*ino2 = VFS_I(ip2);
>  
> - ret = xfs_iolock_two_inodes_and_break_layout(VFS_I(ip1), VFS_I(ip2));
> + ret = xfs_iolock_two_inodes_and_break_layout(ino1, ino2);
>   if (ret)
>   return ret;
> +
> + if (IS_DAX(ino1) && IS_DAX(ino2))
> + return xfs_mmaplock_two_inodes_and_break_dax_layout(ino1, ino2);
> +
>   if (ip1 == ip2)
>   xfs_ilock(ip1, XFS_MMAPLOCK_EXCL);
>   else
> diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
> index ca826cfba91c..2d0b344fb100 100644
> --- a/fs/xfs/xfs_inode.h
> +++ b/fs/xfs/xfs_inode.h
> @@ -457,6 +457,7 @@ enum xfs_prealloc_flags {
>  
>  int  xfs_update_prealloc_flags(struct xfs_inode *ip,
> enum xfs_prealloc_flags flags);
> +int  xfs_break_dax_layouts(struct inode *inode, bool *retry);
>  int  xfs_break_layouts(struct inode *inode, uint *iolock,
>   enum layout_break_reason reason);
>  
> diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
> index 9a780948dbd0..ff308304c5cd 100644
> --- a/fs/xfs/xfs_reflink.c
> +++ b/fs/xfs/xfs_reflink.c
> @@ -1324,8 +1324,8 @@ xfs_reflink_remap_prep(
>   if (XFS_IS_REALTIME_INODE(src) || XFS_IS_REALTIME_INODE(dest))
>   goto out_unlock;
>

Re: [PATCH v5 0/7] fsdax,xfs: Add reflink support for fsdax

2021-05-10 Thread Darrick J. Wong

On Tue, May 11, 2021 at 11:09:26AM +0800, Shiyang Ruan wrote:
> This patchset is attempt to add CoW support for fsdax, and take XFS,
> which has both reflink and fsdax feature, as an example.

Slightly off topic, but I noticed all my pmem disappeared once I rolled
forward to 5.13-rc1.  Am I the only lucky one?  Qemu 4.2, with fake
memory devices backed by tmpfs files -- info qtree says they're there,
but the kernel doesn't show anything in /proc/iomem.

--D

> Changes from V4:
>  - Fix the mistake of breaking dax layout for two inodes
>  - Add CONFIG_FS_DAX judgement for fsdax code in remap_range.c
>  - Fix other small problems and mistakes
> 
> Changes from V3:
>  - Take out the first 3 patches as a cleanup patchset[1], which has been
> sent yesterday.
>  - Fix usage of code in dax_iomap_cow_copy()
>  - Add comments for macro definitions
>  - Fix other code style problems and mistakes
> 
> One of the key mechanism need to be implemented in fsdax is CoW.  Copy
> the data from srcmap before we actually write data to the destance
> iomap.  And we just copy range in which data won't be changed.
> 
> Another mechanism is range comparison.  In page cache case, readpage()
> is used to load data on disk to page cache in order to be able to
> compare data.  In fsdax case, readpage() does not work.  So, we need
> another compare data with direct access support.
> 
> With the two mechanisms implemented in fsdax, we are able to make reflink
> and fsdax work together in XFS.
> 
> Some of the patches are picked up from Goldwyn's patchset.  I made some
> changes to adapt to this patchset.
> 
> 
> (Rebased on v5.13-rc1 and patchset[1])
> [1]: https://lkml.org/lkml/2021/4/22/575
> 
> Shiyang Ruan (7):
>   fsdax: Introduce dax_iomap_cow_copy()
>   fsdax: Replace mmap entry in case of CoW
>   fsdax: Add dax_iomap_cow_copy() for dax_iomap_zero
>   iomap: Introduce iomap_apply2() for operations on two files
>   fsdax: Dedup file range to use a compare function
>   fs/xfs: Handle CoW for fsdax write() path
>   fs/xfs: Add dax dedupe support
> 
>  fs/dax.c   | 206 +++--
>  fs/iomap/apply.c   |  52 +++
>  fs/iomap/buffered-io.c |   2 +-
>  fs/remap_range.c   |  57 ++--
>  fs/xfs/xfs_bmap_util.c |   3 +-
>  fs/xfs/xfs_file.c  |  11 +--
>  fs/xfs/xfs_inode.c |  66 -
>  fs/xfs/xfs_inode.h |   1 +
>  fs/xfs/xfs_iomap.c |  61 +++-
>  fs/xfs/xfs_iomap.h |   4 +
>  fs/xfs/xfs_iops.c  |   7 +-
>  fs/xfs/xfs_reflink.c   |  15 +--
>  include/linux/dax.h|   7 +-
>  include/linux/fs.h |  12 ++-
>  include/linux/iomap.h  |   7 +-
>  15 files changed, 449 insertions(+), 62 deletions(-)
> 
> -- 
> 2.31.1
> 
> 
> 
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [Virtio-fs] [PATCH v3 2/3] dax: Add a wakeup mode parameter to put_unlocked_entry()

2021-04-22 Thread Darrick J. Wong

On Thu, Apr 22, 2021 at 07:24:58AM +0100, Christoph Hellwig wrote:
> On Wed, Apr 21, 2021 at 12:09:54PM -0700, Dan Williams wrote:
> > Can you get in the habit of not replying inline with new patches like
> > this? Collect the review feedback, take a pause, and resend the full
> > series so tooling like b4 and patchwork can track when a new posting
> > supersedes a previous one. As is, this inline style inflicts manual
> > effort on the maintainer.
> 
> Honestly I don't mind it at all.  If you shiny new tooling can't handle
> it maybe you should fix your shiny new tooling instead of changing
> everyones workflow?

Just speaking for XFS here, but I don't like inline resubmissions
because that makes it /really/ hard to find the original patch 6 months
later when everything has paged out of my brain but random enterprise
distro backporters start asking questions ("is this an actively
exploited security fix?" "what were you smoking?" etc).

At least change the subject line to something that screams "new patch!"
so that mutt and lore will make it stand out.

(Granted this isn't XFS, so I am not the enforcer here ;))

--D
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [PATCH v4 7/7] fs/xfs: Add dedupe support for fsdax

2021-04-08 Thread Darrick J. Wong

On Thu, Apr 08, 2021 at 08:04:32PM +0800, Shiyang Ruan wrote:
> Add xfs_break_two_dax_layouts() to break layout for tow dax files.  Then
> call compare range function only when files are both DAX or not.
> 
> Signed-off-by: Shiyang Ruan 
> ---
>  fs/xfs/xfs_file.c| 20 
>  fs/xfs/xfs_inode.c   |  8 +++-
>  fs/xfs/xfs_inode.h   |  1 +
>  fs/xfs/xfs_reflink.c |  5 +++--
>  4 files changed, 31 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 5795d5d6f869..1fd457167c12 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -842,6 +842,26 @@ xfs_break_dax_layouts(
>   0, 0, xfs_wait_dax_page(inode));
>  }
>  
> +int
> +xfs_break_two_dax_layouts(
> + struct inode*src,
> + struct inode*dest)
> +{
> + int error;
> + boolretry = false;
> +
> +retry:
> + error = xfs_break_dax_layouts(src, );
> + if (error || retry)
> + goto retry;
> +
> + error = xfs_break_dax_layouts(dest, );
> + if (error || retry)
> + goto retry;
> +
> + return error;
> +}
> +
>  int
>  xfs_break_layouts(
>   struct inode*inode,
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index f93370bd7b1e..c01786917eef 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -3713,8 +3713,10 @@ xfs_ilock2_io_mmap(
>   struct xfs_inode*ip2)
>  {
>   int ret;
> + struct inode*inode1 = VFS_I(ip1);
> + struct inode*inode2 = VFS_I(ip2);
>  
> - ret = xfs_iolock_two_inodes_and_break_layout(VFS_I(ip1), VFS_I(ip2));
> + ret = xfs_iolock_two_inodes_and_break_layout(inode1, inode2);
>   if (ret)
>   return ret;
>   if (ip1 == ip2)
> @@ -3722,6 +3724,10 @@ xfs_ilock2_io_mmap(
>   else
>   xfs_lock_two_inodes(ip1, XFS_MMAPLOCK_EXCL,
>   ip2, XFS_MMAPLOCK_EXCL);
> +
> + if (IS_DAX(inode1) && IS_DAX(inode2))
> + ret = xfs_break_two_dax_layouts(inode1, inode2);

This is wrong on many levels.

The first problem is that xfs_break_two_dax_layouts calls
xfs_break_dax_layouts twice even if inode1 == inode2, which is
unnecessary.

The second problem is that xfs_break_dax_layouts can cycle the MMAPLOCK
on the inode that it's processing.  Since there are two inodes in play
here, you must be /very/ careful about maintaining correct locking order,
which for the MMAPLOCK is increasing order of xfs_inode.i_ino.  If you
drop the MMAPLOCK for the lower-numbered inode for any reason, you have
to drop both MMAPLOCKs and try again.

In other words, you have to replace all that nice MMAPLOCK code with a
new xfs_mmaplock_two_inodes_and_break_dax_layouts function that is
structured similarly to what xfs_iolock_two_inodes_and_break_layout
does for the IOLOCK and PNFS layouts.

> +
>   return 0;
>  }
>  
> diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
> index 88ee4c3930ae..5ef21924dddc 100644
> --- a/fs/xfs/xfs_inode.h
> +++ b/fs/xfs/xfs_inode.h
> @@ -435,6 +435,7 @@ enum xfs_prealloc_flags {
>  
>  int  xfs_update_prealloc_flags(struct xfs_inode *ip,
> enum xfs_prealloc_flags flags);
> +int  xfs_break_two_dax_layouts(struct inode *inode1, struct inode *inode2);
>  int  xfs_break_layouts(struct inode *inode, uint *iolock,
>   enum layout_break_reason reason);
>  
> diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
> index a4cd6e8a7aa0..4426bcc8a985 100644
> --- a/fs/xfs/xfs_reflink.c
> +++ b/fs/xfs/xfs_reflink.c
> @@ -29,6 +29,7 @@
>  #include "xfs_iomap.h"
>  #include "xfs_sb.h"
>  #include "xfs_ag_resv.h"
> +#include 

Why is this necessary?

--D

>  
>  /*
>   * Copy on Write of Shared Blocks
> @@ -1324,8 +1325,8 @@ xfs_reflink_remap_prep(
>   if (XFS_IS_REALTIME_INODE(src) || XFS_IS_REALTIME_INODE(dest))
>   goto out_unlock;
>  
> - /* Don't share DAX file data for now. */
> - if (IS_DAX(inode_in) || IS_DAX(inode_out))
> + /* Don't share DAX file data with non-DAX file. */
> + if (IS_DAX(inode_in) != IS_DAX(inode_out))
>   goto out_unlock;
>  
>   if (!IS_DAX(inode_in))
> -- 
> 2.31.0
> 
> 
> 
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [PATCH v4 6/7] fs/xfs: Handle CoW for fsdax write() path

2021-04-08 Thread Darrick J. Wong

On Thu, Apr 08, 2021 at 08:04:31PM +0800, Shiyang Ruan wrote:
> In fsdax mode, WRITE and ZERO on a shared extent need CoW performed. After
> CoW, new allocated extents needs to be remapped to the file.  So, add an
> iomap_end for dax write ops to do the remapping work.
> 
> Signed-off-by: Shiyang Ruan 
> ---
>  fs/xfs/xfs_bmap_util.c |  3 +--
>  fs/xfs/xfs_file.c  |  9 +++
>  fs/xfs/xfs_iomap.c | 58 +-
>  fs/xfs/xfs_iomap.h |  4 +++
>  fs/xfs/xfs_iops.c  |  7 +++--
>  fs/xfs/xfs_reflink.c   |  3 +--
>  6 files changed, 69 insertions(+), 15 deletions(-)
> 
> diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
> index e7d68318e6a5..9fcea33dd2c9 100644
> --- a/fs/xfs/xfs_bmap_util.c
> +++ b/fs/xfs/xfs_bmap_util.c
> @@ -954,8 +954,7 @@ xfs_free_file_space(
>   return 0;
>   if (offset + len > XFS_ISIZE(ip))
>   len = XFS_ISIZE(ip) - offset;
> - error = iomap_zero_range(VFS_I(ip), offset, len, NULL,
> - _buffered_write_iomap_ops);
> + error = xfs_iomap_zero_range(VFS_I(ip), offset, len, NULL);
>   if (error)
>   return error;
>  
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index a007ca0711d9..5795d5d6f869 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -684,11 +684,8 @@ xfs_file_dax_write(
>   pos = iocb->ki_pos;
>  
>   trace_xfs_file_dax_write(iocb, from);
> - ret = dax_iomap_rw(iocb, from, _direct_write_iomap_ops);
> - if (ret > 0 && iocb->ki_pos > i_size_read(inode)) {
> - i_size_write(inode, iocb->ki_pos);
> - error = xfs_setfilesize(ip, pos, ret);
> - }
> + ret = dax_iomap_rw(iocb, from, _dax_write_iomap_ops);
> +
>  out:
>   if (iolock)
>   xfs_iunlock(ip, iolock);
> @@ -1309,7 +1306,7 @@ __xfs_filemap_fault(
>  
>   ret = dax_iomap_fault(vmf, pe_size, , NULL,
>   (write_fault && !vmf->cow_page) ?
> -  _direct_write_iomap_ops :
> +  _dax_write_iomap_ops :
>_read_iomap_ops);
>   if (ret & VM_FAULT_NEEDDSYNC)
>   ret = dax_finish_sync_fault(vmf, pe_size, pfn);
> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> index e17ab7f42928..f818f989687b 100644
> --- a/fs/xfs/xfs_iomap.c
> +++ b/fs/xfs/xfs_iomap.c
> @@ -760,7 +760,8 @@ xfs_direct_write_iomap_begin(
>  
>   /* may drop and re-acquire the ilock */
>   error = xfs_reflink_allocate_cow(ip, , , ,
> - , flags & IOMAP_DIRECT);
> + ,
> + flags & IOMAP_DIRECT || IS_DAX(inode));

Parentheses, please:
(flags & IOMAP_DIRECT) || IS_DAX(inode));

>   if (error)
>   goto out_unlock;
>   if (shared)
> @@ -853,6 +854,38 @@ const struct iomap_ops xfs_direct_write_iomap_ops = {
>   .iomap_begin= xfs_direct_write_iomap_begin,
>  };
>  
> +static int
> +xfs_dax_write_iomap_end(
> + struct inode*inode,
> + loff_t  pos,
> + loff_t  length,
> + ssize_t written,
> + unsigned intflags,
> + struct iomap*iomap)
> +{
> + int error = 0;
> + xfs_inode_t *ip = XFS_I(inode);

Please don't use typedefs:

struct xfs_inode*ip = XFS_I(inode);

> + boolcow = xfs_is_cow_inode(ip);
> +
> + if (pos + written > i_size_read(inode)) {

What if we wrote zero bytes?  Usually that means error, right?

> + i_size_write(inode, pos + written);
> + error = xfs_setfilesize(ip, pos, written);
> + if (error && cow) {
> + xfs_reflink_cancel_cow_range(ip, pos, written, true);
> + return error;
> + }
> + }
> + if (cow)
> + error = xfs_reflink_end_cow(ip, pos, written);
> +
> + return error;
> +}
> +
> +const struct iomap_ops xfs_dax_write_iomap_ops = {
> + .iomap_begin= xfs_direct_write_iomap_begin,
> + .iomap_end  = xfs_dax_write_iomap_end,
> +};
> +
>  static int
>  xfs_buffered_write_iomap_begin(
>   struct inode*inode,
> @@ -1314,3 +1347,26 @@ xfs_xattr_iomap_begin(
>  const struct iomap_ops xfs_xattr_iomap_ops = {
>   .iomap_begin= xfs_xattr_iomap_begin,
>  };
> +
> +int
> +xfs_iomap_zero_range(
> + struct inode*inode,

Might as well pass the xfs_inode pointers directly into these two functions.

--D

> + loff_t  offset,
> + loff_t  len,
> + bool*did_zero)
> +{
> + return iomap_zero_range(inode, offset, len, did_zero,
> + IS_DAX(inode) ?

Re: [PATCH v4 5/7] fsdax: Dedup file range to use a compare function

2021-04-08 Thread Darrick J. Wong

On Thu, Apr 08, 2021 at 08:04:30PM +0800, Shiyang Ruan wrote:
> With dax we cannot deal with readpage() etc. So, we create a dax
> comparison funciton which is similar with
> vfs_dedupe_file_range_compare().
> And introduce dax_remap_file_range_prep() for filesystem use.
> 
> Signed-off-by: Goldwyn Rodrigues 
> Signed-off-by: Shiyang Ruan 
> ---
>  fs/dax.c | 56 
>  fs/remap_range.c | 45 ---
>  fs/xfs/xfs_reflink.c |  9 +--
>  include/linux/dax.h  |  4 
>  include/linux/fs.h   | 12 ++
>  5 files changed, 112 insertions(+), 14 deletions(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index fcd1e932716e..ba924b6629a6 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -1849,3 +1849,59 @@ vm_fault_t dax_finish_sync_fault(struct vm_fault *vmf,
>   return dax_insert_pfn_mkwrite(vmf, pfn, order);
>  }
>  EXPORT_SYMBOL_GPL(dax_finish_sync_fault);
> +
> +static loff_t dax_range_compare_actor(struct inode *ino1, loff_t pos1,
> + struct inode *ino2, loff_t pos2, loff_t len, void *data,
> + struct iomap *smap, struct iomap *dmap)
> +{
> + void *saddr, *daddr;
> + bool *same = data;
> + int ret;
> +
> + if (smap->type == IOMAP_HOLE && dmap->type == IOMAP_HOLE) {
> + *same = true;
> + return len;
> + }
> +
> + if (smap->type == IOMAP_HOLE || dmap->type == IOMAP_HOLE) {
> + *same = false;
> + return 0;
> + }
> +
> + ret = dax_iomap_direct_access(smap, pos1, ALIGN(pos1 + len, PAGE_SIZE),
> +   , NULL);
> + if (ret < 0)
> + return -EIO;
> +
> + ret = dax_iomap_direct_access(dmap, pos2, ALIGN(pos2 + len, PAGE_SIZE),
> +   , NULL);
> + if (ret < 0)
> + return -EIO;
> +
> + *same = !memcmp(saddr, daddr, len);
> + return len;
> +}
> +
> +int dax_dedupe_file_range_compare(struct inode *src, loff_t srcoff,
> + struct inode *dest, loff_t destoff, loff_t len, bool *is_same,
> + const struct iomap_ops *ops)
> +{
> + int id, ret = 0;
> +
> + id = dax_read_lock();
> + while (len) {
> + ret = iomap_apply2(src, srcoff, dest, destoff, len, 0, ops,
> +is_same, dax_range_compare_actor);
> + if (ret < 0 || !*is_same)
> + goto out;
> +
> + len -= ret;
> + srcoff += ret;
> + destoff += ret;
> + }
> + ret = 0;
> +out:
> + dax_read_unlock(id);
> + return ret;
> +}
> +EXPORT_SYMBOL_GPL(dax_dedupe_file_range_compare);
> diff --git a/fs/remap_range.c b/fs/remap_range.c
> index e4a5fdd7ad7b..1fab0db49c68 100644
> --- a/fs/remap_range.c
> +++ b/fs/remap_range.c
> @@ -14,6 +14,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include "internal.h"
>  
>  #include 
> @@ -199,9 +200,9 @@ static void vfs_unlock_two_pages(struct page *page1, 
> struct page *page2)
>   * Compare extents of two files to see if they are the same.
>   * Caller must have locked both inodes to prevent write races.
>   */
> -static int vfs_dedupe_file_range_compare(struct inode *src, loff_t srcoff,
> -  struct inode *dest, loff_t destoff,
> -  loff_t len, bool *is_same)
> +int vfs_dedupe_file_range_compare(struct inode *src, loff_t srcoff,
> +   struct inode *dest, loff_t destoff,
> +   loff_t len, bool *is_same)
>  {
>   loff_t src_poff;
>   loff_t dest_poff;
> @@ -280,6 +281,7 @@ static int vfs_dedupe_file_range_compare(struct inode 
> *src, loff_t srcoff,
>  out_error:
>   return error;
>  }
> +EXPORT_SYMBOL(vfs_dedupe_file_range_compare);
>  
>  /*
>   * Check that the two inodes are eligible for cloning, the ranges make
> @@ -289,9 +291,11 @@ static int vfs_dedupe_file_range_compare(struct inode 
> *src, loff_t srcoff,
>   * If there's an error, then the usual negative error code is returned.
>   * Otherwise returns 0 with *len set to the request length.
>   */
> -int generic_remap_file_range_prep(struct file *file_in, loff_t pos_in,
> -   struct file *file_out, loff_t pos_out,
> -   loff_t *len, unsigned int remap_flags)
> +static int
> +__generic_remap_file_range_prep(struct file *file_in, loff_t pos_in,
> + struct file *file_out, loff_t pos_out,
> + loff_t *len, unsigned int remap_flags,
> + const struct iomap_ops *ops)

Can we rename @ops to @dax_read_ops instead?

>  {
>   struct inode *inode_in = file_inode(file_in);
>   struct inode *inode_out = file_inode(file_out);
> @@ -351,8 +355,15 @@ int generic_remap_file_range_prep(struct file *file_in, 
> loff_t pos_in,
>   if (remap_flags &

Re: [PATCH v4 4/7] iomap: Introduce iomap_apply2() for operations on two files

2021-04-08 Thread Darrick J. Wong

On Thu, Apr 08, 2021 at 08:04:29PM +0800, Shiyang Ruan wrote:
> Some operations, such as comparing a range of data in two files under
> fsdax mode, requires nested iomap_open()/iomap_end() on two file.  Thus,
> we introduce iomap_apply2() to accept arguments from two files and
> iomap_actor2_t for actions on two files.
> 
> Signed-off-by: Shiyang Ruan 

Kinda wish we weren't propagating even more indirect call usage, but oh
well.

Reviewed-by: Darrick J. Wong 

--D

> ---
>  fs/iomap/apply.c  | 52 +++
>  include/linux/iomap.h |  7 +-
>  2 files changed, 58 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/iomap/apply.c b/fs/iomap/apply.c
> index 26ab6563181f..0493da5286ad 100644
> --- a/fs/iomap/apply.c
> +++ b/fs/iomap/apply.c
> @@ -97,3 +97,55 @@ iomap_apply(struct inode *inode, loff_t pos, loff_t 
> length, unsigned flags,
>  
>   return written ? written : ret;
>  }
> +
> +loff_t
> +iomap_apply2(struct inode *ino1, loff_t pos1, struct inode *ino2, loff_t 
> pos2,
> + loff_t length, unsigned int flags, const struct iomap_ops *ops,
> + void *data, iomap_actor2_t actor)
> +{
> + struct iomap smap = { .type = IOMAP_HOLE };
> + struct iomap dmap = { .type = IOMAP_HOLE };
> + loff_t written = 0, ret, ret2 = 0;
> + loff_t len1 = length, len2, min_len;
> +
> + ret = ops->iomap_begin(ino1, pos1, len1, flags, , NULL);
> + if (ret)
> + goto out;
> + if (WARN_ON(smap.offset > pos1)) {
> + written = -EIO;
> + goto out_src;
> + }
> + if (WARN_ON(smap.length == 0)) {
> + written = -EIO;
> + goto out_src;
> + }
> + len2 = min_t(loff_t, len1, smap.length);
> +
> + ret = ops->iomap_begin(ino2, pos2, len2, flags, , NULL);
> + if (ret)
> + goto out_src;
> + if (WARN_ON(dmap.offset > pos2)) {
> + written = -EIO;
> + goto out_dest;
> + }
> + if (WARN_ON(dmap.length == 0)) {
> + written = -EIO;
> + goto out_dest;
> + }
> + min_len = min_t(loff_t, len2, dmap.length);
> +
> + written = actor(ino1, pos1, ino2, pos2, min_len, data, , );
> +
> +out_dest:
> + if (ops->iomap_end)
> + ret2 = ops->iomap_end(ino2, pos2, len2,
> +   written > 0 ? written : 0, flags, );
> +out_src:
> + if (ops->iomap_end)
> + ret = ops->iomap_end(ino1, pos1, len1,
> +  written > 0 ? written : 0, flags, );
> +out:
> + if (written)
> + return written;
> + return ret ?: ret2;
> +}
> diff --git a/include/linux/iomap.h b/include/linux/iomap.h
> index d202fd2d0f91..9493c48bcc9c 100644
> --- a/include/linux/iomap.h
> +++ b/include/linux/iomap.h
> @@ -150,10 +150,15 @@ struct iomap_ops {
>   */
>  typedef loff_t (*iomap_actor_t)(struct inode *inode, loff_t pos, loff_t len,
>   void *data, struct iomap *iomap, struct iomap *srcmap);
> -
> +typedef loff_t (*iomap_actor2_t)(struct inode *ino1, loff_t pos1,
> + struct inode *ino2, loff_t pos2, loff_t len, void *data,
> + struct iomap *smap, struct iomap *dmap);
>  loff_t iomap_apply(struct inode *inode, loff_t pos, loff_t length,
>   unsigned flags, const struct iomap_ops *ops, void *data,
>   iomap_actor_t actor);
> +loff_t iomap_apply2(struct inode *ino1, loff_t pos1, struct inode *ino2,
> + loff_t pos2, loff_t length, unsigned int flags,
> + const struct iomap_ops *ops, void *data, iomap_actor2_t actor);
>  
>  ssize_t iomap_file_buffered_write(struct kiocb *iocb, struct iov_iter *from,
>   const struct iomap_ops *ops);
> -- 
> 2.31.0
> 
> 
> 
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [PATCH v4 2/7] fsdax: Replace mmap entry in case of CoW

2021-04-08 Thread Darrick J. Wong

On Thu, Apr 08, 2021 at 08:04:27PM +0800, Shiyang Ruan wrote:
> We replace the existing entry to the newly allocated one in case of CoW.
> Also, we mark the entry as PAGECACHE_TAG_TOWRITE so writeback marks this
> entry as writeprotected.  This helps us snapshots so new write
> pagefaults after snapshots trigger a CoW.
> 
> Signed-off-by: Goldwyn Rodrigues 
> Signed-off-by: Shiyang Ruan 
> Reviewed-by: Christoph Hellwig 
> Reviewed-by: Ritesh Harjani 
> ---
>  fs/dax.c | 39 ---
>  1 file changed, 28 insertions(+), 11 deletions(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index b4fd3813457a..e6c1354b27a8 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -722,6 +722,10 @@ static int copy_cow_page_dax(struct block_device *bdev, 
> struct dax_device *dax_d
>   return 0;
>  }
>  
> +/* DAX Insert Flag for the entry we insert */

Might be worth mentioning that these are xarray marks for the inserted
entry, since this comment didn't help much.

> +#define DAX_IF_DIRTY (1 << 0)
> +#define DAX_IF_COW   (1 << 1)
> +
>  /*
>   * By this point grab_mapping_entry() has ensured that we have a locked entry
>   * of the appropriate size so we don't have to worry about downgrading PMDs 
> to
> @@ -729,16 +733,19 @@ static int copy_cow_page_dax(struct block_device *bdev, 
> struct dax_device *dax_d
>   * already in the tree, we will skip the insertion and just dirty the PMD as
>   * appropriate.
>   */
> -static void *dax_insert_entry(struct xa_state *xas,
> - struct address_space *mapping, struct vm_fault *vmf,
> - void *entry, pfn_t pfn, unsigned long flags, bool dirty)
> +static void *dax_insert_entry(struct xa_state *xas, struct vm_fault *vmf,
> + void *entry, pfn_t pfn, unsigned long flags,
> + unsigned int insert_flags)

Urk, two flags arguments.  Oh, I see.  We insert (shifted) pfn_t values
into the mapping as xarray values, so @flags determines the state flags
of the new entry value, whereas @insert_flags determines what xarray
mark we're going to attach (if any) to the inserted value.

--D

>  {
> + struct address_space *mapping = vmf->vma->vm_file->f_mapping;
>   void *new_entry = dax_make_entry(pfn, flags);
> + bool dirty = insert_flags & DAX_IF_DIRTY;
> + bool cow = insert_flags & DAX_IF_COW;
>  
>   if (dirty)
>   __mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
>  
> - if (dax_is_zero_entry(entry) && !(flags & DAX_ZERO_PAGE)) {
> + if (cow || (dax_is_zero_entry(entry) && !(flags & DAX_ZERO_PAGE))) {
>   unsigned long index = xas->xa_index;
>   /* we are replacing a zero page with block mapping */
>   if (dax_is_pmd_entry(entry))
> @@ -750,7 +757,7 @@ static void *dax_insert_entry(struct xa_state *xas,
>  
>   xas_reset(xas);
>   xas_lock_irq(xas);
> - if (dax_is_zero_entry(entry) || dax_is_empty_entry(entry)) {
> + if (cow || dax_is_zero_entry(entry) || dax_is_empty_entry(entry)) {
>   void *old;
>  
>   dax_disassociate_entry(entry, mapping, false);
> @@ -774,6 +781,9 @@ static void *dax_insert_entry(struct xa_state *xas,
>   if (dirty)
>   xas_set_mark(xas, PAGECACHE_TAG_DIRTY);
>  
> + if (cow)
> + xas_set_mark(xas, PAGECACHE_TAG_TOWRITE);
> +
>   xas_unlock_irq(xas);
>   return entry;
>  }
> @@ -1109,8 +1119,7 @@ static vm_fault_t dax_load_hole(struct xa_state *xas,
>   pfn_t pfn = pfn_to_pfn_t(my_zero_pfn(vaddr));
>   vm_fault_t ret;
>  
> - *entry = dax_insert_entry(xas, mapping, vmf, *entry, pfn,
> - DAX_ZERO_PAGE, false);
> + *entry = dax_insert_entry(xas, vmf, *entry, pfn, DAX_ZERO_PAGE, 0);
>  
>   ret = vmf_insert_mixed(vmf->vma, vaddr, pfn);
>   trace_dax_load_hole(inode, vmf, ret);
> @@ -1137,8 +1146,8 @@ static vm_fault_t dax_pmd_load_hole(struct xa_state 
> *xas, struct vm_fault *vmf,
>   goto fallback;
>  
>   pfn = page_to_pfn_t(zero_page);
> - *entry = dax_insert_entry(xas, mapping, vmf, *entry, pfn,
> - DAX_PMD | DAX_ZERO_PAGE, false);
> + *entry = dax_insert_entry(xas, vmf, *entry, pfn,
> +   DAX_PMD | DAX_ZERO_PAGE, 0);
>  
>   if (arch_needs_pgtable_deposit()) {
>   pgtable = pte_alloc_one(vma->vm_mm);
> @@ -1444,6 +1453,7 @@ static vm_fault_t dax_fault_actor(struct vm_fault *vmf, 
> pfn_t *pfnp,
>   bool write = vmf->flags & FAULT_FLAG_WRITE;
>   bool sync = dax_fault_is_synchronous(flags, vmf->vma, iomap);
>   unsigned long entry_flags = pmd ? DAX_PMD : 0;
> + unsigned int insert_flags = 0;
>   int err = 0;
>   pfn_t pfn;
>   void *kaddr;
> @@ -1466,8 +1476,15 @@ static vm_fault_t dax_fault_actor(struct vm_fault 
> *vmf, pfn_t *pfnp,
>   if (err)
>   return pmd ? VM_FAULT_FALLBACK : dax_fault_return(err);
>  
> - *entry = dax_insert_entry(xas, mapping, vmf,

Re: [PATCH v4 1/7] fsdax: Introduce dax_iomap_cow_copy()

2021-04-08 Thread Darrick J. Wong

On Thu, Apr 08, 2021 at 08:04:26PM +0800, Shiyang Ruan wrote:
> In the case where the iomap is a write operation and iomap is not equal
> to srcmap after iomap_begin, we consider it is a CoW operation.
> 
> The destance extent which iomap indicated is new allocated extent.
> So, it is needed to copy the data from srcmap to new allocated extent.
> In theory, it is better to copy the head and tail ranges which is
> outside of the non-aligned area instead of copying the whole aligned
> range. But in dax page fault, it will always be an aligned range.  So,
> we have to copy the whole range in this case.
> 
> Signed-off-by: Shiyang Ruan 
> Reviewed-by: Christoph Hellwig 
> ---
>  fs/dax.c | 82 
>  1 file changed, 77 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index 8d7e4e2cc0fb..b4fd3813457a 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -1038,6 +1038,61 @@ static int dax_iomap_direct_access(struct iomap 
> *iomap, loff_t pos, size_t size,
>   return rc;
>  }
>  
> +/**
> + * dax_iomap_cow_copy(): Copy the data from source to destination before 
> write.
> + * @pos: address to do copy from.
> + * @length:  size of copy operation.
> + * @align_size:  aligned w.r.t align_size (either PMD_SIZE or PAGE_SIZE)
> + * @srcmap:  iomap srcmap
> + * @daddr:   destination address to copy to.
> + *
> + * This can be called from two places. Either during DAX write fault, to copy
> + * the length size data to daddr. Or, while doing normal DAX write operation,
> + * dax_iomap_actor() might call this to do the copy of either start or end
> + * unaligned address. In this case the rest of the copy of aligned ranges is
> + * taken care by dax_iomap_actor() itself.

Er... what?  This description is very confusing to me.  /me reads the
code, and ...

OH.

Given a range (pos, length) and a mapping for a source file, this
function copies all the bytes between pos and (pos + length) to daddr if
the range is aligned to @align_size.  But if pos and length are not both
aligned to align_src then it'll copy *around* the range, leaving the
area in the middle uncopied waiting for write_iter to fill it in with
whatever's in the iovec.

Yikes, this function is doing double duty and ought to be split into
two functions.

The first function does the COW work for a write fault to an mmap
region and does a straight copy.  Page faults are always aligned, so
this functionality is needed by dax_fault_actor.  Maybe this could be
named dax_fault_cow?

The second function does the prep COW work *around* a write so that we
always copy entire page/blocks.  This cow-around code is needed by
dax_iomap_actor.  This should perhaps be named dax_iomap_cow_around()?

> + * Also, note DAX fault will always result in aligned pos and pos + length.
> + */
> +static int dax_iomap_cow_copy(loff_t pos, loff_t length, size_t align_size,
> + struct iomap *srcmap, void *daddr)
> +{
> + loff_t head_off = pos & (align_size - 1);
> + size_t size = ALIGN(head_off + length, align_size);
> + loff_t end = pos + length;
> + loff_t pg_end = round_up(end, align_size);
> + bool copy_all = head_off == 0 && end == pg_end;
> + void *saddr = 0;
> + int ret = 0;
> +
> + ret = dax_iomap_direct_access(srcmap, pos, size, , NULL);
> + if (ret)
> + return ret;
> +
> + if (copy_all) {
> + ret = copy_mc_to_kernel(daddr, saddr, length);
> + return ret ? -EIO : 0;

I find it /very/ interesting that copy_mc_to_kernel takes an unsigned
int argument but returns an unsigned long (counting the bytes that
didn't get copied, oddly...but that's an existing API so I guess I'll
let it go.)

> + }
> +
> + /* Copy the head part of the range.  Note: we pass offset as length. */
> + if (head_off) {
> + ret = copy_mc_to_kernel(daddr, saddr, head_off);
> + if (ret)
> + return -EIO;
> + }
> +
> + /* Copy the tail part of the range */
> + if (end < pg_end) {
> + loff_t tail_off = head_off + length;
> + loff_t tail_len = pg_end - end;
> +
> + ret = copy_mc_to_kernel(daddr + tail_off, saddr + tail_off,
> + tail_len);
> + if (ret)
> + return -EIO;
> + }
> + return 0;
> +}
> +
>  /*
>   * The user has performed a load from a hole in the file.  Allocating a new
>   * page in the file would cause excessive storage usage for workloads with
> @@ -1167,11 +1222,12 @@ dax_iomap_actor(struct inode *inode, loff_t pos, 
> loff_t length, void *data,
>   struct dax_device *dax_dev = iomap->dax_dev;
>   struct iov_iter *iter = data;
>   loff_t end = pos + length, done = 0;
> + bool write = iov_iter_rw(iter) == WRITE;
>   ssize_t ret = 0;
>   size_t xfer;
>   int id;
>  
> - if (iov_iter_rw(iter) == READ) {
> + if (!write) {
>

Re: [PATCH v2 2/3] fsdax: Factor helper: dax_fault_actor()

2021-04-08 Thread Darrick J. Wong

On Wed, Apr 07, 2021 at 09:38:22PM +0800, Shiyang Ruan wrote:
> The core logic in the two dax page fault functions is similar. So, move
> the logic into a common helper function. Also, to facilitate the
> addition of new features, such as CoW, switch-case is no longer used to
> handle different iomap types.
> 
> Signed-off-by: Shiyang Ruan 
> Reviewed-by: Christoph Hellwig 
> Reviewed-by: Ritesh Harjani 
> ---
>  fs/dax.c | 294 ---
>  1 file changed, 148 insertions(+), 146 deletions(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index f843fb8fbbf1..6dea1fc11b46 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -1054,6 +1054,66 @@ static vm_fault_t dax_load_hole(struct xa_state *xas,
>   return ret;
>  }
>  
> +#ifdef CONFIG_FS_DAX_PMD
> +static vm_fault_t dax_pmd_load_hole(struct xa_state *xas, struct vm_fault 
> *vmf,
> + struct iomap *iomap, void **entry)
> +{
> + struct address_space *mapping = vmf->vma->vm_file->f_mapping;
> + unsigned long pmd_addr = vmf->address & PMD_MASK;
> + struct vm_area_struct *vma = vmf->vma;
> + struct inode *inode = mapping->host;
> + pgtable_t pgtable = NULL;
> + struct page *zero_page;
> + spinlock_t *ptl;
> + pmd_t pmd_entry;
> + pfn_t pfn;
> +
> + zero_page = mm_get_huge_zero_page(vmf->vma->vm_mm);
> +
> + if (unlikely(!zero_page))
> + goto fallback;
> +
> + pfn = page_to_pfn_t(zero_page);
> + *entry = dax_insert_entry(xas, mapping, vmf, *entry, pfn,
> + DAX_PMD | DAX_ZERO_PAGE, false);
> +
> + if (arch_needs_pgtable_deposit()) {
> + pgtable = pte_alloc_one(vma->vm_mm);
> + if (!pgtable)
> + return VM_FAULT_OOM;
> + }
> +
> + ptl = pmd_lock(vmf->vma->vm_mm, vmf->pmd);
> + if (!pmd_none(*(vmf->pmd))) {
> + spin_unlock(ptl);
> + goto fallback;
> + }
> +
> + if (pgtable) {
> + pgtable_trans_huge_deposit(vma->vm_mm, vmf->pmd, pgtable);
> + mm_inc_nr_ptes(vma->vm_mm);
> + }
> + pmd_entry = mk_pmd(zero_page, vmf->vma->vm_page_prot);
> + pmd_entry = pmd_mkhuge(pmd_entry);
> + set_pmd_at(vmf->vma->vm_mm, pmd_addr, vmf->pmd, pmd_entry);
> + spin_unlock(ptl);
> + trace_dax_pmd_load_hole(inode, vmf, zero_page, *entry);
> + return VM_FAULT_NOPAGE;
> +
> +fallback:
> + if (pgtable)
> + pte_free(vma->vm_mm, pgtable);
> + trace_dax_pmd_load_hole_fallback(inode, vmf, zero_page, *entry);
> + return VM_FAULT_FALLBACK;
> +}
> +#else
> +static vm_fault_t dax_pmd_load_hole(struct xa_state *xas, struct vm_fault 
> *vmf,
> + struct iomap *iomap, void **entry)
> +{
> + return VM_FAULT_FALLBACK;
> +}
> +#endif /* CONFIG_FS_DAX_PMD */
> +
>  s64 dax_iomap_zero(loff_t pos, u64 length, struct iomap *iomap)
>  {
>   sector_t sector = iomap_sector(iomap, pos & PAGE_MASK);
> @@ -1291,6 +1351,64 @@ static vm_fault_t dax_fault_cow_page(struct vm_fault 
> *vmf, struct iomap *iomap,
>   return ret;
>  }
>  
> +/**
> + * dax_fault_actor - Common actor to handle pfn insertion in PTE/PMD fault.
> + * @vmf: vm fault instance
> + * @pfnp:pfn to be returned
> + * @xas: the dax mapping tree of a file
> + * @entry:   an unlocked dax entry to be inserted
> + * @pmd: distinguish whether it is a pmd fault
> + * @flags:   iomap flags
> + * @iomap:   from iomap_begin()
> + * @srcmap:  from iomap_begin(), not equal to iomap if it is a CoW
> + */
> +static vm_fault_t dax_fault_actor(struct vm_fault *vmf, pfn_t *pfnp,
> + struct xa_state *xas, void **entry, bool pmd,
> + unsigned int flags, struct iomap *iomap, struct iomap *srcmap)
> +{
> + struct address_space *mapping = vmf->vma->vm_file->f_mapping;
> + size_t size = pmd ? PMD_SIZE : PAGE_SIZE;
> + loff_t pos = (loff_t)xas->xa_index << PAGE_SHIFT;
> + bool write = vmf->flags & FAULT_FLAG_WRITE;
> + bool sync = dax_fault_is_synchronous(flags, vmf->vma, iomap);
> + unsigned long entry_flags = pmd ? DAX_PMD : 0;
> + int err = 0;
> + pfn_t pfn;
> +
> + /* if we are reading UNWRITTEN and HOLE, return a hole. */
> + if (!write &&
> + (iomap->type == IOMAP_UNWRITTEN || iomap->type == IOMAP_HOLE)) {
> + if (!pmd)
> + return dax_load_hole(xas, mapping, entry, vmf);
> + else
> + return dax_pmd_load_hole(xas, vmf, iomap, entry);
> + }
> +
> + if (iomap->type != IOMAP_MAPPED) {
> + WARN_ON_ONCE(1);
> + return pmd ? VM_FAULT_FALLBACK : VM_FAULT_SIGBUS;
> + }
> +
> + err = dax_iomap_pfn(iomap, pos, size, );
> + if (err)
> + return pmd ? VM_FAULT_FALLBACK : dax_fault_return(err);
> +
> + *entry = dax_insert_entry(xas, mapping, vmf, *entry, pfn, entry_flags,
> +   write && !sync);
> +
> + if (sync)
> +

Re: [PATCH 2/3] mm, dax, pmem: Introduce dev_pagemap_failure()

2021-03-17 Thread Darrick J. Wong

On Wed, Mar 17, 2021 at 09:08:23PM -0700, Dan Williams wrote:
> Jason wondered why the get_user_pages_fast() path takes references on a
> @pgmap object. The rationale was to protect against accessing a 'struct
> page' that might be in the process of being removed by the driver, but
> he rightly points out that should be solved the same way all gup-fast
> synchronization is solved which is invalidate the mapping and let the
> gup slow path do @pgmap synchronization [1].
> 
> To achieve that it means that new user mappings need to stop being
> created and all existing user mappings need to be invalidated.
> 
> For device-dax this is already the case as kill_dax() prevents future
> faults from installing a pte, and the single device-dax inode
> address_space can be trivially unmapped.
> 
> The situation is different for filesystem-dax where device pages could
> be mapped by any number of inode address_space instances. An initial
> thought was to treat the device removal event like a drop_pagecache_sb()
> event that walks superblocks and unmaps all inodes. However, Dave points
> out that it is not just the filesystem user-mappings that need to react
> to global DAX page-unmap events, it is also filesystem metadata
> (proposed DAX metadata access), and other drivers (upstream
> DM-writecache) that need to react to this event [2].
> 
> The only kernel facility that is meant to globally broadcast the loss of
> a page (via corruption or surprise remove) is memory_failure(). The
> downside of memory_failure() is that it is a pfn-at-a-time interface.
> However, the events that would trigger the need to call memory_failure()
> over a full PMEM device should be rare. Remove should always be
> coordinated by the administrator with the filesystem. If someone force
> removes a device from underneath a mounted filesystem the driver assumes
> they have a good reason, or otherwise get to keep the pieces. Since
> ->remove() callbacks can not fail the only option is to trigger the mass
> memory_failure().
> 
> The mechanism to determine whether memory_failure() triggers at
> pmem->remove() time is whether the associated dax_device has an elevated
> reference at @pgmap ->kill() time.
> 
> With this in place the get_user_pages_fast() path can drop its
> half-measure synchronization with an @pgmap reference.
> 
> Link: http://lore.kernel.org/r/20210224010017.gq2643...@ziepe.ca [1]
> Link: http://lore.kernel.org/r/20210302075736.gj4...@dread.disaster.area [2]
> Reported-by: Jason Gunthorpe 
> Cc: Dave Chinner 
> Cc: Christoph Hellwig 
> Cc: Shiyang Ruan 
> Cc: Vishal Verma 
> Cc: Dave Jiang 
> Cc: Ira Weiny 
> Cc: Matthew Wilcox 
> Cc: Jan Kara 
> Cc: Andrew Morton 
> Cc: Naoya Horiguchi 
> Cc: "Darrick J. Wong" 
> Signed-off-by: Dan Williams 
> ---
>  drivers/dax/super.c  |   15 +++
>  drivers/nvdimm/pmem.c|   10 +-
>  drivers/nvdimm/pmem.h|1 +
>  include/linux/dax.h  |5 +
>  include/linux/memremap.h |5 +
>  include/linux/mm.h   |3 +++
>  mm/memory-failure.c  |   11 +--
>  mm/memremap.c|   11 +++
>  8 files changed, 58 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/dax/super.c b/drivers/dax/super.c
> index 5fa6ae9dbc8b..5ebcedf4a68c 100644
> --- a/drivers/dax/super.c
> +++ b/drivers/dax/super.c
> @@ -624,6 +624,21 @@ void put_dax(struct dax_device *dax_dev)
>  }
>  EXPORT_SYMBOL_GPL(put_dax);
>  
> +bool dax_is_idle(struct dax_device *dax_dev)
> +{
> + struct inode *inode;
> +
> + if (!dax_dev)
> + return true;
> +
> + WARN_ONCE(test_bit(DAXDEV_ALIVE, _dev->flags),
> +   "dax idle check on live device.\n");
> +
> + inode = _dev->inode;
> + return atomic_read(>i_count) < 2;
> +}
> +EXPORT_SYMBOL_GPL(dax_is_idle);
> +
>  /**
>   * dax_get_by_host() - temporary lookup mechanism for filesystem-dax
>   * @host: alternate name for the device registered by a dax driver
> diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
> index b8a85bfb2e95..e8822c9262ee 100644
> --- a/drivers/nvdimm/pmem.c
> +++ b/drivers/nvdimm/pmem.c
> @@ -348,15 +348,21 @@ static void pmem_pagemap_kill(struct dev_pagemap *pgmap)
>  {
>   struct request_queue *q =
>   container_of(pgmap->ref, struct request_queue, q_usage_counter);
> + struct pmem_device *pmem = q->queuedata;
>  
>   blk_freeze_queue_start(q);
> + kill_dax(pmem->dax_dev);
> + if (!dax_is_idle(pmem->dax_dev)) {
> + dev_warn(pmem->dev,
> +  "DAX active at remove, trigger mass memory failure\n");
>

Re: Question about the "EXPERIMENTAL" tag for dax in XFS

2021-03-04 Thread Darrick J. Wong

On Tue, Mar 02, 2021 at 09:49:30AM -0800, Dan Williams wrote:
> On Mon, Mar 1, 2021 at 11:57 PM Dave Chinner  wrote:
> >
> > On Mon, Mar 01, 2021 at 09:41:02PM -0800, Dan Williams wrote:
> > > On Mon, Mar 1, 2021 at 7:28 PM Darrick J. Wong  wrote:
> > > > > > I really don't see you seem to be telling us that invalidation is an
> > > > > > either/or choice. There's more ways to convert physical block
> > > > > > address -> inode file offset and mapping index than brute force
> > > > > > inode cache walks
> > > > >
> > > > > Yes, but I was trying to map it to an existing mechanism and the
> > > > > internals of drop_pagecache_sb() are, in coarse terms, close to what
> > > > > needs to happen here.
> > > >
> > > > Yes.  XFS (with rmap enabled) can do all the iteration and walking in
> > > > that function except for the invalidate_mapping_* call itself.  The goal
> > > > of this series is first to wire up a callback within both the block and
> > > > pmem subsystems so that they can take notifications and reverse-map them
> > > > through the storage stack until they reach an fs superblock.
> > >
> > > I'm chuckling because this "reverse map all the way up the block
> > > layer" is the opposite of what Dave said at the first reaction to my
> > > proposal, "can't the mm map pfns to fs inode  address_spaces?".
> >
> > Ah, no, I never said that the filesystem can't do reverse maps. I
> > was asking if the mm could directly (brute-force) invalidate PTEs
> > pointing at physical pmem ranges without needing walk the inode
> > mappings. That would be far more efficient if it could be done

So, uh, /can/ the kernel brute-force invalidate PTEs when the pmem
driver says that something died?  Part of what's keeping me from putting
together a coherent vision for how this would work is my relative
unfamiliarity with all things mm/.

> > > Today whenever the pmem driver receives new corrupted range
> > > notification from the lower level nvdimm
> > > infrastructure(nd_pmem_notify) it updates the 'badblocks' instance
> > > associated with the pmem gendisk and then notifies userspace that
> > > there are new badblocks. This seems a perfect place to signal an upper
> > > level stacked block device that may also be watching disk->bb. Then
> > > each gendisk in a stacked topology is responsible for watching the
> > > badblock notifications of the next level and storing a remapped
> > > instance of those blocks until ultimately the filesystem mounted on
> > > the top-level block device is responsible for registering for those
> > > top-level disk->bb events.
> > >
> > > The device gone notification does not map cleanly onto 'struct badblocks'.
> >
> > Filesystems are not allowed to interact with the gendisk
> > infrastructure - that's for supporting the device side of a block
> > device. It's a layering violation, and many a filesytem developer
> > has been shouted at for trying to do this. At most we can peek
> > through it to query functionality support from the request queue,
> > but otherwise filesystems do not interact with anything under
> > bdev->bd_disk.
> 
> So lets add an api that allows the querying of badblocks by bdev and
> let the block core handle the bd_disk interaction. I see other block
> functionality like blk-integrity reaching through gendisk. The fs need
> not interact with the gendisk directly.

(I thought it was ok for block code to fiddle with other block
internals, and it's filesystems messing with block internals that was
prohibited?)

> > As it is, badblocks are used by devices to manage internal state.
> > e.g. md for recording stripes that need recovery if the system
> > crashes while they are being written out.
> 
> I know, I was there when it was invented which is why it was
> top-of-mind when pmem had a need to communicate badblocks. Other block
> drivers have threatened to use it for badblocks tracking, but none of
> those have carried through on that initial interest.

I hadn't realized that badblocks was bolted onto gendisk nowadays, I
mistakenly thought it was still something internal to md.

Looking over badblocks, I see a major drawback in that it can only
remember a single page's worth of badblocks records.

> > > If an upper level agent really cared about knowing about ->remove()
> > > events before they happened it could maybe do something like:
> > >
> > > dev = disk_to_dev(bdev->bd_disk)->parent;
> >

Re: [PATCH v3 02/11] blk: Introduce ->corrupted_range() for block device

2021-03-04 Thread Darrick J. Wong

On Wed, Feb 10, 2021 at 02:21:39PM +0100, Christoph Hellwig wrote:
> On Mon, Feb 08, 2021 at 06:55:21PM +0800, Shiyang Ruan wrote:
> > In fsdax mode, the memory failure happens on block device.  So, it is
> > needed to introduce an interface for block devices.  Each kind of block
> > device can handle the memory failure in ther own ways.
> 
> As told before: DAX operations please do not add anything to the block
> device.  We've been working very hard to decouple DAX from the block
> device, and while we're not done regressing the split should not happen.

I agree with you (Christoph) that (strictly speaking) within the scope of
the DAX work this isn't needed; xfs should be able to consume the
->memory_failure events directly and DTRT.

My vision here, however, is to establish upcalls for /both/ types of
stroage.

Regular block devices can use ->corrupted_range to push error
notifications upwards through the block stack to a filesystem, and we
can finally do a teensy bit more with scsi sense data about media
errors, or thinp wanting to warn the filesystem that it's getting low on
space and maybe this would be an agreeable time to self-FITRIM, or raid
noticing that a mirror is inconsistent and can the fs do something to
resolve the dispute, etc.  Maybe we can use this mechanism to warn a
filesystem that someone did "echo 1 > /sys/block/sda/device/delete" and
we had better persist everything while we still can.

Memory devices will use ->memory_failure to tell us about ADR errors,
and I guess upcoming and past hotremove events.  For fsdax you'd
probably have to send the announcement and invalidate the current ptes
to force filesystem pagefaults and the like.

Either way, I think this piece is fine, but I would change the dax
side to send the ->memory_failure events directly to xfs.

A gap here is that xfs can attach to rt/log devices but we don't
currently plumb in enough information that get_active_super can find
the correct filesystem.

I dunno, maybe we should add this to the thread here[1]?

[1] 
https://lore.kernel.org/linux-xfs/CAPcyv4g3ZwbdLFx8bqMcNvXyrob8y6sBXXu=xptmty0vsk5...@mail.gmail.com/T/#m55a5c67153d0d10f3ff05a69d7e502914d97ac9d

--D
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: Question about the "EXPERIMENTAL" tag for dax in XFS

2021-03-01 Thread Darrick J. Wong

On Mon, Mar 01, 2021 at 12:55:53PM -0800, Dan Williams wrote:
> On Sun, Feb 28, 2021 at 2:39 PM Dave Chinner  wrote:
> >
> > On Sat, Feb 27, 2021 at 03:40:24PM -0800, Dan Williams wrote:
> > > On Sat, Feb 27, 2021 at 2:36 PM Dave Chinner  wrote:
> > > > On Fri, Feb 26, 2021 at 02:41:34PM -0800, Dan Williams wrote:
> > > > > On Fri, Feb 26, 2021 at 1:28 PM Dave Chinner  
> > > > > wrote:
> > > > > > On Fri, Feb 26, 2021 at 12:59:53PM -0800, Dan Williams wrote:
> > > > it points to, check if it points to the PMEM that is being removed,
> > > > grab the page it points to, map that to the relevant struct page,
> > > > run collect_procs() on that page, then kill the user processes that
> > > > map that page.
> > > >
> > > > So why can't we walk the ptescheck the physical pages that they
> > > > map to and if they map to a pmem page we go poison that
> > > > page and that kills any user process that maps it.
> > > >
> > > > i.e. I can't see how unexpected pmem device unplug is any different
> > > > to an MCE delivering a hwpoison event to a DAX mapped page.
> > >
> > > I guess the tradeoff is walking a long list of inodes vs walking a
> > > large array of pages.
> >
> > Not really. You're assuming all a filesystem has to do is invalidate
> > everything if a device goes away, and that's not true. Finding if an
> > inode has a mapping that spans a specific device in a multi-device
> > filesystem can be a lot more complex than that. Just walking inodes
> > is easy - determining whihc inodes need invalidation is the hard
> > part.
> 
> That inode-to-device level of specificity is not needed for the same
> reason that drop_caches does not need to be specific. If the wrong
> page is unmapped a re-fault will bring it back, and re-fault will fail
> for the pages that are successfully removed.
> 
> > That's where ->corrupt_range() comes in - the filesystem is already
> > set up to do reverse mapping from physical range to inode(s)
> > offsets...
> 
> Sure, but what is the need to get to that level of specificity with
> the filesystem for something that should rarely happen in the course
> of normal operation outside of a mistake?

I can't tell if we're conflating the "a bunch of your pmem went bad"
case with the "all your dimms fell out of the machine" case.

If, say, a single cacheline's worth of pmem goes bad on a node with 2TB
of pmem, I certainly want that level of specificity.  Just notify the
users of the dead piece, don't flush the whole machine down the drain.

> > > There's likely always more pages than inodes, but perhaps it's more
> > > efficient to walk the 'struct page' array than sb->s_inodes?
> >
> > I really don't see you seem to be telling us that invalidation is an
> > either/or choice. There's more ways to convert physical block
> > address -> inode file offset and mapping index than brute force
> > inode cache walks
> 
> Yes, but I was trying to map it to an existing mechanism and the
> internals of drop_pagecache_sb() are, in coarse terms, close to what
> needs to happen here.

Yes.  XFS (with rmap enabled) can do all the iteration and walking in
that function except for the invalidate_mapping_* call itself.  The goal
of this series is first to wire up a callback within both the block and
pmem subsystems so that they can take notifications and reverse-map them
through the storage stack until they reach an fs superblock.

Once the information has reached XFS, it can use its own reverse
mappings to figure out which pages of which inodes are now targetted.
The future of DAX hw error handling can be that you throw the spitwad at
us, and it's our problem to distill that into mm invalidation calls.
XFS' reverse mapping data is indexed by storage location and isn't
sharded by address_space, so (except for the DIMMs falling out), we
don't need to walk the entire inode list or scan the entire mapping.

Between XFS and DAX and mm, the mm already has the invalidation calls,
xfs already has the distiller, and so all we need is that first bit.
The current mm code doesn't fully solve the problem, nor does it need
to, since it handles DRAM errors acceptably* already.

* Actually, the hwpoison code should _also_ be calling ->corrupted_range
when DRAM goes bad so that we can detect metadata failures and either
reload the buffer or (if it was dirty) shut down.

> >
> > .
> >
> > > > IOWs, what needs to happen at this point is very filesystem
> > > > specific. Assuming that "device unplug == filesystem dead" is not
> > > > correct, nor is specifying a generic action that assumes the
> > > > filesystem is dead because a device it is using went away.
> > >
> > > Ok, I think I set this discussion in the wrong direction implying any
> > > mapping of this action to a "filesystem dead" event. It's just a "zap
> > > all ptes" event and upper layers recover from there.
> >
> > Yes, that's exactly what ->corrupt_range() is intended for. It
> > allows the filesystem to lock out access to the bad range
> > and then

Re: Question about the "EXPERIMENTAL" tag for dax in XFS

2021-02-26 Thread Darrick J. Wong

On Fri, Feb 26, 2021 at 09:45:45AM +, ruansy.f...@fujitsu.com wrote:
> Hi, guys
> 
> Beside this patchset, I'd like to confirm something about the
> "EXPERIMENTAL" tag for dax in XFS.
> 
> In XFS, the "EXPERIMENTAL" tag, which is reported in waring message
> when we mount a pmem device with dax option, has been existed for a
> while.  It's a bit annoying when using fsdax feature.  So, my initial
> intention was to remove this tag.  And I started to find out and solve
> the problems which prevent it from being removed.
> 
> As is talked before, there are 3 main problems.  The first one is "dax
> semantics", which has been resolved.  The rest two are "RMAP for
> fsdax" and "support dax reflink for filesystem", which I have been
> working on.  



> So, what I want to confirm is: does it means that we can remove the
> "EXPERIMENTAL" tag when the rest two problem are solved?

Yes.  I'd keep the experimental tag for a cycle or two to make sure that
nothing new pops up, but otherwise the two patchsets you've sent close
those two big remaining gaps.  Thank you for working on this!

> Or maybe there are other important problems need to be fixed before
> removing it?  If there are, could you please show me that?

That remains to be seen through QA/validation, but I think that's it.

Granted, I still have to read through the two patchsets...

--D

> 
> Thank you.
> 
> 
> --
> Ruan Shiyang.
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [PATCH v2 07/10] iomap: Introduce iomap_apply2() for operations on two files

2021-02-25 Thread Darrick J. Wong

On Fri, Feb 26, 2021 at 08:20:27AM +0800, Shiyang Ruan wrote:
> Some operations, such as comparing a range of data in two files under
> fsdax mode, requires nested iomap_open()/iomap_end() on two file.  Thus,
> we introduce iomap_apply2() to accept arguments from two files and
> iomap_actor2_t for actions on two files.
> 
> Signed-off-by: Shiyang Ruan 
> ---
>  fs/iomap/apply.c  | 51 +++
>  include/linux/iomap.h |  7 +-
>  2 files changed, 57 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/iomap/apply.c b/fs/iomap/apply.c
> index 26ab6563181f..fd2f8bde5791 100644
> --- a/fs/iomap/apply.c
> +++ b/fs/iomap/apply.c
> @@ -97,3 +97,54 @@ iomap_apply(struct inode *inode, loff_t pos, loff_t 
> length, unsigned flags,
>  
>   return written ? written : ret;
>  }
> +
> +loff_t
> +iomap_apply2(struct inode *ino1, loff_t pos1, struct inode *ino2, loff_t 
> pos2,
> + loff_t length, unsigned int flags, const struct iomap_ops *ops,
> + void *data, iomap_actor2_t actor)
> +{
> + struct iomap smap = { .type = IOMAP_HOLE };
> + struct iomap dmap = { .type = IOMAP_HOLE };
> + loff_t written = 0, ret;
> +
> + ret = ops->iomap_begin(ino1, pos1, length, 0, , NULL);
> + if (ret)
> + goto out_src;
> + if (WARN_ON(smap.offset > pos1)) {
> + written = -EIO;
> + goto out_src;
> + }
> + if (WARN_ON(smap.length == 0)) {
> + written = -EIO;
> + goto out_src;
> + }
> +
> + ret = ops->iomap_begin(ino2, pos2, length, 0, , NULL);
> + if (ret)
> + goto out_dest;
> + if (WARN_ON(dmap.offset > pos2)) {
> + written = -EIO;
> + goto out_dest;
> + }
> + if (WARN_ON(dmap.length == 0)) {
> + written = -EIO;
> + goto out_dest;
> + }
> +
> + /* make sure extent length of two file is equal */
> + if (WARN_ON(smap.length != dmap.length)) {

Why not set smap.length and dmap.length to min(smap.length, dmap.length) ?

--D

> + written = -EIO;
> + goto out_dest;
> + }
> +
> + written = actor(ino1, pos1, ino2, pos2, length, data, , );
> +
> +out_dest:
> + if (ops->iomap_end)
> + ret = ops->iomap_end(ino2, pos2, length, 0, 0, );
> +out_src:
> + if (ops->iomap_end)
> + ret = ops->iomap_end(ino1, pos1, length, 0, 0, );
> +
> + return ret;
> +}
> diff --git a/include/linux/iomap.h b/include/linux/iomap.h
> index 5bd3cac4df9c..913f98897a77 100644
> --- a/include/linux/iomap.h
> +++ b/include/linux/iomap.h
> @@ -148,10 +148,15 @@ struct iomap_ops {
>   */
>  typedef loff_t (*iomap_actor_t)(struct inode *inode, loff_t pos, loff_t len,
>   void *data, struct iomap *iomap, struct iomap *srcmap);
> -
> +typedef loff_t (*iomap_actor2_t)(struct inode *ino1, loff_t pos1,
> + struct inode *ino2, loff_t pos2, loff_t len, void *data,
> + struct iomap *smap, struct iomap *dmap);
>  loff_t iomap_apply(struct inode *inode, loff_t pos, loff_t length,
>   unsigned flags, const struct iomap_ops *ops, void *data,
>   iomap_actor_t actor);
> +loff_t iomap_apply2(struct inode *ino1, loff_t pos1, struct inode *ino2,
> + loff_t pos2, loff_t length, unsigned int flags,
> + const struct iomap_ops *ops, void *data, iomap_actor2_t actor);
>  
>  ssize_t iomap_file_buffered_write(struct kiocb *iocb, struct iov_iter *from,
>   const struct iomap_ops *ops);
> -- 
> 2.30.1
> 
> 
> 
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [PATCH 5/7] fsdax: Dedup file range to use a compare function

2021-02-18 Thread Darrick J. Wong

On Wed, Feb 17, 2021 at 11:24:18AM +0800, Ruan Shiyang wrote:
> 
> 
> On 2021/2/10 下午9:19, Christoph Hellwig wrote:
> > On Tue, Feb 09, 2021 at 05:46:13PM +0800, Ruan Shiyang wrote:
> > > 
> > > 
> > > On 2021/2/9 下午5:34, Christoph Hellwig wrote:
> > > > On Tue, Feb 09, 2021 at 05:15:13PM +0800, Ruan Shiyang wrote:
> > > > > The dax dedupe comparison need the iomap_ops pointer as argument, so 
> > > > > my
> > > > > understanding is that we don't modify the argument list of
> > > > > generic_remap_file_range_prep(), but move its code into
> > > > > __generic_remap_file_range_prep() whose argument list can be modified 
> > > > > to
> > > > > accepts the iomap_ops pointer.  Then it looks like this:
> > > > 
> > > > I'd say just add the iomap_ops pointer to
> > > > generic_remap_file_range_prep and do away with the extra wrappers.  We
> > > > only have three callers anyway.
> > > 
> > > OK.
> > 
> > So looking at this again I think your proposal actaully is better,
> > given that the iomap variant is still DAX specific.  Sorry for
> > the noise.
> > 
> > Also I think dax_file_range_compare should use iomap_apply instead
> > of open coding it.
> > 
> 
> There are two files, which are not reflinked, need to be direct_access()
> here.  The iomap_apply() can handle one file each time.  So, it seems that
> iomap_apply() is not suitable for this case...
> 
> 
> The pseudo code of this process is as follows:
> 
>   srclen = ops->begin()
>   destlen = ops->begin()
> 
>   direct_access(, )
>   direct_access(, )
> 
>   same = memcpy(saddr, daddr, min(srclen,destlen))
> 
>   ops->end()
>   ops->end()
> 
> I think a nested call like this is necessary.  That's why I use the open
> code way.

This might be a good place to implement an iomap_apply2() loop that
actually /does/ walk all the extents of file1 and file2.  There's now
two users of this idiom.

(Possibly structured as a "get next mappings from both" generator
function like Matthew Wilcox keeps asking for. :))

--D

> 
> --
> Thanks,
> Ruan Shiyang.
> > 
> 
> 
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [PATCH RESEND v2 08/10] md: Implement ->corrupted_range()

2021-02-01 Thread Darrick J. Wong

On Fri, Jan 29, 2021 at 02:27:55PM +0800, Shiyang Ruan wrote:
> With the support of ->rmap(), it is possible to obtain the superblock on
> a mapped device.
> 
> If a pmem device is used as one target of mapped device, we cannot
> obtain its superblock directly.  With the help of SYSFS, the mapped
> device can be found on the target devices.  So, we iterate the
> bdev->bd_holder_disks to obtain its mapped device.
> 
> Signed-off-by: Shiyang Ruan 
> ---
>  drivers/md/dm.c   | 61 +++
>  drivers/nvdimm/pmem.c | 11 +++-
>  fs/block_dev.c| 42 -

I feel like this ^^^ part that implements the generic ability for a block
device with a bad sector to notify whatever's holding onto it (fs, other
block device) should be in patch 2.  That's generic block layer code,
and it's hard to tell (when you're looking at patch 2) what the bare
function declaration in it is really supposed to do.

Also, this patch is still difficult to review because it mixes device
mapper, nvdimm, and block layer changes!

>  include/linux/genhd.h |  2 ++
>  4 files changed, 107 insertions(+), 9 deletions(-)
> 
> diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> index 7bac564f3faa..31b0c340b695 100644
> --- a/drivers/md/dm.c
> +++ b/drivers/md/dm.c
> @@ -507,6 +507,66 @@ static int dm_blk_report_zones(struct gendisk *disk, 
> sector_t sector,
>  #define dm_blk_report_zones  NULL
>  #endif /* CONFIG_BLK_DEV_ZONED */
>  
> +struct corrupted_hit_info {
> + struct block_device *bdev;
> + sector_t offset;
> +};
> +
> +static int dm_blk_corrupted_hit(struct dm_target *ti, struct dm_dev *dev,
> + sector_t start, sector_t count, void *data)
> +{
> + struct corrupted_hit_info *bc = data;
> +
> + return bc->bdev == (void *)dev->bdev &&
> + (start <= bc->offset && bc->offset < start + count);
> +
> +}
> +
> +struct corrupted_do_info {
> + size_t length;
> + void *data;
> +};
> +
> +static int dm_blk_corrupted_do(struct dm_target *ti, struct block_device 
> *bdev,
> +sector_t disk_sect, void *data)
> +{
> + struct corrupted_do_info *bc = data;
> + loff_t disk_off = to_bytes(disk_sect);
> + loff_t bdev_off = to_bytes(disk_sect - get_start_sect(bdev));
> +
> + return bd_corrupted_range(bdev, disk_off, bdev_off, bc->length, 
> bc->data);
> +}
> +
> +static int dm_blk_corrupted_range(struct gendisk *disk,
> +   struct block_device *target_bdev,
> +   loff_t target_offset, size_t len, void *data)
> +{
> + struct mapped_device *md = disk->private_data;
> + struct dm_table *map;
> + struct dm_target *ti;
> + sector_t target_sect = to_sector(target_offset);
> + struct corrupted_hit_info hi = {target_bdev, target_sect};
> + struct corrupted_do_info di = {len, data};
> + int srcu_idx, i, rc = -ENODEV;
> +
> + map = dm_get_live_table(md, _idx);
> + if (!map)
> + return rc;
> +
> + for (i = 0; i < dm_table_get_num_targets(map); i++) {
> + ti = dm_table_get_target(map, i);
> + if (!(ti->type->iterate_devices && ti->type->rmap))
> + continue;
> + if (!ti->type->iterate_devices(ti, dm_blk_corrupted_hit, ))
> + continue;
> +
> + rc = ti->type->rmap(ti, target_sect, dm_blk_corrupted_do, );

Why is it necessary to call ->iterate_devices here?

If you pass the target_bdev, offset, and length to the dm-target's
->rmap function, it should be able to work backwards through its mapping
logic to come up with all the LBA ranges of the mapped_device that
are affected, and then it can call bd_corrupted_range on each of those
reverse mappings.

It would be helpful to have the changes to dm-linear.c in this patch
too, since that's the only real implementation at this point.

> + break;
> + }
> +
> + dm_put_live_table(md, srcu_idx);
> + return rc;
> +}
> +
>  static int dm_prepare_ioctl(struct mapped_device *md, int *srcu_idx,
>   struct block_device **bdev)
>  {
> @@ -3062,6 +3122,7 @@ static const struct block_device_operations dm_blk_dops 
> = {
>   .getgeo = dm_blk_getgeo,
>   .report_zones = dm_blk_report_zones,
>   .pr_ops = _pr_ops,
> + .corrupted_range = dm_blk_corrupted_range,
>   .owner = THIS_MODULE
>  };
>  
> diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
> index 501959947d48..3d9f4ccbbd9e 100644
> --- a/drivers/nvdimm/pmem.c
> +++ b/drivers/nvdimm/pmem.c
> @@ -256,21 +256,16 @@ static int pmem_rw_page(struct block_device *bdev, 
> sector_t sector,
>  static int pmem_corrupted_range(struct gendisk *disk, struct block_device 
> *bdev,
>   loff_t disk_offset, size_t len, void *data)
>  {
> - struct super_block *sb;
>   loff_t bdev_offset;
>   sector_t disk_sector

Re: [PATCH RESEND v2 09/10] xfs: Implement ->corrupted_range() for XFS

2021-02-01 Thread Darrick J. Wong

On Fri, Jan 29, 2021 at 02:27:56PM +0800, Shiyang Ruan wrote:
> This function is used to handle errors which may cause data lost in
> filesystem.  Such as memory failure in fsdax mode.
> 
> In XFS, it requires "rmapbt" feature in order to query for files or
> metadata which associated to the corrupted data.  Then we could call fs
> recover functions to try to repair the corrupted data.(did not
> implemented in this patchset)

I would suggest:
"If the rmap feature of XFS enabled, we can query it to find files and
metadata which are associated with the corrupt data.  For now all we do
is kill processes with that file mapped into their address spaces, but
future patches could actually do something about corrupt metadata."

> After that, the memory failure also needs to notify the processes who
> are using those files.
> 
> Only support data device.  Realtime device is not supported for now.
> 
> Signed-off-by: Shiyang Ruan 
> ---
>  fs/xfs/xfs_fsops.c |   5 +++
>  fs/xfs/xfs_mount.h |   1 +
>  fs/xfs/xfs_super.c | 109 +
>  3 files changed, 115 insertions(+)
> 
> diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c
> index 959ce91a3755..f03901a5c673 100644
> --- a/fs/xfs/xfs_fsops.c
> +++ b/fs/xfs/xfs_fsops.c
> @@ -498,6 +498,11 @@ xfs_do_force_shutdown(
>  "Corruption of in-memory data detected.  Shutting down filesystem");
>   if (XFS_ERRLEVEL_HIGH <= xfs_error_level)
>   xfs_stack_trace();
> + } else if (flags & SHUTDOWN_CORRUPT_META) {
> + xfs_alert_tag(mp, XFS_PTAG_SHUTDOWN_CORRUPT,
> +"Corruption of on-disk metadata detected.  Shutting down filesystem");
> + if (XFS_ERRLEVEL_HIGH <= xfs_error_level)
> + xfs_stack_trace();
>   } else if (logerror) {
>   xfs_alert_tag(mp, XFS_PTAG_SHUTDOWN_LOGERROR,
>   "Log I/O Error Detected. Shutting down filesystem");
> diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
> index dfa429b77ee2..8f0df67ffcc1 100644
> --- a/fs/xfs/xfs_mount.h
> +++ b/fs/xfs/xfs_mount.h
> @@ -274,6 +274,7 @@ void xfs_do_force_shutdown(struct xfs_mount *mp, int 
> flags, char *fname,
>  #define SHUTDOWN_LOG_IO_ERROR0x0002  /* write attempt to the log 
> failed */
>  #define SHUTDOWN_FORCE_UMOUNT0x0004  /* shutdown from a forced 
> unmount */
>  #define SHUTDOWN_CORRUPT_INCORE  0x0008  /* corrupt in-memory data 
> structures */
> +#define SHUTDOWN_CORRUPT_META0x0010  /* corrupt metadata on device */
>  
>  /*
>   * Flags for xfs_mountfs
> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> index 813be879a5e5..93093fe0ee8a 100644
> --- a/fs/xfs/xfs_super.c
> +++ b/fs/xfs/xfs_super.c
> @@ -35,6 +35,11 @@
>  #include "xfs_refcount_item.h"
>  #include "xfs_bmap_item.h"
>  #include "xfs_reflink.h"
> +#include "xfs_alloc.h"
> +#include "xfs_rmap.h"
> +#include "xfs_rmap_btree.h"
> +#include "xfs_rtalloc.h"
> +#include "xfs_bit.h"
>  
>  #include 
>  #include 
> @@ -1105,6 +1110,109 @@ xfs_fs_free_cached_objects(
>   return xfs_reclaim_inodes_nr(XFS_M(sb), sc->nr_to_scan);
>  }
>  
> +static int
> +xfs_corrupt_helper(
> + struct xfs_btree_cur*cur,
> + struct xfs_rmap_irec*rec,
> + void*data)
> +{
> + struct xfs_inode*ip;
> + struct address_space*mapping;
> + int rc = 0;
> + int *flags = data;
> +
> + if (XFS_RMAP_NON_INODE_OWNER(rec->rm_owner) ||
> + (rec->rm_flags & (XFS_RMAP_ATTR_FORK | XFS_RMAP_BMBT_BLOCK))) {
> + // TODO check and try to fix metadata
> + rc = -EFSCORRUPTED;

The xfs_force_shutdown() call should go here, since SHUTDOWN_CORRUPT_META
is specific to this case.

I guess one could also dig through the buffer cache and delwri_submit
those buffers or something.

> + } else {
> + /*
> +  * Get files that incore, filter out others that are not in use.
> +  */
> + rc = xfs_iget(cur->bc_mp, cur->bc_tp, rec->rm_owner,
> +   XFS_IGET_INCORE, 0, );
> + if (rc || !ip)
> + return rc;
> + if (!VFS_I(ip)->i_mapping)
> + goto out;
> +
> + mapping = VFS_I(ip)->i_mapping;
> + if (IS_DAX(VFS_I(ip)))
> + rc = mf_dax_mapping_kill_procs(mapping, rec->rm_offset,
> +*flags);
> + else
> + mapping_set_error(mapping, -EIO);
> +
> + // TODO try to fix data

What could we do to fix the data?  If we're not in S_DAX mode and
there's actually pagecache mapped in, does that imply that we could
mark it dirty and kick off dirty pagecache writeback?

> +out:
> + xfs_irele(ip);
> + }
> +
> + return rc;
> +}
> +
> +static int
>

Re: [PATCH 04/10] mm, fsdax: Refactor memory-failure handler for dax mapping

2021-01-14 Thread Darrick J. Wong

On Thu, Jan 14, 2021 at 05:38:33PM +0800, zhong jiang wrote:
> 
> On 2021/1/14 11:52 上午, Ruan Shiyang wrote:
> > 
> > 
> > On 2021/1/14 上午11:26, zhong jiang wrote:
> > > 
> > > On 2021/1/14 9:44 上午, Ruan Shiyang wrote:
> > > > 
> > > > 
> > > > On 2021/1/13 下午6:04, zhong jiang wrote:
> > > > > 
> > > > > On 2021/1/12 10:55 上午, Ruan Shiyang wrote:
> > > > > > 
> > > > > > 
> > > > > > On 2021/1/6 下午11:41, Jan Kara wrote:
> > > > > > > On Thu 31-12-20 00:55:55, Shiyang Ruan wrote:
> > > > > > > > The current memory_failure_dev_pagemap() can
> > > > > > > > only handle single-mapped
> > > > > > > > dax page for fsdax mode.  The dax page could be
> > > > > > > > mapped by multiple files
> > > > > > > > and offsets if we let reflink feature & fsdax
> > > > > > > > mode work together. So,
> > > > > > > > we refactor current implementation to support
> > > > > > > > handle memory failure on
> > > > > > > > each file and offset.
> > > > > > > > 
> > > > > > > > Signed-off-by: Shiyang Ruan 
> > > > > > > 
> > > > > > > Overall this looks OK to me, a few comments below.
> > > > > > > 
> > > > > > > > ---
> > > > > > > >   fs/dax.c    | 21 +++
> > > > > > > >   include/linux/dax.h |  1 +
> > > > > > > >   include/linux/mm.h  |  9 +
> > > > > > > >   mm/memory-failure.c | 91
> > > > > > > > ++---
> > > > > > > >   4 files changed, 100 insertions(+), 22 deletions(-)
> > > > > > 
> > > > > > ...
> > > > > > 
> > > > > > > >   @@ -345,9 +348,12 @@ static void
> > > > > > > > add_to_kill(struct task_struct *tsk, struct page
> > > > > > > > *p,
> > > > > > > >   }
> > > > > > > >     tk->addr = page_address_in_vma(p, vma);
> > > > > > > > -    if (is_zone_device_page(p))
> > > > > > > > -    tk->size_shift = dev_pagemap_mapping_shift(p, vma);
> > > > > > > > -    else
> > > > > > > > +    if (is_zone_device_page(p)) {
> > > > > > > > +    if (is_device_fsdax_page(p))
> > > > > > > > +    tk->addr = vma->vm_start +
> > > > > > > > +    ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
> > > > > > > 
> > > > > > > It seems strange to use 'pgoff' for dax pages and
> > > > > > > not for any other page.
> > > > > > > Why? I'd rather pass correct pgoff from all callers
> > > > > > > of add_to_kill() and
> > > > > > > avoid this special casing...
> > > > > > 
> > > > > > Because one fsdax page can be shared by multiple pgoffs.
> > > > > > I have to pass each pgoff in each iteration to calculate
> > > > > > the address in vma (for tk->addr).  Other kinds of pages
> > > > > > don't need this. They can get their unique address by
> > > > > > calling "page_address_in_vma()".
> > > > > > 
> > > > > IMO,   an fsdax page can be shared by multiple files rather
> > > > > than multiple pgoffs if fs query support reflink.   Because
> > > > > an page only located in an mapping(page->mapping is
> > > > > exclusive), hence it  only has an pgoff or index pointing at
> > > > > the node.
> > > > > 
> > > > >   or  I miss something for the feature ?  thanks,
> > > > 
> > > > Yes, a fsdax page is shared by multiple files because of
> > > > reflink. I think my description of 'pgoff' here is not correct. 
> > > > This 'pgoff' means the offset within the a file. (We use rmap to
> > > > find out all the sharing files and their offsets.)  So, I said
> > > > that "can be shared by multiple pgoffs".  It's my bad.
> > > > 
> > > > I think I should name it another word to avoid misunderstandings.
> > > > 
> > > IMO,  All the sharing files should be the same offset to share the
> > > fsdax page.  why not that ?
> > 
> > The dedupe operation can let different files share their same data
> > extent, though offsets are not same.  So, files can share one fsdax page
> > at different offset.
> Ok,  Get it.
> > 
> > > As you has said,  a shared fadax page should be inserted to
> > > different mapping files.  but page->index and page->mapping is
> > > exclusive.  hence an page only should be placed in an mapping tree.
> > 
> > We can't use page->mapping and page->index here for reflink & fsdax. And
> > that's this patchset aims to solve.  I introduced a series of
> > ->corrupted_range(), from mm to pmem driver to block device and finally
> > to filesystem, to use rmap feature of filesystem to find out all files
> > sharing same data extent (fsdax page).
> 
> From this patch,  each file has mapping tree,  the shared page will be
> inserted into multiple file mapping tree.  then filesystem use file and
> offset to get the killed process.   Is it correct?

FWIW I thought the purpose of this patchset is to remove the (dax)
memory poison code's reliance on the pagecache mapping structure by
pushing poison notifications directly into the filesystem and letting
the filesystem perform reverse lookup operations to figure out which
file(s) have gone bad, and using the file list to call back into the mm
to kill processes.

Once that's done, I think(?) that puts us significantly closer

Re: [PATCH 02/10] blk: Introduce ->corrupted_range() for block device

2021-01-08 Thread Darrick J. Wong

On Fri, Jan 08, 2021 at 10:55:00AM +0100, Christoph Hellwig wrote:
> It happens on a dax_device.  We should not interwind dax and block_device
> even more after a lot of good work has happened to detangle them.

I agree that the dax device should not be implied from the block device,
but what happens if regular block device drivers grow the ability to
(say) perform a background integrity scan and want to ->corrupted_range?

--D
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [RFC PATCH v3 8/9] md: Implement ->corrupted_range()

2021-01-08 Thread Darrick J. Wong

On Fri, Jan 08, 2021 at 05:52:11PM +0800, Ruan Shiyang wrote:
> 
> 
> On 2021/1/5 上午7:34, Darrick J. Wong wrote:
> > On Fri, Dec 18, 2020 at 10:11:54AM +0800, Ruan Shiyang wrote:
> > > 
> > > 
> > > On 2020/12/16 上午4:51, Darrick J. Wong wrote:
> > > > On Tue, Dec 15, 2020 at 08:14:13PM +0800, Shiyang Ruan wrote:
> > > > > With the support of ->rmap(), it is possible to obtain the superblock 
> > > > > on
> > > > > a mapped device.
> > > > > 
> > > > > If a pmem device is used as one target of mapped device, we cannot
> > > > > obtain its superblock directly.  With the help of SYSFS, the mapped
> > > > > device can be found on the target devices.  So, we iterate the
> > > > > bdev->bd_holder_disks to obtain its mapped device.
> > > > > 
> > > > > Signed-off-by: Shiyang Ruan 
> > > > > ---
> > > > >drivers/md/dm.c   | 66 
> > > > > +++
> > > > >drivers/nvdimm/pmem.c |  9 --
> > > > >fs/block_dev.c| 21 ++
> > > > >include/linux/genhd.h |  7 +
> > > > >4 files changed, 100 insertions(+), 3 deletions(-)
> > > > > 
> > > > > diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> > > > > index 4e0cbfe3f14d..9da1f9322735 100644
> > > > > --- a/drivers/md/dm.c
> > > > > +++ b/drivers/md/dm.c
> > > > > @@ -507,6 +507,71 @@ static int dm_blk_report_zones(struct gendisk 
> > > > > *disk, sector_t sector,
> > > > >#define dm_blk_report_zonesNULL
> > > > >#endif /* CONFIG_BLK_DEV_ZONED */
> > > > > +struct dm_blk_corrupt {
> > > > > + struct block_device *bdev;
> > > > > + sector_t offset;
> > > > > +};
> > > > > +
> > > > > +static int dm_blk_corrupt_fn(struct dm_target *ti, struct dm_dev 
> > > > > *dev,
> > > > > + sector_t start, sector_t len, void 
> > > > > *data)
> > > > > +{
> > > > > + struct dm_blk_corrupt *bc = data;
> > > > > +
> > > > > + return bc->bdev == (void *)dev->bdev &&
> > > > > + (start <= bc->offset && bc->offset < start + 
> > > > > len);
> > > > > +}
> > > > > +
> > > > > +static int dm_blk_corrupted_range(struct gendisk *disk,
> > > > > +   struct block_device *target_bdev,
> > > > > +   loff_t target_offset, size_t len, 
> > > > > void *data)
> > > > > +{
> > > > > + struct mapped_device *md = disk->private_data;
> > > > > + struct block_device *md_bdev = md->bdev;
> > > > > + struct dm_table *map;
> > > > > + struct dm_target *ti;
> > > > > + struct super_block *sb;
> > > > > + int srcu_idx, i, rc = 0;
> > > > > + bool found = false;
> > > > > + sector_t disk_sec, target_sec = to_sector(target_offset);
> > > > > +
> > > > > + map = dm_get_live_table(md, _idx);
> > > > > + if (!map)
> > > > > + return -ENODEV;
> > > > > +
> > > > > + for (i = 0; i < dm_table_get_num_targets(map); i++) {
> > > > > + ti = dm_table_get_target(map, i);
> > > > > + if (ti->type->iterate_devices && ti->type->rmap) {
> > > > > + struct dm_blk_corrupt bc = {target_bdev, 
> > > > > target_sec};
> > > > > +
> > > > > + found = ti->type->iterate_devices(ti, 
> > > > > dm_blk_corrupt_fn, );
> > > > > + if (!found)
> > > > > + continue;
> > > > > + disk_sec = ti->type->rmap(ti, target_sec);
> > > > 
> > > > What happens if the dm device has multiple reverse mappings because the
> > > > physical storage is being shared at multiple LBAs?  (e.g. a
> > > > deduplication target)
> > > 
> > > I thought that the dm device knows the mapping relationship, and it

Re: [RFC PATCH v3 8/9] md: Implement ->corrupted_range()

2021-01-04 Thread Darrick J. Wong

On Fri, Dec 18, 2020 at 10:11:54AM +0800, Ruan Shiyang wrote:
> 
> 
> On 2020/12/16 上午4:51, Darrick J. Wong wrote:
> > On Tue, Dec 15, 2020 at 08:14:13PM +0800, Shiyang Ruan wrote:
> > > With the support of ->rmap(), it is possible to obtain the superblock on
> > > a mapped device.
> > > 
> > > If a pmem device is used as one target of mapped device, we cannot
> > > obtain its superblock directly.  With the help of SYSFS, the mapped
> > > device can be found on the target devices.  So, we iterate the
> > > bdev->bd_holder_disks to obtain its mapped device.
> > > 
> > > Signed-off-by: Shiyang Ruan 
> > > ---
> > >   drivers/md/dm.c   | 66 +++
> > >   drivers/nvdimm/pmem.c |  9 --
> > >   fs/block_dev.c| 21 ++
> > >   include/linux/genhd.h |  7 +
> > >   4 files changed, 100 insertions(+), 3 deletions(-)
> > > 
> > > diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> > > index 4e0cbfe3f14d..9da1f9322735 100644
> > > --- a/drivers/md/dm.c
> > > +++ b/drivers/md/dm.c
> > > @@ -507,6 +507,71 @@ static int dm_blk_report_zones(struct gendisk *disk, 
> > > sector_t sector,
> > >   #define dm_blk_report_zones NULL
> > >   #endif /* CONFIG_BLK_DEV_ZONED */
> > > +struct dm_blk_corrupt {
> > > + struct block_device *bdev;
> > > + sector_t offset;
> > > +};
> > > +
> > > +static int dm_blk_corrupt_fn(struct dm_target *ti, struct dm_dev *dev,
> > > + sector_t start, sector_t len, void *data)
> > > +{
> > > + struct dm_blk_corrupt *bc = data;
> > > +
> > > + return bc->bdev == (void *)dev->bdev &&
> > > + (start <= bc->offset && bc->offset < start + len);
> > > +}
> > > +
> > > +static int dm_blk_corrupted_range(struct gendisk *disk,
> > > +   struct block_device *target_bdev,
> > > +   loff_t target_offset, size_t len, void *data)
> > > +{
> > > + struct mapped_device *md = disk->private_data;
> > > + struct block_device *md_bdev = md->bdev;
> > > + struct dm_table *map;
> > > + struct dm_target *ti;
> > > + struct super_block *sb;
> > > + int srcu_idx, i, rc = 0;
> > > + bool found = false;
> > > + sector_t disk_sec, target_sec = to_sector(target_offset);
> > > +
> > > + map = dm_get_live_table(md, _idx);
> > > + if (!map)
> > > + return -ENODEV;
> > > +
> > > + for (i = 0; i < dm_table_get_num_targets(map); i++) {
> > > + ti = dm_table_get_target(map, i);
> > > + if (ti->type->iterate_devices && ti->type->rmap) {
> > > + struct dm_blk_corrupt bc = {target_bdev, target_sec};
> > > +
> > > + found = ti->type->iterate_devices(ti, 
> > > dm_blk_corrupt_fn, );
> > > + if (!found)
> > > + continue;
> > > + disk_sec = ti->type->rmap(ti, target_sec);
> > 
> > What happens if the dm device has multiple reverse mappings because the
> > physical storage is being shared at multiple LBAs?  (e.g. a
> > deduplication target)
> 
> I thought that the dm device knows the mapping relationship, and it can be
> done by implementation of ->rmap() in each target.  Did I understand it
> wrong?

The dm device /does/ know the mapping relationship.  I'm asking what
happens if there are *multiple* mappings.  For example, a deduplicating
dm device could observe that the upper level code wrote some data to
sector 200 and now it wants to write the same data to sector 500.
Instead of writing twice, it simply maps sector 500 in its LBA space to
the same space that it mapped sector 200.

Pretend that sector 200 on the dm-dedupe device maps to sector 64 on the
underlying storage (call it /dev/pmem1 and let's say it's the only
target sitting underneath the dm-dedupe device).

If /dev/pmem1 then notices that sector 64 has gone bad, it will start
calling ->corrupted_range handlers until it calls dm_blk_corrupted_range
on the dm-dedupe device.  At least in theory, the dm-dedupe driver's
rmap method ought to return both (64 -> 200) and (64 -> 500) so that
dm_blk_corrupted_range can pass on both corruption notices to whatever's
sitting atop the dedupe device.

At the moment, your ->rmap prototype is only capable of returning one
sector_t mappi

Re: [PATCH 09/10] xfs: Implement ->corrupted_range() for XFS

2021-01-04 Thread Darrick J. Wong

On Thu, Dec 31, 2020 at 12:56:00AM +0800, Shiyang Ruan wrote:
> This function is used to handle errors which may cause data lost in
> filesystem.  Such as memory failure in fsdax mode.
> 
> In XFS, it requires "rmapbt" feature in order to query for files or
> metadata which associated to the corrupted data.  Then we could call fs
> recover functions to try to repair the corrupted data.(did not
> implemented in this patchset)
> 
> After that, the memory failure also needs to notify the processes who
> are using those files.
> 
> Only support data device.  Realtime device is not supported for now.
> 
> Signed-off-by: Shiyang Ruan 
> ---
>  fs/xfs/xfs_fsops.c |   5 +++
>  fs/xfs/xfs_mount.h |   1 +
>  fs/xfs/xfs_super.c | 107 +
>  3 files changed, 113 insertions(+)
> 
> diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c
> index ef1d5bb88b93..0a2038875d32 100644
> --- a/fs/xfs/xfs_fsops.c
> +++ b/fs/xfs/xfs_fsops.c
> @@ -501,6 +501,11 @@ xfs_do_force_shutdown(
>  "Corruption of in-memory data detected.  Shutting down filesystem");
>   if (XFS_ERRLEVEL_HIGH <= xfs_error_level)
>   xfs_stack_trace();
> + } else if (flags & SHUTDOWN_CORRUPT_META) {
> + xfs_alert_tag(mp, XFS_PTAG_SHUTDOWN_CORRUPT,
> +"Corruption of on-disk metadata detected.  Shutting down filesystem");
> + if (XFS_ERRLEVEL_HIGH <= xfs_error_level)
> + xfs_stack_trace();
>   } else if (logerror) {
>   xfs_alert_tag(mp, XFS_PTAG_SHUTDOWN_LOGERROR,
>   "Log I/O Error Detected. Shutting down filesystem");
> diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
> index dfa429b77ee2..8f0df67ffcc1 100644
> --- a/fs/xfs/xfs_mount.h
> +++ b/fs/xfs/xfs_mount.h
> @@ -274,6 +274,7 @@ void xfs_do_force_shutdown(struct xfs_mount *mp, int 
> flags, char *fname,
>  #define SHUTDOWN_LOG_IO_ERROR0x0002  /* write attempt to the log 
> failed */
>  #define SHUTDOWN_FORCE_UMOUNT0x0004  /* shutdown from a forced 
> unmount */
>  #define SHUTDOWN_CORRUPT_INCORE  0x0008  /* corrupt in-memory data 
> structures */
> +#define SHUTDOWN_CORRUPT_META0x0010  /* corrupt metadata on device */
>  
>  /*
>   * Flags for xfs_mountfs
> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> index e3e229e52512..cbcad419bb9e 100644
> --- a/fs/xfs/xfs_super.c
> +++ b/fs/xfs/xfs_super.c
> @@ -35,6 +35,11 @@
>  #include "xfs_refcount_item.h"
>  #include "xfs_bmap_item.h"
>  #include "xfs_reflink.h"
> +#include "xfs_alloc.h"
> +#include "xfs_rmap.h"
> +#include "xfs_rmap_btree.h"
> +#include "xfs_rtalloc.h"
> +#include "xfs_bit.h"
>  
>  #include 
>  #include 
> @@ -1103,6 +1108,107 @@ xfs_fs_free_cached_objects(
>   return xfs_reclaim_inodes_nr(XFS_M(sb), sc->nr_to_scan);
>  }
>  
> +static int
> +xfs_corrupt_helper(
> + struct xfs_btree_cur*cur,
> + struct xfs_rmap_irec*rec,
> + void*data)
> +{
> + struct xfs_inode*ip;
> + struct address_space*mapping;
> + int rc = 0;
> + int *flags = data;
> +
> + if (XFS_RMAP_NON_INODE_OWNER(rec->rm_owner) ||
> + (rec->rm_flags & (XFS_RMAP_ATTR_FORK | XFS_RMAP_BMBT_BLOCK))) {
> + // TODO check and try to fix metadata
> + rc = -EFSCORRUPTED;
> + } else {
> + /*
> +  * Get files that incore, filter out others that are not in use.
> +  */
> + rc = xfs_iget(cur->bc_mp, cur->bc_tp, rec->rm_owner,
> +   XFS_IGET_INCORE, 0, );
> + if (rc || !ip)
> + return rc;
> + if (!VFS_I(ip)->i_mapping)
> + goto out;
> +
> + mapping = VFS_I(ip)->i_mapping;
> + if (IS_DAX(VFS_I(ip)))
> + rc = mf_dax_mapping_kill_procs(mapping, rec->rm_offset,
> +*flags);
> + else
> + mapping_set_error(mapping, -EFSCORRUPTED);

Hm.  I don't know if EFSCORRUPTED is the right error code for corrupt
file data, since we (so far) have only used it for corrupt metadata.

> +
> + // TODO try to fix data
> +out:
> + xfs_irele(ip);
> + }
> +
> + return rc;
> +}
> +
> +static int
> +xfs_fs_corrupted_range(
> + struct super_block  *sb,
> + struct block_device *bdev,
> + loff_t  offset,
> + size_t  len,
> + void*data)
> +{
> + struct xfs_mount*mp = XFS_M(sb);
> + struct xfs_trans*tp = NULL;
> + struct xfs_btree_cur*cur = NULL;
> + struct xfs_rmap_irecrmap_low, rmap_high;
> + struct xfs_buf  *agf_bp = NULL;
> + xfs_fsblock_t   fsbno = XFS_B_TO_FSB(mp, offset);
> +

Re: [RFC PATCH v3 0/9] fsdax: introduce fs query to support reflink

2020-12-17 Thread Darrick J. Wong

On Fri, Dec 18, 2020 at 10:44:26AM +0800, Ruan Shiyang wrote:
> 
> 
> On 2020/12/17 上午4:55, Jane Chu wrote:
> > Hi, Shiyang,
> > 
> > On 12/15/2020 4:14 AM, Shiyang Ruan wrote:
> > > The call trace is like this:
> > > memory_failure()
> > >   pgmap->ops->memory_failure()  => pmem_pgmap_memory_failure()
> > >    gendisk->fops->corrupted_range() => - pmem_corrupted_range()
> > >    - md_blk_corrupted_range()
> > >     sb->s_ops->currupted_range()    => xfs_fs_corrupted_range()
> > >  xfs_rmap_query_range()
> > >   xfs_currupt_helper()
> > >    * corrupted on metadata
> > >    try to recover data, call xfs_force_shutdown()
> > >    * corrupted on file data
> > >    try to recover data, call mf_dax_mapping_kill_procs()
> > > 
> > > The fsdax & reflink support for XFS is not contained in this patchset.
> > > 
> > > (Rebased on v5.10)
> > 
> > So I tried the patchset with pmem error injection, the SIGBUS payload
> > does not look right -
> > 
> > ** SIGBUS(7): **
> > ** si_addr(0x(nil)), si_lsb(0xC), si_code(0x4, BUS_MCEERR_AR) **
> > 
> > I expect the payload looks like
> > 
> > ** si_addr(0x7f3672e0), si_lsb(0x15), si_code(0x4, BUS_MCEERR_AR) **
> 
> Thanks for testing.  I test the SIGBUS by writing a program which calls
> madvise(... ,MADV_HWPOISON) to inject memory-failure.  It just shows that
> the program is killed by SIGBUS.  I cannot get any detail from it.  So,
> could you please show me the right way(test tools) to test it?

I'm assuming that Jane is using a program that calls sigaction to
install a SIGBUS handler, and dumps the entire siginfo_t structure
whenever it receives one...

--D

> 
> --
> Thanks,
> Ruan Shiyang.
> 
> > 
> > thanks,
> > -jane
> > 
> > 
> > 
> > 
> > 
> > 
> 
> 
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [RFC PATCH v2 0/6] fsdax: introduce fs query to support reflink

2020-12-15 Thread Darrick J. Wong

On Wed, Dec 16, 2020 at 10:10:22AM +1100, Dave Chinner wrote:
> On Tue, Dec 15, 2020 at 11:05:07AM -0800, Jane Chu wrote:
> > On 12/15/2020 3:58 AM, Ruan Shiyang wrote:
> > > Hi Jane
> > > 
> > > On 2020/12/15 上午4:58, Jane Chu wrote:
> > > > Hi, Shiyang,
> > > > 
> > > > On 11/22/2020 4:41 PM, Shiyang Ruan wrote:
> > > > > This patchset is a try to resolve the problem of tracking shared page
> > > > > for fsdax.
> > > > > 
> > > > > Change from v1:
> > > > >    - Intorduce ->block_lost() for block device
> > > > >    - Support mapped device
> > > > >    - Add 'not available' warning for realtime device in XFS
> > > > >    - Rebased to v5.10-rc1
> > > > > 
> > > > > This patchset moves owner tracking from dax_assocaite_entry() to pmem
> > > > > device, by introducing an interface ->memory_failure() of struct
> > > > > pagemap.  The interface is called by memory_failure() in mm, and
> > > > > implemented by pmem device.  Then pmem device calls its ->block_lost()
> > > > > to find the filesystem which the damaged page located in, and call
> > > > > ->storage_lost() to track files or metadata assocaited with this page.
> > > > > Finally we are able to try to fix the damaged data in filesystem and 
> > > > > do
> > > > 
> > > > Does that mean clearing poison? if so, would you mind to elaborate
> > > > specifically which change does that?
> > > 
> > > Recovering data for filesystem (or pmem device) has not been done in
> > > this patchset...  I just triggered the handler for the files sharing the
> > > corrupted page here.
> > 
> > Thanks! That confirms my understanding.
> > 
> > With the framework provided by the patchset, how do you envision it to
> > ease/simplify poison recovery from the user's perspective?
> 
> At the moment, I'd say no change what-so-ever. THe behaviour is
> necessary so that we can kill whatever user application maps
> multiply-shared physical blocks if there's a memory error. THe
> recovery method from that is unchanged. The only advantage may be
> that the filesystem (if rmap enabled) can tell you the exact file
> and offset into the file where data was corrupted.
> 
> However, it can be worse, too: it may also now completely shut down
> the filesystem if the filesystem discovers the error is in metadata
> rather than user data. That's much more complex to recover from, and
> right now will require downtime to take the filesystem offline and
> run fsck to correct the error. That may trash whatever the metadata
> that can't be recovered points to, so you still have a uesr data
> recovery process to perform after this...

...though for the future future I'd like to bypass the default behaviors
if there's somebody watching the sb notification that will also kick off
the appropriate repair activities.  The xfs auto-repair parts are coming
along nicely.  Dunno about userspace, though I figure if we can do
userspace page faults then some people could probably do autorepair
too.

--D

> > And how does it help in dealing with page faults upon poisoned
> > dax page?
> 
> It doesn't. If the page is poisoned, the same behaviour will occur
> as does now. This is simply error reporting infrastructure, not
> error handling.
> 
> Future work might change how we correct the faults found in the
> storage, but I think the user visible behaviour is going to be "kill
> apps mapping corrupted data" for a long time yet
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> da...@fromorbit.com
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [RFC PATCH v3 8/9] md: Implement ->corrupted_range()

2020-12-15 Thread Darrick J. Wong

On Tue, Dec 15, 2020 at 08:14:13PM +0800, Shiyang Ruan wrote:
> With the support of ->rmap(), it is possible to obtain the superblock on
> a mapped device.
> 
> If a pmem device is used as one target of mapped device, we cannot
> obtain its superblock directly.  With the help of SYSFS, the mapped
> device can be found on the target devices.  So, we iterate the
> bdev->bd_holder_disks to obtain its mapped device.
> 
> Signed-off-by: Shiyang Ruan 
> ---
>  drivers/md/dm.c   | 66 +++
>  drivers/nvdimm/pmem.c |  9 --
>  fs/block_dev.c| 21 ++
>  include/linux/genhd.h |  7 +
>  4 files changed, 100 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> index 4e0cbfe3f14d..9da1f9322735 100644
> --- a/drivers/md/dm.c
> +++ b/drivers/md/dm.c
> @@ -507,6 +507,71 @@ static int dm_blk_report_zones(struct gendisk *disk, 
> sector_t sector,
>  #define dm_blk_report_zones  NULL
>  #endif /* CONFIG_BLK_DEV_ZONED */
>  
> +struct dm_blk_corrupt {
> + struct block_device *bdev;
> + sector_t offset;
> +};
> +
> +static int dm_blk_corrupt_fn(struct dm_target *ti, struct dm_dev *dev,
> + sector_t start, sector_t len, void *data)
> +{
> + struct dm_blk_corrupt *bc = data;
> +
> + return bc->bdev == (void *)dev->bdev &&
> + (start <= bc->offset && bc->offset < start + len);
> +}
> +
> +static int dm_blk_corrupted_range(struct gendisk *disk,
> +   struct block_device *target_bdev,
> +   loff_t target_offset, size_t len, void *data)
> +{
> + struct mapped_device *md = disk->private_data;
> + struct block_device *md_bdev = md->bdev;
> + struct dm_table *map;
> + struct dm_target *ti;
> + struct super_block *sb;
> + int srcu_idx, i, rc = 0;
> + bool found = false;
> + sector_t disk_sec, target_sec = to_sector(target_offset);
> +
> + map = dm_get_live_table(md, _idx);
> + if (!map)
> + return -ENODEV;
> +
> + for (i = 0; i < dm_table_get_num_targets(map); i++) {
> + ti = dm_table_get_target(map, i);
> + if (ti->type->iterate_devices && ti->type->rmap) {
> + struct dm_blk_corrupt bc = {target_bdev, target_sec};
> +
> + found = ti->type->iterate_devices(ti, 
> dm_blk_corrupt_fn, );
> + if (!found)
> + continue;
> + disk_sec = ti->type->rmap(ti, target_sec);

What happens if the dm device has multiple reverse mappings because the
physical storage is being shared at multiple LBAs?  (e.g. a
deduplication target)

> + break;
> + }
> + }
> +
> + if (!found) {
> + rc = -ENODEV;
> + goto out;
> + }
> +
> + sb = get_super(md_bdev);
> + if (!sb) {
> + rc = bd_disk_holder_corrupted_range(md_bdev, 
> to_bytes(disk_sec), len, data);
> + goto out;
> + } else if (sb->s_op->corrupted_range) {
> + loff_t off = to_bytes(disk_sec - get_start_sect(md_bdev));
> +
> + rc = sb->s_op->corrupted_range(sb, md_bdev, off, len, data);

This "call bd_disk_holder_corrupted_range or sb->s_op->corrupted_range"
logic appears twice; should it be refactored into a common helper?

Or, should the superblock dispatch part move to
bd_disk_holder_corrupted_range?

> + }
> + drop_super(sb);
> +
> +out:
> + dm_put_live_table(md, srcu_idx);
> + return rc;
> +}
> +
>  static int dm_prepare_ioctl(struct mapped_device *md, int *srcu_idx,
>   struct block_device **bdev)
>  {
> @@ -3084,6 +3149,7 @@ static const struct block_device_operations dm_blk_dops 
> = {
>   .getgeo = dm_blk_getgeo,
>   .report_zones = dm_blk_report_zones,
>   .pr_ops = _pr_ops,
> + .corrupted_range = dm_blk_corrupted_range,
>   .owner = THIS_MODULE
>  };
>  
> diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
> index 4688bff19c20..e8cfaf860149 100644
> --- a/drivers/nvdimm/pmem.c
> +++ b/drivers/nvdimm/pmem.c
> @@ -267,11 +267,14 @@ static int pmem_corrupted_range(struct gendisk *disk, 
> struct block_device *bdev,
>  
>   bdev_offset = (disk_sector - get_start_sect(bdev)) << SECTOR_SHIFT;
>   sb = get_super(bdev);
> - if (sb && sb->s_op->corrupted_range) {
> + if (!sb) {
> + rc = bd_disk_holder_corrupted_range(bdev, bdev_offset, len, 
> data);
> + goto out;
> + } else if (sb->s_op->corrupted_range)
>   rc = sb->s_op->corrupted_range(sb, bdev, bdev_offset, len, 
> data);
> - drop_super(sb);

This is out of scope for this patch(set) but do you think that the scsi
disk driver should intercept media errors from sense data and call
->corrupted_range too?  ISTR Ted muttering that one of his employers had
a patchset to do more with sense data

Re: [RFC PATCH v3 9/9] xfs: Implement ->corrupted_range() for XFS

2020-12-15 Thread Darrick J. Wong

On Tue, Dec 15, 2020 at 08:14:14PM +0800, Shiyang Ruan wrote:
> This function is used to handle errors which may cause data lost in
> filesystem.  Such as memory failure in fsdax mode.
> 
> In XFS, it requires "rmapbt" feature in order to query for files or
> metadata which associated to the corrupted data.  Then we could call fs
> recover functions to try to repair the corrupted data.(did not
> implemented in this patchset)
> 
> After that, the memory failure also needs to notify the processes who
> are using those files.
> 
> Only support data device.  Realtime device is not supported for now.
> 
> Signed-off-by: Shiyang Ruan 
> ---
>  fs/xfs/xfs_fsops.c | 10 +
>  fs/xfs/xfs_mount.h |  2 +
>  fs/xfs/xfs_super.c | 93 ++
>  3 files changed, 105 insertions(+)
> 
> diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c
> index ef1d5bb88b93..0ec1b44bfe88 100644
> --- a/fs/xfs/xfs_fsops.c
> +++ b/fs/xfs/xfs_fsops.c
> @@ -501,6 +501,16 @@ xfs_do_force_shutdown(
>  "Corruption of in-memory data detected.  Shutting down filesystem");
>   if (XFS_ERRLEVEL_HIGH <= xfs_error_level)
>   xfs_stack_trace();
> + } else if (flags & SHUTDOWN_CORRUPT_META) {
> + xfs_alert_tag(mp, XFS_PTAG_SHUTDOWN_CORRUPT,
> +"Corruption of on-disk metadata detected.  Shutting down filesystem");
> + if (XFS_ERRLEVEL_HIGH <= xfs_error_level)
> + xfs_stack_trace();
> + } else if (flags & SHUTDOWN_CORRUPT_DATA) {
> + xfs_alert_tag(mp, XFS_PTAG_SHUTDOWN_CORRUPT,
> +"Corruption of on-disk file data detected.  Shutting down filesystem");
> + if (XFS_ERRLEVEL_HIGH <= xfs_error_level)
> + xfs_stack_trace();
>   } else if (logerror) {
>   xfs_alert_tag(mp, XFS_PTAG_SHUTDOWN_LOGERROR,
>   "Log I/O Error Detected. Shutting down filesystem");
> diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
> index dfa429b77ee2..e36c07553486 100644
> --- a/fs/xfs/xfs_mount.h
> +++ b/fs/xfs/xfs_mount.h
> @@ -274,6 +274,8 @@ void xfs_do_force_shutdown(struct xfs_mount *mp, int 
> flags, char *fname,
>  #define SHUTDOWN_LOG_IO_ERROR0x0002  /* write attempt to the log 
> failed */
>  #define SHUTDOWN_FORCE_UMOUNT0x0004  /* shutdown from a forced 
> unmount */
>  #define SHUTDOWN_CORRUPT_INCORE  0x0008  /* corrupt in-memory data 
> structures */
> +#define SHUTDOWN_CORRUPT_META0x0010  /* corrupt metadata on device */
> +#define SHUTDOWN_CORRUPT_DATA0x0020  /* corrupt file data on device 
> */

This symbol isn't used anywhere.  I don't know why we'd shut down the fs
for data loss, as we don't do that anywhere else in xfs.

>  
>  /*
>   * Flags for xfs_mountfs
> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> index e3e229e52512..30202de7e89d 100644
> --- a/fs/xfs/xfs_super.c
> +++ b/fs/xfs/xfs_super.c
> @@ -35,6 +35,11 @@
>  #include "xfs_refcount_item.h"
>  #include "xfs_bmap_item.h"
>  #include "xfs_reflink.h"
> +#include "xfs_alloc.h"
> +#include "xfs_rmap.h"
> +#include "xfs_rmap_btree.h"
> +#include "xfs_rtalloc.h"
> +#include "xfs_bit.h"
>  
>  #include 
>  #include 
> @@ -1103,6 +1108,93 @@ xfs_fs_free_cached_objects(
>   return xfs_reclaim_inodes_nr(XFS_M(sb), sc->nr_to_scan);
>  }
>  
> +static int
> +xfs_corrupt_helper(
> + struct xfs_btree_cur*cur,
> + struct xfs_rmap_irec*rec,
> + void*data)
> +{
> + struct xfs_inode*ip;
> + int rc = 0;

Note: we usually use the name "error", not "rc".

> + int *flags = data;
> +
> + if (XFS_RMAP_NON_INODE_OWNER(rec->rm_owner)) {

There are a few more things to check here to detect if metadata has been
lost.  The first is that any loss in the extended attribute information
is considered filesystem metadata; and the second is that loss of an
extent btree block is also metadata.

IOWs, this check should be:

if (XFS_RMAP_NON_INODE_OWNER(rec->rm_owner) ||
(rec->rm_flags & (XFS_RMAP_ATTR_FORK | XFS_RMAP_BMBT_BLOCK))) {
// TODO check and try to fix metadata
return -EFSCORRUPTED;
}

> + // TODO check and try to fix metadata
> + rc = -EFSCORRUPTED;
> + } else {
> + /*
> +  * Get files that incore, filter out others that are not in use.
> +  */
> + rc = xfs_iget(cur->bc_mp, cur->bc_tp, rec->rm_owner,
> +   XFS_IGET_INCORE, 0, );
> + if (rc || !ip)
> + return rc;
> + if (!VFS_I(ip)->i_mapping)
> + goto out;
> +
> + if (IS_DAX(VFS_I(ip)))
> + rc = mf_dax_mapping_kill_procs(VFS_I(ip)->i_mapping,
> +rec->rm_offset, *flags);

If

Re: Best solution for shifting DAX_ZERO_PAGE to XA_ZERO_ENTRY

2020-11-08 Thread Darrick J. Wong

On Sun, Nov 08, 2020 at 05:15:55PM -0800, Amy Parker wrote:
> I've been writing a patch to migrate the defined DAX_ZERO_PAGE
> to XA_ZERO_ENTRY for representing holes in files.

Why?  IIRC XA_ZERO_ENTRY ("no mapping in the address space") isn't the
same as DAX_ZERO_PAGE ("the zero page is mapped into the address space
because we took a read fault on a sparse file hole").

--D

> XA_ZERO_ENTRY
> is defined in include/linux/xarray.h, where it's defined using
> xa_mk_internal(257). This function returns a void pointer, which
> is incompatible with the bitwise arithmetic it is performed on with.
> 
> Currently, DAX_ZERO_PAGE is defined as an unsigned long,
> so I considered typecasting it. Typecasting every time would be
> repetitive and inefficient. I thought about making a new definition
> for it which has the typecast, but this breaks the original point of
> using already defined terms.
> 
> Should we go the route of adding a new definition, we might as
> well just change the definition of DAX_ZERO_PAGE. This would
> break the simplicity of the current DAX bit definitions:
> 
> #define DAX_LOCKED  (1UL << 0)
> #define DAX_PMD   (1UL << 1)
> #define DAX_ZERO_PAGE  (1UL << 2)
> #define DAX_EMPTY  (1UL << 3)
> 
> Any thoughts on this, and what could be the best solution here?
> 
> Best regards,
> Amy Parker
> (they/them)
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [PATCH] ext4/xfs: add page refcount helper

2020-10-07 Thread Darrick J. Wong

On Tue, Oct 06, 2020 at 04:09:30PM -0700, Ralph Campbell wrote:
> There are several places where ZONE_DEVICE struct pages assume a reference
> count == 1 means the page is idle and free. Instead of open coding this,
> add a helper function to hide this detail.
> 
> Signed-off-by: Ralph Campbell 
> Reviewed-by: Christoph Hellwig 
> ---
> 
> I'm resending this as a separate patch since I think it is ready to
> merge. Originally, this was part of an RFC and is unchanged from v3:
> https://lore.kernel.org/linux-mm/20201001181715.17416-1-rcampb...@nvidia.com
> 
> It applies cleanly to linux-5.9.0-rc7-mm1 but doesn't really
> depend on anything, just simple merge conflicts when applied to
> other trees.
> I'll let the various maintainers decide which tree and when to merge.
> It isn't urgent since it is a clean up patch.
> 
>  fs/dax.c|  4 ++--
>  fs/ext4/inode.c |  5 +
>  fs/xfs/xfs_file.c   |  4 +---
>  include/linux/dax.h | 10 ++
>  4 files changed, 14 insertions(+), 9 deletions(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index 5b47834f2e1b..85c63f735909 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -358,7 +358,7 @@ static void dax_disassociate_entry(void *entry, struct 
> address_space *mapping,
>   for_each_mapped_pfn(entry, pfn) {
>   struct page *page = pfn_to_page(pfn);
>  
> - WARN_ON_ONCE(trunc && page_ref_count(page) > 1);
> + WARN_ON_ONCE(trunc && !dax_layout_is_idle_page(page));
>   WARN_ON_ONCE(page->mapping && page->mapping != mapping);
>   page->mapping = NULL;
>   page->index = 0;
> @@ -372,7 +372,7 @@ static struct page *dax_busy_page(void *entry)
>   for_each_mapped_pfn(entry, pfn) {
>   struct page *page = pfn_to_page(pfn);
>  
> - if (page_ref_count(page) > 1)
> + if (!dax_layout_is_idle_page(page))
>   return page;
>   }
>   return NULL;
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 771ed8b1fadb..132620cbfa13 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -3937,10 +3937,7 @@ int ext4_break_layouts(struct inode *inode)
>   if (!page)
>   return 0;
>  
> - error = ___wait_var_event(>_refcount,
> - atomic_read(>_refcount) == 1,
> - TASK_INTERRUPTIBLE, 0, 0,
> - ext4_wait_dax_page(ei));
> + error = dax_wait_page(ei, page, ext4_wait_dax_page);
>   } while (error == 0);
>  
>   return error;
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 3d1b95124744..a5304aaeaa3a 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -749,9 +749,7 @@ xfs_break_dax_layouts(
>   return 0;
>  
>   *retry = true;
> - return ___wait_var_event(>_refcount,
> - atomic_read(>_refcount) == 1, TASK_INTERRUPTIBLE,
> -         0, 0, xfs_wait_dax_page(inode));
> + return dax_wait_page(inode, page, xfs_wait_dax_page);

I don't mind this open-coded soup getting cleaned up into a macro,
though my general opinion is that if the mm/dax developers are ok with
this then:

Acked-by: Darrick J. Wong 

--D

>  }
>  
>  int
> diff --git a/include/linux/dax.h b/include/linux/dax.h
> index b52f084aa643..8909a91cd381 100644
> --- a/include/linux/dax.h
> +++ b/include/linux/dax.h
> @@ -243,6 +243,16 @@ static inline bool dax_mapping(struct address_space 
> *mapping)
>   return mapping->host && IS_DAX(mapping->host);
>  }
>  
> +static inline bool dax_layout_is_idle_page(struct page *page)
> +{
> + return page_ref_count(page) == 1;
> +}
> +
> +#define dax_wait_page(_inode, _page, _wait_cb)   
> \
> + ___wait_var_event(&(_page)->_refcount,  \
> + dax_layout_is_idle_page(_page), \
> + TASK_INTERRUPTIBLE, 0, 0, _wait_cb(_inode))
> +
>  #ifdef CONFIG_DEV_DAX_HMEM_DEVICES
>  void hmem_register_device(int target_nid, struct resource *r);
>  #else
> -- 
> 2.20.1
> 
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [PATCH v2 5/9] iomap: Support arbitrarily many blocks per page

2020-09-22 Thread Darrick J. Wong

On Wed, Sep 23, 2020 at 03:48:59AM +0100, Matthew Wilcox wrote:
> On Tue, Sep 22, 2020 at 09:06:03PM -0400, Qian Cai wrote:
> > On Tue, 2020-09-22 at 18:05 +0100, Matthew Wilcox wrote:
> > > On Tue, Sep 22, 2020 at 12:23:45PM -0400, Qian Cai wrote:
> > > > On Fri, 2020-09-11 at 00:47 +0100, Matthew Wilcox (Oracle) wrote:
> > > > > Size the uptodate array dynamically to support larger pages in the
> > > > > page cache.  With a 64kB page, we're only saving 8 bytes per page 
> > > > > today,
> > > > > but with a 2MB maximum page size, we'd have to allocate more than 4kB
> > > > > per page.  Add a few debugging assertions.
> > > > > 
> > > > > Signed-off-by: Matthew Wilcox (Oracle) 
> > > > > Reviewed-by: Dave Chinner 
> > > > 
> > > > Some syscall fuzzing will trigger this on powerpc:
> > > > 
> > > > .config: https://gitlab.com/cailca/linux-mm/-/blob/master/powerpc.config
> > > > 
> > > > [ 8805.895344][T445431] WARNING: CPU: 61 PID: 445431 at 
> > > > fs/iomap/buffered-
> > > > io.c:78 iomap_page_release+0x250/0x270
> > > 
> > > Well, I'm glad it triggered.  That warning is:
> > > WARN_ON_ONCE(bitmap_full(iop->uptodate, nr_blocks) !=
> > > PageUptodate(page));
> > > so there was definitely a problem of some kind.
> > > 
> > > truncate_cleanup_page() calls
> > > do_invalidatepage() calls
> > > iomap_invalidatepage() calls
> > > iomap_page_release()
> > > 
> > > Is this the first warning?  I'm wondering if maybe there was an I/O error
> > > earlier which caused PageUptodate to get cleared again.  If it's easy to
> > > reproduce, perhaps you could try something like this?
> > > 
> > > +void dump_iomap_page(struct page *page, const char *reason)
> > > +{
> > > +   struct iomap_page *iop = to_iomap_page(page);
> > > +   unsigned int nr_blocks = i_blocks_per_page(page->mapping->host, 
> > > page);
> > > +
> > > +   dump_page(page, reason);
> > > +   if (iop)
> > > +   printk("iop:reads %d writes %d uptodate %*pb\n",
> > > +   atomic_read(>read_bytes_pending),
> > > +   atomic_read(>write_bytes_pending),
> > > +   nr_blocks, iop->uptodate);
> > > +   else
> > > +   printk("iop:none\n");
> > > +}
> > > 
> > > and then do something like:
> > > 
> > >   if (bitmap_full(iop->uptodate, nr_blocks) != PageUptodate(page))
> > >   dump_iomap_page(page, NULL);
> > 
> > This:
> > 
> > [ 1683.158254][T164965] page:4a6c16cd refcount:2 mapcount:0 
> > mapping:ea017dc5 index:0x2 pfn:0xc365c
> > [ 1683.158311][T164965] aops:xfs_address_space_operations ino:417b7e7 
> > dentry name:"trinity-testfile2"
> > [ 1683.158354][T164965] flags: 0x7fff800015(locked|uptodate|lru)
> > [ 1683.158392][T164965] raw: 007fff800015 c00c019c4b08 
> > c00c019a53c8 c000201c8362c1e8
> > [ 1683.158430][T164965] raw: 0002  
> > 0002 c000201c54db4000
> > [ 1683.158470][T164965] page->mem_cgroup:c000201c54db4000
> > [ 1683.158506][T164965] iop:none
> 
> Oh, I'm a fool.  This is after the call to detach_page_private() so
> page->private is NULL and we don't get the iop dumped.
> 
> Nevertheless, this is interesting.  Somehow, the page is marked Uptodate,
> but the bitmap is deemed not full.  There are three places where we set
> an iomap page Uptodate:
> 
> 1.  if (bitmap_full(iop->uptodate, i_blocks_per_page(inode, page)))
> SetPageUptodate(page);
> 
> 2.  if (page_has_private(page))
> iomap_iop_set_range_uptodate(page, off, len);
> else
> SetPageUptodate(page);
> 
> 3.  BUG_ON(page->index);
> ...
> SetPageUptodate(page);
> 
> It can't be #2 because the page has an iop.  It can't be #3 because the
> page->index is not 0.  So at some point in the past, the bitmap was full.
> 
> I don't think it's possible for inode->i_blksize to change, and you
> aren't running with THPs, so it's definitely not possible for thp_size()
> to change.  So i_blocks_per_page() isn't going to change.
> 
> We seem to have allocated enough memory for ->iop because that's also
> based on i_blocks_per_page().
> 
> I'm out of ideas.  Maybe I'll wake up with a better idea in the morning.
> I've been trying to reproduce this on x86 with a 1kB block size
> filesystem, and haven't been able to yet.  Maybe I'll try to setup a
> powerpc cross-compilation environment tomorrow.

FWIW I managed to reproduce it with the following fstests configuration
on a 1k block size fs on a x86 machinE:

SECTION  -- -no-sections-
FSTYP-- xfs
MKFS_OPTIONS --  -m reflink=1,rmapbt=1 -i sparse=1 -b size=1024
MOUNT_OPTIONS --  -o usrquota,grpquota,prjquota
HOST_OPTIONS -- local.config
CHECK_OPTIONS -- -g auto
XFS_MKFS_OPTIONS -- -bsize=4096
TIME_FACTOR  -- 1
LOAD_FACTOR  -- 1
TEST_DIR -- /mnt
TEST_DEV -- /dev/sde
SCRATCH_DEV  -- /dev/sdd
SCRATCH_MNT  -- /opt
OVL_UPPER--

Re: [dm-devel] [PATCH v2] dm: Call proper helper to determine dax support

2020-09-18 Thread Darrick J. Wong

On Thu, Sep 17, 2020 at 10:30:03PM -0700, Dan Williams wrote:
> From: Jan Kara 
> 
> DM was calling generic_fsdax_supported() to determine whether a device
> referenced in the DM table supports DAX. However this is a helper for "leaf" 
> device drivers so that
> they don't have to duplicate common generic checks. High level code
> should call dax_supported() helper which that calls into appropriate
> helper for the particular device. This problem manifested itself as
> kernel messages:
> 
> dm-3: error: dax access failed (-95)
> 
> when lvm2-testsuite run in cases where a DM device was stacked on top of
> another DM device.

Is there somewhere where it is documented which of:

bdev_dax_supported, generic_fsdax_supported, and dax_supported

one is supposed to use for a given circumstance?

I guess the last two can test a given range w/ blocksize; the first one
only does blocksize; and the middle one also checks with whatever fs
might be mounted? 

(I ask because it took me a while to figure out how to revert correctly
the brokenness in rc3-5 that broke my nightly dax fstesting.)

--D

> 
> Fixes: 7bf7eac8d648 ("dax: Arrange for dax_supported check to span multiple 
> devices")
> Cc: 
> Tested-by: Adrian Huang 
> Signed-off-by: Jan Kara 
> Acked-by: Mike Snitzer 
> Signed-off-by: Dan Williams 
> ---
> Changes since v1 [1]:
> - Add missing dax_read_lock() around dax_supported()
> 
> [1]: http://lore.kernel.org/r/20200916151445.450-1-j...@suse.cz
> 
>  drivers/dax/super.c   |4 
>  drivers/md/dm-table.c |   10 +++---
>  include/linux/dax.h   |   11 +--
>  3 files changed, 20 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/dax/super.c b/drivers/dax/super.c
> index e5767c83ea23..b6284c5cae0a 100644
> --- a/drivers/dax/super.c
> +++ b/drivers/dax/super.c
> @@ -325,11 +325,15 @@ EXPORT_SYMBOL_GPL(dax_direct_access);
>  bool dax_supported(struct dax_device *dax_dev, struct block_device *bdev,
>   int blocksize, sector_t start, sector_t len)
>  {
> + if (!dax_dev)
> + return false;
> +
>   if (!dax_alive(dax_dev))
>   return false;
>  
>   return dax_dev->ops->dax_supported(dax_dev, bdev, blocksize, start, 
> len);
>  }
> +EXPORT_SYMBOL_GPL(dax_supported);
>  
>  size_t dax_copy_from_iter(struct dax_device *dax_dev, pgoff_t pgoff, void 
> *addr,
>   size_t bytes, struct iov_iter *i)
> diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
> index 5edc3079e7c1..229f461e7def 100644
> --- a/drivers/md/dm-table.c
> +++ b/drivers/md/dm-table.c
> @@ -860,10 +860,14 @@ EXPORT_SYMBOL_GPL(dm_table_set_type);
>  int device_supports_dax(struct dm_target *ti, struct dm_dev *dev,
>   sector_t start, sector_t len, void *data)
>  {
> - int blocksize = *(int *) data;
> + int blocksize = *(int *) data, id;
> + bool rc;
>  
> - return generic_fsdax_supported(dev->dax_dev, dev->bdev, blocksize,
> -start, len);
> + id = dax_read_lock();
> + rc = dax_supported(dev->dax_dev, dev->bdev, blocksize, start, len);
> + dax_read_unlock(id);
> +
> + return rc;
>  }
>  
>  /* Check devices support synchronous DAX */
> diff --git a/include/linux/dax.h b/include/linux/dax.h
> index 6904d4e0b2e0..9f916326814a 100644
> --- a/include/linux/dax.h
> +++ b/include/linux/dax.h
> @@ -130,6 +130,8 @@ static inline bool generic_fsdax_supported(struct 
> dax_device *dax_dev,
>   return __generic_fsdax_supported(dax_dev, bdev, blocksize, start,
>   sectors);
>  }
> +bool dax_supported(struct dax_device *dax_dev, struct block_device *bdev,
> + int blocksize, sector_t start, sector_t len);
>  
>  static inline void fs_put_dax(struct dax_device *dax_dev)
>  {
> @@ -157,6 +159,13 @@ static inline bool generic_fsdax_supported(struct 
> dax_device *dax_dev,
>   return false;
>  }
>  
> +static inline bool dax_supported(struct dax_device *dax_dev,
> + struct block_device *bdev, int blocksize, sector_t start,
> + sector_t len)
> +{
> + return false;
> +}
> +
>  static inline void fs_put_dax(struct dax_device *dax_dev)
>  {
>  }
> @@ -195,8 +204,6 @@ bool dax_alive(struct dax_device *dax_dev);
>  void *dax_get_private(struct dax_device *dax_dev);
>  long dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff, long 
> nr_pages,
>   void **kaddr, pfn_t *pfn);
> -bool dax_supported(struct dax_device *dax_dev, struct block_device *bdev,
> - int blocksize, sector_t start, sector_t len);
>  size_t dax_copy_from_iter(struct dax_device *dax_dev, pgoff_t pgoff, void 
> *addr,
>   size_t bytes, struct iov_iter *i);
>  size_t dax_copy_to_iter(struct dax_device *dax_dev, pgoff_t pgoff, void 
> *addr,
> 
> --
> dm-devel mailing list
> dm-de...@redhat.com
> https://www.redhat.com/mailman/listinfo/dm-devel
> 
___
Linux-nvdimm mailing list --

Re: [PATCH v2 9/9] iomap: Change calling convention for zeroing

2020-09-17 Thread Darrick J. Wong

On Thu, Sep 17, 2020 at 11:11:15PM +0100, Matthew Wilcox wrote:
> On Thu, Sep 17, 2020 at 03:05:00PM -0700, Darrick J. Wong wrote:
> > > -static loff_t
> > > -iomap_zero_range_actor(struct inode *inode, loff_t pos, loff_t count,
> > > - void *data, struct iomap *iomap, struct iomap *srcmap)
> > > +static loff_t iomap_zero_range_actor(struct inode *inode, loff_t pos,
> > > + loff_t length, void *data, struct iomap *iomap,
> > 
> > Any reason not to change @length and the return value to s64?
> 
> Because it's an actor, passed to iomap_apply, so its types have to match.
> I can change that, but it'll be a separate patch series.

Ah, right.  I seemingly forgot that. :(

Carry on.
Reviewed-by: Darrick J. Wong 

--D
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [PATCH v2 9/9] iomap: Change calling convention for zeroing

2020-09-17 Thread Darrick J. Wong

On Fri, Sep 11, 2020 at 12:47:07AM +0100, Matthew Wilcox (Oracle) wrote:
> Pass the full length to iomap_zero() and dax_iomap_zero(), and have
> them return how many bytes they actually handled.  This is preparatory
> work for handling THP, although it looks like DAX could actually take
> advantage of it if there's a larger contiguous area.
> 
> Signed-off-by: Matthew Wilcox (Oracle) 
> ---
>  fs/dax.c   | 13 ++---
>  fs/iomap/buffered-io.c | 33 +++--
>  include/linux/dax.h|  3 +--
>  3 files changed, 22 insertions(+), 27 deletions(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index 994ab66a9907..6ad346352a8c 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -1037,18 +1037,18 @@ static vm_fault_t dax_load_hole(struct xa_state *xas,
>   return ret;
>  }
>  
> -int dax_iomap_zero(loff_t pos, unsigned offset, unsigned size,
> -struct iomap *iomap)
> +s64 dax_iomap_zero(loff_t pos, u64 length, struct iomap *iomap)
>  {
>   sector_t sector = iomap_sector(iomap, pos & PAGE_MASK);
>   pgoff_t pgoff;
>   long rc, id;
>   void *kaddr;
>   bool page_aligned = false;
> -
> + unsigned offset = offset_in_page(pos);
> + unsigned size = min_t(u64, PAGE_SIZE - offset, length);
>  
>   if (IS_ALIGNED(sector << SECTOR_SHIFT, PAGE_SIZE) &&
> - IS_ALIGNED(size, PAGE_SIZE))
> + (size == PAGE_SIZE))
>   page_aligned = true;
>  
>   rc = bdev_dax_pgoff(iomap->bdev, sector, PAGE_SIZE, );
> @@ -1058,8 +1058,7 @@ int dax_iomap_zero(loff_t pos, unsigned offset, 
> unsigned size,
>   id = dax_read_lock();
>  
>   if (page_aligned)
> - rc = dax_zero_page_range(iomap->dax_dev, pgoff,
> -  size >> PAGE_SHIFT);
> + rc = dax_zero_page_range(iomap->dax_dev, pgoff, 1);
>   else
>   rc = dax_direct_access(iomap->dax_dev, pgoff, 1, , NULL);
>   if (rc < 0) {
> @@ -1072,7 +1071,7 @@ int dax_iomap_zero(loff_t pos, unsigned offset, 
> unsigned size,
>   dax_flush(iomap->dax_dev, kaddr + offset, size);
>   }
>   dax_read_unlock(id);
> - return 0;
> + return size;
>  }
>  
>  static loff_t
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index cb25a7b70401..3e1eb40a73fd 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -898,11 +898,13 @@ iomap_file_unshare(struct inode *inode, loff_t pos, 
> loff_t len,
>  }
>  EXPORT_SYMBOL_GPL(iomap_file_unshare);
>  
> -static int iomap_zero(struct inode *inode, loff_t pos, unsigned offset,
> - unsigned bytes, struct iomap *iomap, struct iomap *srcmap)
> +static s64 iomap_zero(struct inode *inode, loff_t pos, u64 length,
> + struct iomap *iomap, struct iomap *srcmap)
>  {
>   struct page *page;
>   int status;
> + unsigned offset = offset_in_page(pos);
> + unsigned bytes = min_t(u64, PAGE_SIZE - offset, length);
>  
>   status = iomap_write_begin(inode, pos, bytes, 0, , iomap, srcmap);
>   if (status)
> @@ -914,38 +916,33 @@ static int iomap_zero(struct inode *inode, loff_t pos, 
> unsigned offset,
>   return iomap_write_end(inode, pos, bytes, bytes, page, iomap, srcmap);
>  }
>  
> -static loff_t
> -iomap_zero_range_actor(struct inode *inode, loff_t pos, loff_t count,
> - void *data, struct iomap *iomap, struct iomap *srcmap)
> +static loff_t iomap_zero_range_actor(struct inode *inode, loff_t pos,
> + loff_t length, void *data, struct iomap *iomap,

Any reason not to change @length and the return value to s64?

--D

> + struct iomap *srcmap)
>  {
>   bool *did_zero = data;
>   loff_t written = 0;
> - int status;
>  
>   /* already zeroed?  we're done. */
>   if (srcmap->type == IOMAP_HOLE || srcmap->type == IOMAP_UNWRITTEN)
> - return count;
> + return length;
>  
>   do {
> - unsigned offset, bytes;
> -
> - offset = offset_in_page(pos);
> - bytes = min_t(loff_t, PAGE_SIZE - offset, count);
> + s64 bytes;
>  
>   if (IS_DAX(inode))
> - status = dax_iomap_zero(pos, offset, bytes, iomap);
> + bytes = dax_iomap_zero(pos, length, iomap);
>   else
> - status = iomap_zero(inode, pos, offset, bytes, iomap,
> - srcmap);
> - if (status < 0)
> - return status;
> + bytes = iomap_zero(inode, pos, length, iomap, srcmap);
> + if (bytes < 0)
> + return bytes;
>  
>   pos += bytes;
> - count -= bytes;
> + length -= bytes;
>   written += bytes;
>   if (did_zero)
>   *did_zero = true;
> - } while (count > 0);
> + } while (length > 0);
>  
>   return written;
>  }
> diff --git

Re: [PATCH v2 8/9] iomap: Convert iomap_write_end types

2020-09-17 Thread Darrick J. Wong

On Fri, Sep 11, 2020 at 12:47:06AM +0100, Matthew Wilcox (Oracle) wrote:
> iomap_write_end cannot return an error, so switch it to return
> size_t instead of int and remove the error checking from the callers.
> Also convert the arguments to size_t from unsigned int, in case anyone
> ever wants to support a page size larger than 2GB.
> 
> Signed-off-by: Matthew Wilcox (Oracle) 
> Reviewed-by: Christoph Hellwig 

LGTM
Reviewed-by: Darrick J. Wong 

--D

> ---
>  fs/iomap/buffered-io.c | 31 ---
>  1 file changed, 12 insertions(+), 19 deletions(-)
> 
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index 64a5cb383f30..cb25a7b70401 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -663,9 +663,8 @@ iomap_set_page_dirty(struct page *page)
>  }
>  EXPORT_SYMBOL_GPL(iomap_set_page_dirty);
>  
> -static int
> -__iomap_write_end(struct inode *inode, loff_t pos, unsigned len,
> - unsigned copied, struct page *page)
> +static size_t __iomap_write_end(struct inode *inode, loff_t pos, size_t len,
> + size_t copied, struct page *page)
>  {
>   flush_dcache_page(page);
>  
> @@ -687,9 +686,8 @@ __iomap_write_end(struct inode *inode, loff_t pos, 
> unsigned len,
>   return copied;
>  }
>  
> -static int
> -iomap_write_end_inline(struct inode *inode, struct page *page,
> - struct iomap *iomap, loff_t pos, unsigned copied)
> +static size_t iomap_write_end_inline(struct inode *inode, struct page *page,
> + struct iomap *iomap, loff_t pos, size_t copied)
>  {
>   void *addr;
>  
> @@ -705,13 +703,14 @@ iomap_write_end_inline(struct inode *inode, struct page 
> *page,
>   return copied;
>  }
>  
> -static int
> -iomap_write_end(struct inode *inode, loff_t pos, unsigned len, unsigned 
> copied,
> - struct page *page, struct iomap *iomap, struct iomap *srcmap)
> +/* Returns the number of bytes copied.  May be 0.  Cannot be an errno. */
> +static size_t iomap_write_end(struct inode *inode, loff_t pos, size_t len,
> + size_t copied, struct page *page, struct iomap *iomap,
> + struct iomap *srcmap)
>  {
>   const struct iomap_page_ops *page_ops = iomap->page_ops;
>   loff_t old_size = inode->i_size;
> - int ret;
> + size_t ret;
>  
>   if (srcmap->type == IOMAP_INLINE) {
>   ret = iomap_write_end_inline(inode, page, iomap, pos, copied);
> @@ -790,11 +789,8 @@ iomap_write_actor(struct inode *inode, loff_t pos, 
> loff_t length, void *data,
>  
>   copied = iov_iter_copy_from_user_atomic(page, i, offset, bytes);
>  
> - status = iomap_write_end(inode, pos, bytes, copied, page, iomap,
> + copied = iomap_write_end(inode, pos, bytes, copied, page, iomap,
>   srcmap);
> - if (unlikely(status < 0))
> - break;
> - copied = status;
>  
>   cond_resched();
>  
> @@ -868,11 +864,8 @@ iomap_unshare_actor(struct inode *inode, loff_t pos, 
> loff_t length, void *data,
>  
>   status = iomap_write_end(inode, pos, bytes, bytes, page, iomap,
>   srcmap);
> - if (unlikely(status <= 0)) {
> - if (WARN_ON_ONCE(status == 0))
> - return -EIO;
> - return status;
> - }
> + if (WARN_ON_ONCE(status == 0))
> + return -EIO;
>  
>   cond_resched();
>  
> -- 
> 2.28.0
> 
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [PATCH v2 7/9] iomap: Convert write_count to write_bytes_pending

2020-09-17 Thread Darrick J. Wong

On Fri, Sep 11, 2020 at 12:47:05AM +0100, Matthew Wilcox (Oracle) wrote:
> Instead of counting bio segments, count the number of bytes submitted.
> This insulates us from the block layer's definition of what a 'same page'
> is, which is not necessarily clear once THPs are involved.
> 
> Signed-off-by: Matthew Wilcox (Oracle) 
> Reviewed-by: Christoph Hellwig 

Looks ok,
Reviewed-by: Darrick J. Wong 

--D

> ---
>  fs/iomap/buffered-io.c | 19 ++-
>  1 file changed, 10 insertions(+), 9 deletions(-)
> 
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index 1cf976a8e55c..64a5cb383f30 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -27,7 +27,7 @@
>   */
>  struct iomap_page {
>   atomic_tread_bytes_pending;
> - atomic_twrite_count;
> + atomic_twrite_bytes_pending;
>   spinlock_t  uptodate_lock;
>   unsigned long   uptodate[];
>  };
> @@ -73,7 +73,7 @@ iomap_page_release(struct page *page)
>   if (!iop)
>   return;
>   WARN_ON_ONCE(atomic_read(>read_bytes_pending));
> - WARN_ON_ONCE(atomic_read(>write_count));
> + WARN_ON_ONCE(atomic_read(>write_bytes_pending));
>   WARN_ON_ONCE(bitmap_full(iop->uptodate, nr_blocks) !=
>   PageUptodate(page));
>   kfree(iop);
> @@ -1047,7 +1047,7 @@ EXPORT_SYMBOL_GPL(iomap_page_mkwrite);
>  
>  static void
>  iomap_finish_page_writeback(struct inode *inode, struct page *page,
> - int error)
> + int error, unsigned int len)
>  {
>   struct iomap_page *iop = to_iomap_page(page);
>  
> @@ -1057,9 +1057,9 @@ iomap_finish_page_writeback(struct inode *inode, struct 
> page *page,
>   }
>  
>   WARN_ON_ONCE(i_blocks_per_page(inode, page) > 1 && !iop);
> - WARN_ON_ONCE(iop && atomic_read(>write_count) <= 0);
> + WARN_ON_ONCE(iop && atomic_read(>write_bytes_pending) <= 0);
>  
> - if (!iop || atomic_dec_and_test(>write_count))
> + if (!iop || atomic_sub_and_test(len, >write_bytes_pending))
>   end_page_writeback(page);
>  }
>  
> @@ -1093,7 +1093,8 @@ iomap_finish_ioend(struct iomap_ioend *ioend, int error)
>  
>   /* walk each page on bio, ending page IO on them */
>   bio_for_each_segment_all(bv, bio, iter_all)
> - iomap_finish_page_writeback(inode, bv->bv_page, error);
> + iomap_finish_page_writeback(inode, bv->bv_page, error,
> + bv->bv_len);
>   bio_put(bio);
>   }
>   /* The ioend has been freed by bio_put() */
> @@ -1309,8 +1310,8 @@ iomap_add_to_ioend(struct inode *inode, loff_t offset, 
> struct page *page,
>  
>   merged = __bio_try_merge_page(wpc->ioend->io_bio, page, len, poff,
>   _page);
> - if (iop && !same_page)
> - atomic_inc(>write_count);
> + if (iop)
> + atomic_add(len, >write_bytes_pending);
>  
>   if (!merged) {
>   if (bio_full(wpc->ioend->io_bio, len)) {
> @@ -1353,7 +1354,7 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc,
>   LIST_HEAD(submit_list);
>  
>   WARN_ON_ONCE(i_blocks_per_page(inode, page) > 1 && !iop);
> - WARN_ON_ONCE(iop && atomic_read(>write_count) != 0);
> + WARN_ON_ONCE(iop && atomic_read(>write_bytes_pending) != 0);
>  
>   /*
>* Walk through the page to find areas to write back. If we run off the
> -- 
> 2.28.0
> 
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [PATCH v2 6/9] iomap: Convert read_count to read_bytes_pending

2020-09-17 Thread Darrick J. Wong

On Fri, Sep 11, 2020 at 12:47:04AM +0100, Matthew Wilcox (Oracle) wrote:
> Instead of counting bio segments, count the number of bytes submitted.
> This insulates us from the block layer's definition of what a 'same page'
> is, which is not necessarily clear once THPs are involved.
> 
> Signed-off-by: Matthew Wilcox (Oracle) 
> ---
>  fs/iomap/buffered-io.c | 41 -
>  1 file changed, 12 insertions(+), 29 deletions(-)
> 
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index 9670c096b83e..1cf976a8e55c 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -26,7 +26,7 @@
>   * to track sub-page uptodate status and I/O completions.
>   */
>  struct iomap_page {
> - atomic_tread_count;
> + atomic_tread_bytes_pending;
>   atomic_twrite_count;
>   spinlock_t  uptodate_lock;
>   unsigned long   uptodate[];
> @@ -72,7 +72,7 @@ iomap_page_release(struct page *page)
>  
>   if (!iop)
>   return;
> - WARN_ON_ONCE(atomic_read(>read_count));
> + WARN_ON_ONCE(atomic_read(>read_bytes_pending));
>   WARN_ON_ONCE(atomic_read(>write_count));
>   WARN_ON_ONCE(bitmap_full(iop->uptodate, nr_blocks) !=
>   PageUptodate(page));
> @@ -167,13 +167,6 @@ iomap_set_range_uptodate(struct page *page, unsigned 
> off, unsigned len)
>   SetPageUptodate(page);
>  }
>  
> -static void
> -iomap_read_finish(struct iomap_page *iop, struct page *page)
> -{
> - if (!iop || atomic_dec_and_test(>read_count))
> - unlock_page(page);
> -}
> -
>  static void
>  iomap_read_page_end_io(struct bio_vec *bvec, int error)
>  {
> @@ -187,7 +180,8 @@ iomap_read_page_end_io(struct bio_vec *bvec, int error)
>   iomap_set_range_uptodate(page, bvec->bv_offset, bvec->bv_len);
>   }
>  
> - iomap_read_finish(iop, page);
> + if (!iop || atomic_sub_and_test(bvec->bv_len, >read_bytes_pending))
> + unlock_page(page);
>  }
>  
>  static void
> @@ -267,30 +261,19 @@ iomap_readpage_actor(struct inode *inode, loff_t pos, 
> loff_t length, void *data,
>   }
>  
>   ctx->cur_page_in_bio = true;
> + if (iop)
> + atomic_add(plen, >read_bytes_pending);
>  
> - /*
> -  * Try to merge into a previous segment if we can.
> -  */
> + /* Try to merge into a previous segment if we can */
>   sector = iomap_sector(iomap, pos);
> - if (ctx->bio && bio_end_sector(ctx->bio) == sector)
> + if (ctx->bio && bio_end_sector(ctx->bio) == sector) {
> + if (__bio_try_merge_page(ctx->bio, page, plen, poff,
> + _page))
> + goto done;
>   is_contig = true;
> -
> - if (is_contig &&
> - __bio_try_merge_page(ctx->bio, page, plen, poff, _page)) {
> - if (!same_page && iop)
> - atomic_inc(>read_count);
> - goto done;
>   }
>  
> - /*
> -  * If we start a new segment we need to increase the read count, and we
> -  * need to do so before submitting any previous full bio to make sure
> -  * that we don't prematurely unlock the page.
> -  */
> - if (iop)
> - atomic_inc(>read_count);
> -
> - if (!ctx->bio || !is_contig || bio_full(ctx->bio, plen)) {
> + if (!is_contig || bio_full(ctx->bio, plen)) {
>   gfp_t gfp = mapping_gfp_constraint(page->mapping, GFP_KERNEL);
>   gfp_t orig_gfp = gfp;
>   int nr_vecs = (length + PAGE_SIZE - 1) >> PAGE_SHIFT;
> -- 
> 2.28.0
> 
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [PATCH v2 5/9] iomap: Support arbitrarily many blocks per page

2020-09-17 Thread Darrick J. Wong

On Fri, Sep 11, 2020 at 12:47:03AM +0100, Matthew Wilcox (Oracle) wrote:
> Size the uptodate array dynamically to support larger pages in the
> page cache.  With a 64kB page, we're only saving 8 bytes per page today,
> but with a 2MB maximum page size, we'd have to allocate more than 4kB
> per page.  Add a few debugging assertions.
> 
> Signed-off-by: Matthew Wilcox (Oracle) 
> Reviewed-by: Dave Chinner 

Looks ok,
Reviewed-by: Darrick J. Wong 

--D

> ---
>  fs/iomap/buffered-io.c | 22 +-
>  1 file changed, 17 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index 7fc0e02d27b0..9670c096b83e 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -22,18 +22,25 @@
>  #include "../internal.h"
>  
>  /*
> - * Structure allocated for each page when block size < PAGE_SIZE to track
> - * sub-page uptodate status and I/O completions.
> + * Structure allocated for each page or THP when block size < page size
> + * to track sub-page uptodate status and I/O completions.
>   */
>  struct iomap_page {
>   atomic_tread_count;
>   atomic_twrite_count;
>   spinlock_t  uptodate_lock;
> - DECLARE_BITMAP(uptodate, PAGE_SIZE / 512);
> + unsigned long   uptodate[];
>  };
>  
>  static inline struct iomap_page *to_iomap_page(struct page *page)
>  {
> + /*
> +  * per-block data is stored in the head page.  Callers should
> +  * not be dealing with tail pages (and if they are, they can
> +  * call thp_head() first.
> +  */
> + VM_BUG_ON_PGFLAGS(PageTail(page), page);
> +
>   if (page_has_private(page))
>   return (struct iomap_page *)page_private(page);
>   return NULL;
> @@ -45,11 +52,13 @@ static struct iomap_page *
>  iomap_page_create(struct inode *inode, struct page *page)
>  {
>   struct iomap_page *iop = to_iomap_page(page);
> + unsigned int nr_blocks = i_blocks_per_page(inode, page);
>  
> - if (iop || i_blocks_per_page(inode, page) <= 1)
> + if (iop || nr_blocks <= 1)
>   return iop;
>  
> - iop = kzalloc(sizeof(*iop), GFP_NOFS | __GFP_NOFAIL);
> + iop = kzalloc(struct_size(iop, uptodate, BITS_TO_LONGS(nr_blocks)),
> + GFP_NOFS | __GFP_NOFAIL);
>   spin_lock_init(>uptodate_lock);
>   attach_page_private(page, iop);
>   return iop;
> @@ -59,11 +68,14 @@ static void
>  iomap_page_release(struct page *page)
>  {
>   struct iomap_page *iop = detach_page_private(page);
> + unsigned int nr_blocks = i_blocks_per_page(page->mapping->host, page);
>  
>   if (!iop)
>   return;
>   WARN_ON_ONCE(atomic_read(>read_count));
>   WARN_ON_ONCE(atomic_read(>write_count));
> + WARN_ON_ONCE(bitmap_full(iop->uptodate, nr_blocks) !=
> + PageUptodate(page));
>   kfree(iop);
>  }
>  
> -- 
> 2.28.0
> 
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [RFC PATCH 2/4] pagemap: introduce ->memory_failure()

2020-09-15 Thread Darrick J. Wong

On Tue, Sep 15, 2020 at 06:13:09PM +0800, Shiyang Ruan wrote:
> When memory-failure occurs, we call this function which is implemented
> by each devices.  For fsdax, pmem device implements it.  Pmem device
> will find out the block device where the error page located in, gets the
> filesystem on this block device, and finally call ->storage_lost() to
> handle the error in filesystem layer.
> 
> Normally, a pmem device may contain one or more partitions, each
> partition contains a block device, each block device contains a
> filesystem.  So we are able to find out the filesystem by one offset on
> this pmem device.  However, in other cases, such as mapped device, I
> didn't find a way to obtain the filesystem laying on it.  It is a
> problem need to be fixed.
> 
> Signed-off-by: Shiyang Ruan 
> ---
>  block/genhd.c| 12 
>  drivers/nvdimm/pmem.c| 31 +++
>  include/linux/genhd.h|  2 ++
>  include/linux/memremap.h |  3 +++
>  4 files changed, 48 insertions(+)
> 
> diff --git a/block/genhd.c b/block/genhd.c
> index 99c64641c314..e7442b60683e 100644
> --- a/block/genhd.c
> +++ b/block/genhd.c
> @@ -1063,6 +1063,18 @@ struct block_device *bdget_disk(struct gendisk *disk, 
> int partno)
>  }
>  EXPORT_SYMBOL(bdget_disk);
>  
> +struct block_device *bdget_disk_sector(struct gendisk *disk, sector_t sector)
> +{
> + struct block_device *bdev = NULL;
> + struct hd_struct *part = disk_map_sector_rcu(disk, sector);
> +
> + if (part)
> + bdev = bdget(part_devt(part));
> +
> + return bdev;
> +}
> +EXPORT_SYMBOL(bdget_disk_sector);
> +
>  /*
>   * print a full list of all partitions - intended for places where the root
>   * filesystem can't be mounted and thus to give the victim some idea of what
> diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
> index fab29b514372..3ed96486c883 100644
> --- a/drivers/nvdimm/pmem.c
> +++ b/drivers/nvdimm/pmem.c
> @@ -364,9 +364,40 @@ static void pmem_release_disk(void *__pmem)
>   put_disk(pmem->disk);
>  }
>  
> +static int pmem_pagemap_memory_failure(struct dev_pagemap *pgmap,
> + struct mf_recover_controller *mfrc)
> +{
> + struct pmem_device *pdev;
> + struct block_device *bdev;
> + sector_t disk_sector;
> + loff_t bdev_offset;
> +
> + pdev = container_of(pgmap, struct pmem_device, pgmap);
> + if (!pdev->disk)
> + return -ENXIO;
> +
> + disk_sector = (PFN_PHYS(mfrc->pfn) - pdev->phys_addr) >> SECTOR_SHIFT;

Ah, I see, looking at the current x86 MCE code, the MCE handler gets a
physical address, which is then rounded down to a PFN, which is then
blown back up into a byte address(?) and then rounded down to sectors.
That is then blown back up into a byte address and passed on to XFS,
which rounds it down to fs blocksize.

/me wishes that wasn't so convoluted, but reforming the whole mm poison
system to have smaller blast radii isn't the purpose of this patch. :)

> + bdev = bdget_disk_sector(pdev->disk, disk_sector);
> + if (!bdev)
> + return -ENXIO;
> +
> + // TODO what if block device contains a mapped device

Find its dev_pagemap_ops and invoke its memory_failure function? ;)

> + if (!bdev->bd_super)
> + goto out;
> +
> + bdev_offset = ((disk_sector - get_start_sect(bdev)) << SECTOR_SHIFT) -
> + pdev->data_offset;
> + bdev->bd_super->s_op->storage_lost(bdev->bd_super, bdev_offset, mfrc);

->storage_lost is required for all filesystems?

--D

> +
> +out:
> + bdput(bdev);
> + return 0;
> +}
> +
>  static const struct dev_pagemap_ops fsdax_pagemap_ops = {
>   .kill   = pmem_pagemap_kill,
>   .cleanup= pmem_pagemap_cleanup,
> + .memory_failure = pmem_pagemap_memory_failure,
>  };
>  
>  static int pmem_attach_disk(struct device *dev,
> diff --git a/include/linux/genhd.h b/include/linux/genhd.h
> index 4ab853461dff..16e9e13e0841 100644
> --- a/include/linux/genhd.h
> +++ b/include/linux/genhd.h
> @@ -303,6 +303,8 @@ static inline void add_disk_no_queue_reg(struct gendisk 
> *disk)
>  extern void del_gendisk(struct gendisk *gp);
>  extern struct gendisk *get_gendisk(dev_t dev, int *partno);
>  extern struct block_device *bdget_disk(struct gendisk *disk, int partno);
> +extern struct block_device *bdget_disk_sector(struct gendisk *disk,
> + sector_t sector);
>  
>  extern void set_device_ro(struct block_device *bdev, int flag);
>  extern void set_disk_ro(struct gendisk *disk, int flag);
> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
> index 5f5b2df06e61..efebefa70d00 100644
> --- a/include/linux/memremap.h
> +++ b/include/linux/memremap.h
> @@ -6,6 +6,7 @@
>  
>  struct resource;
>  struct device;
> +struct mf_recover_controller;
>  
>  /**
>   * struct vmem_altmap - pre-allocated storage for vmemmap_populate
> @@ -87,6 +88,8 @@ struct dev_pagemap_ops {
>* the

Re: [RFC PATCH 1/4] fs: introduce ->storage_lost() for memory-failure

2020-09-15 Thread Darrick J. Wong

On Tue, Sep 15, 2020 at 06:13:08PM +0800, Shiyang Ruan wrote:
> This function is used to handle errors which may cause data lost in
> filesystem.  Such as memory-failure in fsdax mode.
> 
> In XFS, it requires "rmapbt" feature in order to query for files or
> metadata which associated to the error block.  Then we could call fs
> recover functions to try to repair the damaged data.(did not implemented
> in this patch)
> 
> After that, the memory-failure also needs to kill processes who are
> using those files.  The struct mf_recover_controller is created to store
> necessary parameters.
> 
> Signed-off-by: Shiyang Ruan 
> ---
>  fs/xfs/xfs_super.c | 80 ++
>  include/linux/fs.h |  1 +
>  include/linux/mm.h |  6 
>  3 files changed, 87 insertions(+)
> 
> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> index 71ac6c1cdc36..118d9c5d9e1e 100644
> --- a/fs/xfs/xfs_super.c
> +++ b/fs/xfs/xfs_super.c
> @@ -35,6 +35,10 @@
>  #include "xfs_refcount_item.h"
>  #include "xfs_bmap_item.h"
>  #include "xfs_reflink.h"
> +#include "xfs_alloc.h"
> +#include "xfs_rmap.h"
> +#include "xfs_rmap_btree.h"
> +#include "xfs_bit.h"
>  
>  #include 
>  #include 
> @@ -1104,6 +1108,81 @@ xfs_fs_free_cached_objects(
>   return xfs_reclaim_inodes_nr(XFS_M(sb), sc->nr_to_scan);
>  }
>  
> +static int
> +xfs_storage_lost_helper(
> + struct xfs_btree_cur*cur,
> + struct xfs_rmap_irec*rec,
> + void*priv)
> +{
> + struct xfs_inode*ip;
> + struct mf_recover_controller*mfrc = priv;
> + int rc = 0;
> +
> + if (XFS_RMAP_NON_INODE_OWNER(rec->rm_owner)) {
> + // TODO check and try to fix metadata
> + } else {
> + /*
> +  * Get files that incore, filter out others that are not in use.
> +  */
> + xfs_iget(cur->bc_mp, cur->bc_tp, rec->rm_owner, XFS_IGET_INCORE,
> +  0, );

Missing return value check here.

> + if (!ip)
> + return 0;
> + if (!VFS_I(ip)->i_mapping)
> + goto out;
> +
> + rc = mfrc->recover_fn(mfrc->pfn, mfrc->flags,
> +   VFS_I(ip)->i_mapping, rec->rm_offset);
> +
> + // TODO try to fix data
> +out:
> + xfs_irele(ip);
> + }
> +
> + return rc;
> +}
> +
> +static int
> +xfs_fs_storage_lost(
> + struct super_block  *sb,
> + loff_t  offset,

offset into which device?  XFS supports three...

I'm also a little surprised you don't pass in a length.

I guess that means this function will get called repeatedly for every
byte in the poisoned range?

> + void*priv)
> +{
> + struct xfs_mount*mp = XFS_M(sb);
> + struct xfs_trans*tp = NULL;
> + struct xfs_btree_cur*cur = NULL;
> + struct xfs_rmap_irecrmap_low, rmap_high;
> + struct xfs_buf  *agf_bp = NULL;
> + xfs_fsblock_t   fsbno = XFS_B_TO_FSB(mp, offset);
> + xfs_agnumber_t  agno = XFS_FSB_TO_AGNO(mp, fsbno);
> + xfs_agblock_t   agbno = XFS_FSB_TO_AGBNO(mp, fsbno);
> + int error = 0;
> +
> + error = xfs_trans_alloc_empty(mp, );
> + if (error)
> + return error;
> +
> + error = xfs_alloc_read_agf(mp, tp, agno, 0, _bp);
> + if (error)
> + return error;
> +
> + cur = xfs_rmapbt_init_cursor(mp, tp, agf_bp, agno);

...and this is definitely the wrong call sequence if the malfunctioning
device is the realtime device.  If a dax rt device dies, you'll be
shooting down files on the data device, which will cause all sorts of
problems.

Question: Should all this poison recovery stuff go into a new file?
xfs_poison.c?  There's already a lot of code in xfs_super.c.

--D

> +
> + /* Construct a range for rmap query */
> + memset(_low, 0, sizeof(rmap_low));
> + memset(_high, 0xFF, sizeof(rmap_high));
> + rmap_low.rm_startblock = rmap_high.rm_startblock = agbno;
> +
> + error = xfs_rmap_query_range(cur, _low, _high,
> +  xfs_storage_lost_helper, priv);
> + if (error == -ECANCELED)
> + error = 0;
> +
> + xfs_btree_del_cursor(cur, error);
> + xfs_trans_brelse(tp, agf_bp);
> + return error;
> +}
> +
>  static const struct super_operations xfs_super_operations = {
>   .alloc_inode= xfs_fs_alloc_inode,
>   .destroy_inode  = xfs_fs_destroy_inode,
> @@ -1117,6 +1196,7 @@ static const struct super_operations 
> xfs_super_operations = {
>   .show_options   = xfs_fs_show_options,
>   .nr_cached_objects  = xfs_fs_nr_cached_objects,
>   .free_cached_objects= xfs_fs_free_cached_objects,
> + .storage_lost   = xfs_fs_storage_lost,
>  };
>  
>  static int
> diff --git

Re: [PATCH 2/2] xfs: don't update mtime on COW faults

2020-09-10 Thread Darrick J. Wong

On Sat, Sep 05, 2020 at 01:02:33PM -0400, Mikulas Patocka wrote:
> 
> 
> On Sat, 5 Sep 2020, Darrick J. Wong wrote:
> 
> > On Sat, Sep 05, 2020 at 08:13:02AM -0400, Mikulas Patocka wrote:
> > > When running in a dax mode, if the user maps a page with MAP_PRIVATE and
> > > PROT_WRITE, the xfs filesystem would incorrectly update ctime and mtime
> > > when the user hits a COW fault.
> > > 
> > > This breaks building of the Linux kernel.
> > > How to reproduce:
> > > 1. extract the Linux kernel tree on dax-mounted xfs filesystem
> > > 2. run make clean
> > > 3. run make -j12
> > > 4. run make -j12
> > > - at step 4, make would incorrectly rebuild the whole kernel (although it
> > >   was already built in step 3).
> > > 
> > > The reason for the breakage is that almost all object files depend on
> > > objtool. When we run objtool, it takes COW page fault on its .data
> > > section, and these faults will incorrectly update the timestamp of the
> > > objtool binary. The updated timestamp causes make to rebuild the whole
> > > tree.
> > > 
> > > Signed-off-by: Mikulas Patocka 
> > > Cc: sta...@vger.kernel.org
> > > 
> > > ---
> > >  fs/xfs/xfs_file.c |   11 +--
> > >  1 file changed, 9 insertions(+), 2 deletions(-)
> > > 
> > > Index: linux-2.6/fs/xfs/xfs_file.c
> > > ===
> > > --- linux-2.6.orig/fs/xfs/xfs_file.c  2020-09-05 10:01:42.0 
> > > +0200
> > > +++ linux-2.6/fs/xfs/xfs_file.c   2020-09-05 13:59:12.0 +0200
> > > @@ -1223,6 +1223,13 @@ __xfs_filemap_fault(
> > >   return ret;
> > >  }
> > >  
> > > +static bool
> > > +xfs_is_write_fault(
> > 
> > Call this xfs_is_shared_dax_write_fault, and throw in the IS_DAX() test?
> > 
> > You might as well make it a static inline.
> 
> Yes, it is possible. I'll send a second version.
> 
> > > + struct vm_fault *vmf)
> > > +{
> > > + return vmf->flags & FAULT_FLAG_WRITE && vmf->vma->vm_flags & VM_SHARED;
> > 
> > Also, is "shortcutting the normal fault path" the reason for ext2 and
> > xfs both being broken?
> > 
> > /me puzzles over why write_fault is always true for page_mkwrite and
> > pfn_mkwrite, but not for fault and huge_fault...
> > 
> > Also: Can you please turn this (checking for timestamp update behavior
> > wrt shared and private mapping write faults) into an fstest so we don't
> > mess this up again?
> 
> I've written this program that tests it - you can integrate it into your 
> testsuite.

I don't get it.  You're a filesystem maintainer too, which means you're
a regular contributor.  Do you:

(a) not use fstests?  If you don't, I really hope you use something else
to QA hpfs.

(b) really think that it's my problem to integrate and submit your
regression tests for you?

> Mikulas
> 
> 
> #include 

and (c) what do you want me to do with a piece of code that has no
signoff tag, no copyright, and no license?  This is your patch, and
therefore your responsibility to develop enough of an appropriate
regression test in a proper form that the rest of us can easily
determine we have the rights to contribute to it.

I don't have a problem with helping to tweak a properly licensed and
tagged test program into fstests, but this is a non-starter.

--D

> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> 
> #define FILE_NAME "test.txt"
> 
> static struct stat st1, st2;
> 
> int main(void)
> {
>   int h, r;
>   char *map;
>   unlink(FILE_NAME);
>   h = creat(FILE_NAME, 0600);
>   if (h == -1) perror("creat"), exit(1);
>   r = write(h, "x", 1);
>   if (r != 1) perror("write"), exit(1);
>   if (close(h)) perror("close"), exit(1);
>   h = open(FILE_NAME, O_RDWR);
>   if (h == -1) perror("open"), exit(1);
> 
>   map = mmap(NULL, 4096, PROT_READ | PROT_WRITE, MAP_PRIVATE, h, 0);
>   if (map == MAP_FAILED) perror("mmap"), exit(1);
>   if (fstat(h, )) perror("fstat"), exit(1);
>   sleep(2);
>   *map = 'y';
>   if (fstat(h, )) perror("fstat"), exit(1);
>   if (memcmp(, , sizeof(struct stat))) fprintf(stderr, "BUG: COW 
> fault changed time!\n"), exit(1);
>   if (munmap(map, 4096)) perror("munmap"), exit(1);
> 
>   map = mmap(NULL, 4096, PROT_REA

Re: [PATCH 2/2] xfs: don't update mtime on COW faults

2020-09-05 Thread Darrick J. Wong

On Sat, Sep 05, 2020 at 08:13:02AM -0400, Mikulas Patocka wrote:
> When running in a dax mode, if the user maps a page with MAP_PRIVATE and
> PROT_WRITE, the xfs filesystem would incorrectly update ctime and mtime
> when the user hits a COW fault.
> 
> This breaks building of the Linux kernel.
> How to reproduce:
> 1. extract the Linux kernel tree on dax-mounted xfs filesystem
> 2. run make clean
> 3. run make -j12
> 4. run make -j12
> - at step 4, make would incorrectly rebuild the whole kernel (although it
>   was already built in step 3).
> 
> The reason for the breakage is that almost all object files depend on
> objtool. When we run objtool, it takes COW page fault on its .data
> section, and these faults will incorrectly update the timestamp of the
> objtool binary. The updated timestamp causes make to rebuild the whole
> tree.
> 
> Signed-off-by: Mikulas Patocka 
> Cc: sta...@vger.kernel.org
> 
> ---
>  fs/xfs/xfs_file.c |   11 +--
>  1 file changed, 9 insertions(+), 2 deletions(-)
> 
> Index: linux-2.6/fs/xfs/xfs_file.c
> ===
> --- linux-2.6.orig/fs/xfs/xfs_file.c  2020-09-05 10:01:42.0 +0200
> +++ linux-2.6/fs/xfs/xfs_file.c   2020-09-05 13:59:12.0 +0200
> @@ -1223,6 +1223,13 @@ __xfs_filemap_fault(
>   return ret;
>  }
>  
> +static bool
> +xfs_is_write_fault(

Call this xfs_is_shared_dax_write_fault, and throw in the IS_DAX() test?

You might as well make it a static inline.

> + struct vm_fault *vmf)
> +{
> + return vmf->flags & FAULT_FLAG_WRITE && vmf->vma->vm_flags & VM_SHARED;

Also, is "shortcutting the normal fault path" the reason for ext2 and
xfs both being broken?

/me puzzles over why write_fault is always true for page_mkwrite and
pfn_mkwrite, but not for fault and huge_fault...

Also: Can you please turn this (checking for timestamp update behavior
wrt shared and private mapping write faults) into an fstest so we don't
mess this up again?

--D

> +}
> +
>  static vm_fault_t
>  xfs_filemap_fault(
>   struct vm_fault *vmf)
> @@ -1230,7 +1237,7 @@ xfs_filemap_fault(
>   /* DAX can shortcut the normal fault path on write faults! */
>   return __xfs_filemap_fault(vmf, PE_SIZE_PTE,
>   IS_DAX(file_inode(vmf->vma->vm_file)) &&
> - (vmf->flags & FAULT_FLAG_WRITE));
> + xfs_is_write_fault(vmf));
>  }
>  
>  static vm_fault_t
> @@ -1243,7 +1250,7 @@ xfs_filemap_huge_fault(
>  
>   /* DAX can shortcut the normal fault path on write faults! */
>   return __xfs_filemap_fault(vmf, pe_size,
> - (vmf->flags & FAULT_FLAG_WRITE));
> + xfs_is_write_fault(vmf));
>  }
>  
>  static vm_fault_t
> 
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [PATCH 5/9] iomap: Support arbitrarily many blocks per page

2020-08-25 Thread Darrick J. Wong

On Wed, Aug 26, 2020 at 03:26:23AM +0100, Matthew Wilcox wrote:
> On Tue, Aug 25, 2020 at 02:02:03PM -0700, Darrick J. Wong wrote:
> > >  /*
> > > - * Structure allocated for each page when block size < PAGE_SIZE to track
> > > + * Structure allocated for each page when block size < page size to track
> > >   * sub-page uptodate status and I/O completions.
> > 
> > "for each regular page or head page of a huge page"?  Or whatever we're
> > calling them nowadays?
> 
> Well, that's what I'm calling a "page" ;-)
> 
> How about "for each page or THP"?  The fact that it's stored in the
> head page is incidental -- it's allocated for the THP.

Ok.

--D
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [PATCH 9/9] iomap: Change calling convention for zeroing

2020-08-25 Thread Darrick J. Wong

On Mon, Aug 24, 2020 at 03:55:10PM +0100, Matthew Wilcox (Oracle) wrote:
> Pass the full length to iomap_zero() and dax_iomap_zero(), and have
> them return how many bytes they actually handled.  This is preparatory
> work for handling THP, although it looks like DAX could actually take
> advantage of it if there's a larger contiguous area.
> 
> Signed-off-by: Matthew Wilcox (Oracle) 
> ---
>  fs/dax.c   | 13 ++---
>  fs/iomap/buffered-io.c | 33 +++--
>  include/linux/dax.h|  3 +--
>  3 files changed, 22 insertions(+), 27 deletions(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index 95341af1a966..f2b912cb034e 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -1037,18 +1037,18 @@ static vm_fault_t dax_load_hole(struct xa_state *xas,
>   return ret;
>  }
>  
> -int dax_iomap_zero(loff_t pos, unsigned offset, unsigned size,
> -struct iomap *iomap)
> +loff_t dax_iomap_zero(loff_t pos, u64 length, struct iomap *iomap)

Sorry for my ultra-slow response to this.  The u64 length seems ok to me
(or uint64_t, I don't care all /that/ much), but using loff_t as a
return type bothers me because I see that and think that this function
is returning a new file offset, e.g. (pos + number of bytes zeroed).

So please, let's use s64 or something that isn't so misleading.

FWIW, Linus also[0] doesn't[1] like using loff_t for the number of bytes
copied.

--D

[0] 
https://lore.kernel.org/linux-fsdevel/CAHk-=wgcpafosigmf0xwagfvjw413xn3upatwywhrss+qui...@mail.gmail.com/
[1] 
https://lore.kernel.org/linux-fsdevel/CAHk-=wgvrounrevadvr_zthy8nmyo-_jvjv37o1mddm2de+...@mail.gmail.com/

>  {
>   sector_t sector = iomap_sector(iomap, pos & PAGE_MASK);
>   pgoff_t pgoff;
>   long rc, id;
>   void *kaddr;
>   bool page_aligned = false;
> -
> + unsigned offset = offset_in_page(pos);
> + unsigned size = min_t(u64, PAGE_SIZE - offset, length);
>  
>   if (IS_ALIGNED(sector << SECTOR_SHIFT, PAGE_SIZE) &&
> - IS_ALIGNED(size, PAGE_SIZE))
> + (size == PAGE_SIZE))
>   page_aligned = true;
>  
>   rc = bdev_dax_pgoff(iomap->bdev, sector, PAGE_SIZE, );
> @@ -1058,8 +1058,7 @@ int dax_iomap_zero(loff_t pos, unsigned offset, 
> unsigned size,
>   id = dax_read_lock();
>  
>   if (page_aligned)
> - rc = dax_zero_page_range(iomap->dax_dev, pgoff,
> -  size >> PAGE_SHIFT);
> + rc = dax_zero_page_range(iomap->dax_dev, pgoff, 1);
>   else
>   rc = dax_direct_access(iomap->dax_dev, pgoff, 1, , NULL);
>   if (rc < 0) {
> @@ -1072,7 +1071,7 @@ int dax_iomap_zero(loff_t pos, unsigned offset, 
> unsigned size,
>   dax_flush(iomap->dax_dev, kaddr + offset, size);
>   }
>   dax_read_unlock(id);
> - return 0;
> + return size;
>  }
>  
>  static loff_t
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index 7f618ab4b11e..2dba054095e8 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -901,11 +901,13 @@ iomap_file_unshare(struct inode *inode, loff_t pos, 
> loff_t len,
>  }
>  EXPORT_SYMBOL_GPL(iomap_file_unshare);
>  
> -static int iomap_zero(struct inode *inode, loff_t pos, unsigned offset,
> - unsigned bytes, struct iomap *iomap, struct iomap *srcmap)
> +static loff_t iomap_zero(struct inode *inode, loff_t pos, u64 length,
> + struct iomap *iomap, struct iomap *srcmap)
>  {
>   struct page *page;
>   int status;
> + unsigned offset = offset_in_page(pos);
> + unsigned bytes = min_t(u64, PAGE_SIZE - offset, length);
>  
>   status = iomap_write_begin(inode, pos, bytes, 0, , iomap, srcmap);
>   if (status)
> @@ -917,38 +919,33 @@ static int iomap_zero(struct inode *inode, loff_t pos, 
> unsigned offset,
>   return iomap_write_end(inode, pos, bytes, bytes, page, iomap, srcmap);
>  }
>  
> -static loff_t
> -iomap_zero_range_actor(struct inode *inode, loff_t pos, loff_t count,
> - void *data, struct iomap *iomap, struct iomap *srcmap)
> +static loff_t iomap_zero_range_actor(struct inode *inode, loff_t pos,
> + loff_t length, void *data, struct iomap *iomap,
> + struct iomap *srcmap)
>  {
>   bool *did_zero = data;
>   loff_t written = 0;
> - int status;
>  
>   /* already zeroed?  we're done. */
>   if (srcmap->type == IOMAP_HOLE || srcmap->type == IOMAP_UNWRITTEN)
> - return count;
> + return length;
>  
>   do {
> - unsigned offset, bytes;
> -
> - offset = offset_in_page(pos);
> - bytes = min_t(loff_t, PAGE_SIZE - offset, count);
> + loff_t bytes;
>  
>   if (IS_DAX(inode))
> - status = dax_iomap_zero(pos, offset, bytes, iomap);
> + bytes = dax_iomap_zero(pos, length, iomap);
>   else
> - status = iomap_zero(inode, pos,

Re: [PATCH 6/9] iomap: Convert read_count to byte count

2020-08-25 Thread Darrick J. Wong

On Tue, Aug 25, 2020 at 10:09:02AM +1000, Dave Chinner wrote:
> On Mon, Aug 24, 2020 at 03:55:07PM +0100, Matthew Wilcox (Oracle) wrote:
> > Instead of counting bio segments, count the number of bytes submitted.
> > This insulates us from the block layer's definition of what a 'same page'
> > is, which is not necessarily clear once THPs are involved.
> 
> I'd like to see a comment on the definition of struct iomap_page to
> indicate that read_count (and write count) reflect the byte count of
> IO currently in flight on the page, not an IO count, because THP
> makes tracking this via bio state hard. Otherwise it is not at all
> obvious why it is done and why it is intentional...

Agreed. :)

--D

> Otherwise the code looks OK.
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> da...@fromorbit.com
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [PATCH 5/9] iomap: Support arbitrarily many blocks per page

2020-08-25 Thread Darrick J. Wong

On Mon, Aug 24, 2020 at 03:55:06PM +0100, Matthew Wilcox (Oracle) wrote:
> Size the uptodate array dynamically to support larger pages in the
> page cache.  With a 64kB page, we're only saving 8 bytes per page today,
> but with a 2MB maximum page size, we'd have to allocate more than 4kB
> per page.  Add a few debugging assertions.
> 
> Signed-off-by: Matthew Wilcox (Oracle) 
> ---
>  fs/iomap/buffered-io.c | 14 ++
>  1 file changed, 10 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index dbf9572dabe9..844e95cacea8 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -22,18 +22,19 @@
>  #include "../internal.h"
>  
>  /*
> - * Structure allocated for each page when block size < PAGE_SIZE to track
> + * Structure allocated for each page when block size < page size to track
>   * sub-page uptodate status and I/O completions.

"for each regular page or head page of a huge page"?  Or whatever we're
calling them nowadays?

--D

>   */
>  struct iomap_page {
>   atomic_tread_count;
>   atomic_twrite_count;
>   spinlock_t  uptodate_lock;
> - DECLARE_BITMAP(uptodate, PAGE_SIZE / 512);
> + unsigned long   uptodate[];
>  };
>  
>  static inline struct iomap_page *to_iomap_page(struct page *page)
>  {
> + VM_BUG_ON_PGFLAGS(PageTail(page), page);
>   if (page_has_private(page))
>   return (struct iomap_page *)page_private(page);
>   return NULL;
> @@ -45,11 +46,13 @@ static struct iomap_page *
>  iomap_page_create(struct inode *inode, struct page *page)
>  {
>   struct iomap_page *iop = to_iomap_page(page);
> + unsigned int nr_blocks = i_blocks_per_page(inode, page);
>  
> - if (iop || i_blocks_per_page(inode, page) <= 1)
> + if (iop || nr_blocks <= 1)
>   return iop;
>  
> - iop = kzalloc(sizeof(*iop), GFP_NOFS | __GFP_NOFAIL);
> + iop = kzalloc(struct_size(iop, uptodate, BITS_TO_LONGS(nr_blocks)),
> + GFP_NOFS | __GFP_NOFAIL);
>   spin_lock_init(>uptodate_lock);
>   attach_page_private(page, iop);
>   return iop;
> @@ -59,11 +62,14 @@ static void
>  iomap_page_release(struct page *page)
>  {
>   struct iomap_page *iop = detach_page_private(page);
> + unsigned int nr_blocks = i_blocks_per_page(page->mapping->host, page);
>  
>   if (!iop)
>   return;
>   WARN_ON_ONCE(atomic_read(>read_count));
>   WARN_ON_ONCE(atomic_read(>write_count));
> + WARN_ON_ONCE(bitmap_full(iop->uptodate, nr_blocks) !=
> + PageUptodate(page));
>   kfree(iop);
>  }
>  
> -- 
> 2.28.0
> 
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [PATCH 4/9] iomap: Use bitmap ops to set uptodate bits

2020-08-25 Thread Darrick J. Wong

On Mon, Aug 24, 2020 at 03:55:05PM +0100, Matthew Wilcox (Oracle) wrote:
> Now that the bitmap is protected by a spinlock, we can use the
> more efficient bitmap ops instead of individual test/set bit ops.
> 
> Signed-off-by: Matthew Wilcox (Oracle) 
> Reviewed-by: Christoph Hellwig 

Yay!
Reviewed-by: Darrick J. Wong 

--D

> ---
>  fs/iomap/buffered-io.c | 12 ++--
>  1 file changed, 2 insertions(+), 10 deletions(-)
> 
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index 639d54a4177e..dbf9572dabe9 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -134,19 +134,11 @@ iomap_iop_set_range_uptodate(struct page *page, 
> unsigned off, unsigned len)
>   struct inode *inode = page->mapping->host;
>   unsigned first = off >> inode->i_blkbits;
>   unsigned last = (off + len - 1) >> inode->i_blkbits;
> - bool uptodate = true;
>   unsigned long flags;
> - unsigned int i;
>  
>   spin_lock_irqsave(>uptodate_lock, flags);
> - for (i = 0; i < i_blocks_per_page(inode, page); i++) {
> - if (i >= first && i <= last)
> - set_bit(i, iop->uptodate);
> - else if (!test_bit(i, iop->uptodate))
> - uptodate = false;
> - }
> -
> - if (uptodate)
> + bitmap_set(iop->uptodate, first, last - first + 1);
> + if (bitmap_full(iop->uptodate, i_blocks_per_page(inode, page)))
>   SetPageUptodate(page);
>   spin_unlock_irqrestore(>uptodate_lock, flags);
>  }
> -- 
> 2.28.0
> 
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [PATCH 3/9] iomap: Use kzalloc to allocate iomap_page

2020-08-25 Thread Darrick J. Wong

On Mon, Aug 24, 2020 at 03:55:04PM +0100, Matthew Wilcox (Oracle) wrote:
> We can skip most of the initialisation, although spinlocks still
> need explicit initialisation as architectures may use a non-zero
> value to indicate unlocked.  The comment is no longer useful as
> attach_page_private() handles the refcount now.
> 
> Signed-off-by: Matthew Wilcox (Oracle) 
> Reviewed-by: Christoph Hellwig 

Looks good to me,
Reviewed-by: Darrick J. Wong 

--D

> ---
>  fs/iomap/buffered-io.c | 10 +-
>  1 file changed, 1 insertion(+), 9 deletions(-)
> 
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index 13d5cdab8dcd..639d54a4177e 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -49,16 +49,8 @@ iomap_page_create(struct inode *inode, struct page *page)
>   if (iop || i_blocks_per_page(inode, page) <= 1)
>   return iop;
>  
> - iop = kmalloc(sizeof(*iop), GFP_NOFS | __GFP_NOFAIL);
> - atomic_set(>read_count, 0);
> - atomic_set(>write_count, 0);
> + iop = kzalloc(sizeof(*iop), GFP_NOFS | __GFP_NOFAIL);
>   spin_lock_init(>uptodate_lock);
> - bitmap_zero(iop->uptodate, PAGE_SIZE / SECTOR_SIZE);
> -
> - /*
> -  * migrate_page_move_mapping() assumes that pages with private data have
> -  * their count elevated by 1.
> -  */
>   attach_page_private(page, iop);
>   return iop;
>  }
> -- 
> 2.28.0
> 
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [PATCH 2/9] fs: Introduce i_blocks_per_page

2020-08-25 Thread Darrick J. Wong

On Mon, Aug 24, 2020 at 03:55:03PM +0100, Matthew Wilcox (Oracle) wrote:
> This helper is useful for both THPs and for supporting block size larger
> than page size.  Convert all users that I could find (we have a few
> different ways of writing this idiom, and I may have missed some).
> 
> Signed-off-by: Matthew Wilcox (Oracle) 
> Reviewed-by: Christoph Hellwig 

/me wonders what will happen when someone tries to make blocksz >
pagesize work, but as the most likely someone already rvb'd this I guess
it's fine:

Reviewed-by: Darrick J. Wong 

--D

> ---
>  fs/iomap/buffered-io.c  |  8 
>  fs/jfs/jfs_metapage.c   |  2 +-
>  fs/xfs/xfs_aops.c   |  2 +-
>  include/linux/pagemap.h | 16 
>  4 files changed, 22 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index cffd575e57b6..13d5cdab8dcd 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -46,7 +46,7 @@ iomap_page_create(struct inode *inode, struct page *page)
>  {
>   struct iomap_page *iop = to_iomap_page(page);
>  
> - if (iop || i_blocksize(inode) == PAGE_SIZE)
> + if (iop || i_blocks_per_page(inode, page) <= 1)
>   return iop;
>  
>   iop = kmalloc(sizeof(*iop), GFP_NOFS | __GFP_NOFAIL);
> @@ -147,7 +147,7 @@ iomap_iop_set_range_uptodate(struct page *page, unsigned 
> off, unsigned len)
>   unsigned int i;
>  
>   spin_lock_irqsave(>uptodate_lock, flags);
> - for (i = 0; i < PAGE_SIZE / i_blocksize(inode); i++) {
> + for (i = 0; i < i_blocks_per_page(inode, page); i++) {
>   if (i >= first && i <= last)
>   set_bit(i, iop->uptodate);
>   else if (!test_bit(i, iop->uptodate))
> @@ -1078,7 +1078,7 @@ iomap_finish_page_writeback(struct inode *inode, struct 
> page *page,
>   mapping_set_error(inode->i_mapping, -EIO);
>   }
>  
> - WARN_ON_ONCE(i_blocksize(inode) < PAGE_SIZE && !iop);
> + WARN_ON_ONCE(i_blocks_per_page(inode, page) > 1 && !iop);
>   WARN_ON_ONCE(iop && atomic_read(>write_count) <= 0);
>  
>   if (!iop || atomic_dec_and_test(>write_count))
> @@ -1374,7 +1374,7 @@ iomap_writepage_map(struct iomap_writepage_ctx *wpc,
>   int error = 0, count = 0, i;
>   LIST_HEAD(submit_list);
>  
> - WARN_ON_ONCE(i_blocksize(inode) < PAGE_SIZE && !iop);
> + WARN_ON_ONCE(i_blocks_per_page(inode, page) > 1 && !iop);
>   WARN_ON_ONCE(iop && atomic_read(>write_count) != 0);
>  
>   /*
> diff --git a/fs/jfs/jfs_metapage.c b/fs/jfs/jfs_metapage.c
> index a2f5338a5ea1..176580f54af9 100644
> --- a/fs/jfs/jfs_metapage.c
> +++ b/fs/jfs/jfs_metapage.c
> @@ -473,7 +473,7 @@ static int metapage_readpage(struct file *fp, struct page 
> *page)
>   struct inode *inode = page->mapping->host;
>   struct bio *bio = NULL;
>   int block_offset;
> - int blocks_per_page = PAGE_SIZE >> inode->i_blkbits;
> + int blocks_per_page = i_blocks_per_page(inode, page);
>   sector_t page_start;/* address of page in fs blocks */
>   sector_t pblock;
>   int xlen;
> diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> index b35611882ff9..55d126d4e096 100644
> --- a/fs/xfs/xfs_aops.c
> +++ b/fs/xfs/xfs_aops.c
> @@ -544,7 +544,7 @@ xfs_discard_page(
>   page, ip->i_ino, offset);
>  
>   error = xfs_bmap_punch_delalloc_range(ip, start_fsb,
> - PAGE_SIZE / i_blocksize(inode));
> + i_blocks_per_page(inode, page));
>   if (error && !XFS_FORCED_SHUTDOWN(mp))
>   xfs_alert(mp, "page discard unable to remove delalloc 
> mapping.");
>  out_invalidate:
> diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> index 7de11dcd534d..853733286138 100644
> --- a/include/linux/pagemap.h
> +++ b/include/linux/pagemap.h
> @@ -899,4 +899,20 @@ static inline int page_mkwrite_check_truncate(struct 
> page *page,
>   return offset;
>  }
>  
> +/**
> + * i_blocks_per_page - How many blocks fit in this page.
> + * @inode: The inode which contains the blocks.
> + * @page: The page (head page if the page is a THP).
> + *
> + * If the block size is larger than the size of this page, return zero.
> + *
> + * Context: The caller should hold a refcount on the page to prevent it
> + * from being split.
> + * Return: The number of filesystem blocks covered by this page.
> + */
> +static inline
> +unsigned int i_blocks_per_page(struct inode *inode, struct page *page)
> +{
> + return thp_size(page) >> inode->i_blkbits;
> +}
>  #endif /* _LINUX_PAGEMAP_H */
> -- 
> 2.28.0
> 
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [PATCH 1/9] iomap: Fix misplaced page flushing

2020-08-25 Thread Darrick J. Wong

On Mon, Aug 24, 2020 at 03:55:02PM +0100, Matthew Wilcox (Oracle) wrote:
> If iomap_unshare_actor() unshares to an inline iomap, the page was
> not being flushed.  block_write_end() and __iomap_write_end() already
> contain flushes, so adding it to iomap_write_end_inline() seems like
> the best place.  That means we can remove it from iomap_write_actor().
> 
> Signed-off-by: Matthew Wilcox (Oracle) 

Seems reasonable to me...
Reviewed-by: Darrick J. Wong 

--D

> ---
>  fs/iomap/buffered-io.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index bcfc288dba3f..cffd575e57b6 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -715,6 +715,7 @@ iomap_write_end_inline(struct inode *inode, struct page 
> *page,
>  {
>   void *addr;
>  
> + flush_dcache_page(page);
>   WARN_ON_ONCE(!PageUptodate(page));
>   BUG_ON(pos + copied > PAGE_SIZE - offset_in_page(iomap->inline_data));
>  
> @@ -811,8 +812,6 @@ iomap_write_actor(struct inode *inode, loff_t pos, loff_t 
> length, void *data,
>  
>   copied = iov_iter_copy_from_user_atomic(page, i, offset, bytes);
>  
> - flush_dcache_page(page);
> -
>   status = iomap_write_end(inode, pos, bytes, copied, page, iomap,
>   srcmap);
>   if (unlikely(status < 0))
> -- 
> 2.28.0
> 
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [RFC PATCH 1/8] fs: introduce get_shared_files() for dax

2020-08-07 Thread Darrick J. Wong

On Fri, Aug 07, 2020 at 09:13:29PM +0800, Shiyang Ruan wrote:
> Under the mode of both dax and reflink on, one page may be shared by
> multiple files and offsets.  In order to track them in memory-failure or
> other cases, we introduce this function by finding out who is sharing
> this block(the page) in a filesystem.  It returns a list that contains
> all the owners, and the offset in each owner.
> 
> For XFS, rmapbt is used to find out the owners of one block.  So, it
> should be turned on when we want to use dax feature together.
> 
> Signed-off-by: Shiyang Ruan 
> ---
>  fs/xfs/xfs_super.c  | 67 +
>  include/linux/dax.h |  7 +
>  include/linux/fs.h  |  2 ++
>  3 files changed, 76 insertions(+)
> 
> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> index 379cbff438bc..b71392219c91 100644
> --- a/fs/xfs/xfs_super.c
> +++ b/fs/xfs/xfs_super.c
> @@ -35,6 +35,9 @@
>  #include "xfs_refcount_item.h"
>  #include "xfs_bmap_item.h"
>  #include "xfs_reflink.h"
> +#include "xfs_alloc.h"
> +#include "xfs_rmap.h"
> +#include "xfs_rmap_btree.h"
>  
>  #include 
>  #include 
> @@ -1097,6 +1100,69 @@ xfs_fs_free_cached_objects(
>   return xfs_reclaim_inodes_nr(XFS_M(sb), sc->nr_to_scan);
>  }
>  
> +static int _get_shared_files_fn(

Needs an xfs_ prefix...

> + struct xfs_btree_cur*cur,
> + struct xfs_rmap_irec*rec,
> + void*priv)
> +{
> + struct list_head*list = priv;
> + struct xfs_inode*ip;
> + struct shared_files *sfp;
> +
> + /* Get files that incore, filter out others that are not in use. */
> + xfs_iget(cur->bc_mp, cur->bc_tp, rec->rm_owner, XFS_IGET_INCORE, 0, 
> );

No error checking at all?

What if rm_owner refers to metadata?

> + if (ip && !ip->i_vnode.i_mapping)
> + return 0;

When is the xfs_inode released?  We don't iput it here, and there's no
way for dax_unlock_page (afaict the only consumer) to do it, so we
leak the reference.

> +
> + sfp = kmalloc(sizeof(*sfp), GFP_KERNEL);

If there are millions of open files reflinked to this range of pmem this
is going to allocate a /lot/ of memory.

> + sfp->mapping = ip->i_vnode.i_mapping;

sfp->mapping = VFS_I(ip)->i_mapping;

> + sfp->index = rec->rm_offset;
> + list_add_tail(>list, list);

Why do we leave ->cookie uninitialized?  What does it even do?

> +
> + return 0;
> +}
> +
> +static int
> +xfs_fs_get_shared_files(
> + struct super_block  *sb,
> + pgoff_t offset,

Which device does this offset refer to?  XFS supports multiple storage
devices.

Also, uh, is this really a pgoff_t?  If yes, you can't use it with
XFS_B_TO_FSB below without first converting it to a loff_t.

> + struct list_head*list)
> +{
> + struct xfs_mount*mp = XFS_M(sb);
> + struct xfs_trans*tp = NULL;
> + struct xfs_btree_cur*cur = NULL;
> + struct xfs_rmap_irecrmap_low = { 0 }, rmap_high = { 0 };

No need to memset(0) rmap_low later, or zero rmap_high just to memset it
later.

> + struct xfs_buf  *agf_bp = NULL;
> + xfs_agblock_t   bno = XFS_B_TO_FSB(mp, offset);

"FSB" refers to xfs_fsblock_t.  You just ripped the upper 32 bits off
the fsblock number.

> + xfs_agnumber_t  agno = XFS_FSB_TO_AGNO(mp, bno);
> + int error = 0;
> +
> + error = xfs_trans_alloc_empty(mp, );
> + if (error)
> + return error;
> +
> + error = xfs_alloc_read_agf(mp, tp, agno, 0, _bp);
> + if (error)
> + return error;
> +
> + cur = xfs_rmapbt_init_cursor(mp, tp, agf_bp, agno);
> +
> + memset(>bc_rec, 0, sizeof(cur->bc_rec));

Not necessary, bc_rec is zero in a freshly created cursor.

> + /* Construct the range for one rmap search */
> + memset(_low, 0, sizeof(rmap_low));
> + memset(_high, 0xFF, sizeof(rmap_high));
> + rmap_low.rm_startblock = rmap_high.rm_startblock = bno;
> +
> + error = xfs_rmap_query_range(cur, _low, _high,
> +  _get_shared_files_fn, list);
> + if (error == -ECANCELED)
> + error = 0;
> +
> + xfs_btree_del_cursor(cur, error);
> + xfs_trans_brelse(tp, agf_bp);
> + return error;
> +}

Looking at this, I don't think this is the right way to approach memory
poisoning.  Rather than allocating a (potentially huge) linked list and
passing it to the memory poison code to unmap pages, kill processes, and
free the list, I think:

1) "->get_shared_files" should be more targetted.  Call it ->storage_lost
or something, so that it only has one purpose, which is to react to
asynchronous notifications that storage has been lost.

2) The inner _get_shared_files_fn should directly call back into the
memory manager to remove a poisoned page from the mapping and signal
whatever process might have it mapped.

That way, _get_shared_files_fn can look in the xfs buffer cache to see

Re: Can we change the S_DAX flag immediately on XFS without dropping caches?

2020-08-05 Thread Darrick J. Wong

On Wed, Aug 05, 2020 at 04:10:05PM +0800, Li, Hao wrote:
> Hello,
> 
> Ping.
> 
> Thanks,
> Hao Li
> 
> 
> On 2020/7/31 17:12, Li, Hao wrote:
> > On 2020/7/30 0:10, Ira Weiny wrote:
> >
> >> On Wed, Jul 29, 2020 at 11:23:21AM +0900, Yasunori Goto wrote:
> >>> Hi,
> >>>
> >>> On 2020/07/28 11:20, Dave Chinner wrote:
>  On Tue, Jul 28, 2020 at 02:00:08AM +, Li, Hao wrote:
> > Hi,
> >
> > I have noticed that we have to drop caches to make the changing of S_DAX
> > flag take effect after using chattr +x to turn on DAX for a existing
> > regular file. The related function is xfs_diflags_to_iflags, whose
> > second parameter determines whether we should set S_DAX immediately.
>  Yup, as documented in Documentation/filesystems/dax.txt. Specifically:
> 
>    6. When changing the S_DAX policy via toggling the persistent 
>  FS_XFLAG_DAX flag,
>   the change in behaviour for existing regular files may not occur
>   immediately.  If the change must take effect immediately, the 
>  administrator
>   needs to:
> 
>   a) stop the application so there are no active references to the 
>  data set
>  the policy change will affect
> 
>   b) evict the data set from kernel caches so it will be 
>  re-instantiated when
>  the application is restarted. This can be achieved by:
> 
>  i. drop-caches
>  ii. a filesystem unmount and mount cycle
>  iii. a system reboot
> 
> > I can't figure out why we do this. Is this because the page caches in
> > address_space->i_pages are hard to deal with?
>  Because of unfixable races in the page fault path that prevent
>  changing the caching behaviour of the inode while concurrent access
>  is possible. The only way to guarantee races can't happen is to
>  cycle the inode out of cache.
> >>> I understand why the drop_cache operation is necessary. Thanks.
> >>>
> >>> BTW, even normal user becomes to able to change DAX flag for an inode,
> >>> drop_cache operation still requires root permission, right?
> >>>
> >>> So, if kernel have a feature for normal user can operate drop cache for "a
> >>> inode" with
> >>> its permission, I think it improve the above limitation, and
> >>> we would like to try to implement it recently.
> >>>
> >>> Do you have any opinion making such feature?
> >>> (Agree/opposition, or any other comment?)
> >> I would not be opposed but there were many hurdles to that implementation.
> >>
> >> What is the use case you are thinking of here?
> >>
> >> The compromise of dropping caches was reached because we envisioned that 
> >> many
> >> users would simply want to chose the file mode when a file was created and
> >> maintain that mode through the lifetime of the file.  To that end one can
> >> simply create directories which have the desired dax mode and any files 
> >> created
> >> in that directory will inherit the dax mode immediately.  
> > Inheriting mechanism for DAX mode is reasonable but chattr_caches
> > makes things complicated.
> >> So there is no need
> >> to switch the file mode directly as a normal user.
> > The question is, the normal users can indeed use chattr to change the DAX
> > mode for a regular file as long as they want. However, when they do this,
> > they have no way to make the change take effect. I think this behavior is
> > weird. We can say chattr executes successfully because XFS_DIFLAG2_DAX has
> > been set onto xfs_inode->i_d.di_flags2, but we can also say chattr doesn't
> > finish things completely because S_DAX is not set onto inode->i_flags.
> > The user may be confused about why chattr +/-x doesn't work at all. Maybe
> > we should find a way for the normal user to make chattr take effects
> > without calling the administrator, or we can make the chattr +/x command
> > request root permission now that if the user has root permission, he can
> > make DAX changing take effect through echo 2 > /proc/sys/vm/drop_caches.

The kernel can sometimes make S_DAX changes take effect on its own,
provided that there are no other users of the file and the filesystem
agrees to reclaim an inode on file close and the program closes the file
after changing the bit.  None of these behaviors are guaranteed to
exist, so this is not mentioned in the documentation.

(And before anyone asks, yes, we did try to build a way to change the
file ops on the fly, but adding more concurrency control to all io paths
to handle an infrequent state change is not acceptable.)

--D

> >
> >
> > Regards,
> >
> > Hao Li
> >
> >> Would that work for your use case?
> >>
> >> Ira
> 
> 
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [RFC] Make the memory failure blast radius more precise

2020-06-23 Thread Darrick J. Wong

On Tue, Jun 23, 2020 at 11:40:27PM +0100, Matthew Wilcox wrote:
> On Tue, Jun 23, 2020 at 03:26:58PM -0700, Luck, Tony wrote:
> > On Tue, Jun 23, 2020 at 11:17:41PM +0100, Matthew Wilcox wrote:
> > > It might also be nice to have an madvise() MADV_ZERO option so the
> > > application doesn't have to look up the fd associated with that memory
> > > range, but we haven't floated that idea with the customer yet; I just
> > > thought of it now.
> > 
> > So the conversation between OS and kernel goes like this?
> > 
> > 1) machine check
> > 2) Kernel unmaps the 4K page surroundinng the poison and sends
> >SIGBUS to the application to say that one cache line is gone
> > 3) App says madvise(MADV_ZERO, that cache line)
> > 4) Kernel says ... "oh, you know how to deal with this" and allocates
> >a new page, copying the 63 good cache lines from the old page and
> >zeroing the missing one. New page is mapped to user.
> 
> That could be one way of implementing it.  My understanding is that
> pmem devices will reallocate bad cachelines on writes, so a better
> implementation would be:
> 
> 1) Kernel receives machine check
> 2) Kernel sends SIGBUS to the application
> 3) App send madvise(MADV_ZERO, addr, 1 << granularity)
> 4) Kernel does special writes to ensure the cacheline is zeroed
> 5) App does whatever it needs to recover (reconstructs the data or marks
> it as gone)

Frankly, I've wondered why the filesystem shouldn't just be in charge of
all this--

1. kernel receives machine check
2. kernel tattles to xfs
3. xfs looks up which file(s) own the pmem range
4. xfs zeroes the region, clears the poison, and sets AS_EIO on the
   files
5. xfs sends SIGBUS to any programs that had those files mapped to tell
   them "Your data is gone, we've stabilized the storage you had
   mapped."
6. app does whatever it needs to recover

Apps shouldn't have to do this punch-and-reallocate dance, seeing as
they don't currently do that for SCSI disks and the like.

--D

> > Do you have folks lined up to use that?  I don't know that many
> > folks are even catching the SIGBUS :-(
> 
> Had a 75 minute meeting with some people who want to use pmem this
> afternoon ...
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: 回复: Re: [RFC PATCH 0/8] dax: Add a dax-rmap tree to support reflink

2020-06-04 Thread Darrick J. Wong

On Thu, Jun 04, 2020 at 03:37:42PM +0800, Ruan Shiyang wrote:
> 
> 
> On 2020/4/28 下午2:43, Dave Chinner wrote:
> > On Tue, Apr 28, 2020 at 06:09:47AM +, Ruan, Shiyang wrote:
> > > 
> > > 在 2020/4/27 20:28:36, "Matthew Wilcox"  写道:
> > > 
> > > > On Mon, Apr 27, 2020 at 04:47:42PM +0800, Shiyang Ruan wrote:
> > > > >   This patchset is a try to resolve the shared 'page cache' problem 
> > > > > for
> > > > >   fsdax.
> > > > > 
> > > > >   In order to track multiple mappings and indexes on one page, I
> > > > >   introduced a dax-rmap rb-tree to manage the relationship.  A dax 
> > > > > entry
> > > > >   will be associated more than once if is shared.  At the second time 
> > > > > we
> > > > >   associate this entry, we create this rb-tree and store its root in
> > > > >   page->private(not used in fsdax).  Insert (->mapping, ->index) when
> > > > >   dax_associate_entry() and delete it when dax_disassociate_entry().
> > > > 
> > > > Do we really want to track all of this on a per-page basis?  I would
> > > > have thought a per-extent basis was more useful.  Essentially, create
> > > > a new address_space for each shared extent.  Per page just seems like
> > > > a huge overhead.
> > > > 
> > > Per-extent tracking is a nice idea for me.  I haven't thought of it
> > > yet...
> > > 
> > > But the extent info is maintained by filesystem.  I think we need a way
> > > to obtain this info from FS when associating a page.  May be a bit
> > > complicated.  Let me think about it...
> > 
> > That's why I want the -user of this association- to do a filesystem
> > callout instead of keeping it's own naive tracking infrastructure.
> > The filesystem can do an efficient, on-demand reverse mapping lookup
> > from it's own extent tracking infrastructure, and there's zero
> > runtime overhead when there are no errors present.
> 
> Hi Dave,
> 
> I ran into some difficulties when trying to implement the per-extent rmap
> tracking.  So, I re-read your comments and found that I was misunderstanding
> what you described here.
> 
> I think what you mean is: we don't need the in-memory dax-rmap tracking now.
> Just ask the FS for the owner's information that associate with one page
> when memory-failure.  So, the per-page (even per-extent) dax-rmap is
> needless in this case.  Is this right?

Right.  XFS already has its own rmap tree.

> Based on this, we only need to store the extent information of a fsdax page
> in its ->mapping (by searching from FS).  Then obtain the owners of this
> page (also by searching from FS) when memory-failure or other rmap case
> occurs.

I don't even think you need that much.  All you need is the "physical"
offset of that page within the pmem device (e.g. 'this is the 307th 4k
page == offset 1257472 since the start of /dev/pmem0') and xfs can look
up the owner of that range of physical storage and deal with it as
needed.

> So, a fsdax page is no longer associated with a specific file, but with a
> FS(or the pmem device).  I think it's easier to understand and implement.

Yes.  I also suspect this will be necessary to support reflink...

--D

> 
> --
> Thanks,
> Ruan Shiyang.
> > 
> > At the moment, this "dax association" is used to "report" a storage
> > media error directly to userspace. I say "report" because what it
> > does is kill userspace processes dead. The storage media error
> > actually needs to be reported to the owner of the storage media,
> > which in the case of FS-DAX is the filesytem.
> > 
> > That way the filesystem can then look up all the owners of that bad
> > media range (i.e. the filesystem block it corresponds to) and take
> > appropriate action. e.g.
> > 
> > - if it falls in filesytem metadata, shutdown the filesystem
> > - if it falls in user data, call the "kill userspace dead" routines
> >for each mapping/index tuple the filesystem finds for the given
> >LBA address that the media error occurred.
> > 
> > Right now if the media error is in filesystem metadata, the
> > filesystem isn't even told about it. The filesystem can't even shut
> > down - the error is just dropped on the floor and it won't be until
> > the filesystem next tries to reference that metadata that we notice
> > there is an issue.
> > 
> > Cheers,
> > 
> > Dave.
> > 
> 
> 
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: 回复: Re: [RFC PATCH 0/8] dax: Add a dax-rmap tree to support reflink

2020-04-28 Thread Darrick J. Wong

On Tue, Apr 28, 2020 at 09:24:41PM +1000, Dave Chinner wrote:
> On Tue, Apr 28, 2020 at 04:16:36AM -0700, Matthew Wilcox wrote:
> > On Tue, Apr 28, 2020 at 05:32:41PM +0800, Ruan Shiyang wrote:
> > > On 2020/4/28 下午2:43, Dave Chinner wrote:
> > > > On Tue, Apr 28, 2020 at 06:09:47AM +, Ruan, Shiyang wrote:
> > > > > 在 2020/4/27 20:28:36, "Matthew Wilcox"  写道:
> > > > > > On Mon, Apr 27, 2020 at 04:47:42PM +0800, Shiyang Ruan wrote:
> > > > > > >   This patchset is a try to resolve the shared 'page cache' 
> > > > > > > problem for
> > > > > > >   fsdax.
> > > > > > > 
> > > > > > >   In order to track multiple mappings and indexes on one page, I
> > > > > > >   introduced a dax-rmap rb-tree to manage the relationship.  A 
> > > > > > > dax entry
> > > > > > >   will be associated more than once if is shared.  At the second 
> > > > > > > time we
> > > > > > >   associate this entry, we create this rb-tree and store its root 
> > > > > > > in
> > > > > > >   page->private(not used in fsdax).  Insert (->mapping, ->index) 
> > > > > > > when
> > > > > > >   dax_associate_entry() and delete it when 
> > > > > > > dax_disassociate_entry().
> > > > > > 
> > > > > > Do we really want to track all of this on a per-page basis?  I would
> > > > > > have thought a per-extent basis was more useful.  Essentially, 
> > > > > > create
> > > > > > a new address_space for each shared extent.  Per page just seems 
> > > > > > like
> > > > > > a huge overhead.
> > > > > > 
> > > > > Per-extent tracking is a nice idea for me.  I haven't thought of it
> > > > > yet...
> > > > > 
> > > > > But the extent info is maintained by filesystem.  I think we need a 
> > > > > way
> > > > > to obtain this info from FS when associating a page.  May be a bit
> > > > > complicated.  Let me think about it...
> > > > 
> > > > That's why I want the -user of this association- to do a filesystem
> > > > callout instead of keeping it's own naive tracking infrastructure.
> > > > The filesystem can do an efficient, on-demand reverse mapping lookup
> > > > from it's own extent tracking infrastructure, and there's zero
> > > > runtime overhead when there are no errors present.
> > > > 
> > > > At the moment, this "dax association" is used to "report" a storage
> > > > media error directly to userspace. I say "report" because what it
> > > > does is kill userspace processes dead. The storage media error
> > > > actually needs to be reported to the owner of the storage media,
> > > > which in the case of FS-DAX is the filesytem.
> > > 
> > > Understood.
> > > 
> > > BTW, this is the usage in memory-failure, so what about rmap?  I have not
> > > found how to use this tracking in rmap.  Do you have any ideas?
> > > 
> > > > 
> > > > That way the filesystem can then look up all the owners of that bad
> > > > media range (i.e. the filesystem block it corresponds to) and take
> > > > appropriate action. e.g.
> > > 
> > > I tried writing a function to look up all the owners' info of one block in
> > > xfs for memory-failure use.  It was dropped in this patchset because I 
> > > found
> > > out that this lookup function needs 'rmapbt' to be enabled when mkfs.  But
> > > by default, rmapbt is disabled.  I am not sure if it matters...
> > 
> > I'm pretty sure you can't have shared extents on an XFS filesystem if you
> > _don't_ have the rmapbt feature enabled.  I mean, that's why it exists.
> 
> You're confusing reflink with rmap. :)
> 
> rmapbt does all the reverse mapping tracking, reflink just does the
> shared data extent tracking.
> 
> But given that anyone who wants to use DAX with reflink is going to
> have to mkfs their filesystem anyway (to turn on reflink) requiring
> that rmapbt is also turned on is not a big deal. Especially as we
> can check it at mount time in the kernel...

Are we going to turn on rmap by default?  The last I checked, it did
have a 10-20% performance cost on extreme metadata-heavy workloads.
Or do we only enable it by default if mkfs detects a pmem device?

(Admittedly, most people do not run fsx as a productivity app; the
normal hit is usually 3-5% which might not be such a big deal since you
also get (half of) online fsck. :P)

--D

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> da...@fromorbit.com
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [RFC] dax,pmem: Provide a dax operation to zero range of memory

2020-02-04 Thread Darrick J. Wong

On Fri, Jan 31, 2020 at 03:31:58PM -0800, Dan Williams wrote:
> On Thu, Jan 23, 2020 at 11:07 AM Darrick J. Wong
>  wrote:
> >
> > On Thu, Jan 23, 2020 at 11:52:49AM -0500, Vivek Goyal wrote:
> > > Hi,
> > >
> > > This is an RFC patch to provide a dax operation to zero a range of memory.
> > > It will also clear poison in the process. This is primarily compile tested
> > > patch. I don't have real hardware to test the poison logic. I am posting
> > > this to figure out if this is the right direction or not.
> > >
> > > Motivation from this patch comes from Christoph's feedback that he will
> > > rather prefer a dax way to zero a range instead of relying on having to
> > > call blkdev_issue_zeroout() in __dax_zero_page_range().
> > >
> > > https://lkml.org/lkml/2019/8/26/361
> > >
> > > My motivation for this change is virtiofs DAX support. There we use DAX
> > > but we don't have a block device. So any dax code which has the assumption
> > > that there is always a block device associated is a problem. So this
> > > is more of a cleanup of one of the places where dax has this dependency
> > > on block device and if we add a dax operation for zeroing a range, it
> > > can help with not having to call blkdev_issue_zeroout() in dax path.
> > >
> > > I have yet to take care of stacked block drivers (dm/md).
> > >
> > > Current poison clearing logic is primarily written with assumption that
> > > I/O is sector aligned. With this new method, this assumption is broken
> > > and one can pass any range of memory to zero. I have fixed few places
> > > in existing logic to be able to handle an arbitrary start/end. I am
> > > not sure are there other dependencies which might need fixing or
> > > prohibit us from providing this method.
> > >
> > > Any feedback or comment is welcome.
> >
> > So who gest to use this? :)
> >
> > Should we (XFS) make fallocate(ZERO_RANGE) detect when it's operating on
> > a written extent in a DAX file and call this instead of what it does now
> > (punch range and reallocate unwritten)?
> 
> If it eliminates more block assumptions, then yes. In general I think
> there are opportunities to use "native" direct_access instead of
> block-i/o for other areas too, like metadata i/o.
> 
> > Is this the kind of thing XFS should just do on its own when DAX us that
> > some range of pmem has gone bad and now we need to (a) race with the
> > userland programs to write /something/ to the range to prevent a machine
> > check (b) whack all the programs that think they have a mapping to
> > their data, (c) see if we have a DRAM copy and just write that back, (d)
> > set wb_err so fsyncs fail, and/or (e) regenerate metadata as necessary?
> 
> (a), (b) duplicate what memory error handling already does. So yes,
> could be done but it only helps if machine check handling is broken or
> missing.

 

> (c) what DRAM copy in the DAX case?

Sorry, I was talking about the fs metadata that we cache in DRAM.

> (d) dax fsync is just cache flush, so it can't fail, or are you
> talking about errors in metadata?

I'm talking about an S_DAX file that someone is doing regular write()s
to:

1. Open file O_RDWR
2. Write something to the file
3. Some time later, something decides the pmem is bad.
4. Program calls fsync(); does it return EIO?

(I shouldn't have mixed the metadata/file data cases, sorry...)

> (e) I thought our solution for dax metadata redundancy is to use a
> realtime data device and raid mirror for the metadata device.

In the end it was set aside on the grounds that reserving space for
a separate metadata device was too costly and too complex for now.
We might get back to it later when the  economics improve.

> >  Will XFS ever get that "your storage went bad" hook that was
> > promised ages ago?
> 
> pmem developers don't scale?

Ah, sorry. :/

> > Though I guess it only does this a single page at a time, which won't be
> > awesome if we're trying to zero (say) 100GB of pmem.  I was expecting to
> > see one big memset() call to zero the entire range followed by
> > pmem_clear_poison() on the entire range, but I guess you did tag this
> > RFC. :)
> 
> Until movdir64b is available the only way to clear poison is by making
> a call to the BIOS. The BIOS may not be efficient at bulk clearing.

Well then let's port XFS to SMM mode. 

(No, please don't...)

--D
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [RFC] dax,pmem: Provide a dax operation to zero range of memory

2020-01-23 Thread Darrick J. Wong

On Thu, Jan 23, 2020 at 11:52:49AM -0500, Vivek Goyal wrote:
> Hi,
> 
> This is an RFC patch to provide a dax operation to zero a range of memory.
> It will also clear poison in the process. This is primarily compile tested
> patch. I don't have real hardware to test the poison logic. I am posting
> this to figure out if this is the right direction or not.
> 
> Motivation from this patch comes from Christoph's feedback that he will
> rather prefer a dax way to zero a range instead of relying on having to
> call blkdev_issue_zeroout() in __dax_zero_page_range().
> 
> https://lkml.org/lkml/2019/8/26/361
> 
> My motivation for this change is virtiofs DAX support. There we use DAX
> but we don't have a block device. So any dax code which has the assumption
> that there is always a block device associated is a problem. So this
> is more of a cleanup of one of the places where dax has this dependency
> on block device and if we add a dax operation for zeroing a range, it
> can help with not having to call blkdev_issue_zeroout() in dax path.
> 
> I have yet to take care of stacked block drivers (dm/md).
> 
> Current poison clearing logic is primarily written with assumption that
> I/O is sector aligned. With this new method, this assumption is broken
> and one can pass any range of memory to zero. I have fixed few places
> in existing logic to be able to handle an arbitrary start/end. I am
> not sure are there other dependencies which might need fixing or
> prohibit us from providing this method.
> 
> Any feedback or comment is welcome.

So who gest to use this? :)

Should we (XFS) make fallocate(ZERO_RANGE) detect when it's operating on
a written extent in a DAX file and call this instead of what it does now
(punch range and reallocate unwritten)?

Is this the kind of thing XFS should just do on its own when DAX us that
some range of pmem has gone bad and now we need to (a) race with the
userland programs to write /something/ to the range to prevent a machine
check, (b) whack all the programs that think they have a mapping to
their data, (c) see if we have a DRAM copy and just write that back, (d)
set wb_err so fsyncs fail, and/or (e) regenerate metadata as necessary?

 Will XFS ever get that "your storage went bad" hook that was
promised ages ago?

Though I guess it only does this a single page at a time, which won't be
awesome if we're trying to zero (say) 100GB of pmem.  I was expecting to
see one big memset() call to zero the entire range followed by
pmem_clear_poison() on the entire range, but I guess you did tag this
RFC. :)

--D

> Thanks
> Vivek
> 
> ---
>  drivers/dax/super.c   |   13 +
>  drivers/nvdimm/pmem.c |   67 
> ++
>  fs/dax.c  |   39 -
>  include/linux/dax.h   |3 ++
>  4 files changed, 85 insertions(+), 37 deletions(-)
> 
> Index: rhvgoyal-linux/drivers/nvdimm/pmem.c
> ===
> --- rhvgoyal-linux.orig/drivers/nvdimm/pmem.c 2020-01-23 11:32:11.075139183 
> -0500
> +++ rhvgoyal-linux/drivers/nvdimm/pmem.c  2020-01-23 11:32:28.660139183 
> -0500
> @@ -52,8 +52,8 @@ static void hwpoison_clear(struct pmem_d
>   if (is_vmalloc_addr(pmem->virt_addr))
>   return;
>  
> - pfn_start = PHYS_PFN(phys);
> - pfn_end = pfn_start + PHYS_PFN(len);
> + pfn_start = PFN_UP(phys);
> + pfn_end = PFN_DOWN(phys + len);
>   for (pfn = pfn_start; pfn < pfn_end; pfn++) {
>   struct page *page = pfn_to_page(pfn);
>  
> @@ -71,22 +71,24 @@ static blk_status_t pmem_clear_poison(st
>   phys_addr_t offset, unsigned int len)
>  {
>   struct device *dev = to_dev(pmem);
> - sector_t sector;
> + sector_t sector_start, sector_end;
>   long cleared;
>   blk_status_t rc = BLK_STS_OK;
> + int nr_sectors;
>  
> - sector = (offset - pmem->data_offset) / 512;
> + sector_start = ALIGN((offset - pmem->data_offset), 512) / 512;
> + sector_end = ALIGN_DOWN((offset - pmem->data_offset + len), 512)/512;
> + nr_sectors =  sector_end - sector_start;
>  
>   cleared = nvdimm_clear_poison(dev, pmem->phys_addr + offset, len);
>   if (cleared < len)
>   rc = BLK_STS_IOERR;
> - if (cleared > 0 && cleared / 512) {
> + if (cleared > 0 && nr_sectors > 0) {
>   hwpoison_clear(pmem, pmem->phys_addr + offset, cleared);
> - cleared /= 512;
> - dev_dbg(dev, "%#llx clear %ld sector%s\n",
> - (unsigned long long) sector, cleared,
> - cleared > 1 ? "s" : "");
> - badblocks_clear(>bb, sector, cleared);
> + dev_dbg(dev, "%#llx clear %d sector%s\n",
> + (unsigned long long) sector_start, nr_sectors,
> + nr_sectors > 1 ? "s" : "");
> + badblocks_clear(>bb, sector_start, nr_sectors);
>

Re: [PATCH 01/19] dax: remove block device dependencies

2020-01-07 Thread Darrick J. Wong

On Tue, Jan 07, 2020 at 10:49:55AM -0800, Dan Williams wrote:
> On Tue, Jan 7, 2020 at 10:33 AM Vivek Goyal  wrote:
> >
> > On Tue, Jan 07, 2020 at 10:07:18AM -0800, Dan Williams wrote:
> > > On Tue, Jan 7, 2020 at 10:02 AM Vivek Goyal  wrote:
> > > >
> > > > On Tue, Jan 07, 2020 at 09:29:17AM -0800, Dan Williams wrote:
> > > > > On Tue, Jan 7, 2020 at 9:08 AM Darrick J. Wong 
> > > > >  wrote:
> > > > > >
> > > > > > On Tue, Jan 07, 2020 at 06:22:54AM -0800, Dan Williams wrote:
> > > > > > > On Tue, Jan 7, 2020 at 4:52 AM Christoph Hellwig 
> > > > > > >  wrote:
> > > > > > > >
> > > > > > > > On Mon, Dec 16, 2019 at 01:10:14PM -0500, Vivek Goyal wrote:
> > > > > > > > > > Agree. In retrospect it was my laziness in the dax-device
> > > > > > > > > > implementation to expect the block-device to be available.
> > > > > > > > > >
> > > > > > > > > > It looks like fs_dax_get_by_bdev() is an intercept point 
> > > > > > > > > > where a
> > > > > > > > > > dax_device could be dynamically created to represent the 
> > > > > > > > > > subset range
> > > > > > > > > > indicated by the block-device partition. That would open up 
> > > > > > > > > > more
> > > > > > > > > > cleanup opportunities.
> > > > > > > > >
> > > > > > > > > Hi Dan,
> > > > > > > > >
> > > > > > > > > After a long time I got time to look at it again. Want to 
> > > > > > > > > work on this
> > > > > > > > > cleanup so that I can make progress with virtiofs DAX paches.
> > > > > > > > >
> > > > > > > > > I am not sure I understand the requirements fully. I see that 
> > > > > > > > > right now
> > > > > > > > > dax_device is created per device and all block partitions 
> > > > > > > > > refer to it. If
> > > > > > > > > we want to create one dax_device per partition, then it looks 
> > > > > > > > > like this
> > > > > > > > > will be structured more along the lines how block layer 
> > > > > > > > > handles disk and
> > > > > > > > > partitions. (One gendisk for disk and block_devices for 
> > > > > > > > > partitions,
> > > > > > > > > including partition 0). That probably means state belong to 
> > > > > > > > > whole device
> > > > > > > > > will be in common structure say dax_device_common, and per 
> > > > > > > > > partition state
> > > > > > > > > will be in dax_device and dax_device can carry a pointer to
> > > > > > > > > dax_device_common.
> > > > > > > > >
> > > > > > > > > I am also not sure what does it mean to partition dax 
> > > > > > > > > devices. How will
> > > > > > > > > partitions be exported to user space.
> > > > > > > >
> > > > > > > > Dan, last time we talked you agreed that partitioned dax 
> > > > > > > > devices are
> > > > > > > > rather pointless IIRC.  Should we just deprecate partitions on 
> > > > > > > > DAX
> > > > > > > > devices and then remove them after a cycle or two?
> > > > > > >
> > > > > > > That does seem a better plan than trying to force partition 
> > > > > > > support
> > > > > > > where it is not needed.
> > > > > >
> > > > > > Question: if one /did/ have a partitioned DAX device and used 
> > > > > > kpartx to
> > > > > > create dm-linear devices for each partition, will DAX still work 
> > > > > > through
> > > > > > that?
> > > > >
> > > > > The device-mapper support will continue, but it will be limited to
> > > > > whole device sub-components. I.e. you could use kpartx to carve up
> > > >

Re: [PATCH 01/19] dax: remove block device dependencies

2020-01-07 Thread Darrick J. Wong

On Tue, Jan 07, 2020 at 06:22:54AM -0800, Dan Williams wrote:
> On Tue, Jan 7, 2020 at 4:52 AM Christoph Hellwig  wrote:
> >
> > On Mon, Dec 16, 2019 at 01:10:14PM -0500, Vivek Goyal wrote:
> > > > Agree. In retrospect it was my laziness in the dax-device
> > > > implementation to expect the block-device to be available.
> > > >
> > > > It looks like fs_dax_get_by_bdev() is an intercept point where a
> > > > dax_device could be dynamically created to represent the subset range
> > > > indicated by the block-device partition. That would open up more
> > > > cleanup opportunities.
> > >
> > > Hi Dan,
> > >
> > > After a long time I got time to look at it again. Want to work on this
> > > cleanup so that I can make progress with virtiofs DAX paches.
> > >
> > > I am not sure I understand the requirements fully. I see that right now
> > > dax_device is created per device and all block partitions refer to it. If
> > > we want to create one dax_device per partition, then it looks like this
> > > will be structured more along the lines how block layer handles disk and
> > > partitions. (One gendisk for disk and block_devices for partitions,
> > > including partition 0). That probably means state belong to whole device
> > > will be in common structure say dax_device_common, and per partition state
> > > will be in dax_device and dax_device can carry a pointer to
> > > dax_device_common.
> > >
> > > I am also not sure what does it mean to partition dax devices. How will
> > > partitions be exported to user space.
> >
> > Dan, last time we talked you agreed that partitioned dax devices are
> > rather pointless IIRC.  Should we just deprecate partitions on DAX
> > devices and then remove them after a cycle or two?
> 
> That does seem a better plan than trying to force partition support
> where it is not needed.

Question: if one /did/ have a partitioned DAX device and used kpartx to
create dm-linear devices for each partition, will DAX still work through
that?

--D
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [RFC PATCH 0/7] xfs: add reflink & dedupe support for fsdax.

2019-10-09 Thread Darrick J. Wong

On Tue, Oct 08, 2019 at 11:31:44PM -0700, Christoph Hellwig wrote:
> Btw, I just had a chat with Dan last week on this.  And he pointed out
> that while this series deals with the read/write path issues of 
> reflink on DAX it doesn't deal with the mmap side issue that
> page->mapping and page->index can point back to exactly one file.
> 
> I think we want a few xfstests that reflink a file and then use the
> different links using mmap, as that should blow up pretty reliably.

Hmm, you're right, we don't actually have a test that checks the
behavior of mwriting all copies of a shared block.  Ok, I'll go write
one.

--D
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org

Re: [PATCH 04/18] dax: Introduce IOMAP_DAX_COW to CoW edges during writes

2019-05-28 Thread Darrick J. Wong

On Wed, May 29, 2019 at 12:02:40PM +0800, Shiyang Ruan wrote:
> 
> 
> On 5/29/19 10:47 AM, Dave Chinner wrote:
> > On Wed, May 29, 2019 at 10:01:58AM +0800, Shiyang Ruan wrote:
> > > 
> > > On 5/28/19 5:17 PM, Jan Kara wrote:
> > > > On Mon 27-05-19 16:25:41, Shiyang Ruan wrote:
> > > > > On 5/23/19 7:51 PM, Goldwyn Rodrigues wrote:
> > > > > > > 
> > > > > > > Hi,
> > > > > > > 
> > > > > > > I'm working on reflink & dax in XFS, here are some thoughts on 
> > > > > > > this:
> > > > > > > 
> > > > > > > As mentioned above: the second iomap's offset and length must 
> > > > > > > match the
> > > > > > > first.  I thought so at the beginning, but later found that the 
> > > > > > > only
> > > > > > > difference between these two iomaps is @addr.  So, what about 
> > > > > > > adding a
> > > > > > > @saddr, which means the source address of COW extent, into the 
> > > > > > > struct iomap.
> > > > > > > The ->iomap_begin() fills @saddr if the extent is COW, and 0 if 
> > > > > > > not.  Then
> > > > > > > handle this @saddr in each ->actor().  No more modifications in 
> > > > > > > other
> > > > > > > functions.
> > > > > > 
> > > > > > Yes, I started of with the exact idea before being recommended this 
> > > > > > by Dave.
> > > > > > I used two fields instead of one namely cow_pos and cow_addr which 
> > > > > > defined
> > > > > > the source details. I had put it as a iomap flag as opposed to a 
> > > > > > type
> > > > > > which of course did not appeal well.
> > > > > > 
> > > > > > We may want to use iomaps for cases where two inodes are involved.
> > > > > > An example of the other scenario where offset may be different is 
> > > > > > file
> > > > > > comparison for dedup: vfs_dedup_file_range_compare(). However, it 
> > > > > > would
> > > > > > need two inodes in iomap as well.
> > > > > > 
> > > > > Yes, it is reasonable.  Thanks for your explanation.
> > > > > 
> > > > > One more thing RFC:
> > > > > I'd like to add an end-io callback argument in ->dax_iomap_actor() to 
> > > > > update
> > > > > the metadata after one whole COW operation is completed.  The end-io 
> > > > > can
> > > > > also be called in ->iomap_end().  But one COW operation may call
> > > > > ->iomap_apply() many times, and so does the end-io.  Thus, I think it 
> > > > > would
> > > > > be nice to move it to the bottom of ->dax_iomap_actor(), called just 
> > > > > once in
> > > > > each COW operation.
> > > > 
> > > > I'm sorry but I don't follow what you suggest. One COW operation is a 
> > > > call
> > > > to dax_iomap_rw(), isn't it? That may call iomap_apply() several times,
> > > > each invocation calls ->iomap_begin(), ->actor() (dax_iomap_actor()),
> > > > ->iomap_end() once. So I don't see a difference between doing something 
> > > > in
> > > > ->actor() and ->iomap_end() (besides the passed arguments but that does 
> > > > not
> > > > seem to be your concern). So what do you exactly want to do?
> > > 
> > > Hi Jan,
> > > 
> > > Thanks for pointing out, and I'm sorry for my mistake.  It's
> > > ->dax_iomap_rw(), not ->dax_iomap_actor().
> > > 
> > > I want to call the callback function at the end of ->dax_iomap_rw().
> > > 
> > > Like this:
> > > dax_iomap_rw(..., callback) {
> > > 
> > >  ...
> > >  while (...) {
> > >  iomap_apply(...);
> > >  }
> > > 
> > >  if (callback != null) {
> > >  callback();
> > >  }
> > >  return ...;
> > > }
> > 
> > Why does this need to be in dax_iomap_rw()?
> > 
> > We already do post-dax_iomap_rw() "io-end callbacks" directly in
> > xfs_file_dax_write() to update the file size
> 
> Yes, but we also need to call ->xfs_reflink_end_cow() after a COW operation.
> And an is-cow flag(from iomap) is also needed to determine if we call it.  I
> think it would be better to put this into ->dax_iomap_rw() as a callback
> function.

Sort of like how iomap_dio_rw takes a write endio function?

--D

> So sorry for my poor expression.
> 
> > 
> > Cheers,
> > 
> > Dave.
> > 
> 
> -- 
> Thanks,
> Shiyang Ruan.
> 
> 
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

Re: [PATCH 08/18] dax: memcpy page in case of IOMAP_DAX_COW for mmap faults

2019-05-22 Thread Darrick J. Wong

On Wed, May 22, 2019 at 02:11:39PM -0500, Goldwyn Rodrigues wrote:
> On 10:46 21/05, Darrick J. Wong wrote:
> > On Mon, Apr 29, 2019 at 12:26:39PM -0500, Goldwyn Rodrigues wrote:
> > > From: Goldwyn Rodrigues 
> > > 
> > > Change dax_iomap_pfn to return the address as well in order to
> > > use it for performing a memcpy in case the type is IOMAP_DAX_COW.
> > > We don't handle PMD because btrfs does not support hugepages.
> > > 
> > > Question:
> > > The sequence of bdev_dax_pgoff() and dax_direct_access() is
> > > used multiple times to calculate address and pfn's. Would it make
> > > sense to call it while calculating address as well to reduce code?
> > > 
> > > Signed-off-by: Goldwyn Rodrigues 
> > > ---
> > >  fs/dax.c | 19 +++
> > >  1 file changed, 15 insertions(+), 4 deletions(-)
> > > 
> > > diff --git a/fs/dax.c b/fs/dax.c
> > > index 610bfa861a28..718b1632a39d 100644
> > > --- a/fs/dax.c
> > > +++ b/fs/dax.c
> > > @@ -984,7 +984,7 @@ static sector_t dax_iomap_sector(struct iomap *iomap, 
> > > loff_t pos)
> > >  }
> > >  
> > >  static int dax_iomap_pfn(struct iomap *iomap, loff_t pos, size_t size,
> > > -  pfn_t *pfnp)
> > > +  pfn_t *pfnp, void **addr)
> > >  {
> > >   const sector_t sector = dax_iomap_sector(iomap, pos);
> > >   pgoff_t pgoff;
> > > @@ -996,7 +996,7 @@ static int dax_iomap_pfn(struct iomap *iomap, loff_t 
> > > pos, size_t size,
> > >   return rc;
> > >   id = dax_read_lock();
> > >   length = dax_direct_access(iomap->dax_dev, pgoff, PHYS_PFN(size),
> > > -NULL, pfnp);
> > > +addr, pfnp);
> > >   if (length < 0) {
> > >   rc = length;
> > >   goto out;
> > > @@ -1286,6 +1286,7 @@ static vm_fault_t dax_iomap_pte_fault(struct 
> > > vm_fault *vmf, pfn_t *pfnp,
> > >   XA_STATE(xas, >i_pages, vmf->pgoff);
> > >   struct inode *inode = mapping->host;
> > >   unsigned long vaddr = vmf->address;
> > > + void *addr;
> > >   loff_t pos = (loff_t)vmf->pgoff << PAGE_SHIFT;
> > >   struct iomap iomap = { 0 };
> > 
> > Ugh, I had forgotten that fs/dax.c open-codes iomap_apply, probably
> > because the actor returns vm_fault_t, not bytes copied.  I guess that
> > makes it a tiny bit more complicated to pass in two (struct iomap *) to
> > the iomap_begin function...
> 
> I am not sure I understand this. We do not use iomap_apply() in
> the fault path: dax_iomap_pte_fault(). We just use iomap_begin()
> and iomap_end(). So, why can we not implement your idea of using two
> iomaps?

Oh, sorry, I wasn't trying to say that calling ->iomap_begin made it
*impossible* to implement.  I was merely complaining about the increased
maintenance burden that results from open coding -- now there are three
places where we have to change a struct iomap declaration, not one
(iomap_apply) as I had originally thought.

> What does open-coding iomap-apply mean?

Any function that calls (1) ->iomap_begin; (2) performs an action on the
returned iomap; and (3) then calls calling ->iomap_end.  That's what
iomap_apply() does.

Really I'm just being maintainer-cranky.  Ignore me for now. :)

--D

> 
> -- 
> Goldwyn
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

Re: [PATCH 13/18] fs: dedup file range to use a compare function

2019-05-21 Thread Darrick J. Wong

On Mon, Apr 29, 2019 at 12:26:44PM -0500, Goldwyn Rodrigues wrote:
> From: Goldwyn Rodrigues 
> 
> With dax we cannot deal with readpage() etc. So, we create a
> funciton callback to perform the file data comparison and pass
> it to generic_remap_file_range_prep() so it can use iomap-based
> functions.
> 
> This may not be the best way to solve this. Suggestions welcome.
> 
> Signed-off-by: Goldwyn Rodrigues 
> ---
>  fs/btrfs/ctree.h |  9 
>  fs/btrfs/dax.c   |  8 +++
>  fs/btrfs/ioctl.c | 11 +++--
>  fs/dax.c | 65 
> 
>  fs/ocfs2/file.c  |  2 +-
>  fs/read_write.c  | 11 +
>  fs/xfs/xfs_reflink.c |  2 +-
>  include/linux/dax.h  |  4 
>  include/linux/fs.h   |  8 ++-
>  9 files changed, 110 insertions(+), 10 deletions(-)
> 
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 2b7bdabb44f8..d3d044125619 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -3803,11 +3803,20 @@ int btree_readahead_hook(struct extent_buffer *eb, 
> int err);
>  ssize_t btrfs_file_dax_read(struct kiocb *iocb, struct iov_iter *to);
>  ssize_t btrfs_file_dax_write(struct kiocb *iocb, struct iov_iter *from);
>  vm_fault_t btrfs_dax_fault(struct vm_fault *vmf);
> +int btrfs_dax_file_range_compare(struct inode *src, loff_t srcoff,
> + struct inode *dest, loff_t destoff, loff_t len,
> + bool *is_same);
>  #else
>  static inline ssize_t btrfs_file_dax_write(struct kiocb *iocb, struct 
> iov_iter *from)
>  {
>   return 0;
>  }
> +static inline int btrfs_dax_file_range_compare(struct inode *src, loff_t 
> srcoff,
> + struct inode *dest, loff_t destoff, loff_t len,
> + bool *is_same)
> +{
> + return 0;
> +}
>  #endif /* CONFIG_FS_DAX */
>  
>  static inline int is_fstree(u64 rootid)
> diff --git a/fs/btrfs/dax.c b/fs/btrfs/dax.c
> index de957d681e16..af64696a5337 100644
> --- a/fs/btrfs/dax.c
> +++ b/fs/btrfs/dax.c
> @@ -227,4 +227,12 @@ vm_fault_t btrfs_dax_fault(struct vm_fault *vmf)
>  
>   return ret;
>  }
> +
> +int btrfs_dax_file_range_compare(struct inode *src, loff_t srcoff,
> + struct inode *dest, loff_t destoff, loff_t len,
> + bool *is_same)
> +{
> + return dax_file_range_compare(src, srcoff, dest, destoff, len,
> +   is_same, _iomap_ops);
> +}
>  #endif /* CONFIG_FS_DAX */
> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
> index 0138119cd9a3..5ebb52848d5a 100644
> --- a/fs/btrfs/ioctl.c
> +++ b/fs/btrfs/ioctl.c
> @@ -3939,6 +3939,7 @@ static int btrfs_remap_file_range_prep(struct file 
> *file_in, loff_t pos_in,
>   bool same_inode = inode_out == inode_in;
>   u64 wb_len;
>   int ret;
> + compare_range_t cmp;
>  
>   if (!(remap_flags & REMAP_FILE_DEDUP)) {
>   struct btrfs_root *root_out = BTRFS_I(inode_out)->root;
> @@ -4000,8 +4001,14 @@ static int btrfs_remap_file_range_prep(struct file 
> *file_in, loff_t pos_in,
>   if (ret < 0)
>   goto out_unlock;
>  
> - ret = generic_remap_file_range_prep(file_in, pos_in, file_out, pos_out,
> - len, remap_flags);
> + if (IS_DAX(file_inode(file_in)) && IS_DAX(file_inode(file_out)))

If we're moving towards a world where IS_DAX is a per-file condition, I
think this is going to need quite a bit more work to support doing
mixed-mode comparisons.

That, I think, could involve reworking vfs_dedupe_file_range_compare to
take a pair of (inode, iomap_ops) so that we can use the iomap
information to skip holes, avoid reading pages for uncached unwritten
ranges, etc.

TBH that sounds like a whole series on its own, so maybe we just want to
say no dedupe for now unless both files are in the page cache or both
files are DAX?

--D

> + cmp = btrfs_dax_file_range_compare;
> + else
> + cmp = vfs_dedupe_file_range_compare;
> +
> + ret = generic_remap_file_range_prep(file_in, pos_in, file_out,
> + pos_out, len, remap_flags, cmp);
> +
>   if (ret < 0 || *len == 0)
>   goto out_unlock;
>  
> diff --git a/fs/dax.c b/fs/dax.c
> index 07e8ff20161d..fa9ccbad7c03 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -39,6 +39,8 @@
>  #define CREATE_TRACE_POINTS
>  #include 
>  
> +#define MIN(a, b) (((a) < (b)) ? (a) : (b))
> +
>  static inline unsigned int pe_order(enum page_entry_size pe_size)
>  {
>   if (pe_size == PE_SIZE_PTE)
> @@ -1795,3 +1797,66 @@ vm_fault_t dax_finish_sync_fault(struct vm_fault *vmf,
>   return dax_insert_pfn_mkwrite(vmf, pfn, order);
>  }
>  EXPORT_SYMBOL_GPL(dax_finish_sync_fault);
> +
> +static inline void *iomap_address(struct iomap *iomap, loff_t off, loff_t 
> len)
> +{
> + loff_t start;
> + void *addr;
> + start = (get_start_sect(iomap->bdev) << 9) + iomap->addr +
> + (off - iomap->offset);
> + dax_direct_access(iomap->dax_dev,

Re: [PATCH 01/18] btrfs: create a mount option for dax

2019-05-21 Thread Darrick J. Wong

[add Ted to the thread]

On Mon, Apr 29, 2019 at 12:26:32PM -0500, Goldwyn Rodrigues wrote:
> From: Goldwyn Rodrigues 
> 
> This sets S_DAX in inode->i_flags, which can be used with
> IS_DAX().
> 
> The dax option is restricted to non multi-device mounts.
> dax interacts with the device directly instead of using bio, so
> all bio-hooks which we use for multi-device cannot be performed
> here. While regular read/writes could be manipulated with
> RAID0/1, mmap() is still an issue.
> 
> Auto-setting free space tree, because dealing with free space
> inode (specifically readpages) is a nightmare.
> Auto-setting nodatasum because we don't get callback for writing
> checksums after mmap()s.
> Deny compression because it does not go with direct I/O.
> 
> Store the dax_device in fs_info which will be used in iomap code.
> 
> I am aware of the push to directory-based flags for dax. Until, that
> code is in the kernel, we will work with mount flags.

Hmm.  This patchset was sent before LSFMM, and I've heard[1] that the
discussion there yielded some progress on how to move forward with the
user interface.  I've gotten the impression that means no new dax mount
options; a persistent flag that can be inherited by new files; and some
other means for userspace to check if writethrough worked.

However, the LWN article says Ted planned to summarize for fsdevel so
let's table this part until he does that.  Ted? :)

--D

[1] https://lwn.net/SubscriberLink/787973/ad85537bf8747e90/

> Signed-off-by: Goldwyn Rodrigues 
> ---
>  fs/btrfs/ctree.h   |  2 ++
>  fs/btrfs/disk-io.c |  4 
>  fs/btrfs/ioctl.c   |  5 -
>  fs/btrfs/super.c   | 30 ++
>  4 files changed, 40 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index b3642367a595..8ca1c0d120f4 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -1067,6 +1067,7 @@ struct btrfs_fs_info {
>   u32 metadata_ratio;
>  
>   void *bdev_holder;
> + struct dax_device *dax_dev;
>  
>   /* private scrub information */
>   struct mutex scrub_lock;
> @@ -1442,6 +1443,7 @@ static inline u32 BTRFS_MAX_XATTR_SIZE(const struct 
> btrfs_fs_info *info)
>  #define BTRFS_MOUNT_FREE_SPACE_TREE  (1 << 26)
>  #define BTRFS_MOUNT_NOLOGREPLAY  (1 << 27)
>  #define BTRFS_MOUNT_REF_VERIFY   (1 << 28)
> +#define BTRFS_MOUNT_DAX  (1 << 29)
>  
>  #define BTRFS_DEFAULT_COMMIT_INTERVAL(30)
>  #define BTRFS_DEFAULT_MAX_INLINE (2048)
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index 6fe9197f6ee4..2bbb63b2fcff 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -16,6 +16,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -2805,6 +2806,8 @@ int open_ctree(struct super_block *sb,
>   goto fail_alloc;
>   }
>  
> + fs_info->dax_dev = fs_dax_get_by_bdev(fs_devices->latest_bdev);
> +
>   /*
>* We want to check superblock checksum, the type is stored inside.
>* Pass the whole disk block of size BTRFS_SUPER_INFO_SIZE (4k).
> @@ -4043,6 +4046,7 @@ void close_ctree(struct btrfs_fs_info *fs_info)
>  #endif
>  
>   btrfs_close_devices(fs_info->fs_devices);
> + fs_put_dax(fs_info->dax_dev);
>   btrfs_mapping_tree_free(_info->mapping_tree);
>  
>   percpu_counter_destroy(_info->dirty_metadata_bytes);
> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
> index cd4e693406a0..0138119cd9a3 100644
> --- a/fs/btrfs/ioctl.c
> +++ b/fs/btrfs/ioctl.c
> @@ -149,8 +149,11 @@ void btrfs_sync_inode_flags_to_i_flags(struct inode 
> *inode)
>   if (binode->flags & BTRFS_INODE_DIRSYNC)
>   new_fl |= S_DIRSYNC;
>  
> + if ((btrfs_test_opt(btrfs_sb(inode->i_sb), DAX)) && 
> S_ISREG(inode->i_mode))
> + new_fl |= S_DAX;
> +
>   set_mask_bits(>i_flags,
> -   S_SYNC | S_APPEND | S_IMMUTABLE | S_NOATIME | S_DIRSYNC,
> +   S_SYNC | S_APPEND | S_IMMUTABLE | S_NOATIME | S_DIRSYNC | 
> S_DAX,
> new_fl);
>  }
>  
> diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
> index 120e4340792a..3b85e61e5182 100644
> --- a/fs/btrfs/super.c
> +++ b/fs/btrfs/super.c
> @@ -326,6 +326,7 @@ enum {
>   Opt_treelog, Opt_notreelog,
>   Opt_usebackuproot,
>   Opt_user_subvol_rm_allowed,
> + Opt_dax,
>  
>   /* Deprecated options */
>   Opt_alloc_start,
> @@ -393,6 +394,7 @@ static const match_table_t tokens = {
>   {Opt_notreelog, "notreelog"},
>   {Opt_usebackuproot, "usebackuproot"},
>   {Opt_user_subvol_rm_allowed, "user_subvol_rm_allowed"},
> + {Opt_dax, "dax"},
>  
>   /* Deprecated options */
>   {Opt_alloc_start, "alloc_start=%s"},
> @@ -745,6 +747,32 @@ int btrfs_parse_options(struct btrfs_fs_info *info, char 
> *options,
>   case Opt_user_subvol_rm_allowed:
>   btrfs_set_opt(info->mount_opt,

Re: [PATCH 08/18] dax: memcpy page in case of IOMAP_DAX_COW for mmap faults

2019-05-21 Thread Darrick J. Wong

On Mon, Apr 29, 2019 at 12:26:39PM -0500, Goldwyn Rodrigues wrote:
> From: Goldwyn Rodrigues 
> 
> Change dax_iomap_pfn to return the address as well in order to
> use it for performing a memcpy in case the type is IOMAP_DAX_COW.
> We don't handle PMD because btrfs does not support hugepages.
> 
> Question:
> The sequence of bdev_dax_pgoff() and dax_direct_access() is
> used multiple times to calculate address and pfn's. Would it make
> sense to call it while calculating address as well to reduce code?
> 
> Signed-off-by: Goldwyn Rodrigues 
> ---
>  fs/dax.c | 19 +++
>  1 file changed, 15 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index 610bfa861a28..718b1632a39d 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -984,7 +984,7 @@ static sector_t dax_iomap_sector(struct iomap *iomap, 
> loff_t pos)
>  }
>  
>  static int dax_iomap_pfn(struct iomap *iomap, loff_t pos, size_t size,
> -  pfn_t *pfnp)
> +  pfn_t *pfnp, void **addr)
>  {
>   const sector_t sector = dax_iomap_sector(iomap, pos);
>   pgoff_t pgoff;
> @@ -996,7 +996,7 @@ static int dax_iomap_pfn(struct iomap *iomap, loff_t pos, 
> size_t size,
>   return rc;
>   id = dax_read_lock();
>   length = dax_direct_access(iomap->dax_dev, pgoff, PHYS_PFN(size),
> -NULL, pfnp);
> +addr, pfnp);
>   if (length < 0) {
>   rc = length;
>   goto out;
> @@ -1286,6 +1286,7 @@ static vm_fault_t dax_iomap_pte_fault(struct vm_fault 
> *vmf, pfn_t *pfnp,
>   XA_STATE(xas, >i_pages, vmf->pgoff);
>   struct inode *inode = mapping->host;
>   unsigned long vaddr = vmf->address;
> + void *addr;
>   loff_t pos = (loff_t)vmf->pgoff << PAGE_SHIFT;
>   struct iomap iomap = { 0 };

Ugh, I had forgotten that fs/dax.c open-codes iomap_apply, probably
because the actor returns vm_fault_t, not bytes copied.  I guess that
makes it a tiny bit more complicated to pass in two (struct iomap *) to
the iomap_begin function...

>   unsigned flags = IOMAP_FAULT;
> @@ -1375,16 +1376,26 @@ static vm_fault_t dax_iomap_pte_fault(struct vm_fault 
> *vmf, pfn_t *pfnp,
>   sync = dax_fault_is_synchronous(flags, vma, );
>  
>   switch (iomap.type) {
> + case IOMAP_DAX_COW:
>   case IOMAP_MAPPED:
>   if (iomap.flags & IOMAP_F_NEW) {
>   count_vm_event(PGMAJFAULT);
>   count_memcg_event_mm(vma->vm_mm, PGMAJFAULT);
>   major = VM_FAULT_MAJOR;
>   }
> - error = dax_iomap_pfn(, pos, PAGE_SIZE, );
> + error = dax_iomap_pfn(, pos, PAGE_SIZE, , );
>   if (error < 0)
>   goto error_finish_iomap;
>  
> + if (iomap.type == IOMAP_DAX_COW) {
> + if (iomap.inline_data) {
> + error = memcpy_mcsafe(addr, iomap.inline_data,
> +   PAGE_SIZE);
> + if (error < 0)
> + goto error_finish_iomap;
> + } else
> + memset(addr, 0, PAGE_SIZE);

This memcpy_mcsafe/memset chunk shows up a lot in this series.  Maybe it
should be a static inline within dax.c?

--D

> + }
>   entry = dax_insert_entry(, mapping, vmf, entry, pfn,
>0, write && !sync);
>  
> @@ -1597,7 +1608,7 @@ static vm_fault_t dax_iomap_pmd_fault(struct vm_fault 
> *vmf, pfn_t *pfnp,
>  
>   switch (iomap.type) {
>   case IOMAP_MAPPED:
> - error = dax_iomap_pfn(, pos, PMD_SIZE, );
> + error = dax_iomap_pfn(, pos, PMD_SIZE, , NULL);
>   if (error < 0)
>   goto finish_iomap;
>  
> -- 
> 2.16.4
> 
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

Re: [PATCH 10/18] dax: replace mmap entry in case of CoW

2019-05-21 Thread Darrick J. Wong

On Mon, Apr 29, 2019 at 12:26:41PM -0500, Goldwyn Rodrigues wrote:
> From: Goldwyn Rodrigues 
> 
> We replace the existing entry to the newly allocated one
> in case of CoW. Also, we mark the entry as PAGECACHE_TAG_TOWRITE
> so writeback marks this entry as writeprotected. This
> helps us snapshots so new write pagefaults after snapshots
> trigger a CoW.
> 
> btrfs does not support hugepages so we don't handle PMD.
> 
> Signed-off-by: Goldwyn Rodrigues 
> ---
>  fs/dax.c | 36 
>  1 file changed, 28 insertions(+), 8 deletions(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index 718b1632a39d..07e8ff20161d 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -700,6 +700,9 @@ static int copy_user_dax(struct block_device *bdev, 
> struct dax_device *dax_dev,
>   return 0;
>  }
>  
> +#define DAX_IF_DIRTY (1ULL << 0)
> +#define DAX_IF_COW   (1ULL << 1)
> +
>  /*
>   * By this point grab_mapping_entry() has ensured that we have a locked entry
>   * of the appropriate size so we don't have to worry about downgrading PMDs 
> to
> @@ -709,14 +712,17 @@ static int copy_user_dax(struct block_device *bdev, 
> struct dax_device *dax_dev,
>   */
>  static void *dax_insert_entry(struct xa_state *xas,
>   struct address_space *mapping, struct vm_fault *vmf,
> - void *entry, pfn_t pfn, unsigned long flags, bool dirty)
> + void *entry, pfn_t pfn, unsigned long flags,
> + unsigned long insert_flags)

I think unsigned int would have sufficed here.

>  {
>   void *new_entry = dax_make_entry(pfn, flags);
> + bool dirty = insert_flags & DAX_IF_DIRTY;
> + bool cow = insert_flags & DAX_IF_COW;
>  
>   if (dirty)
>   __mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
>  
> - if (dax_is_zero_entry(entry) && !(flags & DAX_ZERO_PAGE)) {
> + if (cow || (dax_is_zero_entry(entry) && !(flags & DAX_ZERO_PAGE))) {
>   unsigned long index = xas->xa_index;
>   /* we are replacing a zero page with block mapping */
>   if (dax_is_pmd_entry(entry))
> @@ -728,12 +734,12 @@ static void *dax_insert_entry(struct xa_state *xas,
>  
>   xas_reset(xas);
>   xas_lock_irq(xas);
> - if (dax_entry_size(entry) != dax_entry_size(new_entry)) {
> + if (cow || (dax_entry_size(entry) != dax_entry_size(new_entry))) {
>   dax_disassociate_entry(entry, mapping, false);
>   dax_associate_entry(new_entry, mapping, vmf->vma, vmf->address);
>   }
>  
> - if (dax_is_zero_entry(entry) || dax_is_empty_entry(entry)) {
> + if (cow || dax_is_zero_entry(entry) || dax_is_empty_entry(entry)) {
>   /*
>* Only swap our new entry into the page cache if the current
>* entry is a zero page or an empty entry.  If a normal PTE or
> @@ -753,6 +759,9 @@ static void *dax_insert_entry(struct xa_state *xas,
>   if (dirty)
>   xas_set_mark(xas, PAGECACHE_TAG_DIRTY);
>  
> + if (cow)
> + xas_set_mark(xas, PAGECACHE_TAG_TOWRITE);
> +
>   xas_unlock_irq(xas);
>   return entry;
>  }
> @@ -1032,7 +1041,7 @@ static vm_fault_t dax_load_hole(struct xa_state *xas,
>   vm_fault_t ret;
>  
>   *entry = dax_insert_entry(xas, mapping, vmf, *entry, pfn,
> - DAX_ZERO_PAGE, false);
> + DAX_ZERO_PAGE, 0);
>  
>   ret = vmf_insert_mixed(vmf->vma, vaddr, pfn);
>   trace_dax_load_hole(inode, vmf, ret);
> @@ -1296,6 +1305,7 @@ static vm_fault_t dax_iomap_pte_fault(struct vm_fault 
> *vmf, pfn_t *pfnp,
>   vm_fault_t ret = 0;
>   void *entry;
>   pfn_t pfn;
> + unsigned long insert_flags = 0;
>  
>   trace_dax_pte_fault(inode, vmf, ret);
>   /*
> @@ -1357,6 +1367,8 @@ static vm_fault_t dax_iomap_pte_fault(struct vm_fault 
> *vmf, pfn_t *pfnp,
>   error = copy_user_dax(iomap.bdev, iomap.dax_dev,
>   sector, PAGE_SIZE, vmf->cow_page, 
> vaddr);
>   break;
> + case IOMAP_DAX_COW:
> + /* Should not be setting this - fallthrough */
>   default:
>   WARN_ON_ONCE(1);
>   error = -EIO;
> @@ -1377,6 +1389,8 @@ static vm_fault_t dax_iomap_pte_fault(struct vm_fault 
> *vmf, pfn_t *pfnp,
>  
>   switch (iomap.type) {
>   case IOMAP_DAX_COW:
> + insert_flags |= DAX_IF_COW;
> + /* fallthrough */
>   case IOMAP_MAPPED:
>   if (iomap.flags & IOMAP_F_NEW) {
>   count_vm_event(PGMAJFAULT);
> @@ -1396,8 +1410,10 @@ static vm_fault_t dax_iomap_pte_fault(struct vm_fault 
> *vmf, pfn_t *pfnp,
>   } else
>   memset(addr, 0, PAGE_SIZE);
>   }
> + if (write && !sync)
> + insert_flags |= DAX_IF_DIRTY;
>   entry = dax_insert_entry(,

Re: [PATCH 14/18] dax: memcpy before zeroing range

2019-05-21 Thread Darrick J. Wong

On Mon, Apr 29, 2019 at 12:26:45PM -0500, Goldwyn Rodrigues wrote:
> From: Goldwyn Rodrigues 
> 
> However, this needed more iomap fields, so it was easier
> to pass iomap and compute inside the function rather
> than passing a log of arguments.
> 
> Note, there is subtle difference between iomap_sector and
> dax_iomap_sector(). Can we replace dax_iomap_sector with
> iomap_sector()? It would need pos & PAGE_MASK though or else
> bdev_dax_pgoff() return -EINVAL.
> 
> Signed-off-by: Goldwyn Rodrigues 
> ---
>  fs/dax.c  | 17 -
>  fs/iomap.c|  9 +
>  include/linux/dax.h   | 11 +--
>  include/linux/iomap.h |  6 ++
>  4 files changed, 24 insertions(+), 19 deletions(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index fa9ccbad7c03..82a08b0eec23 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -1063,11 +1063,16 @@ static bool dax_range_is_aligned(struct block_device 
> *bdev,
>   return true;
>  }
>  
> -int __dax_zero_page_range(struct block_device *bdev,
> - struct dax_device *dax_dev, sector_t sector,
> - unsigned int offset, unsigned int size)
> +int __dax_zero_page_range(struct iomap *iomap, loff_t pos,
> +   unsigned int offset, unsigned int size)
>  {
> - if (dax_range_is_aligned(bdev, offset, size)) {
> + sector_t sector = dax_iomap_sector(iomap, pos & PAGE_MASK);
> + struct block_device *bdev = iomap->bdev;
> + struct dax_device *dax_dev = iomap->dax_dev;
> + int ret = 0;
> +
> + if (!(iomap->type == IOMAP_DAX_COW) &&
> + dax_range_is_aligned(bdev, offset, size)) {
>   sector_t start_sector = sector + (offset >> 9);
>  
>   return blkdev_issue_zeroout(bdev, start_sector,
> @@ -1087,11 +1092,13 @@ int __dax_zero_page_range(struct block_device *bdev,
>   dax_read_unlock(id);
>   return rc;
>   }
> + if (iomap->type == IOMAP_DAX_COW)
> + ret = memcpy_mcsafe(kaddr, iomap->inline_data, offset);

If the memcpy fails, does it make sense to keep going?

>   memset(kaddr + offset, 0, size);

Is it ever the case that offset + size isn't the end of the page?  If
so, then don't we need a second memcpy_mcsafe to handle that too?

>   dax_flush(dax_dev, kaddr + offset, size);
>   dax_read_unlock(id);
>   }
> - return 0;
> + return ret;
>  }
>  EXPORT_SYMBOL_GPL(__dax_zero_page_range);
>  
> diff --git a/fs/iomap.c b/fs/iomap.c
> index abdd18e404f8..90698c854883 100644
> --- a/fs/iomap.c
> +++ b/fs/iomap.c
> @@ -98,12 +98,6 @@ iomap_apply(struct inode *inode, loff_t pos, loff_t 
> length, unsigned flags,
>   return written ? written : ret;
>  }
>  
> -static sector_t
> -iomap_sector(struct iomap *iomap, loff_t pos)
> -{
> - return (iomap->addr + pos - iomap->offset) >> SECTOR_SHIFT;
> -}
> -
>  static struct iomap_page *
>  iomap_page_create(struct inode *inode, struct page *page)
>  {
> @@ -990,8 +984,7 @@ static int iomap_zero(struct inode *inode, loff_t pos, 
> unsigned offset,
>  static int iomap_dax_zero(loff_t pos, unsigned offset, unsigned bytes,
>   struct iomap *iomap)
>  {
> - return __dax_zero_page_range(iomap->bdev, iomap->dax_dev,
> - iomap_sector(iomap, pos & PAGE_MASK), offset, bytes);
> + return __dax_zero_page_range(iomap, pos, offset, bytes);
>  }
>  
>  static loff_t
> diff --git a/include/linux/dax.h b/include/linux/dax.h
> index 1370d39c91b6..c469d9ff54b4 100644
> --- a/include/linux/dax.h
> +++ b/include/linux/dax.h
> @@ -9,6 +9,7 @@
>  
>  typedef unsigned long dax_entry_t;
>  
> +struct iomap;
>  struct iomap_ops;
>  struct dax_device;
>  struct dax_operations {
> @@ -163,13 +164,11 @@ int dax_file_range_compare(struct inode *src, loff_t 
> srcoff,
>  const struct iomap_ops *ops);
>  
>  #ifdef CONFIG_FS_DAX
> -int __dax_zero_page_range(struct block_device *bdev,
> - struct dax_device *dax_dev, sector_t sector,
> - unsigned int offset, unsigned int length);
> +int __dax_zero_page_range(struct iomap *iomap, loff_t pos,
> + unsigned int offset, unsigned int size);
>  #else
> -static inline int __dax_zero_page_range(struct block_device *bdev,
> - struct dax_device *dax_dev, sector_t sector,
> - unsigned int offset, unsigned int length)
> +static inline int __dax_zero_page_range(struct iomap *iomap, loff_t pos,
> + unsigned int offset, unsigned int size)
>  {
>   return -ENXIO;
>  }
> diff --git a/include/linux/iomap.h b/include/linux/iomap.h
> index 6e885c5a38a3..fcfce269db3e 100644
> --- a/include/linux/iomap.h
> +++ b/include/linux/iomap.h
> @@ -7,6 +7,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  struct address_space;
>  struct fiemap_extent_info;
> @@ -120,6 +121,11 @@ static inline struct iomap_page *to_iomap_page(struct 
> page *page)
>

Re: [PATCH 07/18] btrfs: add dax write support

2019-05-21 Thread Darrick J. Wong

On Mon, Apr 29, 2019 at 12:26:38PM -0500, Goldwyn Rodrigues wrote:
> From: Goldwyn Rodrigues 
> 
> IOMAP_DAX_COW allows to inform the dax code, to first perform
> a copy which are not page-aligned before performing the write.
> The responsibility of checking if data edges are page aligned
> is performed in ->iomap_begin() and the source address is
> stored in ->inline_data
> 
> A new struct btrfs_iomap is passed from iomap_begin() to
> iomap_end(), which contains all the accounting and locking information
> for CoW based writes.
> 
> For writing to a hole, iomap->inline_data is set to zero.
> 
> Signed-off-by: Goldwyn Rodrigues 
> ---
>  fs/btrfs/ctree.h |   6 ++
>  fs/btrfs/dax.c   | 182 
> +--
>  fs/btrfs/file.c  |   4 +-
>  3 files changed, 185 insertions(+), 7 deletions(-)
> 
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 1e3e758b83c2..eec01eb92f33 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -3801,6 +3801,12 @@ int btree_readahead_hook(struct extent_buffer *eb, int 
> err);
>  #ifdef CONFIG_FS_DAX
>  /* dax.c */
>  ssize_t btrfs_file_dax_read(struct kiocb *iocb, struct iov_iter *to);
> +ssize_t btrfs_file_dax_write(struct kiocb *iocb, struct iov_iter *from);
> +#else
> +static inline ssize_t btrfs_file_dax_write(struct kiocb *iocb, struct 
> iov_iter *from)
> +{
> + return 0;
> +}
>  #endif /* CONFIG_FS_DAX */
>  
>  static inline int is_fstree(u64 rootid)
> diff --git a/fs/btrfs/dax.c b/fs/btrfs/dax.c
> index bf3d46b0acb6..f5cc9bcdbf14 100644
> --- a/fs/btrfs/dax.c
> +++ b/fs/btrfs/dax.c
> @@ -9,30 +9,184 @@
>  #ifdef CONFIG_FS_DAX
>  #include 
>  #include 
> +#include 
>  #include "ctree.h"
>  #include "btrfs_inode.h"
>  
> +struct btrfs_iomap {
> + u64 start;
> + u64 end;
> + bool nocow;
> + struct extent_changeset *data_reserved;
> + struct extent_state *cached_state;
> +};
> +
> +static struct btrfs_iomap *btrfs_iomap_init(struct inode *inode,
> +  struct extent_map **em,
> +  loff_t pos, loff_t length)
> +{
> + int ret = 0;
> + struct extent_map *map = *em;
> + struct btrfs_iomap *bi;
> +
> + bi = kzalloc(sizeof(struct btrfs_iomap), GFP_NOFS);
> + if (!bi)
> + return ERR_PTR(-ENOMEM);
> +
> + bi->start = round_down(pos, PAGE_SIZE);
> + bi->end = PAGE_ALIGN(pos + length);
> +
> + /* Wait for existing ordered extents in range to finish */
> + btrfs_wait_ordered_range(inode, bi->start, bi->end - bi->start);
> +
> + lock_extent_bits(_I(inode)->io_tree, bi->start, bi->end, 
> >cached_state);
> +
> + ret = btrfs_delalloc_reserve_space(inode, >data_reserved,
> + bi->start, bi->end - bi->start);
> + if (ret) {
> + unlock_extent_cached(_I(inode)->io_tree, bi->start, 
> bi->end,
> + >cached_state);
> + kfree(bi);
> + return ERR_PTR(ret);
> + }
> +
> + refcount_inc(>refs);
> + ret = btrfs_get_extent_map_write(em, NULL,
> + inode, bi->start, bi->end - bi->start, >nocow);
> + if (ret) {
> + unlock_extent_cached(_I(inode)->io_tree, bi->start, 
> bi->end,
> + >cached_state);
> + btrfs_delalloc_release_space(inode,
> + bi->data_reserved, bi->start,
> + bi->end - bi->start, true);
> + extent_changeset_free(bi->data_reserved);
> + kfree(bi);
> + return ERR_PTR(ret);
> + }
> + free_extent_map(map);
> + return bi;
> +}
> +
> +static void *dax_address(struct block_device *bdev, struct dax_device 
> *dax_dev,
> +  sector_t sector, loff_t pos, loff_t length)

This looks like a common function for fs/dax.c.

--D

> +{
> + size_t size = ALIGN(pos + length, PAGE_SIZE);
> + int id, ret = 0;
> + void *kaddr = NULL;
> + pgoff_t pgoff;
> + long map_len;
> +
> + id = dax_read_lock();
> +
> + ret = bdev_dax_pgoff(bdev, sector, size, );
> + if (ret)
> + goto out;
> +
> + map_len = dax_direct_access(dax_dev, pgoff, PHYS_PFN(size),
> + , NULL);
> + if (map_len < 0)
> + ret = map_len;
> +
> +out:
> + dax_read_unlock(id);
> + if (ret)
> + return ERR_PTR(ret);
> + return kaddr;
> +}
> +
>  static int btrfs_iomap_begin(struct inode *inode, loff_t pos,
>   loff_t length, unsigned flags, struct iomap *iomap)
>  {
>   struct extent_map *em;
>   struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
> + struct btrfs_iomap *bi = NULL;
> + unsigned offset = pos & (PAGE_SIZE - 1);
> + u64 srcblk = 0;
> + loff_t diff;
> +
>   em = btrfs_get_extent(BTRFS_I(inode), NULL, 0, pos, length, 0);
> +
> + iomap->type = IOMAP_MAPPED;
> +
> + if (flags & IOMAP_WRITE) {
> +

Re: [PATCH 04/18] dax: Introduce IOMAP_DAX_COW to CoW edges during writes

2019-05-21 Thread Darrick J. Wong

On Mon, Apr 29, 2019 at 12:26:35PM -0500, Goldwyn Rodrigues wrote:
> From: Goldwyn Rodrigues 
> 
> The IOMAP_DAX_COW is a iomap type which performs copy of
> edges of data while performing a write if start/end are
> not page aligned. The source address is expected in
> iomap->inline_data.
> 
> dax_copy_edges() is a helper functions performs a copy from
> one part of the device to another for data not page aligned.
> If iomap->inline_data is NULL, it memset's the area to zero.
> 
> Signed-off-by: Goldwyn Rodrigues 
> ---
>  fs/dax.c  | 46 +-
>  include/linux/iomap.h |  1 +
>  2 files changed, 46 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index e5e54da1715f..610bfa861a28 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -1084,6 +1084,42 @@ int __dax_zero_page_range(struct block_device *bdev,
>  }
>  EXPORT_SYMBOL_GPL(__dax_zero_page_range);
>  
> +/*
> + * dax_copy_edges - Copies the part of the pages not included in
> + *   the write, but required for CoW because
> + *   offset/offset+length are not page aligned.
> + */
> +static int dax_copy_edges(struct inode *inode, loff_t pos, loff_t length,
> +struct iomap *iomap, void *daddr)
> +{
> + unsigned offset = pos & (PAGE_SIZE - 1);
> + loff_t end = pos + length;
> + loff_t pg_end = round_up(end, PAGE_SIZE);
> + void *saddr = iomap->inline_data;
> + int ret = 0;
> + /*
> +  * Copy the first part of the page
> +  * Note: we pass offset as length
> +  */
> + if (offset) {
> + if (saddr)
> + ret = memcpy_mcsafe(daddr, saddr, offset);
> + else
> + memset(daddr, 0, offset);
> + }
> +
> + /* Copy the last part of the range */
> + if (end < pg_end) {
> + if (saddr)
> + ret = memcpy_mcsafe(daddr + offset + length,
> +saddr + offset + length, pg_end - end);
> + else
> + memset(daddr + offset + length, 0,
> + pg_end - end);
> + }
> + return ret;
> +}
> +
>  static loff_t
>  dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
>   struct iomap *iomap)
> @@ -1105,9 +1141,11 @@ dax_iomap_actor(struct inode *inode, loff_t pos, 
> loff_t length, void *data,
>   return iov_iter_zero(min(length, end - pos), iter);
>   }
>  
> - if (WARN_ON_ONCE(iomap->type != IOMAP_MAPPED))
> + if (WARN_ON_ONCE(iomap->type != IOMAP_MAPPED
> +  && iomap->type != IOMAP_DAX_COW))

I reiterate (from V3) that the && goes on the previous line...

if (WARN_ON_ONCE(iomap->type != IOMAP_MAPPED &&
 iomap->type != IOMAP_DAX_COW))

>   return -EIO;
>  
> +
>   /*
>* Write can allocate block for an area which has a hole page mapped
>* into page tables. We have to tear down these mappings so that data
> @@ -1144,6 +1182,12 @@ dax_iomap_actor(struct inode *inode, loff_t pos, 
> loff_t length, void *data,
>   break;
>   }
>  
> + if (iomap->type == IOMAP_DAX_COW) {
> + ret = dax_copy_edges(inode, pos, length, iomap, kaddr);
> + if (ret)
> + break;
> + }
> +
>   map_len = PFN_PHYS(map_len);
>   kaddr += offset;
>   map_len -= offset;
> diff --git a/include/linux/iomap.h b/include/linux/iomap.h
> index 0fefb5455bda..6e885c5a38a3 100644
> --- a/include/linux/iomap.h
> +++ b/include/linux/iomap.h
> @@ -25,6 +25,7 @@ struct vm_fault;
>  #define IOMAP_MAPPED 0x03/* blocks allocated at @addr */
>  #define IOMAP_UNWRITTEN  0x04/* blocks allocated at @addr in 
> unwritten state */
>  #define IOMAP_INLINE 0x05/* data inline in the inode */

> +#define IOMAP_DAX_COW0x06

DAX isn't going to be the only scenario where we need a way to
communicate to iomap actors the need to implement copy on write.

XFS also uses struct iomap to hand out file leases to clients.  The
lease code /currently/ doesn't support files with shared blocks (because
the only user is pNFS) but one could easily imagine a future where some
client wants to lease a file with shared blocks, in which case XFS will
want to convey the COW details to the lessee.

> +/* Copy data pointed by inline_data before write*/

A month ago during the V3 patchset review, I wrote (possibly in an other
thread, sorry) about something that I'm putting my foot down about now
for the V4 patchset, which is the {re,ab}use of @inline_data for the
data source address.

We cannot use @inline_data to convey the source address.  @inline_data
(so far) is used to point to the in-memory representation of the storage
described by @addr.  For data writes, @addr is the location of the

Re: [PATCH 03/18] btrfs: basic dax read

2019-05-21 Thread Darrick J. Wong

On Mon, Apr 29, 2019 at 12:26:34PM -0500, Goldwyn Rodrigues wrote:
> From: Goldwyn Rodrigues 
> 
> Perform a basic read using iomap support. The btrfs_iomap_begin()
> finds the extent at the position and fills the iomap data
> structure with the values.
> 
> Signed-off-by: Goldwyn Rodrigues 
> ---
>  fs/btrfs/Makefile |  1 +
>  fs/btrfs/ctree.h  |  5 +
>  fs/btrfs/dax.c| 49 +
>  fs/btrfs/file.c   | 11 ++-
>  4 files changed, 65 insertions(+), 1 deletion(-)
>  create mode 100644 fs/btrfs/dax.c
> 
> diff --git a/fs/btrfs/Makefile b/fs/btrfs/Makefile
> index ca693dd554e9..1fa77b875ae9 100644
> --- a/fs/btrfs/Makefile
> +++ b/fs/btrfs/Makefile
> @@ -12,6 +12,7 @@ btrfs-y += super.o ctree.o extent-tree.o print-tree.o 
> root-tree.o dir-item.o \
>  reada.o backref.o ulist.o qgroup.o send.o dev-replace.o raid56.o \
>  uuid-tree.o props.o free-space-tree.o tree-checker.o
>  
> +btrfs-$(CONFIG_FS_DAX) += dax.o
>  btrfs-$(CONFIG_BTRFS_FS_POSIX_ACL) += acl.o
>  btrfs-$(CONFIG_BTRFS_FS_CHECK_INTEGRITY) += check-integrity.o
>  btrfs-$(CONFIG_BTRFS_FS_REF_VERIFY) += ref-verify.o
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 9512f49262dd..b7bbe5130a3b 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -3795,6 +3795,11 @@ int btrfs_reada_wait(void *handle);
>  void btrfs_reada_detach(void *handle);
>  int btree_readahead_hook(struct extent_buffer *eb, int err);
>  
> +#ifdef CONFIG_FS_DAX
> +/* dax.c */
> +ssize_t btrfs_file_dax_read(struct kiocb *iocb, struct iov_iter *to);
> +#endif /* CONFIG_FS_DAX */
> +
>  static inline int is_fstree(u64 rootid)
>  {
>   if (rootid == BTRFS_FS_TREE_OBJECTID ||
> diff --git a/fs/btrfs/dax.c b/fs/btrfs/dax.c
> new file mode 100644
> index ..bf3d46b0acb6
> --- /dev/null
> +++ b/fs/btrfs/dax.c
> @@ -0,0 +1,49 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * DAX support for BTRFS
> + *
> + * Copyright (c) 2019  SUSE Linux
> + * Author: Goldwyn Rodrigues 
> + */
> +
> +#ifdef CONFIG_FS_DAX
> +#include 
> +#include 
> +#include "ctree.h"
> +#include "btrfs_inode.h"
> +
> +static int btrfs_iomap_begin(struct inode *inode, loff_t pos,
> + loff_t length, unsigned flags, struct iomap *iomap)
> +{
> + struct extent_map *em;
> + struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
> + em = btrfs_get_extent(BTRFS_I(inode), NULL, 0, pos, length, 0);
> + if (em->block_start == EXTENT_MAP_HOLE) {
> + iomap->type = IOMAP_HOLE;
> + return 0;

I'm not doing a rigorous review of the btrfs-specific pieces, but you're
required to fill out the other iomap fields for a read hole.

--D

> + }
> + iomap->type = IOMAP_MAPPED;
> + iomap->bdev = em->bdev;
> + iomap->dax_dev = fs_info->dax_dev;
> + iomap->offset = em->start;
> + iomap->length = em->len;
> + iomap->addr = em->block_start;
> + return 0;
> +}
> +
> +static const struct iomap_ops btrfs_iomap_ops = {
> + .iomap_begin= btrfs_iomap_begin,
> +};
> +
> +ssize_t btrfs_file_dax_read(struct kiocb *iocb, struct iov_iter *to)
> +{
> + ssize_t ret;
> + struct inode *inode = file_inode(iocb->ki_filp);
> +
> + inode_lock_shared(inode);
> + ret = dax_iomap_rw(iocb, to, _iomap_ops);
> + inode_unlock_shared(inode);
> +
> + return ret;
> +}
> +#endif /* CONFIG_FS_DAX */
> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> index 34fe8a58b0e9..9194591f9eea 100644
> --- a/fs/btrfs/file.c
> +++ b/fs/btrfs/file.c
> @@ -3288,9 +3288,18 @@ static int btrfs_file_open(struct inode *inode, struct 
> file *filp)
>   return generic_file_open(inode, filp);
>  }
>  
> +static ssize_t btrfs_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
> +{
> +#ifdef CONFIG_FS_DAX
> + if (IS_DAX(file_inode(iocb->ki_filp)))
> + return btrfs_file_dax_read(iocb, to);
> +#endif
> + return generic_file_read_iter(iocb, to);
> +}
> +
>  const struct file_operations btrfs_file_operations = {
>   .llseek = btrfs_file_llseek,
> - .read_iter  = generic_file_read_iter,
> + .read_iter  = btrfs_file_read_iter,
>   .splice_read= generic_file_splice_read,
>   .write_iter = btrfs_file_write_iter,
>   .mmap   = btrfs_file_mmap,
> -- 
> 2.16.4
> 
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

Re: [PATCH v7 6/6] xfs: disable map_sync for async flush

2019-05-07 Thread Darrick J. Wong

On Tue, May 07, 2019 at 08:37:01AM -0700, Dan Williams wrote:
> On Thu, Apr 25, 2019 at 10:03 PM Pankaj Gupta  wrote:
> >
> > Dont support 'MAP_SYNC' with non-DAX files and DAX files
> > with asynchronous dax_device. Virtio pmem provides
> > asynchronous host page cache flush mechanism. We don't
> > support 'MAP_SYNC' with virtio pmem and xfs.
> >
> > Signed-off-by: Pankaj Gupta 
> > ---
> >  fs/xfs/xfs_file.c | 9 ++---
> >  1 file changed, 6 insertions(+), 3 deletions(-)
> 
> Darrick, does this look ok to take through the nvdimm tree?

 forgot about this, sorry. :/

> >
> > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> > index a7ceae90110e..f17652cca5ff 100644
> > --- a/fs/xfs/xfs_file.c
> > +++ b/fs/xfs/xfs_file.c
> > @@ -1203,11 +1203,14 @@ xfs_file_mmap(
> > struct file *filp,
> > struct vm_area_struct *vma)
> >  {
> > +   struct dax_device   *dax_dev;
> > +
> > +   dax_dev = xfs_find_daxdev_for_inode(file_inode(filp));
> > /*
> > -* We don't support synchronous mappings for non-DAX files. At least
> > -* until someone comes with a sensible use case.
> > +* We don't support synchronous mappings for non-DAX files and
> > +* for DAX files if underneath dax_device is not synchronous.
> >  */
> > -   if (!IS_DAX(file_inode(filp)) && (vma->vm_flags & VM_SYNC))
> > +   if (!daxdev_mapping_supported(vma, dax_dev))
> > return -EOPNOTSUPP;

LGTM, and I'm fine with it going through nvdimm.  Nothing in
xfs-5.2-merge touches that function so it should be clean.

Reviewed-by: Darrick J. Wong 

--D

> >
> > file_accessed(filp);
> > --
> > 2.20.1
> >
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

Re: [PATCH v6 6/6] xfs: disable map_sync for async flush

2019-04-23 Thread Darrick J. Wong

On Wed, Apr 24, 2019 at 08:02:17AM +1000, Dave Chinner wrote:
> On Tue, Apr 23, 2019 at 01:36:12PM +0530, Pankaj Gupta wrote:
> > Dont support 'MAP_SYNC' with non-DAX files and DAX files
> > with asynchronous dax_device. Virtio pmem provides
> > asynchronous host page cache flush mechanism. We don't
> > support 'MAP_SYNC' with virtio pmem and xfs.
> > 
> > Signed-off-by: Pankaj Gupta 
> > ---
> >  fs/xfs/xfs_file.c | 10 ++
> >  1 file changed, 6 insertions(+), 4 deletions(-)
> > 
> > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> > index 1f2e2845eb76..0e59be018511 100644
> > --- a/fs/xfs/xfs_file.c
> > +++ b/fs/xfs/xfs_file.c
> > @@ -1196,11 +1196,13 @@ xfs_file_mmap(
> > struct file *filp,
> > struct vm_area_struct *vma)
> >  {
> > -   /*
> > -* We don't support synchronous mappings for non-DAX files. At least
> > -* until someone comes with a sensible use case.
> > +   struct dax_device *dax_dev = xfs_find_daxdev_for_inode
> > +   (file_inode(filp));

tab separation here ^^^    and cut
   down the indent
 while you're at
it, please:

struct dax_device   *dax_dev;

dax_dev = xfs_find_daxdev_for_inode(file_inode(filp));
if (!dax_is_frobbed(dax))
return -EMEWANTCOOKIE;

--D

> > +
> > +   /* We don't support synchronous mappings for non-DAX files and
> > +* for DAX files if underneath dax_device is not synchronous.
> >  */
> 
>   /*
>* This is the correct multi-line comment format. Please
>* update the patch to maintain the existing comment format.
>*/
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> da...@fromorbit.com
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

Re: [PATCH 08/18] dax: memcpy page in case of IOMAP_DAX_COW for mmap faults

2019-04-17 Thread Darrick J. Wong

On Tue, Apr 16, 2019 at 11:41:44AM -0500, Goldwyn Rodrigues wrote:
> From: Goldwyn Rodrigues 
> 
> Change dax_iomap_pfn to return the address as well in order to
> use it for performing a memcpy in case the type is IOMAP_DAX_COW.
> 
> Question:
> The sequence of bdev_dax_pgoff() and dax_direct_access() is
> used multiple times to calculate address and pfn's. Would it make
> sense to call it while calculating address as well to reduce code?
> 
> Signed-off-by: Goldwyn Rodrigues 
> ---
>  fs/dax.c | 16 
>  1 file changed, 12 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index 4b4ac51fbd16..45fc2e18969a 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -983,7 +983,7 @@ static sector_t dax_iomap_sector(struct iomap *iomap, 
> loff_t pos)
>  }
>  
>  static int dax_iomap_pfn(struct iomap *iomap, loff_t pos, size_t size,
> -  pfn_t *pfnp)
> +  pfn_t *pfnp, void **addr)
>  {
>   const sector_t sector = dax_iomap_sector(iomap, pos);
>   pgoff_t pgoff;
> @@ -995,7 +995,7 @@ static int dax_iomap_pfn(struct iomap *iomap, loff_t pos, 
> size_t size,
>   return rc;
>   id = dax_read_lock();
>   length = dax_direct_access(iomap->dax_dev, pgoff, PHYS_PFN(size),
> -NULL, pfnp);
> +addr, pfnp);
>   if (length < 0) {
>   rc = length;
>   goto out;
> @@ -1280,6 +1280,7 @@ static vm_fault_t dax_iomap_pte_fault(struct vm_fault 
> *vmf, pfn_t *pfnp,
>   XA_STATE(xas, >i_pages, vmf->pgoff);
>   struct inode *inode = mapping->host;
>   unsigned long vaddr = vmf->address;
> + void *addr;
>   loff_t pos = (loff_t)vmf->pgoff << PAGE_SHIFT;
>   struct iomap iomap = { 0 };
>   unsigned flags = IOMAP_FAULT;
> @@ -1369,16 +1370,23 @@ static vm_fault_t dax_iomap_pte_fault(struct vm_fault 
> *vmf, pfn_t *pfnp,
>   sync = dax_fault_is_synchronous(flags, vma, );
>  
>   switch (iomap.type) {
> + case IOMAP_DAX_COW:
>   case IOMAP_MAPPED:
>   if (iomap.flags & IOMAP_F_NEW) {
>   count_vm_event(PGMAJFAULT);
>   count_memcg_event_mm(vma->vm_mm, PGMAJFAULT);
>   major = VM_FAULT_MAJOR;
>   }
> - error = dax_iomap_pfn(, pos, PAGE_SIZE, );
> + error = dax_iomap_pfn(, pos, PAGE_SIZE, , );
>   if (error < 0)
>   goto error_finish_iomap;
>  
> + if (iomap.type == IOMAP_DAX_COW) {
> + if (iomap.inline_data)
> + memcpy(addr, iomap.inline_data, PAGE_SIZE);

Same memcpy_mcsafe question from my reply to patch 4 applies here.

> + else
> + memset(addr, 0, PAGE_SIZE);
> + }
>   entry = dax_insert_entry(, mapping, vmf, entry, pfn,
>0, write && !sync);
>  
> @@ -1577,7 +1585,7 @@ static vm_fault_t dax_iomap_pmd_fault(struct vm_fault 
> *vmf, pfn_t *pfnp,
>  
>   switch (iomap.type) {
>   case IOMAP_MAPPED:
> - error = dax_iomap_pfn(, pos, PMD_SIZE, );
> + error = dax_iomap_pfn(, pos, PMD_SIZE, , NULL);

Same (unanswered) question from the v2 series -- doesn't a PMD fault
also require handling IOMAP_DAX_COW?

--D

>   if (error < 0)
>   goto finish_iomap;
>  
> -- 
> 2.16.4
> 
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

Re: [PATCH 04/18] dax: Introduce IOMAP_DAX_COW to CoW edges during writes

2019-04-17 Thread Darrick J. Wong

On Tue, Apr 16, 2019 at 11:41:40AM -0500, Goldwyn Rodrigues wrote:
> From: Goldwyn Rodrigues 
> 
> The IOMAP_DAX_COW is a iomap type which performs copy of
> edges of data while performing a write if start/end are
> not page aligned. The source address is expected in
> iomap->inline_data.
> 
> dax_copy_edges() is a helper functions performs a copy from
> one part of the device to another for data not page aligned.
> If iomap->inline_data is NULL, it memset's the area to zero.
> 
> Signed-off-by: Goldwyn Rodrigues 
> ---
>  fs/dax.c  | 41 -
>  include/linux/iomap.h |  1 +
>  2 files changed, 41 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index ca0671d55aa6..4b4ac51fbd16 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -1083,6 +1083,40 @@ int __dax_zero_page_range(struct block_device *bdev,
>  }
>  EXPORT_SYMBOL_GPL(__dax_zero_page_range);
>  
> +/*
> + * dax_copy_edges - Copies the part of the pages not included in
> + *   the write, but required for CoW because
> + *   offset/offset+length are not page aligned.
> + */
> +static void dax_copy_edges(struct inode *inode, loff_t pos, loff_t length,
> +struct iomap *iomap, void *daddr)
> +{
> + unsigned offset = pos & (PAGE_SIZE - 1);
> + loff_t end = pos + length;
> + loff_t pg_end = round_up(end, PAGE_SIZE);
> + void *saddr = iomap->inline_data;
> + /*
> +  * Copy the first part of the page
> +  * Note: we pass offset as length
> +  */
> + if (offset) {
> + if (saddr)
> + memcpy(daddr, saddr, offset);

I've been wondering, do we need memcpy_mcsafe here?

> + else
> + memset(daddr, 0, offset);

Or here?

(Or any of the other places we call memcpy/memset in this series...)

Because I think we'd prefer to return EIO on bad pmem over a machine
check.

--D

> + }
> +
> + /* Copy the last part of the range */
> + if (end < pg_end) {
> + if (saddr)
> + memcpy(daddr + offset + length,
> +saddr + offset + length, pg_end - end);
> + else
> + memset(daddr + offset + length, 0,
> + pg_end - end);
> + }
> +}
> +
>  static loff_t
>  dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
>   struct iomap *iomap)
> @@ -1104,9 +1138,11 @@ dax_iomap_actor(struct inode *inode, loff_t pos, 
> loff_t length, void *data,
>   return iov_iter_zero(min(length, end - pos), iter);
>   }
>  
> - if (WARN_ON_ONCE(iomap->type != IOMAP_MAPPED))
> + if (WARN_ON_ONCE(iomap->type != IOMAP_MAPPED
> +  && iomap->type != IOMAP_DAX_COW))

Usually the '&&' goes on the first line, right?

>   return -EIO;
>  
> +
>   /*
>* Write can allocate block for an area which has a hole page mapped
>* into page tables. We have to tear down these mappings so that data
> @@ -1143,6 +1179,9 @@ dax_iomap_actor(struct inode *inode, loff_t pos, loff_t 
> length, void *data,
>   break;
>   }
>  
> + if (iomap->type == IOMAP_DAX_COW)
> + dax_copy_edges(inode, pos, length, iomap, kaddr);

No return value?  So the pmem copy never fails?

--D

> +
>   map_len = PFN_PHYS(map_len);
>   kaddr += offset;
>   map_len -= offset;
> diff --git a/include/linux/iomap.h b/include/linux/iomap.h
> index 0fefb5455bda..6e885c5a38a3 100644
> --- a/include/linux/iomap.h
> +++ b/include/linux/iomap.h
> @@ -25,6 +25,7 @@ struct vm_fault;
>  #define IOMAP_MAPPED 0x03/* blocks allocated at @addr */
>  #define IOMAP_UNWRITTEN  0x04/* blocks allocated at @addr in 
> unwritten state */
>  #define IOMAP_INLINE 0x05/* data inline in the inode */
> +#define IOMAP_DAX_COW0x06/* Copy data pointed by inline_data 
> before write*/
>  
>  /*
>   * Flags for all iomap mappings:
> -- 
> 2.16.4
> 
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

Re: [PATCH 14/18] dax: memcpy before zeroing range

2019-04-17 Thread Darrick J. Wong

On Tue, Apr 16, 2019 at 11:41:50AM -0500, Goldwyn Rodrigues wrote:
> From: Goldwyn Rodrigues 
> 
> However, this needed more iomap fields, so it was easier
> to pass iomap and compute inside the function rather
> than passing a log of arguments.
> 
> Note, there is subtle difference between iomap_sector and
> dax_iomap_sector(). Can we replace dax_iomap_sector with
> iomap_sector()? It would need pos & PAGE_MASK though or else
> bdev_dax_pgoff() return -EINVAL.
> 
> Signed-off-by: Goldwyn Rodrigues 
> ---
>  fs/dax.c  | 14 ++
>  fs/iomap.c|  9 +
>  include/linux/dax.h   | 11 +--
>  include/linux/iomap.h |  6 ++
>  4 files changed, 22 insertions(+), 18 deletions(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index abbe4a79f219..af94909640ea 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -1055,11 +1055,15 @@ static bool dax_range_is_aligned(struct block_device 
> *bdev,
>   return true;
>  }
>  
> -int __dax_zero_page_range(struct block_device *bdev,
> - struct dax_device *dax_dev, sector_t sector,
> - unsigned int offset, unsigned int size)
> +int __dax_zero_page_range(struct iomap *iomap, loff_t pos,
> +   unsigned int offset, unsigned int size)
>  {
> - if (dax_range_is_aligned(bdev, offset, size)) {
> + sector_t sector = dax_iomap_sector(iomap, pos & PAGE_MASK);
> + struct block_device *bdev = iomap->bdev;
> + struct dax_device *dax_dev = iomap->dax_dev;
> +
> + if (!(iomap->type == IOMAP_DAX_COW) &&
> + dax_range_is_aligned(bdev, offset, size)) {
>   sector_t start_sector = sector + (offset >> 9);
>  
>   return blkdev_issue_zeroout(bdev, start_sector,
> @@ -1079,6 +1083,8 @@ int __dax_zero_page_range(struct block_device *bdev,
>   dax_read_unlock(id);
>   return rc;
>   }
> + if (iomap->type == IOMAP_DAX_COW)
> + memcpy(iomap->inline_data, kaddr, offset);

I'm confused, why are we copying into the source page before zeroing
some part of the dax device?

>   memset(kaddr + offset, 0, size);
>   dax_flush(dax_dev, kaddr + offset, size);
>   dax_read_unlock(id);
> diff --git a/fs/iomap.c b/fs/iomap.c
> index abdd18e404f8..90698c854883 100644
> --- a/fs/iomap.c
> +++ b/fs/iomap.c
> @@ -98,12 +98,6 @@ iomap_apply(struct inode *inode, loff_t pos, loff_t 
> length, unsigned flags,
>   return written ? written : ret;
>  }
>  
> -static sector_t
> -iomap_sector(struct iomap *iomap, loff_t pos)
> -{
> - return (iomap->addr + pos - iomap->offset) >> SECTOR_SHIFT;
> -}
> -
>  static struct iomap_page *
>  iomap_page_create(struct inode *inode, struct page *page)
>  {
> @@ -990,8 +984,7 @@ static int iomap_zero(struct inode *inode, loff_t pos, 
> unsigned offset,
>  static int iomap_dax_zero(loff_t pos, unsigned offset, unsigned bytes,
>   struct iomap *iomap)
>  {
> - return __dax_zero_page_range(iomap->bdev, iomap->dax_dev,
> - iomap_sector(iomap, pos & PAGE_MASK), offset, bytes);
> + return __dax_zero_page_range(iomap, pos, offset, bytes);
>  }
>  
>  static loff_t
> diff --git a/include/linux/dax.h b/include/linux/dax.h
> index a11bc7b1f526..892c478d7073 100644
> --- a/include/linux/dax.h
> +++ b/include/linux/dax.h
> @@ -9,6 +9,7 @@
>  
>  typedef unsigned long dax_entry_t;
>  
> +struct iomap;
>  struct iomap_ops;
>  struct dax_device;
>  struct dax_operations {
> @@ -161,13 +162,11 @@ int dax_file_range_compare(struct inode *src, loff_t 
> srcoff, struct inode *dest,
>  loff_t destoff, loff_t len, bool *is_same, const struct 
> iomap_ops *ops);
>  
>  #ifdef CONFIG_FS_DAX
> -int __dax_zero_page_range(struct block_device *bdev,
> - struct dax_device *dax_dev, sector_t sector,
> - unsigned int offset, unsigned int length);
> +int __dax_zero_page_range(struct iomap *iomap, loff_t pos,
> + unsigned int offset, unsigned int size);
>  #else
> -static inline int __dax_zero_page_range(struct block_device *bdev,
> - struct dax_device *dax_dev, sector_t sector,
> - unsigned int offset, unsigned int length)
> +static inline int __dax_zero_page_range(struct iomap *iomap, loff_t pos,
> + unsigned int offset, unsigned int size)
>  {
>   return -ENXIO;
>  }
> diff --git a/include/linux/iomap.h b/include/linux/iomap.h
> index 6e885c5a38a3..3a803566dea1 100644
> --- a/include/linux/iomap.h
> +++ b/include/linux/iomap.h
> @@ -7,6 +7,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  struct address_space;
>  struct fiemap_extent_info;
> @@ -120,6 +121,11 @@ static inline struct iomap_page *to_iomap_page(struct 
> page *page)
>   return NULL;
>  }
>  
> +static inline sector_t iomap_sector(struct iomap *iomap, loff_t pos)
> +{
> + return (iomap->addr + pos - iomap->offset) >> SECTOR_SHIFT;

Why

Re: [PATCH 13/18] fs: dedup file range to use a compare function

2019-04-17 Thread Darrick J. Wong

On Tue, Apr 16, 2019 at 11:41:49AM -0500, Goldwyn Rodrigues wrote:
> From: Goldwyn Rodrigues 
> 
> With dax we cannot deal with readpage() etc. So, we create a
> funciton callback to perform the file data comparison and pass
> it to generic_remap_file_range_prep() so it can use iomap-based
> functions.
> 
> This may not be the best way to solve this. Suggestions welcome.
> 
> Signed-off-by: Goldwyn Rodrigues 
> ---
>  fs/btrfs/ctree.h |  9 
>  fs/btrfs/dax.c   |  7 +++
>  fs/btrfs/ioctl.c | 11 --
>  fs/dax.c | 58 
> 
>  fs/ocfs2/file.c  |  2 +-
>  fs/read_write.c  | 10 -
>  fs/xfs/xfs_reflink.c |  2 +-
>  include/linux/dax.h  |  2 ++
>  include/linux/fs.h   |  7 ++-
>  9 files changed, 98 insertions(+), 10 deletions(-)
> 
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 2b7bdabb44f8..d3d044125619 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -3803,11 +3803,20 @@ int btree_readahead_hook(struct extent_buffer *eb, 
> int err);
>  ssize_t btrfs_file_dax_read(struct kiocb *iocb, struct iov_iter *to);
>  ssize_t btrfs_file_dax_write(struct kiocb *iocb, struct iov_iter *from);
>  vm_fault_t btrfs_dax_fault(struct vm_fault *vmf);
> +int btrfs_dax_file_range_compare(struct inode *src, loff_t srcoff,
> + struct inode *dest, loff_t destoff, loff_t len,
> + bool *is_same);
>  #else
>  static inline ssize_t btrfs_file_dax_write(struct kiocb *iocb, struct 
> iov_iter *from)
>  {
>   return 0;
>  }
> +static inline int btrfs_dax_file_range_compare(struct inode *src, loff_t 
> srcoff,
> + struct inode *dest, loff_t destoff, loff_t len,
> + bool *is_same)
> +{
> + return 0;
> +}
>  #endif /* CONFIG_FS_DAX */
>  
>  static inline int is_fstree(u64 rootid)
> diff --git a/fs/btrfs/dax.c b/fs/btrfs/dax.c
> index de957d681e16..a29628b403b3 100644
> --- a/fs/btrfs/dax.c
> +++ b/fs/btrfs/dax.c
> @@ -227,4 +227,11 @@ vm_fault_t btrfs_dax_fault(struct vm_fault *vmf)
>  
>   return ret;
>  }
> +
> +int btrfs_dax_file_range_compare(struct inode *src, loff_t srcoff,
> + struct inode *dest, loff_t destoff, loff_t len,
> + bool *is_same)
> +{
> + return dax_file_range_compare(src, srcoff, dest, destoff, len, is_same, 
> _iomap_ops);
> +}
>  #endif /* CONFIG_FS_DAX */
> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
> index 0138119cd9a3..cd590105bd78 100644
> --- a/fs/btrfs/ioctl.c
> +++ b/fs/btrfs/ioctl.c
> @@ -4000,8 +4000,15 @@ static int btrfs_remap_file_range_prep(struct file 
> *file_in, loff_t pos_in,
>   if (ret < 0)
>   goto out_unlock;
>  
> - ret = generic_remap_file_range_prep(file_in, pos_in, file_out, pos_out,
> - len, remap_flags);
> + if (IS_DAX(file_inode(file_in)) && IS_DAX(file_inode(file_out)))
> + ret = generic_remap_file_range_prep(file_in, pos_in, file_out,
> + pos_out, len, remap_flags,
> + btrfs_dax_file_range_compare);
> + else
> + ret = generic_remap_file_range_prep(file_in, pos_in, file_out,
> + pos_out, len, remap_flags,
> + vfs_dedupe_file_range_compare);

I wonder if you could simply have a compare_range_t variable that you
can set to either the vfs and btrfs_dax compare functions, and then only
have to maintain a single generic_remap_file_range_prep callsite?

> +
>   if (ret < 0 || *len == 0)
>   goto out_unlock;
>  
> diff --git a/fs/dax.c b/fs/dax.c
> index d5100cbe8bd2..abbe4a79f219 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -1759,3 +1759,61 @@ vm_fault_t dax_finish_sync_fault(struct vm_fault *vmf,
>   return dax_insert_pfn_mkwrite(vmf, pfn, order);
>  }
>  EXPORT_SYMBOL_GPL(dax_finish_sync_fault);
> +
> +
> +int dax_file_range_compare(struct inode *src, loff_t srcoff, struct inode 
> *dest,
> + loff_t destoff, loff_t len, bool *is_same, const struct 
> iomap_ops *ops)
> +{
> + void *saddr, *daddr;
> + struct iomap s_iomap = {0};
> + struct iomap d_iomap = {0};
> + loff_t dstart, sstart;
> + bool same = true;
> + loff_t cmp_len, l;
> + int id, ret = 0;
> +
> + id = dax_read_lock();
> + while (len) {
> + ret = ops->iomap_begin(src, srcoff, len, 0, _iomap);
> + if (ret < 0) {
> + if (ops->iomap_end)
> + ops->iomap_end(src, srcoff, len, ret, 0, 
> _iomap);
> + return ret;
> + }
> + cmp_len = len;
> + if (cmp_len > s_iomap.offset + s_iomap.length - srcoff)
> + cmp_len = s_iomap.offset + s_iomap.length - srcoff;
> +
> + ret = ops->iomap_begin(dest, destoff, cmp_len, 0, _iomap);
> + if (ret < 0) {
> + if (ops->iomap_end)

Re: [RFC PATCH 0/4] xfs: add handle for reflink in dax

2019-04-16 Thread Darrick J. Wong

On Wed, Apr 18, 2019 at 09:27:11AM +0800, Shiyang Ruan wrote:
> In XFS (under fsdax mode), reflink did not work correctly because xfs
> iomap operators did not handle the inode with both reflink and dax flag.
> 
> This patchset aims to take care of this issue to make COW operation work
> correctly in XFS.
> 
> XFS uses iomap to do read/write/mmap operations:
>   vfs interface   xfs:
> iomap_bengin(); --> xfs_iomap_begin();
> actor();--> dax_iomap_actor() / mmap actor function
> iomap_end();--> xfs_iomap_end();
> 
> In xfs_iomap_begin(), COW operation is detected but not told to actor
> function.  To resolve this, a new field 'src_addr' is added into
> 'struct iomap' to pass this COW info.  It means the start address of
> source blocks in a COW operation, for actor functions to copy data
> before writing.
> 
> In actor functions, the value of iomap->src_addr determines if it is a
> COW operation.  If it is, copy data from source blocks to destination
> blocks first, and then write user data.
> 
> After the COW operation, it is supposed to update the metadata of the
> inode.  Added in xfs_iomap_end().

How do the fs/iomap.c changes in your series compare with Goldwyn's
"btrfs dax support" series that he put out today?

Also, there are a few things missing:

1. A DAX-compatible file contents comparison function for the dedupe
ioctl.

2. Checks that we aren't trying to reflink or dedupe between S_DAX and
!S_DAX files.

3. Do we need to make changes to the hairy
xfs_iolock_two_inodes_and_break_layout function to handle DAX?  Seeing
as it doesn't call xfs_break_dax_layouts...

--D

> 
> 
> Shiyang Ruan (4):
>   fs/iomap: Introduce src_addr for COW in fsdax mode.
>   fs/xfs: iomap: add handle for reflink in fsdax mode
>   fs/dax: copy source blocks before writing when COW
>   fs/xfs: iomap: update the extent list after a COW
> 
>  fs/dax.c  | 70 +++
>  fs/xfs/xfs_iomap.c| 23 +++---
>  include/linux/iomap.h |  4 +++
>  3 files changed, 93 insertions(+), 4 deletions(-)
> 
> -- 
> 2.17.0
> 
> 
> 
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

Re: [Qemu-devel] [PATCH v4 5/5] xfs: disable map_sync for async flush

2019-04-04 Thread Darrick J. Wong

On Thu, Apr 04, 2019 at 06:08:44AM -0400, Pankaj Gupta wrote:
> 
> > On Thu 04-04-19 05:09:10, Pankaj Gupta wrote:
> > > 
> > > > > On Thu, Apr 04, 2019 at 09:09:12AM +1100, Dave Chinner wrote:
> > > > > > On Wed, Apr 03, 2019 at 04:10:18PM +0530, Pankaj Gupta wrote:
> > > > > > > Virtio pmem provides asynchronous host page cache flush
> > > > > > > mechanism. we don't support 'MAP_SYNC' with virtio pmem
> > > > > > > and xfs.
> > > > > > > 
> > > > > > > Signed-off-by: Pankaj Gupta 
> > > > > > > ---
> > > > > > >  fs/xfs/xfs_file.c | 8 
> > > > > > >  1 file changed, 8 insertions(+)
> > > > > > > 
> > > > > > > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> > > > > > > index 1f2e2845eb76..dced2eb8c91a 100644
> > > > > > > --- a/fs/xfs/xfs_file.c
> > > > > > > +++ b/fs/xfs/xfs_file.c
> > > > > > > @@ -1203,6 +1203,14 @@ xfs_file_mmap(
> > > > > > >   if (!IS_DAX(file_inode(filp)) && (vma->vm_flags & VM_SYNC))
> > > > > > >   return -EOPNOTSUPP;
> > > > > > >  
> > > > > > > + /* We don't support synchronous mappings with DAX files if
> > > > > > > +  * dax_device is not synchronous.
> > > > > > > +  */
> > > > > > > + if (IS_DAX(file_inode(filp)) && !dax_synchronous(
> > > > > > > + xfs_find_daxdev_for_inode(file_inode(filp))) &&
> > > > > > > + (vma->vm_flags & VM_SYNC))
> > > > > > > + return -EOPNOTSUPP;
> > > > > > > +
> > > > > > >   file_accessed(filp);
> > > > > > >   vma->vm_ops = _file_vm_ops;
> > > > > > >   if (IS_DAX(file_inode(filp)))
> > > > > > 
> > > > > > All this ad hoc IS_DAX conditional logic is getting pretty nasty.
> > > > > > 
> > > > > > xfs_file_mmap(
> > > > > > 
> > > > > > {
> > > > > > struct inode*inode = file_inode(filp);
> > > > > > 
> > > > > > if (vma->vm_flags & VM_SYNC) {
> > > > > > if (!IS_DAX(inode))
> > > > > > return -EOPNOTSUPP;
> > > > > > if (!dax_synchronous(xfs_find_daxdev_for_inode(inode))
> > > > > > return -EOPNOTSUPP;
> > > > > > }
> > > > > > 
> > > > > > file_accessed(filp);
> > > > > > vma->vm_ops = _file_vm_ops;
> > > > > > if (IS_DAX(inode))
> > > > > > vma->vm_flags |= VM_HUGEPAGE;
> > > > > > return 0;
> > > > > > }
> > > > > > 
> > > > > > 
> > > > > > Even better, factor out all the "MAP_SYNC supported" checks into a
> > > > > > helper so that the filesystem code just doesn't have to care about
> > > > > > the details of checking for DAX+MAP_SYNC support
> > > > > 
> > > > > Seconded, since ext4 has nearly the same flag validation logic.
> > > > 
> > > 
> > > Only issue with this I see is we need the helper function only for
> > > supported
> > > filesystems ext4 & xfs (right now). If I create the function in "fs.h" it
> > > will be compiled for every filesystem, even for those don't need it.
> > > 
> > > Sample patch below, does below patch is near to what you have in mind?
> > 
> > So I would put the helper in include/linux/dax.h and have it like:
> > 
> > bool daxdev_mapping_supported(struct vm_area_struct *vma,

Should this be static inline if you're putting it in the header file?

A comment ought to be added to describe what this predicate function
does.

> >   struct dax_device *dax_dev)
> > {
> > if (!(vma->vm_flags & VM_SYNC))
> > return true;
> > if (!IS_DAX(file_inode(vma->vm_file)))
> > return false;
> > return dax_synchronous(dax_dev);
> > }
> 
> Sure. This is much better. I was also not sure what to name the helper 
> function.
> I will go ahead with this unless 'Dave' & 'Darrick' have anything to add.

Jan's approach (modulo that one comment) looks good to me.

--D

> Thank you very much.
> 
> Best regards,
> Pankaj 
> 
> > 
> > Honza
> > > 
> > > =
> > > 
> > > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> > > index 1f2e2845eb76..614995170cac 100644
> > > --- a/fs/xfs/xfs_file.c
> > > +++ b/fs/xfs/xfs_file.c
> > > @@ -1196,12 +1196,17 @@ xfs_file_mmap(
> > > struct file *filp,
> > > struct vm_area_struct *vma)
> > >  {
> > > +   struct dax_device *dax_dev =
> > > xfs_find_daxdev_for_inode(file_inode(filp));
> > > +
> > > /*
> > > -* We don't support synchronous mappings for non-DAX files. At
> > > least
> > > -* until someone comes with a sensible use case.
> > > +* We don't support synchronous mappings for non-DAX files and
> > > +* for DAX files if underneath dax_device is not synchronous.
> > >  */
> > > -   if (!IS_DAX(file_inode(filp)) && (vma->vm_flags & VM_SYNC))
> > > -   return -EOPNOTSUPP;
> > > +   if (vma->vm_flags & VM_SYNC) {
> > > +   int err = is_synchronous(filp, dax_dev);
> > > +   if (err)
> > > +   return err;
> > > +   }
> > >  
> > >

Re: [PATCH v4 5/5] xfs: disable map_sync for async flush

2019-04-03 Thread Darrick J. Wong

On Thu, Apr 04, 2019 at 09:09:12AM +1100, Dave Chinner wrote:
> On Wed, Apr 03, 2019 at 04:10:18PM +0530, Pankaj Gupta wrote:
> > Virtio pmem provides asynchronous host page cache flush
> > mechanism. we don't support 'MAP_SYNC' with virtio pmem 
> > and xfs.
> > 
> > Signed-off-by: Pankaj Gupta 
> > ---
> >  fs/xfs/xfs_file.c | 8 
> >  1 file changed, 8 insertions(+)
> > 
> > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> > index 1f2e2845eb76..dced2eb8c91a 100644
> > --- a/fs/xfs/xfs_file.c
> > +++ b/fs/xfs/xfs_file.c
> > @@ -1203,6 +1203,14 @@ xfs_file_mmap(
> > if (!IS_DAX(file_inode(filp)) && (vma->vm_flags & VM_SYNC))
> > return -EOPNOTSUPP;
> >  
> > +   /* We don't support synchronous mappings with DAX files if
> > +* dax_device is not synchronous.
> > +*/
> > +   if (IS_DAX(file_inode(filp)) && !dax_synchronous(
> > +   xfs_find_daxdev_for_inode(file_inode(filp))) &&
> > +   (vma->vm_flags & VM_SYNC))
> > +   return -EOPNOTSUPP;
> > +
> > file_accessed(filp);
> > vma->vm_ops = _file_vm_ops;
> > if (IS_DAX(file_inode(filp)))
> 
> All this ad hoc IS_DAX conditional logic is getting pretty nasty.
> 
> xfs_file_mmap(
> 
> {
>   struct inode*inode = file_inode(filp);
> 
>   if (vma->vm_flags & VM_SYNC) {
>   if (!IS_DAX(inode))
>   return -EOPNOTSUPP;
>   if (!dax_synchronous(xfs_find_daxdev_for_inode(inode))
>   return -EOPNOTSUPP;
>   }
> 
>   file_accessed(filp);
>   vma->vm_ops = _file_vm_ops;
>   if (IS_DAX(inode))
>   vma->vm_flags |= VM_HUGEPAGE;
>   return 0;
> }
> 
> 
> Even better, factor out all the "MAP_SYNC supported" checks into a
> helper so that the filesystem code just doesn't have to care about
> the details of checking for DAX+MAP_SYNC support

Seconded, since ext4 has nearly the same flag validation logic.

--D

> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> da...@fromorbit.com
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

Re: [RFC PATCH] pmem: advertise page alignment for pmem devices supporting fsdax

2019-02-22 Thread Darrick J. Wong

On Sat, Feb 23, 2019 at 10:30:38AM +1100, Dave Chinner wrote:
> On Fri, Feb 22, 2019 at 10:45:25AM -0800, Darrick J. Wong wrote:
> > On Fri, Feb 22, 2019 at 10:28:15AM -0800, Dan Williams wrote:
> > > On Fri, Feb 22, 2019 at 10:21 AM Darrick J. Wong
> > >  wrote:
> > > >
> > > > Hi all!
> > > >
> > > > Uh, we have an internal customer  who's been trying out MAP_SYNC
> > > > on pmem, and they've observed that one has to do a fair amount of
> > > > legwork (in the form of mkfs.xfs parameters) to get the kernel to set up
> > > > 2M PMD mappings.  They (of course) want to mmap hundreds of GB of pmem,
> > > > so the PMD mappings are much more efficient.
> 
> Are you really saying that "mkfs.xfs -d su=2MB,sw=1 " is
> considered "too much legwork" to set up the filesystem for DAX and
> PMD alignment?

Yes.  I mean ... userspace /can/ figure out the page sizes on arm64 &
ppc64le (or extract it from sysfs), but why not just advertise it as a
io hint on the pmem "block" device?

Hmm, now having watched various xfstests blow up because they don't
expect blocks to be larger than 64k, maybe I'll rethink this as a
default behavior. :)

> > > > I started poking around w.r.t. what mkfs.xfs was doing and realized that
> > > > if the fsdax pmem device advertised iomin/ioopt of 2MB, then mkfs will
> > > > set up all the parameters automatically.  Below is my ham-handed attempt
> > > > to teach the kernel to do this.
> 
> Still need extent size hints so that writes that are smaller than
> the PMD size are allocated correctly aligned and sized to map to
> PMDs...

I think we're generally planning to use the RT device where we can make
2M alignment mandatory, so for the data device the effectiveness of the
extent hint doesn't really matter.

> > > > Comments, flames, "WTF is this guy smoking?" are all welcome. :)
> > > >
> > > > --D
> > > >
> > > > ---
> > > > Configure pmem devices to advertise the default page alignment when said
> > > > block device supports fsdax.  Certain filesystems use these iomin/ioopt
> > > > hints to try to create aligned file extents, which makes it much easier
> > > > for mmaps to take advantage of huge page table entries.
> > > >
> > > > Signed-off-by: Darrick J. Wong 
> > > > ---
> > > >  drivers/nvdimm/pmem.c |5 -
> > > >  1 file changed, 4 insertions(+), 1 deletion(-)
> > > >
> > > > diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
> > > > index bc2f700feef8..3eeb9dd117d5 100644
> > > > --- a/drivers/nvdimm/pmem.c
> > > > +++ b/drivers/nvdimm/pmem.c
> > > > @@ -441,8 +441,11 @@ static int pmem_attach_disk(struct device *dev,
> > > > blk_queue_logical_block_size(q, pmem_sector_size(ndns));
> > > > blk_queue_max_hw_sectors(q, UINT_MAX);
> > > > blk_queue_flag_set(QUEUE_FLAG_NONROT, q);
> > > > -   if (pmem->pfn_flags & PFN_MAP)
> > > > +   if (pmem->pfn_flags & PFN_MAP) {
> > > > blk_queue_flag_set(QUEUE_FLAG_DAX, q);
> > > > +   blk_queue_io_min(q, PFN_DEFAULT_ALIGNMENT);
> > > > +   blk_queue_io_opt(q, PFN_DEFAULT_ALIGNMENT);
> > > 
> > > The device alignment might sometimes be bigger than this default.
> > > Would there be any detrimental effects for filesystems if io_min and
> > > io_opt were set to 1GB?
> > 
> > Hmmm, that's going to be a struggle on ext4 and the xfs data device
> > because we'd be preferentially skipping the 1023.8MB immediately after
> > each allocation group's metadata.  It already does this now with a 2MB
> > io hint, but losing 1.8MB here and there isn't so bad.
> > 
> > We'd have to study it further, though; filesystems historically have
> > interpreted the iomin/ioopt hints as RAID striping geometry, and I don't
> > think very many people set up 1GB raid stripe units.
> 
> Setting sunit=1GB is really going to cause havoc with things like
> inode chunk allocation alignment, and the first write() will either
> have to be >=1GB or use 1GB extent size hints to trigger alignment.
> And, AFAICT, it will prevent us from doing 2MB alignment on other
> files, even with 2MB extent size hints set.
> 
> IOWs, I don't think 1GB alignment is a good idea as a default.



> > (I doubt very many people have done 2M raid stripes either, but it seems
> > to work easily where we've tried i

Re: [RFC PATCH] pmem: advertise page alignment for pmem devices supporting fsdax

2019-02-22 Thread Darrick J. Wong

On Sat, Feb 23, 2019 at 10:11:36AM +1100, Dave Chinner wrote:
> On Fri, Feb 22, 2019 at 10:20:08AM -0800, Darrick J. Wong wrote:
> > Hi all!
> > 
> > Uh, we have an internal customer  who's been trying out MAP_SYNC
> > on pmem, and they've observed that one has to do a fair amount of
> > legwork (in the form of mkfs.xfs parameters) to get the kernel to set up
> > 2M PMD mappings.  They (of course) want to mmap hundreds of GB of pmem,
> > so the PMD mappings are much more efficient.
> > 
> > I started poking around w.r.t. what mkfs.xfs was doing and realized that
> > if the fsdax pmem device advertised iomin/ioopt of 2MB, then mkfs will
> > set up all the parameters automatically.  Below is my ham-handed attempt
> > to teach the kernel to do this.
> 
> What's the before and after mkfs output?
> 
> (need to see the context that this "fixes" before I comment)

Here's what we do today assuming no options and 800GB pmem devices:

# blockdev --getiomin --getioopt /dev/pmem0 /dev/pmem1
4096
0
4096
0
# mkfs.xfs -N /dev/pmem0 -r rtdev=/dev/pmem1
meta-data=/dev/pmem0 isize=512agcount=4, agsize=52428800 blks
 =   sectsz=512   attr=2, projid32bit=1
 =   crc=1finobt=1, sparse=1, rmapbt=0
 =   reflink=0
data =   bsize=4096   blocks=209715200, imaxpct=25
 =   sunit=0  swidth=0 blks
naming   =version 2  bsize=4096   ascii-ci=0, ftype=1
log  =internal log   bsize=4096   blocks=102400, version=2
 =   sectsz=512   sunit=0 blks, lazy-count=1
realtime =/dev/pmem1 extsz=4096   blocks=209715200, 
rtextents=209715200

And here's what we do to get 2M aligned mappings:

# mkfs.xfs -N /dev/pmem0 -r rtdev=/dev/pmem1,extsize=2m -d su=2m,sw=1
meta-data=/dev/pmem0 isize=512agcount=32, agsize=6553600 blks
 =   sectsz=512   attr=2, projid32bit=1
 =   crc=1finobt=1, sparse=1, rmapbt=0
 =   reflink=0
data =   bsize=4096   blocks=209715200, imaxpct=25
 =   sunit=512swidth=512 blks
naming   =version 2  bsize=4096   ascii-ci=0, ftype=1
log  =internal log   bsize=4096   blocks=102400, version=2
 =   sectsz=512   sunit=0 blks, lazy-count=1
realtime =/dev/pmem1 extsz=2097152 blocks=209715200, 
rtextents=409600

With this patch, things change as such:

# blockdev --getiomin --getioopt /dev/pmem0 /dev/pmem1
2097152
2097152
2097152
2097152
# mkfs.xfs -N /dev/pmem0 -r rtdev=/dev/pmem1
meta-data=/dev/pmem0 isize=512agcount=32, agsize=6553600 blks
 =   sectsz=512   attr=2, projid32bit=1
 =   crc=1finobt=1, sparse=1, rmapbt=0
 =   reflink=0
data =   bsize=4096   blocks=209715200, imaxpct=25
 =   sunit=512swidth=512 blks
naming   =version 2  bsize=4096   ascii-ci=0, ftype=1
log  =internal log   bsize=4096   blocks=102400, version=2
 =   sectsz=512   sunit=0 blks, lazy-count=1
realtime =/dev/pmem1 extsz=2097152 blocks=209715200, 
rtextents=409600

I think the only change is the agcount, which for 2M mappings probably
isn't a huge deal.  It's obviously a bigger deal for 1G pages, assuming
we decide that's even advisable.

--D

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> da...@fromorbit.com
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

Re: [RFC PATCH] pmem: advertise page alignment for pmem devices supporting fsdax

2019-02-22 Thread Darrick J. Wong

On Fri, Feb 22, 2019 at 10:28:15AM -0800, Dan Williams wrote:
> On Fri, Feb 22, 2019 at 10:21 AM Darrick J. Wong
>  wrote:
> >
> > Hi all!
> >
> > Uh, we have an internal customer  who's been trying out MAP_SYNC
> > on pmem, and they've observed that one has to do a fair amount of
> > legwork (in the form of mkfs.xfs parameters) to get the kernel to set up
> > 2M PMD mappings.  They (of course) want to mmap hundreds of GB of pmem,
> > so the PMD mappings are much more efficient.
> >
> > I started poking around w.r.t. what mkfs.xfs was doing and realized that
> > if the fsdax pmem device advertised iomin/ioopt of 2MB, then mkfs will
> > set up all the parameters automatically.  Below is my ham-handed attempt
> > to teach the kernel to do this.
> >
> > Comments, flames, "WTF is this guy smoking?" are all welcome. :)
> >
> > --D
> >
> > ---
> > Configure pmem devices to advertise the default page alignment when said
> > block device supports fsdax.  Certain filesystems use these iomin/ioopt
> > hints to try to create aligned file extents, which makes it much easier
> > for mmaps to take advantage of huge page table entries.
> >
> > Signed-off-by: Darrick J. Wong 
> > ---
> >  drivers/nvdimm/pmem.c |5 -
> >  1 file changed, 4 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
> > index bc2f700feef8..3eeb9dd117d5 100644
> > --- a/drivers/nvdimm/pmem.c
> > +++ b/drivers/nvdimm/pmem.c
> > @@ -441,8 +441,11 @@ static int pmem_attach_disk(struct device *dev,
> > blk_queue_logical_block_size(q, pmem_sector_size(ndns));
> > blk_queue_max_hw_sectors(q, UINT_MAX);
> > blk_queue_flag_set(QUEUE_FLAG_NONROT, q);
> > -   if (pmem->pfn_flags & PFN_MAP)
> > +   if (pmem->pfn_flags & PFN_MAP) {
> > blk_queue_flag_set(QUEUE_FLAG_DAX, q);
> > +   blk_queue_io_min(q, PFN_DEFAULT_ALIGNMENT);
> > +   blk_queue_io_opt(q, PFN_DEFAULT_ALIGNMENT);
> 
> The device alignment might sometimes be bigger than this default.
> Would there be any detrimental effects for filesystems if io_min and
> io_opt were set to 1GB?

Hmmm, that's going to be a struggle on ext4 and the xfs data device
because we'd be preferentially skipping the 1023.8MB immediately after
each allocation group's metadata.  It already does this now with a 2MB
io hint, but losing 1.8MB here and there isn't so bad.

We'd have to study it further, though; filesystems historically have
interpreted the iomin/ioopt hints as RAID striping geometry, and I don't
think very many people set up 1GB raid stripe units.

(I doubt very many people have done 2M raid stripes either, but it seems
to work easily where we've tried it...)

> I'm thinking and xfs-realtime configuration might be able to support
> 1GB mappings in the future.

The xfs realtime device ought to be able to support 1g alignment pretty
easily though. :)

--D
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

[RFC PATCH] pmem: advertise page alignment for pmem devices supporting fsdax

2019-02-22 Thread Darrick J. Wong

Hi all!

Uh, we have an internal customer  who's been trying out MAP_SYNC
on pmem, and they've observed that one has to do a fair amount of
legwork (in the form of mkfs.xfs parameters) to get the kernel to set up
2M PMD mappings.  They (of course) want to mmap hundreds of GB of pmem,
so the PMD mappings are much more efficient.

I started poking around w.r.t. what mkfs.xfs was doing and realized that
if the fsdax pmem device advertised iomin/ioopt of 2MB, then mkfs will
set up all the parameters automatically.  Below is my ham-handed attempt
to teach the kernel to do this.

Comments, flames, "WTF is this guy smoking?" are all welcome. :)

--D

---
Configure pmem devices to advertise the default page alignment when said
block device supports fsdax.  Certain filesystems use these iomin/ioopt
hints to try to create aligned file extents, which makes it much easier
for mmaps to take advantage of huge page table entries.

Signed-off-by: Darrick J. Wong 
---
 drivers/nvdimm/pmem.c |5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index bc2f700feef8..3eeb9dd117d5 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -441,8 +441,11 @@ static int pmem_attach_disk(struct device *dev,
blk_queue_logical_block_size(q, pmem_sector_size(ndns));
blk_queue_max_hw_sectors(q, UINT_MAX);
blk_queue_flag_set(QUEUE_FLAG_NONROT, q);
-   if (pmem->pfn_flags & PFN_MAP)
+   if (pmem->pfn_flags & PFN_MAP) {
blk_queue_flag_set(QUEUE_FLAG_DAX, q);
+   blk_queue_io_min(q, PFN_DEFAULT_ALIGNMENT);
+   blk_queue_io_opt(q, PFN_DEFAULT_ALIGNMENT);
+   }
q->queuedata = pmem;
 
disk = alloc_disk_node(0, nid);
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

Re: [PATCH v3 5/5] xfs: disable map_sync for virtio pmem

2019-01-09 Thread Darrick J. Wong

On Wed, Jan 09, 2019 at 08:17:36PM +0530, Pankaj Gupta wrote:
> Virtio pmem provides asynchronous host page cache flush
> mechanism. we don't support 'MAP_SYNC' with virtio pmem 
> and xfs.
> 
> Signed-off-by: Pankaj Gupta 
> ---
>  fs/xfs/xfs_file.c | 8 
>  1 file changed, 8 insertions(+)
> 
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index e474250..eae4aa4 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -1190,6 +1190,14 @@ xfs_file_mmap(
>   if (!IS_DAX(file_inode(filp)) && (vma->vm_flags & VM_SYNC))
>   return -EOPNOTSUPP;
>  
> + /* We don't support synchronous mappings with guest direct access
> +  * and virtio based host page cache mechanism.
> +  */
> + if (IS_DAX(file_inode(filp)) && virtio_pmem_host_cache_enabled(

Echoing what Jan said, this ought to be some sort of generic function
that tells us whether or not memory mapped from the dax device will
always still be accessible even after a crash (i.e. supports MAP_SYNC).

What if the underlying file on the host is itself on pmem and can be
MAP_SYNC'd?  Shouldn't the guest be able to use MAP_SYNC as well?

--D

> + xfs_find_daxdev_for_inode(file_inode(filp))) &&
> + (vma->vm_flags & VM_SYNC))
> + return -EOPNOTSUPP;
> +
>   file_accessed(filp);
>   vma->vm_ops = _file_vm_ops;
>   if (IS_DAX(file_inode(filp)))
> -- 
> 2.9.3
> 
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

Re: [PATCH v2 2/2] [PATCH] xfs: Close race between direct IO and xfs_break_layouts()

2018-08-10 Thread Darrick J. Wong

On Fri, Aug 10, 2018 at 08:54:00AM -0700, Dave Jiang wrote:
> 
> 
> On 08/10/2018 08:48 AM, Darrick J. Wong wrote:
> > On Wed, Aug 08, 2018 at 10:31:40AM -0700, Dave Jiang wrote:
> >> This patch is the duplicate of ross's fix for ext4 for xfs.
> >>
> >> If the refcount of a page is lowered between the time that it is returned
> >> by dax_busy_page() and when the refcount is again checked in
> >> xfs_break_layouts() => ___wait_var_event(), the waiting function
> >> xfs_wait_dax_page() will never be called.  This means that
> >> xfs_break_layouts() will still have 'retry' set to false, so we'll stop
> >> looping and never check the refcount of other pages in this inode.
> >>
> >> Instead, always continue looping as long as dax_layout_busy_page() gives us
> >> a page which it found with an elevated refcount.
> >>
> >> Signed-off-by: Dave Jiang 
> >> Reviewed-by: Jan Kara 
> >> ---
> >>
> >> Sorry resend, forgot to add Jan's reviewed-by.
> >>
> >> v2:
> >> - Rename parameter from did_unlock to retry (Jan)
> >>
> >>  fs/xfs/xfs_file.c |9 -
> >>  1 file changed, 4 insertions(+), 5 deletions(-)
> >>
> >> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> >> index a3e7767a5715..cd6f0d8c4922 100644
> >> --- a/fs/xfs/xfs_file.c
> >> +++ b/fs/xfs/xfs_file.c
> >> @@ -721,12 +721,10 @@ xfs_file_write_iter(
> >>  
> >>  static void
> >>  xfs_wait_dax_page(
> >> -  struct inode*inode,
> >> -  bool*did_unlock)
> >> +  struct inode*inode)
> >>  {
> >>struct xfs_inode*ip = XFS_I(inode);
> >>  
> >> -  *did_unlock = true;
> >>xfs_iunlock(ip, XFS_MMAPLOCK_EXCL);
> >>schedule();
> >>xfs_ilock(ip, XFS_MMAPLOCK_EXCL);
> >> @@ -736,7 +734,7 @@ static int
> >>  xfs_break_dax_layouts(
> >>struct inode*inode,
> >>uintiolock,
> >> -  bool*did_unlock)
> >> +  bool*retry)
> > 
> > Uhhh, this hunk doesn't apply.  xfs_break_dax_layouts doesn't have an
> > iolock parameter anymore; was this not generated off of xfs for-next?
> 
> Sorry. It was generated against 4.18-rc8. I'll respin patch against xfs
> for-next.

I think it's just a matter of taking the old patch and changing
"did_unlock" to "retry", right?  If so, I'll just change that and be
done with this one. :)

--D

> > 
> > --D
> > 
> >>  {
> >>struct page *page;
> >>  
> >> @@ -746,9 +744,10 @@ xfs_break_dax_layouts(
> >>if (!page)
> >>return 0;
> >>  
> >> +  *retry = true;
> >>return ___wait_var_event(>_refcount,
> >>atomic_read(>_refcount) == 1, TASK_INTERRUPTIBLE,
> >> -  0, 0, xfs_wait_dax_page(inode, did_unlock));
> >> +  0, 0, xfs_wait_dax_page(inode));
> >>  }
> >>  
> >>  int
> >>
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> >> the body of a message to majord...@vger.kernel.org
> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

Re: [PATCH v2 2/2] [PATCH] xfs: Close race between direct IO and xfs_break_layouts()

2018-08-10 Thread Darrick J. Wong

On Wed, Aug 08, 2018 at 10:31:40AM -0700, Dave Jiang wrote:
> This patch is the duplicate of ross's fix for ext4 for xfs.
> 
> If the refcount of a page is lowered between the time that it is returned
> by dax_busy_page() and when the refcount is again checked in
> xfs_break_layouts() => ___wait_var_event(), the waiting function
> xfs_wait_dax_page() will never be called.  This means that
> xfs_break_layouts() will still have 'retry' set to false, so we'll stop
> looping and never check the refcount of other pages in this inode.
> 
> Instead, always continue looping as long as dax_layout_busy_page() gives us
> a page which it found with an elevated refcount.
> 
> Signed-off-by: Dave Jiang 
> Reviewed-by: Jan Kara 
> ---
> 
> Sorry resend, forgot to add Jan's reviewed-by.
> 
> v2:
> - Rename parameter from did_unlock to retry (Jan)
> 
>  fs/xfs/xfs_file.c |9 -
>  1 file changed, 4 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index a3e7767a5715..cd6f0d8c4922 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -721,12 +721,10 @@ xfs_file_write_iter(
>  
>  static void
>  xfs_wait_dax_page(
> - struct inode*inode,
> - bool*did_unlock)
> + struct inode*inode)
>  {
>   struct xfs_inode*ip = XFS_I(inode);
>  
> - *did_unlock = true;
>   xfs_iunlock(ip, XFS_MMAPLOCK_EXCL);
>   schedule();
>   xfs_ilock(ip, XFS_MMAPLOCK_EXCL);
> @@ -736,7 +734,7 @@ static int
>  xfs_break_dax_layouts(
>   struct inode*inode,
>   uintiolock,
> - bool*did_unlock)
> + bool*retry)

Uhhh, this hunk doesn't apply.  xfs_break_dax_layouts doesn't have an
iolock parameter anymore; was this not generated off of xfs for-next?

--D

>  {
>   struct page *page;
>  
> @@ -746,9 +744,10 @@ xfs_break_dax_layouts(
>   if (!page)
>   return 0;
>  
> + *retry = true;
>   return ___wait_var_event(>_refcount,
>   atomic_read(>_refcount) == 1, TASK_INTERRUPTIBLE,
> - 0, 0, xfs_wait_dax_page(inode, did_unlock));
> + 0, 0, xfs_wait_dax_page(inode));
>  }
>  
>  int
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

Re: [PATCH v3 2/2] ext4: handle layout changes to pinned DAX mappings

2018-07-09 Thread Darrick J. Wong

On Mon, Jul 09, 2018 at 02:33:47PM +0200, Jan Kara wrote:
> On Thu 05-07-18 10:53:10, Ross Zwisler wrote:
> > On Wed, Jul 04, 2018 at 08:59:52PM -0700, Darrick J. Wong wrote:
> > > On Thu, Jul 05, 2018 at 09:54:14AM +1000, Dave Chinner wrote:
> > > > On Wed, Jul 04, 2018 at 02:27:23PM +0200, Jan Kara wrote:
> > > > > On Wed 04-07-18 10:49:23, Dave Chinner wrote:
> > > > > > On Mon, Jul 02, 2018 at 11:29:12AM -0600, Ross Zwisler wrote:
> > > > > > > Follow the lead of xfs_break_dax_layouts() and add 
> > > > > > > synchronization between
> > > > > > > operations in ext4 which remove blocks from an inode (hole punch, 
> > > > > > > truncate
> > > > > > > down, etc.) and pages which are pinned due to DAX DMA operations.
> > > > > > > 
> > > > > > > Signed-off-by: Ross Zwisler 
> > > > > > > Reviewed-by: Jan Kara 
> > > > > > > Reviewed-by: Lukas Czerner 
> > > > > > > ---
> > > > > > > 
> > > > > > > Changes since v2:
> > > > > > >  * Added a comment to ext4_insert_range() explaining why we don't 
> > > > > > > call
> > > > > > >ext4_break_layouts(). (Jan)
> > > > > > 
> > > > > > Which I think is wrong and will cause data corruption.
> > > > > > 
> > > > > > > @@ -5651,6 +5663,11 @@ int ext4_insert_range(struct inode *inode, 
> > > > > > > loff_t offset, loff_t len)
> > > > > > >   LLONG_MAX);
> > > > > > >   if (ret)
> > > > > > >   goto out_mmap;
> > > > > > > + /*
> > > > > > > +  * We don't need to call ext4_break_layouts() because we aren't
> > > > > > > +  * removing any blocks from the inode.  We are just changing 
> > > > > > > their
> > > > > > > +  * offset by inserting a hole.
> > > > > > > +  */
> > > 
> > > Does calling ext4_break_layouts from insert range not work?
> > > 
> > > It's my understanding that file leases work are a mechanism for the
> > > filesystem to delegate some of its authority over physical space
> > > mappings to "client" software.  AFAICT it's used for mmap'ing pmem
> > > directly into userspace and exporting space on shared storage over
> > > pNFS.  Some day we might use the same mechanism for the similar things
> > > that RDMA does, or the swapfile code since that's essentially how it
> > > works today.
> > > 
> > > The other part of these file leases is that the filesystem revokes them
> > > any time it wants to perform a mapping operation on a file.  This breaks
> > > my mental model of how leases work, and if you commit to this for ext4
> > > then I'll have to remember that leases are different between xfs and
> > > ext4.  Worse, since the reason for skipping ext4_break_layouts seems to
> > > be the implementation detail that "DAX won't care", then someone else
> > > wiring up pNFS/future RDMA/whatever will also have to remember to put it
> > > back into ext4 or else kaboom.
> > > 
> > > Granted, Dave said all these things already, but I actually feel
> > > strongly enough to reiterate.
> > 
> > Jan, would you like me to call ext4_break_layouts() in ext4_insert_range() 
> > to
> > keep the lease mechanism consistent between ext4 and XFS, or would you 
> > prefer
> > the s/ext4_break_layouts()/ext4_dax_unmap_pages()/ rename?
> 
> Let's just call it from ext4_insert_range(). I think the simple semantics
> Dave and Darrick defend is more maintainable and insert range isn't really
> performance critical operation.
> 
> The question remains whether equivalent of BREAK_UNMAP is really required
> also for allocation of new blocks using fallocate. Because that doesn't
> really seem fundamentally different from normal write which uses
> BREAK_WRITE for xfs_break_layouts(). And that it more often used operation
> so bothering with GUP synchronization when not needed could hurt there.
> Dave, Darrick?

Hmm, IIRC BREAK_UNMAP is supposed to be for callers who are going to
remove (or move) mappings that already exist, so that the caller blocks
until the lessee acknowledges that they've forgotten all the mappings
they knew about.  So I /think/ for fallocate mode 0 I think this could
be BREAK_WRITE instead of _UNMAP, though (at least for xfs) the other
modes all need _UNMAP.

Side question: in xfs_file_aio_write_checks, do we need to do
BREAK_UNMAP if is possible that writeback will end up performing a copy
write?  Granted, the pnfs export and dax stuff don't support reflink or
cow so I guess this is an academic question for now...

--D

>   Honza
> -- 
> Jan Kara 
> SUSE Labs, CR
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

Re: [PATCH v3 2/2] ext4: handle layout changes to pinned DAX mappings

2018-07-09 Thread Darrick J. Wong

On Mon, Jul 09, 2018 at 11:59:07AM +0200, Lukas Czerner wrote:
> On Fri, Jul 06, 2018 at 09:29:34AM +1000, Dave Chinner wrote:
> > On Thu, Jul 05, 2018 at 01:40:17PM -0700, Dan Williams wrote:
> > > On Wed, Jul 4, 2018 at 8:59 PM, Darrick J. Wong  
> > > wrote:
> > > > On Thu, Jul 05, 2018 at 09:54:14AM +1000, Dave Chinner wrote:
> > > >> On Wed, Jul 04, 2018 at 02:27:23PM +0200, Jan Kara wrote:
> > > >> > On Wed 04-07-18 10:49:23, Dave Chinner wrote:
> > > >> > > On Mon, Jul 02, 2018 at 11:29:12AM -0600, Ross Zwisler wrote:
> > > >> > > > Follow the lead of xfs_break_dax_layouts() and add 
> > > >> > > > synchronization between
> > > >> > > > operations in ext4 which remove blocks from an inode (hole 
> > > >> > > > punch, truncate
> > > >> > > > down, etc.) and pages which are pinned due to DAX DMA operations.
> > > >> > > >
> > > >> > > > Signed-off-by: Ross Zwisler 
> > > >> > > > Reviewed-by: Jan Kara 
> > > >> > > > Reviewed-by: Lukas Czerner 
> > > >> > > > ---
> > > >> > > >
> > > >> > > > Changes since v2:
> > > >> > > >  * Added a comment to ext4_insert_range() explaining why we 
> > > >> > > > don't call
> > > >> > > >ext4_break_layouts(). (Jan)
> > > >> > >
> > > >> > > Which I think is wrong and will cause data corruption.
> > > >> > >
> > > >> > > > @@ -5651,6 +5663,11 @@ int ext4_insert_range(struct inode 
> > > >> > > > *inode, loff_t offset, loff_t len)
> > > >> > > > LLONG_MAX);
> > > >> > > > if (ret)
> > > >> > > > goto out_mmap;
> > > >> > > > +   /*
> > > >> > > > +* We don't need to call ext4_break_layouts() because we 
> > > >> > > > aren't
> > > >> > > > +* removing any blocks from the inode.  We are just 
> > > >> > > > changing their
> > > >> > > > +* offset by inserting a hole.
> > > >> > > > +*/
> > > >
> > > > Does calling ext4_break_layouts from insert range not work?
> > > >
> > > > It's my understanding that file leases work are a mechanism for the
> > > > filesystem to delegate some of its authority over physical space
> > > > mappings to "client" software.  AFAICT it's used for mmap'ing pmem
> > > > directly into userspace and exporting space on shared storage over
> > > > pNFS.  Some day we might use the same mechanism for the similar things
> > > > that RDMA does, or the swapfile code since that's essentially how it
> > > > works today.
> > > >
> > > > The other part of these file leases is that the filesystem revokes them
> > > > any time it wants to perform a mapping operation on a file.  This breaks
> > > > my mental model of how leases work, and if you commit to this for ext4
> > > > then I'll have to remember that leases are different between xfs and
> > > > ext4.  Worse, since the reason for skipping ext4_break_layouts seems to
> > > > be the implementation detail that "DAX won't care", then someone else
> > > > wiring up pNFS/future RDMA/whatever will also have to remember to put it
> > > > back into ext4 or else kaboom.
> > > >
> > > > Granted, Dave said all these things already, but I actually feel
> > > > strongly enough to reiterate.
> > > 
> > > This patch kit is only for the DAX fix, this isn't full layout lease
> > > support. Even XFS is special casing unmap with the BREAK_UNMAP flag.
> > > So ext4 is achieving feature parity for BREAK_UNMAP, just not
> > > BREAK_WRITE, yet.
> > 
> > BREAK_UNMAP is issued unconditionally by XFS for all fallocate
> > operations. There is no special except for extent shifting (up or
> > down) in XFS as this patch set is making for ext4.  IOWs, this
> > patchset does not implement BREAK_UNMAP with the same semantics as
> > XFS.
> 
> If anything this is very usefull discussion ( at least for me ) and what
> I do take away from it is that there is no documentation, nor
> specification of the leases nor BREAK_UNMAP nor BREAK_WRITE.
> 
> grep -iR -e break_layout -e BREAK_UNMAP -e BREAK_WRITE Documentation/*
> 
> Maybe someone with a good understanding of how this stuff is supposed to
> be done could write it down so filesystem devs can make it behave the
> same.

Dan? :)

IIRC, BREAK_WRITE means "terminate all leases immediately" as the caller
prepares to write to a file range (which may or may not involve adding
more mappings), whereas BREAK_UNMAP means "terminate all leases and wait
until the lessee acknowledges" as the caller prepares to remove (or
move) file extent mappings.

--D

> Thanks!
> -Lukas
> 
> 
> > 
> > Cheers,
> > 
> > Dave.
> > -- 
> > Dave Chinner
> > da...@fromorbit.com
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

Re: [PATCH v3 2/2] ext4: handle layout changes to pinned DAX mappings

2018-07-04 Thread Darrick J. Wong

On Thu, Jul 05, 2018 at 09:54:14AM +1000, Dave Chinner wrote:
> On Wed, Jul 04, 2018 at 02:27:23PM +0200, Jan Kara wrote:
> > On Wed 04-07-18 10:49:23, Dave Chinner wrote:
> > > On Mon, Jul 02, 2018 at 11:29:12AM -0600, Ross Zwisler wrote:
> > > > Follow the lead of xfs_break_dax_layouts() and add synchronization 
> > > > between
> > > > operations in ext4 which remove blocks from an inode (hole punch, 
> > > > truncate
> > > > down, etc.) and pages which are pinned due to DAX DMA operations.
> > > > 
> > > > Signed-off-by: Ross Zwisler 
> > > > Reviewed-by: Jan Kara 
> > > > Reviewed-by: Lukas Czerner 
> > > > ---
> > > > 
> > > > Changes since v2:
> > > >  * Added a comment to ext4_insert_range() explaining why we don't call
> > > >ext4_break_layouts(). (Jan)
> > > 
> > > Which I think is wrong and will cause data corruption.
> > > 
> > > > @@ -5651,6 +5663,11 @@ int ext4_insert_range(struct inode *inode, 
> > > > loff_t offset, loff_t len)
> > > > LLONG_MAX);
> > > > if (ret)
> > > > goto out_mmap;
> > > > +   /*
> > > > +* We don't need to call ext4_break_layouts() because we aren't
> > > > +* removing any blocks from the inode.  We are just changing 
> > > > their
> > > > +* offset by inserting a hole.
> > > > +*/

Does calling ext4_break_layouts from insert range not work?

It's my understanding that file leases work are a mechanism for the
filesystem to delegate some of its authority over physical space
mappings to "client" software.  AFAICT it's used for mmap'ing pmem
directly into userspace and exporting space on shared storage over
pNFS.  Some day we might use the same mechanism for the similar things
that RDMA does, or the swapfile code since that's essentially how it
works today.

The other part of these file leases is that the filesystem revokes them
any time it wants to perform a mapping operation on a file.  This breaks
my mental model of how leases work, and if you commit to this for ext4
then I'll have to remember that leases are different between xfs and
ext4.  Worse, since the reason for skipping ext4_break_layouts seems to
be the implementation detail that "DAX won't care", then someone else
wiring up pNFS/future RDMA/whatever will also have to remember to put it
back into ext4 or else kaboom.

Granted, Dave said all these things already, but I actually feel
strongly enough to reiterate.

--D

> > > 
> > > The entire point of these leases is so that a thrid party can
> > > directly access the blocks underlying the file. That means they are
> > > keeping their own file offset<->disk block mapping internally, and
> > > they are assuming that it is valid for as long as they hold the
> > > lease. If the filesystem modifies the extent map - even something
> > > like a shift here which changes the offset<->disk block mapping -
> > > the userspace app now has a stale mapping and so the lease *must be
> > > broken* to tell it that it's mappings are now stale and it needs to
> > > refetch them.
> > 
> > Well, ext4 has no real concept of leases and no pNFS support. And DAX
> > requirements wrt consistency are much weaker than those of pNFS. This is
> > mostly caused by the fact that calls like invalidate_mapping_pages() will
> > flush offset<->pfn mappings DAX maintains in the radix tree automatically
> > (similarly as it happens when page cache is used).
> 
> I'm more concerned about apps that use file leases behaving the same
> way, not just the pNFS stuff. if we are /delegating file layouts/ to
> 3rd parties, then all filesystems *need* to behave the same way.
> We've already defined those semantics with XFS - every time the
> filesystem changes an extent layout in any way it needs to break
> existing layout delegations...
> 
> > What Ross did just keeps ext4 + DAX behave similarly as ext4 + page cache
> > does
> 
> Sure. But the issue I'm raising is that ext4 is not playing by the
> same extent layout delegation rules that XFS has already defined for
> 3rd party use.
> 
> i.e. don't fuck up layout delegation behaviour consistency right
> from the start just because " is all
> we need right now for ext4". All the filesystems should implement
> the same semantics and behaviour right from the start, otherwise
> we're just going to make life a misery for anyone who tries to use
> layout delegations in future.
> 
> Haven't we learnt this lesson the hard way enough times already?
> 
> Cheers,
> 
> Dave.
> 
> -- 
> Dave Chinner
> da...@fromorbit.com
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

Re: [fstests PATCH] generic/223: skip when using DAX

2018-06-13 Thread Darrick J. Wong

On Wed, Jun 13, 2018 at 03:07:42PM -0600, Ross Zwisler wrote:
> As of these upstream kernel commits:
> 
> commit 6e2608dfd934 ("xfs, dax: introduce xfs_dax_aops")
> commit 5f0663bb4a64 ("ext4, dax: introduce ext4_dax_aops")
> 
> generic/223 fails on XFS and ext4 because filesystems mounted with DAX no
> longer support bmap.  This is desired behavior and will not be fixed,
> according to:
> 
> https://lists.01.org/pipermail/linux-nvdimm/2018-April/015383.html
> 
> So, just skip over generic/223 when using DAX so we don't throw false
> positive test failures.

Just because we decided not to support FIBMAP on XFSDAX doesn't mean we
should let this test bitrot. :)

Just out of curiosity, does the following patch fix g/223 for you?

--D

diff --git a/src/t_stripealign.c b/src/t_stripealign.c
index 05ed36b5..690f743a 100644
--- a/src/t_stripealign.c
+++ b/src/t_stripealign.c
@@ -17,8 +17,13 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 
-#define FIBMAP  _IO(0x00, 1)/* bmap access */
+#define FIEMAP_EXTENT_ACCEPTABLE   (FIEMAP_EXTENT_LAST | \
+   FIEMAP_EXTENT_DATA_ENCRYPTED | FIEMAP_EXTENT_ENCODED | \
+   FIEMAP_EXTENT_UNWRITTEN | FIEMAP_EXTENT_MERGED | \
+   FIEMAP_EXTENT_SHARED)
 
 /*
  * If only filename given, print first block.
@@ -28,11 +33,14 @@
 
 int main(int argc, char ** argv)
 {
-   int fd;
-   int ret;
-   int sunit = 0;  /* in blocks */
-   char*filename;
-   unsigned intblock = 0;
+   struct stat sb;
+   struct fiemap   *fie;
+   struct fiemap_extent*fe;
+   int fd;
+   int ret;
+   int sunit = 0;  /* in blocks */
+   char*filename;
+   unsigned long long  block;
 
 if (argc < 3) {
 printf("Usage: %s  \n", argv[0]);
@@ -48,21 +56,63 @@ int main(int argc, char ** argv)
 return 1;
 }
 
-   ret = ioctl(fd, FIBMAP, );
-   if (ret < 0) {
+   ret = fstat(fd, );
+   if (ret) {
+   perror(filename);
close(fd);
-   perror("fibmap");
return 1;
}
 
-   close(fd);
+   fie = calloc(1, sizeof(struct fiemap) + sizeof(struct fiemap_extent));
+   if (!fie) {
+   close(fd);
+   perror("malloc");
+   return 1;
+   }
+   fie->fm_length = 1;
+   fie->fm_flags = FIEMAP_FLAG_SYNC;
+   fie->fm_extent_count = 1;
+
+   ret = ioctl(fd, FS_IOC_FIEMAP, fie);
+   if (ret < 0) {
+   unsigned intbmap = 0;
+
+   ret = ioctl(fd, FIBMAP, );
+   if (ret < 0) {
+   perror("fibmap");
+   free(fie);
+   close(fd);
+   return 1;
+   }
+   block = bmap;
+   goto check;
+   }
 
+
+   if (fie->fm_mapped_extents != 1) {
+   printf("%s: no extents?\n", filename);
+   free(fie);
+   close(fd);
+   return 1;
+   }
+   fe = >fm_extents[0];
+   if (fe->fe_flags & ~FIEMAP_EXTENT_ACCEPTABLE) {
+   printf("%s: bad flags 0x%x\n", filename, fe->fe_flags);
+   free(fie);
+   close(fd);
+   return 1;
+   }
+
+   block = fie->fm_extents[0].fe_physical / sb.st_blksize;
+check:
if (block % sunit) {
-   printf("%s: Start block %u not multiple of sunit %u\n",
+   printf("%s: Start block %llu not multiple of sunit %u\n",
filename, block, sunit);
return 1;
} else
printf("%s: well-aligned\n", filename);
+   free(fie);
+   close(fd);
 
return 0;
 }
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

Re: [dm-devel] [PATCH v2 2/7] dax: change bdev_dax_supported() to support boolean returns

2018-05-31 Thread Darrick J. Wong

On Thu, May 31, 2018 at 04:52:06PM -0400, Mike Snitzer wrote:
> On Thu, May 31 2018 at  3:13pm -0400,
> Darrick J. Wong  wrote:
> 
> > On Tue, May 29, 2018 at 04:01:14PM -0600, Ross Zwisler wrote:
> > > On Tue, May 29, 2018 at 02:25:10PM -0700, Darrick J. Wong wrote:
> > > > On Tue, May 29, 2018 at 01:51:01PM -0600, Ross Zwisler wrote:
> > > > > From: Dave Jiang 
> > > > > 
> > > > > The function return values are confusing with the way the function is
> > > > > named. We expect a true or false return value but it actually returns
> > > > > 0/-errno.  This makes the code very confusing. Changing the return 
> > > > > values
> > > > > to return a bool where if DAX is supported then return true and no DAX
> > > > > support returns false.
> > > > > 
> > > > > Signed-off-by: Dave Jiang 
> > > > > Signed-off-by: Ross Zwisler 
> > > > 
> > > > Looks ok, do you want me to pull the first two patches through the xfs
> > > > tree?
> > > > 
> > > > Reviewed-by: Darrick J. Wong 
> > > 
> > > Thanks for the review.
> > > 
> > > I'm not sure what's best.  If you do that then Mike will need to have a DM
> > > branch for the rest of the series based on your stable commits, yea?
> > > 
> > > Mike what would you prefer?
> > 
> > I /was/ about to say that I would pull in the first two patches, but now
> > I can't get xfs to mount with pmem at all, and have no way of testing
> > this...?
> 
> Once you get this sorted out, please feel free to pull in the first 2.

Sorted.  It'll be in Friday's for-next.  Ross helped me bang on the pmem
devices w/ ndctl to enable fsdax mode and twist qemu until everything
worked properly. ;)

--D

> I'm unlikely to get to reviewing the DM patches in this series until
> tomorrow at the earliest.
> 
> Mike
> 
> --
> dm-devel mailing list
> dm-de...@redhat.com
> https://www.redhat.com/mailman/listinfo/dm-devel
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

Re: [PATCH v2 2/7] dax: change bdev_dax_supported() to support boolean returns

2018-05-31 Thread Darrick J. Wong

On Tue, May 29, 2018 at 04:01:14PM -0600, Ross Zwisler wrote:
> On Tue, May 29, 2018 at 02:25:10PM -0700, Darrick J. Wong wrote:
> > On Tue, May 29, 2018 at 01:51:01PM -0600, Ross Zwisler wrote:
> > > From: Dave Jiang 
> > > 
> > > The function return values are confusing with the way the function is
> > > named. We expect a true or false return value but it actually returns
> > > 0/-errno.  This makes the code very confusing. Changing the return values
> > > to return a bool where if DAX is supported then return true and no DAX
> > > support returns false.
> > > 
> > > Signed-off-by: Dave Jiang 
> > > Signed-off-by: Ross Zwisler 
> > 
> > Looks ok, do you want me to pull the first two patches through the xfs
> > tree?
> > 
> > Reviewed-by: Darrick J. Wong 
> 
> Thanks for the review.
> 
> I'm not sure what's best.  If you do that then Mike will need to have a DM
> branch for the rest of the series based on your stable commits, yea?
> 
> Mike what would you prefer?

I /was/ about to say that I would pull in the first two patches, but now
I can't get xfs to mount with pmem at all, and have no way of testing
this...?

# echo 'file drivers/dax/* +p' > /sys/kernel/debug/dynamic_debug/control
# mount /dev/pmem3 -o rtdev=/dev/pmem4,dax /mnt
# dmesg

SGI XFS with ACLs, security attributes, realtime, scrub, debug enabled
XFS (pmem3): DAX enabled. Warning: EXPERIMENTAL, use at your own risk
pmem3: error: dax support not enabled
pmem4: error: dax support not enabled
XFS (pmem3): DAX unsupported by block device. Turning off DAX.
XFS (pmem3): Mounting V5 Filesystem
XFS (pmem3): Ending clean mount

Evidently the pfn it picks up is missing PFN_MAP in flags because
ND_REGION_PAGEMAP isn't set, and looking at the kernel source, pmem that
comes in via NFIT never gets that set...?

relevant qemu pmem options:

-object 
memory-backend-file,id=memnvdimm0,prealloc=no,mem-path=/dev/shm/a.img,share=yes,size=1341664
-device nvdimm,node=0,memdev=memnvdimm0,id=nvdimm0,slot=0
(repeat for five more devices)



--D

NFIT table contents:

000  4e  46  49  54  78  04  00  00  01  46  42  4f  43  48  53  20
  N   F   I   T   x 004  \0  \0 001   F   B   O   C   H   S
016  42  58  50  43  4e  46  49  54  01  00  00  00  42  58  50  43
  B   X   P   C   N   F   I   T 001  \0  \0  \0   B   X   P   C
032  01  00  00  00  00  00  00  00  00  00  38  00  08  00  03  00
001  \0  \0  \0  \0  \0  \0  \0  \0  \0   8  \0  \b  \0 003  \0
048  00  00  00  00  01  00  00  00  79  d3  f0  66  f3  b4  74  40
 \0  \0  \0  \0 001  \0  \0  \0   y 323 360   f 363 264   t   @
064  ac  43  0d  33  18  b7  8c  db  00  00  00  6c  0a  00  00  00
254   C  \r   3 030 267 214 333  \0  \0  \0   l  \n  \0  \0  \0
080  00  00  00  24  03  00  00  00  08  80  00  00  00  00  00  00
 \0  \0  \0   $ 003  \0  \0  \0  \b 200  \0  \0  \0  \0  \0  \0
096  01  00  30  00  04  00  00  00  00  00  00  00  08  00  09  00
001  \0   0  \0 004  \0  \0  \0  \0  \0  \0  \0  \b  \0  \t  \0
112  00  00  00  24  03  00  00  00  00  00  00  00  00  00  00  00
 \0  \0  \0   $ 003  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
128  00  00  00  00  00  00  00  00  00  00  01  00  00  00  00  00
 \0  \0  \0  \0  \0  \0  \0  \0  \0  \0 001  \0  \0  \0  \0  \0
144  04  00  50  00  09  00  86  80  01  00  01  00  00  00  00  00
004  \0   P  \0  \t  \0 206 200 001  \0 001  \0  \0  \0  \0  \0
160  00  00  00  00  00  00  00  00  59  34  12  00  01  03  00  00
 \0  \0  \0  \0  \0  \0  \0  \0   Y   4 022  \0 001 003  \0  \0
176  00  00  00  00  00  00  00  00  00  00  00  00  00  00  00  00
 \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
*
224  00  00  38  00  0a  00  03  00  00  00  00  00  00  00  00  00
 \0  \0   8  \0  \n  \0 003  \0  \0  \0  \0  \0  \0  \0  \0  \0
240  79  d3  f0  66  f3  b4  74  40  ac  43  0d  33  18  b7  8c  db
  y 323 360   f 363 264   t   @ 254   C  \r   3 030 267 214 333
256  00  00  00  90  0d  00  00  00  00  00  00  24  03  00  00  00
 \0  \0  \0 220  \r  \0  \0  \0  \0  \0  \0   $ 003  \0  \0  \0
272  08  80  00  00  00  00  00  00  01  00  30  00  05  00  00  00
 \b 200  \0  \0  \0  \0  \0  \0 001  \0   0  \0 005  \0  \0  \0
288  00  00  00  00  0a  00  0b  00  00  00  00  24  03  00  00  00
 \0  \0  \0  \0  \n  \0  \v  \0  \0  \0  \0   $ 003  \0  \0  \0
304  00  00  00  00  00  00  00  00  00  00  00  00  00  00  00  00
 \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
320  00  00  01  00  00  00  00  00  04  00  50  00  0b  00  86  80
 \0  \0 001  \0  \0  \0  \0  \0 004  \0   P  \0  \v  \0 206 200
336  01  00  01  00  00  00  00  00  00  00  00  00  0

Re: Question about Experimental of Filesystem DAX.

2018-05-31 Thread Darrick J. Wong

On Thu, May 31, 2018 at 09:29:15AM -0700, Dan Williams wrote:
> On Thu, May 31, 2018 at 8:07 AM, Ross Zwisler
>  wrote:
> > On Thu, May 31, 2018 at 11:27:33AM +0900, Yasunori Goto wrote:
> >> Hello,
> >>
> >>
> >> I would like to know about the Experimental message of Filesystem DAX.
> >> 
> >> DAX enabled. Warning: EXPERIMENTAL, use at your own risk
> >> 
> >>
> >> AFAIK, the final issue of Filesystem DAX is metadata update problem,
> >> and it is(will be?) solved by great effort of MAP_SYNC and
> >> "fix dma vs truncate/hole-punch" patch set.
> >> So, I suppose that the Experimental message can be removed,
> >> but I'm not sure.
> >>
> >> Is it possible?
> >> Otherwise, are there any other issues in Filesystem DAX yet?
> >>
> >> If this is silly question, sorry for noise
> >>
> >> Thanks,
> >> ---
> >> Yasunori Goto
> >
> > Adding in the XFS and ext4 developers, as it's really their call when to
> > remove this notice.
> >
> > We've talked about this off and on for a long while, but IMHO we should 
> > remove
> > the EXPERIMENTAL warning.  The last few things that we had on our TODO list
> > before this was removed were:
> >
> > 1) Get consistent handling of the DAX mount option.  We currently have this,
> > as both filesystems will behave the same and fall back and remove the DAX
> > mount option if it is unsupported by the block device, etc.

As an aside, I wonder if Christoph's musings about "just have the kernel
determine the appropriate dax/non-dax setting from the acpi tables and
skip the inode flag entirely" ever got resolved?

> > 2) Get consistent handling of the DAX inode option.  We currently have this,
> > as all DAX behavior now happens through the mount option.  If/when we
> > re-enable the per-inode DAX flag we should do it consistently for all DAX
> > enabled filesystems.

The behavior of the inode flag isn't all that consistent.  ext4 doesn't
support it at all.  On XFS, you can set or clear FS_XFLAG_DAX on a
directory which will propagate the setting to any files created in that
directory.

However, if you set or clear it on a file we update the on-disk inode
but we can't change the in-core state flag (S_DAX) until the next
in-core inode instantiation.  It's weird that users can change the flag
but the intended behavior changes won't happen until some ... time ...
in the future??

> > 3) Make DAX work with other XFS features like reflink, etc.  This one isn't
> > done, but we at least disallow DAX with XFS features like reflink where it
> > could be an issue.  Darrick, do you still feel like we need to get these
> > working together to remove EXPERIMENTAL, or are you happy enough that we're
> > keeping them separated and that we're keeping user data safe?

Yes, reflink and dax still need to work together.  I've not heard any
good arguments for why page sharing + copy on write are fundamentally
incompatible with the dax model, or why dax users will never, ever
require reflink.

The recent thread between Jan and Dan make me wonder if making mappings
share struct pages is going to be a nightmare to add to the mm code,
though...

Also: ideally XFS would also be able to consume poison event
notifications from the pmem so that it can try to deal with metadata
loss, but that's probably a separate effort.

--D

> > Jan and the other ext4 guys, do you have any additional things you need done
> > before removing the EXPERIMENTAL warning from ext4 + DAX?
> 
> The one's on my list are:
> 
> 1/ Get proper support for recovering userspace consumed poison in DAX
> mappings (may not make 4.18)
> 
> 2/ The DAX-DMA vs Truncate fix (queued for 4.18).
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

1 2 >

1 - 100 of 180 matches

Mail list logo