[PATCH v6 4/5] fs, xfs: introduce MAP_DIRECT for creating block-map-atomic file ranges

2017-08-23 Thread Dan Williams
MAP_DIRECT is an mmap(2) flag with the following semantics:

  MAP_DIRECT
  When specified with MAP_SHARED a successful fault in this range
  indicates that the kernel is maintaining the block map (user linear
  address to file offset to physical address relationship) in a manner
  that no external agent can observe any inconsistent changes. In other
  words, the block map of the mapping is effectively pinned, or the kernel
  is otherwise able to exchange a new physical extent atomically with
  respect to any hardware / software agent. As implied by this definition
  a successful fault in a MAP_DIRECT range bypasses kernel indirections
  like the page-cache, and all updates are carried directly through to the
  underlying file physical-address blocks (modulo cpu cache effects).

  ETXTBSY may be returned to any third party operation on the file that
  attempts to update the block map (allocate blocks / convert unwritten
  extents / break shared extents). However, whether a filesystem returns
  EXTBSY for a certain state of the block relative to a MAP_DIRECT mapping
  is filesystem and kernel version dependent.

  Some filesystems may extend these operation restrictions outside the
  mapped range and return ETXTBSY to any file operations that might mutate
  the block map. MAP_DIRECT faults may fail with a SIGBUS if the
  filesystem needs to write the block map to satisfy the fault. For
  example, if the mapping was established over a hole in a sparse file.

  ERRORS
  EACCES A MAP_DIRECT mapping was requested and PROT_WRITE was not set,
  or the requesting process is missing CAP_LINUX_IMMUTABLE.

  EINVAL MAP_ANONYMOUS or MAP_PRIVATE was specified with MAP_DIRECT.

  EOPNOTSUPP The filesystem explicitly does not support the flag

  SIGBUS Attempted to write a MAP_DIRECT mapping at a file offset that
 might require block-map updates.

Cc: Jan Kara 
Cc: Jeff Moyer 
Cc: Christoph Hellwig 
Cc: Dave Chinner 
Cc: Alexander Viro 
Cc: "Darrick J. Wong" 
Cc: Ross Zwisler 
Signed-off-by: Dan Williams 
---
 fs/xfs/xfs_file.c   |  115 ++-
 fs/xfs/xfs_inode.h  |1 
 fs/xfs/xfs_super.c  |1 
 include/linux/mman.h|6 ++
 include/uapi/asm-generic/mman.h |1 
 mm/mmap.c   |   23 
 6 files changed, 142 insertions(+), 5 deletions(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index cacc0162a41a..f82bf9416200 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -40,6 +40,7 @@
 #include "xfs_iomap.h"
 #include "xfs_reflink.h"
 
+#include 
 #include 
 #include 
 #include 
@@ -1001,6 +1002,25 @@ xfs_file_llseek(
return vfs_setpos(file, offset, inode->i_sb->s_maxbytes);
 }
 
+static const struct vm_operations_struct xfs_file_vm_direct_ops;
+
+STATIC int
+xfs_vma_checks(
+   struct vm_area_struct   *vma,
+   struct inode*inode)
+{
+   if (vma->vm_ops != _file_vm_direct_ops)
+   return 0;
+
+   if (xfs_is_reflink_inode(XFS_I(inode)))
+   return VM_FAULT_SIGBUS;
+
+   if (!IS_DAX(inode))
+   return VM_FAULT_SIGBUS;
+
+   return 0;
+}
+
 /*
  * Locking for serialisation of IO during page faults. This results in a lock
  * ordering of:
@@ -1031,6 +1051,10 @@ xfs_filemap_page_mkwrite(
file_update_time(vmf->vma->vm_file);
xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
 
+   ret = xfs_vma_checks(vmf->vma, inode);
+   if (ret)
+   goto out_unlock;
+
if (IS_DAX(inode)) {
ret = dax_iomap_fault(vmf, PE_SIZE_PTE, _iomap_ops);
} else {
@@ -1038,6 +1062,7 @@ xfs_filemap_page_mkwrite(
ret = block_page_mkwrite_return(ret);
}
 
+out_unlock:
xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
sb_end_pagefault(inode->i_sb);
 
@@ -1058,10 +1083,15 @@ xfs_filemap_fault(
return xfs_filemap_page_mkwrite(vmf);
 
xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
+   ret = xfs_vma_checks(vmf->vma, inode);
+   if (ret)
+   goto out_unlock;
+
if (IS_DAX(inode))
ret = dax_iomap_fault(vmf, PE_SIZE_PTE, _iomap_ops);
else
ret = filemap_fault(vmf);
+out_unlock:
xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
 
return ret;
@@ -1094,7 +1124,9 @@ xfs_filemap_huge_fault(
}
 
xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
-   ret = dax_iomap_fault(vmf, pe_size, _iomap_ops);
+   ret = xfs_vma_checks(vmf->vma, inode);
+   if (ret == 0)
+   ret = dax_iomap_fault(vmf, pe_size, _iomap_ops);
xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
 
if (vmf->flags & FAULT_FLAG_WRITE)
@@ -1137,6 +1169,61 @@ xfs_filemap_pfn_mkwrite(
 
 }

[PATCH v6 3/5] mm: introduce mmap3 for safely defining new mmap flags

2017-08-23 Thread Dan Williams
The mmap(2) syscall suffers from the ABI anti-pattern of not validating
unknown flags. However, proposals like MAP_SYNC and MAP_DIRECT need a
mechanism to define new behavior that is known to fail on older kernels
without the support. Define a new mmap3 syscall that checks for
unsupported flags at syscall entry and add a 'mmap_supported_mask' to
'struct file_operations' so generic code can validate the ->mmap()
handler knows about the specified flags. This also arranges for the
flags to be passed to the handler so it can do further local validation
if the requested behavior can be fulfilled.

Cc: Jan Kara 
Cc: Arnd Bergmann 
Cc: Andrew Morton 
Suggested-by: Andy Lutomirski 
Signed-off-by: Dan Williams 
---
 arch/x86/entry/syscalls/syscall_32.tbl |1 +
 arch/x86/entry/syscalls/syscall_64.tbl |1 +
 include/linux/fs.h |1 +
 include/linux/mm.h |2 +-
 include/linux/mman.h   |   42 
 include/linux/syscalls.h   |3 ++
 mm/mmap.c  |   32 ++--
 7 files changed, 78 insertions(+), 4 deletions(-)

diff --git a/arch/x86/entry/syscalls/syscall_32.tbl 
b/arch/x86/entry/syscalls/syscall_32.tbl
index 448ac2161112..0618b5b38b45 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -391,3 +391,4 @@
 382i386pkey_free   sys_pkey_free
 383i386statx   sys_statx
 384i386arch_prctl  sys_arch_prctl  
compat_sys_arch_prctl
+385i386mmap3   sys_mmap_pgoff_strict
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl 
b/arch/x86/entry/syscalls/syscall_64.tbl
index 5aef183e2f85..e204c736d7e9 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -339,6 +339,7 @@
 330common  pkey_alloc  sys_pkey_alloc
 331common  pkey_free   sys_pkey_free
 332common  statx   sys_statx
+333common  mmap3   sys_mmap_pgoff_strict
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 33d1ee8f51be..db42da9f98c4 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1674,6 +1674,7 @@ struct file_operations {
long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long);
long (*compat_ioctl) (struct file *, unsigned int, unsigned long);
int (*mmap) (struct file *, struct vm_area_struct *, unsigned long);
+   unsigned long mmap_supported_mask;
int (*open) (struct inode *, struct file *);
int (*flush) (struct file *, fl_owner_t id);
int (*release) (struct inode *, struct file *);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 46b9ac5e8569..49eef48da4b7 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2090,7 +2090,7 @@ extern unsigned long get_unmapped_area(struct file *, 
unsigned long, unsigned lo
 
 extern unsigned long mmap_region(struct file *file, unsigned long addr,
unsigned long len, vm_flags_t vm_flags, unsigned long pgoff,
-   struct list_head *uf);
+   struct list_head *uf, unsigned long flags);
 extern unsigned long do_mmap(struct file *file, unsigned long addr,
unsigned long len, unsigned long prot, unsigned long flags,
vm_flags_t vm_flags, unsigned long pgoff, unsigned long *populate,
diff --git a/include/linux/mman.h b/include/linux/mman.h
index c8367041fafd..64b6cb3dec70 100644
--- a/include/linux/mman.h
+++ b/include/linux/mman.h
@@ -7,6 +7,48 @@
 #include 
 #include 
 
+/*
+ * Arrange for undefined architecture specific flags to be rejected by
+ * default.
+ */
+#ifndef MAP_32BIT
+#define MAP_32BIT 0
+#endif
+#ifndef MAP_HUGE_2MB
+#define MAP_HUGE_2MB 0
+#endif
+#ifndef MAP_HUGE_1GB
+#define MAP_HUGE_1GB 0
+#endif
+#ifndef MAP_UNINITIALIZED
+#define MAP_UNINITIALIZED 0
+#endif
+
+/*
+ * The historical set of flags that all mmap implementations implicitly
+ * support when file_operations.mmap_supported_mask is zero. With the
+ * mmap3 syscall the deprecated MAP_DENYWRITE and MAP_EXECUTABLE bit
+ * values are explicitly rejected with EOPNOTSUPP rather than being
+ * silently accepted.
+ */
+#define LEGACY_MAP_SUPPORTED_MASK (MAP_SHARED \
+   | MAP_PRIVATE \
+   | MAP_FIXED \
+   | MAP_ANONYMOUS \
+   | MAP_UNINITIALIZED \
+   | MAP_GROWSDOWN \
+   | MAP_LOCKED \
+   | MAP_NORESERVE \
+   | MAP_POPULATE \
+   | MAP_NONBLOCK \
+   | MAP_STACK \
+   | MAP_HUGETLB \
+   | MAP_32BIT \
+   | MAP_HUGE_2MB \
+   | MAP_HUGE_1GB)
+
+#defineMAP_SUPPORTED_MASK 

[PATCH v6 2/5] fs, xfs: introduce S_IOMAP_SEALED

2017-08-23 Thread Dan Williams
When a filesystem sees this flag set it will not allow changes to the
file-offset to physical-block-offset relationship of any extent in the
file. The extent of the extents covered by the global S_IOMAP_SEALED is
filesystem specific. In other words it is similar to the inode-wide
XFS_DIFLAG2_REFLINK flag where we make the distinction apply globally to
the inode even though we could theoretically limit that effect to a
sub-range of the file.

The interface that sets this flag (mmap(..., MAP_DIRECT, ...)) will be
careful to document that it is implementation specific whether the
'sealed' restrictions apply to a sub-range or the whole file.
Applications should be prepared for unrelated ranges in the file to be
effected.

The term 'sealed' is used instead of 'immutable' to better indicate that
this is a file property that is temporary and can be undone.

Cc: Jan Kara 
Cc: Jeff Moyer 
Cc: Christoph Hellwig 
Cc: Dave Chinner 
Cc: Alexander Viro 
Cc: "Darrick J. Wong" 
Cc: Ross Zwisler 
Signed-off-by: Dan Williams 
---
 fs/attr.c|   10 ++
 fs/open.c|6 ++
 fs/read_write.c  |3 +++
 fs/xfs/libxfs/xfs_bmap.c |5 +
 fs/xfs/xfs_bmap_util.c   |3 +++
 fs/xfs/xfs_ioctl.c   |6 ++
 include/linux/fs.h   |2 ++
 mm/filemap.c |5 +
 8 files changed, 40 insertions(+)

diff --git a/fs/attr.c b/fs/attr.c
index 135304146120..d940386e0ca9 100644
--- a/fs/attr.c
+++ b/fs/attr.c
@@ -112,6 +112,16 @@ EXPORT_SYMBOL(setattr_prepare);
  */
 int inode_newsize_ok(const struct inode *inode, loff_t offset)
 {
+   if (IS_IOMAP_SEALED(inode)) {
+   /*
+* Any size change is disallowed. Size increases may
+* dirty metadata that an application is not prepared to
+* sync, and a size decrease may expose free blocks to
+* in-flight DMA.
+*/
+   return -ETXTBSY;
+   }
+
if (inode->i_size < offset) {
unsigned long limit;
 
diff --git a/fs/open.c b/fs/open.c
index 35bb784763a4..92d89ec2d6b3 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -292,6 +292,12 @@ int vfs_fallocate(struct file *file, int mode, loff_t 
offset, loff_t len)
return -ETXTBSY;
 
/*
+* We cannot allow any allocation changes on an iomap sealed file
+*/
+   if (IS_IOMAP_SEALED(inode))
+   return -ETXTBSY;
+
+   /*
 * Revalidate the write permissions, in case security policy has
 * changed since the files were opened.
 */
diff --git a/fs/read_write.c b/fs/read_write.c
index 0cc7033aa413..55700ca85f7e 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1706,6 +1706,9 @@ int vfs_clone_file_prep_inodes(struct inode *inode_in, 
loff_t pos_in,
if (IS_SWAPFILE(inode_in) || IS_SWAPFILE(inode_out))
return -ETXTBSY;
 
+   if (IS_IOMAP_SEALED(inode_in) || IS_IOMAP_SEALED(inode_out))
+   return -ETXTBSY;
+
/* Don't reflink dirs, pipes, sockets... */
if (S_ISDIR(inode_in->i_mode) || S_ISDIR(inode_out->i_mode))
return -EISDIR;
diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index c09c16b1ad3b..241f3a272f49 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -4481,6 +4481,11 @@ xfs_bmapi_write(
if (XFS_FORCED_SHUTDOWN(mp))
return -EIO;
 
+   /* fail any attempts to mutate data extents */
+   if (IS_IOMAP_SEALED(VFS_I(ip))
+   && !(flags & (XFS_BMAPI_METADATA | XFS_BMAPI_ATTRFORK)))
+   return -ETXTBSY;
+
ifp = XFS_IFORK_PTR(ip, whichfork);
 
XFS_STATS_INC(mp, xs_blk_mapw);
diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index 93e955262d07..ef4c4e8b0f58 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -1294,6 +1294,9 @@ xfs_free_file_space(
 
trace_xfs_free_file_space(ip);
 
+   if (IS_IOMAP_SEALED(VFS_I(ip)))
+   return -ETXTBSY;
+
error = xfs_qm_dqattach(ip, 0);
if (error)
return error;
diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index 9c0c7a920304..845587e6928b 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -1730,6 +1730,12 @@ xfs_ioc_swapext(
goto out_put_tmp_file;
}
 
+   if (IS_IOMAP_SEALED(file_inode(f.file)) ||
+   IS_IOMAP_SEALED(file_inode(tmp.file))) {
+   error = -EINVAL;
+   goto out_put_tmp_file;
+   }
+
/*
 * We need to ensure that the fds passed in point to XFS inodes
 * before we cast and access them as XFS structures as we have no
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 47249bbe973c..33d1ee8f51be 

[PATCH v6 5/5] fs, fcntl: add F_MAP_DIRECT

2017-08-23 Thread Dan Williams
A common pattern for granting a privilege to an unprivileged process is
to pass it an established file descriptor over a unix domain socket.
This enables fine grained access to the MAP_DIRECT mechanism instead of
requiring the mapping process have CAP_LINUX_IMMUTABLE.

Cc: Jan Kara 
Cc: Jeff Moyer 
Cc: Christoph Hellwig 
Cc: Dave Chinner 
Cc: Alexander Viro 
Cc: "Darrick J. Wong" 
Cc: Ross Zwisler 
Signed-off-by: Dan Williams 
---
 fs/fcntl.c |   15 +++
 include/linux/fs.h |5 +++--
 include/uapi/linux/fcntl.h |5 +
 mm/mmap.c  |3 ++-
 4 files changed, 25 insertions(+), 3 deletions(-)

diff --git a/fs/fcntl.c b/fs/fcntl.c
index 3b01b646e528..f2375c406e6f 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -318,6 +318,18 @@ static long fcntl_rw_hint(struct file *file, unsigned int 
cmd,
}
 }
 
+static int fcntl_map_direct(struct file *filp)
+{
+   if (!capable(CAP_LINUX_IMMUTABLE))
+   return -EACCES;
+
+   spin_lock(>f_lock);
+   filp->f_map_direct = 1;
+   spin_unlock(>f_lock);
+
+   return 0;
+}
+
 static long do_fcntl(int fd, unsigned int cmd, unsigned long arg,
struct file *filp)
 {
@@ -425,6 +437,9 @@ static long do_fcntl(int fd, unsigned int cmd, unsigned 
long arg,
case F_SET_FILE_RW_HINT:
err = fcntl_rw_hint(filp, cmd, arg);
break;
+   case F_MAP_DIRECT:
+   err = fcntl_map_direct(filp);
+   break;
default:
break;
}
diff --git a/include/linux/fs.h b/include/linux/fs.h
index db42da9f98c4..ec2e1d6bf22c 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -855,11 +855,12 @@ struct file {
const struct file_operations*f_op;
 
/*
-* Protects f_ep_links, f_flags.
+* Protects f_ep_links, f_flags, f_write_hint, and f_map_direct.
 * Must not be taken from IRQ context.
 */
spinlock_t  f_lock;
-   enum rw_hintf_write_hint;
+   enum rw_hintf_write_hint:3;
+   unsigned intf_map_direct:1;
atomic_long_t   f_count;
unsigned intf_flags;
fmode_t f_mode;
diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h
index ec69d55bcec7..2a57a503174e 100644
--- a/include/uapi/linux/fcntl.h
+++ b/include/uapi/linux/fcntl.h
@@ -53,6 +53,11 @@
 #define F_SET_FILE_RW_HINT (F_LINUX_SPECIFIC_BASE + 14)
 
 /*
+ * Enable MAP_DIRECT on the file without CAP_LINUX_IMMUTABLE
+ */
+#define F_MAP_DIRECT   (F_LINUX_SPECIFIC_BASE + 15)
+
+/*
  * Valid hint values for F_{GET,SET}_RW_HINT. 0 is "not set", or can be
  * used to clear any hints previously set.
  */
diff --git a/mm/mmap.c b/mm/mmap.c
index 32417b2a668c..cf5e0cb7d0e3 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1399,7 +1399,8 @@ unsigned long do_mmap(struct file *file, unsigned long 
addr,
if (flags & MAP_DIRECT) {
if (!(prot & PROT_WRITE))
return -EACCES;
-   if (!capable(CAP_LINUX_IMMUTABLE))
+   if (!file->f_map_direct
+   && 
!capable(CAP_LINUX_IMMUTABLE))
return -EACCES;
}
 

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH v6 1/5] vfs: add flags parameter to ->mmap() in 'struct file_operations'

2017-08-23 Thread Dan Williams
We are running running short of vma->vm_flags. We can avoid needing a
new VM_* flag in some cases if the original @flags submitted to mmap(2)
is made available to the ->mmap() 'struct file_operations'
implementation. For example, the proposed addition of MAP_DIRECT can be
implemented without taking up a new vm_flags bit. Another motivation to
avoid vm_flags is that they appear in /proc/$pid/smaps, and we have seen
software that tries to dangerously (TOCTOU) read smaps to infer the
behavior of a virtual address range.

This conversion was performed by the following semantic patch. There
were a few manual edits for oddities like proc_reg_mmap.

Thanks to Julia for helping me with coccinelle iteration to cover cases
where the mmap routine is defined in a separate file from the 'struct
file_operations' instance that consumes it.

// Usage:
// $ spatch mmap.cocci --no-includes --include-headers --dir .
//  --in-place ./ -j $num_cpus --very-quiet

virtual after_start

@initialize:ocaml@
@@

let tbl = Hashtbl.create(100)

let add_if_not_present fn =
  if not(Hashtbl.mem tbl fn) then Hashtbl.add tbl fn ()

@ a @
identifier fn;
identifier ops;
@@

struct file_operations ops = { ..., .mmap = fn, ...};

@script:ocaml@
fn << a.fn;
@@

add_if_not_present fn

@finalize:ocaml depends on !after_start@
tbls << merge.tbl;
@@

List.iter (fun t -> Hashtbl.iter (fun f _ -> add_if_not_present f) t) tbls;
Hashtbl.iter
(fun f _ ->
  let it = new iteration() in
  it#add_virtual_rule After_start;
  it#add_virtual_identifier Fn f;
  it#register())
tbl

@depends on after_start@
identifier virtual.fn;
identifier x, y;
@@

int fn(struct file *x,
struct vm_area_struct *y
-   )
+   , unsigned long map_flags)
{
...
}

@depends on after_start@
identifier virtual.fn;
identifier x, y;
@@

int fn(struct file *x,
struct vm_area_struct *y
-   );
+   , unsigned long map_flags);

@depends on after_start@
identifier virtual.fn;


@@

int fn(struct file *,
struct vm_area_struct *
-   );
+   , unsigned long);

Cc: Takashi Iwai 
Cc: David Airlie 
Cc: 
Cc: Daniel Vetter 
Signed-off-by: Julia Lawall 
Suggested-by: Jan Kara 
Cc: Andrew Morton 
Signed-off-by: Dan Williams 
---
 arch/arc/kernel/arc_hostlink.c |3 ++-
 arch/mips/kernel/vdso.c|2 +-
 arch/powerpc/kernel/proc_powerpc.c |3 ++-
 arch/powerpc/kvm/book3s_64_vio.c   |3 ++-
 arch/powerpc/platforms/cell/spufs/file.c   |   21 +---
 arch/powerpc/platforms/powernv/opal-prd.c  |3 ++-
 arch/um/drivers/mmapper_kern.c |3 ++-
 drivers/android/binder.c   |3 ++-
 drivers/char/agp/frontend.c|3 ++-
 drivers/char/bsr.c |3 ++-
 drivers/char/hpet.c|6 --
 drivers/char/mbcs.c|3 ++-
 drivers/char/mbcs.h|3 ++-
 drivers/char/mem.c |   11 +++---
 drivers/char/mspec.c   |9 ++---
 drivers/char/uv_mmtimer.c  |6 --
 drivers/dax/device.c   |3 ++-
 drivers/dma-buf/dma-buf.c  |4 +++-
 drivers/firewire/core-cdev.c   |3 ++-
 drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c|3 ++-
 drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h|3 ++-
 drivers/gpu/drm/amd/amdkfd/kfd_chardev.c   |5 +++--
 drivers/gpu/drm/arc/arcpgu_drv.c   |5 +++--
 drivers/gpu/drm/ast/ast_drv.h  |3 ++-
 drivers/gpu/drm/ast/ast_ttm.c  |3 ++-
 drivers/gpu/drm/bochs/bochs.h  |3 ++-
 drivers/gpu/drm/bochs/bochs_mm.c   |3 ++-
 drivers/gpu/drm/cirrus/cirrus_drv.h|3 ++-
 drivers/gpu/drm/cirrus/cirrus_ttm.c|3 ++-
 drivers/gpu/drm/drm_gem.c  |3 ++-
 drivers/gpu/drm/drm_gem_cma_helper.c   |6 --
 drivers/gpu/drm/drm_vm.c   |3 ++-
 drivers/gpu/drm/etnaviv/etnaviv_drv.h  |3 ++-
 drivers/gpu/drm/etnaviv/etnaviv_gem.c  |5 +++--
 drivers/gpu/drm/exynos/exynos_drm_gem.c|5 +++--
 drivers/gpu/drm/exynos/exynos_drm_gem.h|3 ++-
 drivers/gpu/drm/hisilicon/hibmc/hibmc_drm_drv.h|3 ++-
 drivers/gpu/drm/hisilicon/hibmc/hibmc_ttm.c|3 ++-
 drivers/gpu/drm/i810/i810_dma.c|3 ++-
 drivers/gpu/drm/i915/i915_gem_dmabuf.c |2 +-
 

[PATCH v6 0/5] MAP_DIRECT and block-map-atomic files

2017-08-23 Thread Dan Williams
Changes since v5 [1]:
* Compile fixes from a much improved coccinelle semantic patch (thanks
  Julia!) that adds a 'flags' argument to all the ->mmap()
  implementations in the kernel. (0day-kbuild-robot)

* Make the deprecated MAP_DENYWRITE and MAP_EXECUTABLE flags return
  EOPNOTSUPP with the new mmap3() syscall. (Kirill)

* Minor changelog updates.

* Updated cover letter with a clarified summary and checklist of
  questions to answer before proceeding further.

---

MAP_DIRECT is a mechanism to ask the kernel to atomically manage the
file-offset to physical-address block relationship of a mapping relative
to any memory-mapped access. It is complimentary to the proposed
MAP_SYNC mechanism which makes the same guarantee relative to cpu
faults. MAP_DIRECT goes a step further and makes this guarantee for
agents that may not generate mmu faults, but at the cost of restricting
the kernel's ability to mutate the block-map at will.

MAP_SYNC is preferred for scenarios that want full filesystem feature
support while avoiding fsync/msync overhead, but also do not need to
contend with hypervisors or RDMA agents that do not give the kernel an
mmu fault. In other words, the need for MAP_DIRECT is driven by the
scarcity of SVM capable hardware (Shared Virtual Memory, where hardware
generates mmu faults), hypervisors like Xen that need to interrogate the
physical address layout of a file to maintain their own physical-address
mapping metadata outside the kernel, and peer-to-peer DMA use cases that
always bypass the mmu.

The MAP_DIRECT mechanism allows a filesystem to be used for capacity
provisioning and access control where these aforementioned applications
would otherwise be forced to roll a custom solution on top of a raw
device-file.

Questions:
1/ Is the definition of MAP_DIRECT constrained enough to allow us to
   make the restrictions it imposes on the kernel finer grained over time?

2/ Do the XFS changes look sane? They attempt to avoid adding any
   overhead to the non-MAP_DIRECT case at the expense of the new
   i_mapdcount atomic counter in the XFS inode.

3/ While the generic MAP_DIRECT description warns that the block-map may
   not be actually be immutable for the lifetime of the mapping it also
   does not preclude a filesystem from making that guarantee. In fact,
   Dave wants to be able to get a stable view of the physical mapping
   [2], and Xen has a need to do the same [3]. Do we want userspace to
   start making "XFS + MAP_DIRECT == Immutable" assumptions, or do we
   need a separate mechanism for that guarantee?

[1]: https://lkml.org/lkml/2017/8/16/114
[2]: https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1467677.html
[3]: https://lists.xen.org/archives/html/xen-devel/2017-04/msg00419.html

---

Dan Williams (5):
  vfs: add flags parameter to ->mmap() in 'struct file_operations'
  fs, xfs: introduce S_IOMAP_SEALED
  mm: introduce mmap3 for safely defining new mmap flags
  fs, xfs: introduce MAP_DIRECT for creating block-map-atomic file ranges
  fs, fcntl: add F_MAP_DIRECT


 arch/arc/kernel/arc_hostlink.c |3 -
 arch/mips/kernel/vdso.c|2 
 arch/powerpc/kernel/proc_powerpc.c |3 -
 arch/powerpc/kvm/book3s_64_vio.c   |3 -
 arch/powerpc/platforms/cell/spufs/file.c   |   21 ++--
 arch/powerpc/platforms/powernv/opal-prd.c  |3 -
 arch/um/drivers/mmapper_kern.c |3 -
 arch/x86/entry/syscalls/syscall_32.tbl |1 
 arch/x86/entry/syscalls/syscall_64.tbl |1 
 drivers/android/binder.c   |3 -
 drivers/char/agp/frontend.c|3 -
 drivers/char/bsr.c |3 -
 drivers/char/hpet.c|6 +
 drivers/char/mbcs.c|3 -
 drivers/char/mbcs.h|3 -
 drivers/char/mem.c |   11 +-
 drivers/char/mspec.c   |9 +-
 drivers/char/uv_mmtimer.c  |6 +
 drivers/dax/device.c   |3 -
 drivers/dma-buf/dma-buf.c  |4 +
 drivers/firewire/core-cdev.c   |3 -
 drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c|3 -
 drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.h|3 -
 drivers/gpu/drm/amd/amdkfd/kfd_chardev.c   |5 +
 drivers/gpu/drm/arc/arcpgu_drv.c   |5 +
 drivers/gpu/drm/ast/ast_drv.h  |3 -
 drivers/gpu/drm/ast/ast_ttm.c  |3 -
 drivers/gpu/drm/bochs/bochs.h  |3 -
 drivers/gpu/drm/bochs/bochs_mm.c   |3 -
 drivers/gpu/drm/cirrus/cirrus_drv.h|3 -
 drivers/gpu/drm/cirrus/cirrus_ttm.c|3 -
 drivers/gpu/drm/drm_gem.c 

Re: [PATCH 1/2] nfit: Use init_completion() in acpi_nfit_flush_probe()

2017-08-23 Thread Dan Williams
On Wed, Aug 23, 2017 at 8:25 AM, Boqun Feng  wrote:
> There is no need to use COMPLETION_INITIALIZER_ONSTACK() in
> acpi_nfit_flush_probe(), replace it with init_completion().

Now that I see the better version of this patch with the improved
changelog in the -mm tree...

Acked-by: Dan Williams 
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH v3] Fix ext4 fault handling when mounted with -o dax,ro

2017-08-23 Thread rdodgen
From: Randy Dodgen 

If an ext4 filesystem is mounted with both the DAX and read-only
options, executables on that filesystem will fail to start (claiming
'Segmentation fault') due to the fault handler returning
VM_FAULT_SIGBUS.

This is due to the DAX fault handler (see ext4_dax_huge_fault)
attempting to write to the journal when FAULT_FLAG_WRITE is set. This is
the wrong behavior for write faults which will lead to a COW page; in
particular, this fails for readonly mounts.

This change avoids journal writes for faults that are expected to COW.

It might be the case that this could be better handled in
ext4_iomap_begin / ext4_iomap_end (called via iomap_ops inside
dax_iomap_fault). These is some overlap already (e.g. grabbing journal
handles).

Signed-off-by: Randy Dodgen 
---

This version is simplified as suggested by Ross; all fault sizes and fallbacks
are handled by dax_iomap_fault.

 fs/ext4/file.c | 15 ++-
 1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 0d7cf0cc9b87..dc1e1fb6b54c 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -279,7 +279,20 @@ static int ext4_dax_huge_fault(struct vm_fault *vmf,
handle_t *handle = NULL;
struct inode *inode = file_inode(vmf->vma->vm_file);
struct super_block *sb = inode->i_sb;
-   bool write = vmf->flags & FAULT_FLAG_WRITE;
+
+   /*
+* We have to distinguish real writes from writes which will result in a
+* COW page; COW writes should *not* poke the journal (the file will not
+* be changed). Doing so would cause unintended failures when mounted
+* read-only.
+*
+* We check for VM_SHARED rather than vmf->cow_page since the latter is
+* unset for pe_size != PE_SIZE_PTE (i.e. only in do_cow_fault); for
+* other sizes, dax_iomap_fault will handle splitting / fallback so that
+* we eventually come back with a COW page.
+*/
+   bool write = (vmf->flags & FAULT_FLAG_WRITE) &&
+   (vmf->vma->vm_flags & VM_SHARED);
 
if (write) {
sb_start_pagefault(sb);
-- 
2.14.1.342.g6490525c54-goog

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH] nvdimm: fix potential double-fetch bug

2017-08-23 Thread Meng Xu
From: Meng Xu 

While examining the kernel source code, I found a dangerous operation that
could turn into a double-fetch situation (a race condition bug) where
the same userspace memory region are fetched twice into kernel with sanity
checks after the first fetch while missing checks after the second fetch.

In the case of _IOC_NR(ioctl_cmd) == ND_CMD_CALL:

1. The first fetch happens in line 935 copy_from_user(, p, sizeof(pkg)

2. subsequently `pkg.nd_reserved2` is asserted to be all zeroes
(line 984 to 986).

3. The second fetch happens in line 1022 copy_from_user(buf, p, buf_len)

4. Given that `p` can be fully controlled in userspace, an attacker can
race condition to override the header part of `p`, say,
`((struct nd_cmd_pkg *)p)->nd_reserved2` to arbitrary value
(say nine 0x for `nd_reserved2`) after the first fetch but before the
second fetch. The changed value will be copied to `buf`.

5. There is no checks on the second fetches until the use of it in
line 1034: nd_cmd_clear_to_send(nvdimm_bus, nvdimm, cmd, buf) and
line 1038: nd_desc->ndctl(nd_desc, nvdimm, cmd, buf, buf_len, _rc)
which means that the assumed relation, `p->nd_reserved2` are all zeroes might
not hold after the second fetch. And once the control goes to these functions
we lose the context to assert the assumed relation.

6. Based on my manual analysis, `p->nd_reserved2` is not used in function
`nd_cmd_clear_to_send` and potential implementations of `nd_desc->ndctl`
so there is no working exploit against it right now. However, this could
easily turns to an exploitable one if careless developers start to use
`p->nd_reserved2` later and assume that they are all zeroes.

Proposed patch:

The patch explicitly overrides `buf->nd_reserved2` after the second fetch with
the value `pkg.nd_reserved2` from the first fetch. In this way, it is assured
that the relation, `buf->nd_reserved2` are all zeroes, holds after the second
fetch.

Signed-off-by: Meng Xu 
---
 drivers/nvdimm/bus.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/drivers/nvdimm/bus.c b/drivers/nvdimm/bus.c
index 937fafa..20c4d0f 100644
--- a/drivers/nvdimm/bus.c
+++ b/drivers/nvdimm/bus.c
@@ -1024,6 +1024,12 @@ static int __nd_ioctl(struct nvdimm_bus *nvdimm_bus, 
struct nvdimm *nvdimm,
goto out;
}
 
+   if (cmd == ND_CMD_CALL) {
+   struct nd_cmd_pkg *hdr = (struct nd_cmd_pkg *)buf;
+   memcpy(hdr->nd_reserved2, pkg.nd_reserved2,
+   sizeof(pkg.nd_reserved2));
+   }
+
nvdimm_bus_lock(_bus->dev);
rc = nd_cmd_clear_to_send(nvdimm_bus, nvdimm, func, buf);
if (rc)
-- 
2.7.4

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v2] Fix ext4 fault handling when mounted with -o dax,ro

2017-08-23 Thread Randy Dodgen
That's a nice simplification. I started cautiously by replicating the same
checks for dax.c (dax_iomap_pte_fault checks for cow_page specifically). I
recall that it used to be possible for COW pages to appear in VM_SHARED
mappings, but I'm glad to see that went away in cda540ace6a19. I'll send a new
version today.

One potential advantage hiding in the more complicated checks is that we avoid
repeatedly grabbing the journal as we fallback from PUD -> PMD or PUD -> PMD ->
PTE (see __handle_mm_fault and VM_FAULT_FALLBACK checks). I will defer to the
ext4 folks w.r.t. that being worthwhile; if so, there will need to be some
thought on how to tweak the new .huge_fault protocol, or how to move the
journal bits after the dax_iomap_fault fallbacks (maybe in ext4_iomap_{begin,
end}?)

Regarding ext4's behavior in the non-DAX case, note that those vm_ops don't
have a .huge_fault handler, and .fault delegates to filemap_fault (which as you
mention doesn't care about FAULT_FLAG_WRITE etc). Ignoring .huge_fault, we can
assume that .page_mkwrite will be called at just the right times (e.g. as part
of do_shared_fault but not do_cow_fault).

Meanwhile, implementing .huge_fault is much trickier; there is no
".huge_mkwrite" (so some prediction of COW is needed, as here) and one must
remember to split huge entries before returning VM_FAULT_FALLBACK (see 59bf4fb9
; not doing so in __dax_pmd_fault was resulting in repeated PMD faults not
making progress). Maybe there is room to improve this.

On Wed, Aug 23, 2017 at 9:38 AM, Ross Zwisler
 wrote:
> On Tue, Aug 22, 2017 at 08:37:04PM -0700, rdod...@gmail.com wrote:
>> From: Randy Dodgen 
>>
>> If an ext4 filesystem is mounted with both the DAX and read-only
>> options, executables on that filesystem will fail to start (claiming
>> 'Segmentation fault') due to the fault handler returning
>> VM_FAULT_SIGBUS.
>>
>> This is due to the DAX fault handler (see ext4_dax_huge_fault)
>> attempting to write to the journal when FAULT_FLAG_WRITE is set. This is
>> the wrong behavior for write faults which will lead to a COW page; in
>> particular, this fails for readonly mounts.
>>
>> This changes replicates some check from dax_iomap_fault to more
>> precisely reason about when a journal-write is needed.
>>
>> It might be the case that this could be better handled in
>> ext4_iomap_begin / ext4_iomap_end (called via iomap_ops inside
>> dax_iomap_fault). These is some overlap already (e.g. grabbing journal
>> handles).
>>
>> Signed-off-by: Randy Dodgen 
>> ---
>>
>> I'm resending for some DMARC-proofing (thanks Ted for the explanation), a
>> missing Signed-off-by, and some extra cc's. Oops!
>>
>>  fs/ext4/file.c | 26 +-
>>  1 file changed, 25 insertions(+), 1 deletion(-)
>>
>> diff --git a/fs/ext4/file.c b/fs/ext4/file.c
>> index 0d7cf0cc9b87..d512fb85a3e3 100644
>> --- a/fs/ext4/file.c
>> +++ b/fs/ext4/file.c
>> @@ -279,7 +279,31 @@ static int ext4_dax_huge_fault(struct vm_fault *vmf,
>>   handle_t *handle = NULL;
>>   struct inode *inode = file_inode(vmf->vma->vm_file);
>>   struct super_block *sb = inode->i_sb;
>> - bool write = vmf->flags & FAULT_FLAG_WRITE;
>> + bool write;
>> +
>> + /*
>> +  * We have to distinguish real writes from writes which will result in 
>> a
>> +  * COW page
>> +  * - COW writes need to fall-back to installing PTEs. See
>> +  *   dax_iomap_pmd_fault.
>> +  * - COW writes should *not* poke the journal (the file will not be
>> +  *   changed). Doing so would cause unintended failures when mounted
>> +  *   read-only.
>> +  */
>> + if (pe_size == PE_SIZE_PTE) {
>> + /* See dax_iomap_pte_fault. */
>> + write = (vmf->flags & FAULT_FLAG_WRITE) && !vmf->cow_page;
>> + } else if (pe_size == PE_SIZE_PMD) {
>> + /* See dax_iomap_pmd_fault. */
>> + write = vmf->flags & FAULT_FLAG_WRITE;
>> + if (write && !(vmf->vma->vm_flags & VM_SHARED)) {
>> + split_huge_pmd(vmf->vma, vmf->pmd, vmf->address);
>> + count_vm_event(THP_FAULT_FALLBACK);
>> + return VM_FAULT_FALLBACK;
>> + }
>> + } else {
>> + return VM_FAULT_FALLBACK;
>> + }
>
> This works in my setup, though the logic could be simpler.
>
> For all fault sizes you can rely on the fact that a COW write will happen when
> we have FAULT_FLAG_WRITE but not VM_SHARED.  This is the logic that we use to
> know to set up vmf->cow_page() in do_fault() by calling do_cow_fault(), and in
> finish_fault().
>
> I think your test can then just become:
>
> write = (vmf->flags & FAULT_FLAG_WRITE) &&
> (vmf->vma->vm_flags & VM_SHARED);
>
> With some appropriate commenting.
>
> You can then let the DAX fault handlers worry about validating the fault size
> and splitting the PMD on fallback.
>
> I'll let someone with 

Re: [PATCH v5 7/7] libnvdimm: Add blk-mq pmem driver

2017-08-23 Thread Dan Williams
On Wed, Aug 23, 2017 at 12:56 PM, Dave Jiang  wrote:
>
>
> On 08/23/2017 11:39 AM, Dan Williams wrote:
>> On Mon, Aug 21, 2017 at 2:11 PM, Dave Jiang  wrote:
>>> Adding a DMA supported blk-mq driver for pmem.
>>
>> "Add support for offloading pmem block-device I/O operations to a DMA 
>> engine."
>>
>>> This provides signficant CPU
>>
>> *significant
>>
>>> utilization reduction.
>>
>> "at the cost of some increased latency and bandwidth reduction in some 
>> cases."
>>
>>> By default the pmem driver will be using blk-mq with
>>
>> "By default the current cpu-copy based pmem driver will load, but this
>> driver can be manually selected with a modprobe configuration."
>>
>>> DMA through the dmaengine API. DMA can be turned off with use_dma=0 kernel
>>> parameter.
>>
>> Do we need the module option? It seems for debug / testing a user can
>> simply unload the ioatdma driver, otherwise we should use dma by
>> default.
>>
>>> Additional kernel parameters are provided:
>>>
>>> queue_depth: The queue depth for blk-mq. Typically in relation to what the
>>>  DMA engine can provide per queue/channel. This needs to take
>>>  into account of num_sg as well for some DMA engines. i.e.
>>>  num_sg * queue_depth < total descriptors available per queue or
>>>  channel.
>>>
>>> q_per_node: Hardware queues per node. Typically the number of channels the
>>> DMA engine can provide per socket.
>>> num_sg: Number of scatterlist we can handle per I/O request.
>>
>> Why do these need to be configurable?
>
> The concern is with other arch/platforms that have different DMA
> engines. The configurations would be platform dependent.

...but these are answers we should be able to get from dmaengine and
the specific DMA drivers in use. An end user has no chance of guessing
the right values.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v5 7/7] libnvdimm: Add blk-mq pmem driver

2017-08-23 Thread Dave Jiang


On 08/23/2017 11:39 AM, Dan Williams wrote:
> On Mon, Aug 21, 2017 at 2:11 PM, Dave Jiang  wrote:
>> Adding a DMA supported blk-mq driver for pmem.
> 
> "Add support for offloading pmem block-device I/O operations to a DMA engine."
> 
>> This provides signficant CPU
> 
> *significant
> 
>> utilization reduction.
> 
> "at the cost of some increased latency and bandwidth reduction in some cases."
> 
>> By default the pmem driver will be using blk-mq with
> 
> "By default the current cpu-copy based pmem driver will load, but this
> driver can be manually selected with a modprobe configuration."
> 
>> DMA through the dmaengine API. DMA can be turned off with use_dma=0 kernel
>> parameter.
> 
> Do we need the module option? It seems for debug / testing a user can
> simply unload the ioatdma driver, otherwise we should use dma by
> default.
> 
>> Additional kernel parameters are provided:
>>
>> queue_depth: The queue depth for blk-mq. Typically in relation to what the
>>  DMA engine can provide per queue/channel. This needs to take
>>  into account of num_sg as well for some DMA engines. i.e.
>>  num_sg * queue_depth < total descriptors available per queue or
>>  channel.
>>
>> q_per_node: Hardware queues per node. Typically the number of channels the
>> DMA engine can provide per socket.
>> num_sg: Number of scatterlist we can handle per I/O request.
> 
> Why do these need to be configurable?

The concern is with other arch/platforms that have different DMA
engines. The configurations would be platform dependent.

> 
>>
>> Numbers below are measured against pmem simulated via DRAM using
>> memmap=NN!SS.  DMA engine used is the ioatdma on Intel Skylake Xeon
>> platform.  Keep in mind the performance for persistent memory
>> will differ.
>> Fio 2.21 was used.
>>
>> 64k: 1 task queuedepth=1
>> CPU Read:  7631 MB/s  99.7% CPUDMA Read: 2415 MB/s  54% CPU
>> CPU Write: 3552 MB/s  100% CPU DMA Write 2173 MB/s  54% CPU
>>
>> 64k: 16 tasks queuedepth=16
>> CPU Read: 36800 MB/s  1593% CPUDMA Read:  29100 MB/s  607% CPU
>> CPU Write 20900 MB/s  1589% CPUDMA Write: 23400 MB/s  585% CPU
>>
>> 2M: 1 task queuedepth=1
>> CPU Read:  6013 MB/s  99.3% CPUDMA Read:  7986 MB/s  59.3% CPU
>> CPU Write: 3579 MB/s  100% CPU DMA Write: 5211 MB/s  58.3% CPU
>>
>> 2M: 16 tasks queuedepth=16
>> CPU Read:  18100 MB/s 1588% CPUDMA Read:  21300 MB/s 180.9% CPU
>> CPU Write: 14100 MB/s 1594% CPUDMA Write: 20400 MB/s 446.9% CPU
>>
>> Signed-off-by: Dave Jiang 
>> Reviewed-by: Ross Zwisler 
>>
>> Signed-off-by: Dave Jiang 
>> ---
>>  drivers/nvdimm/Kconfig   |   23 +
>>  drivers/nvdimm/Makefile  |3
>>  drivers/nvdimm/pmem.h|3
>>  drivers/nvdimm/pmem_mq.c |  853 
>> ++
>>  4 files changed, 882 insertions(+)
>>  create mode 100644 drivers/nvdimm/pmem_mq.c
>>
>> diff --git a/drivers/nvdimm/Kconfig b/drivers/nvdimm/Kconfig
>> index 5bdd499..c88c2bb 100644
>> --- a/drivers/nvdimm/Kconfig
>> +++ b/drivers/nvdimm/Kconfig
>> @@ -36,6 +36,29 @@ config BLK_DEV_PMEM
>>
>>   Say Y if you want to use an NVDIMM
>>
>> +config BLK_DEV_PMEM_MQ
> 
> Lets call the driver pmem_dma instead of pmem_mq, the "mq" is just a
> side effect of enabling dma support.
> 
>> +   tristate "PMEM: Persistent memory block device multi-queue support"
>> +   depends on m
> 
> Not sure what the exact kconfig syntax should be, but the policy I'm
> looking for is that this driver should be allowed to be built-in if
> the pmem driver is disabled.
> 
>> +   default LIBNVDIMM
>> +   select DAX
>> +   select ND_BTT if BTT
>> +   select ND_PFN if NVDIMM_PFN
> 
> I think these should probably move to a common symbol used by both
> pmem and pmem_dma.
> 
>> +   select DMA_ENGINE
> 
> Shouldn't this be "depends on"?
> 
>> +   help
>> + Memory ranges for PMEM are described by either an NFIT
>> + (NVDIMM Firmware Interface Table, see CONFIG_NFIT_ACPI), a
>> + non-standard OEM-specific E820 memory type (type-12, see
>> + CONFIG_X86_PMEM_LEGACY), or it is manually specified by the
>> + 'memmap=nn[KMG]!ss[KMG]' kernel command line (see
>> + Documentation/admin-guide/kernel-parameters.rst).  This driver
>> + converts these persistent memory ranges into block devices that are
>> + capable of DAX (direct-access) file system mappings.  See
>> + Documentation/nvdimm/nvdimm.txt for more details. This driver
>> + utilizes block layer multi-queue in order to support using DMA
>> + engines to help offload the data copying.
> 
> Rather than duplicating some of the pmem text I think this help text
> should only talk about the incremental differences relative to the
> base pmem driver.
> 
>> +
>> + Say Y if you want to use an NVDIMM

Re: [PATCH 10/13] mm: Wire up MAP_SYNC

2017-08-23 Thread Christoph Hellwig
On Thu, Aug 17, 2017 at 06:08:12PM +0200, Jan Kara wrote:
> Pretty crude for now...
> 
> Signed-off-by: Jan Kara 
> ---
>  fs/ext4/file.c  | 2 ++
>  include/linux/mm.h  | 1 +
>  include/linux/mman.h| 3 ++-
>  include/uapi/asm-generic/mman.h | 1 +
>  mm/mmap.c   | 5 +
>  5 files changed, 11 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/ext4/file.c b/fs/ext4/file.c
> index f84bb29e941e..850037e140d7 100644
> --- a/fs/ext4/file.c
> +++ b/fs/ext4/file.c
> @@ -340,6 +340,8 @@ static int ext4_file_mmap(struct file *file, struct 
> vm_area_struct *vma)
>   vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE;
>   } else {
>   vma->vm_ops = _file_vm_ops;
> + if (vma->vm_flags & VM_SYNC)
> + return -EOPNOTSUPP;
>   }

So each mmap instance would need to reject the flag explicitly?

Or do I misunderstand this VM_SYNC flag?
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v5 7/7] libnvdimm: Add blk-mq pmem driver

2017-08-23 Thread Dan Williams
On Mon, Aug 21, 2017 at 2:11 PM, Dave Jiang  wrote:
> Adding a DMA supported blk-mq driver for pmem.

"Add support for offloading pmem block-device I/O operations to a DMA engine."

> This provides signficant CPU

*significant

> utilization reduction.

"at the cost of some increased latency and bandwidth reduction in some cases."

> By default the pmem driver will be using blk-mq with

"By default the current cpu-copy based pmem driver will load, but this
driver can be manually selected with a modprobe configuration."

> DMA through the dmaengine API. DMA can be turned off with use_dma=0 kernel
> parameter.

Do we need the module option? It seems for debug / testing a user can
simply unload the ioatdma driver, otherwise we should use dma by
default.

>Additional kernel parameters are provided:
>
> queue_depth: The queue depth for blk-mq. Typically in relation to what the
>  DMA engine can provide per queue/channel. This needs to take
>  into account of num_sg as well for some DMA engines. i.e.
>  num_sg * queue_depth < total descriptors available per queue or
>  channel.
>
> q_per_node: Hardware queues per node. Typically the number of channels the
> DMA engine can provide per socket.
> num_sg: Number of scatterlist we can handle per I/O request.

Why do these need to be configurable?

>
> Numbers below are measured against pmem simulated via DRAM using
> memmap=NN!SS.  DMA engine used is the ioatdma on Intel Skylake Xeon
> platform.  Keep in mind the performance for persistent memory
> will differ.
> Fio 2.21 was used.
>
> 64k: 1 task queuedepth=1
> CPU Read:  7631 MB/s  99.7% CPUDMA Read: 2415 MB/s  54% CPU
> CPU Write: 3552 MB/s  100% CPU DMA Write 2173 MB/s  54% CPU
>
> 64k: 16 tasks queuedepth=16
> CPU Read: 36800 MB/s  1593% CPUDMA Read:  29100 MB/s  607% CPU
> CPU Write 20900 MB/s  1589% CPUDMA Write: 23400 MB/s  585% CPU
>
> 2M: 1 task queuedepth=1
> CPU Read:  6013 MB/s  99.3% CPUDMA Read:  7986 MB/s  59.3% CPU
> CPU Write: 3579 MB/s  100% CPU DMA Write: 5211 MB/s  58.3% CPU
>
> 2M: 16 tasks queuedepth=16
> CPU Read:  18100 MB/s 1588% CPUDMA Read:  21300 MB/s 180.9% CPU
> CPU Write: 14100 MB/s 1594% CPUDMA Write: 20400 MB/s 446.9% CPU
>
> Signed-off-by: Dave Jiang 
> Reviewed-by: Ross Zwisler 
>
> Signed-off-by: Dave Jiang 
> ---
>  drivers/nvdimm/Kconfig   |   23 +
>  drivers/nvdimm/Makefile  |3
>  drivers/nvdimm/pmem.h|3
>  drivers/nvdimm/pmem_mq.c |  853 
> ++
>  4 files changed, 882 insertions(+)
>  create mode 100644 drivers/nvdimm/pmem_mq.c
>
> diff --git a/drivers/nvdimm/Kconfig b/drivers/nvdimm/Kconfig
> index 5bdd499..c88c2bb 100644
> --- a/drivers/nvdimm/Kconfig
> +++ b/drivers/nvdimm/Kconfig
> @@ -36,6 +36,29 @@ config BLK_DEV_PMEM
>
>   Say Y if you want to use an NVDIMM
>
> +config BLK_DEV_PMEM_MQ

Lets call the driver pmem_dma instead of pmem_mq, the "mq" is just a
side effect of enabling dma support.

> +   tristate "PMEM: Persistent memory block device multi-queue support"
> +   depends on m

Not sure what the exact kconfig syntax should be, but the policy I'm
looking for is that this driver should be allowed to be built-in if
the pmem driver is disabled.

> +   default LIBNVDIMM
> +   select DAX
> +   select ND_BTT if BTT
> +   select ND_PFN if NVDIMM_PFN

I think these should probably move to a common symbol used by both
pmem and pmem_dma.

> +   select DMA_ENGINE

Shouldn't this be "depends on"?

> +   help
> + Memory ranges for PMEM are described by either an NFIT
> + (NVDIMM Firmware Interface Table, see CONFIG_NFIT_ACPI), a
> + non-standard OEM-specific E820 memory type (type-12, see
> + CONFIG_X86_PMEM_LEGACY), or it is manually specified by the
> + 'memmap=nn[KMG]!ss[KMG]' kernel command line (see
> + Documentation/admin-guide/kernel-parameters.rst).  This driver
> + converts these persistent memory ranges into block devices that are
> + capable of DAX (direct-access) file system mappings.  See
> + Documentation/nvdimm/nvdimm.txt for more details. This driver
> + utilizes block layer multi-queue in order to support using DMA
> + engines to help offload the data copying.

Rather than duplicating some of the pmem text I think this help text
should only talk about the incremental differences relative to the
base pmem driver.

> +
> + Say Y if you want to use an NVDIMM
> +
>  config ND_BLK
> tristate "BLK: Block data window (aperture) device support"
> default LIBNVDIMM
> diff --git a/drivers/nvdimm/Makefile b/drivers/nvdimm/Makefile
> index 909554c..8bfd107 100644
> --- a/drivers/nvdimm/Makefile
> +++ b/drivers/nvdimm/Makefile
> @@ -1,11 +1,14 @@
>  obj-$(CONFIG_LIBNVDIMM) += 

Re: [PATCH 13/13] ext4: Support for synchronous DAX faults

2017-08-23 Thread Christoph Hellwig
> + pfn_t pfn;
>  
>   if (write) {
>   sb_start_pagefault(sb);
> @@ -287,16 +288,39 @@ static int ext4_dax_huge_fault(struct vm_fault *vmf,
>   down_read(_I(inode)->i_mmap_sem);
>   handle = ext4_journal_start_sb(sb, EXT4_HT_WRITE_PAGE,
>  EXT4_DATA_TRANS_BLOCKS(sb));
> + if (IS_ERR(handle)) {
> + up_read(_I(inode)->i_mmap_sem);
> + sb_end_pagefault(sb);
> + return VM_FAULT_SIGBUS;
> + }
>   } else {
>   down_read(_I(inode)->i_mmap_sem);
>   }
> - if (!IS_ERR(handle))
> - result = dax_iomap_fault(vmf, pe_size, _iomap_ops, NULL);
> - else
> - result = VM_FAULT_SIGBUS;
> + result = dax_iomap_fault(vmf, pe_size, _iomap_ops, );

Maybe split the error handling refactor into a simple prep patch to
make this one more readable?

> + /* Write fault but PFN mapped only RO? */
> + if (result & VM_FAULT_NEEDDSYNC) {
> + int err;
> + loff_t start = ((loff_t)vmf->pgoff) << PAGE_SHIFT;
> + size_t len = 0;
> +
> + if (pe_size == PE_SIZE_PTE)
> + len = PAGE_SIZE;
> +#ifdef CONFIG_FS_DAX_PMD
> + else if (pe_size == PE_SIZE_PMD)
> + len = HPAGE_PMD_SIZE;
> +#endif
> + else
> + WARN_ON_ONCE(1);
> + err = vfs_fsync_range(vmf->vma->vm_file, start,
> +   start + len - 1, 1);
> + if (err)
> + result = VM_FAULT_SIGBUS;
> + else
> + result = dax_insert_pfn_mkwrite(vmf, pe_size,
> + pfn);
> + }

I think this needs to become a helper exported from the DAX code,
way too much magic inside the file system as-is.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 09/13] dax: Allow dax_iomap_fault() to return pfn

2017-08-23 Thread Christoph Hellwig
> @@ -1416,6 +1416,7 @@ static int dax_iomap_pmd_fault(struct vm_fault *vmf,
>   * @vmf: The description of the fault
>   * @pe_size: Size of the page to fault in
>   * @ops: Iomap ops passed from the file system
> + * @pfnp: PFN to insert for synchronous faults if fsync is required
>   *
>   * When a page fault occurs, filesystems may call this helper in
>   * their fault handler for DAX files. dax_iomap_fault() assumes the caller
> @@ -1423,13 +1424,13 @@ static int dax_iomap_pmd_fault(struct vm_fault *vmf,
>   * successfully.
>   */
>  int dax_iomap_fault(struct vm_fault *vmf, enum page_entry_size pe_size,
> - const struct iomap_ops *ops)
> + const struct iomap_ops *ops, pfn_t *pfnp)

Please keep the iomap_ops argument the last one for the exported
function (and probably all others for consistency).

Otherwise looks good:

Reviewed-by: Christoph Hellwig 
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 08/13] dax: Fix comment describing dax_iomap_fault()

2017-08-23 Thread Christoph Hellwig
Looks fine,

Reviewed-by: Christoph Hellwig 
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 07/13] dax: Inline dax_pmd_insert_mapping() into the callsite

2017-08-23 Thread Christoph Hellwig
Looks fine,

Reviewed-by: Christoph Hellwig 
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 06/13] dax: Inline dax_insert_mapping() into the callsite

2017-08-23 Thread Christoph Hellwig
Looks fine,

Reviewed-by: Christoph Hellwig 
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 04/13] dax: Create local variable for VMA in dax_iomap_pte_fault()

2017-08-23 Thread Christoph Hellwig
Looks fine,

Reviewed-by: Christoph Hellwig 
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 03/13] dax: Factor out getting of pfn out of iomap

2017-08-23 Thread Christoph Hellwig
Looks good,

Reviewed-by: Christoph Hellwig 
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH v3] ndctl: daxctl: Adding io option for daxctl

2017-08-23 Thread Dave Jiang
The daxctl io option allows I/Os to be performed between file descriptor to
and from device dax files. It also provides a way to zero a device dax
device.

i.e. daxctl io --input=/home/myfile --output=/dev/dax1.0

Signed-off-by: Dave Jiang 
---

v3:
- Added support for size suffix suggested by Ross.
- Fixed the checking of __do_io() return value >32bit problem.

v2:
- Removed dependency on ndctl to match device and address other comments
by Dan.

 Documentation/daxctl/Makefile.am   |3 
 Documentation/daxctl/daxctl-io.txt |   70 
 daxctl/Makefile.am |5 
 daxctl/daxctl.c|2 
 daxctl/io.c|  576 
 5 files changed, 654 insertions(+), 2 deletions(-)
 create mode 100644 Documentation/daxctl/daxctl-io.txt
 create mode 100644 daxctl/io.c

diff --git a/Documentation/daxctl/Makefile.am b/Documentation/daxctl/Makefile.am
index 5913c94..032d48c 100644
--- a/Documentation/daxctl/Makefile.am
+++ b/Documentation/daxctl/Makefile.am
@@ -16,7 +16,8 @@ asciidoc.conf: ../asciidoc.conf.in
 
 man1_MANS = \
daxctl.1 \
-   daxctl-list.1
+   daxctl-list.1 \
+   daxctl-io.1
 
 CLEANFILES = $(man1_MANS)
 
diff --git a/Documentation/daxctl/daxctl-io.txt 
b/Documentation/daxctl/daxctl-io.txt
new file mode 100644
index 000..a7acc9e
--- /dev/null
+++ b/Documentation/daxctl/daxctl-io.txt
@@ -0,0 +1,70 @@
+daxctl-io(1)
+===
+
+NAME
+
+daxctl-io - Perform I/O on Device-DAX devices or zero a Device-DAX device.
+
+SYNOPSIS
+
+[verse]
+'daxctl io' []
+
+There must be a Device-DAX device involved whether as the input or the output
+device. Read from a Device-DAX device and write to a file descriptor, or
+another Device-DAX device. Write to a Device-DAX device from a file descriptor
+or another Device-DAX device.
+
+No length specified will default to input file/device length. If input is
+a special char file then length will be the output file/device length.
+
+No input will default to stdin. No output will default to stdout.
+
+For a Device-DAX device, attempts to clear badblocks within range of writes
+will be performed.
+
+EXAMPLE
+---
+[verse]
+# daxctl io --zero /dev/dax1.0
+
+# daxctl io --input=/dev/dax1.0 --output=/home/myfile --len=2097152 --seek=4096
+
+# cat /dev/zero | daxctl io --output=/dev/dax1.0
+
+# daxctl io --input=/dev/zero --output=/dev/dax1.0 --skip=4096
+
+OPTIONS
+---
+-i::
+--input=::
+   Input device or file to read from.
+
+-o::
+--output=::
+   Output device or file to write to.
+
+-z::
+--zero::
+   Zero the output device for 'len' size. Or the entire device if no
+   length was provided. The output device must be a Device DAX device.
+
+-l::
+--len::
+   The length in bytes to perform the I/O.
+
+-s::
+--seek::
+   The number of bytes to skip over on the output before performing a
+   write.
+
+-k::
+--skip::
+   The number of bytes to skip over on the input before performing a read.
+
+COPYRIGHT
+-
+Copyright (c) 2017, Intel Corporation. License GPLv2: GNU GPL
+version 2 .  This is free software:
+you are free to change and redistribute it.  There is NO WARRANTY, to
+the extent permitted by law.
diff --git a/daxctl/Makefile.am b/daxctl/Makefile.am
index fe467d0..1ba1f07 100644
--- a/daxctl/Makefile.am
+++ b/daxctl/Makefile.am
@@ -5,10 +5,13 @@ bin_PROGRAMS = daxctl
 daxctl_SOURCES =\
daxctl.c \
list.c \
+   io.c \
../util/json.c
 
 daxctl_LDADD =\
lib/libdaxctl.la \
+   ../ndctl/lib/libndctl.la \
../libutil.a \
$(UUID_LIBS) \
-   $(JSON_LIBS)
+   $(JSON_LIBS) \
+   -lpmem
diff --git a/daxctl/daxctl.c b/daxctl/daxctl.c
index 91a4600..db2e495 100644
--- a/daxctl/daxctl.c
+++ b/daxctl/daxctl.c
@@ -67,11 +67,13 @@ static int cmd_help(int argc, const char **argv, void *ctx)
 }
 
 int cmd_list(int argc, const char **argv, void *ctx);
+int cmd_io(int argc, const char **argv, void *ctx);
 
 static struct cmd_struct commands[] = {
{ "version", cmd_version },
{ "list", cmd_list },
{ "help", cmd_help },
+   { "io", cmd_io },
 };
 
 int main(int argc, const char **argv)
diff --git a/daxctl/io.c b/daxctl/io.c
new file mode 100644
index 000..27e7463
--- /dev/null
+++ b/daxctl/io.c
@@ -0,0 +1,576 @@
+/*
+ * Copyright(c) 2015-2017 Intel Corporation. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ */
+#include 
+#include 

Re: [PATCH v6 6/6] libnvdimm, btt: rework error clearing

2017-08-23 Thread Kani, Toshimitsu
On Tue, 2017-08-22 at 16:19 -0600, Vishal Verma wrote:
 :
> +
> + /* The block had a media error, and needs to be
> cleared */
> + if (btt_is_badblock(btt, arena, arena-
> >freelist[lane].block)) {
> + arena->freelist[lane].has_err = 1;
> + nd_region_release_lane(btt->nd_region,
> lane);
> +
> + arena_clear_freelist_error(arena, lane);
> + /* OK to acquire a different lane/free block
> */
> + goto retry;

I hit an infinite clear loop when DSM Clear Uncorrectable Error
function fails.  Haven't looked into the details, but I suspect this
unconditional retry is the cause of this.

Thanks,
-Toshi


___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v2] ndctl: daxctl: Adding io option for daxctl

2017-08-23 Thread Dan Williams
On Wed, Aug 23, 2017 at 9:47 AM, Ross Zwisler
 wrote:
> On Tue, Aug 22, 2017 at 05:06:15PM -0700, Dave Jiang wrote:
>> The daxctl io option allows I/Os to be performed between file descriptor to
>> and from device dax files. It also provides a way to zero a device dax
>> device.
>>
>> i.e. daxctl io --input=/home/myfile --output=/dev/dax1.0
>>
>> Signed-off-by: Dave Jiang 
>> ---
>>
>> v2:
>> - Removed dependency on ndctl to match device and address other comments
>> by Dan.
>>
>>  Documentation/daxctl/Makefile.am   |3 +-
>>  Documentation/daxctl/daxctl-io.txt |   70 
>> 
>>  2 files changed, 72 insertions(+), 1 deletion(-)
>>  create mode 100644 Documentation/daxctl/daxctl-io.txt
>>
>> diff --git a/Documentation/daxctl/Makefile.am 
>> b/Documentation/daxctl/Makefile.am
>> index 5913c94..032d48c 100644
>> --- a/Documentation/daxctl/Makefile.am
>> +++ b/Documentation/daxctl/Makefile.am
>> @@ -16,7 +16,8 @@ asciidoc.conf: ../asciidoc.conf.in
>>
>>  man1_MANS = \
>>   daxctl.1 \
>> - daxctl-list.1
>> + daxctl-list.1 \
>> + daxctl-io.1
>>
>>  CLEANFILES = $(man1_MANS)
>>
>> diff --git a/Documentation/daxctl/daxctl-io.txt 
>> b/Documentation/daxctl/daxctl-io.txt
>> new file mode 100644
>> index 000..a7acc9e
>> --- /dev/null
>> +++ b/Documentation/daxctl/daxctl-io.txt
>> @@ -0,0 +1,70 @@
>> +daxctl-io(1)
>> +===
>> +
>> +NAME
>> +
>> +daxctl-io - Perform I/O on Device-DAX devices or zero a Device-DAX device.
>> +
>> +SYNOPSIS
>> +
>> +[verse]
>> +'daxctl io' []
>> +
>> +There must be a Device-DAX device involved whether as the input or the 
>> output
>> +device. Read from a Device-DAX device and write to a file descriptor, or
>> +another Device-DAX device. Write to a Device-DAX device from a file 
>> descriptor
>> +or another Device-DAX device.
>> +
>> +No length specified will default to input file/device length. If input is
>> +a special char file then length will be the output file/device length.
>> +
>> +No input will default to stdin. No output will default to stdout.
>> +
>> +For a Device-DAX device, attempts to clear badblocks within range of writes
>> +will be performed.
>> +
>> +EXAMPLE
>> +---
>> +[verse]
>> +# daxctl io --zero /dev/dax1.0
>> +
>> +# daxctl io --input=/dev/dax1.0 --output=/home/myfile --len=2097152 
>> --seek=4096
>
> Do you have plans to add support for suffixes so people don't have to do
> --len=2097152 or --len=$((2*1024*1024))?

Good eye, this command should be using the same size parsing scheme as
ndctl create-namepace. See parse_size64()...
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v2] ndctl: daxctl: Adding io option for daxctl

2017-08-23 Thread Ross Zwisler
On Tue, Aug 22, 2017 at 05:06:15PM -0700, Dave Jiang wrote:
> The daxctl io option allows I/Os to be performed between file descriptor to
> and from device dax files. It also provides a way to zero a device dax
> device.
> 
> i.e. daxctl io --input=/home/myfile --output=/dev/dax1.0
> 
> Signed-off-by: Dave Jiang 
> ---
> 
> v2:
> - Removed dependency on ndctl to match device and address other comments
> by Dan.
> 
>  Documentation/daxctl/Makefile.am   |3 +-
>  Documentation/daxctl/daxctl-io.txt |   70 
> 
>  2 files changed, 72 insertions(+), 1 deletion(-)
>  create mode 100644 Documentation/daxctl/daxctl-io.txt
> 
> diff --git a/Documentation/daxctl/Makefile.am 
> b/Documentation/daxctl/Makefile.am
> index 5913c94..032d48c 100644
> --- a/Documentation/daxctl/Makefile.am
> +++ b/Documentation/daxctl/Makefile.am
> @@ -16,7 +16,8 @@ asciidoc.conf: ../asciidoc.conf.in
>  
>  man1_MANS = \
>   daxctl.1 \
> - daxctl-list.1
> + daxctl-list.1 \
> + daxctl-io.1
>  
>  CLEANFILES = $(man1_MANS)
>  
> diff --git a/Documentation/daxctl/daxctl-io.txt 
> b/Documentation/daxctl/daxctl-io.txt
> new file mode 100644
> index 000..a7acc9e
> --- /dev/null
> +++ b/Documentation/daxctl/daxctl-io.txt
> @@ -0,0 +1,70 @@
> +daxctl-io(1)
> +===
> +
> +NAME
> +
> +daxctl-io - Perform I/O on Device-DAX devices or zero a Device-DAX device.
> +
> +SYNOPSIS
> +
> +[verse]
> +'daxctl io' []
> +
> +There must be a Device-DAX device involved whether as the input or the output
> +device. Read from a Device-DAX device and write to a file descriptor, or
> +another Device-DAX device. Write to a Device-DAX device from a file 
> descriptor
> +or another Device-DAX device.
> +
> +No length specified will default to input file/device length. If input is
> +a special char file then length will be the output file/device length.
> +
> +No input will default to stdin. No output will default to stdout.
> +
> +For a Device-DAX device, attempts to clear badblocks within range of writes
> +will be performed.
> +
> +EXAMPLE
> +---
> +[verse]
> +# daxctl io --zero /dev/dax1.0
> +
> +# daxctl io --input=/dev/dax1.0 --output=/home/myfile --len=2097152 
> --seek=4096

Do you have plans to add support for suffixes so people don't have to do
--len=2097152 or --len=$((2*1024*1024))?

It looks like even DD supports suffixes:

$ dd if=/dev/urandom of=/tmp/random bs=4k count=1
1+0 records in
1+0 records out
4096 bytes (4.1 kB, 4.0 KiB) copied, 0.000180118 s, 22.7 MB/s

from the man page:

N and BYTES may be followed by the following multiplicative suffixes: c =1, w
=2, b =512, kB =1000, K =1024, MB =1000*1000, M =1024*1024, xM =M GB
=1000*1000*1000, G =1024*1024*1024, and so on for T, P, E, Z, Y.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v2] Fix ext4 fault handling when mounted with -o dax,ro

2017-08-23 Thread Ross Zwisler
On Tue, Aug 22, 2017 at 08:37:04PM -0700, rdod...@gmail.com wrote:
> From: Randy Dodgen 
> 
> If an ext4 filesystem is mounted with both the DAX and read-only
> options, executables on that filesystem will fail to start (claiming
> 'Segmentation fault') due to the fault handler returning
> VM_FAULT_SIGBUS.
> 
> This is due to the DAX fault handler (see ext4_dax_huge_fault)
> attempting to write to the journal when FAULT_FLAG_WRITE is set. This is
> the wrong behavior for write faults which will lead to a COW page; in
> particular, this fails for readonly mounts.
> 
> This changes replicates some check from dax_iomap_fault to more
> precisely reason about when a journal-write is needed.
> 
> It might be the case that this could be better handled in
> ext4_iomap_begin / ext4_iomap_end (called via iomap_ops inside
> dax_iomap_fault). These is some overlap already (e.g. grabbing journal
> handles).
> 
> Signed-off-by: Randy Dodgen 
> ---
> 
> I'm resending for some DMARC-proofing (thanks Ted for the explanation), a
> missing Signed-off-by, and some extra cc's. Oops!
> 
>  fs/ext4/file.c | 26 +-
>  1 file changed, 25 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/ext4/file.c b/fs/ext4/file.c
> index 0d7cf0cc9b87..d512fb85a3e3 100644
> --- a/fs/ext4/file.c
> +++ b/fs/ext4/file.c
> @@ -279,7 +279,31 @@ static int ext4_dax_huge_fault(struct vm_fault *vmf,
>   handle_t *handle = NULL;
>   struct inode *inode = file_inode(vmf->vma->vm_file);
>   struct super_block *sb = inode->i_sb;
> - bool write = vmf->flags & FAULT_FLAG_WRITE;
> + bool write;
> +
> + /*
> +  * We have to distinguish real writes from writes which will result in a
> +  * COW page
> +  * - COW writes need to fall-back to installing PTEs. See
> +  *   dax_iomap_pmd_fault.
> +  * - COW writes should *not* poke the journal (the file will not be
> +  *   changed). Doing so would cause unintended failures when mounted
> +  *   read-only.
> +  */
> + if (pe_size == PE_SIZE_PTE) {
> + /* See dax_iomap_pte_fault. */
> + write = (vmf->flags & FAULT_FLAG_WRITE) && !vmf->cow_page;
> + } else if (pe_size == PE_SIZE_PMD) {
> + /* See dax_iomap_pmd_fault. */
> + write = vmf->flags & FAULT_FLAG_WRITE;
> + if (write && !(vmf->vma->vm_flags & VM_SHARED)) {
> + split_huge_pmd(vmf->vma, vmf->pmd, vmf->address);
> + count_vm_event(THP_FAULT_FALLBACK);
> + return VM_FAULT_FALLBACK;
> + }
> + } else {
> + return VM_FAULT_FALLBACK;
> + }

This works in my setup, though the logic could be simpler.

For all fault sizes you can rely on the fact that a COW write will happen when
we have FAULT_FLAG_WRITE but not VM_SHARED.  This is the logic that we use to
know to set up vmf->cow_page() in do_fault() by calling do_cow_fault(), and in
finish_fault().

I think your test can then just become:

write = (vmf->flags & FAULT_FLAG_WRITE) && 
(vmf->vma->vm_flags & VM_SHARED);

With some appropriate commenting.

You can then let the DAX fault handlers worry about validating the fault size
and splitting the PMD on fallback.

I'll let someone with more ext4-fu comment on whether it is okay to skip the
journal entry when doing a COW fault.  This must be handled in ext4 for the
non-DAX case, but I don't see any more checks for VM_SHARED or
FAULT_FLAG_WRITE in fs/ext4, so maybe there is a better way?

- Ross
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v2 1/2] dax: fix deadlock due to misaligned PMD faults

2017-08-23 Thread Ross Zwisler
On Wed, Aug 23, 2017 at 11:57:33AM +0200, Jan Kara wrote:
> On Tue 22-08-17 16:24:35, Ross Zwisler wrote:
> > In DAX there are two separate places where the 2MiB range of a PMD is
> > defined.
> > 
> > The first is in the page tables, where a PMD mapping inserted for a given
> > address spans from (vmf->address & PMD_MASK) to
> > ((vmf->address & PMD_MASK) + PMD_SIZE - 1).  That is, from the 2MiB
> > boundary below the address to the 2MiB boundary above the address.
> > 
> > So, for example, a fault at address 3MiB (0x30 ) falls within the PMD
> > that ranges from 2MiB (0x20 ) to 4MiB (0x40 ).
> > 
> > The second PMD range is in the mapping->page_tree, where a given file
> > offset is covered by a radix tree entry that spans from one 2MiB aligned
> > file offset to another 2MiB aligned file offset.
> > 
> > So, for example, the file offset for 3MiB (pgoff 768) falls within the PMD
> > range for the order 9 radix tree entry that ranges from 2MiB (pgoff 512) to
> > 4MiB (pgoff 1024).
> > 
> > This system works so long as the addresses and file offsets for a given
> > mapping both have the same offsets relative to the start of each PMD.
> > 
> > Consider the case where the starting address for a given file isn't 2MiB
> > aligned - say our faulting address is 3 MiB (0x30 ), but that
> > corresponds to the beginning of our file (pgoff 0).  Now all the PMDs in
> > the mapping are misaligned so that the 2MiB range defined in the page
> > tables never matches up with the 2MiB range defined in the radix tree.
> > 
> > The current code notices this case for DAX faults to storage with the
> > following test in dax_pmd_insert_mapping():
> > 
> > if (pfn_t_to_pfn(pfn) & PG_PMD_COLOUR)
> > goto unlock_fallback;
> > 
> > This test makes sure that the pfn we get from the driver is 2MiB aligned,
> > and relies on the assumption that the 2MiB alignment of the pfn we get back
> > from the driver matches the 2MiB alignment of the faulting address.
> > 
> > However, faults to holes were not checked and we could hit the problem
> > described above.
> > 
> > This was reported in response to the NVML nvml/src/test/pmempool_sync
> > TEST5:
> > 
> > $ cd nvml/src/test/pmempool_sync
> > $ make TEST5
> > 
> > You can grab NVML here:
> > 
> > https://github.com/pmem/nvml/
> > 
> > The dmesg warning you see when you hit this error is:
> > 
> > WARNING: CPU: 13 PID: 2900 at fs/dax.c:641 
> > dax_insert_mapping_entry+0x2df/0x310
> > 
> > Where we notice in dax_insert_mapping_entry() that the radix tree entry we
> > are about to replace doesn't match the locked entry that we had previously
> > inserted into the tree.  This happens because the initial insertion was
> > done in grab_mapping_entry() using a pgoff calculated from the faulting
> > address (vmf->address), and the replacement in
> > dax_pmd_load_hole() => dax_insert_mapping_entry() is done using vmf->pgoff.
> > 
> > In our failure case those two page offsets (one calculated from
> > vmf->address, one using vmf->pgoff) point to different order 9 radix tree
> > entries.
> > 
> > This failure case can result in a deadlock because the radix tree unlock
> > also happens on the pgoff calculated from vmf->address.  This means that
> > the locked radix tree entry that we swapped in to the tree in
> > dax_insert_mapping_entry() using vmf->pgoff is never unlocked, so all
> > future faults to that 2MiB range will block forever.
> > 
> > Fix this by validating that the faulting address's PMD offset matches the
> > PMD offset from the start of the file.  This check is done at the very
> > beginning of the fault and covers faults that would have mapped to storage
> > as well as faults to holes.  I left the COLOUR check in
> > dax_pmd_insert_mapping() in place in case we ever hit the insanity
> > condition where the alignment of the pfn we get from the driver doesn't
> > match the alignment of the userspace address.
> 
> Hum, I'm confused with all these offsets and their alignment requirements.
> So let me try to get this straight. We have three different things here:
> 
> 1) virtual address A where we fault
> 2) file offset FA corresponding to the virtual address - computed as
>linear_page_index(vma, A) = (A - vma->start) >> PAGE_SHIFT + vma->pgoff
>- and stored in vmf->pgoff
> 3) physical address (or sector in filesystem terminology) the file offset
>maps to.
> 
> and then we have the aligned virtual address B = (A & PMD_MASK) and
> corresponding file offset FB we've got from linear_page_index(vma, B). Now
> if I've correctly understood what you've written above, the problem is that
> although B is aligned, FB doesn't have to be. That can happen either when
> vma->start is not aligned (which is the example you give above?) or when
> vma->pgoff is non aligned. And as a result FA >> 9 != FB >> 9 leading to
> issues.
> 
> OK, makes sense. You can add:
> 
> Reviewed-by: Jan Kara 

Yep, your understanding matches mine.  What 

[PATCH 1/2] nfit: Use init_completion() in acpi_nfit_flush_probe()

2017-08-23 Thread Boqun Feng
There is no need to use COMPLETION_INITIALIZER_ONSTACK() in
acpi_nfit_flush_probe(), replace it with init_completion().

Signed-off-by: Boqun Feng 
---
 drivers/acpi/nfit/core.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/acpi/nfit/core.c b/drivers/acpi/nfit/core.c
index 19182d091587..1893e416e7c0 100644
--- a/drivers/acpi/nfit/core.c
+++ b/drivers/acpi/nfit/core.c
@@ -2884,7 +2884,7 @@ static int acpi_nfit_flush_probe(struct 
nvdimm_bus_descriptor *nd_desc)
 * need to be interruptible while waiting.
 */
INIT_WORK_ONSTACK(, flush_probe);
-   COMPLETION_INITIALIZER_ONSTACK(flush.cmp);
+   init_completion();
queue_work(nfit_wq, );
mutex_unlock(_desc->init_mutex);
 
-- 
2.14.1

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v2 2/2] dax: use PG_PMD_COLOUR instead of open coding

2017-08-23 Thread Jan Kara
On Tue 22-08-17 16:24:36, Ross Zwisler wrote:
> Use ~PG_PMD_COLOUR in dax_entry_waitqueue() instead of open coding an
> equivalent page offset mask.
> 
> Signed-off-by: Ross Zwisler 

Looks good. You can add:

Reviewed-by: Jan Kara 

Honza

> ---
>  fs/dax.c | 11 ---
>  1 file changed, 4 insertions(+), 7 deletions(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index 865d42c..0e2a1fd 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -42,6 +42,9 @@
>  #define DAX_WAIT_TABLE_BITS 12
>  #define DAX_WAIT_TABLE_ENTRIES (1 << DAX_WAIT_TABLE_BITS)
>  
> +/* The 'colour' (ie low bits) within a PMD of a page offset.  */
> +#define PG_PMD_COLOUR((PMD_SIZE >> PAGE_SHIFT) - 1)
> +
>  static wait_queue_head_t wait_table[DAX_WAIT_TABLE_ENTRIES];
>  
>  static int __init init_dax_wait_table(void)
> @@ -98,7 +101,7 @@ static wait_queue_head_t *dax_entry_waitqueue(struct 
> address_space *mapping,
>* the range covered by the PMD map to the same bit lock.
>*/
>   if (dax_is_pmd_entry(entry))
> - index &= ~((1UL << (PMD_SHIFT - PAGE_SHIFT)) - 1);
> + index &= ~PG_PMD_COLOUR;
>  
>   key->mapping = mapping;
>   key->entry_start = index;
> @@ -1262,12 +1265,6 @@ static int dax_iomap_pte_fault(struct vm_fault *vmf,
>  }
>  
>  #ifdef CONFIG_FS_DAX_PMD
> -/*
> - * The 'colour' (ie low bits) within a PMD of a page offset.  This comes up
> - * more often than one might expect in the below functions.
> - */
> -#define PG_PMD_COLOUR((PMD_SIZE >> PAGE_SHIFT) - 1)
> -
>  static int dax_pmd_insert_mapping(struct vm_fault *vmf, struct iomap *iomap,
>   loff_t pos, void **entryp)
>  {
> -- 
> 2.9.5
> 
-- 
Jan Kara 
SUSE Labs, CR
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v2 1/2] dax: fix deadlock due to misaligned PMD faults

2017-08-23 Thread Jan Kara
On Tue 22-08-17 16:24:35, Ross Zwisler wrote:
> In DAX there are two separate places where the 2MiB range of a PMD is
> defined.
> 
> The first is in the page tables, where a PMD mapping inserted for a given
> address spans from (vmf->address & PMD_MASK) to
> ((vmf->address & PMD_MASK) + PMD_SIZE - 1).  That is, from the 2MiB
> boundary below the address to the 2MiB boundary above the address.
> 
> So, for example, a fault at address 3MiB (0x30 ) falls within the PMD
> that ranges from 2MiB (0x20 ) to 4MiB (0x40 ).
> 
> The second PMD range is in the mapping->page_tree, where a given file
> offset is covered by a radix tree entry that spans from one 2MiB aligned
> file offset to another 2MiB aligned file offset.
> 
> So, for example, the file offset for 3MiB (pgoff 768) falls within the PMD
> range for the order 9 radix tree entry that ranges from 2MiB (pgoff 512) to
> 4MiB (pgoff 1024).
> 
> This system works so long as the addresses and file offsets for a given
> mapping both have the same offsets relative to the start of each PMD.
> 
> Consider the case where the starting address for a given file isn't 2MiB
> aligned - say our faulting address is 3 MiB (0x30 ), but that
> corresponds to the beginning of our file (pgoff 0).  Now all the PMDs in
> the mapping are misaligned so that the 2MiB range defined in the page
> tables never matches up with the 2MiB range defined in the radix tree.
> 
> The current code notices this case for DAX faults to storage with the
> following test in dax_pmd_insert_mapping():
> 
>   if (pfn_t_to_pfn(pfn) & PG_PMD_COLOUR)
>   goto unlock_fallback;
> 
> This test makes sure that the pfn we get from the driver is 2MiB aligned,
> and relies on the assumption that the 2MiB alignment of the pfn we get back
> from the driver matches the 2MiB alignment of the faulting address.
> 
> However, faults to holes were not checked and we could hit the problem
> described above.
> 
> This was reported in response to the NVML nvml/src/test/pmempool_sync
> TEST5:
> 
>   $ cd nvml/src/test/pmempool_sync
>   $ make TEST5
> 
> You can grab NVML here:
> 
>   https://github.com/pmem/nvml/
> 
> The dmesg warning you see when you hit this error is:
> 
> WARNING: CPU: 13 PID: 2900 at fs/dax.c:641 
> dax_insert_mapping_entry+0x2df/0x310
> 
> Where we notice in dax_insert_mapping_entry() that the radix tree entry we
> are about to replace doesn't match the locked entry that we had previously
> inserted into the tree.  This happens because the initial insertion was
> done in grab_mapping_entry() using a pgoff calculated from the faulting
> address (vmf->address), and the replacement in
> dax_pmd_load_hole() => dax_insert_mapping_entry() is done using vmf->pgoff.
> 
> In our failure case those two page offsets (one calculated from
> vmf->address, one using vmf->pgoff) point to different order 9 radix tree
> entries.
> 
> This failure case can result in a deadlock because the radix tree unlock
> also happens on the pgoff calculated from vmf->address.  This means that
> the locked radix tree entry that we swapped in to the tree in
> dax_insert_mapping_entry() using vmf->pgoff is never unlocked, so all
> future faults to that 2MiB range will block forever.
> 
> Fix this by validating that the faulting address's PMD offset matches the
> PMD offset from the start of the file.  This check is done at the very
> beginning of the fault and covers faults that would have mapped to storage
> as well as faults to holes.  I left the COLOUR check in
> dax_pmd_insert_mapping() in place in case we ever hit the insanity
> condition where the alignment of the pfn we get from the driver doesn't
> match the alignment of the userspace address.

Hum, I'm confused with all these offsets and their alignment requirements.
So let me try to get this straight. We have three different things here:

1) virtual address A where we fault
2) file offset FA corresponding to the virtual address - computed as
   linear_page_index(vma, A) = (A - vma->start) >> PAGE_SHIFT + vma->pgoff
   - and stored in vmf->pgoff
3) physical address (or sector in filesystem terminology) the file offset
   maps to.

and then we have the aligned virtual address B = (A & PMD_MASK) and
corresponding file offset FB we've got from linear_page_index(vma, B). Now
if I've correctly understood what you've written above, the problem is that
although B is aligned, FB doesn't have to be. That can happen either when
vma->start is not aligned (which is the example you give above?) or when
vma->pgoff is non aligned. And as a result FA >> 9 != FB >> 9 leading to
issues.

OK, makes sense. You can add:

Reviewed-by: Jan Kara 

Honza

> 
> Signed-off-by: Ross Zwisler 
> Reported-by: "Slusarz, Marcin" 
> Cc: 
> 
> ---
> 
> Sorry for the quick v2, I realized that I didn't talk about 

Status

2017-08-23 Thread Bounced mail
The original message was received at Wed, 23 Aug 2017 14:05:47 +0800
from lists.01.org [186.54.6.33]

- The following addresses had permanent fatal errors -




___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm