Delivery reports about your e-mail

2017-10-11 Thread Bounced mail
The original message was received at Thu, 12 Oct 2017 12:25:23 +0800
from lists.01.org [187.53.147.89]

- The following addresses had permanent fatal errors -


- Transcript of session follows -
  while talking to lists.01.org.:
>>> MAIL From:"Bounced mail" 
<<< 501 "Bounced mail" ... Refused



___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v9 2/6] fs, mm: pass fd to ->mmap_validate()

2017-10-11 Thread Dan Williams
On Wed, Oct 11, 2017 at 7:17 PM, Dan Williams  wrote:
> On Wed, Oct 11, 2017 at 6:28 PM, Dan Williams  
> wrote:
>> On Wed, Oct 11, 2017 at 6:21 PM, Al Viro  wrote:
>>> On Wed, Oct 11, 2017 at 05:47:18PM -0700, Dan Williams wrote:
 The MAP_DIRECT mechanism for mmap intends to use a file lease to prevent
 block map changes while the file is mapped. It requires the fd to setup
 an fasync_struct for signalling lease break events to the lease holder.
>>>
>>> *UGH*
>>>
>>> That looks like one hell of a bad API.  You are not even guaranteed that
>>> descriptor will remain be still open by the time you pass it down to your
>>> helper, nevermind the moment when event actually happens...
>>
>> What am I missing, fcntl(F_SETLEASE) seems to follow a similar pattern?
>
> Ugh, so I think the difference with F_SETLEASE is that the lease ends
> when the fd is closed. In the mmap case the lease follows the lifetime
> of the vma. I'll rethink this interface...

I'm not seeing a lot of good options outside of documenting that if
you close the fd that is registered with MAP_DIRECT you may still get
SIGIO notifications with si_fd set to the stale fd.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v9 2/6] fs, mm: pass fd to ->mmap_validate()

2017-10-11 Thread Dan Williams
On Wed, Oct 11, 2017 at 6:28 PM, Dan Williams  wrote:
> On Wed, Oct 11, 2017 at 6:21 PM, Al Viro  wrote:
>> On Wed, Oct 11, 2017 at 05:47:18PM -0700, Dan Williams wrote:
>>> The MAP_DIRECT mechanism for mmap intends to use a file lease to prevent
>>> block map changes while the file is mapped. It requires the fd to setup
>>> an fasync_struct for signalling lease break events to the lease holder.
>>
>> *UGH*
>>
>> That looks like one hell of a bad API.  You are not even guaranteed that
>> descriptor will remain be still open by the time you pass it down to your
>> helper, nevermind the moment when event actually happens...
>
> What am I missing, fcntl(F_SETLEASE) seems to follow a similar pattern?

Ugh, so I think the difference with F_SETLEASE is that the lease ends
when the fd is closed. In the mmap case the lease follows the lifetime
of the vma. I'll rethink this interface...
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v9 2/6] fs, mm: pass fd to ->mmap_validate()

2017-10-11 Thread Dan Williams
On Wed, Oct 11, 2017 at 6:21 PM, Al Viro  wrote:
> On Wed, Oct 11, 2017 at 05:47:18PM -0700, Dan Williams wrote:
>> The MAP_DIRECT mechanism for mmap intends to use a file lease to prevent
>> block map changes while the file is mapped. It requires the fd to setup
>> an fasync_struct for signalling lease break events to the lease holder.
>
> *UGH*
>
> That looks like one hell of a bad API.  You are not even guaranteed that
> descriptor will remain be still open by the time you pass it down to your
> helper, nevermind the moment when event actually happens...

What am I missing, fcntl(F_SETLEASE) seems to follow a similar pattern?
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v9 2/6] fs, mm: pass fd to ->mmap_validate()

2017-10-11 Thread Al Viro
On Wed, Oct 11, 2017 at 05:47:18PM -0700, Dan Williams wrote:
> The MAP_DIRECT mechanism for mmap intends to use a file lease to prevent
> block map changes while the file is mapped. It requires the fd to setup
> an fasync_struct for signalling lease break events to the lease holder.

*UGH*

That looks like one hell of a bad API.  You are not even guaranteed that
descriptor will remain be still open by the time you pass it down to your
helper, nevermind the moment when event actually happens...
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH v9 5/6] fs, xfs, iomap: introduce break_layout_nowait()

2017-10-11 Thread Dan Williams
In preparation for using FL_LAYOUT leases to allow coordination between
the kernel and processes doing userspace flushes / RDMA with DAX
mappings, add this helper that can be used to start the lease break
process in contexts where we can not sleep waiting for the lease break
timeout.

This is targeted to be used in an ->iomap_begin() implementation where
we may have various filesystem locks held and can not synchronously wait
for any FL_LAYOUT leases to be released. In particular an iomap mmap
fault handler running under mmap_sem can not unlock that semaphore and
wait for these leases to be unlocked. Instead, this signals the lease
holder(s) that a break is requested and immediately returns with an
error.

Cc: Jan Kara 
Cc: Jeff Moyer 
Cc: Christoph Hellwig 
Cc: Al Viro 
Cc: "Darrick J. Wong" 
Cc: Ross Zwisler 
Suggested-by: Dave Chinner 
Signed-off-by: Dan Williams 
---
 fs/xfs/xfs_iomap.c  |3 +++
 fs/xfs/xfs_layout.c |5 -
 include/linux/fs.h  |9 +
 3 files changed, 16 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index f179bdf1644d..840e4080afb5 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -1055,6 +1055,9 @@ xfs_file_iomap_begin(
error = -EAGAIN;
goto out_unlock;
}
+   error = break_layout_nowait(inode);
+   if (error)
+   goto out_unlock;
/*
 * We cap the maximum length we map here to MAX_WRITEBACK_PAGES
 * pages to keep the chunks of work done where somewhat 
symmetric
diff --git a/fs/xfs/xfs_layout.c b/fs/xfs/xfs_layout.c
index 71d95e1a910a..7a633b6e9397 100644
--- a/fs/xfs/xfs_layout.c
+++ b/fs/xfs/xfs_layout.c
@@ -19,7 +19,10 @@
  * about exposing unallocated blocks but just want to provide basic
  * synchronization between a local writer and pNFS clients.  mmap writes would
  * also benefit from this sort of synchronization, but due to the tricky 
locking
- * rules in the page fault path we don't bother.
+ * rules in the page fault path all we can do is start the lease break
+ * timeout. See usage of break_layout_nowait in xfs_file_iomap_begin to
+ * prevent write-faults from allocating blocks or performing extent
+ * conversion.
  */
 int
 xfs_break_layouts(
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 17e0e899e184..2b030a2fccc7 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2364,6 +2364,15 @@ static inline int break_layout(struct inode *inode, bool 
wait)
 
 #endif /* CONFIG_FILE_LOCKING */
 
+/*
+ * For use in paths where we can not wait for the layout to be recalled,
+ * for example when we are holding mmap_sem.
+ */
+static inline int break_layout_nowait(struct inode *inode)
+{
+   return break_layout(inode, false);
+}
+
 /* fs/open.c */
 struct audit_names;
 struct filename {

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH v9 6/6] xfs: wire up MAP_DIRECT

2017-10-11 Thread Dan Williams
MAP_DIRECT is an mmap(2) flag with the following semantics:

  MAP_DIRECT
  When specified with MAP_SHARED_VALIDATE, sets up a file lease with the
  same lifetime as the mapping. Unlike a typical F_RDLCK lease this lease
  is broken when a "lease breaker" attempts to write(2), change the block
  map (fallocate), or change the size of the file. Otherwise the mechanism
  of a lease break is identical to the typical lease break case where the
  lease needs to be removed (munmap) within the number of seconds
  specified by /proc/sys/fs/lease-break-time. If the lease holder fails to
  remove the lease in time the kernel will invalidate the mapping and
  force all future accesses to the mapping to trigger SIGBUS.

  In addition to lease break timeouts causing faults in the mapping to
  result in SIGBUS, other states of the file will trigger SIGBUS at fault
  time:

  * The fault would trigger the filesystem to allocate blocks
  * The fault would trigger the filesystem to perform extent conversion

  In other words, MAP_DIRECT expects and enforces a fully allocated file
  where faults can be satisfied without modifying block map metadata.

  An unprivileged process may establish a MAP_DIRECT mapping on a file
  whose UID (owner) matches the filesystem UID of the  process. A process
  with the CAP_LEASE capability may establish a MAP_DIRECT mapping on
  arbitrary files

  ERRORS
  EACCES Beyond the typical mmap(2) conditions that trigger EACCES
  MAP_DIRECT also requires the permission to set a file lease.

  EOPNOTSUPP The filesystem explicitly does not support the flag

  EPERM The file does not permit MAP_DIRECT mappings. Potential reasons
  are that DAX access is not available or the file has reflink extents.

  SIGBUS Attempted to write a MAP_DIRECT mapping at a file offset that
 might require block-map updates, or the lease timed out and the
 kernel invalidated the mapping.

Cc: Jan Kara 
Cc: Arnd Bergmann 
Cc: Jeff Moyer 
Cc: Christoph Hellwig 
Cc: Dave Chinner 
Cc: Alexander Viro 
Cc: "Darrick J. Wong" 
Cc: Ross Zwisler 
Cc: Jeff Layton 
Cc: "J. Bruce Fields" 
Signed-off-by: Dan Williams 
---
 fs/xfs/Kconfig  |2 -
 fs/xfs/xfs_file.c   |  107 ++-
 include/linux/mman.h|3 +
 include/uapi/asm-generic/mman.h |1 
 4 files changed, 110 insertions(+), 3 deletions(-)

diff --git a/fs/xfs/Kconfig b/fs/xfs/Kconfig
index f62fc6629abb..f8765653a438 100644
--- a/fs/xfs/Kconfig
+++ b/fs/xfs/Kconfig
@@ -112,4 +112,4 @@ config XFS_ASSERT_FATAL
 
 config XFS_LAYOUT
def_bool y
-   depends on EXPORTFS_BLOCK_OPS
+   depends on EXPORTFS_BLOCK_OPS || FS_DAX
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 3cc7292b2e9f..71dbe0307746 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -41,12 +41,22 @@
 #include "xfs_reflink.h"
 #include "xfs_layout.h"
 
+#include 
 #include 
 #include 
 #include 
+#include 
 #include 
 
 static const struct vm_operations_struct xfs_file_vm_ops;
+static const struct vm_operations_struct xfs_file_vm_direct_ops;
+
+static bool
+xfs_vma_is_direct(
+   struct vm_area_struct   *vma)
+{
+   return vma->vm_ops == _file_vm_direct_ops;
+}
 
 /*
  * Clear the specified ranges to zero through either the pagecache or DAX.
@@ -1013,6 +1023,25 @@ xfs_file_llseek(
 }
 
 /*
+ * MAP_DIRECT faults can only be serviced while the FL_LAYOUT lease is
+ * valid. See map_direct_invalidate.
+ */
+static bool
+xfs_vma_has_direct_lease(
+   struct vm_area_struct   *vma)
+{
+   /* Non MAP_DIRECT vmas do not require layout leases */
+   if (!xfs_vma_is_direct(vma))
+   return true;
+
+   if (!test_map_direct_valid(vma->vm_private_data))
+   return false;
+
+   /* We have a valid lease */
+   return true;
+}
+
+/*
  * Locking for serialisation of IO during page faults. This results in a lock
  * ordering of:
  *
@@ -1028,7 +1057,8 @@ __xfs_filemap_fault(
enum page_entry_sizepe_size,
boolwrite_fault)
 {
-   struct inode*inode = file_inode(vmf->vma->vm_file);
+   struct vm_area_struct   *vma = vmf->vma;
+   struct inode*inode = file_inode(vma->vm_file);
struct xfs_inode*ip = XFS_I(inode);
int ret;
 
@@ -1036,10 +1066,15 @@ __xfs_filemap_fault(
 
if (write_fault) {
sb_start_pagefault(inode->i_sb);
-   file_update_time(vmf->vma->vm_file);
+   file_update_time(vma->vm_file);
}
 
xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
+   if (!xfs_vma_has_direct_lease(vma)) {
+   ret = VM_FAULT_SIGBUS;
+   

[PATCH v9 2/6] fs, mm: pass fd to ->mmap_validate()

2017-10-11 Thread Dan Williams
The MAP_DIRECT mechanism for mmap intends to use a file lease to prevent
block map changes while the file is mapped. It requires the fd to setup
an fasync_struct for signalling lease break events to the lease holder.

Cc: Jan Kara 
Cc: Jeff Moyer 
Cc: Christoph Hellwig 
Cc: Dave Chinner 
Cc: "Darrick J. Wong" 
Cc: Ross Zwisler 
Cc: Andrew Morton 
Signed-off-by: Dan Williams 
---
 arch/mips/kernel/vdso.c |2 +-
 arch/tile/mm/elf.c  |2 +-
 arch/x86/mm/mpx.c   |3 ++-
 fs/aio.c|2 +-
 include/linux/fs.h  |2 +-
 include/linux/mm.h  |9 +
 ipc/shm.c   |3 ++-
 mm/internal.h   |2 +-
 mm/mmap.c   |   13 +++--
 mm/nommu.c  |5 +++--
 mm/util.c   |7 ---
 11 files changed, 28 insertions(+), 22 deletions(-)

diff --git a/arch/mips/kernel/vdso.c b/arch/mips/kernel/vdso.c
index cf10654477a9..ab26c7ac0316 100644
--- a/arch/mips/kernel/vdso.c
+++ b/arch/mips/kernel/vdso.c
@@ -110,7 +110,7 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, 
int uses_interp)
base = mmap_region(NULL, STACK_TOP, PAGE_SIZE,
   VM_READ|VM_WRITE|VM_EXEC|
   VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC,
-  0, NULL, 0);
+  0, NULL, 0, -1);
if (IS_ERR_VALUE(base)) {
ret = base;
goto out;
diff --git a/arch/tile/mm/elf.c b/arch/tile/mm/elf.c
index 5ffcbe76aef9..61a9588e141a 100644
--- a/arch/tile/mm/elf.c
+++ b/arch/tile/mm/elf.c
@@ -144,7 +144,7 @@ int arch_setup_additional_pages(struct linux_binprm *bprm,
addr = mmap_region(NULL, addr, INTRPT_SIZE,
   VM_READ|VM_EXEC|
   VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC, 0,
-  NULL, 0);
+  NULL, 0, -1);
if (addr > (unsigned long) -PAGE_SIZE)
retval = (int) addr;
}
diff --git a/arch/x86/mm/mpx.c b/arch/x86/mm/mpx.c
index 9ceaa955d2ba..a8baa94a496b 100644
--- a/arch/x86/mm/mpx.c
+++ b/arch/x86/mm/mpx.c
@@ -52,7 +52,8 @@ static unsigned long mpx_mmap(unsigned long len)
 
down_write(>mmap_sem);
addr = do_mmap(NULL, 0, len, PROT_READ | PROT_WRITE,
-  MAP_ANONYMOUS | MAP_PRIVATE, VM_MPX, 0, , NULL);
+   MAP_ANONYMOUS | MAP_PRIVATE, VM_MPX, 0, ,
+   NULL, -1);
up_write(>mmap_sem);
if (populate)
mm_populate(addr, populate);
diff --git a/fs/aio.c b/fs/aio.c
index 5a2487217072..d10ca6db2ee6 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -519,7 +519,7 @@ static int aio_setup_ring(struct kioctx *ctx, unsigned int 
nr_events)
 
ctx->mmap_base = do_mmap_pgoff(ctx->aio_ring_file, 0, ctx->mmap_size,
   PROT_READ | PROT_WRITE,
-  MAP_SHARED, 0, , NULL);
+  MAP_SHARED, 0, , NULL, -1);
up_write(>mmap_sem);
if (IS_ERR((void *)ctx->mmap_base)) {
ctx->mmap_size = 0;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 5aee97d64cae..17e0e899e184 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1702,7 +1702,7 @@ struct file_operations {
long (*compat_ioctl) (struct file *, unsigned int, unsigned long);
int (*mmap) (struct file *, struct vm_area_struct *);
int (*mmap_validate) (struct file *, struct vm_area_struct *,
-   unsigned long);
+   unsigned long, int);
int (*open) (struct inode *, struct file *);
int (*flush) (struct file *, fl_owner_t id);
int (*release) (struct inode *, struct file *);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 38f6ed954dde..ec45087348c9 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2133,11 +2133,11 @@ extern unsigned long get_unmapped_area(struct file *, 
unsigned long, unsigned lo
 
 extern unsigned long mmap_region(struct file *file, unsigned long addr,
unsigned long len, vm_flags_t vm_flags, unsigned long pgoff,
-   struct list_head *uf, unsigned long map_flags);
+   struct list_head *uf, unsigned long map_flags, int fd);
 extern unsigned long do_mmap(struct file *file, unsigned long addr,
unsigned long len, unsigned long prot, unsigned long flags,
vm_flags_t vm_flags, unsigned long pgoff, unsigned long *populate,
-   struct list_head *uf);
+   struct list_head *uf, int fd);
 extern int do_munmap(struct mm_struct *, unsigned long, size_t,
 struct list_head *uf);
 
@@ -2145,9 +2145,10 @@ static inline unsigned long
 

[PATCH v9 1/6] mm: introduce MAP_SHARED_VALIDATE, a mechanism to safely define new mmap flags

2017-10-11 Thread Dan Williams
The mmap(2) syscall suffers from the ABI anti-pattern of not validating
unknown flags. However, proposals like MAP_SYNC and MAP_DIRECT need a
mechanism to define new behavior that is known to fail on older kernels
without the support. Define a new MAP_SHARED_VALIDATE flag pattern that
is guaranteed to fail on all legacy mmap implementations.

It is worth noting that the original proposal was for a standalone
MAP_VALIDATE flag. However, when that  could not be supported by all
archs Linus observed:

I see why you *think* you want a bitmap. You think you want
a bitmap because you want to make MAP_VALIDATE be part of MAP_SYNC
etc, so that people can do

ret = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_SHARED
| MAP_SYNC, fd, 0);

and "know" that MAP_SYNC actually takes.

And I'm saying that whole wish is bogus. You're fundamentally
depending on special semantics, just make it explicit. It's already
not portable, so don't try to make it so.

Rename that MAP_VALIDATE as MAP_SHARED_VALIDATE, make it have a value
of 0x3, and make people do

ret = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_SHARED_VALIDATE
| MAP_SYNC, fd, 0);

and then the kernel side is easier too (none of that random garbage
playing games with looking at the "MAP_VALIDATE bit", but just another
case statement in that map type thing.

Boom. Done.

Similar to ->fallocate() we also want the ability to validate the
support for new flags on a per ->mmap() 'struct file_operations'
instance basis.  Towards that end arrange for flags to be generically
validated against a mmap_supported_mask exported by 'struct
file_operations'. By default all existing flags are implicitly
supported, but new flags require MAP_SHARED_VALIDATE and
per-instance-opt-in.

Cc: Arnd Bergmann 
Cc: Andy Lutomirski 
Cc: Andrew Morton 
Suggested-by: Christoph Hellwig 
Suggested-by: Linus Torvalds 
Reviewed-by: Jan Kara 
Signed-off-by: Dan Williams 
---
 arch/alpha/include/uapi/asm/mman.h   |1 +
 arch/mips/include/uapi/asm/mman.h|1 +
 arch/mips/kernel/vdso.c  |2 +
 arch/parisc/include/uapi/asm/mman.h  |1 +
 arch/tile/mm/elf.c   |3 +-
 arch/xtensa/include/uapi/asm/mman.h  |1 +
 include/linux/fs.h   |2 +
 include/linux/mm.h   |2 +
 include/linux/mman.h |   39 ++
 include/uapi/asm-generic/mman-common.h   |1 +
 mm/mmap.c|   21 --
 tools/include/uapi/asm-generic/mman-common.h |1 +
 12 files changed, 69 insertions(+), 6 deletions(-)

diff --git a/arch/alpha/include/uapi/asm/mman.h 
b/arch/alpha/include/uapi/asm/mman.h
index 3b26cc62dadb..f85f18ffbf8c 100644
--- a/arch/alpha/include/uapi/asm/mman.h
+++ b/arch/alpha/include/uapi/asm/mman.h
@@ -11,6 +11,7 @@
 
 #define MAP_SHARED 0x01/* Share changes */
 #define MAP_PRIVATE0x02/* Changes are private */
+#define MAP_SHARED_VALIDATE 0x3/* share + validate extension 
flags */
 #define MAP_TYPE   0x0f/* Mask for type of mapping (OSF/1 is 
_wrong_) */
 #define MAP_FIXED  0x100   /* Interpret addr exactly */
 #define MAP_ANONYMOUS  0x10/* don't use a file */
diff --git a/arch/mips/include/uapi/asm/mman.h 
b/arch/mips/include/uapi/asm/mman.h
index da3216007fe0..054314bb062a 100644
--- a/arch/mips/include/uapi/asm/mman.h
+++ b/arch/mips/include/uapi/asm/mman.h
@@ -28,6 +28,7 @@
  */
 #define MAP_SHARED 0x001   /* Share changes */
 #define MAP_PRIVATE0x002   /* Changes are private */
+#define MAP_SHARED_VALIDATE 0x3/* share + validate extension 
flags */
 #define MAP_TYPE   0x00f   /* Mask for type of mapping */
 #define MAP_FIXED  0x010   /* Interpret addr exactly */
 
diff --git a/arch/mips/kernel/vdso.c b/arch/mips/kernel/vdso.c
index 019035d7225c..cf10654477a9 100644
--- a/arch/mips/kernel/vdso.c
+++ b/arch/mips/kernel/vdso.c
@@ -110,7 +110,7 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, 
int uses_interp)
base = mmap_region(NULL, STACK_TOP, PAGE_SIZE,
   VM_READ|VM_WRITE|VM_EXEC|
   VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC,
-  0, NULL);
+  0, NULL, 0);
if (IS_ERR_VALUE(base)) {
ret = base;
goto out;
diff --git a/arch/parisc/include/uapi/asm/mman.h 
b/arch/parisc/include/uapi/asm/mman.h
index 775b5d5e41a1..a66fdb9c4b6d 100644
--- a/arch/parisc/include/uapi/asm/mman.h
+++ b/arch/parisc/include/uapi/asm/mman.h
@@ -11,6 +11,7 @@
 
 #define 

[PATCH v9 4/6] xfs: prepare xfs_break_layouts() for reuse with MAP_DIRECT

2017-10-11 Thread Dan Williams
Move xfs_break_layouts() to its own compilation unit so that it can be
used for both pnfs layouts and MAP_DIRECT mappings.

Cc: Jan Kara 
Cc: Jeff Moyer 
Cc: Christoph Hellwig 
Cc: Dave Chinner 
Cc: "Darrick J. Wong" 
Cc: Ross Zwisler 
Signed-off-by: Dan Williams 
---
 fs/xfs/Kconfig  |4 
 fs/xfs/Makefile |1 +
 fs/xfs/xfs_file.c   |1 +
 fs/xfs/xfs_ioctl.c  |1 +
 fs/xfs/xfs_iops.c   |1 +
 fs/xfs/xfs_layout.c |   42 ++
 fs/xfs/xfs_layout.h |   13 +
 fs/xfs/xfs_pnfs.c   |   31 +--
 fs/xfs/xfs_pnfs.h   |8 
 9 files changed, 64 insertions(+), 38 deletions(-)
 create mode 100644 fs/xfs/xfs_layout.c
 create mode 100644 fs/xfs/xfs_layout.h

diff --git a/fs/xfs/Kconfig b/fs/xfs/Kconfig
index 1b98cfa342ab..f62fc6629abb 100644
--- a/fs/xfs/Kconfig
+++ b/fs/xfs/Kconfig
@@ -109,3 +109,7 @@ config XFS_ASSERT_FATAL
  result in warnings.
 
  This behavior can be modified at runtime via sysfs.
+
+config XFS_LAYOUT
+   def_bool y
+   depends on EXPORTFS_BLOCK_OPS
diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
index a6e955bfead8..d44135107490 100644
--- a/fs/xfs/Makefile
+++ b/fs/xfs/Makefile
@@ -135,3 +135,4 @@ xfs-$(CONFIG_XFS_POSIX_ACL) += xfs_acl.o
 xfs-$(CONFIG_SYSCTL)   += xfs_sysctl.o
 xfs-$(CONFIG_COMPAT)   += xfs_ioctl32.o
 xfs-$(CONFIG_EXPORTFS_BLOCK_OPS)   += xfs_pnfs.o
+xfs-$(CONFIG_XFS_LAYOUT)   += xfs_layout.o
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 309e26c9dddb..3cc7292b2e9f 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -39,6 +39,7 @@
 #include "xfs_pnfs.h"
 #include "xfs_iomap.h"
 #include "xfs_reflink.h"
+#include "xfs_layout.h"
 
 #include 
 #include 
diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index aa75389be8cf..8bfd6db4f06d 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -44,6 +44,7 @@
 #include "xfs_btree.h"
 #include 
 #include "xfs_fsmap.h"
+#include "xfs_layout.h"
 
 #include 
 #include 
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index 17081c77ef86..4bc2e5ef1a3a 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -39,6 +39,7 @@
 #include "xfs_trans_space.h"
 #include "xfs_pnfs.h"
 #include "xfs_iomap.h"
+#include "xfs_layout.h"
 
 #include 
 #include 
diff --git a/fs/xfs/xfs_layout.c b/fs/xfs/xfs_layout.c
new file mode 100644
index ..71d95e1a910a
--- /dev/null
+++ b/fs/xfs/xfs_layout.c
@@ -0,0 +1,42 @@
+/*
+ * Copyright (c) 2014 Christoph Hellwig.
+ */
+#include "xfs.h"
+#include "xfs_format.h"
+#include "xfs_log_format.h"
+#include "xfs_trans_resv.h"
+#include "xfs_sb.h"
+#include "xfs_mount.h"
+#include "xfs_inode.h"
+
+#include 
+
+/*
+ * Ensure that we do not have any outstanding pNFS layouts that can be used by
+ * clients to directly read from or write to this inode.  This must be called
+ * before every operation that can remove blocks from the extent map.
+ * Additionally we call it during the write operation, where aren't concerned
+ * about exposing unallocated blocks but just want to provide basic
+ * synchronization between a local writer and pNFS clients.  mmap writes would
+ * also benefit from this sort of synchronization, but due to the tricky 
locking
+ * rules in the page fault path we don't bother.
+ */
+int
+xfs_break_layouts(
+   struct inode*inode,
+   uint*iolock)
+{
+   struct xfs_inode*ip = XFS_I(inode);
+   int error;
+
+   ASSERT(xfs_isilocked(ip, XFS_IOLOCK_SHARED|XFS_IOLOCK_EXCL));
+
+   while ((error = break_layout(inode, false) == -EWOULDBLOCK)) {
+   xfs_iunlock(ip, *iolock);
+   error = break_layout(inode, true);
+   *iolock = XFS_IOLOCK_EXCL;
+   xfs_ilock(ip, *iolock);
+   }
+
+   return error;
+}
diff --git a/fs/xfs/xfs_layout.h b/fs/xfs/xfs_layout.h
new file mode 100644
index ..f848ee78cc93
--- /dev/null
+++ b/fs/xfs/xfs_layout.h
@@ -0,0 +1,13 @@
+#ifndef _XFS_LAYOUT_H
+#define _XFS_LAYOUT_H 1
+
+#ifdef CONFIG_XFS_LAYOUT
+int xfs_break_layouts(struct inode *inode, uint *iolock);
+#else
+static inline int
+xfs_break_layouts(struct inode *inode, uint *iolock)
+{
+   return 0;
+}
+#endif /* CONFIG_XFS_LAYOUT */
+#endif /* _XFS_LAYOUT_H */
diff --git a/fs/xfs/xfs_pnfs.c b/fs/xfs/xfs_pnfs.c
index 4246876df7b7..ee9de16d7672 100644
--- a/fs/xfs/xfs_pnfs.c
+++ b/fs/xfs/xfs_pnfs.c
@@ -18,36 +18,7 @@
 #include "xfs_shared.h"
 #include "xfs_bit.h"
 #include "xfs_pnfs.h"
-
-/*
- * Ensure that we do not have any outstanding pNFS layouts that can be used by
- * clients to directly read from or write to this inode.  This must be called
- * before every operation that can remove blocks from the extent map.
- * Additionally we call it 

[PATCH v9 0/6] MAP_DIRECT for DAX userspace flush

2017-10-11 Thread Dan Williams
Changes since v8 [1]:
* Move MAP_SHARED_VALIDATE definition next to MAP_SHARED in all arch
  headers (Jan)

* Include xfs_layout.h directly in all the files that call
  xfs_break_layouts() (Dave)

* Clarify / add more comments to the MAP_DIRECT checks at fault time
  (Dave)

* Rename iomap_can_allocate() to break_layouts_nowait() to make it plain
  the reason we are bailing out of iomap_begin.

* Defer the lease_direct mechanism and RDMA core changes to a later
  patch series.

* EXT4 support is in the works and will be rebased on Jan's MAP_SYNC
  patches.

[1]: https://lists.01.org/pipermail/linux-nvdimm/2017-October/012772.html

---

MAP_DIRECT is a mechanism that allows an application to establish a
mapping where the kernel will not change the block-map, or otherwise
dirty the block-map metadata of a file without notification. It supports
a "flush from userspace" model where persistent memory applications can
bypass the overhead of ongoing coordination of writes with the
filesystem, and it provides safety to RDMA operations involving DAX
mappings.

The kernel always has the ability to revoke access and convert the file
back to normal operation after performing a "lease break". Similar to
fcntl leases, there is no way for userspace to to cancel the lease break
process once it has started, it can only delay it via the
/proc/sys/fs/lease-break-time setting.

MAP_DIRECT enables XFS to supplant the device-dax interface for
mmap-write access to persistent memory with no ongoing coordination with
the filesystem via fsync/msync syscalls.

The MAP_DIRECT mechanism is complimentary to MAP_SYNC. Here are some
scenarios where you would choose one over the other:

* 3rd party DMA / RDMA to DAX with hardware that does not support
  on-demand paging (shared virtual memory) => MAP_DIRECT

* Support for reflinked inodes, fallocate-punch-hole, truncate, or any
  other operation that mutates the block map of an actively
  mapped file => MAP_SYNC

* Userpsace flush => MAP_SYNC or MAP_DIRECT

* Assurances that the file's block map metadata is stable, i.e. minimize
  worst case fault latency by locking out updates => MAP_DIRECT

---

Dan Williams (6):
  mm: introduce MAP_SHARED_VALIDATE, a mechanism to safely define new mmap 
flags
  fs, mm: pass fd to ->mmap_validate()
  fs: MAP_DIRECT core
  xfs: prepare xfs_break_layouts() for reuse with MAP_DIRECT
  fs, xfs, iomap: introduce break_layout_nowait()
  xfs: wire up MAP_DIRECT


 arch/alpha/include/uapi/asm/mman.h   |1 
 arch/mips/include/uapi/asm/mman.h|1 
 arch/mips/kernel/vdso.c  |2 
 arch/parisc/include/uapi/asm/mman.h  |1 
 arch/tile/mm/elf.c   |3 
 arch/x86/mm/mpx.c|3 
 arch/xtensa/include/uapi/asm/mman.h  |1 
 fs/Kconfig   |1 
 fs/Makefile  |2 
 fs/aio.c |2 
 fs/mapdirect.c   |  237 ++
 fs/xfs/Kconfig   |4 
 fs/xfs/Makefile  |1 
 fs/xfs/xfs_file.c|  108 
 fs/xfs/xfs_ioctl.c   |1 
 fs/xfs/xfs_iomap.c   |3 
 fs/xfs/xfs_iops.c|1 
 fs/xfs/xfs_layout.c  |   45 +
 fs/xfs/xfs_layout.h  |   13 +
 fs/xfs/xfs_pnfs.c|   31 ---
 fs/xfs/xfs_pnfs.h|8 -
 include/linux/fs.h   |   11 +
 include/linux/mapdirect.h|   40 
 include/linux/mm.h   |9 +
 include/linux/mman.h |   42 +
 include/uapi/asm-generic/mman-common.h   |1 
 include/uapi/asm-generic/mman.h  |1 
 ipc/shm.c|3 
 mm/internal.h|2 
 mm/mmap.c|   28 ++-
 mm/nommu.c   |5 -
 mm/util.c|7 -
 tools/include/uapi/asm-generic/mman-common.h |1 
 33 files changed, 557 insertions(+), 62 deletions(-)
 create mode 100644 fs/mapdirect.c
 create mode 100644 fs/xfs/xfs_layout.c
 create mode 100644 fs/xfs/xfs_layout.h
 create mode 100644 include/linux/mapdirect.h
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH v9 3/6] fs: MAP_DIRECT core

2017-10-11 Thread Dan Williams
Introduce a set of helper apis for filesystems to establish FL_LAYOUT
leases to protect against writes and block map updates while a
MAP_DIRECT mapping is established. While the lease protects against the
syscall write path and fallocate it does not protect against allocating
write-faults, so this relies on i_mapdcount to disable block map updates
from write faults.

Like the pnfs case MAP_DIRECT does its own timeout of the lease since we
need to have a process context for running map_direct_invalidate().

Cc: Jan Kara 
Cc: Jeff Moyer 
Cc: Christoph Hellwig 
Cc: Dave Chinner 
Cc: "Darrick J. Wong" 
Cc: Ross Zwisler 
Cc: Jeff Layton 
Cc: "J. Bruce Fields" 
Signed-off-by: Dan Williams 
---
 fs/Kconfig|1 
 fs/Makefile   |2 
 fs/mapdirect.c|  237 +
 include/linux/mapdirect.h |   40 
 4 files changed, 279 insertions(+), 1 deletion(-)
 create mode 100644 fs/mapdirect.c
 create mode 100644 include/linux/mapdirect.h

diff --git a/fs/Kconfig b/fs/Kconfig
index 7aee6d699fd6..a7b31a96a753 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -37,6 +37,7 @@ source "fs/f2fs/Kconfig"
 config FS_DAX
bool "Direct Access (DAX) support"
depends on MMU
+   depends on FILE_LOCKING
depends on !(ARM || MIPS || SPARC)
select FS_IOMAP
select DAX
diff --git a/fs/Makefile b/fs/Makefile
index 7bbaca9c67b1..c0e791d235d8 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -29,7 +29,7 @@ obj-$(CONFIG_TIMERFD) += timerfd.o
 obj-$(CONFIG_EVENTFD)  += eventfd.o
 obj-$(CONFIG_USERFAULTFD)  += userfaultfd.o
 obj-$(CONFIG_AIO)   += aio.o
-obj-$(CONFIG_FS_DAX)   += dax.o
+obj-$(CONFIG_FS_DAX)   += dax.o mapdirect.o
 obj-$(CONFIG_FS_ENCRYPTION)+= crypto/
 obj-$(CONFIG_FILE_LOCKING)  += locks.o
 obj-$(CONFIG_COMPAT)   += compat.o compat_ioctl.o
diff --git a/fs/mapdirect.c b/fs/mapdirect.c
new file mode 100644
index ..9f4dd7395dcd
--- /dev/null
+++ b/fs/mapdirect.c
@@ -0,0 +1,237 @@
+/*
+ * Copyright(c) 2017 Intel Corporation. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ */
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#define MAPDIRECT_BREAK 0
+#define MAPDIRECT_VALID 1
+
+struct map_direct_state {
+   atomic_t mds_ref;
+   atomic_t mds_vmaref;
+   unsigned long mds_state;
+   struct inode *mds_inode;
+   struct delayed_work mds_work;
+   struct fasync_struct *mds_fa;
+   struct vm_area_struct *mds_vma;
+};
+
+bool test_map_direct_valid(struct map_direct_state *mds)
+{
+   return test_bit(MAPDIRECT_VALID, >mds_state);
+}
+EXPORT_SYMBOL_GPL(test_map_direct_valid);
+
+static void put_map_direct(struct map_direct_state *mds)
+{
+   if (!atomic_dec_and_test(>mds_ref))
+   return;
+   kfree(mds);
+}
+
+static void put_map_direct_vma(struct map_direct_state *mds)
+{
+   struct vm_area_struct *vma = mds->mds_vma;
+   struct file *file = vma->vm_file;
+   struct inode *inode = file_inode(file);
+   void *owner = mds;
+
+   if (!atomic_dec_and_test(>mds_vmaref))
+   return;
+
+   /*
+* Flush in-flight+forced lm_break events that may be
+* referencing this dying vma.
+*/
+   mds->mds_vma = NULL;
+   set_bit(MAPDIRECT_BREAK, >mds_state);
+   vfs_setlease(vma->vm_file, F_UNLCK, NULL, );
+   flush_delayed_work(>mds_work);
+   iput(inode);
+
+   put_map_direct(mds);
+}
+
+void generic_map_direct_close(struct vm_area_struct *vma)
+{
+   put_map_direct_vma(vma->vm_private_data);
+}
+EXPORT_SYMBOL_GPL(generic_map_direct_close);
+
+static void get_map_direct_vma(struct map_direct_state *mds)
+{
+   atomic_inc(>mds_vmaref);
+}
+
+void generic_map_direct_open(struct vm_area_struct *vma)
+{
+   get_map_direct_vma(vma->vm_private_data);
+}
+EXPORT_SYMBOL_GPL(generic_map_direct_open);
+
+static void map_direct_invalidate(struct work_struct *work)
+{
+   struct map_direct_state *mds;
+   struct vm_area_struct *vma;
+   struct inode *inode;
+   void *owner;
+
+   mds = container_of(work, typeof(*mds), mds_work.work);
+
+   clear_bit(MAPDIRECT_VALID, >mds_state);
+
+   vma = ACCESS_ONCE(mds->mds_vma);
+   inode = mds->mds_inode;
+   

Re: ffsb job does not exit on xfs 4.14-rc1+

2017-10-11 Thread Dave Chinner
On Wed, Oct 11, 2017 at 09:54:15PM +0800, Xiong Zhou wrote:
> On Mon, Sep 25, 2017 at 10:49:03AM +0200, Carlos Maiolino wrote:
> > On Mon, Sep 25, 2017 at 01:40:06AM +, Xiong Zhou wrote:
> > > Hi,
> > > 
> > > ffsb test won't exit like this on Linus tree 4.14-rc1+.
> > > Latest commit cd4175b11685
> > 
> > Can you provide more information? Do you have any kernel log from this 
> > issue?
> > dmesg, Oopses, traces, etc.
> > Storage configuration might also be required here.
> 
> Turns out this only repreduces on nvdimm devices, xfs without dax
> mount option. More logs are attached.

It's a hang, so what's the output of sysrq-w once it's hung?

> > have you also tried to reproduce it with another filesystem? If so, is the 
> > same
> > problem reproducible with another filesystem or only with XFS?
> 
> Only xfs. Test on ext4 ends shortly.
> 
> > 
> > P.S. please avoid sending it to all lists (mainly LKML).
> 
> Why?  I thought LKML was better archived.

The lists are all archived, but that's irrelevant. The list you
should report problems to is based on the scope of the problem, not
whether the lists are archived or not.

Because the scope of the problem at this point is XFS, it's
inappropriate to report it to lists that are for general kernel or
VFS issues. There's enough noise on those lists without everyone
bombarding them with subsystem specific issues - that's why we have
subsystem specific lists in the first place.

If it turns out to be a problem in some other subsystem, we'll add
cc's to other subsystem or general lists as appropriate.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [RFC] KVM "fake DAX" device flushing

2017-10-11 Thread Dan Williams
On Wed, Oct 11, 2017 at 11:51 AM, Pankaj Gupta  wrote:
> We are sharing the prototype version of 'fake DAX' flushing
> interface for the initial feedback. This is still work in progress
> and not yet ready for merging.
>
> Protoype right now just implements basic functionality without advanced
> features with two major parts:
>
> - Qemu virtio-pmem device
>   It exposes a persistent memory range to KVM guest which at host side is file
>   backed memory and works as persistent memory device. In addition to this it
>   provides a virtio flushing interface for KVM guest to do a Qemu side sync 
> for
>   guest DAX persistent memory range.
>
> - Guest virtio-pmem driver
>   Reads persistent memory range from paravirt device and reserves system 
> memory map.
>   It also allocates a block device corresponding to the pmem range which is 
> accessed
>   by DAX capable file systems. (file system support is still pending).
>
> We shared the project idea for 'fake DAX' flushing interface here [1].
> Based on suggestions here [2], we implemented guest 'virtio-pmem'
> driver and Qemu paravirt device.
>
> [1] https://www.spinics.net/lists/kvm/msg149761.html
> [2] https://www.spinics.net/lists/kvm/msg153095.html
>
> Work yet to be done:
>
> - Separate out the common code used by ACPI pmem interface and
>   reuse it.
>
> - In pmem device memmap allocation and working. There is some parallel work
>   going on upstream related to 'memory_hotplug restructuring' [3] and also 
> hitting
>   a memory section alignment issue [4].
>
>   [3] https://lwn.net/Articles/712099/
>   [4] https://www.mail-archive.com/linux-nvdimm@lists.01.org/msg02978.html
>
> - Provide DAX capable file-system(ext4 & XFS) support.
> - Qemu device flush functionality.
> - Qemu live migration work when host page cache is used.
> - Multiple virtio-pmem disks support.
>
> Prototype implementation for feedback:
>
> Kernel: 
> https://github.com/pagupta/linux/commit/d15cf90074eae91aeed7a228da3faf319566dd40

Please send this as a patch so it can be reviewed over email.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[RFC] KVM "fake DAX" device flushing

2017-10-11 Thread Pankaj Gupta
We are sharing the prototype version of 'fake DAX' flushing
interface for the initial feedback. This is still work in progress
and not yet ready for merging.

Protoype right now just implements basic functionality without advanced
features with two major parts:

- Qemu virtio-pmem device
  It exposes a persistent memory range to KVM guest which at host side is file
  backed memory and works as persistent memory device. In addition to this it
  provides a virtio flushing interface for KVM guest to do a Qemu side sync for
  guest DAX persistent memory range.  

- Guest virtio-pmem driver
  Reads persistent memory range from paravirt device and reserves system memory 
map.
  It also allocates a block device corresponding to the pmem range which is 
accessed
  by DAX capable file systems. (file system support is still pending).  
  
We shared the project idea for 'fake DAX' flushing interface here [1].
Based on suggestions here [2], we implemented guest 'virtio-pmem'
driver and Qemu paravirt device.

[1] https://www.spinics.net/lists/kvm/msg149761.html
[2] https://www.spinics.net/lists/kvm/msg153095.html

Work yet to be done:

- Separate out the common code used by ACPI pmem interface and
  reuse it.

- In pmem device memmap allocation and working. There is some parallel work
  going on upstream related to 'memory_hotplug restructuring' [3] and also 
hitting
  a memory section alignment issue [4].
  
  [3] https://lwn.net/Articles/712099/
  [4] https://www.mail-archive.com/linux-nvdimm@lists.01.org/msg02978.html
  
- Provide DAX capable file-system(ext4 & XFS) support.
- Qemu device flush functionality.
- Qemu live migration work when host page cache is used.
- Multiple virtio-pmem disks support.

Prototype implementation for feedback:

Kernel: 
https://github.com/pagupta/linux/commit/d15cf90074eae91aeed7a228da3faf319566dd40
Qemu  : 
https://github.com/pagupta/qemu/commit/9c428db1e1076970e097e2b0ef8afe52509af823

Please provide feedback. Also, I would be attending KVM Forum in Prague from 
(25-27 Oct). 
If you are attending KVM forum/Linux conference, I would love to have a 
discussion on ideas 
and future work.

Thank you,
Pankaj Gupta
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH] libnvdimm: add smart payload fields added in DSM 1.6

2017-10-11 Thread Dan Williams
On Wed, Oct 11, 2017 at 10:57 AM, Dave Jiang  wrote:
> NVDIMM DSM interface v1.6 added additional smart health fields. Updating the
> smart payload data structure accordingly.

I'll also add a note when I merge this that the only reason we are
maintaining this structure in the kernel is in case we want to
translate 3rd party SMART payload formats into the ND_IOCTL_SMART
format. Outside of that we could just delete this since ndctl is
already doing that translation.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH] libnvdimm: add smart payload fields added in DSM 1.6

2017-10-11 Thread Dave Jiang
NVDIMM DSM interface v1.6 added additional smart health fields. Updating the
smart payload data structure accordingly.

Signed-off-by: Dave Jiang 
---
 include/uapi/linux/ndctl.h |6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/ndctl.h b/include/uapi/linux/ndctl.h
index 3f03567..5ca8628 100644
--- a/include/uapi/linux/ndctl.h
+++ b/include/uapi/linux/ndctl.h
@@ -25,6 +25,7 @@ struct nd_cmd_smart {
 #define ND_SMART_USED_VALID(1 << 2)
 #define ND_SMART_TEMP_VALID(1 << 3)
 #define ND_SMART_CTEMP_VALID   (1 << 4)
+#define ND_SMART_SHUTDOWN_COUNT_VALID (1 << 5)
 #define ND_SMART_ALARM_VALID   (1 << 9)
 #define ND_SMART_SHUTDOWN_VALID(1 << 10)
 #define ND_SMART_VENDOR_VALID  (1 << 11)
@@ -44,7 +45,10 @@ struct nd_smart_payload {
__u8 alarm_flags;
__u16 temperature;
__u16 ctrl_temperature;
-   __u8 reserved1[15];
+   __u32 shutdown_count;
+   __u8 ait_status;
+   __u16 pmic_temperature;
+   __u8 reserved1[8];
__u8 shutdown_state;
__u32 vendor_size;
__u8 vendor_data[92];

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: nfit test deadlock

2017-10-11 Thread Dan Williams
On Wed, Oct 11, 2017 at 9:24 AM, Ross Zwisler
 wrote:
> Hey Dan,
>
> I was getting the ndctl unit tests working again in my setup today, and on the
> first run of ndctl's "make check" hit a deadlock.  This seems to be very easy
> to reproduce, all you have to do is specify a number of jobs to make that is
> larger than 1 (which I was accidentally doing via an alias),
> i.e. "make -j32 check"
>
> This seems to reproduce 100% of the time.
>
> I'll append the ouptut of "echo w > /proc/sysrq-trigger" to the end of this
> mail.
>
> I was using v4.13 and ndctl 58.2.

I'll take a look. Probably just need more synchronization around the
nfit_test setup/teardown path, but my recommendation for now is don't
try to run the unit tests in parallel.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


nfit test deadlock

2017-10-11 Thread Ross Zwisler
Hey Dan,

I was getting the ndctl unit tests working again in my setup today, and on the
first run of ndctl's "make check" hit a deadlock.  This seems to be very easy
to reproduce, all you have to do is specify a number of jobs to make that is
larger than 1 (which I was accidentally doing via an alias),
i.e. "make -j32 check"

This seems to reproduce 100% of the time.

I'll append the ouptut of "echo w > /proc/sysrq-trigger" to the end of this
mail.

I was using v4.13 and ndctl 58.2.

- Ross

---

[  132.668043] sysrq: SysRq : Show Blocked State
[  132.668968]   taskPC stack   pid father
[  132.670774] lt-libndctl D0  5991   5983 0x0004
[  132.672102] Call Trace:
[  132.672744]  __schedule+0x411/0xb10
[  132.673266]  ? trace_hardirqs_on+0xd/0x10
[  132.674058]  schedule+0x40/0x90
[  132.674545]  __kernfs_remove+0x1f9/0x310
[  132.675298]  ? remove_wait_queue+0x70/0x70
[  132.676046]  kernfs_remove_by_name_ns+0x45/0x90
[  132.676848]  remove_files.isra.1+0x35/0x70
[  132.677451]  sysfs_remove_group+0x44/0x90
[  132.678259]  sysfs_remove_groups+0x2e/0x50
[  132.679047]  device_remove_attrs+0x4d/0x80
[  132.679438]  device_del+0x1ec/0x330
[  132.679888]  device_unregister+0x1a/0x60
[  132.680266]  nvdimm_bus_unregister+0x17/0x20 [libnvdimm]
[  132.680876]  acpi_nfit_unregister+0x15/0x20 [nfit]
[  132.681329]  devm_action_release+0xf/0x20
[  132.681835]  release_nodes+0x16d/0x2b0
[  132.682196]  devres_release_all+0x3c/0x50
[  132.682573]  device_release_driver_internal+0x175/0x220
[  132.683231]  device_release_driver+0x12/0x20
[  132.683715]  bus_remove_device+0x100/0x180
[  132.684102]  device_del+0x1f4/0x330
[  132.684428]  platform_device_del+0x28/0x90
[  132.684967]  platform_device_unregister+0x12/0x30
[  132.685412]  nfit_test_exit+0x17/0x92f [nfit_test]
[  132.685980]  SyS_delete_module+0x1d8/0x230
[  132.686369]  entry_SYSCALL_64_fastpath+0x1f/0xbe
[  132.686915] RIP: 0033:0x7f841012b317
[  132.687255] RSP: 002b:7fffe5ce0898 EFLAGS: 0206 ORIG_RAX: 
00b0
[  132.688070] RAX: ffda RBX: 7f84103e4500 RCX: 7f841012b317
[  132.688850] RDX: 7f84103e5730 RSI: 0800 RDI: 0258ac98
[  132.689501] RBP: 7fffe5ce05b0 R08: 7f8410e19c80 R09: 0017
[  132.690257] R10: 006d R11: 0206 R12: 0038
[  132.690988] R13: 0001 R14:  R15: fbad2887
[  132.691735] lt-dsm-fail D0  5995   5986 0x0004
[  132.692246] Call Trace:
[  132.692481]  __schedule+0x411/0xb10
[  132.692972]  schedule+0x40/0x90
[  132.693288]  schedule_preempt_disabled+0x18/0x30
[  132.694083]  __mutex_lock+0x487/0xa20
[  132.694720]  ? acpi_nfit_flush_probe+0x3a/0x150 [nfit]
[  132.695452]  mutex_lock_nested+0x1b/0x20
[  132.696245]  ? mutex_lock_nested+0x1b/0x20
[  132.696947]  acpi_nfit_flush_probe+0x3a/0x150 [nfit]
[  132.697750]  ? kernfs_seq_start+0x2f/0x90
[  132.698302]  ? __mutex_lock+0x228/0xa20
[  132.699077]  ? lock_acquire+0xea/0x1f0
[  132.699698]  ? kernfs_seq_start+0x37/0x90
[  132.700083]  wait_probe_show+0x25/0x60 [libnvdimm]
[  132.700529]  dev_attr_show+0x20/0x50
[  132.701022]  ? sysfs_file_ops+0x46/0x60
[  132.701392]  sysfs_kf_seq_show+0xb2/0x110
[  132.701910]  kernfs_seq_show+0x27/0x30
[  132.702271]  seq_read+0x103/0x3d0
[  132.702709]  kernfs_fop_read+0x11e/0x190
[  132.703082]  __vfs_read+0x37/0x160
[  132.703399]  ? security_file_permission+0x9e/0xc0
[  132.704000]  vfs_read+0xab/0x150
[  132.704312]  SyS_read+0x58/0xc0
[  132.704737]  entry_SYSCALL_64_fastpath+0x1f/0xbe
[  132.705295] RIP: 0033:0x7fc0be0d4a80
[  132.705964] RSP: 002b:7fff3b5cfd08 EFLAGS: 0246 ORIG_RAX: 

[  132.707094] RAX: ffda RBX: 0004 RCX: 7fc0be0d4a80
[  132.708154] RDX: 0400 RSI: 7fff3b5cfd80 RDI: 0004
[  132.709206] RBP: 7fff3b5d02a0 R08: 01a3ec00 R09: 0035
[  132.709968] R10: 0073 R11: 0246 R12: 00401620
[  132.710707] R13: 7fff3b5d0cd0 R14:  R15: 
[  132.711369] lt-parent-uuid  D0  5998   5989 0x0004
[  132.711984] Call Trace:
[  132.712229]  __schedule+0x411/0xb10
[  132.712565]  schedule+0x40/0x90
[  132.713004]  schedule_preempt_disabled+0x18/0x30
[  132.713443]  __mutex_lock+0x487/0xa20
[  132.713891]  ? acpi_nfit_flush_probe+0x3a/0x150 [nfit]
[  132.714378]  mutex_lock_nested+0x1b/0x20
[  132.714853]  ? mutex_lock_nested+0x1b/0x20
[  132.715239]  acpi_nfit_flush_probe+0x3a/0x150 [nfit]
[  132.715818]  ? kernfs_seq_start+0x2f/0x90
[  132.716205]  ? __mutex_lock+0x228/0xa20
[  132.716674]  ? lock_acquire+0xea/0x1f0
[  132.717035]  ? kernfs_seq_start+0x37/0x90
[  132.717412]  wait_probe_show+0x25/0x60 [libnvdimm]
[  132.718006]  dev_attr_show+0x20/0x50
[  132.718344]  ? sysfs_file_ops+0x46/0x60
[  132.718818]  sysfs_kf_seq_show+0xb2/0x110
[  132.719204]  

Re: [PATCH v8 13/14] IB/core: use MAP_DIRECT to fix / enable RDMA to DAX mappings

2017-10-11 Thread Dan Williams
On Wed, Oct 11, 2017 at 4:54 AM, Joerg Roedel  wrote:
> On Tue, Oct 10, 2017 at 07:50:12AM -0700, Dan Williams wrote:
>> +static void ib_umem_lease_break(void *__umem)
>> +{
>> + struct ib_umem *umem = umem;
>> + struct ib_device *idev = umem->context->device;
>> + struct device *dev = idev->dma_device;
>> + struct scatterlist *sgl = umem->sg_head.sgl;
>> +
>> + iommu_unmap(umem->iommu, sg_dma_address(sgl) & PAGE_MASK,
>> + iommu_sg_num_pages(dev, sgl, umem->npages));
>> +}
>
> This looks like an invitation to break your code by random iommu-driver
> changes. There is no guarantee that an iommu-backed dma-api
> implemenation will map exactly iommu_sg_num_pages() pages for a given
> sg-list. In other words, you are mixing the use of the IOMMU-API and the
> DMA-API in an incompatible way that only works because you know the
> internals of the iommu-drivers.
>
> I've seen in another patch that your changes strictly require an IOMMU,
> so you what you should do instead is to switch from the DMA-API to the
> IOMMU-API and do the address-space management yourself.
>

Ok, I'll switch over completely to the iommu api for this. It will
also address Robin's concern.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH] Fix mpage_writepage() for pages with buffers

2017-10-11 Thread Matthew Wilcox
On Tue, Oct 10, 2017 at 01:31:44PM -0700, Linus Torvalds wrote:
> On Tue, Oct 10, 2017 at 12:44 PM, Andrew Morton
>  wrote:
> >
> > This is all pretty mature code (isn't it?).  Any idea why this bug
> > popped up now?

I have no idea why it's suddenly popped up.  It looks like it should
be a bohrbug, but it's actually a heisenbug, and I don't understand
that either.

> Also, while the patch looks sane, the
> 
> clean_buffers(page, PAGE_SIZE);
> 
> line really threw me. That's an insane value to pick, it looks like
> "bytes in page", but it isn't. It's just a random value that is bigger
> than "PAGE_SIZE >> SECTOR_SHIFT".
> 
> I'd prefer to see just ~0u if the intention is just "bigger than
> anything possible".

Actually, I did choose it to be "number of bytes in the page", based on
the reasoning that I didn't want to calculate what the actual block size
was, and the block size surely couldn't be any smaller than one byte.  I
forgot about the SECTOR_SIZE limit on filesystem block size, so your
spelling of "big enough" does look better.

Now that I think about it some more, I suppose we might end up with a
situation where we're eventually passing a hugepage to this routine,
and futureproofing it with ~0U probably makes more sense.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v8 01/14] mm: introduce MAP_SHARED_VALIDATE, a mechanism to safely define new mmap flags

2017-10-11 Thread Dan Williams
On Wed, Oct 11, 2017 at 12:43 AM, Jan Kara  wrote:
> On Tue 10-10-17 07:49:01, Dan Williams wrote:
>> The mmap(2) syscall suffers from the ABI anti-pattern of not validating
>> unknown flags. However, proposals like MAP_SYNC and MAP_DIRECT need a
>> mechanism to define new behavior that is known to fail on older kernels
>> without the support. Define a new MAP_SHARED_VALIDATE flag pattern that
>> is guaranteed to fail on all legacy mmap implementations.
>>
>> It is worth noting that the original proposal was for a standalone
>> MAP_VALIDATE flag. However, when that  could not be supported by all
>> archs Linus observed:
>>
>> I see why you *think* you want a bitmap. You think you want
>> a bitmap because you want to make MAP_VALIDATE be part of MAP_SYNC
>> etc, so that people can do
>>
>> ret = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_SHARED
>>   | MAP_SYNC, fd, 0);
>>
>> and "know" that MAP_SYNC actually takes.
>>
>> And I'm saying that whole wish is bogus. You're fundamentally
>> depending on special semantics, just make it explicit. It's already
>> not portable, so don't try to make it so.
>>
>> Rename that MAP_VALIDATE as MAP_SHARED_VALIDATE, make it have a value
>> of 0x3, and make people do
>>
>> ret = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_SHARED_VALIDATE
>>   | MAP_SYNC, fd, 0);
>>
>> and then the kernel side is easier too (none of that random garbage
>> playing games with looking at the "MAP_VALIDATE bit", but just another
>> case statement in that map type thing.
>>
>> Boom. Done.
>>
>> Similar to ->fallocate() we also want the ability to validate the
>> support for new flags on a per ->mmap() 'struct file_operations'
>> instance basis.  Towards that end arrange for flags to be generically
>> validated against a mmap_supported_mask exported by 'struct
>> file_operations'. By default all existing flags are implicitly
>> supported, but new flags require MAP_SHARED_VALIDATE and
>> per-instance-opt-in.
>>
>> Cc: Jan Kara 
>> Cc: Arnd Bergmann 
>> Cc: Andy Lutomirski 
>> Cc: Andrew Morton 
>> Suggested-by: Christoph Hellwig 
>> Suggested-by: Linus Torvalds 
>> Signed-off-by: Dan Williams 
>> ---
>>  arch/alpha/include/uapi/asm/mman.h   |1 +
>>  arch/mips/include/uapi/asm/mman.h|1 +
>>  arch/mips/kernel/vdso.c  |2 +
>>  arch/parisc/include/uapi/asm/mman.h  |1 +
>>  arch/tile/mm/elf.c   |3 +-
>>  arch/xtensa/include/uapi/asm/mman.h  |1 +
>>  include/linux/fs.h   |2 +
>>  include/linux/mm.h   |2 +
>>  include/linux/mman.h |   39 
>> ++
>>  include/uapi/asm-generic/mman-common.h   |1 +
>>  mm/mmap.c|   21 --
>>  tools/include/uapi/asm-generic/mman-common.h |1 +
>>  12 files changed, 69 insertions(+), 6 deletions(-)
>>
>> diff --git a/arch/alpha/include/uapi/asm/mman.h 
>> b/arch/alpha/include/uapi/asm/mman.h
>> index 3b26cc62dadb..92823f24890b 100644
>> --- a/arch/alpha/include/uapi/asm/mman.h
>> +++ b/arch/alpha/include/uapi/asm/mman.h
>> @@ -14,6 +14,7 @@
>>  #define MAP_TYPE 0x0f/* Mask for type of mapping (OSF/1 is 
>> _wrong_) */
>>  #define MAP_FIXED0x100   /* Interpret addr exactly */
>>  #define MAP_ANONYMOUS0x10/* don't use a file */
>> +#define MAP_SHARED_VALIDATE 0x3  /* share + validate extension 
>> flags */
>
> Just a nit but I'd put definition of MAP_SHARED_VALIDATE close to the
> definition of MAP_SHARED and MAP_PRIVATE where it logically belongs (for
> all archs).

Will do.

>
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index f8c10d336e42..5c4c98e4adc9 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -2133,7 +2133,7 @@ extern unsigned long get_unmapped_area(struct file *, 
>> unsigned long, unsigned lo
>>
>>  extern unsigned long mmap_region(struct file *file, unsigned long addr,
>>   unsigned long len, vm_flags_t vm_flags, unsigned long pgoff,
>> - struct list_head *uf);
>> + struct list_head *uf, unsigned long map_flags);
>>  extern unsigned long do_mmap(struct file *file, unsigned long addr,
>>   unsigned long len, unsigned long prot, unsigned long flags,
>>   vm_flags_t vm_flags, unsigned long pgoff, unsigned long *populate,
>
> I have to say I'm not very keen on passing down both vm_flags and map_flags
> - vm_flags are almost a subset of map_flags but not quite and the ambiguity
> which needs to be used for a particular check seems to open a space for
> errors. Granted you currently only care about MAP_DIRECT in ->mmap_validate
> and just pass map_flags 

Re: ffsb job does not exit on xfs 4.14-rc1+

2017-10-11 Thread Dan Williams
On Wed, Oct 11, 2017 at 6:54 AM, Xiong Zhou  wrote:
> On Mon, Sep 25, 2017 at 10:49:03AM +0200, Carlos Maiolino wrote:
>> On Mon, Sep 25, 2017 at 01:40:06AM +, Xiong Zhou wrote:
>> > Hi,
>> >
>> > ffsb test won't exit like this on Linus tree 4.14-rc1+.
>> > Latest commit cd4175b11685
>>
>> Can you provide more information? Do you have any kernel log from this issue?
>> dmesg, Oopses, traces, etc.
>> Storage configuration might also be required here.
>
> Turns out this only repreduces on nvdimm devices, xfs without dax
> mount option. More logs are attached.
>
>>
>> have you also tried to reproduce it with another filesystem? If so, is the 
>> same
>> problem reproducible with another filesystem or only with XFS?
>
> Only xfs. Test on ext4 ends shortly.
>
>>
>> P.S. please avoid sending it to all lists (mainly LKML).
>
> Why?  I thought LKML was better archived.
>
>> Unless it's a more generic kernel problem, keep it in fsdevel and/or the
>> respective filesystem list if it's related to a single filesystem only.
>
> 4.14-rc1+ won't survive this script, while 4.13 can.

Can you try a git bisect?
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v8 13/14] IB/core: use MAP_DIRECT to fix / enable RDMA to DAX mappings

2017-10-11 Thread Joerg Roedel
On Tue, Oct 10, 2017 at 07:50:12AM -0700, Dan Williams wrote:
> +static void ib_umem_lease_break(void *__umem)
> +{
> + struct ib_umem *umem = umem;
> + struct ib_device *idev = umem->context->device;
> + struct device *dev = idev->dma_device;
> + struct scatterlist *sgl = umem->sg_head.sgl;
> +
> + iommu_unmap(umem->iommu, sg_dma_address(sgl) & PAGE_MASK,
> + iommu_sg_num_pages(dev, sgl, umem->npages));
> +}

This looks like an invitation to break your code by random iommu-driver
changes. There is no guarantee that an iommu-backed dma-api
implemenation will map exactly iommu_sg_num_pages() pages for a given
sg-list. In other words, you are mixing the use of the IOMMU-API and the
DMA-API in an incompatible way that only works because you know the
internals of the iommu-drivers.

I've seen in another patch that your changes strictly require an IOMMU,
so you what you should do instead is to switch from the DMA-API to the
IOMMU-API and do the address-space management yourself.

Regards,

Joerg

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


转发:几何尺寸与公差

2017-10-11 Thread 张�c佳
Ö£ºlinux-nvdimm@lists.01.org

Ïêϸ ¿Î³Ìʱ¼ä ¼°±¨ÃûÐÅÏ¢ Çë²éÔĸ½¼þ
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v8 01/14] mm: introduce MAP_SHARED_VALIDATE, a mechanism to safely define new mmap flags

2017-10-11 Thread Jan Kara
On Tue 10-10-17 07:49:01, Dan Williams wrote:
> The mmap(2) syscall suffers from the ABI anti-pattern of not validating
> unknown flags. However, proposals like MAP_SYNC and MAP_DIRECT need a
> mechanism to define new behavior that is known to fail on older kernels
> without the support. Define a new MAP_SHARED_VALIDATE flag pattern that
> is guaranteed to fail on all legacy mmap implementations.
> 
> It is worth noting that the original proposal was for a standalone
> MAP_VALIDATE flag. However, when that  could not be supported by all
> archs Linus observed:
> 
> I see why you *think* you want a bitmap. You think you want
> a bitmap because you want to make MAP_VALIDATE be part of MAP_SYNC
> etc, so that people can do
> 
> ret = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_SHARED
>   | MAP_SYNC, fd, 0);
> 
> and "know" that MAP_SYNC actually takes.
> 
> And I'm saying that whole wish is bogus. You're fundamentally
> depending on special semantics, just make it explicit. It's already
> not portable, so don't try to make it so.
> 
> Rename that MAP_VALIDATE as MAP_SHARED_VALIDATE, make it have a value
> of 0x3, and make people do
> 
> ret = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_SHARED_VALIDATE
>   | MAP_SYNC, fd, 0);
> 
> and then the kernel side is easier too (none of that random garbage
> playing games with looking at the "MAP_VALIDATE bit", but just another
> case statement in that map type thing.
> 
> Boom. Done.
> 
> Similar to ->fallocate() we also want the ability to validate the
> support for new flags on a per ->mmap() 'struct file_operations'
> instance basis.  Towards that end arrange for flags to be generically
> validated against a mmap_supported_mask exported by 'struct
> file_operations'. By default all existing flags are implicitly
> supported, but new flags require MAP_SHARED_VALIDATE and
> per-instance-opt-in.
> 
> Cc: Jan Kara 
> Cc: Arnd Bergmann 
> Cc: Andy Lutomirski 
> Cc: Andrew Morton 
> Suggested-by: Christoph Hellwig 
> Suggested-by: Linus Torvalds 
> Signed-off-by: Dan Williams 
> ---
>  arch/alpha/include/uapi/asm/mman.h   |1 +
>  arch/mips/include/uapi/asm/mman.h|1 +
>  arch/mips/kernel/vdso.c  |2 +
>  arch/parisc/include/uapi/asm/mman.h  |1 +
>  arch/tile/mm/elf.c   |3 +-
>  arch/xtensa/include/uapi/asm/mman.h  |1 +
>  include/linux/fs.h   |2 +
>  include/linux/mm.h   |2 +
>  include/linux/mman.h |   39 
> ++
>  include/uapi/asm-generic/mman-common.h   |1 +
>  mm/mmap.c|   21 --
>  tools/include/uapi/asm-generic/mman-common.h |1 +
>  12 files changed, 69 insertions(+), 6 deletions(-)
> 
> diff --git a/arch/alpha/include/uapi/asm/mman.h 
> b/arch/alpha/include/uapi/asm/mman.h
> index 3b26cc62dadb..92823f24890b 100644
> --- a/arch/alpha/include/uapi/asm/mman.h
> +++ b/arch/alpha/include/uapi/asm/mman.h
> @@ -14,6 +14,7 @@
>  #define MAP_TYPE 0x0f/* Mask for type of mapping (OSF/1 is 
> _wrong_) */
>  #define MAP_FIXED0x100   /* Interpret addr exactly */
>  #define MAP_ANONYMOUS0x10/* don't use a file */
> +#define MAP_SHARED_VALIDATE 0x3  /* share + validate extension 
> flags */

Just a nit but I'd put definition of MAP_SHARED_VALIDATE close to the
definition of MAP_SHARED and MAP_PRIVATE where it logically belongs (for
all archs).

> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index f8c10d336e42..5c4c98e4adc9 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -2133,7 +2133,7 @@ extern unsigned long get_unmapped_area(struct file *, 
> unsigned long, unsigned lo
>  
>  extern unsigned long mmap_region(struct file *file, unsigned long addr,
>   unsigned long len, vm_flags_t vm_flags, unsigned long pgoff,
> - struct list_head *uf);
> + struct list_head *uf, unsigned long map_flags);
>  extern unsigned long do_mmap(struct file *file, unsigned long addr,
>   unsigned long len, unsigned long prot, unsigned long flags,
>   vm_flags_t vm_flags, unsigned long pgoff, unsigned long *populate,

I have to say I'm not very keen on passing down both vm_flags and map_flags
- vm_flags are almost a subset of map_flags but not quite and the ambiguity
which needs to be used for a particular check seems to open a space for
errors. Granted you currently only care about MAP_DIRECT in ->mmap_validate
and just pass map_flags through mmap_region() so there's no space for
confusion but future checks could do something different. But OTOH I don't
see a cleaner way of avoiding the need to allocate